Skip to main content

Network & Ingress Configuration

This guide covers the network configuration required to deploy Michelangelo in a Kubernetes cluster: Envoy proxy settings (CORS, cluster hostnames), Ingress setup for the API server and UI, TLS with cert-manager, and connectivity requirements for multi-cluster deployments.


Overview

Michelangelo's network surface has two external-facing entry points:

Entry PointDefault PortPurpose
API Server Ingress443 (HTTPS)gRPC API used by the ma CLI, workers, and SDK
UI + Envoy Ingress443 (HTTPS)Browser-facing UI and REST/gRPC-Web proxy

Traffic flow from the public internet to internal components:

Internet

├─ api.your-domain.com ──► Ingress ──► michelangelo-apiserver:15566 (gRPC)

└─ app.your-domain.com ──► Ingress ──► michelangelo-envoy:8081

└─► michelangelo-apiserver:15566 (gRPC-Web)

Envoy Proxy Configuration

The Envoy proxy sits in front of the API server for browser clients. It handles HTTP/1.1 → gRPC transcoding and CORS.

CORS Configuration

Add your UI domain to Envoy's CORS allowed origins. This is required for the browser-based UI to call the API. In the Envoy ConfigMap:

static_resources:
listeners:
- address:
socket_address: { address: 0.0.0.0, port_value: 8081 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
route_config:
virtual_hosts:
- name: local_service
domains: ["*"]
cors:
allow_origin_string_match:
- safe_regex:
regex: "https://app\\.your-domain\\.com"
allow_methods: "GET, POST, OPTIONS"
allow_headers: "content-type, context-ttl-ms, grpc-timeout, rpc-caller, rpc-encoding, rpc-service, x-grpc-web, x-user-agent"
expose_headers: "grpc-status, grpc-message"
max_age: "1728000"
routes:
- match: { prefix: "/" }
route:
cluster: michelangelo-apiserver
max_grpc_timeout: 0s
http_filters:
- name: envoy.filters.http.grpc_web
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.grpc_web.v3.GrpcWeb
- name: envoy.filters.http.cors
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.cors.v3.Cors
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

clusters:
- name: michelangelo-apiserver
connect_timeout: 30s
type: LOGICAL_DNS
http2_protocol_options: {}
load_assignment:
cluster_name: michelangelo-apiserver
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: michelangelo-apiserver # Kubernetes service name
port_value: 15566

Fields to customize per environment:

FieldDescription
allow_origin_string_match.regexReplace with your UI domain regex
socket_address.addressAPI server Kubernetes service name (default: michelangelo-apiserver)
socket_address.port_valueAPI server port (default: 15566)

Envoy TLS Termination

If you terminate TLS at the Envoy pod (rather than at the Ingress), add a transport_socket to the listener:

transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
common_tls_context:
tls_certificates:
- certificate_chain:
filename: /etc/ssl/certs/tls.crt
private_key:
filename: /etc/ssl/certs/tls.key

Mount the certificate from a Kubernetes Secret:

volumes:
- name: tls-cert
secret:
secretName: michelangelo-envoy-tls
volumeMounts:
- name: tls-cert
mountPath: /etc/ssl/certs
readOnly: true

In most deployments, TLS is terminated at the Ingress layer instead — see TLS with cert-manager below.


Ingress Setup

API Server Ingress

The API server uses gRPC (HTTP/2). Your Ingress controller must support HTTP/2 backend connections. With NGINX Ingress Controller:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: michelangelo-apiserver
namespace: michelangelo
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
spec:
ingressClassName: nginx
tls:
- hosts:
- api.your-domain.com
secretName: michelangelo-apiserver-tls
rules:
- host: api.your-domain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: michelangelo-apiserver
port:
number: 15566

HTTP/2 requirement: gRPC requires HTTP/2 end-to-end. If your Ingress controller terminates TLS but connects to the backend over HTTP/1.1, gRPC calls will fail. Ensure backend-protocol: GRPC (or equivalent) is set.

UI + Envoy Ingress

The UI and gRPC-Web proxy share a single Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: michelangelo-ui
namespace: michelangelo
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
spec:
ingressClassName: nginx
tls:
- hosts:
- app.your-domain.com
secretName: michelangelo-ui-tls
rules:
- host: app.your-domain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: michelangelo-envoy
port:
number: 8081

Domain Names to Update in Overlays

After setting hostnames in Ingress resources, propagate them through ConfigMaps:

LocationFieldValue
Worker ConfigMapworker.addressapi.your-domain.com:443
UI Public ConfigapiBaseUrlhttps://app.your-domain.com
Envoy CORS configallow_origin_string_match.regexYour UI domain

See Platform Setup — Environment Overrides for the full list.


TLS with cert-manager

Use cert-manager to automate TLS certificate provisioning. Install cert-manager if it is not already present:

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml

ClusterIssuer (Let's Encrypt)

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: platform-team@your-domain.com
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx

Referencing the Issuer in Ingress

Add the cert-manager annotation to your Ingress resources:

metadata:
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"

cert-manager will automatically create and renew the TLS Secret referenced in spec.tls[].secretName.

Using an Internal CA

For private clusters that cannot use ACME, use a ClusterIssuer backed by an internal CA:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: internal-ca
spec:
ca:
secretName: internal-ca-key-pair # Secret containing the CA cert and key

Multi-Cluster Network Topology

When Michelangelo's control plane dispatches jobs to registered compute clusters, the following connectivity is required:

No automatic failover

Michelangelo does not automatically fail over if the control plane API server becomes unreachable. Task pods in compute clusters cannot report results, and new jobs cannot be dispatched. Configure alerting on the controller manager's health endpoint (:8083/healthz) so on-call is paged before users are impacted — see Monitoring.

Control Plane Cluster                    Compute Cluster
┌────────────────────────┐ ┌──────────────────────────────┐
│ Controller Manager │──── HTTPS ───►│ Kubernetes API server │
│ (kubeconfig for each │ │ (port 443) │
│ compute cluster) │ └──────────────────────────────┘
│ │
│ Worker │◄──── gRPC ────┤ Task pods (report back │
│ (port 15566) │ │ via worker.address) │
└────────────────────────┘ └──────────────────────────────┘

Required Connectivity

DirectionSourceDestinationPortPurpose
Outbound from control planeController ManagerCompute cluster K8s API443Dispatching RayCluster / SparkApplication CRDs
Outbound from computeTask podsMichelangelo API server443Worker connectivity for result reporting
Outbound from computeTask podsS3 / object store443Artifact reads and writes

NetworkPolicy for Control Plane → Compute Cluster

If your compute cluster enforces NetworkPolicy, ensure the control plane's egress IP range can reach the Kubernetes API server:

Managed Kubernetes (EKS, GKE, AKS): The API server runs outside the cluster on managed platforms and is not a schedulable pod. This NetworkPolicy only applies to self-managed clusters. For managed clusters, use your cloud provider's security groups or authorized networks instead.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-michelangelo-controller
namespace: kube-system
spec:
podSelector:
matchLabels:
component: kube-apiserver
policyTypes:
- Ingress
ingress:
- from:
- ipBlock:
cidr: <control-plane-egress-cidr>/32
ports:
- protocol: TCP
port: 443

Verifying Cross-Cluster Connectivity

From the Controller Manager pod, verify it can reach each registered compute cluster:

# Exec into the controller manager pod
kubectl exec -it deploy/michelangelo-controllermgr -n michelangelo -- /bin/sh

# Check connectivity to a registered compute cluster's K8s API
curl -sk https://<compute-cluster-api-server>:443/healthz

From a task pod in the compute cluster, verify it can reach the Michelangelo API server:

kubectl exec -it <task-pod> -n <compute-namespace> -- \
curl -sk https://api.your-domain.com/healthz

Checklist

Use this checklist when deploying Michelangelo to a new environment:

  • Ingress controller installed and supports HTTP/2 (for gRPC)
  • API server Ingress created with backend-protocol: GRPC
  • UI + Envoy Ingress created
  • TLS certificates provisioned (cert-manager or manual)
  • Envoy CORS allow_origin updated to match UI domain
  • Worker ConfigMap worker.address updated to api.your-domain.com:443
  • UI config.json apiBaseUrl updated to UI domain
  • Cross-cluster connectivity verified (controller manager → compute K8s API)
  • Task pod → Michelangelo API server connectivity verified