These guides cover deploying, configuring, and integrating Michelangelo in a Kubernetes environment. They target platform engineers and infrastructure operators who are responsible for running Michelangelo in production and for connecting it to the broader ML infrastructure their teams already use — experiment tracking, model registries, compute clusters, schedulers, and serving frameworks.
Getting Started
For a fresh deployment, follow this recommended reading order:
- Platform Setup — configure each component (API server, controller manager, worker, UI/Envoy) via ConfigMaps and Kustomize overlays
- Register a Compute Cluster — connect an existing Kubernetes cluster so Michelangelo can dispatch Ray and Spark jobs to it
- Cluster Setup for Serving — enable model inference on a local or remote cluster
- Authentication — connect an identity provider and configure RBAC before opening to users
| Guide | Description |
|---|
| Platform Setup | ConfigMaps and key fields for API server, controller manager, worker, and UI/Envoy |
| Network & Ingress | Envoy proxy, Ingress setup, TLS with cert-manager, and multi-cluster connectivity |
| API Framework | Architecture overview of the Michelangelo API and control plane |
Jobs & Compute
Model Serving
Integrating with Your ML Stack
Michelangelo is designed to be adopted alongside existing ML infrastructure. These guides cover how to connect Michelangelo to the systems your teams already use.
| Guide | Description |
|---|
| Model Registry | Operate Michelangelo's built-in model registry, configure storage and RBAC, and integrate with serving and CI/CD |
| Experiment Tracking | Connect an external experiment tracking server to Michelangelo task pods |
| Custom Serving Backend | Add support for any inference framework — Triton, vLLM, TensorRT-LLM, or your own |
| Custom Job Scheduler | Replace or extend the job scheduler — integrate Kueue, Volcano, or a custom assignment strategy |
| Register a Compute Cluster | Connect an existing Kubernetes cluster so Michelangelo can dispatch jobs to it |
Operations
| Guide | Description |
|---|
| Authentication | OIDC identity provider setup, RBAC, session configuration, multi-tenant isolation |
| Monitoring & Observability | Prometheus scrape config, key metrics, alerting rules, Grafana dashboards, structured logging |
| Compliance | SOC 2, GDPR, and HIPAA configuration |
| Troubleshooting | Common failure modes and kubectl diagnostic commands |