Operator Guides

These guides cover deploying, configuring, and integrating Michelangelo in a Kubernetes environment. They target platform engineers and infrastructure operators who are responsible for running Michelangelo in production and for connecting it to the broader ML infrastructure their teams already use — experiment tracking, model registries, compute clusters, schedulers, and serving frameworks.

Getting Started

For a fresh deployment, follow this recommended reading order:

Platform Setup — configure each component (API server, controller manager, worker, UI/Envoy) via ConfigMaps and Kustomize overlays
Register a Compute Cluster — connect an existing Kubernetes cluster so Michelangelo can dispatch Ray and Spark jobs to it
Cluster Setup for Serving — enable model inference on a local or remote cluster
Authentication — connect an identity provider and configure RBAC before opening to users

Platform Configuration

Guide	Description
Platform Setup	ConfigMaps and key fields for API server, controller manager, worker, and UI/Envoy
Network & Ingress	Envoy proxy, Ingress setup, TLS with cert-manager, and multi-cluster connectivity
API Framework	Architecture overview of the Michelangelo API and control plane

Jobs & Compute

Guide	Description
Jobs Overview	Ray and Spark job lifecycle, compute selection, and observability
Register a Compute Cluster	Connect an existing Kubernetes cluster to the Michelangelo control plane
Run a Pipeline on a Compute Cluster	Submit and monitor a Uniflow pipeline on a registered cluster
Extend the Job Scheduler	Custom scheduling backends (Kueue, Volcano) and assignment strategies

Model Serving

Guide	Description
Serving Overview	InferenceServer and Deployment lifecycle, architecture
Cluster Setup for Serving	Configure a cluster for inference
Integrate a Custom Backend	Plugin interfaces for Triton, vLLM, TensorRT-LLM, and custom frameworks

UI

Guide	Description
Deploying the UI	Deploy the Michelangelo web UI to Kubernetes
Local UI Development	Run the UI locally for development

Integrating with Your ML Stack

Michelangelo is designed to be adopted alongside existing ML infrastructure. These guides cover how to connect Michelangelo to the systems your teams already use.

Guide	Description
Model Registry	Operate Michelangelo's built-in model registry, configure storage and RBAC, and integrate with serving and CI/CD
Experiment Tracking	Connect an external experiment tracking server to Michelangelo task pods
Custom Serving Backend	Add support for any inference framework — Triton, vLLM, TensorRT-LLM, or your own
Custom Job Scheduler	Replace or extend the job scheduler — integrate Kueue, Volcano, or a custom assignment strategy
Register a Compute Cluster	Connect an existing Kubernetes cluster so Michelangelo can dispatch jobs to it

Operations

Guide	Description
Authentication	OIDC identity provider setup, RBAC, session configuration, multi-tenant isolation
Monitoring & Observability	Prometheus scrape config, key metrics, alerting rules, Grafana dashboards, structured logging
Compliance	SOC 2, GDPR, and HIPAA configuration
Troubleshooting	Common failure modes and `kubectl` diagnostic commands

Getting Started​

Platform Configuration​

Jobs & Compute​

Model Serving​

UI​

Integrating with Your ML Stack​

Operations​