Skip to main content

Operator Guides

These guides cover deploying, configuring, and integrating Michelangelo in a Kubernetes environment. They target platform engineers and infrastructure operators who are responsible for running Michelangelo in production and for connecting it to the broader ML infrastructure their teams already use — experiment tracking, model registries, compute clusters, schedulers, and serving frameworks.

Getting Started

For a fresh deployment, follow this recommended reading order:

  1. Platform Setup — configure each component (API server, controller manager, worker, UI/Envoy) via ConfigMaps and Kustomize overlays
  2. Register a Compute Cluster — connect an existing Kubernetes cluster so Michelangelo can dispatch Ray and Spark jobs to it
  3. Cluster Setup for Serving — enable model inference on a local or remote cluster
  4. Authentication — connect an identity provider and configure RBAC before opening to users

Platform Configuration

GuideDescription
Platform SetupConfigMaps and key fields for API server, controller manager, worker, and UI/Envoy
Network & IngressEnvoy proxy, Ingress setup, TLS with cert-manager, and multi-cluster connectivity
API FrameworkArchitecture overview of the Michelangelo API and control plane

Jobs & Compute

GuideDescription
Jobs OverviewRay and Spark job lifecycle, compute selection, and observability
Register a Compute ClusterConnect an existing Kubernetes cluster to the Michelangelo control plane
Run a Pipeline on a Compute ClusterSubmit and monitor a Uniflow pipeline on a registered cluster
Extend the Job SchedulerCustom scheduling backends (Kueue, Volcano) and assignment strategies

Model Serving

GuideDescription
Serving OverviewInferenceServer and Deployment lifecycle, architecture
Cluster Setup for ServingConfigure a cluster for inference
Integrate a Custom BackendPlugin interfaces for Triton, vLLM, TensorRT-LLM, and custom frameworks

UI

GuideDescription
Deploying the UIDeploy the Michelangelo web UI to Kubernetes
Local UI DevelopmentRun the UI locally for development

Integrating with Your ML Stack

Michelangelo is designed to be adopted alongside existing ML infrastructure. These guides cover how to connect Michelangelo to the systems your teams already use.

GuideDescription
Model RegistryOperate Michelangelo's built-in model registry, configure storage and RBAC, and integrate with serving and CI/CD
Experiment TrackingConnect an external experiment tracking server to Michelangelo task pods
Custom Serving BackendAdd support for any inference framework — Triton, vLLM, TensorRT-LLM, or your own
Custom Job SchedulerReplace or extend the job scheduler — integrate Kueue, Volcano, or a custom assignment strategy
Register a Compute ClusterConnect an existing Kubernetes cluster so Michelangelo can dispatch jobs to it

Operations

GuideDescription
AuthenticationOIDC identity provider setup, RBAC, session configuration, multi-tenant isolation
Monitoring & ObservabilityPrometheus scrape config, key metrics, alerting rules, Grafana dashboards, structured logging
ComplianceSOC 2, GDPR, and HIPAA configuration
TroubleshootingCommon failure modes and kubectl diagnostic commands