Skip to main content

Operator Guides

These guides cover deploying, configuring, and integrating Michelangelo in a Kubernetes environment. They target platform engineers and infrastructure operators who are responsible for running Michelangelo in production and for connecting it to the broader ML infrastructure their teams already use — experiment tracking, model registries, compute clusters, schedulers, and serving frameworks.

Getting Started

For a fresh deployment, follow this recommended reading order:

  1. Platform Setup — configure each component (API server, controller manager, worker, UI/Envoy) via ConfigMaps and Kustomize overlays
  2. Register a Compute Cluster — connect an existing Kubernetes cluster so Michelangelo can dispatch Ray and Spark jobs to it
  3. Cluster Setup for Serving — enable model inference on a local or remote cluster
  4. Authentication — connect an identity provider and configure RBAC before opening to users

Setup & Configuration

GuideDescription
Helm ChartInstall the Michelangelo control plane with Helm — chart layout, values reference, and migration phases
Platform SetupConfigMaps and key fields for API server, controller manager, worker, and UI/Envoy
Network & IngressEnvoy proxy, Ingress setup, TLS with cert-manager, and multi-cluster connectivity
AuthenticationOIDC identity provider setup, RBAC, session configuration, multi-tenant isolation
Register a Compute ClusterConnect an existing Kubernetes cluster to the Michelangelo control plane

Platform Components

GuideDescription
Model RegistryOperate Michelangelo's built-in model registry, configure storage and RBAC, and integrate with serving and CI/CD
Ingester ControllerDeploy, configure, and operate the ingester that syncs CRDs into MySQL

Jobs & Compute

GuideDescription
Jobs OverviewRay and Spark job lifecycle, compute selection, and observability
Run a Pipeline on a Compute ClusterSubmit and monitor a Uniflow pipeline on a registered cluster
Extend the Job SchedulerCustom scheduling backends (Kueue, Volcano) and assignment strategies

Model Serving

GuideDescription
Serving OverviewInferenceServer and Deployment lifecycle, architecture
Cluster Setup for ServingConfigure a cluster for inference
Integrate a Custom BackendPlugin interfaces for Triton, vLLM, TensorRT-LLM, and custom frameworks

UI

GuideDescription
Deploying the UIDeploy the Michelangelo web UI to Kubernetes
Local UI DevelopmentRun the UI locally for development

Third-Party Integrations

Michelangelo is designed to run alongside existing ML infrastructure. The guides below cover making external tools reachable from Michelangelo workloads.

GuideDescription
Experiment Tracking SetupMake an experiment tracking server reachable from task pods — network, ConfigMap injection, auth, and operator/user boundary
Browse all integrationsMLflow and other third-party integration guides

Operations

GuideDescription
Monitoring & ObservabilityPrometheus scrape config, key metrics, alerting rules, Grafana dashboards, structured logging
ComplianceSOC 2, GDPR, and HIPAA configuration
TroubleshootingCommon failure modes and kubectl diagnostic commands

Architecture & Reference

GuideDescription
API FrameworkArchitecture overview of the Michelangelo API and control plane
SQL Key Concepts and TermsMetadata schema, table naming, indexed fields, and SQL query patterns