Monitoring & Observability

This guide is for platform operators setting up observability for a Michelangelo deployment.

Prerequisites: A running Michelangelo control plane with the controller manager deployed. Familiarity with Prometheus Operator is helpful but not required.

Michelangelo components expose Prometheus metrics that integrate with a standard Kubernetes observability stack. This guide covers scrape configuration, key metrics to monitor, alerting rules, and logging configuration.

Prometheus Scrape Configuration

Controller Manager

The controller manager exposes metrics on port 8091 (configured via metricsBindAddress). If you are using the Prometheus Operator, create a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: michelangelo-controllermgr
  namespace: ma-system
  labels:
    app: michelangelo-controllermgr
spec:
  selector:
    matchLabels:
      app: michelangelo-controllermgr
  endpoints:
  - port: metrics          # Must match the Service port name for port 8091
    path: /metrics
    interval: 30s

Health Probes

The controller manager exposes health endpoints on port 8083 (configured via healthProbeBindAddress):

Endpoint	Purpose
`GET :8083/healthz`	Liveness — is the process alive?
`GET :8083/readyz`	Readiness — is the controller ready to reconcile?

These are used by Kubernetes liveness and readiness probes, but you can also poll them from your monitoring stack for coarser-grained health checks.

API Server

The API server (port 15566) exposes standard gRPC metrics. If you have a Prometheus scrape job for gRPC services, point it at the API server pod.

Envoy Proxy

Envoy can expose an admin stats interface for scraping request counts, latency histograms, and upstream error rates. The admin interface is not enabled by default in the Michelangelo Envoy configuration — you must add an admin: block to your Envoy ConfigMap to enable it. See the Envoy admin documentation for setup instructions. Once enabled, add a Prometheus scrape job targeting the admin port.

Key Metrics

Pipeline Runs

Metric	Description	Unit
`pipelinerun_result_total`	Pipeline run results, by `state`, `pipeline_type`, `environment`, `tier`	Count
`pipelinerun_result_failure_total`	Failed pipeline runs, with `failure_reason` label	Count
`pipelinerun_duration_seconds`	Pipeline run execution duration (histogram)	Seconds
`pipelinerun_failed`	Gauge: 1 if most recent run failed, 0 if succeeded	Gauge
`pipelinerun_step_success_total`	Step completions, by `step_name` and `pipeline_type`	Count
`pipeline_ready_total`	Pipelines reaching Ready state	Count

Workflow Engine

Workflow metrics are emitted by the Cadence or Temporal server, not by Michelangelo. Consult your workflow engine's documentation for its native Prometheus metrics. Michelangelo's worker-side reconcile metrics are captured under the pipelinerun_* counters above.

Model Serving (Envoy)

If you have enabled the Envoy admin interface, these standard Envoy metrics are available:

Metric	Description	Unit
`envoy_cluster_upstream_rq_total`	Total requests to inference backends	Count
`envoy_cluster_upstream_rq_5xx`	5xx error responses from inference backends	Count
`envoy_cluster_upstream_rq_time`	Request latency histogram to inference servers	Seconds

Controller Manager Health

The controller manager uses controller-runtime metrics — these are standard across all Kubernetes operators:

Metric	Description	Unit
`controller_runtime_reconcile_errors_total`	Reconcile errors, by `controller` label	Count
`controller_runtime_reconcile_time_seconds`	Reconcile duration histogram	Seconds
`workqueue_depth`	Work items queued, by `name` label (one per controller)	Count
`workqueue_retries_total`	Work item retries — elevated value indicates persistent failures	Count

Alerting Rules

Add these rules to your Prometheus configuration:

groups:
- name: michelangelo
  rules:

  # Pipeline run failure rate
  - alert: PipelineRunFailureRateHigh
    expr: rate(pipelinerun_result_failure_total[5m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Pipeline run failures detected"
      description: >
        Pipeline runs are failing at {{ $value | humanize }} failures/sec.
        Check failure reasons: kubectl -n ma-system get pipelineruns --field-selector status.phase=Failed

  # Pipeline run duration: P99 above 1 hour
  - alert: PipelineRunDurationHigh
    expr: >
      histogram_quantile(0.99,
        rate(pipelinerun_duration_seconds_bucket[5m])
      ) > 3600
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Pipeline run P99 duration above 1 hour"
      description: >
        The 99th percentile pipeline run duration is {{ $value | humanize }}s.

  # Controller reconcile errors — sustained error rate from any controller
  - alert: ControllerReconcileErrorRate
    expr: rate(controller_runtime_reconcile_errors_total[5m]) > 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Controller {{ $labels.controller }} has high reconcile error rate"
      description: >
        The {{ $labels.controller }} controller is failing reconciles at
        {{ $value | humanize }} errors/sec. Check logs:
        kubectl -n ma-system logs deployment/michelangelo-controllermgr

  # Inference latency: P99 above 500ms for 5 minutes
  - alert: InferenceLatencyHigh
    expr: >
      histogram_quantile(0.99,
        rate(envoy_cluster_upstream_rq_time_bucket[5m])
      ) > 500
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Inference P99 latency is above 500ms"
      description: >
        The 99th percentile inference request latency is {{ $value }}ms.
        Check InferenceServer and model-sync sidecar logs.

  # Inference error rate: more than 1% of requests returning 5xx
  - alert: InferenceErrorRateHigh
    expr: >
      rate(envoy_cluster_upstream_rq_5xx[5m])
      / rate(envoy_cluster_upstream_rq_total[5m]) > 0.01
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Inference 5xx error rate above 1%"
      description: >
        {{ $value | humanizePercentage }} of inference requests are returning 5xx errors.

Grafana Dashboard

Create a Grafana dashboard with these panels to get operational visibility at a glance.

Overview row

Panel	Query	Visualization
Pipeline run results	`rate(pipelinerun_result_total[5m])`	Time series
Pipeline run failures	`pipelinerun_failed`	Stat
Pipeline readiness	`pipeline_ready_total`	Stat
Reconcile errors	`rate(controller_runtime_reconcile_errors_total[5m])`	Time series

Jobs row

Panel	Query	Visualization
Pipeline run duration P50/P99	`histogram_quantile(0.5/0.99, rate(pipelinerun_duration_seconds_bucket[5m]))`	Time series
Failure rate by reason	`rate(pipelinerun_result_failure_total[5m])` by `failure_reason`	Time series

Serving row

Panel	Query	Visualization
Request rate	`rate(envoy_cluster_upstream_rq_total[5m])`	Time series
Request latency P50/P99	`histogram_quantile(0.5/0.99, rate(envoy_cluster_upstream_rq_time_bucket[5m]))`	Time series
5xx error rate	`rate(envoy_cluster_upstream_rq_5xx[5m])`	Time series
Active model deployments	`envoy_cluster_upstream_rq_total` (by cluster)	Table

Controller health row

Panel	Query	Visualization
Reconcile error rate by controller	`rate(controller_runtime_reconcile_errors_total[5m])`	Time series
Reconcile latency P99	`histogram_quantile(0.99, rate(controller_runtime_reconcile_time_seconds_bucket[5m]))`	Time series
Work queue depth	`workqueue_depth`	Time series

Structured Logging

All Michelangelo components emit structured logs. Configure log format and level in the relevant ConfigMap:

logging:
  level: info          # debug | info | warn | error
  development: false   # true enables human-readable console output
  encoding: json       # json for production; console for development

For production deployments use encoding: json so your log aggregation system (Loki, Elasticsearch, CloudWatch Logs, etc.) can parse and query fields natively.

Important log fields to index

Field	Description
`level`	Log severity
`logger`	Component/controller name
`msg`	Log message
`namespace`	Kubernetes resource namespace
`name`	Kubernetes resource name
`operation`	Controller operation (e.g., `create_ray_cluster`, `schedule_job`)
`error`	Error message (present on error-level logs)

Indexing these fields allows you to efficiently query all events for a specific resource (namespace + name), filter by controller (logger), or find all failures across the control plane (level: error).

What's next?

Troubleshooting: Use the collected metrics and logs to diagnose issues with the Troubleshooting guide
Authentication: Secure access to your metrics endpoints with the Authentication guide
Compliance: Set up audit log retention to meet SOC 2, GDPR, or HIPAA requirements in the Compliance guide

Prometheus Scrape Configuration​

Controller Manager​

Health Probes​

API Server​

Envoy Proxy​

Key Metrics​

Pipeline Runs​

Workflow Engine​

Model Serving (Envoy)​

Controller Manager Health​

Alerting Rules​

Grafana Dashboard​

Overview row​

Jobs row​

Serving row​

Controller health row​

Structured Logging​

Important log fields to index​

What's next?​