Skip to main content

Monitoring & Observability

This guide is for platform operators setting up observability for a Michelangelo deployment.

Prerequisites: A running Michelangelo control plane with the controller manager deployed. Familiarity with Prometheus Operator is helpful but not required.

Michelangelo components expose Prometheus metrics that integrate with a standard Kubernetes observability stack. This guide covers scrape configuration, key metrics to monitor, alerting rules, and logging configuration.

Prometheus Scrape Configuration

Controller Manager

The controller manager exposes metrics on port 8091 (configured via metricsBindAddress). If you are using the Prometheus Operator, create a ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: michelangelo-controllermgr
namespace: ma-system
labels:
app: michelangelo-controllermgr
spec:
selector:
matchLabels:
app: michelangelo-controllermgr
endpoints:
- port: metrics # Must match the Service port name for port 8091
path: /metrics
interval: 30s

Health Probes

The controller manager exposes health endpoints on port 8083 (configured via healthProbeBindAddress):

EndpointPurpose
GET :8083/healthzLiveness — is the process alive?
GET :8083/readyzReadiness — is the controller ready to reconcile?

These are used by Kubernetes liveness and readiness probes, but you can also poll them from your monitoring stack for coarser-grained health checks.

API Server

The API server (port 15566) exposes standard gRPC metrics. If you have a Prometheus scrape job for gRPC services, point it at the API server pod.

Envoy Proxy

Envoy can expose an admin stats interface for scraping request counts, latency histograms, and upstream error rates. The admin interface is not enabled by default in the Michelangelo Envoy configuration — you must add an admin: block to your Envoy ConfigMap to enable it. See the Envoy admin documentation for setup instructions. Once enabled, add a Prometheus scrape job targeting the admin port.


Key Metrics

Pipeline Runs

MetricDescriptionUnit
pipelinerun_result_totalPipeline run results, by state, pipeline_type, environment, tierCount
pipelinerun_result_failure_totalFailed pipeline runs, with failure_reason labelCount
pipelinerun_duration_secondsPipeline run execution duration (histogram)Seconds
pipelinerun_failedGauge: 1 if most recent run failed, 0 if succeededGauge
pipelinerun_step_success_totalStep completions, by step_name and pipeline_typeCount
pipeline_ready_totalPipelines reaching Ready stateCount

Workflow Engine

Workflow metrics are emitted by the Cadence or Temporal server, not by Michelangelo. Consult your workflow engine's documentation for its native Prometheus metrics. Michelangelo's worker-side reconcile metrics are captured under the pipelinerun_* counters above.

Model Serving (Envoy)

If you have enabled the Envoy admin interface, these standard Envoy metrics are available:

MetricDescriptionUnit
envoy_cluster_upstream_rq_totalTotal requests to inference backendsCount
envoy_cluster_upstream_rq_5xx5xx error responses from inference backendsCount
envoy_cluster_upstream_rq_timeRequest latency histogram to inference serversSeconds

Controller Manager Health

The controller manager uses controller-runtime metrics — these are standard across all Kubernetes operators:

MetricDescriptionUnit
controller_runtime_reconcile_errors_totalReconcile errors, by controller labelCount
controller_runtime_reconcile_time_secondsReconcile duration histogramSeconds
workqueue_depthWork items queued, by name label (one per controller)Count
workqueue_retries_totalWork item retries — elevated value indicates persistent failuresCount

Alerting Rules

Add these rules to your Prometheus configuration:

groups:
- name: michelangelo
rules:

# Pipeline run failure rate
- alert: PipelineRunFailureRateHigh
expr: rate(pipelinerun_result_failure_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pipeline run failures detected"
description: >
Pipeline runs are failing at {{ $value | humanize }} failures/sec.
Check failure reasons: kubectl -n ma-system get pipelineruns --field-selector status.phase=Failed

# Pipeline run duration: P99 above 1 hour
- alert: PipelineRunDurationHigh
expr: >
histogram_quantile(0.99,
rate(pipelinerun_duration_seconds_bucket[5m])
) > 3600
for: 10m
labels:
severity: warning
annotations:
summary: "Pipeline run P99 duration above 1 hour"
description: >
The 99th percentile pipeline run duration is {{ $value | humanize }}s.

# Controller reconcile errors — sustained error rate from any controller
- alert: ControllerReconcileErrorRate
expr: rate(controller_runtime_reconcile_errors_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Controller {{ $labels.controller }} has high reconcile error rate"
description: >
The {{ $labels.controller }} controller is failing reconciles at
{{ $value | humanize }} errors/sec. Check logs:
kubectl -n ma-system logs deployment/michelangelo-controllermgr

# Inference latency: P99 above 500ms for 5 minutes
- alert: InferenceLatencyHigh
expr: >
histogram_quantile(0.99,
rate(envoy_cluster_upstream_rq_time_bucket[5m])
) > 500
for: 5m
labels:
severity: warning
annotations:
summary: "Inference P99 latency is above 500ms"
description: >
The 99th percentile inference request latency is {{ $value }}ms.
Check InferenceServer and model-sync sidecar logs.

# Inference error rate: more than 1% of requests returning 5xx
- alert: InferenceErrorRateHigh
expr: >
rate(envoy_cluster_upstream_rq_5xx[5m])
/ rate(envoy_cluster_upstream_rq_total[5m]) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "Inference 5xx error rate above 1%"
description: >
{{ $value | humanizePercentage }} of inference requests are returning 5xx errors.

Grafana Dashboard

Create a Grafana dashboard with these panels to get operational visibility at a glance.

Overview row

PanelQueryVisualization
Pipeline run resultsrate(pipelinerun_result_total[5m])Time series
Pipeline run failurespipelinerun_failedStat
Pipeline readinesspipeline_ready_totalStat
Reconcile errorsrate(controller_runtime_reconcile_errors_total[5m])Time series

Jobs row

PanelQueryVisualization
Pipeline run duration P50/P99histogram_quantile(0.5/0.99, rate(pipelinerun_duration_seconds_bucket[5m]))Time series
Failure rate by reasonrate(pipelinerun_result_failure_total[5m]) by failure_reasonTime series

Serving row

PanelQueryVisualization
Request raterate(envoy_cluster_upstream_rq_total[5m])Time series
Request latency P50/P99histogram_quantile(0.5/0.99, rate(envoy_cluster_upstream_rq_time_bucket[5m]))Time series
5xx error raterate(envoy_cluster_upstream_rq_5xx[5m])Time series
Active model deploymentsenvoy_cluster_upstream_rq_total (by cluster)Table

Controller health row

PanelQueryVisualization
Reconcile error rate by controllerrate(controller_runtime_reconcile_errors_total[5m])Time series
Reconcile latency P99histogram_quantile(0.99, rate(controller_runtime_reconcile_time_seconds_bucket[5m]))Time series
Work queue depthworkqueue_depthTime series

Structured Logging

All Michelangelo components emit structured logs. Configure log format and level in the relevant ConfigMap:

logging:
level: info # debug | info | warn | error
development: false # true enables human-readable console output
encoding: json # json for production; console for development

For production deployments use encoding: json so your log aggregation system (Loki, Elasticsearch, CloudWatch Logs, etc.) can parse and query fields natively.

Important log fields to index

FieldDescription
levelLog severity
loggerComponent/controller name
msgLog message
namespaceKubernetes resource namespace
nameKubernetes resource name
operationController operation (e.g., create_ray_cluster, schedule_job)
errorError message (present on error-level logs)

Indexing these fields allows you to efficiently query all events for a specific resource (namespace + name), filter by controller (logger), or find all failures across the control plane (level: error).

What's next?

  • Troubleshooting: Use the collected metrics and logs to diagnose issues with the Troubleshooting guide
  • Authentication: Secure access to your metrics endpoints with the Authentication guide
  • Compliance: Set up audit log retention to meet SOC 2, GDPR, or HIPAA requirements in the Compliance guide