Skip to main content

Troubleshooting

This guide is for platform operators diagnosing issues with a Michelangelo deployment. All kubectl commands assume access to the control plane cluster.


Jobs not being scheduled

Symptoms: Jobs are submitted but remain in a pending state with no cluster assignment.

Diagnostics:

# Check for scheduler errors in the controller manager
kubectl -n ma-system logs deployment/michelangelo-controllermgr | grep -i "scheduler\|assign\|enqueue"

# List registered compute clusters and their status
kubectl -n ma-system get clusters

# Inspect a specific cluster's status conditions
kubectl -n ma-system describe cluster <cluster-name>

Likely causes:

  • No compute clusters are registered — complete the cluster registration steps
  • The Cluster CRD has the wrong host or port — the control plane cannot reach the compute cluster's API server
  • The ray-manager token Secret in the control plane is missing or has expired
  • The job requested resources (GPU, CPU) that no registered cluster can satisfy

Compute cluster registration failures

Symptoms: The Cluster CRD is created but the cluster status shows unhealthy or unknown.

Diagnostics:

# Inspect cluster conditions
kubectl -n ma-system describe cluster <cluster-name>

# Verify the token and CA secrets exist in the control plane
kubectl -n default get secret cluster-<cluster-name>-client-token
kubectl -n default get secret cluster-<cluster-name>-ca-data

# Confirm the token secret is populated (output should be > 0)
kubectl -n default get secret cluster-<cluster-name>-client-token \
-o jsonpath='{.data.token}' | wc -c

# Test network connectivity from the control plane to the compute API server
kubectl -n ma-system run connectivity-test --rm -it --restart=Never \
--image=curlimages/curl -- curl -k https://<compute-host>:<port>/healthz

Likely causes:

  • Network policy or firewall is blocking the control plane from reaching the compute cluster's API server
  • The token Secret is missing the token key or was not populated (check the kubernetes.io/service-account.name annotation on the Secret)
  • The CA data does not match the compute cluster's TLS certificate (CA data mismatch)

Ray pods not starting on the compute cluster

Symptoms: A RayCluster or RayJob resource is created on the compute cluster, but head or worker pods remain Pending or enter CrashLoopBackOff.

Diagnostics:

# Check on the compute cluster (use its kubectl context)
kubectl --context <compute-context> get rayclusters,rayjobs
kubectl --context <compute-context> describe raycluster <name>

# List pods for the cluster
kubectl --context <compute-context> get pods -l ray.io/cluster=<cluster-name>

# Check head pod logs
kubectl --context <compute-context> logs <head-pod-name>

# Verify storage config is present on the compute cluster
kubectl --context <compute-context> get configmap michelangelo-config
kubectl --context <compute-context> get secret aws-credentials

Likely causes:

  • The michelangelo-config ConfigMap is missing or has the wrong AWS_ENDPOINT_URL
  • The container image cannot be pulled (wrong registry, missing imagePullSecret)
  • Insufficient CPU or memory quota on the compute cluster — check kubectl --context <compute-context> describe nodes

Worker cannot connect to the API server

Symptoms: Worker pods crash-loop or restart repeatedly. Logs show connection refused, TLS errors, or authentication failures connecting to the API server.

Diagnostics:

# Check recent worker logs
kubectl -n ma-system logs deployment/michelangelo-worker --tail=100

# Verify the worker's configured API server address
kubectl -n ma-system get configmap michelangelo-worker-config -o yaml | grep -A3 "worker:"

# Confirm the API server deployment is running
kubectl -n ma-system get deployment michelangelo-apiserver
kubectl -n ma-system get pods -l app=michelangelo-apiserver

Likely causes:

  • worker.address in the worker ConfigMap points to the wrong hostname or port — it must resolve to the API server from within the ma-system namespace
  • worker.useTLS: true is set but the API server's certificate is not trusted — ensure the CA bundle is mounted into the worker pod
  • The API server is not yet ready (check its pod status and readiness probe)

Temporal / Cadence connectivity issues

Symptoms: Workflows fail to start. Worker logs contain errors like failed to connect to temporal, context deadline exceeded, or domain not found.

Diagnostics:

# Check worker logs for workflow engine errors
kubectl -n ma-system logs deployment/michelangelo-worker | grep -i "temporal\|cadence\|workflow"

# Inspect the configured workflow engine endpoint
kubectl -n ma-system get configmap michelangelo-worker-config -o yaml \
| grep -A8 "workflow-engine:"

# Test TCP connectivity to Temporal from a worker pod
kubectl -n ma-system exec deployment/michelangelo-worker -- \
nc -zv temporal.your-domain.com 7233

Likely causes:

  • workflow-engine.host has the wrong hostname or port (Temporal default is 7233)
  • The Temporal domain (uniflow, default) has not been created — create it with the Temporal CLI or admin tools
  • Network policy in ma-system is blocking egress to the Temporal endpoint

InferenceServer not becoming healthy

Symptoms: An InferenceServer resource is created but stays in a non-Ready state. The Deployment controller cannot deploy models to it because the server is not healthy.

Diagnostics:

# Check InferenceServer status and conditions
kubectl get inferenceservers
kubectl describe inferenceserver <name>

# Check the underlying Kubernetes Deployment
kubectl get deployment -l app=<inferenceserver-name>
kubectl describe deployment <inferenceserver-deployment>

# Check model-sync sidecar logs
kubectl logs <inferenceserver-pod-name> -c model-sync

Likely causes:

  • The backend type is not registered in the controller manager — check controller manager logs for unknown backend type
  • The inference server container image cannot be pulled
  • The model-sync sidecar cannot connect to S3 to download models (see S3 errors below)
  • Insufficient GPU resources on the node — check kubectl describe node for allocatable GPU count

Model not loading (Deployment stuck in Asset Preparation)

Symptoms: A Deployment resource is created but remains in the AssetPreparation or ResourceAcquisition stage indefinitely.

Diagnostics:

# Check Deployment status
kubectl get deployments.michelangelo.api
kubectl describe deployment.michelangelo.api <name>

# Check model-sync sidecar for download errors
kubectl logs <inferenceserver-pod> -c model-sync

# Verify the model config ConfigMap was created
kubectl get configmap <inferenceserver-name>-model-config -o yaml

Likely causes:

  • The model artifact is not at the expected S3 path — verify the registered model's artifactUri matches what is actually in S3
  • S3 credentials in the inference pod do not have s3:GetObject permission on the model bucket
  • The inference server has reached its maximum number of loaded models — check the serving framework's capacity limits

S3 / object store errors

Symptoms: Jobs fail with access denied or endpoint unreachable errors. Model downloads fail in the model-sync sidecar.

Diagnostics:

# Check controller manager storage config
kubectl -n ma-system get configmap michelangelo-controllermgr-config -o yaml \
| grep -A5 "minio:"

# Test S3 access from a worker pod
kubectl -n ma-system exec deployment/michelangelo-worker -- \
aws s3 ls s3://your-bucket/ --endpoint-url http://your-minio-endpoint

# Check for IAM role annotation on the relevant ServiceAccount
kubectl -n ma-system get serviceaccount michelangelo-controllermgr -o yaml \
| grep -i iam

Likely causes:

  • useIam: true is set but the pod's ServiceAccount does not have an IAM role annotation, so no credentials are injected
  • awsEndpointUrl is missing the URL scheme (http:// or https://) or has the wrong port
  • The S3 bucket does not exist or is in a different region than awsRegion specifies
  • Pod-level network policy is blocking outbound traffic to the S3 endpoint

UI not loading or API calls failing

Symptoms: The Michelangelo UI shows a blank page, a CORS error in the browser console, or API calls return 502/504.

Diagnostics:

# Check Envoy and UI pod status
kubectl get pods | grep -E "envoy|ui|apiserver"
kubectl logs deployment/michelangelo-ui

# Check Envoy configuration
kubectl get configmap envoy-config -o yaml

Likely causes:

  • apiBaseUrl in the UI's config.json does not match the actual Envoy ingress hostname — they must match exactly
  • The Envoy cluster's socket_address.address for michelangelo-apiserver is wrong — it must be the Kubernetes service name for the API server within the cluster
  • CORS allowed origins in the Envoy config do not include the origin from which users are accessing the UI
  • The Ingress resource for the UI or API server is misconfigured (wrong hostname, missing TLS secret)

What's Next

  • Monitoring: Set up proactive alerting so issues surface before users report them in the Monitoring guide
  • Network & Ingress: Resolve Ingress and CORS issues at the source with the Network guide
  • Authentication: Fix RBAC and OIDC configuration issues with the Authentication guide

Pipeline stuck in Terminating after cascade delete

Symptoms: A Pipeline has a deletionTimestamp set but does not disappear. Under foreground propagation (the ma pipeline delete default) it remains in Terminating state because one or more child runs are still draining.

Diagnostics:

# Check Pipeline finalizers and deletion timestamp
kubectl get pipeline <name> -n <namespace> -o jsonpath='{.metadata.finalizers}'
kubectl get pipeline <name> -n <namespace> -o jsonpath='{.metadata.deletionTimestamp}'

# List child PipelineRuns (and TriggerRuns — same query) that still have drain
# finalizers, with each child's own deletionTimestamp (the per-child 24h clock)
kubectl get pipelineruns -n <namespace> -o json | \
jq '.items[] | select(.metadata.ownerReferences[]?.uid == "<pipeline-uid>") | {name: .metadata.name, deletionTimestamp: .metadata.deletionTimestamp, finalizers: .metadata.finalizers}'

# Check controller manager logs for cascade-related messages
kubectl -n ma-system logs deployment/michelangelo-controllermgr | grep -i "cascade\|drain\|force"

The safety timeout is per child, keyed off each child's own deletionTimestamp (above) — there is no Pipeline-level cascade annotation or clock. Watch the per-kind drain metrics to see which kind is wedged: cascade_child_drain_active{kind="pipeline_run"|"trigger_run"} (currently draining) and cascade_child_drain_timeout_total{kind=...} (drains that blew past the 24h backstop).

Likely causes:

  • Child PipelineRuns or TriggerRuns still have drain finalizers — their workflows have not finished cancelling. Wait for drains to complete (or up to 24 hours after each child's deletionTimestamp for that child's safety timeout to force-remove its finalizer).
  • The controller manager is not running — the drain controllers need to be running to process child drain finalizers. Ensure the controller manager is healthy.
  • The workflow engine (Cadence/Temporal) is unreachable — the drain controller cannot cancel the workflow. Check workflow engine connectivity.

Child resources not deleted after Pipeline cascade delete

Symptoms: The Pipeline is gone, but PipelineRuns or TriggerRuns that belonged to it remain.

Diagnostics — check whether the orphaned children carry an ownerReference to the Pipeline:

kubectl get pipelineruns -n <namespace> -o json | \
jq '.items[] | select(.metadata.ownerReferences[]?.name == "<pipeline-name>") | .metadata.name'

Likely cause: the Pipeline was deleted with a non-cascading propagation policy (--cascade=orphan/background), or the children predate the ownerReference backfill (lazy, applied on each child's next reconcile).

Remedy: re-run with foreground cascade — or, if the Pipeline is already gone, delete the runs directly (ma pipeline_run delete / kubectl delete pipelinerun …), which drains and removes them individually:

kubectl delete pipeline <name> -n <namespace> --cascade=foreground