Troubleshooting
This guide is for platform operators diagnosing issues with a Michelangelo deployment. All kubectl commands assume access to the control plane cluster.
Jobs not being scheduled
Symptoms: Jobs are submitted but remain in a pending state with no cluster assignment.
Diagnostics:
# Check for scheduler errors in the controller manager
kubectl -n ma-system logs deployment/michelangelo-controllermgr | grep -i "scheduler\|assign\|enqueue"
# List registered compute clusters and their status
kubectl -n ma-system get clusters
# Inspect a specific cluster's status conditions
kubectl -n ma-system describe cluster <cluster-name>
Likely causes:
- No compute clusters are registered — complete the cluster registration steps
- The Cluster CRD has the wrong
hostorport— the control plane cannot reach the compute cluster's API server - The
ray-managertoken Secret in the control plane is missing or has expired - The job requested resources (GPU, CPU) that no registered cluster can satisfy
Compute cluster registration failures
Symptoms: The Cluster CRD is created but the cluster status shows unhealthy or unknown.
Diagnostics:
# Inspect cluster conditions
kubectl -n ma-system describe cluster <cluster-name>
# Verify the token and CA secrets exist in the control plane
kubectl -n default get secret cluster-<cluster-name>-client-token
kubectl -n default get secret cluster-<cluster-name>-ca-data
# Confirm the token secret is populated (output should be > 0)
kubectl -n default get secret cluster-<cluster-name>-client-token \
-o jsonpath='{.data.token}' | wc -c
# Test network connectivity from the control plane to the compute API server
kubectl -n ma-system run connectivity-test --rm -it --restart=Never \
--image=curlimages/curl -- curl -k https://<compute-host>:<port>/healthz
Likely causes:
- Network policy or firewall is blocking the control plane from reaching the compute cluster's API server
- The token Secret is missing the
tokenkey or was not populated (check thekubernetes.io/service-account.nameannotation on the Secret) - The CA data does not match the compute cluster's TLS certificate (CA data mismatch)
Ray pods not starting on the compute cluster
Symptoms: A RayCluster or RayJob resource is created on the compute cluster, but head or worker pods remain Pending or enter CrashLoopBackOff.
Diagnostics:
# Check on the compute cluster (use its kubectl context)
kubectl --context <compute-context> get rayclusters,rayjobs
kubectl --context <compute-context> describe raycluster <name>
# List pods for the cluster
kubectl --context <compute-context> get pods -l ray.io/cluster=<cluster-name>
# Check head pod logs
kubectl --context <compute-context> logs <head-pod-name>
# Verify storage config is present on the compute cluster
kubectl --context <compute-context> get configmap michelangelo-config
kubectl --context <compute-context> get secret aws-credentials
Likely causes:
- The
michelangelo-configConfigMap is missing or has the wrongAWS_ENDPOINT_URL - The container image cannot be pulled (wrong registry, missing imagePullSecret)
- Insufficient CPU or memory quota on the compute cluster — check
kubectl --context <compute-context> describe nodes
Worker cannot connect to the API server
Symptoms: Worker pods crash-loop or restart repeatedly. Logs show connection refused, TLS errors, or authentication failures connecting to the API server.
Diagnostics:
# Check recent worker logs
kubectl -n ma-system logs deployment/michelangelo-worker --tail=100
# Verify the worker's configured API server address
kubectl -n ma-system get configmap michelangelo-worker-config -o yaml | grep -A3 "worker:"
# Confirm the API server deployment is running
kubectl -n ma-system get deployment michelangelo-apiserver
kubectl -n ma-system get pods -l app=michelangelo-apiserver
Likely causes:
worker.addressin the worker ConfigMap points to the wrong hostname or port — it must resolve to the API server from within thema-systemnamespaceworker.useTLS: trueis set but the API server's certificate is not trusted — ensure the CA bundle is mounted into the worker pod- The API server is not yet ready (check its pod status and readiness probe)
Temporal / Cadence connectivity issues
Symptoms: Workflows fail to start. Worker logs contain errors like failed to connect to temporal, context deadline exceeded, or domain not found.
Diagnostics:
# Check worker logs for workflow engine errors
kubectl -n ma-system logs deployment/michelangelo-worker | grep -i "temporal\|cadence\|workflow"
# Inspect the configured workflow engine endpoint
kubectl -n ma-system get configmap michelangelo-worker-config -o yaml \
| grep -A8 "workflow-engine:"
# Test TCP connectivity to Temporal from a worker pod
kubectl -n ma-system exec deployment/michelangelo-worker -- \
nc -zv temporal.your-domain.com 7233
Likely causes:
workflow-engine.hosthas the wrong hostname or port (Temporal default is7233)- The Temporal domain (
uniflow,default) has not been created — create it with the Temporal CLI or admin tools - Network policy in
ma-systemis blocking egress to the Temporal endpoint
InferenceServer not becoming healthy
Symptoms: An InferenceServer resource is created but stays in a non-Ready state. The Deployment controller cannot deploy models to it because the server is not healthy.
Diagnostics:
# Check InferenceServer status and conditions
kubectl get inferenceservers
kubectl describe inferenceserver <name>
# Check the underlying Kubernetes Deployment
kubectl get deployment -l app=<inferenceserver-name>
kubectl describe deployment <inferenceserver-deployment>
# Check model-sync sidecar logs
kubectl logs <inferenceserver-pod-name> -c model-sync
Likely causes:
- The backend type is not registered in the controller manager — check controller manager logs for
unknown backend type - The inference server container image cannot be pulled
- The model-sync sidecar cannot connect to S3 to download models (see S3 errors below)
- Insufficient GPU resources on the node — check
kubectl describe nodefor allocatable GPU count
Model not loading (Deployment stuck in Asset Preparation)
Symptoms: A Deployment resource is created but remains in the AssetPreparation or ResourceAcquisition stage indefinitely.
Diagnostics:
# Check Deployment status
kubectl get deployments.michelangelo.api
kubectl describe deployment.michelangelo.api <name>
# Check model-sync sidecar for download errors
kubectl logs <inferenceserver-pod> -c model-sync
# Verify the model config ConfigMap was created
kubectl get configmap <inferenceserver-name>-model-config -o yaml
Likely causes:
- The model artifact is not at the expected S3 path — verify the registered model's
artifactUrimatches what is actually in S3 - S3 credentials in the inference pod do not have
s3:GetObjectpermission on the model bucket - The inference server has reached its maximum number of loaded models — check the serving framework's capacity limits
S3 / object store errors
Symptoms: Jobs fail with access denied or endpoint unreachable errors. Model downloads fail in the model-sync sidecar.
Diagnostics:
# Check controller manager storage config
kubectl -n ma-system get configmap michelangelo-controllermgr-config -o yaml \
| grep -A5 "minio:"
# Test S3 access from a worker pod
kubectl -n ma-system exec deployment/michelangelo-worker -- \
aws s3 ls s3://your-bucket/ --endpoint-url http://your-minio-endpoint
# Check for IAM role annotation on the relevant ServiceAccount
kubectl -n ma-system get serviceaccount michelangelo-controllermgr -o yaml \
| grep -i iam
Likely causes:
useIam: trueis set but the pod's ServiceAccount does not have an IAM role annotation, so no credentials are injectedawsEndpointUrlis missing the URL scheme (http://orhttps://) or has the wrong port- The S3 bucket does not exist or is in a different region than
awsRegionspecifies - Pod-level network policy is blocking outbound traffic to the S3 endpoint
UI not loading or API calls failing
Symptoms: The Michelangelo UI shows a blank page, a CORS error in the browser console, or API calls return 502/504.
Diagnostics:
# Check Envoy and UI pod status
kubectl get pods | grep -E "envoy|ui|apiserver"
kubectl logs deployment/michelangelo-ui
# Check Envoy configuration
kubectl get configmap envoy-config -o yaml
Likely causes:
apiBaseUrlin the UI'sconfig.jsondoes not match the actual Envoy ingress hostname — they must match exactly- The Envoy cluster's
socket_address.addressformichelangelo-apiserveris wrong — it must be the Kubernetes service name for the API server within the cluster - CORS allowed origins in the Envoy config do not include the origin from which users are accessing the UI
- The Ingress resource for the UI or API server is misconfigured (wrong hostname, missing TLS secret)
What's next?
- Monitoring: Set up proactive alerting so issues surface before users report them in the Monitoring guide
- Network & Ingress: Resolve Ingress and CORS issues at the source with the Network guide
- Authentication: Fix RBAC and OIDC configuration issues with the Authentication guide