Ingester Controller: Configuration and Operations
The Ingester is a Kubernetes controller that syncs all Michelangelo CRDs into MySQL, decoupling metadata storage from etcd. This guide covers deploying, configuring, and operating the ingester on production clusters.
Architecture Overview
The ingester maintains consistency between Kubernetes and MySQL:
┌─────────────────┐
kubectl/gRPC → │ API Server │ → creates CR in K8s + adds finalizer
└────────┬────────┘
│ K8s event
┌────────▼────────┐
│ Ingester │ → upserts to MySQL
│ Controller │ → removes finalizer on delete
└────────┬────────┘
│ SQL
┌────────▼────────┐
│ MySQL │ 13 tables + labels + annotations
└─────────────────┘
Every CRD object created through the API Server is automatically synced to MySQL. When deleted, the ingester ensures MySQL is updated before the object is removed from Kubernetes.
MySQL Storage
Schema Layout
For each of the 13 CRDs, the ingester creates 3 MySQL tables:
| Table Type | Purpose |
|---|---|
Main (e.g., model) | Core object data (uid, name, namespace, JSON, proto, indexed fields) |
Labels (e.g., model_labels) | Key-value label pairs per object |
Annotations (e.g., model_annotations) | Key-value annotation pairs per object |
Total: 39 tables (13 CRDs × 3 table types)
Supported CRDs: Project, ModelFamily, Model, Pipeline, PipelineRun, InferenceServer, Revision, Cluster, RayCluster, RayJob, TriggerRun, Deployment, SparkJob
Main Table Schema Example
CREATE TABLE model (
uid VARCHAR(64) NOT NULL, -- K8s UID (primary key)
group_ver VARCHAR(128), -- APIVersion string
namespace VARCHAR(256),
name VARCHAR(256),
res_version BIGINT, -- K8s ResourceVersion
create_time DATETIME(6),
update_time DATETIME(6),
delete_time DATETIME(6), -- NULL = active, non-NULL = soft-deleted
proto LONGBLOB, -- serialized protobuf
json JSON, -- full object as JSON
-- CRD-specific indexed fields, e.g.:
algorithm VARCHAR(128), -- for Model
PRIMARY KEY (uid),
INDEX idx_namespace_name (namespace, name),
INDEX idx_delete_time (delete_time)
);
All deletions are soft-deletes: DELETE sets delete_time rather than removing the row. Live object queries use WHERE delete_time IS NULL.
Configuration
Do not enable the ingester if your deployment uses SparkJob. A pre-existing nil pointer panic in go/components/spark/job/client/client.go:185 causes the controller manager to crash when syncing SparkJob objects, which prevents ALL MySQL sync. This blocks other CRD kinds too, not just SparkJob. See Known Limitations for details. This requires an upstream code fix before enabling.
Prerequisites
- Kubernetes cluster with Michelangelo API Server
- MySQL 5.7+ accessible from the controllermgr pod
- Schema init Job (provides 39 MySQL tables)
Enabling the Ingester
The ingester activates through two configuration gates in the controllermgr ConfigMap:
Gate 1: metadataStorage.enableMetadataStorage must be true
Gate 2: MySQL connection details must be provided
Both gates are required. If either is missing or disabled, the ingester silently stays disabled with zero impact on the rest of the controllermgr.
ConfigMap Configuration
Edit the controllermgr ConfigMap to add or update these stanzas:
# michelangelo-controllermgr-config
# Gate 1: Enable metadata storage
metadataStorage:
enableMetadataStorage: true
deletionDelay: 10s
enableResourceVersionCache: false
# Gate 2: MySQL connection details
mysql:
host: mysql
port: 3306
user: root
password: root
database: michelangelo
maxOpenConns: 25
maxIdleConns: 5
connMaxLifetime: 5m
# Optional: ingester-specific concurrency settings
ingester:
concurrentReconciles: 2
requeuePeriod: 30s
| Setting | Purpose | Default |
|---|---|---|
enableMetadataStorage | Master switch for the feature | false |
deletionDelay | Grace period for soft-deletes | 10s |
enableResourceVersionCache | Cache K8s resource versions (experimental) | false |
host | MySQL hostname or IP | (required) |
port | MySQL port | 3306 |
user | MySQL user account | (required) |
password | MySQL password | (required) |
database | MySQL database name | (required) |
maxOpenConns | Max concurrent connections | 25 |
maxIdleConns | Max idle connections to keep | 5 |
connMaxLifetime | Max connection lifetime | 5m |
concurrentReconciles | Parallel reconcilers per CRD | 2 |
requeuePeriod | Retry interval on failure | 30s |
With 13 CRDs and concurrentReconciles: 2, you get 13 independent work queues, each with 2 parallel workers.
Migration: Enabling on a Running Cluster
The ingester is designed to be enabled without downtime. Existing objects in Kubernetes will be picked up automatically on the controller's first startup.
Step-by-Step Enablement
1. Create MySQL Schema
Apply the schema init Job to create the 39 tables:
kubectl apply -f scripts/ingester/ingester-schema-init-job.yaml
kubectl wait --for=condition=complete job/ingester-schema-init --timeout=120s
If the Job fails:
# Inspect why it failed
kubectl describe job/ingester-schema-init
kubectl logs job/ingester-schema-init
# Common causes: MySQL unreachable, wrong credentials, insufficient privileges
# After fixing the root cause, delete the failed Job and reapply:
kubectl delete job ingester-schema-init
kubectl apply -f scripts/ingester/ingester-schema-init-job.yaml
kubectl wait --for=condition=complete job/ingester-schema-init --timeout=120s
2. Update the Controllermgr ConfigMap
Add both required stanzas (metadataStorage and mysql):
kubectl edit configmap michelangelo-controllermgr-config
Add the configuration shown in the ConfigMap Configuration section above.
3. Restart the Controllermgr
Restart the controller manager to pick up the new configuration:
# If controllermgr is a Deployment:
kubectl rollout restart deployment michelangelo-controllermgr
# If controllermgr is a bare Pod (e.g., sandbox):
kubectl delete pod michelangelo-controllermgr
4. Verify Controllers Registered
Check that all 13 controllers registered successfully:
kubectl logs -l app=michelangelo-controllermgr | grep "Ingester controller registered"
You should see 13 log lines (one per CRD kind).
5. Verify Backfill
All existing objects will be reconciled once on startup. Verify MySQL was populated:
for table in project modelfamily model pipeline pipelinerun inferenceserver \
revision cluster raycluster rayjob triggerrun deployment sparkjob; do
COUNT=$(kubectl exec pod/mysql -- mysql -uroot -proot michelangelo -sN \
-e "SELECT COUNT(*) FROM ${table} WHERE delete_time IS NULL;" 2>/dev/null)
echo "${table}: ${COUNT}"
done
All counts should be non-zero (assuming you have existing objects).
Operational Guidance
Deletion Behavior
When an object is deleted through the API Server or kubectl, the ingester ensures MySQL is updated before the object is removed from Kubernetes:
- User issues delete command
- Kubernetes sets
DeletionTimestampon the object (or API Server sets deletion annotation) - Ingester detects change and soft-deletes from MySQL (sets
delete_time) - Ingester removes the finalizer
- Kubernetes deletes the object from etcd
This two-phase delete guarantees no data loss.
Immutable Objects
Objects marked with the michelangelo/Immutable annotation are moved to MySQL only:
- Ingester updates MySQL one final time
- Finalizer is removed
- Object is deleted from Kubernetes
Immutable objects (like completed PipelineRuns) no longer consume etcd memory.
Pre-Existing Objects
Objects created before the ingester was enabled will NOT have the ingester finalizer. They are still synced to MySQL on the controller's first startup, but are not intercepted during deletion via kubectl delete. If a user deletes such an object directly through kubectl, Kubernetes removes it from etcd immediately — the ingester never sees the deletion — leaving behind an orphan row in MySQL with delete_time IS NULL. Over time this causes MySQL to drift out of sync with etcd.
Two mitigations are available:
-
Backfill controller (not yet implemented): A periodic reconciliation controller that compares MySQL rows against etcd and soft-deletes any rows whose corresponding etcd object no longer exists. This is the cleanest long-term solution but requires additional engineering work.
-
Require all deletes through the API Server: The API Server sets the
michelangelo/Deletingannotation instead of issuing a direct Kubernetes delete, which the ingester detects and handles correctly even without a finalizer. Enforce this operationally by restricting directkubectl deleteaccess to CRD objects.
Until a backfill controller is in place, option 2 is the recommended operational workaround.
Schema Evolution
Adding a new indexed column to an existing table requires a schema migration:
- Create and apply an
ALTER TABLEmigration Job - Update the code to populate the new indexed field
- Trigger a reconcile of affected objects (e.g., by bumping a resource version annotation) to backfill the new column
The schema init Job is create-only (CREATE TABLE IF NOT EXISTS). It does not apply ALTER TABLE migrations automatically.
Disabling the Ingester
To disable the ingester without data loss:
- Remove the
mysql:stanza from the controllermgr ConfigMap - Restart the controllermgr
- The ingester module will detect missing MySQL config and skip setup
- MySQL data is preserved; you can re-enable the ingester later
The controllermgr will continue operating normally with the ingester disabled.
Monitoring and Troubleshooting
Verify Ingester is Running
Check controller logs for startup messages:
kubectl logs -l app=michelangelo-controllermgr | grep -i ingester
You should see messages like:
"Ingester controller registered for <Kind>""Metadata storage enabled""MySQL connection established"
Check MySQL Connectivity
Verify the controllermgr can reach MySQL:
kubectl exec -it deployment/michelangelo-controllermgr -- \
mysql -h<mysql-host> -u<user> -p<password> -e "SELECT 1;"
Monitor Sync Status
Query MySQL to verify objects are syncing:
# Check Model count
kubectl exec pod/mysql -- mysql -uroot -proot michelangelo -sN \
-e "SELECT COUNT(*) FROM model WHERE delete_time IS NULL;"
# Check for recent updates
kubectl exec pod/mysql -- mysql -uroot -proot michelangelo -sN \
-e "SELECT namespace, name, update_time FROM model ORDER BY update_time DESC LIMIT 5;"
Detect Requeue Issues
If objects are not syncing, check for requeue errors in logs:
kubectl logs -l app=michelangelo-controllermgr | grep -i "requeue\|error" | tail -20
Requeue errors typically indicate:
- MySQL is unreachable
- Database credentials are wrong
- Permission issues on the MySQL connection
Known Limitations and Issues
| Issue | Severity | Impact |
|---|---|---|
| SparkJob double-panic crash | High | Controllermgr may crash when syncing SparkJob, preventing ALL MySQL sync. Requires fix in SparkJob controller code. |
| Pre-existing objects lack finalizer | Medium | Objects created before ingester enable won't be intercepted during kubectl delete. Mitigation: use API Server for all deletes. |
DeleteCollection not implemented | Medium | Namespace-scoped bulk deletes return error. Use individual object deletes. |
| No schema migration support | Medium | New indexed columns require manual ALTER TABLE and backfill. |
Label selector in List not implemented | Low | SQL label filtering not yet wired. Use JSON extraction in queries. |
SparkJob Double-Panic Issue
Status: Pre-existing bug in go/components/spark/job/client/client.go:185
Symptom: Controllermgr crashes with panic when processing SparkJob objects
Mitigation: Do not enable the ingester until SparkJob controller is fixed, or disable SparkJob reconciliation in your controllermgr config
Fix required: Upstream fix to SparkJob controller error handling
Verified Behavior
When the ingester is correctly enabled:
- All 13 controllers register at startup (one per CRD kind)
- Immediate sync on creation — objects appear in MySQL within milliseconds
- Immediate sync on update —
res_versionandupdate_timeadvance in MySQL after every change - Full JSON stored — complete object JSON in the
jsoncolumn - Indexed fields stored —
algorithm,ray_version,entrypoint, etc. in dedicated indexed columns - Labels synced — label changes reflected in
*_labelscompanion tables - Opt-in disabled by default — ingester only runs when MySQL config is present in
michelangelo-controllermgr-config - SparkJob blocked — pre-existing nil pointer panic in the SparkJob business controller prevents sync (unrelated to ingester)
Next Steps
Once the ingester is running, verify steady-state behavior: all 13 CRD kinds appear in MySQL, object counts match etcd, and update_time advances on changes. Then:
- Monitor: Set up alerting on requeue errors — elevated
workqueue_retries_totalfor ingester controllers indicates MySQL connectivity issues. See Monitoring. - Restrict deletes: Enforce all CRD deletes through the Michelangelo API Server (not raw
kubectl delete) to prevent orphan rows from pre-existing objects. - Review internals: See Ingester Internals for developer documentation on extending the ingester or adding new CRD kinds.