SQL Key Concepts and Terms
This page is the SQL reference for operators querying Michelangelo's metadata database — schema layout, indexed columns, safe query patterns, and known storage limitations. Michelangelo uses SQL for platform metadata, not for storing training datasets or feature values. The ingester syncs Kubernetes custom resources into MySQL so API and operations workflows can query metadata without depending only on etcd.
Audience: Platform operators running Michelangelo's ingester and metadata DB in production.
Prerequisites: The ingester must be deployed and connected to MySQL. See Ingester Controller: Configuration and Operations if you haven't done this yet.
You'll reach for this page when:
- Writing a diagnostic SQL query to investigate a stuck
pipelinerunor missing model - Deciding which columns are indexed before adding a filter to an ad-hoc query
- Debugging a label filter that returns unexpected results
- Reading the schema files to understand what the ingester writes
SQL Surfaces
| Surface | Location | Purpose |
|---|---|---|
| Helm schema | helm/michelangelo/files/schema/mysql-init-schema.sql | Schema bundled into the Helm chart and mounted into the API server schema-init container |
| Standalone ingester schema | scripts/ingester/ingester_schema.sql | The file actually referenced by ingester_schema_job.yaml and init_ingester_db.sh at runtime. This is the authoritative copy for ingester setup. |
| Complete ingester schema | scripts/ingester/complete_ingester_schema.sql | A convenience copy kept in sync with ingester_schema.sql by hand. No automation references this file directly. |
| Schema init Job | scripts/ingester/ingester_schema_job.yaml | Kubernetes Job that waits for MySQL and creates the ingester tables |
| Local init script | scripts/ingester/init_ingester_db.sh | Shell helper for initializing a reachable MySQL instance |
| Runtime SQL code | go/storage/mysql/mysql.go | MySQL implementation for upserts, reads, list queries, labels, annotations, and soft deletes |
The three .sql files above are byte-identical copies of the same schema. No automation enforces this — when you change one, update the other two by hand to keep them in sync.
Core Terms
| Term | Meaning |
|---|---|
| Metadata storage | The optional SQL-backed store for Michelangelo custom resource metadata |
| Ingester | Controller that watches Michelangelo CRDs and writes their metadata to MySQL |
| CRD table | Main table for one Kubernetes custom resource kind, such as model or pipelinerun |
| Side table | Per-kind table for labels or annotations, such as model_labels or model_annotations |
| Extracted column | A CRD field copied into a dedicated SQL column so callers can read it without parsing the json payload. Extracted columns are not necessarily indexed — see Extracted Columns and SQL Indexes |
| SQL index | An explicit KEY declared on a column (or column tuple) in the schema, which lets MySQL satisfy filters on those columns without a full table scan |
| Soft delete | Delete behavior that sets delete_time instead of removing the row |
| Resource version | Kubernetes metadata.resourceVersion, stored as res_version for reconciliation ordering |
| Proto column | Serialized protobuf representation of the object, stored in proto |
| JSON column | Full JSON representation of the object, stored in json |
Schema Model
Each supported CRD kind has three tables:
| Table | Example | Stores |
|---|---|---|
| Main table | model | Object identity, timestamps, serialized payloads, and extracted columns |
| Labels table | model_labels | Kubernetes labels for each object UID |
| Annotations table | model_annotations | Kubernetes annotations for each object UID |
The schema currently covers 13 resource kinds: Project, ModelFamily, Model, Pipeline, PipelineRun, InferenceServer, Revision, Cluster, RayCluster, RayJob, TriggerRun, Deployment, and SparkJob. See Extracted Columns and SQL Indexes for the per-kind column detail.
That produces 39 tables total: 13 main tables, 13 label tables, and 13 annotation tables.
Table Relationships
The schema uses object UIDs to connect main tables to their side tables:
<kind>
uid
namespace
name
...
|
| <kind>.uid = <kind>_labels.obj_uid
v
<kind>_labels
<kind>
uid
namespace
name
...
|
| <kind>.uid = <kind>_annotations.obj_uid
v
<kind>_annotations
Cross-resource relationships are stored as denormalized namespace/name columns instead of foreign keys. For example:
| Relationship | Columns |
|---|---|
| PipelineRun to Pipeline | pipelinerun.pipeline_namespace, pipelinerun.pipeline_name |
| PipelineRun to Revision | pipelinerun.revision_namespace, pipelinerun.revision_name |
| TriggerRun to Pipeline | triggerrun.pipeline_namespace, triggerrun.pipeline_name |
| TriggerRun to Revision | triggerrun.revision_namespace, triggerrun.revision_name |
| Model to ModelFamily | model.model_family_namespace, model.model_family_name |
| Revision to base resource | revision.base_resource_namespace, revision.base_resource_name, revision.base_type |
The schema does not define SQL foreign key constraints. Consistency is maintained by Kubernetes reconciliation and ingester writes.
In the side tables, label value columns are typed VARCHAR(63) while annotation value columns are typed TEXT. Label values longer than 63 bytes will be truncated when written to MySQL even though Kubernetes itself accepts the longer value, so queries that filter on a long label may not match.
Main Table Columns
Every main table shares a common base shape:
| Column | Purpose |
|---|---|
uid | Kubernetes object UID and primary key |
group_ver | API group/version for the object |
namespace | Kubernetes namespace |
name | Kubernetes object name |
res_version | Kubernetes resource version |
create_time | Object creation timestamp, sourced from the Kubernetes resource |
update_time | Wall-clock time of the last ingester upsert (time.Now().UTC() at write time), not a field copied from the Kubernetes resource |
delete_time | Soft-delete timestamp, or NULL for active rows |
proto | Serialized protobuf object |
json | Full JSON object |
Main tables also include CRD-specific extracted columns. Examples include model.algorithm, model.description, model.owner, pipeline.owner, pipelinerun.state, deployment.state, and inferenceserver.state. Whether a given extracted column also has a SQL index depends on the table — see Extracted Columns and SQL Indexes.
Extracted Columns and SQL Indexes
The schema treats two related ideas as separate concerns. Knowing which is which decides whether a query is cheap or scans the whole table.
Extracted columns are CRD fields the ingester copies from the protobuf payload into dedicated SQL columns at upsert time. With an extracted column you can read or filter on the field directly in SQL without parsing the json column — but the filter is not necessarily fast.
SQL indexes are explicit KEY declarations on a column or column tuple. Filters on indexed columns can use the index; filters on non-indexed columns require a full table scan even when the column is extracted.
Every main table has a PRIMARY KEY on uid. Beyond that, only the columns listed below have a SQL index today.
| Main Table | Indexes (besides the primary key on uid) |
|---|---|
model | (namespace, name), create_time, algorithm, owner |
modelfamily | (namespace, name), create_time |
pipeline | (namespace, name), create_time, owner |
pipelinerun | (namespace, name), create_time, (pipeline_namespace, pipeline_name), state |
deployment | (namespace, name), create_time, state |
inferenceserver | (namespace, name), create_time, state |
project | (namespace, name), create_time |
revision | (namespace, name), create_time, (base_resource_namespace, base_resource_name) |
cluster | (namespace, name), create_time |
raycluster | (namespace, name), create_time |
rayjob | (namespace, name), create_time |
sparkjob | (namespace, name), create_time |
triggerrun | (namespace, name), create_time, (pipeline_namespace, pipeline_name), state |
Side tables (<kind>_labels, <kind>_annotations) have a PRIMARY KEY on id and indexes on obj_uid. Label tables additionally index (key, value); annotation tables do not (the value is TEXT).
Many extracted columns are not indexed. For example, pipelinerun extracts actor, end_time, exception_type, and several namespace/name reference pairs in addition to the indexed state column — filtering by any of these is supported but requires a table scan. Common extracted columns by table:
| Main Table | Extracted Columns Beyond the Common Base |
|---|---|
model | algorithm, training_framework, owner, source, description, model_kind, package_type, revision_id, src_pipeline_run_namespace/src_pipeline_run_name, model_family_namespace/model_family_name, plus four eval-report namespace/name pairs (feature_eval_report, performance_eval_report, feature_quality_report, explainability_report) |
modelfamily | model_family_name |
pipeline | owner, pipeline_type |
pipelinerun | pipeline_namespace/pipeline_name, revision_namespace/revision_name, resume_pipeline_run_namespace/resume_pipeline_run_name, state, actor, end_time, exception_type |
deployment | state, target_definition_type, current_revision_namespace/current_revision_name, deletion_requested_timestamp |
inferenceserver | state |
project | tier |
revision | base_resource_namespace/base_resource_name, base_type, commit_branch, git_ref, owner |
triggerrun | pipeline_namespace/pipeline_name, revision_namespace/revision_name, state, auto_flip |
cluster, raycluster, rayjob, sparkjob | None beyond the common base columns |
When you write a query, prefer filters on indexed columns. Filters on extracted-but-unindexed columns still work, but expect linear scan cost. Filters that have to dig into the json column should be reserved for one-off diagnostic queries.
Current Storage-Layer Limitations
Several MetadataStorage operations are stubbed in go/storage/mysql/mysql.go and either return an error or silently ignore part of the request. Operators relying on these paths should know what does and does not work today:
| Operation | Behavior |
|---|---|
Upsert with direct = true | Returns the error direct update not yet implemented. The full upsert path (direct = false) works as documented. |
DeleteCollection | Returns DeleteCollection not yet implemented. Use Delete per object instead. |
QueryByTemplateID | Returns QueryByTemplateID not yet implemented. |
Backfill | Returns Backfill not yet implemented. |
List with a LabelSelector | Returns rows from the main table without applying the selector. The label selector value is silently ignored, so callers receive an unfiltered result set rather than an error. Filter by joining the side label table (see Join Labels for Filtering) until selector support lands. |
Query Patterns
Fetch a Live Object by Namespace and Name
SELECT proto
FROM model
WHERE namespace = 'default'
AND name = 'my-model'
AND delete_time IS NULL;
List Live Objects by State
SELECT namespace, name, state, update_time
FROM pipelinerun
WHERE state = 'FAILED'
AND delete_time IS NULL
ORDER BY update_time DESC;
Join Labels for Filtering
SELECT m.namespace, m.name, m.update_time
FROM model AS m
JOIN model_labels AS l
ON l.obj_uid = m.uid
WHERE l.`key` = 'team'
AND l.`value` = 'fraud'
AND m.delete_time IS NULL;
Note: the storage layer's List API silently ignores LabelSelector (see Current Storage-Layer Limitations). Always join the label table directly when filtering by label.
Inspect a Soft-Deleted Object
SELECT namespace, name, delete_time
FROM pipeline
WHERE delete_time IS NOT NULL
ORDER BY delete_time DESC;
Write Patterns
The ingester owns writes to these tables. Application code should use the Michelangelo API or Kubernetes CRDs rather than writing SQL directly.
Each ingester upsert overwrites the row's payload, timestamps, and indexed columns; labels and annotations are replaced wholesale.
Deletes are soft deletes. The row remains in the main table with delete_time set, which preserves metadata for audits and delayed cleanup workflows.
SQL File Conventions
- Main table names are lowercase CRD kind names, for example
ModelFamilybecomesmodelfamily. - Side tables use
<main_table>_labelsand<main_table>_annotations. - Column names use snake case.
- Identifiers are quoted with backticks in schema files.