MLflow Integration
This guide explains how platform operators can connect an MLflow Tracking Server to Michelangelo workloads. MLflow overlaps with two Michelangelo capabilities — experiment tracking and the model registry — so this guide covers both, along with the boundary between what operators configure and what users do in their @uniflow.task() code.
Michelangelo does not bundle an MLflow server. This guide assumes you are running a self-hosted MLflow Tracking Server or a managed endpoint (such as Databricks Managed MLflow).
How MLflow Works with Michelangelo
┌─────────────────────────────────────────────┐
│ Operator Responsibility │
│ ├─ Deploy or point to an MLflow server │
│ └─ Ensure network reachability from pods │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ User Responsibility (task code) │
│ ├─ Set MLFLOW_TRACKING_URI in workflow code │
│ ├─ Import mlflow inside @uniflow.task() │
│ └─ Log runs, params, metrics, artifacts │
└─────────────────────────────────────────────┘
Michelangelo does not intercept or wrap MLflow calls. Users call the MLflow client directly inside @uniflow.task() functions and configure the tracking URI themselves. The operator's job is to ensure the MLflow server is reachable from task pods.
Prerequisites
- A running MLflow Tracking Server accessible from your Kubernetes cluster. Replace
http://mlflow.example.com:5000in the examples below with your actual server address. - Sufficient RBAC to create NetworkPolicy resources in the compute cluster namespace if egress rules are needed.
- The
mlflowPython package available in the task's Docker image (users add this to theirrequirements.txt).
Step 1: Verify Network Reachability
Task pods run inside the compute cluster namespace registered with Michelangelo. Confirm that pods in that namespace can reach your MLflow server before proceeding.
kubectl run mlflow-connectivity-test \
--image=curlimages/curl \
--namespace=<compute-namespace> \
--restart=Never \
--rm -it -- \
curl -sv http://mlflow.example.com:5000/health
A 200 OK response confirms reachability. If the MLflow server is outside the cluster (for example, Databricks or a SaaS endpoint), also confirm egress is allowed by any NetworkPolicy rules on the namespace.
If you need to add an egress rule for task pods:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-mlflow-egress
namespace: <compute-namespace>
spec:
podSelector:
matchLabels:
<your-pod-selector-label>: <your-value>
policyTypes:
- Egress
egress:
# Allow DNS resolution
- ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
# Allow egress to the MLflow server
- to:
- ipBlock:
cidr: <mlflow-server-ip>/32
ports:
- protocol: TCP
port: 5000
Replace <your-pod-selector-label> with labels that match your task pods. Check the actual labels with kubectl get pods -n <compute-namespace> --show-labels.
Step 2: Configure the Tracking URI
MLFLOW_TRACKING_URI is a user-space configuration — it belongs in workflow code or the Ray job pod environment, not in the Michelangelo system ConfigMap. Users should set it themselves using one of these approaches.
Option A: Set in workflow code
The simplest approach is to call mlflow.set_tracking_uri() directly in the task or at the top of the workflow module:
import mlflow
import michelangelo.uniflow.core as uniflow
from michelangelo.uniflow.plugins.ray import RayTask
@uniflow.task(config=RayTask(head_cpu=2, head_memory="4Gi"))
def train_model(train_data, config: dict):
mlflow.set_tracking_uri("http://mlflow.example.com:5000")
mlflow.set_experiment("fraud-detection")
...
Option B: Set via pipeline environment
Users can pass MLFLOW_TRACKING_URI as an environment variable when submitting a pipeline run, keeping the URI out of source code:
ma pipeline dev-run -f pipeline.yaml --env MLFLOW_TRACKING_URI=http://mlflow.example.com:5000
In task code, MLflow reads MLFLOW_TRACKING_URI from the environment automatically — no explicit set_tracking_uri() call is needed when the variable is set.
Step 3: Handle Authentication
Self-hosted MLflow with basic auth
If your MLflow server requires HTTP basic authentication, pass the credentials as pipeline environment variables:
ma pipeline dev-run -f pipeline.yaml \
--env MLFLOW_TRACKING_URI=http://mlflow.example.com:5000 \
--env MLFLOW_TRACKING_USERNAME=<username> \
--env MLFLOW_TRACKING_PASSWORD=<password>
MLflow's client reads MLFLOW_TRACKING_USERNAME and MLFLOW_TRACKING_PASSWORD natively.
Avoid hardcoding credentials in source code or pipeline YAML files committed to version control. Pass them at runtime via --env or a secrets manager integrated with your CI/CD system.
Databricks Managed MLflow
If you are using Databricks Managed MLflow, pass the following environment variables at pipeline submission time:
ma pipeline dev-run -f pipeline.yaml \
--env MLFLOW_TRACKING_URI=databricks \
--env DATABRICKS_HOST=https://<your-workspace>.azuredatabricks.net \
--env DATABRICKS_TOKEN=<your-personal-access-token>
What Users Do (Task Code)
Once the operator has confirmed network reachability (Step 1), users configure their MLflow tracking URI and log experiments from any @uniflow.task() function.
import mlflow
import michelangelo.uniflow.core as uniflow
from michelangelo.uniflow.plugins.ray import RayTask
@uniflow.task(config=RayTask(head_cpu=2, head_memory="4Gi"))
def train_model(train_data, config: dict):
mlflow.set_experiment("fraud-detection")
with mlflow.start_run(run_name="xgboost-baseline"):
mlflow.log_params(config)
model = _train(train_data, config)
mlflow.log_metric("auc", model.auc)
mlflow.log_metric("precision", model.precision)
mlflow.sklearn.log_model(model, artifact_path="model")
return model
Users are responsible for:
- Including
mlflowin their task's Docker image (add torequirements.txtor the project Dockerfile). - Starting and ending MLflow runs inside the task function.
- Ensuring their
mlflowclient version is compatible with the server version your organization runs. See the MLflow compatibility matrix for details.
MLflow Model Registry vs Michelangelo Model Registry
MLflow includes its own model registry. Michelangelo also has a built-in model registry backed by a Model Kubernetes custom resource. The two are independent and can be used simultaneously.
| MLflow Model Registry | Michelangelo Model Registry | |
|---|---|---|
| Backed by | MLflow Tracking Server database | Kubernetes Model CRD + S3 |
| Queried via | MLflow client / MLflow UI | kubectl get models / ma model get |
| Integrates with serving | MLflow serving (mlflow models serve) | Michelangelo InferenceServer |
| Required for Michelangelo pipelines? | No | No |
When to use MLflow's registry: If your organization already uses MLflow for model governance, lineage, and stage transitions (Staging → Production), continue using it. Michelangelo does not require you to use its own registry.
When to use Michelangelo's registry: If you want models to be deployable via Michelangelo's InferenceServer (Triton, vLLM, etc.), register them in Michelangelo's registry using the @uniflow.task() model registration API. You can do this in addition to logging to MLflow.
Using both: Log experiments and register models to MLflow for lineage and governance, and separately register the deployable artifact to Michelangelo for serving. Both calls can live in the same task function.
Verification
Verify network reachability from within the compute namespace using a temporary curl pod — the same approach as Step 1:
kubectl run mlflow-verify \
--image=curlimages/curl \
--namespace=<compute-namespace> \
--restart=Never \
--rm -it -- \
curl -sv http://mlflow.example.com:5000/health
A 200 OK response confirms task pods in that namespace can reach the MLflow server. The pod is automatically deleted after the check (--rm).
Troubleshooting
| Symptom | Likely cause | Resolution |
|---|---|---|
ConnectionRefusedError or requests.exceptions.ConnectionError | MLflow server unreachable from pod | Re-run the connectivity test from Step 1; check NetworkPolicy and firewall rules |
RestException: PERMISSION_DENIED | Credentials missing or incorrect | Verify MLFLOW_TRACKING_USERNAME / MLFLOW_TRACKING_PASSWORD are set at pipeline submission time |
mlflow: command not found / ModuleNotFoundError | mlflow not in task's Docker image | Add mlflow to requirements.txt or the project Dockerfile |
| MLflow run logged but artifacts missing | Artifact store (S3/GCS) unreachable from pod | Confirm task pod has access to the artifact store configured in the MLflow server |
INVALID_PARAMETER_VALUE on log_model | Client/server version mismatch | Pin mlflow to the same major version as the server |
Next Steps
- Experiment Tracking Integration — general guide for connecting any experiment tracking server to Michelangelo
- Model Registry Integration — Michelangelo's built-in model registry: storage configuration, RBAC, and serving integration
- Register a Compute Cluster — how to add a Kubernetes cluster so Michelangelo can dispatch jobs to it
- Platform Setup — full ConfigMap reference for all Michelangelo components
- MLflow Documentation — official MLflow docs for tracking, model registry, and deployment