Skip to main content

MLflow Integration

This guide explains how platform operators can connect an MLflow Tracking Server to Michelangelo workloads. MLflow overlaps with two Michelangelo capabilities — experiment tracking and the model registry — so this guide covers both, along with the boundary between what operators configure and what users do in their @uniflow.task() code.

Michelangelo does not bundle an MLflow server. This guide assumes you are running a self-hosted MLflow Tracking Server or a managed endpoint (such as Databricks Managed MLflow).


How MLflow Works with Michelangelo

┌─────────────────────────────────────────────┐
│ Operator Responsibility │
│ ├─ Deploy or point to an MLflow server │
│ └─ Ensure network reachability from pods │
└─────────────────────────────────────────────┘

┌─────────────────────────────────────────────┐
│ User Responsibility (task code) │
│ ├─ Set MLFLOW_TRACKING_URI in workflow code │
│ ├─ Import mlflow inside @uniflow.task() │
│ └─ Log runs, params, metrics, artifacts │
└─────────────────────────────────────────────┘

Michelangelo does not intercept or wrap MLflow calls. Users call the MLflow client directly inside @uniflow.task() functions and configure the tracking URI themselves. The operator's job is to ensure the MLflow server is reachable from task pods.


Prerequisites

  • A running MLflow Tracking Server accessible from your Kubernetes cluster. Replace http://mlflow.example.com:5000 in the examples below with your actual server address.
  • Sufficient RBAC to create NetworkPolicy resources in the compute cluster namespace if egress rules are needed.
  • The mlflow Python package available in the task's Docker image (users add this to their requirements.txt).

Step 1: Verify Network Reachability

Task pods run inside the compute cluster namespace registered with Michelangelo. Confirm that pods in that namespace can reach your MLflow server before proceeding.

kubectl run mlflow-connectivity-test \
--image=curlimages/curl \
--namespace=<compute-namespace> \
--restart=Never \
--rm -it -- \
curl -sv http://mlflow.example.com:5000/health

A 200 OK response confirms reachability. If the MLflow server is outside the cluster (for example, Databricks or a SaaS endpoint), also confirm egress is allowed by any NetworkPolicy rules on the namespace.

If you need to add an egress rule for task pods:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-mlflow-egress
namespace: <compute-namespace>
spec:
podSelector:
matchLabels:
<your-pod-selector-label>: <your-value>
policyTypes:
- Egress
egress:
# Allow DNS resolution
- ports:
- protocol: UDP
port: 53
- protocol: TCP
port: 53
# Allow egress to the MLflow server
- to:
- ipBlock:
cidr: <mlflow-server-ip>/32
ports:
- protocol: TCP
port: 5000

Replace <your-pod-selector-label> with labels that match your task pods. Check the actual labels with kubectl get pods -n <compute-namespace> --show-labels.


Step 2: Configure the Tracking URI

MLFLOW_TRACKING_URI is a user-space configuration — it belongs in workflow code or the Ray job pod environment, not in the Michelangelo system ConfigMap. Users should set it themselves using one of these approaches.

Option A: Set in workflow code

The simplest approach is to call mlflow.set_tracking_uri() directly in the task or at the top of the workflow module:

import mlflow
import michelangelo.uniflow.core as uniflow
from michelangelo.uniflow.plugins.ray import RayTask

@uniflow.task(config=RayTask(head_cpu=2, head_memory="4Gi"))
def train_model(train_data, config: dict):
mlflow.set_tracking_uri("http://mlflow.example.com:5000")
mlflow.set_experiment("fraud-detection")
...

Option B: Set via pipeline environment

Users can pass MLFLOW_TRACKING_URI as an environment variable when submitting a pipeline run, keeping the URI out of source code:

ma pipeline dev-run -f pipeline.yaml --env MLFLOW_TRACKING_URI=http://mlflow.example.com:5000

In task code, MLflow reads MLFLOW_TRACKING_URI from the environment automatically — no explicit set_tracking_uri() call is needed when the variable is set.


Step 3: Handle Authentication

Self-hosted MLflow with basic auth

If your MLflow server requires HTTP basic authentication, pass the credentials as pipeline environment variables:

ma pipeline dev-run -f pipeline.yaml \
--env MLFLOW_TRACKING_URI=http://mlflow.example.com:5000 \
--env MLFLOW_TRACKING_USERNAME=<username> \
--env MLFLOW_TRACKING_PASSWORD=<password>

MLflow's client reads MLFLOW_TRACKING_USERNAME and MLFLOW_TRACKING_PASSWORD natively.

warning

Avoid hardcoding credentials in source code or pipeline YAML files committed to version control. Pass them at runtime via --env or a secrets manager integrated with your CI/CD system.

Databricks Managed MLflow

If you are using Databricks Managed MLflow, pass the following environment variables at pipeline submission time:

ma pipeline dev-run -f pipeline.yaml \
--env MLFLOW_TRACKING_URI=databricks \
--env DATABRICKS_HOST=https://<your-workspace>.azuredatabricks.net \
--env DATABRICKS_TOKEN=<your-personal-access-token>

What Users Do (Task Code)

Once the operator has confirmed network reachability (Step 1), users configure their MLflow tracking URI and log experiments from any @uniflow.task() function.

import mlflow
import michelangelo.uniflow.core as uniflow
from michelangelo.uniflow.plugins.ray import RayTask

@uniflow.task(config=RayTask(head_cpu=2, head_memory="4Gi"))
def train_model(train_data, config: dict):
mlflow.set_experiment("fraud-detection")

with mlflow.start_run(run_name="xgboost-baseline"):
mlflow.log_params(config)

model = _train(train_data, config)

mlflow.log_metric("auc", model.auc)
mlflow.log_metric("precision", model.precision)
mlflow.sklearn.log_model(model, artifact_path="model")

return model

Users are responsible for:

  • Including mlflow in their task's Docker image (add to requirements.txt or the project Dockerfile).
  • Starting and ending MLflow runs inside the task function.
  • Ensuring their mlflow client version is compatible with the server version your organization runs. See the MLflow compatibility matrix for details.

MLflow Model Registry vs Michelangelo Model Registry

MLflow includes its own model registry. Michelangelo also has a built-in model registry backed by a Model Kubernetes custom resource. The two are independent and can be used simultaneously.

MLflow Model RegistryMichelangelo Model Registry
Backed byMLflow Tracking Server databaseKubernetes Model CRD + S3
Queried viaMLflow client / MLflow UIkubectl get models / ma model get
Integrates with servingMLflow serving (mlflow models serve)Michelangelo InferenceServer
Required for Michelangelo pipelines?NoNo

When to use MLflow's registry: If your organization already uses MLflow for model governance, lineage, and stage transitions (Staging → Production), continue using it. Michelangelo does not require you to use its own registry.

When to use Michelangelo's registry: If you want models to be deployable via Michelangelo's InferenceServer (Triton, vLLM, etc.), register them in Michelangelo's registry using the @uniflow.task() model registration API. You can do this in addition to logging to MLflow.

Using both: Log experiments and register models to MLflow for lineage and governance, and separately register the deployable artifact to Michelangelo for serving. Both calls can live in the same task function.


Verification

Verify network reachability from within the compute namespace using a temporary curl pod — the same approach as Step 1:

kubectl run mlflow-verify \
--image=curlimages/curl \
--namespace=<compute-namespace> \
--restart=Never \
--rm -it -- \
curl -sv http://mlflow.example.com:5000/health

A 200 OK response confirms task pods in that namespace can reach the MLflow server. The pod is automatically deleted after the check (--rm).


Troubleshooting

SymptomLikely causeResolution
ConnectionRefusedError or requests.exceptions.ConnectionErrorMLflow server unreachable from podRe-run the connectivity test from Step 1; check NetworkPolicy and firewall rules
RestException: PERMISSION_DENIEDCredentials missing or incorrectVerify MLFLOW_TRACKING_USERNAME / MLFLOW_TRACKING_PASSWORD are set at pipeline submission time
mlflow: command not found / ModuleNotFoundErrormlflow not in task's Docker imageAdd mlflow to requirements.txt or the project Dockerfile
MLflow run logged but artifacts missingArtifact store (S3/GCS) unreachable from podConfirm task pod has access to the artifact store configured in the MLflow server
INVALID_PARAMETER_VALUE on log_modelClient/server version mismatchPin mlflow to the same major version as the server

Next Steps