Skip to main content

Managing and Monitoring Deep Learning/Machine Learning Experiments with MLflow

Managing and Monitoring Deep Learning/Machine Learning Experiments with MLflow

What is MLflow?

MLflow is an open-source platform for managing the complete machine learning lifecycle, including training, evaluation, and deployment of models. During the development of complex models, experiments are repeatedly conducted with changing hyperparameters, data versions, source code, and model architectures. Without proper tracking, it becomes difficult to reproduce results or improve model performance.

MLflow solves these issues with the following four key components:

  • MLflow Tracking: Stores and compares metadata such as parameters, metrics, models, and logs
  • MLflow Projects: Defines code and execution environments for reproducibility
  • MLflow Models: A universal format to save and deploy models trained with various frameworks
  • MLflow Model Registry: Supports model versioning, approval, and stage transitions like Production and Staging

Why is MLflow important in DL/ML?

Deep learning and machine learning projects involve numerous experiments, each with different hyperparameters, model architectures, and data preprocessing methods. Without systematic management, the following problems may arise:

  • Difficulties reproducing high-performing experiments
  • Challenges sharing or verifying experiments among collaborators
  • Uncertainty about which model version was deployed

MLflow provides the following benefits to address these challenges:

  • Reproducibility: Saves parameters, code, and environment to recreate results consistently
  • Organized experiment management: Compare and analyze hundreds of results at a glance
  • Deployment integration: Manage model versions for serving and easy rollback
  • Collaboration support: Share and review experiments with team members easily

MLflow Tracking Example

The example below shows how to apply MLflow Tracking in a simple classification model using PyTorch.


import mlflow
import mlflow.pytorch
import torch
import torch.nn as nn
import torch.optim as optim

class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 2)

    def forward(self, x):
        return self.fc(x)

model = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

X = torch.randn(100, 10)
y = torch.randint(0, 2, (100,))

with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.01)

    for epoch in range(5):
        optimizer.zero_grad()
        outputs = model(X)
        loss = criterion(outputs, y)
        loss.backward()
        optimizer.step()

        mlflow.log_metric("loss", loss.item(), step=epoch)

    mlflow.pytorch.log_model(model, "model")
    

The above code logs the loss at each epoch and saves the trained model to MLflow. The MLflow UI allows visualization and comparison of multiple experiments.

Using with PyTorch Lightning

MLflow integrates well with PyTorch Lightning. Below is an example using LightningModule and MLFlowLogger to track experiments.


import torch
import torch.nn as nn
import mlflow
import mlflow.pytorch
import pytorch_lightning as pl
from pytorch_lightning.loggers import MLFlowLogger
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score
import numpy as np

# Define Lightning model
class LitModel(pl.LightningModule):
    def __init__(self, input_dim=10, output_dim=2, lr=0.01, batch_size=16, max_epochs=5):
        super().__init__()
        self.save_hyperparameters()  # Log hyperparameters
        self.model = nn.Linear(input_dim, output_dim)
        self.criterion = nn.CrossEntropyLoss()

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self.forward(x)
        loss = self.criterion(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy_score(y.cpu(), preds.cpu())
        self.log("train_loss", loss, on_epoch=True)
        self.log("train_acc", acc, on_epoch=True)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.lr)

# Create dummy data
X = torch.randn(100, 10)
y = torch.randint(0, 2, (100,))
dataloader = DataLoader(TensorDataset(X, y), batch_size=16)

# Start MLflow experiment
with mlflow.start_run() as run:
    # Hyperparams to log
    hparams = {
        "input_dim": 10,
        "output_dim": 2,
        "lr": 0.01,
        "batch_size": 16,
        "max_epochs": 5
    }
    mlflow.log_params(hparams)

    # Set logger with run ID to bind logs correctly
    mlflow_logger = MLFlowLogger(experiment_name="lightning_exp", run_id=run.info.run_id)

    # Train model
    model = LitModel(**hparams)
    trainer = pl.Trainer(max_epochs=hparams["max_epochs"], logger=mlflow_logger)
    trainer.fit(model, dataloader)

    # Evaluate on training set to log final accuracy
    model.eval()
    with torch.no_grad():
        all_preds = []
        all_labels = []
        for xb, yb in dataloader:
            preds = torch.argmax(model(xb), dim=1)
            all_preds.append(preds)
            all_labels.append(yb)
        all_preds = torch.cat(all_preds)
        all_labels = torch.cat(all_labels)
        final_acc = accuracy_score(all_labels.numpy(), all_preds.numpy())
        mlflow.log_metric("final_train_accuracy", final_acc)

    # Input example and model signature
    input_example = torch.randn(1, 10)
    signature = mlflow.models.infer_signature(input_example.numpy(), model(input_example).detach().numpy())

    # Save model with metadata
    mlflow.pytorch.log_model(
        pytorch_model=model.model,  # Only the inner torch model
        artifact_path="model",
        input_example=input_example.numpy(),
        signature=signature
    )  
    

When using PyTorch Lightning as the training backend, you can log information such as the dataset, hyperparameters, and metrics to MLflow automatically without explicitly specifying them by using mlflow.pytorch.autolog(). In that case, the above code can be simplified as follows.


import torch
import torch.nn as nn
import mlflow
import mlflow.pytorch
import pytorch_lightning as pl
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score
import numpy as np

mlflow.pytorch.autolog()
# Create a new MLflow Experiment
mlflow.set_experiment("lion_cheetah2")


# Define Lightning model
class LitModel(pl.LightningModule):
    def __init__(self, input_dim=10, output_dim=2, lr=0.01, batch_size=16, max_epochs=5):
        super().__init__()
        self.save_hyperparameters()  # Log hyperparameters
        self.model = nn.Linear(input_dim, output_dim)
        self.criterion = nn.CrossEntropyLoss()

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self.forward(x)
        loss = self.criterion(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy_score(y.cpu(), preds.cpu())
        self.log("train_loss", loss, on_epoch=True)
        self.log("train_acc", acc, on_epoch=True)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.lr)

# Create dummy data
X = torch.randn(100, 10)
y = torch.randint(0, 2, (100,))
dataloader = DataLoader(TensorDataset(X, y), batch_size=16)

# Start MLflow experiment
with mlflow.start_run() as run:
    # Hyperparams to log
    hparams = {
        "input_dim": 10,
        "output_dim": 2,
        "lr": 0.01,
        "batch_size": 16,
        "max_epochs": 5
    }

    # Train model
    model = LitModel(**hparams)
    trainer = pl.Trainer(max_epochs=hparams["max_epochs"])
    trainer.fit(model, dataloader)

    # Evaluate on training set to log final accuracy
    model.eval()
    with torch.no_grad():
        all_preds = []
        all_labels = []
        for xb, yb in dataloader:
            preds = torch.argmax(model(xb), dim=1)
            all_preds.append(preds)
            all_labels.append(yb)
        all_preds = torch.cat(all_preds)
        all_labels = torch.cat(all_labels)
        final_acc = accuracy_score(all_labels.numpy(), all_preds.numpy())

Monitoring Experiments with MLflow UI

MLflow provides a web-based UI to visually explore and compare experiment logs. Launch it with the following command:

mlflow ui --port 5000

Visit http://localhost:5000 in your browser to use features such as:

  • Inspect parameters, metrics, and model artifacts
  • Compare experiments using graphs
  • Download and reuse models

MLflow Result:


Using MLflow Model Registry

Systematic model versioning is essential for deploying or rolling back models in production. MLflow Model Registry supports this process. Here's an example of registering and transitioning a model version:


from mlflow.tracking import MlflowClient

client = MlflowClient()
model_uri = "runs://model"

model_version = client.create_model_version(
    name="MyModel",
    source=model_uri,
    run_id=""
)

client.transition_model_version_stage(
    name="MyModel",
    version=model_version.version,
    stage="Production"
)
  

This enables easy tracking and movement of models across stages like Staging, Testing, and Production.

Conclusion

MLflow is a powerful tool for managing deep learning and machine learning experiments in a transparent and organized way. It integrates training, comparison, storage, and deployment into one workflow, enhancing reproducibility and collaboration. Widely adopted as a key MLOps component, MLflow is especially useful in automating the pipeline from model development to serving.

References

Comments

Popular

How to Save and Retrieve a Vector Database using LangChain, FAISS, and Gemini Embeddings

How to Save and Retrieve a Vector Database using LangChain, FAISS, and Gemini Embeddings Efficient storage and retrieval of vector databases is foundational for building intelligent retrieval-augmented generation (RAG) systems using large language models (LLMs). In this guide, we’ll walk through a professional-grade Python implementation that utilizes LangChain with FAISS and Google Gemini Embeddings to store document embeddings and retrieve similar information. This setup is highly suitable for advanced machine learning (ML) and deep learning (DL) engineers who work with semantic search and retrieval pipelines. Why Vector Databases Matter in LLM Applications Traditional keyword-based search systems fall short when it comes to understanding semantic meaning. Vector databases store high-dimensional embeddings of text data, allowing for approximate nearest-neighbor (ANN) searches based on semantic similarity. These capabilities are critical in applications like: Question Ans...

Building an MCP Agent with UV, Python & mcp-use

Model Context Protocol (MCP) is an open protocol designed to enable AI agents to interact with external tools and data in a standardized way. MCP is composed of three components: server , client , and host . MCP host The MCP host acts as the interface between the user and the agent   (such as Claude Desktop or IDE) and plays the role of connecting to external tools or data through MCP clients and servers. Previously, Anthropic’s Claude Desktop was introduced as a host, but it required a separate desktop app, license, and API key management, leading to dependency on the Claude ecosystem.   mcp-use is an open-source Python/Node package that connects LangChain LLMs (e.g., GPT-4, Claude, Groq) to MCP servers in just six lines of code, eliminating dependencies and supporting multi-server and multi-model setups. MCP Client The MCP client manages the MCP protocol within the host and is responsible for connecting to MCP servers that provide the necessary functions for the ...

RF-DETR: Overcoming the Limitations of DETR in Object Detection

RF-DETR (Region-Focused DETR), proposed in April 2025, is an advanced object detection architecture designed to overcome fundamental drawbacks of the original DETR (DEtection TRansformer) . In this technical article, we explore RF-DETR's contributions, architecture, and how it compares with both DETR and the improved model D-FINE . We also provide experimental benchmarks and discuss its real-world applicability. RF-DETR Architecture diagram for object detection Limitations of DETR DETR revolutionized object detection by leveraging the Transformer architecture, enabling end-to-end learning without anchor boxes or NMS (Non-Maximum Suppression). However, DETR has notable limitations: Slow convergence, requiring heavy data augmentation and long training schedules Degraded performance on low-resolution objects and complex scenes Lack of locality due to global self-attention mechanisms Key Innovations in RF-DETR RF-DETR intr...