Skip to main content

Managing and Monitoring Deep Learning/Machine Learning Experiments with MLflow

Managing and Monitoring Deep Learning/Machine Learning Experiments with MLflow

What is MLflow?

MLflow is an open-source platform for managing the complete machine learning lifecycle, including training, evaluation, and deployment of models. During the development of complex models, experiments are repeatedly conducted with changing hyperparameters, data versions, source code, and model architectures. Without proper tracking, it becomes difficult to reproduce results or improve model performance.

MLflow solves these issues with the following four key components:

  • MLflow Tracking: Stores and compares metadata such as parameters, metrics, models, and logs
  • MLflow Projects: Defines code and execution environments for reproducibility
  • MLflow Models: A universal format to save and deploy models trained with various frameworks
  • MLflow Model Registry: Supports model versioning, approval, and stage transitions like Production and Staging

Why is MLflow important in DL/ML?

Deep learning and machine learning projects involve numerous experiments, each with different hyperparameters, model architectures, and data preprocessing methods. Without systematic management, the following problems may arise:

  • Difficulties reproducing high-performing experiments
  • Challenges sharing or verifying experiments among collaborators
  • Uncertainty about which model version was deployed

MLflow provides the following benefits to address these challenges:

  • Reproducibility: Saves parameters, code, and environment to recreate results consistently
  • Organized experiment management: Compare and analyze hundreds of results at a glance
  • Deployment integration: Manage model versions for serving and easy rollback
  • Collaboration support: Share and review experiments with team members easily

MLflow Tracking Example

The example below shows how to apply MLflow Tracking in a simple classification model using PyTorch.


import mlflow
import mlflow.pytorch
import torch
import torch.nn as nn
import torch.optim as optim

class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 2)

    def forward(self, x):
        return self.fc(x)

model = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

X = torch.randn(100, 10)
y = torch.randint(0, 2, (100,))

with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.01)

    for epoch in range(5):
        optimizer.zero_grad()
        outputs = model(X)
        loss = criterion(outputs, y)
        loss.backward()
        optimizer.step()

        mlflow.log_metric("loss", loss.item(), step=epoch)

    mlflow.pytorch.log_model(model, "model")
    

The above code logs the loss at each epoch and saves the trained model to MLflow. The MLflow UI allows visualization and comparison of multiple experiments.

Using with PyTorch Lightning

MLflow integrates well with PyTorch Lightning. Below is an example using LightningModule and MLFlowLogger to track experiments.


import torch
import torch.nn as nn
import mlflow
import mlflow.pytorch
import pytorch_lightning as pl
from pytorch_lightning.loggers import MLFlowLogger
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score
import numpy as np

# Define Lightning model
class LitModel(pl.LightningModule):
    def __init__(self, input_dim=10, output_dim=2, lr=0.01, batch_size=16, max_epochs=5):
        super().__init__()
        self.save_hyperparameters()  # Log hyperparameters
        self.model = nn.Linear(input_dim, output_dim)
        self.criterion = nn.CrossEntropyLoss()

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self.forward(x)
        loss = self.criterion(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy_score(y.cpu(), preds.cpu())
        self.log("train_loss", loss, on_epoch=True)
        self.log("train_acc", acc, on_epoch=True)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.lr)

# Create dummy data
X = torch.randn(100, 10)
y = torch.randint(0, 2, (100,))
dataloader = DataLoader(TensorDataset(X, y), batch_size=16)

# Start MLflow experiment
with mlflow.start_run() as run:
    # Hyperparams to log
    hparams = {
        "input_dim": 10,
        "output_dim": 2,
        "lr": 0.01,
        "batch_size": 16,
        "max_epochs": 5
    }
    mlflow.log_params(hparams)

    # Set logger with run ID to bind logs correctly
    mlflow_logger = MLFlowLogger(experiment_name="lightning_exp", run_id=run.info.run_id)

    # Train model
    model = LitModel(**hparams)
    trainer = pl.Trainer(max_epochs=hparams["max_epochs"], logger=mlflow_logger)
    trainer.fit(model, dataloader)

    # Evaluate on training set to log final accuracy
    model.eval()
    with torch.no_grad():
        all_preds = []
        all_labels = []
        for xb, yb in dataloader:
            preds = torch.argmax(model(xb), dim=1)
            all_preds.append(preds)
            all_labels.append(yb)
        all_preds = torch.cat(all_preds)
        all_labels = torch.cat(all_labels)
        final_acc = accuracy_score(all_labels.numpy(), all_preds.numpy())
        mlflow.log_metric("final_train_accuracy", final_acc)

    # Input example and model signature
    input_example = torch.randn(1, 10)
    signature = mlflow.models.infer_signature(input_example.numpy(), model(input_example).detach().numpy())

    # Save model with metadata
    mlflow.pytorch.log_model(
        pytorch_model=model.model,  # Only the inner torch model
        artifact_path="model",
        input_example=input_example.numpy(),
        signature=signature
    )  
    

When using PyTorch Lightning as the training backend, you can log information such as the dataset, hyperparameters, and metrics to MLflow automatically without explicitly specifying them by using mlflow.pytorch.autolog(). In that case, the above code can be simplified as follows.


import torch
import torch.nn as nn
import mlflow
import mlflow.pytorch
import pytorch_lightning as pl
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score
import numpy as np

mlflow.pytorch.autolog()
# Create a new MLflow Experiment
mlflow.set_experiment("lion_cheetah2")


# Define Lightning model
class LitModel(pl.LightningModule):
    def __init__(self, input_dim=10, output_dim=2, lr=0.01, batch_size=16, max_epochs=5):
        super().__init__()
        self.save_hyperparameters()  # Log hyperparameters
        self.model = nn.Linear(input_dim, output_dim)
        self.criterion = nn.CrossEntropyLoss()

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self.forward(x)
        loss = self.criterion(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy_score(y.cpu(), preds.cpu())
        self.log("train_loss", loss, on_epoch=True)
        self.log("train_acc", acc, on_epoch=True)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.lr)

# Create dummy data
X = torch.randn(100, 10)
y = torch.randint(0, 2, (100,))
dataloader = DataLoader(TensorDataset(X, y), batch_size=16)

# Start MLflow experiment
with mlflow.start_run() as run:
    # Hyperparams to log
    hparams = {
        "input_dim": 10,
        "output_dim": 2,
        "lr": 0.01,
        "batch_size": 16,
        "max_epochs": 5
    }

    # Train model
    model = LitModel(**hparams)
    trainer = pl.Trainer(max_epochs=hparams["max_epochs"])
    trainer.fit(model, dataloader)

    # Evaluate on training set to log final accuracy
    model.eval()
    with torch.no_grad():
        all_preds = []
        all_labels = []
        for xb, yb in dataloader:
            preds = torch.argmax(model(xb), dim=1)
            all_preds.append(preds)
            all_labels.append(yb)
        all_preds = torch.cat(all_preds)
        all_labels = torch.cat(all_labels)
        final_acc = accuracy_score(all_labels.numpy(), all_preds.numpy())

Monitoring Experiments with MLflow UI

MLflow provides a web-based UI to visually explore and compare experiment logs. Launch it with the following command:

mlflow ui --port 5000

Visit http://localhost:5000 in your browser to use features such as:

  • Inspect parameters, metrics, and model artifacts
  • Compare experiments using graphs
  • Download and reuse models

MLflow Result:


Using MLflow Model Registry

Systematic model versioning is essential for deploying or rolling back models in production. MLflow Model Registry supports this process. Here's an example of registering and transitioning a model version:


from mlflow.tracking import MlflowClient

client = MlflowClient()
model_uri = "runs://model"

model_version = client.create_model_version(
    name="MyModel",
    source=model_uri,
    run_id=""
)

client.transition_model_version_stage(
    name="MyModel",
    version=model_version.version,
    stage="Production"
)
  

This enables easy tracking and movement of models across stages like Staging, Testing, and Production.

Conclusion

MLflow is a powerful tool for managing deep learning and machine learning experiments in a transparent and organized way. It integrates training, comparison, storage, and deployment into one workflow, enhancing reproducibility and collaboration. Widely adopted as a key MLOps component, MLflow is especially useful in automating the pipeline from model development to serving.

References

Comments

Popular

Understanding SentencePiece: A Language-Independent Tokenizer for AI Engineers

In the realm of Natural Language Processing (NLP), tokenization plays a pivotal role in preparing text data for machine learning models. Traditional tokenization methods often rely on language-specific rules and pre-tokenized inputs, which can be limiting when dealing with diverse languages and scripts. Enter SentencePiece—a language-independent tokenizer and detokenizer designed to address these challenges and streamline the preprocessing pipeline for neural text processing systems. What is SentencePiece? SentencePiece is an open-source tokenizer and detokenizer developed by Google, tailored for neural-based text processing tasks such as Neural Machine Translation (NMT). Unlike conventional tokenizers that depend on whitespace and language-specific rules, SentencePiece treats the input text as a raw byte sequence, enabling it to process languages without explicit word boundaries, such as Japanese, Chinese, and Korean. This approach allows SentencePiece to train subword models di...

Mastering the Byte Pair Encoding (BPE) Tokenizer for NLP and LLMs

Byte Pair Encoding (BPE) is one of the most important and widely adopted subword tokenization algorithms in modern Natural Language Processing (NLP), especially in training Large Language Models (LLMs) like GPT. This guide provides a deep technical dive into how BPE works, compares it with other tokenizers like WordPiece and SentencePiece, and explains its practical implementation with Python code. This article is optimized for AI engineers building real-world models and systems. 1. What is Byte Pair Encoding? BPE was originally introduced as a data compression algorithm by Gage in 1994. It replaces the most frequent pair of bytes in a sequence with a single, unused byte. In 2015, Sennrich et al. adapted BPE for NLP to address the out-of-vocabulary (OOV) problem in neural machine translation. Instead of working with full words, BPE decomposes them into subword units that can be recombined to represent rare or unseen words. 2. Why Tokenization Matters in LLMs Tokenization is th...

Using Gemini API in LangChain: Step-by-Step Tutorial

What is LangChain and Why Use It? LangChain  is an open-source framework that simplifies the use of  Large Language Models (LLMs)  like OpenAI, Gemini (Google), and others by adding structure, tools, and memory to help build real-world applications such as chatbots, assistants, agents, or AI-enhanced software. Why Use LangChain for LLM Projects? Chainable Components : Easily build pipelines combining prompts, LLMs, tools, and memory. Multi-Model Support : Work with Gemini, OpenAI, Anthropic, Hugging Face, etc. Built-in Templates : Manage prompts more effectively. Supports Multi-Turn Chat : Manage complex interactions with memory and roles. Tool and API Integration : Let the model interact with external APIs or functions. Let's Walk Through the Code: Gemini + LangChain I will break the code into  4 main parts , each showcasing different features of LangChain and Gemini API. Part 1: Basic Gemini API Call Using LangChain import os from dotenv import load_dotenv load_dot...