Managing and Monitoring Deep Learning/Machine Learning Experiments with MLflow
What is MLflow?
MLflow is an open-source platform for managing the complete machine learning lifecycle, including training, evaluation, and deployment of models. During the development of complex models, experiments are repeatedly conducted with changing hyperparameters, data versions, source code, and model architectures. Without proper tracking, it becomes difficult to reproduce results or improve model performance.
MLflow solves these issues with the following four key components:
- MLflow Tracking: Stores and compares metadata such as parameters, metrics, models, and logs
- MLflow Projects: Defines code and execution environments for reproducibility
- MLflow Models: A universal format to save and deploy models trained with various frameworks
- MLflow Model Registry: Supports model versioning, approval, and stage transitions like Production and Staging
Why is MLflow important in DL/ML?
Deep learning and machine learning projects involve numerous experiments, each with different hyperparameters, model architectures, and data preprocessing methods. Without systematic management, the following problems may arise:
- Difficulties reproducing high-performing experiments
- Challenges sharing or verifying experiments among collaborators
- Uncertainty about which model version was deployed
MLflow provides the following benefits to address these challenges:
- Reproducibility: Saves parameters, code, and environment to recreate results consistently
- Organized experiment management: Compare and analyze hundreds of results at a glance
- Deployment integration: Manage model versions for serving and easy rollback
- Collaboration support: Share and review experiments with team members easily
MLflow Tracking Example
The example below shows how to apply MLflow Tracking in a simple classification model using PyTorch.
import mlflow
import mlflow.pytorch
import torch
import torch.nn as nn
import torch.optim as optim
class SimpleNN(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(10, 2)
def forward(self, x):
return self.fc(x)
model = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
X = torch.randn(100, 10)
y = torch.randint(0, 2, (100,))
with mlflow.start_run():
mlflow.log_param("learning_rate", 0.01)
for epoch in range(5):
optimizer.zero_grad()
outputs = model(X)
loss = criterion(outputs, y)
loss.backward()
optimizer.step()
mlflow.log_metric("loss", loss.item(), step=epoch)
mlflow.pytorch.log_model(model, "model")
The above code logs the loss at each epoch and saves the trained model to MLflow. The MLflow UI allows visualization and comparison of multiple experiments.
Using with PyTorch Lightning
MLflow integrates well with PyTorch Lightning. Below is an example using LightningModule and MLFlowLogger to track experiments.
import torch
import torch.nn as nn
import mlflow
import mlflow.pytorch
import pytorch_lightning as pl
from pytorch_lightning.loggers import MLFlowLogger
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score
import numpy as np
# Define Lightning model
class LitModel(pl.LightningModule):
def __init__(self, input_dim=10, output_dim=2, lr=0.01, batch_size=16, max_epochs=5):
super().__init__()
self.save_hyperparameters() # Log hyperparameters
self.model = nn.Linear(input_dim, output_dim)
self.criterion = nn.CrossEntropyLoss()
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
x, y = batch
logits = self.forward(x)
loss = self.criterion(logits, y)
preds = torch.argmax(logits, dim=1)
acc = accuracy_score(y.cpu(), preds.cpu())
self.log("train_loss", loss, on_epoch=True)
self.log("train_acc", acc, on_epoch=True)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=self.hparams.lr)
# Create dummy data
X = torch.randn(100, 10)
y = torch.randint(0, 2, (100,))
dataloader = DataLoader(TensorDataset(X, y), batch_size=16)
# Start MLflow experiment
with mlflow.start_run() as run:
# Hyperparams to log
hparams = {
"input_dim": 10,
"output_dim": 2,
"lr": 0.01,
"batch_size": 16,
"max_epochs": 5
}
mlflow.log_params(hparams)
# Set logger with run ID to bind logs correctly
mlflow_logger = MLFlowLogger(experiment_name="lightning_exp", run_id=run.info.run_id)
# Train model
model = LitModel(**hparams)
trainer = pl.Trainer(max_epochs=hparams["max_epochs"], logger=mlflow_logger)
trainer.fit(model, dataloader)
# Evaluate on training set to log final accuracy
model.eval()
with torch.no_grad():
all_preds = []
all_labels = []
for xb, yb in dataloader:
preds = torch.argmax(model(xb), dim=1)
all_preds.append(preds)
all_labels.append(yb)
all_preds = torch.cat(all_preds)
all_labels = torch.cat(all_labels)
final_acc = accuracy_score(all_labels.numpy(), all_preds.numpy())
mlflow.log_metric("final_train_accuracy", final_acc)
# Input example and model signature
input_example = torch.randn(1, 10)
signature = mlflow.models.infer_signature(input_example.numpy(), model(input_example).detach().numpy())
# Save model with metadata
mlflow.pytorch.log_model(
pytorch_model=model.model, # Only the inner torch model
artifact_path="model",
input_example=input_example.numpy(),
signature=signature
)
When using PyTorch Lightning as the training backend, you can log information such as the dataset, hyperparameters, and metrics to MLflow automatically without explicitly specifying them by using mlflow.pytorch.autolog(). In that case, the above code can be simplified as follows.
import torch
import torch.nn as nn
import mlflow
import mlflow.pytorch
import pytorch_lightning as pl
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score
import numpy as np
mlflow.pytorch.autolog()
# Create a new MLflow Experiment
mlflow.set_experiment("lion_cheetah2")
# Define Lightning model
class LitModel(pl.LightningModule):
def __init__(self, input_dim=10, output_dim=2, lr=0.01, batch_size=16, max_epochs=5):
super().__init__()
self.save_hyperparameters() # Log hyperparameters
self.model = nn.Linear(input_dim, output_dim)
self.criterion = nn.CrossEntropyLoss()
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
x, y = batch
logits = self.forward(x)
loss = self.criterion(logits, y)
preds = torch.argmax(logits, dim=1)
acc = accuracy_score(y.cpu(), preds.cpu())
self.log("train_loss", loss, on_epoch=True)
self.log("train_acc", acc, on_epoch=True)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=self.hparams.lr)
# Create dummy data
X = torch.randn(100, 10)
y = torch.randint(0, 2, (100,))
dataloader = DataLoader(TensorDataset(X, y), batch_size=16)
# Start MLflow experiment
with mlflow.start_run() as run:
# Hyperparams to log
hparams = {
"input_dim": 10,
"output_dim": 2,
"lr": 0.01,
"batch_size": 16,
"max_epochs": 5
}
# Train model
model = LitModel(**hparams)
trainer = pl.Trainer(max_epochs=hparams["max_epochs"])
trainer.fit(model, dataloader)
# Evaluate on training set to log final accuracy
model.eval()
with torch.no_grad():
all_preds = []
all_labels = []
for xb, yb in dataloader:
preds = torch.argmax(model(xb), dim=1)
all_preds.append(preds)
all_labels.append(yb)
all_preds = torch.cat(all_preds)
all_labels = torch.cat(all_labels)
final_acc = accuracy_score(all_labels.numpy(), all_preds.numpy())
Monitoring Experiments with MLflow UI
MLflow provides a web-based UI to visually explore and compare experiment logs. Launch it with the following command:
mlflow ui --port 5000
Visit http://localhost:5000 in your browser to use features such as:
- Inspect parameters, metrics, and model artifacts
- Compare experiments using graphs
- Download and reuse models
MLflow Result:
Using MLflow Model Registry
Systematic model versioning is essential for deploying or rolling back models in production. MLflow Model Registry supports this process. Here's an example of registering and transitioning a model version:
from mlflow.tracking import MlflowClient
client = MlflowClient()
model_uri = "runs://model"
model_version = client.create_model_version(
name="MyModel",
source=model_uri,
run_id=""
)
client.transition_model_version_stage(
name="MyModel",
version=model_version.version,
stage="Production"
)
This enables easy tracking and movement of models across stages like Staging, Testing, and Production.
Conclusion
MLflow is a powerful tool for managing deep learning and machine learning experiments in a transparent and organized way. It integrates training, comparison, storage, and deployment into one workflow, enhancing reproducibility and collaboration. Widely adopted as a key MLOps component, MLflow is especially useful in automating the pipeline from model development to serving.
Comments
Post a Comment