Reproducibility; Experiment and metric tracking; Model versioning and Deployment.
Managing machine learning model development can be a non-trivial task, involving multiple steps; model selection, framework selection, data processing, metric optimization, and lastly, model packaging and deployment. An organized workflow makes model management less complicated and adds reproducibility to experiments.
MLfLow is an open-source machine learning lifecycle management tool that facilitates organizing workflow for training, tracking and productionizing machine learning models. It is designed to work along with most recent machine learning libraries and frameworks available out there. According to the official website, there are four components that MLflow currently offers:
Record and query experiments: code, data, config, and results
Package data science code in a format to reproduce runs on any platform
Deploy machine learning models in diverse serving environments
Store, annotate, discover, and manage models in a central repository In the forthcoming sections, we will go over how all of these components can be leveraged to organize the machine learning workflow.
MLflow python package can be easily installed using pip or conda whichever you prefer.
shell> pip install mlflow
If you are using Databricks, all the ML runtimes come with mlflow installed and can be readily used to log model runs on DBFS storage from a Databricks notebook. To test the installation, run the mlflow command in the terminal:
shell> mlflow
You should get an output similar to this:
Usage: mlflow [OPTIONS] COMMAND [ARGS]...Options:
--version Show the version and exit.
--help Show this message and exit.Commands:
azureml Serve models on Azure ML.
download Downloads the artifact at the specified DBFS...
experiments Tracking APIs.
pyfunc Serve Python models locally.
run Run an MLflow project from the given URI.
sagemaker Serve models on SageMaker.
sklearn Serve SciKit-Learn models.
ui Run the MLflow tracking UI.
Tracking component consists of a UI and APIs for logging parameters, code version, metrics and output files. MLflow runs are grouped into experiments such that the logs for different runs of an experiment can be tracked and compared. This also provides the ability to visualize and compare the logged parameters and metrics. MLflow provides simple API Support for most popular platforms including Python, REST, R and Java.
By default, mlflow uses local storage to run the tracking server. MLflow does provide the option to track runs to a remote server as well. This can be done by calling mlflow.set_tracking_uri() The remote tracking server can be assigned using a SQLALchemy link, local file path, HTTP server, or a Data Lake path.
The following snippet shows how to start a run and log parameters and metrics:
import os
from mlflow import log_metric, log_param, log_artifact
with mlflow.start_run() as run:
# Log a parameter (key-value pair)
log_param("param1", 5)
# Log a metric; metrics can be updated throughout the run
log_metric("foo", 1)
log_metric("foo", 2)
log_metric("foo", 3)
# Log an artifact (output file)
with open("output.txt", "w") as f:
f.write("Hello world!")
log_artifact("output.txt")
An artifact can be a file with model results or outputs. The log_artifact() method can be used to log such files generated by a run. Mlflow stores all the runs under ‘default’ experiment name, by default. We can assign an experiment name by using the set_experiment() method before calling the start_run() method which will create a run in this experiment.
mlflow.set_experiment(‘MNIST’)
MLflow also provides automatic experiment logging support for major machine learning frameworks; including Tensorflow, PyTorch, Gluon, XGBoost, LightGBM, SparkML, and FastAI. The autologging capability can be invoked by importing the autolog method from the supported framework binding provided in the mlflow package.
The following code snippet demonstrates how the autolog feature can be used with Tensorflow:
## MLflow Model Tracking and Versioning Example
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras import models
from tensorflow.keras import layers
import mlflow
from mlflow import pyfunc
# Define Tensorflow Keras model and load data
# We define a simple Convolutional Neural Network and train and predict on MNIST data
# Hyperparameters
batch_size = 1024
epochs = 15
num_classes = 10
learning_rate = 0.01
# Load MNIST dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
x_train = x_train.astype('float32')/255
x_test = x_test.astype('float32')/255
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
# Simple CNN Model
model = models.Sequential()
model.add(layers.Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=(28, 28, 1)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(2, 2)))
model.add(layers.Dropout(0.25))
model.add(layers.Flatten())
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(num_classes, activation='softmax'))
# We define the training function that takes in the dataset, the model definition, adds a compiler and sets some hyperparameters for tuning.
def train(learning_rate=1.0):
# Adadelta optimizer
optimizer = keras.optimizers.Adadelta(lr=learning_rate)
# Compile keras model
model.compile(optimizer=optimizer,
loss='categorical_crossentropy',
metrics=['accuracy'])
# Train model
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=2,
validation_data=(x_test, y_test))
# MLflow offers autologging bindings for Deep Learning framework
import mlflow.tensorflow
mlflow.tensorflow.autolog(every_n_iter=1)
# Start the run
with mlflow.start_run(run_name='mnist-run-2') as run:
run_id = run.info.run_id
# train model
train(learning_rate)
# Get trained model
model_uri = mlflow.get_artifact_uri("model")
# Evaluate Trained Model *Load Model & Evaluate*
import mlflow.keras
keras_model = mlflow.keras.load_model(model_uri)
loss, acc = keras_model.evaluate(x_test, y_test, verbose=False)
print(f"Model Evaluation Results:- loss: {loss:.4f} Acc: {acc:.4f}")
Tensorflow 2 MNIST training example with MLflow
To access the mlflow UI, run the following command in the terminal from the same directory as the code:
mlflow ui
If you are using a remote tracking server, the same tracking URI must be provided as the backend store URI to start the mlflow UI. This can be done by passing an additional argument.
mlflow ui --backend-store-uri <path>
The MLflow UI can be accessed at: http://localhost:5000.
What we have done so far? We created a script which autologs the necessary parameters and metrics for a tensorflow model training into an mlflow run. The mlflow UI shows a list of all the runs for a selected experiment with a brief description of the run in tabular format. The details of a run can be viewed by clicking on the timestamp.
MLflows autolog feature automatically logs all the necessary parameters (epochs, batch size, optimizer used, and learning rate etc.) and metrics (loss and criterion for train as well as validation data) during the run. It even logs the trained model which can be seen in artifacts section of the run in the UI.
It is important to observe and understand how the metrics change throughout a run. Visualizations are the best way to track the metric values through the training process. MLflow facilitates this by providing simple-to-use automated plot generation inside the mlflow run UI. By clicking on a metric we can visualize the plots for it.
Plot of training accuracy over time generated in mlflow UI
An MLflow Project is a format for packaging data science code in a reusable and reproducible way, based primarily on conventions.
Essentially, an MLflow Project bundles various components of the machine learning code that includes the API and command-line tools for running projects. Each project is simply a directory of files, or a Git repository, containing your code. This makes it possible to chain together multiple projects into workflows.
Each project contains AnMLproject file which may look something like this:
name: keras-mnist
conda_env: conda.yaml
entry_points:
main:
parameters:
batch_size: {type: int, default: 100}
epochs: {type: int, default: 1000}
command: "python train.py --batch_size={batch_size} --epochs={epochs}"
The MLproject file is used to define the name of the project, environment that is used to run the project, and what command to execute. The conda.yaml defines the environment dependencies for the project. This can be generated easily from an existing conda environment and looks something like this:
name: keras-mnist
channels:
- defaults
- anaconda
- conda-forge
dependencies:
- python=3.6
- pip
- pip:
- mlflow
- tensorflow==2.3.0
MLflow does support Docker environment and system environments as well. More information on this is available https://www.mlflow.org/docs/latest/projects.html#project-environments.
The project can be executed by using the mlflow run command in the terminal from the same directory:
shell> mlflow run .
This will build the conda environment and execute the command mentioned in the MLprojectfile. Inference scripts can similarly be packaged into a project.
An MLflow Model is a standard format for packaging machine learning models that can be used in a variety of downstream tools.
Using the MLflow model format, models from various frameworks can be stored in a standard format which can be consumed in various forms including real-time serving through a REST API, batch inference on Apache Spark, or can even as a python_function.
Similar to MLflow Projects, MLflow Models contain two config files: MLmodel and conda.yaml which contain the model and environment configurations, respectively.
The MLmodel file looks contains the following:
artifact_path: model
flavors:
keras:
data: data
keras_module: keras
keras_version: 2.4.3
python_function:
data: data
env: conda.yaml
loader_module: mlflow.keras
python_version: 3.7.7
run_id: e256210d0ed94b4886efcfdf6f95aac3 utc_time_created: '2020-09-15 07:08:50.850162'
data is the directory containing the model files in native flavor format, which in this case is an keras h5 model file.
The autologging feature also writes the model to the run directory. Path to the model can be used to serve this model as a REST API.
mlflow models serve -m <mlflow_model_uri>
The MLflow Model Registry component is a centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of an MLflow Model.
An MLflow model can be registered to the centralized model registry which provides a convenient way to maintain model versions, annotate different version, and monitor their stages (staging, production, and archived).
A Registered Model has a unique name, contains versions, associated transitional stages, model lineage, and other metadata. An MLflow model can either be registered using the UI workflow or using the python API:
from mlflow.tracking import MlflowClient
# Create a Registered Model
client = MlflowClient()
client.create_registered_model("MNIST-Keras")
# Create a model version
result = client.create_model_version(
name="MNIST-Keras",
source="./mlruns/0/e256210d0ed94b4886efcfdf6f95aac3/\
artifacts/model",
run_id="e256210d0ed94b4886efcfdf6f95aac3"
)
The create_registered_model() method creates a new registered model in the model registry and the create_model_version() method creates a new version for the registered model. This method takes 3 parameters; name, source, and run id. Source is the path to the updated MLflow Model.
Another way to do this is using the register_model API:
mlflow.register_model(
"runs:/e256210d0ed94b4886efcfdf6f95aac3/artifacts/model",
"MNIST-Keras"
)
If a model with the provided name does not exist this mlflow creates a new registered model with the name.
Model stage transition is another useful feature that mlflow provides. As the model evolves, its stage can be updated:
client = MlflowClient()
client.transition_model_version_stage(
name="MNIST-Keras",
version=1,
stage="Production"
)
The above command will update the stage of the version 1 of the ‘MNIST-Keras’ model to Production. Registered Model UI
A registered model can be served using the mlflow CLI:
#!/usr/bin/env sh
# Set environment variable for the tracking URL where the Model Registry resides
export MLFLOW_TRACKING_URI=http://localhost:5000
# Serve the production model from the model registry
mlflow models serve -m "models:/MNIST-Keras/Production"
The MLFLOW_TRACKING_URI environment variable should point to the tracking server (mentioned in the mlflow tracking section) where the model registry resides.
Thank you for reading this post! In this post, I have tried to cover all the major components of MLflow’s Machine Learning management toolkit. Aside from the areas covered in this post, MLflow also provides various deployment APIs for various infrastructures including, AWS Sagemaker, Microsoft Azure and Databricks clusters. In the future posts, we will show how to leverage the MLflow deployment APIs to deploy machine learning models to production in one of these Major infrastructure options