Model Development Versioning
Master ML model development & versioning for reproducibility and maintainability. Learn data versioning, pipeline reproducibility, and ML tooling.
Module 3: Model Development & Versioning
This module focuses on the crucial aspects of developing, tracking, and managing machine learning models to ensure reproducibility and maintainability.
3.1 Data Versioning & Pipeline Reproducibility
Reproducibility in machine learning starts with reproducible data. This section covers strategies for versioning your datasets and ensuring that your entire machine learning pipeline can be reliably recreated.
3.1.1 Data Versioning Tools
DVC (Data Version Control): DVC is an open-source version control system for machine learning projects. It works alongside Git, allowing you to version large datasets and models without bloating your Git repository. DVC tracks data and model files using lightweight pointers stored in Git, while the actual data resides in remote storage (e.g., cloud storage, network drives).
Key Features:
Version control for large files.
Integration with Git.
Support for various remote storage solutions.
Pipeline definition and execution.
Experiment tracking.
Basic Workflow:
Initialize DVC:
dvc init
Add data to DVC:
dvc add data/my_dataset.csv
This creates a
data/my_dataset.csv.dvc
file, which is a small pointer file.Commit data pointers to Git:
git add data/.gitignore data/my_dataset.csv.dvc .dvc/config git commit -m "Add dataset versioning"
Push data to remote storage:
dvc push
3.1.2 Pipeline Reproducibility
DVC Pipelines: DVC allows you to define your ML pipeline as a directed acyclic graph (DAG) of dependencies. This ensures that each step in your pipeline is reproducible. When you run
dvc repro
, DVC intelligently executes only the necessary steps based on changes in input data or code.Defining a Pipeline: Pipelines are defined in a
dvc.yaml
file.stages: prepare_data: cmd: python scripts/prepare_data.py --input data/raw.csv --output data/processed.csv deps: - data/raw.csv - scripts/prepare_data.py outs: - data/processed.csv train_model: cmd: python scripts/train.py --data data/processed.csv --model models/model.pkl deps: - data/processed.csv - scripts/train.py outs: - models/model.pkl
Running the Pipeline:
dvc repro
This command will execute the defined stages in the correct order, respecting dependencies.
3.2 Experiment Tracking with MLFlow/DVC
Keeping track of your experiments is vital for understanding model performance, comparing different approaches, and debugging. Both MLFlow and DVC offer robust solutions for experiment tracking.
3.2.1 MLFlow
MLFlow is an open-source platform for managing the ML lifecycle, including experimentation, reproducibility, and deployment.
Key Features:
MLFlow Tracking: Log parameters, code versions, metrics, and output files.
MLFlow Projects: Package your code in a reusable and reproducible format.
MLFlow Models: Package models in a standard format that can be used in various downstream tools.
MLFlow Registry: Centralized model store for managing model lifecycles.
Basic Tracking Workflow:
Install MLFlow:
pip install mlflow
Start tracking in your Python script:
import mlflow import mlflow.sklearn from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # Start an MLFlow run with mlflow.start_run(): # Log parameters mlflow.log_param("solver", "liblinear") mlflow.log_param("C", 0.1) # Load data (example) X, y = ... # Your data loading here X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train model model = LogisticRegression(solver="liblinear", C=0.1) model.fit(X_train, y_train) # Make predictions and evaluate y_pred = model.predict(X_test) accuracy = accuracy_score(y_test, y_pred) # Log metrics mlflow.log_metric("accuracy", accuracy) # Log the model mlflow.sklearn.log_model(model, "logistic_regression_model") print(f"Logged run with accuracy: {accuracy}") print(f"MLFlow run ID: {mlflow.active_run().info.run_id}")
Launch the MLFlow UI to view experiments:
mlflow ui
Open your browser to
http://localhost:5000
to see the logged experiments.
3.2.2 DVC with Experiment Tracking
DVC also provides experiment tracking capabilities, often integrated with its pipeline and data versioning features. DVC experiments allow you to quickly iterate on different model configurations and compare results.
Key Features:
DVC Experiments: Record and manage changes to parameters, metrics, and outputs for different model runs.
Branching: Create Git branches to isolate experiments.
Metrics Comparison: Easily compare metrics across different experiment runs.
Basic Experiment Tracking Workflow:
Define an experiment script (e.g.,
scripts/train.py
) that logs metrics and outputs. This script should be compatible with DVC's pipeline definitions.Create an experiment:
dvc exp init
Create a new experiment run:
dvc exp run -n my_first_experiment
This command will execute the pipeline defined in
dvc.yaml
and record the results as an experiment. You can also specify parameters to change for the experiment:dvc exp run -n experiment_with_different_params -S model.learning_rate=0.005
View experiments:
dvc exp show
This will display a table of your experiments, their parameters, and metrics.
Compare experiments:
dvc exp diff my_first_experiment experiment_with_different_params
3.3 Model Training Scripts with Best Practices
Writing clean, modular, and well-documented model training scripts is crucial for maintainability and collaboration.
3.3.1 Script Structure
A typical model training script should include:
Imports: All necessary libraries.
Argument Parsing: Use libraries like
argparse
to make scripts configurable.Data Loading and Preprocessing: Load and prepare data. This can often be a separate function or module.
Model Definition: Instantiate your model.
Training: Train the model.
Evaluation: Evaluate the model's performance.
Saving: Save the trained model and any necessary artifacts (e.g., scalers, encoders).
Logging: Log parameters, metrics, and potentially the model itself (e.g., with MLFlow).
3.3.2 Best Practices
Modularity: Break down the training process into functions (e.g.,
load_data
,preprocess
,train_model
,evaluate
).Configuration: Use command-line arguments (
argparse
) or configuration files (YAML, JSON) to manage hyperparameters and paths.Reproducibility:
Set random seeds (
random.seed()
,np.random.seed()
,tf.random.set_seed()
).Use specific versions of libraries.
Track your code with Git.
Logging: Log key information (hyperparameters, metrics, input data versions) consistently.
Error Handling: Implement robust error handling and informative error messages.
Testing: Consider writing unit tests for critical functions (e.g., data preprocessing).
Docstrings: Add clear docstrings to functions and classes to explain their purpose, arguments, and return values.
Example train.py
snippet:
import argparse
import os
import yaml
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib # For saving models
import mlflow
def load_config(config_path):
"""Loads configuration from a YAML file."""
with open(config_path, 'r') as f:
config = yaml.safe_load(f)
return config
def train_model(data_path, model_save_path, params):
"""Trains a RandomForestClassifier and saves the model."""
print(f"Loading data from: {data_path}")
df = pd.read_csv(data_path)
# Assuming 'target' is the column to predict
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=params['test_size'], random_state=params['random_state']
)
print("Training model...")
model = RandomForestClassifier(
n_estimators=params['n_estimators'],
max_depth=params['max_depth'],
random_state=params['random_state']
)
model.fit(X_train, y_train)
print("Evaluating model...")
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
# Log metrics with MLFlow
mlflow.log_metric("accuracy", accuracy)
mlflow.log_params({
"n_estimators": params['n_estimators'],
"max_depth": params['max_depth'],
"test_size": params['test_size'],
"random_state": params['random_state']
})
# Save the model
os.makedirs(os.path.dirname(model_save_path), exist_ok=True)
joblib.dump(model, model_save_path)
print(f"Model saved to: {model_save_path}")
return accuracy
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Train a machine learning model.")
parser.add_argument("--data", required=True, help="Path to the processed data CSV file.")
parser.add_argument("--model-output", default="models/rf_model.joblib", help="Path to save the trained model.")
parser.add_argument("--config", default="config/train_params.yaml", help="Path to the training parameters YAML file.")
args = parser.parse_args()
# Load configuration
train_params = load_config(args.config)
# Start MLFlow run
with mlflow.start_run():
train_model(args.data, args.model_output, train_params)
Example config/train_params.yaml
:
test_size: 0.2
random_state: 42
n_estimators: 100
max_depth: 10
3.4 Setting Up Virtual Environments and Dependency Tracking
Virtual environments isolate your project's dependencies, preventing conflicts with other Python projects. Proper dependency tracking ensures that your project can be reliably set up on any machine.
3.4.1 Virtual Environments
venv
(Built-in): Python's standard library module for creating lightweight virtual environments.Creating a virtual environment:
python -m venv .venv
This command creates a
.venv
directory in your project root.Activating the environment:
On Windows:
.venv\Scripts\activate
On macOS/Linux:
source .venv/bin/activate
Your terminal prompt will usually change to indicate the active environment (e.g.,
(.venv) $
).Deactivating the environment:
deactivate
conda
(Anaconda/Miniconda): A popular package and environment manager, especially for data science.Creating a conda environment:
conda create --name myenv python=3.9
Activating the environment:
conda activate myenv
Deactivating the environment:
conda deactivate
3.4.2 Dependency Tracking
pip freeze > requirements.txt
: This command captures all installed packages in the current Python environment and their exact versions, saving them to arequirements.txt
file.Generating
requirements.txt
:# Ensure your virtual environment is activated pip freeze > requirements.txt
Installing dependencies from
requirements.txt
:# Create a new virtual environment and activate it first pip install -r requirements.txt
conda env export > environment.yaml
: For conda environments, this exports the environment's configuration, including Python version and channel information, into aenvironment.yaml
file.Generating
environment.yaml
:# Ensure your conda environment is activated conda env export > environment.yaml
Creating an environment from
environment.yaml
:conda env create -f environment.yaml
Poetry/Pipenv (More advanced): Tools like Poetry and Pipenv offer more sophisticated dependency management, including dependency resolution, locking, and virtual environment management within the project. They typically use
pyproject.toml
(Poetry) orPipfile
(Pipenv) for dependency specifications. These are recommended for larger, more complex projects.