Experiment tracking in machine learning (ML) research is crucial for reproducibility, collaboration, and performance optimization.

Experiment tracking is a key component of machine learning workflows, helping ensure reproducibility, versioning, and performance monitoring. This post explores various experiment tracking tools, their use cases, and best practices for integrating them into production workflows.

Definitions

%%{ init: { 'theme': 'base', 'themeVariables': { 'fontSize': '24px' } } }%% graph LR; subgraph Experiment[Experiment Tracking **#40;1.#41;** ] subgraph System subgraph Pipeline [Training Pipeline **#40;2.#41;** ] subgraph Data [Data **#40;3.#41;** ] DB[Storage] DR[Registry] DS[Serialization] Preprocessing DV[Versioning & Lineaging] end subgraph Model [Model **#40;4.#41;** ] MB[Storage] MR[Registry] MS[Serialization] Artefacting MV[Versioning] end subgraph Evaluation [Evaluation **#40;6.#41;**] Losses Metrics QA end end Observability Code[Code Version] Environment Infrastructure Data ==> Model Model ==> Evaluation Data o-. Cross Validation **(8)**....-o Evaluation Model o-.Hyper-parameter Optimization **(7)**....-o Evaluation Model o-. Task/Paradigm **(5)**....-o Data end end

Experiment Tracking refers to the practice of systematically documenting and monitoring machine learning experiments to ensure reproducibility and informed decision-making. Experiment Tracking may record details on training Pipeline alongside system information, on the infrastructure (hardware), environment (software), code version, random seeds, resource utilization and logging (observability).

In order to train and evaluate a model, a training Pipeline needs to feed evaluate the performance of a model on some data, and use those results and data in order to improve the performance of the model.

Machine Learning training datasets can be large, comprising large mixtures of clean data from different sources. For both compliance and reproducility, the lineage of this data may be important to capture alongside versioned changes to this data over time. As models may have unique requirements on how this data is preprocessed, researchers must document the augmentation, normalization, encoding or standardization of this data, alongside the parameters needed to process new observations in the future. Due the size of this data serialization becomes important, as the access and storage requirements of this data may vary greatly from many production workloads, requiring techniques in denormalization, partitioning, data compression and storage optimization.

As business use-cases rely on model performance, Evaluation is critical to ensuring the quality of a model. Here, while losses (used in computing gradients) and metrics (used in quantifying real-world performance) selection plays a crucial role in guiding both training and development, quality assurance plays an equally important role in understanding edge cases and possible biases defined in parts of the Pipeline.

Cross-validation techniques, such as k-Fold, stratified sampling, and time-series-based validation, help estimate model performance on unseen data. Choosing the right method depends on the dataset characteristics and the target application. Here, versioning and lineaging is important as researchers should be able to reproduce, across systems, these development datasets.

The problem of Task (Classification, Clustering, Regression…) or Paradigm (Supervised, Unsupervised, Self-supervised…) is closely related to availability and choice of data and architecture. As models get trained versions of the models should be retained to recover from outages or loss spike.

Due to the size of of these models, formats for model serialization can vary, and can affect both I/O performance and safety (Pickle is not safe). Artefacts, such as inference code and model hyper-parameters, and metadata may need to be versioned alongside the model to ensure resumability and reusability.

Using Evaluation metrics, Hyperparameter Optimization technique such as Hyperband, Bayesian Optimization, Genetic Algorithms, Randomized Search or Exhaustive Search (Grid Search) can be used to explore the search space of both the model and optimizer hyper or evaluation loss functions.

Landscape

1. Experiment Tracking

Metrics tracking tools support different storage backends and interfaces, ranging from local file logging to cloud-hosted dashboards.

TensorBoard integrates well with PyTorch and provides visualization tools, though some practitioners find its feature set limited compared to newer platforms. Alternatives such as Weights & Biases, Comet, and MLflow offer cloud-based and local tracking solutions, while Neptune and AzureML provide proprietary options with deep cloud integration. Tools like CodeCarbon and ClearML add specialized capabilities for monitoring carbon emissions and managing large-scale ML workflows.

File-based logging works well for tracking logs alongside model artefacts, though may not be as durable or scale well to large teams.

2. Training Pipeline

Many frameworks like Ray, PyTorch Lightning and HuggingFace Trainer offer a training pipeline which supports various training configurations. Unlike custom training loops, in PyTorch, a Training Pipeline may support distribution, callbacks, HPO, experiment tracking and quantization baked in. Users can subclass these Trainers to add custom training loops, optimizers, metrics or loss functions.

3. Data

3.1. Data Storage

Efficient and reliable storage solutions are essential for managing the large volumes of data typically involved in machine learning experiments. Here are some common storage options:

- Blob Storage: Blob storage services like Amazon S3, Google Cloud Storage, and Azure Blob Storage offer scalable and cost-effective solutions for storing large amounts of unstructured data. They provide high availability, durability, and integration with various data processing tools.
- Local Storage: Local storage refers to using the storage capacity of the machine running the experiments. While it offers fast access speeds, it is limited by the machine’s storage capacity and is not suitable for large-scale or distributed experiments.
- Network File System (NFS): NFS allows multiple machines to access shared storage over a network. It is useful for distributed training setups where multiple nodes need to read and write data concurrently. However, NFS performance can be affected by network latency and bandwidth.

Choosing the right storage solution depends on factors such as cost, throughput, and latency. Options range from local disk storage (e.g., ZFS, Ext4, Btrfs) to networked or cloud-based solutions, with different trade-offs for large-scale machine learning datasets.

3.2. Data Registry

A registry is a system that provides metadata management and APIs for accessing and versioning data stored in various storage solutions. It acts as a catalog that keeps track of datasets enabling efficient organization, retrieval, and sharing of resources. Some key features of a registry include:

- Metadata Management: Registries store metadata about datasets, such as descriptions, versions, and dependencies. This information helps users understand the context and usage of the stored artifacts.
- APIs: Registries provide APIs for programmatically accessing and managing stored artifacts. These APIs enable seamless integration with other tools and workflows, facilitating automation and reproducibility.
- Versioning: Some registries offer versioning capabilities, allowing users to track changes to datasets and models over time. This feature is crucial for maintaining reproducibility and understanding the evolution of experiments.

Examples of registries include Hugging Face Hub and Data Version Control (DVC), but could also just be various database, data warehouse, delta lake or data lake systems. These tools provide robust solutions for managing and versioning data artifacts.

3.3. Data Serialization

Efficient data storage formats are essential for handling multimodal datasets. The choice of format impacts data retrieval speed, compression, and ease of analysis.

Format	Structure	Performance	Compression	Suitability
JSONL	Line-based	Medium	Low	Human-readable but not optimized for large-scale ML
CSV	Tabular	Medium	Low	Simple but lacks support for complex data types
Apache Arrow	Columnar	High	Medium	Fast in-memory processing
TFRecords	Row-oriented	High	Medium	Optimized for TensorFlow-based training
Avro	Row-oriented	High	Medium	Schema evolution support
Parquet	Columnar	High	High	Best trade-off between compression, speed, and flexibility

Parquet can work well for deep learning workloads since it has the ability to store multimodal data efficiently while offering strong compression and flexible query performance. While Parquet contains page sizes, files can be chunked to allow be easier memory management and parallelization.

3.4. Data Proprocessing

Processing and Augmentation can be a complex topic. For preprocessing, both Ray data, HuggingFace datasets offer competitive offerings and even DuckDB offer competitive offerings for preprocessing and optimizing data for training workloads.

For augmentation, great tools like torchvision and nlpaug offer great tooling and can easily be extended by defining a custom classes for use in data augmentation.

4. Model

4.1. Model Storage

Model Storage requirement depend largely model and training requirements. For large distributed training jobs, weights may need to be accumulated and synchronized across training nodes. Similar to Data Storage requirements, a number of design decisions around cloud, network, local storage can be made, alongside filesystem, network and hardware decisions.

4.2. Model Registry

While Models are often smaller than their training datasets, metadata, authentically and access API’s can be useful tools on top on versioning and storage to aid on container builds and deployment.

4.3. Model Serialization

Model serialization formats vary in efficiency, security, and interoperability. The two primary formats used in ML are Pickle and safetensors.

Format	Performance	Security	Compatibility	Use Case
Pickle	High	Low	Python-only	General model serialization, but susceptible to code execution vulnerabilities
safetensors	High	High	Cross-platform	Optimized for secure and efficient tensor storage, particularly for deep learning models

4.4. Model Artefacting

While frameworks allow models to be defined in code, meta-frameworks may separate out concerns related to:

- Model Config (Hyperparameters)
- Model Instantiation & Forward methods
- Weights Serialization
- Text Tokenization, Image Processing and Tabular Data Normalization/Encoding

Tools and Frameworks like:

- ONNX: couple the weights and forward pass computational graph, but makes new ops hard to include in the framework,
- HuggingFace’s transformers: separates these concerns into PretrainedConfig, PretrainedModel, safetensors and various mixins (e.g., ImageProcessingMixin),
- GGML and GGUF: GGML, the predecessor to GGUF, was designed for efficient, low-memory inference. GGUF, now used by tools like OLlama, llama.cpp, and GPT4ALL, typically couples configuration and weights into a single optimized file format for rapid inference, though some implementations may still provide hooks for decoupling tokenization or custom preprocessing steps.
- DDUF: Used by HuggingFace Diffusers, DDUF focuses on decoupling the configuration, weights, inference code, and tokenization. This separation promotes flexibility in modifying or fine-tuning individual components without reworking the entire model artefact.

The best tool for model artifacting depends on factors like serving infrastructure, model size, supported tasks, and compatibility with auto-differentiation frameworks. Resources such as official documentation and community blogs can help guide this decision.

4.5. Model Versioning

Model versioning approaches include using Git Large File Storage (HuggingFace, Git-LFS and DVC ) or database-backed storage in cloud platforms (MLFlow, Weights & Biases, Comet etc.). Generally, the Git-LFS-based systems offer better version control capability, but may require additional looking and complexity in setting up git submodules.

5. Task & Paradigm

The choice of task (e.g., classification, regression, clustering) and paradigm (e.g., supervised, unsupervised, self-supervised) is heavily influenced by the nature of the data and the architecture of the model. For instance, classification tasks typically require labeled data and architectures like convolutional neural networks (CNNs) for image data, while unsupervised tasks such as manifold or representation learning might use autoencoders or other forms of dimensionality reduction. The paradigm dictates the training approach, with supervised learning relying on labeled datasets to guide the model, whereas unsupervised learning seeks to identify patterns without explicit labels. The alignment between data characteristics and model architecture is crucial for optimizing performance and achieving reliable results.

Additionally, tasks such as reinforcement learning or preference optimization often rely on simulated environments (e.g., OpenAI Gym) and may require different training loops and tools. Frameworks like ray[RLlib], torchrl, or Hugging Face’s trl provide specialized support for these paradigms. These tools offer capabilities for environment interaction, policy optimization, and reward management, which are essential for training agents in simulated settings. The choice of tools and frameworks can significantly impact the efficiency and effectiveness of the training pipeline for these specialized tasks.

6. Evaluation

Evaluation is a critical component of the machine learning pipeline, ensuring that models perform well on unseen data and meet the desired performance criteria. This section covers the key aspects of evaluation, including losses, metrics, and quality assurance (QA).

6.1 Losses

Loss functions are used to guide the training process by quantifying the difference between the model’s predictions and the actual target values. PyTorch provides a wide range of built-in loss functions, such as CrossEntropyLoss for classification tasks and MSELoss for regression tasks. Hugging Face extends this by offering additional loss functions tailored for specific models and tasks.

6.2 Metrics

Metrics are used to evaluate the performance of a model on a given task. Unlike loss functions, which are used during training to optimize the model, metrics are used to assess the model’s performance on validation and test data. Common metrics include accuracy, precision, recall, F1 score for classification tasks, and mean absolute error (MAE) or root mean squared error (RMSE) for regression tasks.

In PyTorch, metrics can be computed using libraries like PytorchLightnings torchmetrics or custom implementations. Hugging Face also provides built-in metrics for various tasks, which can be easily integrated with their Trainer API.

6.3 Quality Assurance

Quality Assurance (QA) in machine learning involves a combination of automated and manual processes to ensure that models meet the desired performance and reliability standards. This includes techniques such as:

- Demo and Visual Inspection: Creating visualizations and interactive demos to manually inspect model predictions and identify potential issues. Tools like Streamlit or Gradio can be used to build these interfaces.
- Red-Teaming: Actively searching for model weaknesses by simulating adversarial conditions or edge cases. This can involve generating adversarial examples or testing the model on out-of-distribution data.
- Bias and Fairness Audits: Evaluating the model for biases and ensuring fairness across different demographic groups. Libraries like Fairlearn or Aequitas can assist in conducting these audits.
- Robustness Testing: Assessing the model’s robustness to various perturbations, such as noise, occlusions, or adversarial attacks. Techniques like adversarial training or data augmentation can help improve robustness.
- Performance Monitoring: Continuously monitoring the model’s performance in production to detect any degradation over time, using methods like Drift Detection offered in tools like NannyML. Tools like Prometheus and Grafana can be used for setting up performance dashboards and alerts.
- AB Testing: Running controlled experiments to compare the performance of different model versions or configurations. This involves splitting the traffic between two or more model variants and statistically analyzing the results to determine which variant performs better. Tools like EvidentlyAI or custom scripts can be used to set up and analyze A/B tests.

By incorporating these QA practices, researchers and practitioners can ensure that their models are not only accurate but also reliable, fair, and robust in real-world applications.

7. Hyperparameter Optimization

There are several strategies and tools available for hyperparameter optimization in machine learning:

- Hyperband: An efficient method that dynamically allocates resources and terminates unpromising trials early.
- Bayesian Optimization: A probabilistic approach that models the objective function to identify promising hyperparameter regions.
- Genetic Algorithms: Evolution-inspired techniques that use mutation, crossover, and selection to explore the hyperparameter space.
- Multi-Objective Optimization: Methods that optimize for multiple criteria simultaneously, balancing competing objectives.

Hyperparameter optimization strategies vary based on compute availability, search space, and optimization objectives. Many frameworks, including SigOpt, Weight & Biases, Ray Tune, and Optuna, provide built-in support for efficient tuning within machine learning pipelines.

8. Cross-validation

Cross-validation helps ensure robust estimates of model performance and aids in hyperparameter tuning. Typically, the data is split into training, validation, and test sets, with the validation set used by Hyperparameter Optimization (HPO) tools like Optuna or Ray Tune to guide parameter selection. Best practices include stratified splits for classification tasks, time-based splits for time series data, and repeated k-fold CV for smaller datasets. Tools like scikit-learn’s CV utilities integrate well with most frameworks, while Ray Tune and Hugging Face’s Trainer can automatically manage cross-validation during HPO. This additional validation split prevents information leakage from the test set and promotes more reliable model comparisons.

Tracking Model Experiments