ML experiment tracking: How to track experiments in ML

Never lose track again — Experiment tracking saves all the important info you care about during your ML experiments

Machine learning (ML) is only as effective as the data used for training. No matter how sophisticated the ML model is, poor data results in poor performance. Data quality depends on the features, or inputs, used in the dataset. 

Your first thought to ensure effective training may be to use all of the features you have available. However, many features may be redundant or irrelevant to the problem you’re trying to solve using machine learning. Also, using too many features tends to reduce model performance rather than increase it. So, you need to find a subset of the available features that's best for the ML task at hand.

One solution is to break your inputs up into several different datasets, each dataset being a unique subset of the available features, and let them compete. This is a solution, but it creates a new demand for data scientists to track multiple datasets and their associated outputs. Experimenting with how different datasets produce different results is just one example of the many factors that go into the performance of ML models, further increasing the complexity of tracking. The challenge is for you to decide which experiments generate the best results and know how to reproduce them. That's where ML experiment tracking comes in.

What is ML experiment tracking?

ML experiment tracking is the practice of saving important information about your experiments after running them so that you can go back and pick the most successful iterations out of all the results you got. It's an organizational approach that helps data scientists keep track of their inputs and the results each input generated, with an eye toward reproduction, as you move your ML model into production.

With some experiments running up to thousands of different input combinations, keeping track of which inputs generated which outputs are far beyond the average human cognitive load. Even smaller datasets might have dozens of dimensions you want to track. Factors that might be relevant, and that you should consider keeping track of, can include:

  • The feature set used for a particular run of the experiment, and whether the entire set was used or only a portion of it

  • The specific model used to analyze the data, or the set of models if you're using an ensemble

  • Which version of the data was used, in case it's being updated or revised

  • Hyperparameters used for the model(s)

  • The specific codebase version used

  • Information about the dependencies used

  • Model performance metrics

  • Predictions the model makes on each set

  • Comparisons of anything else you logged during the experiment

What are the benefits of experiment tracking?

With all this to keep track of, you need a solid approach to tracking data across multiple dimensions and through potentially thousands of iterations of your experiments. The benefits of good experiment tracking include:

Reduced time spent managing experiments

As the number and complexity of experiments increase, the time your team has to spend tracking it all can outweigh the time you have available. This is all the more true if you're trying to maximize the number of experiments to generate better results.

Reduced potential for human-caused error

Every human hand that touches experimental data introduces the possibility of mistakes, contamination or other impairments of the results. Automating the experiment tracking reduces the potential for human errors.

Reduced strain on human experimenters

As smart as data scientists are, no person could hold the necessary input-result-iteration information in their head across thousands of experiments. Recording and persisting this information removes the need to memorize experiment trends.

Saved information for later retrieval

Some of your experiments will generate positive results, and you'll want to inspect the factors that caused those positive results. Being able to find and reproduce these successful experiments is crucial to a faster development cycle. 

Better comparisons

People can draw new insights by comparing between items rather than assessing items as separate. Similarly, tracking input/output information across multiple experiments allows you to compare models better to see what makes for optimal results, what errors have crept into less successful iterations and how to improve going forward.

Better outputs

You’ll need to show the outputs of your ML experiments. Good tracking can help with this by keeping results organized and easily searchable to human data scientists and engineers. Tracking the outputs from different experiments, analyzing the results and refining the inputs can lead to more experiments and sometimes, better outputs. 

Use cases of experiment tracking: ML experiments

You can use your tracking system for a lot of different purposes. Examples of what ML experiment tracking lets you do, or do better, include:

Experimenting with different hyperparameters

Your ML model might have a small set of hyperparameters to guide it, or it could have a lot. Even a tiny change to one of these can make a world of difference in the outcome of an experiment. This is even more true for models with multiple hyperparameters, which quickly develop a combinational effect with countless permutations based off of just a few top-level variables. If you're going to change any of these during your experiments, it's essential to track the state of each environment you run your code in, since you may never be able to exactly recreate the initial conditions from scratch.

Comparison of different models

In some cases, you may not know which of the tested models are best for a given task before you run the experiments. Some may be better than others for a given application. Good tracking lets you compare results more easily and efficiently. This gets even more important if you're using multiple models in an ensemble, since again, the results quickly develop into uncountable permutations based on how many models are in use and the sequence you use them in. ML experiment tracking lets you compare outcomes across multiple runs and pick the results that came out best.

Debugging script

In some circumstances, it's almost inevitable that bugs will creep into your models somewhere. Accurately tracking inputs, processes and outputs lets you go back after a bug is spotted and find out what went wrong. It may also help you discriminate between bugs in your program and errors in the datasets you're feeding it, which can help you clean up your input stream and work with better data.

Transfer learning

Transfer learning is the approach of taking an ML model trained on a similar task and tweaking it to fit the needs of a new task. Different modifications of the original model result in different experimental outcomes, necessitating tracking these modifications and comparing the different outcomes.

Optimal feature selection

ML systems can get big and complicated quickly, and it's not always optimal to use every single feature from a data source for every single task. Good tracking lets you identify the appropriate features for a task, identified as the ones that generate the best results when used for a specific purpose, and designate them for use while other features remain dormant.

How to get started tracking your ML experiments

There's no single best way to track experiments in MLOps. How your team does it depends on your needs, resources and the size of the data you're working with. Small datasets might be tracked on spreadsheets or JSON files. This is a manual approach that is possible for small and less complex experiments, but it quickly becomes far too cumbersome for most teams. It's also hard to properly track and analyze data stored this way, as manual inputs encourage errors to slip into the datasets and managing the files consumes increasing amounts of time for everyone.

A better solution is to integrate purpose-built tools, or tools that were specifically created to work with ML tracking, into your experiments from the beginning. These tools vary somewhat in how they work, but all tend to follow the same general steps to human-friendly inputs and outputs. Your typical tracking tool will:

  • Connect to the source code directly to automate the collection process

  • Permit you to set parameters for the information you want collected

  • Operate in the background during experimental runs and automatically collect the required data

  • Allow you to sort and apply filters to compare your experiments and make data-driven decisions

Some tools might also observe results and autonomously decide whether to run more experiments with a given model or how to alter the parameters for a next-round run.

As part of the setup process, you have to specify what information you want to track and how you want it used. At a minimum, you probably want to track these five factors for every run:


You may be working with multiple versions of your codebase. When one performs better than the others, it helps to know which version you have and what's different about it.


You can track python dependencies, docker files and the like with a simple requirements.txt file.


The tool needs to track the data version and identify changes between runs. This allows you to quickly revert back to a previous version if one performs better than the other.


Your model might use very few hyperparameters, or it could have a lot. The tracking tool you use should keep track of all your command line arguments, function parameters and other variables that have downstream effects on how the algorithm or algorithms perform.


Of course, you have to know how the experiments all turned out. Tracking metrics is crucial to knowing which iterations performed better than others.

The rest of what you track depends on the details of your experiments and your specific needs. You might, for instance, choose to include a validation set in addition to your training and test sets, the results of which will have to be tracked as well.

Experiment tracking tools

There are many tools currently on the market to help you track ML experiments. Here's a broad overview of what's available: 


Neptune assists with experiment tracking as well as other components of the MLops lifecycle. This includes monitoring and visualization, being able to compare thousands of models at once. Its main advantages include the ability to log and display all experimental (meta) data collected, a flexible metadata structure and an easily navigable web UI.


MLFlow is an open source tool that supports several parts of the machine learning lifecycle, included in which is a tracking component, provided in both a UI and API. Its main advantages include its focus on the whole machine learning life cycle, a large community of users that support each other, an open interface that can be incorporated into any ML library and its built-in autologging capabilities.


Rubicon is our open source tool that assists Python developers in refining their ML models and pipelines, both for training and deployment of models. Its main advantages include git integration to link the experiment outcomes to the code version that produced them, a dashboard for visualizing and sharing results with others and its lightweight functionality since it integrates with other open source tools.

ML tracking key concepts and best practices


Logging records valuable data and should ideally be persisted, rather than kept in ephemeral memory. You can manually adjust the codebase to log the information you want at intervals that are relevant to your needs. This can be done manually, but a good tracking tool should support autologging, which automatically tracks data.


Your tracking tool should allow monitoring of intermediate results and give a window into how a specific output was arrived at. This can save time, as a given run can be discontinued if it becomes clear it will no longer improve, it's stuck in an unproductive loop or the quality of analysis is degrading.


Even a short run of experiments can generate vast amounts of data that are hard for a human to analyze in its raw form. Humans are, however, quite good at interpreting data presented graphically. Several ML tracking tools make use of this by presenting outputs in easy-to-read visual formats, including plots and graphs.

Experiment tracking using Rubicon

Rubicon is an open source data science tool that stores experiment outcomes (including parameters and metrics) in a repeatable and searchable fashion, allowing data scientists to determine which experiments produced the best results and how to repeat the best-performing experiments. This includes complete and automated experiment tracking.

Rubicon users integrate it directly into their Python projects and data pipelines. The Rubicon logging library can then be configured to log whatever data you desire during the model training process, and you're able to choose where the logs are stored. The Rubicon dashboard uses the information in the logs to provide a UI where the user can explore, filter and visualize the data of the various experiments. Results can also be shared with collaborators.

Making ML experiments manageable with experiment tracking

Good ML experiment tracking can make or break your ML projects. Even small experimental runs can contain thousands of iterative parameters and outcomes that quickly get too big to track manually, so it helps to have a tool to do it for you. Rubicon-ml is an open source experiment tracking tool developed by Capital One. Machine learning specialists, data scientists and MLOps teams can use this tool to easily log, visualize and share their results in a flexible and intuitive way. Get started tracking your machine learning projects with Rubicon, and start saving time, effort and frustration with a capable ML experiment tracking tool.

Capital One Tech

Stories and ideas on development from the people who build it at Capital One.

Explore #LifeAtCapitalOne

Innovate. Inspire. Feel your impact from day one.

Learn more

Related Content