Standardizing the Model Development Process with Rubicon
An open source tool for capturing & storing model training & execution information in a repeatable & searchable way
By Mike McCarty, Srilatha Ranganathan, Joe Wolfe, and Ryan Soley
The model development lifecycle is a complex process that requires an iterative approach over time. With numerous iterations, it can become tedious to keep track of all the inputs and outputs of each model version. This is where Rubicon comes in. Rubicon is an open source data science tool developed by Capital One that helps data scientists and model developers track their experiments and build the best models for their business problem. Rubicon captures and stores model training and execution information, like parameters and outcomes, in a repeatable and searchable way; it’s git integration then associates these inputs and outputs directly with the model code that produced them to ensure full auditability and reproducibility for both developers and stakeholders alike. While experimenting with models, Rubicon’s dashboard makes it easy to explore, filter, visualize, and share recorded work.
How Rubicon helps with well governed model development
Originally designed for the build and train phase of model development, Rubicon helps data scientists track their experiments as they build the best model for a business problem. This provides standard, complete, and automated experiment tracking for model documentation so that training information, such as hyperparameters and metrics, are never lost. By installing Rubicon’s Python package, a model developer will have the ability to integrate Rubicon directly into their Python model pipelines and easily start tracking information such as:
- Features and hyperparameters for a particular model fit.
- Metrics produced by the particular model fit.
- As well as any dataframes or artifacts relevant to the model fit (pickle model object, images, complex metrics, etc).
Rubicon enforces best practices like offering a git integration that automatically associates experiments with the corresponding model code that produced them (a must have for audit and model review purposes) as well as supporting tags to stay organized. The Rubicon dashboard makes this organization and data exploration simple by supporting tag filtering and by automatically grouping experiments under their commits over time. Additionally, Rubicon offers concurrent logging and asynchronous communication with S3, so multiple experiments can be logged in parallel and network reads and writes don’t block.
How to use Rubicon
Imagine that you are a data scientist in charge of building a model but you are not sure which type of model is best suited for this problem so you decide to try a whole bunch of different pipelines and ultimately work towards the best one. You know that you will need to present a full report backing up your model decisions to senior leadership for approval. This is where Rubicon can come in to save you time and effort.
You can integrate Rubicon’s logging library into your Python model code to store and retrieve your model inputs, outputs, and analysis as you iterate through the model’s development. You are in charge of where the data lives and can decide to log locally to your filesystem or to S3.
After running a series of model runs across different model types with varying inputs, you can use the Rubicon dashboard to explore, search, filter and visualize your logged data to more easily identify patterns and tweak your model towards the optimal solution. Then it becomes a natural, iterative process moving between the dashboard and more experimentation.
Once you are happy with the model, you can select a subset of logged data that can tell the whole story of your process and ultimately highlight the optimal model. Using Rubicon, you can then share this subset of data with collaborators or reviewers, giving them exactly the data they need when they need it.
Together, these different components of Rubicon support your needs throughout the entire model development and model approval process.
How to get started with Rubicon
Rubicon is available under the package name
rubicon-ml and can be installed using either conda or pip.
conda: conda install rubicon-ml
pip: pip install rubicon-ml
NOTE: When using conda, make sure to set the channel to conda-forge: conda config --add channels conda-forge
Where Rubicon is headed
We look forward to sharing Rubicon with the data science and open source communities at large, and are excited for the opportunity to engage with a broader community of developers. Working together, we’re confident these efforts will lead to new enhancements and functionality that may not have otherwise been possible, and we look forward to blending these different visions into the Rubicon roadmap.
To learn more about Rubicon, including how to use or contribute to the open source project, visit the Rubicon GitHub page.