Developing Software for the Bank of the Future

A look back at Capital One's Year in Open Source


Over the past few years, Capital One has seen open source bring ingenuity to our operations. We contribute the software we want to see in the world and then get feedback and ideas for improvements. Those contributions help not only others - but also our business. As we continue our open source journey in 2022, let’s take a look back at how Capital One showed up in the open source community in 2021. 

Data Profiler

Launched in February 2021, Data Profiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easier. When building data intensive projects, every data scientist and engineer needs to ask these important questions: Is our data secure? What does the data say? How should we use the data? Data Profiler helps answer this by loading, formatting, and profiling data in order to identify the schema, statistics, entities (PII / NPI) and more. These Data Profiles can then be used in downstream applications or reports.

“The team felt very fortunate to be able to showcase the benefits of using Data Profiler at All Things Open and PyData Global”, says Jeremy Goodsitt, Lead Machine Learning Engineer. “They look forward to the continued support and feedback from users as they work on new features and integrate into other libraries.”

To date, Data Profiler has 700+ stars, 41 forks, and 13 contributors. 

Learn more about Data Profiler

rubicon-ml

Launched in February 2021, rubicon-ml is a data science tool that helps data scientists and model developers capture and store model training and execution information - like parameters and outcomes - in a repeatable and searchable way. 

According to Ryan Soley, Machine Learning Engineer, “we've been able to keep rubicon-ml lightweight by leveraging proven, open-source libraries in the PyData ecosystem. We open sourced rubicon-ml to get it closer to those tools and libraries it's built on. Since making the library public, we've gotten invaluable input and feedback from creators and maintainers of the libraries rubicon-ml relies on.”

rubicon-ml features include:

  •  A git integration that automatically associates experiments with the corresponding model code that produced them. 
  • A dashboard  that simplifies organization and data exploration through tag filtering and by automatically grouping experiments under their commits over time. 
  • Concurrent logging and asynchronous communication with S3, so multiple experiments can be logged in parallel.

To date, rubicon-ml has 50+ stars, 10 forks, and 7 contributors. 

Learn more about rubicon-ml

DataComPy

DataComPy - officially released to the open source community in 2018 - grew in 2021 to almost 190+ stars, 60+ forks, and 14 distinct contributors. A tool to compare Pandas and Spark dataframes, DataComPy can be used as a replacement for SAS' PROC COMPARE or as an alternative to Pandas.DataFrame.equals(Pandas.DataFrame. In 2022, the DataComPy roadmap currently includes some refactoring of the Pandas and Spark codebase to help unify the experience, as well as sourcing more user feedback to help enhance the current capabilities. The full DataComPy roadmap can be seen here.

Learn more about DataComPy

edgetest

Launched in November 2021, edgetest is a plugin-based python package designed to help developers test their code against new versions of their existing dependencies. Since pip introduced the dependency resolver in October 2020, maintenance cost and environment management has become a part of “running the engine”. Edgetest helps reduce the maintenance cost of packages by automating bleeding edge dependency testing. It will create a virtual environment, install your library, upgrade specified dependencies, and run your test command. Afterwards, it will report whether or not it is safe to upgrade based on the test results.

Learn more about edgetest

Open Source Articles & Events

Looking to the Future

Capital One has spent the last decade undertaking a massive technology transformation. During this time, we pioneered new standards, tools and technologies and adopted an “open source first” approach to software development. As we look to the future of banking, we will remain deliberate about what we release, taking into account the risk burden other companies could face because we have to be stress-tested to operate in a highly-regulated industry. We are excited to continue our open source journey in 2022 by launching new projects, investing in open source communities, and sharing our experiences along the way.


Capital One Tech,

Stories and ideas on development from the people who build it at Capital One.


DISCLOSURE STATEMENT: © 2022 Capital One. Opinions are those of the individual author. Unless noted otherwise in this post, Capital One is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.

Yes, We’re Open Source!

Learn more about how we make open source work in our highly regulated industry.

Learn More

Related Content