Democratizing machine learning
Disha Singla on her journey to leading-edge machine learning
June 20, 2023 6 min read
We recently had the pleasure of sitting down with Disha Singla, senior director of machine learning engineering, to talk about her life, her experience before Capital One and what she and her team are doing to stay on the cutting edge of machine learning.
What made you want to get into technology and machine learning?
I was introduced to computers in elementary school, and by high school, I was programming in C and C++. I have always had an inclination toward the full lifecycle of data, and through data engineering, I’ve enjoyed analyzing data and telling its stories.
I earned my bachelor’s in computer science and engineering, becoming the first in my family to receive an engineering degree. I then went on to earn a master’s from San Diego State University in computer science and a second master’s in data science and engineering from the University of California in San Diego.
Having knowledge across various subject areas like business, product, engineering, data science, scalable solutions and various programming languages gave me a technical edge to establish myself in machine learning.
I was inspired by Capital One’s journey to the cloud and culture, and am excited to now be a part of shaping Capital One’s machine learning journey.
Can you talk a little bit about ML democratization at Capital One?
I’m leading a group of very talented data scientists, machine learning engineers and software engineers and we’re working on something very close to my heart—democratizing AI and ML. Our team has built reusable library components and workflows and has created a platform that allows citizen data scientists to drive meaningful insights.
As a concrete example: We’ve built reusable libraries, workflows and components that function in the field of monitoring and forecasting. These fields deal specifically with series anomaly detection, change point detection, root cause analysis and time series forecasting.
[Citizen data scientists] may not have a strong ML background, but they understand the value of data and how to access it. Our democratization is in serving these scientists.
Professional data scientists, of course, want to build bespoke models. Everybody wants to spin up a Jupyter Notebook, analyze data and start building the features they want to train their model on. Eventually, they deploy the model to help drive meaningful insights. These scientists typically have a formal education in ML and statistics.
On the other hand, there are what I like to call citizen data scientists. These are people, often analysts, who have worked and explored data, but they aren’t necessarily keen on building their own models. They may not have a strong ML background, but they understand the value of data and how to access it. Our democratization is in serving these scientists.
Building frameworks to analyze data more quickly
This is a major consideration in developing and democratizing tools and components for machine learning at Capital One, like our open source software solutions Data Profiler, rubicon-ml, DataComPy and Federated Model Aggregation. What we’re doing is building component frameworks for analysts who want to jump right into the data.
Can you go into more detail on some of Capital One’s tooling and workflows?
One example of how our team has democratized machine learning at Capital One involves our Workplace Solutions team. Their goal is to help our associates adapt to a hybrid work environment by using our tools to analyze and forecast various metrics. These are things like how many people are returning to work, how stocked the kitchen is and so on.
On the other end of the spectrum, our internal third-party card fraud team uses our tools and workflows with fantastic results. They approached us looking for a solution to identify anomalies in fraudulent transactions and automatically implement defenses to mitigate losses and reduce customer friction. We helped them by creating a workflow, which functions as a directed acyclic graph (DAG) under the hood. They provide us with data, and we help them analyze it.
We’ve built several open source and proprietary algorithms to improve and enhance the capabilities of our workflows.
An overview of the process
When a transaction is marked as fraudulent, our solution analyzes it in batch mode. Segments with the highest anomalies are flagged by our anomaly detection algorithms. This triggers another set of algorithms that identify the change points and root cause analysis. Afterward, our sophisticated tools generate rules in real time to prevent future cases of fraud matching these change points.
Challenges in anomaly detection: Minimizing false positives
To address this challenge, we’ve built several open-source and proprietary algorithms to improve and enhance the capabilities of our workflows. This ensures a better customer experience and reduces friction, as false positives can cause inconveniences like declined credit cards during a nice dinner.
And of course, we’re always looking for more contributors to our open source projects; we’re delighted to hear from others in the space.
What does the future of these democratization efforts look like?
An essential aspect of democratizing ML at Capital One is focusing on low-to-no-code solutions. Our team has developed an internal platform that is highly scalable and built on Kubernetes, a powerful container orchestration system. This platform features sophisticated orchestration, advanced workflows and a comprehensive library of tooling.
The algorithms I mentioned earlier are a good example. These algorithms were carefully developed and packaged over time through in-depth, leading research, close collaboration and industry best practices. Low-level directed acyclic graphs (DAGs) form the foundation on which we build abstractions, simplifying the process and making it more accessible to a broader range of users.
Streamlining democratization efforts
To streamline the experience even more, we’ve built a user-friendly interface that gives our customers easy access to the data they need. By specifying the location of their data and providing some parameters to drive the analysis, they can quickly spin up new projects and start gleaning insights. Meanwhile, they can specify the features they consider most important, determine how often the model should be retrained and choose the desired output sync.
Customers have the flexibility to send their data to various data systems within Capital One, including S3 buckets and Snowflake. This streamlined low-code and no-code experience empower both professional ML engineers and citizen data scientists alike to extract meaningful insights from their data and make informed decisions. In this way, we create a more inclusive, efficient environment for leveraging ML across the organization.
More broadly, what do you think is the next step in the evolution of ML?
In addition to including a greater investment in leading-edge technologies and approaches like deep learning, the future of ML necessarily involves going back to the basics of software engineering. As data scientists and machine learning engineers, we’re lulled into a false sense of security. We think that because we’ve spent a massive amount of time on a humongous cycle, we’ve dealt with everything and the features are good, so it’s done.
Over the past year, we’ve been actively working toward applying the same rigor that we’ve had to model testing that we have to traditional software. What this means is that, when a model is in production, it should be treated as production software or a live API. Even then, because it involves a complex database, it requires even more sophistication.
Machine learning and scalable models
Take scalable models, for example. A large chunk of time goes into creating a sophisticated model that has a high statistical efficacy. An ML practitioner will perform rounds of cycles doing hyperparameter tuning and feature engineering, which is important, of course. But if the model is a piece of software, it needs to go through the same rigor as any other piece of software. It needs to have end-to-end testing integration, unit testing, load testing, A/B testing and so on. In short, defensive coding and exception handling are key.
Data scientists and ML engineers are closest to the code, so the onus is on them to develop defensive coding and data quality tests. This kind of preventative coding is also where we find the highest ROI in terms of dev time. Armed with this information, we determined that data quality testing is crucial for all of our ML efforts.
Some simple ways that we are incorporating data quality testing:
- Checking the data type and the values of categorical data
- Checking for cardinality shift, where a certain shift in the distribution of categories causes the model to predict category data leakage and drift over time
- Checking for missing data
While the last one sounds simple, it’s near and dear to my heart.
How does Capital One attract and develop top talent in machine learning?
We’re always on the lookout for exceptional ML talent. Capital One is incredibly invested in working directly with universities like UMD, MIT and UVA, especially when it comes to fostering the next generation of ML talent.
Internally, we have training programs for ML engineers and ML product managers, as well as graduate programs in data science and data analytics. We’re also proud to have a robust culture of inclusion, which helps attract top talent from the industry who care about those principles. It’s a pleasure to work with talented, intelligent and accepting people, pioneering the future of machine learning.
Explore tech careers at Capital One
If you’re passionate about machine learning and other high-tech fields, Capital One provides an inclusive, exciting and innovative environment where top talent can thrive. By joining Capital One’s leading technology teams, you’ll have the opportunity to work alongside exceptional professionals and contribute to groundbreaking projects that are shaping the future of ML.