Best Practices For Impactful Machine Learning Platforms
Advice for building enterprise platforms
September 22, 2022
Platforms are the foundation of many of the technological advances and experiences we now take for granted everyday. Platforms like Windows, iOS, and AWS are ubiquitous. But there are countless unheralded technology platforms that organizations have built internally to run their business and innovate at scale.
The word “platform” can mean different things to different people and in different scenarios. To me, a platform is the following: a group of technologies that serve as a base from which to build, contribute, experiment, and scale other applications. Building and maintaining platforms is an integral component of my work as one of Capital One’s machine learning leaders.
Over time, ML models have become central to how we create real time, intelligent experiences for our customers. In the early days of ML, companies took pride in their ability to develop new and bespoke ML solutions for different parts of the business. But building and maintaining many solutions to the same set of problems is inefficient. Companies seeking to scale ML in a well-governed, nimble, and efficient way have to account for continuous updates to data sources, ML models, features, pipelines, and many other aspects of the ML model lifecycle. Making those updates across a large number of platforms is agonizing.
The Shift to Enterprise Platforms
Innovating at the pace and scale of today's business requires companies to shift to enterprise-wise platforms and infrastructure. Capital One’s move to become the first U.S. financial institution to go all in on the cloud and our ability to re-architect our data environment have been integral to accelerating our technology platform capabilities. With that strong foundation, we have started to build enterprise grade ML capabilities which will bring substantial benefits: freeing up technology talent and resources, spurring innovation and ultimately delivering new customer experiences to market.
Much of Capital One’s work in this area is already showing impactful results for the business and for our customers.. For example, Capital One’s fraud decisioning platform was built from the ground up to make complex real-time decisions. By leveraging massive amounts of data and enabling model updates in days (it used to take years!) the platform helps protect millions of customers from card fraud and can be used by various stakeholders across the enterprise.
Based on my experience leading teams to deliver top-performing tech platforms, there are important lessons and best practices I’ve learned along the way.
Hard-earned advice for building enterprise platforms
It all starts with the team: Build a cross functional team of the best people, even if it slows you down at first. A bigger team is not always better! At a minimum the team should have product managers, engineers, and designers. Staff these functions with people who truly understand the users of the platform. For example, if you’re building a platform that will be used primarily by data scientists: hire a product manager who used to be a data scientist or put a data scientist on your leadership team. If the team is made up of people from several organizations, make sure you have shared goals.
Work backwards from a well defined end state: Before you start to build, take the time to align on the end state architecture and your plan to iterate your way to that destination. Make sure your architecture is designed for self-service and contribution from the start. Better yet, design the platform assuming that you will expand it to users outside of your immediate organization or line of business. Assume that over time you will want to swap out components as technology changes.
Estimate how long you think it will take, then double it: It is important to take the time to brainstorm all of the capabilities that you need to build at the outset and then create a t-shirt sized level of effort for each component. Once your tech teams marry this with velocity to estimate how long it will take to build each feature, add a 50% buffer. In my experience, this estimate ends up being surprisingly accurate!
Focus on Business Outcomes: Building great platforms can take a long time. It is important to sequence the work so that business value can be achieved along the way. This motivates the team, builds credibility, and creates a virtuous cycle.
Be radically transparent and over communicate: Share decisions, progress, and roadmaps with stakeholders liberally. In addition to articulating what you are working on, also articulate what you are not currently prioritizing. Invest in documentation which enables contribution as well as easy onboarding to the platform.
Start small: Even the best testing and QA environment can miss issues which are not found until something is put into production. For big changes that will have meaningful customer impact, always start with a tiny population and then ramp up once you see things working in production at a small scale. When possible, use associates only for the initial population when a change impacts external customers.
Get serious about being well managed: Platform owners should obsess about platform performance. All issues should be self-identified through controls and automated alerts. Exceptions should be addressed quickly. Root cause analysis of issues as well as changes to prevent recurrence should be prioritized. A lack of issues should be properly celebrated so that teams know it is appreciated.
If it seems too good to be true… Exception monitoring is a great way to ensure that your execution matches your intent. Often the goal is to have zero exceptions. For example, latency should never exceed 200 milliseconds. If your exception reporting NEVER shows any exceptions, it’s possible that the monitoring is broken. Always force an exception to make sure that it triggers properly. I’ve learned this one the hard way!
A happy team is a productive team. Celebrate accomplishments, recognize team members when they go above and beyond, and create a psychologically safe environment. Measure team happiness (with a quick 1-5 scale) regularly and give teams the space to discuss what would make them happier and the autonomy to try things out to squash dis-satisfiers.
Platforms are one of the next natural extensions of the technology revolution. When developed and managed correctly, they enable the flexibility and speed required for businesses to offer more personalized and intelligent experiences for end users and customers than ever before.