How Capital One operationalized data mesh on Snowflake
August 29, 2022
Presented at Snowflake Summit 2022 by Salim Syed, Vice President and Head of Engineering, and Patrick Barch, Senior Director of Product Management for Capital One Software.
Over the last several years, Capital One underwent a journey to reimagine our data ecosystem for the opportunities and demands of managing data in the cloud. Along the way, we developed tools to empower data owners to access and manage data while ensuring enterprise-wide adherence to central governance policies and standards. Essentially, we discovered a way to operationalize what we know today as the concept of data mesh.
This past June, we shared these experiences in a talk at Snowflake Summit. We’ll recap what we learned, including how we developed the tools, technologies and processes to distribute data ownership to our business lines while maintaining a well-governed data practice.
What is data mesh?
Let’s first define data mesh. Data mesh is an architectural concept that distributes the handling of data in an organization to individual business lines or domains. In this way, data mesh departs from the traditional approach of centralizing data responsibilities to a data team and allows companies to access and analyze data at scale.
In recent years, data mesh has emerged as an important framework to help companies scale a well-managed cloud data ecosystem in a complex data environment in which the volume and sources of data are growing exponentially.
The principles of data mesh include:
- Data as a product: Data teams apply product thinking to their datasets. In this approach, an organization assigns a product owner to data and applies the rigors of product principles to data assets to provide real value to its end customers (e.g., data scientists and analysts).
- Domain ownership: Data ownership is federated among domain experts who are responsible for producing assets for analysis and business intelligence.
- Self-serve data platforms: A central platform handles the underlying infrastructure needs of processing data while providing the tools necessary for domain-specific autonomy.
- Federated computational governance: A universal set of centrally defined standards ensures adherence to data quality, security and policy compliance across data domains and owners.
Scaling data in the cloud
We saw early on the benefits of moving our data workloads to the cloud. As a data-driven business from our start, we knew the key to succeeding in a technology-driven business environment was to harness data to drive business value. In 2017, we began modernizing our data ecosystem in the cloud by building our data infrastructure, storage and processes from a cloud-first perspective. This included the adoption of Snowflake as our data warehouse. In 2020, we became the first major U.S. bank to succeed in moving our entire data workload to the cloud.
The cloud provided access to data from more sources. At the same time, we encountered challenges in managing an exponential growth in data that was stored in different places and meeting the demands of users with a variety of use cases.
A complicated data ecosystem
One of the tenants of data mesh is treating data as a product. Early on, we applied product management and user-centered design principles to solve our data management challenges. In taking this approach, we learned that there were many players working together in our data ecosystem. Scaling this ecosystem effectively meant all the different groups needed to work together seamlessly.
There were different data experiences for different users including in publishing, consumption, governance and infrastructure management. We had people responsible for publishing high-quality data to a shared environment, business analysts and data scientists looking to use data to inform business decisions, and people focused on enforcing data governance policies across the company. Lastly, there were teams managing the infrastructure that made all of these use cases possible.
There was a lot of room for miscommunication between these different groups working in various modes of operation as each tries to get work done. And because point solutions addressed different pieces independently, we were left with a complicated set of user experiences requiring six or seven different tools to complete a single task such as publishing or finding data (See below for what this overly complicated ecosystem can look like). Scaling this ecosystem quickly became challenging.
We needed a way to enable an integrated experience that considered the needs of all data stakeholders throughout an organization and empowers them to work together seamlessly.
A two-pronged approach to data mesh
We knew we wanted to give greater ownership over data to our business lines without sacrificing important company standards for data. This led us to a two-pronged approach to data mesh:
- Centralizing policy through tools built into a central platform
- Enabling federated data management across our business lines
Centralized tooling and policy
Modern data management could only be possible by enforcing common standards across the various business lines and data stakeholders.
- Data responsibilities within business lines: We first broke out our business lines into organizations and units of data responsibility with a hierarchy. Each organization had the same set of roles supporting their data needs: data stewards responsible for the risk of several datasets within an organization, data stewards responsible for the risk of datasets within a business unit, and a data risk officer responsible for the entire thing.
- Common data standards for metadata curation: Additionally, we defined common, enterprise-wide standards for metadata curation. Not all data is created equal. For example, data used in regulatory reports required a different standard of governance than staging tables. We made sure our metadata policies reflected these important nuances.
- Data quality standards for shared datasets: Using a sloped governance approach, we increase governance and controls around access and security for each ascending level of data. For example, production data needs to pass more rigorous data quality checks than data within private user spaces.
- Entitlement patterns based on data sensitivity: Early in our journey, each dataset was protected by its own entitlement, leading to an analyst taking days to weeks to figure out the right permission to which to request access. This was only to find 66% of the time that the data they wanted was poor quality or wasn’t what they needed. We solved this by creating “business org<>sensitivity” combinations so that a user could access all non-sensitive data in one place and only submit a new request when escalating permissions.
- Make it easy: Finally, we had to make all of this easy for both data practitioners to follow and data governance teams to enforce. In surveying hundreds of our practitioners, we found that the vast majority wanted to do the right thing and be good corporate stewards. But confusing and opaque policies prevented them from getting there.
Federated data management
The traditional approach of a central team running your data processing or data loading jobs no longer works as a model for data management at the speed and scale required in the cloud era. One of the main concepts of data mesh is the federation of ownership, or the idea that the ownership of data as a product should lie with the person who knows the data best. But for federated data management to succeed, we needed to create delightful, integrated experiences for our users.
A key step to federating data management responsibly was in providing a usability layer. This layer was oriented by the jobs that needed to be done rather than capabilities like cataloging or data quality. Jobs included publishing a new data product, protecting sensitive data or reconciling an infrastructure bill. A data user could now accomplish a task in one place through simple workflows that guided them through the process while ensuring governance requirements were followed.
We were also able to create the following self-service, data user-specific experiences.
Data producer: For those responsible for pushing a new data product to Snowflake, we enabled a self-service data producer experience that empowers lines of business to manage their own data. This included automatically creating tables and views in the appropriate Snowflake database and assigning appropriate roles to Snowflake tables based on the user’s description of data sensitivity—all while providing the necessary governance and controls through easy-to-follow workflows.
- Data consumer: In a large organization with hundreds to thousands of tables, it can be difficult for analysts and data scientists to find, trust and get access to data. We enabled a self-service data consuming experience so the consumer could go to one place to discover, evaluate and get access to the right data. Information about a dataset such as data quality results, lineage of data, data profile and samples of the data was readily available.
- Risk manager: Those defining and enforcing data protection policies could feel confident knowing the automation of governance was in place through custom workflows and dynamic warehouse provisioning. These workflows ensure a fully auditable purge process and the ability to properly make changes to data in a production environment.
- Business data platform owner: Data mesh not only requires federation of data but the infrastructure to support these processes. For those charged with managing Snowflake infrastructure provisioning and costs, we provided self-service tools to optimize cloud costs while effectively scaling cloud data. For example, multiple dashboards show cost trends, cost spikes and their drivers. We recently brought this capability to market with the launch of Slingshot, a SaaS offering to help businesses adopt Snowflake quickly, automate governance and optimize cloud usage and costs.
Through these steps, we were able to create more accountability for data quality, improve discovery and build greater data trust across our organization.
How companies can enable data mesh
Data mesh is just a concept unless you can provide the self-service tools and automated workflows to operationalize federated ownership. An organization needs to provide the central policies and tooling in one place to enforce governance while supporting all the different lines of businesses.
When you combine central tooling and central policy with federated data ownership, you will be set up to operate at the speed required to fulfill business needs and meet expectations for faster insights in the cloud era.