How to unleash your data while managing risk and costs

July 5, 2023

Presented at Snowflake Summit 2023 by Nagender Gurram, sr. director, software engineering and Patrick Barch, sr. director, product management at Capital One Software.

Winning in today’s market means making the most of your data, and many companies are adopting the cloud in order to come out on top. Gartner estimates $591.8 billion in forecasted spending on public cloud services in 2023, a 20.7% increase from the previous year.

Cloud computing offers many upsides for businesses looking to capitalize on data, including instantly scalable capacity, simplified infrastructure management, and built-in security and resiliency. However, more than half of business executives have not realized the cloud benefits they were after, according to PwC.

In our journey to the cloud, we found the key to unleashing the value of your data is finding the right balance between governance and enablement. This past June at Snowflake Summit, we shared our learnings on how to empower teams with data while managing risk and costs. Below we summarize these learnings.

Balancing empowerment and governance

In Capital One’s path to becoming the first U.S. financial institution to go all-in on the cloud, we found operating in the cloud requires a different approach than managing data on-premises. To realize the value of the cloud, we believe companies need a model that balances governance and enablement.

Organizations need to ensure their data meets governance requirements while controlling costs. At the same time, data analysts should be equipped with the tools and processes to easily onboard new data use cases and drive forward their business goals. But achieving this balance is challenging. If you skew too far into governance, you can limit the value of your teams and fail to realize the benefits of the cloud. If you go too far into enablement, you may run into risks on governance, security and costs.

Federation and self-service

Traditionally with on-premises systems, data requests were fielded by a central team that ensured the data followed strict standards for security, quality and regulations. That model no longer works in the cloud where the volume of data grows exponentially and there is strong business demand for faster insights. A federated data management approach disperses data responsibilities along business lines to those who own the data. Its tenets include self-service and centralized tooling with central policies. The use of self-service allowed our teams to operate at their own pace on a single data usability platform.

Finding and accessing data with sloped governance

How do we set the right policies while empowering people to get their jobs done? Our approach is sloped governance based on risk. Foundational to this approach is the idea that not all data requires the same level of governance. The more data increases in its importance and riskiness, the greater the amount of governance we apply. In other words, we add greater controls around quality, access and security for each ascending level of data. A sloped governance approach places the right amount of governance at the right time in order to remain well-managed, yet encouraging innovation.

To illustrate how we can apply sloped governance, let’s walk through a business scenario with three types of data as they flow through an enterprise ecosystem.

For the purposes of the use case, we have a credit card transaction system with multiple databases within the application. The three types of data are:

Siloed data that never leaves the source system
Shared data that is accessed by the rest of the enterprise
High-value data that gets used in an important fraud detection use case

Next, we will look at how to apply sloped governance to each of these types of data in order to manage risk while enabling teams to be successful. We examine them through the following lenses:

Metadata: How you know where all your data is
Data quality: How you measure whether the data is good quality
Data access: How you manage who gets access to the data

Metadata

Metadata helps users find, evaluate and use the right data. It also enables organizations to understand the risk profile of their data ecosystem. The question is how much metadata an organization should capture. At Capital One, we used to apply the same metadata policy requirements across our datasets. But we experienced long turnaround times and lower quality metadata that made finding and using the right data challenging.

Our solution to defining metadata requirements that are useful, but not overly burdensome, was implementing data tiers. The concept is that the more important and utilized the data, the more metadata it requires. Here are the data tiers applied to the three types of data:

Siloed data: All data in the company must be registered to a minimum metadata standard regardless of the data type and risk involved. Example metadata types added at this tier include data location and sensitivity.
Shared data: Any data intended for sharing in an organization requires a greater level of metadata completeness to support discovery and understanding from users potentially unfamiliar with the line of business. The metadata includes business context that describes what it is, where it came from and how it should be used. Metadata examples are description of data and permitted use.
High-value data: The highest tier of metadata applies exclusively to data that goes into risk and regulatory reporting requirements. Several more fields are necessary to validate the authority of the source and make certain the lineage is complete and trackable. This tier may require metadata such as third-party share indicators and priority level. Although this adds greater overhead to teams, the number of datasets in this category should be relatively small.

Using the tiered approach, we were able to apply a small number of metadata requirements to all data, a medium amount to shared data and a large quantity to the high-risk data that requires it. This method allowed us to strike the balance between enabling teams while ensuring strong governance.

Data quality

Similarly, sloped governance applied to data quality means escalating standards as data value increases. Data quality is a measure of the reliability and fitness of data for use by an organization and is vital to making informed business decisions. Data must adhere to higher standards of data quality the more it is utilized.

Siloed data: Data used within its own application does not require strict data quality standards since it will not be shared. The team using the data takes responsibility for monitoring the quality daily.
Shared data: Data shared outside of a team or application needs to meet basic standards of data quality, such as schema conformance and completeness.
High-value data: This tier is reserved for the most important data that is critical to the success of the business. As data usage increases and becomes foundational to business decisions, specific data quality checks become more important. These include field validations, field dependencies and business applicability.

Data access

For businesses to realize the value of data, users must be able to easily access the data. A well-planned and executed access model is crucial for data security and policy compliance. But not all data requires the same level of strictness when it comes to access. Let’s compare two basic models for accessing data to consider how to implement an effective access model without over burdening users.

Within a card domain, there are multiple datasets for transactions and authorizations. A strict access model that uses a unique entitlement to protect each individual dataset can be cumbersome and slow. Such a model can be incredibly challenging for users and lead to frustration as data discovery often takes several attempts to be successful. Additionally, some access approvals can take days or weeks to complete.

Capital One devised a domain and sensitivity access model that is flexible and can be customized to better fit the needs of the organization. Access is granted to all data within a domain corresponding to a sensitivity level. All data is categorized based on data sensitivity levels, and this categorization means a user can have access to, for example, all non-sensitive data within a particular line of business. With this model, the user only needs to re-request access if they need stepped up permissions.

Creating an environment for self service and innovation

Sloped governance also plays an important role in empowering the analyst community. The right governance can help manage analyst activity and drive greater innovation. Without the proper governance, however, your data can become thousands of user tables with multiple duplications and derived data that lack records of the lineage or source. This could cause risks because the company does not know who is accessing the data or what they are doing with it, and the organization lacks insight into whether the use of computing resources and costs line up appropriately with business goals.

Data sandbox for insights and collaboration

Our solution was providing users a data sandbox with well-managed personal and collaboration spaces for driving innovation. A data sandbox is an isolated playground environment for building, testing, and operationalizing new processes and models on production data. Users can reuse the data and processes rather than building them from scratch. It is also a safe and controlled environment that minimizes data risk for enterprises. The data sandbox is provisioned separately so that it does not impact production spaces.

The data environment consists of a private user space, shared collaboration space and production datasets. Governance requirements are different for each of these spaces and increase as the maturity or usefulness of the data also increases. Governance is applied to each space in the following ways:

Private user space: Data analysts and data scientists can create new insights, innovate and try new things in a secure, controlled environment without affecting the rest of the business. This is the lowest tier where the need for governance is lower, except policies for data scanning and retention.
Collaborative space: From the private user space, analysts and scientists can move along their user data to the collaborative environment when they think it’s sufficient in quality to share with others on their team. This space allows analysts and scientists a place to share, analyze and store data. It promotes efficiency as analysts can quickly build business processes, analyze the available data and build new, transformed datasets. As we share data in the sandbox, we increase governance and try to automate where we can. We start capturing a minimum number of metadata such as the data owner, schema and frequency of use.
Production datasets: Users then deploy their data to the production datasets where data resides in the live operational environment for use by the business. In the production environment, data must be available, accurate and accessible. This is where we start capturing metadata such as data lineage, connect the user to existing data quality tools, and enforce code control of the SQL process through GitHub. We make sure everything is in place to use and connect production data to downstream applications.

Well-Managed Costs

Now that we had a data sandbox where analysts could innovate and transition their data to shared environments, we needed to manage the costs of all that activity. Costs can quickly escalate in the cloud due to a number of factors, including running poor queries, overprovisioning compute resources, storing too much data and optimizing for speed. We recommend a solution for managing costs and improving efficiency with the following:

Proactive monitoring and alerting
Infrastructure governance policies
Built-in retention enforcements
Custom, granular chargeback reports

Tying it all together

At Capital One, we have experienced incredible benefits by striking a balance between the right governance for the right data and empowering our teams. We saw significant increases in the number of Snowflake use cases in a production environment annually, Snowflake queries and engineer hours freed per year.

To summarize, we believe you can unleash your data while safeguarding risk and costs by adhering to the following best practices:

Apply data tiers based on importance and value
Right size governance policies for each data tier
Apply these policies to an environment built for innovation

We took our learnings on realizing the right balance between governance and enabling our people and built Slingshot, a data management SaaS solution that helps businesses on Snowflake maximize their investments by empowering teams to spend wisely, forecast confidently and make smarter decisions. Learn more about how Capital One Software can help unleash the power of your data.

How to unleash your data while managing risk and costs

Snowflake optimization guide

Balancing empowerment and governance

Finding and accessing data with sloped governance

Creating an environment for self service and innovation

Tying it all together

Related Content

Lessons from Capital One's cloud migration journey

Federating data management to scale in the cloud

Data management: A modern, integrated approach

Footnotes