What today’s tech leaders are solving for in Data and AI

Four data trends we are paying attention to.

As data leaders across industries are pushing to scale AI initiatives, they are confronting a familiar set of challenges: protecting sensitive data, managing cloud data costs, maintaining data quality and ensuring data governance without slowing teams down.

At Capital One, we’ve been building tools and frameworks to address these very problems at scale. Along the way, we’ve had the chance to learn from our own experience – and gather learnings from our customers and partners. While every organization’s data journey is different, a few themes have surfaced that may resonate with data leaders navigating similar terrain.

Here are four trends that are worth paying attention to:

1. Data tokenization is a powerful addition to traditional security methods

Organizations are increasingly adopting tokenization to complement encryption and masking because it strikes a strong balance between data protection and usability. As cyber threats grow and data volume rises, leaders are looking for ways to better protect data without compromising usability, especially for AI use cases. That’s where tokenization brings value. 

At Capital One, we embraced tokenization as a robust data security solution to protect sensitive data in the cloud, at the speed and scale our business requires. As adoption expanded, we gained critical insights into balancing security with functionality. Inspired by that journey, we introduced Capital One Databolt to bring our expertise to market.

Databolt is an enterprise-grade, vaultless tokenization engine that is capable of operating up to 4 million tokens per second and seamlessly integrates with platforms like Snowflake and Databricks. Its tokens are format-preserving and deterministic, enabling customers to innovate with AI and BI tools without the need to access raw values of sensitive data. As tokenization continues to gain traction, it’s clear this trend is becoming foundational to modern data protection at scale.  

2. Metadata maturity is increasingly core to scaling AI responsibly

In order to maximize the benefits of AI and machine learning, more companies are understanding the need to set up clear visibility into their data. Without clear insight into where data comes from, how it’s transformed and where it’s used, teams risk introducing bias, drift or errors into models that can be difficult to detect.

More mature metadata systems allow organizations to:

  • Understand dependencies between pipelines and models

  • Identify stale or duplicative data before it’s used in training

  • Establish a stronger foundation for access controls and data sharing

Investing in metadata also empowers organizations to have better control over data protection and operations. With visibility into the data inventory, administrators are able to define and enforce appropriate access controls and guardrails. When guardrails are implemented, it becomes easier for admins to enforce proper usage and thus gain more control over the operational cost.

In scaling our own data in the cloud, there became a critical need for deep visibility into data cloud usage to understand, forecast and optimize spend across the organization. This led to the development of Capital One Slingshot, a data platform management solution. Slingshot leverages metadata to analyze Snowflake and Databricks environments to provide insights and recommendations for both cost and performance optimization.

3. Data cloud optimization is more important than ever

Data platforms, especially in the cloud, are delivering unprecedented flexibility. But the flexibility often comes with rising and sometimes opaque costs. That’s why we’ve seen growing interest in FinOps-inspired governance models that provide visibility into usage patterns, enforce budget thresholds and promote shared accountability between central data teams and business units.

In fact, Capital Rx, an enterprise healthcare technology company, was able to successfully apply the FinOps framework, using Slingshot to optimize compute spend and workload performance, get full visibility into data cloud usage across the organization and reduce user overhead. Slingshot enabled Capital Rx to achieve a more scalable, automated and efficient system to handle growing data and user demands. This resulted in an immediate 59% decrease in average monthly Snowflake spend, despite growing data volumes and platform use.

4. Automated testing is integral for better code in complex environments

Testing Snowflake objects, especially stored procedures, is challenging due to data dependencies, lack of repeatability and manual processes. Automating test setup and execution within a CI pipeline allows for faster, more thorough testing, leading to higher quality code and reduced regression risk. Automated testing is also crucial for data pipelines as organizations adopt software development practices, ensuring quality without hindering delivery despite the complexity of managing numerous data sources and dependencies. 

At Capital One, we built a reusable testing framework to address these challenges and support Slingshot development across Snowflake environments, covering everything from upstream source validation and stored procedures to Snowpark-based transformations.

Our testing framework integrates production data nuances into testing environments using YAML configuration files to define resources and constraints. The process involves:

  1. Defining YAML files: These files declaratively specify Snowflake objects (tables, views, streams, tasks, etc.) and their constraints, shaping the desired test data.

  2. Exposing classes in test code: Test code classes consume YAML files, allowing for additional data constraints.

  3. Creating a data generator: The generator uses definitions to create data and can add constraints like row count to make variations.

  4. Creating and destroying ephemeral data: Temporary data is generated, meeting defined constraints, and automatically deleted by the framework post-test.

By implementing an automated, reusable testing framework and integrating it into a CI pipeline, teams can achieve more rapid and comprehensive testing. This approach ultimately leads to enhanced developer productivity, higher quality code and successful deployments.

A closing thought

While these trends reflect some of the problems we’re solving internally and with our customers, they are ultimately part of a broader shift: from one-off tooling to systems-level thinking. The most effective data organizations aren’t just focused on speed or security or cost; they’re building integrated ecosystems that support all three at once.If these challenges resonate with your experience, we invite you to explore how Slingshot or Databolt might support your organization.