How Capital One uses tokenization to protect data
Capital One's journey to greater data security using tokenization for analytics and AI.
Data security teams today must keep pace with a constantly evolving and complex data landscape. In an environment in which data breaches are on the rise and the volume of data is growing at unprecedented speeds, teams must protect their sensitive data while readying their data for growth opportunities. In order to support our daily operations at scale, Capital One adopted tokenization. In this blog, we share Capital One’s journey to greater data security and how tokenization can be a powerful tool to help ensure the security and agility of data for analytics and AI initiatives.
Securing sensitive data in a complex data environment
In recent years, data breaches have increased and evolved, particularly with new malicious uses of technologies like AI. In 2024, the number of data compromises in the U.S. totaled 3,158, a near-record high just short of the all-time high from the previous year.
At the same time, a complex data environment makes safeguarding sensitive data even more challenging. The complexity arises in part from the sheer volume of data and the velocity at which data is generated at unforeseen speeds. Additionally, with data originating from many different sources, its diversity of formats, from structured to unstructured data, adds the difficulty of protecting sensitive data in all its forms. The need to secure data across on-premise and cloud environments also add complexity when accounting for data movement and visibility gaps across multiple providers.
Additionally, companies are eager to incorporate AI into daily business operations and drive innovations, leading to a push for AI-ready data. But the struggle to maintain the privacy and integrity of data used in model training and analysis can severely limit AI adoption in enterprises.
People want more access to sensitive data while data compromises are rising. These factors have converged to create substantial challenges for today’s data security teams.
Why tokenization?
There are a spectrum of approaches to meeting today’s evolving data security needs.
Deletion involves the complete removal of sensitive information from the organization’s systems, which reduces the ability to use data in important initiatives like creating meaningful machine learning models. Masking permanently converts sensitive data into a string of random characters while preserving the data format for testing or analytics. Redaction is the process of replacing sensitive data with placeholders, such as asterisks, which also alters the data format. These three approaches are irreversible since data is no longer retrievable in its original form.
Reversible approaches to data security include encryption and tokenization. Encryption converts sensitive data into an unreadable format called ciphertext using a key, which is required to decrypt the text and access the original data. Tokenization replaces the sensitive data with a nonsensitive, randomized substitute called a token while preserving the data format. Tokens reduce security risks for organizations since, even if stolen, they have no inherent value and retain no relationship to the original data. In vaulted tokenization, the original data is stored separately and securely in a vault. In vaultless tokenization, there is no need for a centralized vault to store the original data. Instead, tokens are generated on the fly by cryptographic algorithms, without requiring storage of the plaintext data at each endpoint.
In many cases, tokenization will be the better option for data security teams, particularly when it comes to AI and analytics. Let’s dig deeper into how tokenization benefits enterprises and how it can address several of today’s data security challenges.
Benefits of tokenization over traditional methods
Tokenization is a powerful tool for teams looking to secure their sensitive data while maintaining the agility necessary to use data effectively for business growth.
It holds several advantages over traditional methods of securing data:
-
Key management and security: While encryption relies on secure key management for encryption and decryption, tokenization offers less complicated key management as there are fewer keys to manage. Also, brute-force key compromise is practically infeasible as it would require massive amounts of compute power and costs.
-
Flexible design: The token value can have embedded metadata, which contains useful information, such as for bad token detection and rotation support, while remaining nonsensitive. Tokenization can also maintain the deterministic integrity of data to continue its use in operations like joins and queries.
-
Format preservation: Because tokenization preserves the format of the original data, raw values in the database can be replaced by tokens without modifying the schema or making any other major changes in the database. For example, a social security number can maintain the format of nine digits separated by dashes.
Usefulness in analytics: Tokenization is particularly useful for analytics since this method maintains referential integrity throughout systems. The same input string will tokenize the same output string with the same behavior persisting across multiple nodes and complex organizations. As a result, tokens can be used in training machine learning with a much higher degree of confidence than other security approaches.
Capital One’s data security journey
These benefits factored into Capital One’s decision to adopt tokenization to protect our most sensitive data, a necessity in our business.
As part of Capital One’s journey to transform our data ecosystem, we rearchitected our entire data infrastructure in the cloud. e. We began building new cloud data management platforms and data security tools that enabled business teams to tokenize and manage sensitive data.In 2018, we built a tokenization engine to operate at the speed and scale our business required. We wanted it to be highly secure and as frictionless as possible to integrate with the various systems in our data ecosystem. Today, we have billions of tokenized records across hundreds of applications.
In 2022, we built Capital One Software to commercialize our internal innovations. We began with the introduction of Slingshot, a data management solution that helps companies maximize their cloud investments. This year, we introduced Databolt, a tokenization solution developed from our in-house tokenization engine and inspired by our journey to secure our data without compromising on performance.
Databolt: Tokenizing sensitive data at scale
In bringing our expertise to market with Databolt, we provided capabilities for securing sensitive data at scale with minimal latency. Databolt is a vaultless tokenization solution that replaces sensitive data with randomized tokens at the source.
-
Vaultless design: A vaultless approach uses secure cryptography for greater efficiency and safety.
-
Advanced security model: Data security teams can create custom and complex access infrastructure that is flexible enough to accommodate the complexity of enterprises big and small.
-
Cloud-native architecture: Databolt was built with cloud-native tenants in mind. The tokenization engine is completely containerized and designed to scale both horizontally and vertically on cloud infrastructure.
-
Lightning-fast performance: With a throughput up to 4 million tokens per second, Databolt is extremely high performance, offering speed with low latency because its flexible deployment model fits into unique infrastructure.
Architecture of a tokenization solution
With Databolt, we wanted to enable federated computing in data planes that exist wherever the data lives while maintaining centralized onboarding controls and monitoring. A federated data plane means an organization’s tokenized data never leaves its environment.
Configuration and logging is centralized through the control plane, which is owned and hosted by Capital One and accessible via a customer portal. Through a self-service UI, customers can configure access rules, such as who can tokenize and detokenize which data, across the organization and in multiple applications. Customers also manage keys from the configuration store.
The actual tokenization and sensitive data stays within the customer’s environment. Customers can deploy data planes with the tokenization engine running right next to the data. Rather than sending sensitive information across the wire, companies can tokenize and detokenize without crossing trust boundaries.
Configurations defined in the control plane are sent to the data plane on a continuous basis, so that any changes, such as in access policies, can sync as quickly as possible. Importantly, the data plane continues to work independently even if the connection is broken.
Optimizing tokenization for Databricks and Snowflake
After bringing our tokenization solution to market, We saw that Databricks and Snowflake provided opportunities to embed our tokenization solution directly into our customers' data environments. Many companies had massive amounts of information and were performing large-scale analyses, but they were hesitant to bring sensitive data or expand confidential data usage in their data warehouse due to security risks.
We saw several opportunities to uniquely solve these challenges with the following:
-
Purpose-built technology: Databolt has seamless integrations with both Databricks and Snowflake, leveraging unique features of each platform.
-
Use cases at scale: Since Databolt was built for speed and scale, it was a perfect fit for both Databricks and Snowflake, two platforms that allow customers to store and manage data at massive scale.
-
Deep experience: Capital One’s extensive knowledge and experience using both Databricks and Snowflake across large and varied workloads helped us deeply understand data users’ experiences and goals and build a unique experience with Databolt.
We took steps to further optimize Databolt to deliver the best possible performance and experience for each ecosystem. To learn more about these integrations, check out our blogs on Databolt for Databricks, and Databolt for Snowflake
New data use cases
Optimizing tokenization for both Databricks and Snowflake environments opens up new use cases for customers.
-
Field-level security: Databolt protects individual data fields by making data practically useless to threat actors. This protection at the field level decreases risk in the event of a data breach.
-
Data sharing: Keeping all data and processing within the Databricks and Snowflake ecosystems opens up new opportunities around data shares. Deploying the tokenization engine opens up the possibility of granting tokenization abilities when sharing data with select individuals.
-
AI model training: Since tokens are deterministic, they can be used in applications such as large language model (LLM) training and business analytics.
What's next
Companies today manage unprecedented volumes and varieties of data in an environment in which data breaches are increasingly inevitable. Maintaining the confidentiality and integrity of data is crucial to move forward with high-priority initiatives that require large amounts of data like AI adoption. With a tokenization solution like Databolt, organizations can transfer and process sensitive data securely across the organization and their environments without hindering performance and business growth into areas like large-scale analytics and AI.
Databolt integrations for both Databricks and Snowflake are now live in each platform. Reach out to our team to learn about how they can help secure sensitive data.