How to manage secure data and cost visibility for AI
How to manage data security and cost visibility to successfully implement and utilize AI.
High-quality data powers the ability for companies to reap the business benefits of AI and machine learning, including solving challenging problems for their customers. And with more companies racing to implement AI, the need for high-quality and secure data has grown in urgency. From model building to better decision making, AI and analytics initiatives are only as strong as their data. But managing the quality of data while maintaining security and cost controls are significant challenges in today’s increasingly complex data environment. In this blog, we provide tips on managing high-quality data while securing sensitive data and bringing visibility to hidden costs to drive successful usage of AI.
Challenges for AI adoption
Companies want to adopt AI, but research shows there are barriers due to data and security concerns. Although 84% of enterprise CIOs believe AI will be as significant as the internet for their businesses, only 11% believe they have fully implemented it, according to Salesforce. CIOs identified several issues needing resolution before AI adoption, including data accessibility, privacy threats and outdated infrastructure.
In the context of data, there are three common barriers to AI adoption:
-
Trusted data: Data is everywhere and rapidly growing, but data needs to be accurate and trustworthy for companies to confidently feed data into an AI model.
-
Data security: Organizations are trusting companies with their sensitive data as well as their customers’ sensitive data and organizations need to make sure their models or chatbots are not going to leak any of that information.
-
Cost: Building and maintaining models can be incredibly expensive. Companies must work to reduce the cost barrier to get more value out of their data.
At Capital One, we know that world-class AI starts with world-class data. An organization’s analytics, machine learning and AI are only as good as the data that gets fed into them.
Having strong fundamentals and a commitment to a strong data foundation is the baseline for AI readiness. We’ll share the principles and best practices around the pillars of managing high-quality data, securing data, and achieving cost effective data.
Managing AI-ready data
Trusted data, another way to describe high-quality data, is the first pillar to ensuring data is ready for AI use cases.
Teams first need high-quality metadata, a set of data that gives information about data. Metadata answers questions around:
-
Inventory: Where is the data? Data may be in your data warehouse, shared drives, stored on a laptop or S3 buckets. Companies need a solid inventory of everywhere they store data across an enterprise before feeding the data into models.
-
Classification: What is the data? Once the location of the data is known, more details about the data are helpful including the database, schema, table, column.
-
Governance: Are there any policies for this data? The data may be protected health information or personally identifiable information. There may be regulations around using the data. Metadata describes whether there are any policies around the data and its usage and storage.
Once we know where the data is, we want to know how good it is. Data quality assessments can be achieved in different ways. Some options include scanning the data, either manually or automatically, data profiling and data quality tools. Sometimes data comes in a trusted format, such as in the cases of data sets that are purchased or data coming from a master data management system.
Understanding data at every step of its life cycle, known as data lineage, is critical to understand the data being fed into a model. Data rarely enters a system without changing by the time it exits. A team member can ingest the data, but then running transformations on the data changes it. Additionally, running data quality checks can lead to resolving data issues. A data user may choose to use data at many points along the way. Understanding the lineage allows the user to pick the right data for that particular use case.
Securing data for AI
Training AI models requires vast amounts of data, and without proper security and governance, that data is at risk of exposure during training or inference. Capital One tokenizes sensitive data where technically possible, including LLM training data and RAG datasets, to reduce the risk of exposure and guard against potential data loss
There are many different methods to protect data. The first group of approaches permanently alters the data. In deletion, data that is no longer needed is removed from the system. Although a viable option for data protection, deletion does not allow for the further use of that data. Masking involves permanently converting the data into an unusable format, such as using x’s in place of a Social Security Number (SSN). This type of data protection largely is not usable for AI. Lastly, redaction means removing part or all of a piece of data. While redaction is common for unstructured data, the previous two methods apply to structured data.
There are also data protection options that are reversible to the original data form. Many people are familiar with encryption, which involves converting sensitive data from cleartext to ciphertext using an encryption algorithm. A key can decrypt the ciphertext into the original data. Tokenization also converts data into a meaningless form using an algorithm and can preserve the data format.
Advantages of tokenization
At Capital One, we leverage tokenization to protect certain sensitive data for a few key reasons. First, tokenization requires less key management than encryption, and it also protects against brute-force key compromise. Tokens can have embedded metadata that allows for bad token detection as well as helping with token rotation. Tokens can also preserve data formats, meaning data can be tokenized without modifying the underlying schema or applications that use the data. Importantly, tokenization also allows for the utilization of data after it has been tokenized, allowing organizations to gain the fullest use of tokenized data for analytics and AI use cases.
We took our data security knowledge and expertise and launched Capital One Databolt, a vaultless tokenization solution designed to give businesses high security without compromising performance. With Databolt, sensitive information does not leave the customer’s environment and its cloud-native architecture fits flexibly in an organization’s infrastructure. We also built our tokenization solution to scale with enterprises in performance, exhibiting a throughput of up to 4 million tokens per second.
Best practices for securing data
Through Capital One’s data protection journey, we learned best practices for securing data at scale across the organization and throughout multiple use cases.
-
Scan and inventory: First, identify sensitive data in your environment and how it is used. Use scanning tools to scan data, prioritizing human readable data. Having high quality metadata is key. Protecting data at scale requires knowing what data exists and where it is located. To tokenize data at scale, an inventory of the data is needed.
-
Determine policies: Understand the policies that apply to the data. The industry is highly regulated, so there is compliance with many regulations. Determine which data protection methods should be used. Is there data that should be permanently deleted? Should the data be masked or redacted? What data needs to remain usable?
-
Protect data: For data that is still needed, decide how to protect that data. One size does not fit all. Require the business to implement the highest level of protection possible. Tokenization, which replaces clearly defined data elements such as a credit card number, is the preferred method for structured data. With unstructured data, encryption becomes the preferred method since it can protect large volumes of data effectively, such as at the file level.
-
Data access control: Leverage native cloud features for robust data access controls, ensuring access to data is restricted to authorized users. Use identity and access management, which allows for a granular definition of permissions at the user, group or role level. Rely on the native functionality of Amazon Web Services’ access control features paired with custom access policies within the tokenization solution for more scalable, performant and secure operations.
We learned that data protection is never finished. It is a continuous process. At Capital One, data protection involves a continuous commitment to improving our scanning process, optimizing the efficiency of our tools, scanning newly acquired data, and refining our data ingestion and tokenization processes.
Maintaining cost efficiency
To overcome the cost barriers of AI implementations, organizations must also manage the costs of data, building in the right controls and visibility in a fluid and constantly evolving data landscape.
Businesses can benefit from first understanding the total cost of ownership in today’s data platform and ecosystem.
-
Data costs: The costs of data include any tooling in the data management infrastructure and the data preparation and engineering efforts that go into building a high-quality data ecosystem. It also includes the storage costs of persisting data into a data platform.
-
Compute costs: Most of the costs a data platform incurs is from compute costs. This includes the cloud infrastructure costs, which are compute costs that come from running analytics workloads. As AI workloads pick up, costs are also increasingly coming from the AI technology infrastructure that companies use to build AI use cases.
-
Model build costs: Depending on organizational strategy for AI there are costs that accumulate from model training, choosing the right AI models and model fine tuning.
Model operations costs: Once the model is built, deploying and serving the model to the customer also incur costs. This includes inference costs, the price of processing tokens if using a cloud data platform and making sure the model performs as it should with no drift or bias. Monitoring the model is a continuous process.
Next, there are important ways to maximize the investment. They include governance of the data environment, making certain that controls and boundaries are defined to democratize the use of data; optimized compute so that computing resources are being used with cost efficiencies in mind; and monitoring and observability over the entire data ecosystem to eliminate wastage and maintain high data quality.
A well-governed data environment
Governance ensures the right set of controls are in place as an organization scales its data usage and data environment. At Capital One, we scaled the usage of our data platform using a federated ownership model. We want each line of business to own their data, but at the same time to operate within the right controls and guardrails. Next, we define budgets early on to ensure costs were not ballooning and running over budget. We enforce cost controls through a workflow and chain of authority to approve changes, which were defined and built into tooling we create.
Hierarchical governance in this manner can enable independence. In our example, a data engineering business organization and a business intelligence business organization each have multiple teams operating in their data environments. Each team owns its compute infrastructure and is responsible for defining and managing budgets. Above the teams in the hierarchy is the organization administrator who acts as the first-level approver and makes sure use cases align and cost levels are correct.
Finally, the data platform team has a view across the organization. The team is responsible for the overall management of the data platform and budgets at the platform level, acting as approvers of any requests that might increase costs or cause the organization to approach its budget.
At Capital One, we invested early into building this tooling. We took the approval workflow and built it into our tools for any sort of data compute or data infrastructure provisioning. A similar concept is available in Capital One Slingshot, Capital One Software’s data management solution for maximizing a company’s investment in its cloud data platform. Slingshot enables businesses to create workflows and templates for a well-governed data environment. For example, a technical administrator can define templates with guardrails for each of their business units. The templates enable business users to function autonomously without depending on the technical team.
Optimizing compute infrastructure
Compute usage is a primary cost driver in today’s data ecosystem. For businesses that want to optimize their compute infrastructure, making sure the right-sized resources are available at the right time is critical for workloads to run efficiently. There are multiple ways to optimize infrastructure and achieve maximum value.
Data warehouses provide various compute options, or compute-specific configurations, based on the nature of the workload.
-
SQL Warehouse Compute is the better option for SQL analytics workloads such as running queries.
-
GPU Compute is better suited for training or fine tuning AI/ML models.
-
Container Services is the better option for running containerized applications.
Data users need to monitor computing resources continuously and choose the right setting for the infrastructure to gain the best value.
Knowing which setting to choose for each workload to gain optimal performance and cost efficiencies can be quite challenging. We built Slingshot that continuously monitors workloads and suggests the right setting for an organization’s infrastructure. We introduced the idea of scheduling for automatic vertical scaling of warehouses. Slingshot automatically adjusts the warehouse parameters based on a user-defined schedule, such as adjusting the warehouse to a smaller size in the evening hours when fewer workloads are running. Slingshot also provides visibility into the performance, such as historical performance and the impact selecting a certain schedule will have on running queries.
Monitoring and observability
Lastly, a strong data foundation for AI requires continuous tracking of a data ecosystem’s health and performance. Implementing an overarching monitoring and observability practice ensures strong data quality and the running of workloads within the defined scope and performance levels.
First, monitoring data pipelines ensures data is fresh in the target tables. A pipeline failure can lead to disruptions such as models receiving incorrect information, which can negatively impact the business. Next, defining data quality checks ensures the AI models are always receiving accurate, timely and high-quality data. Cost visibility by workloads allow businesses to calculate the investment in a particular table, workload or business process. Building that visibility across the entire organization is key to ensuring the data platform is performing as it should at all times.
With these various monitors in place, proactive alerts notify teams before anything goes wrong, allowing the appropriate person to take action to correct those issues before they affect business outcomes.
Catching critical data pipeline failures and quickly debugging and resolving them will strengthen confidence in an organization’s model outputs.
Key takeaways
Data fuels an enterprise’s AI systems, allowing them to learn, train, grow and improve. Taking steps to manage high-quality, secure data that is cost efficient will build the foundation for trustworthy and effective AI initiatives. There are three key takeaways for building robust data management capabilities that prepare businesses for further innovations:
-
Trusted data is the key to good AI. An organization’s AI models are only as reliable as the data on which they are trained and built.
-
Data security is a continuous process. Protecting sensitive data requires an ongoing dedication to continuously monitor data and adapt to changes in security threats and regulatory requirements.
-
Cost governance is not an afterthought. Without guardrails in place and visibility into cost drivers, costs can snowball unpredictably. Establishing governance upfront can help organizations scale confidently into areas like machine learning and AI while staying within the bounds of budgets and policies.