Capital One’s Generative AI models are developed by customizing open-source Large Language Models (LLMs) with a combination of synthetic data and publicly available open-source data. The datasets are owned by Capital One or are publicly available and licensed for use as open-source datasets for download on public repositories. The datasets were collected and used for development by Capital One beginning in 2024, and dataset collection is currently ongoing. Capital One may release future models and model updates with newly collected data.
The datasets contain over 20 trillion tokens of text, including multi-turn conversations, instruction-tuning for mathematical reasoning, code, web content, and over 7,100 unique synthetically generated conversations, representing over 495,000 individual task examples. The datasets were curated to: aid in generalizing conversational abilities; ensure flexibility in responding to different user styles and needs; prevent overspecialization; promote adaptability to new, unforeseen tasks; prevent overfitting to redundant web content; help instill instruction-following capabilities early; help convey the nuances of human dialogue; boost mathematical and quantitative reasoning skills; improve “small talk” and off-topic queries; support user trust and alignment; and handle complex business rules by contrasting preferred and incorrect responses.
The datasets include both data from the public domain and data subject to intellectual property rights. The datasets were either synthetically generated by Capital One or licensed for use from a developer or curator. Some of the datasets used may include personal information and aggregate consumer information, as defined in California Civil Code Section 1798.140. Capital One leverages a variety of data processing techniques to improve model performance.
Capital One is committed to your privacy. Visit capitalone.com/privacy/ to learn more. Visit capitalone.com/tech/ai/ to learn more about AI at Capital One.