PostgresSQL partitioning: 3 strategies Capital One leverages

by Mark Bentivegna-Lead Data Engineer, Consumer Identity

Leveraging different temporal partitioning techniques for PostgreSQL

Making the common case fast is one of the guiding principles of computer architecture. This idea guides us at Capital One as we strive to change banking for good and provide the best possible user experience for our millions of customers.

I am a lead data engineer responsible for data ingestion processes regarding all forms of our customer’s data within Capital One’s Consumer Identity platform. I deal with a very large volume of data on a daily basis, so it is critical that our database tables are fine-tuned to provide optimal performance. We heavily leverage Postgres databases because its partitioning and indexing capabilities enable faster query response times than non-relational database products.

At Capital One we leverage a temporal partitioning strategy with an emphasis on providing fast response times for our customers’ most recent data. Within our data platform, we leverage partitioning strategies designed to provide recent customer data within milliseconds (99.9% of the time). We employ temporal strategies to partition our relational databases to store customer data across time intervals with an emphasis on the most recently stored data to make the common case fast. In this article, we will discuss three strategies we use to achieve this goal.

Partitioning of database tables splits a large table view into logically smaller groupings for faster lookups by groupings. Aggregating data across a single partition is much faster than scanning across an entire table for index values. By splitting a large table into smaller groups, we can run queries only against smaller subsets of data by temporal groupings, which helps us deliver customer data within milliseconds instead of querying across entire tables. Indexing by our customer’s unique identifiers allows us to check across several months’ worth of recent activity, so we only scan partitions for days where they were active and allows us to skip over partitions where they were not.

1. Timestamptz column partitioning

Our primary strategy is the most simple: partition by timestamp with a specified time zone. By partitioning by timestamp we are enabling faster lookup by dates which is especially useful when aggregating recent events. 

Differing time zones introduce complexities when recording events so we always use Postgres’ “timestamptz” column type. If using a timestamp as a partition or index it is never advisable to omit this information or use different time zones. For example, the state of Arizona is in Mountain Time but does not observe Daylight Savings Time. We specify UTC time because if we did not then our customers who have any activity based in the state of Arizona would find their data logged incorrectly twice annually. We have encountered this problem ourselves and it is quite common. The clear solution to this problem is to incorporate time zones on key columns so that there is no confusion or overlap across dates and time.

Partitioning by timestamp has another added benefit: roughly equally sized partitions. While our customers are much more likely to be active around the holidays when making deposits and spending money on presents for friends and family, we employ a hash function on timestamps so that the volume of customer data is nearly identical across groupings in our database to avoid skewing performance on certain days. Query execution times for our relational databases do not vary significantly throughout the year due to temporal partitioning.

2. Detaching partitions when deleting old data

Customers rarely access data more than 90 days old. As a result, Capital One’s data consumption team generally archives data after this length of time from our optimized lookup tables. Every day we need to archive several terabytes worth of data and this is a very costly operation. To streamline the deletion of old data and not hinder performance we detach partitions to avoid costly deletes on live tables. Bulk deletes or vacuum commands put a lock on a table, which prevents reads or writes. This can result in longer downtimes on our databases and can very negatively impact customer experience and must be avoided.

Seldomly accessed data can be migrated out of performance-critical databases and after this migration takes place, we can detach partitions. On smaller database tables, the performance difference between vacuuming and detaching partitions is negligible, but our data platform deals with data on a very large scale, so this performance improvement is not only noticeable, but it is critical for providing a positive customer experience.

Within Capital One’s data consumption time we used to delete our partitions after a specified time period, but this introduces performance problems and I will elaborate further. We found that detaching partitions removes them from our active tables, making it a much less costly operation. 

3. NOT vacuuming partitions

Vacuuming is a very effective technique for cleaning up unused tuples in a database to shrink memory size and improve performance. As rows get overwritten the old tuples still exist in the table; they are simply invalidated. A table subject to frequent updates can bloat in size and negatively impact query performance. 

It might come as a surprise to learn that we make a point to avoid this practice altogether within my team’s data platform. Because our consumers most often require recent data, we prioritize storing the most recent 90 days’ worth of data for fast lookups, and detach partitions for archival. Vacuuming is very costly in that it places a lock on the table so reads or writes can’t go through. Ultimately, removing the bloat caused by unused tuples in our database tables is so expensive that we just leave them in. Doing otherwise would hinder our ability to provide a seamless customer experience. 

Providing the best customer experience with intuitive partitioning techniques

In summary, Capital One leverages different temporal partitioning techniques to provide the best possible customer experience and make the common case fast. We leverage some fairly intuitive partitioning techniques like using timestamp with timezone and some that ironically sound counterintuitive, like not vacuuming our tables. Overall, temporal partitioning and making the common case fast is a fundamental strategy to ensure that Capital One customers can access our products as seamlessly as possible.

Mark Bentivegna, Lead Data Engineer

Mark Bentivegna is a Lead Data Engineer at Capital One. He completed his bachelor's degree in computer engineering at Penn State University and master's degree in data science at New Jersey Institute of Technology. He is currently pursuing his PhD in computer engineering at George Washington University and is performing research in the fields of big data and graph theory.

Explore #LifeAtCapitalOne

Startup-like innovation with Fortune 100 capabilities.

Learn more

Related Content