Improving performance with automated partitioning

Vishi Cline

May 11, 2023|6 min read

Enhancing cloud performance with automated PostgreSQL partitioning

One of Capital One’s core principles is to be strategically bold. We strive to challenge the status-quo and constantly strive for improvement and innovation. One of the company’s key initiatives is to leverage the cloud to offload the infrastructure cost and maintenance and enhance performance. We are further restricted with the requirement to meet stringent service-level agreements (SLAs) so we can meet our business objectives and be proactive in our role to prevent fraud.

We are an enterprise team in Capital One and our goal is to provide the functionality the API and models need within tight SLAs in order to make key decisions to prevent fraud. Our team recently moved our implementation to Aurora PostgresSQL to meet our key objectives and are actively experimenting with the partman extension. Partitioning is not a new concept for the database world — however, automating some of the otherwise tedious processes that come with partitioning is what makes this extension so powerful.

How partitioning improves query performance

Partitioning allows you to improve query performance by sectioning data by a specific range or interval so the query doesn’t have to read the entire table. You can also archive or delete old data on an ongoing basis to reduce the cost and volume of data.

Make performance improvements by partition pruning

Partition pruning can provide major performance improvements to data processing, especially for larger tables. Query performance improves dramatically when the heavily accessed data all resides within a single partition. Furthermore, if a large percentage of a particular partition is accessed, sequential scan can be used instead of an index, which requires random access reads across the table. It can make purging or archiving data much simpler because you can archive and purge the data within a partition as a unit. You can also move seldom used data to cheaper or slower storage media to save costs.

Partitioning, the old way

Prior to Postgres 10 and in many other database environments, partitioning was a manual process that used table inheritance and checked conditions to complete tedious tasks. Typically you have to create a function that will create the child partition and bind it to the parent, create the indexes for the child partitions, use check constraints to ensure that the data ends up in the right partitions, and finally a trigger that would invoke the function on insert. For deleting or archiving the partition, this would require another function and potentially another trigger if you are inserting for prior dates that you don’t want to maintain.

How to automate partitioning

Partman extension automates this full process and can be done with minimal code and effort. It doesn’t use triggers at all and this alone improves the performance. Compared to manual partitioning, automated partitioning automatically creates the partitions for you, and you can pre-make additional partitions with one line of code. Partman supports several functions for creation, maintenance, or destruction of partitions. It also has several scripts for tasks such as moving the data from the default partition to the appropriate partition.

Creating it on the fly if one doesn’t exist, deleting and undoing partitions, running vacuuming and other maintenance tasks. You no longer have to write a script to identify old data to purge and subsequently create a retention script. With partman, it’s as easy as updating a flag in one of the catalog tables to either detach or drop the old partitioned tables for a given retention period. Since all the complexity is abstracted away, it is very user-friendly to use. For our use case, we utilized parent partitioning and child partitioning. Below are the steps to enable this extension and use it readily within minutes.

Add partman to shared_preload_libraries

CREATE EXTENSION pg_partman SCHEMA partman;
CREATE partitioned table
CREATE partitions
CREATE subpartitions
Delete/detach old partitions and specify retention period
CREATE new subpartitions

Sample test code:

    CREATE TABLE IF NOT EXISTS table1
(
	x_id INT NOT NULL,
  	start_timestamp timestamptz NOT NULL ,
  	sub_partition_id INT NOT NULL,
CONSTRAINT PKEY_startts_subpartid PRIMARY KEY (start_timestamp,sub_partition_id)	
) PARTITION BY RANGE ( start_timestamp)
;

CREATE index on table1 using btree (x_id);

Note that there are limitations on what primary key/unique index you can use in partitioning. The columns you choose to partition must be part of the primary key.

    SELECT partman.create_parent( p_parent_table => 'table1',
 p_control => 'start_timestamp',
 p_type => 'native',
 p_interval=> 'daily', 
 p_premake => 1,
 p_start_partition => ((current_date - (100 || ' DAY')::INTERVAL)::date)::text
);

1. The code above will create a daily partition using the start_timestamp column starting from the p_start_partition date all the way till current_date. In addition, premake allows you to create more partitions ahead of time. This will also create a default partition. Please note that if the data goes to the default partition, it cannot be placed in any other partition.

    SELECT 
partman.create_sub_parent('table1,'sub_partition_id','native','1',p_native_check :='yes',p_premake :=10);

2. This select statement will create subpartitions within the parent partition. So, in this case, it will use column 'sub_partition_id' to create subpartitions within the daily start_timestamp partition. We made this an interval based partition so it will create partitions such as 1,2,3…[max_sub_partition_id]. As with parent partition, it will premake 10 subpartitions within the parent partition. Each one of the subpartitions will also have a default partition like the parent.

3. Now that we are done creating partitions and subpartitions, note that the indexes are automatically propagated to the child partitions. Since data is not automatically moved from the default partition, if data is inserted for any future dates, you may get an error because data will exist for that future partition and it will also exist in the default partition. For rectifying this, pg_partman has a series of functions that you can schedule to check and fix this.

The check_default procedure can be used to check if there are records in the default partition.

    SELECT * FROM partman.check_default(p_exact_count := true)

4. Partition_data_proc procedure will create a new partition if needed and move the data to the appropriate partition.

    call partman.partition_data_proc(p_parent_table => 'table1');

5. Updating the following catalog table will set the retention period to 93 days. Any data older than 93 days will be deleted since we have the ‘retention_keep_table’ set to false. If we wanted to detach instead of delete, you can set ‘retention_keep_table’ to true.

    UPDATE partman.part_config 
SET infinite_time_partitions = true,
  retention = '93 days, 
  retention_keep_table=false 
WHERE parent_table = 'table1';

6. The following procedure will create new partitions and enforce the retention rules specified in partman.part_config table. Run_maintenance can be scheduled at a regular interval using cron job.

    SELECT partman.run_maintenance_proc('table1’, p_jobmon := 't', p_analyze := 't')

Fast deployment, maintenance reduction and automation are all possible with Partman

In summary, Partman is a very powerful extension that allows you to create partitions fast and with minimal coding. Since the functionality of maintenance tasks and partition creation is abstracted away from the user, it makes maintenance easier, less time-consuming, and less error-prone. It allows for faster deployment and allows us to meet business SLAs in a timely manner.

Partman provides a broad range of other functions that automate other tedious tasks as well, such as deleting partitions or filling data gaps within subpartitions. For a complete list, please refer to GitHub.

Vishi Cline, Senior Data Engineer

Vishi completed her bachelor’s degree in computer science from North Carolina State University and master’s degree in data science from Southern Methodist University. She is an AWS architecture certified professional with 16+ years of experience in database development, analytics, and building ETL pipelines. At Capital One, she is responsible for leading the table design and architecture, providing optimizations, and building data pipelines that allows businesses to meet their SLAs and mitigate fraud.