Moving to DynamoDB to increase application resiliency
How moving from PostgreSQL to DynamoDB allowed my development team to automate regional failovers
February 10, 2021 10 min read
Resiliency is the goal for every application -- no one wants their application to fail -- but actually achieving this goal is a challenge. Building in the cloud provides the capability to increase resilience, but it doesn’t happen automatically. Resiliency has to be planned and built into all levels of the stack and application and one of the most challenging layers to address is the database. Traditional SQL databases run with a single primary instance where all the updates occur. Failure of that single database instance renders most applications inoperable.
Earlier this year, my team started looking at options to increase our application resiliency. Our stack runs in an active-active configuration in both East and West AWS regions and we've configured our services to automatically failover. The failover time for most of the stack was nearly instantaneous, but PostgreSQL relational database required several minutes for the primary instance to switch regions.
Evaluating AWS database platforms - DynamoDB, Aurora Global Database, & Aurora Multi-Master
Our configuration is typical for multi-region SQL database failover; the primary database is in one region with a read replica in another. In the event of a regional failover, the read replica is promoted to primary and database traffic is rerouted to the new primary. It takes ten to fifteen minutes to promote the read replica to primary. Historically this is acceptable for a regional failover, but the expectations for modern applications are to have no downtime.
Our team went down the path of looking for alternatives to our current configuration. We wanted to stay with managed database solutions in AWS to avoid installing and maintaining the database on our own - that way we could spend most of our time working on new features for our applications, not applying the latest database updates. This means we did not consider non-managed databases that might have met our needs such as MongoDB, Apache Cassandra, or even features which are not supported by AWS RDS such as MySQL Clusters.
We started our search within RDS and identified several possibilities. We identified the pros and cons for our use case as follows:
Database Solution Pros and Cons for DynamoDB, Aurora Global Database, & Aurora Multi-Master
Aurora Global Database
The first option we looked at was Aurora Global Database which was introduced in November of 2018. This Aurora contains features that reduce replication latency across regions and enables quick recovery from regional outages. According to AWS it, “...provides your application with an effective Recovery Point Objective (RPO) of 1 second and a Recovery Time Objective (RTO) of less than 1 minute, providing a strong foundation for a global business continuity plan.”
Aurora Global Database addressed our primary concern, the database failover time. Aurora Global Database can failover between regions in a minute or less. This was an order of magnitude better than the failover times we were dealing with previously. While we were hoping for an active-active solution, this was an early favorite for the team and we even did a proof of concept to test the failover with excellent results.
However, Aurora Global Database had the downside of only being MySQL compatible at time, which would require a bit of adjustment in our application, but not a significant rewrite.
NOTE - Since our original investigation, Aurora Global Database has added PostgreSQL support.
Our team was excited when we saw the news that Amazon had released a multi-master database for general use in August of 2019, just as were starting our research on increasing our database resiliency. Amazon had been promising an active-active Aurora database solution for a couple of years, and because of the timing we were able to include Aurora Multi-Master in our research on database alternatives.
Aurora multi-master clusters give the active-active support we were looking for with multiple databases running in parallel that can update the database. This capability increases the resilience of the database preventing a single database failure rendering an application inoperable. However, at this point in time it only supports active-active within a single region, and multi-region failover is a requirement for our applications.
The final option we looked at was DynamoDB. While not in RDS, it is a managed database solution. DynamoDB had the critical feature we were looking for - active-active cross-region support. However we were concerned about moving to a NoSQL database and the level of effort needed to update the data persistence layer of our application.
DynamoDB has many features making it a compelling choice. Amazon initially developed DynamoDB for internal use by its retail business at a time when the company was growing at an exceptional rate and its traditional relational database was unable to scale with the increase in demand.
According to the AWS home page for DynamoDB, "DynamoDB can handle more than 10 trillion requests per day and can support peaks of more than 20 million requests per second." DynamoDB scales to even the most demanding applications. It has too many features to delve into for purposes of this article but here are some highlights:
- Fully managed NoSQL database
- Infinitely Scalable (for a reasonable definition of infinity)
- High Availability and Durability
- Multi-region active-active support
- Real time data processing
- Integrated Caching
- Point in time backups
- Cost scaling
NOTE - The AWS DynamoDB features page provides a good introduction to the many features available on the platform.
Choosing a database
Our team had many discussions about which database was the best solution, with strong opinions favoring both Aurora Global Database and DynamoDB. While Aurora Multi-Master would be a good choice from some applications, the lack of multi-region support eliminated it as a candidate for our application. Aurora Global Database would drastically reduce the regional failover time and required a lower level of effort to migrate on the application side, but DynamoDB was the best solution as it was the only database with multi-region active-active support.
DynamoDB table design and the multi-region active-active use case
Since multi-region active-active support is the key capability that drove our decision, it's worth discussing a bit more in-depth on how this works. The feature that enables this functionality is called Global Tables. In the AWS console you enable Global Tables and select a region for replication, which creates a copy of the table in the selected region. The tables are synchronized using the streaming feature of DynamoDB, automatically copying changes from one region to another. The data is eventually consistent, and the time needed to copy the information is known as replication latency. According to the AWS Global Tables documentation, replication latency is around one second. We found this estimate accurate during regular operation, but under heavy load the latency increased. Our focus was on a dual region configuration, but the Global Table works across multiple regions, keeping them all synchronized.
NoSQL data modeling is very different from the familiar relational database designs used by many developers. For example in NoSQL designs an application typically has a single table. The practice of normalizing your data into multiple tables is tossed out the window. The goal with NoSQL design is to base the table structure on the access patterns and not inherit the domain structure. That doesn't mean you can dismiss your domain model; you still need to understand the elements of your data and how they relate. However, all the data for an entity is usually contained in a single entry. This breaks many of the tenets of relational design, duplicating data and storing multiple related entities inside the record. Because you are storing disparate entity types within a single table, fixed prefixes for entity types and common, and secondary index may contain different types of data based on entity type. For instance, if we are storing triangles, squares, and polygons in our database the primary index may use a prefix for the shape type and id such as tri#id, squ#id, or poly#id and the secondary index may be type for triangles, area for squares, and number of sides for polygons. Modeling the data from an access point of view can take a while to grow accustomed to, but the access performance is more efficient than SQL queries.
Despite the numerous benefits of DynamoDB, it is not the best solution for all situations. NoSQL databases are suitable for processing large numbers of transactions but are usually a poor choice for reporting databases. The data is structured around transaction processing, leading to poor performance for reporting, especially ad-hoc queries. If you need high-performance processing and reporting, you will likely need to move the data into a traditional relational database or data warehousing solution. Common approaches to data migration are using the streaming feature to move data into another database in real-time or scheduled data transfers if real-time reporting is not needed.
Applying DynamoDB to our application
The data layer replacement for the application wasn't as extensive as predicted. However, we encountered challenges while doing the transition. The first issue we ran into is that DynamoDB doesn't support nulls or blank values. There are a couple of approaches to dealing with this issue, you can exclude the properties from your data, or you can use a structure to indicate a null value. We went with the null structure approach. We also had an issue with dates because there is no date or time datatype in DynamoDB. The dates were converted to an ISO UTC format (i.e., 2020-09-18T23:35:46Z) and stored as strings. This format is human-readable and is both sortable and searchable, which facilitates date range queries used by our services.
Even after addressing these limitations, we had to adjust our table structure a couple of times when testing revealed our original design wasn't as efficient as expected. In one of our tables we needed to find all entries that matched two data fields over a time range. We created a composite primary key combining both of the data fields with a timestamp as the sort key. We also made some missteps due to a lack of familiarity with the platform, for instance we used a filter for one of our services which causes a table scan instead of using a query, which is much more efficient. After a few iterations of testing and revisions to our queries and table structure we minimized the database reads which both increases performance and reduces costs. One of the nice features of DynamoDB is that you can set the ReturnConsumedCapacity property in a query to get back the number of read capacity units consumed. This is incredibly helpful when you are working to optimize your queries.
To reduce the risk of making such a significant change to our application architecture, we created an implementation using both databases to test DynamoDB without fully switching. In dual database mode, the application reads and writes to both DynamoDB and PostgreSQL, but only returns the PostgreSQL data to the user. This configuration reduces risk because the original database is still in place while testing the new database implementation in production. After several weeks of running in production with no errors, we turned off PostgreSQL, switching the application entirely to DynamoDB.
There are many database options available in AWS. There are nearly unlimited choices if you want to implement the database yourself; but managed solutions are a popular choice especially among development teams who want low maintenance solutions. Going with a managed database reduces the number of databases you have to choose from, but AWS is continuously innovating on their offerings, increasing the resiliency and features of the managed solutions they provide.
Our team narrowed our selection process down to three AWS database solutions: DynamoDB, Aurora Global Database, & Aurora Multi-Master. Aurora Multi-Master provides a highly resiliency database for applications that want SQL support and don’t need cross region write capability. Aurora Global Database is good for applications that need to support cross region reads with low latency updates and the ability to quickly failover between regions. DynamoDB provides cross-region active-active capabilities with high performance, but you lose some of the data access flexibility that comes with SQL based databases.
The result of migrating to DynamoDB was reducing our application failover time by 99%. The process and scripting for regional failovers on the data layer were eliminated. There are no failovers for the database because the application writes to the database in its region and the data is replicated by AWS keeping the regions in sync. We maintained the same or better query performance, though some improvements were due to a lack of optimizations in the original database design. Despite a few hiccups along the way, the migration was well worth the effort increasing the speed and reducing the complexity of failovers.