Architecting for resiliency during cloud migration

Do we need it vs. how much of it do we need?

Steven Dang

September 17, 2018|5 min read

If you’ve worked with software in the last five years, you’ve undoubtedly heard of, or worked with, one or more cloud services like Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, IBM Bluemix, or Alibaba Cloud. For many projects, migrating to the cloud gives teams a chance to revamp their architecture and streamline their codebase (maybe avoid having to import seven different date-time libraries this time!). However, the number of considerations to take into account when migrating to the cloud can get overwhelming, and some are more significant than others.

High on your team’s list of discussion topics should be resiliency. The first question that should come to mind shouldn’t be “Do we need it?”, but instead, “How much of it do we need?”

Whether your app is reporting data to 50 people or streaming data to 125 million subscribers, a consideration of availability is essential for projects of any size.

High level considerations

Building an app that is available all the time isn’t as easy as making scrambled eggs; it’s more of a quiche. You’ll need more than just the basics, but you’re rewarded for the extra work put into it. Architecting for high resiliency can be:

Complex
Expensive
Time consuming
Imperative

Every project has unique use cases and requirements, therefore different projects will have their own necessary levels of availability. Three primary levels that we’ll discuss here at a high level are:

Active
Active-passive
Active-active

Active

Anyone who knows me knows that I love cooking and that I love food. Korean barbecue, tiramisu, SPAM musubi, bolognese, and garlic bread! Often when I cook at home, I’m just cooking for two people. I probably have two stove burners blazing and my oven cranked up in my crammed 100 sq. feet kitchen. It’s small, and if something goes down (or is covered in dirty dishes), I have to stop cooking and deal with the problem, but I wouldn’t have it any other way.

Active availability, like the kitchen in my apartment, is the bare minimum for deploying your stack. With one running instance of each piece of your stack, you can have basic functionality. However, if any server in your stack happens to go down due to connectivity issues, region failure, etc., the rest of your stack follows.

This bare-bones approach to deploying your app is ideal for projects with minimal impact on vital business operations. Outage of services do not need to be remediated immediately and the cost of hosting the application is kept at a minimum. Because of the simple implementation pattern, the complexity of building the app is decreased as well, reducing the time for engineers to build it.

Active-passive

My grandma has always had two freezers. When I asked her why, she told me that when the family gathers for Thanksgiving and Christmas, she needs to store a lot more food. Other than that, the second freezer often sits empty and unplugged the rest of the year.

This is the essence of active-passive availability; the notion of having already-configured server instances ready at your disposal when you need them. With active-passive, health checks ensure that both the passive server and active server are alive and working. When the primary server goes down, traffic is routed to the passive server (which becomes the “active” one). In this design pattern, only the active server instance accepts traffic. This type of failover is common with read-replicas in databases.

Active-passive example architecture.

In the first state, all services in region A are up and running. But when a failure happens in region A, we move to the second state — our services are now running in Region B. Once region A’s state has resumed back to normal, we move to state 3.

In this example, we introduce load balancers and failover policies into the complexity scheme. When instances running in Region A fail, traffic is automatically routed to Region B. Running backup instances in different geographic regions adds a layer of protection against region-wide failures i.e. weather-related outages.

Having additional instances on standby can easily multiply costs and overall effort that goes into maintaining your servers. With failover, there is potential for data loss when services fail and traffic needs to be rerouted. Depending on how your passive instances are configured, there is also the risk of operational downtime.

With active-passive resiliency, business critical applications can continue running as soon as your load balancer realizes you have an unhealthy instance and reroutes traffic. By having passive instances running your application, your RTO and RPO can be drastically reduced without introducing too many complex design decisions in your services.

Active-active

Thanksgiving at our household is always a treat. Five or six people cooking for a hungry army of 20–30 family member every Thanksgiving morning; deviled eggs boiling, the turkey roasting, and my favorite, mashed potatoes mashing. Everyone usually brings a bunch of sides for the day’s grandiose meal with my grandma acting as the de facto maestro in coordinating everyone’s contributions.

Sometimes, those sides aren’t the most appetizing, sometimes, some sides get made multiple times, and other times, the dog knocks a side dish off the table; but grandma always ensures that every side will get made at least once.

Building upon our active-passive example, in an active-active architecture, we now have multiple servers handling traffic and sharing the load all at the same time. With a load balancer checking on the health of each instance, you can ensure high availability as you no longer have one point of failure. Now, we can have multiple instances running within a region and multiple live regions! By using top-level domains, we can configure our load balancer to route traffic appropriately to handle high loads. We no longer have to worry about failover as we have multiple server instances already running.

In certain patterns, it is common to have your data sent to multiple servers — de-duplication mechanisms are needed in these instances when reading from and writing to your database.

Architecting for active-active resiliency requires a careful selection of tools and resources, which can be a very limiting factor in some cases. For example, if you’re using a queue service it is important to be aware of the delivery model — i.e. at-least-once, exactly-once, etc. In certain use cases, a pub-sub model such as Kafka may be more appropriate. Introducing distributed patterns, e.g. utilizing Kafka, creates complexity with asynchrony and eventual consistency.

The concept of having always on servers is ideal in theory, but the tradeoffs to implement should be considered first. At this level, we are having to deal with much higher costs for maintaining servers. Including all these additional mechanisms into your next project is a nontrivial effort and should be considered before architecting for high resiliency.

The verdict

Depending on your business case, it isn’t always necessary to architect for high resiliency. High resiliency requires greater operational overhead and a nonnegligible amount of extra work in designing, architecting, and implementing a solution.

Understanding these tradeoffs and weighing the business value when designing for availability is of utmost importance early in the planning process. Think of it as your mise en place. Do it early and do it first to save yourself some headache later on.

Steven Dang, Senior Software Engineer

Steven Dang is a senior software engineer at Capital One.