Advice for building resilient systems in AWS
The importance of being critical about dependencies on control planes
January 13, 2023 7 min read
In this blog, I’ll talk about the distinction between control and data planes, and emphasize the importance of being critical about hard dependencies on control planes (especially during emergency situations).
Control vs. data planes
The concept of a “control plane” and a “data plane” comes from networking. The data plane is the fabric of routers, switches, LBs and so forth that are responsible for shuffling packets around in realtime. The control plane includes the systems that configure resources running in the data plane - updating route tables, adding or removing endpoints, and so forth.
The reason we separate these is for resilience. Data planes are architected for simplicity and resilience, and a failure in the control plane should NOT take out the data plane. For instance, if the UI for reconfiguring a load balancer is out, that should not necessarily stop it from executing requests based on current configuration.
AWS applies this principle in the design of all of their services. For instance, the EC2 control plane is responsible for allocating, changing state and reconfiguring instances. It can fail or be shut off, and it will not affect any of the currently running instances (which are the data plane for the EC2 service). You can read about it in this excellent whitepaper from AWS, or watch this video from re:Invent.
The punchline is that every AWS service has a control plane and a data plane. The control plane allocates the resources and the data plane runs them. Since control planes are changing the system, they are inherently more complex and therefore prone to breaking. Every service is designed so its data plane keeps working if the control plane fails.
How do you differentiate between the control plane and the data plane?
The difference between control and data planes
Although it’s not officially documented, you can generally tell whether you’re using the control or data plane based on the service and the API action. A few general tips follow:
API CRUD operations almost always use the control plane. API actions that perform CRUD operations on service resources are almost always in the control plane. This includes stuff like EC2 (instances, VPCs, subnets, security groups, load balancers, etc.), RDS (database clusters), Route53 (DNS records and routing policies), DynamoDB (tables), Lambda (functions), and so forth.
Data plane typically interacts with service resources. Interacting with those resources is usually the data plane. Sometimes this is done via industry-standard protocols - for example, logging into your instance using SSH, making an HTTP call through a load balancer, or looking up a DNS entry. Other times it’s done through a different API action. This includes things like Lambda Invoke and InvokeAsync, or Dynamo Scan, Query, Get or Put operations.
Most health checks are tied to data plane. In addition, health checks are almost always in the data plane. This includes everything from load balancers to DNS routing policy to container clusters. These are also optimized for the kind of resilience that is typical of data plane operations.
Once you can tell them apart, how should you use that knowledge?
Best practices for dependencies
As mentioned before, control planes are slightly less stable than data planes. Most AWS services will have a “failure mode” in which the control plane is unavailable, but the data plane keeps working. Once you understand the APIs and actions you depend on, you should assess the risk of those dependencies.
Consider critical path dependencies of your application. Are you relying on a control plane API? This is unusual - for instance, it’s far more likely that your API would write to an existing RDS cluster or DynamoDB table, rather than creating a new table or cluster on each request. But if you do spot any of these dependencies, think hard about how you might remove them.
Examine “BAU” operational dependencies. One example here is autoscaling - if the EC2 control plane goes offline, you may not be able to expand the size of an EC2 instance, or add new EC2 instances. How much headroom do you have in your capacity before this becomes an issue? Should you pre-allocate extra to be able to tolerate spikes?
Consider exceptional circumstances. For instance, do you depend on control plane actions to execute a regional failover? If so, would they depend on the region you are evacuating, or can you execute them from the healthy region? Better yet, is there a corresponding data plane action you could use to eliminate the control plane dependency altogether?
Sometimes there is no way to engineer around the risk of a control plane dependency - there is only one way to do things, and that’s it. But in many cases there are simple workarounds to reduce or eliminate the risk. In the next section we’ll talk about a few examples.
Examples of dependencies
As mentioned before, EC2’s control plane can go offline. If this happens, you won’t be able to get or expand servers, load balancers, subnets or the like. Pre-allocating capacity is the best way to limit the impact of an EC2 control plane outage.
EC2 is also a transitive dependency of many other AWS services. For instance, you’re not likely to be able to allocate a new RDS cluster if EC2’s control plane is offline, since that’s where RDS gets servers from. Lambda tends to pre-allocate a huge buffer of reserved capacity, so there’s unlikely to be a cascading failure there. Fargate might be a little tighter - we have seen failures in allocating new tasks when EC2’s control plane is down.
Use your performance metrics to understand your load, and make a capacity plan that accommodates predictable surges or spikes. This might follow a daily, weekly or seasonal rhythm (e.g. allocate new capacity during the week, and turn it down on the weekends). The Static Stability Architecture Pattern has additional details.
DNS is an absolutely critical tool for cross-region failover. It can be tempting to use the Route53 API to update your routing policy to shift traffic during a failover. But this leaves you with a critical control-plane dependency on the US-EAST-1 region. If that’s the region you’re evacuating, you can wind up with a problem.
Investigate health check options within Route53 routing policy. Health checks are run on a distributed global fleet of resolvers, so they are not subject to the same regional limitation. If you can base your health check on a public HTTPS endpoint, or a simple CloudWatch metric, you will be better off from a control vs data plane perspective.
RDS includes APIs which let you execute a managed failover from one region to another. These are control plane APIs, BUT they are meant to be executed from the healthy region (the one you’re failing towards).
This means that, in the event you want to evacuate a region, you don’t need to use that region to do it. Plan and execute your regional isolation tests by invoking the failover APIs from the healthy region, so you’re familiar with how to do it in an emergency.
Batch jobs are weird from a resilience perspective. Their trigger tends to be tied to a specific region, whether it’s the arrival of a file into S3, or an EventBridge scheduled event, or something else.
A common pattern is to reconfigure these triggers in the case of a regional failover. This is almost always a control plane action, which as we’ve learned can be difficult. A complex regional outage could mean that you can’t shut off the trigger in the unhealthy region. One example here might be disabling a Lambda trigger in the evacuated region, and enabling the corresponding trigger in the healthy region.
A slightly better approach could be to provision a DynamoDB table with cross-region replication that contains a “primary/secondary” switch. Keep the triggers running in both regions, but have their first action be to check this table to ask “am I the primary?” before executing anything. You can toggle this value from the healthy region and it should affect both (data replication is a data plane action).
The Lambda example above deserves special mention. One of the common patterns here is to set the reserved concurrency (RC) for your function to 0. This effectively means any attempt to invoke it will fail. This can be dangerous in large shared accounts, for commonly invoked triggers.
The Lambda service shares a single regional async invoke queue for each account. If your trigger is invoked a lot, and those requests fail, you could fill up this queue with retries, causing delays for other applications.
To properly disable an async Lambda function, make sure you are using Dead Letter Queues. There is excellent information in the AWS documentation about how to do this for Lambda functions.
Takeaways for building resilient systems in AWS
It’s very difficult to make risk-based decisions in the abstract, so I can’t generally tell you to ALWAYS do X, or NEVER do Y. You have to use your judgment in the context of your application and use case.
But hopefully this article provided good food for thought. Carefully consider your dependencies on AWS APIs - whether they are control or data plane, what potential failover scenarios are, and how you might limit the resulting risks.