Experience with a serverless-first strategy at Capital One

Explore how Capital One scaled AWS Lambda, overcame real-world hurdles and built a resilient serverless-first strategy.

George Mao

February 7, 2023|70 min read

It’s been nearly a decade since Capital One made the decision to transition to the cloud. In 2014, we chose Amazon Web Services (AWS) as our provider and began planning our ascent. Just two years later, we completed the deployment of our testing and development environments to AWS, laying the foundations that would guide our journey into the cloud.

In 2021, we closed our last remaining data center and became the first bank to go all-in on the public cloud. We’re also the only Fortune 150 company other than Netflix that’s accomplished this feat.

Capital One now has more than 2,000 applications in AWS. We’re leveraging cloud-native services to innovate with the proficiency of a Silicon Valley tech company, rapidly and consistently delivering exceptional experiences to our customers.

We can’t take all the credit for achieving this serverless first approach, of course. Much of our success goes to technologies like AWS Serverless Application Model (SAM), which enables our talented developers to do in minutes what previously took months to do. For this reason, we’re sharing some of the AWS Lambda architecture best practices and lessons we’ve learned in the cloud, starting with our serverless-first strategy.

What is the AWS Serverless Application Model (SAM) and how does it support a serverless-first strategy?

The AWS SAM is an open-source framework for rapidly developing and deploying code without worrying about the underlying infrastructure. With seamless integration with Lambda, the event-driven serverless platform at the center of AWS, SAM enables engineers to focus on business value through the creation of code and software while ignoring hardware concerns.

With eight data centers to shutter, we knew serverless would be central to our cloud strategy. AWS serverless enables our developers to quickly and easily deploy new code without the overhead of infrastructure management or operations planning.

Incidentally, it also reduces cloud resources. Each Lambda function or API entry point uses the minimum amount of resources necessary for it to run. And depending on the function, those resources are only spun up when the code is in use.

Amazon’s SAM comes in two parts:

Infrastructure as code: The serverless model defines a shorthand syntax for expressing functions, APIs, databases and source mappings for applications. The syntax, written into SAM templates, is then transformed into AWS CloudFormation syntax during deployment.
A robust CLI: The SAM CLI (command-line interface) provides a Lambda-like execution environment that allows you to build, test, debug and deploy applications defined by SAM templates or through the AWS Cloud Development Kit (CDK). It also provides tooling for local development, debugging, building and packaging.

With a few lines of YAML (or for those not familiar, “yet another markup language”), AWS SAM lets you define and model your application’s functions and resources, which you can then deploy to AWS using the CLI.

To truly spell it out, the term “serverless” is a bit of a misnomer. The servers are still there — they’re physically in Amazon’s data centers — SAM simply removes any infrastructure concerns.

How do I use AWS SAM?

Assuming you have an AWS account and the appropriate configuration, the first step to using SAM is downloading and installing the CLI. You can download the appropriate version on the release page of the GitHub repo.

Here are the basic installation instructions for each platform:

Linux x86 64: Download the zip file and install it via the command line
Linux ARM: Install with pip via pip install aws-sam-cli
macOS: Install with Homebrew via brew tap aws/tap and brew install aws-sam-cli
Windows 64-bit: Download and install the MSI

Windows users will also need to install Git and enable long paths. For more detailed installation instructions, see Amazon’s SAM documentation.

Once you’ve installed the SAM CLI, you can start using it right away. You can run sam from the terminal to see the available commands, but the ones you’ll use most often are:

sam init project-name to create a new project
sam build for generating deployment artifacts
sam package for packaging and staging artifacts into an S3 bucket
sam deploy for deploying or updating serverless applications

Central to working with serverless applications is the template.yaml that’s generated when you initialize a new project. This file is how you’ll define the resources for your application. Here’s a sample template that defines a serverless API along with the corresponding Lambda function:

Type: AWS::Serverless::Api
 Properties:
   AccessLogSetting: DestinationArn: !GetAtt ApiLogGroup.Arn Format: '{ "requestId":"$context.requestId", "ip": "$context.identity.sourceIp", "requestTime":"$context.requestTime", "httpMethod":"$context.httpMethod","routeKey":"$context.routeKey", "status":"$context.status","protocol":"$context.protocol"}’
   StageName: !Ref pApiStage
   DefinitionBody:
     Fn::Transform:
       Name: AWS::Include
         Parameters: Location: openapi.yaml
   ApiLogGroup: Type: AWS::Logs::LogGroup
     Properties: LogGroupName: !Sub "/aws/apigateway/spring-${Api}”
     RetentionInDays: 7
    
Type: AWS::Serverless::Function
 Properties: CodeUri: ../target/spring-0.0.1-SNAPSHOT.jar
 Handler: cloud.heeki.spring.SpringHandler::handleRequest
 Role: !GetAtt FnRole.Arn
 AutoPublishAlias: live
 DeploymentPreference: 
     Type: AllAtOnce

SDK and SAM: credential provider chains

Assuming you’ve set up the AWS CLI, you likely have a credential provider already configured. When using an AWS SDK, the SDK searches for these credentials to authenticate when establishing a connection to an AWS service. When the SDK finds valid credentials, it stops searching. The specific chain the SDK follows is called the default credential provider chain.

How the SDK searches for credentials

When you initialize a new service without providing credential arguments, the SDK searches the default credential provider chain. And while the default provider chains are generally consistent across SDKs, they can and do vary. For example, if you’re using SnapStart and the AWS SDK for Java, the EnvironmentVariableCredentialsProvider is removed from the chain.

Credential provider chains

Despite differences between each SDK, the search order is the same. The default provider chain order follows:

System properties
Environment variables
Web identity token from AWS STS
Default credential profiles
ECS container credentials
EC2 instance profile credentials
Custom credential provider

There are several ways to assign values, though any values written in code take precedence, regardless of the default provider chain. Moreover, specifying credentials reduces both discovery and initialization times, so it’s good practice.

Secrets management

At Capital One, we use a tool to manage and store our credentials. This tool provides a standardized way to access secret information from all types of applications, while providing our required safeguards.

As our Lambda adoption grew, we quickly realized that Lambda can cause massive parallel concurrency on our secrets store. Because of this, we developed a Lambda extension that retrieves and caches secrets at initialization. Secrets are then accessed via a local HTTP server that’s only available to the Lambda Execution Environment. The secrets can be refreshed on demand or by a set Time to Live (TTL). You can learn more details about this design at our AWS re:Invent 2022 talk. Amazon just released an AWS Parameters and Secrets Lambda Extension of its own that works similarly.

API integration patterns

Capital One’s API architecture requires us to build highly resilient, multi region architectures that are able to take application traffic at any time.

We use Amazon Route 53 to dynamically route traffic across two regions using any weighting we require. Application traffic is directed to a public endpoint that routes traffic to the domain of our private API (the CNAME resolves to a private load balancer), which terminates TLS with the ACM certificate for the domain. The load balancer then reinitiates the connection with a TLS listener that redirects the traffic to our backends, Lambda or Fargate in general.

This pattern builds on private backend integrations using Amazon VPC technology. This integration allows us to provide secure API access to resources in our VPC without worrying about private network configurations.

Capital One’s full API Architecture includes:
Active Dual Region Resiliency
Auto failover
Route 53 CNAME configurations
Application load balancer

We’ve also implemented phased rollouts with rollback capabilities using CodeDeploy.

Type: AWS::Serverless::Function
   Properties:
   CodeUri: src
   Handler: fn.handler
   Role: !GetAtt FnRole.Arn
   AutoPublishAlias: live
   DeploymentPreference:
      Type: Linear10PercentEvery1Minute
      Hooks: 
         PreTraffic: !Ref PreTrafficFn

Deployment and traffic shifting

Before AWS, a code release meant we had to manually reassign all the traffic to the new application. Naturally, this required considerable operations planning to ensure a flawless migration. Even then, code errors could slip by unnoticed until it was too late.

With AWS, traffic shifting is built into SAM. Using CodeDeploy, we can now define and automate a gradual rollout strategy that includes pre- and post-development testing and rollback capabilities in the event errors are discovered.

A template for a rollout strategy looks like this:

At Capital One, we’ve standardized on rollout strategies that teams choose from. One example is a custom linear rollout strategy that shifts 2% of the traffic every minute, taking ~45 minutes to complete. This allows us to deploy a new Lambda version and slowly move traffic from the previous version.

Burst CPU during initialization

Lambda only allows you to control the amount of memory your functions use — between 128MB and 10GB of RAM, to be precise. But core processing speed and network capacity expand proportionate to the memory setting. When you double a function’s memory, you double its processing power and network capacity.

For simpler functions, the default memory setting of 128MB is sufficient. But if you have a function that’s CPU bound, raising the memory and, consequently, the core processing power can have a big impact on performance. In some cases, it may even be affordable to do so, especially if it’s a function with provisioned concurrency.

There’s one exception, however.

Lambda gives on-demand “cold start” initialization of 10 seconds of unthrottled CPU, completely free. When you can, design your on-demand functions to perform heavy computations during this 10-second window and you’ll save on resource utilization. Keep in mind, however, that if your function fails to start during the cold start, AWS terminates the process and starts the typical initialization and charges you for the resources.

Errors during init for provisioned concurrency

Even with CPU bursts, spinning up a heavy function with numerous dependencies can take some time. If you need to update or scale up your function, Lambda has to create new execution environments, which leads to higher latencies for the new instances.

Allocating provisioned concurrency ensures all new invocations run with consistent — and low — latencies. To be sure, there are plenty of situations where this is a must. Our custom deployment rollout balances provisioned concurrency between old and new versions as it progresses, for example.

But be warned: Provisioned concurrency is fickle. Invalid handlers, runtime errors, and just about any other bug in your code often results in a FUNCTION_ERROR_INIT_FAILURE. When this happens, Lambda won’t try to reassign provisioned concurrency, and all function requests are served on demand. In most cases, you’ll need to release a new version of your function to resolve the issue.

Things to avoid with Lambda

Maxed-out Lambda configurations: While Lambda functions have a maximum timeout of 15 minutes, avoid using it. Synchronous functions should run as quickly as possible. Asynchronous code can take longer, but if you need 900 seconds to run a single function, consider splitting the code into smaller functions.
Using reserved concurrency in place of provisioned concurrency: Reserved concurrency ensures a specified number of environments are guaranteed for a Lambda function. It doesn't impact cold and warm starts.
Over allocating provisioned concurrency to peak traffic: Allocating provisioned concurrency based on patterns in peak traffic sounds reasonable, but if traffic dips, you’re paying for resources you aren’t using. Auto scale your provisioned capacity with Application Auto Scaling – you can use a predetermined schedule or have AWS auto scale based on metrics.

Best practices when working with Lambda

Set CloudWatch log retention policy: Lambda writes CloudWatch logs and by default they are retained indefinitely. This means you pile up logs and expenses forever. Set retention policies and be aggressive in lower level environments.
Optimize your memory configuration: Take time to profile your functions using AWS Lambda Power Tuning and use the optimal memory configuration. Lambda bills you based on the memory configured and not the amount actually used so it’s important to configure memory properly.
Optimize your CPU architecture: If you can migrate any of your functions to ARM64 from x86, you’ll benefit significantly in terms of higher performance and lower costs. Keep in mind some libraries should use the arm64 version for correct performance.
Estimate your serverless costs: One of the most significant benefits of going serverless is that you only pay for what you need. Make a habit of estimating your costs with the AWS Pricing Calculator to ensure you understand your application configuration impacts.
Minimize function deployment size and dependencies: Streamline your Lambda functions as much as possible by optimizing your code and trimming unnecessary dependencies. This will save on resource costs and improve performance. Do not include source code, documentation, or other unnecessary items in your deployment package.

Build your skills for a serverless-first future

We’ve covered a lot of ground, but there’s plenty more to learn about the benefits of a serverless architecture. Whether you’re a developer who’s exploring the cloud through a different lens or a DevOps engineer with an eye on code as infrastructure, the possibilities for personal and professional development are staggering.

As you continue learning about AWS serverless solutions, you’ll discover many more tools and technologies to explore and absorb. Hopefully, you now have a solid foundation for diving into AWS and creating your first serverless applications.

Learn at your own pace

The new Serverless Learning Path and Certification is a great place to explore the fundamentals and best practices of Lambda and the AWS serverless model. This course is developed to put you in a serverless mindset through the process of building and running applications without thinking about infrastructure. You’ll also be tested on your knowledge and have the opportunity to earn a serverless digital badge!

Increase your serverless knowledge

The Serverless Ramp-Up Guide is another great resource for augmenting your knowledge. And of course, there’s no better way to demonstrate your expertise than by getting AWS Certified.

Even if your expertise lies in a different domain, there are numerous AWS certifications spanning everything from cloud essentials to specialties in cloud security and governance.

Explore Capital One's serverless efforts and career opportunities

New to tech at Capital One? We’re all in on the cloud, serverless and open source:

See how we’re building and running serverless applications at a massive scale.
Explore open tech jobs and join our world-class team in changing banking for good.

George Mao, former Capital One associate