At Capital One we are leveraging microservices architecture to increase our speed of delivery without compromising quality. Part of that microservices journey, our transformation from monolithic services, has involved moving to automated CICD pipelines. But how do you design a good enterprise CICD pipeline for your new microservices teams? In my work at Capital One, I lead several dev tooling teams. We’ve learned some lessons about how to transform automated CICD pipelines specifically for microservices, I’d like to share some of what we’ve learned, told from the perspective of my teams and their work.
As stated, our goal at Capital One is to increase the speed of delivery without compromising quality. In order to attain that, we decided that we needed to leverage automated deployments that are compliant with our general quality standards. Capital One maintains very rigorous and explicit guidelines around software delivery and changes in production. This segregation of duties has been directly translated into our pipeline automation via five imperatives that act as compliance gates.
- Source Control Mechanisms - This was the first and most basic step and should not come as a surprise to anyone here. We implemented peer reviews, oversight to make sure the reviews happened, and a recording process for auditing purposes.
- Secure Storage of Application Binary - We implemented a process to ensure that the versioned immutable binaries were promoted/published to a storage or environment only through automated build processes.
- Access Controlled Application Environment - This imperative ensures that the deployment process itself cannot cause issues to the prod environment, and thus any change to the build script has to be tested in pre-prod. Also, the access to each of the environments has to be controlled in such a way that no one can accidentally access from pre-prod to prod and vice versa.
- Quality Checks - This ensures that we have the right amount of code coverage and our code passes functional tests and meets the SLO under load.
- Security Checks - Many of our teams use and contribute heavily to the open source community. This requires us to make sure the open source code is meeting all of our security checklists and licensing requirements.
These five imperatives became the immutable stages in our pipeline execution; each and every application team had to meet them before they could automatically send their code to production. Now these are all very good on paper, but how do you execute on them and how do you make sure that every application team meets these standards?
To answer that we came up with a certification system for their pipelines and deployment strategy. We went through an evolution before making it to the right side of the diagram below. Let me walk you through how we started and how we got to where we are today.
Our first step was defining the immutable stage gates for each pipeline, in our case we chose 16 stage gates. Application teams were expected to collect evidence that they were meeting these compliance requirements before manually submitting them for review. Once submitted, a process engineer would manually validate the evidence, often after several back and forths to clarify questions. After the evidence was validated, the application team was certified for automated deployments
How long did this take? A long time, which was not ideal when more and more teams started doing automated deployments. We needed to revise the process so it could scale.
Semi Manual Certification
To scale our process we first consolidated and reduced the number of stage gates. Additionally, we brought most of the validations into the pipelines themselves so they could collect all the evidence and automatically export it to a database where a process engineer could manually validate it one final time.
This was far more efficient than Manual Certification but still there were some manual processes involved; the bottleneck had just shifted to the process engineer. Also, while we reduced certification time substantially, this process still took time. When you’re talking about microservices, we were creating hundreds of them every day and going through this process for each and every one could potentially bring things to a standstill. This process was not going to let us attain the speed we wanted.
However, looking at the architecture of these microservices, they all followed a similar pattern. Was there an opportunity to improve certification times for microservices and applications belonging to a single product family? We thought there was.
What if we certified a pipeline template that would make sure anyone using it was automatically meeting the requirements for certification? That way you could remove the process engineer bottleneck and automatically certify applications.
This became our target goal and we started focusing more and more on how to attain this stage of certification. Instead of coming up with another process - because we all know how much the developer community loves process - we focused on designing a good pipeline with reusable building blocks, flexible orchestration, and the ability to generate templated pipelines.
Reusable - For example, an immutable stage baked in with automated performance testing tools or infrastructure provisioning tools. This could be reused by any application since it’s so foundational for anything being deployed on a pipeline.
Flexible - We wanted our pipeline to provide many different forms of orchestration without the application teams needing to create their own pipeline. Let me give you an example. Let’s say you have a microservice that has to be deployed in a Kubernetes cluster. Your pipeline won’t need any infrastructure provisioning since you’re deploying to a cluster that already exists. Simple blue/green deployment may not be the right choice for this microservice and you may want to go with a canary deployment instead. However, depending on the complexity and business needs of their application, another team may want to do both. In that case, they should be able to choose, and the template should be flexible enough to orchestrate their pipeline stages.
Templated -The idea here is to provide templates for the teams to instantiate and create their own pipelines. All the stages that are defined in the template will be provisioned in the pipeline itself so that no application team has to reinvent the wheel. This is often overlooked and it wastes a lot of time when you are creating many similar microservices all the time. The cost of creating incrementally one HAS to be small, for the sake of “independent deployability.” This cost is further increased when you look into compliance impact.
Declarative Pipeline Model
Based on what we wanted to achieve, we chose a declarative pipeline model, also known as managed template model. Before we started, we set some goals for ourselves:
- We wanted certified templates with immutable stages to show our compliance and auditing teams that anyone going through this certified template automatically meets compliance gates.
- Once the template is created, we wanted to lock down the stages but not the implementations beneath. Let’s say I have an application written in Java and another written in Node. When they use this template and create their pipeline, they cannot eliminate the build stage. But they can choose their own build modules / scripts to be added to this stage.The Java team should be able to bring in their Java build script and the Node team their Node build script but the stage itself should remain the same. That way we ensure flexibility for that pipeline template, making sure it can be used by multiple teams with different tech stacks.
- We also wanted to provide many templates, not just one, and enable inner sourcing. For example, infrastructure code goes through different orchestration and stages if you are deploying your own code to production. If you are using third party code, open source code, or licensed code they will all go through different stages as well and we wanted to provide different templates for them to create a template marketplace. That way teams could bring in their own templates and implementations that we could help them certify with the compliance teams.
When we started moving towards this goal of increasing speed without sacrificing quality, we needed some metrics to measure our progress and ultimately our success. It’s very easy to say, “I deployed my application and didn’t cause any problems in production, and that’s what success looks like, right?” Which isn’t wrong, but doesn’t cover the whole picture. Instead we started looking at some metrics. For those teams that adopted the compliant and automated deployments the average number of deployments increased, and the issues that they caused to production and the mean time to resolve incidents both fell.