One of the biggest challenges of any kind of transformation into an Agile environment is communicating out the difference between “iterative value” and “speed.” Given its nomenclature, there is typically an assumption that Agile = Faster. This is true to a certain extent, but by “faster” we are actually saying “faster to get value.” This is not the same as “faster to deliver the full value.”
Contract and Resiliency-First Mindset
As a team that has both a lot of dependencies, and a lot of intent, we have always tried to balance delivering both iterative value and final value for the APIs driving Capital One’s mobile experiences. We have long practiced a “contract and resiliency-first” mindset to delivering our services. What does this mean?
Contract-first basically implies that the first thing we define and deploy is the contract of the API. We consider this a contract in the strictest sense. Rather than making assumptions about what REST implies, we make sure to be clear with our UI engineers about both what they can expect to come back, and what we expect to be sent. By setting this stage first, we are able to align on an implementation quickly, and deploy the working contract into production as a first stage deployment. This happens very rapidly, and typically we can implement this within the first 48 hours of alignment between teams.
Resiliency-first is all about creating the practice of, and trust in, building resiliency into our designs and code. This is more than just graceful failure and strong error handling, but leveraging patterns like “circuit breaking”, failover, and understanding the impact of our dependencies to our customer’s experience with our product. We look to not only protect our system, but to protect our customers and our infrastructure.
Reliability is the practice of making sure our systems are resilient in the face of dependencies. This goes beyond things like Active/Active or multi-region deployments. This is how we fundamentally build user experiences, how our APIs respond, and how we prevent a single downstream dependency failure from making its way upstream to users.
Why Deploy in This Manner?
The question often comes up as to why deploy something before people are ready to use it? What’s the point of being iterative in our delivery? They raise concerns about risks to stability as well as security. My answer — “Well, I hate having code sitting.”
There is intrinsic value in going through a deployment and validating in production. We minimize risk to our environments by automating our pipeline and our testing. By leveraging our Edge design patterns, we limit security risk. We are also able to validate the deployment and baseline for our service very early. This gives us a higher degree of confidence as to the impact of every iterative change we then make on top of it. By reducing the potential impact of changes, you are reducing the risk to the infrastructure and making this kind of iteration possible.
Traditional waterfall approach and change-adverse culture wants to wait until everything is done and signed off to reduce the number of times we attempt to change our production environment. Under this mindset, fewer changes implies less risk and less potential for an outage. Waiting until everything is ready also implies that all the business value is there and ready to go. The problem with this approach is that you end up pushing through very large-scoped changes that put a large amount of risk on the production environment. If they fail, there are too many pieces to dig through in order to triage, and you roll the whole thing back. Which means zero business value making it in, and a reinforcement of the change-averse culture.
Now you have a self-fulfilling prophecy that encourages fewer releases as each release is a large risk to the company. Which, in turn, creates a sense of distrust of your engineers.
Changing the Mindset
As engineers, we need to change that mindset. The act of deploying code should not be considered risky. The act of deploying is just business as usual and it should be as natural as committing code to your source control every day.
This does not mean slinging code. It means using strong automation processes to rapidly iterate through features and capabilities. It also means building trust in your engineers and their pipeline and understanding that their sense of ownership is just as strong as those that manage infrastructure. By driving home that sense of ownership, you are able to create a strong culture of resiliency in everything you build.
As API builders, we want to get great features into the hands of our customers as quickly as possible. As “contract-first” practitioners, we want to get a defined model into the hands of our UI devs quickly, allowing them to work through the user experience design without having to wait for all of the logic to be worked through. As “resiliency-first” practitioners, we set the confidence that what we have built will self-heal, be available, and meet the expectations of our customers.
Remember: Faster is not always better, but we always strive to get better faster.