This is the second in a multi-part interview series with Capital One’s VP of cloud strategy Bernard Golden as he engages in the most important topics in cloud computing. For this second piece, Golden offers advice to prevent “day two” problems as enterprises move to applications following cloud-native capabilities from deployment to seamless functionality. You can read Golden’s first interview here.
Putting a new application into production is a big step. Ahead of its deployment, we focus intently on such important matters as whether the code is airtight, or if we have tested the entire application for vulnerabilities.
But just as important is the work we do after the application’s launch. This is the concept of “day two,” which refers to the problems an organization may face in the days and weeks after an application goes live. It’s not a new idea, of course. But in a world of cloud-native applications with application architectures made up of free-floating microservices, managing all that complexity requires new approaches.
Here, Bernard Golden, vice president of cloud strategy at Capital One, offers a checklist of “day two” issues to prepare for as you work to get new applications in your enterprise from deployment to full, seamless functionality.
Q. What does the term “day two” refer to?
It’s a term that addresses what comes after you launch an application, whether it’s brand new or a functional improvement to an existing application. You've defined what the functionality is going to be. You’ve created the code. You’ve done the build and you’ve tested it, and now you've actually deployed the application and it’s up and running.
The question now is how does it operate in the real world? Does it run satisfactorily, or is it buggy? Can it handle the loads placed on it? Is the functionality comprehensible to the end user?
Q. Why is this stage so important?
The functionality of your application is really difficult to determine until you actually expose it to “real life” traffic. When that happens, you get answers to some fundamental questions: is the functionality comprehensible for the user? Does the application stay up in the face of significant user load? Is it resilient in the face of resource failure?
Day two is when users decide whether or not your product delivers value to them. Everything beforehand is just preparation. If you don't have day two right, it doesn't matter how well you did with all the other stuff. You can think of it as getting the play ready for the opening night. When the show starts and the curtains rise, the audience members decide if they like the play, and that’s what tells you if it’s a hit — or a flop!
Q. What do you need to focus on to do day two well?
First of all, you need to put together an incremental rollout strategy to determine the performance of a product before you expose it fully to your customers. There’s commonly a technical practice around that. You can do it with load balancing, or you can do it with feature flags — a software development technique that turns certain functionalities on or off, without requiring you to deploy new code.
Let’s say you have a pool of 10 executables, and you turn on feature flags for them, and then you send, say, 20 percent of your traffic to those executables. That way you get a sense of how well that traffic performs versus the remaining 80 percent. By looking at the statistics you can start to tell if the new functionality is operating properly and delivering value. If you’re directing 20 percent of your traffic to one spot, and there’s a huge drop off in the use of that feature, clearly there's something not right with the functionality. Using feature flags also lets you do A/B or canary testing to evaluate the success of your deployment.
When functionality fails you need a rollback capability. Perhaps you thought a feature would work, but when it gets out there it’s actually broken. A user clicks on a link and gets a “404 error” message, for example. Or the application doesn't do what you want it to do. As a result, you’re getting much lower click-through rates, or nobody's buying what you’re selling. The amount of time people are spending on the site is dropping precipitously. If these things happen, you want to be able to roll back from the new product iteration to the previous version, which had either operated properly, or had better success.
Another element you really need to have in place is observability, which means you need to know exactly what’s going on inside the application as it's operating. You typically need logging capabilities, but you also have to have telemetry so that you can crunch statistics on the kinds of errors you're seeing, or faulty operations.
You’ll need an appropriate architecture to support all of this day two functionality. Typically, that means you are going to have a horizontal scalability and redundancy. You will want architecture that supports feature flag testing, giving you the opportunity to roll these things out incrementally. And of course, you will need the ability to update your applications as they’re operating or roll them back. The right architecture is quite important in that regard.
SRE — or site reliability engineering — capability is also important. This used to mean that a bunch of engineers would watch systems to make sure an application is executing according to the right technical metrics. Now this capability is becoming a software discipline, and people are writing software to track telemetry and automatically respond to different kinds of errors. SRE functionalities are now the bedrock of cloud-native companies. Capital One is moving in that direction, too, because that's what you need to have in place to be good at managing day two. If you're not doing all these things, you could be in trouble because the disciplines have been prevalent among cloud-native companies for years. They are important tools for companies that strive to be cloud-native enterprises.
Q. What happens to companies that don’t focus on mastering day two issues?
The likelihood is they’re going to have more failures in their applications, or perhaps even more dangerous: poorer operation. Your application could be up and running, but running slowly, or it could be throwing out a lot of errors — two scenarios that are likely to annoy customers and make them contact your call center, which in turn could drive up operating expenses. And a subsequent decline in customer satisfaction — an important metric for your business — could mean you generate less revenue.
There’s starting to be a greater appreciation of day two problems and how to manage them. It’s not that yesterday’s practices were bad, it’s just that new applications that follow cloud-native patterns need new capabilities to meet these day two requirements.