What does an ambitious, large-scale cloud migration look like? This past fall, the teams under Andy Rosenbaum and myself finished a major re-design and migration of the systems powering Capital One’s mobile servicing platform. With an active mobile user base of tens of millions of customers, migrating one of our largest and most used customer-facing application was certainly no small task.
The Capital One Mobile project was much more than a ‘lift and ship’ or forklift effort. The project represented a full re-architecture of our orchestration systems, including many of the support tools around them, to allow us to take full advantage of the cloud and its offerings.
With this migration behind us, our team figured it would be a good opportunity to spend a moment sharing some of the lessons we learned along the way.
Great Engineers Already Have A Cloud-Ready Mindset
While advanced cloud offerings like AWS are amazing enablers (we’ll get to all that), most modern day development practices seen in the cloud can be applied to many other infrastructure models as well. Building cloud-ready software creates healthy building blocks towards full-scale movement to the cloud. At the same time, it also introduces some pretty smart safeguards/features that can make managing your systems a little more bearable.
In the case of our mobile back-end, there were several aspects of our app that were ready for the cloud long before the cloud was ready for us. Here were the main enablers that helped prepare us and pave the way for our transition to cloud:
- Microexperience App Design
The various experiences in our app were already being built into separately deployable, self-contained functional apps. All of the same constructs that exist today (our Edge layer [Netflix OSS Zuul] and Microexperiences) were all present prior to the cloud.
- Deployment Automation and Channel Switching
Prior to the cloud, our application was already leveraging mature software automaticity techniques such as full automated app deployments and blue/green channel switching. While the ‘how’ behind how this works has changed since moving to the cloud, our capabilities, the CI/CD stack, and our use of Ansible as a configuration tool still remain the same.
- Up and Downstream Failure Tolerance
One of the major features of our orchestration layer is its ability to intervene in the event of either upstream or downstream issues. Examples of upstream failure tolerance include feature toggling and throttling mechanisms. In terms of downstream issues, our orchestration team leverages the Netflix OSS project Hystrix. This ensures all back-end services have a failover, or depreciated experience option, that can be triggered in the event of an automated, or human initiated failover.
- Global Active/Active Resiliency
(For the doomsday prepper in all of us)
First Things First
As I outlined above, our application was fairly cloud-ready from the start. This meant redesigning it for the cloud was also fairly straightforward. From an infrastructure standpoint, we wanted to explode out our microexperience design to allow for:
- Each component to be deployed separately.
- New versions to be deployed alongside their elder counterparts.
- Unique individual Zuul clusters to handle the filtering of disperse inbound applications.
- Each of which would deploy to their own separate infrastructure. (single set of auto-scaled instances per service)
For example, above is Capital One’s Mobile environment post-AWS redesign (simplified). Prior to this effort, components such as the EDGE Routers and the supporting microexperiences existed, however they were all contained within the same infrastructure. Post-AWS migration, each of these now resides in their own separately scaled/managed/tuned infrastructure.
Digging in Deeper
Making the above design a reality was a ton of fun for our teams. While we hit our fair share of obstacles, the majority of visible work was easily completed early in the effort. While implementing this phase we:
- Replaced our corporate GTM solution with Route53.
- Created a blue/green deployment approach leveraging ASGs.
- Built a pattern for deploying single services rather than our full stack.
- Created 2–3 custom AMIs that baked in all our server level dependencies.
- Moved our data stores to RDS.
(and probably did some other really important/impressive things that the team is going to get mad at me after this gets published for leaving out.)
A Little Bit Farther Down the Rabbit Hole…
Like most projects, the last 10% of work for our migration contained a few underestimated challenges.
The first was performance testing. It’s not that we didn’t have extensive performance data, test scripts, mocks and such — we definitely did — but we’d never had this level of control over our infrastructure sizing/tuning in the past. As a result, some pretty amazing engineers were tasked for a few weeks with tuning and testing activities, setting new Hystrix thresholds, tuning tomcat/http/apache, tuning ASG scale up/down events (avoiding loops, factoring cool downs, etc) and finding/fixing small pre-existing performance flaws in our application (that we would have NEVER found on-prem due to our over provisioned hardware).
In fact, halfway through we discovered our corporate mocking software couldn’t handle the sheer amount of performance testing we were running as part of this effort (we completely crushed some pretty industrial enterprise software in the process). As a result, we made the call to move the entire program over to a Mountebank OSS-based solution with a custom provision to give us the ability to expand/shrink our mocking needs on demand.
In addition to performance work, moving to the world of AutoScaling and treating servers like ‘Cattle’ created some unique challenges. When we first started, we knew Ansible (which is agent-less / push-based if you’re not familiar with the tool) would need a bit of supplementation. We went with Ansible Tower to fill the gap. Through Tower, we were able to solve for any post-provisioning needs that any new servers would require in the event of provisioning or an AutoScale event (items like deploying the correct app, registering any monitoring agents, etc). In addition to Tower, towards the end of our effort we saw an additional need for more of an app-level service discovery (our very first example was getting new services wrapped in Hystrix to register with Turbine — another part of the Netflix OSS stack). Which lead the way to incorporating in Netflix’s Eureka service discovery engine and solving one of the final pieces to our cloud migration.
One small note — not all of this was 100% wrapped up before we started introducing user traffic. For us, it was imperative that we get live users over to the new design while our optimization work and final touches were still under way. This way, we could weed out any new issues not caught through our traditional tests. In order to migrate in a phase approached, we took the following steps:
- Replace our enterprise GTM solution with Route53.
(Note, this was done well in advance of any other migration activities and gave our project team instant control over user traffic for all future activities.)
- Focus on single availability zone deployment, begin with large scale internal pilot to identify production kinks.
- Quickly move 1% of our production traffic to aforementioned single AWS region deployment.
(Honestly, this probably represented the biggest milestone of our project. In addition to feeling like a solid win it also helped us discover several load-driven nuances with our production infrastructure.)
- Slowly convert our application’s traffic until we were spanning 50% in our the Data Center and 50% in a single AWS region — setting engineering goals for every major traffic push up to 50%.
- Engineer second region deployment and repeat Steps 3 and 4.
All in all, migrating our mobile servicing platform to AWS has been a tremendous experience. To reflect back on my earlier point about great engineers and the cloud mindset — to the folks who did this work, the move to AWS was a way of removing the constraints preventing them from developing their ideal application. The move was neither an edict nor a product goal, it was a way to build the best application using our best skills and tools.
Moving forward, Capital One Mobile (as well as the other consumer-facing products owned by our team) will have infrastructure that moves at a speed that keeps up with the needs of our product. In the months to come, we look forward to further exploring the advantages the cloud continues to bring. As well as redefining our own team definition of what it means to “ship early and ship often.”
Links to External / OSS Projects Highlighted in This Blog:
- Andy Rosenbaum’s overview of the Capital One Mobile Edge:https://medium.com/capital-one-developers/mobile-orchestration-innovation-on-the-edge-9835e4cbd69e#.1akzjvb51
- Zuul: https://github.com/Netflix/zuul
- Hystrix: https://github.com/Netflix/hystrix
- Eureka: https://github.com/Netflix/eureka
- Ansible: https://www.ansible.com/
- Tutorial on Managing ASG’s with Ansible Tower:https://www.ansible.com/blog/autoscaling-infrastructures
- Mountebank: http://www.mbtest.org/