Explore Jan 15, 2019

4 Lessons From Scaling iOS CI/CD

From a Snowflake Build Machine to Empowering Hundreds of Mobile Engineers to Deliver Delightful Mobile Experiences

My iOS CI/CD journey started in 2013 when I was an engineer with Capital One Labs. While building a native iOS rewards redemption at point of sale experience, our Labs’ engineering team needed a quick way to iteratively deliver our iOS application to our internal partners and leadership. A shared build environment and automation tools were needed to provide a consistent pattern to accomplish this. To address this need, we began experimenting with mobile CI, pulling together a single Mac Mini build machine and a Jenkins server leveraging the Xcode Build plug-in. Our snowflake build machine lived on the office scrum table and we were proud to show off our setup to anyone who visited the Lab. This humble mobile CI beginning improved our ability to deliver on our MVP iOS application.

Over time, our tooling was no longer meeting the needs of the enterprise. We grew from a few co-located mobile engineers in 2013 to a few hundred distributed mobile engineers in 2016. However, our tooling hadn’t scaled with our team. Our tooling was still optimized for a small co-located team who could reach over and hit the power button on the build machine, not the large distributed engineering team we had become.

As a former iOS developer, I felt compelled to help take on the challenge of creating a mobile CI environment suitable for the scale of mobile engineering at Capital One. This journey lead to clarifying some shared principles which assisted us in simplifying direction and resolving disputes.

As I reflect back, these seem like the three largest challenges that we faced and that assisted in distilling our shared mobile CI principles.

Challenge #1 — Environment Inconsistency

The first challenge we faced was environment inconsistency across our available hardware. To overcome this hurdle, we decided to leverage Mac VMs that run on our available Mac hardware. This allowed us to follow a “baking” process to add new versions of key iOS development tools (Xcode, Swiftlint, RVM, …) and then record that state (into the Golden Master — our GM VM template). This stable known state (GM templates created, managed and supported by a central team) could then be deployed using automation workflow tools (Terraform, Ansible, etc.) across our available hardware.

The centralized nature of the execution environment was our foundational building block; allowing us to guarantee a consistent, validated build environment for all users and ensuring that changes were introduced safely and reliably.

Challenge #2 — Limited Capacity

The second challenge was tackling the growing pains our teams faced. The native iOS engineers we support internally require Xcode to compile and test their code; and Xcode currently requires Mac hardware. At that time, we only had three Mac build machines. This was an extremely limited capacity given the large number of internal mobile engineers.

After a thorough search of our options, from fully hosted Mobile CI SaaS solutions to raw infrastructure, we decided to add additional iOS build/test capacity that could be tied to our existing internal version control and CI tooling.

To achieve this we ended up moving forward with two solutions in parallel:

  1. Purchase self-hosted Mac hardware capacity: Purchasing additional Mac hardware allowed us to move more quickly than our preferred IaaS solution.
  2. Onboard an IaaS provider (Mac Stadium): Onboarding a Mac IaaS provider enabled us with scalable underlying Mac hardware. As a large, security-focused company with lots of data compliance controls, it took us a few months to fully onboard MacStadium. Additional public cloud providers that we explored didn’t offer a Mac IaaS capacity.

By implementing both approaches in parallel, we began to fix the immediate capacity problem and made progress on our long term needs for scalable cloud-based iOS CI/CD infrastructure.

Challenge #3 — Inflexible CI/CD and Fragile Configuration

This largely stemmed from our tight coupling with Jenkins which led to three key issues:

  1. Unversioned Build Scripts: Relying on hundreds of lines of unversioned shell scripts that were fragile and difficult to effectively share.

2. Jenkins Plugins: Global plugin updates and complex manual configurations made us vulnerable to breakages and inconsistencies. This limited the speed and independence that teams could achieve since they had to wait on other teams to upgrade the shared Jenkins plugin and manually configure each job.

3. Lack of cross-platform tools: Internal partners are responsible for delivering applications to the iOS and Android channels. This delivery requires interfacing with evolving external interfaces like Google Play, Apple Developer console, and Apple AppStoreConnect.

These plugins were often slow to adapt to new features and fixes, so other tools were required to build and deliver our applications. Furthermore, Jenkins plugins provide little flexibility in testing and versioning across teams moving at different cadences. To solve for these concerns, we turned to Fastlane.

Fastlane is described as “The easiest way to automate building and releasing your iOS and Android apps” -Fastlane GitHub Repo.

Fastlane is a very active open source project with 24k+ stars on Github. It is currently sponsored by Google, who employs much of the core dev team but was previously sponsored by Twitter. The team is quick to respond to issues opened by the community and outside contributors.

Fastlane allowed our developers to:

  • Modify and version their CI process definitions and configuration within the codebase itself
  • Share processes and actions with other teams
  • Evolve and update tools and patterns at different times
  • Allow us to treat Jenkins as an executor and log collector while the interesting CI/CD patterns are defined in Fastlane
  • Interact with external dependencies reliably (e.g., AppStore Connect, Google Play)

Lessons Learned

During our journey, as we worked to overcome many challenges, including those listed above, some overarching principles emerged (listed below). I encourage you to evaluate these principles to see if you can apply them as you look to scale your own iOS CI/CD capability. Working with partners and engineers is paramount to ensure that they understand the guiding tenets of your CI/CD platform. Having these shared principles makes it possible to resolve disputes that arise during your CI/CD scaling journey.

  • Quality is more important than quantity or speed
  • Focus on building blocks, not turn-key solutions
  • Customers are the center of everything
  • Proper abstraction and clear ownership are critical

Let’s break these down one by one.

Quality is more important than quantity or speed

Nothing is quicker to destroy an engineer’s confidence / trust in automated CI gates than unstable tests that prevent those engineers from being able to dependably deliver.

From frustrated engineers before focusing on test quality:

“I didn’t change that part of the code, but CI rejected my change.”

“I just re-ran and it passed the test that had just failed.”

This led us to encourage partners to institute a process of “testing the tests” and having a production mindset for those tests. This required running each UI test 100 times to ensure consistent execution before the test was introduced to the larger development team as a gate.

Additionally, engineers often fall into CI/CD’s premature optimization trap, becoming too focused on the percent of test coverage, or the speed at which the test executes. These optimizations are all useless if any of the tests are unstable. It would be the equivalent to adding additional floors or faster elevators into a skyscraper with a foundation that was showing signs of crumbling.

After all, “premature optimization is the root of all evil— Donald Knuth

Focus on building blocks, not turn-key solutions

In short, loosely coupled tools enable flexibility and innovation by allowing teams to compose different solutions to meet their needs. Too often we come across solutions that are intended to solve all the problems teams face, only to be adopted by the team that built it. This inflexibility becomes a massive issue if you are attempting to remove the bottleneck faced by your teams as they look to evolve at different speeds and in different directions.

Tight coupling, as mentioned earlier in the discussion regarding the challenges with Jenkins plugins, creates a complicated upgrade process where a consensus must be reached by all involved teams. Otherwise, distinct execution environments become a necessity. Instead, by using Fastlane and keeping CI definitions in the team’s codebase, they have the flexibility to move at their own speed.

Additionally, teams are able to create functionality that doesn’t currently exist and contribute that back to the wider community using Fastlane plugins - Ruby gems that add specific functionality for use in Fastlane.

Customers are the center of everything

At the end of the day why does any of this matter? We aim to provide the best possible digital tools to our internal customers (mobile engineers) so they can, in turn, ship the highest quality applications to our external customers. In this mobile-first world, happier customers are often the direct result of improvements in the quality of the mobile experience provided by the teams we support.

Customer service is key. We look to win over our partners by providing compelling capabilities, listening when we miss the mark, and helping when our partners are struggling.

Proper abstraction and clear ownership are critical

Enterprise DRY (“Don’t Repeat Yourself”) — Ensures teams are able to leverage common solutions to enable higher levels of overall output by not doing the same work over and over again in isolation. This enables teams to focus on delivering value to customers and not re-inventing a mobile CI/CD wheel.

Clear ownership of the underlying tools and infrastructure is best achieved by a small central team. This prevents duplication across isolated pockets in the enterprise, but still allows the app teams to own their CI/CD pattern definitions. This ensures they can move at the speed they need, and that those who best understand the needs of the individual application are empowered to make changes and support their app team’s needs.

Overall Results

Today, we run 3,000 to 4,000 daily mobile builds for Capital One’s domestic and international mobile engineering teams. These builds run on our internally racked Mac hardware, dedicated Mac Stadium IaaS (Mac hardware), and AWS (non-Mac infrastructure). We support the mobile engineers responsible for shipping a variety of iOS and Android applications to millions of Capital One’s customers. All of these mobile engineers are leveraging the same shared CI/CD tooling (but different versions of Fastlane, Xcode, etc.). This allows us to do our part in improving the experience millions of Capital One customers have with our award winning mobile applications.

Hopefully, you will find this content useful as you work to deliver mobile applications to your own customers. Additionally, keep an eye out for a follow-up article that will be published in the next couple of weeks. It will go more into the specific technical configuration best practices if you choose to leverage Fastlane and an open source command executor like Jenkins.

Alex Niderberg
Mobile Reliability Engineer, Capital One

DISCLOSURE STATEMENT: © 2019 Capital One. Opinions are those of the individual author. Unless noted otherwise in this post, Capital One is not affiliated with, nor endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are property of their respective owners.