Balancing CI/CD: Managing product development data
Achieving CI/CD balance while navigating the impact on the data product development process
June 21, 2018
The phrase “CI/CD Pipeline” has recently begun to feel like a magic word. The base concept practically oozes value — an automated pipeline to production enables the trifecta of getting an MVP to market faster, fixing bugs without organizational overhead, iteratively adding new features. It is hard not to let the imagination run wild.
Mine certainly did. As a three-person development team member working on a data product, I’ve experienced CI/CD as a way of life. Faced with the choice between building a pipeline or devoting 33% of our development capacity to production deployments, my team decided to automate.
Unveiling effective CI/CD strategies for data products
Through the process of reaping the benefits of a mature pipeline, I’ve had a lot of mental space devoted to figuring out the best way to think about CI/CD as it relates to data products, and I’ve landed on two distinct mental approaches, only one of which leads in the right direction. The other, not so much.
First, let’s establish whatthe right direction is. To quote legendary 20th-century architect Louis Sullivan, we want to get to a place where form follows function.
Navigating the shift from centralized to nuanced architectures
(You know what’s better than alliteration? Alliteration from historical figures.)
For years, the most popular approach to data has been a centralized static warehouse that handles user queries and automated reports. With the rise of Apache Kafka, Amazon S3, Data as a Service models, and others, the amount of workable data access patterns has experienced a relative explosion.
As a result, designing data platforms has become a nuanced art — the location, form, and cadence of the data all play a role in determining the architecture and functionality of the platform. This is phenomenally exciting, because it means that common failures, data-related or otherwise, can be iteratively ironed out of automated processes at a level below the queries. Jobs can behave differently if they fail based on a data load delay vs. an unexpected null value, instead of simply exiting.
Let me put it another way. Imagine two identical twins, Bob and Bill, who play different sports. Both extraordinarily athletic, they become professionals in their respective sports; football for Bob and tennis for Bill.
Playing such different sports will affect the way that they train. Because football emphasizes generating forward momentum in short bursts, Bob will likely have very strong legs. Bill, on the other hand, will likely focus on developing the strength of his grip, and as a result his playing forearm will become large, even compared to his other arm. Although their identical genes mean they have started in the same place, the demands of each twin’s profession will cause them to diverge.
What if, Bob and Bill like each other so much that they decide to play on a basketball team together? As a result, 50% of each twin’s time is now spent training for basketball season, and their physical attributes naturally converge.
The hypothesis I mean to posit is that the second arrangement will result in Bob regressing as a football player, and Bill regressing as a tennis player. This is the setup we can run into when all data producers and consumers run through the same platform.
How this relates to CI/CD: Uncovering challenges and drawbacks
Any application, when stretched across too big of a range of needs, will become overwrought and seemingly erratic, like a pile of cords going from neatly-arranged to horrendously-tangled over time. As such, there’s a philosophical fork in the road when it comes to regarding CI/CD.
Down one path (which, for context, is the spooky one with storm clouds and evil-sounding wolves (as opposed to friendly-sounding wolves)), is a situation in which the relative ease of deployments stretches platforms beyond their limits.
Remember the monolithic data warehouse I mentioned earlier? This road leads right back to it.
Building new platforms is always going to be hard. If adding to an existing platform is always, by comparison, easy, you can see how the existing platforms would become overburdened. Think about the promise of CI/CD: Faster delivery, faster value. Given that central ideal, the development overhead to build a new platform for a new business need is a tough sell.
If CI/CD is making things so much faster, why will it take six months to get it up and running? This team over here has a platform that uses most of the same data…why can’t we just add to theirs?
These are the natural quandaries that arise when speed is the lone selling point, and over the long run they have the potential to cause monoliths to form.
Positive impact of CI/CD implementation: Enhancing efficiency and benefits
Down the other rainbows-and-sunshine-laden path, we have a data environment utilized by platforms whose underlying structures reflect the data needs they were built to address. Every build becomes an iterative process. Platforms and data coalesce around each other, and every build reaches its full potential. Monoliths are a thing of the past, replaced by a multitude of resilient, specialized products that interact behind the scenes.
CI/CD pipelines are integral to achieving this goal. One of the most damaging aspects of monoliths is that they require an elaborate support infrastructure — developers are constantly being pulled off of forward-moving projects to assist with production implementations and fixes. Automated pipelines to production allow developers to look forward, fixing things before they break constantly. To have a truly healthy DevOps mindset, this needs to be championed alongside speed.
Monoliths have a way of appearing to solve problems despite actively causing them, but they are too big to contort; they simply can’t offer rapid iteration as a benefit. Finding the benefits of DevOps that are antithetical to a dangerously consolidated environment is a key facet of keeping the forward momentum going.
Making CI/CD more effective for data-driven product development
Most people have a cursory knowledge of the symbolism of the yin and yang, but one piece of imagery that I find compelling in it is the inverted-color dots on each side. The black dot on the white side symbolizes an old yang becoming a young yin, and vice versa.
DevOps can be framed as a manifestation of the concept of self-sufficiency in small groups. Its opposite force, then, is centralization. Like our two athletic twins, self-sufficient teams can be swept up in a desire to pursue multiple directions, ultimately consolidating them to no one’s benefit. As DevOps practices mature within your organization, pay attention to the logical extremes. Really productive teams will make it to the horizon, and it needs to be the right one.