No testing strategy, no DevOps

Paul Bruce

February 14, 2018

This article is intended to help software and test engineers develop a “right-sized” testing strategy that meets the rapid pace of delivery and aligns with core principles of DevOps culture.

In DevOps circles, there’s a propensity to narrowly focus on tools that help automate; but right now, there is a looming crisis around how QA and testing fit into DevOps product lifecycles. How do teams obtain a sufficient view of potential risks that small, incremental changes introduce without holding up the flow of work? The answer lies in continuous improvement, and specifically for testing, a practice of tailoring testing strategy to shared goals across teams.

line of code written in black, blue, and red

What’s in a testing strategy?

Simply put, a testing strategy is a plan of how you will accomplish quality goals through a set of activities. While a testing strategy includes technologies used to perform testing, if the answer to a testing strategy is “we use Selenium and Jenkins”, there is a fatal blindspot to the importance of including “why” and “for whom” in the existing definition of strategy.

In DevOps, the goal is to quickly deliver value to customers, and supporting activities are by default automated. But what’s missing is the “plan”. Without a plan, how do you know that your automated testing is helping to accomplish your goal?

A truly useful DevOps testing strategy needs to align to “quick” and “value” while inheriting a propensity for all things automated. Consider the following components of a test strategy:

Required Inputs: what code, binaries, docs, and priorities are required for testing?
Outcomes: what is a “successful” result of testing, and what information allows you to proceed with delivery vs. stop and fix something?
Risks: how will testing expose the risks that the team has prioritized visibility on?
Schedule: when is the feedback from various testing activities needed to support flow?
Human resources: who creates/supervises the automated testing, who exactly owns the pass/fail results, and how is testing time factored in to the team’s delivery cadence
Technology resources: with what tools and on what infrastructure will the automated testing be performed; do these resources ensure that teams will have the right information to make a good decision?
Entry criteria: what triggers testing to begin, when should teams wait to begin testing?
Exit criteria: what triggers testing to end successfully or “fail fast”?
Impact / ROI: how does this testing reduce prioritized risk and technical debt while increasing customer/end-user experience?

These components help you tailor test plans to customer-focused work. The result is a strategy that sound like:

“To deliver this valuable thing with [impact], we need [inputs/resources/entry criteria] to produce [outcomes] along the [delivery schedule] that ensures visibility on [risks].”

By answering these questions up front, teams have a high likelihood that the automated testing activities performed will efficiently help to achieve your goal and avoid testing bottlenecks.

Manual testing just doesn’t fit in DevOps

I’m just going to say it, I’m going to put it out there. I don’t see how traditional manual testing can survive as it is now with the rise of DevOps. Things that are manual cost the same amount each time they’re performed, are often error-prone, and lack audibility. In DevOps, no-scale, high-risk operations like manual testing are out.

However, the goal is to build things for other people: apps, APIs, tools…they all eventually get used by real users one way or the other. As such, engineering teams perform “validation”, in essence asking ”did we build the right thing”…a question that (for now) automated metrics alone can’t answer.

Verification, on the other hand, asks the question “did we build it right?” These tasks can and indeed must be automated as much as possible, and is entirely doable with the right process, skills, and technology in place. These tests are wildly inefficient to perform manually, and the compounded efficiency of automation in regression testing further solidifies manual testing as an exception to the rule as opposed to run-rate motion.

Sure, there are temporary exceptions like mobile fingerprint sign-in, checking for screen flicker or video tearing, visual validation…but these are either soon to be automated (often by contributors to open source) or easy-to-compartmentalize activities through some workaround.

The market for manual testing is shrinking. However, the demand for skilled test engineers continues to grow because the focus on a great user experience is a critical differentiator now more than ever. Speed and quality are inexorably tied, as works like The Goal by Eliyahu Goldratt teach us all; only by applying a continuous improvement “Kaizen” mindset to work can teams expect to turn the challenges in software delivery around into wins for the whole team.

Testing anti-patterns in “DevOps theater”

For the sake of argument, let’s walk through a typical feature release lifecycle in high-velocity teams aspiring to “do DevOps”. A few developers accept a user story from their kanban board and get to work pair programming. After hours of prototyping, designing, coding, and running it locally, they commit changes. At the same time (maybe), various automated tests are written to allow the work to be both verified and validated in a continuous delivery pipeline. Then they demo to the internal team (maybe) and schedule for release.

Coincidentally, teams write unit and integration tests to verify that their feature works as designed. Code coverage metrics from running these tests help the team identify potential high risk areas of code, especially important as the team accelerates their flow of work. A few intelligently chosen functional and performance tests that are run on an end-to-end system (could be UAT, staging, or even production) help to articulate how work should be validated in automated pipelines. These E2E tests also get consumed by monitoring tools to provide real-time synthetic feedback about their feature as demand and infrastructure changes occur down the line.

Now DevOps test strategy nirvana has been reached, visualizing important metrics from code all the way to running software, with optimal autonomy and minimal re-work. Sounds really good, right? Not to me. Says Dev: “I don’t want to think ahead, I want to stay autonomous and let my code flow in the wind as I move on to the next task firing incremental releases into the sunset.”

Automated testing without a strategy falls apart

Even when done well, automated testing is a mountain of work, even with people that really know what they’re doing! The role of QA in DevOps transforms to become part consultative, part programmer, part business analyst, and part hacker. Really good testing and release engineers continuously upgrade their skills, drive alignment, increase group learning, and deliver fast, useful feedback to their product teams.

The mantra my team follows is — Test Early, Test Automatically and Test Often”, says Aashish Kapoor, Sr Testing Specialist at Capital One. “In order to successfully follow that mantra, we identified patterns in our work and wrote Automated Regression code in such a way to enable the entire team to setup test automation suites only by writing the configuration file. Overall time to produce suites is reduced significantly. We have also been following a practice of writing test suites even before starting actual application development in order for test to be available as you develop.

In these types of environments, automation is an imperative that the entire team takes part in building together. Testing is a group effort that benefits everyone involved.

The real costs of testing anti-patterns

Back to the fictitious “nirvana” pipeline. Unfortunately, since those automated end-to-end (E2E) tests written take a few minutes rather than a few seconds to perform, they often get pushed to a “hardening” stage, forcing the team to reduce the amount of testing and therefore increase potential risk or slow delivery to production until testing can occur. Adding insult to injury, the already slow E2E test suite often throws false positives until tests are mark a test as flaky, or simply get turned off to get through a release. Since there’s no time to go back to fix them, there’s now quality gaps that let defects leak through to users.

In terms of performance and scalability, migrating to the cloud was sold as a means to not worry about performance. But the past few heroics on production incidents were related to unexpected drops in API performance, infrastructural boundaries impacting end-to-end latency, and suboptimal data queries. This stuff isn’t as easy as typical story points, and often requires subject-matter expertise across architecture, development, database, infrastructure, network management, and business analytics. So the ball is again punted, just like the flaky tests that were disabled to get the last release out “on time” a few days late anyway.

When there are no load or security tests, when UX metrics aren’t available as baselines, software teams accept future pain. When people say “we’ll cross that bridge when we come to it, otherwise it’s not lean”, they’re often defer critical decisions, not simply commitment as the lean management gods dictate. Though there’s data to show that faster teams deal with performance issues up front, being too buried under feature requests and production firefights to make time for “extra” quality-related improvements kills velocity.

The result is that teams spend more time on re-work, fighting fires in production and playing bumper-cars with emergent system behaviors, and less time on new and improvement work.

QA in DevOps: Align testing activity with business priority

One-size-fits-all testing never works. No matter how many unit tests you write and run, they won’t be able to definitively answer the question “why isn’t it working on my user’s phone?”. In this case, mobile devices are over half our web app’s usage, and analytics drop-offs are often correlated with buggy page components and slow response times.

To figure out what really matters, quality must be framed in terms valuable to a business model. Root-cause of lost users and revenue is a good start, but is too late. A well-tailored testing strategy provides positive ROI in terms of saved re-work (i.e. cost of defect), visibility over trends in quality, and shared understanding about aggregate, latent, and emergent properties of systems.

Different types of tests ask different questions, incrementally incorporating reality into the results. Unit tests help development team know when they’ve broken something fundamental, but functional testing on real platforms and performance testing are the only ways to know that new features and fixes actually accomplish their goal before release.

And what is the goal again? Oh yeah, in DevOps that’s: “quickly deliver value to customers”. Only when teams are aligned to the priorities of the business can they accomplish any part of this goal. The testing strategy must be tailored in order to address not only the development team, but to business stakeholders as well.

On DevExchange Gateway scrum team, we honor a WSJF-prioritized backlog that includes features from the business and tech”, says Jeff Michel, Sr Manager and Software Engineering at Capital One. “We strive in planning and execution to deliver the features in priority order while also seeking to honor date constraints. Feature implementations are developed on contributor forked repos and are submitted as pull requests (PR). Our code review processes require passing integration tests from the PR author as part of the PR as a prerequisite to merging the PR into a GitHub repo from which we deliver. We integration and performance test a feature as a part of our pre-production CI-CD pipeline and our Blue-Green production deployment process, leveraging the tests that were provided with the PR for the feature.

From testing activity to strategy : Frame your flow

Your next steps depend on what your goals are, where you are on that journey, and what’s in place already. This is why it’s important to map our your testing strategy using a flow of work, maybe your pipeline, your code promotion trigger events, high risk areas of your business model, or maybe even your customer journey.

A pipeline view of testing activities might looks something like this:

grey linear flowchart with squares and arrows and black text

Now that I’ve hopefully convinced you of the importance of a testing strategy, I must state that it’s not enough to have a “testing strategy” unto itself. Testing is really just activities to provide useful feedback to development, operations, and product management. The real goal of testing is to improve each “work center” with this useful feedback, internalizing “quality” at each stage and catching issues as early as possible to reduce cost and risk. The approach to successfully driving this outcome is to embed testing activities into each stage of an automated pipeline.

Some people work better with tables and lists. From a release management perspective, a testing coverage map across trigger events in continuous integration might look more like this:

table with grey cell borders and black text and blue header row with white text

Keep in mind, tailoring and Kaizen mindset produce different outcomes for every situation. The important part is for your whole team to see and actively engage in the evolution of your testing strategy. A few ways to collaborate on this are:

Dedicate a wall near your story/kanban boards to visualizing your testing strategy, highlighting where test activities are a critical parts of various feedback loops
Ask contributors where they need better feedback, more useful vs. more often; work with testing SMEs to apply “program thinking” to these improvement projects
Take an active part in team retrospectives and improvement sessions; come to the meeting with insights and recommendations
Highlight gaps in the testing strategy: areas of low coverage and resource constraints where you have evidence that they contribute to defect escape
Identify bottlenecks in flow around work centers: feature teams, long-cycle testing activities, Dev and Ops MTTR on fixes

Conclusion: Arrive together, stay together

The best and most fulfilling engineering teams I’ve worked with have these two things in common: they make decisions transparently and hold themselves accountable to execute efficiently.

I can’t overstate how important it is to ‘bring people with you’ as decisions are made. No one likes to be dictated to about how they should work. An important part of getting everyone rowing in the same direction is to involve people in the decision process early. A testing strategy that includes thoughtfulness about quality as part of the planning process does exactly that; it gets people’s brains engaged about what we’re really building and who we’re building it for before anything is actually built.

Each team member must hold themselves accountable to what was agreed upon upfront (i.e. what does ‘Done’ mean for each task, which feature plans should include shallow vs. deep performance testing, what to do when new code breaks SLAs, etc.). If expectations are too high on a particular aspect of the team’s definition of done, provide the feedback in a retro that shows the impact on your flow of work.

‘Right fit’ in engineering is always work in progress. When teams openly discuss improvement to testing strategy and align their testing activities to priorities of the business, the byproduct is ‘right fit’ strategy and learning becomes a new cultural norm.

Paul Bruce

Paul Bruce is a DevOps Advisor, helping to transform enterprise software teams and delivery practices. He currently works with the Neotys team as a Sr. Performance Engineer and is a Founder at Growgistics. His research wheelhouse includes cloud management, API design and experience, continuous testing at scale, and organizational learning frameworks. He writes, listens, and teaches about software delivery patterns in enterprises and key industries around the world. You can learn more at: http://paulsbruce.io