The value of A/B testing in the era of machine learning
May 14, 2020 8 min read
Direct experimentation is how we understand causes
An A/B test is a simple enough thing to understand. Show your current experience to half your visitors and offer an alternative experience to the other half; observe differences in performance, then either continue with the old one or switch all traffic to the new one. There are many best practices and subtleties between the lines here, but the process is intuitive.
What is often lost is the reason why we do this. If we’ve limited our changes to as few variables as possible, we can learn what actually causes changes in behavior. The ‘why’ is more challenging, but the ‘what’ becomes clear. In the complex, multivariate world of machine learning, finding causes is not the primary concern. Optimizing an objective function is. Therefore, for humans to learn and to create new ideas and build models that reflect the ideal world, A/B testing fills a valuable and lasting role.
Let’s look at A/B testing, machine learning and discover some real world applications of each individually and in combination.
Better living through chemistry (medicine) is a result of rigorous A/B testing
The simple A/B test, or random controlled trial (RCT), is a mainstay of the product development process and can be thoughtfully explored through the example of developing new medicines.
As mentioned prior, RCT helps us understand opportunity/effect size accurately (and therefore ROI), and is also able to illuminate causality, an area where machine learning has not yet matured. Causality allows us to put to rest the argument of ‘correlation vs. causation’ and understand if our new medicine works as intended. Let’s look at how a drug trial is run, at its most basic.
In a placebo-controlled study, subjects are randomly assigned to one of two groups; either they receive the drug or they receive a placebo. Both groups take the pill or other delivery vehicle as per instructions. The key here is that the groups are randomly assigned. The larger the number of users in each group, the lower the chances of error.
If there are observed differences between the test and control groups, and our sample was randomly assigned, we can conclude that there is a causal relationship between the treatment received and the observed difference. By simplifying our view to a single variable, we can have confidence that this is well beyond correlation. The observed effect does not need to validate our hypothesis to be a useful finding.
How does machine learning enhance medicinal product development?
In the emerging field of personalized medicine, software is used to match humans with treatments that fit unique symptoms and genetic markers. In this case, the original A/B (RCT) tests are incredibly valuable due to the matches we have made with different types of users (women vs. men, adults vs. children) and drugs that have been developed. We may have learned during both initial trials and during product rollout at scale that a drug has increased potency for a specific type of user and interacts positively under specific circumstances.
At this point, we are datamining hundreds of variables to develop models that allow us to tailor medicine specifically for you (or people like you). Now is the time of the data scientist, analyzing causal relationships to develop patterns that match real life as often as possible.
A/B testing helps data scientists find valuable levers
A/B testing has the ability to teach data scientists valuable lessons that both enhance understanding of audiences and underlying data sets, but also help focus on core use cases through methodical design of experiments. Let’s take a look at both.
Learning valuable lessons
One of the primary goals of data science is to closely model, through software, what happens in nature. And by nature here, we mean the human mind. Reviewing large data sets alone will not allow us to mimic nature without clear observations. Those observations, when made with a single variable under analysis, allow us to decompose complex problems into digestible, model-capable concepts.
By getting closer to discrete audiences and analyzing patterns of behavior, we can develop feature-rich models using an array of techniques that best match the natural world. It also has the ancillary benefit of connecting data scientists to real-life problems and people, to spur the creativity of answering human problems. Academic challenges can also be useful, but for those of us working to help solve stressful and often vital issues between people and money, a grounding in reality is pivotal.
Methodical design of experiments
I try not to underestimate the value of good experimental design. Thinking in advance about what you’d like to learn and having the underlying observations in data at your disposal is a wonderful primer for the data scientist. Exploring the areas of highest leverage through past observations and planning for rapid experimentation is the key to maximizing the number of causes you can identify. A/B tests do not have to be complex, lengthy or expensive to enhance your machine learning optimization frameworks.
Here’s a made up, but common example of the thought process in action:
An ecommerce site observes web traffic data that shows an outsized number of prospects fall out on their seasonal product page. Applications, as a result, are declining. Top of the funnel engagement has not changed. Mid-funnel bounce rate has increased and time on site declined by 22%. No UX changes have been made to account for the difference.
They hypothesize that tastes are changing globally and that seasonal products no longer meet customer needs. A dive into changes their competitors are making recently shows an uptick in the frequency of social proof messaging, specifically on seasonal products.
They run an A/B test with increased presence of social proof for 50% of the seasonal products segment and BAU for the other 50%.
They observe a statistically significant improvement in application rate and conversions, reduction in bounce rate and time on site returns to prior levels. The learning is captured and the UX is rolled out to 100%
How does this help the data scientist? By having a cause and effect, teams can use data slices from the experiment to better model behaviors for micro-cohorts or individuals. Users can simulate outcomes based on improvements to primary KPIs (please think about end column metrics here). Without A/B testing, data scientists are at a severe disadvantage as the modeling will lack a stimulus-response system and teams can neither scope the opportunity size accurately nor observe the types of treatments that might have a net benefit.
Why not use both A/B Testing and Machine Learning? Use cases for a great destination state
Great, data-driven companies run A/B tests that measure customer engagement (conversions) across a variety of types of experiences -- everything from copy changes to new imagery or distinct changes in the user experience, or even testing different styles of audience segmentation.
When a new message outperforms an old one in an experiment, we replace the old message with the new one. This is a “winner take all” approach, because more customers have converted on the new content. Therefore, we determine new content to show customers based on experiment results, even though the new content may not appeal to all customers.
Let’s explore a made-up, but illustrative example that you might encounter in the real world of A/B Testing. A company tested a new creative (Variation 2, roller coaster image) by comparing it with the existing creative (Variation 1, people swimming). Variation 1 had a 2.5% conversion rate, and Variation 2 had a 3.5% conversion rate. Therefore, the experimenter decided to replace the old message (Variation 1) with the new message (Variation 2).
However, a deeper analysis indicates that Variation 1 has disproportionately more engagement from visitors who came from Google, and Variation 2 has disproportionately more engagement from visitors from Yahoo. Instead of choosing a winning variation, it would be beneficial to use both variations to obtain higher conversion rates from both populations.
One approach could be to target Google referrals with Variation 1 and Yahoo referrals with Variation 2. However, closer examination indicates that although Variation 1 has a greater conversion rate among Google users, Variation 2 actually has a greater conversion rate among visitors that represent med-high spend. Also, although our roller coaster image (Variation 2) has a greater conversion rate among Google visitors, people swimming (Variation 1) actually has a greater conversion rate among Google visitors in the low-spend category.
With this knowledge, we pivot our approach to leverage machine learning, launch both variations, and let the model determine which customers should see people swimming and which should see roller coasters. A machine learning optimization engine can determine which variation to show by determining how similar the customer is with other customers (collaborative filtering) that have converted from Variation 1 or Variation 2.
This example is a very simple use case - message variations may appeal to other sub-groups of customers and generate more complex relationships as we slice the data into finer segments.
Takeaways and key advice
There you have it. You can have your A/B testing and machine learning, too.
In fact, it would be wise to use them both effectively for their respective purposes. How you improve outcomes for your available audience to achieve maximum value is up to you, but the principles shown here can help you avoid analysis and modeling issues down the road. As always, observe, test and optimize for the win.
Finally, a big thank you to Dan Pick and Scott Golder for your expertise on this.