Using Experimentation for Sound Decision Making

A business analyst's playbook with examples

Capital One Tech

August 3, 2020

By James Montgomery, Princ Associate, Data Science, Capital One and Mack Sweeney, Mgr, Data Science, Capital One

Intro

Making good business decisions requires reliable information. This can only be generated from well-designed and properly analyzed experiments. Biased or incorrectly interpreted experiments can cause millions of dollars in losses and missed opportunities. In 2015, Optimizely estimated that the standard A/B testing user regularly makes critical mistakes resulting in error rates as high as 30%! Clearly, there is a need to establish guidelines for running reliable experiments whose results can reliably inform sound business decisions. What we need is an Experimentation Playbook!

We began our work applying reinforcement learning (RL) to digital marketing at Capital One in 2017. Over the last three years we’ve had countless conversations with product owners about how digital marketing experiments should be run. Our work has touched a number of domains in experimentation including traditional A/B/n tests in email marketing, near-real-time optimization of website content, and RL for search engine marketing (SEM) auction bidding. We’ve distilled the lessons we’ve learned applying experimentation in a practical setting at Capital One and our knowledge from academia into this Experimentation Playbook. We hope it helps you in solving your own problems!

Scope of This Experimentation Playbook

Experimentation can be broken down into two fundamental problems: design and analysis. There are many decisions that make up an experimental design. What key performance indicators (KPIs) should we measure? How do we seek to improve it (hypothesis generation)? Should we segment our users in some way? How many variants do we need to test this hypothesis? How many users should we allocate to each variant? How should we analyze the experiment to get the guarantees we want in the most efficient way? Analysis is no less complicated, but it typically breaks down into two simple questions. When do we stop my experiment? Which variant is the winner?

In this post we focus on the problems of experimental design from a business perspective, distilling complex ideas into an easy-to-understand playbook. We provide a step-by-step guide to design your own experiments, answering common questions and concerns along the way. Walking through a simple marketing example, we’ll relate the concepts discussed here to real world problems. We also include references to more technical sources for readers that want to dive deeper into the content. This playbook will give you the tools to transform a business problem into a targeted experimental design that will generate reliable information for effective decision making.

Example Set Up

Let’s begin by setting up an illustrative, practical example. Suppose you are a business analyst in the marketing department of a large corporation. Your team is responsible for the email marketing channel and you’ve been tasked with improving an email campaign through experimentation to drive more visitors to the site. Each email includes a subject line, a banner image, and ad copy (text). Within the ad copy is at least one call to action (CTA), which is a link to the site.

How should we create this campaign (design this experiment)? First, we need to define success. We need a key performance indicator (KPI) that captures our goal of driving more visitors to the site. There are often several good choices for KPI, but for this campaign, it seems the Click Rate (CR) most closely describes our goal. CR is defined as the number of people who click the CTA divided by the number sent the email. Having chosen our KPI, we now arrive at the fundamental task in experimentation: we need to come up with a hypothesis. This will drive the rest of our design decisions.

Example Framing and Hypothesis Generation

At a high level, we have three specific parts of the email we can change and we have many potential users with many common characteristics. We can form hypotheses about changes we believe will make the CTA more appealing to everyone, or we can decide to use some of the known user characteristics to segment recipients into multiple audiences and create CTAs that appeal to specific sub-populations. Concretely, our email server provides us the first and last name, age, gender, and zip code of all the people in our email database. Let’s suppose we ideate for a while and come up with the following options:

Using the recipient’s first name in the subject line increases CR for everyone
Informal ad copy increases CR for our younger audience but decreases it for our older audience
The majority of the people who don’t click the CTA fail to do so because they have short attention spans

Which of these are good hypotheses? For most experimenters, a good hypothesis follows CATS (coined here):

Constrained: the experiment will seek to improve exactly one KPI and contain a small number of changes
Actionable: it should suggest a specific change to be made to something you can change
Testable: you must be able to measure and record the KPI, audience characteristics, and variant changes
Specific: the KPI, changes, and audiences should be defined clearly enough that everyone knows what they mean without further explanation

With CATS in mind, let’s review our list of hypotheses (the numbers below refer to the numbering of hypotheses above):

Good hypothesis! There is one well defined question, a clear action, and a measurable response: change the subject line to include the recipient’s first name to drive clicks. We can test this because we can record the CR of recipients who receive our emails and we have their first names in our database. It is also specific: we’ve defined the KPI (CR), the changes (add first name to subject line messaging), and the audiences (everyone).
Not specific. While there is one question posed and it seems to be testable, the concept of “informal ad copy” might mean different things for different people. Also, “younger” and “older” are not clearly defined segments. So there is not a specific change or specific segmentation suggested by this hypothesis that will answer the question posed.
Not testable. The main problem is that we don’t have any way to measure “short attention spans.” So there is no way to even know if we’re improving CR for people with short attention spans. As a consequence, this is also not actionable, and because we didn’t specify any change to be made, it is also not specific.

Hypothesis Refinement

Let’s try refining the second hypothesis. The second hypothesis lacked clarity on two points:

What is the definition of young and old?
What is “informal ad copy”

What is considered informal is impacted by the age group we’re considering. So let’s start by defining “young” vs. “old.” We will define young as anyone in the Millennial generation (born between 1981 and 1996) or after, and old as anyone born before 1981. We now need to define an informal message. Let’s say that an informal message A) refers to customers by their first name B) uses contractions C) uses phrases sourced from a young user group (as defined above).

Our new hypothesis is: “Ad copy that refers to customers by their first name, uses contractions, and uses phrases sourced from our young user group will increase CR when shown to users born after 1981 and decrease CR for users born before 1981.” We now have clearly defined variants and segmentation to test our hypothesis.

To give ourselves some more practice, let’s now try refining the third hypothesis. This one is currently untestable because we don’t have any way to measure “short attention spans.” There might not be a practical way to resolve this, short of working with the email providers of our users to get them to instrument dwell-time measurements. So an easier fix here is to come up with a similar hypothesis in terms of things we can control.

One option is to assume that users with short attention spans will respond less frequently to ad copy that is longer. We still can’t test this, but we can allow it to motivate our actual hypothesis. One option is: “Shortening ad copy will increase CR.” However, this doesn’t get at our notion of segmentation in the population into users with short and long attention spans. Another option that more clearly reflects this segmentation is: “For those users who don’t respond to the full-length email, sending out a follow-up email with ad copy shortened to half the length will increase CR.”

Let’s now take our refined second hypothesis and run with it.

Choice of Analytical Procedure

So what’s left to figure out at this point? Let’s take a moment to return to the questions we set out to address:

What key performance indicators (KPIs) should we measure?
How do we seek to improve it (hypothesis generation)?
Should we segment our users in some way?
How many variants do we need to test this hypothesis?
How should we analyze the experiment to get the guarantees we want in the most efficient way?
How many users should we allocate to each variant?

We’ve now addressed 1-3: we’re measuring CR, we have our hypothesis, and that hypothesis requires we segment our users into young and old, as we defined above. In this section, we’ll address question 4, and in the next, we’ll address questions 5 and 6.

Our hypothesis dictates our variants will involve ad copy with varying degrees of formality, but we can still decide how many degrees we want to test. One option is to have one “formal” variant that does not use any first names, no contractions, and also none of the phrasing gathered from our young user group and another “informal” variant which has all three of these. Alternatively, we could also define some degrees of increasing formality incorporating just one or more characteristics of our “informal” definition, e.g. just contractions and first names. Let’s go with this latter option, calling the third variant “moderately informal”, which gives us three variants. So our test design now looks like this:

Variant/Segment	Young (born after 1981)	Old (born after 1981)
Formal	Test cell 1	Test cell 2
Moderately Informal	Test cell 3	Test cell 4
Informal	Test cell 5	Test cell 6

Now the only question remaining is how we want to analyze our experiment.

Experimentation vs. Optimization

The first question to ask yourself when deciding which analytical procedure to use is: do I want to experiment or optimize? What’s the difference?

Experimentation: focuses on learning
- Maximizing what you learn (how much information you gain), e.g. by ensuring guarantees on quality of findings
Optimization: focuses on improving
- Minimizing opportunity cost through how you allocate samples between the available options

How you plan to use the test results should determine your choice. For our hypothesis about informal ad copy, we could plan to:

Generalize our findings for all future campaigns -- if informal ad copy does appeal more to younger audiences, then always use informal ad copy in the future. In this case we’d want to experiment.
Treat it as a useful insight but don’t make a hard decision -- in future tests, include variants with informal ad copy for younger audience, but also continue to try out variants with less formal ad copy. In this case, we’d want to optimize.

The choice will also depend on your constraints -- to generalize findings from experiments often requires larger sample sizes, so if you don’t have enough samples, you’re stuck with optimizing. Of course, there are also analytical procedures that do a bit of both, but ultimately, each will be constructed to provide specific guarantees that align more closely to one or the other.

For this example, we’ll leave the choice to you! Below we list common procedures you can use for each:

Experimentation:
- Null hypothesis significance testing (NHST)
- Sequential analysis
Optimization:
- Reinforcement learning algorithms (RL)
- Bandit algorithms (a simple, practical type of RL)

Now the only question left is: “How many users should we allocate to each variant?” The answer is really driven by choice of analytical procedure. For example, for NHST, you can use a sample size calculator to determine up-front how many users to allocate to each variant, but for bandit algorithms there is no predetermined sample size.

That concludes this post, in which we’ve provided a practical playbook for experimental design. For more details on how to actually analyze or optimize experiments using the techniques mentioned above, check out the appendix!