"Hot" and often discussed today, the topic of conversion optimization has led to the unconditional popularization of A/B testing as the only objective way to learn the truth about the performance of certain technologies/solutions related to increasing economic efficiency for online businesses.
To this popularity hides an almost complete lack of culture in organizing, conducting, and analyzing the results of experiments. In Retail Rocket we have accumulated a lot of expertise in evaluating the cost effectiveness of personalization systems in e-commerce. In two years we’ve built up an ideal process for conducting A/B-tests, which we want to share within this article.
Two words on the principles of A/B testing
In theory, it’s incredibly simple :
- Hypothesize that some change (e.g, homepage personalization ) will increase the conversion rate of your online store.
- Create an alternate version of the site "B" – a copy of the original version "A" with the changes that we expect to increase the effectiveness of the site.
- All visitors to the site are randomly divided into two equal groups:one group is shown the original version, the second – the alternative.
- Simultaneously measure the conversion rate for both versions of the site.
- Determine the statistically reliably winning version.
The beauty of this approach is that any hypothesis can be tested with numbers. There is no need to argue or rely on the opinion of pseudo-experts. Run the test, measure the result, move on to the next test.
An example in numbers
For example, let’s imagine we made some change to the site, ran the A/B test and got the following data :
Conversion is not a static value, depending on the number of "trials" and "successes" (in the case of an online store – site visits and orders placed respectively) conversion is distributed in a certain range with a calculated probability.
For the table above, this means that if we bring another 1000 users to version "A" of the site with unchanged external conditions, there is a 99% probability that these users will make between 45 and 75 orders (that is, convert into customers at a rate of 4.53% to 7.47%).
This information by itself is not too valuable, but we can obtain 2 intervals of conversion distribution when performing an A/B test. A comparison of the intersection of the so-called "confidence intervals" of conversions received from two segments of users interacting with different versions of the site allows us to decide and assert that one of the tested versions of the site is statistically significantly superior to the other. Graphically, this can be represented as :
Why are 99% of your A/B tests done incorrectly?
So, most people are already familiar with the above concept of experimentation, they talk about it at industry events and write articles about it. In Retail Rocket 10-20 A/B tests at a time, over the past 3 years we’ve encountered a huge amount of nuance that often goes unnoticed.
There is a huge risk in this: if the A/B test is done incorrectly, the business is guaranteed to make the wrong decision and suffer a hidden loss. Moreover, if you’ve done A/B tests before, it’s likely that they weren’t done correctly.
Why? Let’s break down the most common mistakes we’ve encountered in a lot of post-test analyses of the results of experiments conducted when implementing Retail Rocket To our customers’ online stores.
Share of audience in the test
Perhaps the most common mistake is not checking that the entire audience of the site participates in the test. Quite a common example from life (screenshot from Google Analytics):
The screenshot shows that a little less than 6% of the audience took part in the test in total. It is extremely important that the entire site audience belongs to one of the test segments, otherwise it is impossible to assess the impact of the change on the business as a whole.
Uniformity of audience distribution among tested variations
It’s not enough to distribute the entire site audience across the test segments. It’s also important to do it evenly across all slices. Let’s take the example of one of our clients :
We face a situation in which the site audience is divided unevenly between test segments. In this case, a 50/50 division of traffic was set in the settings of the testing tool. This picture is a clear sign that the traffic distribution tool does not work as expected.
In addition, pay attention to the last column: you can see that the second segment gets more repeat and therefore more loyal audiences. Such people will make more orders and skew the test results. And this is another sign that the testing tool is not working correctly.
To rule out such errors a few days after starting testing, always check the uniformity of traffic division across all available slices (at least by city, browser, and platform).
Filtering employees of the online store
The next common problem is with online store employees who, once in one segment of the test, place orders that come in by phone. In this way, employees generate additional sales in one segment of the test, while callers are in all segments. Of course, such abnormal behavior will ultimately skew the final results.
Call center operators can be identified through the network report in Google Analytics:
The screenshot is an example from our experience: the visitor went to the site 14 times from the chain called "Shopping center electronics on Presnya" and placed an order 35 times – this is obvious behavior of the store employee, who for some reason placed orders through the cart on the site, not through the admin panel of the store.
In any case, you can always unload orders from Google Analytics and assign them the property "processed by operator" or "processed by non-operator". Then make a crosstab like on the screenshot showing another situation we encounter quite often: if we take revenue of RR and Not RR segment ("site with Retail Rocket" and "without" respectively) then "site with Retail Rocket" brings in less money than "without". But if we separate out the orders made by call center operators, it turns out that Retail Rocketgives a 10% increase in revenue.
What metrics should you pay attention to in the final evaluation of the results?
An A/B test was conducted last year and the results were as follows :
- +8% to conversion in the segment "site with Retail Rocket".
- The average check was virtually unchanged (+0.4% – at the level of error).
- Revenue growth of +9% in the Site with Retail Rocket segment.
After reporting the results, we received this letter from a customer :
The manager of the online store insisted that if the average check has not changed, then there is no effect of the service. This completely ignores the fact that the overall increase in revenue due to the recommendation system.
So what is the most important metric to focus on? Of course, the most important thing for a business is money. If the A/B test divided traffic evenly among the visitor segments, then the right metric to compare is revenue for each segment.
In life, no random division of traffic tool gives absolutely equal segments, there is always a fraction of a percent difference, so you need to normalize revenue by number of sessions and use the revenue per visit metric.
This is an internationally recognized KPI, which we recommend that you use as a reference for your A/B tests.
It’s important to remember that revenue from orders placed on the site and "fulfilled" revenue (revenue from orders actually paid for) are completely different things.
Here’s an example of an A/B test in which the system Retail Rocket was compared to another recommendation system :
The "non-Retail Rocket" segment wins on all parameters. However, the next phase of the post-test analysis excluded call center orders as well as cancelled orders. Results :
Post-test analysis of results is a must for A/B testing!
Working with different data slices is an extremely important part of post-test analysis.
Here is one more case of Retail Rocket testing on one of the largest e-shops in Russia :
At first glance, we got a great result – revenue growth of +16.7%. But if we add an additional data slice to the report "Device Type." you can see the following picture :
- Desktop traffic is up almost 72% in revenue!
- On tablets in the Retail Rocket segment, there is a slump.
As it turned out after testing, the Retail Rocket recommendation blocks were not correctly displayed on the tablets.
It is very important as part of the post-test analysis to build reports at least by city, browser, and user platform, so as not to miss such problems and maximize results.
The next topic to address is statistical plausibility. Make a decision to implement a change to a site only after the statistical validity of superiority has been achieved.
There are many online tools to calculate the statistical validity of conversions, such as, htraffic.ru/calc/ :
But conversion is not the only indicator that determines the cost-effectiveness of a site. The problem with most A/B tests today is that only the statistical validity of the conversion is tested, which is insufficient.
An online store’s revenue is based on conversion rates (percentage of people who buy) and the average check (size of a purchase). It is harder to calculate the statistical validity of changes in the average receipt, but you can’t do it without it, otherwise you will inevitably draw the wrong conclusions.
The screenshot shows another example of A/B-test of Retail Rocket, in which one of the segments has an order of more than one million rubles:
This order accounts for almost 10% of all revenue during the test period. In that case, when you achieve statistical validity on conversions, can the results on revenue be considered valid? Of course not.
Such huge orders significantly skew the results, we have two approaches to post-test analysis in terms of average check :
- Complicated. " Bayesian statistics ", which we will discuss in the following articles. In Retail Rocket we use it to evaluate the average check validity of internal tests to optimize recommendation algorithms.
- Simple. Cutting off a few percentiles of orders from the top and bottom (usually 3-5%) of a list sorted in descending order amount.
Lastly, always pay attention to when you run the test and how long it lasts. Try not to run the test a few days before major gender holidays and on holidays/holidays. Seasonality can also be seen in the level of paychecks you receive : this usually drives sales of expensive items, particularly electronics.
In addition, there is a proven correlation between the average check in a store and the time it takes to make a decision before purchase. Simply put, the more expensive the items, the longer it takes to select them. The screenshot shows an example of a store where 7% of users take 1 to 2 weeks to decide before purchasing :
If you do an A/B test on such a store for less than a week, about 10% of the audience will not get in, and the impact of the change on the site on the business cannot be unequivocally assessed.
In lieu of a conclusion. How to conduct a perfect A/B test?
So, in order to eliminate all of the problems described above and conduct a proper A/B test, there are 3 steps to follow :
1. Split traffic 50 / 50
Complicated : With a traffic balancer.
Simple : Use the open source library Retail Rocket Segmentator which is maintained by the Retail Rocket team. Over several years of testing, we haven’t been able to solve the problems described above with tools like Optimizely or Visual Website Optimizer.
Goal in the first step:
- Get an even distribution of the audience across all available slices (browsers, cities, traffic sources, etc.).
- 100% of the audience should hit the test.
2. Conduct an A/A test
Without changing anything on the site, pass different user segment identifiers (in the case of Google Analytics – Custom var / Custom dimension) to Google Analytics (or any other web analytics system you like).
Target in the second step : not to get a winner, i.e. in two segments with the same versions of the site should not be a difference in key indicators.
3. Conduct a post-test analysis
- Exclude company employees.
- Cut off extreme values.
- Check the significance of conversion value, use data on order fulfillment and cancellation, i.e. consider all the cases mentioned above.
Target at the last step : To make the right decision.
Share your A/B test cases in the comments!