This probably isn’t the first time you’ve read about A/B testing. Articles about A/B test results are shared far and wide. Heck, you might already A/B test your email subject lines or your social media posts.
Despite the fact that there’s been plenty said about A/B testing, a lot of marketers still get it wrong. The result? People making major business decisions based on inaccurate results from an improper test.
The problem is that A/B testing is often over-simplified, especially in content written for store owners.
The solution? Here’s everything you need to know to get started with ecommerce A/B testing, explained as plainly as humanly possible.
Table of Contents
- What Is A/B Testing?
- How A/B Testing Works
- What’s A/B/n Testing?
- How Long Should A/B Tests Run?
- Why Should You A/B Test?
- Prioritizing A/B Test Ideas
- Crash Course in A/B Testing Statistics
- How to Set Up an A/B Test
- How to Analyze A/B Test Results
- How to Archive Past A/B Tests
- A/B Testing Processes of the Pros
What Is A/B Testing?
A/B testing, sometimes referred to as split testing, is the process of comparing two versions of the same webpage to determine which one performs better.
This process allows you to answer important business questions, helps you generate more revenue from the traffic you already have, and sets the foundation for a data-informed marketing strategy.
How A/B testing works
You’ll show 50% of visitors version A (let’s call this the “Control”) and you’ll show 50% of visitors version B (let’s call this the “Variant”).
The version of the webpage that results in the highest conversion rate wins. For example, let’s say the variant (version B) yielded the highest conversion rate. You would then declare it the winner and push 100% of visitors to the variant.
Then, the variant becomes the new control and you must design a new variant.
It’s worth mentioning that conversion rate is an imperfect measure of success. Why? You can increase your conversion rate instantly by making everything in your store free. Of course, that’s a terrible business decision.
That’s why you should track the value of a conversion all the way through to the sound of a ringing cash register.
What’s A/B/n testing?
With A/B/n testing, you can test more than one variant against the control. So, instead of showing 50% of visitors the control and 50% of visitors the variant, you might show 25% of visitors the control, 25% the first variant, 25% the second variant, and 25% the third variant.
Note: This is different from multivariate testing, which also involves multiple variants. When running multivariate tests, you’re not only testing multiple variants, you’re testing multiple elements as well. The goal is to figure out which combination performs best.
You’ll need a lot of traffic to run multivariate tests, so you can ignore those for now.
How long should A/B tests run?
Run your A/B test for at least one, but ideally two, full business cycles. Don’t stop your test just because you’ve reached significance. You’ll also need to meet your predetermined sample size. Finally, don’t forget to run all tests in full week increments.
Why two full business cycles? For starters:
- You can account for “I need to think about it” buyers.
- You can account for all of the different traffic sources (Facebook, email newsletter, organic search, etc.)
- You can account for anomalies. For example, your Friday email newsletter.
If you’ve used any sort of A/B testing tool, you’re likely familiar with the little green “Statistically Significant” icon.
For many, unfortunately, that’s the universal sign for “the test is cooked, call it”. As you’ll learn more about in the statistics crash course, statistical significance is not a stopping rule. Just because it’s been reached does not mean you should stop the test.
And your predetermined sample size? It’s not as intimidating as it seems. Open up a sample size calculator, like this one from Evan Miller.
This calculation is saying that if your current conversion rate is 5% and you want to be able to detect a 15% effect, you need a sample of 13,533 per variation. So, in total, over 25,000 visitors are needed if it’s a standard A/B test.
Watch what happens if you want to detect a smaller effect:
All that’s changed is the minimum detectable effect (MDE). It’s decreased from 15% to 8%. In this case, you need a sample of 47,127 per variation. So, in total, nearly 100,000 visitors are needed if it’s a standard A/B test.
Your sample size should be calculated upfront, before your test starts. Your test can’t stop, even if it reaches significance, until the predetermined sample size is reached. If it does, the test isn’t valid.
This is why you can’t aimlessly follow best practices, like “stop after 100 conversions”.
It’s also important to run tests for full week increments. Your traffic can change based on the day of the week and the time of day, so you’ll want to be sure to include every day of the week.
Why should you A/B test?
Let’s say you spend $100 on Facebook ads to send ten people to your site. Your average order value is $25. Eight of those visitors leave without buying anything and the other two spend $25 each. The result? You lost $50.
Now let’s say you spend $100 on Facebook ads to send ten people to your site. Your average order value is still $25. This time, though, only five of those visitors leave without buying anything and the other five spend $25 each. The result? You made $25.
This is a simplified example, of course. But by increasing the conversion rate on-site, you made the same traffic more valuable.
A/B testing also helps you uncover insights, whether your test wins or loses. This value is very transferable. For example, a copywriting insight from a product description A/B test could help inform your value proposition and other product descriptions.
You also can’t ignore the inherent value of focusing on continuously improving the effectiveness of your store.
Should you be A/B testing?
Not necessarily. If you’re a low-traffic site, A/B testing is probably not the best optimization effort for you. You will likely see a higher return on investment (ROI) from conducting user testing or talking to your customers, for example.
Despite popular belief, conversion rate optimization does not begin and end with testing.
Consider the numbers from the sample size calculator above. 47,127 visitors per variation to detect an 8% effect if your baseline conversion rate is 5%. Let’s say you want to test a product page. Do you have a product page that receives nearly 100,000 visitors in two to four weeks?
Hold up. Why two to four weeks?! Remember, we want to run tests for at least two full business cycles. Usually, that works out to two to four weeks. Now maybe you’re thinking, “No problem, Shanelle, I’ll run the test for longer than two to four weeks to reach the required sample size.” That won’t work, either.
You see, the longer a test is running, the more susceptible it is to external validity threats and sample pollution. For example, visitors might delete their cookies and end up re-entered into the A/B test as a new visitor. Or someone could switch from their mobile phone to desktop and see an alternate variation.
Essentially, letting your test run for too long isn’t an option, either.
TL;DR: Testing is worth the investment for stores that can meet the required sample size in two to four weeks. Stores that can’t should consider other forms of optimization until their traffic increases.
Julia Starostenko, Data Scientist at Shopify, agrees, explaining:
Julia Starostenko, Shopify
"Experimenting is fun! But it is important to make sure that the results are accurate.
Ask yourself: is your audience large enough? Have you collected enough data? In order to achieve true statistical significance (within a reasonable timeframe) the audience size needs to be large enough."
What should you A/B test?
I can’t tell you what you should A/B test. I know, I know. It would certainly make your life easier if I could give you a list of 99 things to test right now. There’s no shortage of marketers willing to do that for the clicks.
Truth is, the only tests worth running are tests based on your own data. I don’t have access to your data, your customers, etc. and neither does anyone curating those huge lists of A/B test ideas. None of us can meaningfully tell you what to test.
The only tests worth running are tests based on your own data.
Instead, I encourage you to answer this question for yourself through qualitative and quantitative analysis. That might mean:
- Technical Analysis: Does your store load properly and quickly on every browser? On every device? You might have a shiny new iPhone X, but someone somewhere is still rocking a Motorola Razr from 2005. If your site doesn’t work properly and quickly, it definitely doesn’t convert as well as it could.
- On-Site Surveys: These pop up as your store’s visitors browse around. For example, an on-site survey might ask visitors who have been on the same page for a while if there’s anything holding them back from making a purchase today. If so, what is it? You can use this qualitative data to improve your copy and conversion rate.
- Customer Interviews: Nothing can replace getting on the phone and talking to your customers. Why did they choose your store over competitive stores? What problem were they trying to solve when they arrived on your site? There are a million questions you could ask to get to the heart of who your customers are and why they really buy from you.
- Customer Surveys: Customer surveys are full-length surveys that go out to people who have already made a purchase instead of visitors. When designing a survey, you want to focus on: defining your customers, defining their problems, defining hesitations they had prior to purchasing, and identifying words and phrases they use to describe your store.
- Analytics Analysis: Are your analytics tools tracking and reporting your data correctly? That might sound silly, but you’d be surprised by how many analytics tools are configured incorrectly. Analytics analysis is all about diving into your analytics and analyzing how your visitors behave. For example, you might focus on the funnel. Where is it leaking? In other words, where are most people dropping out of your funnel? That’s a good place to start testing.
- User Testing: This is where you watch real people try to perform tasks on your site. For example, you might ask them to find a video game in the $40-60 range and add it to their cart. While they’re performing the tasks, they’ll be narrating their thoughts and actions out loud.
- Session Replays: Session replays are similar to user testing, but now you’re dealing with real people with real money and real intent to buy. You’ll watch as your actual visitors navigate your site. What do they have trouble finding? Where do they get frustrated? Where do they seem confused?
There are additional types of research as well, but these seven methods are a good starting point. If you run through some of them, you will have a huge laundry list of data-informed ideas worth testing. I guarantee your list will bring you more value than any “99 things to test right now” article ever could.
Prioritizing A/B test ideas
A huge list of A/B test ideas is exciting, but not exactly helpful for deciding what to test. Where do you start? That’s where prioritization comes in.
There are a few common prioritization frameworks you can use:
- ICE: ICE stands for Impact, Confidence and Ease. Each of those factors receives a 1-10 ranking. For example, if you could easily run the test by yourself without help from a developer or designer, you might give Ease an eight. You’re using your judgement here and if you have more than one person running tests, rankings may become too subjective. It helps to have a set of guidelines to guide everyone towards objectivity.
- PIE: PIE stands for Potential, Importance and Ease. Again, each factor receives a 1-10 ranking. For example, if the test will reach 90% of your traffic, you might give Importance an eight. PIE is as subjective as ICE, so guidelines can be helpful for this framework as well.
- PXL: PXL is the prioritization framework from CXL. It’s a little bit different and more customizable, forcing more objective decisions. Instead of three factors, you’ll find yes / no questions and an ease of implementation question. For example, the framework might ask: “Is the test designed to increase motivation?” If yes, it gets a 1. If no, it gets a 0. You can learn more about this framework and download the spreadsheet here.
Now you have an idea of where to start, but it can also help to categorize your ideas. For example, during some conversion research I did recently, I used three categories: implement, investigate, and test.
- Implement: Just do it. It’s broken or obvious.
- Investigate: Requires extra thought to define the problem or narrow in on a solution.
- Test: The idea is sound and data-informed. Test it!
Between this categorization and prioritization, you’re set.
Crash course in A/B testing statistics
Before you run a test, it’s important to dig into statistics. I know, statistics usually aren’t a fan favorite, but think of this as the required course you begrudging take to graduate.
Statistics is a big part of A/B testing. Fortunately, testing tools have made the job of an optimizer easier, but a basic understanding of what’s happening behind the scenes is crucial for analyzing your test results later on.
Alex Birkett, Growth Marketing Manager at HubSpot, explains:
Alex Birkett, HubSpot
"Statistics isn't a magic number of conversions or a binary 'Success! 🤩' or 'Failure 😞' thing. It's a process used to make decisions under uncertainty and to reduce risk by trying to reduce the fogginess on what the outcome of a given decision will be.
With that in mind, I think it's most necessary to know the basics: what's a mean, variance, sampling, standard deviation, regression to the mean, and what constitutes a ‘representative’ sample. In addition, it helps when you're starting out with A/B testing to set up some specific guard rails to mitigate as much human error as possible."
What is mean?
Mean is the average. Your goal is to find a mean that is representative of the whole.
For example, let’s say you’re trying to find the average price of video games. You’re not going to add the price of every video game in the world and divide it by the number of all the video games in the world. Instead, you’ll isolate a small sample that is representative of all of the video games in the world.
You might end up finding the average price of a couple hundred video games. If you’ve selected a representative sample, the mean price of those two hundred video games should be representative of all the video games in the world.
What is variance?
Variance is the average variability. Essentially, the higher the variability, the less accurate the mean will be in predicting an individual data point.
So, how close is the mean to the actual price of each individual video game?
What is sampling?
The larger the sample size, the less variability there will be, which means the mean is more likely to be accurate.
So, if you increased your sample from two hundred video games to two thousand video games, you’d have less variance and a more precise mean.
What is statistical significance?
Assuming there’s no difference between A and B, how often will you see the effect just by chance?
The lower the statistical significance level, the bigger the chance that your winning variation is not a winner at all.
Simply put, a low significance level means that there is a big chance that your ‘winner’ is not a real winner (this is known as a false positive).
Be aware that most A/B testing tools call statistical significance without waiting for a predetermined sample size or point in time to be reached. That’s why you might notice your test flipping back and forth between statistically significant and statistically insignificant.
Peep Laja, Founder of CXL Institute, wishes more people really understood statistical significance and why it’s important:
Peep Laja, CXL Institute
"Statistical significance does not equal validity, it's not a stopping rule. When you reach 95% statistical significance or higher, that means very little before two other, more important, conditions have been met:
1. There's enough sample size, which you figure out using sample size calculators. Meaning, enough people have been part of the experiment so that we can conclude anything at all.
2. The test has run long enough so the sample is representative (and not too long to avoid sample pollution). In most cases you'll want to run your tests 2, 3 or 4 weeks depending on how fast can you get the needed sample."
What is regression to the mean?
You might notice extreme fluctuations at the beginning of your A/B test.
Regression to the mean is the phenomenon that says if something is extreme on its first measurement, it will likely be closer to the average on its second measurement.
If the only reason you’re calling a test is because it’s reached statistical significance, you could be seeing a false positive. Your winning variation will likely regress to the mean over time.
What is statistical power?
Assuming there’s a difference between A and B, how often will you see the effect?
The lower the power level, the bigger the chance that a winner will go unrecognized. The higher the power level, the lower the chance that a winner will go unrecognized. Really, all you’ll need to know is that 80% statistical power is standard for most A/B testing tools.
Ton Wesseling, Founder of Online Dialogue, wishes more people knew about statistical power, though:
Ton Wesseling, Online Dialogue
"Lots of people worry about false positives. We worry way more about false negatives. Why run experiments where the chances of finding proof that your positive change has an impact is really low…?"
What are external validity threats?
There are external factors that threaten the validity of your tests. For example:
- Black Friday Cyber Monday sales.
- A positive or negative press mention.
- A major paid campaign launch.
- The day of the week.
- The changing seasons.
Let’s say, for example, you were to run a test during December. Major shopping holidays would mean an increase in traffic for your store during that month. You might find in January that your December winner is no longer performing well.
Because of an external validity threat: the holidays.
The data you based your test decision on was an anomaly. When things settle down in January, you might be surprised to find your winner losing.
You can’t eliminate external validity threats, but you can mitigate them by running tests for full weeks (e.g. don’t start a test on a Monday and end it on a Friday), including different types of traffic (e.g. don’t test paid traffic exclusively and then roll out the results to every traffic source) and being mindful of potential threats.
If you happen to be running a test during a busy shopping season, like BFCM, or through a major external validity threat, you will find this article helpful.
How to set-up an A/B test
Before you test anything, you need to have a solid hypothesis. (Great, we just finished math class and now we’re on to science.)
Don’t worry, it’s not complicated. Basically, you need to test a hypothesis, not an idea. A hypothesis is measurable, aspires to solve a specific conversion problem and focuses on insights instead of wins.
You need to A/B test a hypothesis, not an idea.
Whenever I’m writing an hypothesis, I use a formula borrowed from Craig Sullivan’s Hypothesis Kit:
- Because I saw [insert data / feedback from research]
- I expect that [change you’re testing] will cause [impact you anticipate]
- I’ll measure this using [data metric]
Easy, right? All you have to do is fill in the blanks and your test idea has transformed into a hypothesis.
Choosing an A/B testing tool
All are good, safe options.
- Google Optimize: Free save for some multivariate limitations, which shouldn’t really impact you if you’re just getting started. Works closely with Google Analytics, which is a plus.
- Optimizely: Easy to get minor tests up and running, even without technical skills. Stats Engine makes it easier to analyze test results. Typically, Optimizely is the most expensive option of the three.
- VWO: VWO has SmartStats to make analysis easier. Plus, they have a great WYSIWYG editor for beginners. Every VWO plan comes with heatmaps, on-site surveys, form analytics, etc. as well.
We also have some testing tools in the Shopify App Store that you might find helpful.
Once you’ve selected a tool, simply sign up and follow the instructions provided. The process varies from tool to tool. Typically, though, you’ll be asked to install a snippet on your site and set goals.
How to analyze A/B test results
Remember when I said writing a hypothesis shifts the focus from wins to insights? Krista Seiden, Analytics Advocate and Product Manager at Google, explains what that means:
Krista Seiden, Google
"The most overlooked aspect of A/B testing is learning from your losers. In fact, in the optimization programs I've run, I make a habit of publishing a 'failures report' where I call out some of the biggest losers of the quarter and what we learned from them.
One of my all time favorites was from a campaign that was months in the making. We were able to sneak in an A/B test of the new campaign landing page just before it was set to go live, and it's a good thing we did, because it failed miserably. Had we actually launched the page as it was, we would have taken a significant hit to the bottom line. Not only did we end up saving the business a ton of money, but we were able to dig in and make some assumptions (that we later tested) about why the new page had performed so poorly, and that made us better marketers and more successful in future campaigns."
If you craft your hypothesis correctly, even a loser is a winner because you’ll gain insights you can use for future tests and in other areas of your business. So, when you’re analyzing your test results, you need to focus on the insights, not whether the test won or lost. There’s always something to learn, always something to analyze. Don’t dismiss the losers!
If you craft your hypothesis correctly, even a loser is a winner.
The most important thing to note here is the need for segmentation. A test might be a loser overall, but chances are it performed well with at least one segment. What do I mean by segment?
- New visitors.
- Returning visitors.
- iOS visitors.
- Android visitors.
- Chrome visitors.
- Safari visitors.
- Desktop visitors.
- Tablet visitors.
- Organic search visitors.
- Paid visitors.
- Social media visitors.
- Logged in buyers.
You get the idea, right?
When you’re looking at the results in your testing tool, you’re looking at the whole pack of Smarties. What you need to do is separate the Smarties by color so you can eat the red ones last. I mean, so you can uncover deeper, segmented insights.
Odds are that the hypothesis was proven right among certain segments. That tells you something as well.
TL;DR: Analysis is about so much more than whether the test was a winner or a loser. Focus on the insights and segment your data to find hidden insights below the surface.
A/B testing tools won’t do the analysis for you, so this is an important skill to develop over time.
How to archive past A/B tests
Let’s say you run your first test tomorrow. In two years from tomorrow, will you remember the details of tomorrow’s test? Not likely.
That’s why archiving your A/B test results is important. Without a well-maintained archive, all those insights you’re gaining will be lost. Plus, I kid you not, it’s very easy to test the same thing twice if you’re not archiving.
There’s no “right” way to do this, though. You could use a tool like Projects or Effective Experiments, or you could use Excel. It’s really up to you, especially when you’re just getting started. Just make sure you’re keeping track of:
- The hypothesis.
- Screenshots of the control and variation.
- Whether it won or lost.
- Insights gained through analysis.
As you grow, you’ll thank yourself for keeping this archive. Not only will it help you, but new hires and advisors / stakeholders as well.
A/B testing processes of the pros
Now that you’ve been through a standard A/B testing process, let’s take a look at the exact processes of pros from companies like Google and HubSpot.
Krista Seiden, Google
My step by step process for A/B testing starts with analysis—in my opinion, this is the core of any good testing program. In the analysis stage, the goal is to analyze your analytics data, survey or UX data, or any other sources of customer insight you might have in order to understand where your opportunities for optimization are.
Once you have a good pipeline of ideas from the analysis stage, you can move on to hypothesize what might be going wrong and how you could potentially fix or improve these areas of optimization.
Next, it's time to build and run your tests. Be sure to run them for a reasonable amount of time (I default to two weeks to ensure I'm accounting for week over week changes or anomalies), and when you have enough data, analyze your results to determine your winner.
It's also important to take some time in this stage to analyze the losers as well—what can you learn from these variations?
Finally, and you may only reach this stage once you've spent time laying the groundwork for a solid optimization program, it's time to look into personalization. This doesn't necessarily require a fancy toolset, but rather can come out of the data you have about your users.
Personalization can be as easy as targeting the right content to the right locations, or as complex as targeting based on individual user actions. Don't jump in all at once on the personalization bit though, be sure you spend enough time to get the basics right first.
Alex Birkett, HubSpot
At a high level, I try to follow this process:
- Collect data and make sure analytics implementations are accurate.
- Analyze data and find insights.
- Turn insights into hypotheses.
- Prioritize based on impact and ease, and maximize my allocation of resources (especially technical resources).
- Run test (following statistics best practices to the best of my knowledge and ability).
- Analyze results & implement or not according to results.
- Iterate based on findings and repeat.
Put more simply: research, test, analyze, repeat.
While this process can deviate or change based on what the context is (Am I testing a business critical product feature? A blog post CTA? What's the risk profile and balance of innovation vs. risk mitigation?), it's pretty applicable to any size or type of company.
The point is this process is agile, but it also collects enough data, both qualitative customer feedback and quantitative analytics, to be able to come up with better test ideas and better prioritize them so as to not waste traffic.
Ton Wesseling, Online Dialogue
The first question we always answer when we want to optimize a customer journey is: where does this product or service fit on the ROAR model that we created at Online Dialogue? Are you still in the risk phase where we could do lots of research, but can't validate our findings through online experiments (below 1,000 conversions per month) or are you in the optimization phase? Or even above?
- Risk Phase: lots of research, which will be translated into anything from a business model pivot to a whole new design and value proposition.
- Optimization Phase: large experiments that will optimize the value proposition and the business model.
- Optimization Phase: small experiments to validate user behavior hypotheses, which will build up knowledge for larger design changes.
- Automation: you still have experimentation power (visitors) left, meaning your full test potential is not needed to validate your user journey. What's left should be used to exploit, to grow faster now (without focus on long-term learnings). This could be automated by running bandits / using algorithms.
- Re-think: you stop adding lots of research, unless it's a pivot to something new.
So A/B testing is only a big thing in the optimization phase of ROAR and beyond (until re-think).
Our approach to running experiments is the FACT & ACT model:
The research we do is based on our 5V Model:
We gather all these insights to come up with a main research-backed hypothesis, which will lead to sub-hypotheses that will be prioritized based on the data gathered. The higher the chance of the hypothesis being true, the higher it will be ranked.
Once we learn if our hypothesis is true or false, we can start combining learnings and take bigger steps by redesigning / realigning larger parts of the customer journey. However, at some point, all winning implementations will lead to a local maximum. Then you need to take a bigger step to be able to reach a potential global maximum.
And of course the main learnings will be spread throughout the company, which leads to all sorts of broader optimization and innovation based on your validated first-party insights.
Julia Starostenko, Shopify
The purpose of an experiment is to validate that making changes to an existing webpage will have a positive impact to the business.
Before getting started, it’s important to determine if running an experiment is truly necessary. Consider the following scenario: there is a button with an extremely low click rate. It would be near impossible to decrease the performance of this button. Validating the effectiveness of a proposed change to the button (i.e. running an experiment) is therefore not necessary.
Similarly, if the proposed change to the button is small, it probably isn't worth spending the time setting up, executing and tearing down an experiment. In this case, the changes should just be rolled out to everyone and performance of the button can be monitored.
If it is determined that running an experiment would in fact be beneficial, the next step is to define the business metric(s) that should be improved (e.g. increase the conversion rate of a button). Then we ensure that proper data collection is in place.
Once this is complete, the audience is randomly split into two groups; one group is shown the existing version of the button while the other group gets the new version. The conversion rate of each audience is monitored, and once statistical significance is reached, the results of the experiment are determined.
Peep Laja, CXL Institute
A/B testing is a part of a bigger conversion optimization picture. In my opinion it's 80% about the research and only 20% about testing. Conversion research will help you determine what to test to begin with.
My process typically looks like this (a simplified summary):
- Conduct conversion research using a framework like ResearchXL to identify issues on your site.
- Pick a high priority issue (one that affects a large portion of users and is a severe issue) and brainstorm as many solutions to this problem as you can. Inform your ideation process with your conversion research insights. Determine which device you want to run the test on (you need to run separate tests on desktop and mobile).
- Determine how many variations you can test (based on your traffic / transaction level), and then pick your best 1-2 ideas for a solution to test against control.
- Wireframe the exact treatments (write the copy, make the design changes etc.) Depending on the scope of changes you might also need to include a designer to design new elements.
- Have your front-end developer implement the treatments in your testing tool. Set up necessary integrations (Google Analytics), set appropriate goals.
- Conduct QA on the test (broken tests are by far the biggest A/B test killer) to make sure it works with every browser/device combo.
- Launch the test!
- Once the test is done, conduct post-test analysis.
- Depending on the outcome either implement the winner, or iterate on the treatments or go and test something else.
You have the process, you have the power! So, get out there and start testing your store. Before you know it, those insights will add up to more money in The Bank of You.
If you have any questions about what you’ve read or if you run into trouble along the way, just leave a comment below.