How A/B test scores are calculated

The method and math at a glance

Randomness: the assignment of any one user to scenario A or B is purely random.
Confidence measures the certainty of the outcomes of a test. Confidence is quantified using the p-value. For classic two-variant tests, a result is confident (statistically significant) when the p-value is below 0.05 (95 % confidence). If you compare more than two variants, the p-value is stricter so the test results stay reliable. To learn more, see Multi-variant testing.
Mathematical formula: the same two-tailed test is applied to every comparison between the control variant and each test variant. Two-tailed tests detect differences in either direction—better or worse—without assuming which variant performs better.
Relevance improvement: in practice you add variants because you hypothesize one of them performs better than the control. Multi-variant tests let you try several ideas at once and show which, if any, delivers better impact.

Statistical Significance or Chance

When you run your tests, you may get results that show a 4% increase in one of the measured metrics. Statistical significance is concerned with whether the 4% increase is chance or real. The statistical concern is whether your sample group truly represents the larger population: Does the 4% make sense only for that sample group or does it reasonably predict the behavior of the larger population?

If the sample doesn’t represent the larger population, then your results are due to chance. Statistical significance (the confidence indicator) distinguishes chance from a real change. When you reach confidence, the observed differences between the control and the variants are likely not due to chance but something you can expect (or predict) for the larger population as well.

Large, distributed samples

Large data samples are necessary to reach confidence. When flipping a coin 1,000 times, you can expect a close to 50:50 ratio of heads and tails. If you flip it just a few times, the ratio can be heavily skewed (it is completely possible to flip heads three times in a row, but very unlikely to do so 1,000 times).

Increasing sample size stabilizes results, increasing the confidence in the results. Each new search event clarifies the underlying pattern and generally leads towards a reliable outcome.

Sample diversity

Be careful when you test. Testing during a sales campaign, a major holiday, or some other exceptional event, can undermine the reliability of your results.

Did you find this page helpful?

How A/B test scores are calculated

On this page

The method and math at a glance

Statistical Significance or Chance

Large, distributed samples

Sample diversity

On this page