Create and run an A/B test
Some of the following only matter if you’re A/B testing two indices. If you’re A/B testing one index, you can skip those parts.
You need to provide four pieces of information for an A/B test:
- Test name: Choose a descriptive name. If you’re comparing the “most-viewed” and “most-sold” sort orders, call the test “Which is better: most-viewed or most-sold?”.
- Variants: The control variant is the index receiving searches and associated events. Variants B, C, D, … are the versions you want to test (limited to 5 variants). Give each variant a short description so you can remember what it tests at a glance.
- Traffic split: Start with an even split. If you’re uncertain which variant will perform best, allocate 50% of traffic to the control variant and evenly distribute traffic to the others. Make sure the percentages total 100%.
- Duration: On low-traffic sites, run tests for longer to collect sufficient data for statistical confidence. A more even split can shorten the time needed to achieve reliable results.
When testing with replica indices, always set the main index (the one that receives search traffic) as the control variant. Use the replica index for a test variant. If you set the replica as the control, it won’t receive any traffic, because Algolia assigns variants during the search request.
Estimate sample size and test duration
Accurate sample size determination is essential for effective A/B testing and reliable results. The sample size estimation tool simplifies this process by estimating the number of searches needed to confidently detect meaningful differences between experiment variations.
Sample size estimator
The sample size estimator uses your historical data to establish a baseline for performance metrics, such as conversion or click-through rates. Algolia uses these metrics to calculate the optimal sample size required for statistically significant changes. With fixed statistical parameters—80% power and a 5% significance level (adjusted in multi-variant testing)—the tool ensures robust and reliable results. Based on your traffic patterns, Algolia will recommend a test duration that corresponds to the required sample size, rounding up to the nearest week to account for variability in business cycles.
Key features:
- Historical baseline rate. The tool uses your existing data to determine the current performance metric, such as conversion or click-through rate, against the measured changes.
- Fixed statistical parameters. A statistical power of 80% and a significance level of 5% (adjusted in multi-variant testing) provide a balance between detecting real effects and minimizing false positives.
- Customizable metrics and effect size. Select the metric you want to measure, such as conversions or clicks, and define the minimum detectable effect size that aligns with your objectives.
Choose an appropriate effect size
The effect size is the smallest relative change in a metric you consider significant enough to act on. Choosing an appropriate effect size is essential for accurate and efficient A/B tests.
For example, if you would adopt a change only if it increases your conversion rate by 5%, set your effect size to 5%. With a baseline conversion rate of 10% for the control variant, a test variant would need to show a conversion rate of 10.5% (a 5% relative gain) to count as an improvement. The sample-size estimator shows how many searches are required to detect that improvement with 80% power at a 5% significance level (adjusted in multi-variant testing).
When choosing an effect size, consider:
- Impact. Consider the smallest change that would make a meaningful improvement in your goals. For example, a 2% relative increase in conversion rate might be significant for you, while another organization might aim for a 5% relative change.
- Historical data. Review past experiments to understand typical variations and set an effect size that’s realistic and achievable based on historical performance. For example, if other changes you have tested result in a 3% relative increase in conversion rate, you might set your effect size to 3%.
- Balance between sensitivity and practicality. Smaller effect sizes, such as 1% to 2%, require larger sample sizes but let you detect subtle changes. Larger effect sizes (for example, 5% to 10%) require smaller samples and are easier to detect but may overlook smaller, yet important, changes.
Start the A/B test
Create new tests from the Algolia dashboard:
Press Create to start your A/B test. You can stop the test before its planned duration or wait until it finishes. However, you can’t stop and then restart a test because doing so would undermine the accuracy of the data: complete and uninterrupted testing is essential for an accurate A/B test.
View the results
Overview page
The overview page lists every A/B test in your application with its ID, name, start date, planned duration, and current status (see Test statuses).
A/B tests are listed from most recent to oldest, with active tests always at the top. Click the chevron next to Start date to change the sort order.
Select a test to open its details page.
A/B test details page
At the top you see a card for each variant (Control, B, C, …) showing its traffic share, tracked users and searches.
The Metric breakdown section shows a table for every metric:
The Metrics breakdown section compares the metrics for each tested variants with the following columns:
- Variant. The tested variant.
- Tracked searches (or Searches for metrics that are calculated regardless of
clickAnalytics
). The number of searches served by this variant. - Metric value. The observed value of the metric (for example, CTR, or ATCR).
- Difference. Relative change with respect to the control, expressed as a percentage difference.
- Confidence interval. The statistical range within which the true metric value is expected to lie.
- Confidence rating. Shows how reliable the observed difference is (for example, trending confident, or unconfident). For details, see Confidence. Hover any rating to see the exact p-value.
You can also open the full Analytics dashboard for the test’s variants by clicking View analytics in the top right corner.
In the Additional insights section, you can inspect the Confidence over time chart, which lets you select a variant and metric to see how its estimate and confidence interval evolve during the test.
If possible, wait until the A/B test has finished before interpreting the results.
A/B tests show results in the dashboard within the first hour after you created them, but metric comparisons won’t show until at least 1 week or 20% of the test duration has elapsed. This mitigates the risk of drawing inaccurate conclusions due to insufficient data. Test results are updated every day.
Test statuses
The potential test statuses are:
- In progress - Early: the test has started, and there is insufficient data to draw reliable conclusions. Wait for at least one week, or 20% of the test duration, before metrics begin to show. Hover over the badges to see data, but avoid drawing conclusions during this stage.
- In progress: the test has been running for a while, and metrics are being collected and compared.
- Failed to start: the test couldn’t be created. This is usually due to problems with your index or setting up the test in the Search API. Try to recreate the test or contact the Algolia support team.
- Stopped: the test was stopped manually. Your app is back to normal: index A performs as usual, receiving 100% of search requests. When you stop a test, all associated metadata and metrics are stored, and the test remains visible in the Algolia dashboard. However, the results may be inconclusive.
- Completed: the test has finished. Your app is back to normal: index A performs as usual, receiving 100% of search requests.
Deleting a test will permanently delete its metadata and metrics and remove it from the Algolia dashboard.
How to interpret results
What you consider good or bad is entirely dependent on your needs. Compare the effort and cost of an improvement with its benefits. A 4% improvement in click-through or conversion rate might not be convincing or profitable enough to warrant a change to the structure of your records.
Given the typically low effort required to adjust settings or modify data, it’s generally advisable to implement any potential improvement.
For more information, see How A/B test scores are calculated.
Confidence
The confidence for a test is based on the test status and the p-value. A p-value is the probability of observing results at least as extreme as the ones you saw, assuming there is actually no difference between variants. It is not ‘confidence’ by itself, the false-positive rate, or the probability of replicating the result. A small p-value suggests the observed difference is unlikely to be due to chance alone. A high p-value suggests there isn’t enough evidence to conclude there is a real difference.
The different confidence levels are:
- Too early to interpret: the test has started, and there is insufficient data to draw reliable conclusions on the performance of the variants. Wait for at least a week, or 20% of the test duration, before metrics begin to show. Hover over the badges to see data, but avoid drawing conclusions during this stage.
- No data: the test has been running for a while, but there is no data to compare. This typically occurs when no events are tracked.
- Unconfident: the test has been running for a while, but it’s impossible to tell whether the observed change is representative of the true impact. This could change as more data is collected. Be careful when interpreting these results.
- Trending confident: the test has been running for a while, and it currently looks like the observed change reflects the true impact. This could change as more data is collected. Be careful when you interpret these results.
- Inconclusive: the test has finished, but the confidence is too low to determine whether the observed change is due to chance. This might be because of insufficient data or high similarity between the variants.
- Confident: the test is complete, and the observed change probably reflects the true impact.
Confident or Trending confident doesn’t mean that the change is good. It just means that the observed change isn’t due to chance. For example:
- A Confident result with a large observed decrease in conversion rate suggests that the change will harm conversions. If your goal is to increase conversion rates, avoid this change.
- A Confident result with a large observed increase in conversion rate suggests that the change will improve conversions. If your goal is to increase conversion rates, do implement the change.
- A Trending confident result with a large observed increase in conversion rate suggests that, based on current data, the change will improve conversions. You could implement the change, but the confidence might change later.
- An Inconclusive test means that the impact is uncertain. Ignore the results or interpret them with discretion. Try re-launching the test for a longer duration to collect more data. This increases, but doesn’t guarantee, the likelihood that the results will reach a confident state.
Minimum number of searches
You can stop your test at any point and analyze the results with as little or as much data as you want. Drawing conclusions based on insufficient data and low-confidence test results might lower your overall search performance and lead to unexpected results.
The confidence indicator helps ensure the confidence and reliability of test results. Use the confidence indicator to protect you from jumping to conclusions too early based on skewed or insufficient data.
Recommendations
-
Test before going live. Be wary of breaking anything. For example, ensure both your test indices work with your UI. Small changes can break your interface or strongly affect the user experience. For example:
- Making a change that affects facets can cause the facet’s UI logic to fail.
- Changing a simple ranking on index B can make the search results so bad that users of this index have terrible results. This isn’t the purpose of A/B testing. Index B should theoretically be better or at least as good as index A.
-
Don’t change your A or B indices during a test. Don’t adjust settings during testing. This pollutes your test results, making them unreliable. If you must update your data, do so synchronously for both indices and, preferably, restart your test. Changing data or settings during a test can break your search experience and undermine your test conclusions.
-
Don’t use the same index for several A/B tests. You can’t use the same index in more than one test at the same time. You’ll get an error.
-
Make only small changes. The more features you test simultaneously, the harder it is to determine causality.
API clients
In most cases, use the Algolia dashboard to manage A/B tests. However, sometimes, testing with an Algolia API client is beneficial. For example:
- You want to run the same test across many indices, for example, if you have several websites using the same kind of indices but with different data. The API simplifies the creation of multiple tests.
- You want your backend to trigger tests based on data changes or create feedback loops based on your analytics (as used in machine learning). This is an advanced method of managing product-line changes or industry trends and should be done carefully.
To use the API, your API key ACL must include:
- A/B test creation and deletion - setSettings on all indices
- A/B test analytics - analytics on all indices