Skip to main content
Sample Size Pitfalls

Sample Size Traps in A/B Tests: 3 Fixes Omatic Uses to Get Clean Results

A/B testing is one of the most trusted tools for making data-informed decisions, but it is surprisingly easy to get wrong. The most common culprit? Sample size. Teams often run tests with too few observations, stop experiments prematurely, or ignore the statistical assumptions that make sample size calculations valid. These sample size traps can lead to false positives, wasted resources, and misguided product changes. At Omatic, we have developed a set of practical fixes to avoid these pitfalls and get clean results. This guide explains the traps and shows you how to apply three key fixes in your own testing program. Where the Traps Show Up in Real Work Sample size problems are not limited to one type of test or industry. They appear in marketing campaigns, product feature rollouts, pricing experiments, and even operational changes.

A/B testing is one of the most trusted tools for making data-informed decisions, but it is surprisingly easy to get wrong. The most common culprit? Sample size. Teams often run tests with too few observations, stop experiments prematurely, or ignore the statistical assumptions that make sample size calculations valid. These sample size traps can lead to false positives, wasted resources, and misguided product changes. At Omatic, we have developed a set of practical fixes to avoid these pitfalls and get clean results. This guide explains the traps and shows you how to apply three key fixes in your own testing program.

Where the Traps Show Up in Real Work

Sample size problems are not limited to one type of test or industry. They appear in marketing campaigns, product feature rollouts, pricing experiments, and even operational changes. A common scenario is a product team that wants to test a new checkout flow. They set up an A/B test with a 50/50 split, run it for one week, and see a 5% lift in conversion with a p-value of 0.04. The team celebrates and ships the new flow. But the sample size was only a few hundred users per variant — far too small to detect a 5% effect reliably. The result was a false positive, and the next month conversion actually dropped.

Another frequent trap is the “peeking” problem. Teams check results daily and stop the test as soon as significance is reached. This inflates the false positive rate dramatically. Even with a correct initial sample size, stopping early can produce misleading conclusions. We have seen teams run tests for only a few days because they hit significance, only to have the effect reverse when they let the test run to completion.

The third trap is ignoring the minimum detectable effect (MDE). Often, teams choose an effect size that is too small to be practically meaningful, or they do not calculate sample size at all. They assume that any statistically significant result is actionable, but tiny effects may not justify the cost of implementation. At Omatic, we require teams to specify the smallest effect that would be worth acting on before the test starts and size the experiment accordingly.

These traps are not rare. In a survey of practitioners, many reported that they had run tests with insufficient sample size at least once. The consequences include wasted engineering time, delayed launches, and loss of trust in the testing process. Understanding where these traps show up is the first step to avoiding them.

Foundations Readers Confuse About Sample Size

One of the most persistent confusions is the relationship between sample size and statistical significance. Many people think that a larger sample size always leads to more reliable results. While it is true that larger samples reduce variance, they also make it easier to detect trivial effects. A test with a million users might find a 0.1% lift that is statistically significant but practically irrelevant. The key is to size the experiment for the effect you care about, not just to maximize power.

Another common misunderstanding is that sample size calculations are only needed for small experiments. In reality, even large experiments benefit from proper sizing. Without it, you risk either wasting resources on an oversized test or missing a real effect because the test is underpowered. The calculation depends on several parameters: baseline conversion rate, minimum detectable effect, significance level (alpha), and statistical power (1 – beta). Changing any of these affects the required sample size.

Many teams also confuse the sample size per variant with the total sample size. If you have two variants and you need 10,000 users per variant, you need 20,000 total. This seems obvious, but we have seen plans that only account for the total and then split it evenly, resulting in half the required users per variant.

A less obvious confusion is the assumption that sample size calculations guarantee valid results. They do not. They only tell you how many observations you need if the test is run correctly. Violating assumptions like independence of observations, random assignment, or no carryover effects can invalidate the calculation. For example, if you run a test on the same users repeatedly without accounting for repeated measures, your effective sample size is smaller than the raw count.

Finally, some teams treat sample size as a one-time decision. But sample size can change if the baseline rate shifts or if you decide to test multiple variants. Sequential testing methods, like the ones used by some modern testing platforms, adjust sample size on the fly, but they require special statistical corrections. Without those corrections, sequential testing inflates error rates.

Why These Confusions Persist

Part of the problem is that introductory statistics courses often focus on hypothesis testing for fixed sample sizes, not on the practical decisions that go into planning an experiment. Many product managers and engineers have not been trained to think about power analysis or MDE. Additionally, default settings in A/B testing tools often hide these parameters, encouraging users to launch tests without thinking about sample size. The result is a culture of “launch and see” rather than deliberate experiment design.

Another reason is that sample size calculations feel abstract. Numbers like “3,700 users per variant” do not connect intuitively to the business question. Teams need a concrete process that ties the calculation to the decision they are making. At Omatic, we use a simple spreadsheet tool that asks for baseline rate, MDE, and desired confidence level, then outputs required sample size. This makes the process tangible and forces teams to think about effect size upfront.

What Happens When You Ignore These Foundations

Ignoring these foundations leads to the traps we described earlier: false positives, false negatives, and wasted resources. It also undermines the credibility of the testing program. When a test produces a result that cannot be replicated, stakeholders lose trust. Over time, the team may abandon testing altogether or rely on gut feel. Getting the foundations right is essential for building a sustainable experimentation culture.

Patterns That Usually Work for Clean Results

After identifying the traps, we have developed three core fixes that consistently produce clean results. These are not theoretical — they are the patterns we apply daily at Omatic.

Fix 1: Pre-specify the Minimum Detectable Effect

Before any test starts, we require the team to answer: “What is the smallest improvement that would be worth implementing?” This becomes the MDE. If the test cannot detect that effect with 80% power and 95% confidence, we do not launch until we can get enough users. This single step eliminates tests that are too small to be useful. It also forces teams to think about the business value of the change, not just statistical significance.

We have found that teams often overestimate the MDE they can realistically detect. A common mistake is to set MDE to 1% or 2% without checking if the sample size is feasible. When we run the numbers, they realize they would need millions of users. At that point, they either adjust the MDE upward or decide the test is not worth running. Both outcomes are better than running an underpowered test.

Implementation tip: Use a sample size calculator that includes MDE as a parameter. Many free tools are available. Enter your baseline conversion rate, MDE, alpha (0.05), and power (0.80). If the required sample size is larger than your available traffic, you have three options: increase the MDE, accept lower power, or extend the test duration. We recommend never going below 80% power.

Fix 2: Set a Fixed Horizon and Avoid Peeking

Once the sample size is determined, we calculate how long the test needs to run to reach that number. This is the fixed horizon. We do not check results until the horizon is reached. This prevents peeking bias. If we must check early (e.g., for safety monitoring), we use a sequential testing method with appropriate stopping boundaries, such as the O'Brien-Fleming spending function. But for most tests, the fixed horizon approach is simpler and works well.

We also account for daily and weekly seasonality. For example, if you run a test for one week, you capture both weekend and weekday behavior. But if you stop after three days, you might miss weekend effects. We recommend running tests for at least one full business cycle — typically one or two weeks for most e-commerce sites, but longer for low-traffic sites.

Common mistake: Teams often set a fixed horizon based on calendar days but then stop early because they hit significance. This is the same as peeking. The rule is: do not look at the results until the horizon is reached. If you cannot resist, use a tool that hides the results until the end.

Fix 3: Validate Assumptions Post-Test

Even after the test is complete, we run diagnostic checks. We verify that the randomization was balanced (e.g., similar sample sizes per variant, similar pre-test metrics). We check for outliers that could have skewed results. We also test for interactions between variants and segments (e.g., does the treatment work differently for new vs. returning users?). These checks catch problems that sample size calculations cannot address.

One common post-test issue is that the baseline conversion rate differs from the estimate used in the sample size calculation. If the actual baseline is lower, the test may be underpowered. We recompute power post-hoc using the observed baseline. If power is below 50%, we flag the result as inconclusive.

Another validation is to run a “A/A” test on the same data to check for false positive rates. If you see significant results in the A/A test, something is wrong with your testing setup (e.g., assignment or measurement errors). We do this quarterly as a sanity check.

Anti-Patterns and Why Teams Revert

Even with good patterns, teams often revert to bad habits. Understanding these anti-patterns helps you avoid them.

Anti-pattern 1: Chasing Significance

The most common anti-pattern is continuing a test until it reaches significance, regardless of sample size. This is called “optional stopping” and it inflates false positives. Teams do this because they are eager for a win. The fix is to set a fixed horizon, but even then, teams may extend the test if the p-value is close. This is a form of data-dependent stopping. The correct approach is to decide the stopping rule before the test and stick to it.

Anti-pattern 2: Using the Wrong Metric

Teams often pick a metric that is easy to measure but not directly tied to the business goal. For example, they test a new feature and measure clicks on a button, when the real goal is revenue. Sample size calculations for clicks may be fine, but the effect on revenue may be too small to detect. Always choose the primary metric that reflects the decision you want to make. If you care about revenue, size the test for revenue, even if it takes longer.

Anti-pattern 3: Ignoring Multiple Testing Corrections

When you test multiple metrics or multiple variants, the chance of a false positive increases. Many teams do not adjust for this. They run a test with five metrics and report the one that shows significance. This is p-hacking. We apply a Bonferroni correction or use a false discovery rate (FDR) method. Alternatively, we pre-specify a primary metric and treat others as exploratory.

Why do teams revert to these anti-patterns? Pressure to deliver results, lack of training, and the belief that “something is better than nothing.” In our experience, nothing is better than a misleading result. A clean null result is more valuable than a false positive that leads to a bad decision.

Maintenance, Drift, and Long-Term Costs

Sample size pitfalls do not end after one test. They accumulate over time. If your testing program consistently uses underpowered tests, you will have a high rate of false positives and false negatives. This erodes trust and leads to a culture of skepticism. Teams may start ignoring test results altogether, falling back on opinions.

Another long-term cost is that you may miss real opportunities. If your tests are underpowered, you fail to detect effects that are real but small. Over time, you lose the cumulative benefit of many small improvements. A well-sized test program can detect a 2% lift if you have enough traffic, but if you consistently use MDEs of 5%, you miss those gains.

Maintenance also involves updating your sample size assumptions as your business changes. For example, if your baseline conversion rate improves, the required sample size for a given MDE changes. We recommend revisiting sample size parameters every quarter or whenever there is a major change to the product or traffic.

Drift can occur in the randomization itself. Over time, the assignment algorithm may become biased due to caching issues, cookie deletion, or other technical problems. We monitor balance metrics (e.g., pre-test conversion rates) weekly to detect drift. If we see a significant imbalance, we investigate before running new tests.

Finally, the cost of poor sample size practices extends to the organization’s ability to learn. A testing program that produces unreliable results cannot support a culture of experimentation. Teams become disillusioned and stop proposing tests. The long-term fix is to invest in training, tools, and processes that enforce good sample size discipline.

When Not to Use This Approach

The three fixes we describe are not universal. There are situations where they may not apply or where a different approach is needed.

Low-traffic scenarios: If your site has very few visitors, you may never reach the required sample size for a reasonable MDE. In that case, running a traditional A/B test is futile. Alternatives include Bayesian methods that can incorporate prior information, or running a time-series analysis instead of a controlled experiment. You might also consider qualitative research or usability testing to guide decisions.

Rapid iteration: In some contexts, such as early-stage product development, you may need to make decisions quickly without waiting for a full sample. Here, a “launch and monitor” approach with guardrail metrics may be more appropriate. But be aware that this increases the risk of false positives. We recommend reserving this approach for low-risk changes and being transparent about the uncertainty.

Complex interventions: Some changes have network effects or spillover effects that violate the independence assumption. For example, a social feature may affect not just the treated users but their friends. In such cases, cluster-randomized trials are needed, and sample size calculations are more complex. The fixes we described are still relevant but must be adapted.

Multiple treatments: When testing many variants simultaneously, the sample size per variant shrinks. You may need to use a multi-armed bandit algorithm or a factorial design. Our pre-specify MDE fix still applies, but you need to adjust for multiple comparisons.

In all these cases, the key is to be explicit about your assumptions and limitations. Do not pretend that a small, underpowered test gives you clean results. Instead, acknowledge the uncertainty and make decisions accordingly.

Open Questions and FAQ

Q: What if I cannot reach the required sample size even after adjusting MDE?
A: You have three options: (1) accept a lower power (e.g., 60% instead of 80%) — but be aware that this increases the chance of missing a real effect; (2) run a longer test to accumulate more users; (3) abandon the test and use other methods like user research or expert judgment. We recommend option 3 for high-risk decisions.

Q: Is it okay to run a test with a sample size that gives 50% power?
A: Generally no. With 50% power, you have a coin flip chance of detecting a real effect. This is rarely acceptable for business decisions. If you must run it, treat the result as suggestive, not conclusive.

Q: Can I use Bayesian methods to avoid sample size calculations?
A: Bayesian methods still require a sample size decision, though the calculation is different. They can incorporate prior information and may require fewer observations if the prior is informative. But they are not a free pass. You still need to plan how many observations to collect.

Q: How do I handle multiple metrics without inflating false positives?
A: Pre-specify one primary metric. For secondary metrics, use a correction like Bonferroni or control the false discovery rate. Or treat them as exploratory and do not base decisions on them alone.

Q: What is the minimum test duration?
A: At least one full business cycle (e.g., one week for most e-commerce sites) to capture day-of-week effects. For low-traffic sites, it may be longer. Never stop a test before the planned duration, even if significance is reached.

Summary and Next Experiments

Sample size traps are common but avoidable. The three fixes we use at Omatic are: pre-specify the minimum detectable effect, set a fixed horizon and avoid peeking, and validate assumptions post-test. These patterns, combined with awareness of anti-patterns, will help you get clean results from your A/B tests.

To put this into practice, here are your next steps:

  • Audit your last five tests: Did you pre-specify MDE? Did you stop early? Did you check assumptions? Note any issues.
  • Create a sample size calculator template for your team. Include fields for baseline rate, MDE, alpha, and power. Make it a mandatory step before any test launch.
  • Set a policy: No peeking until the planned horizon. If you must monitor for safety, use a sequential testing method.
  • Run a quarterly A/A test to check for system-level false positive rates. If you see a significant result, investigate the root cause.
  • Train your team on the foundations: power, MDE, and the pitfalls of optional stopping. A one-hour workshop can make a big difference.

By adopting these practices, you will not only improve the reliability of your tests but also build a stronger experimentation culture. The goal is not to run the most tests, but to run the right tests and learn the truth.

Share this article:

Comments (0)

No comments yet. Be the first to comment!