Introduction: The Hidden Cost of Running Experiments in the Dark
We have all been there. You launch an A/B test on a Tuesday morning, check the dashboard on Wednesday afternoon, and see a 15% lift with a p-value of 0.03. Your instinct screams, "Ship it!" But here is the uncomfortable truth: if you did not determine your sample size before the test began, that p-value is almost meaningless. The mathematical machinery behind hypothesis testing assumes a fixed sample size, a fixed stopping rule, and a pre-specified effect size. When you peek at results early, you inflate your false positive rate dramatically. In fact, practitioners often report that running tests without proper sample sizing can lead to false positive rates as high as 30% or more, depending on how often you check.
This guide is written for anyone who runs or interprets A/B tests—growth marketers, product managers, data scientists, and UX researchers. Our goal is to explain why sample size is the single most overlooked lever for experiment quality, how it silently sabotages your results, and what you can do about it. We will walk through the core statistical concepts in plain language, identify the most common mistakes teams make, and provide a practical, step-by-step approach to getting sample size right. Then, we will introduce Omatic, a tool designed to close the blind spot that plagues so many testing programs. By the end of this guide, you will have a concrete framework for planning experiments that produce trustworthy outcomes.
This overview reflects widely shared professional practices as of May 2026. The principles described here are general guidance for non-medical, non-legal, non-investment decision contexts. For experiments involving human subjects, regulated industries, or high-stakes decisions, consult a qualified statistician or relevant regulatory guidance.
Why Sample Size Matters More Than You Think
At its core, an A/B test is a statistical hypothesis test. You are making an assumption—the null hypothesis—that there is no difference between your control and variation. The p-value tells you how likely the observed data would be if that null hypothesis were true. But here is the catch: the p-value is critically dependent on sample size. With a small sample, even a large observed effect can be due to random noise. With a very large sample, even a tiny, practically insignificant effect can produce a p-value below 0.05. This is why sample size planning is not optional; it is the foundation of trustworthy inference.
The Mathematical Relationship: Power, Effect Size, and Significance
Three key parameters interact in any sample size calculation. First, the effect size—the minimum difference you want to detect. A smaller effect requires a larger sample. Second, statistical power—the probability of detecting a true effect if it exists. Most teams target 80% power, meaning a 20% chance of missing a real effect (a false negative). Third, the significance level (alpha), usually set at 0.05, representing the accepted risk of a false positive. These three parameters are mathematically linked. Fix any two, and the third is determined. Teams often mistakenly fix only alpha and ignore power, leading to underpowered experiments that waste time and money.
Consider a typical scenario: a team running a test on a landing page button color change. They expect a 5% conversion rate on the control and hope to detect a 10% relative lift (to 5.5%). With 80% power and alpha 0.05, the required sample size per variant is roughly 15,000 visitors. If they run the test for only one week and get 5,000 visitors per variant, the test is underpowered. They might see a 12% lift that is not statistically significant, or worse, a significant result that is actually a false positive due to peeking. The consequences are wasted development effort, missed revenue opportunities, and eroded trust in the testing process itself.
Another common mistake is treating sample size as a fixed number without accounting for audience segments. If you plan to analyze results by device type, traffic source, or user segment, each subgroup effectively becomes its own mini-experiment. The required sample size for each subgroup multiplies quickly. Many teams run a test, find no overall significant effect, but then drill down into mobile users and find a "significant" result. This is data dredging, and it inflates the false positive rate far beyond the nominal 5%.
The lesson is clear: sample size is not a bureaucratic checkbox. It is a critical design parameter that directly controls the reliability of your conclusions. Skipping this step is akin to building a bridge without calculating load capacity—you might get lucky, but the risk of catastrophic failure is unacceptable for any serious engineering effort.
Common Mistakes That Sabotage Your A/B Tests
Even experienced teams fall into predictable traps when it comes to sample size. These mistakes are not born from ignorance but from the pressure to move fast, the seduction of early results, and the complexity of real-world experimentation. By naming these errors explicitly, we can build guardrails to avoid them.
Mistake #1: The Peeking Problem
This is the most pervasive error in modern A/B testing. It occurs when experimenters check results repeatedly as data accumulates and stop the test as soon as the p-value crosses the 0.05 threshold. Statistically, this is equivalent to performing multiple hypothesis tests on the same data. Each peek increases the chance of a false positive. Researchers have shown that if you peek after every 100 observations and stop at the first significant result, the actual false positive rate can exceed 20% instead of the nominal 5%. The solution is not to stop peeking entirely—we all want to monitor for technical errors—but to pre-commit to a stopping rule. Use a fixed horizon design or a sequential testing method that adjusts for multiple looks. Omatic, as we will discuss later, provides a sequential testing framework that allows you to monitor without inflating error rates.
A related version of this mistake is "peeking with intent." You start the test, see a promising trend at 40% of the planned sample, decide to run a second variant, and then analyze the combined data as if it came from a single experiment. This invalidates the statistical assumptions and introduces selection bias. The correct approach is to plan your variants and sample size in advance, or use a multi-armed bandit algorithm that handles adaptive allocation properly. But even bandit algorithms require upfront design choices about minimum detectable effects and traffic allocation.
Mistake #2: Underpowered Experiments
An underpowered experiment is one that has too few observations to reliably detect the effect you care about. This is often driven by unrealistic expectations about effect size. Teams want to detect a 1% lift in conversion rate, but they only have the traffic to detect a 5% lift. They run the test anyway, hoping for the best. The result is almost always inconclusive: a non-significant p-value that could mean no effect, or an effect too small to detect. This ambiguity is worse than useless—it leads to debate, delay, and eventually, the temptation to make decisions based on trend direction rather than statistical significance.
To avoid underpowered experiments, you must perform a sample size calculation before the test. This requires estimating the baseline conversion rate and the minimum effect size that would be practically meaningful—not just statistically detectable. If the required sample size exceeds your available traffic within a reasonable time frame, you have three options: increase your traffic (e.g., reduce the number of concurrent tests), accept a larger minimum detectable effect, or extend the test duration. None of these options is ideal, but all are better than running an underpowered test.
Another version of this mistake is the "novelty effect" trap. A new feature often sees an initial spike in engagement that decays over time. If your sample size is too small to capture the decay, you might conclude the feature is successful when it is not. Planning for a longer test period with adequate sample size helps separate genuine effects from transient novelty. This is especially important in user experience tests where initial curiosity can inflate metrics.
Mistake #3: Ignoring Practical Significance
Statistical significance is not the same as practical significance. With a massive sample size—think millions of users—even a 0.01% lift can be statistically significant. But that tiny effect may not be worth the engineering cost, the user experience change, or the maintenance burden. Conversely, a large but non-significant effect from a small sample might be worth further investigation. The key is to interpret results in context. Set your minimum detectable effect based on business impact, not on a statistical rule of thumb.
This mistake often manifests when teams chase p-values without considering confidence intervals. A narrow confidence interval that barely excludes zero might indicate a small effect that is not actionable. A wide confidence interval that includes zero but is mostly positive might suggest the test was underpowered for the true effect size. Look at the full distribution of possible effect sizes, not just the binary significant/not-significant decision. Omatic helps by providing not just sample size recommendations but also post-experiment diagnostics that flag when a result is statistically significant but practically trivial.
How to Calculate Sample Size: A Step-by-Step Guide
Calculating sample size is not as daunting as it sounds. The process follows a logical sequence of decisions that any practitioner can learn. Below, we outline a clear, actionable workflow that you can apply to your next experiment. This guide assumes you are using a frequentist hypothesis testing framework, which is the most common approach in industry.
Step 1: Define Your Baseline and Minimum Detectable Effect
Start by identifying your primary metric—the key performance indicator (KPI) that the experiment is designed to move. For a conversion rate test, this is straightforward. For metrics like revenue per user or session duration, you need to understand the distribution (e.g., standard deviation) of that metric. Historical data is your best friend here. Pull data from the past 4-8 weeks (excluding any periods with other tests running) to estimate the baseline mean and variability. Next, decide the smallest effect that would be practically meaningful. This is a business judgment, not a statistical one. For example, a 2% relative lift in conversion might be worth deploying if the change is simple; a 0.5% lift might not be worth the development effort. Be honest about what matters.
Once you have the baseline and the minimum detectable effect (MDE), you can compute the standardized effect size. For proportion metrics (conversion rates), this is the absolute difference between the two proportions. For continuous metrics, it is the expected difference divided by the pooled standard deviation (Cohen's d). Many online calculators and software tools ask for these inputs directly. The quality of your sample size estimate depends heavily on the quality of your baseline estimate. If you use a baseline from a low-traffic period or an anomalous week, your calculation will be off. Use a representative, stable window of data.
Step 2: Choose Your Statistical Parameters
Next, set your significance level (alpha) and statistical power (1-beta). The industry standard is alpha 0.05 and power 0.80, but these are not universal truths. In contexts where the cost of a false positive is very high (e.g., changing a core pricing page), you might use alpha 0.01. In contexts where you are screening many ideas and can afford more false positives in exchange for faster iteration, you might use alpha 0.10. Similarly, if the cost of missing a real effect is very high (e.g., a safety-critical feature), you might target 90% or 95% power. These choices directly affect the required sample size. For example, moving from 80% to 90% power increases sample size by roughly 30%. Moving from alpha 0.05 to 0.01 increases sample size by about 50%. Know your trade-offs.
Many teams default to 80% power without considering why. This is a legacy from social science conventions, not a mathematically optimal value. In a fast-moving product environment, 70% power might be acceptable for exploratory tests, while 95% power might be required for confirmatory tests that will drive major investment. The key is to make this choice deliberately, not by default. Omatic's interface prompts you to justify your power and alpha choices, forcing a thoughtful decision rather than accepting defaults.
Step 3: Run the Calculation and Plan the Timeline
With your inputs ready, use a sample size calculator. There are many free online tools, or you can use the built-in calculator in Omatic. For a two-sided test comparing two proportions, the formula involves the baseline proportion, the MDE, alpha, and power. The output is the required sample size per variant. Multiply by the number of variants to get the total sample. Then, divide by your expected daily traffic to the experiment to estimate the minimum duration. Always add a buffer: account for potential traffic fluctuations, holidays, or technical issues. A good rule of thumb is to plan for 1.5 to 2 times the minimum duration. This prevents you from stopping too early and gives you room to handle unexpected events.
For example, suppose your calculator says you need 20,000 visitors per variant, and you get 2,000 visitors per day. The minimum duration is 10 days. Plan for 15-20 days. If your traffic is highly variable (e.g., weekdays vs. weekends), run the test for at least two full weeks to capture both patterns. Document your plan publicly within your team. This pre-registration step is a powerful safeguard against post-hoc rationalization. Omatic offers a pre-registration feature that time-stamps your parameters and prevents you from changing them after the test starts, eliminating the temptation to tweak the MDE mid-experiment.
Omatic: Closing the Sample Size Blind Spot
Despite knowing the importance of sample size, most teams struggle to consistently apply best practices. The reasons are practical: manual calculations are error-prone, spreadsheets get lost, and the pressure to launch tests quickly leads to shortcuts. This is where Omatic comes in. Omatic is designed as an end-to-end experiment planning and monitoring platform that automates sample size determination, enforces pre-registration, and provides real-time diagnostics without inflating false positive rates. It is not a silver bullet—no tool can replace statistical literacy—but it closes the blind spot that plagues even mature testing programs.
How Omatic Works: From Planning to Analysis
Omatic integrates directly with your experimentation platform (e.g., Google Optimize, Optimizely, or a custom in-house system). When you create a new experiment, Omatic prompts you to fill in the parameters: baseline metric, minimum detectable effect, expected traffic, alpha, and power. It then calculates the required sample size and duration, and suggests a pre-registration schedule. If you try to start the experiment with an inadequate sample size, Omatic issues a warning and blocks the launch unless you explicitly override with a justification. This friction is intentional—it forces you to confront the trade-off between speed and reliability.
During the experiment, Omatic monitors the accumulating data using a sequential testing framework. Unlike traditional frequentist methods that assume a fixed stopping point, sequential testing allows you to check results periodically without inflating the false positive rate. Omatic uses a spending function approach that controls the Type I error across multiple looks. It also calculates the conditional power—the probability that the test will reach significance if it continues to the planned sample size. If the conditional power drops below a threshold (e.g., 10%), Omatic suggests stopping early for futility, saving you time and traffic. This is far more principled than the common practice of stopping when the p-value dips below 0.05.
After the experiment concludes, Omatic provides a diagnostic report that goes beyond the binary significant/not-significant classification. It shows the confidence interval, the observed effect size, the achieved power, and a comparison of the planned vs. actual sample size. It flags results that are statistically significant but practically trivial, and it highlights any segment analyses that were not pre-registered. This post-hoc transparency is crucial for building a culture of honest experimentation. Teams that use Omatic report a significant reduction in false positive decisions and increased confidence in their deployment choices.
Comparing Omatic with Three Alternative Approaches
To understand the value Omatic provides, it helps to compare it with other common approaches to sample size management. Each alternative has strengths and weaknesses, and the best choice depends on your team's maturity, resources, and risk tolerance. The table below summarizes the key differences, followed by a detailed discussion of each approach.
| Approach | Pros | Cons | Best For |
|---|---|---|---|
| Manual calculation + spreadsheet | No cost, full control, teaches the math | Error-prone, hard to enforce, no monitoring | Solo practitioners, very low-volume tests |
| Free online calculator + fixed horizon | Simple, widely available, good for one-off tests | No integration, no sequential monitoring, no pre-registration | Small teams with occasional tests |
| Bayesian analysis (e.g., with PyMC) | Handles prior information, intuitive interpretation | Steep learning curve, computational cost, still requires sample size planning | Advanced teams with statisticians |
| Omatic | Automated, integrated, sequential monitoring, pre-registration, diagnostics | Subscription cost, requires platform integration | Growth teams, product teams running many concurrent tests |
Alternative 1: Manual Calculation with Spreadsheets
This is the do-it-yourself approach. You look up the formula for a two-proportion z-test, plug in your numbers in Excel or Google Sheets, and write down the result. The main advantage is that you learn the math deeply—you cannot hide from the formula. This is a good exercise for newcomers. However, the disadvantages are significant. Spreadsheets are prone to cell errors, version control issues, and they provide no mechanism for enforcement. There is nothing stopping a team member from launching a test without doing the calculation, or from changing the MDE mid-experiment without updating the plan. Without monitoring, you cannot detect when a test is running too long or when conditional power is low. This approach scales poorly beyond a handful of tests per month.
Alternative 2: Free Online Calculator with Fixed Horizon Design
Many websites offer free sample size calculators (e.g., Evan Miller's sample size calculator, or the one from Optimizely). These are straightforward: you enter your baseline conversion rate, MDE, alpha, and power, and you get a number. You then commit to running the test until that number is reached, without peeking. This approach is better than manual calculation because it reduces arithmetic errors. But it still lacks integration with your testing platform. You must manually check whether the sample has been reached. There is no automatic enforcement, no sequential monitoring, and no pre-registration. The fixed horizon design is statistically valid, but it is inefficient—you cannot stop early for effectiveness or futility. In practice, many teams using this approach still peek at results, undermining the validity.
Alternative 3: Bayesian Analysis with Custom Scripts
Bayesian methods offer an alternative framework where uncertainty is expressed through posterior distributions rather than p-values. These methods can incorporate prior information (e.g., from previous experiments) and provide more intuitive interpretations (e.g., "there is a 95% probability that the lift is between 1% and 4%"). However, Bayesian analysis still requires sample size planning—typically through simulation-based approaches that are computationally intensive. The learning curve is steep; you need to understand prior distributions, MCMC sampling, and model checking. For most product teams, the overhead is prohibitive unless they have a dedicated statistician. Moreover, Bayesian methods do not inherently solve the peeking problem; you still need to decide when to stop, and sequential Bayesian analysis is complex. Omatic uses frequentist sequential testing, which is more accessible and widely supported by the experimentation community.
In summary, Omatic occupies a sweet spot: it automates the calculation, enforces the plan, and provides principled sequential monitoring, all within a workflow that integrates with existing tools. For teams running more than a handful of tests per month, the investment in Omatic often pays for itself through reduced false positives and faster, more confident decisions.
Real-World Scenarios: Before and After Omatic
The best way to understand the impact of proper sample size management is to walk through anonymized composite scenarios that reflect common situations. These examples are drawn from patterns observed across many teams, not from any single organization.
Scenario 1: The False Positive That Cost a Quarter
A product team at a mid-size e-commerce company was testing a new checkout flow. They launched the experiment on a Monday, checked the data on Wednesday, and saw a p-value of 0.04 on the conversion rate. Excited, they deployed the new flow to 100% of traffic. Over the next three weeks, conversion rates actually dropped by 2%. The team had fallen victim to the peeking problem—they stopped at the first significance signal, which was a false positive. The cost was not just the lost revenue but also the engineering time spent reverting the change and the loss of credibility for the testing program. After adopting Omatic, the team now pre-registers each experiment with a planned sample size of 25,000 visitors per variant. Omatic's sequential monitoring flagged that the early significant result at 5,000 visitors was within the expected variability, and the test continued to the full sample. The final result was non-significant, saving the team from a costly false positive.
Scenario 2: The Underpowered Test That Wasted a Month
A SaaS company wanted to test a new onboarding email sequence. They expected a 10% lift in activation rate, but their traffic to the experiment was only 500 users per week. A sample size calculation showed they would need 12 weeks to achieve 80% power for a 10% lift. Instead, the team ran the test for four weeks, saw a non-significant 6% lift, and debated whether to implement it. The discussion consumed two weeks of meetings. Eventually, they decided to run a second test, which also ended inconclusively. The total time spent was six weeks, with no actionable result. With Omatic, the team would have been forced to confront the insufficient sample size before launching. They could have chosen to extend the test to 12 weeks, increase traffic by reducing the number of concurrent tests, or accept a larger minimum detectable effect. The upfront planning would have saved weeks of wasted effort.
Scenario 3: The Surprising Benefit of Sequential Testing
A media company was testing a new article layout aimed at increasing time on page. They planned for a sample size of 50,000 visitors per variant. After 20,000 visitors, Omatic's sequential test showed a conditional power of 1%—essentially impossible to reach significance even at the full sample. Omatic suggested stopping for futility. The team saved the remaining 30,000 visitors for other tests. This is a scenario where a fixed horizon design would have wasted traffic. Without Omatic, the team might have continued the test to the bitter end, or worse, they might have stopped early due to a temporary fluctuation in the wrong direction. The sequential framework gave them an early, principled exit.
These scenarios illustrate that the value of proper sample size management is not just theoretical—it translates directly into better decisions, faster iteration, and more efficient use of traffic. Omatic provides the infrastructure to make this disciplined approach easy to follow across the entire team.
Frequently Asked Questions About Sample Size and Omatic
In our experience working with teams, certain questions arise repeatedly. Below, we address the most common concerns with clear, practical answers.
What if I cannot achieve the required sample size due to limited traffic?
This is a common constraint. If the required sample size is too large for your traffic, you have three options. First, increase the minimum detectable effect—detect only larger changes. This is often acceptable for exploratory tests. Second, run the test longer. Be realistic about the timeline, but remember that a longer test also controls for novelty effects and weekly cycles. Third, consider using a Bayesian approach with informative priors from previous experiments, which can reduce required sample sizes. Omatic can help by simulating different scenarios and showing you the trade-off between MDE, duration, and power.
Can I use Omatic for multivariate tests (MVT) or A/B/n tests?
Yes, Omatic supports experiments with multiple variations. The sample size calculation adjusts for the number of variants and the type of comparison (e.g., each variant vs. control, or all pairwise comparisons). The sequential monitoring framework also extends to multiple comparisons, though the spending function becomes more conservative. For MVT, Omatic can help you plan factorial designs, but we recommend starting with simple A/B tests until your team is comfortable with the workflow.
Does Omatic replace the need for a statistician?
No tool can replace deep statistical expertise, but Omatic is designed to make best practices accessible to practitioners without a statistics background. It automates the calculations, enforces pre-registration, and provides clear diagnostics. For complex experimental designs (e.g., cluster-randomized trials, crossover designs, or experiments with repeated measures), you should still consult a statistician. Omatic is optimized for the most common use cases in product and marketing experimentation: two-arm or multi-arm tests with a single primary metric.
How does Omatic handle multiple metrics and the multiple testing problem?
By default, Omatic focuses on your primary metric for sample size planning. You can also specify secondary metrics, but Omatic will warn you about the multiple comparisons problem. It does not automatically adjust for multiple metrics during sequential monitoring, as this is an area of active research. Instead, it recommends that you pre-register a small number of primary and secondary metrics (no more than 5) and consider using a Bonferroni correction or false discovery rate control for exploratory analyses. The diagnostic report highlights any post-hoc metrics that were not pre-registered, so you know which results are confirmatory and which are exploratory.
What is the cost of using Omatic, and is there a free tier?
Omatic offers a tiered pricing model. There is a free tier for individual practitioners with a limit of 5 experiments per month. For teams, there are paid plans that include integrations, unlimited experiments, and priority support. We do not provide specific pricing here because it changes over time, but you can visit the Omatic website for current details. For most teams, the cost is easily justified by the reduction in wasted traffic, avoided false positives, and faster decision cycles.
Conclusion: Stop Sabotaging Your Tests, Start Getting Answers You Can Trust
Sample size is not a boring technical detail—it is the foundation upon which every valid A/B test rests. Ignoring it is like building a house on sand. The mistakes we have covered—peeking, underpowered experiments, ignoring practical significance—are not just academic concerns. They cause real harm: wasted development time, misguided product decisions, and erosion of organizational trust in experimentation. The good news is that these mistakes are entirely preventable with the right knowledge and the right tools.
We have walked through the core concepts of sample size determination, the most common pitfalls, and a step-by-step process for getting it right. We have compared several approaches and shown how Omatic automates the discipline, providing pre-registration, sequential monitoring, and post-experiment diagnostics. The composite scenarios illustrate that the benefits are tangible and immediate. Whether you are running your first A/B test or your hundredth, the principles here apply. Commit to planning your sample size before every test. Use a tool like Omatic to enforce that commitment. And when you see the results, you will know they reflect a genuine effect, not statistical noise.
Your experiments deserve better than guesswork. Your team deserves better than false signals. And your users deserve products that are built on evidence, not hunches. Start fixing your sample size blind spot today.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!