Skip to main content
Sample Size Pitfalls

The 3 Hidden Sample Size Mistakes That Invalidate Your Experiment Results (Omatic's Problem-Solution Playbook)

Experimentation is the backbone of data-driven decision-making, yet many teams unknowingly sabotage their results through subtle sample size errors. This comprehensive guide, part of Omatic's Problem-Solution Playbook, reveals the three hidden mistakes—from premature peeking and ignoring baseline conversion rates to misunderstanding minimum detectable effect sizes—that can render your A/B tests, surveys, and product experiments invalid. We explain the statistical mechanisms behind each error, pr

Introduction: Why Your Experiment Results Might Be Misleading You

Imagine spending weeks designing and running an A/B test, only to discover later that the results were statistically meaningless—or, worse, directionally wrong. This is not a rare occurrence; it is a common pitfall that plagues teams across industries, from SaaS startups to enterprise marketing departments. The root cause often lies not in the experiment design itself, but in three hidden sample size mistakes that creep in unnoticed. These mistakes can invalidate your conclusions, waste resources, and lead to bad business decisions. This guide, part of Omatic's Problem-Solution Playbook, will walk you through each mistake, explain why it undermines your results, and provide actionable solutions you can implement today. We focus on the practical, statistical, and operational aspects of sample size planning, drawing from anonymized team experiences and widely accepted methodological standards. Whether you are a product manager, data analyst, or growth marketer, understanding these errors will help you build experiments that produce reliable, reproducible insights. This overview reflects widely shared professional practices as of May 2026; verify critical details against current official guidance where applicable.

Mistake #1: Peeking at Results Too Early (The Data Snooping Trap)

One of the most frequent errors teams commit is repeatedly checking experiment results before the planned sample size is reached. This practice, known as data snooping or peeking, dramatically inflates the probability of observing a statistically significant result purely by chance. The mechanism is straightforward: each time you test the data, you increase the number of comparisons, which raises the familywise error rate. Without correction, a 5% significance level can balloon to 20% or higher after just a few peeks, leading to false positives that feel convincing but are not reproducible.

Why Peeking Feels So Tempting

Teams often peek because of pressure to make quick decisions, curiosity about early trends, or a belief that a large early effect is meaningful. In a typical scenario, a product team launches a pricing experiment and checks results after only 200 users per variant. The variant shows a 10% lift with a p-value of 0.04—seemingly significant. They declare victory and roll out the change, only to see no improvement in the following weeks. The early signal was random noise, not a real effect. The team wasted development time and eroded trust in experimentation.

How to Avoid the Peeking Mistake

The solution is to pre-register your sample size and adhere to a fixed stopping rule. Use a power analysis before the experiment begins to determine the minimum number of observations needed, and do not analyze results until that number is reached. If you must monitor for safety or early failure, implement a sequential testing method with alpha spending functions, such as the O'Brien-Fleming boundary. Tools like Omatic's sample size calculator (conceptual example) can help you design a fixed-horizon experiment that respects error rates. Additionally, educate your team about the dangers of peeking and enforce a culture of patience. A simple rule: no dashboard access until the end date, or use blinded results that hide significance flags until the sample is complete.

Avoiding this mistake is not just about discipline—it is about building a trustworthy process. When you pre-specify your stopping criteria, you protect the validity of your conclusions and ensure that your team can act on results with confidence. Many industry surveys suggest that over 40% of practitioners have admitted to peeking at results, and the consequences are well-documented in methodological literature. By committing to a fixed sample size plan, you eliminate the largest source of inflated false positives.

Mistake #2: Ignoring Your Baseline Conversion Rate (The Scale Trap)

The second hidden mistake involves neglecting your baseline conversion rate when calculating sample size. Many teams assume that the required sample size depends only on the effect size they want to detect, but the baseline rate dramatically influences the variance of the estimate. For binary outcomes like conversion, click-through, or purchase, the variance is highest when the baseline is around 50% and decreases as it approaches 0% or 100%. If you ignore this, you may drastically underestimate or overestimate the sample size needed, leading to underpowered experiments.

A Concrete Example of the Scale Trap

Consider a team testing a new onboarding flow for a free trial conversion. They assume a baseline conversion rate of 20% and want to detect a 20% relative lift (from 20% to 24%). Using a standard power calculation, they estimate a sample size of roughly 4,500 users per variant. However, if the actual baseline is 10%, the required sample size jumps to over 7,500 per variant—a 67% increase. Conversely, if the baseline is 40%, the required sample size drops to around 3,200. Running the experiment with the wrong baseline means you either waste traffic or fail to detect a real effect. In a composite scenario from a fintech company, a team used a stale baseline from a year ago, which was 15% lower than the current rate. Their underpowered experiment missed a 10% improvement that later showed up in a larger follow-up test, costing months of lost growth.

How to Correctly Incorporate Baseline Rates

The fix is to use recent, reliable baseline data from a representative period—ideally the past 30–90 days, excluding any major campaigns or anomalies. Use this baseline to calculate the required sample size with a standard formula or tool. For binomial outcomes, the formula for sample size per variant is roughly: n = (Z_alpha/2 + Z_beta)^2 * (p1(1-p1) + p2(1-p2)) / (p2 - p1)^2, where p1 is the baseline and p2 is the hypothesized treatment rate. Omatic's approach recommends using a rolling baseline average to account for seasonality, and always performing a sensitivity analysis: compute sample size for a range of plausible baselines (e.g., ±20% of your estimate) to see how your required sample changes. If the range is large, plan for the worst-case scenario or use a more robust design like continuous monitoring with Bayesian updating.

This mistake often goes unnoticed because teams rely on generic sample size tables that assume a 50% baseline. By tailoring your calculation to your specific metric, you gain efficiency and power. In practice, the baseline rate is one of the most influential inputs, yet it is the most frequently ignored. Addressing it is a simple, high-impact improvement to your experimentation workflow.

Mistake #3: Misunderstanding Minimum Detectable Effect (The Ambition Trap)

The third hidden mistake is setting the minimum detectable effect (MDE) too small or too large without considering business feasibility. The MDE is the smallest practical effect you want to detect with your experiment, and it directly determines sample size—smaller effects require exponentially more data. Many teams choose an MDE based on what they hope to see (e.g., a 5% lift) rather than what is realistic or cost-effective. This leads to either impossibly large sample sizes that never get collected or underpowered studies that fail to detect anything.

The Cost of Over-Optimism in MDE

Imagine a team wants to detect a 0.5% improvement in a metric with a baseline of 10%—a relative lift of just 5%. With standard power of 80% and alpha 0.05, the required sample size is over 400,000 users per variant. For most organizations, this is impractical. The team might run the experiment for two weeks with only 50,000 users per variant, obtaining a null result that is actually inconclusive. They conclude the change has no effect, when in reality it might have a small but meaningful impact that they lacked the power to detect. In an anonymized retail case, a team tested a checkout simplification with an MDE of 2% absolute lift on a 15% baseline. They collected 8,000 users per variant and saw no significant result. Later, a meta-analysis of similar changes across the industry suggested a 0.5% absolute lift was common—but they would have needed 60,000 users per variant to detect it. Their experiment was doomed from the start.

How to Set a Realistic MDE

Start by defining the minimum business-significant effect—the smallest change that would justify the cost of implementation. This is a decision you make with stakeholders, not a statistical input alone. Then, calculate the required sample size for that effect. If it exceeds your available traffic, you have three options: accept a larger MDE (and only launch changes with bigger impact), extend the experiment duration, or use a more sensitive design like a within-subjects test or continuous metric (e.g., revenue instead of conversion). Omatic's playbook recommends a tiered approach: for high-traffic experiments, use a small MDE (1-2% relative); for medium traffic, use a moderate MDE (5-10%); for low traffic, focus on large effects (20%+). Also, consider using Bayesian methods that can incorporate prior information to reduce sample size requirements, though this adds complexity. Never set an MDE based on what you hope to find; set it based on what you can realistically detect with your resources.

This mistake is especially pernicious because it is invisible: you cannot tell from the output alone whether you had enough power to detect a meaningful effect. The only safeguard is transparent pre-registration of your MDE and sample size, along with a post-hoc power analysis that interprets null results as inconclusive rather than negative. By being honest about your detection limits, you avoid false negatives and wasted effort.

Method Comparison: Three Approaches to Sample Size Planning

Choosing the right approach to sample size planning depends on your traffic volume, risk tolerance, and analytical sophistication. Below, we compare three common methods: Classic Power Analysis, Sequential Testing, and Bayesian Sample Size Estimation. Each has strengths, weaknesses, and ideal use cases. This comparison will help you decide which method fits your team's workflow.

Classic Power Analysis

This is the most widely taught method, requiring you to specify alpha (usually 0.05), power (usually 0.80), baseline rate, and MDE. It yields a fixed sample size per variant. Pros: simple to explain, widely supported by tools (e.g., R's pwr package, online calculators), and provides a clear stopping rule. Cons: assumes no peeking, can be overly conservative if assumptions are slightly off, and does not allow for early stopping if the effect is large. Best for: teams with stable traffic that can commit to a fixed duration, especially for high-stakes decisions. In practice, we recommend this as the default for most teams because it is robust and easy to audit.

Sequential Testing

Sequential methods allow you to monitor results continuously while controlling error rates using alpha spending functions (e.g., O'Brien-Fleming, Pocock). Pros: reduces average sample size needed by up to 30% when effects are large, enables early stopping for success or futility, and aligns with agile decision-making. Cons: more complex to implement, requires specialized software, and can still suffer from peeking if boundaries are not correctly applied. Best for: high-traffic environments where speed matters, such as ad platforms or e-commerce sites running many experiments. Tools like Optimizely's Sequential Testing module or R's gsDesign package support this approach. Note: the trade-off is a slight increase in maximum sample size for small effects, so it is not a panacea.

Bayesian Sample Size Estimation

Bayesian methods incorporate prior information (e.g., from previous experiments or domain knowledge) to reduce sample size requirements. You specify a prior distribution for the effect size and update it as data arrives, stopping when the posterior probability of a meaningful effect exceeds a threshold. Pros: can reduce sample size by 20-50% when priors are informative, naturally handles multiple comparisons, and provides intuitive probability statements (e.g., 95% chance that lift > 0). Cons: requires careful prior specification (vague priors may inflate sample size), is less familiar to many stakeholders, and can be computationally intensive. Best for: teams with strong prior data, experienced statisticians, and scenarios where sample size is a critical constraint (e.g., low-traffic sites). Omatic recommends using Bayesian methods only after establishing a track record with frequentist approaches, as the interpretability gains come with added complexity.

Comparison Table

MethodProsConsBest For
Classic Power AnalysisSimple, widely supported, clear stopping ruleNo peeking allowed, can be conservativeMost teams, high-stakes decisions
Sequential TestingEarly stopping, lower average sample sizeComplex, requires specialized toolsHigh-traffic, speed-critical experiments
Bayesian EstimationIncorporates prior data, intuitive resultsRequires careful priors, computationally heavyLow traffic, experienced teams

Each method can be effective when applied correctly. The key is to match the method to your team's maturity and constraints. If you are just starting out, classic power analysis is the safest bet. As you gain experience, consider incorporating sequential or Bayesian elements to improve efficiency.

Step-by-Step Guide: Building a Sample Size Plan That Works

This step-by-step guide provides a repeatable process for creating a sample size plan that avoids all three hidden mistakes. Follow these steps before launching any experiment, and document your plan in a pre-registration template. This ensures transparency and reproducibility.

Step 1: Define Your Primary Metric and Baseline

Select a single primary metric that directly measures the behavior you want to change. Avoid metrics that are too volatile or too rare. Gather baseline data from the past 30–90 days, excluding known anomalies. Calculate the average and check for seasonality. For binary metrics, record the proportion. For continuous metrics (e.g., revenue per user), calculate the mean and standard deviation. Use this baseline as your anchor for all subsequent calculations. If you have multiple baselines (e.g., different segments), consider a stratified design or use the most conservative baseline.

Step 2: Determine the Minimum Business-Significant Effect

Meet with stakeholders to agree on the smallest effect that would justify implementing the change. This is not a statistical decision; it is a business one. For example, a 1% lift in conversion might be worth $100,000 annually, while a 0.5% lift is not. Set your MDE to this value. If no one can articulate a business threshold, use a heuristic: for typical A/B tests, a 5-10% relative lift is a common starting point for conversion metrics. Document this threshold and include it in your pre-registration.

Step 3: Choose an Alpha, Power, and Method

Standard choices are alpha = 0.05 and power = 0.80. For high-risk decisions (e.g., pricing changes), consider alpha = 0.01. For low-risk changes, alpha = 0.10 may be acceptable. Select your method from the comparison above. If you are new, use classic power analysis. If you have high traffic and need speed, consider sequential testing. Use a reliable tool (e.g., Omatic's sample size calculator, R, or Python libraries) to compute the required sample size per variant.

Step 4: Check Feasibility Against Your Traffic

Divide your daily or weekly traffic equally among variants. Calculate how long it will take to reach your target sample size. If the duration exceeds 4 weeks, consider: (a) increasing your MDE, (b) using a less noisy metric, (c) reducing the number of variants, or (d) using a Bayesian approach with informative priors. If the duration is less than 1 week, consider whether you can afford to run longer to reduce sensitivity to novelty effects. Always include a buffer of 10-20% for data quality issues (e.g., invalid users, bot traffic).

Step 5: Pre-Register Your Plan

Write down your baseline, MDE, alpha, power, sample size, and stopping rule in a shared document or a dedicated tool (e.g., Omatic's experiment planner). Share it with your team before the experiment starts. This prevents post-hoc rationalization and makes your analysis more credible. If you must peek, plan for sequential testing in advance. Finally, set a calendar reminder to analyze results only after the end date, and resist the urge to check early. This step alone eliminates Mistake #1.

By following these five steps, you can launch experiments with confidence, knowing that your sample size is adequate and your results will be interpretable. This process is not just about statistics—it is about building a culture of rigorous experimentation.

Anonymized Real-World Scenarios: Learning from Others' Mistakes

To illustrate how these mistakes play out in practice, we present three anonymized scenarios based on common patterns observed across multiple teams. These are composites, not specific cases, but they capture the core challenges.

Scenario 1: The Premature Victory (Peeking)

A mid-size SaaS company tested a new pricing page. After just three days, the product manager noticed a 12% lift in sign-ups with a p-value of 0.03. Excited, they announced the change to the sales team and pushed it live. Over the next two weeks, sign-ups returned to baseline, and the team was left with a confused customer base and wasted development effort. The mistake: they peeked after only 300 users per variant, ignoring the pre-planned sample size of 2,000. The early result was a statistical fluke. The fix: they now use a blinded dashboard that hides results until the sample size is reached, and they enforce a mandatory 2-week waiting period before any analysis.

Scenario 2: The Hidden Baseline Shift (Scale Trap)

An e-commerce team ran a test on a new product recommendation widget. They used a baseline conversion rate of 5% from six months ago, but during that period, the site had redesigned its checkout, pushing the baseline to 8%. Their calculated sample size of 10,000 per variant was based on the old rate, but the actual variance was lower, making the experiment overpowered and wasteful. They spent two extra weeks collecting data they did not need. Worse, the team was so focused on hitting their target that they missed an early strong signal that could have been detected sooner. The fix: they now update baseline data monthly and use a rolling 30-day average to account for trends. They also run a sensitivity check: if the baseline has changed by more than 10% since the last calculation, they re-run the power analysis.

Scenario 3: The Unrealistic MDE (Ambition Trap)

A mobile app company wanted to test a new onboarding flow that they believed would increase retention. They set the MDE to a 0.5% absolute improvement on a 10% baseline, because they thought any smaller effect was not worth implementing. However, they only had 5,000 daily active users to split across two variants. The required sample size was 120,000 per variant—impossible to achieve in a reasonable time. They ran the experiment for 30 days anyway, getting 5,000 users per variant. The result was not statistically significant, and they concluded the flow was ineffective. Six months later, a larger competitor published a similar test showing a 0.3% lift. The team realized they had missed a real effect because they set the bar too high. The fix: they now use a tiered system—if the required sample size is more than 5x their available traffic, they either increase the MDE to 2% or run a longer test with a pilot phase. They also perform post-hoc power analysis to interpret null results correctly.

These scenarios highlight that sample size mistakes are not abstract—they have real consequences for decision-making, resource allocation, and product outcomes. By recognizing the patterns, you can avoid repeating them.

Common Questions About Sample Size and Experiment Validity

Below we address frequent questions that arise when teams try to apply sample size principles. These answers are based on standard methodological practice and general guidance; for specific regulatory or high-stakes decisions, consult a qualified statistician.

Q1: What if I cannot reach the required sample size due to traffic constraints?

If you cannot reach the required sample size, you have three main options. First, increase your MDE to a larger effect that you can detect with fewer observations. Second, use a more sensitive metric, such as average revenue per user instead of conversion rate, which often has higher statistical power. Third, consider a within-subjects design (e.g., pre-post comparison) if the timing confounds can be controlled, though this introduces its own biases. In low-traffic situations, Bayesian methods with weakly informative priors can sometimes reduce sample size by 10-20%, but this requires careful implementation. Ultimately, it is better to acknowledge that your experiment is underpowered and interpret null results as inconclusive rather than negative.

Q2: Can I use post-hoc power analysis after a null result?

Yes, but with caution. Observed power (power computed from the observed effect size) is redundant because it is directly related to the p-value. Instead, compute the minimum effect size you could have detected with your actual sample size and your chosen alpha and power. If this detectable effect is larger than your business-significant threshold, then your experiment was underpowered to detect a meaningful effect, and the null result is inconclusive. This practice, often called sensitivity power analysis, helps you avoid falsely claiming no effect exists. Many teams use this post-hoc check as a routine quality gate.

Q3: How should I handle multiple metrics (e.g., conversion and revenue)?

Multiple metrics require multiple sample size calculations, one for each primary metric. However, you must correct for multiple comparisons to avoid inflating false positives. Common approaches include Bonferroni correction (divide alpha by the number of metrics) or using a composite metric (e.g., a weighted score). Alternatively, designate one metric as primary for sample size planning and treat others as exploratory, accepting that any significant findings on secondary metrics are hypothesis-generating, not confirmatory. This is the approach recommended by most official guidelines on clinical trials and has been adapted for product experimentation.

Q4: Does sample size change if I use a cluster-randomized design?

Yes, cluster-randomized designs (e.g., randomizing by region or team) require larger sample sizes because observations within a cluster are correlated. The design effect is approximately 1 + (m - 1) * ICC, where m is the average cluster size and ICC is the intraclass correlation coefficient. This can double or triple your required sample size. If you are using a cluster design, consult a statistical reference or use a dedicated power calculator that accounts for clustering. Many experiments that ignore this effect are severely underpowered.

These questions represent the most common points of confusion we encounter. If you have a unique situation, we recommend consulting with a statistician or using a validated power analysis tool that handles your specific design.

Conclusion: Building a Culture of Valid Experimentation

The three hidden sample size mistakes—peeking, ignoring baselines, and mis-setting the MDE—are not just technical errors; they reflect deeper issues in how teams approach experimentation. By addressing them systematically, you can transform your experiment pipeline into a reliable engine for learning. The solutions are straightforward: pre-register your sample size, use accurate baseline data, set realistic MDEs, and choose a method that fits your traffic and risk profile. Each fix builds trust in your results and reduces wasted effort.

We encourage you to audit your current experiments against these three mistakes. Check the last few A/B tests you ran: did you peek? Did you use a current baseline? Was your MDE achievable? You may find that several of your past decisions were based on flawed data. This is normal, and the goal is not to assign blame but to improve going forward. Start with one experiment: apply the step-by-step guide from this article, and document your plan. Over time, these practices become automatic, and your team will develop a culture of rigorous experimentation. This guide is part of Omatic's Problem-Solution Playbook, and we will continue to update it as practices evolve. Remember, experimentation is a journey, not a destination. Each valid result—whether positive or null—adds to your collective understanding. Avoid these mistakes, and your experiments will serve you well.

About the Author

This article was prepared by the editorial team for this publication. We focus on practical explanations and update articles when major practices change.

Last reviewed: May 2026

Share this article:

Comments (0)

No comments yet. Be the first to comment!