Skip to main content
Sample Size Pitfalls

The 3 Hidden Sample Size Mistakes That Invalidate Your Experiment Results (Omatic's Problem-Solution Playbook)

You ran an experiment. The p-value looked great. You launched the change. And nothing happened. Or worse, performance dropped. The culprit is often not your hypothesis but your sample size—specifically, three hidden mistakes that invalidate results before the first visitor sees the variant. This guide from Omatic walks through each mistake and gives you a problem-solution playbook to fix them. Who Needs This and What Goes Wrong Without It If you design experiments—A/B tests, survey analyses, controlled studies—you need enough data to detect a real effect. Without proper sample size planning, you risk two costly errors: false positives (thinking a change works when it doesn't) and false negatives (missing a real improvement). This guide is for product managers, data scientists, marketers, and anyone who runs experiments and wants to avoid wasting time on inconclusive or misleading results.

You ran an experiment. The p-value looked great. You launched the change. And nothing happened. Or worse, performance dropped. The culprit is often not your hypothesis but your sample size—specifically, three hidden mistakes that invalidate results before the first visitor sees the variant. This guide from Omatic walks through each mistake and gives you a problem-solution playbook to fix them.

Who Needs This and What Goes Wrong Without It

If you design experiments—A/B tests, survey analyses, controlled studies—you need enough data to detect a real effect. Without proper sample size planning, you risk two costly errors: false positives (thinking a change works when it doesn't) and false negatives (missing a real improvement). This guide is for product managers, data scientists, marketers, and anyone who runs experiments and wants to avoid wasting time on inconclusive or misleading results.

Consider a typical scenario: a product team wants to test a new checkout flow. They set up the experiment, run it for two weeks, and see a 5% lift in conversion with a p-value of 0.04. Confident, they roll out the change. But the lift never materializes in production. Why? The sample size was too small. With low traffic, even random noise can produce statistically significant results. The team didn't calculate the required sample size beforehand, and they fell into the first hidden mistake: ignoring the baseline conversion rate.

Without a clear sample size target, teams often stop experiments too early (peeking) or run them too long (costing time and resources). Many industry surveys suggest that over half of A/B tests are underpowered—meaning they can't reliably detect the effect size the team cares about. This isn't just a statistical nuance; it's a business problem. False positives lead to bad product decisions. False negatives kill good ideas. This playbook addresses three specific mistakes: not accounting for baseline rates, misunderstanding minimum detectable effect (MDE), and neglecting sample size for segments. Each mistake has a straightforward solution once you know where to look.

Who Should Read This

Anyone who designs or analyzes experiments, from junior analysts to senior product leaders. If you've ever looked at an experiment result and felt unsure whether to trust it, this guide is for you.

What You Will Learn

By the end, you'll be able to identify the three hidden mistakes, calculate required sample sizes using a simple workflow, and apply variations for common constraints like limited traffic or multiple metrics. You'll also get a checklist to debug experiments that went wrong.

Prerequisites and Context

Before diving into the mistakes, let's settle a few foundational concepts. You don't need a statistics degree, but you should understand a few key terms: baseline conversion rate, minimum detectable effect (MDE), significance level (α), and statistical power (1−β). The baseline conversion rate is the current metric you're trying to improve—say, 10% of visitors who complete a purchase. The MDE is the smallest change you want to detect, like a 1% absolute lift. Significance level (usually 0.05) is the risk of false positive you're willing to accept. Power (usually 0.80) is the probability of detecting a true effect if it exists.

Sample size calculations depend on these four inputs. The formula is straightforward, but the mistakes happen when we choose these inputs carelessly or ignore practical constraints. Another prerequisite: you need a rough estimate of your expected traffic or sample size per day. If you're running an online experiment, you can usually get this from your analytics platform.

One more context: experiments are not just for websites. They are used in product launches, pricing tests, email campaigns, and even offline studies. The principles are the same, but the constraints differ. For example, an email experiment might have a fixed list size, while a website experiment can often run longer to gather more data. Understanding your constraints helps you choose the right approach.

Common Misconceptions

Many teams think sample size is only about the total number of visitors. But it's also about the distribution of those visitors across variants and over time. Uneven splits or early stopping can invalidate results even with a large sample. Also, some believe that any statistically significant result is trustworthy. Not if the test was underpowered—small samples can produce significant results by chance, especially if you peek repeatedly.

Core Workflow: Calculating Sample Size Correctly

Here's a step-by-step workflow to calculate sample size and avoid the three hidden mistakes. We'll use a typical online experiment as an example.

Step 1: Estimate Baseline Conversion Rate

Look at your historical data for the metric you want to improve. Use a stable period (at least a few weeks) and calculate the average rate. For example, if your current checkout conversion is 10%, that's your baseline. Mistake #1 is using a rough guess or ignoring seasonal fluctuations. Solution: use a confidence interval around your baseline to account for uncertainty. If your baseline fluctuates between 9% and 11%, plan for the worst case (9%) because smaller baselines require larger sample sizes.

Step 2: Choose Minimum Detectable Effect (MDE)

Decide the smallest practical improvement you care about. This is business-driven: a 0.1% lift might not be worth the engineering effort, but a 1% lift is. Mistake #2 is choosing an MDE that's too small (making sample size impossibly large) or too large (missing meaningful effects). Solution: align with stakeholders on the minimum effect that would change a decision. For most experiments, a relative MDE of 5–10% is reasonable. In our example, a 1% absolute lift (from 10% to 11%) is a 10% relative lift.

Step 3: Set Significance Level and Power

Standard values are α=0.05 and power=0.80. Adjust if needed: for high-risk decisions, use α=0.01; for exploratory tests, α=0.10 is sometimes acceptable. Power of 0.80 means you have an 80% chance of detecting the MDE if it's real. Increasing power to 0.90 requires a larger sample but reduces false negatives.

Step 4: Use a Sample Size Calculator

Plug your numbers into a calculator (many free online tools exist). For a two-tailed test with baseline=10%, MDE=1%, α=0.05, power=0.80, you'll need about 15,000 visitors per variant. If your daily traffic is 1,000 visitors, that's 15 days per variant, or 30 days total for a 50/50 split. Mistake #3 is forgetting that you need this many visitors per segment if you plan to analyze subgroups. For example, if you want to see effects on mobile users separately, you need 15,000 mobile visitors per variant—which might take much longer.

Step 5: Plan Duration and Check Assumptions

Run the experiment for the full calculated duration. Don't peek at results early (or if you must, use a sequential testing method). Also, ensure that your traffic is consistent—avoid running experiments during holidays or major events that change behavior.

Tools, Setup, and Environment Realities

You don't need expensive software to calculate sample sizes. Free online calculators like Evan Miller's sample size calculator or built-in functions in R (pwr package) and Python (statsmodels) work well. For continuous metrics (like revenue per visitor), use a different formula based on standard deviation. Most calculators also handle one-tailed vs. two-tailed tests; use two-tailed by default unless you have a strong directional hypothesis.

Your experiment platform (e.g., Optimizely, Google Optimize, or custom in-house) likely has a sample size estimator. But be careful: some platforms default to a 50/50 split and assume infinite traffic. You need to input your own parameters. Also, consider the environment: if you're running multiple experiments simultaneously, ensure they don't overlap on the same pages or metrics, as this can contaminate results.

For offline or survey experiments, the same principles apply but with different tools. Use software like G*Power for general sample size calculations. For survey analysis, consider design effects (like clustering) that inflate required sample size. A common mistake is using simple random sample formulas when your sample is stratified or clustered.

Common Tool Pitfalls

Many online calculators assume equal allocation between variants. If you use an unequal split (e.g., 90/10), the required sample size for the smaller variant increases dramatically. Also, some calculators require the effect size in Cohen's d for continuous outcomes. Know the difference between absolute and relative MDE and which your calculator expects.

Variations for Different Constraints

Not every experiment has the luxury of unlimited traffic or time. Here are variations for common constraints.

Limited Traffic

If you can't reach the required sample size, consider increasing the MDE (detect only larger effects), using a one-tailed test (if justified), or accepting lower power (say, 0.60). Alternatively, run a longer experiment—but beware of time-based confounds. Another option is to use a Bayesian approach, which can incorporate prior information and may require less data, though it comes with its own assumptions.

Multiple Metrics

If you care about several metrics (e.g., conversion rate, average order value, bounce rate), you need to adjust for multiple comparisons. The simplest fix is a Bonferroni correction (divide α by the number of metrics), which increases sample size. A better approach is to pre-specify a primary metric and treat others as exploratory. Or use a composite metric like a desirability index.

Segments and Subgroups

When you plan to analyze segments (e.g., new vs. returning users), power the experiment for the smallest segment you care about. If returning users are only 20% of traffic, you'll need five times the total traffic to have enough returning users. Alternatively, use a stratified allocation to oversample the small segment—but this complicates analysis.

Continuous Metrics

For metrics like revenue per visitor, you need an estimate of the standard deviation (from historical data). The formula is similar but uses effect size in standard deviation units. A common rule of thumb: you need about 0.5 * (σ / MDE)^2 per variant, where σ is the standard deviation. If the metric is highly variable (like revenue), sample size can be huge. Consider using a transformation or a ratio metric instead.

Pitfalls, Debugging, and What to Check When It Fails

Even with careful planning, experiments can go wrong. Here are common pitfalls and how to debug them.

Pitfall 1: Peeking and Early Stopping

The most common mistake is checking results daily and stopping as soon as p<0.05. This inflates false positive rates dramatically. Solution: pre-register your sample size and stick to it. If you must monitor, use a sequential testing method like the always-valid p-value or a Bayesian stopping rule.

Pitfall 2: Ignoring Practical Significance

A statistically significant result might be practically meaningless. For example, a 0.1% lift with p=0.04 might not be worth the implementation cost. Always interpret results in business context, not just p-values.

Pitfall 3: Confounding Changes

If you run experiments during a marketing campaign or site redesign, your results may be confounded. Check for external events and consider using a holdout group or a time-series analysis to control for trends.

Debugging Checklist

When an experiment returns surprising results (or no results), check these:

  • Did you calculate sample size correctly? Re-run the calculation with actual baseline and MDE.
  • Was the experiment duration too short? Compare actual sample size to required.
  • Were there any technical issues (e.g., tracking errors, overlapping experiments)?
  • Did you check for Simpson's paradox? Aggregate results can differ from segment results.
  • Did you account for multiple comparisons? If you tested many metrics, some significant results are expected by chance.

FAQ and Checklist in Prose

Here are answers to frequent questions and a concise checklist to use before every experiment.

FAQ

How do I choose the right MDE? Start with the minimum effect that would change a business decision. If a 1% lift is worth the effort, use that. If you're unsure, run a small pilot to estimate variability.

What if I can't reach the required sample size? Consider increasing MDE, using a one-tailed test, or accepting lower power. Alternatively, use a Bayesian approach or run a qualitative study first.

Can I use historical data as a control? It's risky due to time-based confounds. A concurrent A/B test is more reliable. If you must use historical data, use a method like difference-in-differences.

How do I handle multiple variants? For more than two variants (e.g., A/B/C), use a correction like Bonferroni or use a multi-armed bandit approach. Sample size increases roughly with the number of comparisons.

What's the difference between absolute and relative MDE? Absolute MDE is the change in percentage points (e.g., from 10% to 11%). Relative MDE is the percentage change (e.g., a 10% increase). Most calculators use absolute for binary metrics and relative for continuous. Be consistent.

Pre-Experiment Checklist

Before launching your next experiment, run through this checklist:

  • Define a primary metric with a clear baseline from historical data.
  • Choose an MDE that is both detectable and practically meaningful.
  • Set α=0.05 and power=0.80 (or adjust based on risk).
  • Calculate required sample size per variant using a reliable calculator.
  • Plan experiment duration: at least the time needed to reach that sample size, plus a buffer for weekends or anomalies.
  • Pre-register your sample size and analysis plan (optional but recommended).
  • Ensure no other experiments overlap on the same metric.
  • After the experiment, check for technical issues and external confounds.
  • Interpret results with both statistical and practical significance in mind.

By following this playbook, you'll avoid the three hidden sample size mistakes and produce experiment results you can trust. Next time you plan an experiment, start with the sample size calculation—not as an afterthought, but as the foundation of your analysis.

Share this article:

Comments (0)

No comments yet. Be the first to comment!