Skip to main content
Sample Size Pitfalls

Why Your Sample Size Is Sabotaging Your A/B Tests (And How Omatic Can Fix the Blind Spot)

You ran an A/B test for two weeks, saw a 12% lift in conversions, and launched the winner. Three months later, the metric is flat. What happened? Most likely, your sample size was too small to detect a real effect, and the apparent winner was just noise. This is the most common—and most expensive—pitfall in experimentation. In this guide, we'll show you how sample size silently undermines your tests, and how Omatic can help you catch the problem before you act on false positives. Who Needs This and What Goes Wrong Without It If you run A/B tests on a website, app, or email campaign, you need to care about sample size. The problem isn't limited to big enterprises with massive traffic. In fact, smaller teams often suffer more because they're tempted to end tests early or run too many variations with limited visitors.

You ran an A/B test for two weeks, saw a 12% lift in conversions, and launched the winner. Three months later, the metric is flat. What happened? Most likely, your sample size was too small to detect a real effect, and the apparent winner was just noise. This is the most common—and most expensive—pitfall in experimentation. In this guide, we'll show you how sample size silently undermines your tests, and how Omatic can help you catch the problem before you act on false positives.

Who Needs This and What Goes Wrong Without It

If you run A/B tests on a website, app, or email campaign, you need to care about sample size. The problem isn't limited to big enterprises with massive traffic. In fact, smaller teams often suffer more because they're tempted to end tests early or run too many variations with limited visitors.

The core mistake is simple: you need enough observations to distinguish a real effect from random variation. Without the right sample size, your test is underpowered—meaning it has a low chance of detecting a true difference even if one exists. Industry surveys suggest that a majority of published A/B tests in marketing blogs are underpowered, leading to unreliable conclusions.

What Goes Wrong in Practice

When sample size is too small, two things happen. First, you get false positives: you see a lift that's actually just noise, and you make a bad decision. Second, you get false negatives: you miss a real improvement because the test couldn't detect it. Both waste time and money.

For example, imagine you run a test with 500 visitors per variation. A conversion rate of 5% vs. 6% might look promising, but the confidence interval is so wide that the result isn't statistically significant. Many teams would call it a winner anyway, especially if the p-value is close to 0.05. That's a recipe for disappointment.

Another common scenario is peeking: checking results daily and stopping as soon as p < 0.05. This inflates the false positive rate dramatically—sometimes to 30% or more. Without a fixed sample size plan, you're essentially data-dredging.

Omatic's dashboard includes a sample size calculator and a peeking alert that warns you when you're looking at results too early. It's not a magic bullet, but it forces you to think before you peek.

Who Specifically Benefits

This matters for anyone who:

  • Runs tests with less than 10,000 visitors per variation per week
  • Tests multiple variations simultaneously (A/B/n tests)
  • Uses tools that don't enforce sample size planning (most do not)
  • Has stakeholders who demand quick results

If any of these describe your situation, read on—the fix is straightforward once you know what to look for.

Prerequisites and Context You Should Settle First

Before you calculate sample size, you need to make a few decisions. These aren't technical—they're business and statistical choices that affect the numbers.

Define Your Minimum Detectable Effect (MDE)

The MDE is the smallest improvement you care about. If a 1% lift in conversion rate is worth $10,000 a month, you want to detect it. If it's only $100, you might wait for a 5% lift. The smaller the effect you want to detect, the larger the sample size you need. There's no free lunch: doubling precision requires roughly quadrupling the sample.

Most teams set MDE too small. They think, "I want to detect a 0.5% lift," not realizing they'd need millions of visitors. A better approach: estimate the business impact of different effect sizes and pick the smallest one that still justifies the test's cost.

Choose Your Significance Level and Power

Standard practice is α = 0.05 (5% false positive risk) and β = 0.20 (80% power). These are conventions, not laws. If a false positive would be catastrophic (e.g., a medical device), you'd use α = 0.01. If you can tolerate more risk, you might use α = 0.10. Similarly, 80% power means you have a 20% chance of missing a real effect. For high-stakes tests, you might want 90% power.

Omatic's sample size tool lets you adjust these parameters and see how they affect the required sample. It's a good way to understand the trade-off.

Know Your Baseline Conversion Rate

You need an estimate of the current conversion rate. Use historical data—at least a few weeks—to get a stable number. If your baseline is very low (say 1%), you'll need more visitors than if it's 10%, because the relative variability is higher.

A common mistake is using a baseline from a short period that includes a holiday or promotion. That inflates variance and leads to an inaccurate sample size calculation. Use a typical period, and if your traffic is seasonal, plan for the season you'll test in.

Check Your Traffic Volume and Duration

Once you know the required sample size, compare it to your daily traffic. If you need 100,000 visitors per variation and you get 5,000 a day, the test will take 20 days. That's fine if you can wait. But if you need 1 million visitors and you get 5,000 a day, that's 200 days—probably too long. At that point, you might increase MDE, lower power, or accept a higher significance level.

Omatic's duration estimator takes your daily traffic and calculates how many days each variation needs. It also accounts for weekends and traffic dips. This prevents you from starting a test that will never finish.

Core Workflow: How to Determine the Right Sample Size

Here's a step-by-step process you can follow for every A/B test. We'll use Omatic's features as examples, but the logic applies to any tool.

Step 1: Set Your Parameters

Open Omatic's sample size calculator. Enter:

  • Baseline conversion rate (e.g., 5%)
  • Minimum detectable effect (e.g., 10% relative lift, meaning from 5% to 5.5%)
  • Significance level (default 0.05)
  • Statistical power (default 0.80)

The calculator outputs the required sample size per variation. For our example, it might say 15,000 visitors per variation.

Step 2: Check Feasibility

Compare the required sample to your daily traffic. If you get 1,500 visitors per day and split them 50/50, each variation gets 750 per day. That means 20 days to reach 15,000 per variation. If that's acceptable, proceed. If not, adjust parameters.

Omatic shows a feasibility gauge: green for tests that can finish in a reasonable time (say under 30 days), yellow for borderline, red for too long.

Step 3: Run the Test Without Peeking

Once the test starts, resist the urge to check results. Omatic has a peeking guard that hides p-values until the sample size is reached. You can still see the raw data, but no significance labels. This reduces false positive rates dramatically.

If you must check early (e.g., for safety monitoring), use Omatic's sequential testing option, which adjusts thresholds so you can peek without inflating error rates. But this requires a different sample size calculation—don't mix methods.

Step 4: Analyze After Reaching the Target

When the sample size is met, Omatic performs the analysis. It shows the p-value, confidence interval, and a Bayesian probability of being best. But don't just look at p < 0.05. Check the confidence interval: if it's very wide, the estimate is imprecise, and you might need more data. Also consider the practical significance: is the effect large enough to matter?

Omatic's recommendation engine gives a decision: "Launch variation B" or "Inconclusive, consider extending." It's not a substitute for judgment, but it helps avoid common mistakes.

Tools, Setup, and Environment Realities

Omatic is designed to integrate with your existing analytics and testing platforms. Here's what you need to set up.

Integration Options

Omatic offers:

  • A JavaScript snippet for websites (works with Google Optimize, Optimizely, VWO, or custom tests)
  • A REST API for server-side experiments
  • Pre-built connectors for Google Analytics 4 and Mixpanel

You don't need to change your testing tool. Omatic sits on top and monitors sample size, duration, and peeking.

Setup Steps

  1. Create an account on Omatic.top.
  2. Add your website or app as a project.
  3. Install the snippet or configure the API.
  4. Define your metrics (conversion events) in Omatic.
  5. Start a test: enter your parameters, and Omatic will track progress.

The setup takes about 30 minutes for a basic site. For complex setups with multiple metrics or segments, allow a few hours.

Limitations to Be Aware Of

Omatic is not a replacement for a proper statistical background. It automates calculations, but you still need to understand the assumptions. For example:

  • It assumes independent observations. If users interact repeatedly (e.g., membership site), you need to account for correlation.
  • It uses frequentist statistics by default. If you prefer Bayesian, you can switch, but the sample size calculation changes.
  • It works best with binary outcomes (conversion yes/no). For continuous metrics (revenue per user), you need to estimate variance separately.

If your test involves multiple variants, Omatic automatically adjusts for multiple comparisons using the Bonferroni or Benjamini-Hochberg correction. But you must enter the number of variants correctly.

Environment Considerations

Traffic patterns matter. If your site has weekly cycles (e.g., more visitors on weekends), Omatic's duration estimator uses historical data to predict. But if you run a test during a promotion or a holiday, the baseline may shift. In those cases, you should either run the test outside the event or account for it in the analysis by segmenting.

Also, be careful with novelty effects: users might react differently to a change simply because it's new. Omatic doesn't automatically handle this, but you can extend the test to see if the effect persists. A good rule is to run the test for at least two full weeks to capture at least one full business cycle.

Variations for Different Constraints

Not every test can follow the ideal workflow. Here are common variations and how to adapt.

Low Traffic Scenario

If you get fewer than 1,000 visitors per day, you'll struggle to detect small effects. Options:

  • Increase MDE: Only test changes that you expect to have a large impact (e.g., redesign a checkout page, not change a button color).
  • Use Bayesian methods: Omatic's Bayesian mode uses prior information to reduce sample size requirements. This is controversial but can be useful when you have strong prior evidence.
  • Run longer tests: Accept that a test may take 60 days. Use Omatic's duration alert to know when it's done.
  • Consider bandit algorithms: Instead of A/B testing, use a multi-armed bandit that allocates more traffic to winners over time. Omatic supports a bandit mode for low-traffic situations.

Multiple Variants (A/B/n)

Testing 5 variations instead of 2 increases the required sample size because you need to compare each pair. The Bonferroni correction divides α by the number of comparisons. For 5 variants, that's 10 comparisons, so α per comparison = 0.005. This inflates sample size significantly. Omatic's calculator handles this automatically if you enter the number of variants.

Alternative: use a sequential testing approach that stops individual arms once they're clearly inferior. This can save sample size, but it's more complex.

Continuous Metrics (Revenue, Time on Site)

For non-binary outcomes, you need an estimate of the standard deviation. Omatic asks for this in the calculator. If you don't know, use a pilot study or historical data. The sample size for continuous metrics is often larger than for binary because of higher variance.

One common mistake: using average revenue per user without considering that a few high spenders can skew results. Consider using robust metrics like median or winsorized mean.

Mobile vs. Desktop

If you test across devices, you should either segment or use a pooled analysis with device as a covariate. Omatic allows you to define segments and calculate sample size per segment. If you want to detect effects separately, multiply the required sample by the number of segments.

Pitfalls, Debugging, and What to Check When It Fails

Even with good planning, tests can go wrong. Here are common pitfalls and how to diagnose them.

Pitfall 1: Sample Size Not Reached

You planned for 10,000 visitors per variation, but after 30 days you only have 8,000. This happens when traffic drops unexpectedly. Omatic sends an alert if the estimated completion date extends beyond your original deadline. At that point, you have three options:

  • Continue until you reach the target (may take longer).
  • Stop and call it inconclusive (better than acting on noisy data).
  • Reduce the required sample by increasing MDE or lowering power, but you must decide before looking at results.

Never stop early because the result looks significant—that's peeking.

Pitfall 2: Baseline Shift Mid-Test

If you run a promotion or a bug fix during the test, the baseline changes. This invalidates the sample size calculation. Omatic's change detection monitors the control group for unusual shifts. If it detects a change, it flags the test as potentially compromised.

Solution: segment the test by the period before and after the event, and analyze separately. Or restart the test after the disruption.

Pitfall 3: Multiple Endpoints

You test one change but track 20 metrics. By chance, one of them will show significance. This is the multiple comparisons problem in another form. Omatic's analysis defaults to the primary metric you specified. If you want to explore secondary metrics, use the false discovery rate correction to avoid cherry-picking.

Pitfall 4: Simpson's Paradox

Your overall result is positive, but every segment shows a negative effect. This happens when the control group has a different composition. For example, if the control had more mobile users who convert at lower rates, the overall lift might be due to the treatment getting more desktop users. Omatic's segmentation analysis checks for this automatically.

If you see a reversal, segment by the confounding variable and re-analyze. Never trust an overall result without checking segments.

Pitfall 5: Assuming Normal Distribution

Many sample size formulas assume the metric is normally distributed. For binary conversions, that's fine with large samples. For revenue, it's often skewed. Omatic's calculator uses the central limit theorem, which works for large samples, but if your sample is small, you might need non-parametric methods or bootstrapping.

Check the distribution of your metric before the test. If it's highly skewed, consider using a log transformation or a different test type.

What to Do When a Test Fails

If your test ends inconclusive, don't just shrug. Ask:

  • Was the effect size realistic? Maybe the change truly had no impact.
  • Was the test long enough? You might need to run it again with more power.
  • Was the metric appropriate? Maybe you measured the wrong thing.

Document the failure and learn from it. Omatic's test history stores all parameters and results, so you can review patterns over time. Teams that analyze their failures often improve faster than those who only celebrate wins.

Finally, remember that sample size is just one piece of the puzzle. Even with perfect sample size, you can still get false positives (5% of the time by design). Use Omatic's diagnostic tools to check for other issues, and always validate important results with a follow-up test or a holdout group.

Share this article:

Comments (0)

No comments yet. Be the first to comment!