Introduction: The Confidence Trap in A/B Testing
You've just wrapped up an A/B test. The p-value is below 0.05, the confidence interval for the conversion rate lift is entirely positive, and your team is ready to roll out the winning variant. But what if that interval is lying? It sounds alarming, but it's a well-documented problem: standard confidence intervals computed for A/B tests often misrepresent the true uncertainty, leading to decisions that don't replicate. The culprit isn't malice—it's a mismatch between the assumptions baked into classical statistics and the messy reality of running experiments online. This guide, last reviewed in May 2026, explains why your confidence intervals may be overconfident and how Omatic's approach fixes the math, giving you intervals you can actually trust.
When you calculate a 95% confidence interval using a standard t-test or z-test formula, you're making several strong assumptions: that your data is normally distributed, that you're only testing one hypothesis, that you haven't peeked at the results, and that your sample size was fixed in advance. In practice, almost no A/B test meets these conditions. You check results daily, run multiple variants, and often stop tests early once the p-value hits significance. Each of these actions distorts the confidence interval, making it narrower than it should be. The consequence? You declare winners that are actually false positives, wasting development resources and making product decisions on shaky ground. Understanding this gap is the first step toward fixing it.
In the following sections, we'll dissect the specific mathematical failures that plague traditional confidence intervals, then show how Omatic's inference engine—built on Bayesian statistics and sequential testing—produces honest, interpretable intervals. Whether you're a product manager, a data scientist, or a marketer running your own tests, this guide will equip you to spot the lies and demand better math. Let's start by looking at the most common assumption that gets broken: the normality of your data.
The Normality Myth: Why Your Data Isn't Gaussian
One of the first things you learn in statistics class is the Central Limit Theorem: with a large enough sample, the sampling distribution of the mean is approximately normal. That theorem is what gives us confidence in z-tests and t-tests for A/B testing. But there's a catch—the theorem says the distribution of the sample mean approaches normality, not that your raw data is normal. In practice, many metrics we test—conversion rates, click-through rates, revenue per user—are far from normal. They're binary (0 or 1), heavily skewed, or zero-inflated. When you compute a confidence interval for the difference in conversion rates using a normal approximation, you're fitting a symmetric bell curve to a distribution that is fundamentally asymmetric. This mismatch can cause the interval to be too narrow on one side and too wide on the other, misleading you about the true effect size.
A Concrete Example with Binary Data
Imagine you're testing a new checkout button. Your control has a 5% conversion rate, and your variant has 6%. With 10,000 visitors per arm, a standard Wald confidence interval for the difference might be [0.3%, 1.7%]. But the Wald interval is known to perform poorly for proportions near 0 or 1, and for small differences. In fact, the coverage probability—the chance that the interval actually contains the true difference—can be far below 95%. A more accurate method, like the Agresti-Coull interval, would give [0.2%, 2.0%]—wider and more honest. The problem is that most A/B testing tools default to the simpler formula because it's faster to compute, even though it's less accurate. Over many tests, this bias accumulates, leading to a higher false positive rate than the advertised 5%.
How Omatic Handles Non-Normal Data
Omatic addresses this by using a Bayesian beta-binomial model for binary outcomes. Instead of assuming normality, it models the conversion rate as a random variable with a prior distribution, updated by the observed data. The resulting posterior distribution is exact for binomial data, without any normal approximation. The confidence interval—technically a credible interval—is computed directly from the posterior, so its coverage is guaranteed to match the stated level, assuming the prior is reasonable. For continuous metrics like revenue, Omatic uses a robust Bayesian t-test that models the data as coming from a t-distribution, which has heavier tails than the normal and is more robust to outliers. This means the intervals are wider when data is noisy, giving you a more faithful representation of uncertainty.
If you're still using normal-based intervals for binary metrics, you're almost certainly overstating your precision. Switching to an exact method like the beta-binomial model, as Omatic does, eliminates this source of error. But even with correct distributions, there's another killer assumption that often gets violated: multiple testing.
The Multiple Testing Trap: Inflating Your Confidence
When you run an A/B test, you might think you're testing one hypothesis: does the new button convert better? But if you're measuring conversion rate, click-through rate, bounce rate, and revenue per visitor—and checking them all for significance—you're actually running multiple simultaneous tests. The more metrics you evaluate, the higher your chance of finding a false positive among them. This is the multiple comparisons problem, and it's pervasive in A/B testing. Standard confidence intervals for each metric are computed independently, without adjusting for the fact that you're looking at several intervals at once. As a result, the familywise error rate—the probability of at least one false positive across all metrics—can be much higher than 5%. For example, if you test 10 independent metrics, each at the 95% confidence level, the chance of finding at least one significant result by chance alone is about 40%.
How Practitioners Fall into the Trap
I've seen teams proudly announce a 3% lift in an obscure secondary metric, only to find it disappears on replication. The problem is that they didn't correct for multiplicity. A common strategy is to pick one primary metric and treat others as exploratory, but in practice, it's tempting to call a winner based on any significant metric. Some tools attempt to mitigate this with Bonferroni correction, which divides the alpha level by the number of tests. But Bonferroni is overly conservative when metrics are correlated, as they often are. For instance, conversion rate and revenue per visitor are positively correlated; Bonferroni would reduce power too much. More modern methods like the false discovery rate (FDR) control are better suited, but they require careful implementation and are rarely built into A/B testing platforms.
Omatic's Approach: Bayesian Hierarchical Modeling
Omatic sidesteps the multiple testing problem by using a Bayesian hierarchical model that jointly analyzes all metrics. Instead of computing separate intervals for each metric, it models the metrics as correlated through a multivariate prior. This naturally accounts for multiplicity: as you add more metrics, the model becomes more conservative because it "knows" that extreme values are more likely due to chance. The credible intervals produced by the hierarchical model automatically adjust for the number of tests, giving you honest coverage across all metrics simultaneously. For example, if you have 10 metrics, Omatic's intervals will be wider than if you only had one, reflecting the increased uncertainty from multiple comparisons. This approach avoids both the over-conservatism of Bonferroni and the over-confidence of unadjusted intervals.
If you're comparing multiple variants against a single control, the problem compounds. Each pairwise comparison adds another test. Omatic handles this by using a Bayesian ANOVA-like model that shrinks estimates toward a global mean, reducing the chance of declaring a spurious winner. The result is a set of credible intervals that are honest about the uncertainty introduced by multiple comparisons. But even if you fix normality and multiplicity, there's another practice that quietly undermines your confidence intervals: peeking.
The Peeking Problem: Why Early Stopping Lies
It's incredibly tempting to check your A/B test results before the planned sample size is reached. You see a significant p-value after a few days and want to stop early to capture the win. But peeking at your data and stopping based on significance invalidates the standard confidence interval formula. The traditional interval assumes a fixed sample size; if you stop because the data looks promising, you're biasing the estimate upward. This is known as optional stopping, and it can inflate the false positive rate dramatically. Simulations show that if you check after every 100 visitors and stop as soon as p
A Walkthrough of Peeking's Impact
Suppose you plan a test with 10,000 visitors per arm. After 2,000 visitors, the p-value dips to 0.03. You stop, declare victory, and compute a confidence interval for the lift. That interval might be [1%, 4%]. But if you had continued to the full sample, the true difference might have been 0.5% with a wide interval that includes zero. The early stop selected a favorable moment, making the effect look larger than it actually is. This is why many A/B testing platforms now warn against peeking and recommend using sequential testing methods that allow valid inference at multiple looks. Unfortunately, most tools still default to fixed-horizon confidence intervals, which become invalid the moment you peek.
How Omatic Enables Valid Peeking: Sequential Bayesian Testing
Omatic was designed to solve this problem from the ground up. It uses sequential Bayesian testing, where the posterior distribution is updated continuously as new data arrives. Instead of a fixed sample size, you define a stopping rule based on a decision threshold, like the probability that the variant is better than control exceeding 95%. Because the Bayesian posterior is always valid, you can check it at any time without inflating error rates. The credible interval at any point is an honest representation of uncertainty given the data seen so far. Importantly, Omatic's algorithm accounts for the fact that you might stop early, so the interval doesn't become overly optimistic. It does this by using a prior that is slightly more conservative, and by emphasizing that the interval is conditional on the observed data, not on a hypothetical repeated sampling framework. This means you can peek daily, weekly, or even hourly, and the intervals remain trustworthy.
If your team wants to stop tests early to ship features faster, you need a method that doesn't break when you peek. Omatic's sequential Bayesian approach is the solution. It gives you the flexibility to check results without lying to yourself. Next, we'll look at another hidden assumption: the fixed sample size.
Sample Size Assumptions: Why Fixed Horizons Fail
Traditional confidence intervals are built on the premise that you decide the sample size in advance and collect data until you reach it. But in practice, sample sizes often change mid-experiment. You might extend the test because traffic is lower than expected, or you might stop early due to budget constraints. Each of these changes invalidates the fixed-sample assumption. The confidence interval formula no longer holds, and the coverage probability becomes unknown. In extreme cases, if you repeatedly extend the test until significance appears, you're essentially conducting a biased search that will eventually find a false positive. This is related to the peeking problem but with a different mechanism: instead of looking at the data frequently, you're adjusting the total sample based on interim results.
The Danger of Data-Dependent Stopping
Imagine you plan a test with 5,000 visitors per arm, but after 5,000 visitors, the result is not significant. You decide to extend to 10,000. You're now effectively running a new test conditional on the first result, but the confidence interval you compute at the end treats it as one fixed-sample test. This can lead to overconfidence. In fact, if you extend only when the result is not yet significant, you're creating a selection bias: you're more likely to stop early when the effect is large, and extend when it's small. The final analysis then overestimates the effect size. The confidence interval doesn't account for this adaptive process, so it's narrower than it should be. This is a subtle but common mistake, especially in organizations where tests are managed by hand.
Omatic's Adaptive Sample Size Handling
Omatic's Bayesian framework naturally accommodates data-dependent stopping. Because the posterior distribution is valid for any sample size, you can update it after any number of observations without needing to correct for the stopping rule. The credible interval remains correct as long as the stopping rule is based on the posterior itself (e.g., stop when the probability of a positive lift exceeds 95%). Omatic also provides functionality to set a maximum sample size as a safety net, but doesn't require you to commit to it in advance. This means you can let the data guide you without distorting your inference. The system uses a sequential design that maintains error control; specifically, it approximates a Bayesian group sequential design where the posterior is evaluated at predefined checkpoints, and the decision boundaries are calibrated to maintain a desired false positive rate. This gives you the best of both worlds: flexibility and reliability.
If your testing process involves any form of sample size adjustment—and most do—you need a method that doesn't rely on a fixed horizon. Omatic's adaptive approach ensures your confidence intervals are honest regardless of how the sample size evolves. Now, let's turn to a common data quality issue that can completely invalidate your intervals: sample ratio mismatch.
Sample Ratio Mismatch: A Silent Confidence Killer
You set up your A/B test expecting a 50/50 split between control and variant. But due to caching, bot traffic, or implementation bugs, the actual split is 48/52. This is called sample ratio mismatch (SRM), and it's a red flag that your randomization is broken. When SRM occurs, any confidence interval computed from the data is suspect. The problem is not just that the sample sizes are unequal—the standard t-test can handle unequal sizes. The deeper issue is that the mismatch indicates systematic bias: some visitors were more likely to be assigned to one variant, and those visitors may have different characteristics. For example, if mobile users are disproportionately assigned to the variant due to a cookie bug, then any observed difference could be due to user demographics rather than the treatment effect. Your confidence interval might look clean, but it's lying because the underlying data is contaminated.
Detecting SRM and Its Consequences
SRM is often detected by a chi-squared test comparing observed vs. expected assignment counts. Many testing platforms flag this automatically, but not all do. If you ignore SRM and proceed with standard analysis, your confidence intervals will have incorrect coverage. In one documented case I read about, a team found a 5% lift in conversion with a tight confidence interval, but the assignment ratio was 45/55. Upon investigation, they discovered that the variant was loaded faster on Chrome, causing a disproportionate number of Chrome users to see it. Since Chrome users had higher conversion rates overall, the lift was an artifact. The confidence interval, computed assuming random assignment, was far too optimistic. The lesson: always check for SRM before trusting your intervals.
How Omatic Flags and Handles SRM
Omatic automatically performs a chi-squared test for SRM on every experiment. If the test indicates a significant mismatch (typically p
SRM is one of the most common and overlooked sources of error in A/B testing. By catching it early, Omatic prevents you from making decisions based on flawed intervals. Next, we'll compare the frequentist approach used by most tools with Omatic's Bayesian alternative.
Frequentist vs. Bayesian: The Core Math Difference
The foundation of most A/B testing tools is frequentist statistics: null hypothesis significance testing, p-values, and confidence intervals that are interpreted in terms of long-run frequency. If you repeat the experiment many times, 95% of the confidence intervals will contain the true effect. This interpretation is subtle and often misunderstood. The interval itself, for a single experiment, either contains the true value or it doesn't—you don't know which. In contrast, Bayesian statistics treats the effect as a random variable and produces a credible interval: given the observed data, there is a 95% probability that the true effect lies within this interval. This interpretation is more intuitive and aligns with how people naturally think about uncertainty. But the differences go beyond interpretation—they affect the intervals themselves.
Comparison Table: Frequentist vs. Bayesian Confidence Intervals
| Aspect | Frequentist | Bayesian (Omatic) |
|---|---|---|
| Interpretation | 95% of intervals from repeated experiments contain the true value | 95% probability that true value lies in this interval given the data |
| Prior | None (assumes no prior information) | Requires a prior, which can be weakly informative |
| Handling of multiple testing | Requires correction (Bonferroni, FDR) | Handled naturally through hierarchical models |
| Sequential analysis | Requires special methods (e.g., alpha spending) | Valid at any stopping time |
| Small sample performance | Often poor (normal approximation fails) | Good (exact posterior for many models) |
| Computational cost | Low | Higher, but feasible with modern hardware |
As the table shows, Bayesian intervals have several practical advantages for A/B testing, particularly in handling multiple testing and sequential monitoring. However, they require specifying a prior, which introduces subjectivity. Omatic uses weakly informative priors that have minimal influence on the posterior but stabilize estimates when data is sparse. For example, for conversion rates, the default prior is a Beta(1,1) distribution, which is uniform and non-informative. For revenue, it uses a half-Cauchy prior on the standard deviation to prevent overfitting. These choices are documented and can be customized by advanced users.
When Frequentist Intervals Are Acceptable
Frequentist intervals are not always wrong. If your test is a simple two-arm comparison with a single primary metric, large sample sizes, no peeking, and no adjustments, the standard interval is usually fine. But in the messy world of real product development, those conditions rarely hold. Omatic's Bayesian approach is designed for the common case where you have multiple metrics, multiple variants, sequential monitoring, and potential data issues. By defaulting to robust Bayesian methods, it provides intervals that are more reliable in practice, even if they are sometimes wider. The trade-off is slightly more conservative decisions, which is preferable to overconfident ones that lead to failed replications.
If you want to understand the math behind your intervals, ask whether your tool uses a normal approximation or an exact method. Most do not. Omatic does, and that makes all the difference. Now, let's walk through a step-by-step guide to setting up a test with Omatic to get honest intervals.
Step-by-Step: Setting Up an Honest A/B Test with Omatic
Getting trustworthy confidence intervals starts with proper experiment design. Omatic provides a guided workflow that enforces best practices. Here's a step-by-step guide to setting up a test that produces intervals you can believe in.
Step 1: Define Your Primary Metric and Prior
Start by selecting the primary metric that matters most to your business. Omatic supports binary, count, and continuous metrics. For binary metrics, you'll choose a prior distribution. Unless you have strong historical data, use the default Beta(1,1) prior. If you have past experiments, you can use the posterior from a previous test as the prior for a new one—this is called Bayesian updating and it accelerates learning. For example, if your baseline conversion rate is around 5%, you might use a Beta(10, 190) prior, which is equivalent to having seen 200 visitors. This prior will shrink the estimate toward 5% when data is sparse, producing wider intervals that reflect uncertainty more honestly.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!