Analyzing Statistical Significance of A/B Test Results
Statistical significance is mathematical proof that the difference in conversion between variants is not due to chance. Without proper statistical analysis, false decisions can be made based on data noise.
Key Concepts
P-value — probability of observing this or greater effect given no real difference (null hypothesis is true). When p < 0.05, result is considered significant.
Confidence Level — 1 - alpha. 95% confidence = willing to be wrong in 5% of cases.
Statistical Power — probability of detecting real effect (typically 80%).
MDE (Minimum Detectable Effect) — smallest effect the test can detect at given sample size.
Z-Test for Proportions
from scipy.stats import proportions_ztest, chi2_contingency
import numpy as np
def analyze_test(control_n, control_conv, variant_n, variant_conv, alpha=0.05):
cr_control = control_conv / control_n
cr_variant = variant_conv / variant_n
relative_lift = (cr_variant - cr_control) / cr_control * 100
# Z-test (applies when n > 30)
counts = np.array([variant_conv, control_conv])
nobs = np.array([variant_n, control_n])
z_stat, p_value = proportions_ztest(counts, nobs, alternative='two-sided')
# Confidence interval for difference
se = np.sqrt(
cr_control * (1 - cr_control) / control_n +
cr_variant * (1 - cr_variant) / variant_n
)
diff = cr_variant - cr_control
z_crit = 1.96 # for 95% CI
ci_low = diff - z_crit * se
ci_high = diff + z_crit * se
print(f"Control: {cr_control:.3%} ({control_conv}/{control_n})")
print(f"Variant: {cr_variant:.3%} ({variant_conv}/{variant_n})")
print(f"Lift: {relative_lift:+.1f}%")
print(f"95% CI: [{ci_low:.3%}, {ci_high:.3%}]")
print(f"P-value: {p_value:.4f}")
print(f"Significant: {'YES ✓' if p_value < alpha else 'NO ✗'}")
return p_value < alpha
analyze_test(
control_n=3842, control_conv=115,
variant_n=3891, variant_conv=148
)
Chi-Square Test (Alternative to Z-Test)
from scipy.stats import chi2_contingency
contingency = np.array([
[control_conv, control_n - control_conv], # Control: converts, not converts
[variant_conv, variant_n - variant_conv] # Variant: converts, not converts
])
chi2, p_value, dof, expected = chi2_contingency(contingency)
print(f"Chi2: {chi2:.4f}, p={p_value:.4f}")
Chi-square and Z-test give identical results for two groups.
Interpretation Errors
Peaking (Peeking problem) — stop test as soon as p < 0.05, before reaching required sample size. Inflates Type I error to 26% at alpha=0.05.
# Wrong: check daily and stop when p < 0.05
# Right: calculate sample size first, stop only after reaching it
def required_sample_size(baseline_cr, mde, alpha=0.05, power=0.8):
from scipy import stats
import math
p1, p2 = baseline_cr, baseline_cr * (1 + mde)
p_avg = (p1 + p2) / 2
z_a = stats.norm.ppf(1 - alpha/2)
z_b = stats.norm.ppf(power)
n = ((z_a * math.sqrt(2 * p_avg * (1-p_avg)) +
z_b * math.sqrt(p1*(1-p1) + p2*(1-p2))) / (p2-p1)) ** 2
return math.ceil(n)
n = required_sample_size(baseline_cr=0.03, mde=0.15)
print(f"Run test until {n} users per variant reached")
Multiple comparisons — test many variants and pick best without correction:
# Bonferroni correction for multiple comparisons
n_comparisons = 4 # 4 variants vs control
corrected_alpha = 0.05 / n_comparisons # = 0.0125
# Or FDR (Benjamini-Hochberg)
from statsmodels.stats.multitest import multipletests
p_values = [0.03, 0.07, 0.01, 0.04]
reject, corrected_p, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')
Bayesian A/B Analysis
Alternative to frequentist approach — probability that variant is better:
import numpy as np
def bayesian_ab_test(control_conv, control_n, variant_conv, variant_n, samples=100000):
"""Posterior distribution via Beta distribution"""
# Prior: Beta(1,1) = uniform distribution
control_posterior = np.random.beta(
control_conv + 1,
control_n - control_conv + 1,
samples
)
variant_posterior = np.random.beta(
variant_conv + 1,
variant_n - variant_conv + 1,
samples
)
prob_variant_better = (variant_posterior > control_posterior).mean()
expected_lift = (variant_posterior - control_posterior).mean() / control_posterior.mean() * 100
print(f"Probability variant is better: {prob_variant_better:.1%}")
print(f"Expected lift: {expected_lift:+.1f}%")
print(f"Credible interval: [{np.percentile(variant_posterior - control_posterior, 2.5):.3%}, "
f"{np.percentile(variant_posterior - control_posterior, 97.5):.3%}]")
bayesian_ab_test(115, 3842, 148, 3891)
Decision Guide
| Situation | Decision |
|---|---|
| p < 0.05, lift > 0 | Roll out variant |
| p > 0.05, low traffic | Continue test |
| p > 0.05, reached sample size | No significant effect, close test |
| p < 0.05, negative lift | Keep control |
| One segment significant, other not | Analyze interactions, segmented rollout |
Delivery Time
Setting up significance analysis process with automatic sample size calculation and Bayesian/Frequentist selection — 1–2 business days.







