Statistical significance analysis of A/B test results

Our company is engaged in the development, support and maintenance of sites of any complexity. From simple one-page sites to large-scale cluster systems built on micro services. Experience of developers is confirmed by certificates from vendors.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Development and maintenance of all types of websites:

Informational websites or web applications

Business card websites, landing pages, corporate websites, online catalogs, quizzes, promo websites, blogs, news resources, informational portals, forums, aggregators

E-commerce websites or web applications

Online stores, B2B portals, marketplaces, online exchanges, cashback websites, exchanges, dropshipping platforms, product parsers

Business process management web applications

CRM systems, ERP systems, corporate portals, production management systems, information parsers

Electronic service websites or web applications

Classified ads platforms, online schools, online cinemas, website builders, portals for electronic services, video hosting platforms, thematic portals

These are just some of the technical types of websites we work with, and each of them can have its own specific features and functionality, as well as be customized to meet the specific needs and goals of the client.

Offered services

Showing 1 of 1 servicesAll 2065 services

Statistical significance analysis of A/B test results

Medium

~2-3 business days

FAQ

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

Development of a web application for FEEDME
1170
Development of an online store for the company FURNORO
1094
Development of a web application for Enviok
830
CRM development for Chasseurs
879
Website development for SBH Partners
999
Website development for Red Pear
453

Show more works

Analyzing Statistical Significance of A/B Test Results

Statistical significance is mathematical proof that the difference in conversion between variants is not due to chance. Without proper statistical analysis, false decisions can be made based on data noise.

Key Concepts

P-value — probability of observing this or greater effect given no real difference (null hypothesis is true). When p < 0.05, result is considered significant.

Confidence Level — 1 - alpha. 95% confidence = willing to be wrong in 5% of cases.

Statistical Power — probability of detecting real effect (typically 80%).

MDE (Minimum Detectable Effect) — smallest effect the test can detect at given sample size.

Z-Test for Proportions

from scipy.stats import proportions_ztest, chi2_contingency
import numpy as np

def analyze_test(control_n, control_conv, variant_n, variant_conv, alpha=0.05):
    cr_control = control_conv / control_n
    cr_variant = variant_conv / variant_n
    relative_lift = (cr_variant - cr_control) / cr_control * 100

    # Z-test (applies when n > 30)
    counts = np.array([variant_conv, control_conv])
    nobs = np.array([variant_n, control_n])
    z_stat, p_value = proportions_ztest(counts, nobs, alternative='two-sided')

    # Confidence interval for difference
    se = np.sqrt(
        cr_control * (1 - cr_control) / control_n +
        cr_variant * (1 - cr_variant) / variant_n
    )
    diff = cr_variant - cr_control
    z_crit = 1.96  # for 95% CI
    ci_low = diff - z_crit * se
    ci_high = diff + z_crit * se

    print(f"Control: {cr_control:.3%} ({control_conv}/{control_n})")
    print(f"Variant: {cr_variant:.3%} ({variant_conv}/{variant_n})")
    print(f"Lift: {relative_lift:+.1f}%")
    print(f"95% CI: [{ci_low:.3%}, {ci_high:.3%}]")
    print(f"P-value: {p_value:.4f}")
    print(f"Significant: {'YES ✓' if p_value < alpha else 'NO ✗'}")

    return p_value < alpha

analyze_test(
    control_n=3842, control_conv=115,
    variant_n=3891, variant_conv=148
)

Chi-Square Test (Alternative to Z-Test)

from scipy.stats import chi2_contingency

contingency = np.array([
    [control_conv, control_n - control_conv],     # Control: converts, not converts
    [variant_conv, variant_n - variant_conv]      # Variant: converts, not converts
])

chi2, p_value, dof, expected = chi2_contingency(contingency)
print(f"Chi2: {chi2:.4f}, p={p_value:.4f}")

Chi-square and Z-test give identical results for two groups.

Interpretation Errors

Peaking (Peeking problem) — stop test as soon as p < 0.05, before reaching required sample size. Inflates Type I error to 26% at alpha=0.05.

# Wrong: check daily and stop when p < 0.05
# Right: calculate sample size first, stop only after reaching it

def required_sample_size(baseline_cr, mde, alpha=0.05, power=0.8):
    from scipy import stats
    import math
    p1, p2 = baseline_cr, baseline_cr * (1 + mde)
    p_avg = (p1 + p2) / 2
    z_a = stats.norm.ppf(1 - alpha/2)
    z_b = stats.norm.ppf(power)
    n = ((z_a * math.sqrt(2 * p_avg * (1-p_avg)) +
           z_b * math.sqrt(p1*(1-p1) + p2*(1-p2))) / (p2-p1)) ** 2
    return math.ceil(n)

n = required_sample_size(baseline_cr=0.03, mde=0.15)
print(f"Run test until {n} users per variant reached")

Multiple comparisons — test many variants and pick best without correction:

# Bonferroni correction for multiple comparisons
n_comparisons = 4  # 4 variants vs control
corrected_alpha = 0.05 / n_comparisons  # = 0.0125

# Or FDR (Benjamini-Hochberg)
from statsmodels.stats.multitest import multipletests
p_values = [0.03, 0.07, 0.01, 0.04]
reject, corrected_p, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')

Bayesian A/B Analysis

Alternative to frequentist approach — probability that variant is better:

import numpy as np

def bayesian_ab_test(control_conv, control_n, variant_conv, variant_n, samples=100000):
    """Posterior distribution via Beta distribution"""
    # Prior: Beta(1,1) = uniform distribution
    control_posterior = np.random.beta(
        control_conv + 1,
        control_n - control_conv + 1,
        samples
    )
    variant_posterior = np.random.beta(
        variant_conv + 1,
        variant_n - variant_conv + 1,
        samples
    )

    prob_variant_better = (variant_posterior > control_posterior).mean()
    expected_lift = (variant_posterior - control_posterior).mean() / control_posterior.mean() * 100

    print(f"Probability variant is better: {prob_variant_better:.1%}")
    print(f"Expected lift: {expected_lift:+.1f}%")
    print(f"Credible interval: [{np.percentile(variant_posterior - control_posterior, 2.5):.3%}, "
          f"{np.percentile(variant_posterior - control_posterior, 97.5):.3%}]")

bayesian_ab_test(115, 3842, 148, 3891)

Decision Guide

Situation	Decision
p < 0.05, lift > 0	Roll out variant
p > 0.05, low traffic	Continue test
p > 0.05, reached sample size	No significant effect, close test
p < 0.05, negative lift	Keep control
One segment significant, other not	Analyze interactions, segmented rollout

Delivery Time

Setting up significance analysis process with automatic sample size calculation and Bayesian/Frequentist selection — 1–2 business days.