Statistical significance analysis of A/B test results

Our company is engaged in the development, support and maintenance of sites of any complexity. From simple one-page sites to large-scale cluster systems built on micro services. Experience of developers is confirmed by certificates from vendors.
Development and maintenance of all types of websites:
Informational websites or web applications
Business card websites, landing pages, corporate websites, online catalogs, quizzes, promo websites, blogs, news resources, informational portals, forums, aggregators
E-commerce websites or web applications
Online stores, B2B portals, marketplaces, online exchanges, cashback websites, exchanges, dropshipping platforms, product parsers
Business process management web applications
CRM systems, ERP systems, corporate portals, production management systems, information parsers
Electronic service websites or web applications
Classified ads platforms, online schools, online cinemas, website builders, portals for electronic services, video hosting platforms, thematic portals

These are just some of the technical types of websites we work with, and each of them can have its own specific features and functionality, as well as be customized to meet the specific needs and goals of the client.

Our competencies:
Development stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    847
  • image_website-sbh_0.png
    Website development for SBH Partners
    999
  • image_website-_0.png
    Website development for Red Pear
    451

Analyzing Statistical Significance of A/B Test Results

Statistical significance is mathematical proof that the difference in conversion between variants is not due to chance. Without proper statistical analysis, false decisions can be made based on data noise.

Key Concepts

P-value — probability of observing this or greater effect given no real difference (null hypothesis is true). When p < 0.05, result is considered significant.

Confidence Level — 1 - alpha. 95% confidence = willing to be wrong in 5% of cases.

Statistical Power — probability of detecting real effect (typically 80%).

MDE (Minimum Detectable Effect) — smallest effect the test can detect at given sample size.

Z-Test for Proportions

from scipy.stats import proportions_ztest, chi2_contingency
import numpy as np

def analyze_test(control_n, control_conv, variant_n, variant_conv, alpha=0.05):
    cr_control = control_conv / control_n
    cr_variant = variant_conv / variant_n
    relative_lift = (cr_variant - cr_control) / cr_control * 100

    # Z-test (applies when n > 30)
    counts = np.array([variant_conv, control_conv])
    nobs = np.array([variant_n, control_n])
    z_stat, p_value = proportions_ztest(counts, nobs, alternative='two-sided')

    # Confidence interval for difference
    se = np.sqrt(
        cr_control * (1 - cr_control) / control_n +
        cr_variant * (1 - cr_variant) / variant_n
    )
    diff = cr_variant - cr_control
    z_crit = 1.96  # for 95% CI
    ci_low = diff - z_crit * se
    ci_high = diff + z_crit * se

    print(f"Control: {cr_control:.3%} ({control_conv}/{control_n})")
    print(f"Variant: {cr_variant:.3%} ({variant_conv}/{variant_n})")
    print(f"Lift: {relative_lift:+.1f}%")
    print(f"95% CI: [{ci_low:.3%}, {ci_high:.3%}]")
    print(f"P-value: {p_value:.4f}")
    print(f"Significant: {'YES ✓' if p_value < alpha else 'NO ✗'}")

    return p_value < alpha

analyze_test(
    control_n=3842, control_conv=115,
    variant_n=3891, variant_conv=148
)

Chi-Square Test (Alternative to Z-Test)

from scipy.stats import chi2_contingency

contingency = np.array([
    [control_conv, control_n - control_conv],     # Control: converts, not converts
    [variant_conv, variant_n - variant_conv]      # Variant: converts, not converts
])

chi2, p_value, dof, expected = chi2_contingency(contingency)
print(f"Chi2: {chi2:.4f}, p={p_value:.4f}")

Chi-square and Z-test give identical results for two groups.

Interpretation Errors

Peaking (Peeking problem) — stop test as soon as p < 0.05, before reaching required sample size. Inflates Type I error to 26% at alpha=0.05.

# Wrong: check daily and stop when p < 0.05
# Right: calculate sample size first, stop only after reaching it

def required_sample_size(baseline_cr, mde, alpha=0.05, power=0.8):
    from scipy import stats
    import math
    p1, p2 = baseline_cr, baseline_cr * (1 + mde)
    p_avg = (p1 + p2) / 2
    z_a = stats.norm.ppf(1 - alpha/2)
    z_b = stats.norm.ppf(power)
    n = ((z_a * math.sqrt(2 * p_avg * (1-p_avg)) +
           z_b * math.sqrt(p1*(1-p1) + p2*(1-p2))) / (p2-p1)) ** 2
    return math.ceil(n)

n = required_sample_size(baseline_cr=0.03, mde=0.15)
print(f"Run test until {n} users per variant reached")

Multiple comparisons — test many variants and pick best without correction:

# Bonferroni correction for multiple comparisons
n_comparisons = 4  # 4 variants vs control
corrected_alpha = 0.05 / n_comparisons  # = 0.0125

# Or FDR (Benjamini-Hochberg)
from statsmodels.stats.multitest import multipletests
p_values = [0.03, 0.07, 0.01, 0.04]
reject, corrected_p, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')

Bayesian A/B Analysis

Alternative to frequentist approach — probability that variant is better:

import numpy as np

def bayesian_ab_test(control_conv, control_n, variant_conv, variant_n, samples=100000):
    """Posterior distribution via Beta distribution"""
    # Prior: Beta(1,1) = uniform distribution
    control_posterior = np.random.beta(
        control_conv + 1,
        control_n - control_conv + 1,
        samples
    )
    variant_posterior = np.random.beta(
        variant_conv + 1,
        variant_n - variant_conv + 1,
        samples
    )

    prob_variant_better = (variant_posterior > control_posterior).mean()
    expected_lift = (variant_posterior - control_posterior).mean() / control_posterior.mean() * 100

    print(f"Probability variant is better: {prob_variant_better:.1%}")
    print(f"Expected lift: {expected_lift:+.1f}%")
    print(f"Credible interval: [{np.percentile(variant_posterior - control_posterior, 2.5):.3%}, "
          f"{np.percentile(variant_posterior - control_posterior, 97.5):.3%}]")

bayesian_ab_test(115, 3842, 148, 3891)

Decision Guide

Situation Decision
p < 0.05, lift > 0 Roll out variant
p > 0.05, low traffic Continue test
p > 0.05, reached sample size No significant effect, close test
p < 0.05, negative lift Keep control
One segment significant, other not Analyze interactions, segmented rollout

Delivery Time

Setting up significance analysis process with automatic sample size calculation and Bayesian/Frequentist selection — 1–2 business days.