AI Prompt A/B Testing Setup

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
AI Prompt A/B Testing Setup
Simple
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1170
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1094
  • image_logo-advance_0.png
    B2B Advance company logo design
    563
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    830
  • image_logo-aider_0.jpg
    AIDER company logo development
    763
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    879

Setting up A/B testing of prompts

Prompt A/B testing is a comparison of two variants of a system prompt on real traffic to measure the impact of changes on the quality, latency, and cost of responses.

Managing Prompt Versions

PROMPT_REGISTRY = {
    "customer_support_v1": """Ты помощник службы поддержки компании.
Отвечай кратко, профессионально, по существу.
Если не знаешь ответа — скажи об этом честно.""",

    "customer_support_v2": """Ты опытный специалист службы поддержки.
Стиль: тёплый, профессиональный, конкретный.
Всегда предлагай следующий шаг. Если ситуация сложная — эскалируй.""",
}

class PromptABTest:
    def __init__(self, control: str, treatment: str, traffic_split: float = 0.5):
        self.variants = {"control": control, "treatment": treatment}
        self.traffic_split = traffic_split

    def get_prompt(self, session_id: str) -> tuple[str, str]:
        # Consistent assignment по session_id
        bucket = int(hashlib.md5(session_id.encode()).hexdigest(), 16) % 100
        variant = "treatment" if bucket < self.traffic_split * 100 else "control"
        return self.variants[variant], variant

Metrics for evaluating prompts

For each option, we track: user satisfaction (thumbs up/down, rating), task completion rate, escalation rate, response length, latency, cost.

def evaluate_prompt_variant(variant_responses: list[PromptResponse]) -> VariantMetrics:
    return VariantMetrics(
        n=len(variant_responses),
        avg_satisfaction=np.mean([r.satisfaction_score for r in variant_responses if r.satisfaction_score]),
        completion_rate=np.mean([r.task_completed for r in variant_responses]),
        escalation_rate=np.mean([r.escalated for r in variant_responses]),
        avg_response_tokens=np.mean([r.completion_tokens for r in variant_responses]),
        avg_cost_usd=np.mean([r.cost_usd for r in variant_responses]),
    )

Integration with Langfuse

from langfuse import Langfuse

langfuse = Langfuse()

# Создание датасета для оценки
dataset = langfuse.create_dataset(name="customer_support_eval")

# Запуск эксперимента
for sample in dataset.items:
    for variant, prompt in [("control", CONTROL_PROMPT), ("treatment", TREATMENT_PROMPT)]:
        response = llm.generate(messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": sample.input}
        ])
        # Оценка каждого ответа
        sample.link(run_name=f"prompt_ab_{variant}", output=response)
        langfuse.score(run_name=f"prompt_ab_{variant}", name="quality",
                      value=llm_judge.evaluate(sample.input, response, sample.expected_output))

Statistical significance

The minimum sample size depends on the expected effect. To detect a 5% improvement in satisfaction with a baseline of 70%, ~800 samples per variant are needed (alpha=0.05, power=0.8). We use scipy.stats.power_analysis or specialized A/B calculators.