Setting up A/B testing of prompts
Prompt A/B testing is a comparison of two variants of a system prompt on real traffic to measure the impact of changes on the quality, latency, and cost of responses.
Managing Prompt Versions
PROMPT_REGISTRY = {
"customer_support_v1": """Ты помощник службы поддержки компании.
Отвечай кратко, профессионально, по существу.
Если не знаешь ответа — скажи об этом честно.""",
"customer_support_v2": """Ты опытный специалист службы поддержки.
Стиль: тёплый, профессиональный, конкретный.
Всегда предлагай следующий шаг. Если ситуация сложная — эскалируй.""",
}
class PromptABTest:
def __init__(self, control: str, treatment: str, traffic_split: float = 0.5):
self.variants = {"control": control, "treatment": treatment}
self.traffic_split = traffic_split
def get_prompt(self, session_id: str) -> tuple[str, str]:
# Consistent assignment по session_id
bucket = int(hashlib.md5(session_id.encode()).hexdigest(), 16) % 100
variant = "treatment" if bucket < self.traffic_split * 100 else "control"
return self.variants[variant], variant
Metrics for evaluating prompts
For each option, we track: user satisfaction (thumbs up/down, rating), task completion rate, escalation rate, response length, latency, cost.
def evaluate_prompt_variant(variant_responses: list[PromptResponse]) -> VariantMetrics:
return VariantMetrics(
n=len(variant_responses),
avg_satisfaction=np.mean([r.satisfaction_score for r in variant_responses if r.satisfaction_score]),
completion_rate=np.mean([r.task_completed for r in variant_responses]),
escalation_rate=np.mean([r.escalated for r in variant_responses]),
avg_response_tokens=np.mean([r.completion_tokens for r in variant_responses]),
avg_cost_usd=np.mean([r.cost_usd for r in variant_responses]),
)
Integration with Langfuse
from langfuse import Langfuse
langfuse = Langfuse()
# Создание датасета для оценки
dataset = langfuse.create_dataset(name="customer_support_eval")
# Запуск эксперимента
for sample in dataset.items:
for variant, prompt in [("control", CONTROL_PROMPT), ("treatment", TREATMENT_PROMPT)]:
response = llm.generate(messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": sample.input}
])
# Оценка каждого ответа
sample.link(run_name=f"prompt_ab_{variant}", output=response)
langfuse.score(run_name=f"prompt_ab_{variant}", name="quality",
value=llm_judge.evaluate(sample.input, response, sample.expected_output))
Statistical significance
The minimum sample size depends on the expected effect. To detect a 5% improvement in satisfaction with a baseline of 70%, ~800 samples per variant are needed (alpha=0.05, power=0.8). We use scipy.stats.power_analysis or specialized A/B calculators.







