Implementing A/B Testing for Chatbot Scenarios in Mobile Apps
Product wants to test: which greeting variant converts better—"Hi, how can I help?" or "I'll show you products for your request right away." A/B test on UI level—straightforward task. But a bot isn't just text: it's a dialog graph, set of intents, escalation logic to operators. Bot scenario A/B testing requires separate infrastructure.
What We Test in a Bot
Bot scenarios differ from UI elements: a variant isn't a button color, but an entire dialog graph. User may take 7 steps in variant A and 3 in variant B to achieve one result. Metric—not a click, but target action completion (purchase, lead, resolved question). This complicates measurement and requires event tracking at every dialog step.
Typical A/B hypotheses for bots:
- Different greetings and tone of voice
- Quick replies vs text input on first step
- Escalation timing (immediately vs after 2 failed intents)
- Different CTA formulations within dialog
Technical Implementation
Firebase Remote Config—standard choice for mobile A/B tests. Bot configuration parameters (scenario ID, prompt version, escalation threshold) read on app startup:
let remoteConfig = RemoteConfig.remoteConfig()
remoteConfig.fetch(withExpirationDuration: 3600) { [weak self] status, error in
guard status == .success else { return }
remoteConfig.activate { _, _ in
let botVariant = remoteConfig["bot_scenario_variant"].stringValue ?? "control"
self?.chatViewModel.loadScenario(variant: botVariant)
}
}
Firebase automatically splits audience into groups, supports conditions (country, app version, user properties). Built-in analytics via Firebase Analytics—conversion events marked with standard logEvent.
Growthbook / Statsig—alternatives with more powerful statistical model. Growthbook open-source, can self-host. Statsig has good iOS/Android SDK with low latency (feature flags cached locally).
Server-side A/B vs client-side. If bot implemented via server dialog engine (Rasa, Dialogflow CX, custom), better manage variant server-side—client sends userId + sessionId, server selects scenario by experimental group and returns responses for needed variant. Prevents cheating and simplifies analytics.
Dialog Event Tracking
Without detailed step tracking, impossible to understand where users drop. Minimum event set:
-
bot_session_start— {variant, userId, sessionId} -
bot_message_sent— {variant, stepId, messageType} -
bot_message_received— {variant, stepId, intentId, confidence} -
bot_intent_failed— {variant, stepId, userInput}—when NLU didn't recognize intent -
bot_escalated— {variant, stepId, reason} -
bot_goal_completed— {variant, goalType}—conversion event
All events with variant and sessionId—allows reconstructing full user path in any variant.
Statistical Significance
Common error—stop test on first promising numbers. Need minimum sample size calculated beforehand (power analysis): at desired 5% effect, 15% baseline conversion, 80% test power—need minimum ~2800 users per group. Firebase A/B Testing calculates automatically.
Timeline Estimates
Implementing A/B testing for two scenario variants with Firebase Remote Config and event tracking—3–5 days. Integration with server dialog engine and more complex audience split logic—up to 2 weeks.







