Voice Assistant Implementation in Mobile Applications
A voice assistant in a mobile app isn't just a microphone button. It's a pipeline of multiple components: VAD (Voice Activity Detection), STT, NLU/Intent Recognition, business logic processing, TTS. Each component adds latency. The goal is total latency from end of user speech to start of response ≤1.5 seconds. This is a technical constraint, not a marketing target.
Pipeline Architecture
Microphone → VAD → STT → NLU → Logic → TTS → Speaker
↕ ↕
Streaming Intent DB
VAD — voice activity detection prevents sending silence to STT. WebRTCVAD (native library) or SileroVAD (ONNX/TFLite, ~1 MB). VAD reduces false positives and saves API calls.
STT — speech-to-text conversion. Options: native SFSpeechRecognizer / Android STT for simple cases; OpenAI Whisper API or Yandex SpeechKit for Russian with high accuracy.
NLU — intent and entity extraction from text. Example: "add milk to shopping list" → intent: ADD_TO_LIST, entity: {item: milk, list: shopping}. Solutions:
- Rasa NLU — open source, self-hosted, trains on your data. Suitable for complex domains with many intents.
- Dialogflow ES/CX — Google cloud NLU, quick to start, good Russian language support. Paid at high volume.
-
LLM-based classification — ChatGPT / Claude API with structured output (
function calling). Flexible, no training data annotation needed, more expensive at high traffic. - On-device BERT — MobileBERT TFLite, ~50 MB, classifies intents from fixed set. Works offline.
Intent Recognition: What Actually Works
For applications with limited domain (smart home, online banking, navigation) — Rasa NLU or Dialogflow with explicit intents. 50–200 training examples per intent is sufficient.
For open domain — LLM with system prompt describing available actions. LLM returns JSON via function calling:
{
"intent": "navigate_to",
"destination": "Pushkin restaurant",
"time": null
}
LLM request latency: 400–800 ms for gpt-4o-mini, 200–400 ms for Claude Haiku. Add to STT (800–1500 ms cloud) and TTS (~300 ms). Total: 1.3–2.8 seconds. On the edge of comfortable.
Optimization: start LLM request in parallel during last 200 ms of STT (before final result), cache frequent intents locally.
Conversation Context
A voice assistant without context memory breaks on the second question: "Who is Gazprom's CEO?" — answer. "What about his wife?" — without context, unclear whose wife. Context is an array of last N messages, passed to each LLM request or Dialogflow session.
Mobile context management: ConversationStore — singleton with @Published message list. Maximum 10–15 recent messages (~2000 tokens context is sufficient for most dialogues).
Wake Word (Optional)
"Hey, [AppName]" without button press — works via PorcupineManager from Picovoice. On-device, custom wake word, ~500 KB model. Battery consumption — ~1.5% per hour on modern devices. On iOS requires background audio session, which Apple checks during review.
Case Study
Corporate assistant for field employees: voice task and CRM request creation without unlocking phone. Stack: SileroVAD on-device → Yandex SpeechKit streaming → Rasa NLU (self-hosted, 23 intents) → CRM REST API → Yandex SpeechKit TTS. Latency from end of speech to start of response: median 1.1 seconds, p95 2.3 seconds. Rasa NLU on own server provided full data control.
Timeline
Pipeline with STT + NLU with fixed intent set + TTS — 2–3 weeks. With wake word, conversation context, and business logic integration — 4–6 weeks. Cost is calculated individually.







