Voice Bot Implementation in Mobile Applications
A voice bot — is a pipeline of three links: Speech-to-Text → NLP/LLM → Text-to-Speech. Each link adds latency. Total latency under 1.5 seconds — this is the boundary of acceptable for conversational UX. Exceeding it — user thinks the bot hung.
Latency Optimization: Where Time is Lost
Typical latency breakdown:
| Stage | Cloud variant | Optimized |
|---|---|---|
| STT (transcription) | 400–800ms | 200–400ms (streaming) |
| NLP / LLM response | 500–2000ms | 150–400ms (streaming + cache) |
| TTS (synthesis) | 300–600ms | 100–200ms (streaming) |
| Network (2x) | 100–300ms | — |
| Total | 1.3–3.7s | ~1s with streaming |
Key to low latency — everywhere possible, don't wait for full completion of previous step:
- STT with
shouldReportPartialResults = true— start processing before phrase completes - LLM streaming — as soon as first tokens arrive, start synthesis
- TTS streaming — start playback while remaining phrase is still synthesizing
Speech-to-Text: Choosing Engine
Native APIs. iOS — SFSpeechRecognizer, Android — SpeechRecognizer. Free, offline support (limited). Accuracy for Russian — acceptable for short commands, worse for full sentences.
Whisper API (OpenAI). Best transcription quality for Russian, especially with professional terminology. Latency — 200–500ms for 5–15 second recording. whisper-1 model, language: "ru".
Google Cloud Speech-to-Text. Streaming API allows real-time partial results. StreamingRecognizeRequest + gRPC — minimum latency among cloud variants.
Yandex SpeechKit. Best results for Russian among all options (trained on Russian corpus). Streaming recognition via gRPC. If bot works only with Russian-speaking users — first choice.
// iOS: AVAudioEngine → Yandex SpeechKit streaming
class VoiceBotRecorder {
private let audioEngine = AVAudioEngine()
private var recognitionStream: RecognitionStream?
func startRecording() throws {
let inputNode = audioEngine.inputNode
let format = AVAudioFormat(commonFormat: .pcmFormatInt16,
sampleRate: 16000,
channels: 1,
interleaved: false)!
recognitionStream = speechKitClient.createStream(config: streamConfig)
inputNode.installTap(onBus: 0, bufferSize: 4096, format: format) { [weak self] buffer, _ in
guard let pcmData = buffer.int16ChannelData?[0] else { return }
let bytes = Data(bytes: pcmData, count: Int(buffer.frameLength) * 2)
try? self?.recognitionStream?.send(audio: bytes)
}
audioEngine.prepare()
try audioEngine.start()
}
}
Text-to-Speech: Speech Synthesis
ElevenLabs. Best speech quality, supports Russian. Streaming API via WebSocket or HTTP chunked response. Voice cloning — if bot should speak with specific brand voice.
OpenAI TTS. tts-1 (fast) and tts-1-hd (quality). Streaming via HTTP range requests. Voices alloy, echo, nova and others — for Russian nova sounds most natural.
Yandex SpeechKit TTS. For Russian — one of best options for naturalness. Voices alena, filipp, jane. Streaming via gRPC.
Native synthesis. iOS — AVSpeechSynthesizer, Android — TextToSpeech. Free, works offline, but quality significantly lower than cloud variants — robotic sound.
Audio Management on Mobile
iOS. AVAudioSession category should be .playAndRecord with .defaultToSpeaker option. When playing TTS need to deactivate microphone — otherwise bot hears itself (echo). AVAudioSession.setActive(false) before playback, true after.
try AVAudioSession.sharedInstance().setCategory(
.playAndRecord,
options: [.defaultToSpeaker, .allowBluetooth]
)
Android. AudioManager.requestAudioFocus() before playback, abandonAudioFocus() after. Bluetooth headsets require separate handling via BluetoothHeadset profile.
Interruption handling. User starts speaking while bot is still responding (barge-in). Need to: detect start of user speech → stop TTS playback → start recording. VAD (Voice Activity Detection) — either native via AudioRecord.getMaxAmplitude(), or more accurate via WebRTC VAD.
Wake Word and Hands-Free Mode
For hands-free scenarios (navigation while driving, smart devices) — wake word detection: "Hi, assistant" activates bot without tap. Solutions: Porcupine (Picovoice) with custom wake words support, OpenWakeWord (open source). Work completely on-device, no network requests.
Implementation Process
Choose STT/TTS engines based on requirements (language, accuracy, latency, budget).
Develop audio pipeline: capture, encoding, streaming.
NLP/LLM logic for understanding voice commands.
Latency optimization: streaming at all stages, caching frequent responses.
UI: visualization of states (listening / thinking / speaking), sound wave animation.
Testing in background noise conditions, with different accents and speech rates.
Timeline Estimates
Voice bot with native STT/TTS — 1 week. With cloud engines (Yandex SpeechKit / ElevenLabs), streaming and latency optimization — 3–5 weeks. With wake word and hands-free mode — plus 1–2 weeks.







