Voice AI Assistant with Dialog Mode in Mobile Applications
Voice assistant in dialog mode is not just STT + GPT + TTS sequentially. It's managing conversation state, context window, interruption, background mode, and audio session competing with system apps. That's where "almost ready" integration usually breaks.
What Makes up a Dialog Assistant
Minimal stack:
- Wake word / Push-to-Talk — trigger phrase start
- STT — transcription (Whisper, Deepgram, Google STT)
- LLM — answer in dialog context (GPT-4o, Claude, Gemini)
- TTS — voice answer (ElevenLabs, OpenAI TTS, system)
- State machine — state management: idle → listening → processing → speaking → idle
Without explicit state machine, code becomes flags like isListening, isProcessing, isSpeaking, which desync on network errors. Classic source of bugs like "assistant hung and doesn't respond".
State Machine: Only Right Approach
enum AssistantState {
case idle
case listening
case transcribing
case thinking(history: [Message])
case speaking(text: String)
case error(Error)
}
class AssistantViewModel: ObservableObject {
@Published private(set) var state: AssistantState = .idle
func startListening() {
guard case .idle = state else { return }
state = .listening
audioCapture.start { [weak self] audioData in
self?.handleAudioChunk(audioData)
}
}
func onSilenceDetected() {
guard case .listening = state else { return }
state = .transcribing
audioCapture.stop()
Task { await transcribeAndRespond() }
}
private func transcribeAndRespond() async {
do {
let text = try await stt.transcribe(audioCapture.buffer)
state = .thinking(history: conversationHistory)
let response = try await llm.chat(messages: conversationHistory + [.user(text)])
conversationHistory.append(.user(text))
conversationHistory.append(.assistant(response))
state = .speaking(text: response)
await tts.speak(response)
state = .idle
} catch {
state = .error(error)
}
}
}
Key — transition to next state only from expected previous (guard case). Prevents races with parallel events.
Interruption (Barge-in)
User speaks over assistant's response. Must: stop TTS, stop current LLM request, start listening again.
On iOS:
func handleBargeIn() {
tts.stopSpeaking(at: .immediate)
currentLLMTask?.cancel()
audioCapture.reset()
state = .listening
audioCapture.start { ... }
}
VAD must work parallel during playback. With AVAudioSession in .playAndRecord mode, microphone available simultaneously with speaker. VAD threshold during speech must raise, otherwise speaker echo triggers barge-in.
Context Window Management
GPT-4o supports 128K tokens, but sending full conversation history each request — costs and latency. Strategy:
- Rolling window: keep last N messages (usually 10–20)
- Summarization: after N messages, request summary via separate call, add as system message
- Relevance filtering: for narrow assistants — embedding similarity to select relevant history fragments
For most mobile assistants rolling window of 15–20 messages sufficient.
TTS: Voice Choice and Caching
Streaming TTS — key to low latency. OpenAI TTS supports streaming: response comes in audio/mpeg chunks, client starts playback before full audio received.
// Streaming TTS with OpenAI
func streamSpeak(text: String) async throws {
let request = TTSRequest(model: "tts-1", input: text, voice: "nova", responseFormat: "mp3")
let (bytes, _) = try await urlSession.bytes(for: ttsURLRequest(request))
var audioData = Data()
for try await byte in bytes {
audioData.append(byte)
if audioData.count > 8192 { // Start playback after first 8 KB
try audioPlayer.enqueueChunk(audioData)
audioData = Data()
}
}
}
For frequently repeated phrases ("I'm listening", "Wait", "Didn't understand") — cache pre-synthesized audio locally. Removes latency on typical replies.
Push-to-Talk vs Wake Word
Push-to-Talk — simpler, no false positives, less battery drain. Good for professional tools.
Wake word via Picovoice Porcupine — always active, on-device (< 1% CPU), supports custom words. Integration via PorcupineManager on iOS/Android.
// Android: Porcupine wake word
val porcupine = Porcupine.Builder()
.setAccessKey(accessKey)
.setKeyword(Porcupine.BuiltInKeyword.HEY_GOOGLE) // or custom .ppn file
.build(context)
porcupineManager = PorcupineManager.Builder()
.setAccessKey(accessKey)
.setKeyword(Porcupine.BuiltInKeyword.HEY_GOOGLE)
.build(context) { keywordIndex ->
runOnUiThread { viewModel.onWakeWordDetected() }
}
porcupineManager.start()
Wake word in background on Android requires ForegroundService with notification. Without it system kills process.
Background Mode on iOS
Voice assistant falls under voip or audio background mode in Apple framework. For active listening need audio capability in Entitlements + active AVAudioSession. Apple can reject during review if background audio not justified — explain in metadata review notes.
Timeline
MVP with Push-to-Talk, Whisper STT, GPT-4o, OpenAI TTS — 2–3 weeks on one platform. Full assistant with wake word, barge-in, streaming TTS, context management, background mode — 6–10 weeks.







