Voice AI Assistant with Dialog Mode for Mobile App

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
Voice AI Assistant with Dialog Mode for Mobile App
Complex
~1-2 weeks
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1054
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

Voice AI Assistant with Dialog Mode in Mobile Applications

Voice assistant in dialog mode is not just STT + GPT + TTS sequentially. It's managing conversation state, context window, interruption, background mode, and audio session competing with system apps. That's where "almost ready" integration usually breaks.

What Makes up a Dialog Assistant

Minimal stack:

  • Wake word / Push-to-Talk — trigger phrase start
  • STT — transcription (Whisper, Deepgram, Google STT)
  • LLM — answer in dialog context (GPT-4o, Claude, Gemini)
  • TTS — voice answer (ElevenLabs, OpenAI TTS, system)
  • State machine — state management: idle → listening → processing → speaking → idle

Without explicit state machine, code becomes flags like isListening, isProcessing, isSpeaking, which desync on network errors. Classic source of bugs like "assistant hung and doesn't respond".

State Machine: Only Right Approach

enum AssistantState {
    case idle
    case listening
    case transcribing
    case thinking(history: [Message])
    case speaking(text: String)
    case error(Error)
}

class AssistantViewModel: ObservableObject {
    @Published private(set) var state: AssistantState = .idle

    func startListening() {
        guard case .idle = state else { return }
        state = .listening
        audioCapture.start { [weak self] audioData in
            self?.handleAudioChunk(audioData)
        }
    }

    func onSilenceDetected() {
        guard case .listening = state else { return }
        state = .transcribing
        audioCapture.stop()
        Task { await transcribeAndRespond() }
    }

    private func transcribeAndRespond() async {
        do {
            let text = try await stt.transcribe(audioCapture.buffer)
            state = .thinking(history: conversationHistory)
            let response = try await llm.chat(messages: conversationHistory + [.user(text)])
            conversationHistory.append(.user(text))
            conversationHistory.append(.assistant(response))
            state = .speaking(text: response)
            await tts.speak(response)
            state = .idle
        } catch {
            state = .error(error)
        }
    }
}

Key — transition to next state only from expected previous (guard case). Prevents races with parallel events.

Interruption (Barge-in)

User speaks over assistant's response. Must: stop TTS, stop current LLM request, start listening again.

On iOS:

func handleBargeIn() {
    tts.stopSpeaking(at: .immediate)
    currentLLMTask?.cancel()
    audioCapture.reset()
    state = .listening
    audioCapture.start { ... }
}

VAD must work parallel during playback. With AVAudioSession in .playAndRecord mode, microphone available simultaneously with speaker. VAD threshold during speech must raise, otherwise speaker echo triggers barge-in.

Context Window Management

GPT-4o supports 128K tokens, but sending full conversation history each request — costs and latency. Strategy:

  1. Rolling window: keep last N messages (usually 10–20)
  2. Summarization: after N messages, request summary via separate call, add as system message
  3. Relevance filtering: for narrow assistants — embedding similarity to select relevant history fragments

For most mobile assistants rolling window of 15–20 messages sufficient.

TTS: Voice Choice and Caching

Streaming TTS — key to low latency. OpenAI TTS supports streaming: response comes in audio/mpeg chunks, client starts playback before full audio received.

// Streaming TTS with OpenAI
func streamSpeak(text: String) async throws {
    let request = TTSRequest(model: "tts-1", input: text, voice: "nova", responseFormat: "mp3")
    let (bytes, _) = try await urlSession.bytes(for: ttsURLRequest(request))

    var audioData = Data()
    for try await byte in bytes {
        audioData.append(byte)
        if audioData.count > 8192 { // Start playback after first 8 KB
            try audioPlayer.enqueueChunk(audioData)
            audioData = Data()
        }
    }
}

For frequently repeated phrases ("I'm listening", "Wait", "Didn't understand") — cache pre-synthesized audio locally. Removes latency on typical replies.

Push-to-Talk vs Wake Word

Push-to-Talk — simpler, no false positives, less battery drain. Good for professional tools.

Wake word via Picovoice Porcupine — always active, on-device (< 1% CPU), supports custom words. Integration via PorcupineManager on iOS/Android.

// Android: Porcupine wake word
val porcupine = Porcupine.Builder()
    .setAccessKey(accessKey)
    .setKeyword(Porcupine.BuiltInKeyword.HEY_GOOGLE) // or custom .ppn file
    .build(context)

porcupineManager = PorcupineManager.Builder()
    .setAccessKey(accessKey)
    .setKeyword(Porcupine.BuiltInKeyword.HEY_GOOGLE)
    .build(context) { keywordIndex ->
        runOnUiThread { viewModel.onWakeWordDetected() }
    }
porcupineManager.start()

Wake word in background on Android requires ForegroundService with notification. Without it system kills process.

Background Mode on iOS

Voice assistant falls under voip or audio background mode in Apple framework. For active listening need audio capability in Entitlements + active AVAudioSession. Apple can reject during review if background audio not justified — explain in metadata review notes.

Timeline

MVP with Push-to-Talk, Whisper STT, GPT-4o, OpenAI TTS — 2–3 weeks on one platform. Full assistant with wake word, barge-in, streaming TTS, context management, background mode — 6–10 weeks.