Voice Bot Implementation in Mobile App

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
Voice Bot Implementation in Mobile App
Complex
from 1 week to 3 months
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1052
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

Voice Bot Implementation in Mobile Applications

A voice bot — is a pipeline of three links: Speech-to-Text → NLP/LLM → Text-to-Speech. Each link adds latency. Total latency under 1.5 seconds — this is the boundary of acceptable for conversational UX. Exceeding it — user thinks the bot hung.

Latency Optimization: Where Time is Lost

Typical latency breakdown:

Stage Cloud variant Optimized
STT (transcription) 400–800ms 200–400ms (streaming)
NLP / LLM response 500–2000ms 150–400ms (streaming + cache)
TTS (synthesis) 300–600ms 100–200ms (streaming)
Network (2x) 100–300ms
Total 1.3–3.7s ~1s with streaming

Key to low latency — everywhere possible, don't wait for full completion of previous step:

  • STT with shouldReportPartialResults = true — start processing before phrase completes
  • LLM streaming — as soon as first tokens arrive, start synthesis
  • TTS streaming — start playback while remaining phrase is still synthesizing

Speech-to-Text: Choosing Engine

Native APIs. iOS — SFSpeechRecognizer, Android — SpeechRecognizer. Free, offline support (limited). Accuracy for Russian — acceptable for short commands, worse for full sentences.

Whisper API (OpenAI). Best transcription quality for Russian, especially with professional terminology. Latency — 200–500ms for 5–15 second recording. whisper-1 model, language: "ru".

Google Cloud Speech-to-Text. Streaming API allows real-time partial results. StreamingRecognizeRequest + gRPC — minimum latency among cloud variants.

Yandex SpeechKit. Best results for Russian among all options (trained on Russian corpus). Streaming recognition via gRPC. If bot works only with Russian-speaking users — first choice.

// iOS: AVAudioEngine → Yandex SpeechKit streaming
class VoiceBotRecorder {
    private let audioEngine = AVAudioEngine()
    private var recognitionStream: RecognitionStream?

    func startRecording() throws {
        let inputNode = audioEngine.inputNode
        let format = AVAudioFormat(commonFormat: .pcmFormatInt16,
                                   sampleRate: 16000,
                                   channels: 1,
                                   interleaved: false)!

        recognitionStream = speechKitClient.createStream(config: streamConfig)

        inputNode.installTap(onBus: 0, bufferSize: 4096, format: format) { [weak self] buffer, _ in
            guard let pcmData = buffer.int16ChannelData?[0] else { return }
            let bytes = Data(bytes: pcmData, count: Int(buffer.frameLength) * 2)
            try? self?.recognitionStream?.send(audio: bytes)
        }

        audioEngine.prepare()
        try audioEngine.start()
    }
}

Text-to-Speech: Speech Synthesis

ElevenLabs. Best speech quality, supports Russian. Streaming API via WebSocket or HTTP chunked response. Voice cloning — if bot should speak with specific brand voice.

OpenAI TTS. tts-1 (fast) and tts-1-hd (quality). Streaming via HTTP range requests. Voices alloy, echo, nova and others — for Russian nova sounds most natural.

Yandex SpeechKit TTS. For Russian — one of best options for naturalness. Voices alena, filipp, jane. Streaming via gRPC.

Native synthesis. iOS — AVSpeechSynthesizer, Android — TextToSpeech. Free, works offline, but quality significantly lower than cloud variants — robotic sound.

Audio Management on Mobile

iOS. AVAudioSession category should be .playAndRecord with .defaultToSpeaker option. When playing TTS need to deactivate microphone — otherwise bot hears itself (echo). AVAudioSession.setActive(false) before playback, true after.

try AVAudioSession.sharedInstance().setCategory(
    .playAndRecord,
    options: [.defaultToSpeaker, .allowBluetooth]
)

Android. AudioManager.requestAudioFocus() before playback, abandonAudioFocus() after. Bluetooth headsets require separate handling via BluetoothHeadset profile.

Interruption handling. User starts speaking while bot is still responding (barge-in). Need to: detect start of user speech → stop TTS playback → start recording. VAD (Voice Activity Detection) — either native via AudioRecord.getMaxAmplitude(), or more accurate via WebRTC VAD.

Wake Word and Hands-Free Mode

For hands-free scenarios (navigation while driving, smart devices) — wake word detection: "Hi, assistant" activates bot without tap. Solutions: Porcupine (Picovoice) with custom wake words support, OpenWakeWord (open source). Work completely on-device, no network requests.

Implementation Process

Choose STT/TTS engines based on requirements (language, accuracy, latency, budget).

Develop audio pipeline: capture, encoding, streaming.

NLP/LLM logic for understanding voice commands.

Latency optimization: streaming at all stages, caching frequent responses.

UI: visualization of states (listening / thinking / speaking), sound wave animation.

Testing in background noise conditions, with different accents and speech rates.

Timeline Estimates

Voice bot with native STT/TTS — 1 week. With cloud engines (Yandex SpeechKit / ElevenLabs), streaming and latency optimization — 3–5 weeks. With wake word and hands-free mode — plus 1–2 weeks.