Mobile App Voice Assistant Implementation

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
Mobile App Voice Assistant Implementation
Complex
~1-2 weeks
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1052
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

Voice Assistant Implementation in Mobile Applications

A voice assistant in a mobile app isn't just a microphone button. It's a pipeline of multiple components: VAD (Voice Activity Detection), STT, NLU/Intent Recognition, business logic processing, TTS. Each component adds latency. The goal is total latency from end of user speech to start of response ≤1.5 seconds. This is a technical constraint, not a marketing target.

Pipeline Architecture

Microphone → VAD → STT → NLU → Logic → TTS → Speaker
               ↕         ↕
           Streaming   Intent DB

VAD — voice activity detection prevents sending silence to STT. WebRTCVAD (native library) or SileroVAD (ONNX/TFLite, ~1 MB). VAD reduces false positives and saves API calls.

STT — speech-to-text conversion. Options: native SFSpeechRecognizer / Android STT for simple cases; OpenAI Whisper API or Yandex SpeechKit for Russian with high accuracy.

NLU — intent and entity extraction from text. Example: "add milk to shopping list" → intent: ADD_TO_LIST, entity: {item: milk, list: shopping}. Solutions:

  • Rasa NLU — open source, self-hosted, trains on your data. Suitable for complex domains with many intents.
  • Dialogflow ES/CX — Google cloud NLU, quick to start, good Russian language support. Paid at high volume.
  • LLM-based classification — ChatGPT / Claude API with structured output (function calling). Flexible, no training data annotation needed, more expensive at high traffic.
  • On-device BERT — MobileBERT TFLite, ~50 MB, classifies intents from fixed set. Works offline.

Intent Recognition: What Actually Works

For applications with limited domain (smart home, online banking, navigation) — Rasa NLU or Dialogflow with explicit intents. 50–200 training examples per intent is sufficient.

For open domain — LLM with system prompt describing available actions. LLM returns JSON via function calling:

{
  "intent": "navigate_to",
  "destination": "Pushkin restaurant",
  "time": null
}

LLM request latency: 400–800 ms for gpt-4o-mini, 200–400 ms for Claude Haiku. Add to STT (800–1500 ms cloud) and TTS (~300 ms). Total: 1.3–2.8 seconds. On the edge of comfortable.

Optimization: start LLM request in parallel during last 200 ms of STT (before final result), cache frequent intents locally.

Conversation Context

A voice assistant without context memory breaks on the second question: "Who is Gazprom's CEO?" — answer. "What about his wife?" — without context, unclear whose wife. Context is an array of last N messages, passed to each LLM request or Dialogflow session.

Mobile context management: ConversationStore — singleton with @Published message list. Maximum 10–15 recent messages (~2000 tokens context is sufficient for most dialogues).

Wake Word (Optional)

"Hey, [AppName]" without button press — works via PorcupineManager from Picovoice. On-device, custom wake word, ~500 KB model. Battery consumption — ~1.5% per hour on modern devices. On iOS requires background audio session, which Apple checks during review.

Case Study

Corporate assistant for field employees: voice task and CRM request creation without unlocking phone. Stack: SileroVAD on-device → Yandex SpeechKit streaming → Rasa NLU (self-hosted, 23 intents) → CRM REST API → Yandex SpeechKit TTS. Latency from end of speech to start of response: median 1.1 seconds, p95 2.3 seconds. Rasa NLU on own server provided full data control.

Timeline

Pipeline with STT + NLU with fixed intent set + TTS — 2–3 weeks. With wake word, conversation context, and business logic integration — 4–6 weeks. Cost is calculated individually.