Mobile App Speech-to-Text Implementation

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
Mobile App Speech-to-Text Implementation
Medium
~3-5 business days
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1052
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

Speech-to-Text Implementation in Mobile Applications

Speech-to-Text on mobile devices falls into two scenarios: online cloud-based (better accuracy, requires internet) and on-device (works offline, sufficient for most production tasks). This choice determines not only the architecture but also operational costs.

Native APIs: SFSpeechRecognizer and Android STT

iOS SFSpeechRecognizer — built-in iOS API available since iOS 10. From iOS 13+ it supports on-device mode (requiresOnDeviceRecognition = true) for 11 languages. For Russian, only cloud mode is available (requests go to Apple servers). Limitation: 1 minute per request, approximately 1000 requests per day per device without a paid agreement.

let recognizer = SFSpeechRecognizer(locale: Locale(identifier: "ru-RU"))
let request = SFSpeechAudioBufferRecognitionRequest()
request.requiresOnDeviceRecognition = false // Russian uses cloud
request.shouldReportPartialResults = true

Setting shouldReportPartialResults = true is critical for UX: users see text as they speak rather than waiting for the recording to end.

Android SpeechRecognizer — uses Google Speech Services. Offline mode is available via RecognizerIntent.EXTRA_PREFER_OFFLINE = true, but language packs must be downloaded separately and consume 80–200 MB. Russian offline pack is available, but users may not have it installed — checking and fallback logic is required.

Cloud Alternatives

When native APIs fall short (custom dictionaries, specialized terminology, high accuracy requirements):

  • OpenAI Whisper — best accuracy among available options, multilingual. API is straightforward: POST with audio file. Server-side latency 1–3 seconds. Mobile version Whisper.cpp is available, compiles via CMake — runs on-device, ~50–200 MB depending on model size.
  • Google Cloud Speech-to-Text v2 — supports streaming via gRPC WebSocket, custom dictionary adaptation through Adaptation API.
  • Yandex SpeechKit — optimized for Russian language, streaming via WebSocket.

Streaming vs Batch

Batch (record → stop → recognize) is simpler to implement but provides worse UX. Streaming displays text as users speak.

Streaming on iOS via SFSpeechAudioBufferRecognitionRequest with append(buffer:) from AVAudioEngine:

inputNode.installTap(onBus: 0, bufferSize: 1024, format: format) { buffer, _ in
    request.append(buffer)
}

For Yandex SpeechKit streaming — WebSocket with chunked audio in PCM 16kHz 16-bit. Chunks of 200–400 ms balance latency and overhead.

Case study: warehouse voice form-filling app (operator dictates data hands-free). Native Android STT with offline Russian language pack. Problem: specific SKU codes ('item 7788-ABV') had poor recognition. Solution: Yandex SpeechKit with custom dictionary via PhraseSuggestions — added 3000 SKUs. Accuracy for product codes improved from 61% to 89%.

Permissions and UX

On iOS, NSMicrophoneUsageDescription and NSSpeechRecognitionUsageDescription are required in Info.plist. Request permissions before first use via AVAudioSession.requestRecordPermission and SFSpeechRecognizer.requestAuthorization. Without successful authorization, startRecording() will fail without informative error messages.

Always include a recording indicator in UI. An animated waveform or pulsing indicator lets users know they're being listened to.

Timeline

Native STT (iOS/Android) with UI — 3–5 days. Streaming with cloud API and custom dictionary — 1–2 weeks. Cost is calculated individually.