AI Real-Time Speech Translation for Mobile App

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
AI Real-Time Speech Translation for Mobile App
Complex
~1-2 weeks
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1054
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

Real-Time AI Speech Translation in Mobile Applications

Real-time speech translation is a pipeline of three independent latencies: audio capture → transcription → translation → voice synthesis. End-to-end latency from phrase end to translated speech sound in good implementation — 1.5–3 seconds. In poor — 8–15 seconds, that's unusable for live conversation.

Pipeline Architecture

Microphone → VAD → 2-3 sec buffer → STT API → source text
                                             ↓
                                   Translation API → translated text
                                             ↓
                                        TTS API → audio → speaker

Each block can parallelize. While TTS synthesizes first sentence, STT already processes next fragment. This is pipeline parallelism, reduces end-to-end latency twice.

Choosing STT for Streaming

Whisper — no. Deepgram Nova-2 or Google Speech-to-Text v2 with interim_results — yes. For speech translation, need streaming STT, otherwise wait for full pause.

Deepgram with interim_results=true and utterance_end_ms=1200 gives text within 300–500 ms after phrase ends. Working window for starting translation.

iOS Implementation

class SpeechTranslationPipeline {
    private let deepgramStreamer: DeepgramStreamer
    private let translator: TranslationService
    private let tts: AVSpeechSynthesizer

    func handleFinalTranscript(_ text: String, sourceLang: String, targetLang: String) async {
        // Start translation immediately after getting final utterance
        async let translated = translator.translate(text, from: sourceLang, to: targetLang)

        // Simultaneously show source text in UI
        await MainActor.run { sourceLabel.text = text }

        let translatedText = try? await translated
        guard let result = translatedText else { return }

        await MainActor.run { targetLabel.text = result }

        // TTS
        let utterance = AVSpeechUtterance(string: result)
        utterance.voice = AVSpeechSynthesisVoice(language: targetLang)
        utterance.rate = 0.52
        tts.speak(utterance)
    }
}

AVSpeechSynthesizer — system TTS on iOS. Russian voice quality acceptable, but noticeably worse than ElevenLabs or OpenAI TTS. For natural voice — replace TTS block with cloud service with caching.

Audio Session Management

When simultaneously capturing microphone and playing translation — AVAudioSession conflict. Need .playAndRecord category with .defaultToSpeaker option:

try AVAudioSession.sharedInstance().setCategory(
    .playAndRecord,
    mode: .voiceChat,
    options: [.defaultToSpeaker, .allowBluetooth]
)

.voiceChat mode enables echo cancellation. Without it, translation from speaker goes back to microphone and goes second round through transcription.

Android Implementation

class SpeechTranslationPipeline @Inject constructor(
    private val deepgramStreamer: DeepgramStreamer,
    private val translationRepo: TranslationRepository,
    private val tts: TextToSpeech
) {
    fun start(sourceLang: String, targetLang: String) {
        deepgramStreamer.onFinalTranscript = { text ->
            coroutineScope.launch {
                val translated = translationRepo.translate(text, targetLang)
                withContext(Dispatchers.Main) {
                    sourceTextView.text = text
                    targetTextView.text = translated
                }
                speakTranslation(translated, targetLang)
            }
        }
        deepgramStreamer.start()
    }

    private fun speakTranslation(text: String, lang: String) {
        tts.language = Locale.forLanguageTag(lang)
        tts.speak(text, TextToSpeech.QUEUE_FLUSH, null, null)
    }
}

AudioManager.MODE_IN_COMMUNICATION + AudioRecord with source VOICE_COMMUNICATION — for correct AEC (Acoustic Echo Canceler) on Android. Otherwise on devices without hardware AEC there's echo.

Phrase Overlap Problem

While TTS pronounces translation, user can speak next phrase. If VAD doesn't account, microphone picks up voice from speaker. Solution:

  • Pause VAD during TTS playback
  • Or additional filtering: ignore interim results during audio playback

Second option more reliable in practice, doesn't create awkward pauses.

Providers by Language Pair

Direction STT Translation TTS
ru → en Deepgram Nova-2 DeepL OpenAI TTS (alloy)
en → ru Deepgram Nova-2 DeepL/Google Yandex SpeechKit
zh → en Google STT Google Translate Google TTS
ar → en AssemblyAI GPT-4o ElevenLabs

For Russian synthesis, Yandex SpeechKit notably better than Google TTS and OpenAI by naturalness. Not opinion — verifiable on test set of 50 phrases.

Offline Variant

For devices without stable internet: Whisper on-device (whisper.cpp via CoreML on iOS, ONNX on Android) + ML Kit Translate + system TTS. Latency 3–6 seconds instead of 1.5, but works offline.

Whisper tiny/base on iPhone 13 via CoreML — about 2 seconds for 5-second fragment. Acceptable for travel scenario.

Timeline

Streaming speech translation with cloud services on one platform — 2–4 weeks. Includes STT, translation, TTS integration, audio session management, network interruption handling, basic UI. Cross-platform on Flutter with native audio bridges — 3–5 weeks.