Real-Time AI Speech Translation in Mobile Applications
Real-time speech translation is a pipeline of three independent latencies: audio capture → transcription → translation → voice synthesis. End-to-end latency from phrase end to translated speech sound in good implementation — 1.5–3 seconds. In poor — 8–15 seconds, that's unusable for live conversation.
Pipeline Architecture
Microphone → VAD → 2-3 sec buffer → STT API → source text
↓
Translation API → translated text
↓
TTS API → audio → speaker
Each block can parallelize. While TTS synthesizes first sentence, STT already processes next fragment. This is pipeline parallelism, reduces end-to-end latency twice.
Choosing STT for Streaming
Whisper — no. Deepgram Nova-2 or Google Speech-to-Text v2 with interim_results — yes. For speech translation, need streaming STT, otherwise wait for full pause.
Deepgram with interim_results=true and utterance_end_ms=1200 gives text within 300–500 ms after phrase ends. Working window for starting translation.
iOS Implementation
class SpeechTranslationPipeline {
private let deepgramStreamer: DeepgramStreamer
private let translator: TranslationService
private let tts: AVSpeechSynthesizer
func handleFinalTranscript(_ text: String, sourceLang: String, targetLang: String) async {
// Start translation immediately after getting final utterance
async let translated = translator.translate(text, from: sourceLang, to: targetLang)
// Simultaneously show source text in UI
await MainActor.run { sourceLabel.text = text }
let translatedText = try? await translated
guard let result = translatedText else { return }
await MainActor.run { targetLabel.text = result }
// TTS
let utterance = AVSpeechUtterance(string: result)
utterance.voice = AVSpeechSynthesisVoice(language: targetLang)
utterance.rate = 0.52
tts.speak(utterance)
}
}
AVSpeechSynthesizer — system TTS on iOS. Russian voice quality acceptable, but noticeably worse than ElevenLabs or OpenAI TTS. For natural voice — replace TTS block with cloud service with caching.
Audio Session Management
When simultaneously capturing microphone and playing translation — AVAudioSession conflict. Need .playAndRecord category with .defaultToSpeaker option:
try AVAudioSession.sharedInstance().setCategory(
.playAndRecord,
mode: .voiceChat,
options: [.defaultToSpeaker, .allowBluetooth]
)
.voiceChat mode enables echo cancellation. Without it, translation from speaker goes back to microphone and goes second round through transcription.
Android Implementation
class SpeechTranslationPipeline @Inject constructor(
private val deepgramStreamer: DeepgramStreamer,
private val translationRepo: TranslationRepository,
private val tts: TextToSpeech
) {
fun start(sourceLang: String, targetLang: String) {
deepgramStreamer.onFinalTranscript = { text ->
coroutineScope.launch {
val translated = translationRepo.translate(text, targetLang)
withContext(Dispatchers.Main) {
sourceTextView.text = text
targetTextView.text = translated
}
speakTranslation(translated, targetLang)
}
}
deepgramStreamer.start()
}
private fun speakTranslation(text: String, lang: String) {
tts.language = Locale.forLanguageTag(lang)
tts.speak(text, TextToSpeech.QUEUE_FLUSH, null, null)
}
}
AudioManager.MODE_IN_COMMUNICATION + AudioRecord with source VOICE_COMMUNICATION — for correct AEC (Acoustic Echo Canceler) on Android. Otherwise on devices without hardware AEC there's echo.
Phrase Overlap Problem
While TTS pronounces translation, user can speak next phrase. If VAD doesn't account, microphone picks up voice from speaker. Solution:
- Pause VAD during TTS playback
- Or additional filtering: ignore interim results during audio playback
Second option more reliable in practice, doesn't create awkward pauses.
Providers by Language Pair
| Direction | STT | Translation | TTS |
|---|---|---|---|
| ru → en | Deepgram Nova-2 | DeepL | OpenAI TTS (alloy) |
| en → ru | Deepgram Nova-2 | DeepL/Google | Yandex SpeechKit |
| zh → en | Google STT | Google Translate | Google TTS |
| ar → en | AssemblyAI | GPT-4o | ElevenLabs |
For Russian synthesis, Yandex SpeechKit notably better than Google TTS and OpenAI by naturalness. Not opinion — verifiable on test set of 50 phrases.
Offline Variant
For devices without stable internet: Whisper on-device (whisper.cpp via CoreML on iOS, ONNX on Android) + ML Kit Translate + system TTS. Latency 3–6 seconds instead of 1.5, but works offline.
Whisper tiny/base on iPhone 13 via CoreML — about 2 seconds for 5-second fragment. Acceptable for travel scenario.
Timeline
Streaming speech translation with cloud services on one platform — 2–4 weeks. Includes STT, translation, TTS integration, audio session management, network interruption handling, basic UI. Cross-platform on Flutter with native audio bridges — 3–5 weeks.







