Text-to-Speech Implementation in Mobile Applications
Text-to-Speech is one of the few mobile AI features where native APIs provide acceptable quality out-of-the-box without external dependencies. iOS AVSpeechSynthesizer and Android TextToSpeech work on-device, support Russian and don't require internet. The main work is proper integration, queue management, and voice selection.
AVSpeechSynthesizer on iOS
The basic case is three lines of code. Real production is more complex.
let synthesizer = AVSpeechSynthesizer()
let utterance = AVSpeechUtterance(string: text)
utterance.voice = AVSpeechSynthesisVoice(language: "ru-RU")
utterance.rate = 0.5 // 0.0–1.0, default = 0.5
synthesizer.speak(utterance)
iOS voices come in "compact" (built-in, ~50 MB) and "enhanced" (higher quality, ~300 MB download). Enhanced voices use neural synthesis. If the device hasn't downloaded them — AVSpeechSynthesisVoice(identifier: "com.apple.voice.enhanced.ru-RU.Milena") returns nil. Check and fallback to compact.
let enhanced = AVSpeechSynthesisVoice(identifier: "com.apple.voice.enhanced.ru-RU.Milena")
utterance.voice = enhanced ?? AVSpeechSynthesisVoice(language: "ru-RU")
Managing AVAudioSession is mandatory. TTS must work even if the app switched the session for microphone recording or music playback. Use .playback category with mixWithOthers or .duckOthers depending on requirements.
Android TextToSpeech: Initialization and Queue Management
TextToSpeech requires asynchronous initialization — common mistake: calling speak() before onInit(status) returns SUCCESS.
val tts = TextToSpeech(context) { status ->
if (status == TextToSpeech.SUCCESS) {
tts.language = Locale("ru", "RU")
// only now can you call speak()
}
}
QUEUE_FLUSH — interrupts the current utterance and starts a new one. QUEUE_ADD — adds to queue. For sequential notifications (e.g., navigation turn-by-turn), use QUEUE_ADD. For assistant responses, use QUEUE_FLUSH to prevent queue buildup on rapid input.
UtteranceProgressListener — tracks utterance start and end:
tts.setOnUtteranceProgressListener(object : UtteranceProgressListener() {
override fun onStart(utteranceId: String) { /* show indicator */ }
override fun onDone(utteranceId: String) { /* hide indicator */ }
override fun onError(utteranceId: String) { /* handle error */ }
})
Each speak() call must receive a unique utteranceId — otherwise callbacks won't trigger properly.
Managing Speed and Pauses
SSML (Speech Synthesis Markup Language) is supported on iOS from version 14.0:
let ssml = "<speak><prosody rate='slow'>Attention</prosody>, <break time='500ms'/>next stop.</speak>"
let utterance = AVSpeechUtterance(ssmlRepresentation: ssml)
On Android, SSML support depends on the engine (Google TTS supports it, Samsung TTS partially). For critical cases, split text into multiple speak() calls with pauses via playSilentUtterance.
Speed adjustment for accessibility: provide users with rate control in app settings. Older users often prefer 0.35–0.4 instead of default 0.5.
Timeline
Basic TTS integration with queue management and voice handling — 2–3 working days. Cost is calculated individually.







