AI Text-to-Speech with Voice Selection in Mobile Applications
System TTS on iOS and Android — AVSpeechSynthesizer and TextToSpeech — handles basic task, sounds robotic. AI TTS from ElevenLabs, OpenAI, or Yandex SpeechKit — voices hard to distinguish from live. Integration requires smart caching and streaming playback, else 2–4 second latency before first word kills UX.
Providers and Characteristics
OpenAI TTS — 6 voices (alloy, echo, fable, onyx, nova, shimmer), models tts-1 (fast) and tts-1-hd (quality). Supports streaming. Russian — good. Cost: $15/1M chars for tts-1-hd.
ElevenLabs — large voice library, voice cloning, multilingual v2. Streaming via WebSocket. Best quality among all providers.
Yandex SpeechKit — best Russian voices including alena, filipp. REST or gRPC. SSML for intonation, pauses, stress control.
System TTS — free, offline, zero latency, robotic. Good as fallback.
Streaming Playback
Most important — don't wait for full response. 500 chars on tts-1-hd synthesizes ~2 seconds. With streaming user hears first words in 300–500 ms.
iOS: Streaming via AVPlayer
class StreamingTTSPlayer {
private var player: AVPlayer?
private var playerItem: AVPlayerItem?
func speak(text: String, voice: String = "nova") async throws {
var request = URLRequest(url: URL(string: "https://api.openai.com/v1/audio/speech")!)
request.httpMethod = "POST"
request.setValue("Bearer \(apiKey)", forHTTPHeaderField: "Authorization")
request.setValue("application/json", forHTTPHeaderField: "Content-Type")
let body = ["model": "tts-1", "input": text, "voice": voice, "response_format": "mp3"]
request.httpBody = try JSONEncoder().encode(body)
// AVPlayer can stream HTTP response via resourceLoader
// Use custom AVAssetResourceLoaderDelegate
let asset = StreamingAudioAsset(request: request)
playerItem = AVPlayerItem(asset: asset)
player = AVPlayer(playerItem: playerItem)
player?.play()
}
}
For full streaming playback need AVAssetResourceLoaderDelegate feeding audio chunks as received. ~100 lines code, only way to start playback before full file on iOS.
Alternative — use AudioStreamer library or AVPlayer with data URI via pipe. Simpler in practice — AVAudioPlayerNode + AVAudioEngine with manual feeding of decoded PCM buffers.
Android: ExoPlayer with Streaming
class StreamingTTSPlayer(private val context: Context) {
private val exoPlayer = ExoPlayer.Builder(context).build()
fun speak(text: String, voice: String = "nova") {
val url = "https://api.openai.com/v1/audio/speech"
// ExoPlayer supports streaming natively via MediaSource
val dataSourceFactory = DefaultHttpDataSource.Factory().apply {
setDefaultRequestProperties(mapOf(
"Authorization" to "Bearer $apiKey",
"Content-Type" to "application/json"
))
}
// For POST requests use custom DataSource
val mediaItem = MediaItem.fromUri(buildCachedUri(text, voice))
exoPlayer.setMediaItem(mediaItem)
exoPlayer.prepare()
exoPlayer.play()
}
}
ExoPlayer natively supports progressive MP3/AAC streaming. For POST requests need custom DataSource doing POST and returning InputStream — ExoPlayer buffers itself and starts playback after first few seconds of audio.
Synthesized Audio Caching
TTS is expensive. Same phrase with same settings shouldn't synthesize twice.
// Android: disk cache with sha256 key
class TTSCache(private val cacheDir: File) {
fun getKey(text: String, voice: String): String =
MessageDigest.getInstance("SHA-256")
.digest("$text|$voice".toByteArray())
.joinToString("") { "%02x".format(it) }
fun get(key: String): File? {
val file = File(cacheDir, "$key.mp3")
return if (file.exists()) file else null
}
fun put(key: String, data: ByteArray) {
File(cacheDir, "$key.mp3").writeBytes(data)
}
}
Cache TTL — 30 days for static content (UI phrases, training text), no TTL for user content. Size limit — 50–100 MB, LRU eviction.
Voice Selection UI
User must hear voice before choosing. Pattern:
- Voice list with name and short description
- "Listen" button — plays 5-second example (cache pre-recorded samples, don't synthesize on fly)
- Selected voice saved in UserDefaults / SharedPreferences
For ElevenLabs — /v1/voices returns available voices with metadata: preview_url for preview. Don't synthesize — just play ready sample.
SSML for Fine-Tuning
Yandex SpeechKit and Google TTS support SSML:
<speak>
Welcome to <emphasis level="strong">our service</emphasis>.
<break time="500ms"/>
Your order <say-as interpret-as="cardinal">12345</say-as> ready.
</speak>
<break>, <prosody rate="slow">, <say-as> for numbers and dates — differentiates natural from robotic. OpenAI TTS doesn't support SSML — manage via <pause> in text or prompt hints.
Timeline
Basic integration of one provider with voice selection UI — 4–6 days. Streaming playback + disk cache + system TTS fallback — another 5–7 days.







