AI Text-to-Speech with Voice Selection for Mobile App

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
AI Text-to-Speech with Voice Selection for Mobile App
Simple
~2-3 business days
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1054
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

AI Text-to-Speech with Voice Selection in Mobile Applications

System TTS on iOS and Android — AVSpeechSynthesizer and TextToSpeech — handles basic task, sounds robotic. AI TTS from ElevenLabs, OpenAI, or Yandex SpeechKit — voices hard to distinguish from live. Integration requires smart caching and streaming playback, else 2–4 second latency before first word kills UX.

Providers and Characteristics

OpenAI TTS — 6 voices (alloy, echo, fable, onyx, nova, shimmer), models tts-1 (fast) and tts-1-hd (quality). Supports streaming. Russian — good. Cost: $15/1M chars for tts-1-hd.

ElevenLabs — large voice library, voice cloning, multilingual v2. Streaming via WebSocket. Best quality among all providers.

Yandex SpeechKit — best Russian voices including alena, filipp. REST or gRPC. SSML for intonation, pauses, stress control.

System TTS — free, offline, zero latency, robotic. Good as fallback.

Streaming Playback

Most important — don't wait for full response. 500 chars on tts-1-hd synthesizes ~2 seconds. With streaming user hears first words in 300–500 ms.

iOS: Streaming via AVPlayer

class StreamingTTSPlayer {
    private var player: AVPlayer?
    private var playerItem: AVPlayerItem?

    func speak(text: String, voice: String = "nova") async throws {
        var request = URLRequest(url: URL(string: "https://api.openai.com/v1/audio/speech")!)
        request.httpMethod = "POST"
        request.setValue("Bearer \(apiKey)", forHTTPHeaderField: "Authorization")
        request.setValue("application/json", forHTTPHeaderField: "Content-Type")

        let body = ["model": "tts-1", "input": text, "voice": voice, "response_format": "mp3"]
        request.httpBody = try JSONEncoder().encode(body)

        // AVPlayer can stream HTTP response via resourceLoader
        // Use custom AVAssetResourceLoaderDelegate
        let asset = StreamingAudioAsset(request: request)
        playerItem = AVPlayerItem(asset: asset)
        player = AVPlayer(playerItem: playerItem)
        player?.play()
    }
}

For full streaming playback need AVAssetResourceLoaderDelegate feeding audio chunks as received. ~100 lines code, only way to start playback before full file on iOS.

Alternative — use AudioStreamer library or AVPlayer with data URI via pipe. Simpler in practice — AVAudioPlayerNode + AVAudioEngine with manual feeding of decoded PCM buffers.

Android: ExoPlayer with Streaming

class StreamingTTSPlayer(private val context: Context) {
    private val exoPlayer = ExoPlayer.Builder(context).build()

    fun speak(text: String, voice: String = "nova") {
        val url = "https://api.openai.com/v1/audio/speech"
        // ExoPlayer supports streaming natively via MediaSource
        val dataSourceFactory = DefaultHttpDataSource.Factory().apply {
            setDefaultRequestProperties(mapOf(
                "Authorization" to "Bearer $apiKey",
                "Content-Type" to "application/json"
            ))
        }
        // For POST requests use custom DataSource
        val mediaItem = MediaItem.fromUri(buildCachedUri(text, voice))
        exoPlayer.setMediaItem(mediaItem)
        exoPlayer.prepare()
        exoPlayer.play()
    }
}

ExoPlayer natively supports progressive MP3/AAC streaming. For POST requests need custom DataSource doing POST and returning InputStream — ExoPlayer buffers itself and starts playback after first few seconds of audio.

Synthesized Audio Caching

TTS is expensive. Same phrase with same settings shouldn't synthesize twice.

// Android: disk cache with sha256 key
class TTSCache(private val cacheDir: File) {
    fun getKey(text: String, voice: String): String =
        MessageDigest.getInstance("SHA-256")
            .digest("$text|$voice".toByteArray())
            .joinToString("") { "%02x".format(it) }

    fun get(key: String): File? {
        val file = File(cacheDir, "$key.mp3")
        return if (file.exists()) file else null
    }

    fun put(key: String, data: ByteArray) {
        File(cacheDir, "$key.mp3").writeBytes(data)
    }
}

Cache TTL — 30 days for static content (UI phrases, training text), no TTL for user content. Size limit — 50–100 MB, LRU eviction.

Voice Selection UI

User must hear voice before choosing. Pattern:

  1. Voice list with name and short description
  2. "Listen" button — plays 5-second example (cache pre-recorded samples, don't synthesize on fly)
  3. Selected voice saved in UserDefaults / SharedPreferences

For ElevenLabs — /v1/voices returns available voices with metadata: preview_url for preview. Don't synthesize — just play ready sample.

SSML for Fine-Tuning

Yandex SpeechKit and Google TTS support SSML:

<speak>
  Welcome to <emphasis level="strong">our service</emphasis>.
  <break time="500ms"/>
  Your order <say-as interpret-as="cardinal">12345</say-as> ready.
</speak>

<break>, <prosody rate="slow">, <say-as> for numbers and dates — differentiates natural from robotic. OpenAI TTS doesn't support SSML — manage via <pause> in text or prompt hints.

Timeline

Basic integration of one provider with voice selection UI — 4–6 days. Streaming playback + disk cache + system TTS fallback — another 5–7 days.