OpenAI TTS Speech Generation Integration for Mobile App

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
OpenAI TTS Speech Generation Integration for Mobile App
Simple
from 1 business day to 3 business days
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1054
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

OpenAI TTS Integration for Speech Generation in Mobile Applications

OpenAI TTS is simplest integration provider with good quality. One endpoint, six voices, two request formats (REST and streaming), supports 57 languages. Main nuance — organize caching and streaming playback correctly, else 1–3 second latency before first word annoys users.

API and Parameters

POST https://api.openai.com/v1/audio/speech
Authorization: Bearer {api_key}
Content-Type: application/json

{
  "model": "tts-1-hd",
  "input": "Your text here",
  "voice": "nova",
  "response_format": "mp3",
  "speed": 1.0
}

Models:

  • tts-1 — faster, slightly lower quality, cheaper ($15/1M chars)
  • tts-1-hd — higher quality, slower ~30%, more expensive ($30/1M chars)

Voices: alloy (neutral), echo (soft male), fable (British), onyx (deep male), nova (lively female), shimmer (calm female).

For Russian nova and shimmer sound most natural.

speed: 0.25–4.0. Default 1.0. Values above 1.3 start breaking prosody.

Non-Streaming Implementation (Short Texts)

// iOS: load and play
func speak(text: String, voice: String = "nova") async throws {
    var request = URLRequest(url: URL(string: "https://api.openai.com/v1/audio/speech")!)
    request.httpMethod = "POST"
    request.setValue("Bearer \(apiKey)", forHTTPHeaderField: "Authorization")
    request.setValue("application/json", forHTTPHeaderField: "Content-Type")

    let body = TTSSpeechRequest(model: "tts-1", input: text, voice: voice, responseFormat: "mp3")
    request.httpBody = try JSONEncoder().encode(body)

    let (data, _) = try await URLSession.shared.data(for: request)
    audioPlayer = try AVAudioPlayer(data: data)
    audioPlayer?.play()
}

For short phrases (up to 100 chars) on tts-1, latency ~300–500 ms — acceptable without streaming. For long text, streaming needed.

Streaming Playback on Android

class OpenAITTSStreamer(private val apiKey: String, private val context: Context) {
    private val exoPlayer = ExoPlayer.Builder(context).build()

    fun speak(text: String, voice: String = "nova") {
        val requestBody = JSONObject().apply {
            put("model", "tts-1")
            put("input", text)
            put("voice", voice)
            put("response_format", "mp3")
        }.toString().toRequestBody("application/json".toMediaType())

        // Use OkHttp as DataSource via custom MediaSource
        val call = OkHttpClient().newCall(
            Request.Builder()
                .url("https://api.openai.com/v1/audio/speech")
                .header("Authorization", "Bearer $apiKey")
                .post(requestBody)
                .build()
        )

        call.enqueue(object : Callback {
            override fun onResponse(call: Call, response: Response) {
                // Write stream to temp file, start playback simultaneously
                val tempFile = File(context.cacheDir, "tts_${System.currentTimeMillis()}.mp3")
                response.body!!.byteStream().use { input ->
                    tempFile.outputStream().use { output ->
                        val buffer = ByteArray(8192)
                        var bytes: Int
                        var firstChunk = true
                        while (input.read(buffer).also { bytes = it } != -1) {
                            output.write(buffer, 0, bytes)
                            if (firstChunk && tempFile.length() > 32768) {
                                firstChunk = false
                                // Start playback after first 32 KB
                                Handler(Looper.getMainLooper()).post {
                                    exoPlayer.setMediaItem(MediaItem.fromUri(tempFile.toUri()))
                                    exoPlayer.prepare()
                                    exoPlayer.play()
                                }
                            }
                        }
                    }
                }
            }
            override fun onFailure(call: Call, e: IOException) { /* handle error */ }
        })
    }
}

ExoPlayer supports playing from file still being written — ProgressiveMediaSource reads data as it arrives. Latency to first sound — 400–700 ms.

Cache

// iOS: cache synthesized audio
class TTSCache {
    private let cacheURL: URL

    init() {
        cacheURL = FileManager.default.urls(for: .cachesDirectory, in: .userDomainMask)[0]
            .appendingPathComponent("tts_cache")
        try? FileManager.default.createDirectory(at: cacheURL, withIntermediateDirectories: true)
    }

    func key(text: String, voice: String) -> String {
        let input = "\(text)|\(voice)"
        return SHA256.hash(data: Data(input.utf8)).hexString
    }

    func get(_ key: String) -> Data? {
        let url = cacheURL.appendingPathComponent(key + ".mp3")
        return try? Data(contentsOf: url)
    }

    func set(_ key: String, data: Data) {
        let url = cacheURL.appendingPathComponent(key + ".mp3")
        try? data.write(to: url)
    }
}

Before each TTS request — check cache. Cache hit = instant playback. For app UI phrases (greeting, hints) — pre-generate and cache forever.

Handling Long Texts

OpenAI TTS accepts up to 4096 characters per request. For long texts — split by sentences:

func splitBySentences(_ text: String, maxLength: Int = 1000) -> [String] {
    var chunks: [String] = []
    var current = ""
    for sentence in text.components(separatedBy: CharacterSet(charactersIn: ".!?\n")) {
        let trimmed = sentence.trimmingCharacters(in: .whitespaces)
        if trimmed.isEmpty { continue }
        if current.count + trimmed.count > maxLength {
            if !current.isEmpty { chunks.append(current) }
            current = trimmed
        } else {
            current += (current.isEmpty ? "" : ". ") + trimmed
        }
    }
    if !current.isEmpty { chunks.append(current) }
    return chunks
}

Synthesize chunks in parallel via TaskGroup, play sequentially — total latency less than sequential processing.

Timeline

REST integration with cache on one platform — 3–4 days. Streaming playback + long text splitting + voice selection UI — 7–10 days.