OpenAI TTS Speech Generation Integration for Mobile App

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Development and support of all types of mobile applications:

Information and entertainment mobile applications

News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators

E-commerce mobile applications

Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.

Business process management mobile applications

CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems

Electronic services mobile applications

Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Offered services

Showing 1 of 1 servicesAll 1735 services

OpenAI TTS Speech Generation Integration for Mobile App

Simple

from 1 business day to 3 business days

FAQ

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

Development of a mobile application for FEEDME
756
Development of a mobile application for XOOMER
624
Development of a mobile application for RHL
1054
Development of a mobile application for ZIPPY
947
Development of a mobile application for Affhome
862
Development of a mobile application for the FLAVORS company
445

Show more works

OpenAI TTS Integration for Speech Generation in Mobile Applications

OpenAI TTS is simplest integration provider with good quality. One endpoint, six voices, two request formats (REST and streaming), supports 57 languages. Main nuance — organize caching and streaming playback correctly, else 1–3 second latency before first word annoys users.

API and Parameters

POST https://api.openai.com/v1/audio/speech
Authorization: Bearer {api_key}
Content-Type: application/json

{
  "model": "tts-1-hd",
  "input": "Your text here",
  "voice": "nova",
  "response_format": "mp3",
  "speed": 1.0
}

Models:

tts-1 — faster, slightly lower quality, cheaper ($15/1M chars)
tts-1-hd — higher quality, slower ~30%, more expensive ($30/1M chars)

Voices: alloy (neutral), echo (soft male), fable (British), onyx (deep male), nova (lively female), shimmer (calm female).

For Russian nova and shimmer sound most natural.

speed: 0.25–4.0. Default 1.0. Values above 1.3 start breaking prosody.

Non-Streaming Implementation (Short Texts)

// iOS: load and play
func speak(text: String, voice: String = "nova") async throws {
    var request = URLRequest(url: URL(string: "https://api.openai.com/v1/audio/speech")!)
    request.httpMethod = "POST"
    request.setValue("Bearer \(apiKey)", forHTTPHeaderField: "Authorization")
    request.setValue("application/json", forHTTPHeaderField: "Content-Type")

    let body = TTSSpeechRequest(model: "tts-1", input: text, voice: voice, responseFormat: "mp3")
    request.httpBody = try JSONEncoder().encode(body)

    let (data, _) = try await URLSession.shared.data(for: request)
    audioPlayer = try AVAudioPlayer(data: data)
    audioPlayer?.play()
}

For short phrases (up to 100 chars) on tts-1, latency ~300–500 ms — acceptable without streaming. For long text, streaming needed.

Streaming Playback on Android

class OpenAITTSStreamer(private val apiKey: String, private val context: Context) {
    private val exoPlayer = ExoPlayer.Builder(context).build()

    fun speak(text: String, voice: String = "nova") {
        val requestBody = JSONObject().apply {
            put("model", "tts-1")
            put("input", text)
            put("voice", voice)
            put("response_format", "mp3")
        }.toString().toRequestBody("application/json".toMediaType())

        // Use OkHttp as DataSource via custom MediaSource
        val call = OkHttpClient().newCall(
            Request.Builder()
                .url("https://api.openai.com/v1/audio/speech")
                .header("Authorization", "Bearer $apiKey")
                .post(requestBody)
                .build()
        )

        call.enqueue(object : Callback {
            override fun onResponse(call: Call, response: Response) {
                // Write stream to temp file, start playback simultaneously
                val tempFile = File(context.cacheDir, "tts_${System.currentTimeMillis()}.mp3")
                response.body!!.byteStream().use { input ->
                    tempFile.outputStream().use { output ->
                        val buffer = ByteArray(8192)
                        var bytes: Int
                        var firstChunk = true
                        while (input.read(buffer).also { bytes = it } != -1) {
                            output.write(buffer, 0, bytes)
                            if (firstChunk && tempFile.length() > 32768) {
                                firstChunk = false
                                // Start playback after first 32 KB
                                Handler(Looper.getMainLooper()).post {
                                    exoPlayer.setMediaItem(MediaItem.fromUri(tempFile.toUri()))
                                    exoPlayer.prepare()
                                    exoPlayer.play()
                                }
                            }
                        }
                    }
                }
            }
            override fun onFailure(call: Call, e: IOException) { /* handle error */ }
        })
    }
}

ExoPlayer supports playing from file still being written — ProgressiveMediaSource reads data as it arrives. Latency to first sound — 400–700 ms.

Cache

// iOS: cache synthesized audio
class TTSCache {
    private let cacheURL: URL

    init() {
        cacheURL = FileManager.default.urls(for: .cachesDirectory, in: .userDomainMask)[0]
            .appendingPathComponent("tts_cache")
        try? FileManager.default.createDirectory(at: cacheURL, withIntermediateDirectories: true)
    }

    func key(text: String, voice: String) -> String {
        let input = "\(text)|\(voice)"
        return SHA256.hash(data: Data(input.utf8)).hexString
    }

    func get(_ key: String) -> Data? {
        let url = cacheURL.appendingPathComponent(key + ".mp3")
        return try? Data(contentsOf: url)
    }

    func set(_ key: String, data: Data) {
        let url = cacheURL.appendingPathComponent(key + ".mp3")
        try? data.write(to: url)
    }
}

Before each TTS request — check cache. Cache hit = instant playback. For app UI phrases (greeting, hints) — pre-generate and cache forever.

Handling Long Texts

OpenAI TTS accepts up to 4096 characters per request. For long texts — split by sentences:

func splitBySentences(_ text: String, maxLength: Int = 1000) -> [String] {
    var chunks: [String] = []
    var current = ""
    for sentence in text.components(separatedBy: CharacterSet(charactersIn: ".!?\n")) {
        let trimmed = sentence.trimmingCharacters(in: .whitespaces)
        if trimmed.isEmpty { continue }
        if current.count + trimmed.count > maxLength {
            if !current.isEmpty { chunks.append(current) }
            current = trimmed
        } else {
            current += (current.isEmpty ? "" : ". ") + trimmed
        }
    }
    if !current.isEmpty { chunks.append(current) }
    return chunks
}

Synthesize chunks in parallel via TaskGroup, play sequentially — total latency less than sequential processing.

Timeline

REST integration with cache on one platform — 3–4 days. Streaming playback + long text splitting + voice selection UI — 7–10 days.