OpenAI TTS Integration for Speech Generation in Mobile Applications
OpenAI TTS is simplest integration provider with good quality. One endpoint, six voices, two request formats (REST and streaming), supports 57 languages. Main nuance — organize caching and streaming playback correctly, else 1–3 second latency before first word annoys users.
API and Parameters
POST https://api.openai.com/v1/audio/speech
Authorization: Bearer {api_key}
Content-Type: application/json
{
"model": "tts-1-hd",
"input": "Your text here",
"voice": "nova",
"response_format": "mp3",
"speed": 1.0
}
Models:
-
tts-1— faster, slightly lower quality, cheaper ($15/1M chars) -
tts-1-hd— higher quality, slower ~30%, more expensive ($30/1M chars)
Voices: alloy (neutral), echo (soft male), fable (British), onyx (deep male), nova (lively female), shimmer (calm female).
For Russian nova and shimmer sound most natural.
speed: 0.25–4.0. Default 1.0. Values above 1.3 start breaking prosody.
Non-Streaming Implementation (Short Texts)
// iOS: load and play
func speak(text: String, voice: String = "nova") async throws {
var request = URLRequest(url: URL(string: "https://api.openai.com/v1/audio/speech")!)
request.httpMethod = "POST"
request.setValue("Bearer \(apiKey)", forHTTPHeaderField: "Authorization")
request.setValue("application/json", forHTTPHeaderField: "Content-Type")
let body = TTSSpeechRequest(model: "tts-1", input: text, voice: voice, responseFormat: "mp3")
request.httpBody = try JSONEncoder().encode(body)
let (data, _) = try await URLSession.shared.data(for: request)
audioPlayer = try AVAudioPlayer(data: data)
audioPlayer?.play()
}
For short phrases (up to 100 chars) on tts-1, latency ~300–500 ms — acceptable without streaming. For long text, streaming needed.
Streaming Playback on Android
class OpenAITTSStreamer(private val apiKey: String, private val context: Context) {
private val exoPlayer = ExoPlayer.Builder(context).build()
fun speak(text: String, voice: String = "nova") {
val requestBody = JSONObject().apply {
put("model", "tts-1")
put("input", text)
put("voice", voice)
put("response_format", "mp3")
}.toString().toRequestBody("application/json".toMediaType())
// Use OkHttp as DataSource via custom MediaSource
val call = OkHttpClient().newCall(
Request.Builder()
.url("https://api.openai.com/v1/audio/speech")
.header("Authorization", "Bearer $apiKey")
.post(requestBody)
.build()
)
call.enqueue(object : Callback {
override fun onResponse(call: Call, response: Response) {
// Write stream to temp file, start playback simultaneously
val tempFile = File(context.cacheDir, "tts_${System.currentTimeMillis()}.mp3")
response.body!!.byteStream().use { input ->
tempFile.outputStream().use { output ->
val buffer = ByteArray(8192)
var bytes: Int
var firstChunk = true
while (input.read(buffer).also { bytes = it } != -1) {
output.write(buffer, 0, bytes)
if (firstChunk && tempFile.length() > 32768) {
firstChunk = false
// Start playback after first 32 KB
Handler(Looper.getMainLooper()).post {
exoPlayer.setMediaItem(MediaItem.fromUri(tempFile.toUri()))
exoPlayer.prepare()
exoPlayer.play()
}
}
}
}
}
}
override fun onFailure(call: Call, e: IOException) { /* handle error */ }
})
}
}
ExoPlayer supports playing from file still being written — ProgressiveMediaSource reads data as it arrives. Latency to first sound — 400–700 ms.
Cache
// iOS: cache synthesized audio
class TTSCache {
private let cacheURL: URL
init() {
cacheURL = FileManager.default.urls(for: .cachesDirectory, in: .userDomainMask)[0]
.appendingPathComponent("tts_cache")
try? FileManager.default.createDirectory(at: cacheURL, withIntermediateDirectories: true)
}
func key(text: String, voice: String) -> String {
let input = "\(text)|\(voice)"
return SHA256.hash(data: Data(input.utf8)).hexString
}
func get(_ key: String) -> Data? {
let url = cacheURL.appendingPathComponent(key + ".mp3")
return try? Data(contentsOf: url)
}
func set(_ key: String, data: Data) {
let url = cacheURL.appendingPathComponent(key + ".mp3")
try? data.write(to: url)
}
}
Before each TTS request — check cache. Cache hit = instant playback. For app UI phrases (greeting, hints) — pre-generate and cache forever.
Handling Long Texts
OpenAI TTS accepts up to 4096 characters per request. For long texts — split by sentences:
func splitBySentences(_ text: String, maxLength: Int = 1000) -> [String] {
var chunks: [String] = []
var current = ""
for sentence in text.components(separatedBy: CharacterSet(charactersIn: ".!?\n")) {
let trimmed = sentence.trimmingCharacters(in: .whitespaces)
if trimmed.isEmpty { continue }
if current.count + trimmed.count > maxLength {
if !current.isEmpty { chunks.append(current) }
current = trimmed
} else {
current += (current.isEmpty ? "" : ". ") + trimmed
}
}
if !current.isEmpty { chunks.append(current) }
return chunks
}
Synthesize chunks in parallel via TaskGroup, play sequentially — total latency less than sequential processing.
Timeline
REST integration with cache on one platform — 3–4 days. Streaming playback + long text splitting + voice selection UI — 7–10 days.







