ElevenLabs Integration for Speech Generation in Mobile Applications
ElevenLabs is one of two providers with truly natural-sounding multilingual speech (second is OpenAI TTS). For Russian, ElevenLabs with eleven_multilingual_v2 model produces results people regularly mistake for live speech. Integration is nontrivial: API has nuances with formats, streaming, and character quota management.
Basic Integration: REST
Minimal synthesis request:
POST https://api.elevenlabs.io/v1/text-to-speech/{voice_id}
xi-api-key: YOUR_KEY
Content-Type: application/json
{
"text": "Hello, this is test text",
"model_id": "eleven_multilingual_v2",
"voice_settings": {
"stability": 0.5,
"similarity_boost": 0.75,
"style": 0.0,
"use_speaker_boost": true
}
}
Response — binary audio file. Default mp3_44100_128, changeable via output_format query parameter: pcm_16000, pcm_22050, pcm_24000, pcm_44100, mp3_22050_32, mp3_44100_64, mp3_44100_128, mp3_44100_192.
For mobile playback — mp3_44100_128. For on-the-fly playback without saving — pcm_16000 fed directly to AudioTrack / AVAudioPlayerNode.
Streaming: WebSocket API
ElevenLabs supports two streaming types: streaming HTTP (/v1/text-to-speech/{voice_id}/stream) and WebSocket (/v1/text-to-speech/{voice_id}/stream-input).
WebSocket for dialog apps where text generates as LLM responds:
// iOS: streaming via URLSessionWebSocketTask
class ElevenLabsStreamPlayer {
private var webSocket: URLSessionWebSocketTask?
private var audioEngine = AVAudioEngine()
private var playerNode = AVAudioPlayerNode()
func connect(voiceId: String) {
let url = URL(string: "wss://api.elevenlabs.io/v1/text-to-speech/\(voiceId)/stream-input?model_id=eleven_multilingual_v2&output_format=pcm_16000")!
var request = URLRequest(url: url)
request.setValue(apiKey, forHTTPHeaderField: "xi-api-key")
webSocket = URLSession.shared.webSocketTask(with: request)
webSocket?.resume()
// Initialize stream
let initMsg = #"{"text":" ","voice_settings":{"stability":0.5,"similarity_boost":0.75}}"#
webSocket?.send(.string(initMsg)) { _ in }
audioEngine.attach(playerNode)
audioEngine.connect(playerNode, to: audioEngine.mainMixerNode, format: nil)
try? audioEngine.start()
receiveAudio()
}
func sendText(_ chunk: String) {
let msg = "{\"text\":\"\(chunk)\"}"
webSocket?.send(.string(msg)) { _ in }
}
func flush() {
webSocket?.send(.string("{\"text\":\"\"}")) { _ in }
}
private func receiveAudio() {
webSocket?.receive { [weak self] result in
if case .success(.string(let text)) = result,
let data = text.data(using: .utf8),
let json = try? JSONDecoder().decode(AudioChunk.self, from: data),
let audioB64 = json.audio,
let audioData = Data(base64Encoded: audioB64) {
self?.enqueueAudio(audioData)
}
self?.receiveAudio()
}
}
private func enqueueAudio(_ data: Data) {
// PCM 16000 Hz, int16 → AVAudioPCMBuffer
let format = AVAudioFormat(commonFormat: .pcmFormatInt16, sampleRate: 16000, channels: 1, interleaved: false)!
let frameCount = AVAudioFrameCount(data.count / 2)
guard let buffer = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: frameCount) else { return }
buffer.frameLength = frameCount
data.withUnsafeBytes { ptr in
buffer.int16ChannelData?[0].update(from: ptr.bindMemory(to: Int16.self).baseAddress!, count: Int(frameCount))
}
playerNode.scheduleBuffer(buffer, completionHandler: nil)
if !playerNode.isPlaying { playerNode.play() }
}
}
Usage pattern in dialog assistant: as tokens arrive from GPT — sendText(token), when response ends — flush(). Latency to first sound — 200–400 ms.
Voice Settings
stability (0–1): higher = more monotone. 0.3–0.5 for live speech, 0.8–1.0 for narrator reading.
similarity_boost (0–1): how accurately reproduces cloned voice timbre. Too high (>0.9) can add artifacts.
style (0–1): only for eleven_multilingual_v2 and eleven_turbo_v2_5. Boosts emotionality. 0 for neutral.
use_speaker_boost: true — improves clarity for synthesized voices (not clones). Enable by default.
Quota Monitoring
ElevenLabs counts characters. GET /v1/user/subscription returns character_count and character_limit. Add check before each request — if quota < text length, show error or offer upgrade.
suspend fun checkQuota(textLength: Int): Boolean {
val response = httpClient.get("https://api.elevenlabs.io/v1/user/subscription") {
header("xi-api-key", apiKey)
}.body<SubscriptionInfo>()
return (response.characterLimit - response.characterCount) >= textLength
}
Caching
Same phrase with same settings should synthesize once. Cache key: sha256(text + voice_id + stability + similarity_boost + model_id). Store files in app internal storage with 30-day TTL, LRU eviction at 100 MB limit.
Timeline
Basic REST + playback — 2–3 days. WebSocket streaming feeding tokens from LLM — 5–7 days. Full voice selection UI + cache + quota monitoring — 10–14 days.







