ElevenLabs Speech Generation Integration for Mobile App

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
ElevenLabs Speech Generation Integration for Mobile App
Simple
~2-3 business days
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1054
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

ElevenLabs Integration for Speech Generation in Mobile Applications

ElevenLabs is one of two providers with truly natural-sounding multilingual speech (second is OpenAI TTS). For Russian, ElevenLabs with eleven_multilingual_v2 model produces results people regularly mistake for live speech. Integration is nontrivial: API has nuances with formats, streaming, and character quota management.

Basic Integration: REST

Minimal synthesis request:

POST https://api.elevenlabs.io/v1/text-to-speech/{voice_id}
xi-api-key: YOUR_KEY
Content-Type: application/json

{
  "text": "Hello, this is test text",
  "model_id": "eleven_multilingual_v2",
  "voice_settings": {
    "stability": 0.5,
    "similarity_boost": 0.75,
    "style": 0.0,
    "use_speaker_boost": true
  }
}

Response — binary audio file. Default mp3_44100_128, changeable via output_format query parameter: pcm_16000, pcm_22050, pcm_24000, pcm_44100, mp3_22050_32, mp3_44100_64, mp3_44100_128, mp3_44100_192.

For mobile playback — mp3_44100_128. For on-the-fly playback without saving — pcm_16000 fed directly to AudioTrack / AVAudioPlayerNode.

Streaming: WebSocket API

ElevenLabs supports two streaming types: streaming HTTP (/v1/text-to-speech/{voice_id}/stream) and WebSocket (/v1/text-to-speech/{voice_id}/stream-input).

WebSocket for dialog apps where text generates as LLM responds:

// iOS: streaming via URLSessionWebSocketTask
class ElevenLabsStreamPlayer {
    private var webSocket: URLSessionWebSocketTask?
    private var audioEngine = AVAudioEngine()
    private var playerNode = AVAudioPlayerNode()

    func connect(voiceId: String) {
        let url = URL(string: "wss://api.elevenlabs.io/v1/text-to-speech/\(voiceId)/stream-input?model_id=eleven_multilingual_v2&output_format=pcm_16000")!
        var request = URLRequest(url: url)
        request.setValue(apiKey, forHTTPHeaderField: "xi-api-key")
        webSocket = URLSession.shared.webSocketTask(with: request)
        webSocket?.resume()

        // Initialize stream
        let initMsg = #"{"text":" ","voice_settings":{"stability":0.5,"similarity_boost":0.75}}"#
        webSocket?.send(.string(initMsg)) { _ in }

        audioEngine.attach(playerNode)
        audioEngine.connect(playerNode, to: audioEngine.mainMixerNode, format: nil)
        try? audioEngine.start()

        receiveAudio()
    }

    func sendText(_ chunk: String) {
        let msg = "{\"text\":\"\(chunk)\"}"
        webSocket?.send(.string(msg)) { _ in }
    }

    func flush() {
        webSocket?.send(.string("{\"text\":\"\"}")) { _ in }
    }

    private func receiveAudio() {
        webSocket?.receive { [weak self] result in
            if case .success(.string(let text)) = result,
               let data = text.data(using: .utf8),
               let json = try? JSONDecoder().decode(AudioChunk.self, from: data),
               let audioB64 = json.audio,
               let audioData = Data(base64Encoded: audioB64) {
                self?.enqueueAudio(audioData)
            }
            self?.receiveAudio()
        }
    }

    private func enqueueAudio(_ data: Data) {
        // PCM 16000 Hz, int16 → AVAudioPCMBuffer
        let format = AVAudioFormat(commonFormat: .pcmFormatInt16, sampleRate: 16000, channels: 1, interleaved: false)!
        let frameCount = AVAudioFrameCount(data.count / 2)
        guard let buffer = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: frameCount) else { return }
        buffer.frameLength = frameCount
        data.withUnsafeBytes { ptr in
            buffer.int16ChannelData?[0].update(from: ptr.bindMemory(to: Int16.self).baseAddress!, count: Int(frameCount))
        }
        playerNode.scheduleBuffer(buffer, completionHandler: nil)
        if !playerNode.isPlaying { playerNode.play() }
    }
}

Usage pattern in dialog assistant: as tokens arrive from GPT — sendText(token), when response ends — flush(). Latency to first sound — 200–400 ms.

Voice Settings

stability (0–1): higher = more monotone. 0.3–0.5 for live speech, 0.8–1.0 for narrator reading.

similarity_boost (0–1): how accurately reproduces cloned voice timbre. Too high (>0.9) can add artifacts.

style (0–1): only for eleven_multilingual_v2 and eleven_turbo_v2_5. Boosts emotionality. 0 for neutral.

use_speaker_boost: true — improves clarity for synthesized voices (not clones). Enable by default.

Quota Monitoring

ElevenLabs counts characters. GET /v1/user/subscription returns character_count and character_limit. Add check before each request — if quota < text length, show error or offer upgrade.

suspend fun checkQuota(textLength: Int): Boolean {
    val response = httpClient.get("https://api.elevenlabs.io/v1/user/subscription") {
        header("xi-api-key", apiKey)
    }.body<SubscriptionInfo>()
    return (response.characterLimit - response.characterCount) >= textLength
}

Caching

Same phrase with same settings should synthesize once. Cache key: sha256(text + voice_id + stability + similarity_boost + model_id). Store files in app internal storage with 30-day TTL, LRU eviction at 100 MB limit.

Timeline

Basic REST + playback — 2–3 days. WebSocket streaming feeding tokens from LLM — 5–7 days. Full voice selection UI + cache + quota monitoring — 10–14 days.