Multimodal AI Input (Text + Audio) for Mobile App

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
Multimodal AI Input (Text + Audio) for Mobile App
Medium
~3-5 business days
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1052
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

Implementing Multimodal AI Input (Text + Audio) in a Mobile Application

Voice messages in messengers are familiar. But when user wants not just transcription but semantic response to spoken content—need chain: audio capture → transcription (or native audio input to model) → LLM with context. Three different technical layers, each with own pitfalls.

Two Architectural Paths

Path 1: STT → LLM. Whisper API or analogs convert audio to text, text goes to messages[]. Works with any LLM, cheap, predictable. Problem—double latency: wait for transcription (1–3 s for 30-second fragment), then model response. User stares 5–10 seconds.

Path 2: Native audio input. GPT-4o Audio Preview, Gemini 1.5 Pro accept input_audio directly in content[]. Less latency, model "hears" intonation, pauses, accent. Limitation—format: OpenAI requires PCM16 or MP3, Gemini—FLAC, MP3, WAV, OGG. Device needs conversion.

Choice depends on task. Voice assistant for conversation—path 2. Meeting transcription with later analysis—path 1 with batch processing.

Audio Recording: Where Bugs Come From

Android capture via MediaRecorder simple, but AudioRecord needed when real-time PCM required (streaming to Whisper via WebSocket). MediaRecorder saves to file—convenient for short voice, inconvenient for live stream. Typical crash: IllegalStateException: start called in invalid state—calling start() before prepare() or repeated start() without reset(). Remember to release in onPause(), else other apps lose microphone.

iOS: AVAudioEngine for PCM streaming, AVAudioRecorder for files. Problem everyone hits—AVAudioSession configuration. If not set .record category before start, recording is silent or goes through speaker instead of microphone. And iOS 17 requires NSMicrophoneUsageDescription even for simulator.

Default AVAudioRecorder format is CAF. Whisper doesn't accept it. Either convert via AVAssetExportSession (async, adds latency) or configure AVAudioRecorder for M4A/FLAC upfront.

Implementing Streaming STT

For live transcription (user speaks—text appears on screen), use WebSocket to Whisper Streaming or Deepgram. On Android:

val audioRecord = AudioRecord(
    MediaRecorder.AudioSource.MIC,
    16000, // 16kHz—optimal for Whisper
    AudioFormat.CHANNEL_IN_MONO,
    AudioFormat.ENCODING_PCM_16BIT,
    bufferSize
)
// chunks per 100ms → WebSocket → partial transcripts

16 kHz sample rate sufficient for speech, half data vs 44.1 kHz. On iOS equivalent—AVAudioEngine with installTap(onBus:).

Important: WebSocket must reopen on network loss. OkHttp WebSocket on Android has onFailure callback—implement exponential backoff with max 3 retries, else user won't understand connection dropped.

Sending Audio File to Multimodal Model

// iOS—send audio to GPT-4o Audio
let audioData = try Data(contentsOf: recordingURL)
let b64 = audioData.base64EncodedString()

let payload: [String: Any] = [
    "model": "gpt-4o-audio-preview",
    "messages": [[
        "role": "user",
        "content": [
            ["type": "text", "text": userText],
            ["type": "input_audio", "input_audio": [
                "data": b64,
                "format": "mp3"
            ]]
        ]
    ]]
]

OpenAI audio size limit—25 MB. 30-minute MP3 128kbps recording ~28 MB—doesn't fit. For long content, chunk into 10–15 minute pieces or pre-process with Whisper.

Stages and Timeline

Audit requirements (streaming vs file, provider, target platforms) → architecture choice → implement capture and conversion → integrate STT/multimodal API → streaming UI → test on real devices (different mics, background noise, headphones) → release.

MVP with recording and Whisper—1–2 weeks. Full implementation with streaming, native audio input, long-recording handling—3–5 weeks.