AI Audio/Video File Transcription in Mobile Applications
User uploads 40-minute meeting recording — waits for text. If raw 300 MB MP4 goes to server and JSON comes back in 90 seconds, UX is broken before first word. Task — build pipeline where mobile client doesn't just upload file but participates in prep: slicing, conversion, chunking — and gets results gradually while model still works.
Where Naive Implementation Breaks
Most common mistake — send file whole via standard URLSession.dataTask or OkHttp with default timeouts. On large files this causes NSURLErrorTimedOut (-1001), 413 from server, or OOM on Android from buffering in memory.
Second pitfall — format. Whisper and most cloud providers accept audio/wav, audio/mp3, audio/mp4, audio/ogg, but not all codecs within. .mov with pcm_s16le codec passes, .mov with ac3 gives 400 Bad Request with no explanation. iOS demuxing via AVAssetExportSession with preset AVAssetExportPresetAppleM4A solves 95% of cases. Android — MediaExtractor + MediaCodec for PCM decoding, then MediaMuxer to pack to AAC.
Third problem — chunking by time without accounting for silence. Slicing every 60 seconds exactly, word lands on boundary, transcription splits it. Use VAD (Voice Activity Detection) for slicing by pauses. iOS has AVAudioSession + AVAudioEngine for signal analysis, Flutter — voice_activity_detector package over WebRTC VAD.
How Pipeline Works in Practice
File Preparation on Device
// iOS: Extract audio track from video
let asset = AVURLAsset(url: videoURL)
let exportSession = AVAssetExportSession(asset: asset, presetName: AVAssetExportPresetAppleM4A)!
exportSession.outputFileType = .m4a
exportSession.outputURL = tempAudioURL
await exportSession.export()
After export — split into 25 MB chunks (Whisper API limit) respecting VAD boundaries. Each chunk uploads via URLSession.uploadTask(with:fromFile:) with background config (URLSessionConfiguration.background), so upload continues if app minimized.
Android similarly: WorkManager with CoroutineWorker for background, OkHttp with RequestBody.create from File, not ByteArray — critical for RAM savings on 2 GB devices.
Streaming Results
Instead of polling every N seconds — WebSocket or SSE. If using own backend over Whisper, server streams partial_transcript as chunks process. On client URLSessionWebSocketTask (iOS) or OkHttp WebSocket (Android), adds lines to StateFlow / @Published — UI updates real-time.
For direct Whisper API integration, no streaming — API is synchronous. Split into independent requests and merge on client by chunk index, not response order (network doesn't guarantee order).
Storage and Post-processing
Raw Whisper transcript returns segments with time marks — more valuable than text alone. Store JSON with start, end, text per segment: enables "tap word → rewind audio".
For post-processing — punctuation and diarization (who spoke). Whisper doesn't separate speakers. Separate step needed: pyannote.audio via API or AssemblyAI with speaker_labels: true. Client merges two JSONs by time marks.
Choosing Provider by Task
| Provider | Accuracy (RU) | Streaming | Diarization | File Limit |
|---|---|---|---|---|
| OpenAI Whisper | High | No | No | 25 MB |
| AssemblyAI | Medium | Yes | Yes | 5 GB |
| Deepgram Nova-2 | High | Yes | Yes | No limit |
| Google Speech-to-Text v2 | Medium | Yes | Yes | 1 GB |
| On-device (iOS CoreML) | Medium | No | No | RAM limited |
For Russian, Whisper large-v3 notably wins on informal speech and technical jargon. Deepgram Nova-2 with language: ru — good if real-time needed.
Development Process
Start with audit: file formats, average size, languages, diarization needed, offline requirements. Choose provider and pipeline architecture.
Develop in stages: first basic upload + transcription without optimizations, then add chunking, background upload, UI streaming, post-processing. Each stage — separate branch with functional test on real devices (not simulator — CoreML and MediaCodec behave differently on real hardware).
Timeline from basic Whisper API integration to full pipeline with diarization and background upload — 2 to 6 weeks depending on platform and requirements.







