AI Audio/Video File Transcription for Mobile App

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
AI Audio/Video File Transcription for Mobile App
Medium
~3-5 business days
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1054
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

AI Audio/Video File Transcription in Mobile Applications

User uploads 40-minute meeting recording — waits for text. If raw 300 MB MP4 goes to server and JSON comes back in 90 seconds, UX is broken before first word. Task — build pipeline where mobile client doesn't just upload file but participates in prep: slicing, conversion, chunking — and gets results gradually while model still works.

Where Naive Implementation Breaks

Most common mistake — send file whole via standard URLSession.dataTask or OkHttp with default timeouts. On large files this causes NSURLErrorTimedOut (-1001), 413 from server, or OOM on Android from buffering in memory.

Second pitfall — format. Whisper and most cloud providers accept audio/wav, audio/mp3, audio/mp4, audio/ogg, but not all codecs within. .mov with pcm_s16le codec passes, .mov with ac3 gives 400 Bad Request with no explanation. iOS demuxing via AVAssetExportSession with preset AVAssetExportPresetAppleM4A solves 95% of cases. Android — MediaExtractor + MediaCodec for PCM decoding, then MediaMuxer to pack to AAC.

Third problem — chunking by time without accounting for silence. Slicing every 60 seconds exactly, word lands on boundary, transcription splits it. Use VAD (Voice Activity Detection) for slicing by pauses. iOS has AVAudioSession + AVAudioEngine for signal analysis, Flutter — voice_activity_detector package over WebRTC VAD.

How Pipeline Works in Practice

File Preparation on Device

// iOS: Extract audio track from video
let asset = AVURLAsset(url: videoURL)
let exportSession = AVAssetExportSession(asset: asset, presetName: AVAssetExportPresetAppleM4A)!
exportSession.outputFileType = .m4a
exportSession.outputURL = tempAudioURL
await exportSession.export()

After export — split into 25 MB chunks (Whisper API limit) respecting VAD boundaries. Each chunk uploads via URLSession.uploadTask(with:fromFile:) with background config (URLSessionConfiguration.background), so upload continues if app minimized.

Android similarly: WorkManager with CoroutineWorker for background, OkHttp with RequestBody.create from File, not ByteArray — critical for RAM savings on 2 GB devices.

Streaming Results

Instead of polling every N seconds — WebSocket or SSE. If using own backend over Whisper, server streams partial_transcript as chunks process. On client URLSessionWebSocketTask (iOS) or OkHttp WebSocket (Android), adds lines to StateFlow / @Published — UI updates real-time.

For direct Whisper API integration, no streaming — API is synchronous. Split into independent requests and merge on client by chunk index, not response order (network doesn't guarantee order).

Storage and Post-processing

Raw Whisper transcript returns segments with time marks — more valuable than text alone. Store JSON with start, end, text per segment: enables "tap word → rewind audio".

For post-processing — punctuation and diarization (who spoke). Whisper doesn't separate speakers. Separate step needed: pyannote.audio via API or AssemblyAI with speaker_labels: true. Client merges two JSONs by time marks.

Choosing Provider by Task

Provider Accuracy (RU) Streaming Diarization File Limit
OpenAI Whisper High No No 25 MB
AssemblyAI Medium Yes Yes 5 GB
Deepgram Nova-2 High Yes Yes No limit
Google Speech-to-Text v2 Medium Yes Yes 1 GB
On-device (iOS CoreML) Medium No No RAM limited

For Russian, Whisper large-v3 notably wins on informal speech and technical jargon. Deepgram Nova-2 with language: ru — good if real-time needed.

Development Process

Start with audit: file formats, average size, languages, diarization needed, offline requirements. Choose provider and pipeline architecture.

Develop in stages: first basic upload + transcription without optimizations, then add chunking, background upload, UI streaming, post-processing. Each stage — separate branch with functional test on real devices (not simulator — CoreML and MediaCodec behave differently on real hardware).

Timeline from basic Whisper API integration to full pipeline with diarization and background upload — 2 to 6 weeks depending on platform and requirements.