AI Voice Message Transcription for Mobile App

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
AI Voice Message Transcription for Mobile App
Simple
~2-3 business days
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1054
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

AI Voice Message Transcription in Mobile Applications

Voice messages in messengers are a plague of corporate communication. A 3-minute voice message instead of two lines of text. Transcription solves it: hit button — get text, read without sound, copy, search.

Capturing Audio from Voice Message

In practice, mobile app works with two scenarios:

Recording within app. User records directly in your app — native capture, full format control.

Importing external file. Got WAV/MP3/OGG from messenger via share sheet. iOS — UTType.audio in UIDocumentPickerViewController. Android — ACTION_GET_CONTENT with "audio/*".

File format matters. OGG Opus (Telegram format) — Whisper understands natively. AMR (old Android messengers) — needs conversion. On server ffmpeg handles any format:

import subprocess

def convert_to_mp3(input_path: str, output_path: str) -> None:
    subprocess.run([
        "ffmpeg", "-i", input_path,
        "-ar", "16000",      # 16kHz sufficient for speech
        "-ac", "1",          # mono
        "-b:a", "32k",       # 32kbps for speech
        output_path
    ], check=True)

16kHz mono MP3 32kbps — optimum for Whisper: quality doesn't drop, file size minimal.

Whisper API: Integration Details

import openai

client = openai.OpenAI()

def transcribe_audio(file_path: str, language: str = "ru") -> dict:
    with open(file_path, "rb") as f:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=f,
            language=language,            # explicit language improves quality
            response_format="verbose_json",  # includes timestamps and segments
            timestamp_granularities=["word"]  # timestamp per word
        )
    return transcript

verbose_json + timestamp_granularities=["word"] give timestamp per word. On mobile allows "read and listen": tap word in transcript → jump to that moment in audio.

language parameter critical for mixed recordings. Without it Whisper spends first seconds detecting language — adds latency. If app knows user language — always pass it.

Latency: Real Numbers and Optimization

Whisper API: 10-second message processed in 0.5–1.5 s. 1-minute — 3–8 s. This is processing time on OpenAI servers + network. For user acceptable if showing progress.

For lower latency:

Deepgram Nova-2 — real-time streaming transcription, latency < 300 ms. More expensive than Whisper, faster.

Local Whisper (self-hosted). faster-whisper on GPU (RTX 3090) processes 1 minute in 2–4 seconds. On CPU — 15–30 seconds. If data can't go to cloud — only option.

Client transcription on iOS. SFSpeechRecognizer — native Apple framework, runs on-device (iOS 16+), free, doesn't send data. But: supports only limited languages, quality lower than Whisper, limit 1 minute per request.

// iOS — local transcription via SFSpeechRecognizer
let recognizer = SFSpeechRecognizer(locale: Locale(identifier: "ru-RU"))
let request = SFSpeechURLRecognitionRequest(url: audioURL)
request.shouldReportPartialResults = true

recognizer?.recognitionTask(with: request) { result, error in
    guard let result else { return }
    DispatchQueue.main.async {
        self.transcriptText = result.bestTranscription.formattedString
    }
}

For short personal notes SFSpeechRecognizer — good option without server costs. For corporate meeting recordings — Whisper or Deepgram.

Displaying Transcript on Mobile

Simple transcription — just text. Good transcription on mobile:

  • Interactive text with timestamps: tap word → audio jumps to that moment
  • Punctuation (Whisper restores it well, not perfect — sometimes needs post-processing)
  • Paragraphs by pauses (Whisper segments — use segments for splitting)
  • Copy all button
  • Search within transcript

For messenger: transcript appears streaming — don't wait for completion, show as segments ready.

Post-processing Transcript

Whisper sometimes inserts [Music], [Applause] in Whisper notation, transcribes background noise. Filter:

import re

def clean_transcript(text: str) -> str:
    # Remove Whisper notations like [Music], [Noise]
    text = re.sub(r'\[.*?\]', '', text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

For business scenarios, LLM post-processing useful: fix proper names, terms, add punctuation where Whisper erred.

Implementation Timeline

Audio capture and file import → server transcription (Whisper/Deepgram) with progress → formatting and post-processing → mobile UI with interactive transcript → optional: streaming and local SFSpeechRecognizer for iOS.

Basic transcription via Whisper with simple text display — 1–2 weeks. Full tool with interactive text, timestamps, post-processing — 3–4 weeks.