AI Voice Message Transcription in Mobile Applications
Voice messages in messengers are a plague of corporate communication. A 3-minute voice message instead of two lines of text. Transcription solves it: hit button — get text, read without sound, copy, search.
Capturing Audio from Voice Message
In practice, mobile app works with two scenarios:
Recording within app. User records directly in your app — native capture, full format control.
Importing external file. Got WAV/MP3/OGG from messenger via share sheet. iOS — UTType.audio in UIDocumentPickerViewController. Android — ACTION_GET_CONTENT with "audio/*".
File format matters. OGG Opus (Telegram format) — Whisper understands natively. AMR (old Android messengers) — needs conversion. On server ffmpeg handles any format:
import subprocess
def convert_to_mp3(input_path: str, output_path: str) -> None:
subprocess.run([
"ffmpeg", "-i", input_path,
"-ar", "16000", # 16kHz sufficient for speech
"-ac", "1", # mono
"-b:a", "32k", # 32kbps for speech
output_path
], check=True)
16kHz mono MP3 32kbps — optimum for Whisper: quality doesn't drop, file size minimal.
Whisper API: Integration Details
import openai
client = openai.OpenAI()
def transcribe_audio(file_path: str, language: str = "ru") -> dict:
with open(file_path, "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f,
language=language, # explicit language improves quality
response_format="verbose_json", # includes timestamps and segments
timestamp_granularities=["word"] # timestamp per word
)
return transcript
verbose_json + timestamp_granularities=["word"] give timestamp per word. On mobile allows "read and listen": tap word in transcript → jump to that moment in audio.
language parameter critical for mixed recordings. Without it Whisper spends first seconds detecting language — adds latency. If app knows user language — always pass it.
Latency: Real Numbers and Optimization
Whisper API: 10-second message processed in 0.5–1.5 s. 1-minute — 3–8 s. This is processing time on OpenAI servers + network. For user acceptable if showing progress.
For lower latency:
Deepgram Nova-2 — real-time streaming transcription, latency < 300 ms. More expensive than Whisper, faster.
Local Whisper (self-hosted). faster-whisper on GPU (RTX 3090) processes 1 minute in 2–4 seconds. On CPU — 15–30 seconds. If data can't go to cloud — only option.
Client transcription on iOS. SFSpeechRecognizer — native Apple framework, runs on-device (iOS 16+), free, doesn't send data. But: supports only limited languages, quality lower than Whisper, limit 1 minute per request.
// iOS — local transcription via SFSpeechRecognizer
let recognizer = SFSpeechRecognizer(locale: Locale(identifier: "ru-RU"))
let request = SFSpeechURLRecognitionRequest(url: audioURL)
request.shouldReportPartialResults = true
recognizer?.recognitionTask(with: request) { result, error in
guard let result else { return }
DispatchQueue.main.async {
self.transcriptText = result.bestTranscription.formattedString
}
}
For short personal notes SFSpeechRecognizer — good option without server costs. For corporate meeting recordings — Whisper or Deepgram.
Displaying Transcript on Mobile
Simple transcription — just text. Good transcription on mobile:
- Interactive text with timestamps: tap word → audio jumps to that moment
- Punctuation (Whisper restores it well, not perfect — sometimes needs post-processing)
- Paragraphs by pauses (Whisper segments — use
segmentsfor splitting) - Copy all button
- Search within transcript
For messenger: transcript appears streaming — don't wait for completion, show as segments ready.
Post-processing Transcript
Whisper sometimes inserts [Music], [Applause] in Whisper notation, transcribes background noise. Filter:
import re
def clean_transcript(text: str) -> str:
# Remove Whisper notations like [Music], [Noise]
text = re.sub(r'\[.*?\]', '', text)
# Remove extra spaces
text = re.sub(r'\s+', ' ', text).strip()
return text
For business scenarios, LLM post-processing useful: fix proper names, terms, add punctuation where Whisper erred.
Implementation Timeline
Audio capture and file import → server transcription (Whisper/Deepgram) with progress → formatting and post-processing → mobile UI with interactive transcript → optional: streaming and local SFSpeechRecognizer for iOS.
Basic transcription via Whisper with simple text display — 1–2 weeks. Full tool with interactive text, timestamps, post-processing — 3–4 weeks.







