AI Camera Translation (AR Translation) in Mobile Apps
Google Translate's "Instant Translation" is AR Translation in action: camera sees text, translation appears in real-time overlaid on the image as if printed there natively. Implementing it independently is harder than it appears: OCR, translation, inpainting the background beneath erased source text, and rendering new text with correct font and size.
AR Translation Pipeline Architecture
Each camera frame passes through multiple stages:
Frame → Text Detection → OCR → Translation → Inpainting → Text Overlay → Render
Text Detection. Find text bounding boxes in frame. On iOS: VNRecognizeTextRequest (Vision framework) with recognitionLevel: .fast for real-time. On Android: ML Kit Text Recognition v2. Both work on-device, no network required. Vision framework returns VNTextObservation with bounding box in normalized coordinates — convert to screen coordinates accounting for buffer orientation.
OCR. VNRecognizeTextRequest with recognitionLevel: .accurate is too slow for every frame. Strategy: use .fast for detection, .accurate only when text stabilizes (user tap or stationary phone). Stable frame detection: compare bounding boxes between frames — if deviation < 5px → text is stable → run accurate OCR.
Translation. Two options:
| On-device (ML Kit Translate) | Cloud API (DeepL, Google Cloud) | |
|---|---|---|
| Latency | 10–50 ms | 200–800 ms |
| Quality | Adequate | High (DeepL especially) |
| Offline | Yes (~30 MB model) | No |
| Cost | Free | Per request |
For live camera stream — on-device only. For "photograph → translate" mode — cloud API with DeepL for better quality.
Inpainting and Text Overlay — Most Complex Part
Simple approach: draw background-colored rectangle over source text, write translation on top. Result — crude white rectangle, doesn't fit the image. Correct approach:
Background Color Detection. Sample pixels around bounding box, compute median color — fill rectangle with it. Works for uniform backgrounds (white wall, paper sheet).
Texture Inpainting for Complex Backgrounds. CoreImage CIInpaintingFilter (iOS 16+) or custom convolution kernel to fill region with background texture. For real-time — too slow, use only in static photo mode.
Font Matching. Determine source text size from bounding box, select UIFont / TextPaint with similar size. Identifying specific font from OCR result — unsolved for most cases. Use system sans-serif.
Right-to-Left (RTL) Languages. Arabic, Hebrew — text flows right-to-left, UILabel and TextView need semanticContentAttribute: .forceRightToLeft. When overlaying on image: NSParagraphStyle.writingDirection = .rightToLeft.
Stabilization and Performance
Running full pipeline every frame at 30 FPS is impossible. Throttling:
- Text detection: every 3–5 frames
- OCR: only on stabilization or tap
- Translation: debounce 500 ms on text change
On iPhone 12+ Metal Performance Shaders accelerate Vision pipeline. On Android — GPU Delegate for ML Kit via GpuDelegateV2.
Cache results by OCR text hash: don't translate same text twice in session.
What's Included
- Architecture selection: on-device vs cloud, livecam vs photo mode
- OCR + translation pipeline implementation
- UI for language selection (with source language auto-detection)
- Translation text overlay on image
- Offline mode with downloadable language models (ML Kit)
Timeline: basic AR translation for static photos — 3–5 weeks. Real-time livecam translation with on-device ML and offline mode — 6–10 weeks. Cost calculated individually.







