OCR and Text Recognition Implementation in Mobile Applications
OCR on mobile is one of the most mature tasks with good ready-made tools. Native solutions (Vision on iOS, ML Kit on Android) cover most cases. Complexity starts where text is non-standard: handwriting, faded receipts, reflections, perspective distortion.
Tool Selection
iOS Vision Framework — VNRecognizeTextRequest. Fully on-device, supports 18+ languages including Cyrillic. recognitionLevel = .accurate best quality, recognitionLevel = .fast 2–3x faster. iPhone 12 at .accurate: 180–350 ms on A4 photo.
ML Kit Text Recognition v2 — cross-platform (iOS + Android), on-device. Supports Latin, Cyrillic, Devanagari, CJK characters. Android via TextRecognition.getClient(TextRecognizerOptions.DEFAULT_OPTIONS).
Tesseract via SwiftyTesseract (iOS) or tess-two (Android)—when custom training for specific font or language needed. 3–5x slower than native APIs but more flexible.
For standard tasks (documents, business cards, price tags)—Vision / ML Kit sufficient. For specialized tasks (medical forms with non-standard fonts)—Tesseract with fine-tuned model.
Preprocessing: Critical for 40% of Accuracy
VNRecognizeTextRequest and ML Kit accept CGImage / InputImage—but input image quality is critical.
Typical preprocessing pipeline:
- Grayscale conversion—reduces JPEG color artifacts noise
- Brightness/contrast correction via
CIColorControls(iOS) orColorMatrix(Android) - Binarization (Otsu threshold)—helps with uneven lighting
- Deskew—perspective and rotation correction
Perspective correction (document shot at angle): iOS VNDetectRectanglesRequest finds document contour, CIPerspectiveCorrection straightens. Android—similar via Bitmap + Matrix.setPolyToPoly.
Case: shipping invoice scanning app. ML Kit v2 without preprocessing gave 78% accuracy in field conditions (warehouse lighting, creased paper). After Otsu binarization + perspective correction—94%. Especially helped with matrix-font invoice numbers.
Real-Time vs Photo Recognition
For real-time (point camera, text recognized on-the-fly—like Google Lens), adapt the pipeline:
- Lower resolution to 720p or less
- iOS:
VNRecognizeTextRequestinVNSequenceRequestHandlerevery 3–5 frames, not each - Buffer results: show previous result while inferring new frame
- Stabilize text between frames: compare bounding box IoU, if >0.7—same text
On Android, ML Kit in STREAM_MODE manages frequency—doesn't overload pipeline.
Post-Processing: Text ≠ Data
Recognizing text and extracting useful data are different tasks.
For phone numbers, email, dates—use NSDataDetector (iOS) or Patterns (Android) on recognized text. For structured documents (tax IDs, passport numbers)—regex with checksum verification.
For tables and forms: ML Kit v2 returns TextBlock → TextLine → TextElement with coordinates of each. Group by line Y-coordinate (±5px) to reconstruct table structure.
Timeline
OCR for photos with preprocessing and data post-processing: 3–5 business days. Full document scanner with real-time mode, perspective correction, and export: 1–2 weeks. Cost calculated individually.







