Implementing OCR (Text Recognition) via Camera in Mobile Applications
User points camera at a price tag, receipt, contract or sign — and the app instantly recognizes text. The gap between "works in demo" and "works in production" is enormous here: real conditions include poor lighting, tilted text, handwritten elements and different languages in one frame.
Native OCR Frameworks Without External Dependencies
iOS: Vision + VNRecognizeTextRequest
Since iOS 13, the Vision framework can recognize text offline. VNRecognizeTextRequest supports two modes: .fast (approximate, instant) and .accurate (slower but significantly more accurate for complex fonts).
func recognizeText(in image: UIImage) {
guard let cgImage = image.cgImage else { return }
let request = VNRecognizeTextRequest { [weak self] request, error in
guard let observations = request.results as? [VNRecognizedTextObservation] else { return }
let text = observations.compactMap { $0.topCandidates(1).first?.string }.joined(separator: "\n")
DispatchQueue.main.async { self?.handleRecognized(text: text) }
}
request.recognitionLevel = .accurate
request.usesLanguageCorrection = true
request.recognitionLanguages = ["ru-RU", "en-US"] // order = priority
let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])
try? handler.perform([request])
}
usesLanguageCorrection helps with typos, but sometimes "corrects" abbreviations and article codes — for technical documents, better to disable.
Android: ML Kit Text Recognition v2
com.google.mlkit:text-recognition supports Latin, Cyrillic, Chinese, Japanese, Korean via separate modules. Downloads on first use (~5 MB for Latin).
val recognizer = TextRecognition.getClient(
TextRecognizerOptions.DEFAULT_OPTIONS // or RussianTextRecognizerOptions
)
val image = InputImage.fromBitmap(bitmap, 0)
recognizer.process(image)
.addOnSuccessListener { visionText ->
val fullText = visionText.textBlocks
.joinToString("\n") { block -> block.text }
handleRecognized(fullText)
}
.addOnFailureListener { e -> handleError(e) }
ML Kit also returns bounding boxes for each text block — useful for highlighting recognized areas in UI.
Live Mode: Text in Real-Time From Video Stream
For live overlay (text highlighted directly in video stream), on iOS use AVCaptureSession + CMSampleBuffer:
// AVCaptureVideoDataOutput delegate method
func captureOutput(_ output: AVCaptureOutput,
didOutput sampleBuffer: CMSampleBuffer,
from connection: AVCaptureConnection) {
guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
// Don't start new request if previous hasn't completed
guard !isProcessing else { return }
isProcessing = true
let request = VNRecognizeTextRequest { [weak self] request, _ in
defer { self?.isProcessing = false }
// process results...
}
request.recognitionLevel = .fast // for live, speed matters
try? VNImageRequestHandler(cvPixelBuffer: pixelBuffer, options: [:]).perform([request])
}
The isProcessing flag is mandatory — without it at 30 FPS, request queue accumulates and memory grows to crash.
On Android — CameraX + ImageAnalysis.Analyzer. ML Kit is optimized for working with ImageProxy directly without Bitmap conversion.
Post-processing: From Raw Text to Structured Data
Raw OCR result is a stream of lines. For most tasks, structuring is needed:
- Receipts: extract lines with prices via regex, parse total amount
-
Business cards:
NSDataDetector(iOS) orPatterns(Android) for phones, email, addresses - Passports/documents: MRZ zone read by ICAO 9303 standard, ready parsers exist
- License plates: separate task — better specialized model (OpenALPR, PlateRecognizer API)
For Cyrillic text with poor quality, sometimes image preprocessing helps: contrast boost via vImageContrastStretch, grayscale conversion, Sharpen CIFilter before passing to OCR.
Workflow
Define use cases: document types, languages, whether live mode or static photo only is needed.
Implement image capture (camera + gallery), preprocessing.
Integrate OCR: native Vision/ML Kit or cloud (Google Vision API, AWS Textract) if higher accuracy for complex documents is needed.
Post-process for specific task: data structuring, regex, NER.
Test on real samples in different lighting conditions.
Timeline Guidelines
Basic static text recognition via native framework — 2–3 days. Live mode with overlay + data structuring for specific document type — 1–2 weeks.







