Implementing Document Scanning via Camera in Mobile Applications
Scanning a document with a smartphone camera — a task that looks simple but contains a dozen technical nuances. Perspective distortion, shadows from fingers, reflective surfaces, hand tremor during capture — all must be processed before results reach the user.
Detecting Document Boundaries
First step — find four corners of the document in the frame. On iOS with iOS 13+ this is done by VisionKit via VNDetectRectanglesRequest:
let request = VNDetectRectanglesRequest { request, error in
guard let results = request.results as? [VNRectangleObservation],
let rect = results.first else { return }
// rect.topLeft, topRight, bottomLeft, bottomRight in normalized coordinates [0,1]
DispatchQueue.main.async {
self.overlayView.drawQuadrilateral(observation: rect,
imageSize: self.previewLayer.frame.size)
}
}
request.minimumConfidence = 0.8
request.minimumAspectRatio = 0.5 // filter out narrow rectangles
request.quadratureTolerance = 30 // tolerance for deviation from rectangle in degrees
Since iOS 16, VNDocumentCameraViewController is available — ready interface from Apple with automatic capture, perspective correction and multi-page scanning. For most tasks, this is the optimal choice.
On Android — ML Kit Document Scanner API (beta, available via Google Play Services) or OpenCV via NDK for custom solutions.
Perspective Correction
After detecting four corners, apply homographic transformation — align tilted document to a rectangle "as if shot from above". On iOS this is CIPerspectiveCorrection from Core Image:
func correctPerspective(image: CIImage, observation: VNRectangleObservation) -> CIImage {
let imageSize = image.extent.size
// Convert normalized Vision coordinates to pixel CIImage
func toPixel(_ point: CGPoint) -> CIVector {
return CIVector(x: point.x * imageSize.width,
y: point.y * imageSize.height)
}
let filter = CIFilter.perspectiveCorrection()
filter.inputImage = image
filter.topLeft = toPixel(observation.topLeft)
filter.topRight = toPixel(observation.topRight)
filter.bottomLeft = toPixel(observation.bottomLeft)
filter.bottomRight = toPixel(observation.bottomRight)
return filter.outputImage ?? image
}
Important: CIImage coordinate system is inverted by Y relative to UIKit — topLeft in Vision is bottomLeft in CIImage. This mistake occurs in 90% of first implementations.
Image Post-processing
Scanned document after geometric correction usually needs enhancement:
Grayscale + contrast boost — for text recognition, archive documents:
let grayscaleFilter = CIFilter.colorControls()
grayscaleFilter.saturation = 0
grayscaleFilter.contrast = 1.3
Adaptive thresholding — "black and white" effect like in Adobe Scan. Core Image lacks built-in adaptive threshold, so use CIKernel or Metal Compute Shader for 15×15 pixel block processing.
Document enhancement — iOS 17+ has VNGeneratePersonInstanceMaskRequest that helps remove hand shadow. For earlier versions — GPUImage3 or custom Metal shader for highlight recovery.
Multi-page Scanning and PDF
User scans multiple pages — they combine into one document. On iOS — PDFDocument + PDFPage from PDFKit:
func createPDF(from images: [UIImage]) -> Data? {
let pdfDocument = PDFDocument()
for (index, image) in images.enumerated() {
guard let page = PDFPage(image: image) else { continue }
pdfDocument.insert(page, at: index)
}
return pdfDocument.dataRepresentation()
}
PDF size matters: A4 at 300 DPI = ~2500×3500 px. For storage and transfer, compress JPEG inside PDF with quality 0.7–0.85. For OCR tasks — save original resolution.
Automatic vs Manual Capture
Auto trigger on document detection — good UX but requires stabilization: document in frame > 1.5 seconds with sufficient confidence before auto-capture. Too aggressive trigger annoys — user still aligning phone, app already shot.
Workflow
Define scenario: document types (passport, receipt, contract, multi-page materials), whether PDF export is needed, OCR integration.
Implement boundary detector and live preview with found document highlighting.
Perspective correction, image post-processing.
Multi-page mode, export to PDF or JPEG.
Test in real conditions: different lighting, various paper types (glossy, matte, old documents).
Timeline Guidelines
Basic scanner with VNDocumentCameraViewController (iOS only) — 1–2 days. Custom implementation with manual perspective correction, post-processing and multi-page PDF — 3–5 days per platform.







