Image Recognition Bot Implementation in Mobile Applications
User takes a photo — bot responds. Sounds simple, but between "attach photo" and "get useful answer" lies model choice, request size management, and handling cases where image doesn't contain expected content.
Vision API: What to Use
GPT-4o Vision (OpenAI). Send image base64 or URL in request, receive text response. Understands complex scenes, documents, handwriting, diagrams. Cost — depends on image size (tile-based pricing). For detailed high-resolution analysis — more expensive.
Claude 3.5 Sonnet / Haiku. Similar capability via Anthropic Messages API. Claude works well with documents and tables, shows comparable results with GPT-4o on most tasks.
Google Cloud Vision API. Specialized functions: OCR (TEXT_DETECTION), object recognition (OBJECT_LOCALIZATION), face (FACE_DETECTION), logos (LOGO_DETECTION), content safety (SAFE_SEARCH_DETECTION). Cheaper than LLM for homogeneous tasks, but no free-form text response.
ML Kit (Google) on-device. Completely on device: text recognition, barcodes, faces, objects. No network latency, no per-request cost. Accuracy lower than cloud LLM for complex scenes, but for structured tasks (QR code, barcode, document text) — sufficient.
CoreML + Vision (iOS). MobileNetV3, EfficientNet — on-device image classification. VNRecognizeTextRequest for OCR. VNDetectBarcodeRequest for QR/barcodes.
Choice depends on task:
| Task | Recommended Solution |
|---|---|
| Free-form question about photo | GPT-4o Vision / Claude |
| Document OCR | Google Vision API / ML Kit |
| Barcodes and QR codes | ML Kit / CoreML (on-device) |
| Product classification | Custom CoreML / TFLite model |
| Content moderation | Google Vision SAFE_SEARCH |
Sending Images from Mobile Application
Images are not sent directly to Vision API from mobile client — API key cannot be stored in app.
Data flow:
Mobile Client → Resize/Compress → Upload to S3/GCS → URL → Your Server → Vision API
Image is compressed on device to appropriate size before upload. GPT-4o with detail: "auto" determines needed resolution itself, but sending 12-megapixel photo without compression — wasteful and expensive.
// Android: compress image before upload
fun compressForBot(uri: Uri, maxSizePx: Int = 1024): ByteArray {
val bitmap = MediaStore.Images.Media.getBitmap(contentResolver, uri)
val scale = maxSizePx.toFloat() / maxOf(bitmap.width, bitmap.height)
val scaled = if (scale < 1f) {
Bitmap.createScaledBitmap(
bitmap,
(bitmap.width * scale).toInt(),
(bitmap.height * scale).toInt(),
true
)
} else bitmap
val output = ByteArrayOutputStream()
scaled.compress(Bitmap.CompressFormat.JPEG, 85, output)
return output.toByteArray()
}
Use Case Scenarios
Retail bots. User photographs product — bot finds it in catalog, shows price and availability. Visual embedding search (CLIP + Qdrant) more accurate than text from OCR.
Medical bots. Photo of symptom, prescription, lab result — bot explains (doesn't diagnose). System prompt should explicitly limit answer scope and include disclaimer.
Document bots. Photo of invoice, receipt, passport — extract structured data. GPT-4o Vision + structured output via JSON Schema gives high accuracy on typical documents.
Inspection bots. Builder photographs defect — bot classifies defect type and creates task in management system.
Handling "Bad" Photos
Mandatory test cases:
- Blurry image
- Poor lighting
- Off-topic photo (user sent cat instead of receipt)
- Image with prohibited content
For the last — moderation before sending to main model. OpenAI Moderation API or Google Safe Search as first filter.
Implementation Process
Define use case scenarios for images: exactly what needs recognition.
Choose Vision API for task and budget.
Backend: image upload, Vision API call, response formation.
Mobile UI: gallery selection, camera, preview before sending.
Test in real field conditions — poor lighting, angles, partial visibility.
Timeline Estimates
Bot with basic Vision API (Google Vision or GPT-4o) — 3–5 days. With custom classification model, on-device inference and complex scenarios — 3–6 weeks.







