Object Detection Implementation in Mobile Applications
Object detection on mobile isn't just "find and box." It's also frame-to-frame tracking, correct bounding box projection onto preview layer, handling overlapping detections, and managing performance at 30 FPS video. The last point often becomes the bottleneck.
Model Selection: YOLO vs SSD vs NanoDet
Three main families work on mobile:
- MobileNet SSD — classic, excellent TFLite Task Library and ML Kit support. On Pixel 7: 18–25 ms on 320×320. COCO mAP: ~23–27.
- YOLOv8n/YOLOv5n — best accuracy/speed balance in 2024. After TFLite or Core ML conversion: 22–40 ms depending on input size. COCO mAP: 37+.
- NanoDet — for truly weak devices, <10 ms on Snapdragon 665.
For real-time video on modern Android flagships, use YOLOv8n with GPU delegate. For offline photos across a wide device range, use MobileNet SSD v2.
Bounding Box: Projection to Camera
Most common visual error: bounding box doesn't align with the object on preview. Reason: the model receives resized input (e.g., 320×320), while camera preview is 1920×1080 with AspectFill or AspectFit. Recalculate coordinates accounting for scale and offsets.
On iOS with AVCaptureVideoPreviewLayer:
let converted = previewLayer.layerRectConverted(fromMetadataOutputRect: normalizedRect)
VNDetectedObjectObservation returns boundingBox in normalized coordinates (0..1, y from bottom). Before projecting to UIKit coordinates, invert the Y-axis: CGRect(x: box.minX, y: 1 - box.maxY, width: box.width, height: box.height).
On Android with CameraX + ImageAnalysis: detection results are in input image coordinates, preview in PreviewView coordinates. Use MappingUtils from ML Kit or compute transformation manually via matrix.
Frame-to-Frame Tracking
Detecting every frame is expensive. Correct approach: detect every N frames (typically every 5–10), between frames use SORT or ByteTrack, or iOS built-in VNDetectRectanglesRequest with ObjectTrackerObservation.
ML Kit Object Detection & Tracking supports tracking out-of-the-box via .enableMultipleObjects() and .enableClassification(). Each tracked object gets a stable trackingID—allowing display of object info without flickering when lost/reappearing.
NMS (Non-Maximum Suppression) is important. Default iouThreshold = 0.5. If objects overlap in frame (e.g., packed goods on conveyor), lower to 0.3–0.35. Otherwise, the detector "glues" adjacent objects into one.
Real Case Study
Queue people-counting app via static camera (tablet on stand). YOLOv8n, TFLite, GPU delegate on Android 11+. Problem: dense queues (>8 people) had detector missing center people—overlap >60%. Solution: lowered nmsThreshold to 0.3, added minDetectionConfidence = 0.4 (vs 0.5). False misses dropped from 31% to 9%. Additionally: fine-tuned model on overlapping frames via Roboflow dataset.
Timeline
Integrating detection model with preview projection and NMS tuning: 1–2 weeks. Fine-tuning on custom classes + integration: 2–3 weeks. Cost calculated individually.







