AI Object Counting from Camera Frame in Mobile Apps
Counting objects via camera seems simple but hides several non-trivial issues: overlapping objects, objects at different scales in one frame, and the main trap—double counting when the camera moves. An industrial warehouse, a herd of animals, coins on a table—each scenario has its own characteristics.
Two Approaches: Detection vs Density Estimation
Detection-based counting — YOLOv8 or RT-DETR detects each object; count = number of detections. Works with low density (up to 50–100 objects per frame) when objects don't overlap heavily.
Density map estimation — CNN predicts a density map; count = integral of the map. Used for high density: crowds, grain in a bin, cells under microscope. CSRNet, DMCount, BL-model are current architectures.
// iOS: method selection based on expected density
enum CountingStrategy {
case detection(model: VNCoreMLModel) // < 100 objects
case densityMap(model: VNCoreMLModel) // > 100 objects per frame
case hybrid // adaptive selection
}
class AdaptiveObjectCounter {
func selectStrategy(for objectClass: CountableObject) -> CountingStrategy {
switch objectClass {
case .vehicle, .person_sparse:
return .detection(model: vehicleDetector)
case .crowd, .grain, .cell:
return .densityMap(model: densityEstimator)
case .product_shelf:
return .hybrid
}
}
}
Detection-Based: Implementation with Deduplication
class DetectionCounter {
func count(in sampleBuffer: CMSampleBuffer,
targetClass: String) async throws -> CountResult {
guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else {
throw CounterError.invalidFrame
}
let request = VNCoreMLRequest(model: detectionModel)
request.imageCropAndScaleOption = .scaleFill
try VNImageRequestHandler(cvPixelBuffer: pixelBuffer).perform([request])
let observations = (request.results as? [VNRecognizedObjectObservation]) ?? []
// Filter by class and confidence
let targetObjects = observations.filter { obs in
obs.labels.first?.identifier == targetClass &&
obs.confidence >= 0.4
}
// NMS to eliminate duplicate bounding boxes
let deduplicated = applyNMS(targetObjects, iouThreshold: 0.45)
return CountResult(
count: deduplicated.count,
detections: deduplicated,
confidence: deduplicated.map { $0.confidence }.average()
)
}
private func applyNMS(_ observations: [VNRecognizedObjectObservation],
iouThreshold: Float) -> [VNRecognizedObjectObservation] {
// Sort by confidence (descending)
let sorted = observations.sorted { $0.confidence > $1.confidence }
var kept: [VNRecognizedObjectObservation] = []
for obs in sorted {
let overlapping = kept.contains { existingObs in
iou(obs.boundingBox, existingObs.boundingBox) > iouThreshold
}
if !overlapping { kept.append(obs) }
}
return kept
}
}
Vision framework does not apply NMS automatically with VNCoreMLRequest—you must do it manually, otherwise objects at crop boundaries are counted twice.
Density Map for High Density
// Android: density map estimation via TFLite
class DensityMapCounter(context: Context) {
private val interpreter: Interpreter by lazy {
val model = FileUtil.loadMappedFile(context, "csrnet_lite.tflite")
Interpreter(model, Interpreter.Options().apply {
addDelegate(GpuDelegate())
numThreads = 4
})
}
fun estimate(bitmap: Bitmap): Int {
// Model input size—typically 512×512 or multiple of 16
val resized = Bitmap.createScaledBitmap(bitmap, 512, 512, true)
val inputBuffer = TensorImage.fromBitmap(resized).buffer
// Output tensor—density map at same resolution
val outputBuffer = TensorBuffer.createFixedSize(
intArrayOf(1, 512, 512, 1), DataType.FLOAT32
)
interpreter.run(inputBuffer, outputBuffer.buffer)
// Sum across all pixels of density map = estimated count
val densitySum = outputBuffer.floatArray.sum()
// Scaling: sum corresponds to object count
return densitySum.roundToInt()
}
}
Counting with Camera Motion: Tracking
If the user smoothly pans the camera (warehouse, auditorium), tracking is needed to avoid counting the same objects twice:
class TrackingObjectCounter {
private var tracker = ByteTracker() // BYTE tracking algorithm
private var countedIds: Set<Int> = [] // unique IDs in session
func processFrame(_ detections: [Detection]) -> TrackingCountResult {
let tracks = tracker.update(detections: detections)
// New IDs = new objects entering frame
let newIds = tracks.map { $0.trackId }.filter { !countedIds.contains($0) }
countedIds.formUnion(newIds)
return TrackingCountResult(
currentFrameCount: tracks.count, // in frame now
totalUniqueCount: countedIds.count // total in session
)
}
}
ByteTracker is one of the best tracking algorithms for this task, robust to occlusions.
Timeline Estimates
Detection-based counting with a ready model (single object class) and counter UI takes 3–5 days. Adaptive system with detection + density map, tracking for camera motion, multiple object classes, and iOS + Android support requires 1–2 weeks.







