AI Object Tracking in Video Streams for Mobile Apps
Object tracking is a separate task from detection. A detector says "there's a car here" on each frame independently. A tracker says "this is the same car #7 that was on the left in the previous frame." Loss of object identity is a typical error with naive approaches: the object exits the frame and returns—the tracker assigns it a new ID.
Classification of Tracking Tasks
SOT (Single Object Tracking) — tracking one selected object. User taps an object → app follows it. Applications: sports broadcasts, tracking a specific person in frame. Algorithms: SiamFC, OSTrack, STARK.
MOT (Multi-Object Tracking) — simultaneous tracking of all objects of a target class. Applications: visitor counting, traffic control, production conveyors. Algorithms: SORT, ByteTrack, StrongSORT, OC-SORT.
MOT: Detector + Tracker Pipeline
Standard pipeline for mobile:
// iOS: YOLOv8 detection + SORT tracking
class MultiObjectTracker {
private let detector: YOLOv8Detector
private let tracker: SORTTracker
// SORT parameters—important to tune for your task
init(targetClass: String,
maxAge: Int = 10, // frames without detection before track removal
minHits: Int = 3, // frames of detection to confirm track
iouThreshold: Float = 0.3) {
self.detector = YOLOv8Detector(targetClass: targetClass)
self.tracker = SORTTracker(maxAge: maxAge,
minHits: minHits,
iouThreshold: iouThreshold)
}
func processFrame(_ pixelBuffer: CVPixelBuffer) async -> [TrackedObject] {
// 1. Detection on current frame
let detections = await detector.detect(pixelBuffer)
// 2. Tracker update
let tracks = tracker.update(detections: detections.map { det in
Detection(bbox: det.boundingBox, confidence: det.confidence)
})
// 3. Convert to TrackedObject
return tracks.map { track in
TrackedObject(
id: track.trackId,
boundingBox: track.bbox,
isConfirmed: track.hitStreak >= tracker.minHits,
velocity: track.kalmanFilter.velocity // from Kalman state
)
}
}
}
maxAge = 10 means a track lives for 10 frames without detection (object behind obstacle). At 30 FPS, this is 333 ms—sufficient for most brief occlusions.
ByteTrack: Better Than SORT for Occlusions
SORT uses only high-confidence detections. ByteTrack uses ALL detections—including low-confidence ones—for association with existing tracks. This dramatically reduces track loss during occlusions:
// Android: ByteTrack association
class ByteTracker(
private val trackThresh: Float = 0.5f,
private val highThresh: Float = 0.6f,
private val matchThresh: Float = 0.8f
) {
private val trackedStracks = mutableListOf<STrack>()
private val lostStracks = mutableListOf<STrack>()
fun update(detections: List<Detection>): List<STrack> {
// Split detections into high/low confidence
val highDetections = detections.filter { it.confidence >= highThresh }
val lowDetections = detections.filter { it.confidence in trackThresh..<highThresh }
// 1. Associate high-confidence with active tracks
val (matches1, unmatched_tracks1, unmatched_dets1) =
linearAssignment(trackedStracks, highDetections, matchThresh)
// 2. Associate low-confidence with unmatched tracks from step 1
val (matches2, _, _) =
linearAssignment(unmatched_tracks1, lowDetections, 0.5f)
// 3. Initialize new tracks for unassociated high-conf detections
val newTracks = unmatched_dets1.map { STrack(it) }
return (matches1 + matches2).map { it.track } + newTracks
}
}
SOT: Tap-to-Track
// iOS: user selects object with tap, app follows
class SingleObjectTracker {
// Use Vision VNTrackObjectRequest
private var trackingRequest: VNTrackObjectRequest?
func initializeTracking(at point: CGPoint, in frame: CVPixelBuffer) {
let observation = VNDetectedObjectObservation(
boundingBox: CGRect(center: point, size: CGSize(width: 0.1, height: 0.1))
)
trackingRequest = VNTrackObjectRequest(
detectedObjectObservation: observation
) { [weak self] request, _ in
guard let obs = request.results?.first as? VNDetectedObjectObservation else { return }
self?.delegate?.didUpdateTracking(boundingBox: obs.boundingBox,
confidence: obs.confidence)
}
trackingRequest?.trackingLevel = .accurate // vs .fast
}
func trackInFrame(_ pixelBuffer: CVPixelBuffer) {
guard let request = trackingRequest else { return }
let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer)
try? handler.perform([request])
}
}
trackingLevel = .accurate uses a more heavyweight tracker (CorrelateBased vs Optical Flow). Difference: .fast — 50+ FPS, loses track at fast motion. .accurate — 20–30 FPS, more robust to fast objects. Choose based on your task.
Track Rendering
@Composable
fun TrackingOverlay(
tracks: List<TrackedObject>,
imageSize: Size,
modifier: Modifier = Modifier
) {
val colors = remember { generateTrackColors(maxTracks = 100) }
Canvas(modifier = modifier) {
tracks.forEach { track ->
val color = colors[track.id % colors.size]
val rect = track.boundingBox.toScreenRect(imageSize, size)
// Bounding box
drawRect(color = color, topLeft = rect.topLeft,
size = rect.size, style = Stroke(width = 3f))
// ID badge
drawIntoCanvas { canvas ->
canvas.nativeCanvas.drawText(
"ID: ${track.id}",
rect.left + 4f,
rect.top + 20f,
Paint().apply { this.color = color.toArgb(); textSize = 32f }
)
}
// Velocity vector (optional)
if (track.velocity != null) {
drawLine(
color = color.copy(alpha = 0.6f),
start = rect.center,
end = rect.center + track.velocity.toOffset(scale = 20f),
strokeWidth = 2f
)
}
}
}
}
Timeline Estimates
SOT (Vision VNTrackObjectRequest) with tap for object selection takes 2–3 days. MOT with YOLOv8 + ByteTrack, track rendering, multiple object classes, and iOS + Android support requires 1–2 weeks.







