Implementing Hand Tracking in AR Applications
Hand tracking is finger and hand tracking without markers via camera. Independent AR interface control, virtual musical instruments, educational apps for surgery or mechanics, AR games where hands are the controller. Technically difficult task: 21 joints per hand, fast movements, finger overlap, tracking loss in poor lighting.
Platform Situation
iOS: ARKit before iOS 18 didn't provide public Hand Tracking API. Starting visionOS 1.0 and iOS 18 / ARKit 6 — HandAnchor with HandSkeleton available in RealityKit. Works on iPhone via rear camera. 26 joints per hand.
Before iOS 18 on iPhone — only third-party ML solutions. After iOS 18 — native ARKit.
Android: ARCore has no hand tracking. MediaPipe Hands — standard on Android (and iOS if cross-platform needed).
ARKit Hand Tracking (iOS 18+)
// iOS 18+, RealityKit
let session = ARKitSession()
let handTrackingProvider = HandTrackingProvider()
Task {
try await session.run([handTrackingProvider])
for await update in handTrackingProvider.anchorUpdates {
let handAnchor = update.anchor
guard handAnchor.isTracked else { continue }
// Position of index finger tip
if let indexTip = handAnchor.skeleton.joint(named: .indexFingerTip) {
let worldTransform = handAnchor.originFromAnchorTransform * indexTip.anchorFromJointTransform
// Attach object to finger tip
}
}
}
On visionOS same API, but with both hands simultaneously and without needing to hold device.
MediaPipe Hands: Cross-Platform Solution
MediaPipe Hand Landmark Task — 21 joints per hand, up to 2 hands simultaneously. iOS + Android. Free.
// Android
val handLandmarker = HandLandmarker.createFromOptions(context,
HandLandmarkerOptions.builder()
.setBaseOptions(BaseOptions.builder().setModelAssetPath("hand_landmarker.task").build())
.setNumHands(2)
.setMinHandDetectionConfidence(0.5f)
.setMinTrackingConfidence(0.5f)
.build()
)
val result = handLandmarker.detect(mpImage)
// result.landmarks() — List<List<NormalizedLandmark>>
// 21 points per hand in normalized coordinates [0..1]
21 joints in MediaPipe: WRIST, THUMB_CMC through THUMB_TIP, INDEX_FINGER_MCP through INDEX_FINGER_TIP, similarly for other 4 fingers.
For AR attachment in 3D: normalized 2D coordinates → unproject via camera intrinsics + depth (LiDAR or monocular depth estimation).
Gesture Recognition
Basic gestures without ML — via joint geometry:
Pinch: distance between THUMB_TIP and INDEX_FINGER_TIP < threshold (usually 2–3 cm in real coordinates).
Open palm: all _TIP joints higher than corresponding _MCP joints on Y axis.
Fist: all _TIP joints lower than _MCP on Y axis.
Victory (V-gesture): index and middle _TIP higher than _MCP, rest — lower.
Complex gestures (ASL alphabet, custom combinations) — CreateML GestureClassifier or TensorFlow Lite custom model. Training on 500–1000 samples per gesture.
Hand Interaction with AR Objects
Picking objects: ray from palm/finger → intersection with AR objects. Pinch gesture = "grab", release = "drop".
Deforming AR object with hands: two hands simultaneously → scaling (distance between palms), rotation (orientation of vector between palms).
Surgical simulation: finger tip interacts with virtual organs — collision detection between joint position and AR mesh. CollisionComponent + PhysicsBodyComponent in RealityKit for physically correct interaction.
Limitations in Real Conditions
Finger tracking degrades with occlusion (one finger behind another — common situation). MediaPipe and ARKit both use 2.5D approach — self-occlusion problem not fully solved.
Dark background + dark skin of hands — contrast decreases, detection confidence falls. Minimum lighting for stable tracking — 200 lux. Need UI indicator when confidence < 0.5.
Latency: MediaPipe on mid-range Android (Snapdragon 720G) — 35–45 ms per frame. ARKit HandTracking on iPhone 15 — 15–20 ms. For musical instruments difference is noticeable.
Timeline
Basic hand tracking with pinch/open gesture recognition on iOS 18+ (ARKit) — 1–2 weeks. Cross-platform solution on MediaPipe — 2–3 weeks. Custom gesture classifier with training — plus 2–3 weeks. Interactive hand interaction with AR objects (picking, deformation) — plus 2–4 weeks. Cost calculated individually.







