Gesture Recognition Implementation in Mobile Applications
Camera-based gesture recognition is touch-free interface. Used in fitness apps (rep counting), accessibility apps, AR interactions, and presentation modes. The key tool: MediaPipe Hand Landmarker.
MediaPipe Hand Landmarker: How It Works
MediaPipe Hand Landmarker detects up to 2 hands in frame, returning 21 landmarks per hand: wrist, metacarpals, phalanges. Coordinates are normalized (0..1) in image space plus 3D coordinates (z—depth relative to wrist).
Latency on modern devices:
- Pixel 7 (GPU): ~12 ms
- iPhone 14 (CPU): ~18 ms
- Snapdragon 665 mid-range (CPU): ~45 ms
For real-time UI at 30 FPS, budget is 33 ms. On weak devices, choose: lower inference to 15 FPS or work with one hand only (numHands = 1).
Gesture Recognition on Top of Landmarks
MediaPipe Gesture Recognizer—wrapper over Hand Landmarker—recognizes 7 built-in gestures: Open_Palm, Closed_Fist, Pointing_Up, Thumb_Up, Thumb_Down, Victory, ILoveYou. Enough for basic control (next slide, like/dislike, confirm).
For custom gestures, two approaches:
Geometric. Compute angles between phalanges, distances between keypoints, determine "is finger bent" via dot product. Fast (no extra inference), easy to debug. Works well for static gestures (specific hand shape).
ML Classifier. Record landmark vectors for each gesture, train small MLP or custom GestureRecognizer via MediaPipe Model Maker. Better for similar-shaped gestures and dynamic gestures (hand movement).
Case: presentation control via gestures during standup. Gestures: "next slide" (swipe right), "previous" (swipe left), "stop" (open palm). Static gestures via geometry (Open_Palm from MediaPipe Gesture Recognizer). Dynamic swipes—via wrist position tracking (WRIST landmark) over 10 frames: if delta_x > 0.3 normalized units in 333 ms—swipe. Threshold tuned experimentally on 15 test users.
Binding Gestures to Actions
Recognized gesture is an event with confidence and timestamp. Not every event should be processed. Debouncing matters: "thumbs up" shouldn't retrigger if hand stays in frame. Maintain state: lastGestureTime, lastGestureType. New gesture counts only if >500 ms passed or type changed.
On iOS: MediaPipeTasksVision via Swift Package Manager, wrap GestureRecognizer in actor class. Results via @Published in ObservableObject.
On Android: MediaPipe Tasks Android SDK (com.google.mediapipe:tasks-vision). GestureRecognizerOptions with runningMode = RunningMode.LIVE_STREAM for video stream, callbacks via ResultListener.
Common Implementation Mistakes
Don't account for handedness—MediaPipe returns LEFT/RIGHT from model perspective (mirrored from front camera). "Thumbs right" from front camera detected as left hand. If logic depends on specific hand, apply mirror correction.
Don't account for distance to camera. Normalized coordinates carry no distance info—geometric thresholds working at 50 cm differ at 150 cm.
Timeline
Basic gestures via MediaPipe Gesture Recognizer + action binding: 1 week. Custom geometric or ML gestures + dynamic swipes: 2 weeks. Cost calculated individually.







