Image Segmentation Implementation in Mobile Applications
Segmentation is the most computationally expensive computer vision task on mobile. While detection returns a rectangle, segmentation returns a per-pixel mask. On a 512×512 image, that's 262,144 pixels, each with a class—all processed and rendered in 33 ms for 30 FPS.
Semantic vs Instance Segmentation: Which to Choose
Semantic Segmentation assigns each pixel one class (sky, person, road). All people in frame = one "person" class. Models: DeepLabV3+, MobileNetV3 Segmentation. TFLite DeepLabV3+ with 257×257 input runs in 22–35 ms on modern Android.
Instance Segmentation gives each object instance a separate mask. Three people = three masks. Models: Mask R-CNN, YOLOv8-seg. Significantly heavier: YOLOv8n-seg on TFLite: 80–120 ms on mobile. Real real-time only on flagships with GPU delegate.
For most consumer cases (background removal, background blur on photos), semantic segmentation of "person" or "background" classes suffices. This is covered by ML Kit Selfie Segmentation—on-device, 30 FPS, a dedicated network trained for this exact use case.
Real-Time Mask Overlay
A segmentation mask is a ByteArray or FloatArray of class indices. Overlaying it on a video stream in 33 ms is a GPU job.
On iOS, use Metal for blending: convert mask to CIImage via CIFilter.pixellate or custom Metal kernel, overlay on original frame via CIBlendWithMask. All Metal rendering avoids data copies via MTLBuffer with shared storage mode.
On Android, use RenderScript (deprecated API 31+) or Vulkan/OpenGL ES via SurfaceView. For new projects: AGSL (Android Graphics Shading Language) starting Android 13, or Canvas.drawBitmap with Paint.xfermode = PorterDuffXfermode(PorterDuff.Mode.DST_IN) for simple cases.
Common mistake: generating Bitmap from mask on CPU in a loop for every frame. On Pixel 6, this is ~18 ms just for allocation + copy—kills the 33 ms budget. Correct: use Bitmap.copyPixelsFromBuffer with pre-allocated ByteBuffer or pass mask directly to shader.
ML Kit Selfie Segmentation in Practice
Fastest path for "background blur" in video calls or photo editors:
val segmenter = Segmentation.getClient(
SelfieSegmenterOptions.Builder()
.setDetectorMode(SelfieSegmenterOptions.STREAM_MODE)
.enableRawSizeMask()
.build()
)
STREAM_MODE is optimized for video—caches state between frames. enableRawSizeMask() returns mask at full resolution, not downsampled—needed for quality edge smoothing.
Case: corporate video presentation app, virtual background on iOS. Core Image CIBlendWithMask + ML Kit Selfie Segmentation (iOS SDK): 28 ms on iPhone 13 mini at 720p. On iPhone SE 2nd gen: 41 ms, causing dropped frames every 2–3 seconds at 30 FPS. Solution: lowered processing resolution to 540p, upscale mask via bilinear interpolation—24 ms, dropped frames gone.
Timeline
Integrating ML Kit Selfie Segmentation with effect overlay: 5–7 days. Custom segmentation model with Metal/OpenGL rendering on real video stream: 2–3 weeks. Cost calculated individually.







