Real-Time AI Video Segmentation in Mobile Apps
Real-time video segmentation on mobile is when the app understands everything in the frame—person, background, car, road—and does it on every frame at 15–30 FPS. Making it "work in a demo" is simple. Making it "doesn't overheat, doesn't lag, works on iPhone XR" requires serious optimization work.
Types of Segmentation and Applications
Semantic segmentation — each pixel belongs to a class (background, person, car). Applications: background replacement for video calls, AR effects, traffic scene analysis.
Instance segmentation — separate mask for each object of the same class (three cars—three masks). Applications: object counting, tracking.
Panoptic — combination of both. More computationally expensive; rarely used on mobile.
Model Selection for Real-Time Performance
Speed is critical. Here are real numbers on iPhone 14 Pro (Neural Engine):
| Model | Resolution | FPS (CoreML) | Quality |
|---|---|---|---|
| MobileNetV3-DeepLabV3 | 513×513 | 22–28 | Moderate |
| EfficientPS-lite | 640×360 | 18–24 | Good |
| YOLOv8n-seg | 640×640 | 20–30 | Good |
| Segment Anything (SAM-mobile) | 1024×1024 | 3–5 | Excellent |
SAM is for interactive segmentation (tap object → mask). For real-time without user input, YOLOv8n-seg or DeepLabV3+ are recommended.
iOS: CoreML Pipeline for Real-Time
class RealtimeSegmentationProcessor {
private let model: VNCoreMLModel
private let processQueue = DispatchQueue(label: "segmentation.process", qos: .userInteractive)
// Frame skipping: process every N-th frame
private var frameCounter = 0
private let processEveryNFrames = 2 // 30fps camera → 15fps processing
func captureOutput(_ output: AVCaptureOutput,
didOutput sampleBuffer: CMSampleBuffer,
from connection: AVCaptureConnection) {
frameCounter += 1
guard frameCounter % processEveryNFrames == 0 else { return }
guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
processQueue.async { [weak self] in
self?.runSegmentation(on: pixelBuffer)
}
}
private func runSegmentation(on pixelBuffer: CVPixelBuffer) {
let request = VNCoreMLRequest(model: model) { [weak self] req, _ in
guard let observation = req.results?.first as? VNCoreMLFeatureValueObservation,
let maskArray = observation.featureValue.multiArrayValue else { return }
let mask = self?.processMask(maskArray)
DispatchQueue.main.async {
self?.delegate?.didUpdateSegmentationMask(mask)
}
}
// Important: pixelBuffer must be in correct format
request.imageCropAndScaleOption = .scaleFill
let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer,
orientation: .right) // landscape orientation
try? handler.perform([request])
}
private func processMask(_ array: MLMultiArray) -> SegmentationMask {
// Convert MLMultiArray → CVPixelBuffer for rendering
// Shape: [numClasses, height, width]
let numClasses = array.shape[0].intValue
let height = array.shape[1].intValue
let width = array.shape[2].intValue
// Argmax over classes for each pixel → label map
var labelMap = [UInt8](repeating: 0, count: height * width)
for y in 0..<height {
for x in 0..<width {
var maxClass = 0
var maxVal: Float = -Float.infinity
for c in 0..<numClasses {
let val = array[[c, y, x] as [NSNumber]].floatValue
if val > maxVal { maxVal = val; maxClass = c }
}
labelMap[y * width + x] = UInt8(maxClass)
}
}
return SegmentationMask(labels: labelMap, width: width, height: height,
classColors: Self.classColorMap)
}
}
Rendering Mask Over Video Stream
A naive approach of drawing the mask in a CPU loop yields 3–5 FPS. The correct approach uses Metal / OpenGL ES:
// Metal shader for mask overlay on video
// Inputs: videoTexture (YCbCr), maskTexture (label map), colorLUT (class→color)
fragment float4 segmentationOverlay(
VertexOut in [[stage_in]],
texture2d<float> videoTexture [[texture(0)]],
texture2d<uint> maskTexture [[texture(1)]],
texture1d<float> colorLUT [[texture(2)]],
constant OverlayParams& params [[buffer(0)]]
) {
float2 uv = in.texCoords;
float4 videoColor = videoTexture.sample(sampler, uv);
uint classLabel = maskTexture.sample(nearestSampler, uv).r;
if (classLabel == 0) { return videoColor; } // background—unchanged
float4 maskColor = colorLUT.sample(sampler, float(classLabel) / float(params.numClasses));
return mix(videoColor, maskColor, params.overlayAlpha); // blending
}
This Metal pipeline renders the mask on the GPU without CPU involvement—stable 30 FPS even on iPhone 11.
Background Replacement—A Special Case
For video calls, binary segmentation (person / background) is popular. MediaPipe Selfie Segmentation is a ready solution optimized for this:
// Android: MediaPipe Selfie Segmentation
val options = ImageSegmenterOptions.builder()
.setBaseOptions(BaseOptions.builder()
.setModelAssetPath("selfie_segmentation.tflite")
.setDelegate(Delegate.GPU)
.build())
.setRunningMode(RunningMode.LIVE_STREAM)
.setResultListener { result, _ ->
val confidenceMask = result.confidenceMasks?.get(0)
updateBackground(confidenceMask)
}
.build()
val segmenter = ImageSegmenter.createFromOptions(context, options)
Delegate.GPU is critical: the same MediaPipe on CPU yields 8–12 FPS; on GPU—25–30 FPS.
Timeline Estimates
Basic single-class segmentation (e.g., person) with a ready model and simple rendering takes 1 week. Multi-class segmentation with Metal/GPU rendering, custom model for a specific task, performance optimization, and iOS + Android support requires 2–4 weeks.







