AI real-time video stream segmentation in mobile app

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
AI real-time video stream segmentation in mobile app
Complex
~2-4 weeks
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1052
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

Real-Time AI Video Segmentation in Mobile Apps

Real-time video segmentation on mobile is when the app understands everything in the frame—person, background, car, road—and does it on every frame at 15–30 FPS. Making it "work in a demo" is simple. Making it "doesn't overheat, doesn't lag, works on iPhone XR" requires serious optimization work.

Types of Segmentation and Applications

Semantic segmentation — each pixel belongs to a class (background, person, car). Applications: background replacement for video calls, AR effects, traffic scene analysis.

Instance segmentation — separate mask for each object of the same class (three cars—three masks). Applications: object counting, tracking.

Panoptic — combination of both. More computationally expensive; rarely used on mobile.

Model Selection for Real-Time Performance

Speed is critical. Here are real numbers on iPhone 14 Pro (Neural Engine):

Model Resolution FPS (CoreML) Quality
MobileNetV3-DeepLabV3 513×513 22–28 Moderate
EfficientPS-lite 640×360 18–24 Good
YOLOv8n-seg 640×640 20–30 Good
Segment Anything (SAM-mobile) 1024×1024 3–5 Excellent

SAM is for interactive segmentation (tap object → mask). For real-time without user input, YOLOv8n-seg or DeepLabV3+ are recommended.

iOS: CoreML Pipeline for Real-Time

class RealtimeSegmentationProcessor {

    private let model: VNCoreMLModel
    private let processQueue = DispatchQueue(label: "segmentation.process", qos: .userInteractive)

    // Frame skipping: process every N-th frame
    private var frameCounter = 0
    private let processEveryNFrames = 2  // 30fps camera → 15fps processing

    func captureOutput(_ output: AVCaptureOutput,
                       didOutput sampleBuffer: CMSampleBuffer,
                       from connection: AVCaptureConnection) {
        frameCounter += 1
        guard frameCounter % processEveryNFrames == 0 else { return }
        guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }

        processQueue.async { [weak self] in
            self?.runSegmentation(on: pixelBuffer)
        }
    }

    private func runSegmentation(on pixelBuffer: CVPixelBuffer) {
        let request = VNCoreMLRequest(model: model) { [weak self] req, _ in
            guard let observation = req.results?.first as? VNCoreMLFeatureValueObservation,
                  let maskArray = observation.featureValue.multiArrayValue else { return }

            let mask = self?.processMask(maskArray)
            DispatchQueue.main.async {
                self?.delegate?.didUpdateSegmentationMask(mask)
            }
        }

        // Important: pixelBuffer must be in correct format
        request.imageCropAndScaleOption = .scaleFill
        let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer,
                                           orientation: .right)  // landscape orientation
        try? handler.perform([request])
    }

    private func processMask(_ array: MLMultiArray) -> SegmentationMask {
        // Convert MLMultiArray → CVPixelBuffer for rendering
        // Shape: [numClasses, height, width]
        let numClasses = array.shape[0].intValue
        let height = array.shape[1].intValue
        let width = array.shape[2].intValue

        // Argmax over classes for each pixel → label map
        var labelMap = [UInt8](repeating: 0, count: height * width)
        for y in 0..<height {
            for x in 0..<width {
                var maxClass = 0
                var maxVal: Float = -Float.infinity
                for c in 0..<numClasses {
                    let val = array[[c, y, x] as [NSNumber]].floatValue
                    if val > maxVal { maxVal = val; maxClass = c }
                }
                labelMap[y * width + x] = UInt8(maxClass)
            }
        }
        return SegmentationMask(labels: labelMap, width: width, height: height,
                                classColors: Self.classColorMap)
    }
}

Rendering Mask Over Video Stream

A naive approach of drawing the mask in a CPU loop yields 3–5 FPS. The correct approach uses Metal / OpenGL ES:

// Metal shader for mask overlay on video
// Inputs: videoTexture (YCbCr), maskTexture (label map), colorLUT (class→color)
fragment float4 segmentationOverlay(
    VertexOut in [[stage_in]],
    texture2d<float> videoTexture [[texture(0)]],
    texture2d<uint> maskTexture [[texture(1)]],
    texture1d<float> colorLUT [[texture(2)]],
    constant OverlayParams& params [[buffer(0)]]
) {
    float2 uv = in.texCoords;
    float4 videoColor = videoTexture.sample(sampler, uv);
    uint classLabel = maskTexture.sample(nearestSampler, uv).r;

    if (classLabel == 0) { return videoColor; }  // background—unchanged

    float4 maskColor = colorLUT.sample(sampler, float(classLabel) / float(params.numClasses));
    return mix(videoColor, maskColor, params.overlayAlpha);  // blending
}

This Metal pipeline renders the mask on the GPU without CPU involvement—stable 30 FPS even on iPhone 11.

Background Replacement—A Special Case

For video calls, binary segmentation (person / background) is popular. MediaPipe Selfie Segmentation is a ready solution optimized for this:

// Android: MediaPipe Selfie Segmentation
val options = ImageSegmenterOptions.builder()
    .setBaseOptions(BaseOptions.builder()
        .setModelAssetPath("selfie_segmentation.tflite")
        .setDelegate(Delegate.GPU)
        .build())
    .setRunningMode(RunningMode.LIVE_STREAM)
    .setResultListener { result, _ ->
        val confidenceMask = result.confidenceMasks?.get(0)
        updateBackground(confidenceMask)
    }
    .build()

val segmenter = ImageSegmenter.createFromOptions(context, options)

Delegate.GPU is critical: the same MediaPipe on CPU yields 8–12 FPS; on GPU—25–30 FPS.

Timeline Estimates

Basic single-class segmentation (e.g., person) with a ready model and simple rendering takes 1 week. Multi-class segmentation with Metal/GPU rendering, custom model for a specific task, performance optimization, and iOS + Android support requires 2–4 weeks.