AI emotion analysis during video call in mobile app

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
AI emotion analysis during video call in mobile app
Complex
~2-4 weeks
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    761
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    649
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1071
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    884
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    466

AI Emotion Analysis During Video Calls in Mobile Apps

Real-time emotion analysis via camera is technically feasible but requires special attention to ethics and UX. Here is the technical side without hiding limitations: emotion analysis models are among the most criticized AI tools for reliability.

Important Limitation to Not Ignore

Academic consensus (Lisa Feldman Barrett, 2019) and practice show: facial expressions do not unambiguously map to emotions. The same facial muscle movement pattern means different things for different people and cultures. Therefore:

  • Calling the output "emotion" is incorrect—"affective state" or "facial expression" is more accurate
  • Systems must never be used for hiring or legal decisions
  • Users must explicitly consent to facial analysis

This is not just an ethical note—it is an architectural requirement.

Technical Stack

Face detection — MediaPipe Face Detection (iOS/Android), Vision VNDetectFaceRectanglesRequest (iOS).

Expression recognition — several options:

  • Apple Vision VNDetectFaceExpressionsRequest (iOS 17+) — built-in, no cloud, 7 basic Action Units
  • Microsoft Azure Face API — cloud-based, detailed, includes Action Units
  • AWS Rekognition (DetectFaces) — cloud-based, 7 basic emotions
  • FER+ model (TFLite/CoreML) — open source, 8 classes, on-device

For video calls, on-device is mandatory: you cannot stream a peer's face to the cloud without explicit consent.

Implementation on iOS with Vision (On-Device)

// iOS 17+: facial expression analysis via Vision
class FaceExpressionAnalyzer {

    func analyze(sampleBuffer: CMSampleBuffer) async throws -> ExpressionResult? {
        guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return nil }

        let faceRequest = VNDetectFaceLandmarksRequest()

        // iOS 17: expression analysis—brow action units, etc.
        let expressionRequest = VNDetectFaceExpressionsRequest()

        let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer)
        try handler.perform([faceRequest, expressionRequest])

        guard let faceObs = faceRequest.results?.first as? VNFaceObservation,
              let exprObs = expressionRequest.results?.first as? VNFaceExpressionObservation else {
            return nil
        }

        return ExpressionResult(
            faceBox: faceObs.boundingBox,
            browLower: exprObs.browLowerQuirk,
            browRaise: exprObs.browRaiseRight + exprObs.browRaiseLeft,
            eyesClosed: exprObs.eyeBlinkLeft + exprObs.eyeBlinkRight,
            mouthSmile: exprObs.mouthSmileLeft + exprObs.mouthSmileRight,
            mouthFrown: exprObs.mouthFrownLeft + exprObs.mouthFrownRight,
            mouthOpen: exprObs.mouthOpen,
            jawOpen: exprObs.jawOpen
        )
    }
}

VNDetectFaceExpressionsRequest works with Action Units—basic facial muscle movements from FACS (Facial Action Coding System). This is more correct than "smile = happiness": specific muscle action, no interpretation.

Time Aggregation

One frame is noise. Use aggregation over a sliding window:

class ExpressionAggregator {
    private var history: [ExpressionResult] = []
    private let windowSize = 15  // ~0.5 sec at 30fps

    func update(_ result: ExpressionResult) -> AggregatedExpression {
        history.append(result)
        if history.count > windowSize { history.removeFirst() }

        return AggregatedExpression(
            averageSmile: history.map { $0.mouthSmile }.average(),
            averageBrowRaise: history.map { $0.browRaise }.average(),
            averageJawOpen: history.map { $0.jawOpen }.average(),
            // Trend: smile increasing or decreasing over last N frames
            smileTrend: computeTrend(history.map { $0.mouthSmile })
        )
    }
}

Integration in Video Calls

Analysis runs on the local video stream from your camera, not the peer's stream. The peer's stream is on their device; you don't have access to raw frames via standard WebRTC. Two approaches:

SDK with analysis support — Agora Video SDK allows local video processor:

// Agora: process local video before sending
class EmotionVideoProcessor: AgoraVideoFrameDelegate {

    func onCapture(_ videoFrame: AgoraOutputVideoFrame,
                   sourceType: AgoraVideoSourceType) -> Bool {
        // Analyze your own frame before sending
        if let pixelBuffer = videoFrame.pixelBuffer {
            Task {
                let result = try? await expressionAnalyzer.analyze(buffer: pixelBuffer)
                // result analyzes your emotions, not the peer's
                await MainActor.run {
                    emotionDelegate?.didUpdateExpression(result)
                }
            }
        }
        return true  // pass frame to stream unmodified
    }
}

Peer-to-peer analysis — both participants analyze their own expressions and transmit results (not video) via data channel. WebRTC data channel for JSON packets—minimal overhead.

// Send emotion data via WebRTC DataChannel
struct EmotionDataPacket: Codable {
    let timestamp: Double
    let smile: Float
    let browRaise: Float
    let eyesClosed: Float
    // DON'T send images—only numbers
}

func sendEmotionData(_ expression: AggregatedExpression) {
    let packet = EmotionDataPacket(
        timestamp: Date().timeIntervalSince1970,
        smile: expression.averageSmile,
        browRaise: expression.averageBrowRaise,
        eyesClosed: expression.averageJawOpen
    )
    let data = try! JSONEncoder().encode(packet)
    dataChannel.sendData(RTCDataBuffer(data: data, isBinary: false))
}

Each participant analyzes only themselves but sees aggregated data from the peer. Private and technically clean.

UX: How to Show Results

Showing "angry / sad / happy" is incorrect and potentially offensive. Proper options:

  • Engagement indicator: "peer is actively participating" (based on browRaise + eyeBlink rhythm)
  • Attention level: neutral engagement indicator without emotion interpretation
  • Conversation mood: aggregation of both participants into single "thermal" metric
@Composable
fun EngagementIndicator(score: Float) {
    Box(
        modifier = Modifier
            .size(12.dp)
            .clip(CircleShape)
            .background(
                when {
                    score > 0.7f -> Color(0xFF4CAF50)   // engaged
                    score > 0.4f -> Color(0xFFFFC107)   // neutral
                    else -> Color(0xFF9E9E9E)            // passive
                }
            )
    )
}

No emotion faces, no verbal labels—only a neutral color indicator.

Timeline Estimates

On-device expression analysis via Vision + basic engagement indicator in existing video call takes 1–2 weeks. Full system with peer-to-peer data transmission via data channel, aggregation, conversation analytics, consent screen, and iOS + Android support requires 2–4 weeks.