AI Emotion Analysis During Video Calls in Mobile Apps
Real-time emotion analysis via camera is technically feasible but requires special attention to ethics and UX. Here is the technical side without hiding limitations: emotion analysis models are among the most criticized AI tools for reliability.
Important Limitation to Not Ignore
Academic consensus (Lisa Feldman Barrett, 2019) and practice show: facial expressions do not unambiguously map to emotions. The same facial muscle movement pattern means different things for different people and cultures. Therefore:
- Calling the output "emotion" is incorrect—"affective state" or "facial expression" is more accurate
- Systems must never be used for hiring or legal decisions
- Users must explicitly consent to facial analysis
This is not just an ethical note—it is an architectural requirement.
Technical Stack
Face detection — MediaPipe Face Detection (iOS/Android), Vision VNDetectFaceRectanglesRequest (iOS).
Expression recognition — several options:
- Apple Vision
VNDetectFaceExpressionsRequest(iOS 17+) — built-in, no cloud, 7 basic Action Units - Microsoft Azure Face API — cloud-based, detailed, includes Action Units
- AWS Rekognition (DetectFaces) — cloud-based, 7 basic emotions
- FER+ model (TFLite/CoreML) — open source, 8 classes, on-device
For video calls, on-device is mandatory: you cannot stream a peer's face to the cloud without explicit consent.
Implementation on iOS with Vision (On-Device)
// iOS 17+: facial expression analysis via Vision
class FaceExpressionAnalyzer {
func analyze(sampleBuffer: CMSampleBuffer) async throws -> ExpressionResult? {
guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return nil }
let faceRequest = VNDetectFaceLandmarksRequest()
// iOS 17: expression analysis—brow action units, etc.
let expressionRequest = VNDetectFaceExpressionsRequest()
let handler = VNImageRequestHandler(cvPixelBuffer: pixelBuffer)
try handler.perform([faceRequest, expressionRequest])
guard let faceObs = faceRequest.results?.first as? VNFaceObservation,
let exprObs = expressionRequest.results?.first as? VNFaceExpressionObservation else {
return nil
}
return ExpressionResult(
faceBox: faceObs.boundingBox,
browLower: exprObs.browLowerQuirk,
browRaise: exprObs.browRaiseRight + exprObs.browRaiseLeft,
eyesClosed: exprObs.eyeBlinkLeft + exprObs.eyeBlinkRight,
mouthSmile: exprObs.mouthSmileLeft + exprObs.mouthSmileRight,
mouthFrown: exprObs.mouthFrownLeft + exprObs.mouthFrownRight,
mouthOpen: exprObs.mouthOpen,
jawOpen: exprObs.jawOpen
)
}
}
VNDetectFaceExpressionsRequest works with Action Units—basic facial muscle movements from FACS (Facial Action Coding System). This is more correct than "smile = happiness": specific muscle action, no interpretation.
Time Aggregation
One frame is noise. Use aggregation over a sliding window:
class ExpressionAggregator {
private var history: [ExpressionResult] = []
private let windowSize = 15 // ~0.5 sec at 30fps
func update(_ result: ExpressionResult) -> AggregatedExpression {
history.append(result)
if history.count > windowSize { history.removeFirst() }
return AggregatedExpression(
averageSmile: history.map { $0.mouthSmile }.average(),
averageBrowRaise: history.map { $0.browRaise }.average(),
averageJawOpen: history.map { $0.jawOpen }.average(),
// Trend: smile increasing or decreasing over last N frames
smileTrend: computeTrend(history.map { $0.mouthSmile })
)
}
}
Integration in Video Calls
Analysis runs on the local video stream from your camera, not the peer's stream. The peer's stream is on their device; you don't have access to raw frames via standard WebRTC. Two approaches:
SDK with analysis support — Agora Video SDK allows local video processor:
// Agora: process local video before sending
class EmotionVideoProcessor: AgoraVideoFrameDelegate {
func onCapture(_ videoFrame: AgoraOutputVideoFrame,
sourceType: AgoraVideoSourceType) -> Bool {
// Analyze your own frame before sending
if let pixelBuffer = videoFrame.pixelBuffer {
Task {
let result = try? await expressionAnalyzer.analyze(buffer: pixelBuffer)
// result analyzes your emotions, not the peer's
await MainActor.run {
emotionDelegate?.didUpdateExpression(result)
}
}
}
return true // pass frame to stream unmodified
}
}
Peer-to-peer analysis — both participants analyze their own expressions and transmit results (not video) via data channel. WebRTC data channel for JSON packets—minimal overhead.
// Send emotion data via WebRTC DataChannel
struct EmotionDataPacket: Codable {
let timestamp: Double
let smile: Float
let browRaise: Float
let eyesClosed: Float
// DON'T send images—only numbers
}
func sendEmotionData(_ expression: AggregatedExpression) {
let packet = EmotionDataPacket(
timestamp: Date().timeIntervalSince1970,
smile: expression.averageSmile,
browRaise: expression.averageBrowRaise,
eyesClosed: expression.averageJawOpen
)
let data = try! JSONEncoder().encode(packet)
dataChannel.sendData(RTCDataBuffer(data: data, isBinary: false))
}
Each participant analyzes only themselves but sees aggregated data from the peer. Private and technically clean.
UX: How to Show Results
Showing "angry / sad / happy" is incorrect and potentially offensive. Proper options:
- Engagement indicator: "peer is actively participating" (based on browRaise + eyeBlink rhythm)
- Attention level: neutral engagement indicator without emotion interpretation
- Conversation mood: aggregation of both participants into single "thermal" metric
@Composable
fun EngagementIndicator(score: Float) {
Box(
modifier = Modifier
.size(12.dp)
.clip(CircleShape)
.background(
when {
score > 0.7f -> Color(0xFF4CAF50) // engaged
score > 0.4f -> Color(0xFFFFC107) // neutral
else -> Color(0xFF9E9E9E) // passive
}
)
)
}
No emotion faces, no verbal labels—only a neutral color indicator.
Timeline Estimates
On-device expression analysis via Vision + basic engagement indicator in existing video call takes 1–2 weeks. Full system with peer-to-peer data transmission via data channel, aggregation, conversation analytics, consent screen, and iOS + Android support requires 2–4 weeks.







