AI Photo Animation in Mobile Apps
"Bring to life" a static photo—synthesize motion where none exists. Eyes that blink. Head that slightly turns. Hair swaying from wind. This is a generative model task, and implementing it fully on-device in 2024 remains non-trivial.
Two Architectural Approaches
Server inference—model lives on backend. App uploads photo, receives video. Simpler to deploy, no model size constraints, can use SadTalker, LivePortrait, or AnimateDiff. Downside—needs internet, 3–15 second latency, GPU time cost.
On-device—lighter specialized models. Face reenactment via landmark-based warping (First Order Motion Model mobile version), or simple animation via optical flow. Offline, but lower quality.
Most implementations choose hybrid: on-device quick preview (low quality), server final result.
On-Device: Facial Animation via Keypoints
Lightweight approach without generative network: use MediaPipe Face Mesh (468 face points) to build mesh, then deform source image along given motion trajectory.
// MediaPipe FaceLandmarker on iOS
let options = FaceLandmarkerOptions()
options.baseOptions.modelAssetPath = Bundle.main.path(forResource: "face_landmarker", ofType: "task")!
options.numFaces = 1
options.minFaceDetectionConfidence = 0.5
let faceLandmarker = try FaceLandmarker(options: options)
let result = try faceLandmarker.detect(image: .init(uiImage: sourcePhoto))
// landmarks.first?.faceLandmarks—468 points [NormalizedLandmark]
// Build deformation via TPS (Thin Plate Spline) or affine warp
Animation—along pre-recorded head motion trajectory (mocap data) or synthetic: sinusoidal oscillations of keypoints with different amplitudes. Render deformed image via Metal Performance Shaders—few milliseconds per frame.
Result—3–5 seconds animation, exported to .mp4 via AVAssetWriter. Quality sufficient for "living portrait", but artifacts at face edges and background inevitable without full GAN.
First Order Motion Model (FOMM): Mobile Version
FOMM generates motion from one driving video (donor) and source image. On mobile runs via TFLite or ONNX Runtime, but model after optimization—40–80 MB. On iPhone 12+ one frame 256×256 inference—~200–400 ms. For 30-frame animation (1 second)—6–12 seconds processing. One-time generation, not real-time.
// Android: ONNX Runtime with FOMM
val session = OrtEnvironment.getEnvironment().createSession("fomm_optimized.onnx")
// Model inputs: source frame (1, 3, 256, 256) + driving frame (1, 3, 256, 256) + keypoints
val sourceInput = OnnxTensor.createTensor(env, sourceArray, longArrayOf(1, 3, 256, 256))
val drivingInput = OnnxTensor.createTensor(env, drivingArray, longArrayOf(1, 3, 256, 256))
val result = session.run(mapOf("source" to sourceInput, "driving" to drivingInput))
// Result: deformed source with applied motion
Loop over driving frames (pre-recorded motion clip): get sequence of output frames, assemble into video.
Server Option: SadTalker and LivePortrait
For quality facial animation with audio (talking head)—SadTalker: takes photo + audio track, generates video where face speaks in sync with speech. On server with A100—30–60 seconds per minute of video. App uploads photo and audio, gets mp4.
LivePortrait (2024)—faster and higher quality variant, 128 ms per frame on A100. API wrapper via FastAPI or Replicate.
// Upload photo to server for animation
func uploadPhotoForAnimation(image: UIImage, audio: URL?) async throws -> URL {
var request = URLRequest(url: URL(string: "https://api.example.com/animate")!)
request.httpMethod = "POST"
// multipart/form-data: image + optional audio
let boundary = UUID().uuidString
let body = createMultipartBody(image: image, audio: audio, boundary: boundary)
request.httpBody = body
let (data, _) = try await URLSession.shared.data(for: request)
let response = try JSONDecoder().decode(AnimationResponse.self, from: data)
return response.videoURL
}
Task status polling or WebSocket notification for readiness—depends on generation time.
Export and Playback
Animation result—.mp4 (H.264 or H.265). On iOS plays via AVPlayer, exports to Photos via PHPhotoLibrary. For looped animation (Living Photo)—convert to .gif via CGImageDestination or LivePhoto format via PHLivePhoto.
Apple Live Photo: need both video file (.mov) and photo file (.jpg) with same kCGImagePropertyMakerAppleDictionary → 17 (identifier). Without this, system Photos app doesn't perceive as LivePhoto.
Process
Choose architecture (on-device vs server), prepare model or API integration, implement UI for choosing animation "style", export and sharing. For server variant—task queue, ready status, timeout fallback.
Timeline Estimates
On-device landmark-based animation, one platform takes 3–4 weeks. Server integration with SadTalker/LivePortrait + both platforms requires 4–7 weeks.







