Implementing AI Video Editing in a Mobile App
AI video editing — broad term breaking into separate functions: automatic pause/filler removal, background replacement, color correction by reference, smart cropping, auto-highlight cuts. Each requires different technical solution.
Auto-cut pauses and filler words
Perhaps most demanded feature for content creators. User records conversational video, AI removes pauses >0.5 sec and filler words like "um", "like", "you know".
Pipeline:
- Transcribe audio via Whisper API (OpenAI) or Deepgram
- Get JSON with timestamps per word
- Find pauses and filler-list words
- Generate FFmpeg cut-list and assemble final video
# Backend: generate FFmpeg filter from Whisper transcript
def build_cut_filter(transcript_words, pause_threshold=0.5, filler_words=None):
filler_words = filler_words or {"um", "like", "you know", "basically"}
segments_to_keep = []
prev_end = 0.0
for i, word in enumerate(transcript_words):
gap = word["start"] - prev_end
if gap > pause_threshold:
# Pause — skip
pass
if word["word"].lower().strip(".,!?") in filler_words:
continue
segments_to_keep.append((word["start"], word["end"]))
prev_end = word["end"]
# Convert to FFmpeg select/aselect filter
filter_parts = "+".join(
f"between(t,{s},{e})" for s, e in merge_segments(segments_to_keep, gap=0.05)
)
return f"select='{filter_parts}',setpts=N/FRAME_RATE/TB"
Whisper word_timestamps=True gives ±20 ms accuracy — sufficient for smooth cuts. On mobile: upload video, run backend task, get processed file. Play processed via AVPlayer/ExoPlayer.
Background replacement in video
Trickier than it sounds. Static camera — portrait segmentation via MediaPipe Selfie Segmentation (30 fps real-time on modern devices). Moving camera with objects — SAM 2 (Segment Anything Model 2) via API.
MediaPipe on Android:
val options = ImageSegmenterOptions.builder()
.setBaseOptions(BaseOptions.builder().useGpu().build())
.setOutputCategoryMask(false)
.setOutputConfidenceMasks(true)
.build()
val segmenter = ImageSegmenter.createFromOptions(context, options)
Result — confidence mask 0..1. Apply to each frame via Metal/Vulkan, replace background with static image or video. At 1080p 30fps — requires GPU rendering, CPU can't handle.
Server processing (SAM 2): best quality for complex scenes, but 2–5 minutes per video minute even on A100.
Smart crop (Auto Reframe)
Converting 16:9 to 9:16 with smart crop — object tracking task. Adobe Premiere calls it Auto Reframe. On mobile implement via:
- Face detection per keyframe (every 0.5 sec) —
VNDetectFaceRectanglesRequest - Build subject motion trajectory
- Smooth pan with EasingCurve (no harsh jump cuts)
- FFmpeg
cropfilter with dynamic params
# FFmpeg: crop with motion (x changes 0 to 540 over 10 sec)
ffmpeg -i input.mp4 \
-vf "crop=608:1080:'min(max(cx-304,0),672)':0" \
-c:v libx264 output_9x16.mp4
Where cx — subject x-coordinate from tracking data. Implementation: Python script on backend generates FFmpeg command with params, executes, returns result.
AI color correction
By reference photo — CinematicLUT via Core Image on iOS (100 ms per frame). By text description ("make it like golden hour morning") — call backend for LUT-generation via Stable Diffusion + ControlNet Color Pipeline.
Timeline
Pause/filler removal + playback UI — 5–7 days. Multi-function (reframe, background, color correction) with FFmpeg backend integration — 3–4 weeks. Performant real-time background removal — additional 2 weeks.







