Implementing AI Image Generation (Stable Diffusion) in a Mobile App
Stable Diffusion offers more control than DALL-E: negative prompts, ControlNet, LoRA, step tuning, CFG scale, SDXL vs SD 1.5. But this adds complexity: you must choose a provider (or self-host), understand parameters that directly impact quality, and properly organize async pipeline — generation takes 10–30 seconds.
Integration options
Replicate — cloud inference via REST API. Supports SDXL, SD 1.5, many LoRA. Async model: POST → get prediction_id → polling or webhook for result.
FAL.ai — faster than Replicate with lower latency, both sync and async modes, supports SDXL, SD3, Flux.
Stability AI API — official provider, reliable but pricier.
Self-hosting — ComfyUI or AUTOMATIC1111 on GPU server. Maximum control, no vendor lock-in, economical at scale.
For a mobile app with moderate load — Replicate or FAL, without infrastructure costs.
Replicate integration (SDXL)
Replicate uses async model. First create a prediction, then poll for status:
class ReplicateSDXLService {
private let baseURL = "https://api.replicate.com/v1"
private let modelVersion = "7762fd07cf82c948538e41f63f77d685e02b063e0ccecb39397596b78813f88f" // SDXL
func generate(prompt: String, negativePrompt: String = "", steps: Int = 30) async throws -> URL {
// 1. Create prediction
let createBody: [String: Any] = [
"version": modelVersion,
"input": [
"prompt": prompt,
"negative_prompt": negativePrompt,
"num_inference_steps": steps,
"guidance_scale": 7.5,
"width": 1024,
"height": 1024
]
]
var createRequest = URLRequest(url: URL(string: "\(baseURL)/predictions")!)
createRequest.httpMethod = "POST"
createRequest.setValue("Token \(apiKey)", forHTTPHeaderField: "Authorization")
createRequest.setValue("application/json", forHTTPHeaderField: "Content-Type")
createRequest.httpBody = try JSONSerialization.data(withJSONObject: createBody)
let (createData, _) = try await URLSession.shared.data(for: createRequest)
let prediction = try JSONDecoder().decode(Prediction.self, from: createData)
// 2. Poll until complete
return try await pollUntilComplete(predictionId: prediction.id)
}
private func pollUntilComplete(predictionId: String) async throws -> URL {
var attempts = 0
while attempts < 60 {
try await Task.sleep(nanoseconds: 2_000_000_000) // 2 seconds
let statusURL = URL(string: "\(baseURL)/predictions/\(predictionId)")!
var request = URLRequest(url: statusURL)
request.setValue("Token \(apiKey)", forHTTPHeaderField: "Authorization")
let (data, _) = try await URLSession.shared.data(for: request)
let status = try JSONDecoder().decode(PredictionStatus.self, from: data)
switch status.status {
case "succeeded":
return URL(string: status.output![0])!
case "failed":
throw SDError.generationFailed(status.error ?? "Unknown error")
default:
attempts += 1
}
}
throw SDError.timeout
}
}
Instead of polling, use webhooks ("webhook": "https://your-backend.com/webhook"), but for mobile polling with 2-second intervals is simpler.
Parameters that actually impact results
num_inference_steps — number of diffusion steps. 20–30 for production (speed/quality balance). 50+ shows no noticeable improvement, just slower.
guidance_scale (CFG scale) — how strictly to follow the prompt. 7–8 for realistic images, 10–12 for stylized. >15 produces artifacts.
negative_prompt — what to exclude. Standard set: "blurry, low quality, distorted, deformed, ugly, duplicate, watermark". Not magic, but works.
For portraits: "((best quality)), detailed face, sharp focus" in positive + "bad anatomy, distorted face, extra fingers, mutation" in negative.
ControlNet for structure/pose-guided generation
ControlNet lets you specify image structure: body pose (OpenPose), edges (Canny), depth. This is a key difference from DALL-E:
let controlNetBody: [String: Any] = [
"version": "...", // ControlNet SDXL version
"input": [
"prompt": prompt,
"image": base64EncodedPoseImage, // OpenPose skeleton
"controlnet_conditioning_scale": 0.8,
"control_mode": "balanced"
]
]
User takes a photo or selects a pose → send it as control image → model generates character in that pose. Popular in fashion, fitness, avatar apps.
On-device: Core ML and ONNX
SDXL turbo versions (SDXL-Turbo, LCM) with 4–8 steps run on iPhone 15 Pro via Core ML in 10–15 seconds. Apple publishes converted Core ML SD models on Hugging Face.
// Core ML SD via Apple's swift-coreml-diffusers
let pipeline = try StableDiffusionPipeline(
resourcesAt: modelURL,
controlNet: [],
configuration: config
)
let images = try pipeline.generateImages(
prompt: prompt,
imageCount: 1,
stepCount: 4, // SDXL-Turbo: 4 steps sufficient
seed: 42
)
Android — ONNX Runtime with SD mobile-optimized models (~400 MB). 20–40 seconds on average 2024 device. Realistic only for offline scenarios.
Timeline
Replicate SDXL integration with basic UI (prompt + result) — 3–5 days. ControlNet, LoRA selection, parameters (CFG, steps), generation history, sharing — 2–3 weeks.







