Building an AI Assistant with Local On-Device Models in a Mobile Application
On-device LLM is not just "works without internet." It's an architectural decision: user data never leaves the device. For medical diaries, personal notes, corporate documents on employee devices—this is not a feature, it's a requirement. Technically solvable on modern flagship devices, but requires careful model and runtime selection for specific devices.
What's Actually Runnable on Phone in 2024–2025
The boundary of possibility shifted with new chips. iPhone 15 Pro (A17 Pro, 8 GB RAM, Neural Engine 35 TOPS) and Samsung Galaxy S24 (Snapdragon 8 Gen 3, 12 GB RAM) run 3B models in INT4 quantization at 15–30 tokens/sec—sufficient for assistant with streaming output.
| Device | RAM | Recommended Model | Tokens/sec |
|---|---|---|---|
| iPhone 15 Pro / 16 | 8 GB | Llama 3.2 3B Q4_K_M | 20–30 |
| iPad Pro M4 | 16 GB | Llama 3.1 8B Q4_K_M | 15–25 |
| Samsung S24 Ultra | 12 GB | Phi-3 Mini Q4 | 25–35 |
| Budget Android | 4–6 GB | Phi-3 Mini Q2 / Gemma 2B | 5–15 |
| Older devices | 3 GB | Not recommended | — |
Attempting to run 7B+ model on phone with 4 GB RAM—guaranteed OOM crash.
iOS: Core ML and Apple MLX
Apple offers two paths. Core ML—stable, supports iOS 16+, automatically uses Neural Engine. Model is converted via coremltools from PyTorch/GGUF. Size after converting Llama 3.2 3B to INT4—about 1.8 GB.
import CoreML
class OnDeviceLLM {
private let model: MLModel
init() throws {
let config = MLModelConfiguration()
config.computeUnits = .all // CPU + GPU + Neural Engine
model = try LlamaModel(configuration: config).model
}
func generate(prompt: String) -> AsyncStream<String> {
AsyncStream { continuation in
Task.detached(priority: .userInitiated) {
// tokenize → autoregressive decode → yield tokens
let tokens = self.tokenize(prompt)
for _ in 0..<512 {
let nextToken = self.model.predictNextToken(tokens)
continuation.yield(self.detokenize(nextToken))
if nextToken == self.eosTokenId { break }
}
continuation.finish()
}
}
}
}
Apple MLX (Swift framework, iOS 16+)—more convenient API, but requires iOS 16+ and works better on devices with unified memory. Official converted models from Apple available on Hugging Face.
llama.cpp—largest selection of models in GGUF format, actively maintained by community. Integration via C++ bridging header is more complex than Core ML, but provides access to any GGUF model.
Android: TFLite, MediaPipe, ExecuTorch
Android has more options and less unified standard.
MediaPipe LLM Inference API (Google)—most mature solution for Android. Supports Gemma 2B/7B, Phi-2, Llama 2, ExportedModels in TFLite format. Integration via LlmInference class:
val options = LlmInference.LlmInferenceOptions.builder()
.setModelPath("/data/local/tmp/gemma-2b-it-gpu-int4.bin")
.setMaxTokens(1024)
.setResultListener { partialResult, done ->
runOnUiThread { appendText(partialResult) }
}
.build()
val llmInference = LlmInference.createFromOptions(context, options)
llmInference.generateResponseAsync(prompt)
TFLite with custom LLM runner—more flexible, but requires more integration work.
ExecuTorch (Meta)—official runtime for Llama on Android, supports Llama 3.x directly without conversion. Compiled via buck2, which is nontrivial to set up in Gradle project.
Model Loading and Updates
Model cannot be bundled in app package—1.5–3 GB immediately leads to App Store rejection and frightens users away with large APK. Correct approach: first launch → offer to download model → background download with progress bar → verify checksum.
On iOS: URLSession.downloadTask with backgroundConfiguration—download continues when app goes to background. File stored in Application Support (not Caches—can be deleted by system if space is low).
On Android: DownloadManager or WorkManager with NetworkType.UNMETERED constraint—user doesn't spend mobile traffic.
Model update: server publishes manifest with current version and SHA-256 sum. On app launch, check manifest; if version changed—offer update.
Heat and Battery
Inference on Neural Engine/GPU heats device. For long generations (>200 tokens), monitor thermal state via ProcessInfo.thermalState (iOS)—at .serious reduce n_threads or temporarily switch to CPU-only inference. On Android—PowerManager.thermalStatus.
Battery consumption: 3B model on iPhone 15 Pro uses ~5–8% battery per 100 requests. No need to warn users—this is less than video streaming.
Timeline Estimates
Basic on-device assistant (one platform, Core ML or MediaPipe)—4–5 weeks. Cross-platform with loading management, updates, and thermal monitoring—8–12 weeks.







