On-Device LLM Integration (Llama.cpp) for Offline AI Assistant in Mobile App
Running a language model directly on a smartphone without internet is now reality. Llama.cpp provides working CPU inference with optional Metal/Vulkan acceleration. The main question isn't "can we?" but "which model and quantization should we choose so the device doesn't overheat after 5 minutes?"
Models and Their Real Requirements
Llama.cpp works with models in GGUF format. Popular options for mobile:
| Model | Quantization | Size | RAM | Speed (iPhone 14) |
|---|---|---|---|---|
| Llama-3.2-1B | Q4_K_M | 0.8 GB | ~1.2 GB | 25–35 t/s |
| Llama-3.2-3B | Q4_K_M | 2.0 GB | ~2.5 GB | 10–15 t/s |
| Phi-3-mini-4k | Q4_K_M | 2.2 GB | ~2.8 GB | 8–12 t/s |
| Gemma-2-2B | Q4_K_M | 1.6 GB | ~2.0 GB | 12–18 t/s |
| Qwen2.5-1.5B | Q4_K_M | 1.0 GB | ~1.4 GB | 20–28 t/s |
On iPhone SE 2nd gen with 3 GB RAM, Llama-3.2-3B Q4 runs at the limit — OOM is possible with long contexts. Safe choice for broad device coverage — models up to 1.5–2 GB.
Building Llama.cpp for iOS
# Clone repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build with CMake for iOS
cmake -B build-ios \
-DCMAKE_TOOLCHAIN_FILE=ios.toolchain.cmake \
-DPLATFORM=OS64 \ # arm64 only
-DLLAMA_METAL=ON \ # Metal GPU acceleration
-DLLAMA_STATIC=ON
cmake --build build-ios --config Release
Result — libllama.a static library. Create Swift Package with C-bridging header:
// llama_bridge.h
#include "llama.h"
// Wrappers for Swift-friendly API
void* llama_create_context(const char* model_path, int n_ctx, int n_gpu_layers);
const char* llama_generate_token(void* ctx, const char* prompt);
void llama_free_context(void* ctx);
n_gpu_layers — number of layers offloaded to Metal GPU. Value -1 means all layers on GPU. On iPhone 14 with 6 GB unified memory, set -1. On 3 GB devices, experiment: too many layers on GPU trigger OOM.
Swift Wrapper for Token Streaming
import Foundation
actor LlamaSession {
private var context: OpaquePointer?
private var model: OpaquePointer?
func load(modelPath: String, contextSize: Int32 = 2048, gpuLayers: Int32 = -1) throws {
var params = llama_model_default_params()
params.n_gpu_layers = gpuLayers
model = llama_load_model_from_file(modelPath, params)
guard model != nil else { throw LlamaError.modelLoadFailed }
var ctxParams = llama_context_default_params()
ctxParams.n_ctx = UInt32(contextSize)
ctxParams.n_batch = 512
context = llama_new_context_with_model(model, ctxParams)
}
func generate(prompt: String) -> AsyncThrowingStream<String, Error> {
AsyncThrowingStream { continuation in
Task.detached(priority: .userInitiated) {
// Tokenization
var tokens = [llama_token](repeating: 0, count: 4096)
let nTokens = llama_tokenize(self.model, prompt, Int32(prompt.utf8.count),
&tokens, 4096, true, false)
// Inference — one token at a time
for i in 0..<nTokens {
llama_batch_add(&batch, tokens[Int(i)], llama_pos(i), [0], false)
}
while true {
llama_decode(self.context, batch)
let nextToken = llama_sample_token_greedy(self.context, &candidates)
if nextToken == llama_token_eos(self.model) { break }
// Convert token to string
var buf = [Int8](repeating: 0, count: 64)
llama_token_to_piece(self.model, nextToken, &buf, 64, 0, true)
let piece = String(cString: buf)
continuation.yield(piece)
}
continuation.finish()
}
}
}
}
Token streaming via AsyncThrowingStream — users see text as it generates, no waiting for complete responses. Critical for UX: 10 tokens per second feels natural when text appears progressively.
Android: Llama.cpp via NDK
// CMakeLists.txt in jni/
add_library(llama_jni SHARED llama_jni.cpp)
target_link_libraries(llama_jni llama ggml)
// Kotlin side
class LlamaEngine {
init { System.loadLibrary("llama_jni") }
external fun loadModel(modelPath: String, nGpuLayers: Int): Long // returns handle
external fun generateNext(handle: Long, tokens: IntArray): String
external fun freeModel(handle: Long)
}
On Android — Vulkan backend instead of Metal: enable LLAMA_VULKAN=ON in CMakeLists. Supported on Vulkan 1.1+ devices, essentially all Android 10+.
Android challenge: process has no hard memory limit as a pool — system may kill app (SIGKILL) on RAM shortage without warning. ComponentCallbacks2.onTrimMemory(TRIM_MEMORY_RUNNING_CRITICAL) — last chance to free context before process termination.
Model Download: Progress and Verification
GGUF files weigh 1–4 GB. Download via URLSession (iOS) or WorkManager with DownloadManager (Android):
// iOS: Background URLSession for background downloads
let config = URLSessionConfiguration.background(withIdentifier: "model-download")
config.isDiscretionary = false
let session = URLSession(configuration: config, delegate: self, delegateQueue: nil)
let task = session.downloadTask(with: modelURL)
task.resume()
// SHA256 verification after download
func verify(fileURL: URL, expectedHash: String) -> Bool {
guard let data = try? Data(contentsOf: fileURL) else { return false }
let hash = SHA256.hash(data: data)
return hash.compactMap { String(format: "%02x", $0) }.joined() == expectedHash
}
Model SHA256 hashes are published in HuggingFace repositories — verify before loading into memory. Corrupt GGUF causes crashes during header parsing or later during inference — catch at verification.
Thermal Constraints
Llama.cpp on iPhone heats devices during extended generation. iOS throttling: on overheat, system reduces clock frequency, generation speed drops from 25 t/s to 8–10 t/s. This is expected system behavior.
Practical solution: limit maximum context (n_ctx) to 1024–2048 for short sessions. Pause between requests. Monitor ProcessInfo.processInfo.thermalState on iOS:
NotificationCenter.default.addObserver(forName: ProcessInfo.thermalStateDidChangeNotification, ...) { _ in
let state = ProcessInfo.processInfo.thermalState
if state == .critical || state == .serious {
// Pause generation, notify user
}
}
Process
Model selection for target devices → llama.cpp build for platforms → Swift/Kotlin wrapper with async streaming → download and verification → chat UI → thermal stress testing.
Timeline Estimates
Single platform, basic chat UI with selected model — 3–5 weeks. Both platforms, multiple model choices, background download, context management — 7–12 weeks.







