On-Device LLM Integration (MLC LLM) for Offline AI Assistant in Mobile App
MLC LLM (Machine Learning Compilation LLM) is a project from the TVM team that compiles language models directly for specific hardware targets. Unlike llama.cpp, which runs through a universal C++ backend, MLC generates optimized Metal code for iPhone or Vulkan for Android at compilation time. This delivers noticeable speed gains — especially on Apple Silicon.
How MLC Differs from Llama.cpp
Llama.cpp interprets GGUF graphs at runtime, using Metal through a common path. MLC LLM uses AOT (Ahead-Of-Time) compilation: Python scripts generate .metal/.vulkan shaders specific to each model and device. The price of longer preparation time yields more efficient shaders.
On iPhone 14 Pro with Llama-3.2-3B Q4: llama.cpp — 10–14 t/s, MLC LLM — 16–22 t/s. The difference is noticeable.
Compiling Models for iOS
# Install mlc-llm
pip install mlc-llm
# Compile model for iPhone (Metal)
mlc_llm convert_weight \
./Llama-3.2-3B-Instruct/ \
--quantization q4f16_1 \
--output mlc-llm-weights/
mlc_llm gen_config \
./Llama-3.2-3B-Instruct/ \
--quantization q4f16_1 \
--conv-template llama-3 \
--output mlc-llm-config/
mlc_llm compile \
mlc-llm-config/mlc-chat-config.json \
--device iphone \
--output dist/libs/Llama-3.2-3B-Instruct-q4f16_1-iphone.tar
Result — an archive with .dylib and Metal shaders. Embed in Xcode project.
For Android, use the same process with --device android:
mlc_llm compile \
mlc-llm-config/mlc-chat-config.json \
--device android \
--output dist/libs/Llama-3.2-3B-Instruct-q4f16_1-android.tar
iOS SDK: Swift Integration
MLC LLM provides official Swift Package — mlc-swift:
import MLCSwift
// Initialize engine
let engine = MLCEngine()
// Load model (asynchronously)
try await engine.reload(
modelPath: Bundle.main.path(forResource: "Llama-3.2-3B", ofType: nil)!,
modelLib: "Llama-3.2-3B-Instruct-q4f16_1-iphone" // .dylib name without extension
)
// Stream via async/await
let messages: [ChatCompletionMessage] = [
.init(role: .system, content: "You are a helpful assistant."),
.init(role: .user, content: "Explain what RAG is in machine learning")
]
let request = ChatCompletionRequest(messages: messages, stream: true)
for await chunk in try await engine.chat.completions.create(request) {
if let delta = chunk.choices.first?.delta.content {
// Add delta to UI in real-time
await MainActor.run { self.responseText += delta }
}
}
API closely mirrors OpenAI Chat Completions API — simplifying code reuse between server and on-device implementations.
Android SDK: Kotlin Integration
import ai.mlc.mlcllm.MLCEngine
class LLMViewModel(application: Application) : AndroidViewModel(application) {
private val engine = MLCEngine()
suspend fun loadModel(modelPath: String, modelLib: String) {
engine.reload(modelPath, modelLib)
}
fun chat(userMessage: String): Flow<String> = flow {
val messages = listOf(
ChatCompletionMessage(role = MessageRole.user, content = userMessage)
)
val request = ChatCompletionRequest(messages = messages, stream = true)
engine.chat.completions.create(request).collect { chunk ->
chunk.choices.firstOrNull()?.delta?.content?.let { delta ->
emit(delta)
}
}
}.flowOn(Dispatchers.IO)
}
flowOn(Dispatchers.IO) — inference must not block main thread. UI subscribes to Flow via collectAsState() in Compose or launchWhenResumed in Fragment.
Model Memory Management
One model in memory at a time — rule for mobile. Unloading:
await engine.unload()
// Explicit unloading releases Metal buffers and GPU memory
// After this, another model can be loaded
On iOS, Metal memory is a separate pool from system RAM, but shared with other apps. If user switches to a heavy app (game, camera), system may forcibly evict Metal resources — models must reload.
// Handle Metal resource eviction
NotificationCenter.default.addObserver(
forName: .MLCEngineModelUnloaded, // or custom detection mechanism
object: nil, queue: .main
) { [weak self] _ in
Task { try await self?.engine.reload(...) }
}
Model Download and Management
Model weights are not embedded in app bundle (App Store limit of 4 GB per package, weights can be 2–4 GB). Download on first launch or on demand:
// Background download via URLSession
func downloadModel(from url: URL, modelName: String) async throws {
let destinationURL = Self.modelsDirectory.appendingPathComponent(modelName)
guard !FileManager.default.fileExists(atPath: destinationURL.path) else { return }
let (tempURL, _) = try await URLSession.shared.download(from: url)
try FileManager.default.moveItem(at: tempURL, to: destinationURL)
}
static var modelsDirectory: URL {
FileManager.default.urls(for: .applicationSupportDirectory, in: .userDomainMask)[0]
.appendingPathComponent("MLCModels")
}
applicationSupportDirectory — proper location for large app data (not Documents, visible to users in Files.app).
When to Use MLC vs Llama.cpp
MLC LLM preferred when: maximum speed on specific devices matters, target devices are known (compile for specific architectures), using official HuggingFace models (Llama, Phi, Gemma, Mistral).
Llama.cpp preferred when: quantization flexibility needed, models arrive in GGUF from partners, legacy device support required, custom sampling needed (beam search, specific temperature parameters).
Process
Model selection for task and device range → compilation for iOS/Android targets → SDK integration → chat UI implementation with async streaming → weight download pipeline → thermal constraint testing.
Timeline Estimates
Single platform, single model, basic chat — 3–5 weeks. Both platforms, multiple model switching, weight management system — 7–11 weeks.







