On-device LLM MLC LLM for offline AI assistant in mobile app

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Development and support of all types of mobile applications:

Information and entertainment mobile applications

News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators

E-commerce mobile applications

Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.

Business process management mobile applications

CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems

Electronic services mobile applications

Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Offered services

Showing 1 of 1 servicesAll 1735 services

On-device LLM MLC LLM for offline AI assistant in mobile app

Complex

~2-4 weeks

FAQ

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

Development of a mobile application for FEEDME
761
Development of a mobile application for XOOMER
649
Development of a mobile application for RHL
1071
Development of a mobile application for ZIPPY
947
Development of a mobile application for Affhome
884
Development of a mobile application for the FLAVORS company
466

Show more works

On-Device LLM Integration (MLC LLM) for Offline AI Assistant in Mobile App

MLC LLM (Machine Learning Compilation LLM) is a project from the TVM team that compiles language models directly for specific hardware targets. Unlike llama.cpp, which runs through a universal C++ backend, MLC generates optimized Metal code for iPhone or Vulkan for Android at compilation time. This delivers noticeable speed gains — especially on Apple Silicon.

How MLC Differs from Llama.cpp

Llama.cpp interprets GGUF graphs at runtime, using Metal through a common path. MLC LLM uses AOT (Ahead-Of-Time) compilation: Python scripts generate .metal/.vulkan shaders specific to each model and device. The price of longer preparation time yields more efficient shaders.

On iPhone 14 Pro with Llama-3.2-3B Q4: llama.cpp — 10–14 t/s, MLC LLM — 16–22 t/s. The difference is noticeable.

Compiling Models for iOS

# Install mlc-llm
pip install mlc-llm

# Compile model for iPhone (Metal)
mlc_llm convert_weight \
    ./Llama-3.2-3B-Instruct/ \
    --quantization q4f16_1 \
    --output mlc-llm-weights/

mlc_llm gen_config \
    ./Llama-3.2-3B-Instruct/ \
    --quantization q4f16_1 \
    --conv-template llama-3 \
    --output mlc-llm-config/

mlc_llm compile \
    mlc-llm-config/mlc-chat-config.json \
    --device iphone \
    --output dist/libs/Llama-3.2-3B-Instruct-q4f16_1-iphone.tar

Result — an archive with .dylib and Metal shaders. Embed in Xcode project.

For Android, use the same process with --device android:

mlc_llm compile \
    mlc-llm-config/mlc-chat-config.json \
    --device android \
    --output dist/libs/Llama-3.2-3B-Instruct-q4f16_1-android.tar

iOS SDK: Swift Integration

MLC LLM provides official Swift Package — mlc-swift:

import MLCSwift

// Initialize engine
let engine = MLCEngine()

// Load model (asynchronously)
try await engine.reload(
    modelPath: Bundle.main.path(forResource: "Llama-3.2-3B", ofType: nil)!,
    modelLib: "Llama-3.2-3B-Instruct-q4f16_1-iphone"  // .dylib name without extension
)

// Stream via async/await
let messages: [ChatCompletionMessage] = [
    .init(role: .system, content: "You are a helpful assistant."),
    .init(role: .user, content: "Explain what RAG is in machine learning")
]

let request = ChatCompletionRequest(messages: messages, stream: true)

for await chunk in try await engine.chat.completions.create(request) {
    if let delta = chunk.choices.first?.delta.content {
        // Add delta to UI in real-time
        await MainActor.run { self.responseText += delta }
    }
}

API closely mirrors OpenAI Chat Completions API — simplifying code reuse between server and on-device implementations.

Android SDK: Kotlin Integration

import ai.mlc.mlcllm.MLCEngine

class LLMViewModel(application: Application) : AndroidViewModel(application) {
    private val engine = MLCEngine()

    suspend fun loadModel(modelPath: String, modelLib: String) {
        engine.reload(modelPath, modelLib)
    }

    fun chat(userMessage: String): Flow<String> = flow {
        val messages = listOf(
            ChatCompletionMessage(role = MessageRole.user, content = userMessage)
        )
        val request = ChatCompletionRequest(messages = messages, stream = true)

        engine.chat.completions.create(request).collect { chunk ->
            chunk.choices.firstOrNull()?.delta?.content?.let { delta ->
                emit(delta)
            }
        }
    }.flowOn(Dispatchers.IO)
}

flowOn(Dispatchers.IO) — inference must not block main thread. UI subscribes to Flow via collectAsState() in Compose or launchWhenResumed in Fragment.

Model Memory Management

One model in memory at a time — rule for mobile. Unloading:

await engine.unload()
// Explicit unloading releases Metal buffers and GPU memory
// After this, another model can be loaded

On iOS, Metal memory is a separate pool from system RAM, but shared with other apps. If user switches to a heavy app (game, camera), system may forcibly evict Metal resources — models must reload.

// Handle Metal resource eviction
NotificationCenter.default.addObserver(
    forName: .MLCEngineModelUnloaded,  // or custom detection mechanism
    object: nil, queue: .main
) { [weak self] _ in
    Task { try await self?.engine.reload(...) }
}

Model Download and Management

Model weights are not embedded in app bundle (App Store limit of 4 GB per package, weights can be 2–4 GB). Download on first launch or on demand:

// Background download via URLSession
func downloadModel(from url: URL, modelName: String) async throws {
    let destinationURL = Self.modelsDirectory.appendingPathComponent(modelName)
    guard !FileManager.default.fileExists(atPath: destinationURL.path) else { return }

    let (tempURL, _) = try await URLSession.shared.download(from: url)
    try FileManager.default.moveItem(at: tempURL, to: destinationURL)
}

static var modelsDirectory: URL {
    FileManager.default.urls(for: .applicationSupportDirectory, in: .userDomainMask)[0]
        .appendingPathComponent("MLCModels")
}

applicationSupportDirectory — proper location for large app data (not Documents, visible to users in Files.app).

When to Use MLC vs Llama.cpp

MLC LLM preferred when: maximum speed on specific devices matters, target devices are known (compile for specific architectures), using official HuggingFace models (Llama, Phi, Gemma, Mistral).

Llama.cpp preferred when: quantization flexibility needed, models arrive in GGUF from partners, legacy device support required, custom sampling needed (beam search, specific temperature parameters).

Process

Model selection for task and device range → compilation for iOS/Android targets → SDK integration → chat UI implementation with async streaming → weight download pipeline → thermal constraint testing.

Timeline Estimates

Single platform, single model, basic chat — 3–5 weeks. Both platforms, multiple model switching, weight management system — 7–11 weeks.