On-device LLM Llama.cpp for offline AI assistant in mobile app

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
On-device LLM Llama.cpp for offline AI assistant in mobile app
Complex
~2-4 weeks
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    761
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    649
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1071
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    884
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    466

On-Device LLM Integration (Llama.cpp) for Offline AI Assistant in Mobile App

Running a language model directly on a smartphone without internet is now reality. Llama.cpp provides working CPU inference with optional Metal/Vulkan acceleration. The main question isn't "can we?" but "which model and quantization should we choose so the device doesn't overheat after 5 minutes?"

Models and Their Real Requirements

Llama.cpp works with models in GGUF format. Popular options for mobile:

Model Quantization Size RAM Speed (iPhone 14)
Llama-3.2-1B Q4_K_M 0.8 GB ~1.2 GB 25–35 t/s
Llama-3.2-3B Q4_K_M 2.0 GB ~2.5 GB 10–15 t/s
Phi-3-mini-4k Q4_K_M 2.2 GB ~2.8 GB 8–12 t/s
Gemma-2-2B Q4_K_M 1.6 GB ~2.0 GB 12–18 t/s
Qwen2.5-1.5B Q4_K_M 1.0 GB ~1.4 GB 20–28 t/s

On iPhone SE 2nd gen with 3 GB RAM, Llama-3.2-3B Q4 runs at the limit — OOM is possible with long contexts. Safe choice for broad device coverage — models up to 1.5–2 GB.

Building Llama.cpp for iOS

# Clone repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CMake for iOS
cmake -B build-ios \
    -DCMAKE_TOOLCHAIN_FILE=ios.toolchain.cmake \
    -DPLATFORM=OS64 \  # arm64 only
    -DLLAMA_METAL=ON \  # Metal GPU acceleration
    -DLLAMA_STATIC=ON
cmake --build build-ios --config Release

Result — libllama.a static library. Create Swift Package with C-bridging header:

// llama_bridge.h
#include "llama.h"
// Wrappers for Swift-friendly API
void* llama_create_context(const char* model_path, int n_ctx, int n_gpu_layers);
const char* llama_generate_token(void* ctx, const char* prompt);
void llama_free_context(void* ctx);

n_gpu_layers — number of layers offloaded to Metal GPU. Value -1 means all layers on GPU. On iPhone 14 with 6 GB unified memory, set -1. On 3 GB devices, experiment: too many layers on GPU trigger OOM.

Swift Wrapper for Token Streaming

import Foundation

actor LlamaSession {
    private var context: OpaquePointer?
    private var model: OpaquePointer?

    func load(modelPath: String, contextSize: Int32 = 2048, gpuLayers: Int32 = -1) throws {
        var params = llama_model_default_params()
        params.n_gpu_layers = gpuLayers

        model = llama_load_model_from_file(modelPath, params)
        guard model != nil else { throw LlamaError.modelLoadFailed }

        var ctxParams = llama_context_default_params()
        ctxParams.n_ctx = UInt32(contextSize)
        ctxParams.n_batch = 512

        context = llama_new_context_with_model(model, ctxParams)
    }

    func generate(prompt: String) -> AsyncThrowingStream<String, Error> {
        AsyncThrowingStream { continuation in
            Task.detached(priority: .userInitiated) {
                // Tokenization
                var tokens = [llama_token](repeating: 0, count: 4096)
                let nTokens = llama_tokenize(self.model, prompt, Int32(prompt.utf8.count),
                                              &tokens, 4096, true, false)

                // Inference — one token at a time
                for i in 0..<nTokens {
                    llama_batch_add(&batch, tokens[Int(i)], llama_pos(i), [0], false)
                }

                while true {
                    llama_decode(self.context, batch)
                    let nextToken = llama_sample_token_greedy(self.context, &candidates)

                    if nextToken == llama_token_eos(self.model) { break }

                    // Convert token to string
                    var buf = [Int8](repeating: 0, count: 64)
                    llama_token_to_piece(self.model, nextToken, &buf, 64, 0, true)
                    let piece = String(cString: buf)

                    continuation.yield(piece)
                }
                continuation.finish()
            }
        }
    }
}

Token streaming via AsyncThrowingStream — users see text as it generates, no waiting for complete responses. Critical for UX: 10 tokens per second feels natural when text appears progressively.

Android: Llama.cpp via NDK

// CMakeLists.txt in jni/
add_library(llama_jni SHARED llama_jni.cpp)
target_link_libraries(llama_jni llama ggml)

// Kotlin side
class LlamaEngine {
    init { System.loadLibrary("llama_jni") }

    external fun loadModel(modelPath: String, nGpuLayers: Int): Long  // returns handle
    external fun generateNext(handle: Long, tokens: IntArray): String
    external fun freeModel(handle: Long)
}

On Android — Vulkan backend instead of Metal: enable LLAMA_VULKAN=ON in CMakeLists. Supported on Vulkan 1.1+ devices, essentially all Android 10+.

Android challenge: process has no hard memory limit as a pool — system may kill app (SIGKILL) on RAM shortage without warning. ComponentCallbacks2.onTrimMemory(TRIM_MEMORY_RUNNING_CRITICAL) — last chance to free context before process termination.

Model Download: Progress and Verification

GGUF files weigh 1–4 GB. Download via URLSession (iOS) or WorkManager with DownloadManager (Android):

// iOS: Background URLSession for background downloads
let config = URLSessionConfiguration.background(withIdentifier: "model-download")
config.isDiscretionary = false
let session = URLSession(configuration: config, delegate: self, delegateQueue: nil)

let task = session.downloadTask(with: modelURL)
task.resume()

// SHA256 verification after download
func verify(fileURL: URL, expectedHash: String) -> Bool {
    guard let data = try? Data(contentsOf: fileURL) else { return false }
    let hash = SHA256.hash(data: data)
    return hash.compactMap { String(format: "%02x", $0) }.joined() == expectedHash
}

Model SHA256 hashes are published in HuggingFace repositories — verify before loading into memory. Corrupt GGUF causes crashes during header parsing or later during inference — catch at verification.

Thermal Constraints

Llama.cpp on iPhone heats devices during extended generation. iOS throttling: on overheat, system reduces clock frequency, generation speed drops from 25 t/s to 8–10 t/s. This is expected system behavior.

Practical solution: limit maximum context (n_ctx) to 1024–2048 for short sessions. Pause between requests. Monitor ProcessInfo.processInfo.thermalState on iOS:

NotificationCenter.default.addObserver(forName: ProcessInfo.thermalStateDidChangeNotification, ...) { _ in
    let state = ProcessInfo.processInfo.thermalState
    if state == .critical || state == .serious {
        // Pause generation, notify user
    }
}

Process

Model selection for target devices → llama.cpp build for platforms → Swift/Kotlin wrapper with async streaming → download and verification → chat UI → thermal stress testing.

Timeline Estimates

Single platform, basic chat UI with selected model — 3–5 weeks. Both platforms, multiple model choices, background download, context management — 7–12 weeks.