On-Device AI Assistant with Local Model for Mobile App

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
On-Device AI Assistant with Local Model for Mobile App
Complex
~1-2 weeks
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1052
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

Building an AI Assistant with Local On-Device Models in a Mobile Application

On-device LLM is not just "works without internet." It's an architectural decision: user data never leaves the device. For medical diaries, personal notes, corporate documents on employee devices—this is not a feature, it's a requirement. Technically solvable on modern flagship devices, but requires careful model and runtime selection for specific devices.

What's Actually Runnable on Phone in 2024–2025

The boundary of possibility shifted with new chips. iPhone 15 Pro (A17 Pro, 8 GB RAM, Neural Engine 35 TOPS) and Samsung Galaxy S24 (Snapdragon 8 Gen 3, 12 GB RAM) run 3B models in INT4 quantization at 15–30 tokens/sec—sufficient for assistant with streaming output.

Device RAM Recommended Model Tokens/sec
iPhone 15 Pro / 16 8 GB Llama 3.2 3B Q4_K_M 20–30
iPad Pro M4 16 GB Llama 3.1 8B Q4_K_M 15–25
Samsung S24 Ultra 12 GB Phi-3 Mini Q4 25–35
Budget Android 4–6 GB Phi-3 Mini Q2 / Gemma 2B 5–15
Older devices 3 GB Not recommended

Attempting to run 7B+ model on phone with 4 GB RAM—guaranteed OOM crash.

iOS: Core ML and Apple MLX

Apple offers two paths. Core ML—stable, supports iOS 16+, automatically uses Neural Engine. Model is converted via coremltools from PyTorch/GGUF. Size after converting Llama 3.2 3B to INT4—about 1.8 GB.

import CoreML

class OnDeviceLLM {
    private let model: MLModel

    init() throws {
        let config = MLModelConfiguration()
        config.computeUnits = .all  // CPU + GPU + Neural Engine
        model = try LlamaModel(configuration: config).model
    }

    func generate(prompt: String) -> AsyncStream<String> {
        AsyncStream { continuation in
            Task.detached(priority: .userInitiated) {
                // tokenize → autoregressive decode → yield tokens
                let tokens = self.tokenize(prompt)
                for _ in 0..<512 {
                    let nextToken = self.model.predictNextToken(tokens)
                    continuation.yield(self.detokenize(nextToken))
                    if nextToken == self.eosTokenId { break }
                }
                continuation.finish()
            }
        }
    }
}

Apple MLX (Swift framework, iOS 16+)—more convenient API, but requires iOS 16+ and works better on devices with unified memory. Official converted models from Apple available on Hugging Face.

llama.cpp—largest selection of models in GGUF format, actively maintained by community. Integration via C++ bridging header is more complex than Core ML, but provides access to any GGUF model.

Android: TFLite, MediaPipe, ExecuTorch

Android has more options and less unified standard.

MediaPipe LLM Inference API (Google)—most mature solution for Android. Supports Gemma 2B/7B, Phi-2, Llama 2, ExportedModels in TFLite format. Integration via LlmInference class:

val options = LlmInference.LlmInferenceOptions.builder()
    .setModelPath("/data/local/tmp/gemma-2b-it-gpu-int4.bin")
    .setMaxTokens(1024)
    .setResultListener { partialResult, done ->
        runOnUiThread { appendText(partialResult) }
    }
    .build()

val llmInference = LlmInference.createFromOptions(context, options)
llmInference.generateResponseAsync(prompt)

TFLite with custom LLM runner—more flexible, but requires more integration work.

ExecuTorch (Meta)—official runtime for Llama on Android, supports Llama 3.x directly without conversion. Compiled via buck2, which is nontrivial to set up in Gradle project.

Model Loading and Updates

Model cannot be bundled in app package—1.5–3 GB immediately leads to App Store rejection and frightens users away with large APK. Correct approach: first launch → offer to download model → background download with progress bar → verify checksum.

On iOS: URLSession.downloadTask with backgroundConfiguration—download continues when app goes to background. File stored in Application Support (not Caches—can be deleted by system if space is low).

On Android: DownloadManager or WorkManager with NetworkType.UNMETERED constraint—user doesn't spend mobile traffic.

Model update: server publishes manifest with current version and SHA-256 sum. On app launch, check manifest; if version changed—offer update.

Heat and Battery

Inference on Neural Engine/GPU heats device. For long generations (>200 tokens), monitor thermal state via ProcessInfo.thermalState (iOS)—at .serious reduce n_threads or temporarily switch to CPU-only inference. On Android—PowerManager.thermalStatus.

Battery consumption: 3B model on iPhone 15 Pro uses ~5–8% battery per 100 requests. No need to warn users—this is less than video streaming.

Timeline Estimates

Basic on-device assistant (one platform, Core ML or MediaPipe)—4–5 weeks. Cross-platform with loading management, updates, and thermal monitoring—8–12 weeks.