On-device ML model TensorFlow Lite for offline AI in Android app

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
On-device ML model TensorFlow Lite for offline AI in Android app
Complex
~1-2 weeks
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    761
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    649
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1071
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    884
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    466

On-Device ML Model Integration (TensorFlow Lite) for Offline AI in Android Apps

TensorFlow Lite is the de-facto standard for running ML models on Android. But "add tflite file to assets" is just the beginning. Real integration includes choosing acceleration delegate, managing buffer memory, handling device incompatibilities, and testing numerical accuracy.

Converting Model to TFLite

From PyTorch via ONNX:

# PyTorch → ONNX
python -c "
import torch, onnx
model = MyModel(); model.eval()
torch.onnx.export(model, torch.zeros(1,3,224,224), 'model.onnx',
    opset_version=17, input_names=['input'], output_names=['output'])
"

# ONNX → TFLite via onnx-tf
pip install onnx-tf tensorflow
onnx-tf convert -i model.onnx -o model_tf
tflite_convert --saved_model_dir=model_tf --output_file=model.tflite

Or directly from TensorFlow SavedModel:

converter = tf.lite.TFLiteConverter.from_saved_model("model_tf")
converter.optimizations = [tf.lite.Optimize.DEFAULT]  # dynamic FP16 quantization
converter.target_spec.supported_types = [tf.float16]  # for GPU delegate
tflite_model = converter.convert()
with open("model_fp16.tflite", "wb") as f:
    f.write(tflite_model)

Acceleration Delegates: Which to Choose

Delegate Requirements Speedup vs CPU Constraints
GPU Delegate OpenGL ES 3.1 / Vulkan 3–7× Not all ops, FP32/FP16
NNAPI Android 8.1+, NPU/DSP 2–10× Chip-dependent, unstable on old ROM
Hexagon (QC) Snapdragon with DSP 3–8× Qualcomm only
CPU (XNNPACK) Always baseline
// GPU Delegate—most universal
import org.tensorflow.lite.gpu.GpuDelegate
import org.tensorflow.lite.gpu.CompatibilityList

val compatList = CompatibilityList()
val options = Interpreter.Options()

if (compatList.isDelegateSupportedOnThisDevice) {
    val delegateOptions = compatList.bestOptionsForThisDevice
    options.addDelegate(GpuDelegate(delegateOptions))
} else {
    // Fallback: NNAPI or CPU with XNNPACK
    options.setUseNNAPI(true)
    options.setUseXNNPACK(true)
}
options.setNumThreads(4)

val interpreter = Interpreter(
    FileUtil.loadMappedFile(context, "model_fp16.tflite"),
    options
)

NNAPI in practice unstable: on some devices gives 5× speedup, on others crashes with NNAPIDelegate: Failed to invoke the model due to incompatible operations. Must have try/catch with CPU fallback:

try {
    options.setUseNNAPI(true)
    interpreter = Interpreter(modelBuffer, options)
    // Test run to verify
    interpreter.run(testInput, testOutput)
} catch (e: Exception) {
    Log.w("ML", "NNAPI failed, falling back to CPU: ${e.message}")
    options.setUseNNAPI(false)
    interpreter = Interpreter(modelBuffer, options)
}

Buffer Management: ByteBuffer vs TensorBuffer

Direct ByteBuffer management faster but verbose. TensorBuffer from org.tensorflow.lite.support more convenient:

// Via TFLite Support Library (recommended)
val imageProcessor = ImageProcessor.Builder()
    .add(ResizeOp(224, 224, ResizeOp.ResizeMethod.BILINEAR))
    .add(NormalizeOp(127.5f, 127.5f))  // normalization [-1, 1]
    .build()

val tensorImage = TensorImage(DataType.FLOAT32)
tensorImage.load(bitmap)
val processedImage = imageProcessor.process(tensorImage)

// Run
val outputBuffer = TensorBuffer.createFixedSize(intArrayOf(1, 1000), DataType.FLOAT32)
interpreter.run(processedImage.buffer, outputBuffer.buffer)

// Result
val probabilities = outputBuffer.floatArray
val topIndex = probabilities.indices.maxByOrNull { probabilities[it] } ?: -1

ResizeOp on CPU surprisingly slow for large images (Full HD → 224×224 takes 20–40 ms). Alternative: pre-resize via Bitmap.createScaledBitmap() or via RenderScript (deprecated) / Camera2 output size.

CameraX Integration

val imageAnalyzer = ImageAnalysis.Builder()
    .setTargetResolution(Size(640, 480))
    .setBackpressureStrategy(ImageAnalysis.STRATEGY_KEEP_ONLY_LATEST)  // don't queue
    .build()
    .also {
        it.setAnalyzer(cameraExecutor) { imageProxy ->
            val bitmap = imageProxy.toBitmap()
            runInference(bitmap)
            imageProxy.close()  // CRITICAL: else CameraX hangs
        }
    }

imageProxy.close() in finally block—not optional. If not closed, ImageAnalysis stops delivering frames after seconds. Typical bug discovered only in long testing.

Numerical Accuracy After Conversion

After conversion and quantization, always check model accuracy on test set. FP16 quantization usually loses <1% accuracy, INT8—1–3%. If losses larger—possibly quantization calibration dataset too small or model sensitive to certain layers.

To verify—compare outputs of original PyTorch model and TFLite on same inputs:

# Test output match
import numpy as np
original_out = pytorch_model(test_input).detach().numpy()
tflite_out = run_tflite(interpreter, test_input)
print(f"Max difference: {np.max(np.abs(original_out - tflite_out))}")
# Normal: < 0.01 for FP16, < 0.05 for INT8

Model Placement

.tflite file in assets/. First run copy to filesDir or use MappedByteBuffer directly from assets for zero-copy loading:

fun loadModelFile(context: Context, filename: String): MappedByteBuffer {
    val fileDescriptor = context.assets.openFd(filename)
    val inputStream = FileInputStream(fileDescriptor.fileDescriptor)
    return inputStream.channel.map(
        FileChannel.MapMode.READ_ONLY,
        fileDescriptor.startOffset,
        fileDescriptor.declaredLength
    )
}

MappedByteBuffer—OS doesn't copy file to RAM on load, but maps directly. For large models (50–200 MB) significant.

Process

Convert from source format → evaluate delegates on target devices → integrate with fallback logic → test numerical accuracy → profile via Android Profiler + TFLite Model Benchmark Tool.

Timeline Estimates

Basic TFLite model integration on Android takes 1–2 weeks. With multi-delegate logic, CameraX pipeline, testing on device fleet requires 3–5 weeks.