On-Device ML Model Integration (TensorFlow Lite) for Offline AI in Android Apps
TensorFlow Lite is the de-facto standard for running ML models on Android. But "add tflite file to assets" is just the beginning. Real integration includes choosing acceleration delegate, managing buffer memory, handling device incompatibilities, and testing numerical accuracy.
Converting Model to TFLite
From PyTorch via ONNX:
# PyTorch → ONNX
python -c "
import torch, onnx
model = MyModel(); model.eval()
torch.onnx.export(model, torch.zeros(1,3,224,224), 'model.onnx',
opset_version=17, input_names=['input'], output_names=['output'])
"
# ONNX → TFLite via onnx-tf
pip install onnx-tf tensorflow
onnx-tf convert -i model.onnx -o model_tf
tflite_convert --saved_model_dir=model_tf --output_file=model.tflite
Or directly from TensorFlow SavedModel:
converter = tf.lite.TFLiteConverter.from_saved_model("model_tf")
converter.optimizations = [tf.lite.Optimize.DEFAULT] # dynamic FP16 quantization
converter.target_spec.supported_types = [tf.float16] # for GPU delegate
tflite_model = converter.convert()
with open("model_fp16.tflite", "wb") as f:
f.write(tflite_model)
Acceleration Delegates: Which to Choose
| Delegate | Requirements | Speedup vs CPU | Constraints |
|---|---|---|---|
| GPU Delegate | OpenGL ES 3.1 / Vulkan | 3–7× | Not all ops, FP32/FP16 |
| NNAPI | Android 8.1+, NPU/DSP | 2–10× | Chip-dependent, unstable on old ROM |
| Hexagon (QC) | Snapdragon with DSP | 3–8× | Qualcomm only |
| CPU (XNNPACK) | Always | baseline | — |
// GPU Delegate—most universal
import org.tensorflow.lite.gpu.GpuDelegate
import org.tensorflow.lite.gpu.CompatibilityList
val compatList = CompatibilityList()
val options = Interpreter.Options()
if (compatList.isDelegateSupportedOnThisDevice) {
val delegateOptions = compatList.bestOptionsForThisDevice
options.addDelegate(GpuDelegate(delegateOptions))
} else {
// Fallback: NNAPI or CPU with XNNPACK
options.setUseNNAPI(true)
options.setUseXNNPACK(true)
}
options.setNumThreads(4)
val interpreter = Interpreter(
FileUtil.loadMappedFile(context, "model_fp16.tflite"),
options
)
NNAPI in practice unstable: on some devices gives 5× speedup, on others crashes with NNAPIDelegate: Failed to invoke the model due to incompatible operations. Must have try/catch with CPU fallback:
try {
options.setUseNNAPI(true)
interpreter = Interpreter(modelBuffer, options)
// Test run to verify
interpreter.run(testInput, testOutput)
} catch (e: Exception) {
Log.w("ML", "NNAPI failed, falling back to CPU: ${e.message}")
options.setUseNNAPI(false)
interpreter = Interpreter(modelBuffer, options)
}
Buffer Management: ByteBuffer vs TensorBuffer
Direct ByteBuffer management faster but verbose. TensorBuffer from org.tensorflow.lite.support more convenient:
// Via TFLite Support Library (recommended)
val imageProcessor = ImageProcessor.Builder()
.add(ResizeOp(224, 224, ResizeOp.ResizeMethod.BILINEAR))
.add(NormalizeOp(127.5f, 127.5f)) // normalization [-1, 1]
.build()
val tensorImage = TensorImage(DataType.FLOAT32)
tensorImage.load(bitmap)
val processedImage = imageProcessor.process(tensorImage)
// Run
val outputBuffer = TensorBuffer.createFixedSize(intArrayOf(1, 1000), DataType.FLOAT32)
interpreter.run(processedImage.buffer, outputBuffer.buffer)
// Result
val probabilities = outputBuffer.floatArray
val topIndex = probabilities.indices.maxByOrNull { probabilities[it] } ?: -1
ResizeOp on CPU surprisingly slow for large images (Full HD → 224×224 takes 20–40 ms). Alternative: pre-resize via Bitmap.createScaledBitmap() or via RenderScript (deprecated) / Camera2 output size.
CameraX Integration
val imageAnalyzer = ImageAnalysis.Builder()
.setTargetResolution(Size(640, 480))
.setBackpressureStrategy(ImageAnalysis.STRATEGY_KEEP_ONLY_LATEST) // don't queue
.build()
.also {
it.setAnalyzer(cameraExecutor) { imageProxy ->
val bitmap = imageProxy.toBitmap()
runInference(bitmap)
imageProxy.close() // CRITICAL: else CameraX hangs
}
}
imageProxy.close() in finally block—not optional. If not closed, ImageAnalysis stops delivering frames after seconds. Typical bug discovered only in long testing.
Numerical Accuracy After Conversion
After conversion and quantization, always check model accuracy on test set. FP16 quantization usually loses <1% accuracy, INT8—1–3%. If losses larger—possibly quantization calibration dataset too small or model sensitive to certain layers.
To verify—compare outputs of original PyTorch model and TFLite on same inputs:
# Test output match
import numpy as np
original_out = pytorch_model(test_input).detach().numpy()
tflite_out = run_tflite(interpreter, test_input)
print(f"Max difference: {np.max(np.abs(original_out - tflite_out))}")
# Normal: < 0.01 for FP16, < 0.05 for INT8
Model Placement
.tflite file in assets/. First run copy to filesDir or use MappedByteBuffer directly from assets for zero-copy loading:
fun loadModelFile(context: Context, filename: String): MappedByteBuffer {
val fileDescriptor = context.assets.openFd(filename)
val inputStream = FileInputStream(fileDescriptor.fileDescriptor)
return inputStream.channel.map(
FileChannel.MapMode.READ_ONLY,
fileDescriptor.startOffset,
fileDescriptor.declaredLength
)
}
MappedByteBuffer—OS doesn't copy file to RAM on load, but maps directly. For large models (50–200 MB) significant.
Process
Convert from source format → evaluate delegates on target devices → integrate with fallback logic → test numerical accuracy → profile via Android Profiler + TFLite Model Benchmark Tool.
Timeline Estimates
Basic TFLite model integration on Android takes 1–2 weeks. With multi-delegate logic, CameraX pipeline, testing on device fleet requires 3–5 weeks.







