ML Model Conversion to TensorFlow Lite Format for Android
TFLite isn't just weight conversion. It's quantization format selection, graph optimization, choosing operation sets compatible with target Android versions, and verifying numerical result matches original. Each step has specific pitfalls.
Conversion Paths
From TensorFlow SavedModel — direct path. From PyTorch — via ONNX intermediate format. From JAX — via TensorFlow export.
# Path 1: TF SavedModel → TFLite (most reliable)
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model_dir/")
tflite_model = converter.convert()
# Path 2: Keras model → TFLite
converter = tf.lite.TFLiteConverter.from_keras_model(keras_model)
tflite_model = converter.convert()
# Path 3: PyTorch → ONNX → TF → TFLite
import subprocess
subprocess.run(["python", "-m", "tf2onnx.convert",
"--onnx", "model.onnx",
"--output", "model_tf",
"--opset", "17"])
converter = tf.lite.TFLiteConverter.from_saved_model("model_tf/")
ONNX path introduces additional potential incompatibilities — use only when direct path unavailable.
Quantization During Conversion
# FP16 — minimal degradation, 2x smaller model, GPU delegate acceleration
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_fp16 = converter.convert()
# Dynamic INT8 — weights int8, activations float32. No calibration dataset needed.
converter2 = tf.lite.TFLiteConverter.from_saved_model("saved_model_dir/")
converter2.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_dynamic_int8 = converter2.convert()
# Full INT8 — both weights and activations. Requires calibration dataset. Needed for Hexagon DSP.
def representative_dataset():
dataset = load_calibration_data() # 100-500 examples
for sample in dataset:
yield [sample[np.newaxis, :].astype(np.float32)]
converter3 = tf.lite.TFLiteConverter.from_saved_model("saved_model_dir/")
converter3.optimizations = [tf.lite.Optimize.DEFAULT]
converter3.representative_dataset = representative_dataset
converter3.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter3.inference_input_type = tf.uint8 # or tf.int8
converter3.inference_output_type = tf.uint8
tflite_full_int8 = converter3.convert()
Full INT8 with inference_input_type = tf.uint8 — input data passed as uint8 (0–255), no normalization to float32 in Java/Kotlin. Removes preprocessing step, but requires careful quantization parameter alignment.
Unsupported Operations
Not all TF/PyTorch ops exist in TFLite builtin ops. Check:
# Which operations in model are unsupported TFLite
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS,
tf.lite.OpsSet.SELECT_TF_OPS # fallback to TF operations
]
SELECT_TF_OPS includes TF operation subset — increases TFLite runtime binary (~5 MB) and slows some ops. Better rewrite model avoiding SELECT_TF_OPS — enables NNAPI and Hexagon compatibility.
Custom operation via C++:
// Custom operator registration
static TfLiteRegistration* GetMyCustomOpRegistration() {
static TfLiteRegistration reg = {
nullptr, nullptr,
[](TfLiteContext* ctx, TfLiteNode* node) -> TfLiteStatus {
// Inference implementation
return kTfLiteOk;
},
nullptr, "MyCustomOp", 1
};
return ®
}
// In Android NDK code:
interpreter.AddCustomOp("MyCustomOp", GetMyCustomOpRegistration, 1);
Kotlin via JNI. Non-trivial, but sometimes only path.
Numerical Accuracy Verification
# Compare PyTorch/TF and TFLite outputs on identical inputs
import numpy as np
# TF original
tf_output = tf_model(test_input).numpy()
# TFLite
interpreter = tf.lite.Interpreter(model_content=tflite_model)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
interpreter.set_tensor(input_details[0]['index'], test_input)
interpreter.invoke()
tflite_output = interpreter.get_tensor(output_details[0]['index'])
print(f"Max abs diff: {np.max(np.abs(tf_output - tflite_output))}")
print(f"MSE: {np.mean((tf_output - tflite_output)**2)}")
# FP32: < 1e-5, FP16: < 1e-2, INT8: < 0.05
If difference exceeds norm — issue in input normalization, incorrect quantization parameters, or operation using different algorithm in TFLite.
Object Detection Specifics
YOLO, SSD, EfficientDet contain NMS (Non-Maximum Suppression) postprocessing. TFLite lacks built-in NMS (unlike Core ML Detection Output). Options:
- Remove NMS from model, implement in Java/Kotlin post-inference
- Use
TFLite Task Library— contains ready ObjectDetection API with NMS
// TFLite Task Library: ObjectDetector (includes NMS)
val options = ObjectDetector.ObjectDetectorOptions.builder()
.setScoreThreshold(0.5f)
.setMaxResults(20)
.build()
val detector = ObjectDetector.createFromFileAndOptions(context, "detector.tflite", options)
val image = TensorImage.fromBitmap(inputBitmap)
val results: List<Detection> = detector.detect(image)
for (detection in results) {
val box = detection.boundingBox // RectF
val label = detection.categories.first().label
val score = detection.categories.first().score
}
Task Library supports only specific model signatures — model must follow TFLite Model Metadata format.
Model Metadata: Important for Interpreter + Task Library
from tflite_support.metadata_writers import image_classifier
from tflite_support.metadata_writers import writer_utils
# Create classifier metadata
writer = image_classifier.MetadataWriter.create_for_inference(
writer_utils.load_file("model.tflite"),
input_norm_mean=[0.0],
input_norm_std=[255.0],
labels_file_paths=["labels.txt"]
)
tflite_with_metadata = writer.populate()
writer_utils.save_file(tflite_with_metadata, "model_with_metadata.tflite")
Without metadata, TFLite Task Library performs worse — no auto normalization, no output mapping. With metadata — handled automatically.
Benchmark on Devices
# ADB: run TFLite Benchmark Tool directly on device
adb push model.tflite /data/local/tmp/
adb shell /data/local/tmp/benchmark_model \
--graph=/data/local/tmp/model.tflite \
--use_gpu=true \
--num_threads=4 \
--num_runs=50
# Output: average latency, min/max, warmup time
TFLite Benchmark Tool — official Google tool, provides honest numbers without JVM overhead and Android UI. Use for delegate comparison (GPU vs NNAPI vs CPU).
Process
Conversion path selection → conversion with needed quantization level → accuracy verification → metadata addition → testing on device range via Benchmark Tool → Android app integration.
Timeline Estimates
Direct TF/Keras model conversion with verification — 3–7 days. Conversion via ONNX, custom operations, metadata addition, full testing — 2–4 weeks.







