Machine Learning Development (TensorFlow Lite) in Mobile Applications
TensorFlow Lite is one of two mobile on-device ML standards (alongside Core ML and ONNX Runtime). Its strength lies in control: you choose the delegate, quantization level, and model loading method. Its weakness is that same flexibility: wrong delegate or unoptimized model yields worse performance than a cloud API.
Delegates: Where Most Mistakes Happen
TfLiteGpuDelegateV2 on Android gives real gains only with batch inference or heavy convolutional models (EfficientDet, MobileNet SSD). On light models (MobileNetV2 with 224×224 input), GPU delegate is slower than CPU due to memory-to-GPU transfer overhead. We profiled this on Xiaomi Redmi Note 11: CPU 78 ms, GPU 112 ms. Takeaway: always measure on target devices, not flagships.
NNAPI delegate (NnApiDelegate) theoretically uses hardware accelerators (DSP, NPU), but operation support is very uneven. If the model contains non-standard ops (e.g., custom squeeze-excitation block), NNAPI silently falls back to CPU. Always log InterpreterApi.Options.setNumThreads and check via Interpreter.getSignatureInputs() which operations actually run on the accelerator.
On iOS, TFLite uses CoreMLDelegate—a wrapper over Core ML. For target >= iOS 12, CoreMLDelegate automatically leverages Neural Engine for supported layers. Unsupported layers fall back to TFLite's CPU interpreter. Mixed execution works but latency is unpredictable without profiling.
Model Optimization Before Deployment
Quantization is mandatory for mobile. Three options:
- Post-training dynamic range quantization — simplest, weights compressed to INT8, activations remain float. Model size shrinks ~4x, CPU speed improves 20–40%.
- Post-training integer quantization — both weights and activations in INT8, requires calibration dataset. Needed for NNAPI and Edge TPU.
- Quantization-aware training (QAT) — best INT8 accuracy but requires model retraining.
Using tf.lite.TFLiteConverter with optimizations = [tf.lite.Optimize.DEFAULT] covers the first two. QAT configured via tfmot.quantization.keras.quantize_model.
Real example: plant recognition app (EfficientNetB0 classifier, 29 MB float32). After full integer quantization: 7.4 MB, inference with NNAPI on Pixel 7: 18 ms versus 95 ms on float32 CPU. On Snapdragon 778G, NNAPI had to fall back to CPU due to unsupported LEAKY_RELU—added fallback via NnApiDelegate.Options.setAllowFp16PrecisionForFp32.
App Integration
On Android use org.tensorflow:tensorflow-lite + org.tensorflow:tensorflow-lite-gpu via Gradle. For Task Library support (ImageClassifier, ObjectDetector), add org.tensorflow:tensorflow-lite-task-vision. Task Library handles image preprocessing (resize, normalization), eliminating significant boilerplate.
On iOS, CocoaPods pod 'TensorFlowLiteSwift' or Swift Package Manager (TFLite 2.13+). Wrap inference in an actor for thread-safety:
actor TFLiteInferenceService {
private let interpreter: Interpreter
func classify(pixelBuffer: CVPixelBuffer) throws -> [Float] { ... }
}
Load models from bundle or URL with SHA-256 verification. For OTA updates, use Firebase Remote Config with model URL, download via URLSession.downloadTask in background.
Timeline
Integrating a ready TFLite model with delegate selection and basic optimization: 1 week. Full cycle with conversion, quantization, testing on target devices, and OTA updates: 2–3 weeks. Cost calculated individually after model and requirements review.







