ML Model Optimization (Quantization) for Mobile Device
Quantization converts model weights from float32 to lower-precision formats: float16, int8, int4. ResNet-50 weighs 98 MB in FP32. After int8 quantization — 25 MB. Inference speed on mobile CPU increases 2–4x due to reduced data volume and ARM NEON/SVE integer instructions.
But naive quantization often degrades accuracy more than acceptable. Proper quantization means selecting method, analyzing sensitive layers, and verifying degradation.
Quantization Types and When to Apply Them
Post-Training Quantization (PTQ) — quantize already-trained models without retraining. Two approaches:
- Dynamic quantization — weights in int8, activations computed in float32 at runtime. Simple, requires no calibration data. Works well for RNN/Transformer (BERT, LLM). Less gain for CNN.
- Static quantization — both weights and activations in int8. Requires calibration dataset (100–500 representative examples). Faster than dynamic, but needs calibration.
Quantization-Aware Training (QAT) — retrain model with simulated quantization. Weights adapt to reduced precision. Best quality, but requires training data and GPU time.
# PyTorch: static PTQ via torch.quantization
import torch
from torch.quantization import quantize_static, get_default_qconfig
model.eval()
model.qconfig = get_default_qconfig('fbgemm') # x86; for ARM — 'qnnpack'
torch.quantization.prepare(model, inplace=True)
# Calibration: run calibration dataset
with torch.no_grad():
for batch in calibration_loader:
model(batch)
torch.quantization.convert(model, inplace=True)
# Now model contains quantized layers
For mobile Android (ARM) — use qconfig = 'qnnpack', not 'fbgemm'. This changes quantized operation ordering for QNNPACK backend using ARM NEON instructions.
TFLite Quantization: Full Integer
# Convert with full int8 (activations + weights)
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
# Calibration generator — critical for static quantization accuracy
def representative_dataset():
for sample in calibration_data[:500]:
yield [sample.astype(np.float32)]
converter.representative_dataset = representative_dataset
tflite_model = converter.convert()
Full int8 models run on NNAPI and Hexagon DSP — where FP16 lacks support. On Snapdragon 778G via Hexagon — 5–8x faster than CPU with proper INT8 quantization.
Core ML Quantization on iOS
import coremltools as ct
from coremltools.optimize.coreml import (
OptimizationConfig,
OpLinearQuantizerConfig,
linearly_quantize_weights
)
# Load already-converted Core ML model
mlmodel = ct.models.MLModel("model_fp32.mlpackage")
# Configuration: 8-bit linear weight quantization
config = OptimizationConfig(
global_config=OpLinearQuantizerConfig(
mode="linear_symmetric",
dtype=np.int8,
granularity="per_channel" # per_channel more accurate than per_tensor for CNN
)
)
compressed_model = linearly_quantize_weights(mlmodel, config)
compressed_model.save("model_int8.mlpackage")
per_channel quantization — separate scale factor per output channel in convolution layers. Significantly more accurate than per_tensor (single scale per layer), slightly slower. Usually justified for CNN.
Sensitive Layer Analysis
Not all layers tolerate quantization equally. First and last layers, plus attention layers in transformers — often most sensitive. Tool: per-layer sensitivity analysis.
# Check accuracy degradation when quantizing each layer individually
from torch.quantization.quantize_fx import prepare_fx, convert_fx
baseline_accuracy = evaluate(float_model, test_loader)
for layer_name in get_all_quantizable_layers(model):
# Quantize only this layer
single_layer_model = quantize_single_layer(model, layer_name)
layer_accuracy = evaluate(single_layer_model, test_loader)
sensitivity = baseline_accuracy - layer_accuracy
print(f"{layer_name}: sensitivity={sensitivity:.4f}")
Keep high-sensitivity layers in FP32 — this is mixed precision quantization. Quantize the rest to INT8. 5–10% "heavy" layers stay FP32, model loses only 20–30% volume instead of 75%, accuracy preserved.
Verification: What and How to Check
After quantization, mandatory checks:
-
Accuracy on test dataset — compare top-1/top-5 accuracy with original. Acceptable degradation: FP16 — <0.5%, INT8 — <2%. If higher, move to QAT or mixed precision.
-
Numerical error — compare outputs on identical inputs between float and quantized. MSE < 0.01 typically acceptable.
-
Speed on real devices — not simulators. Xcode Instruments → Core ML Profiler for iOS,
adb shell am instrument+ TFLite Benchmark Tool for Android. -
Crash test — various inputs, edge cases (black image, very bright, unusual aspect ratio). INT8 models sometimes overflow on extreme inputs.
Practical Case Study
YOLOv8n object detection in FP32 — 6.3 MB, 45 ms on iPhone 13. After Core ML INT8 quantization — 1.8 MB, 12 ms. mAP dropped from 37.3 to 36.1 — within acceptable for most tasks. On Snapdragon 8 Gen 1 via TFLite INT8 + NNAPI — 8 ms.
Process
Source model audit → method selection (PTQ/QAT, INT8/FP16) → calibration → sensitive layer analysis → mixed precision if needed → accuracy verification → speed measurements on target devices.
Timeline Estimates
PTQ for single model with verification — 1–2 weeks. QAT with full retraining and testing cycle — 3–6 weeks depending on dataset size.







