ML model quantization optimization for mobile device

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
ML model quantization optimization for mobile device
Complex
~3-5 business days
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1052
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

ML Model Optimization (Quantization) for Mobile Device

Quantization converts model weights from float32 to lower-precision formats: float16, int8, int4. ResNet-50 weighs 98 MB in FP32. After int8 quantization — 25 MB. Inference speed on mobile CPU increases 2–4x due to reduced data volume and ARM NEON/SVE integer instructions.

But naive quantization often degrades accuracy more than acceptable. Proper quantization means selecting method, analyzing sensitive layers, and verifying degradation.

Quantization Types and When to Apply Them

Post-Training Quantization (PTQ) — quantize already-trained models without retraining. Two approaches:

  • Dynamic quantization — weights in int8, activations computed in float32 at runtime. Simple, requires no calibration data. Works well for RNN/Transformer (BERT, LLM). Less gain for CNN.
  • Static quantization — both weights and activations in int8. Requires calibration dataset (100–500 representative examples). Faster than dynamic, but needs calibration.

Quantization-Aware Training (QAT) — retrain model with simulated quantization. Weights adapt to reduced precision. Best quality, but requires training data and GPU time.

# PyTorch: static PTQ via torch.quantization
import torch
from torch.quantization import quantize_static, get_default_qconfig

model.eval()
model.qconfig = get_default_qconfig('fbgemm')  # x86; for ARM — 'qnnpack'
torch.quantization.prepare(model, inplace=True)

# Calibration: run calibration dataset
with torch.no_grad():
    for batch in calibration_loader:
        model(batch)

torch.quantization.convert(model, inplace=True)
# Now model contains quantized layers

For mobile Android (ARM) — use qconfig = 'qnnpack', not 'fbgemm'. This changes quantized operation ordering for QNNPACK backend using ARM NEON instructions.

TFLite Quantization: Full Integer

# Convert with full int8 (activations + weights)
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

# Calibration generator — critical for static quantization accuracy
def representative_dataset():
    for sample in calibration_data[:500]:
        yield [sample.astype(np.float32)]

converter.representative_dataset = representative_dataset
tflite_model = converter.convert()

Full int8 models run on NNAPI and Hexagon DSP — where FP16 lacks support. On Snapdragon 778G via Hexagon — 5–8x faster than CPU with proper INT8 quantization.

Core ML Quantization on iOS

import coremltools as ct
from coremltools.optimize.coreml import (
    OptimizationConfig,
    OpLinearQuantizerConfig,
    linearly_quantize_weights
)

# Load already-converted Core ML model
mlmodel = ct.models.MLModel("model_fp32.mlpackage")

# Configuration: 8-bit linear weight quantization
config = OptimizationConfig(
    global_config=OpLinearQuantizerConfig(
        mode="linear_symmetric",
        dtype=np.int8,
        granularity="per_channel"  # per_channel more accurate than per_tensor for CNN
    )
)

compressed_model = linearly_quantize_weights(mlmodel, config)
compressed_model.save("model_int8.mlpackage")

per_channel quantization — separate scale factor per output channel in convolution layers. Significantly more accurate than per_tensor (single scale per layer), slightly slower. Usually justified for CNN.

Sensitive Layer Analysis

Not all layers tolerate quantization equally. First and last layers, plus attention layers in transformers — often most sensitive. Tool: per-layer sensitivity analysis.

# Check accuracy degradation when quantizing each layer individually
from torch.quantization.quantize_fx import prepare_fx, convert_fx

baseline_accuracy = evaluate(float_model, test_loader)

for layer_name in get_all_quantizable_layers(model):
    # Quantize only this layer
    single_layer_model = quantize_single_layer(model, layer_name)
    layer_accuracy = evaluate(single_layer_model, test_loader)
    sensitivity = baseline_accuracy - layer_accuracy
    print(f"{layer_name}: sensitivity={sensitivity:.4f}")

Keep high-sensitivity layers in FP32 — this is mixed precision quantization. Quantize the rest to INT8. 5–10% "heavy" layers stay FP32, model loses only 20–30% volume instead of 75%, accuracy preserved.

Verification: What and How to Check

After quantization, mandatory checks:

  1. Accuracy on test dataset — compare top-1/top-5 accuracy with original. Acceptable degradation: FP16 — <0.5%, INT8 — <2%. If higher, move to QAT or mixed precision.

  2. Numerical error — compare outputs on identical inputs between float and quantized. MSE < 0.01 typically acceptable.

  3. Speed on real devices — not simulators. Xcode Instruments → Core ML Profiler for iOS, adb shell am instrument + TFLite Benchmark Tool for Android.

  4. Crash test — various inputs, edge cases (black image, very bright, unusual aspect ratio). INT8 models sometimes overflow on extreme inputs.

Practical Case Study

YOLOv8n object detection in FP32 — 6.3 MB, 45 ms on iPhone 13. After Core ML INT8 quantization — 1.8 MB, 12 ms. mAP dropped from 37.3 to 36.1 — within acceptable for most tasks. On Snapdragon 8 Gen 1 via TFLite INT8 + NNAPI — 8 ms.

Process

Source model audit → method selection (PTQ/QAT, INT8/FP16) → calibration → sensitive layer analysis → mixed precision if needed → accuracy verification → speed measurements on target devices.

Timeline Estimates

PTQ for single model with verification — 1–2 weeks. QAT with full retraining and testing cycle — 3–6 weeks depending on dataset size.