ML Model Optimization (Pruning) for Mobile Device
Pruning removes weights or neurons from models. Logic: in networks trained on real data, a significant fraction of weights are close to zero and barely affect output. They can be zeroed or deleted without major accuracy loss, gaining speed and size.
Sounds appealing. In practice — pruning is harder than quantization, requires retraining after sparsification, and doesn't always deliver expected speedup on mobile due to implementation details.
Two Pruning Types
Unstructured pruning — zero individual weights (sparse matrices). Matrix with 90% zeros — seemingly 10x savings. But GPU/NPU work with dense matrices; sparse computation doesn't accelerate there. Practical gain: reduced model size after compression (zeros compress well). Not inference speed on typical devices.
Structured pruning — remove entire filters (channels) in convolution layers or heads in attention. Result — physically smaller graph, truly faster on any hardware. This is what mobile needs.
Structured Pruning: PyTorch Practice
import torch
import torch.nn.utils.prune as prune
# L1-based structured pruning: remove 30% of filters from Conv2d layers
# by minimum L1-norm criterion (least important filters)
for name, module in model.named_modules():
if isinstance(module, torch.nn.Conv2d):
prune.ln_structured(
module,
name='weight',
amount=0.3, # 30% of channels
n=1, # L1 norm
dim=0 # dim=0 — output filters
)
# After pruning — make weights permanent (remove mask)
for name, module in model.named_modules():
if isinstance(module, torch.nn.Conv2d):
prune.remove(module, 'weight')
After this, model contains zero filters but they remain in graph. Next step — actually remove zero channels:
# Custom function to remove zero filters
def remove_zero_filters(conv_layer, next_layer=None):
"""Remove filters with zero weights and sync next layer"""
weight = conv_layer.weight.data
# Mask: filters with nonzero weights
nonzero_mask = weight.abs().sum(dim=(1,2,3)) > 1e-6
conv_layer.weight = nn.Parameter(weight[nonzero_mask])
if conv_layer.bias is not None:
conv_layer.bias = nn.Parameter(conv_layer.bias.data[nonzero_mask])
conv_layer.out_channels = nonzero_mask.sum().item()
# Sync next layer (input channels)
if next_layer is not None and isinstance(next_layer, nn.Conv2d):
next_layer.weight = nn.Parameter(next_layer.weight.data[:, nonzero_mask])
next_layer.in_channels = nonzero_mask.sum().item()
Do carefully — BatchNorm layers after Conv also contain per-channel parameters requiring synchronization.
Fine-tuning After Pruning
After removing 20–40% of filters, accuracy drops. Fine-tuning is mandatory. Rule: more aggressive pruning, longer fine-tuning.
# Fine-tuning after pruning — typically 10-20% of original epochs
optimizer = torch.optim.Adam(pruned_model.parameters(), lr=1e-4) # lower LR
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=20)
for epoch in range(20):
train_one_epoch(pruned_model, train_loader, optimizer)
val_acc = evaluate(pruned_model, val_loader)
scheduler.step()
print(f"Epoch {epoch}: val_acc={val_acc:.4f}")
Iterative pruning — cycle pruning → fine-tuning → pruning — yields better results than single removal of large filter counts.
Lottery Ticket Hypothesis: Deeper
For accuracy-critical tasks, use Lottery Ticket approach: train full network, find "winning tickets" — sparse subnetworks trainable to comparable accuracy from scratch. Implement via torch_pruning:
import torch_pruning as tp
# Analyze layer dependencies
example_inputs = torch.zeros(1, 3, 224, 224)
DG = tp.DependencyGraph()
DG.build_dependency(model, example_inputs=example_inputs)
# Get groups of related layers (pruning one requires pruning related ones)
pruner = tp.pruner.MagnitudePruner(
model,
example_inputs,
importance=tp.importance.MagnitudeImportance(p=1),
pruning_ratio=0.5, # remove 50% of channels
global_pruning=False,
iterative_steps=5 # iteratively across 5 steps
)
Why Pruning Doesn't Always Accelerate
MobileNetV3 is already optimized: depthwise separable convolutions with few channels. Remove 30% of filters from 16-channel layer — get 11 channels. Speed difference — minimal, tensor operation overhead remains.
Pruning works on large models: ResNet-50, EfficientNet-B4, BERT. On compact MobileNet/EfficientNet-lite — lower effect. Better to start with lighter base architecture than pruning heavy one.
Combining with Quantization
Pruning + quantization — standard two-step optimization:
- Structured pruning 30–40% → fine-tune → reduce graph
- INT8 quantization of compressed graph → final model
Example result: EfficientNet-B0 (20 MB FP32, 80 ms Android) → 35% pruning + INT8 → 4 MB, 18 ms. Top-1 accuracy dropped from 77.1% to 75.8%.
Process
Model pruning-suitability analysis → criterion and sparsification degree selection → iterative pruning + fine-tuning → accuracy verification → optional: quantization → measurements on target devices.
Timeline Estimates
Structured pruning with fine-tuning on ready dataset — 2–4 weeks. Iterative pruning with full experimental cycle — 4–8 weeks.







