ML model pruning optimization for mobile device

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
ML model pruning optimization for mobile device
Complex
~3-5 business days
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1052
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

ML Model Optimization (Pruning) for Mobile Device

Pruning removes weights or neurons from models. Logic: in networks trained on real data, a significant fraction of weights are close to zero and barely affect output. They can be zeroed or deleted without major accuracy loss, gaining speed and size.

Sounds appealing. In practice — pruning is harder than quantization, requires retraining after sparsification, and doesn't always deliver expected speedup on mobile due to implementation details.

Two Pruning Types

Unstructured pruning — zero individual weights (sparse matrices). Matrix with 90% zeros — seemingly 10x savings. But GPU/NPU work with dense matrices; sparse computation doesn't accelerate there. Practical gain: reduced model size after compression (zeros compress well). Not inference speed on typical devices.

Structured pruning — remove entire filters (channels) in convolution layers or heads in attention. Result — physically smaller graph, truly faster on any hardware. This is what mobile needs.

Structured Pruning: PyTorch Practice

import torch
import torch.nn.utils.prune as prune

# L1-based structured pruning: remove 30% of filters from Conv2d layers
# by minimum L1-norm criterion (least important filters)
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Conv2d):
        prune.ln_structured(
            module,
            name='weight',
            amount=0.3,  # 30% of channels
            n=1,         # L1 norm
            dim=0        # dim=0 — output filters
        )

# After pruning — make weights permanent (remove mask)
for name, module in model.named_modules():
    if isinstance(module, torch.nn.Conv2d):
        prune.remove(module, 'weight')

After this, model contains zero filters but they remain in graph. Next step — actually remove zero channels:

# Custom function to remove zero filters
def remove_zero_filters(conv_layer, next_layer=None):
    """Remove filters with zero weights and sync next layer"""
    weight = conv_layer.weight.data
    # Mask: filters with nonzero weights
    nonzero_mask = weight.abs().sum(dim=(1,2,3)) > 1e-6

    conv_layer.weight = nn.Parameter(weight[nonzero_mask])
    if conv_layer.bias is not None:
        conv_layer.bias = nn.Parameter(conv_layer.bias.data[nonzero_mask])
    conv_layer.out_channels = nonzero_mask.sum().item()

    # Sync next layer (input channels)
    if next_layer is not None and isinstance(next_layer, nn.Conv2d):
        next_layer.weight = nn.Parameter(next_layer.weight.data[:, nonzero_mask])
        next_layer.in_channels = nonzero_mask.sum().item()

Do carefully — BatchNorm layers after Conv also contain per-channel parameters requiring synchronization.

Fine-tuning After Pruning

After removing 20–40% of filters, accuracy drops. Fine-tuning is mandatory. Rule: more aggressive pruning, longer fine-tuning.

# Fine-tuning after pruning — typically 10-20% of original epochs
optimizer = torch.optim.Adam(pruned_model.parameters(), lr=1e-4)  # lower LR
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=20)

for epoch in range(20):
    train_one_epoch(pruned_model, train_loader, optimizer)
    val_acc = evaluate(pruned_model, val_loader)
    scheduler.step()
    print(f"Epoch {epoch}: val_acc={val_acc:.4f}")

Iterative pruning — cycle pruning → fine-tuning → pruning — yields better results than single removal of large filter counts.

Lottery Ticket Hypothesis: Deeper

For accuracy-critical tasks, use Lottery Ticket approach: train full network, find "winning tickets" — sparse subnetworks trainable to comparable accuracy from scratch. Implement via torch_pruning:

import torch_pruning as tp

# Analyze layer dependencies
example_inputs = torch.zeros(1, 3, 224, 224)
DG = tp.DependencyGraph()
DG.build_dependency(model, example_inputs=example_inputs)

# Get groups of related layers (pruning one requires pruning related ones)
pruner = tp.pruner.MagnitudePruner(
    model,
    example_inputs,
    importance=tp.importance.MagnitudeImportance(p=1),
    pruning_ratio=0.5,  # remove 50% of channels
    global_pruning=False,
    iterative_steps=5   # iteratively across 5 steps
)

Why Pruning Doesn't Always Accelerate

MobileNetV3 is already optimized: depthwise separable convolutions with few channels. Remove 30% of filters from 16-channel layer — get 11 channels. Speed difference — minimal, tensor operation overhead remains.

Pruning works on large models: ResNet-50, EfficientNet-B4, BERT. On compact MobileNet/EfficientNet-lite — lower effect. Better to start with lighter base architecture than pruning heavy one.

Combining with Quantization

Pruning + quantization — standard two-step optimization:

  1. Structured pruning 30–40% → fine-tune → reduce graph
  2. INT8 quantization of compressed graph → final model

Example result: EfficientNet-B0 (20 MB FP32, 80 ms Android) → 35% pruning + INT8 → 4 MB, 18 ms. Top-1 accuracy dropped from 77.1% to 75.8%.

Process

Model pruning-suitability analysis → criterion and sparsification degree selection → iterative pruning + fine-tuning → accuracy verification → optional: quantization → measurements on target devices.

Timeline Estimates

Structured pruning with fine-tuning on ready dataset — 2–4 weeks. Iterative pruning with full experimental cycle — 4–8 weeks.