ML model distillation optimization for mobile device

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
ML model distillation optimization for mobile device
Complex
~5 business days
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1052
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

ML Model Optimization (Distillation) for Mobile Device

Knowledge Distillation trains a small model ("student") to reproduce large model ("teacher") behavior. Not just copying correct answers, but absorbing "soft" probabilities — class distributions containing information about concept similarity.

Distillation fundamentally differs from pruning and quantization: you get a new, smaller architecture with own weights. Student size you choose. More powerful, but requires resources: dataset, GPU, training time.

Why Soft Labels Work Better

Normal training: correct class = 1.0, others = 0.0. Hard labels.

Teacher on cat image outputs: cat 0.85, lynx 0.08, tiger 0.04, dog 0.02. These soft labels carry information that lynx resembles cat more than airplane. Student trained on such labels learns feature space structure, not just binary classification.

import torch
import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, true_labels, temperature=4.0, alpha=0.7):
    """
    alpha — weight distillation vs hard label loss
    temperature — smooths teacher distribution
    """
    # Soft targets loss (KL divergence between student and teacher)
    soft_teacher = F.softmax(teacher_logits / temperature, dim=-1)
    soft_student = F.log_softmax(student_logits / temperature, dim=-1)
    distill_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean') * (temperature ** 2)

    # Hard label loss (standard cross-entropy)
    hard_loss = F.cross_entropy(student_logits, true_labels)

    return alpha * distill_loss + (1 - alpha) * hard_loss

temperature ** 2 — normalization factor compensating gradient scale at high temperature. Without it, distill_loss and hard_loss are at different scales.

Choosing Student Architecture

Student should be smaller than teacher, but not arbitrary. Good base architectures for mobile:

  • MobileNetV3-Small — 2.5 MB, designed for mobile from scratch, depthwise separable convolutions
  • EfficientNet-Lite0/1 — good accuracy/speed balance
  • MobileViT-XXS — hybrid CNN+Transformer, 1.3 MB
  • DistilBERT (for NLP) — already distilled from BERT, 66 MB vs 440 MB

For mobile object detection: student based on YOLOv8n (8 MB) distilled from YOLOv8l (87 MB).

Distillation Process: Classification Example

# Assume: teacher — ResNet-50, student — MobileNetV3-Small
teacher = torchvision.models.resnet50(pretrained=True).eval()
student = torchvision.models.mobilenet_v3_small(pretrained=False)

# Freeze teacher
for param in teacher.parameters():
    param.requires_grad = False

optimizer = torch.optim.AdamW(student.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

for epoch in range(100):
    student.train()
    for images, labels in train_loader:
        with torch.no_grad():
            teacher_logits = teacher(images)

        student_logits = student(images)

        loss = distillation_loss(student_logits, teacher_logits, labels,
                                  temperature=4.0, alpha=0.7)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    scheduler.step()
    val_acc = evaluate(student, val_loader)
    print(f"Epoch {epoch}: student_acc={val_acc:.4f}")

Typical results: MobileNetV3-Small trained normally — 67–68% top-1 on ImageNet. After distillation from ResNet-50 — 71–72%. 3–4% gain from knowledge transfer.

Intermediate Layer Distillation

Distilling only logits — basic variant. Stronger: match intermediate feature maps.

# FitNets / PKT: student learns teacher feature maps
class DistillationHook:
    """Hook to capture intermediate activations"""
    def __init__(self):
        self.output = None

    def __call__(self, module, input, output):
        self.output = output

teacher_hook = DistillationHook()
student_hook = DistillationHook()

# Register on corresponding layers
teacher.layer3.register_forward_hook(teacher_hook)
student.features[9].register_forward_hook(student_hook)  # Analogous layer

# In training loop add feature distillation loss
with torch.no_grad():
    teacher(images)
teacher_features = teacher_hook.output

student(images)  # with grad
student_features = student_hook.output

# If dimensions differ — need adapter (1x1 Conv)
if teacher_features.shape[1] != student_features.shape[1]:
    student_features = adapter_conv(student_features)  # adapter trained

feature_loss = F.mse_loss(student_features, teacher_features.detach())

This approach requires feature map dimension alignment via 1×1 conv adapter. Adapter adds few student parameters, remains small.

Data-Free Distillation

Sometimes source dataset unavailable (IP restrictions, privacy). Data-free distillation — generate synthetic data maximizing teacher activations:

# DAFL (Data-Free Learning): generator creates samples for distillation
generator = Generator(latent_dim=256, img_channels=3)
optimizer_G = torch.optim.Adam(generator.parameters(), lr=1e-4)

for step in range(1000):
    z = torch.randn(batch_size, 256)
    fake_images = generator(z)

    # Losses: maximize teacher confidence + minimize BatchNorm statistics mismatch
    teacher_out = teacher(fake_images)
    activation_loss = -teacher_out.max(dim=1)[0].mean()  # teacher should be confident

    # BN statistics matching
    bn_loss = compute_bn_statistics_loss(teacher, fake_images)

    total_loss = activation_loss + 0.1 * bn_loss
    optimizer_G.zero_grad()
    total_loss.backward()
    optimizer_G.step()

Data-free distillation quality lower than full-data variant, but sometimes only option.

Distillation for Mobile NLP Tasks

For mobile NLP (review classification, intent detection, summarization): distill from GPT-4 / Claude API responses to small BERT/DistilBERT.

# Collect soft labels from teacher (GPT-4 API)
# For each training example request class probabilities
# Save as training labels for student
# Student — DistilBERT fine-tuned on these soft labels

DistilBERT (66 MB, ONNX int8 — 18 MB) runs on device in 30–80 ms on iOS/Android. GPT-4 in cloud — hundreds ms, costs per request.

Process

Define student architecture per resource budget → configure distillation (temperature, alpha, intermediate layers) → train on GPU → verify accuracy vs teacher → convert to Core ML / TFLite → final device measurements.

Timeline Estimates

Basic logit distillation for classification — 2–4 weeks (GPU time plus hyperparameter tuning). Full distillation with intermediate layers, non-standard architectures, data augmentation — 5–10 weeks.