ML model distillation optimization for mobile device

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Development and support of all types of mobile applications:

Information and entertainment mobile applications

News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators

E-commerce mobile applications

Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.

Business process management mobile applications

CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems

Electronic services mobile applications

Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Offered services

Showing 1 of 1 servicesAll 1735 services

ML model distillation optimization for mobile device

Complex

~5 business days

FAQ

Our competencies:

Free consultation

Book a free consultation if you have any questions. A dedicated specialist will advise you.

Cost calculation

If you know what exactly you need to develop, or you already have a ready-made technical task.

Development stages

Latest works

Development of a mobile application for FEEDME
761
Development of a mobile application for XOOMER
649
Development of a mobile application for RHL
1071
Development of a mobile application for ZIPPY
947
Development of a mobile application for Affhome
884
Development of a mobile application for the FLAVORS company
466

Show more works

ML Model Optimization (Distillation) for Mobile Device

Knowledge Distillation trains a small model ("student") to reproduce large model ("teacher") behavior. Not just copying correct answers, but absorbing "soft" probabilities — class distributions containing information about concept similarity.

Distillation fundamentally differs from pruning and quantization: you get a new, smaller architecture with own weights. Student size you choose. More powerful, but requires resources: dataset, GPU, training time.

Why Soft Labels Work Better

Normal training: correct class = 1.0, others = 0.0. Hard labels.

Teacher on cat image outputs: cat 0.85, lynx 0.08, tiger 0.04, dog 0.02. These soft labels carry information that lynx resembles cat more than airplane. Student trained on such labels learns feature space structure, not just binary classification.

import torch
import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, true_labels, temperature=4.0, alpha=0.7):
    """
    alpha — weight distillation vs hard label loss
    temperature — smooths teacher distribution
    """
    # Soft targets loss (KL divergence between student and teacher)
    soft_teacher = F.softmax(teacher_logits / temperature, dim=-1)
    soft_student = F.log_softmax(student_logits / temperature, dim=-1)
    distill_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean') * (temperature ** 2)

    # Hard label loss (standard cross-entropy)
    hard_loss = F.cross_entropy(student_logits, true_labels)

    return alpha * distill_loss + (1 - alpha) * hard_loss

temperature ** 2 — normalization factor compensating gradient scale at high temperature. Without it, distill_loss and hard_loss are at different scales.

Choosing Student Architecture

Student should be smaller than teacher, but not arbitrary. Good base architectures for mobile:

MobileNetV3-Small — 2.5 MB, designed for mobile from scratch, depthwise separable convolutions
EfficientNet-Lite0/1 — good accuracy/speed balance
MobileViT-XXS — hybrid CNN+Transformer, 1.3 MB
DistilBERT (for NLP) — already distilled from BERT, 66 MB vs 440 MB

For mobile object detection: student based on YOLOv8n (8 MB) distilled from YOLOv8l (87 MB).

Distillation Process: Classification Example

# Assume: teacher — ResNet-50, student — MobileNetV3-Small
teacher = torchvision.models.resnet50(pretrained=True).eval()
student = torchvision.models.mobilenet_v3_small(pretrained=False)

# Freeze teacher
for param in teacher.parameters():
    param.requires_grad = False

optimizer = torch.optim.AdamW(student.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

for epoch in range(100):
    student.train()
    for images, labels in train_loader:
        with torch.no_grad():
            teacher_logits = teacher(images)

        student_logits = student(images)

        loss = distillation_loss(student_logits, teacher_logits, labels,
                                  temperature=4.0, alpha=0.7)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    scheduler.step()
    val_acc = evaluate(student, val_loader)
    print(f"Epoch {epoch}: student_acc={val_acc:.4f}")

Typical results: MobileNetV3-Small trained normally — 67–68% top-1 on ImageNet. After distillation from ResNet-50 — 71–72%. 3–4% gain from knowledge transfer.

Intermediate Layer Distillation

Distilling only logits — basic variant. Stronger: match intermediate feature maps.

# FitNets / PKT: student learns teacher feature maps
class DistillationHook:
    """Hook to capture intermediate activations"""
    def __init__(self):
        self.output = None

    def __call__(self, module, input, output):
        self.output = output

teacher_hook = DistillationHook()
student_hook = DistillationHook()

# Register on corresponding layers
teacher.layer3.register_forward_hook(teacher_hook)
student.features[9].register_forward_hook(student_hook)  # Analogous layer

# In training loop add feature distillation loss
with torch.no_grad():
    teacher(images)
teacher_features = teacher_hook.output

student(images)  # with grad
student_features = student_hook.output

# If dimensions differ — need adapter (1x1 Conv)
if teacher_features.shape[1] != student_features.shape[1]:
    student_features = adapter_conv(student_features)  # adapter trained

feature_loss = F.mse_loss(student_features, teacher_features.detach())

This approach requires feature map dimension alignment via 1×1 conv adapter. Adapter adds few student parameters, remains small.

Data-Free Distillation

Sometimes source dataset unavailable (IP restrictions, privacy). Data-free distillation — generate synthetic data maximizing teacher activations:

# DAFL (Data-Free Learning): generator creates samples for distillation
generator = Generator(latent_dim=256, img_channels=3)
optimizer_G = torch.optim.Adam(generator.parameters(), lr=1e-4)

for step in range(1000):
    z = torch.randn(batch_size, 256)
    fake_images = generator(z)

    # Losses: maximize teacher confidence + minimize BatchNorm statistics mismatch
    teacher_out = teacher(fake_images)
    activation_loss = -teacher_out.max(dim=1)[0].mean()  # teacher should be confident

    # BN statistics matching
    bn_loss = compute_bn_statistics_loss(teacher, fake_images)

    total_loss = activation_loss + 0.1 * bn_loss
    optimizer_G.zero_grad()
    total_loss.backward()
    optimizer_G.step()

Data-free distillation quality lower than full-data variant, but sometimes only option.

Distillation for Mobile NLP Tasks

For mobile NLP (review classification, intent detection, summarization): distill from GPT-4 / Claude API responses to small BERT/DistilBERT.

# Collect soft labels from teacher (GPT-4 API)
# For each training example request class probabilities
# Save as training labels for student
# Student — DistilBERT fine-tuned on these soft labels

DistilBERT (66 MB, ONNX int8 — 18 MB) runs on device in 30–80 ms on iOS/Android. GPT-4 in cloud — hundreds ms, costs per request.

Process

Define student architecture per resource budget → configure distillation (temperature, alpha, intermediate layers) → train on GPU → verify accuracy vs teacher → convert to Core ML / TFLite → final device measurements.

Timeline Estimates

Basic logit distillation for classification — 2–4 weeks (GPU time plus hyperparameter tuning). Full distillation with intermediate layers, non-standard architectures, data augmentation — 5–10 weeks.