ML Model Optimization (Distillation) for Mobile Device
Knowledge Distillation trains a small model ("student") to reproduce large model ("teacher") behavior. Not just copying correct answers, but absorbing "soft" probabilities — class distributions containing information about concept similarity.
Distillation fundamentally differs from pruning and quantization: you get a new, smaller architecture with own weights. Student size you choose. More powerful, but requires resources: dataset, GPU, training time.
Why Soft Labels Work Better
Normal training: correct class = 1.0, others = 0.0. Hard labels.
Teacher on cat image outputs: cat 0.85, lynx 0.08, tiger 0.04, dog 0.02. These soft labels carry information that lynx resembles cat more than airplane. Student trained on such labels learns feature space structure, not just binary classification.
import torch
import torch.nn.functional as F
def distillation_loss(student_logits, teacher_logits, true_labels, temperature=4.0, alpha=0.7):
"""
alpha — weight distillation vs hard label loss
temperature — smooths teacher distribution
"""
# Soft targets loss (KL divergence between student and teacher)
soft_teacher = F.softmax(teacher_logits / temperature, dim=-1)
soft_student = F.log_softmax(student_logits / temperature, dim=-1)
distill_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean') * (temperature ** 2)
# Hard label loss (standard cross-entropy)
hard_loss = F.cross_entropy(student_logits, true_labels)
return alpha * distill_loss + (1 - alpha) * hard_loss
temperature ** 2 — normalization factor compensating gradient scale at high temperature. Without it, distill_loss and hard_loss are at different scales.
Choosing Student Architecture
Student should be smaller than teacher, but not arbitrary. Good base architectures for mobile:
- MobileNetV3-Small — 2.5 MB, designed for mobile from scratch, depthwise separable convolutions
- EfficientNet-Lite0/1 — good accuracy/speed balance
- MobileViT-XXS — hybrid CNN+Transformer, 1.3 MB
- DistilBERT (for NLP) — already distilled from BERT, 66 MB vs 440 MB
For mobile object detection: student based on YOLOv8n (8 MB) distilled from YOLOv8l (87 MB).
Distillation Process: Classification Example
# Assume: teacher — ResNet-50, student — MobileNetV3-Small
teacher = torchvision.models.resnet50(pretrained=True).eval()
student = torchvision.models.mobilenet_v3_small(pretrained=False)
# Freeze teacher
for param in teacher.parameters():
param.requires_grad = False
optimizer = torch.optim.AdamW(student.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
for epoch in range(100):
student.train()
for images, labels in train_loader:
with torch.no_grad():
teacher_logits = teacher(images)
student_logits = student(images)
loss = distillation_loss(student_logits, teacher_logits, labels,
temperature=4.0, alpha=0.7)
optimizer.zero_grad()
loss.backward()
optimizer.step()
scheduler.step()
val_acc = evaluate(student, val_loader)
print(f"Epoch {epoch}: student_acc={val_acc:.4f}")
Typical results: MobileNetV3-Small trained normally — 67–68% top-1 on ImageNet. After distillation from ResNet-50 — 71–72%. 3–4% gain from knowledge transfer.
Intermediate Layer Distillation
Distilling only logits — basic variant. Stronger: match intermediate feature maps.
# FitNets / PKT: student learns teacher feature maps
class DistillationHook:
"""Hook to capture intermediate activations"""
def __init__(self):
self.output = None
def __call__(self, module, input, output):
self.output = output
teacher_hook = DistillationHook()
student_hook = DistillationHook()
# Register on corresponding layers
teacher.layer3.register_forward_hook(teacher_hook)
student.features[9].register_forward_hook(student_hook) # Analogous layer
# In training loop add feature distillation loss
with torch.no_grad():
teacher(images)
teacher_features = teacher_hook.output
student(images) # with grad
student_features = student_hook.output
# If dimensions differ — need adapter (1x1 Conv)
if teacher_features.shape[1] != student_features.shape[1]:
student_features = adapter_conv(student_features) # adapter trained
feature_loss = F.mse_loss(student_features, teacher_features.detach())
This approach requires feature map dimension alignment via 1×1 conv adapter. Adapter adds few student parameters, remains small.
Data-Free Distillation
Sometimes source dataset unavailable (IP restrictions, privacy). Data-free distillation — generate synthetic data maximizing teacher activations:
# DAFL (Data-Free Learning): generator creates samples for distillation
generator = Generator(latent_dim=256, img_channels=3)
optimizer_G = torch.optim.Adam(generator.parameters(), lr=1e-4)
for step in range(1000):
z = torch.randn(batch_size, 256)
fake_images = generator(z)
# Losses: maximize teacher confidence + minimize BatchNorm statistics mismatch
teacher_out = teacher(fake_images)
activation_loss = -teacher_out.max(dim=1)[0].mean() # teacher should be confident
# BN statistics matching
bn_loss = compute_bn_statistics_loss(teacher, fake_images)
total_loss = activation_loss + 0.1 * bn_loss
optimizer_G.zero_grad()
total_loss.backward()
optimizer_G.step()
Data-free distillation quality lower than full-data variant, but sometimes only option.
Distillation for Mobile NLP Tasks
For mobile NLP (review classification, intent detection, summarization): distill from GPT-4 / Claude API responses to small BERT/DistilBERT.
# Collect soft labels from teacher (GPT-4 API)
# For each training example request class probabilities
# Save as training labels for student
# Student — DistilBERT fine-tuned on these soft labels
DistilBERT (66 MB, ONNX int8 — 18 MB) runs on device in 30–80 ms on iOS/Android. GPT-4 in cloud — hundreds ms, costs per request.
Process
Define student architecture per resource budget → configure distillation (temperature, alpha, intermediate layers) → train on GPU → verify accuracy vs teacher → convert to Core ML / TFLite → final device measurements.
Timeline Estimates
Basic logit distillation for classification — 2–4 weeks (GPU time plus hyperparameter tuning). Full distillation with intermediate layers, non-standard architectures, data augmentation — 5–10 weeks.







