Implementing LoRA Adaptation for LLM in Mobile App
Full fine-tuning Llama 3 8B requires 80 GB GPU memory and days of training. LoRA (Low-Rank Adaptation) achieves comparable quality by freezing original weights and training only small adapter matrices. In practice — A100 40GB instead of clusters, hours instead of days, and 50–300 MB adapter instead of 16 GB checkpoint.
How LoRA Works Technically
Original weight matrix W sized d × k unchanged. Instead, train two matrices: A sized d × r and B sized r × k, where r — adaptation rank (hyperparameter, usually 8–64). At inference: W_new = W + α * (A × B), where α — scaling coefficient.
Key hyperparameters:
-
r(rank) — higher means more parameters trained, costlier adaptation.r=16— reasonable start -
lora_alpha— usually2rorr. Controls adaptation "strength" when merging weights -
target_modules— which layers to adapt. For transformers:q_proj, v_proj, k_proj, o_projand optionallygate_proj, up_proj, down_proj -
lora_dropout— regularization, 0.05–0.1 for small datasets
Training: Unsloth + Hugging Face PEFT
Unsloth accelerates LoRA training 2–5x vs pure PEFT via custom CUDA kernels:
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length=2048,
dtype=torch.float16,
load_in_4bit=True # QLoRA: 4-bit quantization + LoRA
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
lora_dropout=0.05,
bias="none",
use_gradient_checkpointing="unsloth"
)
QLoRA — LoRA atop 4-bit quantization of base model. Llama 3 8B in 4-bit uses ~5 GB VRAM vs 16 GB in fp16. Minimum GPU for QLoRA training — RTX 3090 (24 GB) or rented A100 on RunPod/Lambda Labs.
Adapter Deployment: Server vs On-device
After training, adapter saved separately from base model. Two integration paths with mobile app:
Server deployment via vLLM or Ollama. Base model on server, adapter applied at init or runtime. Mobile app works via API endpoint — no model burden on device.
# vLLM with LoRA adapter
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-lora \
--lora-modules my-adapter=/path/to/lora/adapter
On-device via llama.cpp / Core ML. Only possible for small models with weight merging (merge + GGUF). For mobile realistic: Llama 3.2 3B or Phi-3.5-mini 3.8B with LoRA adapter merged into GGUF Q4_K_M. Final model size — 2–3 GB, fits iPhone 14+ and Galaxy S23+.
# Weight merging before GGUF export
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
# Then: llama.cpp convert + quantize → .gguf file
On iOS, such GGUF runs via llama.swift or MLModel (convert to Core ML via coremltools). On Android — llama.cpp via JNI or MediaPipe LLM Inference API for Gemma models.
Common LoRA Adaptation Mistakes
Wrong target_modules. Adapting only q_proj, v_proj while skipping gate_proj and up_proj in MLP blocks — weak effect. For instruction-following tasks, adapt all projection layers.
Too small dataset. LoRA with 50–100 examples overfits faster than improves. For domain adaptation need minimum 300–500 diverse examples.
Base not frozen at merge. After merge_and_unload(), verify original weights unchanged vs base model — signals proper LoRA operation.
Timeline Estimates
Training dataset preparation — 1–2 weeks. Environment setup (RunPod + Unsloth) and training launch — 1–2 days. Adapter conversion and testing — 2–3 days. Server API integration in mobile app — 2–4 days. Full cycle — 2 to 4 weeks.







