LLM Fine-Tuning via LoRA Method (Low-Rank Adaptation)
LoRA is a parameter-efficient fine-tuning method where original model weights are frozen, and small low-rank matrices are trained alongside them. Method proposed in 2021 (Hu et al., Microsoft Research) and became de facto standard for LLM fine-tuning. LoRA allows fine-tuning 7B model on single A100 40GB GPU instead of several, with minimal quality loss compared to Full Fine-Tuning for most tasks.
LoRA Mathematics
For weight matrix W ∈ R^(d×k) LoRA adds product of two matrices:
W' = W + ΔW = W + BA
where B ∈ R^(d×r), A ∈ R^(r×k), r << min(d, k)
Rank r is the key hyperparameter. At r=16 and d=k=4096 (typical attention projection sizes in 7B model) trainable parameters in one layer: 16×4096 + 4096×16 = 131,072 instead of 4096×4096 = 16,777,216. This is 128× compression.
During initialization, A is random Gaussian, B is zero. This ensures ΔW=0 at start — model begins with original behavior.
LoRA Configuration: Key Hyperparameters
from peft import LoraConfig
config = LoraConfig(
r=16, # Rank: 4, 8, 16, 32, 64, 128
lora_alpha=32, # Scale: usually = 2*r
target_modules=[ # Which layers to adapt
"q_proj", "v_proj", # Minimum
"k_proj", "o_proj", # Extended variant
"gate_proj", "up_proj", "down_proj" # MLP inclusive
],
lora_dropout=0.05, # Adapter regularization
bias="none", # "none", "all", "lora_only"
task_type="CAUSAL_LM",
modules_to_save=["embed_tokens", "lm_head"], # Train fully
)
Choosing r: harder task and bigger domain divergence from pretraining — higher r. For classification and formatting: r=4–8. For generation in specific style: r=16–32. For complex domain adaptation: r=64–128.
lora_alpha: controls adapter scale. Effective adapter lr = lr × (alpha/r). Standard practice: alpha = 2r.
DoRA: LoRA Improvement
DoRA (Weight-Decomposed Low-Rank Adaptation) separates weight update into magnitude and direction components:
config = LoraConfig(
r=16,
use_dora=True, # Enables DoRA instead of standard LoRA
...
)
DoRA improves quality by 1–3% vs standard LoRA without increasing inference costs.
Practical Case: LoRA for NER Classification
Task: extract named entities from medical records (4 classes: MEDICATION, DOSAGE, CONDITION, PROCEDURE).
Base model: Llama 3.1 8B Instruct.
Configuration: r=16, alpha=32, target_modules=["q_proj","v_proj"], 3 epochs.
Dataset: 2200 examples, A100 40GB, QLoRA 4-bit, training time 2.5 hours.
| Metric | Base Model (5-shot) | LoRA r=8 | LoRA r=16 | LoRA r=32 |
|---|---|---|---|---|
| F1 MEDICATION | 0.71 | 0.88 | 0.91 | 0.92 |
| F1 DOSAGE | 0.64 | 0.83 | 0.87 | 0.88 |
| F1 CONDITION | 0.79 | 0.91 | 0.94 | 0.94 |
| F1 PROCEDURE | 0.68 | 0.85 | 0.89 | 0.90 |
Gap between r=16 and r=32 is negligible — r=16 is optimal.
Merging Adapter for Deployment
LoRA adapter can be merged with base model to simplify inference:
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(base_model, "./lora-adapter")
# Merge: result is regular model without PEFT overhead
merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")
After merging, model is identical in inference speed to fully trained — LoRA overhead on inference disappears.
Timeline
- Data preparation: 2–4 weeks
- Training (7B, LoRA, A100 40GB): 2–8 hours
- Hyperparameter iterations: 3–5 days
- Total: 3–6 weeks







