LLM LoRA (Low-Rank Adaptation) Fine-Tuning

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
LLM LoRA (Low-Rank Adaptation) Fine-Tuning
Medium
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823
  • image_logo-aider_0.jpg
    AIDER company logo development
    762
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    848

LLM Fine-Tuning via LoRA Method (Low-Rank Adaptation)

LoRA is a parameter-efficient fine-tuning method where original model weights are frozen, and small low-rank matrices are trained alongside them. Method proposed in 2021 (Hu et al., Microsoft Research) and became de facto standard for LLM fine-tuning. LoRA allows fine-tuning 7B model on single A100 40GB GPU instead of several, with minimal quality loss compared to Full Fine-Tuning for most tasks.

LoRA Mathematics

For weight matrix W ∈ R^(d×k) LoRA adds product of two matrices:

W' = W + ΔW = W + BA
where B ∈ R^(d×r), A ∈ R^(r×k), r << min(d, k)

Rank r is the key hyperparameter. At r=16 and d=k=4096 (typical attention projection sizes in 7B model) trainable parameters in one layer: 16×4096 + 4096×16 = 131,072 instead of 4096×4096 = 16,777,216. This is 128× compression.

During initialization, A is random Gaussian, B is zero. This ensures ΔW=0 at start — model begins with original behavior.

LoRA Configuration: Key Hyperparameters

from peft import LoraConfig

config = LoraConfig(
    r=16,               # Rank: 4, 8, 16, 32, 64, 128
    lora_alpha=32,      # Scale: usually = 2*r
    target_modules=[    # Which layers to adapt
        "q_proj", "v_proj",          # Minimum
        "k_proj", "o_proj",          # Extended variant
        "gate_proj", "up_proj", "down_proj"  # MLP inclusive
    ],
    lora_dropout=0.05,  # Adapter regularization
    bias="none",        # "none", "all", "lora_only"
    task_type="CAUSAL_LM",
    modules_to_save=["embed_tokens", "lm_head"],  # Train fully
)

Choosing r: harder task and bigger domain divergence from pretraining — higher r. For classification and formatting: r=4–8. For generation in specific style: r=16–32. For complex domain adaptation: r=64–128.

lora_alpha: controls adapter scale. Effective adapter lr = lr × (alpha/r). Standard practice: alpha = 2r.

DoRA: LoRA Improvement

DoRA (Weight-Decomposed Low-Rank Adaptation) separates weight update into magnitude and direction components:

config = LoraConfig(
    r=16,
    use_dora=True,  # Enables DoRA instead of standard LoRA
    ...
)

DoRA improves quality by 1–3% vs standard LoRA without increasing inference costs.

Practical Case: LoRA for NER Classification

Task: extract named entities from medical records (4 classes: MEDICATION, DOSAGE, CONDITION, PROCEDURE).

Base model: Llama 3.1 8B Instruct.

Configuration: r=16, alpha=32, target_modules=["q_proj","v_proj"], 3 epochs.

Dataset: 2200 examples, A100 40GB, QLoRA 4-bit, training time 2.5 hours.

Metric Base Model (5-shot) LoRA r=8 LoRA r=16 LoRA r=32
F1 MEDICATION 0.71 0.88 0.91 0.92
F1 DOSAGE 0.64 0.83 0.87 0.88
F1 CONDITION 0.79 0.91 0.94 0.94
F1 PROCEDURE 0.68 0.85 0.89 0.90

Gap between r=16 and r=32 is negligible — r=16 is optimal.

Merging Adapter for Deployment

LoRA adapter can be merged with base model to simplify inference:

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
model = PeftModel.from_pretrained(base_model, "./lora-adapter")

# Merge: result is regular model without PEFT overhead
merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")

After merging, model is identical in inference speed to fully trained — LoRA overhead on inference disappears.

Timeline

  • Data preparation: 2–4 weeks
  • Training (7B, LoRA, A100 40GB): 2–8 hours
  • Hyperparameter iterations: 3–5 days
  • Total: 3–6 weeks