DeepSeek Language Model Fine-Tuning

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
DeepSeek Language Model Fine-Tuning
Complex
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823
  • image_logo-aider_0.jpg
    AIDER company logo development
    762
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    848

Fine-Tuning DeepSeek Language Models

DeepSeek is a family of open-source language models from Chinese company DeepSeek AI, released under MIT license. DeepSeek-V3 and DeepSeek-R1 are current flagship models, competing with GPT-4o and Claude 3.5 Sonnet on most benchmarks at significantly lower inference cost. Open weights and high quality make DeepSeek attractive for enterprise fine-tuning scenarios.

DeepSeek Family: Model Navigation

Model Parameters Architecture Application
DeepSeek-V3 671B (MoE, ~37B active) MoE Flagship, general purpose
DeepSeek-R1 671B (MoE) MoE + Chain-of-Thought Reasoning, mathematics
DeepSeek-R1-Distill-Llama-70B 70B Dense Reasoning, more accessible
DeepSeek-R1-Distill-Llama-8B 8B Dense Lightweight reasoning
DeepSeek-R1-Distill-Qwen-32B 32B Dense Quality/resource balance
DeepSeek-Coder-V2 236B (MoE) MoE Code generation

For practical fine-tuning, distilled versions (8B, 32B, 70B) are more commonly used — they train on regular GPU clusters and deliver good results for specialized tasks.

Architectural Feature: Multi-head Latent Attention (MLA)

DeepSeek-V3 uses MLA — attention mechanism with KV-cache compression. Compared to GQA (Grouped Query Attention in Llama), MLA reduces KV-cache by 5–13× at comparable quality. This is critical for long-context inference — DeepSeek supports 128K tokens with reasonable memory requirements.

When fine-tuning, MLA layers are processed normally through peft, but when choosing target_modules we account for specifics: in DeepSeek-V3, attention projections are named q_proj, kv_a_proj_with_mqa, kv_b_proj, o_proj.

LoRA Fine-Tuning DeepSeek-R1-Distill-Qwen-32B

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,  # required for DeepSeek
    torch_dtype=torch.bfloat16,
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 160,432,128 || all params: 32,783,822,848 || trainable%: 0.49

Working with Chain-of-Thought When Fine-Tuning R1

DeepSeek-R1 generates reasoning in special <think>...</think> tags before final answer. When fine-tuning on specialized data, it's important to:

  1. Preserve think-blocks in training examples — maintains model's reasoning capability
  2. Don't suppress <think> token in prompts during training
  3. Remove reasoning from inference if needed — use <think>\n\n</think> as prefix or adjust stopping criteria

Example training example for R1:

{
  "messages": [
    {"role": "user", "content": "Is there a contradiction between section 3.2 and section 7.1 of the contract?"},
    {"role": "assistant", "content": "<think>\nSection 3.2 sets payment deadline — 30 days from delivery. Section 7.1 grants supplier the right to suspend deliveries if payment is overdue by more than 15 days. No contradiction — section 7.1 creates a right that arises before the general payment deadline expires.\n</think>\n\nNo direct contradiction between sections..."}
  ]
}

Practical Case: Reasoning Model for Financial Audit

Task: fine-tune DeepSeek-R1-Distill-Qwen-32B for detecting violations in financial documents — inconsistencies between balance sheet items, double-entry violations, anomalous transactions.

Dataset: 2100 examples: financial document fragments + think-block with step-by-step auditor reasoning + final conclusion. Data prepared with practicing auditors.

Training: QLoRA (r=32), 3 epochs, 4×A100 40GB, 18 hours.

Results:

  • Violation detection precision: 0.61 → 0.89
  • Recall (doesn't miss violations): 0.54 → 0.84
  • F1: 0.57 → 0.87
  • Reasoning quality (auditor evaluation, 1–5): 2.8 → 4.3

Inference via vLLM with MoE Support

For DeepSeek-V3/R1 (full size) requires special vLLM configuration:

from vllm import LLM, SamplingParams

llm = LLM(
    model="deepseek-ai/DeepSeek-V3",
    tensor_parallel_size=8,   # 8×H100 for full model
    trust_remote_code=True,
    max_model_len=65536,
    dtype="bfloat16",
)

For distilled models (8B, 32B) 1–4 GPUs are sufficient.

Project Timeline

  • Dataset preparation with think-blocks: 3–8 weeks (significantly more complex than standard SFT)
  • Training (32B, 4×A100): 12–24 hours
  • Reasoning quality evaluation: 2 weeks (requires expert evaluation)
  • Deployment and monitoring: 1–2 weeks
  • Total: 7–14 weeks