Fine-tuning LLMs with Instruction Tuning
Instruction Tuning is a fine-tuning method that trains a language model on a set of "instruction → response" pairs, teaching the model to follow natural language text instructions. This step transforms a base LLM (pretrained for next-token prediction) into an Instruct-model capable of completing tasks. Most public Instruct-models (Llama Instruct, Mistral Instruct, Qwen Instruct) are created using this method.
Base vs Instruct: fundamental difference
Base LLM (Llama 3.1 8B): continues text. Give it sentence beginning — it will continue it, but won't answer questions like an assistant.
Instruct LLM (Llama 3.1 8B Instruct): follows instructions. Can answer questions, complete tasks, refuse harmful content.
When fine-tuning a corporate model, we usually take an already ready Instruct-version and adapt it to the domain. But sometimes full Instruction Tuning from scratch is needed — for example, when working with a base model or when needing to redefine base behavior.
Instruction Tuning formats
Alpaca format (simple):
{
"instruction": "Translate the text from English to Russian",
"input": "The contract must be signed before the deadline",
"output": "Договор должен быть подписан до истечения срока"
}
ShareGPT format (multi-turn dialog):
{
"conversations": [
{"from": "human", "value": "Analyze the company balance sheet"},
{"from": "gpt", "value": "To analyze balance sheet we need the following indicators..."},
{"from": "human", "value": "How to interpret the asset ratio?"},
{"from": "gpt", "value": "The ratio of current to long-term assets shows..."}
]
}
Chat Template format (modern standard):
# Example for Llama 3 chat template
messages = [
{"role": "system", "content": "You are a financial analysis assistant"},
{"role": "user", "content": "Calculate ROE"},
{"role": "assistant", "content": "ROE = Net Income / Equity × 100%..."},
]
# Apply chat template
formatted = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
Data volume for Instruction Tuning
LIMA research (Less is More for Alignment, 2023) showed: 1000 high-quality examples provide competitive quality compared to 52000 examples in Alpaca. Quality matters more than quantity.
Guidelines for specialized Instruction Tuning:
| Task | Minimum volume | Optimal volume |
|---|---|---|
| Style specialization | 100–300 | 500–1000 |
| New domain (moderate complexity) | 500–1000 | 2000–5000 |
| Complex technical domain | 1000–2000 | 5000–15000 |
| Base behavior change | 2000–5000 | 10000–50000 |
Instruction Tuning for corporate assistant
Task: fine-tune Llama 3.1 8B on corporate communication standards of an IT company — official tone, use of accepted abbreviations and terminology, response structure using templates.
Dataset: 1800 examples — real internal correspondence, converted to instruction/response pairs.
Feature: dataset includes negative examples — examples the model should learn to decline (competitor requests, employee personal data).
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
trainer = SFTTrainer(
model=model,
args=SFTConfig(
output_dir="./corporate-instruct",
num_train_epochs=4,
learning_rate=2e-4,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
max_seq_length=2048,
bf16=True,
# Mask prompt during loss calculation — train only on completion
dataset_text_field="text",
),
train_dataset=formatted_dataset,
peft_config=LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"]),
)
Important: in Instruction Tuning standard practice is to mask the instruction portion during loss calculation (calculate loss only on response tokens). In TRL this is controlled via DataCollatorForCompletionOnlyLM.
Results:
- Adherence to corporate tone (LLM-judge, 1–5): 2.9 → 4.4
- Correct use of domain terminology: 61% → 87%
- Appropriate refusals: 34% → 89%
- Unwanted refusals (false rejections): 8% → 2%
Building instruction dataset from unlabeled documents
# Pipeline: document → instruction examples
def document_to_instructions(doc_text: str, llm_client) -> list:
"""Converts corporate document into training examples"""
response = llm_client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"""Create 10 training examples for LLM from the following document.
Each example: {{"instruction": "task", "output": "correct answer based on document"}}.
Vary task types: questions, summarization, analysis, comparison.
Document:
{doc_text[:3000]}
Return JSON array of examples."""
}],
)
return json.loads(response.choices[0].message.content)
Timeline
- Dataset design and source collection: 2–3 weeks
- Example generation and verification: 2–4 weeks
- Training and iterations: 1–2 weeks
- Total: 5–9 weeks







