Preparing Datasets for LLM Fine-tuning
Dataset quality is the main success factor for fine-tuning. "Garbage in, garbage out" applies doubly to LLMs: poorly structured or irrelevant examples don't just fail to help — they actively degrade the model. 1000 quality examples are better than 100,000 poor ones.
Dataset Formats for Fine-tuning
Instruction following (Alpaca format):
{"instruction": "Translate to English", "input": "Hello world", "output": "Hello world"}
{"instruction": "Write SQL query", "input": "Select all users older than 30", "output": "SELECT * FROM users WHERE age > 30;"}
Chat format (ShareGPT/ChatML):
{
"conversations": [
{"from": "system", "value": "You are an SQL assistant"},
{"from": "human", "value": "How do I select unique values?"},
{"from": "gpt", "value": "Use SELECT DISTINCT: `SELECT DISTINCT column FROM table;`"}
]
}
Completion format (simple):
{"text": "### Question: What is RLHF?\n### Answer: RLHF (Reinforcement Learning from Human Feedback) is a method..."}
Dataset Volume Requirements
| Task | Minimum Examples | Optimal |
|---|---|---|
| Tone/style transfer | 500-1000 | 2000-5000 |
| Domain adaptation | 1000-3000 | 5000-15000 |
| Task-specific (Q&A) | 500-2000 | 3000-10000 |
| Code generation | 2000-5000 | 10000-50000 |
| Multi-turn dialogue | 1000-3000 | 5000-20000 |
Good Example Structure
from dataclasses import dataclass
@dataclass
class FineTuningExample:
instruction: str # Clear task without ambiguity
input: str # Specific context/data (optional)
output: str # Ideal model response
def validate(self) -> list[str]:
issues = []
if len(self.output) < 10:
issues.append("Output too short")
if len(self.output) > 2000:
issues.append("Output may be too long for this task")
if self.output in ["I don't know", "N/A", ""]:
issues.append("Uninformative output")
# Check for prompt leakage in response
if self.instruction.lower()[:20] in self.output.lower():
issues.append("Output contains instruction text")
return issues
Train/Eval Split
from sklearn.model_selection import train_test_split
def split_dataset(examples: list, eval_ratio: float = 0.1) -> tuple:
# Stratified split by output length
short = [e for e in examples if len(e.output) < 200]
medium = [e for e in examples if 200 <= len(e.output) < 500]
long = [e for e in examples if len(e.output) >= 500]
train, eval = [], []
for group in [short, medium, long]:
if len(group) > 1:
tr, ev = train_test_split(group, test_size=eval_ratio)
train.extend(tr); eval.extend(ev)
return train, eval
Deduplication
import hashlib
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
def deduplicate_exact(examples: list) -> list:
"""Exact deduplication by hash"""
seen = set()
unique = []
for ex in examples:
h = hashlib.md5(f"{ex.instruction}{ex.input}".encode()).hexdigest()
if h not in seen:
seen.add(h)
unique.append(ex)
return unique
def deduplicate_semantic(examples: list, threshold: float = 0.95) -> list:
"""Semantic deduplication (removes near-duplicates)"""
model = SentenceTransformer('all-MiniLM-L6-v2')
texts = [f"{e.instruction} {e.input}" for e in examples]
embeddings = model.encode(texts, batch_size=512, show_progress_bar=True)
keep = [True] * len(examples)
for i in range(len(examples)):
if not keep[i]:
continue
for j in range(i+1, len(examples)):
sim = cosine_similarity([embeddings[i]], [embeddings[j]])[0][0]
if sim > threshold:
keep[j] = False
return [ex for ex, k in zip(examples, keep) if k]
Final Checklist Before Training
- No duplicates (exact and near-duplicate)
- No PII in dataset (names, emails, phones)
- Output does not contain date/version references ("as of 2023")
- Even distribution across task types
- Eval set does not overlap with train
- Tokenized examples do not exceed model max_length (without truncation)







