Dataset Preparation for LLM Fine-Tuning

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Dataset Preparation for LLM Fine-Tuning
Medium
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823
  • image_logo-aider_0.jpg
    AIDER company logo development
    762
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    848

Preparing Datasets for LLM Fine-tuning

Dataset quality is the main success factor for fine-tuning. "Garbage in, garbage out" applies doubly to LLMs: poorly structured or irrelevant examples don't just fail to help — they actively degrade the model. 1000 quality examples are better than 100,000 poor ones.

Dataset Formats for Fine-tuning

Instruction following (Alpaca format):

{"instruction": "Translate to English", "input": "Hello world", "output": "Hello world"}
{"instruction": "Write SQL query", "input": "Select all users older than 30", "output": "SELECT * FROM users WHERE age > 30;"}

Chat format (ShareGPT/ChatML):

{
  "conversations": [
    {"from": "system", "value": "You are an SQL assistant"},
    {"from": "human", "value": "How do I select unique values?"},
    {"from": "gpt", "value": "Use SELECT DISTINCT: `SELECT DISTINCT column FROM table;`"}
  ]
}

Completion format (simple):

{"text": "### Question: What is RLHF?\n### Answer: RLHF (Reinforcement Learning from Human Feedback) is a method..."}

Dataset Volume Requirements

Task Minimum Examples Optimal
Tone/style transfer 500-1000 2000-5000
Domain adaptation 1000-3000 5000-15000
Task-specific (Q&A) 500-2000 3000-10000
Code generation 2000-5000 10000-50000
Multi-turn dialogue 1000-3000 5000-20000

Good Example Structure

from dataclasses import dataclass

@dataclass
class FineTuningExample:
    instruction: str    # Clear task without ambiguity
    input: str          # Specific context/data (optional)
    output: str         # Ideal model response

    def validate(self) -> list[str]:
        issues = []
        if len(self.output) < 10:
            issues.append("Output too short")
        if len(self.output) > 2000:
            issues.append("Output may be too long for this task")
        if self.output in ["I don't know", "N/A", ""]:
            issues.append("Uninformative output")
        # Check for prompt leakage in response
        if self.instruction.lower()[:20] in self.output.lower():
            issues.append("Output contains instruction text")
        return issues

Train/Eval Split

from sklearn.model_selection import train_test_split

def split_dataset(examples: list, eval_ratio: float = 0.1) -> tuple:
    # Stratified split by output length
    short = [e for e in examples if len(e.output) < 200]
    medium = [e for e in examples if 200 <= len(e.output) < 500]
    long = [e for e in examples if len(e.output) >= 500]

    train, eval = [], []
    for group in [short, medium, long]:
        if len(group) > 1:
            tr, ev = train_test_split(group, test_size=eval_ratio)
            train.extend(tr); eval.extend(ev)

    return train, eval

Deduplication

import hashlib
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

def deduplicate_exact(examples: list) -> list:
    """Exact deduplication by hash"""
    seen = set()
    unique = []
    for ex in examples:
        h = hashlib.md5(f"{ex.instruction}{ex.input}".encode()).hexdigest()
        if h not in seen:
            seen.add(h)
            unique.append(ex)
    return unique

def deduplicate_semantic(examples: list, threshold: float = 0.95) -> list:
    """Semantic deduplication (removes near-duplicates)"""
    model = SentenceTransformer('all-MiniLM-L6-v2')
    texts = [f"{e.instruction} {e.input}" for e in examples]
    embeddings = model.encode(texts, batch_size=512, show_progress_bar=True)

    keep = [True] * len(examples)
    for i in range(len(examples)):
        if not keep[i]:
            continue
        for j in range(i+1, len(examples)):
            sim = cosine_similarity([embeddings[i]], [embeddings[j]])[0][0]
            if sim > threshold:
                keep[j] = False

    return [ex for ex, k in zip(examples, keep) if k]

Final Checklist Before Training

  • No duplicates (exact and near-duplicate)
  • No PII in dataset (names, emails, phones)
  • Output does not contain date/version references ("as of 2023")
  • Even distribution across task types
  • Eval set does not overlap with train
  • Tokenized examples do not exceed model max_length (without truncation)