Synthetic Data Generation for LLM Fine-Tuning

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Synthetic Data Generation for LLM Fine-Tuning
Medium
~3-5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823
  • image_logo-aider_0.jpg
    AIDER company logo development
    762
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    848

Synthetic Data Generation for LLM Fine-tuning

Synthetic data generation involves creating training examples using a stronger LLM (teacher model). The "Self-Instruct" approach and its evolution WizardLM/Evol-Instruct allow generating thousands of quality training examples from a small seed dataset (100-200 examples).

Self-Instruct Methodology

from anthropic import Anthropic
import json

client = Anthropic()

SEED_EXAMPLES = [
    {"instruction": "Explain ML term", "output": "..."},
    {"instruction": "Write SQL query for...", "output": "..."},
    # 20-200 seed examples
]

def generate_new_instructions(seed_examples: list, n: int = 20) -> list[str]:
    """Generate new instructions based on seed examples"""
    examples_str = "\n".join([f"- {ex['instruction']}" for ex in seed_examples[:10]])

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2000,
        messages=[{
            "role": "user",
            "content": f"""Here are some example instructions for an AI assistant:
{examples_str}

Generate {n} NEW diverse instructions in the same domain.
Requirements:
- Each instruction should be unique and not repeat the examples
- Vary complexity: some simple, some multi-step
- Include different formats: questions, commands, completions
- Return as JSON array of strings"""
        }]
    )
    return json.loads(response.content[0].text)

def generate_response(instruction: str, context: str = None) -> str:
    """Generate ideal response for instruction"""
    prompt = f"Instruction: {instruction}"
    if context:
        prompt = f"Context: {context}\n\n{prompt}"

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1000,
        system="You are an expert assistant. Provide accurate, helpful, and complete responses.",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Evol-Instruct: Instruction Complexification

EVOLUTION_METHODS = [
    "Add constraints: add a specific constraint or requirement to the instruction",
    "Deepening: ask for more depth or detail in the response",
    "Concretizing: replace general concepts with specific examples",
    "Increased reasoning steps: require multi-step reasoning",
    "Complicate input: add more complex or ambiguous input",
]

def evolve_instruction(original: str) -> str:
    """Complexify instruction using one of the methods"""
    method = random.choice(EVOLUTION_METHODS)

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"""Rewrite this instruction using this method: {method}

Original instruction: {original}

Return only the rewritten instruction, nothing else."""
        }]
    )
    return response.content[0].text.strip()

Domain-Specific Data Generation

def generate_domain_dataset(domain: str, n_examples: int,
                             output_path: str):
    """Generate dataset for specific domain"""
    examples = []

    for i in range(n_examples):
        # Step 1: Generate diverse instruction
        instruction = generate_instruction_for_domain(domain)

        # Step 2: Generate response
        response = generate_response(instruction)

        # Step 3: Quality filter (LLM-judge)
        quality_score = judge_quality(instruction, response)

        if quality_score >= 0.7:
            examples.append({
                "instruction": instruction,
                "output": response,
                "quality_score": quality_score,
                "generated_by": "claude-3-5-sonnet-20241022"
            })

        if (i + 1) % 100 == 0:
            print(f"Generated {i+1}/{n_examples}, kept {len(examples)}")

    with open(output_path, 'w') as f:
        for ex in examples:
            f.write(json.dumps(ex, ensure_ascii=False) + '\n')

Evaluating Synthetic Data Before Training

Synthetic data requires additional validation:

  • Hallucination check: teacher model responses may contain factual errors. Need domain-expert review of sample.
  • Style bias: GPT-4 has characteristic style — model can learn "GPT-style" instead of target style.
  • Diversity check: are there thematic clusters that are overrepresented.

For 5000+ synthetic examples, human review of 5-10% sample is recommended with approval rate calculation. If approval rate < 80%, improve generation prompts.