Data Augmentation for LLM Fine-Tuning

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Data Augmentation for LLM Fine-Tuning
Medium
~3-5 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823
  • image_logo-aider_0.jpg
    AIDER company logo development
    762
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    848

Data Augmentation for LLM Fine-tuning

Data augmentation for LLMs differs from augmentation for CV or classical NLP: you cannot simply rotate text or change brightness. You need methods that create semantically equivalent but lexically diverse examples — without breaking meaning or degrading quality.

Augmentation Methods

Backtranslation — translate to an intermediate language and back. Creates paraphrases while preserving meaning:

from deep_translator import GoogleTranslator

def backtranslate(text: str, pivot_language: str = 'de') -> str:
    """English → German → English to create paraphrase"""
    intermediate = GoogleTranslator(source='en', target=pivot_language).translate(text)
    back = GoogleTranslator(source=pivot_language, target='en').translate(intermediate)
    return back

# Apply to instructions (not output!)
original = "How do I cancel my subscription?"
augmented = backtranslate(original)  # "How can I terminate my subscription?"

LLM-generated paraphrases — highest quality method:

from anthropic import Anthropic

client = Anthropic()

def generate_paraphrases(instruction: str, n: int = 5) -> list[str]:
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""Generate {n} diverse paraphrases of this instruction.
Keep the same meaning but vary the wording, formality level, and sentence structure.

Instruction: {instruction}

Return as JSON array of strings."""
        }]
    )
    return json.loads(response.content[0].text)

Instruction diversity expansion — expand instruction types:

def expand_instruction_types(task_description: str,
                               example_output: str) -> list[dict]:
    """Create different instruction formats for same task"""
    variations = [
        f"Please {task_description.lower()}",
        f"Can you {task_description.lower()}?",
        f"I need you to {task_description.lower()}",
        f"{task_description}:",
        task_description.upper()  # Imperative
    ]
    return [{"instruction": var, "output": example_output}
            for var in variations]

Negation augmentation — add refusal examples:

refusal_examples = []
for ex in harmful_edge_cases:
    refusal_examples.append({
        "instruction": ex.instruction,
        "output": f"I can't help with that request as it {reason}. "
                  f"I'd be happy to help with {alternative_suggestion} instead."
    })

Output Augmentation

def augment_long_outputs(output: str, model_client) -> list[str]:
    """Create output variations of different lengths and structures"""
    augmented = []

    # Brief version
    brief = model_client.summarize(output, max_words=50)
    augmented.append(brief)

    # Structured (with bullet points)
    structured = model_client.restructure_with_bullets(output)
    augmented.append(structured)

    return augmented

Augmentation Quality Control

from sentence_transformers import SentenceTransformer
import numpy as np

def measure_augmentation_quality(original: str, augmented: str) -> dict:
    model = SentenceTransformer('all-MiniLM-L6-v2')
    orig_emb = model.encode(original)
    aug_emb = model.encode(augmented)

    similarity = float(np.dot(orig_emb, aug_emb) /
                       (np.linalg.norm(orig_emb) * np.linalg.norm(aug_emb)))

    return {
        'semantic_similarity': similarity,
        'is_valid': 0.7 < similarity < 0.98,  # Too similar = not augmentation; too different = different meaning
        'length_ratio': len(augmented) / len(original),
        'unique_words': len(set(augmented.split()) - set(original.split()))
    }

Goal of augmentation is to increase formulation diversity while preserving meaning. Optimal semantic similarity range: 0.75-0.95. With similarity > 0.98 — almost a duplicate, with < 0.7 — likely meaning changed.

Augmentation increases dataset by 2-3x with optimal ratio of originals to augmented examples 1:2. Too high proportion of augmented examples (>70%) can reduce diversity and cause overfitting on certain patterns.