Data Augmentation for LLM Fine-tuning
Data augmentation for LLMs differs from augmentation for CV or classical NLP: you cannot simply rotate text or change brightness. You need methods that create semantically equivalent but lexically diverse examples — without breaking meaning or degrading quality.
Augmentation Methods
Backtranslation — translate to an intermediate language and back. Creates paraphrases while preserving meaning:
from deep_translator import GoogleTranslator
def backtranslate(text: str, pivot_language: str = 'de') -> str:
"""English → German → English to create paraphrase"""
intermediate = GoogleTranslator(source='en', target=pivot_language).translate(text)
back = GoogleTranslator(source=pivot_language, target='en').translate(intermediate)
return back
# Apply to instructions (not output!)
original = "How do I cancel my subscription?"
augmented = backtranslate(original) # "How can I terminate my subscription?"
LLM-generated paraphrases — highest quality method:
from anthropic import Anthropic
client = Anthropic()
def generate_paraphrases(instruction: str, n: int = 5) -> list[str]:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Generate {n} diverse paraphrases of this instruction.
Keep the same meaning but vary the wording, formality level, and sentence structure.
Instruction: {instruction}
Return as JSON array of strings."""
}]
)
return json.loads(response.content[0].text)
Instruction diversity expansion — expand instruction types:
def expand_instruction_types(task_description: str,
example_output: str) -> list[dict]:
"""Create different instruction formats for same task"""
variations = [
f"Please {task_description.lower()}",
f"Can you {task_description.lower()}?",
f"I need you to {task_description.lower()}",
f"{task_description}:",
task_description.upper() # Imperative
]
return [{"instruction": var, "output": example_output}
for var in variations]
Negation augmentation — add refusal examples:
refusal_examples = []
for ex in harmful_edge_cases:
refusal_examples.append({
"instruction": ex.instruction,
"output": f"I can't help with that request as it {reason}. "
f"I'd be happy to help with {alternative_suggestion} instead."
})
Output Augmentation
def augment_long_outputs(output: str, model_client) -> list[str]:
"""Create output variations of different lengths and structures"""
augmented = []
# Brief version
brief = model_client.summarize(output, max_words=50)
augmented.append(brief)
# Structured (with bullet points)
structured = model_client.restructure_with_bullets(output)
augmented.append(structured)
return augmented
Augmentation Quality Control
from sentence_transformers import SentenceTransformer
import numpy as np
def measure_augmentation_quality(original: str, augmented: str) -> dict:
model = SentenceTransformer('all-MiniLM-L6-v2')
orig_emb = model.encode(original)
aug_emb = model.encode(augmented)
similarity = float(np.dot(orig_emb, aug_emb) /
(np.linalg.norm(orig_emb) * np.linalg.norm(aug_emb)))
return {
'semantic_similarity': similarity,
'is_valid': 0.7 < similarity < 0.98, # Too similar = not augmentation; too different = different meaning
'length_ratio': len(augmented) / len(original),
'unique_words': len(set(augmented.split()) - set(original.split()))
}
Goal of augmentation is to increase formulation diversity while preserving meaning. Optimal semantic similarity range: 0.75-0.95. With similarity > 0.98 — almost a duplicate, with < 0.7 — likely meaning changed.
Augmentation increases dataset by 2-3x with optimal ratio of originals to augmented examples 1:2. Too high proportion of augmented examples (>70%) can reduce diversity and cause overfitting on certain patterns.







