Data Annotation for LLM Fine-tuning
Data annotation for LLM fine-tuning is fundamentally different from classical ML annotation: rather than simply assigning a label, you create an ideal model response. Annotation quality directly determines the quality of the trained model.
Annotation Types
Direct annotation — annotator creates a pair (instruction, ideal response) from scratch. Highest quality, highest cost.
Edit-based annotation — annotator improves the response from a baseline model. 2-3x faster than from scratch. Risk: annotator accepts poor response with minor edits.
Ranking/Preference annotation — annotator ranks multiple model responses (used for RLHF and DPO). Simpler than creating from scratch, but requires understanding quality criteria.
AI-assisted annotation — strong model (GPT-4) generates responses, humans review and correct. Optimal balance of quality and speed for most tasks.
Annotation Guidelines
A key document without which annotators will produce inconsistent results:
## Annotation Guidelines: Customer Support Assistant
### Quality Criteria for Good Response:
1. **Accuracy:** Response complies with company policies and is factually correct
2. **Completeness:** Solves user's problem without leaving open questions
3. **Tone:** Professional, empathetic, without apologizing for non-existent issues
4. **Length:** Sufficient but not excessive (100-300 words optimal)
5. **Structure:** Paragraphs, no lists for simple responses
### What Should NOT be in Response:
- "I'm just a language model..."
- "I cannot..."
- Repeating user's question
- Inappropriate apologies
- Outdated product information
### Examples of GOOD Response: [examples]
### Examples of BAD Response: [examples with explanation]
Annotation Platforms
Label Studio (open-source):
from label_studio_sdk import Client
ls = Client(url='http://localhost:8080', api_key='...')
# Create project for LLM annotation
project = ls.start_project(
title='Customer Support Fine-tuning',
label_config='''
<View>
<Text name="instruction" value="$instruction"/>
<TextArea name="response" toName="instruction"
placeholder="Write ideal response..."
rows="10" maxSubmissions="1"/>
<Rating name="quality" toName="instruction"
maxRating="5" icon="star" size="medium"/>
</View>
'''
)
# Load tasks
tasks = [{"instruction": ex.instruction, "input": ex.input}
for ex in unannotated_examples]
project.import_tasks(tasks)
Scale AI / Appen — for large volumes with professional annotators. Significantly more expensive, but Quality Control included.
Inter-annotator Agreement
For quality control, overlap is important: 10-20% of tasks are annotated by two annotators independently:
from sklearn.metrics import cohen_kappa_score
def compute_iaa(annotations_a: list, annotations_b: list) -> float:
"""Cohen's Kappa for annotator agreement"""
# For ranking tasks (1-5 rating)
kappa = cohen_kappa_score(annotations_a, annotations_b)
print(f"Cohen's Kappa: {kappa:.3f}")
# < 0.4: low agreement, review guidelines
# 0.6-0.8: good agreement
# > 0.8: excellent agreement
return kappa
Calibration Session
Before full annotation: joint annotation of 20-50 examples by the whole team, discussion of discrepancies, guidelines clarification. This critically reduces variance between annotators. Without calibration, even experienced annotators achieve kappa < 0.5 on complex tasks.







