Data Annotation for LLM Fine-Tuning

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Data Annotation for LLM Fine-Tuning
Medium
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823
  • image_logo-aider_0.jpg
    AIDER company logo development
    762
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    848

Data Annotation for LLM Fine-tuning

Data annotation for LLM fine-tuning is fundamentally different from classical ML annotation: rather than simply assigning a label, you create an ideal model response. Annotation quality directly determines the quality of the trained model.

Annotation Types

Direct annotation — annotator creates a pair (instruction, ideal response) from scratch. Highest quality, highest cost.

Edit-based annotation — annotator improves the response from a baseline model. 2-3x faster than from scratch. Risk: annotator accepts poor response with minor edits.

Ranking/Preference annotation — annotator ranks multiple model responses (used for RLHF and DPO). Simpler than creating from scratch, but requires understanding quality criteria.

AI-assisted annotation — strong model (GPT-4) generates responses, humans review and correct. Optimal balance of quality and speed for most tasks.

Annotation Guidelines

A key document without which annotators will produce inconsistent results:

## Annotation Guidelines: Customer Support Assistant

### Quality Criteria for Good Response:
1. **Accuracy:** Response complies with company policies and is factually correct
2. **Completeness:** Solves user's problem without leaving open questions
3. **Tone:** Professional, empathetic, without apologizing for non-existent issues
4. **Length:** Sufficient but not excessive (100-300 words optimal)
5. **Structure:** Paragraphs, no lists for simple responses

### What Should NOT be in Response:
- "I'm just a language model..."
- "I cannot..."
- Repeating user's question
- Inappropriate apologies
- Outdated product information

### Examples of GOOD Response: [examples]
### Examples of BAD Response: [examples with explanation]

Annotation Platforms

Label Studio (open-source):

from label_studio_sdk import Client

ls = Client(url='http://localhost:8080', api_key='...')

# Create project for LLM annotation
project = ls.start_project(
    title='Customer Support Fine-tuning',
    label_config='''
    <View>
        <Text name="instruction" value="$instruction"/>
        <TextArea name="response" toName="instruction"
                  placeholder="Write ideal response..."
                  rows="10" maxSubmissions="1"/>
        <Rating name="quality" toName="instruction"
                maxRating="5" icon="star" size="medium"/>
    </View>
    '''
)

# Load tasks
tasks = [{"instruction": ex.instruction, "input": ex.input}
         for ex in unannotated_examples]
project.import_tasks(tasks)

Scale AI / Appen — for large volumes with professional annotators. Significantly more expensive, but Quality Control included.

Inter-annotator Agreement

For quality control, overlap is important: 10-20% of tasks are annotated by two annotators independently:

from sklearn.metrics import cohen_kappa_score

def compute_iaa(annotations_a: list, annotations_b: list) -> float:
    """Cohen's Kappa for annotator agreement"""
    # For ranking tasks (1-5 rating)
    kappa = cohen_kappa_score(annotations_a, annotations_b)
    print(f"Cohen's Kappa: {kappa:.3f}")
    # < 0.4: low agreement, review guidelines
    # 0.6-0.8: good agreement
    # > 0.8: excellent agreement
    return kappa

Calibration Session

Before full annotation: joint annotation of 20-50 examples by the whole team, discussion of discrepancies, guidelines clarification. This critically reduces variance between annotators. Without calibration, even experienced annotators achieve kappa < 0.5 on complex tasks.