TTS Model Fine-Tuning on Client Voice

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
TTS Model Fine-Tuning on Client Voice
Complex
from 2 weeks to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823
  • image_logo-aider_0.jpg
    AIDER company logo development
    762
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    848

Fine-tuning the TTS model on the customer's voice. Fine-tuning the TTS on a specific person's voice produces more stable and high-quality results compared to zero-shot cloning. Suitable for creating a scalable brand voice with predictable quality. ### Data Preparation. Minimum volume for noticeable improvement: 30–60 minutes of high-quality recordings.

Структура датасета:
dataset/
  wavs/
    001.wav   # 5–15 секунд
    002.wav
    ...
  metadata.csv  # filename|text
import pandas as pd
from pathlib import Path
import soundfile as sf
import numpy as np

def validate_dataset(dataset_dir: str) -> dict:
    """Проверяем датасет перед обучением"""
    metadata = pd.read_csv(f"{dataset_dir}/metadata.csv",
                           sep="|", names=["file", "text"])
    stats = {
        "total_files": len(metadata),
        "total_duration": 0,
        "errors": []
    }

    for _, row in metadata.iterrows():
        wav_path = f"{dataset_dir}/wavs/{row['file']}.wav"
        if not Path(wav_path).exists():
            stats["errors"].append(f"Missing: {wav_path}")
            continue

        audio, sr = sf.read(wav_path)
        duration = len(audio) / sr
        stats["total_duration"] += duration

        if sr != 22050:
            stats["errors"].append(f"Wrong SR {sr}: {wav_path}")
        if duration < 1.0 or duration > 15.0:
            stats["errors"].append(f"Bad duration {duration:.1f}s: {wav_path}")

    stats["total_duration_min"] = stats["total_duration"] / 60
    return stats
```### Fine-tuning XTTS v2```python
from trainer import Trainer, TrainerArgs
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

config = XttsConfig()
config.load_json("base_xtts_config.json")

# Параметры fine-tuning
config.audio.output_sample_rate = 24000
config.batch_size = 4
config.eval_batch_size = 2
config.num_loader_workers = 4

# Fine-tuning только decoder (быстрее, меньше данных)
config.trainer_args = {
    "epochs": 100,
    "save_step": 1000,
    "print_step": 50,
    "eval_split_size": 0.1
}
```### MOS (Mean Opinion Score) quality assessment — the main metric: - Baseline XTTS zero-shot: MOS ~3.8 - After fine-tuning 30 min of data: MOS ~4.1–4.3 - After fine-tuning 60+ min of data: MOS ~4.3–4.5 Objective metrics: - **UTMOS**: automatic assessment of naturalness - **SECS** (Speaker Embedding Cosine Similarity): similarity to the original - **WER on reverse recognition**: intelligibility ### Training infrastructure
| Configuration | Training Time (30 min of data) | Cost | |------------|-------------------------------|------------| | 1x A100 80GB | ~3-4 hours | ~$15 (RunPod) | | 1x A10G | ~6-8 hours | ~$8 | | 1x RTX 4090 | ~8-12 hours | ~$5 (local) | ### Project Timeline - Dataset Collection and Cleaning: 1-2 weeks - Training and Evaluation: 3-5 days - Integration and A/B Test: 3-5 days - Total: 3-4 weeks