French · German · Zero-Shot TTS · CosyVoice2

Europeanizing
CosyVoice2

Data-Efficient Adaptation for French and German Zero-Shot TTS

A rigorous, reproducible ablation study fine-tuning key components of CosyVoice2 across data budgets and language regimes — setting a new standard for multilingual TTS research with backbone-agnostic LoRA adaptation.

2×
Languages
FR&DE
French & German
3+
Components ablated
0-shot
Voice cloning
Research Overview

Component-level adaptation

Systematic fine-tuning of CosyVoice2 for European languages — transparent, reproducible, and backbone-agnostic.

Challenge

While generative TTS excels in English and Chinese, high-quality, expressive synthesis for European languages like French and German is underexplored in open-source systems.

Methodology

A rigorous, reproducible ablation grid fine-tuning key components of CosyVoice2 across data budgets and language regimes. Backbone-agnostic refactor unlocks plug-and-play LLMs and efficient LoRA adaptation.

Innovation

One of the first systematic open, component-level benchmarks of adapting a modern generative TTS to European languages — enabling transparent reproduction, extension & evaluation of cross-lingual voice + prosody cloning.

Architecture

CosyVoice2 pipeline

CosyVoice2 follows a modular three-stage TTS pipeline. Each component can be fine-tuned independently or in combination to adapt the model to European languages. Click on the components below to explore their role and impact on speech quality.

CosyVoice2 EU Adaptation Architecture Three-layer layout: middle generative pipeline; top speaker embedding conditioning; bottom reference audio, tokenizer and reference mel prompts. Input Text FR/DE Text Text-Speech LM Text → Semantic Tokens Flow Matching CFM Decoder Semantic → Mel HiFi-GAN Vocoder (Frozen) Mel → Waveform Speech Output 🔊 Audio Output Speaker Emb. CAM++ (Frozen) Reference Audio Prompt Speaker S3 Speech Tokenizer Semantic IDs (Frozen) Reference Mel Acoustic Prompt Legend: Fine-tuned Module Frozen Module Conditioning / Prompt

Select a Component

Upper row: generative stages. Lower row: frozen prompt / conditioning path (semantic tokenizer, speaker embedding, reference mel). LM & Flow are primary adaptation targets; HiFi-GAN stays frozen.

Interactive Demo

Fine-tuning configuration

Select components to hear progressive improvements. Compare the baseline against your chosen fine-tuning configuration.

Sample Text

"Bonjour, je m'appelle Luka et je travaille dans une entreprise de technologie à Paris. Aujourd'hui, nous allons explorer les capacités de synthèse vocale en français avec CosyVoice 2."
French reference voice

Fine-tuning Configuration

Improves semantic understanding and pronunciation patterns

Enhances prosody and natural speech rhythm

Improves audio quality and reduces artifacts

Baseline (No Fine-tuning)

Strong English Accent Unnatural Prosody

Original CosyVoice2 model without any language-specific training.

Current Configuration

Select components to hear improvements

Select fine-tuning components above to hear the progressive improvements in speech quality.

* Note on HiFi-GAN Training: The "original" HiFi-GAN here uses a partially trained vocoder, while "fine-tuned" represents the official CosyVoice2 HiFi-GAN model. By doing so, we aim to showcase the effect of the vocoder on speech quality, as typically no additional HiFi-GAN training is needed for cross-lingual adaptation.

See how these configurations compare quantitatively

Rows map to your selections above (Text-Speech LM, Flow, HiFi-GAN). We highlight the matching row automatically.

original (unchanged) fine-tuned partially trained

Metrics: WER↓ (intelligibility), SECS↑ (speaker similarity), MCD↓ (distortion). Column emphasis follows the selected language.

Evaluation Results

Quantitative benchmarks

Evaluation across intelligibility (WER↓), speaker similarity (SECS↑), spectral distortion (MCD↓), pitch correlation (F0 Corr↑), and voicing error (V/UV↓). Most WER gains stem from LM fine-tuning; Flow refines prosody; HiFi-GAN stays robust frozen.

Metric Directions: WER / MCD / V/UV ↓  ·  SECS / F0 Corr ↑

Component Comparison

Learning Curve

Mix vs Mono Training

Baseline vs Best Model (anchored by min WER_norm)

Key Findings

Most Impactful Component

The text-speech language model is the key driver of quality, capturing linguistic rhythm and phrasing essential for intelligibility and natural prosody. Flow fine-tuning further improves continuity and expressiveness.

Data Scaling & Efficiency

Most gains arise within ~100–500 monolingual hours. WER plateaus early under current data scale; SECS continues modest upward trend reflecting prosody refinement.

Component Synergy & Practical Recipe

Fine-tune the LM for immediate gains; add flow adaptation for enhanced prosody if resources allow. Vocoder and tokenizer can remain frozen — streamlining adaptation for new languages.

Cross-lingual Training Effects

Bilingual training consistently boosts SECS; WER benefits appear at very low & high scales, neutral mid-scale — indicating cross-lingual style regularization.

Limitations & Future Directions

Limitations: frozen speech tokenizer; moderate data size. Future: tokenizer adaptation, larger corpora, backbone swaps, instruction-tuned prosody prompts.