Europeanizing CosyVoice2:
Data-Efficient Adaptation for French and German Zero-Shot TTS

ICASSP 2026 Under Review

Research Overview

Challenge

While generative TTS excels in English and Chinese, high-quality, expressive synthesis for European languages like French and German is an underexplored area in open-source systems. We address this gap with a systematic, component-level adaptation of CosyVoice2, a state-of-the-art TTS model.

Methodology

We introduce a rigorous, reproducible ablation grid—fine-tuning key components of CosyVoice2 across data budgets and language regimes. Our backbone-agnostic refactor unlocks plug-and-play LLMs and efficient LoRA adaptation, setting a new standard for multilingual TTS research.

Innovation

One of the first systematic open, component-level benchmarks of adapting a modern generative TTS (CosyVoice2) to European languages; enabling transparent reproduction, extension & evaluation of cross‑lingual voice + prosody cloning.

CosyVoice2 Architecture

CosyVoice2 follows a modular three-stage TTS pipeline. Each component can be fine-tuned independently or in combination to adapt the model to European languages. Click on the components below to explore their role and impact on the final speech quality.

CosyVoice2 EU Adaptation Architecture Three-layer layout: middle generative pipeline (Input → LM → Flow Matching → HiFi-GAN → Output); top speaker embedding conditioning; bottom reference audio, tokenizer & reference mel prompts into LM / Flow. Input Text FR/DE Text Text-Speech LM Text → Semantic Tokens Flow Matching CFM Decoder Semantic → Mel HiFi-GAN Vocoder (Frozen) Mel → Waveform Speech Output 🔊 Audio Output Speaker Emb. CAM++ (Frozen) Reference Audio Prompt Speaker S3 Speech Tokenizer Semantic IDs (Frozen) Reference Mel Acoustic Prompt Legend: Fine‑tuned Module Frozen Module Conditioning / Prompt

Select a Component

Upper row: generative stages. Lower row: frozen prompt / conditioning path (semantic tokenizer, speaker embedding, reference mel). LM & Flow are primary adaptation targets; HiFi-GAN stays frozen.

Interactive Fine-tuning Demo

Sample Text

"Bonjour, je m'appelle Luka et je travaille dans une entreprise de technologie à Paris. Aujourd'hui, nous allons explorer les capacités de synthèse vocale en français avec CosyVoice 2."
French reference voice

Fine-tuning Configuration

Improves semantic understanding and pronunciation patterns

Enhances prosody and natural speech rhythm

Improves audio quality and reduces artifacts

Baseline (No Fine-tuning)

Strong English Accent Unnatural Prosody

Original CosyVoice2 model without any language-specific training.

Current Configuration

Select components to hear improvements

Select fine-tuning components above to hear the progressive improvements in speech quality.

* Note on HiFi-GAN Training: The "original" HiFi-GAN here uses a partially trained vocoder, while "fine-tuned" represents the official CosyVoice2 HiFi-GAN model. By doing so, we aim to showcase the effect of the vocoder on speech quality, as typically no additional HiFi-GAN training is needed for cross-lingual adaptation.

See how these configurations compare quantitatively on our test data

Rows map to your selections above (Text‑Speech LM, Flow, HiFi‑GAN). We highlight the matching row automatically.

original (unchanged) fine‑tuned partially trained

Metrics: WER↓ (intelligibility), SECS↑ (speaker similarity), MCD↓ (distortion). Column emphasis follows the selected language.

Evaluation Results

Evaluation across intelligibility (WER↓), speaker similarity (SECS↑), spectral distortion (MCD↓), pitch correlation (F0 Corr↑), and voicing error (V/UV↓). Most WER gains stem from LM fine‑tuning; Flow refines prosody; HiFi-GAN stays robust frozen.

Metric Directions: WER / MCD / V/UV ↓; SECS / F0 Corr ↑.

Component Comparison

Learning Curve

Mix vs Mono Training (across hours)

Baseline vs Best Model (anchored by min WER_norm)

Key Findings

Most Impactful Component

The text-speech language model is the key driver of quality, capturing linguistic rhythm and phrasing essential for intelligibility and natural prosody. Flow fine-tuning further improves continuity and expressiveness, while the vocoder is robust and rarely a bottleneck for cross-lingual adaptation.

Data Scaling & Efficiency

Most gains arise within ~100–500 monolingual hours (≈ total 200–1000 in bilingual). WER plateaus early under current data scale; SECS continues modest upward trend reflecting prosody refinement.

Component Synergy & Practical Recipe

Fine-tune the LM for immediate gains; add flow adaptation for enhanced prosody if resources allow. Vocoder and tokenizer can remain frozen. This modular strategy streamlines adaptation for new languages and domains.

Cross-lingual Training Effects

Bilingual training consistently boosts SECS; WER benefits appear at very low & high scales, neutral mid-scale—indicating cross-lingual style regularization.

Limitations & Future Directions

Limitations: frozen speech tokenizer; moderate data size. Future: tokenizer adaptation, larger corpora, backbone swaps, instruction-tuned prosody prompts.