Research Overview
Challenge
While generative TTS excels in English and Chinese, high-quality, expressive synthesis for European languages like French and German is an underexplored area in open-source systems. We address this gap with a systematic, component-level adaptation of CosyVoice2, a state-of-the-art TTS model.
Methodology
We introduce a rigorous, reproducible ablation grid—fine-tuning key components of CosyVoice2 across data budgets and language regimes. Our backbone-agnostic refactor unlocks plug-and-play LLMs and efficient LoRA adaptation, setting a new standard for multilingual TTS research.
Innovation
One of the first systematic open, component-level benchmarks of adapting a modern generative TTS (CosyVoice2) to European languages; enabling transparent reproduction, extension & evaluation of cross‑lingual voice + prosody cloning.
CosyVoice2 Architecture
CosyVoice2 follows a modular three-stage TTS pipeline. Each component can be fine-tuned independently or in combination to adapt the model to European languages. Click on the components below to explore their role and impact on the final speech quality.
Select a Component
Upper row: generative stages. Lower row: frozen prompt / conditioning path (semantic tokenizer, speaker embedding, reference mel). LM & Flow are primary adaptation targets; HiFi-GAN stays frozen.
Interactive Fine-tuning Demo
Sample Text
Fine-tuning Configuration
Improves semantic understanding and pronunciation patterns
Enhances prosody and natural speech rhythm
Improves audio quality and reduces artifacts
Baseline (No Fine-tuning)
Original CosyVoice2 model without any language-specific training.
Current Configuration
Select fine-tuning components above to hear the progressive improvements in speech quality.
See how these configurations compare quantitatively on our test data
Rows map to your selections above (Text‑Speech LM, Flow, HiFi‑GAN). We highlight the matching row automatically.
Metrics: WER↓ (intelligibility), SECS↑ (speaker similarity), MCD↓ (distortion). Column emphasis follows the selected language.
Evaluation Results
Evaluation across intelligibility (WER↓), speaker similarity (SECS↑), spectral distortion (MCD↓), pitch correlation (F0 Corr↑), and voicing error (V/UV↓). Most WER gains stem from LM fine‑tuning; Flow refines prosody; HiFi-GAN stays robust frozen.
Component Comparison
Learning Curve
Mix vs Mono Training (across hours)
Baseline vs Best Model (anchored by min WER_norm)
Key Findings
Most Impactful Component
The text-speech language model is the key driver of quality, capturing linguistic rhythm and phrasing essential for intelligibility and natural prosody. Flow fine-tuning further improves continuity and expressiveness, while the vocoder is robust and rarely a bottleneck for cross-lingual adaptation.
Data Scaling & Efficiency
Most gains arise within ~100–500 monolingual hours (≈ total 200–1000 in bilingual). WER plateaus early under current data scale; SECS continues modest upward trend reflecting prosody refinement.
Component Synergy & Practical Recipe
Fine-tune the LM for immediate gains; add flow adaptation for enhanced prosody if resources allow. Vocoder and tokenizer can remain frozen. This modular strategy streamlines adaptation for new languages and domains.
Cross-lingual Training Effects
Bilingual training consistently boosts SECS; WER benefits appear at very low & high scales, neutral mid-scale—indicating cross-lingual style regularization.
Limitations & Future Directions
Limitations: frozen speech tokenizer; moderate data size. Future: tokenizer adaptation, larger corpora, backbone swaps, instruction-tuned prosody prompts.