Neural Vocoder · French Speech · WavLM-Base+

WavLM2Audio

Reconstructing French Speech from SSL Representations

A neural vocoder trained on 238h of cleaned French corpora, capable of reconstructing high-quality audio from frozen WavLM-Base+ representations — the foundational stage for continuous voice conversion in WavLM latent space.

238h

French training data

−23.8%

F0 RMSE reduction

0.96

F0 correlation

×320

Upsampling factor

Architecture

Three-stage pipeline

From raw French audio to reconstructed waveform — a frozen SSL encoder, a trainable fusion and adapter module, and a HiFi-GAN generator with optional adversarial supervision.

WavLM-Base+ Extraction

12-layer transformer encoder (dim 768) pre-trained on Libri-Light, GigaSpeech, VoxPopuli. All weights are frozen — no gradient flows through the SSL backbone.

12 layersdim 768frozen16 kHz input

Layer Fusion + Convolutional Adapter

Learned weighted fusion αᵢ (softmax-normalized) combines N selected last layers. A 4-block Conv1D adapter projects 768 → 256 with progressive temporal context.

learned αᵢ4 conv blocks768→256trainable

HiFi-GAN Generator + Discriminators

Progressive upsampling ×(8×5×4×2)=×320. MRF residual blocks with kernels (3,7,11). Optional MPD + MSD discriminators with hinge loss and feature matching.

×320 upsampleMPD+MSDfeature matching16 kHz output

Training configuration

OptimizerAdamW β₁=0.8, β₂=0.99

Learning rate2×10⁻⁴ (G and D)

LR schedulerExponentialLR γ=0.999

Batch size16/GPU × 4 GPUs · DDP+AMP

Training50 epochs + early stopping

GAN warmup10 000 spectral-only steps

Loss function L_G

L₁ temporalλ₁ · ℒ_L1

Log-Mel L₁λ₂ · ℒ_mel · 80 bins

Multi-res STFTλ₃ · ℒ_STFT · 512/1024/2048

Adversarialλ₄ · ℒ_adv · hinge

Feature matchingλ₅ · ℒ_FM

Inference protocol

Chunk duration10 s

Overlap25%

WindowingHann + overlap-add

Output normalizationPeak [−0.95, 0.95]

Results

Impact of adversarial supervision

Evaluated on 15 stratified held-out utterances (unseen speakers, avg 2.45 s). Spectral metrics computed frame-by-frame; perceptual metrics at utterance level.

MCD ↓

Mel-cepstral distortion · MCEP 24-dim · WORLD

9.72→8.43dB

−13.3%

F0 RMSE ↓

Fundamental frequency error · CREPE

10.1→7.7Hz

−23.8%

Mel-L1 ↓

Log-mel spectrogram L1 distance

1.55→1.17

−24.5%

PESQ ↑

ITU-T P.862 perceptual quality score (1–4.5)

1.11→1.28

+15.3%

STOI ↑

Short-time objective intelligibility (0–1)

0.74→0.86

+16.2%

F0 Corr ↑ · V/UV F1 ↑

Pitch correlation + voiced/unvoiced F1

0.83→0.96F0 corr

0.878→0.932V/UV F1

Key findings

LAYERS 7–12

Upper WavLM layers carry most phonético-prosodic information — sufficient for vocoder input.

GAN IS INDISPENSABLE

Spectral losses alone fail to capture multi-scale speech structure. The MPD captures pitch periodicity effectively.

LEARNED FUSION

Layer 6 (absolute index) emerges as dominant when using the last 7 layers — learned automatically via αᵢ.

STAGE 1 → STAGE 2

Validating this decoder opens the path to continuous voice conversion via diffusion or flow matching in WavLM space.

Audio Demo

Listen to the reconstruction

Two male French speakers from the Common Voice test set — never seen during training. Audio reconstructed using checkpoint checkpoint_step180000.pt (+GAN, N=9 layers).

nassimaODL/wavlm-vocoder-french checkpoint_step180000.pt · +MPD/MSD+FM · N=9 layers

Common Voice FR Male speaker · Unseen +GAN · N=9 Short French utterance · read speech · checkpoint_step180000.pt

Original input

Ground truth · 16 kHz

Reconstructed

WavLM2Audio · +GAN

MCD

8.43 dB

PESQ

1.28

STOI

0.86

F0 Corr

0.96

F0 RMSE

7.7 Hz

Common Voice FR Male speaker · Unseen +GAN · N=9 Short French utterance · read speech · checkpoint_step180000.pt

Original input

Ground truth · 16 kHz

Reconstructed

WavLM2Audio · +GAN

MCD

8.43 dB

PESQ

1.28

STOI

0.86

F0 Corr

0.96

F0 RMSE

7.7 Hz

Training Data

238h of cleaned French speech

Three complementary public corpora covering studio quality, audiobooks, and crowd-sourced conditions — cleaned through a 5-step reproducible pipeline.

SIWIS

10.9h

Studio-quality read speech · 5 speakers · University of Edinburgh · train/dev · high SNR anchor.

M-AILABS French

160.7h

Large-scale French audiobooks · 2 speakers · primary training volume · diverse phonetic coverage.

Common Voice FR

66.7h

Crowd-sourced French · diverse unseen speakers · used as test set · real voice conversion conditions.

STEP 01 — NORMALIZATION

Audio normalization

Mono conversion, resampling to 16 kHz, peak amplitude normalization to [−0.95, 0.95].

STEP 02 — DURATION

Duration filtering

Segments shorter than 1 s or longer than 20 s are discarded to avoid silences and segmentation errors.

STEP 03 — SILENCE

Silence detection

Segments with more than 50% silence (adaptive energy threshold) are automatically removed.

STEP 04–05 — QC + MANUAL

Acoustic QC + manual check

Clipping and SNR estimation reject degraded recordings. 100 samples per corpus validated by manual listening.

WavLM2Audio

Three-stage pipeline

WavLM-Base+ Extraction

Layer Fusion + Convolutional Adapter

HiFi-GAN Generator + Discriminators

Training configuration

Loss function LG

Inference protocol

Impact of adversarial supervision

Listen to the reconstruction

238h of cleaned French speech

SIWIS

M-AILABS French

Common Voice FR

Audio normalization

Duration filtering

Silence detection

Acoustic QC + manual check

Loss function L_G