Neural Vocoder · French Speech · WavLM-Base+

WavLM2Audio

Reconstructing French Speech from SSL Representations

A neural vocoder trained on 238h of cleaned French corpora, capable of reconstructing high-quality audio from frozen WavLM-Base+ representations — the foundational stage for continuous voice conversion in WavLM latent space.

238h
French training data
23.8%
F0 RMSE reduction
0.96
F0 correlation
×320
Upsampling factor
Architecture

Three-stage pipeline

From raw French audio to reconstructed waveform — a frozen SSL encoder, a trainable fusion and adapter module, and a HiFi-GAN generator with optional adversarial supervision.

STAGE 1 — WavLM (frozen) STAGE 2 — Fusion (trainable) STAGE 3 — HiFi-GAN SUPERVISION Audio GT (B, T) · 16 kHz WavLM-Base+ 12 layers · dim = 768 All weights frozen [h₀, …, h₁₂] 1 conv + 12 transf. · (B, T', 768) Layer Selection N last layers [h₉, h₁₀, h₁₁, h₁₂] Weighted Fusion z = Σᵢ αᵢ hᵢ (B, T', 768) Conv Adapter 4 × (Conv1D + LayerNorm + GELU) 768 → 256 Fused features (B, T', 256) T' = T / 320 HiFi-GAN Upsample ×8, 5, 4, 2 ResBlocks + Conv1D MRF kernels (3, 7, 11) Total upsampling ×320 Audio Reconstructed (B, T) @ 16 kHz MPD — Multi-Period Periods: 2, 3, 5, 7, 11 MSD — Multi-Scale Scales: ×1, ×2, ×4 Losses ℒ_adv (hinge) + ℒ_FM (feature matching) Figure 1 — WavLM2Audio neural vocoder architecture · B = batch size · T' = T / 320
S1

WavLM-Base+ Extraction

12-layer transformer encoder (dim 768) pre-trained on Libri-Light, GigaSpeech, VoxPopuli. All weights are frozen — no gradient flows through the SSL backbone.

12 layersdim 768frozen16 kHz input
S2

Layer Fusion + Convolutional Adapter

Learned weighted fusion αᵢ (softmax-normalized) combines N selected last layers. A 4-block Conv1D adapter projects 768 → 256 with progressive temporal context.

learned αᵢ4 conv blocks768→256trainable
S3

HiFi-GAN Generator + Discriminators

Progressive upsampling ×(8×5×4×2)=×320. MRF residual blocks with kernels (3,7,11). Optional MPD + MSD discriminators with hinge loss and feature matching.

×320 upsampleMPD+MSDfeature matching16 kHz output
Training configuration
OptimizerAdamW β₁=0.8, β₂=0.99
Learning rate2×10⁻⁴ (G and D)
LR schedulerExponentialLR γ=0.999
Batch size16/GPU × 4 GPUs · DDP+AMP
Training50 epochs + early stopping
GAN warmup10 000 spectral-only steps
Loss function LG
L₁ temporalλ₁ · ℒ_L1
Log-Mel L₁λ₂ · ℒ_mel · 80 bins
Multi-res STFTλ₃ · ℒ_STFT · 512/1024/2048
Adversarialλ₄ · ℒ_adv · hinge
Feature matchingλ₅ · ℒ_FM
Inference protocol
Chunk duration10 s
Overlap25%
WindowingHann + overlap-add
Output normalizationPeak [−0.95, 0.95]
Results

Impact of adversarial supervision

Evaluated on 15 stratified held-out utterances (unseen speakers, avg 2.45 s). Spectral metrics computed frame-by-frame; perceptual metrics at utterance level.

MCD ↓
Mel-cepstral distortion · MCEP 24-dim · WORLD
9.728.43dB
−13.3%
F0 RMSE ↓
Fundamental frequency error · CREPE
10.17.7Hz
−23.8%
Mel-L1 ↓
Log-mel spectrogram L1 distance
1.551.17
−24.5%
PESQ ↑
ITU-T P.862 perceptual quality score (1–4.5)
1.111.28
+15.3%
STOI ↑
Short-time objective intelligibility (0–1)
0.740.86
+16.2%
F0 Corr ↑ · V/UV F1 ↑
Pitch correlation + voiced/unvoiced F1
0.830.96F0 corr
0.8780.932V/UV F1
Key findings
LAYERS 7–12

Upper WavLM layers carry most phonético-prosodic information — sufficient for vocoder input.

GAN IS INDISPENSABLE

Spectral losses alone fail to capture multi-scale speech structure. The MPD captures pitch periodicity effectively.

LEARNED FUSION

Layer 6 (absolute index) emerges as dominant when using the last 7 layers — learned automatically via αᵢ.

STAGE 1 → STAGE 2

Validating this decoder opens the path to continuous voice conversion via diffusion or flow matching in WavLM space.

Audio Demo

Listen to the reconstruction

Two male French speakers from the Common Voice test set — never seen during training. Audio reconstructed using checkpoint checkpoint_step180000.pt (+GAN, N=9 layers).

nassimaODL/wavlm-vocoder-french checkpoint_step180000.pt · +MPD/MSD+FM · N=9 layers
Common Voice FR Male speaker · Unseen +GAN · N=9 Short French utterance · read speech · checkpoint_step180000.pt
Original input
Ground truth · 16 kHz
Reconstructed
WavLM2Audio · +GAN
MCD
8.43 dB
PESQ
1.28
STOI
0.86
F0 Corr
0.96
F0 RMSE
7.7 Hz
Common Voice FR Male speaker · Unseen +GAN · N=9 Short French utterance · read speech · checkpoint_step180000.pt
Original input
Ground truth · 16 kHz
Reconstructed
WavLM2Audio · +GAN
MCD
8.43 dB
PESQ
1.28
STOI
0.86
F0 Corr
0.96
F0 RMSE
7.7 Hz
Training Data

238h of cleaned French speech

Three complementary public corpora covering studio quality, audiobooks, and crowd-sourced conditions — cleaned through a 5-step reproducible pipeline.

SIWIS

10.9h

Studio-quality read speech · 5 speakers · University of Edinburgh · train/dev · high SNR anchor.

M-AILABS French

160.7h

Large-scale French audiobooks · 2 speakers · primary training volume · diverse phonetic coverage.

Common Voice FR

66.7h

Crowd-sourced French · diverse unseen speakers · used as test set · real voice conversion conditions.

STEP 01 — NORMALIZATION

Audio normalization

Mono conversion, resampling to 16 kHz, peak amplitude normalization to [−0.95, 0.95].

STEP 02 — DURATION

Duration filtering

Segments shorter than 1 s or longer than 20 s are discarded to avoid silences and segmentation errors.

STEP 03 — SILENCE

Silence detection

Segments with more than 50% silence (adaptive energy threshold) are automatically removed.

STEP 04–05 — QC + MANUAL

Acoustic QC + manual check

Clipping and SNR estimation reject degraded recordings. 100 samples per corpus validated by manual listening.