Reconstructing French Speech from SSL Representations
A neural vocoder trained on 238h of cleaned French corpora, capable of reconstructing high-quality audio from frozen WavLM-Base+ representations — the foundational stage for continuous voice conversion in WavLM latent space.
From raw French audio to reconstructed waveform — a frozen SSL encoder, a trainable fusion and adapter module, and a HiFi-GAN generator with optional adversarial supervision.
12-layer transformer encoder (dim 768) pre-trained on Libri-Light, GigaSpeech, VoxPopuli. All weights are frozen — no gradient flows through the SSL backbone.
Learned weighted fusion αᵢ (softmax-normalized) combines N selected last layers. A 4-block Conv1D adapter projects 768 → 256 with progressive temporal context.
Progressive upsampling ×(8×5×4×2)=×320. MRF residual blocks with kernels (3,7,11). Optional MPD + MSD discriminators with hinge loss and feature matching.
Evaluated on 15 stratified held-out utterances (unseen speakers, avg 2.45 s). Spectral metrics computed frame-by-frame; perceptual metrics at utterance level.
Upper WavLM layers carry most phonético-prosodic information — sufficient for vocoder input.
Spectral losses alone fail to capture multi-scale speech structure. The MPD captures pitch periodicity effectively.
Layer 6 (absolute index) emerges as dominant when using the last 7 layers — learned automatically via αᵢ.
Validating this decoder opens the path to continuous voice conversion via diffusion or flow matching in WavLM space.
Two male French speakers from the Common Voice test set — never seen during training. Audio reconstructed using checkpoint checkpoint_step180000.pt (+GAN, N=9 layers).
Three complementary public corpora covering studio quality, audiobooks, and crowd-sourced conditions — cleaned through a 5-step reproducible pipeline.
Studio-quality read speech · 5 speakers · University of Edinburgh · train/dev · high SNR anchor.
Large-scale French audiobooks · 2 speakers · primary training volume · diverse phonetic coverage.
Crowd-sourced French · diverse unseen speakers · used as test set · real voice conversion conditions.
Mono conversion, resampling to 16 kHz, peak amplitude normalization to [−0.95, 0.95].
Segments shorter than 1 s or longer than 20 s are discarded to avoid silences and segmentation errors.
Segments with more than 50% silence (adaptive energy threshold) are automatically removed.
Clipping and SNR estimation reject degraded recordings. 100 samples per corpus validated by manual listening.