Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS
0. Contents
- Abstract
- Demos on Spon data (TTS)
- Demos on speaker adaptation (TTS)
- Demos on Hifi data (TTS)
- Demos on unseen speakers (copy synthesis)
- Demos on seen speakers (copy synthesis)
1. Abstract
In current two-stage neural text-to-speech (TTS) paradigm, it is ideal to have a universal neural vocoder, once trained, which is robust to imperfect mel-spectrogram predicted from the acoustic model To this end, we propose Robust MelGAN vocoder by solving the original multi-band MelGAN's metallic sound problem and increasing its generalization ability. Specifically, we introduce a fine-grained network dropout strategy to the generator. With a specifically designed over-smooth handler which separates speech signal intro periodic and aperiodic components, we only perform network dropout to the aperodic components, which alleviates metallic sounding and maintains good speaker similarity. To further improve generalization ability, we introduce several data augmentation methods to augment fake data in the discriminator, including harmonic shift, harmonic noise and phase noise. Experiments show that Robust MelGAN can be used as a universal vocoder, significantly improving sound quality in TTS systems built on various types of data.
2. Demos on Spon data (TTS).
Spon data is a studio-recorded conversation dataset which comprises a male speaker and a female speaker, each with 8,000 utterances. The dataset contains expressive conversational speech with fast talking, different emotions, breaths and even smiles.
Speakers:
male | female |
---|---|
Demos:
speaker | multi-band MelGAN | Robust MelGAN | - over-smooth handler | - data augmentation |
---|---|---|---|---|
male |
||||
male |
||||
male |
||||
male |
||||
male |
||||
female |
||||
female |
||||
female |
||||
female |
||||
female |
3. Demos on speaker adaptation (TTS).
We use 3 male and 2 female speakers, each with 50 utterances recorded in a typical office room, to fine- tune the base model for low resource speaker adaption.
Speakers:
spk1 | spk2 | spk3 | spk4 | spk5 |
---|---|---|---|---|
Demos:
speaker |
multi-band MelGAN | Robust MelGAN | - over-smooth handler | - data augmentation |
---|---|---|---|---|
spk1 |
||||
spk1 |
||||
spk2 |
||||
spk2 |
||||
spk3 |
||||
spk3 |
||||
spk4 |
||||
spk4 |
||||
spk5 |
||||
spk5 |
4. Demos on Hifi data (TTS).
Hifi data is a high-fidelity dataset consisting of a male speaker and a female speaker with typical reading style, each with 5,000 utterances.
Speakers:
male | female |
---|---|
Demos:
speaker |
multi-band MelGAN | Robust MelGAN | - over-smooth handler | - data augmentation |
---|---|---|---|---|
male |
||||
male |
||||
male |
||||
male |
||||
male |
||||
female |
||||
female |
||||
female |
||||
female |
||||
female |
5. Demos on unseen speakers (copy synthesis).
We use multi-speaker Mandarin dataset AISHELL-3 as the unseen speakers, from which we randomly selected 50 utterances of 10 speakers for copysyn test.
Recording | multi-band MelGAN | Robust MelGAN | - over-smooth handler | - data augmentation |
---|---|---|---|---|
6. Demos on seen speakers (copy synthesis).
We reserve 50 utter- ances of 10 speakers randomly selected from the training data as the seen speakers.
Recording | multi-band MelGAN | Robust MelGAN | - over-smooth handler | - data augmentation |
---|---|---|---|---|