Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS

0. Contents

Abstract
Demos on Spon data (TTS)
Demos on speaker adaptation (TTS)
Demos on Hifi data (TTS)
Demos on unseen speakers (copy synthesis)
Demos on seen speakers (copy synthesis)

1. Abstract

In current two-stage neural text-to-speech (TTS) paradigm, it is ideal to have a universal neural vocoder, once trained, which is robust to imperfect mel-spectrogram predicted from the acoustic model To this end, we propose Robust MelGAN vocoder by solving the original multi-band MelGAN's metallic sound problem and increasing its generalization ability. Specifically, we introduce a fine-grained network dropout strategy to the generator. With a specifically designed over-smooth handler which separates speech signal intro periodic and aperiodic components, we only perform network dropout to the aperodic components, which alleviates metallic sounding and maintains good speaker similarity. To further improve generalization ability, we introduce several data augmentation methods to augment fake data in the discriminator, including harmonic shift, harmonic noise and phase noise. Experiments show that Robust MelGAN can be used as a universal vocoder, significantly improving sound quality in TTS systems built on various types of data.

2. Demos on Spon data (TTS).

Spon data is a studio-recorded conversation dataset which comprises a male speaker and a female speaker, each with 8,000 utterances. The dataset contains expressive conversational speech with fast talking, different emotions, breaths and even smiles.

Speakers:

male	female

Demos:

speaker	multi-band MelGAN	Robust MelGAN	- over-smooth handler	- data augmentation
male
male
male
male
male
female
female
female
female
female

3. Demos on speaker adaptation (TTS).

We use 3 male and 2 female speakers, each with 50 utterances recorded in a typical office room, to fine- tune the base model for low resource speaker adaption.

Speakers:

spk1	spk2	spk3	spk4	spk5

Demos:

speaker	multi-band MelGAN	Robust MelGAN	- over-smooth handler	- data augmentation
spk1
spk1
spk2
spk2
spk3
spk3
spk4
spk4
spk5
spk5

4. Demos on Hifi data (TTS).

Hifi data is a high-fidelity dataset consisting of a male speaker and a female speaker with typical reading style, each with 5,000 utterances.

Speakers:

male	female

Demos:

speaker	multi-band MelGAN	Robust MelGAN	- over-smooth handler	- data augmentation
male
male
male
male
male
female
female
female
female
female

5. Demos on unseen speakers (copy synthesis).

We use multi-speaker Mandarin dataset AISHELL-3 as the unseen speakers, from which we randomly selected 50 utterances of 10 speakers for copysyn test.