Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS

0. Contents

  1. Abstract
  2. Demos on Spon data (TTS)
  3. Demos on speaker adaptation (TTS)
  4. Demos on Hifi data (TTS)
  5. Demos on unseen speakers (copy synthesis)
  6. Demos on seen speakers (copy synthesis)


1. Abstract

In current two-stage neural text-to-speech (TTS) paradigm, it is ideal to have a universal neural vocoder, once trained, which is robust to imperfect mel-spectrogram predicted from the acoustic model To this end, we propose Robust MelGAN vocoder by solving the original multi-band MelGAN's metallic sound problem and increasing its generalization ability. Specifically, we introduce a fine-grained network dropout strategy to the generator. With a specifically designed over-smooth handler which separates speech signal intro periodic and aperiodic components, we only perform network dropout to the aperodic components, which alleviates metallic sounding and maintains good speaker similarity. To further improve generalization ability, we introduce several data augmentation methods to augment fake data in the discriminator, including harmonic shift, harmonic noise and phase noise. Experiments show that Robust MelGAN can be used as a universal vocoder, significantly improving sound quality in TTS systems built on various types of data.



2. Demos on Spon data (TTS).

Spon data is a studio-recorded conversation dataset which comprises a male speaker and a female speaker, each with 8,000 utterances. The dataset contains expressive conversational speech with fast talking, different emotions, breaths and even smiles.

Speakers:

male female

Demos:

speaker multi-band MelGAN Robust MelGAN - over-smooth handler - data augmentation

male

male

male

male

male

female

female

female

female

female

3. Demos on speaker adaptation (TTS).

We use 3 male and 2 female speakers, each with 50 utterances recorded in a typical office room, to fine- tune the base model for low resource speaker adaption.

Speakers:

spk1 spk2 spk3 spk4 spk5

Demos:

speaker

multi-band MelGAN Robust MelGAN - over-smooth handler - data augmentation

spk1

spk1

spk2

spk2

spk3

spk3

spk4

spk4

spk5

spk5

4. Demos on Hifi data (TTS).

Hifi data is a high-fidelity dataset consisting of a male speaker and a female speaker with typical reading style, each with 5,000 utterances.

Speakers:

male female

Demos:

speaker

multi-band MelGAN Robust MelGAN - over-smooth handler - data augmentation

male

male

male

male

male

female

female

female

female

female

5. Demos on unseen speakers (copy synthesis).

We use multi-speaker Mandarin dataset AISHELL-3 as the unseen speakers, from which we randomly selected 50 utterances of 10 speakers for copysyn test.

Recording multi-band MelGAN Robust MelGAN - over-smooth handler - data augmentation

6. Demos on seen speakers (copy synthesis).

We reserve 50 utter- ances of 10 speakers randomly selected from the training data as the seen speakers.

Recording multi-band MelGAN Robust MelGAN - over-smooth handler - data augmentation