ParakeetRebeccaRosario/examples/deepvoice3/README.md

4.9 KiB

Deep Voice 3

PaddlePaddle dynamic graph implementation of Deep Voice 3, a convolutional network based text-to-speech generative model. The implementation is based on Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning.

We implement Deep Voice 3 using Paddle Fluid with dynamic graph, which is convenient for building flexible network architectures.

Dataset

We experiment with the LJSpeech dataset. Download and unzip LJSpeech.

wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2

Model Architecture

Deep Voice 3 model architecture

The model consists of an encoder, a decoder and a converter (and a speaker embedding for multispeaker models). The encoder and the decoder together form the seq2seq part of the model, and the converter forms the postnet part.

Project Structure

├── config/
├── synthesize.py
├── data.py
├── preprocess.py
├── clip.py
├── train.py
└── vocoder.py

Preprocess

Preprocess to dataset with preprocess.py.

usage: preprocess.py [-h] --config CONFIG --input INPUT --output OUTPUT

preprocess ljspeech dataset and save it.

optional arguments:
  -h, --help       show this help message and exit
  --config CONFIG  config file
  --input INPUT    data path of the original data
  --output OUTPUT  path to save the preprocessed dataset

example code:

python preprocess.py --config=configs/ljspeech.yaml --input=LJSpeech-1.1/ --output=data/ljspeech

Train

Train the model using train.py, follow the usage displayed by python train.py --help.

usage: train.py [-h] --config CONFIG --input INPUT

train a Deep Voice 3 model with LJSpeech

optional arguments:
  -h, --help       show this help message and exit
  --config CONFIG  config file
  --input INPUT    data path of the original data

example code:

CUDA_VISIBLE_DEVICES=0 python train.py --config=configs/ljspeech.yaml --input=data/ljspeech

It would create a runs folder, outputs for each run is saved in a seperate folder in runs, whose name is the time joined with hostname. Inside this filder, tensorboard log, parameters and optimizer states are saved. Parameters(*.pdparams) and optimizer states(*.pdopt) are named by the step when they are saved.

runs/Jul07_09-39-34_instance-mqcyj27y-4/
├── checkpoint
├── events.out.tfevents.1594085974.instance-mqcyj27y-4
├── step-1000000.pdopt
├── step-1000000.pdparams
├── step-100000.pdopt
├── step-100000.pdparams
...

Since we use waveflow to synthesize audio while training, so download the trained waveflow model and extract it in current directory before training.

wget https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_ckpt_1.0.zip
unzip waveflow_res128_ljspeech_ckpt_1.0.zip

Visualization

You can visualize training losses, check the attention and listen to the synthesized audio when training with teacher forcing.

example code:

tensorboard --logdir=runs/ --host=$HOSTNAME --port=8000

Synthesis

usage: synthesize from a checkpoint [-h] --config CONFIG --input INPUT
                                    --output OUTPUT --checkpoint CHECKPOINT
                                    --monotonic_layers MONOTONIC_LAYERS
                                    [--vocoder {griffin-lim,waveflow}]

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       config file
  --input INPUT         text file to synthesize
  --output OUTPUT       path to save audio
  --checkpoint CHECKPOINT
                        data path of the checkpoint
  --monotonic_layers MONOTONIC_LAYERS
                        monotonic decoder layers' indices(start from 1)
  --vocoder {griffin-lim,waveflow}
                        vocoder to use

synthesize.py is used to synthesize several sentences in a text file. --monotonic_layers is the index of the decoders layer that manifest monotonic diagonal attention. You can get monotonic layers by inspecting tensorboard logs. Mind that the index starts from 1. The layers that manifest monotonic diagonal attention are stable for a model during training and synthesizing, but differ among different runs. So once you get the indices of monotonic layers by inspecting tensorboard log, you can use them at synthesizing. Note that only decoder layers that show strong diagonal attention should be considerd. --vocoder is the vocoder to use. Current supported values are "waveflow" and "griffin-lim". Default value is "waveflow".

example code:

CUDA_VISIBLE_DEVICES=2 python synthesize.py \
    --config configs/ljspeech.yaml \
    --input sentences.txt \
    --output outputs/ \
    --checkpoint runs/Jul07_09-39-34_instance-mqcyj27y-4/step-1320000 \
    --monotonic_layers "5,6" \
    --vocoder waveflow