Parakeet/examples/transformer_tts
lifuchen 69b2a2b5cc modified README of transformer_tts and fastspeech, remove dygraph.guard() 2020-05-09 03:11:55 +00:00
..
configs completed fastspeech and modified save/load 2020-04-09 12:06:04 +00:00
images add README of TransformerTTS 2020-02-17 07:53:54 +00:00
README.md modified README of transformer_tts and fastspeech, remove dygraph.guard() 2020-05-09 03:11:55 +00:00
data.py modified transformer_tts to make sure it works on paddle 1.8 2020-05-07 02:23:50 +00:00
synthesis.py modified README of transformer_tts and fastspeech, remove dygraph.guard() 2020-05-09 03:11:55 +00:00
synthesis.sh completed fastspeech and modified save/load 2020-04-09 12:06:04 +00:00
train_transformer.py modified README of transformer_tts and fastspeech, remove dygraph.guard() 2020-05-09 03:11:55 +00:00
train_transformer.sh completed fastspeech and modified save/load 2020-04-09 12:06:04 +00:00
train_vocoder.py modified README of transformer_tts and fastspeech, remove dygraph.guard() 2020-05-09 03:11:55 +00:00
train_vocoder.sh completed fastspeech and modified save/load 2020-04-09 12:06:04 +00:00

README.md

TransformerTTS

PaddlePaddle dynamic graph implementation of TransformerTTS, a neural TTS with Transformer. The implementation is based on Neural Speech Synthesis with Transformer Network.

Dataset

We experiment with the LJSpeech dataset. Download and unzip LJSpeech.

wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2

Model Architecture


TransformerTTS model architecture

The model adopts the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. The model consists of two main parts, encoder and decoder. We also implement the CBHG model of Tacotron as the vocoder part and convert the spectrogram into raw wave using Griffin-Lim algorithm.

Project Structure

├── config                 # yaml configuration files
├── data.py                # dataset and dataloader settings for LJSpeech
├── synthesis.py           # script to synthesize waveform from text
├── train_transformer.py   # script for transformer model training
├── train_vocoder.py       # script for vocoder model training

Saving & Loading

train_transformer.py and train_vocoer.py have 3 arguments in common, --checkpoint, --iteration and --output.

  1. --output is the directory for saving results. During training, checkpoints are saved in ${output}/checkpoints and tensorboard logs are saved in ${output}/log. During synthesis, results are saved in ${output}/samples and tensorboard log is save in ${output}/log.

  2. --checkpoint is the path of a checkpoint and --iteration is the target step. They are used to load checkpoints in the following way.

    • If --checkpoint is provided, the checkpoint specified by --checkpoint is loaded.

    • If --checkpoint is not provided, we try to load the checkpoint of the target step specified by --iteration from the ${output}/checkpoints/ directory, e.g. if given --iteration 120000, the checkpoint ${output}/checkpoints/step-120000.* will be load.

    • If both --checkpoint and --iteration are not provided, we try to load the latest checkpoint from ${output}/checkpoints/ directory.

Train Transformer

TransformerTTS model can be trained by running train_transformer.py.

python train_trasformer.py \
--use_gpu=1 \
--data=${DATAPATH} \
--output='./experiment' \
--config='configs/ljspeech.yaml' \

Or you can run the script file directly.

sh train_transformer.sh

If you want to train on multiple GPUs, you must start training in the following way.

CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train_transformer.py \
--use_gpu=1 \
--data=${DATAPATH} \
--output='./experiment' \
--config='configs/ljspeech.yaml' \

If you wish to resume from an existing model, See Saving-&-Loading for details of checkpoint loading.

Note: In order to ensure the training effect, we recommend using multi-GPU training to enlarge the batch size, and at least 16 samples in single batch per GPU.

For more help on arguments

python train_transformer.py --help.

Train Vocoder

Vocoder model can be trained by running train_vocoder.py.

python train_vocoder.py \
--use_gpu=1 \
--data=${DATAPATH} \
--output='./vocoder' \
--config='configs/ljspeech.yaml' \

Or you can run the script file directly.

sh train_vocoder.sh

If you want to train on multiple GPUs, you must start training in the following way.

CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train_vocoder.py \
--use_gpu=1 \
--data=${DATAPATH} \
--output='./vocoder' \
--config='configs/ljspeech.yaml' \

If you wish to resume from an existing model, See Saving-&-Loading for details of checkpoint loading.

For more help on arguments

python train_vocoder.py --help.

Synthesis

After training the TransformerTTS and vocoder model, audio can be synthesized by running synthesis.py.

python synthesis.py \
--max_len=300 \
--use_gpu=1 \
--output='./synthesis' \
--config='configs/ljspeech.yaml' \
--checkpoint_transformer='./checkpoint/transformer/step-120000' \
--checkpoint_vocoder='./checkpoint/vocoder/step-100000' \

Or you can run the script file directly.

sh synthesis.sh

For more help on arguments

python synthesis.py --help.