History

lifuchen 69b2a2b5cc modified README of transformer_tts and fastspeech, remove dygraph.guard()		2020-05-09 03:11:55 +00:00
..
alignments	completed fastspeech and modified save/load	2020-04-09 12:06:04 +00:00
configs	completed fastspeech and modified save/load	2020-04-09 12:06:04 +00:00
images	add README of FastSpeech	2020-02-17 08:44:53 +00:00
README.md	modified README of transformer_tts and fastspeech, remove dygraph.guard()	2020-05-09 03:11:55 +00:00
data.py	modified fastspeech to make sure it works on paddle 1.8	2020-05-08 03:58:45 +00:00
synthesis.py	modified README of transformer_tts and fastspeech, remove dygraph.guard()	2020-05-09 03:11:55 +00:00
synthesis.sh	completed fastspeech and modified save/load	2020-04-09 12:06:04 +00:00
train.py	modified README of transformer_tts and fastspeech, remove dygraph.guard()	2020-05-09 03:11:55 +00:00
train.sh	completed fastspeech and modified save/load	2020-04-09 12:06:04 +00:00

README.md

Fastspeech

PaddlePaddle dynamic graph implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on FastSpeech: Fast, Robust and Controllable Text to Speech.

Dataset

We experiment with the LJSpeech dataset. Download and unzip LJSpeech.

wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2

Model Architecture

FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extracts attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation. We use the TransformerTTS as teacher model. The model consists of encoder, decoder and length regulator three parts.

Project Structure

├── config                 # yaml configuration files
├── synthesis.py           # script to synthesize waveform from text
├── train.py               # script for model training

Saving & Loading

train_transformer.py and train_vocoer.py have 3 arguments in common, --checkpoint, --iteration and --output.

--output is the directory for saving results. During training, checkpoints are saved in ${output}/checkpoints and tensorboard logs are saved in ${output}/log. During synthesis, results are saved in ${output}/samples and tensorboard log is save in ${output}/log.
--checkpoint is the path of a checkpoint and --iteration is the target step. They are used to load checkpoints in the following way.
- If --checkpoint is provided, the checkpoint specified by --checkpoint is loaded.
- If --checkpoint is not provided, we try to load the checkpoint of the target step specified by --iteration from the ${output}/checkpoints/ directory, e.g. if given --iteration 120000, the checkpoint ${output}/checkpoints/step-120000.* will be load.
- If both --checkpoint and --iteration are not provided, we try to load the latest checkpoint from ${output}/checkpoints/ directory.

Compute Phoneme Duration

A ground truth duration of each phoneme (number of frames in the spectrogram that correspond to that phoneme) should be provided when training a FastSpeech model.

We compute the ground truth duration of each phomemes in the following way. We extract the encoder-decoder attention alignment from a trained Transformer TTS model; Each frame is considered corresponding to the phoneme that receive the most attention;

You can run alignments/get_alignments.py to get it.

cd alignments
python get_alignments.py \
--use_gpu=1 \
--output='./alignments' \
--data=${DATAPATH} \
--config=${CONFIG} \
--checkpoint_transformer=${CHECKPOINT} \

where ${DATAPATH} is the path saved LJSpeech data, ${CHECKPOINT} is the pretrain model path of TransformerTTS, ${CONFIG} is the config yaml file of TransformerTTS checkpoint. It is necessary for you to prepare a pre-trained TranformerTTS checkpoint.

For more help on arguments

python alignments.py --help.

Or you can use your own phoneme duration, you just need to process the data into the following format.

{'fname1': alignment1,
'fname2': alignment2,
...}

Train FastSpeech

FastSpeech model can be trained by running train.py.

python train.py \
--use_gpu=1 \
--data=${DATAPATH} \
--alignments_path=${ALIGNMENTS_PATH} \
--output='./experiment' \
--config='configs/ljspeech.yaml' \

Or you can run the script file directly.

sh train.sh

If you want to train on multiple GPUs, start training in the following way.

CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train.py \
--use_gpu=1 \
--data=${DATAPATH} \
--alignments_path=${ALIGNMENTS_PATH} \
--output='./experiment' \
--config='configs/ljspeech.yaml' \

If you wish to resume from an existing model, See Saving-&-Loading for details of checkpoint loading.

For more help on arguments

python train.py --help.

Synthesis

After training the FastSpeech, audio can be synthesized by running synthesis.py.

python synthesis.py \
--use_gpu=1 \
--alpha=1.0 \
--checkpoint='./checkpoint/fastspeech/step-120000' \
--config='configs/ljspeech.yaml' \
--config_clarine='../clarinet/configs/config.yaml' \
--checkpoint_clarinet='../clarinet/checkpoint/step-500000' \
--output='./synthesis' \

We use Clarinet to synthesis wav, so it necessary for you to prepare a pre-trained Clarinet checkpoint.

Or you can run the script file directly.

sh synthesis.sh

For more help on arguments

python synthesis.py --help.