History

lifuchen aaae100854 modified data preprocessing and synthesis of transformer_tts and fastspeech		2020-06-23 12:52:58 +00:00
..
alignments	modified data preprocessing and synthesis of transformer_tts and fastspeech	2020-06-23 12:52:58 +00:00
configs	modified data preprocessing and synthesis of transformer_tts and fastspeech	2020-06-23 12:52:58 +00:00
images	add README of FastSpeech	2020-02-17 08:44:53 +00:00
README.md	modified data preprocessing and synthesis of transformer_tts and fastspeech	2020-06-23 12:52:58 +00:00
data.py	modified data preprocessing and synthesis of transformer_tts and fastspeech	2020-06-23 12:52:58 +00:00
synthesis.py	modified data preprocessing and synthesis of transformer_tts and fastspeech	2020-06-23 12:52:58 +00:00
synthesis.sh	modified data preprocessing and synthesis of transformer_tts and fastspeech	2020-06-23 12:52:58 +00:00
train.py	fix some bugs of transformer_tts and fastspeech.	2020-06-12 08:54:32 +00:00
train.sh	modified data preprocessing and synthesis of transformer_tts and fastspeech	2020-06-23 12:52:58 +00:00

README.md

Fastspeech

PaddlePaddle dynamic graph implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on FastSpeech: Fast, Robust and Controllable Text to Speech.

Dataset

We experiment with the LJSpeech dataset. Download and unzip LJSpeech.

wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2

Model Architecture

FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extracts attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation. We use the TransformerTTS as teacher model. The model consists of encoder, decoder and length regulator three parts.

Project Structure

├── config                 # yaml configuration files
├── synthesis.py           # script to synthesize waveform from text
├── train.py               # script for model training

Saving & Loading

train_transformer.py and train_vocoer.py have 3 arguments in common, --checkpoint, --iteration and --output.

--output is the directory for saving results. During training, checkpoints are saved in ${output}/checkpoints and tensorboard logs are saved in ${output}/log. During synthesis, results are saved in ${output}/samples and tensorboard log is save in ${output}/log.
--checkpoint is the path of a checkpoint and --iteration is the target step. They are used to load checkpoints in the following way.
- If --checkpoint is provided, the checkpoint specified by --checkpoint is loaded.
- If --checkpoint is not provided, we try to load the checkpoint of the target step specified by --iteration from the ${output}/checkpoints/ directory, e.g. if given --iteration 120000, the checkpoint ${output}/checkpoints/step-120000.* will be load.
- If both --checkpoint and --iteration are not provided, we try to load the latest checkpoint from ${output}/checkpoints/ directory.

Compute Phoneme Duration

A ground truth duration of each phoneme (number of frames in the spectrogram that correspond to that phoneme) should be provided when training a FastSpeech model.

We compute the ground truth duration of each phomemes in the following way. We extract the encoder-decoder attention alignment from a trained Transformer TTS model; Each frame is considered corresponding to the phoneme that receive the most attention;

You can run alignments/get_alignments.py to get it.

cd alignments
python get_alignments.py \
--use_gpu=1 \
--output='./alignments' \
--data=${DATAPATH} \
--config=${CONFIG} \
--checkpoint_transformer=${CHECKPOINT} \

where ${DATAPATH} is the path saved LJSpeech data, ${CHECKPOINT} is the pretrain model path of TransformerTTS, ${CONFIG} is the config yaml file of TransformerTTS checkpoint. It is necessary for you to prepare a pre-trained TranformerTTS checkpoint.

For more help on arguments

python alignments.py --help.

Or you can use your own phoneme duration, you just need to process the data into the following format.

{'fname1': alignment1,
'fname2': alignment2,
...}

Train FastSpeech

FastSpeech model can be trained by running train.py.

python train.py \
--use_gpu=1 \
--data=${DATAPATH} \
--alignments_path=${ALIGNMENTS_PATH} \
--output=${OUTPUTPATH} \
--config='configs/ljspeech.yaml' \

Or you can run the script file directly.

sh train.sh

If you want to train on multiple GPUs, start training in the following way.

CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train.py \
--use_gpu=1 \
--data=${DATAPATH} \
--alignments_path=${ALIGNMENTS_PATH} \
--output=${OUTPUTPATH} \
--config='configs/ljspeech.yaml' \

If you wish to resume from an existing model, See Saving-&-Loading for details of checkpoint loading.

For more help on arguments

python train.py --help.

Synthesis

After training the FastSpeech, audio can be synthesized by running synthesis.py.

python synthesis.py \
--use_gpu=1 \
--alpha=1.0 \
--checkpoint=${CHECKPOINTPATH} \
--config='configs/ljspeech.yaml' \
--output=${OUTPUTPATH} \
--vocoder='griffinlim' \

We currently support two vocoders, griffinlim and waveflow. You can set --vocoder to use one of them. If you want to use waveflow as your vocoder, you need to set --config_vocoder and --checkpoint_vocoder which are the path of the config and checkpoint of vocoder. You can download the pretrain model of waveflow from here.

Or you can run the script file directly.

sh synthesis.sh

For more help on arguments

python synthesis.py --help.

Then you can find the synthesized audio files in ${OUTPUTPATH}/samples.