Parakeet/examples/fastspeech
lifuchen f7ec215b9a add docstring for transformer_tts and fastspeech 2020-03-09 07:16:02 +00:00
..
configs Modified data.py to generate masks as models inputs 2020-03-05 07:08:12 +00:00
images add README of FastSpeech 2020-02-17 08:44:53 +00:00
README.md modified some vars name 2020-03-06 02:47:16 +00:00
parse.py Modified data.py to generate masks as models inputs 2020-03-05 07:22:50 +00:00
synthesis.py Modified data.py to generate masks as models inputs 2020-03-05 07:22:50 +00:00
synthesis.sh add docstring for transformer_tts and fastspeech 2020-03-09 07:16:02 +00:00
train.py modified some vars name 2020-03-06 02:47:16 +00:00
train.sh add docstring for transformer_tts and fastspeech 2020-03-09 07:16:02 +00:00

README.md

Fastspeech

PaddlePaddle dynamic graph implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on FastSpeech: Fast, Robust and Controllable Text to Speech.

Dataset

We experiment with the LJSpeech dataset. Download and unzip LJSpeech.

wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2

Model Architecture

FastSpeech model architecture

FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extracts attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation. We use the TransformerTTS as teacher model. The model consists of encoder, decoder and length regulator three parts.

Project Structure

├── config                 # yaml configuration files
├── synthesis.py           # script to synthesize waveform from text
├── train.py               # script for model training

Train Transformer

FastSpeech model can be trained with train.py.

python train.py \
--use_gpu=1 \
--use_data_parallel=0 \
--data_path=${DATAPATH} \
--transtts_path='../transformer_tts/checkpoint' \
--transformer_step=160000 \
--config_path='config/fastspeech.yaml' \

Or you can run the script file directly.

sh train.sh

If you want to train on multiple GPUs, you must set --use_data_parallel=1, and then start training as follows:

CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train.py \
--use_gpu=1 \
--use_data_parallel=1 \
--data_path=${DATAPATH} \
--transtts_path='../transformer_tts/checkpoint' \
--transformer_step=160000 \
--config_path='config/fastspeech.yaml' \

If you wish to resume from an existing model, please set --checkpoint_path and --fastspeech_step.

For more help on arguments: python train.py --help.

Synthesis

After training the FastSpeech, audio can be synthesized with synthesis.py.

python synthesis.py \
--use_gpu=1 \
--alpha=1.0 \
--checkpoint_path='checkpoint/' \
--fastspeech_step=112000 \

Or you can run the script file directly.

sh synthesis.sh

For more help on arguments: python synthesis.py --help.