4.4 KiB

Raw Blame History

Fastspeech

PaddlePaddle dynamic graph implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on FastSpeech: Fast, Robust and Controllable Text to Speech.

Dataset

We experiment with the LJSpeech dataset. Download and unzip LJSpeech.

wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2

Model Architecture

FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extracts attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation. We use the TransformerTTS as teacher model. The model consists of encoder, decoder and length regulator three parts.

Project Structure

├── config                 # yaml configuration files
├── synthesis.py           # script to synthesize waveform from text
├── train.py               # script for model training

Saving & Loading

train.py have 3 arguments in common, --checkpooint, iteration and output.

output is the directory for saving results. During training, checkpoints are saved in checkpoints/ in output and tensorboard log is save in log/ in output. During synthesis, results are saved in samples/ in output and tensorboard log is save in log/ in output.
--checkpoint and --iteration for loading from existing checkpoint. Loading existing checkpoiont follows the following rule: If --checkpoint is provided, the checkpoint specified by --checkpoint is loaded. If --checkpoint is not provided, we try to load the model specified by --iteration from the checkpoint directory. If --iteration is not provided, we try to load the latested checkpoint from checkpoint directory.

Compute Alignment

Before train FastSpeech model, you should have diagonal information. We use the diagonal obtained from the TranformerTTS model as the diagonal, you can run alignments/get_alignments.py to get it.

cd alignments
python get_alignments.py \
--use_gpu=1 \
--output='./alignments' \
--data=${DATAPATH} \
--config=${CONFIG} \
--checkpoint_transformer=${CHECKPOINT} \

where ${DATAPATH} is the path saved LJSpeech data, ${CHECKPOINT} is the pretrain model path of TransformerTTS, ${CONFIG} is the config yaml file of TransformerTTS checkpoint. It necessary for you to prepare a pre-trained TranformerTTS checkpoint.

For more help on arguments: python train.py --help.

Or you can use your own diagonal information, you should process the data into the following format:

{'fname1': alignment1,
'fname2': alignment2,
...}

Train FastSpeech

FastSpeech model can be trained with train.py.

python train.py \
--use_gpu=1 \
--data=${DATAPATH} \
--alignments_path=${ALIGNMENTS_PATH} \
--output='./experiment' \
--config='configs/ljspeech.yaml' \

Or you can run the script file directly.

sh train.sh

If you want to train on multiple GPUs, start training as follows:

CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train.py \
--use_gpu=1 \
--data=${DATAPATH} \
--alignments_path=${ALIGNMENTS_PATH} \
--output='./experiment' \
--config='configs/ljspeech.yaml' \

If you wish to resume from an existing model, See Saving-&-Loading for details of checkpoint loading.

For more help on arguments: python train.py --help.

Synthesis

After training the FastSpeech, audio can be synthesized with synthesis.py.

python synthesis.py \
--use_gpu=1 \
--alpha=1.0 \
--checkpoint='./checkpoint/fastspeech/step-120000' \
--config='configs/ljspeech.yaml' \
--config_clarine='../clarinet/configs/config.yaml' \
--checkpoint_clarinet='../clarinet/checkpoint/step-500000' \
--output='./synthesis' \

We use Clarinet to synthesis wav, so it necessary for you to prepare a pre-trained Clarinet checkpoint.