2020-02-10 15:38:29 +08:00
# Fastspeech
2020-05-08 15:13:15 +08:00
2020-03-06 08:21:50 +08:00
PaddlePaddle dynamic graph implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on [FastSpeech: Fast, Robust and Controllable Text to Speech ](https://arxiv.org/abs/1905.09263 ).
2020-02-10 15:38:29 +08:00
2020-02-17 16:44:53 +08:00
## Dataset
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech ](https://keithito.com/LJ-Speech-Dataset/ ).
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
## Model Architecture
![FastSpeech model architecture ](./images/model_architecture.png )
2020-03-06 08:21:50 +08:00
FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extracts attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length
2020-02-17 16:44:53 +08:00
regulator to expand the source phoneme sequence to match the length of the target
mel-spectrogram sequence for parallel mel-spectrogram generation. We use the TransformerTTS as teacher model.
The model consists of encoder, decoder and length regulator three parts.
## Project Structure
2020-05-08 15:13:15 +08:00
2020-02-17 16:44:53 +08:00
```text
├── config # yaml configuration files
├── synthesis.py # script to synthesize waveform from text
├── train.py # script for model training
```
2020-04-09 20:06:04 +08:00
## Saving & Loading
2020-05-08 15:13:15 +08:00
`train_transformer.py` and `train_vocoer.py` have 3 arguments in common, `--checkpoint` , `--iteration` and `--output` .
1. `--output` is the directory for saving results.
During training, checkpoints are saved in `${output}/checkpoints` and tensorboard logs are saved in `${output}/log` .
During synthesis, results are saved in `${output}/samples` and tensorboard log is save in `${output}/log` .
2. `--checkpoint` is the path of a checkpoint and `--iteration` is the target step. They are used to load checkpoints in the following way.
- If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.
2020-04-09 20:06:04 +08:00
2020-05-08 15:13:15 +08:00
- If `--checkpoint` is not provided, we try to load the checkpoint of the target step specified by `--iteration` from the `${output}/checkpoints/` directory, e.g. if given `--iteration 120000` , the checkpoint `${output}/checkpoints/step-120000.*` will be load.
- If both `--checkpoint` and `--iteration` are not provided, we try to load the latest checkpoint from `${output}/checkpoints/` directory.
2020-04-09 20:06:04 +08:00
2020-04-14 14:16:17 +08:00
## Compute Phoneme Duration
2020-04-09 20:06:04 +08:00
2020-04-14 14:16:17 +08:00
A ground truth duration of each phoneme (number of frames in the spectrogram that correspond to that phoneme) should be provided when training a FastSpeech model.
2020-05-08 15:13:15 +08:00
We compute the ground truth duration of each phomemes in the following way.
2020-04-14 14:16:17 +08:00
We extract the encoder-decoder attention alignment from a trained Transformer TTS model;
Each frame is considered corresponding to the phoneme that receive the most attention;
You can run alignments/get_alignments.py to get it.
2020-04-09 20:06:04 +08:00
```bash
cd alignments
python get_alignments.py \
--use_gpu=1 \
--output='./alignments' \
--data=${DATAPATH} \
--config=${CONFIG} \
--checkpoint_transformer=${CHECKPOINT} \
```
2020-05-08 15:13:15 +08:00
2020-04-14 14:16:17 +08:00
where `${DATAPATH}` is the path saved LJSpeech data, `${CHECKPOINT}` is the pretrain model path of TransformerTTS, `${CONFIG}` is the config yaml file of TransformerTTS checkpoint. It is necessary for you to prepare a pre-trained TranformerTTS checkpoint.
2020-04-09 20:06:04 +08:00
2020-05-08 15:13:15 +08:00
For more help on arguments
2020-04-14 14:16:17 +08:00
``python alignments.py --help``.
2020-04-09 20:06:04 +08:00
2020-05-08 15:13:15 +08:00
Or you can use your own phoneme duration, you just need to process the data into the following format.
2020-04-09 20:06:04 +08:00
```bash
{'fname1': alignment1,
'fname2': alignment2,
...}
```
## Train FastSpeech
2020-02-17 16:44:53 +08:00
2020-05-08 15:13:15 +08:00
FastSpeech model can be trained by running ``train.py``.
2020-02-17 16:44:53 +08:00
```bash
python train.py \
--use_gpu=1 \
2020-04-09 20:06:04 +08:00
--data=${DATAPATH} \
--alignments_path=${ALIGNMENTS_PATH} \
--output='./experiment' \
--config='configs/ljspeech.yaml' \
2020-02-17 16:44:53 +08:00
```
2020-05-08 15:13:15 +08:00
2020-03-06 08:21:50 +08:00
Or you can run the script file directly.
2020-05-08 15:13:15 +08:00
2020-02-17 16:44:53 +08:00
```bash
sh train.sh
```
2020-05-08 15:13:15 +08:00
If you want to train on multiple GPUs, start training in the following way.
2020-02-17 16:44:53 +08:00
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train.py \
--use_gpu=1 \
2020-04-09 20:06:04 +08:00
--data=${DATAPATH} \
--alignments_path=${ALIGNMENTS_PATH} \
--output='./experiment' \
--config='configs/ljspeech.yaml' \
2020-02-17 16:44:53 +08:00
```
2020-05-08 15:13:15 +08:00
2020-04-09 20:06:04 +08:00
If you wish to resume from an existing model, See [Saving-&-Loading ](#Saving-&-Loading ) for details of checkpoint loading.
2020-02-17 16:44:53 +08:00
2020-05-08 15:13:15 +08:00
For more help on arguments
2020-02-17 16:44:53 +08:00
``python train.py --help``.
## Synthesis
2020-05-08 15:13:15 +08:00
After training the FastSpeech, audio can be synthesized by running ``synthesis.py``.
2020-02-17 16:44:53 +08:00
```bash
python synthesis.py \
--use_gpu=1 \
--alpha=1.0 \
2020-04-09 20:06:04 +08:00
--checkpoint='./checkpoint/fastspeech/step-120000' \
--config='configs/ljspeech.yaml' \
--config_clarine='../clarinet/configs/config.yaml' \
--checkpoint_clarinet='../clarinet/checkpoint/step-500000' \
--output='./synthesis' \
2020-02-17 16:44:53 +08:00
```
2020-05-08 15:13:15 +08:00
2020-04-09 20:06:04 +08:00
We use Clarinet to synthesis wav, so it necessary for you to prepare a pre-trained [Clarinet checkpoint ](https://paddlespeech.bj.bcebos.com/Parakeet/clarinet_ljspeech_ckpt_1.0.zip ).
2020-02-17 16:44:53 +08:00
2020-03-06 08:21:50 +08:00
Or you can run the script file directly.
2020-05-08 15:13:15 +08:00
2020-02-17 16:44:53 +08:00
```bash
sh synthesis.sh
```
2020-05-08 15:13:15 +08:00
For more help on arguments
2020-02-17 16:44:53 +08:00
``python synthesis.py --help``.