Parakeet/examples/fastspeech/README.md

# Fastspeech
Paddle fluid implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263).

## Dataset

We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).

```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```

## Model Architecture

![FastSpeech model architecture](./images/model_architecture.png)

FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length
regulator to expand the source phoneme sequence to match the length of the target
mel-spectrogram sequence for parallel mel-spectrogram generation. We use the TransformerTTS as teacher model.
The model consists of encoder, decoder and length regulator three parts.

## Project Structure
```text
├── config                 # yaml configuration files
├── synthesis.py           # script to synthesize waveform from text
├── train.py               # script for model training
```

## Train Transformer

FastSpeech model can train with ``train.py``.
```bash
python train.py \
--use_gpu=1 \
--use_data_parallel=0 \
--data_path=${DATAPATH} \
--transtts_path='../transformer_tts/checkpoint' \
--transformer_step=160000 \
--config_path='config/fastspeech.yaml' \
```
or you can run the script file directly.
```bash
sh train.sh
```
If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follow:

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train.py \
--use_gpu=1 \
--use_data_parallel=1 \
--data_path=${DATAPATH} \
--transtts_path='../transformer_tts/checkpoint' \
--transformer_step=160000 \
--config_path='config/fastspeech.yaml' \
```

if you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--fastspeech_step``

For more help on arguments: 
``python train.py --help``.

## Synthesis
After training the FastSpeech, audio can be synthesized with ``synthesis.py``.
```bash
python synthesis.py \
--use_gpu=1 \
--alpha=1.0 \
--checkpoint_path='checkpoint/' \
--fastspeech_step=112000 \
```

or you can run the script file directly.
```bash
sh synthesis.sh
```

For more help on arguments: 
``python synthesis.py --help``.
add transformerTTS and fastspeech 2020-02-10 15:38:29 +08:00			`# Fastspeech`
			`Paddle fluid implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263).`

add README of FastSpeech 2020-02-17 16:44:53 +08:00			`## Dataset`

			`We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).`

			```bash
			`wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2`
			`tar xjvf LJSpeech-1.1.tar.bz2`
			```

			`## Model Architecture`

			`![FastSpeech model architecture](./images/model_architecture.png)`

			`FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length`
			`regulator to expand the source phoneme sequence to match the length of the target`
			`mel-spectrogram sequence for parallel mel-spectrogram generation. We use the TransformerTTS as teacher model.`
			`The model consists of encoder, decoder and length regulator three parts.`

			`## Project Structure`
			```text
			`├── config # yaml configuration files`
			`├── synthesis.py # script to synthesize waveform from text`
			`├── train.py # script for model training`
			```

			`## Train Transformer`

			FastSpeech model can train with ``train.py``.
			```bash
			`python train.py \`
			`--use_gpu=1 \`
			`--use_data_parallel=0 \`
			`--data_path=${DATAPATH} \`
			`--transtts_path='../transformer_tts/checkpoint' \`
			`--transformer_step=160000 \`
			`--config_path='config/fastspeech.yaml' \`
			```
			`or you can run the script file directly.`
			```bash
			`sh train.sh`
			```
			If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follow:

			```bash
			`CUDA_VISIBLE_DEVICES=0,1,2,3`
			`python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train.py \`
			`--use_gpu=1 \`
			`--use_data_parallel=1 \`
			`--data_path=${DATAPATH} \`
			`--transtts_path='../transformer_tts/checkpoint' \`
			`--transformer_step=160000 \`
			`--config_path='config/fastspeech.yaml' \`
			```

			if you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--fastspeech_step``

			`For more help on arguments:`
			``python train.py --help``.

			`## Synthesis`
			After training the FastSpeech, audio can be synthesized with ``synthesis.py``.
			```bash
			`python synthesis.py \`
			`--use_gpu=1 \`
			`--alpha=1.0 \`
			`--checkpoint_path='checkpoint/' \`
			`--fastspeech_step=112000 \`
			```

			`or you can run the script file directly.`
			```bash
			`sh synthesis.sh`
			```

			`For more help on arguments:`
			``python synthesis.py --help``.