Parakeet/examples/deepvoice3/README.md

145 lines
4.9 KiB
Markdown
Raw Normal View History

2020-03-05 10:02:13 +08:00
# Deep Voice 3
2020-02-13 16:36:14 +08:00
2020-03-05 10:02:13 +08:00
PaddlePaddle dynamic graph implementation of Deep Voice 3, a convolutional network based text-to-speech generative model. The implementation is based on [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654).
2020-02-13 16:36:14 +08:00
2020-03-05 10:02:13 +08:00
We implement Deep Voice 3 using Paddle Fluid with dynamic graph, which is convenient for building flexible network architectures.
2020-02-13 16:36:14 +08:00
2020-02-13 16:49:39 +08:00
## Dataset
2020-02-13 16:36:14 +08:00
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
## Model Architecture
2020-03-05 10:02:13 +08:00
![Deep Voice 3 model architecture](./images/model_architecture.png)
2020-02-13 16:36:14 +08:00
2020-03-05 10:02:13 +08:00
The model consists of an encoder, a decoder and a converter (and a speaker embedding for multispeaker models). The encoder and the decoder together form the seq2seq part of the model, and the converter forms the postnet part.
2020-02-13 16:36:14 +08:00
## Project Structure
2020-02-13 16:49:39 +08:00
```text
2020-07-10 20:22:43 +08:00
├── config/
├── synthesize.py
├── data.py
├── preprocess.py
├── clip.py
├── train.py
└── vocoder.py
2020-02-13 16:49:39 +08:00
```
2020-02-13 16:36:14 +08:00
2020-07-10 20:22:43 +08:00
# Preprocess
2020-03-26 10:58:16 +08:00
2020-07-10 20:22:43 +08:00
Preprocess to dataset with `preprocess.py`.
2020-03-26 10:58:16 +08:00
```text
2020-07-10 20:22:43 +08:00
usage: preprocess.py [-h] --config CONFIG --input INPUT --output OUTPUT
preprocess ljspeech dataset and save it.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file
--input INPUT data path of the original data
--output OUTPUT path to save the preprocessed dataset
2020-03-26 10:58:16 +08:00
```
2020-07-10 20:22:43 +08:00
example code:
```bash
python preprocess.py --config=configs/ljspeech.yaml --input=LJSpeech-1.1/ --output=data/ljspeech
```
2020-03-26 10:58:16 +08:00
2020-02-13 16:49:39 +08:00
## Train
2020-02-13 16:36:14 +08:00
Train the model using train.py, follow the usage displayed by `python train.py --help`.
```text
2020-07-10 20:22:43 +08:00
usage: train.py [-h] --config CONFIG --input INPUT
2020-02-13 16:36:14 +08:00
2020-07-10 20:22:43 +08:00
train a Deep Voice 3 model with LJSpeech
2020-03-26 10:58:16 +08:00
2020-02-13 16:36:14 +08:00
optional arguments:
2020-07-10 20:22:43 +08:00
-h, --help show this help message and exit
--config CONFIG config file
--input INPUT data path of the original data
```
example code:
```bash
CUDA_VISIBLE_DEVICES=0 python train.py --config=configs/ljspeech.yaml --input=data/ljspeech
2020-02-26 21:03:51 +08:00
```
2020-02-13 16:36:14 +08:00
2020-07-10 20:22:43 +08:00
It would create a `runs` folder, outputs for each run is saved in a seperate folder in `runs`, whose name is the time joined with hostname. Inside this filder, tensorboard log, parameters and optimizer states are saved. Parameters(`*.pdparams`) and optimizer states(`*.pdopt`) are named by the step when they are saved.
2020-02-13 16:36:14 +08:00
```text
2020-07-10 20:22:43 +08:00
runs/Jul07_09-39-34_instance-mqcyj27y-4/
├── checkpoint
├── events.out.tfevents.1594085974.instance-mqcyj27y-4
├── step-1000000.pdopt
├── step-1000000.pdparams
├── step-100000.pdopt
├── step-100000.pdparams
...
2020-02-13 16:36:14 +08:00
```
Since we use waveflow to synthesize audio while training, so download the trained waveflow model and extract it in current directory before training.
```bash
2020-07-10 20:22:43 +08:00
wget https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_ckpt_1.0.zip
unzip waveflow_res128_ljspeech_ckpt_1.0.zip
```
2020-07-10 20:22:43 +08:00
## Visualization
You can visualize training losses, check the attention and listen to the synthesized audio when training with teacher forcing.
2020-07-10 20:22:43 +08:00
example code:
```bash
2020-07-10 20:22:43 +08:00
tensorboard --logdir=runs/ --host=$HOSTNAME --port=8000
```
2020-02-13 16:49:39 +08:00
## Synthesis
2020-02-13 16:36:14 +08:00
2020-07-10 20:22:43 +08:00
```text
usage: synthesize from a checkpoint [-h] --config CONFIG --input INPUT
--output OUTPUT --checkpoint CHECKPOINT
--monotonic_layers MONOTONIC_LAYERS
[--vocoder {griffin-lim,waveflow}]
2020-02-13 16:36:14 +08:00
optional arguments:
2020-07-10 20:22:43 +08:00
-h, --help show this help message and exit
--config CONFIG config file
--input INPUT text file to synthesize
--output OUTPUT path to save audio
--checkpoint CHECKPOINT
data path of the checkpoint
--monotonic_layers MONOTONIC_LAYERS
monotonic decoder layers' indices(start from 1)
--vocoder {griffin-lim,waveflow}
vocoder to use
2020-02-13 16:36:14 +08:00
```
2020-07-10 20:22:43 +08:00
`synthesize.py` is used to synthesize several sentences in a text file.
`--monotonic_layers` is the index of the decoders layer that manifest monotonic diagonal attention. You can get monotonic layers by inspecting tensorboard logs. Mind that the index starts from 1. The layers that manifest monotonic diagonal attention are stable for a model during training and synthesizing, but differ among different runs. So once you get the indices of monotonic layers by inspecting tensorboard log, you can use them at synthesizing. Note that only decoder layers that show strong diagonal attention should be considerd.
`--vocoder` is the vocoder to use. Current supported values are "waveflow" and "griffin-lim". Default value is "waveflow".
2020-03-26 10:58:16 +08:00
2020-07-10 20:22:43 +08:00
example code:
2020-03-26 10:58:16 +08:00
```bash
2020-07-10 20:22:43 +08:00
CUDA_VISIBLE_DEVICES=2 python synthesize.py \
--config configs/ljspeech.yaml \
--input sentences.txt \
--output outputs/ \
--checkpoint runs/Jul07_09-39-34_instance-mqcyj27y-4/step-1320000 \
--monotonic_layers "5,6" \
--vocoder waveflow
```