Merge pull request #73 from PaddlePaddle/develop
add README for transformer_tts, waveflow and wavenet
This commit is contained in:
commit
df627d6a2e
|
@ -0,0 +1,77 @@
|
|||
# Tacotron2
|
||||
|
||||
PaddlePaddle dynamic graph implementation of Tacotron2, a neural network architecture for speech synthesis directly from text. The implementation is based on [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884).
|
||||
|
||||
## Project Structure
|
||||
|
||||
```text
|
||||
├── config.py # default configuration file
|
||||
├── ljspeech.py # dataset and dataloader settings for LJSpeech
|
||||
├── preprocess.py # script to preprocess LJSpeech dataset
|
||||
├── synthesis.py # script to synthesize spectrogram from text
|
||||
├── train.py # script for tacotron2 model training
|
||||
```
|
||||
|
||||
## Dataset
|
||||
|
||||
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
|
||||
|
||||
```bash
|
||||
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
|
||||
tar xjvf LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
Then you need to preprocess the data by running ``preprocess.py``, the preprocessed data will be placed in ``--output`` directory.
|
||||
|
||||
```bash
|
||||
python preprocess.py \
|
||||
--input=${DATAPATH} \
|
||||
--output=${PREPROCESSEDDATAPATH} \
|
||||
-v \
|
||||
```
|
||||
|
||||
For more help on arguments
|
||||
|
||||
``python preprocess.py --help``.
|
||||
|
||||
## Train the model
|
||||
|
||||
Tacotron2 model can be trained by running ``train.py``.
|
||||
|
||||
```bash
|
||||
python train.py \
|
||||
--data=${PREPROCESSEDDATAPATH} \
|
||||
--output=${OUTPUTPATH} \
|
||||
--device=gpu \
|
||||
```
|
||||
|
||||
If you want to train on CPU, just set ``--device=cpu``.
|
||||
If you want to train on multiple GPUs, just set ``--nprocs`` as num of GPU.
|
||||
By default, training will be resumed from the latest checkpoint in ``--output``, if you want to start a new training, please use a new ``${OUTPUTPATH}`` with no checkpoint. And if you want to resume from an other existing model, you should set ``checkpoint_path`` to be the checkpoint path you want to load.
|
||||
|
||||
**Note: The checkpoint path cannot contain the file extension.**
|
||||
|
||||
For more help on arguments
|
||||
|
||||
``python train_transformer.py --help``.
|
||||
|
||||
## Synthesis
|
||||
|
||||
After training the Tacotron2, spectrogram can be synthesized by running ``synthesis.py``.
|
||||
|
||||
```bash
|
||||
python synthesis.py \
|
||||
--config=${CONFIGPATH} \
|
||||
--checkpoint_path=${CHECKPOINTPATH} \
|
||||
--input=${TEXTPATH} \
|
||||
--output=${OUTPUTPATH}
|
||||
--device=gpu
|
||||
```
|
||||
|
||||
The ``${CONFIGPATH}`` needs to be matched with ``${CHECKPOINTPATH}``.
|
||||
|
||||
For more help on arguments
|
||||
|
||||
``python synthesis.py --help``.
|
||||
|
||||
Then you can find the spectrogram files in ``${OUTPUTPATH}``, and then they can be the input of vocoder like [waveflow](../waveflow/README.md#Synthesis) to get audio files.
|
|
@ -0,0 +1,48 @@
|
|||
# TransformerTTS with LJSpeech
|
||||
|
||||
## Dataset
|
||||
|
||||
### Download the datasaet.
|
||||
|
||||
```bash
|
||||
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
### Extract the dataset.
|
||||
|
||||
```bash
|
||||
tar xjvf LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
### Preprocess the dataset.
|
||||
|
||||
Assume the path to save the preprocessed dataset is `ljspeech_transformer_tts`. Run the command below to preprocess the dataset.
|
||||
|
||||
```bash
|
||||
python preprocess.py --input=LJSpeech-1.1/ --output=ljspeech_transformer_tts
|
||||
```
|
||||
|
||||
## Train the model
|
||||
|
||||
The training script requires 4 command line arguments.
|
||||
`--data` is the path of the training dataset, `--output` is the path of the output direcctory (we recommend to use a subdirectory in `runs` to manage different experiments.)
|
||||
|
||||
`--device` should be "cpu" or "gpu", `--nprocs` is the number of processes to train the model in parallel.
|
||||
|
||||
```bash
|
||||
python train.py --data=ljspeech_transformer_tts/ --output=runs/test --device="gpu" --nprocs=1
|
||||
```
|
||||
|
||||
If you want distributed training, set a larger `--nprocs` (e.g. 4). Note that distributed training with cpu is not supported yet.
|
||||
|
||||
## Synthesize
|
||||
|
||||
Synthesize waveform. We assume the `--input` is a text file, one sentence per line, and `--output` is a directory to save the synthesized mel spectrogram(log magnitude) in `.npy` format. The mel spectrograms can be used with `Waveflow` to generate waveforms.
|
||||
|
||||
`--checkpoint_path` should be the path of the parameter file (`.pdparams`) to load. Note that the extention name `.pdparmas` is not included here.
|
||||
|
||||
`--device` specifies to device to run synthesis on.
|
||||
|
||||
```bash
|
||||
python synthesize.py --input=sentence.txt --output=mels/ --checkpoint_path='step-310000' --device="gpu" --verbose
|
||||
```
|
|
@ -0,0 +1,48 @@
|
|||
# WaveFlow with LJSpeech
|
||||
|
||||
## Dataset
|
||||
|
||||
### Download the datasaet.
|
||||
|
||||
```bash
|
||||
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
### Extract the dataset.
|
||||
|
||||
```bash
|
||||
tar xjvf LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
### Preprocess the dataset.
|
||||
|
||||
Assume the path to save the preprocessed dataset is `ljspeech_waveflow`. Run the command below to preprocess the dataset.
|
||||
|
||||
```bash
|
||||
python preprocess.py --input=LJSpeech-1.1/ --output=ljspeech_waveflow
|
||||
```
|
||||
|
||||
## Train the model
|
||||
|
||||
The training script requires 4 command line arguments.
|
||||
`--data` is the path of the training dataset, `--output` is the path of the output directory (we recommend to use a subdirectory in `runs` to manage different experiments.)
|
||||
|
||||
`--device` should be "cpu" or "gpu", `--nprocs` is the number of processes to train the model in parallel.
|
||||
|
||||
```bash
|
||||
python train.py --data=ljspeech_waveflow/ --output=runs/test --device="gpu" --nprocs=1
|
||||
```
|
||||
|
||||
If you want distributed training, set a larger `--nprocs` (e.g. 4). Note that distributed training with cpu is not supported yet.
|
||||
|
||||
## Synthesize
|
||||
|
||||
Synthesize waveform. We assume the `--input` is a directory containing several mel spectrograms(log magnitude) in `.npy` format. The output would be saved in `--output` directory, containing several `.wav` files, each with the same name as the mel spectrogram does.
|
||||
|
||||
`--checkpoint_path` should be the path of the parameter file (`.pdparams`) to load. Note that the extention name `.pdparmas` is not included here.
|
||||
|
||||
`--device` specifies to device to run synthesis on.
|
||||
|
||||
```bash
|
||||
python synthesize.py --input=mels/ --output=wavs/ --checkpoint_path='step-2000000' --device="gpu" --verbose
|
||||
```
|
|
@ -0,0 +1,48 @@
|
|||
# WaveNet with LJSpeech
|
||||
|
||||
## Dataset
|
||||
|
||||
### Download the datasaet.
|
||||
|
||||
```bash
|
||||
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
### Extract the dataset.
|
||||
|
||||
```bash
|
||||
tar xjvf LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
### Preprocess the dataset.
|
||||
|
||||
Assume the path to save the preprocessed dataset is `ljspeech_wavenet`. Run the command below to preprocess the dataset.
|
||||
|
||||
```bash
|
||||
python preprocess.py --input=LJSpeech-1.1/ --output=ljspeech_wavenet
|
||||
```
|
||||
|
||||
## Train the model
|
||||
|
||||
The training script requires 4 command line arguments.
|
||||
`--data` is the path of the training dataset, `--output` is the path of the output directory (we recommend to use a subdirectory in `runs` to manage different experiments.)
|
||||
|
||||
`--device` should be "cpu" or "gpu", `--nprocs` is the number of processes to train the model in parallel.
|
||||
|
||||
```bash
|
||||
python train.py --data=ljspeech_wavenet/ --output=runs/test --device="gpu" --nprocs=1
|
||||
```
|
||||
|
||||
If you want distributed training, set a larger `--nprocs` (e.g. 4). Note that distributed training with cpu is not supported yet.
|
||||
|
||||
## Synthesize
|
||||
|
||||
Synthesize waveform. We assume the `--input` is a directory containing several mel spectrograms(normalized into range[0, 1)) in `.npy` format. The output would be saved in `--output` directory, containing several `.wav` files, each with the same name as the mel spectrogram does.
|
||||
|
||||
`--checkpoint_path` should be the path of the parameter file (`.pdparams`) to load. Note that the extention name `.pdparmas` is not included here.
|
||||
|
||||
`--device` specifies to device to run synthesis on. Due to the autoregressiveness of wavenet, using cpu may be faster.
|
||||
|
||||
```bash
|
||||
python synthesize.py --input=mels/ --output=wavs/ --checkpoint_path='step-2450000' --device="cpu" --verbose
|
||||
```
|
Loading…
Reference in New Issue