Merge branch 'refine_doc' into 'master'
Refine doc part2 See merge request !31
This commit is contained in:
commit
612de1a25c
|
@ -6,8 +6,8 @@ Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-spee
|
|||
<img src="images/logo.png" width=450 /> <br>
|
||||
</div>
|
||||
|
||||
In particular, it features the latest [WaveFlow] (https://arxiv.org/abs/1912.01219) model proposed by Baidu Research.
|
||||
- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than WaveGlow and serveral orders of magnitude faster than WaveNet.
|
||||
In particular, it features the latest [WaveFlow] (https://arxiv.org/abs/1912.01219) model proposed by Baidu Research.
|
||||
- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow] (https://github.com/NVIDIA/waveglow) and serveral orders of magnitude faster than WaveNet.
|
||||
- WaveFlow is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M) and comparable to WaveNet (4.6M).
|
||||
- WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development.
|
||||
|
||||
|
|
|
@ -42,10 +42,10 @@ optional arguments:
|
|||
--wavenet WAVENET wavenet checkpoint to use.
|
||||
```
|
||||
|
||||
1. `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
|
||||
2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
|
||||
3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
|
||||
4. `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.
|
||||
- `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
|
||||
- `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
|
||||
- `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
|
||||
- `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.
|
||||
|
||||
```text
|
||||
├── checkpoints # checkpoint
|
||||
|
@ -53,8 +53,8 @@ optional arguments:
|
|||
└── log # tensorboard log
|
||||
```
|
||||
|
||||
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
6. `--wavenet` is the path of the wavenet checkpoint to load. If you do not specify `--resume`, then this must be provided.
|
||||
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
- `--wavenet` is the path of the wavenet checkpoint to load. If you do not specify `--resume`, then this must be provided.
|
||||
|
||||
|
||||
Before you start training a ClariNet model, you should have trained a WaveNet model with single Gaussian output distribution. Make sure the config of the teacher model matches that of the trained model.
|
||||
|
@ -90,11 +90,11 @@ optional arguments:
|
|||
--data DATA path of LJspeech dataset.
|
||||
```
|
||||
|
||||
1. `--config` is the configuration file to use. You should use the same configuration with which you train you model.
|
||||
2. `--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
|
||||
3. `checkpoint` is the checkpoint to load.
|
||||
4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`).
|
||||
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
- `--config` is the configuration file to use. You should use the same configuration with which you train you model.
|
||||
- `--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
|
||||
- `checkpoint` is the checkpoint to load.
|
||||
- `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`).
|
||||
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
|
||||
Example script:
|
||||
|
||||
|
|
|
@ -52,10 +52,10 @@ optional arguments:
|
|||
device to use
|
||||
```
|
||||
|
||||
1. `--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config.
|
||||
2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
|
||||
3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
|
||||
4. `--output` is the directory to save results, all results are saved in this directory. The structure of the output directory is shown below.
|
||||
- `--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config.
|
||||
- `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
|
||||
- `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
|
||||
- `--output` is the directory to save results, all results are saved in this directory. The structure of the output directory is shown below.
|
||||
|
||||
```text
|
||||
├── checkpoints # checkpoint
|
||||
|
@ -67,7 +67,7 @@ optional arguments:
|
|||
└── waveform # waveform (.wav files)
|
||||
```
|
||||
|
||||
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
|
||||
Example script:
|
||||
|
||||
|
@ -101,11 +101,11 @@ optional arguments:
|
|||
device to use
|
||||
```
|
||||
|
||||
1. `--config` is the configuration file to use. You should use the same configuration with which you train you model.
|
||||
2. `checkpoint` is the checkpoint to load.
|
||||
3. `text`is the text file to synthesize.
|
||||
4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`) and attention plots (*.png) for each sentence.
|
||||
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
- `--config` is the configuration file to use. You should use the same configuration with which you train you model.
|
||||
- `checkpoint` is the checkpoint to load.
|
||||
- `text`is the text file to synthesize.
|
||||
- `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`) and attention plots (*.png) for each sentence.
|
||||
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
|
||||
Example script:
|
||||
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
# Fastspeech
|
||||
Paddle fluid implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263).
|
||||
PaddlePaddle dynamic graph implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263).
|
||||
|
||||
## Dataset
|
||||
|
||||
|
@ -14,7 +14,7 @@ tar xjvf LJSpeech-1.1.tar.bz2
|
|||
|
||||
![FastSpeech model architecture](./images/model_architecture.png)
|
||||
|
||||
FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length
|
||||
FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extracts attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length
|
||||
regulator to expand the source phoneme sequence to match the length of the target
|
||||
mel-spectrogram sequence for parallel mel-spectrogram generation. We use the TransformerTTS as teacher model.
|
||||
The model consists of encoder, decoder and length regulator three parts.
|
||||
|
@ -28,7 +28,7 @@ The model consists of encoder, decoder and length regulator three parts.
|
|||
|
||||
## Train Transformer
|
||||
|
||||
FastSpeech model can train with ``train.py``.
|
||||
FastSpeech model can be trained with ``train.py``.
|
||||
```bash
|
||||
python train.py \
|
||||
--use_gpu=1 \
|
||||
|
@ -38,11 +38,11 @@ python train.py \
|
|||
--transformer_step=160000 \
|
||||
--config_path='config/fastspeech.yaml' \
|
||||
```
|
||||
or you can run the script file directly.
|
||||
Or you can run the script file directly.
|
||||
```bash
|
||||
sh train.sh
|
||||
```
|
||||
If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follow:
|
||||
If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follows:
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3
|
||||
|
@ -55,7 +55,7 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr
|
|||
--config_path='config/fastspeech.yaml' \
|
||||
```
|
||||
|
||||
if you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--fastspeech_step``
|
||||
If you wish to resume from an existing model, please set ``--checkpoint_path`` and ``--fastspeech_step``.
|
||||
|
||||
For more help on arguments:
|
||||
``python train.py --help``.
|
||||
|
@ -70,7 +70,7 @@ python synthesis.py \
|
|||
--fastspeech_step=112000 \
|
||||
```
|
||||
|
||||
or you can run the script file directly.
|
||||
Or you can run the script file directly.
|
||||
```bash
|
||||
sh synthesis.sh
|
||||
```
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
# TransformerTTS
|
||||
Paddle fluid implementation of TransformerTTS, a neural TTS with Transformer. The implementation is based on [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895).
|
||||
PaddlePaddle dynamic graph implementation of TransformerTTS, a neural TTS with Transformer. The implementation is based on [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895).
|
||||
|
||||
## Dataset
|
||||
|
||||
|
@ -12,7 +12,7 @@ tar xjvf LJSpeech-1.1.tar.bz2
|
|||
## Model Architecture
|
||||
|
||||
![TransformerTTS model architecture](./images/model_architecture.jpg)
|
||||
The model adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in [Tacotron2](https://arxiv.org/abs/1712.05884). The model consists of two main parts, encoder and decoder. We also implemented CBHG model of tacotron as a vocoder part and converted the spectrogram into raw wave using griffin-lim algorithm.
|
||||
The model adopts the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in [Tacotron2](https://arxiv.org/abs/1712.05884). The model consists of two main parts, encoder and decoder. We also implement the CBHG model of Tacotron as the vocoder part and convert the spectrogram into raw wave using Griffin-Lim algorithm.
|
||||
|
||||
## Project Structure
|
||||
```text
|
||||
|
@ -25,7 +25,7 @@ The model adapt the multi-head attention mechanism to replace the RNN structures
|
|||
|
||||
## Train Transformer
|
||||
|
||||
TransformerTTS model can train with ``train_transformer.py``.
|
||||
TransformerTTS model can be trained with ``train_transformer.py``.
|
||||
```bash
|
||||
python train_trasformer.py \
|
||||
--use_gpu=1 \
|
||||
|
@ -33,11 +33,11 @@ python train_trasformer.py \
|
|||
--data_path=${DATAPATH} \
|
||||
--config_path='config/train_transformer.yaml' \
|
||||
```
|
||||
or you can run the script file directly.
|
||||
Or you can run the script file directly.
|
||||
```bash
|
||||
sh train_transformer.sh
|
||||
```
|
||||
If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follow:
|
||||
If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follows:
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3
|
||||
|
@ -48,13 +48,13 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr
|
|||
--config_path='config/train_transformer.yaml' \
|
||||
```
|
||||
|
||||
if you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--transformer_step``
|
||||
If you wish to resume from an existing model, please set ``--checkpoint_path`` and ``--transformer_step``.
|
||||
|
||||
For more help on arguments:
|
||||
``python train_transformer.py --help``.
|
||||
|
||||
## Train Vocoder
|
||||
Vocoder model can train with ``train_vocoder.py``.
|
||||
Vocoder model can be trained with ``train_vocoder.py``.
|
||||
```bash
|
||||
python train_vocoder.py \
|
||||
--use_gpu=1 \
|
||||
|
@ -62,11 +62,11 @@ python train_vocoder.py \
|
|||
--data_path=${DATAPATH} \
|
||||
--config_path='config/train_vocoder.yaml' \
|
||||
```
|
||||
or you can run the script file directly.
|
||||
Or you can run the script file directly.
|
||||
```bash
|
||||
sh train_vocoder.sh
|
||||
```
|
||||
If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follow:
|
||||
If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follows:
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3
|
||||
|
@ -76,13 +76,13 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr
|
|||
--data_path=${DATAPATH} \
|
||||
--config_path='config/train_vocoder.yaml' \
|
||||
```
|
||||
if you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--vocoder_step``
|
||||
If you wish to resume from an existing model, please set ``--checkpoint_path`` and ``--vocoder_step``.
|
||||
|
||||
For more help on arguments:
|
||||
``python train_vocoder.py --help``.
|
||||
|
||||
## Synthesis
|
||||
After training the transformerTTS and vocoder model, audio can be synthesized with ``synthesis.py``.
|
||||
After training the TransformerTTS and vocoder model, audio can be synthesized with ``synthesis.py``.
|
||||
```bash
|
||||
python synthesis.py \
|
||||
--max_len=50 \
|
||||
|
@ -94,7 +94,7 @@ python synthesis.py \
|
|||
--config_path='config/synthesis.yaml' \
|
||||
```
|
||||
|
||||
or you can run the script file directly.
|
||||
Or you can run the script file directly.
|
||||
```bash
|
||||
sh synthesis.sh
|
||||
```
|
||||
|
|
|
@ -1,6 +1,10 @@
|
|||
# WaveFlow with Paddle Fluid
|
||||
# WaveFlow
|
||||
|
||||
Paddle fluid implementation of [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219).
|
||||
PaddlePaddle dynamic graph implementation of [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219).
|
||||
|
||||
- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow] (https://github.com/NVIDIA/waveglow) and serveral orders of magnitude faster than WaveNet.
|
||||
- WaveFlow is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M) and comparable to WaveNet (4.6M).
|
||||
- WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development.
|
||||
|
||||
## Project Structure
|
||||
```text
|
||||
|
@ -72,7 +76,7 @@ Use `export CUDA_VISIBLE_DEVICES=0,1,2,3` to set the GPUs that you want to use t
|
|||
|
||||
### Monitor with Tensorboard
|
||||
|
||||
By default, the logs are saved in `./runs/waveflow/${ModelName}/logs/`. You can monitor logs by tensorboard.
|
||||
By default, the logs are saved in `./runs/waveflow/${ModelName}/logs/`. You can monitor logs using TensorBoard.
|
||||
|
||||
```bash
|
||||
tensorboard --logdir=${log_dir} --port=8888
|
||||
|
@ -112,7 +116,7 @@ python -u benchmark.py \
|
|||
|
||||
### Low-precision inference
|
||||
|
||||
This model supports the float16 low-precsion inference. By appending the argument
|
||||
This model supports the float16 low-precision inference. By appending the argument
|
||||
|
||||
```bash
|
||||
--use_fp16=true
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# Wavenet
|
||||
# WaveNet
|
||||
|
||||
Paddle implementation of wavenet in dynamic graph, a convolutional network based vocoder. Wavenet is proposed in [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499), but in thie experiment, the implementation follows the teacher model in [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281).
|
||||
PaddlePaddle dynamic graph implementation of WaveNet, a convolutional network based vocoder. WaveNet is originally proposed in [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499). However, in this experiment, the implementation follows the teacher model in [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281).
|
||||
|
||||
|
||||
## Dataset
|
||||
|
@ -24,13 +24,13 @@ tar xjvf LJSpeech-1.1.tar.bz2
|
|||
|
||||
## Train
|
||||
|
||||
Train the model using train.py, follow the usage displayed by `python train.py --help`.
|
||||
Train the model using train.py. For help on usage, try `python train.py --help`.
|
||||
|
||||
```text
|
||||
usage: train.py [-h] [--data DATA] [--config CONFIG] [--output OUTPUT]
|
||||
[--device DEVICE] [--resume RESUME]
|
||||
|
||||
Train a wavenet model with LJSpeech.
|
||||
Train a WaveNet model with LJSpeech.
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
|
@ -41,25 +41,25 @@ optional arguments:
|
|||
--resume RESUME checkpoint to resume from.
|
||||
```
|
||||
|
||||
1. `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
|
||||
2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
|
||||
3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
|
||||
4. `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.
|
||||
- `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
|
||||
- `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
|
||||
- `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before training.
|
||||
- `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.
|
||||
|
||||
```text
|
||||
├── checkpoints # checkpoint
|
||||
└── log # tensorboard log
|
||||
```
|
||||
|
||||
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
|
||||
example script:
|
||||
Example script:
|
||||
|
||||
```bash
|
||||
python train.py --config=./configs/wavenet_single_gaussian.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0
|
||||
```
|
||||
|
||||
You can monitor training log via tensorboard, using the script below.
|
||||
You can monitor training log via TensorBoard, using the script below.
|
||||
|
||||
```bash
|
||||
cd experiment/log
|
||||
|
@ -71,7 +71,7 @@ tensorboard --logdir=.
|
|||
usage: synthesis.py [-h] [--data DATA] [--config CONFIG] [--device DEVICE]
|
||||
checkpoint output
|
||||
|
||||
Synthesize valid data from LJspeech with a wavenet model.
|
||||
Synthesize valid data from LJspeech with a WaveNet model.
|
||||
|
||||
positional arguments:
|
||||
checkpoint checkpoint to load.
|
||||
|
@ -84,13 +84,13 @@ optional arguments:
|
|||
--device DEVICE device to use.
|
||||
```
|
||||
|
||||
1. `--config` is the configuration file to use. You should use the same configuration with which you train you model.
|
||||
2. `--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
|
||||
3. `checkpoint` is the checkpoint to load.
|
||||
4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`).
|
||||
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
- `--config` is the configuration file to use. You should use the same configuration with which you train you model.
|
||||
- `--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
|
||||
- `checkpoint` is the checkpoint to load.
|
||||
- `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`).
|
||||
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
|
||||
example script:
|
||||
Example script:
|
||||
|
||||
```bash
|
||||
python synthesis.py --config=./configs/wavenet_single_gaussian.yaml --data=./LJSpeech-1.1/ --device=0 experiment/checkpoints/step_500000 generated
|
||||
|
|
Loading…
Reference in New Issue