diff --git a/README.md b/README.md index b3fdb2c..aef1963 100644 --- a/README.md +++ b/README.md @@ -6,8 +6,8 @@ Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-spee
-In particular, it features the latest [WaveFlow] (https://arxiv.org/abs/1912.01219) model proposed by Baidu Research. -- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than WaveGlow and serveral orders of magnitude faster than WaveNet. +In particular, it features the latest [WaveFlow] (https://arxiv.org/abs/1912.01219) model proposed by Baidu Research. +- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow] (https://github.com/NVIDIA/waveglow) and serveral orders of magnitude faster than WaveNet. - WaveFlow is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M) and comparable to WaveNet (4.6M). - WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development. diff --git a/examples/clarinet/README.md b/examples/clarinet/README.md index 58bca99..459e2f5 100644 --- a/examples/clarinet/README.md +++ b/examples/clarinet/README.md @@ -42,10 +42,10 @@ optional arguments: --wavenet WAVENET wavenet checkpoint to use. ``` -1. `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config. -2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt). -3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig. -4. `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below. +- `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config. +- `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt). +- `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig. +- `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below. ```text ├── checkpoints # checkpoint @@ -53,8 +53,8 @@ optional arguments: └── log # tensorboard log ``` -5. `--device` is the device (gpu id) to use for training. `-1` means CPU. -6. `--wavenet` is the path of the wavenet checkpoint to load. If you do not specify `--resume`, then this must be provided. +- `--device` is the device (gpu id) to use for training. `-1` means CPU. +- `--wavenet` is the path of the wavenet checkpoint to load. If you do not specify `--resume`, then this must be provided. Before you start training a ClariNet model, you should have trained a WaveNet model with single Gaussian output distribution. Make sure the config of the teacher model matches that of the trained model. @@ -90,11 +90,11 @@ optional arguments: --data DATA path of LJspeech dataset. ``` -1. `--config` is the configuration file to use. You should use the same configuration with which you train you model. -2. `--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files. -3. `checkpoint` is the checkpoint to load. -4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`). -5. `--device` is the device (gpu id) to use for training. `-1` means CPU. +- `--config` is the configuration file to use. You should use the same configuration with which you train you model. +- `--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files. +- `checkpoint` is the checkpoint to load. +- `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`). +- `--device` is the device (gpu id) to use for training. `-1` means CPU. Example script: diff --git a/examples/deepvoice3/README.md b/examples/deepvoice3/README.md index f5db3e9..fa7a5e4 100644 --- a/examples/deepvoice3/README.md +++ b/examples/deepvoice3/README.md @@ -52,10 +52,10 @@ optional arguments: device to use ``` -1. `--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config. -2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt). -3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig. -4. `--output` is the directory to save results, all results are saved in this directory. The structure of the output directory is shown below. +- `--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config. +- `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt). +- `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig. +- `--output` is the directory to save results, all results are saved in this directory. The structure of the output directory is shown below. ```text ├── checkpoints # checkpoint @@ -67,7 +67,7 @@ optional arguments: └── waveform # waveform (.wav files) ``` -5. `--device` is the device (gpu id) to use for training. `-1` means CPU. +- `--device` is the device (gpu id) to use for training. `-1` means CPU. Example script: @@ -101,11 +101,11 @@ optional arguments: device to use ``` -1. `--config` is the configuration file to use. You should use the same configuration with which you train you model. -2. `checkpoint` is the checkpoint to load. -3. `text`is the text file to synthesize. -4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`) and attention plots (*.png) for each sentence. -5. `--device` is the device (gpu id) to use for training. `-1` means CPU. +- `--config` is the configuration file to use. You should use the same configuration with which you train you model. +- `checkpoint` is the checkpoint to load. +- `text`is the text file to synthesize. +- `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`) and attention plots (*.png) for each sentence. +- `--device` is the device (gpu id) to use for training. `-1` means CPU. Example script: diff --git a/examples/fastspeech/README.md b/examples/fastspeech/README.md index 1199b8b..cc0a3ef 100644 --- a/examples/fastspeech/README.md +++ b/examples/fastspeech/README.md @@ -1,5 +1,5 @@ # Fastspeech -Paddle fluid implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263). +PaddlePaddle dynamic graph implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263). ## Dataset @@ -14,7 +14,7 @@ tar xjvf LJSpeech-1.1.tar.bz2 ![FastSpeech model architecture](./images/model_architecture.png) -FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length +FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extracts attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation. We use the TransformerTTS as teacher model. The model consists of encoder, decoder and length regulator three parts. @@ -28,7 +28,7 @@ The model consists of encoder, decoder and length regulator three parts. ## Train Transformer -FastSpeech model can train with ``train.py``. +FastSpeech model can be trained with ``train.py``. ```bash python train.py \ --use_gpu=1 \ @@ -38,11 +38,11 @@ python train.py \ --transformer_step=160000 \ --config_path='config/fastspeech.yaml' \ ``` -or you can run the script file directly. +Or you can run the script file directly. ```bash sh train.sh ``` -If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follow: +If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follows: ```bash CUDA_VISIBLE_DEVICES=0,1,2,3 @@ -55,7 +55,7 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr --config_path='config/fastspeech.yaml' \ ``` -if you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--fastspeech_step`` +If you wish to resume from an existing model, please set ``--checkpoint_path`` and ``--fastspeech_step``. For more help on arguments: ``python train.py --help``. @@ -70,7 +70,7 @@ python synthesis.py \ --fastspeech_step=112000 \ ``` -or you can run the script file directly. +Or you can run the script file directly. ```bash sh synthesis.sh ``` diff --git a/examples/transformer_tts/README.md b/examples/transformer_tts/README.md index 6fda6d1..1f1922c 100644 --- a/examples/transformer_tts/README.md +++ b/examples/transformer_tts/README.md @@ -1,5 +1,5 @@ # TransformerTTS -Paddle fluid implementation of TransformerTTS, a neural TTS with Transformer. The implementation is based on [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895). +PaddlePaddle dynamic graph implementation of TransformerTTS, a neural TTS with Transformer. The implementation is based on [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895). ## Dataset @@ -12,7 +12,7 @@ tar xjvf LJSpeech-1.1.tar.bz2 ## Model Architecture ![TransformerTTS model architecture](./images/model_architecture.jpg) -The model adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in [Tacotron2](https://arxiv.org/abs/1712.05884). The model consists of two main parts, encoder and decoder. We also implemented CBHG model of tacotron as a vocoder part and converted the spectrogram into raw wave using griffin-lim algorithm. +The model adopts the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in [Tacotron2](https://arxiv.org/abs/1712.05884). The model consists of two main parts, encoder and decoder. We also implement the CBHG model of Tacotron as the vocoder part and convert the spectrogram into raw wave using Griffin-Lim algorithm. ## Project Structure ```text @@ -25,7 +25,7 @@ The model adapt the multi-head attention mechanism to replace the RNN structures ## Train Transformer -TransformerTTS model can train with ``train_transformer.py``. +TransformerTTS model can be trained with ``train_transformer.py``. ```bash python train_trasformer.py \ --use_gpu=1 \ @@ -33,11 +33,11 @@ python train_trasformer.py \ --data_path=${DATAPATH} \ --config_path='config/train_transformer.yaml' \ ``` -or you can run the script file directly. +Or you can run the script file directly. ```bash sh train_transformer.sh ``` -If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follow: +If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follows: ```bash CUDA_VISIBLE_DEVICES=0,1,2,3 @@ -48,13 +48,13 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr --config_path='config/train_transformer.yaml' \ ``` -if you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--transformer_step`` +If you wish to resume from an existing model, please set ``--checkpoint_path`` and ``--transformer_step``. For more help on arguments: ``python train_transformer.py --help``. ## Train Vocoder -Vocoder model can train with ``train_vocoder.py``. +Vocoder model can be trained with ``train_vocoder.py``. ```bash python train_vocoder.py \ --use_gpu=1 \ @@ -62,11 +62,11 @@ python train_vocoder.py \ --data_path=${DATAPATH} \ --config_path='config/train_vocoder.yaml' \ ``` -or you can run the script file directly. +Or you can run the script file directly. ```bash sh train_vocoder.sh ``` -If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follow: +If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follows: ```bash CUDA_VISIBLE_DEVICES=0,1,2,3 @@ -76,13 +76,13 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr --data_path=${DATAPATH} \ --config_path='config/train_vocoder.yaml' \ ``` -if you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--vocoder_step`` +If you wish to resume from an existing model, please set ``--checkpoint_path`` and ``--vocoder_step``. For more help on arguments: ``python train_vocoder.py --help``. ## Synthesis -After training the transformerTTS and vocoder model, audio can be synthesized with ``synthesis.py``. +After training the TransformerTTS and vocoder model, audio can be synthesized with ``synthesis.py``. ```bash python synthesis.py \ --max_len=50 \ @@ -94,7 +94,7 @@ python synthesis.py \ --config_path='config/synthesis.yaml' \ ``` -or you can run the script file directly. +Or you can run the script file directly. ```bash sh synthesis.sh ``` diff --git a/examples/waveflow/README.md b/examples/waveflow/README.md index 050bb17..e21039a 100644 --- a/examples/waveflow/README.md +++ b/examples/waveflow/README.md @@ -1,6 +1,10 @@ -# WaveFlow with Paddle Fluid +# WaveFlow -Paddle fluid implementation of [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219). +PaddlePaddle dynamic graph implementation of [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219). + +- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow] (https://github.com/NVIDIA/waveglow) and serveral orders of magnitude faster than WaveNet. +- WaveFlow is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M) and comparable to WaveNet (4.6M). +- WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development. ## Project Structure ```text @@ -72,7 +76,7 @@ Use `export CUDA_VISIBLE_DEVICES=0,1,2,3` to set the GPUs that you want to use t ### Monitor with Tensorboard -By default, the logs are saved in `./runs/waveflow/${ModelName}/logs/`. You can monitor logs by tensorboard. +By default, the logs are saved in `./runs/waveflow/${ModelName}/logs/`. You can monitor logs using TensorBoard. ```bash tensorboard --logdir=${log_dir} --port=8888 @@ -112,7 +116,7 @@ python -u benchmark.py \ ### Low-precision inference -This model supports the float16 low-precsion inference. By appending the argument +This model supports the float16 low-precision inference. By appending the argument ```bash --use_fp16=true diff --git a/examples/wavenet/README.md b/examples/wavenet/README.md index caed5d9..5114182 100644 --- a/examples/wavenet/README.md +++ b/examples/wavenet/README.md @@ -1,6 +1,6 @@ -# Wavenet +# WaveNet -Paddle implementation of wavenet in dynamic graph, a convolutional network based vocoder. Wavenet is proposed in [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499), but in thie experiment, the implementation follows the teacher model in [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281). +PaddlePaddle dynamic graph implementation of WaveNet, a convolutional network based vocoder. WaveNet is originally proposed in [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499). However, in this experiment, the implementation follows the teacher model in [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281). ## Dataset @@ -24,13 +24,13 @@ tar xjvf LJSpeech-1.1.tar.bz2 ## Train -Train the model using train.py, follow the usage displayed by `python train.py --help`. +Train the model using train.py. For help on usage, try `python train.py --help`. ```text usage: train.py [-h] [--data DATA] [--config CONFIG] [--output OUTPUT] [--device DEVICE] [--resume RESUME] -Train a wavenet model with LJSpeech. +Train a WaveNet model with LJSpeech. optional arguments: -h, --help show this help message and exit @@ -41,25 +41,25 @@ optional arguments: --resume RESUME checkpoint to resume from. ``` -1. `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config. -2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt). -3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig. -4. `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below. +- `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config. +- `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt). +- `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before training. +- `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below. ```text ├── checkpoints # checkpoint └── log # tensorboard log ``` -5. `--device` is the device (gpu id) to use for training. `-1` means CPU. +- `--device` is the device (gpu id) to use for training. `-1` means CPU. -example script: +Example script: ```bash python train.py --config=./configs/wavenet_single_gaussian.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0 ``` -You can monitor training log via tensorboard, using the script below. +You can monitor training log via TensorBoard, using the script below. ```bash cd experiment/log @@ -71,7 +71,7 @@ tensorboard --logdir=. usage: synthesis.py [-h] [--data DATA] [--config CONFIG] [--device DEVICE] checkpoint output -Synthesize valid data from LJspeech with a wavenet model. +Synthesize valid data from LJspeech with a WaveNet model. positional arguments: checkpoint checkpoint to load. @@ -84,13 +84,13 @@ optional arguments: --device DEVICE device to use. ``` -1. `--config` is the configuration file to use. You should use the same configuration with which you train you model. -2. `--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files. -3. `checkpoint` is the checkpoint to load. -4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`). -5. `--device` is the device (gpu id) to use for training. `-1` means CPU. +- `--config` is the configuration file to use. You should use the same configuration with which you train you model. +- `--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files. +- `checkpoint` is the checkpoint to load. +- `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`). +- `--device` is the device (gpu id) to use for training. `-1` means CPU. -example script: +Example script: ```bash python synthesis.py --config=./configs/wavenet_single_gaussian.yaml --data=./LJSpeech-1.1/ --device=0 experiment/checkpoints/step_500000 generated