commit
4860d06dba
79
README.md
79
README.md
|
@ -1,5 +1,4 @@
|
|||
# Parakeet
|
||||
|
||||
Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It is built on PaddlePaddle Fluid dynamic graph and includes many influential TTS models proposed by [Baidu Research](http://research.baidu.com) and other research groups.
|
||||
|
||||
<div align="center">
|
||||
|
@ -13,20 +12,29 @@ In particular, it features the latest [WaveFlow](https://arxiv.org/abs/1912.0121
|
|||
- WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development.
|
||||
|
||||
## Overview
|
||||
|
||||
In order to facilitate exploiting the existing TTS models directly and developing the new ones, Parakeet selects typical models and provides their reference implementations in PaddlePaddle. Further more, Parakeet abstracts the TTS pipeline and standardizes the procedure of data preprocessing, common modules sharing, model configuration, and the process of training and synthesis. The models supported here include Vocoders and end-to-end TTS models:
|
||||
In order to facilitate exploiting the existing TTS models directly and developing the new ones, Parakeet selects typical models and provides their reference implementations in PaddlePaddle. Further more, Parakeet abstracts the TTS pipeline and standardizes the procedure of data preprocessing, common modules sharing, model configuration, and the process of training and synthesis. The models supported here include Vocoders and end-to-end Acoustic models:
|
||||
|
||||
- Vocoders
|
||||
- [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219)
|
||||
- [【Parallel WaveGAN】Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram](https://arxiv.org/abs/1910.11480)
|
||||
- [【WaveFlow】WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219)
|
||||
|
||||
- TTS models
|
||||
- [Neural Speech Synthesis with Transformer Network (Transformer TTS)](https://arxiv.org/abs/1809.08895)
|
||||
- [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](arxiv.org/abs/1712.05884)
|
||||
- Acoustic models
|
||||
- [【FastSpeech2】FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558)
|
||||
- [【SpeedySpeech】SpeedySpeech: Efficient Neural Speech Synthesis](https://arxiv.org/abs/2008.03802)
|
||||
- [【Transformer TTS】Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895)
|
||||
- [【Tacotron2】Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884)
|
||||
|
||||
- Voice Conversion
|
||||
- [【GE2E】Generalized End-to-End Loss for Speaker Verification](https://arxiv.org/abs/1710.10467)
|
||||
|
||||
## Updates
|
||||
|
||||
May-07-2021, Add an example for voice cloning in Chinese. Check [examples/tacotron2_aishell3](./examples/tacotron2_aishell3).
|
||||
|
||||
- Aug-31-2021, Add an example for Chinese Text Frontend](). Check [examples/text_frontend](./examples/text_frontend)
|
||||
- Aug-23-2021, Add an example for FastSpeech2 with AISHELL-3. Check [fastspeech2/aishell3](./fastspeech2/aishell3)
|
||||
- Aug-3-2021, Add an example for FastSpeech2 with CSMSC. Check [fastspeech2/baker](./fastspeech2/baker)
|
||||
- Jul-19-2021, Add an example for SpeedySpeech with CSMSC. Check [speedyspeech/baker](./speedyspeech/baker)
|
||||
- Jul-01-2021, Add an example for Parallel WaveGAN with CSMSC. Check [parallelwave_gan/baker](./parallelwave_gan/baker)
|
||||
- Jul-01-2021, Add an example for usage of Montreal-Forced-Aligner. Check [examples/use_mfa](./examples/use_mfa).
|
||||
- May-07-2021, Add an example for voice cloning in Chinese. Check [examples/tacotron2_aishell3](./examples/tacotron2_aishell3).
|
||||
|
||||
## Setup
|
||||
It's difficult to install some dependent libraries for this repo in Windows system, we recommend that you **DO NOT** use Windows system, please use `Linux`.
|
||||
|
@ -36,9 +44,7 @@ Make sure the library `libsndfile1` is installed, e.g., on Ubuntu.
|
|||
```bash
|
||||
sudo apt-get install libsndfile1
|
||||
```
|
||||
|
||||
### Install PaddlePaddle
|
||||
|
||||
See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. This repo requires PaddlePaddle **2.1.2** or above.
|
||||
|
||||
### Install Parakeet
|
||||
|
@ -62,41 +68,50 @@ sudo apt install -y python3.6-dev
|
|||
See [install](https://paddle-parakeet.readthedocs.io/en/latest/install.html) for more details.
|
||||
|
||||
## Examples
|
||||
|
||||
Entries to the introduction, and the launch of training and synthsis for different example models:
|
||||
|
||||
- [>>> WaveFlow](./examples/waveflow)
|
||||
- [>>> Transformer TTS](./examples/transformer_tts)
|
||||
- [>>> Tacotron2](./examples/tacotron2)
|
||||
- [>>> Chinese Text Frontend](./examples/text_frontend)
|
||||
- [>>> FastSpeech2](./examples/fastspeech2)
|
||||
- [>>> Montreal-Forced-Aligner](./examples/use_mfa)
|
||||
- [>>> Parallel WaveGAN](./parallelwave_gan)
|
||||
- [>>> SpeedySpeech](.examples/speedyspeech)
|
||||
- [>>> Tacotron2_AISHELL3](./examples/tacotron2_aishell3)
|
||||
- [>>> GE2E](./examples/ge2e)
|
||||
|
||||
- [>>> WaveFlow](./examples/waveflow)
|
||||
- [>>> TransformerTTS](./examples/transformer_tts)
|
||||
- [>>> Tacotron2](./examples/tacotron2)
|
||||
|
||||
## Audio samples
|
||||
|
||||
### TTS models (Acoustic Model + Neural Vocoder)
|
||||
|
||||
Check our [website](https://paddle-parakeet.readthedocs.io/en/latest/demo.html) for audio sampels.
|
||||
|
||||
|
||||
## Checkpoints
|
||||
### FastSpeech2
|
||||
1. [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_baker_ckpt_0.4.zip)
|
||||
2. [fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_aishell3_ckpt_0.4.zip)
|
||||
|
||||
### Parallel WaveGAN
|
||||
1. [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_baker_ckpt_0.4.zip)
|
||||
|
||||
### SpeedySpeech
|
||||
1. [speedyspeech_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/speedyspeech_baker_ckpt_0.4.zip)
|
||||
|
||||
### Tacotron2_AISHELL3
|
||||
1. [tacotron2_aishell3_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_aishell3_ckpt_0.3.zip)
|
||||
|
||||
### GE2E
|
||||
1. [ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip)
|
||||
|
||||
### WaveFlow
|
||||
1. [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_ljspeech_ckpt_0.3.zip)
|
||||
|
||||
### TransformerTTS
|
||||
1. [transformer_tts_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/transformer_tts_ljspeech_ckpt_0.3.zip)
|
||||
|
||||
### Tacotron2
|
||||
1. [tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.3.zip)
|
||||
2. [tacotron2_ljspeech_ckpt_0.3_alternative.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.3_alternative.zip)
|
||||
|
||||
### Tacotron2_AISHELL3
|
||||
1. [tacotron2_aishell3_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_aishell3_ckpt_0.3.zip)
|
||||
|
||||
### TransformerTTS
|
||||
1. [transformer_tts_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/transformer_tts_ljspeech_ckpt_0.3.zip)
|
||||
|
||||
### WaveFlow
|
||||
1. [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_ljspeech_ckpt_0.3.zip)
|
||||
|
||||
### GE2E
|
||||
1. [ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip)
|
||||
|
||||
## Copyright and License
|
||||
|
||||
Parakeet is provided under the [Apache-2.0 license](LICENSE).
|
||||
|
|
|
@ -1,13 +1,13 @@
|
|||
|
||||
# FastSpeech2 with AISHELL-3
|
||||
This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3).
|
||||
|
||||
## Introduction
|
||||
[AISHELL-3](http://www.aishelltech.com/aishell_3) is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems.
|
||||
AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems.
|
||||
We use AISHELL-3 to train a multi-speaker fastspeech2 model here.
|
||||
|
||||
## Dataset
|
||||
|
||||
### Download and Extract the datasaet.
|
||||
### Download and Extract the datasaet
|
||||
Download AISHELL-3.
|
||||
```bash
|
||||
wget https://www.openslr.org/resources/93/data_aishell3.tgz
|
||||
|
@ -18,13 +18,11 @@ mkdir data_aishell3
|
|||
tar zxvf data_aishell3.tgz -C data_aishell3
|
||||
```
|
||||
|
||||
### Get MFA result of BZNSYP and Extract it.
|
||||
|
||||
### Get MFA result of BZNSYP and Extract it
|
||||
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
|
||||
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) (use MFA1.x now) of our repo.
|
||||
|
||||
### Preprocess the dataset.
|
||||
|
||||
### Preprocess the dataset
|
||||
Assume the path to the dataset is `~/datasets/data_aishell3`.
|
||||
Assume the path to the MFA result of AISHELL-3 is `./aishell3_alignment_tone`.
|
||||
Run the command below to preprocess the dataset.
|
||||
|
@ -32,11 +30,13 @@ Run the command below to preprocess the dataset.
|
|||
```bash
|
||||
./preprocess.sh
|
||||
```
|
||||
|
||||
## Train the model
|
||||
```bash
|
||||
./run.sh
|
||||
```
|
||||
If you want to train fastspeech2 with cpu, please add `--device=cpu` arguments for `python3 train.py` in `run.sh`.
|
||||
|
||||
## Synthesize
|
||||
We use [parallel wavegan](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/parallelwave_gan/baker) as the neural vocoder.
|
||||
Download pretrained parallel wavegan model (Trained with baker) from [fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_aishell3_ckpt_0.4.zip) and unzip it.
|
||||
|
@ -75,5 +75,6 @@ python3 synthesize_e2e.py \
|
|||
--speaker-dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt
|
||||
|
||||
```
|
||||
|
||||
## Future work
|
||||
A multi-speaker vocoder is needed.
|
||||
|
|
|
@ -1,16 +1,16 @@
|
|||
# FastSpeech2 with BZNSYP
|
||||
# FastSpeech2 with the Baker dataset
|
||||
This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
|
||||
|
||||
## Dataset
|
||||
|
||||
### Download and Extract the datasaet.
|
||||
Download BZNSYP from it's [Official Website](https://test.data-baker.com/data/index/source).
|
||||
### Get MFA result of BZNSYP and Extract it.
|
||||
### Download and Extract the datasaet
|
||||
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source).
|
||||
|
||||
### Get MFA result of CSMSC and Extract it
|
||||
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
|
||||
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo.
|
||||
|
||||
### Preprocess the dataset.
|
||||
|
||||
### Preprocess the dataset
|
||||
Assume the path to the dataset is `~/datasets/BZNSYP`.
|
||||
Assume the path to the MFA result of BZNSYP is `./baker_alignment_tone`.
|
||||
Run the command below to preprocess the dataset.
|
||||
|
@ -18,11 +18,13 @@ Run the command below to preprocess the dataset.
|
|||
```bash
|
||||
./preprocess.sh
|
||||
```
|
||||
|
||||
## Train the model
|
||||
```bash
|
||||
./run.sh
|
||||
```
|
||||
If you want to train fastspeech2 with cpu, please add `--device=cpu` arguments for `python3 train.py` in `run.sh`.
|
||||
|
||||
## Synthesize
|
||||
We use [parallel wavegan](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/parallelwave_gan/baker) as the neural vocoder.
|
||||
Download pretrained parallel wavegan model from [parallel_wavegan_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/parallel_wavegan_baker_ckpt_0.4.zip) and unzip it.
|
||||
|
|
Loading…
Reference in New Issue