2021-09-06 20:10:01 +08:00
# FastSpeech2 with the Baker dataset
This example contains code used to train a [Fastspeech2 ](https://arxiv.org/abs/2006.04558 ) model with [Chinese Standard Mandarin Speech Copus ](https://www.data-baker.com/open_source.html ).
2021-08-02 14:28:25 +08:00
2021-07-22 18:31:34 +08:00
## Dataset
2021-08-02 14:28:25 +08:00
2021-09-06 20:10:01 +08:00
### Download and Extract the datasaet
Download CSMSC from it's [Official Website ](https://test.data-baker.com/data/index/source ).
2021-07-22 18:31:34 +08:00
2021-09-06 20:10:01 +08:00
### Get MFA result of CSMSC and Extract it
2021-08-02 14:28:25 +08:00
We use [MFA ](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner ) to get durations for fastspeech2.
2021-08-04 17:38:08 +08:00
You can download from here [baker_alignment_tone.tar.gz ](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz ), or train your own MFA model reference to [use_mfa example ](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa ) of our repo.
2021-07-22 18:31:34 +08:00
2021-09-06 20:10:01 +08:00
### Preprocess the dataset
2021-07-22 18:31:34 +08:00
Assume the path to the dataset is `~/datasets/BZNSYP` .
Assume the path to the MFA result of BZNSYP is `./baker_alignment_tone` .
Run the command below to preprocess the dataset.
```bash
./preprocess.sh
```
2021-09-06 20:10:01 +08:00
2021-07-22 18:31:34 +08:00
## Train the model
```bash
./run.sh
```
2021-08-18 11:45:58 +08:00
If you want to train fastspeech2 with cpu, please add `--device=cpu` arguments for `python3 train.py` in `run.sh` .
2021-09-06 20:10:01 +08:00
2021-07-22 18:31:34 +08:00
## Synthesize
2021-08-02 14:28:25 +08:00
We use [parallel wavegan ](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/parallelwave_gan/baker ) as the neural vocoder.
2021-08-04 17:38:08 +08:00
Download pretrained parallel wavegan model from [parallel_wavegan_baker_ckpt_0.4.zip ](https://paddlespeech.bj.bcebos.com/Parakeet/parallel_wavegan_baker_ckpt_0.4.zip ) and unzip it.
2021-08-02 14:28:25 +08:00
```bash
2021-08-04 17:38:08 +08:00
unzip parallel_wavegan_baker_ckpt_0.4.zip
2021-08-02 14:28:25 +08:00
```
2021-08-03 18:10:39 +08:00
`synthesize.sh` can synthesize waveform from `metadata.jsonl` .
`synthesize_e2e.sh` can synthesize waveform from text list.
2021-08-02 14:28:25 +08:00
2021-07-22 18:31:34 +08:00
```bash
./synthesize.sh
```
or
```bash
./synthesize_e2e.sh
```
2021-08-02 14:28:25 +08:00
You can see the bash files for more datails of input parameters.
2021-07-22 18:31:34 +08:00
## Pretrained Model
2021-08-04 17:38:08 +08:00
Pretrained Model with no sil in the edge of audios can be downloaded here. [fastspeech2_nosil_baker_ckpt_0.4.zip ](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_baker_ckpt_0.4.zip )
2021-08-02 14:28:25 +08:00
2021-08-30 11:48:11 +08:00
Then, you can use the following scripts to synthesize for `../sentences.txt` using pretrained fastspeech2 model.
2021-08-02 14:28:25 +08:00
```bash
2021-08-04 17:38:08 +08:00
python3 synthesize_e2e.py \
--fastspeech2-config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \
--fastspeech2-checkpoint=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \
--fastspeech2-stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \
--pwg-config=parallel_wavegan_baker_ckpt_0.4/pwg_default.yaml \
--pwg-params=parallel_wavegan_baker_ckpt_0.4/pwg_generator.pdparams \
--pwg-stat=parallel_wavegan_baker_ckpt_0.4/pwg_stats.npy \
2021-08-30 11:48:11 +08:00
--text=../sentences.txt \
2021-09-06 11:41:03 +08:00
--output-dir=exp/default/test_e2e \
2021-08-04 17:38:08 +08:00
--device="gpu" \
--phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
2021-08-02 14:28:25 +08:00
```