ParakeetEricRoss/examples/fastspeech2/baker/README.md

64 lines
2.7 KiB
Markdown
Raw Normal View History

2021-09-06 20:10:01 +08:00
# FastSpeech2 with the Baker dataset
This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
2021-08-02 14:28:25 +08:00
2021-07-22 18:31:34 +08:00
## Dataset
2021-08-02 14:28:25 +08:00
2021-09-06 20:10:01 +08:00
### Download and Extract the datasaet
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source).
2021-07-22 18:31:34 +08:00
2021-09-06 20:10:01 +08:00
### Get MFA result of CSMSC and Extract it
2021-08-02 14:28:25 +08:00
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo.
2021-07-22 18:31:34 +08:00
2021-09-06 20:10:01 +08:00
### Preprocess the dataset
2021-07-22 18:31:34 +08:00
Assume the path to the dataset is `~/datasets/BZNSYP`.
Assume the path to the MFA result of BZNSYP is `./baker_alignment_tone`.
Run the command below to preprocess the dataset.
```bash
./preprocess.sh
```
2021-09-06 20:10:01 +08:00
2021-07-22 18:31:34 +08:00
## Train the model
```bash
./run.sh
```
2021-08-18 11:45:58 +08:00
If you want to train fastspeech2 with cpu, please add `--device=cpu` arguments for `python3 train.py` in `run.sh`.
2021-09-06 20:10:01 +08:00
2021-07-22 18:31:34 +08:00
## Synthesize
2021-08-02 14:28:25 +08:00
We use [parallel wavegan](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/parallelwave_gan/baker) as the neural vocoder.
Download pretrained parallel wavegan model from [parallel_wavegan_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/parallel_wavegan_baker_ckpt_0.4.zip) and unzip it.
2021-08-02 14:28:25 +08:00
```bash
unzip parallel_wavegan_baker_ckpt_0.4.zip
2021-08-02 14:28:25 +08:00
```
2021-08-03 18:10:39 +08:00
`synthesize.sh` can synthesize waveform from `metadata.jsonl`.
`synthesize_e2e.sh` can synthesize waveform from text list.
2021-08-02 14:28:25 +08:00
2021-07-22 18:31:34 +08:00
```bash
./synthesize.sh
```
or
```bash
./synthesize_e2e.sh
```
2021-08-02 14:28:25 +08:00
You can see the bash files for more datails of input parameters.
2021-07-22 18:31:34 +08:00
## Pretrained Model
Pretrained Model with no sil in the edge of audios can be downloaded here. [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_baker_ckpt_0.4.zip)
2021-08-02 14:28:25 +08:00
2021-08-30 11:48:11 +08:00
Then, you can use the following scripts to synthesize for `../sentences.txt` using pretrained fastspeech2 model.
2021-08-02 14:28:25 +08:00
```bash
python3 synthesize_e2e.py \
--fastspeech2-config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \
--fastspeech2-checkpoint=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \
--fastspeech2-stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \
--pwg-config=parallel_wavegan_baker_ckpt_0.4/pwg_default.yaml \
--pwg-params=parallel_wavegan_baker_ckpt_0.4/pwg_generator.pdparams \
--pwg-stat=parallel_wavegan_baker_ckpt_0.4/pwg_stats.npy \
2021-08-30 11:48:11 +08:00
--text=../sentences.txt \
2021-09-06 11:41:03 +08:00
--output-dir=exp/default/test_e2e \
--device="gpu" \
--phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
2021-08-02 14:28:25 +08:00
```