ParakeetRebeccaRosario/examples/fastspeech2/aishell3
TianYuan 065fa32a37 fix log format of fastspeech2 speedyspeech and pwg 2021-09-03 11:21:52 +00:00
..
conf add aishell3 example 2021-08-25 09:37:16 +00:00
README.md fix readme of aishell3 2021-08-30 03:48:11 +00:00
batch_fn.py add aishell3 example 2021-08-25 09:37:16 +00:00
config.py fix log format of fastspeech2 speedyspeech and pwg 2021-09-03 11:21:52 +00:00
fastspeech2_updater.py fix log format of fastspeech2 speedyspeech and pwg 2021-09-03 11:21:52 +00:00
preprocess.sh add aishell3 example 2021-08-25 09:37:16 +00:00
synthesize.py fix log format of fastspeech2 speedyspeech and pwg 2021-09-03 11:21:52 +00:00
synthesize.sh fix readme of aishell3 2021-08-30 03:48:11 +00:00
synthesize_e2e.py fix log format of fastspeech2 speedyspeech and pwg 2021-09-03 11:21:52 +00:00
synthesize_e2e.sh fix readme of aishell3 2021-08-30 03:48:11 +00:00
train.py fix log format of fastspeech2 speedyspeech and pwg 2021-09-03 11:21:52 +00:00

README.md

FastSpeech2 with AISHELL-3

Introduction

AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. We use AISHELL-3 to train a multi-speaker fastspeech2 model here.

Dataset

Download and Extract the datasaet.

Download AISHELL-3.

wget https://www.openslr.org/resources/93/data_aishell3.tgz

Extract AISHELL-3.

mkdir data_aishell3
tar zxvf data_aishell3.tgz -C data_aishell3

Get MFA result of BZNSYP and Extract it.

We use MFA2.x to get durations for aishell3_fastspeech2. You can download from here aishell3_alignment_tone.tar.gz, or train your own MFA model reference to use_mfa example (use MFA1.x now) of our repo.

Preprocess the dataset.

Assume the path to the dataset is ~/datasets/data_aishell3. Assume the path to the MFA result of AISHELL-3 is ./aishell3_alignment_tone. Run the command below to preprocess the dataset.

./preprocess.sh

Train the model

./run.sh

If you want to train fastspeech2 with cpu, please add --device=cpu arguments for python3 train.py in run.sh.

Synthesize

We use parallel wavegan as the neural vocoder. Download pretrained parallel wavegan model (Trained with baker) from fastspeech2_nosil_aishell3_ckpt_0.4.zip and unzip it.

unzip parallel_wavegan_baker_ckpt_0.4.zip

synthesize.sh can synthesize waveform from metadata.jsonl. synthesize_e2e.sh can synthesize waveform from text list.

./synthesize.sh

or

./synthesize_e2e.sh

You can see the bash files for more datails of input parameters.

Pretrained Model

Pretrained Model with no sil in the edge of audios can be downloaded here. fastspeech2_nosil_aishell3_ckpt_0.4.zip

Then, you can use the following scripts to synthesize for ../sentences.txt using pretrained fastspeech2 model.

python3 synthesize_e2e.py \
  --fastspeech2-config=fastspeech2_nosil_aishell3_ckpt_0.4/default.yaml \
  --fastspeech2-checkpoint=fastspeech2_nosil_aishell3_ckpt_0.4/snapshot_iter_96400.pdz \
  --fastspeech2-stat=fastspeech2_nosil_aishell3_ckpt_0.4/speech_stats.npy \
  --pwg-config=parallel_wavegan_baker_ckpt_0.4/pwg_default.yaml \
  --pwg-params=parallel_wavegan_baker_ckpt_0.4/pwg_generator.pdparams \
  --pwg-stat=parallel_wavegan_baker_ckpt_0.4/pwg_stats.npy \
  --text=../sentences.txt \
  --output-dir=exp/debug/test_e2e \
  --device="gpu" \
  --phones-dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \
  --speaker-dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt

Future work

A multi-speaker vocoder is needed.