modified fastspeech README

This commit is contained in:
lifuchen 2020-04-14 06:16:17 +00:00
parent c1b837dc17
commit 45c07fa42e
1 changed files with 11 additions and 5 deletions

View File

@ -37,9 +37,15 @@ During synthesis, results are saved in `samples/` in `output` and tensorboard lo
If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.
If `--checkpoint` is not provided, we try to load the model specified by `--iteration` from the checkpoint directory. If `--iteration` is not provided, we try to load the latested checkpoint from checkpoint directory.
## Compute Alignment
## Compute Phoneme Duration
Before train FastSpeech model, you should have diagonal information. We use the diagonal obtained from the TranformerTTS model as the diagonal, you can run alignments/get_alignments.py to get it.
A ground truth duration of each phoneme (number of frames in the spectrogram that correspond to that phoneme) should be provided when training a FastSpeech model.
We compute the ground truth duration of each phomemes in this way:
We extract the encoder-decoder attention alignment from a trained Transformer TTS model;
Each frame is considered corresponding to the phoneme that receive the most attention;
You can run alignments/get_alignments.py to get it.
```bash
cd alignments
@ -50,12 +56,12 @@ python get_alignments.py \
--config=${CONFIG} \
--checkpoint_transformer=${CHECKPOINT} \
```
where `${DATAPATH}` is the path saved LJSpeech data, `${CHECKPOINT}` is the pretrain model path of TransformerTTS, `${CONFIG}` is the config yaml file of TransformerTTS checkpoint. It necessary for you to prepare a pre-trained TranformerTTS checkpoint.
where `${DATAPATH}` is the path saved LJSpeech data, `${CHECKPOINT}` is the pretrain model path of TransformerTTS, `${CONFIG}` is the config yaml file of TransformerTTS checkpoint. It is necessary for you to prepare a pre-trained TranformerTTS checkpoint.
For more help on arguments:
``python train.py --help``.
``python alignments.py --help``.
Or you can use your own diagonal information, you should process the data into the following format:
Or you can use your own phoneme duration, you just need to process the data into the following format:
```bash
{'fname1': alignment1,
'fname2': alignment2,