refine READMEs and clean code
This commit is contained in:
parent
8bcbcf8a86
commit
2f644e1b8b
|
@ -1,10 +1,10 @@
|
|||
# Speaker Encoder
|
||||
|
||||
This experiment trains a speaker encoder with speaker verification as it is task. It is done as a part of the experiment of transfer learning from speaker verification to multispeaker text-to-speech synthesis, which can be find at [tacotron2_aishell3](../tacotron2_shell3). The trained speaker encoder is used to ectract utterance embedding from utterances.
|
||||
This experiment trains a speaker encoder with speaker verification as its task. It is done as a part of the experiment of transfer learning from speaker verification to multispeaker text-to-speech synthesis, which can be found at [tacotron2_aishell3](../tacotron2_shell3). The trained speaker encoder is used to extract utterance embeddings from utterances.
|
||||
|
||||
## Model
|
||||
|
||||
The model used is this experiment is the speaker encoder with text independent speaker verification task in [GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION](https://arxiv.org/pdf/1710.10467.pdf). GE2E-softmax loss is used.
|
||||
The model used in this experiment is the speaker encoder with text independent speaker verification task in [GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION](https://arxiv.org/pdf/1710.10467.pdf). GE2E-softmax loss is used.
|
||||
|
||||
## File Structure
|
||||
|
||||
|
@ -30,27 +30,27 @@ Currently supported datasets are Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-
|
|||
|
||||
An English multispeaker dataset,[URL](https://www.openslr.org/resources/12/train-other-500.tar.gz),only the `train-other-500` subset is used.
|
||||
|
||||
1. VoxCeleb1
|
||||
2. VoxCeleb1
|
||||
|
||||
An English multispeaker dataset,[URL](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) , Audio Files from Dev A to Dev D should be downloaded, combined and extracted.
|
||||
|
||||
2. VoxCeleb2
|
||||
3. VoxCeleb2
|
||||
|
||||
An English multispeaker dataset,[URL](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) , Audio Files from Dev A to Dev H should be downloaded, combined and extracted.
|
||||
|
||||
3. Aidatatang-200zh
|
||||
4. Aidatatang-200zh
|
||||
|
||||
A Mandarin Chinese multispeaker dataset ,[URL](https://www.openslr.org/62/) .
|
||||
|
||||
4. magicdata
|
||||
5. magicdata
|
||||
|
||||
A Mandarin Chinese multispeaker dataset ,[URL](https://www.openslr.org/68/) .
|
||||
|
||||
If you want to use other datasets, you can also download and preprocess it as long as it meets the needs described below.
|
||||
If you want to use other datasets, you can also download and preprocess it as long as it meets the requirements described below.
|
||||
|
||||
## Preprocess Datasets
|
||||
|
||||
Multispeaker datasets are used as training data, though the transcriptions are not used. To enlarge the amount of data used for training, several multispeaker datasets are combined. The preporcessed dataset s are organized in a file structure described below. The mel spectrogram of each utterance is save in `.npy` format. The dataset is 2-straitified (speaker-utterance). Since multiple datasets are combined, to avoid conflict in speaker id, dataset name is prepended to the speake ids.
|
||||
Multispeaker datasets are used as training data, though the transcriptions are not used. To enlarge the amount of data used for training, several multispeaker datasets are combined. The preporcessed datasets are organized in a file structure described below. The mel spectrogram of each utterance is save in `.npy` format. The dataset is 2-stratified (speaker-utterance). Since multiple datasets are combined, to avoid conflict in speaker id, dataset name is prepended to the speake ids.
|
||||
|
||||
```text
|
||||
dataset_root
|
||||
|
@ -88,25 +88,20 @@ When preprocessing is done, run the command below to train the mdoel.
|
|||
python train.py --data=<data_path> --output=<output> --device="gpu" --nprocs=1
|
||||
```
|
||||
|
||||
`--data` is the path to the preprocessed dataset。
|
||||
- `--data` is the path to the preprocessed dataset.
|
||||
- `--output` is the directory to save results,usually a subdirectory of `runs`.It contains visualdl log files, text log files, config file and a `checkpoints` directory, which contains parameter file and optimizer state file. If `--output` already has some training results in it, the most recent parameter file and optimizer state file is loaded before training.
|
||||
- `--device` is the device type to run the training, 'cpu' and 'gpu' are supported.
|
||||
- `--nprocs` is the number of replicas to run in multiprocessing based parallel training。Currently multiprocessing based parallel training is only enabled when using 'gpu' as the devicde. `CUDA_VISIBLE_DEVICES` can be used to specify visible devices with cuda.
|
||||
|
||||
`--output` is the directory to save results,usually a subdirectory of `runs`。It contains visualdl log files, text log files, config file and a `checkpoints` directory, which contains parameter file and optimizer state file. If `--output` already has some training results in it, the most recent parameter file and optimizer state file is loaded before training.
|
||||
Other options are described below.
|
||||
|
||||
`--device` is the device type to run the training, 'cpu' and 'gpu' are supported.
|
||||
|
||||
`--nprocs` is the number of replicas to run in multiprocessing based parallel training。Currently multiprocessing based parallel training is only enabled when using 'gpu' as the devicde. `CUDA_VISIBLE_DEVICES` can be used to specify visible devices with cuda。
|
||||
|
||||
Other options。
|
||||
|
||||
`--config` is a `.yaml` config file used to override the default config(which is coded in `config.py`)。
|
||||
|
||||
`--opts` is command line options to further override config files。It should be the last comman line options passed with multiple key-value pairs separated by spaces。
|
||||
|
||||
`--checkpoint_path` specifies the checkpoiont to load before training, extension is not included。A parameter file ( `.pdparams`) and an optimizer state file ( `.pdopt`) with the same name is used. This option has a higher priority than auto-resuming from the `--output` directory.
|
||||
- `--config` is a `.yaml` config file used to override the default config(which is coded in `config.py`).
|
||||
- `--opts` is command line options to further override config files. It should be the last comman line options passed with multiple key-value pairs separated by spaces.
|
||||
- `--checkpoint_path` specifies the checkpoiont to load before training, extension is not included. A parameter file ( `.pdparams`) and an optimizer state file ( `.pdopt`) with the same name is used. This option has a higher priority than auto-resuming from the `--output` directory.
|
||||
|
||||
## Pretrained Model
|
||||
|
||||
The pretrained model is first trained to 1560k steps at Librispeech-other-500 and voxceleb1. Then trained at aidatatang_200h and magic_data to 3000k steps。
|
||||
The pretrained model is first trained to 1560k steps at Librispeech-other-500 and voxceleb1. Then trained at aidatatang_200h and magic_data to 3000k steps.
|
||||
|
||||
Download URL [ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip).
|
||||
|
||||
|
@ -126,12 +121,12 @@ python inference.py --input=<input> --output=<output> --checkpoint_path=<checkpo
|
|||
|
||||
`--pattern` is the wildcard pattern to filter audio files for inference, defaults to `*.wav`.
|
||||
|
||||
`--device` `--opts` has the same meaning as the training script.
|
||||
`--device` and `--opts` have the same meaning as in the training script.
|
||||
|
||||
## References
|
||||
|
||||
1. [GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION](https://arxiv.org/pdf/1710.10467.pdf)
|
||||
2. [Transfer Learning from Speaker Verification toMultispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf)
|
||||
1. [Generalized End-to-end Loss for Speaker Verification](https://arxiv.org/pdf/1710.10467.pdf)
|
||||
2. [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf)
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -29,21 +29,21 @@ ge2e
|
|||
|
||||
英文多说话人数据集,[下载链接](https://www.openslr.org/resources/12/train-other-500.tar.gz),我们的实验中仅用到了 train-other-500 这个子集。
|
||||
|
||||
1. VoxCeleb1
|
||||
2. VoxCeleb1
|
||||
|
||||
英文多说话人数据集,[下载链接](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html),需要下载其中的 Audio Files 中的 Dev A 到 Dev D 四个压缩文件并合并解压。
|
||||
|
||||
2. VoxCeleb2
|
||||
3. VoxCeleb2
|
||||
|
||||
英文多说话人数据集,[下载链接](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html),需要下载其中的 Audio Files 中的 Dev A 到 Dev H 八个压缩文件并合并解压。
|
||||
|
||||
3. Aidatatang-200zh
|
||||
4. Aidatatang-200zh
|
||||
|
||||
中文多说话人数据集,[下载链接](https://www.openslr.org/62/) .
|
||||
中文多说话人数据集,[下载链接](https://www.openslr.org/62/)。
|
||||
|
||||
4. magicdata
|
||||
5. magicdata
|
||||
|
||||
中文多说话人数据集,[下载链接](https://www.openslr.org/68/) .
|
||||
中文多说话人数据集,[下载链接](https://www.openslr.org/68/)。
|
||||
|
||||
如果用户需要使用其他的数据集,也可以自行下载并进行数据处理,只要符合如下的要求。
|
||||
|
||||
|
@ -87,21 +87,16 @@ python preprocess.py --datasets_root=<datasets_root> --output_dir=<output_dir> -
|
|||
python train.py --data=<data_path> --output=<output> --device="gpu" --nprocs=1
|
||||
```
|
||||
|
||||
`--data` 是处理后的数据集路径。
|
||||
|
||||
`--output` 是训练结果的保存路径,一般使用 runs 下的一个子目录。保存结果包含 visualdl 的 log 文件,文本 log 记录,运行 config 备份,以及 checkpoints 目录,里面包含参数文件和优化器状态文件。如果指定的 output 路径包含此前的训练结果,训练前会自动加载最近的参数文件和优化器状态文件。
|
||||
|
||||
`--device` 是运行设备,目前支持 'cpu' 和 'gpu'.
|
||||
|
||||
`--nprocs` 是指定运行进程数。目前仅在使用 'gpu' 是支持多进程训练。可以配合 `CUDA_VISIBLE_DEVICES` 环境变量指定可见卡号。
|
||||
- `--data` 是处理后的数据集路径。
|
||||
- `--output` 是训练结果的保存路径,一般使用 runs 下的一个子目录。保存结果包含 visualdl 的 log 文件,文本 log 记录,运行 config 备份,以及 checkpoints 目录,里面包含参数文件和优化器状态文件。如果指定的 output 路径包含此前的训练结果,训练前会自动加载最近的参数文件和优化器状态文件。
|
||||
- `--device` 是运行设备,目前支持 'cpu' 和 'gpu'.
|
||||
- `--nprocs` 是指定运行进程数。目前仅在使用 'gpu' 是支持多进程训练。可以配合 `CUDA_VISIBLE_DEVICES` 环境变量指定可见卡号。
|
||||
|
||||
另外还有几个选项。
|
||||
|
||||
`--config` 是用于覆盖默认配置(默认配置可以查看 `config.py`) 的配置文件,为 `.yaml` 文件。
|
||||
|
||||
`--opts` 是用命令行参数进一步覆盖配置。这是最后一个传入的命令行选项,用多组空格分隔的 KEY VALUE 对的方式传入。
|
||||
|
||||
`--checkpoint_path` 指定从中恢复的 checkpoint, 不需要包含扩展名。同名的参数文件( `.pdparams`) 和优化器文件( `.pdopt`)会被加载以恢复训练。这个参数指定的恢复训练优先级高于自动从 `output` 文件夹中恢复训练。
|
||||
- `--config` 是用于覆盖默认配置(默认配置可以查看 `config.py`) 的配置文件,为 `.yaml` 文件。
|
||||
- `--opts` 是用命令行参数进一步覆盖配置。这是最后一个传入的命令行选项,用多组空格分隔的 KEY VALUE 对的方式传入。
|
||||
- `--checkpoint_path` 指定从中恢复的 checkpoint, 不需要包含扩展名。同名的参数文件( `.pdparams`) 和优化器文件( `.pdopt`)会被加载以恢复训练。这个参数指定的恢复训练优先级高于自动从 `output` 文件夹中恢复训练。
|
||||
|
||||
## 预训练模型
|
||||
|
||||
|
@ -117,15 +112,11 @@ python train.py --data=<data_path> --output=<output> --device="gpu" --nprocs=1
|
|||
python inference.py --input=<input> --output=<output> --checkpoint_path=<checkpoint_path> --device="gpu"
|
||||
```
|
||||
|
||||
`--input` 是需要处理的数据集的路径。
|
||||
|
||||
`--output` 是处理的结果,它会保持和 `--input` 相同的文件夹结构,对应 input 中的每一个音频文件会有一个同名的 `*.npy` 文件,是从这个音频文件中提取到的 utterance embedding.
|
||||
|
||||
`--checkpoint_path` 为用于预测的参数文件路径,不包含扩展名。
|
||||
|
||||
`--pattern` 是用于筛选数据集中需要处理的音频文件的通配符模式,默认为 `*.wav`.
|
||||
|
||||
`--device` `--opts` 的语义和训练脚本一致。
|
||||
- `--input` 是需要处理的数据集的路径。
|
||||
- `--output` 是处理的结果,它会保持和 `--input` 相同的文件夹结构,对应 input 中的每一个音频文件会有一个同名的 `*.npy` 文件,是从这个音频文件中提取到的 utterance embedding.
|
||||
- `--checkpoint_path` 为用于预测的参数文件路径,不包含扩展名。
|
||||
- `--pattern` 是用于筛选数据集中需要处理的音频文件的通配符模式,默认为 `*.wav`.
|
||||
- `--device` 和 `--opts` 的语义和训练脚本一致。
|
||||
|
||||
## 参考文献
|
||||
|
||||
|
|
|
@ -7,10 +7,12 @@ from paddle.io import DataLoader
|
|||
from paddle.nn.clip import ClipGradByGlobalNorm
|
||||
|
||||
from parakeet.models.lstm_speaker_encoder import LSTMSpeakerEncoder
|
||||
from parakeet.training import ExperimentBase, default_argument_parser
|
||||
from parakeet.training import ExperimentBase
|
||||
from parakeet.training import default_argument_parser
|
||||
|
||||
from speaker_verification_dataset import (MultiSpeakerMelDataset,
|
||||
MultiSpeakerSampler, Collate)
|
||||
from speaker_verification_dataset import MultiSpeakerMelDataset
|
||||
from speaker_verification_dataset import MultiSpeakerSampler
|
||||
from speaker_verification_dataset import Collate
|
||||
from config import get_cfg_defaults
|
||||
|
||||
|
||||
|
|
|
@ -55,7 +55,7 @@ python process_wav.py --input=<input> --output=<output> --alignment=<alignment>
|
|||
|
||||
### 转录文本处理
|
||||
|
||||
把文本转换称为 phone 和 tone 的形式,并存储起来。值得注意的是,这里我们的处理和用于 montreal force aligner 的不一样。我们把声调分了出来。这是一个处理方式,当然也可以只做声母和韵母的切分。
|
||||
把文本转换成为 phone 和 tone 的形式,并存储起来。值得注意的是,这里我们的处理和用于 montreal force aligner 的不一样。我们把声调分了出来。这是一个处理方式,当然也可以只做声母和韵母的切分。
|
||||
|
||||
运行脚本处理转录文本。
|
||||
|
||||
|
|
Loading…
Reference in New Issue