Modified data.py to generate masks as models inputs
This commit is contained in:
commit
d08779d61e
|
@ -25,3 +25,11 @@
|
|||
files: \.md$
|
||||
- id: remove-tabs
|
||||
files: \.md$
|
||||
- repo: local
|
||||
hooks:
|
||||
- id: copyright_checker
|
||||
name: copyright_checker
|
||||
entry: python ./tools/copyright.hook
|
||||
language: system
|
||||
files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|proto|py)$
|
||||
exclude: (?!.*third_party)^.*$ | (?!.*book)^.*$
|
||||
|
|
2
LICENSE
2
LICENSE
|
@ -13,6 +13,4 @@
|
|||
limitations under the License.
|
||||
|
||||
|
||||
Part of code was copied or adpated from https://github.com/r9y9/deepvoice3_pytorch/
|
||||
Copyright (c) 2017: Ryuichi Yamamoto, whose applies.
|
||||
|
||||
|
|
45
README.md
45
README.md
|
@ -1,16 +1,27 @@
|
|||
# Parakeet
|
||||
|
||||
Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It is built on Paddle Fluid dynamic graph, with the support of many influential TTS models proposed by [Baidu Research](http://research.baidu.com) and other academic institutions.
|
||||
Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It is built on PaddlePaddle Fluid dynamic graph and includes many influential TTS models proposed by [Baidu Research](http://research.baidu.com) and other research groups.
|
||||
|
||||
<div align="center">
|
||||
<img src="images/logo.png" width=450 /> <br>
|
||||
</div>
|
||||
|
||||
## Installation
|
||||
In particular, it features the latest [WaveFlow] (https://arxiv.org/abs/1912.01219) model proposed by Baidu Research.
|
||||
- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than WaveGlow and serveral orders of magnitude faster than WaveNet.
|
||||
- WaveFlow is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M) and comparable to WaveNet (4.6M).
|
||||
- WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development.
|
||||
|
||||
### Install Paddlepaddle
|
||||
### Setup
|
||||
|
||||
See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. This repo requires paddlepaddle's version is above 1.7.
|
||||
Make sure the library `libsndfile1` is installed, e.g., on Ubuntu.
|
||||
|
||||
```bash
|
||||
sudo apt-get install libsndfile1
|
||||
```
|
||||
|
||||
### Install PaddlePaddle
|
||||
|
||||
See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. This repo requires paddlepaddle 1.7 or above.
|
||||
|
||||
### Install Parakeet
|
||||
|
||||
|
@ -20,12 +31,6 @@ cd Parakeet
|
|||
pip install -e .
|
||||
```
|
||||
|
||||
### Setup
|
||||
Make sure libsndfile1 installed:
|
||||
```bash
|
||||
sudo apt-get install libsndfile1
|
||||
```
|
||||
|
||||
### Install CMUdict for nltk
|
||||
|
||||
CMUdict from nltk is used to transform text into phonemes.
|
||||
|
@ -36,14 +41,24 @@ nltk.download("cmudict")
|
|||
```
|
||||
|
||||
|
||||
## Supported models
|
||||
## Related Research
|
||||
|
||||
- [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654)
|
||||
- [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895)
|
||||
- [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263).
|
||||
- [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263)
|
||||
- [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219)
|
||||
- [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499)
|
||||
- [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](https://arxiv.org/abs/1807.07281)
|
||||
|
||||
## Examples
|
||||
|
||||
- [Train a deepvoice 3 model with ljspeech dataset](./parakeet/examples/deepvoice3)
|
||||
- [Train a transformer_tts model with ljspeech dataset](./parakeet/examples/transformer_tts)
|
||||
- [Train a fastspeech model with ljspeech dataset](./parakeet/examples/fastspeech)
|
||||
- [Train a DeepVoice3 model with ljspeech dataset](./examples/deepvoice3)
|
||||
- [Train a TransformerTTS model with ljspeech dataset](./examples/transformer_tts)
|
||||
- [Train a FastSpeech model with ljspeech dataset](./examples/fastspeech)
|
||||
- [Train a WaveFlow model with ljspeech dataset](./examples/waveflow)
|
||||
- [Train a WaveNet model with ljspeech dataset](./examples/wavenet)
|
||||
- [Train a Clarinet model with ljspeech dataset](./examples/clarinet)
|
||||
|
||||
## Copyright and License
|
||||
|
||||
Parakeet is provided under the [Apache-2.0 license](LICENSE).
|
||||
|
|
|
@ -0,0 +1,103 @@
|
|||
# Clarinet
|
||||
|
||||
PaddlePaddle dynamic graph implementation of ClariNet, a convolutional network based vocoder. The implementation is based on the paper [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281).
|
||||
|
||||
|
||||
## Dataset
|
||||
|
||||
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
|
||||
|
||||
```bash
|
||||
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
|
||||
tar xjvf LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```text
|
||||
├── data.py data_processing
|
||||
├── configs/ (example) configuration file
|
||||
├── synthesis.py script to synthesize waveform from mel_spectrogram
|
||||
├── train.py script to train a model
|
||||
└── utils.py utility functions
|
||||
```
|
||||
|
||||
## Train
|
||||
|
||||
Train the model using train.py, follow the usage displayed by `python train.py --help`.
|
||||
|
||||
```text
|
||||
usage: train.py [-h] [--config CONFIG] [--device DEVICE] [--output OUTPUT]
|
||||
[--data DATA] [--resume RESUME] [--wavenet WAVENET]
|
||||
|
||||
train a ClariNet model with LJspeech and a trained WaveNet model.
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--config CONFIG path of the config file.
|
||||
--device DEVICE device to use.
|
||||
--output OUTPUT path to save student.
|
||||
--data DATA path of LJspeech dataset.
|
||||
--resume RESUME checkpoint to load from.
|
||||
--wavenet WAVENET wavenet checkpoint to use.
|
||||
```
|
||||
|
||||
1. `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
|
||||
2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
|
||||
3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
|
||||
4. `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.
|
||||
|
||||
```text
|
||||
├── checkpoints # checkpoint
|
||||
├── states # audio files generated at validation
|
||||
└── log # tensorboard log
|
||||
```
|
||||
|
||||
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
6. `--wavenet` is the path of the wavenet checkpoint to load. If you do not specify `--resume`, then this must be provided.
|
||||
|
||||
|
||||
Before you start training a ClariNet model, you should have trained a WaveNet model with single Gaussian output distribution. Make sure the config of the teacher model matches that of the trained model.
|
||||
|
||||
Example script:
|
||||
|
||||
```bash
|
||||
python train.py --config=./configs/clarinet_ljspeech.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0 --conditioner=wavenet_checkpoint/conditioner --conditioner=wavenet_checkpoint/teacher
|
||||
```
|
||||
|
||||
You can monitor training log via tensorboard, using the script below.
|
||||
|
||||
```bash
|
||||
cd experiment/log
|
||||
tensorboard --logdir=.
|
||||
```
|
||||
|
||||
## Synthesis
|
||||
```text
|
||||
usage: synthesis.py [-h] [--config CONFIG] [--device DEVICE] [--data DATA]
|
||||
checkpoint output
|
||||
|
||||
train a ClariNet model with LJspeech and a trained WaveNet model.
|
||||
|
||||
positional arguments:
|
||||
checkpoint checkpoint to load from.
|
||||
output path to save student.
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--config CONFIG path of the config file.
|
||||
--device DEVICE device to use.
|
||||
--data DATA path of LJspeech dataset.
|
||||
```
|
||||
|
||||
1. `--config` is the configuration file to use. You should use the same configuration with which you train you model.
|
||||
2. `--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
|
||||
3. `checkpoint` is the checkpoint to load.
|
||||
4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`).
|
||||
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
|
||||
Example script:
|
||||
|
||||
```bash
|
||||
python synthesis.py --config=./configs/wavenet_single_gaussian.yaml --data=./LJSpeech-1.1/ --device=0 experiment/checkpoints/step_500000 generated
|
||||
```
|
|
@ -0,0 +1,52 @@
|
|||
data:
|
||||
batch_size: 8
|
||||
train_clip_seconds: 0.5
|
||||
sample_rate: 22050
|
||||
hop_length: 256
|
||||
win_length: 1024
|
||||
n_fft: 2048
|
||||
|
||||
n_mels: 80
|
||||
valid_size: 16
|
||||
|
||||
|
||||
conditioner:
|
||||
upsampling_factors: [16, 16]
|
||||
|
||||
teacher:
|
||||
n_loop: 10
|
||||
n_layer: 3
|
||||
filter_size: 2
|
||||
residual_channels: 128
|
||||
loss_type: "mog"
|
||||
output_dim: 3
|
||||
log_scale_min: -9
|
||||
|
||||
student:
|
||||
n_loops: [10, 10, 10, 10, 10, 10]
|
||||
n_layers: [1, 1, 1, 1, 1, 1]
|
||||
filter_size: 3
|
||||
residual_channels: 64
|
||||
log_scale_min: -7
|
||||
|
||||
stft:
|
||||
n_fft: 2048
|
||||
win_length: 1024
|
||||
hop_length: 256
|
||||
|
||||
loss:
|
||||
lmd: 4
|
||||
|
||||
train:
|
||||
learning_rate: 0.0005
|
||||
anneal_rate: 0.5
|
||||
anneal_interval: 200000
|
||||
gradient_max_norm: 100.0
|
||||
|
||||
checkpoint_interval: 1000
|
||||
eval_interval: 1000
|
||||
|
||||
max_iterations: 2000000
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,151 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
import ruamel.yaml
|
||||
import random
|
||||
from tqdm import tqdm
|
||||
import pickle
|
||||
import numpy as np
|
||||
from tensorboardX import SummaryWriter
|
||||
|
||||
import paddle.fluid.dygraph as dg
|
||||
from paddle import fluid
|
||||
|
||||
from parakeet.models.wavenet import WaveNet, UpsampleNet
|
||||
from parakeet.models.clarinet import STFT, Clarinet, ParallelWaveNet
|
||||
from parakeet.data import TransformDataset, SliceDataset, RandomSampler, SequentialSampler, DataCargo
|
||||
from parakeet.utils.layer_tools import summary, freeze
|
||||
|
||||
from utils import valid_model, eval_model, save_checkpoint, load_checkpoint, load_model
|
||||
sys.path.append("../wavenet")
|
||||
from data import LJSpeechMetaData, Transform, DataCollector
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
description="synthesize audio files from mel spectrogram in the validation set."
|
||||
)
|
||||
parser.add_argument("--config", type=str, help="path of the config file.")
|
||||
parser.add_argument(
|
||||
"--device", type=int, default=-1, help="device to use.")
|
||||
parser.add_argument("--data", type=str, help="path of LJspeech dataset.")
|
||||
parser.add_argument(
|
||||
"checkpoint", type=str, help="checkpoint to load from.")
|
||||
parser.add_argument(
|
||||
"output", type=str, default="experiment", help="path to save student.")
|
||||
|
||||
args = parser.parse_args()
|
||||
with open(args.config, 'rt') as f:
|
||||
config = ruamel.yaml.safe_load(f)
|
||||
|
||||
ljspeech_meta = LJSpeechMetaData(args.data)
|
||||
|
||||
data_config = config["data"]
|
||||
sample_rate = data_config["sample_rate"]
|
||||
n_fft = data_config["n_fft"]
|
||||
win_length = data_config["win_length"]
|
||||
hop_length = data_config["hop_length"]
|
||||
n_mels = data_config["n_mels"]
|
||||
train_clip_seconds = data_config["train_clip_seconds"]
|
||||
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
|
||||
ljspeech = TransformDataset(ljspeech_meta, transform)
|
||||
|
||||
valid_size = data_config["valid_size"]
|
||||
ljspeech_valid = SliceDataset(ljspeech, 0, valid_size)
|
||||
ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech))
|
||||
|
||||
teacher_config = config["teacher"]
|
||||
n_loop = teacher_config["n_loop"]
|
||||
n_layer = teacher_config["n_layer"]
|
||||
filter_size = teacher_config["filter_size"]
|
||||
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
|
||||
print("context size is {} samples".format(context_size))
|
||||
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
|
||||
train_clip_seconds)
|
||||
valid_batch_fn = DataCollector(
|
||||
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
|
||||
|
||||
batch_size = data_config["batch_size"]
|
||||
train_cargo = DataCargo(
|
||||
ljspeech_train,
|
||||
train_batch_fn,
|
||||
batch_size,
|
||||
sampler=RandomSampler(ljspeech_train))
|
||||
|
||||
# only batch=1 for validation is enabled
|
||||
valid_cargo = DataCargo(
|
||||
ljspeech_valid,
|
||||
valid_batch_fn,
|
||||
batch_size=1,
|
||||
sampler=SequentialSampler(ljspeech_valid))
|
||||
|
||||
if args.device == -1:
|
||||
place = fluid.CPUPlace()
|
||||
else:
|
||||
place = fluid.CUDAPlace(args.device)
|
||||
|
||||
with dg.guard(place):
|
||||
# conditioner(upsampling net)
|
||||
conditioner_config = config["conditioner"]
|
||||
upsampling_factors = conditioner_config["upsampling_factors"]
|
||||
upsample_net = UpsampleNet(upscale_factors=upsampling_factors)
|
||||
freeze(upsample_net)
|
||||
|
||||
residual_channels = teacher_config["residual_channels"]
|
||||
loss_type = teacher_config["loss_type"]
|
||||
output_dim = teacher_config["output_dim"]
|
||||
log_scale_min = teacher_config["log_scale_min"]
|
||||
assert loss_type == "mog" and output_dim == 3, \
|
||||
"the teacher wavenet should be a wavenet with single gaussian output"
|
||||
|
||||
teacher = WaveNet(n_loop, n_layer, residual_channels, output_dim,
|
||||
n_mels, filter_size, loss_type, log_scale_min)
|
||||
# load & freeze upsample_net & teacher
|
||||
freeze(teacher)
|
||||
|
||||
student_config = config["student"]
|
||||
n_loops = student_config["n_loops"]
|
||||
n_layers = student_config["n_layers"]
|
||||
student_residual_channels = student_config["residual_channels"]
|
||||
student_filter_size = student_config["filter_size"]
|
||||
student_log_scale_min = student_config["log_scale_min"]
|
||||
student = ParallelWaveNet(n_loops, n_layers, student_residual_channels,
|
||||
n_mels, student_filter_size)
|
||||
|
||||
stft_config = config["stft"]
|
||||
stft = STFT(
|
||||
n_fft=stft_config["n_fft"],
|
||||
hop_length=stft_config["hop_length"],
|
||||
win_length=stft_config["win_length"])
|
||||
|
||||
lmd = config["loss"]["lmd"]
|
||||
model = Clarinet(upsample_net, teacher, student, stft,
|
||||
student_log_scale_min, lmd)
|
||||
summary(model)
|
||||
load_model(model, args.checkpoint)
|
||||
|
||||
# loader
|
||||
train_loader = fluid.io.DataLoader.from_generator(
|
||||
capacity=10, return_list=True)
|
||||
train_loader.set_batch_generator(train_cargo, place)
|
||||
|
||||
valid_loader = fluid.io.DataLoader.from_generator(
|
||||
capacity=10, return_list=True)
|
||||
valid_loader.set_batch_generator(valid_cargo, place)
|
||||
|
||||
if not os.path.exists(args.output):
|
||||
os.makedirs(args.output)
|
||||
eval_model(model, valid_loader, args.output, sample_rate)
|
|
@ -0,0 +1,220 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
import ruamel.yaml
|
||||
import random
|
||||
from tqdm import tqdm
|
||||
import pickle
|
||||
import numpy as np
|
||||
from tensorboardX import SummaryWriter
|
||||
|
||||
import paddle.fluid.dygraph as dg
|
||||
from paddle import fluid
|
||||
|
||||
from parakeet.models.wavenet import WaveNet, UpsampleNet
|
||||
from parakeet.models.clarinet import STFT, Clarinet, ParallelWaveNet
|
||||
from parakeet.data import TransformDataset, SliceDataset, RandomSampler, SequentialSampler, DataCargo
|
||||
from parakeet.utils.layer_tools import summary, freeze
|
||||
|
||||
from utils import make_output_tree, valid_model, save_checkpoint, load_checkpoint, load_wavenet
|
||||
sys.path.append("../wavenet")
|
||||
from data import LJSpeechMetaData, Transform, DataCollector
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
description="train a clarinet model with LJspeech and a trained wavenet model."
|
||||
)
|
||||
parser.add_argument("--config", type=str, help="path of the config file.")
|
||||
parser.add_argument(
|
||||
"--device", type=int, default=-1, help="device to use.")
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
type=str,
|
||||
default="experiment",
|
||||
help="path to save student.")
|
||||
parser.add_argument("--data", type=str, help="path of LJspeech dataset.")
|
||||
parser.add_argument("--resume", type=str, help="checkpoint to load from.")
|
||||
parser.add_argument(
|
||||
"--wavenet", type=str, help="wavenet checkpoint to use.")
|
||||
args = parser.parse_args()
|
||||
with open(args.config, 'rt') as f:
|
||||
config = ruamel.yaml.safe_load(f)
|
||||
|
||||
ljspeech_meta = LJSpeechMetaData(args.data)
|
||||
|
||||
data_config = config["data"]
|
||||
sample_rate = data_config["sample_rate"]
|
||||
n_fft = data_config["n_fft"]
|
||||
win_length = data_config["win_length"]
|
||||
hop_length = data_config["hop_length"]
|
||||
n_mels = data_config["n_mels"]
|
||||
train_clip_seconds = data_config["train_clip_seconds"]
|
||||
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
|
||||
ljspeech = TransformDataset(ljspeech_meta, transform)
|
||||
|
||||
valid_size = data_config["valid_size"]
|
||||
ljspeech_valid = SliceDataset(ljspeech, 0, valid_size)
|
||||
ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech))
|
||||
|
||||
teacher_config = config["teacher"]
|
||||
n_loop = teacher_config["n_loop"]
|
||||
n_layer = teacher_config["n_layer"]
|
||||
filter_size = teacher_config["filter_size"]
|
||||
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
|
||||
print("context size is {} samples".format(context_size))
|
||||
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
|
||||
train_clip_seconds)
|
||||
valid_batch_fn = DataCollector(
|
||||
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
|
||||
|
||||
batch_size = data_config["batch_size"]
|
||||
train_cargo = DataCargo(
|
||||
ljspeech_train,
|
||||
train_batch_fn,
|
||||
batch_size,
|
||||
sampler=RandomSampler(ljspeech_train))
|
||||
|
||||
# only batch=1 for validation is enabled
|
||||
valid_cargo = DataCargo(
|
||||
ljspeech_valid,
|
||||
valid_batch_fn,
|
||||
batch_size=1,
|
||||
sampler=SequentialSampler(ljspeech_valid))
|
||||
|
||||
make_output_tree(args.output)
|
||||
|
||||
if args.device == -1:
|
||||
place = fluid.CPUPlace()
|
||||
else:
|
||||
place = fluid.CUDAPlace(args.device)
|
||||
|
||||
with dg.guard(place):
|
||||
# conditioner(upsampling net)
|
||||
conditioner_config = config["conditioner"]
|
||||
upsampling_factors = conditioner_config["upsampling_factors"]
|
||||
upsample_net = UpsampleNet(upscale_factors=upsampling_factors)
|
||||
freeze(upsample_net)
|
||||
|
||||
residual_channels = teacher_config["residual_channels"]
|
||||
loss_type = teacher_config["loss_type"]
|
||||
output_dim = teacher_config["output_dim"]
|
||||
log_scale_min = teacher_config["log_scale_min"]
|
||||
assert loss_type == "mog" and output_dim == 3, \
|
||||
"the teacher wavenet should be a wavenet with single gaussian output"
|
||||
|
||||
teacher = WaveNet(n_loop, n_layer, residual_channels, output_dim,
|
||||
n_mels, filter_size, loss_type, log_scale_min)
|
||||
freeze(teacher)
|
||||
|
||||
student_config = config["student"]
|
||||
n_loops = student_config["n_loops"]
|
||||
n_layers = student_config["n_layers"]
|
||||
student_residual_channels = student_config["residual_channels"]
|
||||
student_filter_size = student_config["filter_size"]
|
||||
student_log_scale_min = student_config["log_scale_min"]
|
||||
student = ParallelWaveNet(n_loops, n_layers, student_residual_channels,
|
||||
n_mels, student_filter_size)
|
||||
|
||||
stft_config = config["stft"]
|
||||
stft = STFT(
|
||||
n_fft=stft_config["n_fft"],
|
||||
hop_length=stft_config["hop_length"],
|
||||
win_length=stft_config["win_length"])
|
||||
|
||||
lmd = config["loss"]["lmd"]
|
||||
model = Clarinet(upsample_net, teacher, student, stft,
|
||||
student_log_scale_min, lmd)
|
||||
summary(model)
|
||||
|
||||
# optim
|
||||
train_config = config["train"]
|
||||
learning_rate = train_config["learning_rate"]
|
||||
anneal_rate = train_config["anneal_rate"]
|
||||
anneal_interval = train_config["anneal_interval"]
|
||||
lr_scheduler = dg.ExponentialDecay(
|
||||
learning_rate, anneal_interval, anneal_rate, staircase=True)
|
||||
optim = fluid.optimizer.Adam(
|
||||
lr_scheduler, parameter_list=model.parameters())
|
||||
gradiant_max_norm = train_config["gradient_max_norm"]
|
||||
clipper = fluid.dygraph_grad_clip.GradClipByGlobalNorm(
|
||||
gradiant_max_norm)
|
||||
|
||||
assert args.wavenet or args.resume, "you should load from a trained wavenet or resume training; training without a trained wavenet is not recommended."
|
||||
if args.wavenet:
|
||||
load_wavenet(model, args.wavenet)
|
||||
|
||||
if args.resume:
|
||||
load_checkpoint(model, optim, args.resume)
|
||||
|
||||
# loader
|
||||
train_loader = fluid.io.DataLoader.from_generator(
|
||||
capacity=10, return_list=True)
|
||||
train_loader.set_batch_generator(train_cargo, place)
|
||||
|
||||
valid_loader = fluid.io.DataLoader.from_generator(
|
||||
capacity=10, return_list=True)
|
||||
valid_loader.set_batch_generator(valid_cargo, place)
|
||||
|
||||
# train
|
||||
max_iterations = train_config["max_iterations"]
|
||||
checkpoint_interval = train_config["checkpoint_interval"]
|
||||
eval_interval = train_config["eval_interval"]
|
||||
checkpoint_dir = os.path.join(args.output, "checkpoints")
|
||||
state_dir = os.path.join(args.output, "states")
|
||||
log_dir = os.path.join(args.output, "log")
|
||||
writer = SummaryWriter(log_dir)
|
||||
|
||||
# training loop
|
||||
global_step = 1
|
||||
global_epoch = 1
|
||||
while global_step < max_iterations:
|
||||
epoch_loss = 0.
|
||||
for j, batch in tqdm(enumerate(train_loader), desc="[train]"):
|
||||
audios, mels, audio_starts = batch
|
||||
model.train()
|
||||
loss_dict = model(
|
||||
audios, mels, audio_starts, clip_kl=global_step > 500)
|
||||
|
||||
writer.add_scalar("learning_rate",
|
||||
optim._learning_rate.step().numpy()[0],
|
||||
global_step)
|
||||
for k, v in loss_dict.items():
|
||||
writer.add_scalar("loss/{}".format(k),
|
||||
v.numpy()[0], global_step)
|
||||
|
||||
l = loss_dict["loss"]
|
||||
step_loss = l.numpy()[0]
|
||||
print("[train] loss: {:<8.6f}".format(step_loss))
|
||||
epoch_loss += step_loss
|
||||
|
||||
l.backward()
|
||||
optim.minimize(l, grad_clip=clipper)
|
||||
optim.clear_gradients()
|
||||
|
||||
if global_step % eval_interval == 0:
|
||||
# evaluate on valid dataset
|
||||
valid_model(model, valid_loader, state_dir, global_step,
|
||||
sample_rate)
|
||||
if global_step % checkpoint_interval == 0:
|
||||
save_checkpoint(model, optim, checkpoint_dir, global_step)
|
||||
|
||||
global_step += 1
|
||||
|
||||
# epoch loss
|
||||
average_loss = epoch_loss / j
|
||||
writer.add_scalar("average_loss", average_loss, global_epoch)
|
||||
global_epoch += 1
|
|
@ -0,0 +1,96 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import soundfile as sf
|
||||
from tensorboardX import SummaryWriter
|
||||
from collections import OrderedDict
|
||||
|
||||
from paddle import fluid
|
||||
import paddle.fluid.dygraph as dg
|
||||
|
||||
|
||||
def make_output_tree(output_dir):
|
||||
checkpoint_dir = os.path.join(output_dir, "checkpoints")
|
||||
if not os.path.exists(checkpoint_dir):
|
||||
os.makedirs(checkpoint_dir)
|
||||
|
||||
state_dir = os.path.join(output_dir, "states")
|
||||
if not os.path.exists(state_dir):
|
||||
os.makedirs(state_dir)
|
||||
|
||||
|
||||
def valid_model(model, valid_loader, output_dir, global_step, sample_rate):
|
||||
model.eval()
|
||||
for i, batch in enumerate(valid_loader):
|
||||
# print("sentence {}".format(i))
|
||||
path = os.path.join(output_dir,
|
||||
"step_{}_sentence_{}.wav".format(global_step, i))
|
||||
audio_clips, mel_specs, audio_starts = batch
|
||||
wav_var = model.synthesis(mel_specs)
|
||||
wav_np = wav_var.numpy()[0]
|
||||
sf.write(path, wav_np, samplerate=sample_rate)
|
||||
print("generated {}".format(path))
|
||||
|
||||
|
||||
def eval_model(model, valid_loader, output_dir, sample_rate):
|
||||
model.eval()
|
||||
for i, batch in enumerate(valid_loader):
|
||||
# print("sentence {}".format(i))
|
||||
path = os.path.join(output_dir, "sentence_{}.wav".format(i))
|
||||
audio_clips, mel_specs, audio_starts = batch
|
||||
wav_var = model.synthesis(mel_specs)
|
||||
wav_np = wav_var.numpy()[0]
|
||||
sf.write(path, wav_np, samplerate=sample_rate)
|
||||
print("generated {}".format(path))
|
||||
|
||||
|
||||
def save_checkpoint(model, optim, checkpoint_dir, global_step):
|
||||
path = os.path.join(checkpoint_dir, "step_{}".format(global_step))
|
||||
dg.save_dygraph(model.state_dict(), path)
|
||||
print("saving model to {}".format(path + ".pdparams"))
|
||||
if optim:
|
||||
dg.save_dygraph(optim.state_dict(), path)
|
||||
print("saving optimizer to {}".format(path + ".pdopt"))
|
||||
|
||||
|
||||
def load_model(model, path):
|
||||
model_dict, _ = dg.load_dygraph(path)
|
||||
model.state_dict(model_dict)
|
||||
print("loaded model from {}.pdparams".format(path))
|
||||
|
||||
|
||||
def load_checkpoint(model, optim, path):
|
||||
model_dict, optim_dict = dg.load_dygraph(path)
|
||||
model.state_dict(model_dict)
|
||||
print("loaded model from {}.pdparams".format(path))
|
||||
if optim_dict:
|
||||
optim.set_dict(optim_dict)
|
||||
print("loaded optimizer from {}.pdparams".format(path))
|
||||
|
||||
|
||||
def load_wavenet(model, path):
|
||||
wavenet_dict, _ = dg.load_dygraph(path)
|
||||
encoder_dict = OrderedDict()
|
||||
teacher_dict = OrderedDict()
|
||||
for k, v in wavenet_dict.items():
|
||||
if k.startswith("encoder."):
|
||||
encoder_dict[k.split('.', 1)[1]] = v
|
||||
else:
|
||||
# k starts with "decoder."
|
||||
teacher_dict[k.split('.', 1)[1]] = v
|
||||
|
||||
model.encoder.set_dict(encoder_dict)
|
||||
model.teacher.set_dict(teacher_dict)
|
||||
print("loaded the encoder part and teacher part from wavenet model.")
|
|
@ -1,8 +1,8 @@
|
|||
# Deepvoice 3
|
||||
# Deep Voice 3
|
||||
|
||||
Paddle implementation of deepvoice 3 in dynamic graph, a convolutional network based text-to-speech synthesis model. The implementation is based on [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654).
|
||||
PaddlePaddle dynamic graph implementation of Deep Voice 3, a convolutional network based text-to-speech generative model. The implementation is based on [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654).
|
||||
|
||||
We implement Deepvoice 3 in paddle fluid with dynamic graph, which is convenient for flexible network architectures.
|
||||
We implement Deep Voice 3 using Paddle Fluid with dynamic graph, which is convenient for building flexible network architectures.
|
||||
|
||||
## Dataset
|
||||
|
||||
|
@ -15,15 +15,15 @@ tar xjvf LJSpeech-1.1.tar.bz2
|
|||
|
||||
## Model Architecture
|
||||
|
||||
![DeepVoice3 model architecture](./images/model_architecture.png)
|
||||
![Deep Voice 3 model architecture](./images/model_architecture.png)
|
||||
|
||||
The model consists of an encoder, a decoder and a converter (and a speaker embedding for multispeaker models). The encoder, together with the decoder forms the seq2seq part of the model, and the converter forms the postnet part.
|
||||
The model consists of an encoder, a decoder and a converter (and a speaker embedding for multispeaker models). The encoder and the decoder together form the seq2seq part of the model, and the converter forms the postnet part.
|
||||
|
||||
## Project Structure
|
||||
|
||||
```text
|
||||
├── data.py data_processing
|
||||
├── ljspeech.yaml (example) configuration file
|
||||
├── data.py data_processing
|
||||
├── configs/ (example) configuration files
|
||||
├── sentences.txt sample sentences
|
||||
├── synthesis.py script to synthesize waveform from text
|
||||
├── train.py script to train a model
|
||||
|
@ -37,7 +37,7 @@ Train the model using train.py, follow the usage displayed by `python train.py -
|
|||
```text
|
||||
usage: train.py [-h] [-c CONFIG] [-s DATA] [-r RESUME] [-o OUTPUT] [-g DEVICE]
|
||||
|
||||
Train a deepvoice 3 model with LJSpeech dataset.
|
||||
Train a Deep Voice 3 model with LJSpeech dataset.
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
|
@ -50,18 +50,18 @@ optional arguments:
|
|||
The directory to save result.
|
||||
-g DEVICE, --device DEVICE
|
||||
device to use
|
||||
```
|
||||
```
|
||||
|
||||
1. `--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config.
|
||||
2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
|
||||
3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
|
||||
4. `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.
|
||||
4. `--output` is the directory to save results, all results are saved in this directory. The structure of the output directory is shown below.
|
||||
|
||||
```text
|
||||
├── checkpoints # checkpoint
|
||||
├── log # tensorboard log
|
||||
└── states # train and evaluation results
|
||||
├── alignments # attention
|
||||
├── alignments # attention
|
||||
├── lin_spec # linear spectrogram
|
||||
├── mel_spec # mel spectrogram
|
||||
└── waveform # waveform (.wav files)
|
||||
|
@ -69,10 +69,10 @@ optional arguments:
|
|||
|
||||
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
|
||||
example script:
|
||||
Example script:
|
||||
|
||||
```bash
|
||||
python train.py --config=./ljspeech.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0
|
||||
python train.py --config=configs/ljspeech.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0
|
||||
```
|
||||
|
||||
You can monitor training log via tensorboard, using the script below.
|
||||
|
@ -86,7 +86,7 @@ tensorboard --logdir=.
|
|||
```text
|
||||
usage: synthesis.py [-h] [-c CONFIG] [-g DEVICE] checkpoint text output_path
|
||||
|
||||
Synthsize waveform with a checkpoint.
|
||||
Synthsize waveform from a checkpoint.
|
||||
|
||||
positional arguments:
|
||||
checkpoint checkpoint to load.
|
||||
|
@ -107,9 +107,8 @@ optional arguments:
|
|||
4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`) and attention plots (*.png) for each sentence.
|
||||
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
|
||||
example script:
|
||||
Example script:
|
||||
|
||||
```bash
|
||||
python synthesis.py --config=./ljspeech.yaml --device=0 experiment/checkpoints/model_step_005000000 sentences.txt generated
|
||||
python synthesis.py --config=configs/ljspeech.yaml --device=0 experiment/checkpoints/model_step_005000000 sentences.txt generated
|
||||
```
|
||||
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import csv
|
||||
from pathlib import Path
|
||||
|
@ -79,10 +93,11 @@ class Transform(object):
|
|||
y = signal.lfilter([1., -self.preemphasis], [1.], wav)
|
||||
|
||||
# STFT
|
||||
D = librosa.stft(y=y,
|
||||
n_fft=self.n_fft,
|
||||
win_length=self.win_length,
|
||||
hop_length=self.hop_length)
|
||||
D = librosa.stft(
|
||||
y=y,
|
||||
n_fft=self.n_fft,
|
||||
win_length=self.win_length,
|
||||
hop_length=self.hop_length)
|
||||
S = np.abs(D)
|
||||
|
||||
# to db and normalize to 0-1
|
||||
|
@ -96,11 +111,8 @@ class Transform(object):
|
|||
|
||||
# mel scale and to db and normalize to 0-1,
|
||||
# CAUTION: pass linear scale S, not dbscaled S
|
||||
S_mel = librosa.feature.melspectrogram(S=S,
|
||||
n_mels=self.n_mels,
|
||||
fmin=self.fmin,
|
||||
fmax=self.fmax,
|
||||
power=1.)
|
||||
S_mel = librosa.feature.melspectrogram(
|
||||
S=S, n_mels=self.n_mels, fmin=self.fmin, fmax=self.fmax, power=1.)
|
||||
S_mel = 20 * np.log10(np.maximum(amplitude_min,
|
||||
S_mel)) - self.ref_level_db
|
||||
S_mel_norm = (S_mel - self.min_level_db) / (-self.min_level_db)
|
||||
|
@ -148,20 +160,18 @@ class DataCollector(object):
|
|||
(mix_grapheme_phonemes, text_length, speaker_id, S_norm,
|
||||
S_mel_norm, num_frames) = example
|
||||
text_sequences.append(
|
||||
np.pad(mix_grapheme_phonemes,
|
||||
(0, max_text_length - text_length)))
|
||||
np.pad(mix_grapheme_phonemes, (0, max_text_length - text_length
|
||||
)))
|
||||
lin_specs.append(
|
||||
np.pad(S_norm,
|
||||
((0, 0), (self._pad_begin,
|
||||
max_frames - self._pad_begin - num_frames))))
|
||||
np.pad(S_norm, ((0, 0), (self._pad_begin, max_frames -
|
||||
self._pad_begin - num_frames))))
|
||||
mel_specs.append(
|
||||
np.pad(S_mel_norm,
|
||||
((0, 0), (self._pad_begin,
|
||||
max_frames - self._pad_begin - num_frames))))
|
||||
np.pad(S_mel_norm, ((0, 0), (self._pad_begin, max_frames -
|
||||
self._pad_begin - num_frames))))
|
||||
done_flags.append(
|
||||
np.pad(np.zeros((int(np.ceil(num_frames // self._factor)), )),
|
||||
(0, max_decoder_length -
|
||||
int(np.ceil(num_frames // self._factor))),
|
||||
(0, max_decoder_length - int(
|
||||
np.ceil(num_frames // self._factor))),
|
||||
constant_values=1))
|
||||
text_sequences = np.array(text_sequences).astype(np.int64)
|
||||
lin_specs = np.transpose(np.array(lin_specs),
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import argparse
|
||||
import ruamel.yaml
|
||||
|
@ -22,11 +36,8 @@ if __name__ == "__main__":
|
|||
parser.add_argument("checkpoint", type=str, help="checkpoint to load.")
|
||||
parser.add_argument("text", type=str, help="text file to synthesize")
|
||||
parser.add_argument("output_path", type=str, help="path to save results")
|
||||
parser.add_argument("-g",
|
||||
"--device",
|
||||
type=int,
|
||||
default=-1,
|
||||
help="device to use")
|
||||
parser.add_argument(
|
||||
"-g", "--device", type=int, default=-1, help="device to use")
|
||||
|
||||
args = parser.parse_args()
|
||||
with open(args.config, 'rt') as f:
|
||||
|
@ -76,15 +87,14 @@ if __name__ == "__main__":
|
|||
window_ahead = model_config["window_ahead"]
|
||||
key_projection = model_config["key_projection"]
|
||||
value_projection = model_config["value_projection"]
|
||||
dv3 = make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim,
|
||||
padding_idx, embedding_std, max_positions, n_vocab,
|
||||
freeze_embedding, filter_size, encoder_channels,
|
||||
n_mels, decoder_channels, r,
|
||||
trainable_positional_encodings, use_memory_mask,
|
||||
query_position_rate, key_position_rate,
|
||||
window_backward, window_ahead, key_projection,
|
||||
value_projection, downsample_factor, linear_dim,
|
||||
use_decoder_states, converter_channels, dropout)
|
||||
dv3 = make_model(
|
||||
n_speakers, speaker_dim, speaker_embed_std, embed_dim, padding_idx,
|
||||
embedding_std, max_positions, n_vocab, freeze_embedding,
|
||||
filter_size, encoder_channels, n_mels, decoder_channels, r,
|
||||
trainable_positional_encodings, use_memory_mask,
|
||||
query_position_rate, key_position_rate, window_backward,
|
||||
window_ahead, key_projection, value_projection, downsample_factor,
|
||||
linear_dim, use_decoder_states, converter_channels, dropout)
|
||||
|
||||
summary(dv3)
|
||||
state, _ = dg.load_dygraph(args.checkpoint)
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import argparse
|
||||
import ruamel.yaml
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import numpy as np
|
||||
from matplotlib import cm
|
||||
|
@ -28,8 +42,9 @@ def make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim,
|
|||
converter_channels, dropout):
|
||||
"""just a simple function to create a deepvoice 3 model"""
|
||||
if n_speakers > 1:
|
||||
spe = dg.Embedding((n_speakers, speaker_dim),
|
||||
param_attr=I.Normal(scale=speaker_embed_std))
|
||||
spe = dg.Embedding(
|
||||
(n_speakers, speaker_dim),
|
||||
param_attr=I.Normal(scale=speaker_embed_std))
|
||||
else:
|
||||
spe = None
|
||||
|
||||
|
@ -45,17 +60,17 @@ def make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim,
|
|||
ConvSpec(h, k, 9),
|
||||
ConvSpec(h, k, 27),
|
||||
ConvSpec(h, k, 1),
|
||||
ConvSpec(h, k, 3),
|
||||
)
|
||||
enc = Encoder(n_vocab,
|
||||
embed_dim,
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
padding_idx=None,
|
||||
embedding_weight_std=embedding_std,
|
||||
convolutions=encoder_convolutions,
|
||||
max_positions=max_positions,
|
||||
dropout=dropout)
|
||||
ConvSpec(h, k, 3), )
|
||||
enc = Encoder(
|
||||
n_vocab,
|
||||
embed_dim,
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
padding_idx=None,
|
||||
embedding_weight_std=embedding_std,
|
||||
convolutions=encoder_convolutions,
|
||||
max_positions=max_positions,
|
||||
dropout=dropout)
|
||||
if freeze_embedding:
|
||||
freeze(enc.embed)
|
||||
|
||||
|
@ -66,28 +81,28 @@ def make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim,
|
|||
ConvSpec(h, k, 3),
|
||||
ConvSpec(h, k, 9),
|
||||
ConvSpec(h, k, 27),
|
||||
ConvSpec(h, k, 1),
|
||||
)
|
||||
ConvSpec(h, k, 1), )
|
||||
attention = [True, False, False, False, True]
|
||||
force_monotonic_attention = [True, False, False, False, True]
|
||||
dec = Decoder(n_speakers,
|
||||
speaker_dim,
|
||||
embed_dim,
|
||||
mel_dim,
|
||||
r=r,
|
||||
max_positions=max_positions,
|
||||
padding_idx=padding_idx,
|
||||
preattention=prenet_convolutions,
|
||||
convolutions=attentive_convolutions,
|
||||
attention=attention,
|
||||
dropout=dropout,
|
||||
use_memory_mask=use_memory_mask,
|
||||
force_monotonic_attention=force_monotonic_attention,
|
||||
query_position_rate=query_position_rate,
|
||||
key_position_rate=key_position_rate,
|
||||
window_range=WindowRange(window_behind, window_ahead),
|
||||
key_projection=key_projection,
|
||||
value_projection=value_projection)
|
||||
dec = Decoder(
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
embed_dim,
|
||||
mel_dim,
|
||||
r=r,
|
||||
max_positions=max_positions,
|
||||
padding_idx=padding_idx,
|
||||
preattention=prenet_convolutions,
|
||||
convolutions=attentive_convolutions,
|
||||
attention=attention,
|
||||
dropout=dropout,
|
||||
use_memory_mask=use_memory_mask,
|
||||
force_monotonic_attention=force_monotonic_attention,
|
||||
query_position_rate=query_position_rate,
|
||||
key_position_rate=key_position_rate,
|
||||
window_range=WindowRange(window_behind, window_ahead),
|
||||
key_projection=key_projection,
|
||||
value_projection=value_projection)
|
||||
if not trainable_positional_encodings:
|
||||
freeze(dec.embed_keys_positions)
|
||||
freeze(dec.embed_query_positions)
|
||||
|
@ -97,15 +112,15 @@ def make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim,
|
|||
ConvSpec(h, k, 1),
|
||||
ConvSpec(h, k, 3),
|
||||
ConvSpec(2 * h, k, 1),
|
||||
ConvSpec(2 * h, k, 3),
|
||||
)
|
||||
cvt = Converter(n_speakers,
|
||||
speaker_dim,
|
||||
dec.state_dim if use_decoder_states else mel_dim,
|
||||
linear_dim,
|
||||
time_upsampling=downsample_factor,
|
||||
convolutions=postnet_convolutions,
|
||||
dropout=dropout)
|
||||
ConvSpec(2 * h, k, 3), )
|
||||
cvt = Converter(
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
dec.state_dim if use_decoder_states else mel_dim,
|
||||
linear_dim,
|
||||
time_upsampling=downsample_factor,
|
||||
convolutions=postnet_convolutions,
|
||||
dropout=dropout)
|
||||
dv3 = DeepVoice3(enc, dec, cvt, spe, use_decoder_states)
|
||||
return dv3
|
||||
|
||||
|
@ -115,8 +130,10 @@ def eval_model(model, text, replace_pronounciation_prob, min_level_db,
|
|||
ref_level_db, power, n_iter, win_length, hop_length,
|
||||
preemphasis):
|
||||
"""generate waveform from text using a deepvoice 3 model"""
|
||||
text = np.array(en.text_to_sequence(text, p=replace_pronounciation_prob),
|
||||
dtype=np.int64)
|
||||
text = np.array(
|
||||
en.text_to_sequence(
|
||||
text, p=replace_pronounciation_prob),
|
||||
dtype=np.int64)
|
||||
length = len(text)
|
||||
print("text sequence's length: {}".format(length))
|
||||
text_positions = np.arange(1, 1 + length)
|
||||
|
@ -145,10 +162,11 @@ def spec_to_waveform(spec, min_level_db, ref_level_db, power, n_iter,
|
|||
"""
|
||||
denoramlized = np.clip(spec, 0, 1) * (-min_level_db) + min_level_db
|
||||
lin_scaled = np.exp((denoramlized + ref_level_db) / 20 * np.log(10))
|
||||
wav = librosa.griffinlim(lin_scaled**power,
|
||||
n_iter=n_iter,
|
||||
hop_length=hop_length,
|
||||
win_length=win_length)
|
||||
wav = librosa.griffinlim(
|
||||
lin_scaled**power,
|
||||
n_iter=n_iter,
|
||||
hop_length=hop_length,
|
||||
win_length=win_length)
|
||||
if preemphasis > 0:
|
||||
wav = signal.lfilter([1.], [1., -preemphasis], wav)
|
||||
return wav
|
||||
|
@ -225,28 +243,30 @@ def save_state(save_dir,
|
|||
plt.colorbar()
|
||||
plt.title("mel_input")
|
||||
plt.savefig(
|
||||
os.path.join(path,
|
||||
"target_mel_spec_step{:09d}.png".format(global_step)))
|
||||
os.path.join(path, "target_mel_spec_step{:09d}.png".format(
|
||||
global_step)))
|
||||
plt.close()
|
||||
|
||||
writer.add_image("target/mel_spec",
|
||||
cm.viridis(mel_input),
|
||||
global_step,
|
||||
dataformats="HWC")
|
||||
writer.add_image(
|
||||
"target/mel_spec",
|
||||
cm.viridis(mel_input),
|
||||
global_step,
|
||||
dataformats="HWC")
|
||||
|
||||
plt.figure(figsize=(10, 3))
|
||||
display.specshow(mel_output)
|
||||
plt.colorbar()
|
||||
plt.title("mel_output")
|
||||
plt.savefig(
|
||||
os.path.join(
|
||||
path, "predicted_mel_spec_step{:09d}.png".format(global_step)))
|
||||
os.path.join(path, "predicted_mel_spec_step{:09d}.png".format(
|
||||
global_step)))
|
||||
plt.close()
|
||||
|
||||
writer.add_image("predicted/mel_spec",
|
||||
cm.viridis(mel_output),
|
||||
global_step,
|
||||
dataformats="HWC")
|
||||
writer.add_image(
|
||||
"predicted/mel_spec",
|
||||
cm.viridis(mel_output),
|
||||
global_step,
|
||||
dataformats="HWC")
|
||||
|
||||
if lin_input is not None and lin_output is not None:
|
||||
lin_input = lin_input[0].numpy().T
|
||||
|
@ -258,28 +278,30 @@ def save_state(save_dir,
|
|||
plt.colorbar()
|
||||
plt.title("mel_input")
|
||||
plt.savefig(
|
||||
os.path.join(path,
|
||||
"target_lin_spec_step{:09d}.png".format(global_step)))
|
||||
os.path.join(path, "target_lin_spec_step{:09d}.png".format(
|
||||
global_step)))
|
||||
plt.close()
|
||||
|
||||
writer.add_image("target/lin_spec",
|
||||
cm.viridis(lin_input),
|
||||
global_step,
|
||||
dataformats="HWC")
|
||||
writer.add_image(
|
||||
"target/lin_spec",
|
||||
cm.viridis(lin_input),
|
||||
global_step,
|
||||
dataformats="HWC")
|
||||
|
||||
plt.figure(figsize=(10, 3))
|
||||
display.specshow(lin_output)
|
||||
plt.colorbar()
|
||||
plt.title("mel_input")
|
||||
plt.savefig(
|
||||
os.path.join(
|
||||
path, "predicted_lin_spec_step{:09d}.png".format(global_step)))
|
||||
os.path.join(path, "predicted_lin_spec_step{:09d}.png".format(
|
||||
global_step)))
|
||||
plt.close()
|
||||
|
||||
writer.add_image("predicted/lin_spec",
|
||||
cm.viridis(lin_output),
|
||||
global_step,
|
||||
dataformats="HWC")
|
||||
writer.add_image(
|
||||
"predicted/lin_spec",
|
||||
cm.viridis(lin_output),
|
||||
global_step,
|
||||
dataformats="HWC")
|
||||
|
||||
if alignments is not None and len(alignments.shape) == 4:
|
||||
path = os.path.join(save_dir, "alignments")
|
||||
|
@ -290,10 +312,11 @@ def save_state(save_dir,
|
|||
"train_attn_layer_{}_step_{}.png".format(idx, global_step))
|
||||
plot_alignment(attn_layer, save_path)
|
||||
|
||||
writer.add_image("train_attn/layer_{}".format(idx),
|
||||
cm.viridis(attn_layer),
|
||||
global_step,
|
||||
dataformats="HWC")
|
||||
writer.add_image(
|
||||
"train_attn/layer_{}".format(idx),
|
||||
cm.viridis(attn_layer),
|
||||
global_step,
|
||||
dataformats="HWC")
|
||||
|
||||
if lin_output is not None:
|
||||
wav = spec_to_waveform(lin_output, min_level_db, ref_level_db, power,
|
||||
|
@ -302,7 +325,5 @@ def save_state(save_dir,
|
|||
save_path = os.path.join(
|
||||
path, "train_sample_step_{:09d}.wav".format(global_step))
|
||||
sf.write(save_path, wav, sample_rate)
|
||||
writer.add_audio("train_sample",
|
||||
wav,
|
||||
global_step,
|
||||
sample_rate=sample_rate)
|
||||
writer.add_audio(
|
||||
"train_sample", wav, global_step, sample_rate=sample_rate)
|
||||
|
|
|
@ -57,7 +57,7 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr
|
|||
|
||||
If you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--fastspeech_step``
|
||||
|
||||
For more help on arguments:
|
||||
For more help on arguments:
|
||||
``python train.py --help``.
|
||||
|
||||
## Synthesis
|
||||
|
@ -75,5 +75,5 @@ or you can run the script file directly.
|
|||
sh synthesis.sh
|
||||
```
|
||||
|
||||
For more help on arguments:
|
||||
For more help on arguments:
|
||||
``python synthesis.py --help``.
|
||||
|
|
|
@ -1,39 +1,96 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import argparse
|
||||
|
||||
|
||||
def add_config_options_to_parser(parser):
|
||||
parser.add_argument('--config_path', type=str, default='config/fastspeech.yaml',
|
||||
parser.add_argument(
|
||||
'--config_path',
|
||||
type=str,
|
||||
default='config/fastspeech.yaml',
|
||||
help="the yaml config file path.")
|
||||
parser.add_argument('--batch_size', type=int, default=32,
|
||||
help="batch size for training.")
|
||||
parser.add_argument('--epochs', type=int, default=10000,
|
||||
parser.add_argument(
|
||||
'--batch_size', type=int, default=32, help="batch size for training.")
|
||||
parser.add_argument(
|
||||
'--epochs',
|
||||
type=int,
|
||||
default=10000,
|
||||
help="the number of epoch for training.")
|
||||
parser.add_argument('--lr', type=float, default=0.001,
|
||||
parser.add_argument(
|
||||
'--lr',
|
||||
type=float,
|
||||
default=0.001,
|
||||
help="the learning rate for training.")
|
||||
parser.add_argument('--save_step', type=int, default=500,
|
||||
parser.add_argument(
|
||||
'--save_step',
|
||||
type=int,
|
||||
default=500,
|
||||
help="checkpointing interval during training.")
|
||||
parser.add_argument('--fastspeech_step', type=int, default=70000,
|
||||
parser.add_argument(
|
||||
'--fastspeech_step',
|
||||
type=int,
|
||||
default=70000,
|
||||
help="Global step to restore checkpoint of fastspeech.")
|
||||
parser.add_argument('--use_gpu', type=int, default=1,
|
||||
parser.add_argument(
|
||||
'--use_gpu',
|
||||
type=int,
|
||||
default=1,
|
||||
help="use gpu or not during training.")
|
||||
parser.add_argument('--use_data_parallel', type=int, default=0,
|
||||
parser.add_argument(
|
||||
'--use_data_parallel',
|
||||
type=int,
|
||||
default=0,
|
||||
help="use data parallel or not during training.")
|
||||
parser.add_argument('--alpha', type=float, default=1.0,
|
||||
parser.add_argument(
|
||||
'--alpha',
|
||||
type=float,
|
||||
default=1.0,
|
||||
help="The hyperparameter to determine the length of the expanded sequence \
|
||||
mel, thereby controlling the voice speed.")
|
||||
|
||||
parser.add_argument('--data_path', type=str, default='./dataset/LJSpeech-1.1',
|
||||
parser.add_argument(
|
||||
'--data_path',
|
||||
type=str,
|
||||
default='./dataset/LJSpeech-1.1',
|
||||
help="the path of dataset.")
|
||||
parser.add_argument('--checkpoint_path', type=str, default=None,
|
||||
parser.add_argument(
|
||||
'--checkpoint_path',
|
||||
type=str,
|
||||
default=None,
|
||||
help="the path to load checkpoint or pretrain model.")
|
||||
parser.add_argument('--save_path', type=str, default='./checkpoint',
|
||||
parser.add_argument(
|
||||
'--save_path',
|
||||
type=str,
|
||||
default='./checkpoint',
|
||||
help="the path to save checkpoint.")
|
||||
parser.add_argument('--log_dir', type=str, default='./log',
|
||||
parser.add_argument(
|
||||
'--log_dir',
|
||||
type=str,
|
||||
default='./log',
|
||||
help="the directory to save tensorboard log.")
|
||||
parser.add_argument('--sample_path', type=str, default='./sample',
|
||||
parser.add_argument(
|
||||
'--sample_path',
|
||||
type=str,
|
||||
default='./sample',
|
||||
help="the directory to save audio sample in synthesis.")
|
||||
parser.add_argument('--transtts_path', type=str, default='./log',
|
||||
parser.add_argument(
|
||||
'--transtts_path',
|
||||
type=str,
|
||||
default='./log',
|
||||
help="the directory to load pretrain transformerTTS model.")
|
||||
parser.add_argument('--transformer_step', type=int, default=160000,
|
||||
parser.add_argument(
|
||||
'--transformer_step',
|
||||
type=int,
|
||||
default=160000,
|
||||
help="the step to load transformerTTS model.")
|
||||
|
||||
|
||||
|
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import os
|
||||
from tensorboardX import SummaryWriter
|
||||
from collections import OrderedDict
|
||||
|
@ -13,6 +26,7 @@ from parakeet import audio
|
|||
from parakeet.models.fastspeech.fastspeech import FastSpeech
|
||||
from parakeet.models.transformer_tts.utils import *
|
||||
|
||||
|
||||
def load_checkpoint(step, model_path):
|
||||
model_dict, _ = fluid.dygraph.load_dygraph(os.path.join(model_path, step))
|
||||
new_state_dict = OrderedDict()
|
||||
|
@ -23,13 +37,14 @@ def load_checkpoint(step, model_path):
|
|||
new_state_dict[param] = model_dict[param]
|
||||
return new_state_dict
|
||||
|
||||
|
||||
def synthesis(text_input, args):
|
||||
place = (fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace())
|
||||
|
||||
# tensorboard
|
||||
if not os.path.exists(args.log_dir):
|
||||
os.mkdir(args.log_dir)
|
||||
path = os.path.join(args.log_dir,'synthesis')
|
||||
os.mkdir(args.log_dir)
|
||||
path = os.path.join(args.log_dir, 'synthesis')
|
||||
|
||||
with open(args.config_path) as f:
|
||||
cfg = yaml.load(f, Loader=yaml.Loader)
|
||||
|
@ -38,35 +53,42 @@ def synthesis(text_input, args):
|
|||
|
||||
with dg.guard(place):
|
||||
model = FastSpeech(cfg)
|
||||
model.set_dict(load_checkpoint(str(args.fastspeech_step), os.path.join(args.checkpoint_path, "fastspeech")))
|
||||
model.set_dict(
|
||||
load_checkpoint(
|
||||
str(args.fastspeech_step),
|
||||
os.path.join(args.checkpoint_path, "fastspeech")))
|
||||
model.eval()
|
||||
|
||||
text = np.asarray(text_to_sequence(text_input))
|
||||
text = np.expand_dims(text, axis=0)
|
||||
pos_text = np.arange(1, text.shape[1]+1)
|
||||
pos_text = np.arange(1, text.shape[1] + 1)
|
||||
pos_text = np.expand_dims(pos_text, axis=0)
|
||||
enc_non_pad_mask = get_non_pad_mask(pos_text).astype(np.float32)
|
||||
enc_slf_attn_mask = get_attn_key_pad_mask(pos_text, text).astype(np.float32)
|
||||
|
||||
enc_slf_attn_mask = get_attn_key_pad_mask(pos_text,
|
||||
text).astype(np.float32)
|
||||
|
||||
text = dg.to_variable(text)
|
||||
pos_text = dg.to_variable(pos_text)
|
||||
enc_non_pad_mask = dg.to_variable(enc_non_pad_mask)
|
||||
enc_slf_attn_mask = dg.to_variable(enc_slf_attn_mask)
|
||||
|
||||
mel_output, mel_output_postnet = model(text, pos_text, alpha=args.alpha,
|
||||
enc_non_pad_mask=enc_non_pad_mask,
|
||||
enc_slf_attn_mask=enc_slf_attn_mask,
|
||||
dec_non_pad_mask=None,
|
||||
dec_slf_attn_mask=None)
|
||||
mel_output, mel_output_postnet = model(
|
||||
text,
|
||||
pos_text,
|
||||
alpha=args.alpha,
|
||||
enc_non_pad_mask=enc_non_pad_mask,
|
||||
enc_slf_attn_mask=enc_slf_attn_mask,
|
||||
dec_non_pad_mask=None,
|
||||
dec_slf_attn_mask=None)
|
||||
|
||||
_ljspeech_processor = audio.AudioProcessor(
|
||||
sample_rate=cfg['audio']['sr'],
|
||||
num_mels=cfg['audio']['num_mels'],
|
||||
min_level_db=cfg['audio']['min_level_db'],
|
||||
ref_level_db=cfg['audio']['ref_level_db'],
|
||||
n_fft=cfg['audio']['n_fft'],
|
||||
win_length= cfg['audio']['win_length'],
|
||||
hop_length= cfg['audio']['hop_length'],
|
||||
sample_rate=cfg['audio']['sr'],
|
||||
num_mels=cfg['audio']['num_mels'],
|
||||
min_level_db=cfg['audio']['min_level_db'],
|
||||
ref_level_db=cfg['audio']['ref_level_db'],
|
||||
n_fft=cfg['audio']['n_fft'],
|
||||
win_length=cfg['audio']['win_length'],
|
||||
hop_length=cfg['audio']['hop_length'],
|
||||
power=cfg['audio']['power'],
|
||||
preemphasis=cfg['audio']['preemphasis'],
|
||||
signal_norm=True,
|
||||
|
@ -79,14 +101,17 @@ def synthesis(text_input, args):
|
|||
do_trim_silence=False,
|
||||
sound_norm=False)
|
||||
|
||||
mel_output_postnet = fluid.layers.transpose(fluid.layers.squeeze(mel_output_postnet,[0]), [1,0])
|
||||
wav = _ljspeech_processor.inv_melspectrogram(mel_output_postnet.numpy())
|
||||
mel_output_postnet = fluid.layers.transpose(
|
||||
fluid.layers.squeeze(mel_output_postnet, [0]), [1, 0])
|
||||
wav = _ljspeech_processor.inv_melspectrogram(mel_output_postnet.numpy(
|
||||
))
|
||||
writer.add_audio(text_input, wav, 0, cfg['audio']['sr'])
|
||||
print("Synthesis completed !!!")
|
||||
writer.close()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(description="Train Fastspeech model")
|
||||
add_config_options_to_parser(parser)
|
||||
args = parser.parse_args()
|
||||
synthesis("Transformer model is so fast!", args)
|
||||
synthesis("Transformer model is so fast!", args)
|
||||
|
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import numpy as np
|
||||
import argparse
|
||||
import os
|
||||
|
@ -21,8 +34,10 @@ import sys
|
|||
sys.path.append("../transformer_tts")
|
||||
from data import LJSpeechLoader
|
||||
|
||||
|
||||
def load_checkpoint(step, model_path):
|
||||
model_dict, opti_dict = fluid.dygraph.load_dygraph(os.path.join(model_path, step))
|
||||
model_dict, opti_dict = fluid.dygraph.load_dygraph(
|
||||
os.path.join(model_path, step))
|
||||
new_state_dict = OrderedDict()
|
||||
for param in model_dict:
|
||||
if param.startswith('_layers.'):
|
||||
|
@ -31,6 +46,7 @@ def load_checkpoint(step, model_path):
|
|||
new_state_dict[param] = model_dict[param]
|
||||
return new_state_dict, opti_dict
|
||||
|
||||
|
||||
def main(args):
|
||||
local_rank = dg.parallel.Env().local_rank if args.use_data_parallel else 0
|
||||
nranks = dg.parallel.Env().nranks if args.use_data_parallel else 1
|
||||
|
@ -44,26 +60,33 @@ def main(args):
|
|||
if args.use_gpu else fluid.CPUPlace())
|
||||
|
||||
if not os.path.exists(args.log_dir):
|
||||
os.mkdir(args.log_dir)
|
||||
path = os.path.join(args.log_dir,'fastspeech')
|
||||
os.mkdir(args.log_dir)
|
||||
path = os.path.join(args.log_dir, 'fastspeech')
|
||||
|
||||
writer = SummaryWriter(path) if local_rank == 0 else None
|
||||
|
||||
with dg.guard(place):
|
||||
with fluid.unique_name.guard():
|
||||
transformerTTS = TransformerTTS(cfg)
|
||||
model_dict, _ = load_checkpoint(str(args.transformer_step), os.path.join(args.transtts_path, "transformer"))
|
||||
model_dict, _ = load_checkpoint(
|
||||
str(args.transformer_step),
|
||||
os.path.join(args.transtts_path, "transformer"))
|
||||
transformerTTS.set_dict(model_dict)
|
||||
transformerTTS.eval()
|
||||
|
||||
model = FastSpeech(cfg)
|
||||
model.train()
|
||||
optimizer = fluid.optimizer.AdamOptimizer(learning_rate=dg.NoamDecay(1/(cfg['warm_up_step'] *( args.lr ** 2)), cfg['warm_up_step']),
|
||||
parameter_list=model.parameters())
|
||||
reader = LJSpeechLoader(cfg, args, nranks, local_rank, shuffle=True).reader()
|
||||
|
||||
optimizer = fluid.optimizer.AdamOptimizer(
|
||||
learning_rate=dg.NoamDecay(1 / (
|
||||
cfg['warm_up_step'] * (args.lr**2)), cfg['warm_up_step']),
|
||||
parameter_list=model.parameters())
|
||||
reader = LJSpeechLoader(
|
||||
cfg, args, nranks, local_rank, shuffle=True).reader()
|
||||
|
||||
if args.checkpoint_path is not None:
|
||||
model_dict, opti_dict = load_checkpoint(str(args.fastspeech_step), os.path.join(args.checkpoint_path, "fastspeech"))
|
||||
model_dict, opti_dict = load_checkpoint(
|
||||
str(args.fastspeech_step),
|
||||
os.path.join(args.checkpoint_path, "fastspeech"))
|
||||
model.set_dict(model_dict)
|
||||
optimizer.set_dict(opti_dict)
|
||||
global_step = args.fastspeech_step
|
||||
|
@ -77,45 +100,66 @@ def main(args):
|
|||
pbar = tqdm(reader)
|
||||
|
||||
for i, data in enumerate(pbar):
|
||||
pbar.set_description('Processing at epoch %d'%epoch)
|
||||
(character, mel, mel_input, pos_text, pos_mel, text_length, mel_lens,
|
||||
enc_slf_mask, enc_query_mask, dec_slf_mask, enc_dec_mask, dec_query_slf_mask, dec_query_mask) = data
|
||||
pbar.set_description('Processing at epoch %d' % epoch)
|
||||
(character, mel, mel_input, pos_text, pos_mel, text_length,
|
||||
mel_lens, enc_slf_mask, enc_query_mask, dec_slf_mask,
|
||||
enc_dec_mask, dec_query_slf_mask, dec_query_mask) = data
|
||||
|
||||
_, _, attn_probs, _, _, _ = transformerTTS(character, mel_input, pos_text, pos_mel,
|
||||
dec_slf_mask=dec_slf_mask,
|
||||
enc_slf_mask=enc_slf_mask, enc_query_mask=enc_query_mask,
|
||||
enc_dec_mask=enc_dec_mask, dec_query_slf_mask=dec_query_slf_mask,
|
||||
dec_query_mask=dec_query_mask)
|
||||
alignment, max_attn = get_alignment(attn_probs, mel_lens, cfg['transformer_head'])
|
||||
_, _, attn_probs, _, _, _ = transformerTTS(
|
||||
character,
|
||||
mel_input,
|
||||
pos_text,
|
||||
pos_mel,
|
||||
dec_slf_mask=dec_slf_mask,
|
||||
enc_slf_mask=enc_slf_mask,
|
||||
enc_query_mask=enc_query_mask,
|
||||
enc_dec_mask=enc_dec_mask,
|
||||
dec_query_slf_mask=dec_query_slf_mask,
|
||||
dec_query_mask=dec_query_mask)
|
||||
alignment, max_attn = get_alignment(attn_probs, mel_lens,
|
||||
cfg['transformer_head'])
|
||||
alignment = dg.to_variable(alignment).astype(np.float32)
|
||||
|
||||
if local_rank==0 and global_step % 5 == 1:
|
||||
x = np.uint8(cm.viridis(max_attn[8,:mel_lens.numpy()[8]]) * 255)
|
||||
writer.add_image('Attention_%d_0'%global_step, x, 0, dataformats="HWC")
|
||||
|
||||
if local_rank == 0 and global_step % 5 == 1:
|
||||
x = np.uint8(
|
||||
cm.viridis(max_attn[8, :mel_lens.numpy()[8]]) * 255)
|
||||
writer.add_image(
|
||||
'Attention_%d_0' % global_step,
|
||||
x,
|
||||
0,
|
||||
dataformats="HWC")
|
||||
|
||||
global_step += 1
|
||||
|
||||
|
||||
#Forward
|
||||
result= model(character,
|
||||
pos_text,
|
||||
mel_pos=pos_mel,
|
||||
length_target=alignment,
|
||||
enc_non_pad_mask=enc_query_mask,
|
||||
enc_slf_attn_mask=enc_slf_mask,
|
||||
dec_non_pad_mask=dec_query_slf_mask,
|
||||
dec_slf_attn_mask=dec_slf_mask)
|
||||
result = model(
|
||||
character,
|
||||
pos_text,
|
||||
mel_pos=pos_mel,
|
||||
length_target=alignment,
|
||||
enc_non_pad_mask=enc_query_mask,
|
||||
enc_slf_attn_mask=enc_slf_mask,
|
||||
dec_non_pad_mask=dec_query_slf_mask,
|
||||
dec_slf_attn_mask=dec_slf_mask)
|
||||
mel_output, mel_output_postnet, duration_predictor_output, _, _ = result
|
||||
mel_loss = layers.mse_loss(mel_output, mel)
|
||||
mel_postnet_loss = layers.mse_loss(mel_output_postnet, mel)
|
||||
duration_loss = layers.mean(layers.abs(layers.elementwise_sub(duration_predictor_output, alignment)))
|
||||
duration_loss = layers.mean(
|
||||
layers.abs(
|
||||
layers.elementwise_sub(duration_predictor_output,
|
||||
alignment)))
|
||||
total_loss = mel_loss + mel_postnet_loss + duration_loss
|
||||
|
||||
if local_rank==0:
|
||||
writer.add_scalar('mel_loss', mel_loss.numpy(), global_step)
|
||||
writer.add_scalar('post_mel_loss', mel_postnet_loss.numpy(), global_step)
|
||||
writer.add_scalar('duration_loss', duration_loss.numpy(), global_step)
|
||||
writer.add_scalar('learning_rate', optimizer._learning_rate.step().numpy(), global_step)
|
||||
|
||||
if local_rank == 0:
|
||||
writer.add_scalar('mel_loss',
|
||||
mel_loss.numpy(), global_step)
|
||||
writer.add_scalar('post_mel_loss',
|
||||
mel_postnet_loss.numpy(), global_step)
|
||||
writer.add_scalar('duration_loss',
|
||||
duration_loss.numpy(), global_step)
|
||||
writer.add_scalar('learning_rate',
|
||||
optimizer._learning_rate.step().numpy(),
|
||||
global_step)
|
||||
|
||||
if args.use_data_parallel:
|
||||
total_loss = model.scale_loss(total_loss)
|
||||
|
@ -123,21 +167,25 @@ def main(args):
|
|||
model.apply_collective_grads()
|
||||
else:
|
||||
total_loss.backward()
|
||||
optimizer.minimize(total_loss, grad_clip = fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg['grad_clip_thresh']))
|
||||
optimizer.minimize(
|
||||
total_loss,
|
||||
grad_clip=fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg[
|
||||
'grad_clip_thresh']))
|
||||
model.clear_gradients()
|
||||
|
||||
# save checkpoint
|
||||
if local_rank==0 and global_step % args.save_step == 0:
|
||||
# save checkpoint
|
||||
if local_rank == 0 and global_step % args.save_step == 0:
|
||||
if not os.path.exists(args.save_path):
|
||||
os.mkdir(args.save_path)
|
||||
save_path = os.path.join(args.save_path,'fastspeech/%d' % global_step)
|
||||
save_path = os.path.join(args.save_path,
|
||||
'fastspeech/%d' % global_step)
|
||||
dg.save_dygraph(model.state_dict(), save_path)
|
||||
dg.save_dygraph(optimizer.state_dict(), save_path)
|
||||
if local_rank==0:
|
||||
if local_rank == 0:
|
||||
writer.close()
|
||||
|
||||
|
||||
if __name__ =='__main__':
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(description="Train Fastspeech model")
|
||||
add_config_options_to_parser(parser)
|
||||
args = parser.parse_args()
|
||||
|
|
|
@ -50,7 +50,7 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr
|
|||
|
||||
If you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--transformer_step``
|
||||
|
||||
For more help on arguments:
|
||||
For more help on arguments:
|
||||
``python train_transformer.py --help``.
|
||||
|
||||
## Train Vocoder
|
||||
|
@ -78,7 +78,7 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr
|
|||
```
|
||||
If you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--vocoder_step``
|
||||
|
||||
For more help on arguments:
|
||||
For more help on arguments:
|
||||
``python train_vocoder.py --help``.
|
||||
|
||||
## Synthesis
|
||||
|
@ -101,5 +101,5 @@ sh synthesis.sh
|
|||
|
||||
And the audio file will be saved in ``--sample_path``.
|
||||
|
||||
For more help on arguments:
|
||||
For more help on arguments:
|
||||
``python synthesis.py --help``.
|
||||
|
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from pathlib import Path
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
@ -13,8 +26,15 @@ from parakeet.data.batch import TextIDBatcher, SpecBatcher
|
|||
from parakeet.data.dataset import DatasetMixin, TransformDataset, CacheDataset
|
||||
from parakeet.models.transformer_tts.utils import *
|
||||
|
||||
|
||||
class LJSpeechLoader:
|
||||
def __init__(self, config, args, nranks, rank, is_vocoder=False, shuffle=True):
|
||||
def __init__(self,
|
||||
config,
|
||||
args,
|
||||
nranks,
|
||||
rank,
|
||||
is_vocoder=False,
|
||||
shuffle=True):
|
||||
place = fluid.CUDAPlace(rank) if args.use_gpu else fluid.CPUPlace()
|
||||
|
||||
LJSPEECH_ROOT = Path(args.data_path)
|
||||
|
@ -23,15 +43,28 @@ class LJSpeechLoader:
|
|||
dataset = TransformDataset(metadata, transformer)
|
||||
dataset = CacheDataset(dataset)
|
||||
|
||||
sampler = DistributedSampler(len(metadata), nranks, rank, shuffle=shuffle)
|
||||
sampler = DistributedSampler(
|
||||
len(metadata), nranks, rank, shuffle=shuffle)
|
||||
|
||||
assert args.batch_size % nranks == 0
|
||||
each_bs = args.batch_size // nranks
|
||||
if is_vocoder:
|
||||
dataloader = DataCargo(dataset, sampler=sampler, batch_size=each_bs, shuffle=shuffle, batch_fn=batch_examples_vocoder, drop_last=True)
|
||||
dataloader = DataCargo(
|
||||
dataset,
|
||||
sampler=sampler,
|
||||
batch_size=each_bs,
|
||||
shuffle=shuffle,
|
||||
batch_fn=batch_examples_vocoder,
|
||||
drop_last=True)
|
||||
else:
|
||||
dataloader = DataCargo(dataset, sampler=sampler, batch_size=each_bs, shuffle=shuffle, batch_fn=batch_examples, drop_last=True)
|
||||
|
||||
dataloader = DataCargo(
|
||||
dataset,
|
||||
sampler=sampler,
|
||||
batch_size=each_bs,
|
||||
shuffle=shuffle,
|
||||
batch_fn=batch_examples,
|
||||
drop_last=True)
|
||||
|
||||
self.reader = fluid.io.DataLoader.from_generator(
|
||||
capacity=32,
|
||||
iterable=True,
|
||||
|
@ -66,13 +99,13 @@ class LJSpeech(object):
|
|||
super(LJSpeech, self).__init__()
|
||||
self.config = config
|
||||
self._ljspeech_processor = audio.AudioProcessor(
|
||||
sample_rate=config['audio']['sr'],
|
||||
num_mels=config['audio']['num_mels'],
|
||||
min_level_db=config['audio']['min_level_db'],
|
||||
ref_level_db=config['audio']['ref_level_db'],
|
||||
n_fft=config['audio']['n_fft'],
|
||||
win_length= config['audio']['win_length'],
|
||||
hop_length= config['audio']['hop_length'],
|
||||
sample_rate=config['audio']['sr'],
|
||||
num_mels=config['audio']['num_mels'],
|
||||
min_level_db=config['audio']['min_level_db'],
|
||||
ref_level_db=config['audio']['ref_level_db'],
|
||||
n_fft=config['audio']['n_fft'],
|
||||
win_length=config['audio']['win_length'],
|
||||
hop_length=config['audio']['hop_length'],
|
||||
power=config['audio']['power'],
|
||||
preemphasis=config['audio']['preemphasis'],
|
||||
signal_norm=True,
|
||||
|
@ -84,7 +117,7 @@ class LJSpeech(object):
|
|||
griffin_lim_iters=60,
|
||||
do_trim_silence=False,
|
||||
sound_norm=False)
|
||||
|
||||
|
||||
def __call__(self, metadatum):
|
||||
"""All the code for generating an Example from a metadatum. If you want a
|
||||
different preprocessing pipeline, you can override this method.
|
||||
|
@ -93,13 +126,15 @@ class LJSpeech(object):
|
|||
method.
|
||||
"""
|
||||
fname, raw_text, normalized_text = metadatum
|
||||
|
||||
|
||||
# load -> trim -> preemphasis -> stft -> magnitude -> mel_scale -> logscale -> normalize
|
||||
wav = self._ljspeech_processor.load_wav(str(fname))
|
||||
mag = self._ljspeech_processor.spectrogram(wav).astype(np.float32)
|
||||
mel = self._ljspeech_processor.melspectrogram(wav).astype(np.float32)
|
||||
phonemes = np.array(g2p.en.text_to_sequence(normalized_text), dtype=np.int64)
|
||||
return (mag, mel, phonemes) # maybe we need to implement it as a map in the future
|
||||
phonemes = np.array(
|
||||
g2p.en.text_to_sequence(normalized_text), dtype=np.int64)
|
||||
return (mag, mel, phonemes
|
||||
) # maybe we need to implement it as a map in the future
|
||||
|
||||
|
||||
def batch_examples(batch):
|
||||
|
@ -112,52 +147,81 @@ def batch_examples(batch):
|
|||
pos_mels = []
|
||||
for data in batch:
|
||||
_, mel, text = data
|
||||
mel_inputs.append(np.concatenate([np.zeros([mel.shape[0], 1], np.float32), mel[:,:-1]], axis=-1))
|
||||
mel_inputs.append(
|
||||
np.concatenate(
|
||||
[np.zeros([mel.shape[0], 1], np.float32), mel[:, :-1]],
|
||||
axis=-1))
|
||||
mel_lens.append(mel.shape[1])
|
||||
text_lens.append(len(text))
|
||||
pos_texts.append(np.arange(1, len(text) + 1))
|
||||
pos_mels.append(np.arange(1, mel.shape[1] + 1))
|
||||
mels.append(mel)
|
||||
texts.append(text)
|
||||
|
||||
|
||||
# Sort by text_len in descending order
|
||||
texts = [i for i,_ in sorted(zip(texts, text_lens), key=lambda x: x[1], reverse=True)]
|
||||
mels = [i for i,_ in sorted(zip(mels, text_lens), key=lambda x: x[1], reverse=True)]
|
||||
mel_inputs = [i for i,_ in sorted(zip(mel_inputs, text_lens), key=lambda x: x[1], reverse=True)]
|
||||
mel_lens = [i for i,_ in sorted(zip(mel_lens, text_lens), key=lambda x: x[1], reverse=True)]
|
||||
pos_texts = [i for i,_ in sorted(zip(pos_texts, text_lens), key=lambda x: x[1], reverse=True)]
|
||||
pos_mels = [i for i,_ in sorted(zip(pos_mels, text_lens), key=lambda x: x[1], reverse=True)]
|
||||
texts = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(texts, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
mels = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(mels, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
mel_inputs = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(mel_inputs, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
mel_lens = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(mel_lens, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
pos_texts = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(pos_texts, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
pos_mels = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(pos_mels, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
text_lens = sorted(text_lens, reverse=True)
|
||||
|
||||
# Pad sequence with largest len of the batch
|
||||
texts = TextIDBatcher(pad_id=0)(texts) #(B, T)
|
||||
pos_texts = TextIDBatcher(pad_id=0)(pos_texts) #(B,T)
|
||||
pos_mels = TextIDBatcher(pad_id=0)(pos_mels) #(B,T)
|
||||
mels = np.transpose(SpecBatcher(pad_value=0.)(mels), axes=(0,2,1)) #(B,T,num_mels)
|
||||
mel_inputs = np.transpose(SpecBatcher(pad_value=0.)(mel_inputs), axes=(0,2,1))#(B,T,num_mels)
|
||||
texts = TextIDBatcher(pad_id=0)(texts) #(B, T)
|
||||
pos_texts = TextIDBatcher(pad_id=0)(pos_texts) #(B,T)
|
||||
pos_mels = TextIDBatcher(pad_id=0)(pos_mels) #(B,T)
|
||||
mels = np.transpose(
|
||||
SpecBatcher(pad_value=0.)(mels), axes=(0, 2, 1)) #(B,T,num_mels)
|
||||
mel_inputs = np.transpose(
|
||||
SpecBatcher(pad_value=0.)(mel_inputs), axes=(0, 2, 1)) #(B,T,num_mels)
|
||||
enc_slf_mask = get_attn_key_pad_mask(pos_texts, texts).astype(np.float32)
|
||||
enc_query_mask = get_non_pad_mask(pos_texts).astype(np.float32)
|
||||
dec_slf_mask = get_dec_attn_key_pad_mask(pos_mels,mel_inputs).astype(np.float32)
|
||||
enc_dec_mask = get_attn_key_pad_mask(enc_query_mask[:,:,0], mel_inputs).astype(np.float32)
|
||||
dec_slf_mask = get_dec_attn_key_pad_mask(pos_mels,
|
||||
mel_inputs).astype(np.float32)
|
||||
enc_dec_mask = get_attn_key_pad_mask(enc_query_mask[:, :, 0],
|
||||
mel_inputs).astype(np.float32)
|
||||
dec_query_slf_mask = get_non_pad_mask(pos_mels).astype(np.float32)
|
||||
dec_query_mask = get_non_pad_mask(pos_mels).astype(np.float32)
|
||||
|
||||
return (texts, mels, mel_inputs, pos_texts, pos_mels, np.array(text_lens), np.array(mel_lens),
|
||||
enc_slf_mask, enc_query_mask, dec_slf_mask, enc_dec_mask, dec_query_slf_mask, dec_query_mask)
|
||||
return (texts, mels, mel_inputs, pos_texts, pos_mels, np.array(text_lens),
|
||||
np.array(mel_lens), enc_slf_mask, enc_query_mask, dec_slf_mask,
|
||||
enc_dec_mask, dec_query_slf_mask, dec_query_mask)
|
||||
|
||||
|
||||
def batch_examples_vocoder(batch):
|
||||
mels=[]
|
||||
mags=[]
|
||||
mels = []
|
||||
mags = []
|
||||
for data in batch:
|
||||
mag, mel, _ = data
|
||||
mels.append(mel)
|
||||
mags.append(mag)
|
||||
|
||||
mels = np.transpose(SpecBatcher(pad_value=0.)(mels), axes=(0,2,1))
|
||||
mags = np.transpose(SpecBatcher(pad_value=0.)(mags), axes=(0,2,1))
|
||||
mels = np.transpose(SpecBatcher(pad_value=0.)(mels), axes=(0, 2, 1))
|
||||
mags = np.transpose(SpecBatcher(pad_value=0.)(mags), axes=(0, 2, 1))
|
||||
|
||||
return (mels, mags)
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -1,38 +1,100 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import argparse
|
||||
|
||||
|
||||
def add_config_options_to_parser(parser):
|
||||
parser.add_argument('--config_path', type=str, default='config/train_transformer.yaml',
|
||||
parser.add_argument(
|
||||
'--config_path',
|
||||
type=str,
|
||||
default='config/train_transformer.yaml',
|
||||
help="the yaml config file path.")
|
||||
parser.add_argument('--batch_size', type=int, default=32,
|
||||
help="batch size for training.")
|
||||
parser.add_argument('--epochs', type=int, default=10000,
|
||||
parser.add_argument(
|
||||
'--batch_size', type=int, default=32, help="batch size for training.")
|
||||
parser.add_argument(
|
||||
'--epochs',
|
||||
type=int,
|
||||
default=10000,
|
||||
help="the number of epoch for training.")
|
||||
parser.add_argument('--lr', type=float, default=0.001,
|
||||
parser.add_argument(
|
||||
'--lr',
|
||||
type=float,
|
||||
default=0.001,
|
||||
help="the learning rate for training.")
|
||||
parser.add_argument('--save_step', type=int, default=500,
|
||||
parser.add_argument(
|
||||
'--save_step',
|
||||
type=int,
|
||||
default=500,
|
||||
help="checkpointing interval during training.")
|
||||
parser.add_argument('--image_step', type=int, default=2000,
|
||||
parser.add_argument(
|
||||
'--image_step',
|
||||
type=int,
|
||||
default=2000,
|
||||
help="attention image interval during training.")
|
||||
parser.add_argument('--max_len', type=int, default=400,
|
||||
parser.add_argument(
|
||||
'--max_len',
|
||||
type=int,
|
||||
default=400,
|
||||
help="The max length of audio when synthsis.")
|
||||
parser.add_argument('--transformer_step', type=int, default=160000,
|
||||
parser.add_argument(
|
||||
'--transformer_step',
|
||||
type=int,
|
||||
default=160000,
|
||||
help="Global step to restore checkpoint of transformer.")
|
||||
parser.add_argument('--vocoder_step', type=int, default=90000,
|
||||
parser.add_argument(
|
||||
'--vocoder_step',
|
||||
type=int,
|
||||
default=90000,
|
||||
help="Global step to restore checkpoint of postnet.")
|
||||
parser.add_argument('--use_gpu', type=int, default=1,
|
||||
parser.add_argument(
|
||||
'--use_gpu',
|
||||
type=int,
|
||||
default=1,
|
||||
help="use gpu or not during training.")
|
||||
parser.add_argument('--use_data_parallel', type=int, default=0,
|
||||
parser.add_argument(
|
||||
'--use_data_parallel',
|
||||
type=int,
|
||||
default=0,
|
||||
help="use data parallel or not during training.")
|
||||
parser.add_argument('--stop_token', type=int, default=0,
|
||||
parser.add_argument(
|
||||
'--stop_token',
|
||||
type=int,
|
||||
default=0,
|
||||
help="use stop token loss in network or not.")
|
||||
|
||||
parser.add_argument('--data_path', type=str, default='./dataset/LJSpeech-1.1',
|
||||
parser.add_argument(
|
||||
'--data_path',
|
||||
type=str,
|
||||
default='./dataset/LJSpeech-1.1',
|
||||
help="the path of dataset.")
|
||||
parser.add_argument('--checkpoint_path', type=str, default=None,
|
||||
parser.add_argument(
|
||||
'--checkpoint_path',
|
||||
type=str,
|
||||
default=None,
|
||||
help="the path to load checkpoint or pretrain model.")
|
||||
parser.add_argument('--save_path', type=str, default='./checkpoint',
|
||||
parser.add_argument(
|
||||
'--save_path',
|
||||
type=str,
|
||||
default='./checkpoint',
|
||||
help="the path to save checkpoint.")
|
||||
parser.add_argument('--log_dir', type=str, default='./log',
|
||||
parser.add_argument(
|
||||
'--log_dir',
|
||||
type=str,
|
||||
default='./log',
|
||||
help="the directory to save tensorboard log.")
|
||||
parser.add_argument('--sample_path', type=str, default='./sample',
|
||||
parser.add_argument(
|
||||
'--sample_path',
|
||||
type=str,
|
||||
default='./sample',
|
||||
help="the directory to save audio sample in synthesis.")
|
||||
|
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import os
|
||||
from scipy.io.wavfile import write
|
||||
from parakeet.g2p.en import text_to_sequence
|
||||
|
@ -18,6 +31,7 @@ from parakeet import audio
|
|||
from parakeet.models.transformer_tts.vocoder import Vocoder
|
||||
from parakeet.models.transformer_tts.transformer_tts import TransformerTTS
|
||||
|
||||
|
||||
def load_checkpoint(step, model_path):
|
||||
model_dict, _ = fluid.dygraph.load_dygraph(os.path.join(model_path, step))
|
||||
new_state_dict = OrderedDict()
|
||||
|
@ -28,6 +42,7 @@ def load_checkpoint(step, model_path):
|
|||
new_state_dict[param] = model_dict[param]
|
||||
return new_state_dict
|
||||
|
||||
|
||||
def synthesis(text_input, args):
|
||||
place = (fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace())
|
||||
|
||||
|
@ -36,48 +51,57 @@ def synthesis(text_input, args):
|
|||
|
||||
# tensorboard
|
||||
if not os.path.exists(args.log_dir):
|
||||
os.mkdir(args.log_dir)
|
||||
path = os.path.join(args.log_dir,'synthesis')
|
||||
os.mkdir(args.log_dir)
|
||||
path = os.path.join(args.log_dir, 'synthesis')
|
||||
|
||||
writer = SummaryWriter(path)
|
||||
|
||||
with dg.guard(place):
|
||||
with fluid.unique_name.guard():
|
||||
model = TransformerTTS(cfg)
|
||||
model.set_dict(load_checkpoint(str(args.transformer_step), os.path.join(args.checkpoint_path, "transformer")))
|
||||
model.set_dict(
|
||||
load_checkpoint(
|
||||
str(args.transformer_step),
|
||||
os.path.join(args.checkpoint_path, "transformer")))
|
||||
model.eval()
|
||||
|
||||
|
||||
with fluid.unique_name.guard():
|
||||
model_vocoder = Vocoder(cfg, args.batch_size)
|
||||
model_vocoder.set_dict(load_checkpoint(str(args.vocoder_step), os.path.join(args.checkpoint_path, "vocoder")))
|
||||
model_vocoder.set_dict(
|
||||
load_checkpoint(
|
||||
str(args.vocoder_step),
|
||||
os.path.join(args.checkpoint_path, "vocoder")))
|
||||
model_vocoder.eval()
|
||||
# init input
|
||||
text = np.asarray(text_to_sequence(text_input))
|
||||
text = fluid.layers.unsqueeze(dg.to_variable(text),[0])
|
||||
mel_input = dg.to_variable(np.zeros([1,1,80])).astype(np.float32)
|
||||
pos_text = np.arange(1, text.shape[1]+1)
|
||||
pos_text = fluid.layers.unsqueeze(dg.to_variable(pos_text),[0])
|
||||
|
||||
text = fluid.layers.unsqueeze(dg.to_variable(text), [0])
|
||||
mel_input = dg.to_variable(np.zeros([1, 1, 80])).astype(np.float32)
|
||||
pos_text = np.arange(1, text.shape[1] + 1)
|
||||
pos_text = fluid.layers.unsqueeze(dg.to_variable(pos_text), [0])
|
||||
|
||||
pbar = tqdm(range(args.max_len))
|
||||
for i in pbar:
|
||||
dec_slf_mask = get_triu_tensor(mel_input.numpy(), mel_input.numpy()).astype(np.float32)
|
||||
dec_slf_mask = fluid.layers.cast(dg.to_variable(dec_slf_mask == 0), np.float32)
|
||||
pos_mel = np.arange(1, mel_input.shape[1]+1)
|
||||
pos_mel = fluid.layers.unsqueeze(dg.to_variable(pos_mel),[0])
|
||||
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(text, mel_input, pos_text, pos_mel, dec_slf_mask)
|
||||
mel_input = fluid.layers.concat([mel_input, postnet_pred[:,-1:,:]], axis=1)
|
||||
dec_slf_mask = get_triu_tensor(
|
||||
mel_input.numpy(), mel_input.numpy()).astype(np.float32)
|
||||
dec_slf_mask = fluid.layers.cast(
|
||||
dg.to_variable(dec_slf_mask == 0), np.float32)
|
||||
pos_mel = np.arange(1, mel_input.shape[1] + 1)
|
||||
pos_mel = fluid.layers.unsqueeze(dg.to_variable(pos_mel), [0])
|
||||
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(
|
||||
text, mel_input, pos_text, pos_mel, dec_slf_mask)
|
||||
mel_input = fluid.layers.concat(
|
||||
[mel_input, postnet_pred[:, -1:, :]], axis=1)
|
||||
|
||||
mag_pred = model_vocoder(postnet_pred)
|
||||
|
||||
_ljspeech_processor = audio.AudioProcessor(
|
||||
sample_rate=cfg['audio']['sr'],
|
||||
num_mels=cfg['audio']['num_mels'],
|
||||
min_level_db=cfg['audio']['min_level_db'],
|
||||
ref_level_db=cfg['audio']['ref_level_db'],
|
||||
n_fft=cfg['audio']['n_fft'],
|
||||
win_length= cfg['audio']['win_length'],
|
||||
hop_length= cfg['audio']['hop_length'],
|
||||
sample_rate=cfg['audio']['sr'],
|
||||
num_mels=cfg['audio']['num_mels'],
|
||||
min_level_db=cfg['audio']['min_level_db'],
|
||||
ref_level_db=cfg['audio']['ref_level_db'],
|
||||
n_fft=cfg['audio']['n_fft'],
|
||||
win_length=cfg['audio']['win_length'],
|
||||
hop_length=cfg['audio']['hop_length'],
|
||||
power=cfg['audio']['power'],
|
||||
preemphasis=cfg['audio']['preemphasis'],
|
||||
signal_norm=True,
|
||||
|
@ -90,30 +114,49 @@ def synthesis(text_input, args):
|
|||
do_trim_silence=False,
|
||||
sound_norm=False)
|
||||
|
||||
wav = _ljspeech_processor.inv_spectrogram(fluid.layers.transpose(fluid.layers.squeeze(mag_pred,[0]), [1,0]).numpy())
|
||||
wav = _ljspeech_processor.inv_spectrogram(
|
||||
fluid.layers.transpose(
|
||||
fluid.layers.squeeze(mag_pred, [0]), [1, 0]).numpy())
|
||||
global_step = 0
|
||||
for i, prob in enumerate(attn_probs):
|
||||
for j in range(4):
|
||||
x = np.uint8(cm.viridis(prob.numpy()[j]) * 255)
|
||||
writer.add_image('Attention_%d_0'%global_step, x, i*4+j, dataformats="HWC")
|
||||
writer.add_image(
|
||||
'Attention_%d_0' % global_step,
|
||||
x,
|
||||
i * 4 + j,
|
||||
dataformats="HWC")
|
||||
|
||||
for i, prob in enumerate(attn_enc):
|
||||
for j in range(4):
|
||||
x = np.uint8(cm.viridis(prob.numpy()[j]) * 255)
|
||||
writer.add_image('Attention_enc_%d_0'%global_step, x, i*4+j, dataformats="HWC")
|
||||
writer.add_image(
|
||||
'Attention_enc_%d_0' % global_step,
|
||||
x,
|
||||
i * 4 + j,
|
||||
dataformats="HWC")
|
||||
|
||||
for i, prob in enumerate(attn_dec):
|
||||
for j in range(4):
|
||||
x = np.uint8(cm.viridis(prob.numpy()[j]) * 255)
|
||||
writer.add_image('Attention_dec_%d_0'%global_step, x, i*4+j, dataformats="HWC")
|
||||
writer.add_image(
|
||||
'Attention_dec_%d_0' % global_step,
|
||||
x,
|
||||
i * 4 + j,
|
||||
dataformats="HWC")
|
||||
writer.add_audio(text_input, wav, 0, cfg['audio']['sr'])
|
||||
if not os.path.exists(args.sample_path):
|
||||
os.mkdir(args.sample_path)
|
||||
write(os.path.join(args.sample_path,'test.wav'), cfg['audio']['sr'], wav)
|
||||
write(
|
||||
os.path.join(args.sample_path, 'test.wav'), cfg['audio']['sr'],
|
||||
wav)
|
||||
writer.close()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(description="Synthesis model")
|
||||
add_config_options_to_parser(parser)
|
||||
args = parser.parse_args()
|
||||
synthesis("They emphasized the necessity that the information now being furnished be handled with judgment and care.", args)
|
||||
synthesis(
|
||||
"They emphasized the necessity that the information now being furnished be handled with judgment and care.",
|
||||
args)
|
||||
|
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import os
|
||||
from tqdm import tqdm
|
||||
from tensorboardX import SummaryWriter
|
||||
|
@ -16,8 +29,10 @@ from parakeet.models.transformer_tts.utils import cross_entropy
|
|||
from data import LJSpeechLoader
|
||||
from parakeet.models.transformer_tts.transformer_tts import TransformerTTS
|
||||
|
||||
|
||||
def load_checkpoint(step, model_path):
|
||||
model_dict, opti_dict = fluid.dygraph.load_dygraph(os.path.join(model_path, step))
|
||||
model_dict, opti_dict = fluid.dygraph.load_dygraph(
|
||||
os.path.join(model_path, step))
|
||||
new_state_dict = OrderedDict()
|
||||
for param in model_dict:
|
||||
if param.startswith('_layers.'):
|
||||
|
@ -40,22 +55,27 @@ def main(args):
|
|||
if args.use_gpu else fluid.CPUPlace())
|
||||
|
||||
if not os.path.exists(args.log_dir):
|
||||
os.mkdir(args.log_dir)
|
||||
path = os.path.join(args.log_dir,'transformer')
|
||||
os.mkdir(args.log_dir)
|
||||
path = os.path.join(args.log_dir, 'transformer')
|
||||
|
||||
writer = SummaryWriter(path) if local_rank == 0 else None
|
||||
|
||||
|
||||
with dg.guard(place):
|
||||
model = TransformerTTS(cfg)
|
||||
|
||||
model.train()
|
||||
optimizer = fluid.optimizer.AdamOptimizer(learning_rate=dg.NoamDecay(1/(cfg['warm_up_step'] *( args.lr ** 2)), cfg['warm_up_step']),
|
||||
parameter_list=model.parameters())
|
||||
|
||||
reader = LJSpeechLoader(cfg, args, nranks, local_rank, shuffle=True).reader()
|
||||
optimizer = fluid.optimizer.AdamOptimizer(
|
||||
learning_rate=dg.NoamDecay(1 / (
|
||||
cfg['warm_up_step'] * (args.lr**2)), cfg['warm_up_step']),
|
||||
parameter_list=model.parameters())
|
||||
|
||||
reader = LJSpeechLoader(
|
||||
cfg, args, nranks, local_rank, shuffle=True).reader()
|
||||
|
||||
if args.checkpoint_path is not None:
|
||||
model_dict, opti_dict = load_checkpoint(str(args.transformer_step), os.path.join(args.checkpoint_path, "transformer"))
|
||||
model_dict, opti_dict = load_checkpoint(
|
||||
str(args.transformer_step),
|
||||
os.path.join(args.checkpoint_path, "transformer"))
|
||||
model.set_dict(model_dict)
|
||||
optimizer.set_dict(opti_dict)
|
||||
global_step = args.transformer_step
|
||||
|
@ -64,93 +84,122 @@ def main(args):
|
|||
if args.use_data_parallel:
|
||||
strategy = dg.parallel.prepare_context()
|
||||
model = fluid.dygraph.parallel.DataParallel(model, strategy)
|
||||
|
||||
|
||||
for epoch in range(args.epochs):
|
||||
pbar = tqdm(reader)
|
||||
for i, data in enumerate(pbar):
|
||||
pbar.set_description('Processing at epoch %d'%epoch)
|
||||
character, mel, mel_input, pos_text, pos_mel, text_length, _, enc_slf_mask, enc_query_mask, dec_slf_mask, enc_dec_mask, dec_query_slf_mask, dec_query_mask= data
|
||||
pbar.set_description('Processing at epoch %d' % epoch)
|
||||
character, mel, mel_input, pos_text, pos_mel, text_length, _, enc_slf_mask, enc_query_mask, dec_slf_mask, enc_dec_mask, dec_query_slf_mask, dec_query_mask = data
|
||||
|
||||
global_step += 1
|
||||
|
||||
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(character, mel_input, pos_text, pos_mel, dec_slf_mask=dec_slf_mask,
|
||||
enc_slf_mask=enc_slf_mask, enc_query_mask=enc_query_mask,
|
||||
enc_dec_mask=enc_dec_mask, dec_query_slf_mask=dec_query_slf_mask,
|
||||
dec_query_mask=dec_query_mask)
|
||||
|
||||
|
||||
mel_loss = layers.mean(layers.abs(layers.elementwise_sub(mel_pred, mel)))
|
||||
post_mel_loss = layers.mean(layers.abs(layers.elementwise_sub(postnet_pred, mel)))
|
||||
|
||||
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(
|
||||
character,
|
||||
mel_input,
|
||||
pos_text,
|
||||
pos_mel,
|
||||
dec_slf_mask=dec_slf_mask,
|
||||
enc_slf_mask=enc_slf_mask,
|
||||
enc_query_mask=enc_query_mask,
|
||||
enc_dec_mask=enc_dec_mask,
|
||||
dec_query_slf_mask=dec_query_slf_mask,
|
||||
dec_query_mask=dec_query_mask)
|
||||
|
||||
mel_loss = layers.mean(
|
||||
layers.abs(layers.elementwise_sub(mel_pred, mel)))
|
||||
post_mel_loss = layers.mean(
|
||||
layers.abs(layers.elementwise_sub(postnet_pred, mel)))
|
||||
loss = mel_loss + post_mel_loss
|
||||
|
||||
|
||||
# Note: When used stop token loss the learning did not work.
|
||||
if args.stop_token:
|
||||
label = (pos_mel == 0).astype(np.float32)
|
||||
stop_loss = cross_entropy(stop_preds, label)
|
||||
loss = loss + stop_loss
|
||||
|
||||
if local_rank==0:
|
||||
if local_rank == 0:
|
||||
writer.add_scalars('training_loss', {
|
||||
'mel_loss':mel_loss.numpy(),
|
||||
'post_mel_loss':post_mel_loss.numpy()
|
||||
'mel_loss': mel_loss.numpy(),
|
||||
'post_mel_loss': post_mel_loss.numpy()
|
||||
}, global_step)
|
||||
|
||||
if args.stop_token:
|
||||
writer.add_scalar('stop_loss', stop_loss.numpy(), global_step)
|
||||
writer.add_scalar('stop_loss',
|
||||
stop_loss.numpy(), global_step)
|
||||
|
||||
if args.use_data_parallel:
|
||||
writer.add_scalars('alphas', {
|
||||
'encoder_alpha':model._layers.encoder.alpha.numpy(),
|
||||
'decoder_alpha':model._layers.decoder.alpha.numpy(),
|
||||
'encoder_alpha':
|
||||
model._layers.encoder.alpha.numpy(),
|
||||
'decoder_alpha':
|
||||
model._layers.decoder.alpha.numpy(),
|
||||
}, global_step)
|
||||
else:
|
||||
writer.add_scalars('alphas', {
|
||||
'encoder_alpha':model.encoder.alpha.numpy(),
|
||||
'decoder_alpha':model.decoder.alpha.numpy(),
|
||||
'encoder_alpha': model.encoder.alpha.numpy(),
|
||||
'decoder_alpha': model.decoder.alpha.numpy(),
|
||||
}, global_step)
|
||||
|
||||
writer.add_scalar('learning_rate', optimizer._learning_rate.step().numpy(), global_step)
|
||||
writer.add_scalar('learning_rate',
|
||||
optimizer._learning_rate.step().numpy(),
|
||||
global_step)
|
||||
|
||||
if global_step % args.image_step == 1:
|
||||
for i, prob in enumerate(attn_probs):
|
||||
for j in range(4):
|
||||
x = np.uint8(cm.viridis(prob.numpy()[j*16]) * 255)
|
||||
writer.add_image('Attention_%d_0'%global_step, x, i*4+j, dataformats="HWC")
|
||||
x = np.uint8(
|
||||
cm.viridis(prob.numpy()[j * 16]) * 255)
|
||||
writer.add_image(
|
||||
'Attention_%d_0' % global_step,
|
||||
x,
|
||||
i * 4 + j,
|
||||
dataformats="HWC")
|
||||
|
||||
for i, prob in enumerate(attn_enc):
|
||||
for j in range(4):
|
||||
x = np.uint8(cm.viridis(prob.numpy()[j*16]) * 255)
|
||||
writer.add_image('Attention_enc_%d_0'%global_step, x, i*4+j, dataformats="HWC")
|
||||
x = np.uint8(
|
||||
cm.viridis(prob.numpy()[j * 16]) * 255)
|
||||
writer.add_image(
|
||||
'Attention_enc_%d_0' % global_step,
|
||||
x,
|
||||
i * 4 + j,
|
||||
dataformats="HWC")
|
||||
|
||||
for i, prob in enumerate(attn_dec):
|
||||
for j in range(4):
|
||||
x = np.uint8(cm.viridis(prob.numpy()[j*16]) * 255)
|
||||
writer.add_image('Attention_dec_%d_0'%global_step, x, i*4+j, dataformats="HWC")
|
||||
|
||||
|
||||
x = np.uint8(
|
||||
cm.viridis(prob.numpy()[j * 16]) * 255)
|
||||
writer.add_image(
|
||||
'Attention_dec_%d_0' % global_step,
|
||||
x,
|
||||
i * 4 + j,
|
||||
dataformats="HWC")
|
||||
|
||||
if args.use_data_parallel:
|
||||
loss = model.scale_loss(loss)
|
||||
loss.backward()
|
||||
model.apply_collective_grads()
|
||||
else:
|
||||
loss.backward()
|
||||
optimizer.minimize(loss, grad_clip = fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg['grad_clip_thresh']))
|
||||
optimizer.minimize(
|
||||
loss,
|
||||
grad_clip=fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg[
|
||||
'grad_clip_thresh']))
|
||||
model.clear_gradients()
|
||||
|
||||
|
||||
|
||||
# save checkpoint
|
||||
if local_rank==0 and global_step % args.save_step == 0:
|
||||
if local_rank == 0 and global_step % args.save_step == 0:
|
||||
if not os.path.exists(args.save_path):
|
||||
os.mkdir(args.save_path)
|
||||
save_path = os.path.join(args.save_path,'transformer/%d' % global_step)
|
||||
save_path = os.path.join(args.save_path,
|
||||
'transformer/%d' % global_step)
|
||||
dg.save_dygraph(model.state_dict(), save_path)
|
||||
dg.save_dygraph(optimizer.state_dict(), save_path)
|
||||
if local_rank==0:
|
||||
if local_rank == 0:
|
||||
writer.close()
|
||||
|
||||
|
||||
if __name__ =='__main__':
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(description="Train TransformerTTS model")
|
||||
add_config_options_to_parser(parser)
|
||||
|
||||
|
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from tensorboardX import SummaryWriter
|
||||
import os
|
||||
from tqdm import tqdm
|
||||
|
@ -13,6 +26,7 @@ import paddle.fluid.layers as layers
|
|||
from data import LJSpeechLoader
|
||||
from parakeet.models.transformer_tts.vocoder import Vocoder
|
||||
|
||||
|
||||
def load_checkpoint(step, model_path):
|
||||
model_dict, opti_dict = dg.load_dygraph(os.path.join(model_path, step))
|
||||
new_state_dict = OrderedDict()
|
||||
|
@ -23,8 +37,9 @@ def load_checkpoint(step, model_path):
|
|||
new_state_dict[param] = model_dict[param]
|
||||
return new_state_dict, opti_dict
|
||||
|
||||
|
||||
def main(args):
|
||||
|
||||
|
||||
local_rank = dg.parallel.Env().local_rank if args.use_data_parallel else 0
|
||||
nranks = dg.parallel.Env().nranks if args.use_data_parallel else 1
|
||||
|
||||
|
@ -35,23 +50,26 @@ def main(args):
|
|||
place = (fluid.CUDAPlace(dg.parallel.Env().dev_id)
|
||||
if args.use_data_parallel else fluid.CUDAPlace(0)
|
||||
if args.use_gpu else fluid.CPUPlace())
|
||||
|
||||
|
||||
if not os.path.exists(args.log_dir):
|
||||
os.mkdir(args.log_dir)
|
||||
path = os.path.join(args.log_dir,'vocoder')
|
||||
os.mkdir(args.log_dir)
|
||||
path = os.path.join(args.log_dir, 'vocoder')
|
||||
|
||||
writer = SummaryWriter(path) if local_rank == 0 else None
|
||||
|
||||
with dg.guard(place):
|
||||
with dg.guard(place):
|
||||
model = Vocoder(cfg, args.batch_size)
|
||||
|
||||
model.train()
|
||||
optimizer = fluid.optimizer.AdamOptimizer(learning_rate=dg.NoamDecay(1/(cfg['warm_up_step'] *( args.lr ** 2)), cfg['warm_up_step']),
|
||||
parameter_list=model.parameters())
|
||||
|
||||
optimizer = fluid.optimizer.AdamOptimizer(
|
||||
learning_rate=dg.NoamDecay(1 / (
|
||||
cfg['warm_up_step'] * (args.lr**2)), cfg['warm_up_step']),
|
||||
parameter_list=model.parameters())
|
||||
|
||||
if args.checkpoint_path is not None:
|
||||
model_dict, opti_dict = load_checkpoint(str(args.vocoder_step), os.path.join(args.checkpoint_path, "vocoder"))
|
||||
model_dict, opti_dict = load_checkpoint(
|
||||
str(args.vocoder_step),
|
||||
os.path.join(args.checkpoint_path, "vocoder"))
|
||||
model.set_dict(model_dict)
|
||||
optimizer.set_dict(opti_dict)
|
||||
global_step = args.vocoder_step
|
||||
|
@ -61,48 +79,55 @@ def main(args):
|
|||
strategy = dg.parallel.prepare_context()
|
||||
model = fluid.dygraph.parallel.DataParallel(model, strategy)
|
||||
|
||||
reader = LJSpeechLoader(cfg, args, nranks, local_rank, is_vocoder=True).reader()
|
||||
reader = LJSpeechLoader(
|
||||
cfg, args, nranks, local_rank, is_vocoder=True).reader()
|
||||
|
||||
for epoch in range(args.epochs):
|
||||
pbar = tqdm(reader)
|
||||
for i, data in enumerate(pbar):
|
||||
pbar.set_description('Processing at epoch %d'%epoch)
|
||||
pbar.set_description('Processing at epoch %d' % epoch)
|
||||
mel, mag = data
|
||||
mag = dg.to_variable(mag.numpy())
|
||||
mel = dg.to_variable(mel.numpy())
|
||||
global_step += 1
|
||||
|
||||
mag_pred = model(mel)
|
||||
loss = layers.mean(layers.abs(layers.elementwise_sub(mag_pred, mag)))
|
||||
|
||||
loss = layers.mean(
|
||||
layers.abs(layers.elementwise_sub(mag_pred, mag)))
|
||||
|
||||
if args.use_data_parallel:
|
||||
loss = model.scale_loss(loss)
|
||||
loss.backward()
|
||||
model.apply_collective_grads()
|
||||
else:
|
||||
loss.backward()
|
||||
optimizer.minimize(loss, grad_clip = fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg['grad_clip_thresh']))
|
||||
optimizer.minimize(
|
||||
loss,
|
||||
grad_clip=fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg[
|
||||
'grad_clip_thresh']))
|
||||
model.clear_gradients()
|
||||
|
||||
if local_rank==0:
|
||||
writer.add_scalars('training_loss',{
|
||||
'loss':loss.numpy(),
|
||||
|
||||
if local_rank == 0:
|
||||
writer.add_scalars('training_loss', {
|
||||
'loss': loss.numpy(),
|
||||
}, global_step)
|
||||
|
||||
if global_step % args.save_step == 0:
|
||||
if not os.path.exists(args.save_path):
|
||||
os.mkdir(args.save_path)
|
||||
save_path = os.path.join(args.save_path,'vocoder/%d' % global_step)
|
||||
save_path = os.path.join(args.save_path,
|
||||
'vocoder/%d' % global_step)
|
||||
dg.save_dygraph(model.state_dict(), save_path)
|
||||
dg.save_dygraph(optimizer.state_dict(), save_path)
|
||||
|
||||
if local_rank==0:
|
||||
if local_rank == 0:
|
||||
writer.close()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(description="Train vocoder model")
|
||||
add_config_options_to_parser(parser)
|
||||
args = parser.parse_args()
|
||||
# Print the whole config setting.
|
||||
pprint(args)
|
||||
main(args)
|
||||
main(args)
|
||||
|
|
|
@ -109,3 +109,13 @@ python -u benchmark.py \
|
|||
--root=./data/LJSpeech-1.1 \
|
||||
--name=${ModelName} --use_gpu=true
|
||||
```
|
||||
|
||||
### Low-precision inference
|
||||
|
||||
This model supports the float16 low-precsion inference. By appending the argument
|
||||
|
||||
```bash
|
||||
--use_fp16=true
|
||||
```
|
||||
|
||||
to the command of synthesis and benchmarking, one can experience the fast speed of low-precision inference.
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import random
|
||||
from pprint import pprint
|
||||
|
@ -24,9 +38,14 @@ def add_options_to_parser(parser):
|
|||
|
||||
parser.add_argument(
|
||||
'--use_gpu',
|
||||
type=bool,
|
||||
type=utils.str2bool,
|
||||
default=True,
|
||||
help="option to use gpu training")
|
||||
parser.add_argument(
|
||||
'--use_fp16',
|
||||
type=utils.str2bool,
|
||||
default=True,
|
||||
help="option to use fp16 for inference")
|
||||
|
||||
parser.add_argument(
|
||||
'--iteration',
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import random
|
||||
from pprint import pprint
|
||||
|
@ -24,9 +38,14 @@ def add_options_to_parser(parser):
|
|||
|
||||
parser.add_argument(
|
||||
'--use_gpu',
|
||||
type=bool,
|
||||
type=utils.str2bool,
|
||||
default=True,
|
||||
help="option to use gpu training")
|
||||
parser.add_argument(
|
||||
'--use_fp16',
|
||||
type=utils.str2bool,
|
||||
default=True,
|
||||
help="option to use fp16 for inference")
|
||||
|
||||
parser.add_argument(
|
||||
'--iteration',
|
||||
|
@ -74,7 +93,6 @@ def synthesize(config):
|
|||
# Build model.
|
||||
model = WaveFlow(config, checkpoint_dir)
|
||||
model.build(training=False)
|
||||
|
||||
# Obtain the current iteration.
|
||||
if config.checkpoint is None:
|
||||
if config.iteration is None:
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import random
|
||||
import subprocess
|
||||
|
@ -127,4 +141,6 @@ if __name__ == "__main__":
|
|||
# the preceding update will be overwritten by the following one.
|
||||
config = parser.parse_args()
|
||||
config = utils.add_yaml_config(config)
|
||||
# Force to use fp32 in model training
|
||||
vars(config)["use_fp16"] = False
|
||||
train(config)
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import itertools
|
||||
import os
|
||||
import time
|
||||
|
@ -126,7 +140,8 @@ def load_parameters(checkpoint_dir,
|
|||
model,
|
||||
optimizer=None,
|
||||
iteration=None,
|
||||
file_path=None):
|
||||
file_path=None,
|
||||
dtype="float32"):
|
||||
if file_path is None:
|
||||
if iteration is None:
|
||||
iteration = load_latest_checkpoint(checkpoint_dir, rank)
|
||||
|
@ -135,6 +150,12 @@ def load_parameters(checkpoint_dir,
|
|||
file_path = "{}/step-{}".format(checkpoint_dir, iteration)
|
||||
|
||||
model_dict, optimizer_dict = dg.load_dygraph(file_path)
|
||||
if dtype == "float16":
|
||||
for k, v in model_dict.items():
|
||||
if "conv2d_transpose" in k:
|
||||
model_dict[k] = v.astype("float32")
|
||||
else:
|
||||
model_dict[k] = v.astype(dtype)
|
||||
model.set_dict(model_dict)
|
||||
print("[checkpoint] Rank {}: loaded model from {}".format(rank, file_path))
|
||||
if optimizer and optimizer_dict:
|
||||
|
|
|
@ -0,0 +1,97 @@
|
|||
# Wavenet
|
||||
|
||||
Paddle implementation of wavenet in dynamic graph, a convolutional network based vocoder. Wavenet is proposed in [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499), but in thie experiment, the implementation follows the teacher model in [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281).
|
||||
|
||||
|
||||
## Dataset
|
||||
|
||||
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
|
||||
|
||||
```bash
|
||||
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
|
||||
tar xjvf LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```text
|
||||
├── data.py data_processing
|
||||
├── configs/ (example) configuration file
|
||||
├── synthesis.py script to synthesize waveform from mel_spectrogram
|
||||
├── train.py script to train a model
|
||||
└── utils.py utility functions
|
||||
```
|
||||
|
||||
## Train
|
||||
|
||||
Train the model using train.py, follow the usage displayed by `python train.py --help`.
|
||||
|
||||
```text
|
||||
usage: train.py [-h] [--data DATA] [--config CONFIG] [--output OUTPUT]
|
||||
[--device DEVICE] [--resume RESUME]
|
||||
|
||||
Train a wavenet model with LJSpeech.
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--data DATA path of the LJspeech dataset.
|
||||
--config CONFIG path of the config file.
|
||||
--output OUTPUT path to save results.
|
||||
--device DEVICE device to use.
|
||||
--resume RESUME checkpoint to resume from.
|
||||
```
|
||||
|
||||
1. `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
|
||||
2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
|
||||
3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
|
||||
4. `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.
|
||||
|
||||
```text
|
||||
├── checkpoints # checkpoint
|
||||
└── log # tensorboard log
|
||||
```
|
||||
|
||||
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
|
||||
example script:
|
||||
|
||||
```bash
|
||||
python train.py --config=./configs/wavenet_single_gaussian.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0
|
||||
```
|
||||
|
||||
You can monitor training log via tensorboard, using the script below.
|
||||
|
||||
```bash
|
||||
cd experiment/log
|
||||
tensorboard --logdir=.
|
||||
```
|
||||
|
||||
## Synthesis
|
||||
```text
|
||||
usage: synthesis.py [-h] [--data DATA] [--config CONFIG] [--device DEVICE]
|
||||
checkpoint output
|
||||
|
||||
Synthesize valid data from LJspeech with a wavenet model.
|
||||
|
||||
positional arguments:
|
||||
checkpoint checkpoint to load.
|
||||
output path to save results.
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--data DATA path of the LJspeech dataset.
|
||||
--config CONFIG path of the config file.
|
||||
--device DEVICE device to use.
|
||||
```
|
||||
|
||||
1. `--config` is the configuration file to use. You should use the same configuration with which you train you model.
|
||||
2. `--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
|
||||
3. `checkpoint` is the checkpoint to load.
|
||||
4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`).
|
||||
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
|
||||
example script:
|
||||
|
||||
```bash
|
||||
python synthesis.py --config=./configs/wavenet_single_gaussian.yaml --data=./LJSpeech-1.1/ --device=0 experiment/checkpoints/step_500000 generated
|
||||
```
|
|
@ -0,0 +1,36 @@
|
|||
data:
|
||||
batch_size: 16
|
||||
train_clip_seconds: 0.5
|
||||
sample_rate: 22050
|
||||
hop_length: 256
|
||||
win_length: 1024
|
||||
n_fft: 2048
|
||||
n_mels: 80
|
||||
valid_size: 16
|
||||
|
||||
|
||||
|
||||
model:
|
||||
upsampling_factors: [16, 16]
|
||||
n_loop: 10
|
||||
n_layer: 3
|
||||
filter_size: 2
|
||||
residual_channels: 128
|
||||
loss_type: "mog"
|
||||
output_dim: 30
|
||||
log_scale_min: -9
|
||||
|
||||
train:
|
||||
learning_rate: 0.001
|
||||
anneal_rate: 0.5
|
||||
anneal_interval: 200000
|
||||
gradient_max_norm: 100.0
|
||||
|
||||
checkpoint_interval: 10000
|
||||
snap_interval: 10000
|
||||
eval_interval: 10000
|
||||
|
||||
max_iterations: 2000000
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,36 @@
|
|||
data:
|
||||
batch_size: 16
|
||||
train_clip_seconds: 0.5
|
||||
sample_rate: 22050
|
||||
hop_length: 256
|
||||
win_length: 1024
|
||||
n_fft: 2048
|
||||
n_mels: 80
|
||||
valid_size: 16
|
||||
|
||||
|
||||
|
||||
model:
|
||||
upsampling_factors: [16, 16]
|
||||
n_loop: 10
|
||||
n_layer: 3
|
||||
filter_size: 2
|
||||
residual_channels: 128
|
||||
loss_type: "mog"
|
||||
output_dim: 3
|
||||
log_scale_min: -9
|
||||
|
||||
train:
|
||||
learning_rate: 0.001
|
||||
anneal_rate: 0.5
|
||||
anneal_interval: 200000
|
||||
gradient_max_norm: 100.0
|
||||
|
||||
checkpoint_interval: 10000
|
||||
snap_interval: 10000
|
||||
eval_interval: 10000
|
||||
|
||||
max_iterations: 2000000
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,36 @@
|
|||
data:
|
||||
batch_size: 16
|
||||
train_clip_seconds: 0.5
|
||||
sample_rate: 22050
|
||||
hop_length: 256
|
||||
win_length: 1024
|
||||
n_fft: 2048
|
||||
n_mels: 80
|
||||
valid_size: 16
|
||||
|
||||
|
||||
|
||||
model:
|
||||
upsampling_factors: [16, 16]
|
||||
n_loop: 10
|
||||
n_layer: 3
|
||||
filter_size: 2
|
||||
residual_channels: 128
|
||||
loss_type: "softmax"
|
||||
output_dim: 2048
|
||||
log_scale_min: -9
|
||||
|
||||
train:
|
||||
learning_rate: 0.001
|
||||
anneal_rate: 0.5
|
||||
anneal_interval: 200000
|
||||
gradient_max_norm: 100.0
|
||||
|
||||
checkpoint_interval: 10000
|
||||
snap_interval: 10000
|
||||
eval_interval: 10000
|
||||
|
||||
max_iterations: 2000000
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,163 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import csv
|
||||
import numpy as np
|
||||
import librosa
|
||||
from pathlib import Path
|
||||
import pandas as pd
|
||||
|
||||
from parakeet.data import batch_spec, batch_wav
|
||||
from parakeet.data import DatasetMixin
|
||||
|
||||
|
||||
class LJSpeechMetaData(DatasetMixin):
|
||||
def __init__(self, root):
|
||||
self.root = Path(root)
|
||||
self._wav_dir = self.root.joinpath("wavs")
|
||||
csv_path = self.root.joinpath("metadata.csv")
|
||||
self._table = pd.read_csv(
|
||||
csv_path,
|
||||
sep="|",
|
||||
header=None,
|
||||
quoting=csv.QUOTE_NONE,
|
||||
names=["fname", "raw_text", "normalized_text"])
|
||||
|
||||
def get_example(self, i):
|
||||
fname, raw_text, normalized_text = self._table.iloc[i]
|
||||
fname = str(self._wav_dir.joinpath(fname + ".wav"))
|
||||
return fname, raw_text, normalized_text
|
||||
|
||||
def __len__(self):
|
||||
return len(self._table)
|
||||
|
||||
|
||||
class Transform(object):
|
||||
def __init__(self, sample_rate, n_fft, win_length, hop_length, n_mels):
|
||||
self.sample_rate = sample_rate
|
||||
self.n_fft = n_fft
|
||||
self.win_length = win_length
|
||||
self.hop_length = hop_length
|
||||
self.n_mels = n_mels
|
||||
|
||||
def __call__(self, example):
|
||||
wav_path, _, _ = example
|
||||
|
||||
sr = self.sample_rate
|
||||
n_fft = self.n_fft
|
||||
win_length = self.win_length
|
||||
hop_length = self.hop_length
|
||||
n_mels = self.n_mels
|
||||
|
||||
wav, loaded_sr = librosa.load(wav_path, sr=None)
|
||||
assert loaded_sr == sr, "sample rate does not match, resampling applied"
|
||||
|
||||
# Pad audio to the right size.
|
||||
frames = int(np.ceil(float(wav.size) / hop_length))
|
||||
fft_padding = (n_fft - hop_length) // 2 # sound
|
||||
desired_length = frames * hop_length + fft_padding * 2
|
||||
pad_amount = (desired_length - wav.size) // 2
|
||||
|
||||
if wav.size % 2 == 0:
|
||||
wav = np.pad(wav, (pad_amount, pad_amount), mode='reflect')
|
||||
else:
|
||||
wav = np.pad(wav, (pad_amount, pad_amount + 1), mode='reflect')
|
||||
|
||||
# Normalize audio.
|
||||
wav = wav / np.abs(wav).max() * 0.999
|
||||
|
||||
# Compute mel-spectrogram.
|
||||
# Turn center to False to prevent internal padding.
|
||||
spectrogram = librosa.core.stft(
|
||||
wav,
|
||||
hop_length=hop_length,
|
||||
win_length=win_length,
|
||||
n_fft=n_fft,
|
||||
center=False)
|
||||
spectrogram_magnitude = np.abs(spectrogram)
|
||||
|
||||
# Compute mel-spectrograms.
|
||||
mel_filter_bank = librosa.filters.mel(sr=sr,
|
||||
n_fft=n_fft,
|
||||
n_mels=n_mels)
|
||||
mel_spectrogram = np.dot(mel_filter_bank, spectrogram_magnitude)
|
||||
mel_spectrogram = mel_spectrogram
|
||||
|
||||
# Rescale mel_spectrogram.
|
||||
min_level, ref_level = 1e-5, 20 # hard code it
|
||||
mel_spectrogram = 20 * np.log10(np.maximum(min_level, mel_spectrogram))
|
||||
mel_spectrogram = mel_spectrogram - ref_level
|
||||
mel_spectrogram = np.clip((mel_spectrogram + 100) / 100, 0, 1)
|
||||
|
||||
# Extract the center of audio that corresponds to mel spectrograms.
|
||||
audio = wav[fft_padding:-fft_padding]
|
||||
assert mel_spectrogram.shape[1] * hop_length == audio.size
|
||||
|
||||
# there is no clipping here
|
||||
return audio, mel_spectrogram
|
||||
|
||||
|
||||
class DataCollector(object):
|
||||
def __init__(self,
|
||||
context_size,
|
||||
sample_rate,
|
||||
hop_length,
|
||||
train_clip_seconds,
|
||||
valid=False):
|
||||
frames_per_second = sample_rate // hop_length
|
||||
train_clip_frames = int(
|
||||
np.ceil(train_clip_seconds * frames_per_second))
|
||||
context_frames = context_size // hop_length
|
||||
self.num_frames = train_clip_frames + context_frames
|
||||
|
||||
self.sample_rate = sample_rate
|
||||
self.hop_length = hop_length
|
||||
self.valid = valid
|
||||
|
||||
def random_crop(self, sample):
|
||||
audio, mel_spectrogram = sample
|
||||
audio_frames = int(audio.size) // self.hop_length
|
||||
max_start_frame = audio_frames - self.num_frames
|
||||
assert max_start_frame >= 0, "audio is too short to be cropped"
|
||||
|
||||
frame_start = np.random.randint(0, max_start_frame)
|
||||
# frame_start = 0 # norandom
|
||||
frame_end = frame_start + self.num_frames
|
||||
|
||||
audio_start = frame_start * self.hop_length
|
||||
audio_end = frame_end * self.hop_length
|
||||
|
||||
audio = audio[audio_start:audio_end]
|
||||
return audio, mel_spectrogram, audio_start
|
||||
|
||||
def __call__(self, samples):
|
||||
# transform them first
|
||||
if self.valid:
|
||||
samples = [(audio, mel_spectrogram, 0)
|
||||
for audio, mel_spectrogram in samples]
|
||||
else:
|
||||
samples = [self.random_crop(sample) for sample in samples]
|
||||
# batch them
|
||||
audios = [sample[0] for sample in samples]
|
||||
audio_starts = [sample[2] for sample in samples]
|
||||
mels = [sample[1] for sample in samples]
|
||||
|
||||
mels = batch_spec(mels)
|
||||
|
||||
if self.valid:
|
||||
audios = batch_wav(audios, dtype=np.float32)
|
||||
else:
|
||||
audios = np.array(audios, dtype=np.float32)
|
||||
audio_starts = np.array(audio_starts, dtype=np.int64)
|
||||
return audios, mels, audio_starts
|
|
@ -0,0 +1,124 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import ruamel.yaml
|
||||
import argparse
|
||||
from tqdm import tqdm
|
||||
from tensorboardX import SummaryWriter
|
||||
from paddle import fluid
|
||||
import paddle.fluid.dygraph as dg
|
||||
|
||||
from parakeet.data import SliceDataset, TransformDataset, DataCargo, SequentialSampler, RandomSampler
|
||||
from parakeet.models.wavenet import UpsampleNet, WaveNet, ConditionalWavenet
|
||||
from parakeet.utils.layer_tools import summary
|
||||
|
||||
from data import LJSpeechMetaData, Transform, DataCollector
|
||||
from utils import make_output_tree, valid_model, eval_model, save_checkpoint
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Synthesize valid data from LJspeech with a wavenet model.")
|
||||
parser.add_argument(
|
||||
"--data", type=str, help="path of the LJspeech dataset.")
|
||||
parser.add_argument("--config", type=str, help="path of the config file.")
|
||||
parser.add_argument(
|
||||
"--device", type=int, default=-1, help="device to use.")
|
||||
|
||||
parser.add_argument("checkpoint", type=str, help="checkpoint to load.")
|
||||
parser.add_argument(
|
||||
"output", type=str, default="experiment", help="path to save results.")
|
||||
|
||||
args = parser.parse_args()
|
||||
with open(args.config, 'rt') as f:
|
||||
config = ruamel.yaml.safe_load(f)
|
||||
|
||||
ljspeech_meta = LJSpeechMetaData(args.data)
|
||||
|
||||
data_config = config["data"]
|
||||
sample_rate = data_config["sample_rate"]
|
||||
n_fft = data_config["n_fft"]
|
||||
win_length = data_config["win_length"]
|
||||
hop_length = data_config["hop_length"]
|
||||
n_mels = data_config["n_mels"]
|
||||
train_clip_seconds = data_config["train_clip_seconds"]
|
||||
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
|
||||
ljspeech = TransformDataset(ljspeech_meta, transform)
|
||||
|
||||
valid_size = data_config["valid_size"]
|
||||
ljspeech_valid = SliceDataset(ljspeech, 0, valid_size)
|
||||
ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech))
|
||||
|
||||
model_config = config["model"]
|
||||
n_loop = model_config["n_loop"]
|
||||
n_layer = model_config["n_layer"]
|
||||
filter_size = model_config["filter_size"]
|
||||
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
|
||||
print("context size is {} samples".format(context_size))
|
||||
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
|
||||
train_clip_seconds)
|
||||
valid_batch_fn = DataCollector(
|
||||
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
|
||||
|
||||
batch_size = data_config["batch_size"]
|
||||
train_cargo = DataCargo(
|
||||
ljspeech_train,
|
||||
train_batch_fn,
|
||||
batch_size,
|
||||
sampler=RandomSampler(ljspeech_train))
|
||||
|
||||
# only batch=1 for validation is enabled
|
||||
valid_cargo = DataCargo(
|
||||
ljspeech_valid,
|
||||
valid_batch_fn,
|
||||
batch_size=1,
|
||||
sampler=SequentialSampler(ljspeech_valid))
|
||||
|
||||
make_output_tree(args.output)
|
||||
|
||||
if args.device == -1:
|
||||
place = fluid.CPUPlace()
|
||||
else:
|
||||
place = fluid.CUDAPlace(args.device)
|
||||
|
||||
with dg.guard(place):
|
||||
model_config = config["model"]
|
||||
upsampling_factors = model_config["upsampling_factors"]
|
||||
encoder = UpsampleNet(upsampling_factors)
|
||||
|
||||
n_loop = model_config["n_loop"]
|
||||
n_layer = model_config["n_layer"]
|
||||
residual_channels = model_config["residual_channels"]
|
||||
output_dim = model_config["output_dim"]
|
||||
loss_type = model_config["loss_type"]
|
||||
log_scale_min = model_config["log_scale_min"]
|
||||
decoder = WaveNet(n_loop, n_layer, residual_channels, output_dim,
|
||||
n_mels, filter_size, loss_type, log_scale_min)
|
||||
|
||||
model = ConditionalWavenet(encoder, decoder)
|
||||
summary(model)
|
||||
|
||||
model_dict, _ = dg.load_dygraph(args.checkpoint)
|
||||
print("Loading from {}.pdparams".format(args.checkpoint))
|
||||
model.set_dict(model_dict)
|
||||
|
||||
train_loader = fluid.io.DataLoader.from_generator(
|
||||
capacity=10, return_list=True)
|
||||
train_loader.set_batch_generator(train_cargo, place)
|
||||
|
||||
valid_loader = fluid.io.DataLoader.from_generator(
|
||||
capacity=10, return_list=True)
|
||||
valid_loader.set_batch_generator(valid_cargo, place)
|
||||
|
||||
eval_model(model, valid_loader, args.output, sample_rate)
|
|
@ -0,0 +1,181 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import ruamel.yaml
|
||||
import argparse
|
||||
from tqdm import tqdm
|
||||
from tensorboardX import SummaryWriter
|
||||
from paddle import fluid
|
||||
import paddle.fluid.dygraph as dg
|
||||
|
||||
from parakeet.data import SliceDataset, TransformDataset, DataCargo, SequentialSampler, RandomSampler
|
||||
from parakeet.models.wavenet import UpsampleNet, WaveNet, ConditionalWavenet
|
||||
from parakeet.utils.layer_tools import summary
|
||||
|
||||
from data import LJSpeechMetaData, Transform, DataCollector
|
||||
from utils import make_output_tree, valid_model, save_checkpoint
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Train a wavenet model with LJSpeech.")
|
||||
parser.add_argument(
|
||||
"--data", type=str, help="path of the LJspeech dataset.")
|
||||
parser.add_argument("--config", type=str, help="path of the config file.")
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
type=str,
|
||||
default="experiment",
|
||||
help="path to save results.")
|
||||
parser.add_argument(
|
||||
"--device", type=int, default=-1, help="device to use.")
|
||||
parser.add_argument(
|
||||
"--resume", type=str, help="checkpoint to resume from.")
|
||||
|
||||
args = parser.parse_args()
|
||||
with open(args.config, 'rt') as f:
|
||||
config = ruamel.yaml.safe_load(f)
|
||||
|
||||
ljspeech_meta = LJSpeechMetaData(args.data)
|
||||
|
||||
data_config = config["data"]
|
||||
sample_rate = data_config["sample_rate"]
|
||||
n_fft = data_config["n_fft"]
|
||||
win_length = data_config["win_length"]
|
||||
hop_length = data_config["hop_length"]
|
||||
n_mels = data_config["n_mels"]
|
||||
train_clip_seconds = data_config["train_clip_seconds"]
|
||||
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
|
||||
ljspeech = TransformDataset(ljspeech_meta, transform)
|
||||
|
||||
valid_size = data_config["valid_size"]
|
||||
ljspeech_valid = SliceDataset(ljspeech, 0, valid_size)
|
||||
ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech))
|
||||
|
||||
model_config = config["model"]
|
||||
n_loop = model_config["n_loop"]
|
||||
n_layer = model_config["n_layer"]
|
||||
filter_size = model_config["filter_size"]
|
||||
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
|
||||
print("context size is {} samples".format(context_size))
|
||||
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
|
||||
train_clip_seconds)
|
||||
valid_batch_fn = DataCollector(
|
||||
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
|
||||
|
||||
batch_size = data_config["batch_size"]
|
||||
train_cargo = DataCargo(
|
||||
ljspeech_train,
|
||||
train_batch_fn,
|
||||
batch_size,
|
||||
sampler=RandomSampler(ljspeech_train))
|
||||
|
||||
# only batch=1 for validation is enabled
|
||||
valid_cargo = DataCargo(
|
||||
ljspeech_valid,
|
||||
valid_batch_fn,
|
||||
batch_size=1,
|
||||
sampler=SequentialSampler(ljspeech_valid))
|
||||
|
||||
make_output_tree(args.output)
|
||||
|
||||
if args.device == -1:
|
||||
place = fluid.CPUPlace()
|
||||
else:
|
||||
place = fluid.CUDAPlace(args.device)
|
||||
|
||||
with dg.guard(place):
|
||||
model_config = config["model"]
|
||||
upsampling_factors = model_config["upsampling_factors"]
|
||||
encoder = UpsampleNet(upsampling_factors)
|
||||
|
||||
n_loop = model_config["n_loop"]
|
||||
n_layer = model_config["n_layer"]
|
||||
residual_channels = model_config["residual_channels"]
|
||||
output_dim = model_config["output_dim"]
|
||||
loss_type = model_config["loss_type"]
|
||||
log_scale_min = model_config["log_scale_min"]
|
||||
decoder = WaveNet(n_loop, n_layer, residual_channels, output_dim,
|
||||
n_mels, filter_size, loss_type, log_scale_min)
|
||||
|
||||
model = ConditionalWavenet(encoder, decoder)
|
||||
summary(model)
|
||||
|
||||
train_config = config["train"]
|
||||
learning_rate = train_config["learning_rate"]
|
||||
anneal_rate = train_config["anneal_rate"]
|
||||
anneal_interval = train_config["anneal_interval"]
|
||||
lr_scheduler = dg.ExponentialDecay(
|
||||
learning_rate, anneal_interval, anneal_rate, staircase=True)
|
||||
optim = fluid.optimizer.Adam(
|
||||
lr_scheduler, parameter_list=model.parameters())
|
||||
|
||||
gradiant_max_norm = train_config["gradient_max_norm"]
|
||||
clipper = fluid.dygraph_grad_clip.GradClipByGlobalNorm(
|
||||
gradiant_max_norm)
|
||||
|
||||
if args.resume:
|
||||
model_dict, optim_dict = dg.load_dygraph(args.resume)
|
||||
print("Loading from {}.pdparams".format(args.resume))
|
||||
model.set_dict(model_dict)
|
||||
if optim_dict:
|
||||
optim.set_dict(optim_dict)
|
||||
print("Loading from {}.pdopt".format(args.resume))
|
||||
|
||||
train_loader = fluid.io.DataLoader.from_generator(
|
||||
capacity=10, return_list=True)
|
||||
train_loader.set_batch_generator(train_cargo, place)
|
||||
|
||||
valid_loader = fluid.io.DataLoader.from_generator(
|
||||
capacity=10, return_list=True)
|
||||
valid_loader.set_batch_generator(valid_cargo, place)
|
||||
|
||||
max_iterations = train_config["max_iterations"]
|
||||
checkpoint_interval = train_config["checkpoint_interval"]
|
||||
snap_interval = train_config["snap_interval"]
|
||||
eval_interval = train_config["eval_interval"]
|
||||
checkpoint_dir = os.path.join(args.output, "checkpoints")
|
||||
log_dir = os.path.join(args.output, "log")
|
||||
writer = SummaryWriter(log_dir)
|
||||
|
||||
global_step = 1
|
||||
while global_step <= max_iterations:
|
||||
epoch_loss = 0.
|
||||
for i, batch in tqdm(enumerate(train_loader)):
|
||||
audio_clips, mel_specs, audio_starts = batch
|
||||
|
||||
model.train()
|
||||
y_var = model(audio_clips, mel_specs, audio_starts)
|
||||
loss_var = model.loss(y_var, audio_clips)
|
||||
loss_var.backward()
|
||||
loss_np = loss_var.numpy()
|
||||
|
||||
epoch_loss += loss_np[0]
|
||||
|
||||
writer.add_scalar("loss", loss_np[0], global_step)
|
||||
writer.add_scalar("learning_rate",
|
||||
optim._learning_rate.step().numpy()[0],
|
||||
global_step)
|
||||
optim.minimize(loss_var, grad_clip=clipper)
|
||||
optim.clear_gradients()
|
||||
print("loss: {:<8.6f}".format(loss_np[0]))
|
||||
|
||||
if global_step % snap_interval == 0:
|
||||
valid_model(model, valid_loader, writer, global_step,
|
||||
sample_rate)
|
||||
|
||||
if global_step % checkpoint_interval == 0:
|
||||
save_checkpoint(model, optim, checkpoint_dir, global_step)
|
||||
|
||||
global_step += 1
|
|
@ -0,0 +1,67 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import numpy as np
|
||||
import soundfile as sf
|
||||
import paddle.fluid.dygraph as dg
|
||||
|
||||
|
||||
def make_output_tree(output_dir):
|
||||
checkpoint_dir = os.path.join(output_dir, "checkpoints")
|
||||
if not os.path.exists(checkpoint_dir):
|
||||
os.makedirs(checkpoint_dir)
|
||||
|
||||
state_dir = os.path.join(output_dir, "states")
|
||||
if not os.path.exists(state_dir):
|
||||
os.makedirs(state_dir)
|
||||
|
||||
|
||||
def valid_model(model, valid_loader, writer, global_step, sample_rate):
|
||||
loss = []
|
||||
wavs = []
|
||||
model.eval()
|
||||
for i, batch in enumerate(valid_loader):
|
||||
# print("sentence {}".format(i))
|
||||
audio_clips, mel_specs, audio_starts = batch
|
||||
y_var = model(audio_clips, mel_specs, audio_starts)
|
||||
wav_var = model.sample(y_var)
|
||||
loss_var = model.loss(y_var, audio_clips)
|
||||
loss.append(loss_var.numpy()[0])
|
||||
wavs.append(wav_var.numpy()[0])
|
||||
|
||||
average_loss = np.mean(loss)
|
||||
writer.add_scalar("valid_loss", average_loss, global_step)
|
||||
for i, wav in enumerate(wavs):
|
||||
writer.add_audio("valid/sample_{}".format(i), wav, global_step,
|
||||
sample_rate)
|
||||
|
||||
|
||||
def eval_model(model, valid_loader, output_dir, sample_rate):
|
||||
model.eval()
|
||||
for i, batch in enumerate(valid_loader):
|
||||
# print("sentence {}".format(i))
|
||||
path = os.path.join(output_dir, "sentence_{}.wav".format(i))
|
||||
audio_clips, mel_specs, audio_starts = batch
|
||||
wav_var = model.synthesis(mel_specs)
|
||||
wav_np = wav_var.numpy()[0]
|
||||
sf.write(path, wav_np, samplerate=sample_rate)
|
||||
print("generated {}".format(path))
|
||||
|
||||
|
||||
def save_checkpoint(model, optim, checkpoint_dir, global_step):
|
||||
checkpoint_path = os.path.join(checkpoint_dir,
|
||||
"step_{:09d}".format(global_step))
|
||||
dg.save_dygraph(model.state_dict(), checkpoint_path)
|
||||
dg.save_dygraph(optim.state_dict(), checkpoint_path)
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
__version__ = "0.0.0"
|
||||
|
||||
from . import data, g2p, models, modules
|
||||
|
|
|
@ -1 +1,15 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from .audio import AudioProcessor
|
|
@ -1,30 +1,46 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import librosa
|
||||
import soundfile as sf
|
||||
import numpy as np
|
||||
import scipy.io
|
||||
import scipy.signal
|
||||
|
||||
|
||||
class AudioProcessor(object):
|
||||
def __init__(self,
|
||||
sample_rate=None, # int, sampling rate
|
||||
num_mels=None, # int, bands of mel spectrogram
|
||||
min_level_db=None, # float, minimum level db
|
||||
ref_level_db=None, # float, reference level db
|
||||
n_fft=None, # int: number of samples in a frame for stft
|
||||
win_length=None, # int: the same meaning with n_fft
|
||||
hop_length=None, # int: number of samples between neighboring frame
|
||||
power=None, # float:power to raise before griffin-lim
|
||||
preemphasis=None, # float: preemphasis coefficident
|
||||
signal_norm=None, #
|
||||
symmetric_norm=False, # bool, apply clip norm in [-max_norm, max_form]
|
||||
max_norm=None, # float, max norm
|
||||
mel_fmin=None, # int: mel spectrogram's minimum frequency
|
||||
mel_fmax=None, # int: mel spectrogram's maximum frequency
|
||||
clip_norm=True, # bool: clip spectrogram's norm
|
||||
griffin_lim_iters=None, # int:
|
||||
do_trim_silence=False, # bool: trim silence
|
||||
sound_norm=False,
|
||||
**kwargs):
|
||||
def __init__(
|
||||
self,
|
||||
sample_rate=None, # int, sampling rate
|
||||
num_mels=None, # int, bands of mel spectrogram
|
||||
min_level_db=None, # float, minimum level db
|
||||
ref_level_db=None, # float, reference level db
|
||||
n_fft=None, # int: number of samples in a frame for stft
|
||||
win_length=None, # int: the same meaning with n_fft
|
||||
hop_length=None, # int: number of samples between neighboring frame
|
||||
power=None, # float:power to raise before griffin-lim
|
||||
preemphasis=None, # float: preemphasis coefficident
|
||||
signal_norm=None, #
|
||||
symmetric_norm=False, # bool, apply clip norm in [-max_norm, max_form]
|
||||
max_norm=None, # float, max norm
|
||||
mel_fmin=None, # int: mel spectrogram's minimum frequency
|
||||
mel_fmax=None, # int: mel spectrogram's maximum frequency
|
||||
clip_norm=True, # bool: clip spectrogram's norm
|
||||
griffin_lim_iters=None, # int:
|
||||
do_trim_silence=False, # bool: trim silence
|
||||
sound_norm=False,
|
||||
**kwargs):
|
||||
self.sample_rate = sample_rate
|
||||
self.num_mels = num_mels
|
||||
self.min_level_db = min_level_db
|
||||
|
@ -34,8 +50,8 @@ class AudioProcessor(object):
|
|||
self.n_fft = n_fft
|
||||
self.win_length = win_length or n_fft
|
||||
# hop length defaults to 1/4 window_length
|
||||
self.hop_length = hop_length or 0.25 * self.win_length
|
||||
|
||||
self.hop_length = hop_length or 0.25 * self.win_length
|
||||
|
||||
self.power = power
|
||||
self.preemphasis = float(preemphasis)
|
||||
|
||||
|
@ -52,7 +68,8 @@ class AudioProcessor(object):
|
|||
self.do_trim_silence = do_trim_silence
|
||||
|
||||
self.sound_norm = sound_norm
|
||||
self.num_freq, self.frame_length_ms, self.frame_shift_ms = self._stft_parameters()
|
||||
self.num_freq, self.frame_length_ms, self.frame_shift_ms = self._stft_parameters(
|
||||
)
|
||||
|
||||
def _stft_parameters(self):
|
||||
"""compute frame length and hop length in ms"""
|
||||
|
@ -65,44 +82,54 @@ class AudioProcessor(object):
|
|||
"""object repr"""
|
||||
cls_name_str = self.__class__.__name__
|
||||
members = vars(self)
|
||||
dict_str = "\n".join([" {}: {},".format(k, v) for k, v in members.items()])
|
||||
dict_str = "\n".join(
|
||||
[" {}: {},".format(k, v) for k, v in members.items()])
|
||||
repr_str = "{}(\n{})\n".format(cls_name_str, dict_str)
|
||||
return repr_str
|
||||
|
||||
def save_wav(self, path, wav):
|
||||
"""save audio with scipy.io.wavfile in 16bit integers"""
|
||||
wav_norm = wav * (32767 / max(0.01, np.max(np.abs(wav))))
|
||||
scipy.io.wavfile.write(path, self.sample_rate, wav_norm.as_type(np.int16))
|
||||
scipy.io.wavfile.write(path, self.sample_rate,
|
||||
wav_norm.as_type(np.int16))
|
||||
|
||||
def load_wav(self, path, sr=None):
|
||||
"""load wav -> trim_silence -> rescale"""
|
||||
|
||||
x, sr = librosa.load(path, sr=None)
|
||||
assert self.sample_rate == sr, "audio sample rate: {}Hz != processor sample rate: {}Hz".format(sr, self.sample_rate)
|
||||
assert self.sample_rate == sr, "audio sample rate: {}Hz != processor sample rate: {}Hz".format(
|
||||
sr, self.sample_rate)
|
||||
if self.do_trim_silence:
|
||||
try:
|
||||
x = self.trim_silence(x)
|
||||
except ValueError:
|
||||
print(" [!] File cannot be trimmed for silence - {}".format(path))
|
||||
print(" [!] File cannot be trimmed for silence - {}".format(
|
||||
path))
|
||||
if self.sound_norm:
|
||||
x = x / x.max() * 0.9 # why 0.9 ?
|
||||
x = x / x.max() * 0.9 # why 0.9 ?
|
||||
return x
|
||||
|
||||
def trim_silence(self, wav):
|
||||
"""Trim soilent parts with a threshold and 0.01s margin"""
|
||||
margin = int(self.sample_rate * 0.01)
|
||||
wav = wav[margin: -margin]
|
||||
trimed_wav = librosa.effects.trim(wav, top_db=60, frame_length=self.win_length, hop_length=self.hop_length)[0]
|
||||
wav = wav[margin:-margin]
|
||||
trimed_wav = librosa.effects.trim(
|
||||
wav,
|
||||
top_db=60,
|
||||
frame_length=self.win_length,
|
||||
hop_length=self.hop_length)[0]
|
||||
return trimed_wav
|
||||
|
||||
def apply_preemphasis(self, x):
|
||||
if self.preemphasis == 0.:
|
||||
raise RuntimeError(" !! Preemphasis coefficient should be positive. ")
|
||||
raise RuntimeError(
|
||||
" !! Preemphasis coefficient should be positive. ")
|
||||
return scipy.signal.lfilter([1., -self.preemphasis], [1.], x)
|
||||
|
||||
def apply_inv_preemphasis(self, x):
|
||||
if self.preemphasis == 0.:
|
||||
raise RuntimeError(" !! Preemphasis coefficient should be positive. ")
|
||||
raise RuntimeError(
|
||||
" !! Preemphasis coefficient should be positive. ")
|
||||
return scipy.signal.lfilter([1.], [1., -self.preemphasis], x)
|
||||
|
||||
def _amplitude_to_db(self, x):
|
||||
|
@ -125,12 +152,11 @@ class AudioProcessor(object):
|
|||
"""return mel basis for mel scale"""
|
||||
if self.mel_fmax is not None:
|
||||
assert self.mel_fmax <= self.sample_rate // 2
|
||||
return librosa.filters.mel(
|
||||
self.sample_rate,
|
||||
self.n_fft,
|
||||
n_mels=self.num_mels,
|
||||
fmin=self.mel_fmin,
|
||||
fmax=self.mel_fmax)
|
||||
return librosa.filters.mel(self.sample_rate,
|
||||
self.n_fft,
|
||||
n_mels=self.num_mels,
|
||||
fmin=self.mel_fmin,
|
||||
fmax=self.mel_fmax)
|
||||
|
||||
def _normalize(self, S):
|
||||
"""put values in [0, self.max_norm] or [-self.max_norm, self,max_norm]"""
|
||||
|
@ -156,25 +182,29 @@ class AudioProcessor(object):
|
|||
if self.symmetric_norm:
|
||||
if self.clip_norm:
|
||||
S_denorm = np.clip(S_denorm, -self.max_norm, self.max_norm)
|
||||
S_denorm = (S_denorm + self.max_norm) * (-self.min_level_db) / (2 * self.max_norm) + self.min_level_db
|
||||
S_denorm = (S_denorm + self.max_norm) * (
|
||||
-self.min_level_db) / (2 * self.max_norm
|
||||
) + self.min_level_db
|
||||
return S_denorm
|
||||
else:
|
||||
if self.clip_norm:
|
||||
S_denorm = np.clip(S_denorm, 0, self.max_norm)
|
||||
S_denorm = S_denorm * (-self.min_level_db)/ self.max_norm + self.min_level_db
|
||||
S_denorm = S_denorm * (-self.min_level_db
|
||||
) / self.max_norm + self.min_level_db
|
||||
return S_denorm
|
||||
else:
|
||||
return S
|
||||
|
||||
def _stft(self, y):
|
||||
return librosa.stft(
|
||||
y=y,
|
||||
y=y,
|
||||
n_fft=self.n_fft,
|
||||
win_length=self.win_length,
|
||||
hop_length=self.hop_length)
|
||||
|
||||
def _istft(self, S):
|
||||
return librosa.istft(S, hop_length=self.hop_length, win_length=self.win_length)
|
||||
return librosa.istft(
|
||||
S, hop_length=self.hop_length, win_length=self.win_length)
|
||||
|
||||
def spectrogram(self, y):
|
||||
"""compute linear spectrogram(amplitude)
|
||||
|
@ -195,7 +225,8 @@ class AudioProcessor(object):
|
|||
D = self._stft(self.apply_preemphasis(y))
|
||||
else:
|
||||
D = self._stft(y)
|
||||
S = self._amplitude_to_db(self._linear_to_mel(np.abs(D))) - self.ref_level_db
|
||||
S = self._amplitude_to_db(self._linear_to_mel(np.abs(
|
||||
D))) - self.ref_level_db
|
||||
return self._normalize(S)
|
||||
|
||||
def inv_spectrogram(self, spectrogram):
|
||||
|
@ -203,16 +234,16 @@ class AudioProcessor(object):
|
|||
S = self._denormalize(spectrogram)
|
||||
S = self._db_to_amplitude(S + self.ref_level_db)
|
||||
if self.preemphasis:
|
||||
return self.apply_inv_preemphasis(self._griffin_lim(S ** self.power))
|
||||
return self._griffin_lim(S ** self.power)
|
||||
return self.apply_inv_preemphasis(self._griffin_lim(S**self.power))
|
||||
return self._griffin_lim(S**self.power)
|
||||
|
||||
def inv_melspectrogram(self, mel_spectrogram):
|
||||
S = self._denormalize(mel_spectrogram)
|
||||
S = self._db_to_amplitude(S + self.ref_level_db)
|
||||
S = self._mel_to_linear(np.abs(S))
|
||||
if self.preemphasis:
|
||||
return self.apply_inv_preemphasis(self._griffin_lim(S ** self.power))
|
||||
return self._griffin_lim(S ** self.power)
|
||||
return self.apply_inv_preemphasis(self._griffin_lim(S**self.power))
|
||||
return self._griffin_lim(S**self.power)
|
||||
|
||||
def out_linear_to_mel(self, linear_spec):
|
||||
"""convert output linear spec to mel spec"""
|
||||
|
@ -222,7 +253,7 @@ class AudioProcessor(object):
|
|||
S = self._amplitude_to_db(S) - self.ref_level_db
|
||||
mel = self._normalize(S)
|
||||
return mel
|
||||
|
||||
|
||||
def _griffin_lim(self, S):
|
||||
angles = np.exp(2j * np.pi * np.random.rand(*S.shape))
|
||||
S_complex = np.abs(S).astype(np.complex)
|
||||
|
@ -234,18 +265,18 @@ class AudioProcessor(object):
|
|||
|
||||
@staticmethod
|
||||
def mulaw_encode(wav, qc):
|
||||
mu = 2 ** qc - 1
|
||||
mu = 2**qc - 1
|
||||
# wav_abs = np.minimum(np.abs(wav), 1.0)
|
||||
signal = np.sign(wav) * np.log(1 + mu * np.abs(wav)) / np.log(1. + mu)
|
||||
# Quantize signal to the specified number of levels.
|
||||
signal = (signal + 1) / 2 * mu + 0.5
|
||||
return np.floor(signal,)
|
||||
return np.floor(signal, )
|
||||
|
||||
@staticmethod
|
||||
def mulaw_decode(wav, qc):
|
||||
"""Recovers waveform from quantized values."""
|
||||
mu = 2 ** qc - 1
|
||||
x = np.sign(wav) / mu * ((1 + mu) ** np.abs(wav) - 1)
|
||||
mu = 2**qc - 1
|
||||
x = np.sign(wav) / mu * ((1 + mu)**np.abs(wav) - 1)
|
||||
return x
|
||||
|
||||
@staticmethod
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from .dataset import *
|
||||
from .datacargo import *
|
||||
from .sampler import *
|
||||
|
|
|
@ -1,18 +1,34 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
functions to make batch for arrays which satisfy some conditions.
|
||||
"""
|
||||
import numpy as np
|
||||
|
||||
|
||||
class TextIDBatcher(object):
|
||||
"""A wrapper class for a function to build a functor, which holds the configs to pass to the function."""
|
||||
|
||||
def __init__(self, pad_id=0, dtype=np.int64):
|
||||
self.pad_id = pad_id
|
||||
self.dtype = dtype
|
||||
|
||||
|
||||
def __call__(self, minibatch):
|
||||
out = batch_text_id(minibatch, pad_id=self.pad_id, dtype=self.dtype)
|
||||
return out
|
||||
|
||||
|
||||
def batch_text_id(minibatch, pad_id=0, dtype=np.int64):
|
||||
"""
|
||||
minibatch: List[Example]
|
||||
|
@ -20,26 +36,32 @@ def batch_text_id(minibatch, pad_id=0, dtype=np.int64):
|
|||
"""
|
||||
peek_example = minibatch[0]
|
||||
assert len(peek_example.shape) == 1, "text example is an 1D tensor"
|
||||
|
||||
lengths = [example.shape[0] for example in minibatch] # assume (channel, n_samples) or (n_samples, )
|
||||
|
||||
lengths = [example.shape[0] for example in minibatch
|
||||
] # assume (channel, n_samples) or (n_samples, )
|
||||
max_len = np.max(lengths)
|
||||
|
||||
|
||||
batch = []
|
||||
for example in minibatch:
|
||||
pad_len = max_len - example.shape[0]
|
||||
batch.append(np.pad(example, [(0, pad_len)], mode='constant', constant_values=pad_id))
|
||||
batch.append(
|
||||
np.pad(example, [(0, pad_len)],
|
||||
mode='constant',
|
||||
constant_values=pad_id))
|
||||
|
||||
return np.array(batch, dtype=dtype)
|
||||
|
||||
|
||||
class WavBatcher(object):
|
||||
def __init__(self, pad_value=0., dtype=np.float32):
|
||||
self.pad_value = pad_value
|
||||
self.dtype = dtype
|
||||
|
||||
|
||||
def __call__(self, minibatch):
|
||||
out = batch_wav(minibatch, pad_value=self.pad_value, dtype=self.dtype)
|
||||
return out
|
||||
|
||||
|
||||
def batch_wav(minibatch, pad_value=0., dtype=np.float32):
|
||||
"""
|
||||
minibatch: List[Example]
|
||||
|
@ -51,18 +73,25 @@ def batch_wav(minibatch, pad_value=0., dtype=np.float32):
|
|||
mono_channel = True
|
||||
elif len(peek_example.shape) == 2:
|
||||
mono_channel = False
|
||||
|
||||
lengths = [example.shape[-1] for example in minibatch] # assume (channel, n_samples) or (n_samples, )
|
||||
|
||||
lengths = [example.shape[-1] for example in minibatch
|
||||
] # assume (channel, n_samples) or (n_samples, )
|
||||
max_len = np.max(lengths)
|
||||
|
||||
|
||||
batch = []
|
||||
for example in minibatch:
|
||||
pad_len = max_len - example.shape[-1]
|
||||
if mono_channel:
|
||||
batch.append(np.pad(example, [(0, pad_len)], mode='constant', constant_values=pad_value))
|
||||
batch.append(
|
||||
np.pad(example, [(0, pad_len)],
|
||||
mode='constant',
|
||||
constant_values=pad_value))
|
||||
else:
|
||||
batch.append(np.pad(example, [(0, 0), (0, pad_len)], mode='constant', constant_values=pad_value)) # what about PCM, no
|
||||
|
||||
batch.append(
|
||||
np.pad(example, [(0, 0), (0, pad_len)],
|
||||
mode='constant',
|
||||
constant_values=pad_value)) # what about PCM, no
|
||||
|
||||
return np.array(batch, dtype=dtype)
|
||||
|
||||
|
||||
|
@ -75,6 +104,7 @@ class SpecBatcher(object):
|
|||
out = batch_spec(minibatch, pad_value=self.pad_value, dtype=self.dtype)
|
||||
return out
|
||||
|
||||
|
||||
def batch_spec(minibatch, pad_value=0., dtype=np.float32):
|
||||
"""
|
||||
minibatch: List[Example]
|
||||
|
@ -86,16 +116,23 @@ def batch_spec(minibatch, pad_value=0., dtype=np.float32):
|
|||
mono_channel = True
|
||||
elif len(peek_example.shape) == 3:
|
||||
mono_channel = False
|
||||
|
||||
lengths = [example.shape[-1] for example in minibatch] # assume (channel, F, n_frame) or (F, n_frame)
|
||||
max_len = np.max(lengths)
|
||||
|
||||
|
||||
lengths = [example.shape[-1] for example in minibatch
|
||||
] # assume (channel, F, n_frame) or (F, n_frame)
|
||||
max_len = np.max(lengths)
|
||||
|
||||
batch = []
|
||||
for example in minibatch:
|
||||
pad_len = max_len - example.shape[-1]
|
||||
if mono_channel:
|
||||
batch.append(np.pad(example, [(0, 0), (0, pad_len)], mode='constant', constant_values=pad_value))
|
||||
batch.append(
|
||||
np.pad(example, [(0, 0), (0, pad_len)],
|
||||
mode='constant',
|
||||
constant_values=pad_value))
|
||||
else:
|
||||
batch.append(np.pad(example, [(0, 0), (0, 0), (0, pad_len)], mode='constant', constant_values=pad_value)) # what about PCM, no
|
||||
|
||||
return np.array(batch, dtype=dtype)
|
||||
batch.append(
|
||||
np.pad(example, [(0, 0), (0, 0), (0, pad_len)],
|
||||
mode='constant',
|
||||
constant_values=pad_value)) # what about PCM, no
|
||||
|
||||
return np.array(batch, dtype=dtype)
|
||||
|
|
|
@ -1,3 +1,18 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import six
|
||||
from .sampler import SequentialSampler, RandomSampler, BatchSampler
|
||||
|
||||
|
||||
|
@ -84,7 +99,11 @@ class DataIterator(object):
|
|||
return minibatch
|
||||
|
||||
def _next_index(self):
|
||||
return next(self._sampler_iter)
|
||||
if six.PY3:
|
||||
return next(self._sampler_iter)
|
||||
else:
|
||||
# six.PY2
|
||||
return self._sampler_iter.next()
|
||||
|
||||
def __len__(self):
|
||||
return len(self._index_sampler)
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import six
|
||||
import numpy as np
|
||||
from tqdm import tqdm
|
||||
|
@ -10,8 +24,7 @@ class DatasetMixin(object):
|
|||
if isinstance(index, slice):
|
||||
start, stop, step = index.indices(len(self))
|
||||
return [
|
||||
self.get_example(i)
|
||||
for i in six.moves.range(start, stop, step)
|
||||
self.get_example(i) for i in six.moves.range(start, stop, step)
|
||||
]
|
||||
elif isinstance(index, (list, np.ndarray)):
|
||||
return [self.get_example(i) for i in index]
|
||||
|
@ -46,6 +59,7 @@ class TransformDataset(DatasetMixin):
|
|||
in_data = self._dataset[i]
|
||||
return self._transform(in_data)
|
||||
|
||||
|
||||
class CacheDataset(DatasetMixin):
|
||||
def __init__(self, dataset):
|
||||
self._dataset = dataset
|
||||
|
@ -58,6 +72,7 @@ class CacheDataset(DatasetMixin):
|
|||
def get_example(self, i):
|
||||
return self._cache[i]
|
||||
|
||||
|
||||
class TupleDataset(object):
|
||||
def __init__(self, *datasets):
|
||||
if not datasets:
|
||||
|
@ -133,7 +148,7 @@ class SliceDataset(DatasetMixin):
|
|||
format(len(order), len(dataset)))
|
||||
self._order = order
|
||||
|
||||
def len(self):
|
||||
def __len__(self):
|
||||
return self._size
|
||||
|
||||
def get_example(self, i):
|
||||
|
@ -192,8 +207,7 @@ class ChainDataset(DatasetMixin):
|
|||
|
||||
def get_example(self, i):
|
||||
if i < 0:
|
||||
raise IndexError(
|
||||
"ChainDataset doesnot support negative indexing.")
|
||||
raise IndexError("ChainDataset doesnot support negative indexing.")
|
||||
|
||||
for dataset in self._datasets:
|
||||
if i < len(dataset):
|
||||
|
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
At most cases, we have non-stream dataset, which means we can random access it with __getitem__, and we can get the length of the dataset with __len__.
|
||||
|
||||
|
@ -6,10 +19,10 @@ This suffices for a sampler. We implemente sampler as iterable of valid indices.
|
|||
So the sampler is only responsible for generating valid indices.
|
||||
"""
|
||||
|
||||
|
||||
import numpy as np
|
||||
import random
|
||||
|
||||
|
||||
class Sampler(object):
|
||||
def __init__(self, data_source):
|
||||
pass
|
||||
|
@ -23,7 +36,7 @@ class Sampler(object):
|
|||
class SequentialSampler(Sampler):
|
||||
def __init__(self, data_source):
|
||||
self.data_source = data_source
|
||||
|
||||
|
||||
def __iter__(self):
|
||||
return iter(range(len(self.data_source)))
|
||||
|
||||
|
@ -42,12 +55,14 @@ class RandomSampler(Sampler):
|
|||
"replacement={}".format(self.replacement))
|
||||
|
||||
if self._num_samples is not None and not replacement:
|
||||
raise ValueError("With replacement=False, num_samples should not be specified, "
|
||||
"since a random permutation will be performed.")
|
||||
raise ValueError(
|
||||
"With replacement=False, num_samples should not be specified, "
|
||||
"since a random permutation will be performed.")
|
||||
|
||||
if not isinstance(self.num_samples, int) or self.num_samples <= 0:
|
||||
raise ValueError("num_samples should be a positive integer "
|
||||
"value, but got num_samples={}".format(self.num_samples))
|
||||
"value, but got num_samples={}".format(
|
||||
self.num_samples))
|
||||
|
||||
@property
|
||||
def num_samples(self):
|
||||
|
@ -59,7 +74,9 @@ class RandomSampler(Sampler):
|
|||
def __iter__(self):
|
||||
n = len(self.data_source)
|
||||
if self.replacement:
|
||||
return iter(np.random.randint(0, n, size=(self.num_samples,), dtype=np.int64).tolist())
|
||||
return iter(
|
||||
np.random.randint(
|
||||
0, n, size=(self.num_samples, ), dtype=np.int64).tolist())
|
||||
return iter(np.random.permutation(n).tolist())
|
||||
|
||||
def __len__(self):
|
||||
|
@ -76,7 +93,8 @@ class SubsetRandomSampler(Sampler):
|
|||
self.indices = indices
|
||||
|
||||
def __iter__(self):
|
||||
return (self.indices[i] for i in np.random.permutation(len(self.indices)))
|
||||
return (self.indices[i]
|
||||
for i in np.random.permutation(len(self.indices)))
|
||||
|
||||
def __len__(self):
|
||||
return len(self.indices)
|
||||
|
@ -89,9 +107,14 @@ class PartialyRandomizedSimilarTimeLengthSampler(Sampler):
|
|||
3. Permutate mini-batchs
|
||||
"""
|
||||
|
||||
def __init__(self, lengths, batch_size=4, batch_group_size=None,
|
||||
def __init__(self,
|
||||
lengths,
|
||||
batch_size=4,
|
||||
batch_group_size=None,
|
||||
permutate=True):
|
||||
_lengths = np.array(lengths, dtype=np.int64) # maybe better implement length as a sort key
|
||||
_lengths = np.array(
|
||||
lengths,
|
||||
dtype=np.int64) # maybe better implement length as a sort key
|
||||
self.lengths = np.sort(_lengths)
|
||||
self.sorted_indices = np.argsort(_lengths)
|
||||
|
||||
|
@ -112,20 +135,21 @@ class PartialyRandomizedSimilarTimeLengthSampler(Sampler):
|
|||
for i in range(len(indices) // batch_group_size):
|
||||
s = i * batch_group_size
|
||||
e = s + batch_group_size
|
||||
random.shuffle(indices[s: e]) # inplace
|
||||
random.shuffle(indices[s:e]) # inplace
|
||||
|
||||
# Permutate batches
|
||||
if self.permutate:
|
||||
perm = np.arange(len(indices[:e]) // self.batch_size)
|
||||
random.shuffle(perm)
|
||||
indices[:e] = indices[:e].reshape(-1, self.batch_size)[perm, :].reshape(-1)
|
||||
indices[:e] = indices[:e].reshape(
|
||||
-1, self.batch_size)[perm, :].reshape(-1)
|
||||
|
||||
# Handle last elements
|
||||
s += batch_group_size
|
||||
#print(indices)
|
||||
if s < len(indices):
|
||||
random.shuffle(indices[s:])
|
||||
|
||||
|
||||
return iter(indices)
|
||||
|
||||
def __len__(self):
|
||||
|
@ -150,14 +174,19 @@ class WeightedRandomSampler(Sampler):
|
|||
def __init__(self, weights, num_samples, replacement):
|
||||
if not isinstance(num_samples, int) or num_samples <= 0:
|
||||
raise ValueError("num_samples should be a positive integer "
|
||||
"value, but got num_samples={}".format(num_samples))
|
||||
"value, but got num_samples={}".format(
|
||||
num_samples))
|
||||
self.weights = np.array(weights, dtype=np.float64)
|
||||
self.num_samples = num_samples
|
||||
self.replacement = replacement
|
||||
|
||||
def __iter__(self):
|
||||
return iter(np.random.choice(len(self.weights), size=(self.num_samples, ),
|
||||
replace=self.replacement, p=self.weights).tolist())
|
||||
return iter(
|
||||
np.random.choice(
|
||||
len(self.weights),
|
||||
size=(self.num_samples, ),
|
||||
replace=self.replacement,
|
||||
p=self.weights).tolist())
|
||||
|
||||
def __len__(self):
|
||||
return self.num_samples
|
||||
|
@ -184,7 +213,7 @@ class DistributedSampler(Sampler):
|
|||
|
||||
# Subset samples for each trainer.
|
||||
indices = indices[self.rank:self.total_size:self.num_trainers]
|
||||
assert len(indices) == self.num_samples
|
||||
assert len(indices) == self.num_samples
|
||||
|
||||
return iter(indices)
|
||||
|
||||
|
@ -209,8 +238,7 @@ class BatchSampler(Sampler):
|
|||
def __init__(self, sampler, batch_size, drop_last):
|
||||
if not isinstance(sampler, Sampler):
|
||||
raise ValueError("sampler should be an instance of "
|
||||
"Sampler, but got sampler={}"
|
||||
.format(sampler))
|
||||
"Sampler, but got sampler={}".format(sampler))
|
||||
if not isinstance(batch_size, int) or batch_size <= 0:
|
||||
raise ValueError("batch_size should be a positive integer value, "
|
||||
"but got batch_size={}".format(batch_size))
|
||||
|
|
|
@ -14,9 +14,4 @@ One of the reasons we choose to load data lazily (only load metadata before hand
|
|||
|
||||
For deep learning practice, we typically batch examples. So the dataset should comes with a method to batch examples. Assuming the record is implemented as a tuple with several items. When an item is represented as a fix-sized array, to batch them is trivial, just `np.stack` suffices. But for array with dynamic size, padding is needed. We decide to implement a batching method for each item. Then batching a record can be implemented by these methods. For a dataset, a `_batch_examples` should be implemented. But in most cases, you can choose one from `batching.py`.
|
||||
|
||||
That is it!
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
That is it!
|
||||
|
|
|
@ -0,0 +1,13 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from pathlib import Path
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from pathlib import Path
|
||||
import pandas as pd
|
||||
from ruamel.yaml import YAML
|
||||
|
@ -11,23 +25,25 @@ from parakeet.data.dataset import Dataset
|
|||
from parakeet.data.datacargo import DataCargo
|
||||
from parakeet.data.batch import TextIDBatcher, WavBatcher
|
||||
|
||||
|
||||
class VCTK(Dataset):
|
||||
def __init__(self, root):
|
||||
assert isinstance(root, (str, Path)), "root should be a string or Path object"
|
||||
assert isinstance(root, (
|
||||
str, Path)), "root should be a string or Path object"
|
||||
self.root = root if isinstance(root, Path) else Path(root)
|
||||
self.text_root = self.root.joinpath("txt")
|
||||
self.wav_root = self.root.joinpath("wav48")
|
||||
|
||||
if not (self.root.joinpath("metadata.csv").exists() and
|
||||
if not (self.root.joinpath("metadata.csv").exists() and
|
||||
self.root.joinpath("speaker_indices.yaml").exists()):
|
||||
self._prepare_metadata()
|
||||
self.speaker_indices, self.metadata = self._load_metadata()
|
||||
|
||||
def _load_metadata(self):
|
||||
yaml=YAML(typ='safe')
|
||||
yaml = YAML(typ='safe')
|
||||
speaker_indices = yaml.load(self.root.joinpath("speaker_indices.yaml"))
|
||||
metadata = pd.read_csv(self.root.joinpath("metadata.csv"),
|
||||
sep="|", quoting=3, header=1)
|
||||
metadata = pd.read_csv(
|
||||
self.root.joinpath("metadata.csv"), sep="|", quoting=3, header=1)
|
||||
return speaker_indices, metadata
|
||||
|
||||
def _prepare_metadata(self):
|
||||
|
@ -41,15 +57,19 @@ class VCTK(Dataset):
|
|||
with io.open(str(text_file)) as f:
|
||||
transcription = f.read().strip()
|
||||
wav_file = text_file.with_suffix(".wav")
|
||||
metadata.append((wav_file.name, speaker_folder.name, transcription))
|
||||
metadata = pd.DataFrame.from_records(metadata,
|
||||
columns=["wave_file", "speaker", "text"])
|
||||
|
||||
metadata.append(
|
||||
(wav_file.name, speaker_folder.name, transcription))
|
||||
metadata = pd.DataFrame.from_records(
|
||||
metadata, columns=["wave_file", "speaker", "text"])
|
||||
|
||||
# save them
|
||||
yaml=YAML(typ='safe')
|
||||
yaml = YAML(typ='safe')
|
||||
yaml.dump(speaker_to_index, self.root.joinpath("speaker_indices.yaml"))
|
||||
metadata.to_csv(self.root.joinpath("metadata.csv"),
|
||||
sep="|", quoting=3, index=False)
|
||||
metadata.to_csv(
|
||||
self.root.joinpath("metadata.csv"),
|
||||
sep="|",
|
||||
quoting=3,
|
||||
index=False)
|
||||
|
||||
def _get_example(self, metadatum):
|
||||
wave_file, speaker, text = metadatum
|
||||
|
@ -77,5 +97,3 @@ class VCTK(Dataset):
|
|||
speaker_batch = np.array(speaker_batch)
|
||||
phoneme_batch = TextIDBatcher(pad_id=0)(phoneme_batch)
|
||||
return wav_batch, speaker_batch, phoneme_batch
|
||||
|
||||
|
|
@ -1,5 +1,4 @@
|
|||
# coding: utf-8
|
||||
|
||||
"""Text processing frontend
|
||||
|
||||
All frontend module should have the following functions:
|
||||
|
|
|
@ -32,6 +32,3 @@ def text_to_sequence(text, p=0.0):
|
|||
from ..text import text_to_sequence
|
||||
text = text_to_sequence(text, ["english_cleaners"])
|
||||
return text
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -12,6 +12,3 @@ def text_to_sequence(text, p=0.0):
|
|||
from ..text import text_to_sequence
|
||||
text = text_to_sequence(text, ["basic_cleaners"])
|
||||
return text
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -1,6 +1,5 @@
|
|||
# coding: utf-8
|
||||
|
||||
|
||||
import MeCab
|
||||
import jaconv
|
||||
from random import random
|
||||
|
@ -30,9 +29,9 @@ def _yomi(mecab_result):
|
|||
|
||||
|
||||
def _mix_pronunciation(tokens, yomis, p):
|
||||
return "".join(
|
||||
yomis[idx] if yomis[idx] is not None and random() < p else tokens[idx]
|
||||
for idx in range(len(tokens)))
|
||||
return "".join(yomis[idx]
|
||||
if yomis[idx] is not None and random() < p else tokens[idx]
|
||||
for idx in range(len(tokens)))
|
||||
|
||||
|
||||
def mix_pronunciation(text, p):
|
||||
|
@ -59,8 +58,7 @@ def normalize_delimitor(text):
|
|||
|
||||
|
||||
def text_to_sequence(text, p=0.0):
|
||||
for c in [" ", " ", "「", "」", "『", "』", "・", "【", "】",
|
||||
"(", ")", "(", ")"]:
|
||||
for c in [" ", " ", "「", "」", "『", "』", "・", "【", "】", "(", ")", "(", ")"]:
|
||||
text = text.replace(c, "")
|
||||
text = text.replace("!", "!")
|
||||
text = text.replace("?", "?")
|
||||
|
|
|
@ -1,6 +1,5 @@
|
|||
# coding: utf-8
|
||||
|
||||
|
||||
from random import random
|
||||
|
||||
n_vocab = 0xffff
|
||||
|
@ -13,5 +12,6 @@ _tagger = None
|
|||
def text_to_sequence(text, p=0.0):
|
||||
return [ord(c) for c in text] + [_eos] # EOS
|
||||
|
||||
|
||||
def sequence_to_text(seq):
|
||||
return "".join(chr(n) for n in seq)
|
||||
|
|
|
@ -1,8 +1,21 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import re
|
||||
from . import cleaners
|
||||
from .symbols import symbols
|
||||
|
||||
|
||||
# Mappings from symbol to numeric ID and vice versa:
|
||||
_symbol_to_id = {s: i for i, s in enumerate(symbols)}
|
||||
_id_to_symbol = {i: s for i, s in enumerate(symbols)}
|
||||
|
@ -32,7 +45,8 @@ def text_to_sequence(text, cleaner_names):
|
|||
if not m:
|
||||
sequence += _symbols_to_sequence(_clean_text(text, cleaner_names))
|
||||
break
|
||||
sequence += _symbols_to_sequence(_clean_text(m.group(1), cleaner_names))
|
||||
sequence += _symbols_to_sequence(
|
||||
_clean_text(m.group(1), cleaner_names))
|
||||
sequence += _arpabet_to_sequence(m.group(2))
|
||||
text = m.group(3)
|
||||
|
||||
|
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
'''
|
||||
Cleaners are transformations that run over the input text at both training and eval time.
|
||||
|
||||
|
@ -14,31 +27,31 @@ import re
|
|||
from unidecode import unidecode
|
||||
from .numbers import normalize_numbers
|
||||
|
||||
|
||||
# Regular expression matching whitespace:
|
||||
_whitespace_re = re.compile(r'\s+')
|
||||
|
||||
# List of (regular expression, replacement) pairs for abbreviations:
|
||||
_abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [
|
||||
('mrs', 'misess'),
|
||||
('mr', 'mister'),
|
||||
('dr', 'doctor'),
|
||||
('st', 'saint'),
|
||||
('co', 'company'),
|
||||
('jr', 'junior'),
|
||||
('maj', 'major'),
|
||||
('gen', 'general'),
|
||||
('drs', 'doctors'),
|
||||
('rev', 'reverend'),
|
||||
('lt', 'lieutenant'),
|
||||
('hon', 'honorable'),
|
||||
('sgt', 'sergeant'),
|
||||
('capt', 'captain'),
|
||||
('esq', 'esquire'),
|
||||
('ltd', 'limited'),
|
||||
('col', 'colonel'),
|
||||
('ft', 'fort'),
|
||||
]]
|
||||
_abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1])
|
||||
for x in [
|
||||
('mrs', 'misess'),
|
||||
('mr', 'mister'),
|
||||
('dr', 'doctor'),
|
||||
('st', 'saint'),
|
||||
('co', 'company'),
|
||||
('jr', 'junior'),
|
||||
('maj', 'major'),
|
||||
('gen', 'general'),
|
||||
('drs', 'doctors'),
|
||||
('rev', 'reverend'),
|
||||
('lt', 'lieutenant'),
|
||||
('hon', 'honorable'),
|
||||
('sgt', 'sergeant'),
|
||||
('capt', 'captain'),
|
||||
('esq', 'esquire'),
|
||||
('ltd', 'limited'),
|
||||
('col', 'colonel'),
|
||||
('ft', 'fort'),
|
||||
]]
|
||||
|
||||
|
||||
def expand_abbreviations(text):
|
||||
|
|
|
@ -1,14 +1,28 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import re
|
||||
|
||||
|
||||
valid_symbols = [
|
||||
'AA', 'AA0', 'AA1', 'AA2', 'AE', 'AE0', 'AE1', 'AE2', 'AH', 'AH0', 'AH1', 'AH2',
|
||||
'AO', 'AO0', 'AO1', 'AO2', 'AW', 'AW0', 'AW1', 'AW2', 'AY', 'AY0', 'AY1', 'AY2',
|
||||
'B', 'CH', 'D', 'DH', 'EH', 'EH0', 'EH1', 'EH2', 'ER', 'ER0', 'ER1', 'ER2', 'EY',
|
||||
'EY0', 'EY1', 'EY2', 'F', 'G', 'HH', 'IH', 'IH0', 'IH1', 'IH2', 'IY', 'IY0', 'IY1',
|
||||
'IY2', 'JH', 'K', 'L', 'M', 'N', 'NG', 'OW', 'OW0', 'OW1', 'OW2', 'OY', 'OY0',
|
||||
'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH', 'UH0', 'UH1', 'UH2', 'UW',
|
||||
'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH'
|
||||
'AA', 'AA0', 'AA1', 'AA2', 'AE', 'AE0', 'AE1', 'AE2', 'AH', 'AH0', 'AH1',
|
||||
'AH2', 'AO', 'AO0', 'AO1', 'AO2', 'AW', 'AW0', 'AW1', 'AW2', 'AY', 'AY0',
|
||||
'AY1', 'AY2', 'B', 'CH', 'D', 'DH', 'EH', 'EH0', 'EH1', 'EH2', 'ER', 'ER0',
|
||||
'ER1', 'ER2', 'EY', 'EY0', 'EY1', 'EY2', 'F', 'G', 'HH', 'IH', 'IH0',
|
||||
'IH1', 'IH2', 'IY', 'IY0', 'IY1', 'IY2', 'JH', 'K', 'L', 'M', 'N', 'NG',
|
||||
'OW', 'OW0', 'OW1', 'OW2', 'OY', 'OY0', 'OY1', 'OY2', 'P', 'R', 'S', 'SH',
|
||||
'T', 'TH', 'UH', 'UH0', 'UH1', 'UH2', 'UW', 'UW0', 'UW1', 'UW2', 'V', 'W',
|
||||
'Y', 'Z', 'ZH'
|
||||
]
|
||||
|
||||
_valid_symbol_set = set(valid_symbols)
|
||||
|
@ -24,7 +38,10 @@ class CMUDict:
|
|||
else:
|
||||
entries = _parse_cmudict(file_or_path)
|
||||
if not keep_ambiguous:
|
||||
entries = {word: pron for word, pron in entries.items() if len(pron) == 1}
|
||||
entries = {
|
||||
word: pron
|
||||
for word, pron in entries.items() if len(pron) == 1
|
||||
}
|
||||
self._entries = entries
|
||||
|
||||
def __len__(self):
|
||||
|
|
|
@ -3,7 +3,6 @@
|
|||
import inflect
|
||||
import re
|
||||
|
||||
|
||||
_inflect = inflect.engine()
|
||||
_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
|
||||
_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
|
||||
|
@ -56,7 +55,8 @@ def _expand_number(m):
|
|||
elif num % 100 == 0:
|
||||
return _inflect.number_to_words(num // 100) + ' hundred'
|
||||
else:
|
||||
return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
|
||||
return _inflect.number_to_words(
|
||||
num, andword='', zero='oh', group=2).replace(', ', ' ')
|
||||
else:
|
||||
return _inflect.number_to_words(num, andword='')
|
||||
|
||||
|
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
'''
|
||||
Defines the set of symbols used in text input to the model.
|
||||
|
||||
|
|
|
@ -0,0 +1,13 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
|
@ -0,0 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from .net import *
|
||||
from .parallel_wavenet import *
|
|
@ -0,0 +1,169 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import itertools
|
||||
import numpy as np
|
||||
from scipy import signal
|
||||
from tqdm import trange
|
||||
|
||||
import paddle.fluid.layers as F
|
||||
import paddle.fluid.dygraph as dg
|
||||
import paddle.fluid.initializer as I
|
||||
import paddle.fluid.layers.distributions as D
|
||||
|
||||
from parakeet.modules.weight_norm import Conv2DTranspose
|
||||
from parakeet.models.wavenet import crop, WaveNet, UpsampleNet
|
||||
from parakeet.models.clarinet.parallel_wavenet import ParallelWaveNet
|
||||
from parakeet.models.clarinet.utils import conv2d
|
||||
|
||||
|
||||
# Gaussian IAF model
|
||||
class Clarinet(dg.Layer):
|
||||
def __init__(self,
|
||||
encoder,
|
||||
teacher,
|
||||
student,
|
||||
stft,
|
||||
min_log_scale=-6.0,
|
||||
lmd=4.0):
|
||||
super(Clarinet, self).__init__()
|
||||
self.lmd = lmd
|
||||
self.encoder = encoder
|
||||
self.teacher = teacher
|
||||
self.student = student
|
||||
|
||||
self.min_log_scale = min_log_scale
|
||||
self.stft = stft
|
||||
|
||||
def forward(self, audio, mel, audio_start, clip_kl=True):
|
||||
"""Compute loss for a distill model
|
||||
|
||||
Arguments:
|
||||
audio {Variable} -- shape(batch_size, time_steps), target waveform.
|
||||
mel {Variable} -- shape(batch_size, condition_dim, time_steps // hop_length), original mel spectrogram, not upsampled yet.
|
||||
audio_starts {Variable} -- shape(batch_size, ), the index of the start sample.
|
||||
clip_kl (bool) -- whether to clip kl divergence if it is greater than 10.0.
|
||||
|
||||
Returns:
|
||||
Variable -- shape(1,), loss
|
||||
"""
|
||||
|
||||
batch_size, audio_length = audio.shape # audio clip's length
|
||||
|
||||
z = F.gaussian_random(audio.shape)
|
||||
condition = self.encoder(mel) # (B, C, T)
|
||||
condition_slice = crop(condition, audio_start, audio_length)
|
||||
|
||||
x, s_means, s_scales = self.student(z, condition_slice) # all [0: T]
|
||||
s_means = s_means[:, 1:] # (B, T-1), time steps [1: T]
|
||||
s_scales = s_scales[:, 1:] # (B, T-1), time steps [1: T]
|
||||
s_clipped_scales = F.clip(s_scales, self.min_log_scale, 100.)
|
||||
|
||||
# teacher outputs single gaussian
|
||||
y = self.teacher(x[:, :-1], condition_slice[:, :, 1:])
|
||||
_, t_means, t_scales = F.split(y, 3, -1) # time steps [1: T]
|
||||
t_means = F.squeeze(t_means, [-1]) # (B, T-1), time steps [1: T]
|
||||
t_scales = F.squeeze(t_scales, [-1]) # (B, T-1), time steps [1: T]
|
||||
t_clipped_scales = F.clip(t_scales, self.min_log_scale, 100.)
|
||||
|
||||
s_distribution = D.Normal(s_means, F.exp(s_clipped_scales))
|
||||
t_distribution = D.Normal(t_means, F.exp(t_clipped_scales))
|
||||
|
||||
# kl divergence loss, so we only need to sample once? no MC
|
||||
kl = s_distribution.kl_divergence(t_distribution)
|
||||
if clip_kl:
|
||||
kl = F.clip(kl, -100., 10.)
|
||||
# context size dropped
|
||||
kl = F.reduce_mean(kl[:, self.teacher.context_size:])
|
||||
# major diff here
|
||||
regularization = F.mse_loss(t_scales[:, self.teacher.context_size:],
|
||||
s_scales[:, self.teacher.context_size:])
|
||||
|
||||
# introduce information from real target
|
||||
spectrogram_frame_loss = F.mse_loss(
|
||||
self.stft.magnitude(audio), self.stft.magnitude(x))
|
||||
loss = kl + self.lmd * regularization + spectrogram_frame_loss
|
||||
loss_dict = {
|
||||
"loss": loss,
|
||||
"kl_divergence": kl,
|
||||
"regularization": regularization,
|
||||
"stft_loss": spectrogram_frame_loss
|
||||
}
|
||||
return loss_dict
|
||||
|
||||
@dg.no_grad
|
||||
def synthesis(self, mel):
|
||||
"""Synthesize waveform conditioned on the mel spectrogram.
|
||||
|
||||
Arguments:
|
||||
mel {Variable} -- shape(batch_size, frequqncy_bands, frames)
|
||||
|
||||
Returns:
|
||||
Variable -- shape(batch_size, frames * upsample_factor)
|
||||
"""
|
||||
condition = self.encoder(mel)
|
||||
samples_shape = (condition.shape[0], condition.shape[-1])
|
||||
z = F.gaussian_random(samples_shape)
|
||||
x, s_means, s_scales = self.student(z, condition)
|
||||
return x
|
||||
|
||||
|
||||
class STFT(dg.Layer):
|
||||
def __init__(self, n_fft, hop_length, win_length, window="hanning"):
|
||||
super(STFT, self).__init__()
|
||||
self.hop_length = hop_length
|
||||
self.n_bin = 1 + n_fft // 2
|
||||
self.n_fft = n_fft
|
||||
|
||||
# calculate window
|
||||
window = signal.get_window(window, win_length)
|
||||
if n_fft != win_length:
|
||||
pad = (n_fft - win_length) // 2
|
||||
window = np.pad(window, ((pad, pad), ), 'constant')
|
||||
|
||||
# calculate weights
|
||||
r = np.arange(0, n_fft)
|
||||
M = np.expand_dims(r, -1) * np.expand_dims(r, 0)
|
||||
w_real = np.reshape(window *
|
||||
np.cos(2 * np.pi * M / n_fft)[:self.n_bin],
|
||||
(self.n_bin, 1, 1, self.n_fft)).astype("float32")
|
||||
w_imag = np.reshape(window *
|
||||
np.sin(-2 * np.pi * M / n_fft)[:self.n_bin],
|
||||
(self.n_bin, 1, 1, self.n_fft)).astype("float32")
|
||||
|
||||
w = np.concatenate([w_real, w_imag], axis=0)
|
||||
self.weight = dg.to_variable(w)
|
||||
|
||||
def forward(self, x):
|
||||
# x(batch_size, time_steps)
|
||||
# pad it first with reflect mode
|
||||
pad_start = F.reverse(x[:, 1:1 + self.n_fft // 2], axis=1)
|
||||
pad_stop = F.reverse(x[:, -(1 + self.n_fft // 2):-1], axis=1)
|
||||
x = F.concat([pad_start, x, pad_stop], axis=-1)
|
||||
|
||||
# to BC1T, C=1
|
||||
x = F.unsqueeze(x, axes=[1, 2])
|
||||
out = conv2d(x, self.weight, stride=(1, self.hop_length))
|
||||
real, imag = F.split(out, 2, dim=1) # BC1T
|
||||
return real, imag
|
||||
|
||||
def power(self, x):
|
||||
real, imag = self(x)
|
||||
power = real**2 + imag**2
|
||||
return power
|
||||
|
||||
def magnitude(self, x):
|
||||
power = self.power(x)
|
||||
magnitude = F.sqrt(power)
|
||||
return magnitude
|
|
@ -0,0 +1,69 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import math
|
||||
import time
|
||||
import itertools
|
||||
import numpy as np
|
||||
|
||||
import paddle.fluid.layers as F
|
||||
import paddle.fluid.dygraph as dg
|
||||
import paddle.fluid.initializer as I
|
||||
import paddle.fluid.layers.distributions as D
|
||||
|
||||
from parakeet.modules.weight_norm import Linear, Conv1D, Conv1DCell, Conv2DTranspose
|
||||
from parakeet.models.wavenet import WaveNet
|
||||
|
||||
|
||||
class ParallelWaveNet(dg.Layer):
|
||||
def __init__(self, n_loops, n_layers, residual_channels, condition_dim,
|
||||
filter_size):
|
||||
super(ParallelWaveNet, self).__init__()
|
||||
self.flows = dg.LayerList()
|
||||
for n_loop, n_layer in zip(n_loops, n_layers):
|
||||
# teacher's log_scale_min does not matter herem, -100 is a dummy value
|
||||
self.flows.append(
|
||||
WaveNet(n_loop, n_layer, residual_channels, 3, condition_dim,
|
||||
filter_size, "mog", -100.0))
|
||||
|
||||
def forward(self, z, condition=None):
|
||||
"""Inverse Autoregressive Flow. Several wavenets.
|
||||
|
||||
Arguments:
|
||||
z {Variable} -- shape(batch_size, time_steps), hidden variable, sampled from a standard normal distribution.
|
||||
|
||||
Keyword Arguments:
|
||||
condition {Variable} -- shape(batch_size, condition_dim, time_steps), condition, basically upsampled mel spectrogram. (default: {None})
|
||||
|
||||
Returns:
|
||||
Variable -- shape(batch_size, time_steps), transformed z.
|
||||
Variable -- shape(batch_size, time_steps), output distribution's mu.
|
||||
Variable -- shape(batch_size, time_steps), output distribution's log_std.
|
||||
"""
|
||||
|
||||
for i, flow in enumerate(self.flows):
|
||||
theta = flow(z, condition) # w, mu, log_std [0: T]
|
||||
w, mu, log_std = F.split(theta, 3, dim=-1) # (B, T, 1) for each
|
||||
mu = F.squeeze(mu, [-1]) #[0: T]
|
||||
log_std = F.squeeze(log_std, [-1]) #[0: T]
|
||||
z = z * F.exp(log_std) + mu #[0: T]
|
||||
|
||||
if i == 0:
|
||||
out_mu = mu
|
||||
out_log_std = log_std
|
||||
else:
|
||||
out_mu = out_mu * F.exp(log_std) + mu
|
||||
out_log_std += log_std
|
||||
|
||||
return z, out_mu, out_log_std
|
|
@ -0,0 +1,48 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from paddle import fluid
|
||||
from paddle.fluid.core import ops
|
||||
|
||||
|
||||
@fluid.framework.dygraph_only
|
||||
def conv2d(input,
|
||||
weight,
|
||||
stride=(1, 1),
|
||||
padding=((0, 0), (0, 0)),
|
||||
dilation=(1, 1),
|
||||
groups=1,
|
||||
use_cudnn=True,
|
||||
data_format="NCHW"):
|
||||
padding = tuple(pad for pad_dim in padding for pad in pad_dim)
|
||||
|
||||
inputs = {
|
||||
'Input': [input],
|
||||
'Filter': [weight],
|
||||
}
|
||||
attrs = {
|
||||
'strides': stride,
|
||||
'paddings': padding,
|
||||
'dilations': dilation,
|
||||
'groups': groups,
|
||||
'use_cudnn': use_cudnn,
|
||||
'use_mkldnn': False,
|
||||
'fuse_relu_before_depthwise_conv': False,
|
||||
"padding_algorithm": "EXPLICIT",
|
||||
"data_format": data_format,
|
||||
}
|
||||
|
||||
outputs = ops.conv2d(inputs, attrs)
|
||||
out = outputs["Output"][0]
|
||||
return out
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from parakeet.models.deepvoice3.encoder import Encoder, ConvSpec
|
||||
from parakeet.models.deepvoice3.decoder import Decoder, WindowRange
|
||||
from parakeet.models.deepvoice3.converter import Converter
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import numpy as np
|
||||
from collections import namedtuple
|
||||
from paddle import fluid
|
||||
|
@ -19,23 +33,19 @@ class Attention(dg.Layer):
|
|||
value_projection=True):
|
||||
super(Attention, self).__init__()
|
||||
std = np.sqrt(1 / query_dim)
|
||||
self.query_proj = Linear(query_dim,
|
||||
embed_dim,
|
||||
param_attr=I.Normal(scale=std))
|
||||
self.query_proj = Linear(
|
||||
query_dim, embed_dim, param_attr=I.Normal(scale=std))
|
||||
if key_projection:
|
||||
std = np.sqrt(1 / embed_dim)
|
||||
self.key_proj = Linear(embed_dim,
|
||||
embed_dim,
|
||||
param_attr=I.Normal(scale=std))
|
||||
self.key_proj = Linear(
|
||||
embed_dim, embed_dim, param_attr=I.Normal(scale=std))
|
||||
if value_projection:
|
||||
std = np.sqrt(1 / embed_dim)
|
||||
self.value_proj = Linear(embed_dim,
|
||||
embed_dim,
|
||||
param_attr=I.Normal(scale=std))
|
||||
self.value_proj = Linear(
|
||||
embed_dim, embed_dim, param_attr=I.Normal(scale=std))
|
||||
std = np.sqrt(1 / embed_dim)
|
||||
self.out_proj = Linear(embed_dim,
|
||||
query_dim,
|
||||
param_attr=I.Normal(scale=std))
|
||||
self.out_proj = Linear(
|
||||
embed_dim, query_dim, param_attr=I.Normal(scale=std))
|
||||
|
||||
self.key_projection = key_projection
|
||||
self.value_projection = value_projection
|
||||
|
@ -102,9 +112,8 @@ class Attention(dg.Layer):
|
|||
|
||||
x = F.softmax(x)
|
||||
attn_scores = x
|
||||
x = F.dropout(x,
|
||||
self.dropout,
|
||||
dropout_implementation="upscale_in_train")
|
||||
x = F.dropout(
|
||||
x, self.dropout, dropout_implementation="upscale_in_train")
|
||||
x = F.matmul(x, values)
|
||||
encoder_length = keys.shape[1]
|
||||
# CAUTION: is it wrong? let it be now
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import numpy as np
|
||||
|
||||
from paddle import fluid
|
||||
|
@ -15,6 +29,7 @@ class Conv1DGLU(dg.Layer):
|
|||
has residual connection from the input x, and scale the output by
|
||||
np.sqrt(0.5).
|
||||
"""
|
||||
|
||||
def __init__(self,
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
|
@ -50,20 +65,20 @@ class Conv1DGLU(dg.Layer):
|
|||
), "this block uses residual connection"\
|
||||
"the input_channes should equals num_filters"
|
||||
std = np.sqrt(std_mul * (1 - dropout) / (filter_size * in_channels))
|
||||
self.conv = Conv1DCell(in_channels,
|
||||
2 * num_filters,
|
||||
filter_size,
|
||||
dilation,
|
||||
causal,
|
||||
param_attr=I.Normal(scale=std))
|
||||
self.conv = Conv1DCell(
|
||||
in_channels,
|
||||
2 * num_filters,
|
||||
filter_size,
|
||||
dilation,
|
||||
causal,
|
||||
param_attr=I.Normal(scale=std))
|
||||
|
||||
if n_speakers > 1:
|
||||
assert (speaker_dim is not None
|
||||
), "speaker embed should not be null in multi-speaker case"
|
||||
std = np.sqrt(1 / speaker_dim)
|
||||
self.fc = Linear(speaker_dim,
|
||||
num_filters,
|
||||
param_attr=I.Normal(scale=std))
|
||||
self.fc = Linear(
|
||||
speaker_dim, num_filters, param_attr=I.Normal(scale=std))
|
||||
|
||||
def forward(self, x, speaker_embed=None):
|
||||
"""
|
||||
|
@ -82,9 +97,8 @@ class Conv1DGLU(dg.Layer):
|
|||
C_out means the output channels of Conv1DGLU.
|
||||
"""
|
||||
residual = x
|
||||
x = F.dropout(x,
|
||||
self.dropout,
|
||||
dropout_implementation="upscale_in_train")
|
||||
x = F.dropout(
|
||||
x, self.dropout, dropout_implementation="upscale_in_train")
|
||||
x = self.conv(x)
|
||||
content, gate = F.split(x, num_or_sections=2, dim=1)
|
||||
|
||||
|
@ -118,9 +132,8 @@ class Conv1DGLU(dg.Layer):
|
|||
C_out means the output channels of Conv1DGLU.
|
||||
"""
|
||||
residual = x_t
|
||||
x_t = F.dropout(x_t,
|
||||
self.dropout,
|
||||
dropout_implementation="upscale_in_train")
|
||||
x_t = F.dropout(
|
||||
x_t, self.dropout, dropout_implementation="upscale_in_train")
|
||||
x_t = self.conv.add_input(x_t)
|
||||
content_t, gate_t = F.split(x_t, num_or_sections=2, dim=1)
|
||||
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import numpy as np
|
||||
from itertools import chain
|
||||
|
||||
|
@ -19,44 +33,45 @@ def upsampling_4x_blocks(n_speakers, speaker_dim, target_channels, dropout):
|
|||
2,
|
||||
stride=2,
|
||||
param_attr=I.Normal(scale=np.sqrt(1 / (2 * target_channels)))),
|
||||
Conv1DGLU(n_speakers,
|
||||
speaker_dim,
|
||||
target_channels,
|
||||
target_channels,
|
||||
3,
|
||||
dilation=1,
|
||||
std_mul=1.,
|
||||
dropout=dropout),
|
||||
Conv1DGLU(n_speakers,
|
||||
speaker_dim,
|
||||
target_channels,
|
||||
target_channels,
|
||||
3,
|
||||
dilation=3,
|
||||
std_mul=4.,
|
||||
dropout=dropout),
|
||||
Conv1DTranspose(
|
||||
Conv1DGLU(
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
target_channels,
|
||||
target_channels,
|
||||
2,
|
||||
stride=2,
|
||||
param_attr=I.Normal(scale=np.sqrt(4. / (2 * target_channels)))),
|
||||
Conv1DGLU(n_speakers,
|
||||
speaker_dim,
|
||||
target_channels,
|
||||
target_channels,
|
||||
3,
|
||||
dilation=1,
|
||||
std_mul=1.,
|
||||
dropout=dropout),
|
||||
Conv1DGLU(n_speakers,
|
||||
speaker_dim,
|
||||
target_channels,
|
||||
target_channels,
|
||||
3,
|
||||
dilation=3,
|
||||
std_mul=4.,
|
||||
dropout=dropout)
|
||||
3,
|
||||
dilation=1,
|
||||
std_mul=1.,
|
||||
dropout=dropout), Conv1DGLU(
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
target_channels,
|
||||
target_channels,
|
||||
3,
|
||||
dilation=3,
|
||||
std_mul=4.,
|
||||
dropout=dropout), Conv1DTranspose(
|
||||
target_channels,
|
||||
target_channels,
|
||||
2,
|
||||
stride=2,
|
||||
param_attr=I.Normal(scale=np.sqrt(
|
||||
4. / (2 * target_channels)))), Conv1DGLU(
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
target_channels,
|
||||
target_channels,
|
||||
3,
|
||||
dilation=1,
|
||||
std_mul=1.,
|
||||
dropout=dropout), Conv1DGLU(
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
target_channels,
|
||||
target_channels,
|
||||
3,
|
||||
dilation=3,
|
||||
std_mul=4.,
|
||||
dropout=dropout)
|
||||
]
|
||||
return upsampling_convolutions
|
||||
|
||||
|
@ -69,36 +84,38 @@ def upsampling_2x_blocks(n_speakers, speaker_dim, target_channels, dropout):
|
|||
2,
|
||||
stride=2,
|
||||
param_attr=I.Normal(scale=np.sqrt(1. / (2 * target_channels)))),
|
||||
Conv1DGLU(n_speakers,
|
||||
speaker_dim,
|
||||
target_channels,
|
||||
target_channels,
|
||||
3,
|
||||
dilation=1,
|
||||
std_mul=1.,
|
||||
dropout=dropout),
|
||||
Conv1DGLU(n_speakers,
|
||||
speaker_dim,
|
||||
target_channels,
|
||||
target_channels,
|
||||
3,
|
||||
dilation=3,
|
||||
std_mul=4.,
|
||||
dropout=dropout)
|
||||
Conv1DGLU(
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
target_channels,
|
||||
target_channels,
|
||||
3,
|
||||
dilation=1,
|
||||
std_mul=1.,
|
||||
dropout=dropout), Conv1DGLU(
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
target_channels,
|
||||
target_channels,
|
||||
3,
|
||||
dilation=3,
|
||||
std_mul=4.,
|
||||
dropout=dropout)
|
||||
]
|
||||
return upsampling_convolutions
|
||||
|
||||
|
||||
def upsampling_1x_blocks(n_speakers, speaker_dim, target_channels, dropout):
|
||||
upsampling_convolutions = [
|
||||
Conv1DGLU(n_speakers,
|
||||
speaker_dim,
|
||||
target_channels,
|
||||
target_channels,
|
||||
3,
|
||||
dilation=3,
|
||||
std_mul=4.,
|
||||
dropout=dropout)
|
||||
Conv1DGLU(
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
target_channels,
|
||||
target_channels,
|
||||
3,
|
||||
dilation=3,
|
||||
std_mul=4.,
|
||||
dropout=dropout)
|
||||
]
|
||||
return upsampling_convolutions
|
||||
|
||||
|
@ -108,6 +125,7 @@ class Converter(dg.Layer):
|
|||
Vocoder that transforms mel spectrogram (or ecoder hidden states)
|
||||
to waveform.
|
||||
"""
|
||||
|
||||
def __init__(self,
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
|
@ -161,33 +179,36 @@ class Converter(dg.Layer):
|
|||
std = np.sqrt(std_mul / in_channels)
|
||||
# CAUTION: relu
|
||||
self.convolutions.append(
|
||||
Conv1D(in_channels,
|
||||
out_channels,
|
||||
1,
|
||||
act="relu",
|
||||
param_attr=I.Normal(scale=std)))
|
||||
Conv1D(
|
||||
in_channels,
|
||||
out_channels,
|
||||
1,
|
||||
act="relu",
|
||||
param_attr=I.Normal(scale=std)))
|
||||
in_channels = out_channels
|
||||
std_mul = 2.0
|
||||
self.convolutions.append(
|
||||
Conv1DGLU(n_speakers,
|
||||
speaker_dim,
|
||||
in_channels,
|
||||
out_channels,
|
||||
filter_size,
|
||||
dilation=dilation,
|
||||
std_mul=std_mul,
|
||||
dropout=dropout))
|
||||
Conv1DGLU(
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
in_channels,
|
||||
out_channels,
|
||||
filter_size,
|
||||
dilation=dilation,
|
||||
std_mul=std_mul,
|
||||
dropout=dropout))
|
||||
in_channels = out_channels
|
||||
std_mul = 4.0
|
||||
|
||||
# final conv proj, channel transformed to linear dim
|
||||
std = np.sqrt(std_mul * (1 - dropout) / in_channels)
|
||||
# CAUTION: sigmoid
|
||||
self.last_conv_proj = Conv1D(in_channels,
|
||||
linear_dim,
|
||||
1,
|
||||
act="sigmoid",
|
||||
param_attr=I.Normal(scale=std))
|
||||
self.last_conv_proj = Conv1D(
|
||||
in_channels,
|
||||
linear_dim,
|
||||
1,
|
||||
act="sigmoid",
|
||||
param_attr=I.Normal(scale=std))
|
||||
|
||||
def forward(self, x, speaker_embed=None):
|
||||
"""
|
||||
|
@ -229,4 +250,4 @@ class Converter(dg.Layer):
|
|||
|
||||
out = self.last_conv_proj(x)
|
||||
out = F.transpose(out, [0, 2, 1])
|
||||
return out
|
||||
return out
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import numpy as np
|
||||
import paddle.fluid.layers as F
|
||||
import paddle.fluid.initializer as I
|
||||
|
@ -80,25 +94,25 @@ def unfold_adjacent_frames(folded_frames, r):
|
|||
|
||||
class Decoder(dg.Layer):
|
||||
def __init__(
|
||||
self,
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
embed_dim,
|
||||
mel_dim,
|
||||
r=1,
|
||||
max_positions=512,
|
||||
padding_idx=None, # remove it!
|
||||
preattention=(ConvSpec(128, 5, 1), ) * 4,
|
||||
convolutions=(ConvSpec(128, 5, 1), ) * 4,
|
||||
attention=True,
|
||||
dropout=0.0,
|
||||
use_memory_mask=False,
|
||||
force_monotonic_attention=False,
|
||||
query_position_rate=1.0,
|
||||
key_position_rate=1.0,
|
||||
window_range=WindowRange(-1, 3),
|
||||
key_projection=True,
|
||||
value_projection=True):
|
||||
self,
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
embed_dim,
|
||||
mel_dim,
|
||||
r=1,
|
||||
max_positions=512,
|
||||
padding_idx=None, # remove it!
|
||||
preattention=(ConvSpec(128, 5, 1), ) * 4,
|
||||
convolutions=(ConvSpec(128, 5, 1), ) * 4,
|
||||
attention=True,
|
||||
dropout=0.0,
|
||||
use_memory_mask=False,
|
||||
force_monotonic_attention=False,
|
||||
query_position_rate=1.0,
|
||||
key_position_rate=1.0,
|
||||
window_range=WindowRange(-1, 3),
|
||||
key_projection=True,
|
||||
value_projection=True):
|
||||
super(Decoder, self).__init__()
|
||||
|
||||
self.dropout = dropout
|
||||
|
@ -111,23 +125,17 @@ class Decoder(dg.Layer):
|
|||
|
||||
conv_channels = convolutions[0].out_channels
|
||||
# only when padding idx is 0 can we easilt handle it
|
||||
self.embed_keys_positions = PositionEmbedding(max_positions,
|
||||
embed_dim,
|
||||
padding_idx=0)
|
||||
self.embed_query_positions = PositionEmbedding(max_positions,
|
||||
conv_channels,
|
||||
padding_idx=0)
|
||||
self.embed_keys_positions = PositionEmbedding(
|
||||
max_positions, embed_dim, padding_idx=0)
|
||||
self.embed_query_positions = PositionEmbedding(
|
||||
max_positions, conv_channels, padding_idx=0)
|
||||
|
||||
if n_speakers > 1:
|
||||
std = np.sqrt((1 - dropout) / speaker_dim)
|
||||
self.speaker_proj1 = Linear(speaker_dim,
|
||||
1,
|
||||
act="sigmoid",
|
||||
param_attr=I.Normal(scale=std))
|
||||
self.speaker_proj2 = Linear(speaker_dim,
|
||||
1,
|
||||
act="sigmoid",
|
||||
param_attr=I.Normal(scale=std))
|
||||
self.speaker_proj1 = Linear(
|
||||
speaker_dim, 1, act="sigmoid", param_attr=I.Normal(scale=std))
|
||||
self.speaker_proj2 = Linear(
|
||||
speaker_dim, 1, act="sigmoid", param_attr=I.Normal(scale=std))
|
||||
|
||||
# prenet
|
||||
self.prenet = dg.LayerList()
|
||||
|
@ -138,24 +146,26 @@ class Decoder(dg.Layer):
|
|||
# conv1d & relu
|
||||
std = np.sqrt(std_mul / in_channels)
|
||||
self.prenet.append(
|
||||
Conv1D(in_channels,
|
||||
out_channels,
|
||||
1,
|
||||
act="relu",
|
||||
param_attr=I.Normal(scale=std)))
|
||||
Conv1D(
|
||||
in_channels,
|
||||
out_channels,
|
||||
1,
|
||||
act="relu",
|
||||
param_attr=I.Normal(scale=std)))
|
||||
in_channels = out_channels
|
||||
std_mul = 2.0
|
||||
self.prenet.append(
|
||||
Conv1DGLU(n_speakers,
|
||||
speaker_dim,
|
||||
in_channels,
|
||||
out_channels,
|
||||
filter_size,
|
||||
dilation,
|
||||
std_mul,
|
||||
dropout,
|
||||
causal=True,
|
||||
residual=True))
|
||||
Conv1DGLU(
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
in_channels,
|
||||
out_channels,
|
||||
filter_size,
|
||||
dilation,
|
||||
std_mul,
|
||||
dropout,
|
||||
causal=True,
|
||||
residual=True))
|
||||
in_channels = out_channels
|
||||
std_mul = 4.0
|
||||
|
||||
|
@ -184,16 +194,17 @@ class Decoder(dg.Layer):
|
|||
assert (
|
||||
in_channels == out_channels
|
||||
), "the stack of convolution & attention does not change channels"
|
||||
conv_layer = Conv1DGLU(n_speakers,
|
||||
speaker_dim,
|
||||
in_channels,
|
||||
out_channels,
|
||||
filter_size,
|
||||
dilation,
|
||||
std_mul,
|
||||
dropout,
|
||||
causal=True,
|
||||
residual=False)
|
||||
conv_layer = Conv1DGLU(
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
in_channels,
|
||||
out_channels,
|
||||
filter_size,
|
||||
dilation,
|
||||
std_mul,
|
||||
dropout,
|
||||
causal=True,
|
||||
residual=False)
|
||||
attn_layer = Attention(
|
||||
out_channels,
|
||||
embed_dim,
|
||||
|
@ -211,10 +222,8 @@ class Decoder(dg.Layer):
|
|||
|
||||
# 1 * 1 conv to transform channels
|
||||
std = np.sqrt(std_mul * (1 - dropout) / in_channels)
|
||||
self.last_conv = Conv1D(in_channels,
|
||||
mel_dim * r,
|
||||
1,
|
||||
param_attr=I.Normal(scale=std))
|
||||
self.last_conv = Conv1D(
|
||||
in_channels, mel_dim * r, 1, param_attr=I.Normal(scale=std))
|
||||
|
||||
# mel (before sigmoid) to done hat
|
||||
std = np.sqrt(1 / in_channels)
|
||||
|
@ -308,9 +317,8 @@ class Decoder(dg.Layer):
|
|||
# (B, C, T)
|
||||
frames = F.transpose(frames, [0, 2, 1])
|
||||
x = frames
|
||||
x = F.dropout(x,
|
||||
self.dropout,
|
||||
dropout_implementation="upscale_in_train")
|
||||
x = F.dropout(
|
||||
x, self.dropout, dropout_implementation="upscale_in_train")
|
||||
# Prenet
|
||||
for layer in self.prenet:
|
||||
if isinstance(layer, Conv1DGLU):
|
||||
|
@ -408,14 +416,13 @@ class Decoder(dg.Layer):
|
|||
test_inputs = fold_adjacent_frames(test_inputs, self.r)
|
||||
test_inputs = F.transpose(test_inputs, [0, 2, 1])
|
||||
|
||||
initial_input = F.zeros((batch_size, self.mel_dim * self.r, 1),
|
||||
dtype=keys.dtype)
|
||||
initial_input = F.zeros(
|
||||
(batch_size, self.mel_dim * self.r, 1), dtype=keys.dtype)
|
||||
|
||||
t = 0 # decoder time step
|
||||
while True:
|
||||
frame_pos = F.fill_constant((batch_size, 1),
|
||||
value=t + 1,
|
||||
dtype="int64")
|
||||
frame_pos = F.fill_constant(
|
||||
(batch_size, 1), value=t + 1, dtype="int64")
|
||||
w = self.query_position_rate
|
||||
if self.n_speakers > 1:
|
||||
w = w * F.squeeze(self.speaker_proj2(speaker_embed), [-1])
|
||||
|
@ -433,9 +440,8 @@ class Decoder(dg.Layer):
|
|||
current_input = initial_input
|
||||
|
||||
x_t = current_input
|
||||
x_t = F.dropout(x_t,
|
||||
self.dropout,
|
||||
dropout_implementation="upscale_in_train")
|
||||
x_t = F.dropout(
|
||||
x_t, self.dropout, dropout_implementation="upscale_in_train")
|
||||
|
||||
# Prenet
|
||||
for layer in self.prenet:
|
||||
|
@ -453,15 +459,15 @@ class Decoder(dg.Layer):
|
|||
x_t = F.transpose(x_t, [0, 2, 1])
|
||||
if frame_pos_embed is not None:
|
||||
x_t += frame_pos_embed
|
||||
x_t, attn_scores = attn(
|
||||
x_t, (keys, values), mask,
|
||||
last_attended[i] if test_inputs is None else None)
|
||||
x_t, attn_scores = attn(x_t, (keys, values), mask,
|
||||
last_attended[i]
|
||||
if test_inputs is None else None)
|
||||
x_t = F.transpose(x_t, [0, 2, 1])
|
||||
step_attn_scores.append(attn_scores) #(B, T_dec=1, T_enc)
|
||||
# update last attended when necessary
|
||||
if self.force_monotonic_attention[i]:
|
||||
last_attended[i] = np.argmax(attn_scores.numpy(),
|
||||
axis=-1)[0][0]
|
||||
last_attended[i] = np.argmax(
|
||||
attn_scores.numpy(), axis=-1)[0][0]
|
||||
x_t = F.scale(residual + x_t, np.sqrt(0.5))
|
||||
if len(step_attn_scores):
|
||||
# (B, 1, T_enc) again
|
||||
|
@ -485,8 +491,8 @@ class Decoder(dg.Layer):
|
|||
t += 1
|
||||
|
||||
if test_inputs is None:
|
||||
if F.reduce_min(done_t).numpy(
|
||||
)[0] > 0.5 and t > self.min_decoder_steps:
|
||||
if F.reduce_min(done_t).numpy()[
|
||||
0] > 0.5 and t > self.min_decoder_steps:
|
||||
break
|
||||
elif t > self.max_decoder_steps:
|
||||
break
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import numpy as np
|
||||
from collections import namedtuple
|
||||
|
||||
|
@ -33,14 +47,16 @@ class Encoder(dg.Layer):
|
|||
self.dropout = dropout
|
||||
if n_speakers > 1:
|
||||
std = np.sqrt((1 - dropout) / speaker_dim)
|
||||
self.sp_proj1 = Linear(speaker_dim,
|
||||
embed_dim,
|
||||
act="softsign",
|
||||
param_attr=I.Normal(scale=std))
|
||||
self.sp_proj2 = Linear(speaker_dim,
|
||||
embed_dim,
|
||||
act="softsign",
|
||||
param_attr=I.Normal(scale=std))
|
||||
self.sp_proj1 = Linear(
|
||||
speaker_dim,
|
||||
embed_dim,
|
||||
act="softsign",
|
||||
param_attr=I.Normal(scale=std))
|
||||
self.sp_proj2 = Linear(
|
||||
speaker_dim,
|
||||
embed_dim,
|
||||
act="softsign",
|
||||
param_attr=I.Normal(scale=std))
|
||||
self.n_speakers = n_speakers
|
||||
|
||||
self.convolutions = dg.LayerList()
|
||||
|
@ -51,31 +67,34 @@ class Encoder(dg.Layer):
|
|||
if in_channels != out_channels:
|
||||
std = np.sqrt(std_mul / in_channels)
|
||||
self.convolutions.append(
|
||||
Conv1D(in_channels,
|
||||
out_channels,
|
||||
1,
|
||||
act="relu",
|
||||
param_attr=I.Normal(scale=std)))
|
||||
Conv1D(
|
||||
in_channels,
|
||||
out_channels,
|
||||
1,
|
||||
act="relu",
|
||||
param_attr=I.Normal(scale=std)))
|
||||
in_channels = out_channels
|
||||
std_mul = 2.0
|
||||
|
||||
self.convolutions.append(
|
||||
Conv1DGLU(n_speakers,
|
||||
speaker_dim,
|
||||
in_channels,
|
||||
out_channels,
|
||||
filter_size,
|
||||
dilation,
|
||||
std_mul,
|
||||
dropout,
|
||||
causal=False,
|
||||
residual=True))
|
||||
Conv1DGLU(
|
||||
n_speakers,
|
||||
speaker_dim,
|
||||
in_channels,
|
||||
out_channels,
|
||||
filter_size,
|
||||
dilation,
|
||||
std_mul,
|
||||
dropout,
|
||||
causal=False,
|
||||
residual=True))
|
||||
in_channels = out_channels
|
||||
std_mul = 4.0
|
||||
|
||||
std = np.sqrt(std_mul * (1 - dropout) / in_channels)
|
||||
self.convolutions.append(
|
||||
Conv1D(in_channels, embed_dim, 1, param_attr=I.Normal(scale=std)))
|
||||
Conv1D(
|
||||
in_channels, embed_dim, 1, param_attr=I.Normal(scale=std)))
|
||||
|
||||
def forward(self, x, speaker_embed=None):
|
||||
"""
|
||||
|
@ -96,9 +115,8 @@ class Encoder(dg.Layer):
|
|||
representation for values.
|
||||
"""
|
||||
x = self.embed(x)
|
||||
x = F.dropout(x,
|
||||
self.dropout,
|
||||
dropout_implementation="upscale_in_train")
|
||||
x = F.dropout(
|
||||
x, self.dropout, dropout_implementation="upscale_in_train")
|
||||
x = F.transpose(x, [0, 2, 1])
|
||||
|
||||
if self.n_speakers > 1 and speaker_embed is not None:
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import numpy as np
|
||||
from numba import jit
|
||||
|
||||
|
@ -31,9 +45,7 @@ def guided_attention(N, max_N, T, max_T, g):
|
|||
return W
|
||||
|
||||
|
||||
def guided_attentions(encoder_lengths,
|
||||
decoder_lengths,
|
||||
max_decoder_len,
|
||||
def guided_attentions(encoder_lengths, decoder_lengths, max_decoder_len,
|
||||
g=0.2):
|
||||
B = len(encoder_lengths)
|
||||
max_input_len = encoder_lengths.max()
|
||||
|
@ -93,9 +105,8 @@ class TTSLoss(object):
|
|||
def binary_divergence(self, prediction, target, mask):
|
||||
flattened_prediction = F.reshape(prediction, [-1, 1])
|
||||
flattened_target = F.reshape(target, [-1, 1])
|
||||
flattened_loss = F.log_loss(flattened_prediction,
|
||||
flattened_target,
|
||||
epsilon=1e-8)
|
||||
flattened_loss = F.log_loss(
|
||||
flattened_prediction, flattened_target, epsilon=1e-8)
|
||||
bin_div = fluid.layers.reshape(flattened_loss, prediction.shape)
|
||||
|
||||
w = self.masked_weight
|
||||
|
@ -163,23 +174,20 @@ class TTSLoss(object):
|
|||
max_mel_steps = max_frames // self.downsample_factor
|
||||
max_decoder_steps = max_mel_steps // self.r
|
||||
|
||||
decoder_mask = F.sequence_mask(n_frames // self.downsample_factor //
|
||||
self.r,
|
||||
max_decoder_steps,
|
||||
dtype="float32")
|
||||
mel_mask = F.sequence_mask(n_frames // self.downsample_factor,
|
||||
max_mel_steps,
|
||||
dtype="float32")
|
||||
decoder_mask = F.sequence_mask(
|
||||
n_frames // self.downsample_factor // self.r,
|
||||
max_decoder_steps,
|
||||
dtype="float32")
|
||||
mel_mask = F.sequence_mask(
|
||||
n_frames // self.downsample_factor, max_mel_steps, dtype="float32")
|
||||
lin_mask = F.sequence_mask(n_frames, max_frames, dtype="float32")
|
||||
|
||||
if compute_lin_loss:
|
||||
lin_hyp = lin_hyp[:, :-self.time_shift, :]
|
||||
lin_ref = lin_ref[:, self.time_shift:, :]
|
||||
lin_mask = lin_mask[:, self.time_shift:, :]
|
||||
lin_l1_loss = self.l1_loss(lin_hyp,
|
||||
lin_ref,
|
||||
lin_mask,
|
||||
priority_bin=self.priority_bin)
|
||||
lin_l1_loss = self.l1_loss(
|
||||
lin_hyp, lin_ref, lin_mask, priority_bin=self.priority_bin)
|
||||
lin_bce_loss = self.binary_divergence(lin_hyp, lin_ref, lin_mask)
|
||||
lin_loss = self.binary_divergence_weight * lin_bce_loss \
|
||||
+ (1 - self.binary_divergence_weight) * lin_l1_loss
|
||||
|
@ -197,9 +205,10 @@ class TTSLoss(object):
|
|||
total_loss += mel_loss
|
||||
|
||||
if compute_attn_loss:
|
||||
attn_loss = self.attention_loss(
|
||||
attn_hyp, input_lengths.numpy(),
|
||||
n_frames.numpy() // (self.downsample_factor * self.r))
|
||||
attn_loss = self.attention_loss(attn_hyp,
|
||||
input_lengths.numpy(),
|
||||
n_frames.numpy() //
|
||||
(self.downsample_factor * self.r))
|
||||
total_loss += attn_loss
|
||||
|
||||
if compute_done_loss:
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import numpy as np
|
||||
|
||||
import paddle.fluid.layers as F
|
||||
|
@ -29,9 +43,9 @@ class DeepVoice3(dg.Layer):
|
|||
mel_outputs, alignments, done, decoder_states = self.decoder(
|
||||
(keys, values), valid_lengths, mel_inputs, text_positions,
|
||||
frame_positions, speaker_embed)
|
||||
linear_outputs = self.converter(
|
||||
decoder_states if self.use_decoder_states else mel_outputs,
|
||||
speaker_embed)
|
||||
linear_outputs = self.converter(decoder_states
|
||||
if self.use_decoder_states else
|
||||
mel_outputs, speaker_embed)
|
||||
return mel_outputs, linear_outputs, alignments, done
|
||||
|
||||
def transduce(self, text_sequences, text_positions, speaker_indices=None):
|
||||
|
@ -43,7 +57,7 @@ class DeepVoice3(dg.Layer):
|
|||
keys, values = self.encoder(text_sequences, speaker_embed)
|
||||
mel_outputs, alignments, done, decoder_states = self.decoder.decode(
|
||||
(keys, values), text_positions, speaker_embed)
|
||||
linear_outputs = self.converter(
|
||||
decoder_states if self.use_decoder_states else mel_outputs,
|
||||
speaker_embed)
|
||||
linear_outputs = self.converter(decoder_states
|
||||
if self.use_decoder_states else
|
||||
mel_outputs, speaker_embed)
|
||||
return mel_outputs, linear_outputs, alignments, done
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import numpy as np
|
||||
from paddle import fluid
|
||||
import paddle.fluid.layers as F
|
||||
|
@ -95,10 +109,11 @@ class PositionEmbedding(dg.Layer):
|
|||
speaker_position_rate) # (B, V, C)
|
||||
# make indices for gather_nd
|
||||
batch_id = F.expand(
|
||||
F.unsqueeze(F.range(0, batch_size, 1, dtype="int64"), [1]),
|
||||
[1, time_steps])
|
||||
F.unsqueeze(
|
||||
F.range(
|
||||
0, batch_size, 1, dtype="int64"), [1]), [1, time_steps])
|
||||
# (B, T, 2)
|
||||
gather_nd_id = F.stack([batch_id, indices], -1)
|
||||
|
||||
out = F.gather_nd(weight, gather_nd_id)
|
||||
return out
|
||||
return out
|
||||
|
|
|
@ -0,0 +1,13 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
|
@ -1,8 +1,22 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import paddle.fluid.dygraph as dg
|
||||
import paddle.fluid as fluid
|
||||
from parakeet.models.transformer_tts.utils import *
|
||||
from parakeet.models.fastspeech.fft_block import FFTBlock
|
||||
|
||||
|
||||
class Decoder(dg.Layer):
|
||||
def __init__(self,
|
||||
len_max_seq,
|
||||
|
@ -19,16 +33,29 @@ class Decoder(dg.Layer):
|
|||
|
||||
n_position = len_max_seq + 1
|
||||
self.n_head = n_head
|
||||
self.pos_inp = get_sinusoid_encoding_table(n_position, d_model, padding_idx=0)
|
||||
self.position_enc = dg.Embedding(size=[n_position, d_model],
|
||||
padding_idx=0,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.NumpyArrayInitializer(self.pos_inp),
|
||||
trainable=False))
|
||||
self.layer_stack = [FFTBlock(d_model, d_inner, n_head, d_k, d_v, fft_conv1d_kernel, fft_conv1d_padding, dropout=dropout) for _ in range(n_layers)]
|
||||
self.pos_inp = get_sinusoid_encoding_table(
|
||||
n_position, d_model, padding_idx=0)
|
||||
self.position_enc = dg.Embedding(
|
||||
size=[n_position, d_model],
|
||||
padding_idx=0,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.NumpyArrayInitializer(
|
||||
self.pos_inp),
|
||||
trainable=False))
|
||||
self.layer_stack = [
|
||||
FFTBlock(
|
||||
d_model,
|
||||
d_inner,
|
||||
n_head,
|
||||
d_k,
|
||||
d_v,
|
||||
fft_conv1d_kernel,
|
||||
fft_conv1d_padding,
|
||||
dropout=dropout) for _ in range(n_layers)
|
||||
]
|
||||
for i, layer in enumerate(self.layer_stack):
|
||||
self.add_sublayer('fft_{}'.format(i), layer)
|
||||
|
||||
|
||||
def forward(self, enc_seq, enc_pos, non_pad_mask, slf_attn_mask=None):
|
||||
"""
|
||||
Decoder layer of FastSpeech.
|
||||
|
@ -55,4 +82,4 @@ class Decoder(dg.Layer):
|
|||
slf_attn_mask=slf_attn_mask)
|
||||
dec_slf_attn_list += [dec_slf_attn]
|
||||
|
||||
return dec_output, dec_slf_attn_list
|
||||
return dec_output, dec_slf_attn_list
|
||||
|
|
|
@ -1,8 +1,22 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import paddle.fluid.dygraph as dg
|
||||
import paddle.fluid as fluid
|
||||
from parakeet.models.transformer_tts.utils import *
|
||||
from parakeet.models.fastspeech.fft_block import FFTBlock
|
||||
|
||||
|
||||
class Encoder(dg.Layer):
|
||||
def __init__(self,
|
||||
n_src_vocab,
|
||||
|
@ -20,14 +34,30 @@ class Encoder(dg.Layer):
|
|||
n_position = len_max_seq + 1
|
||||
self.n_head = n_head
|
||||
|
||||
self.src_word_emb = dg.Embedding(size=[n_src_vocab, d_model], padding_idx=0,
|
||||
param_attr=fluid.initializer.Normal(loc=0.0, scale=1.0))
|
||||
self.pos_inp = get_sinusoid_encoding_table(n_position, d_model, padding_idx=0)
|
||||
self.position_enc = dg.Embedding(size=[n_position, d_model],
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.NumpyArrayInitializer(self.pos_inp),
|
||||
trainable=False))
|
||||
self.layer_stack = [FFTBlock(d_model, d_inner, n_head, d_k, d_v, fft_conv1d_kernel, fft_conv1d_padding, dropout=dropout) for _ in range(n_layers)]
|
||||
self.src_word_emb = dg.Embedding(
|
||||
size=[n_src_vocab, d_model],
|
||||
padding_idx=0,
|
||||
param_attr=fluid.initializer.Normal(
|
||||
loc=0.0, scale=1.0))
|
||||
self.pos_inp = get_sinusoid_encoding_table(
|
||||
n_position, d_model, padding_idx=0)
|
||||
self.position_enc = dg.Embedding(
|
||||
size=[n_position, d_model],
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.NumpyArrayInitializer(
|
||||
self.pos_inp),
|
||||
trainable=False))
|
||||
self.layer_stack = [
|
||||
FFTBlock(
|
||||
d_model,
|
||||
d_inner,
|
||||
n_head,
|
||||
d_k,
|
||||
d_v,
|
||||
fft_conv1d_kernel,
|
||||
fft_conv1d_padding,
|
||||
dropout=dropout) for _ in range(n_layers)
|
||||
]
|
||||
for i, layer in enumerate(self.layer_stack):
|
||||
self.add_sublayer('fft_{}'.format(i), layer)
|
||||
|
||||
|
@ -50,7 +80,8 @@ class Encoder(dg.Layer):
|
|||
slf_attn_mask = layers.expand(slf_attn_mask, [self.n_head, 1, 1])
|
||||
|
||||
# -- Forward
|
||||
enc_output = self.src_word_emb(character) + self.position_enc(text_pos) #(N, T, C)
|
||||
enc_output = self.src_word_emb(character) + self.position_enc(
|
||||
text_pos) #(N, T, C)
|
||||
|
||||
for enc_layer in self.layer_stack:
|
||||
enc_output, enc_slf_attn = enc_layer(
|
||||
|
@ -58,5 +89,5 @@ class Encoder(dg.Layer):
|
|||
non_pad_mask=non_pad_mask,
|
||||
slf_attn_mask=slf_attn_mask)
|
||||
enc_slf_attn_list += [enc_slf_attn]
|
||||
|
||||
return enc_output, enc_slf_attn_list
|
||||
|
||||
return enc_output, enc_slf_attn_list
|
||||
|
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import math
|
||||
import numpy as np
|
||||
import paddle.fluid.dygraph as dg
|
||||
|
@ -9,56 +22,71 @@ from parakeet.models.fastspeech.length_regulator import LengthRegulator
|
|||
from parakeet.models.fastspeech.encoder import Encoder
|
||||
from parakeet.models.fastspeech.decoder import Decoder
|
||||
|
||||
|
||||
class FastSpeech(dg.Layer):
|
||||
def __init__(self, cfg):
|
||||
" FastSpeech"
|
||||
super(FastSpeech, self).__init__()
|
||||
|
||||
self.encoder = Encoder(n_src_vocab=len(symbols)+1,
|
||||
len_max_seq=cfg['max_seq_len'],
|
||||
n_layers=cfg['encoder_n_layer'],
|
||||
n_head=cfg['encoder_head'],
|
||||
d_k=cfg['fs_hidden_size'] // cfg['encoder_head'],
|
||||
d_v=cfg['fs_hidden_size'] // cfg['encoder_head'],
|
||||
d_model=cfg['fs_hidden_size'],
|
||||
d_inner=cfg['encoder_conv1d_filter_size'],
|
||||
fft_conv1d_kernel=cfg['fft_conv1d_filter'],
|
||||
fft_conv1d_padding=cfg['fft_conv1d_padding'],
|
||||
dropout=0.1)
|
||||
self.length_regulator = LengthRegulator(input_size=cfg['fs_hidden_size'],
|
||||
out_channels=cfg['duration_predictor_output_size'],
|
||||
filter_size=cfg['duration_predictor_filter_size'],
|
||||
dropout=cfg['dropout'])
|
||||
self.decoder = Decoder(len_max_seq=cfg['max_seq_len'],
|
||||
n_layers=cfg['decoder_n_layer'],
|
||||
n_head=cfg['decoder_head'],
|
||||
d_k=cfg['fs_hidden_size'] // cfg['decoder_head'],
|
||||
d_v=cfg['fs_hidden_size'] // cfg['decoder_head'],
|
||||
d_model=cfg['fs_hidden_size'],
|
||||
d_inner=cfg['decoder_conv1d_filter_size'],
|
||||
fft_conv1d_kernel=cfg['fft_conv1d_filter'],
|
||||
fft_conv1d_padding=cfg['fft_conv1d_padding'],
|
||||
dropout=0.1)
|
||||
self.weight = fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer())
|
||||
self.encoder = Encoder(
|
||||
n_src_vocab=len(symbols) + 1,
|
||||
len_max_seq=cfg['max_seq_len'],
|
||||
n_layers=cfg['encoder_n_layer'],
|
||||
n_head=cfg['encoder_head'],
|
||||
d_k=cfg['fs_hidden_size'] // cfg['encoder_head'],
|
||||
d_v=cfg['fs_hidden_size'] // cfg['encoder_head'],
|
||||
d_model=cfg['fs_hidden_size'],
|
||||
d_inner=cfg['encoder_conv1d_filter_size'],
|
||||
fft_conv1d_kernel=cfg['fft_conv1d_filter'],
|
||||
fft_conv1d_padding=cfg['fft_conv1d_padding'],
|
||||
dropout=0.1)
|
||||
self.length_regulator = LengthRegulator(
|
||||
input_size=cfg['fs_hidden_size'],
|
||||
out_channels=cfg['duration_predictor_output_size'],
|
||||
filter_size=cfg['duration_predictor_filter_size'],
|
||||
dropout=cfg['dropout'])
|
||||
self.decoder = Decoder(
|
||||
len_max_seq=cfg['max_seq_len'],
|
||||
n_layers=cfg['decoder_n_layer'],
|
||||
n_head=cfg['decoder_head'],
|
||||
d_k=cfg['fs_hidden_size'] // cfg['decoder_head'],
|
||||
d_v=cfg['fs_hidden_size'] // cfg['decoder_head'],
|
||||
d_model=cfg['fs_hidden_size'],
|
||||
d_inner=cfg['decoder_conv1d_filter_size'],
|
||||
fft_conv1d_kernel=cfg['fft_conv1d_filter'],
|
||||
fft_conv1d_padding=cfg['fft_conv1d_padding'],
|
||||
dropout=0.1)
|
||||
self.weight = fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer())
|
||||
k = math.sqrt(1 / cfg['fs_hidden_size'])
|
||||
self.bias = fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k))
|
||||
self.mel_linear = dg.Linear(cfg['fs_hidden_size'],
|
||||
cfg['audio']['num_mels']* cfg['audio']['outputs_per_step'],
|
||||
param_attr = self.weight,
|
||||
bias_attr = self.bias,)
|
||||
self.postnet = PostConvNet(n_mels=cfg['audio']['num_mels'],
|
||||
num_hidden=512,
|
||||
filter_size=5,
|
||||
padding=int(5 / 2),
|
||||
num_conv=5,
|
||||
outputs_per_step=cfg['audio']['outputs_per_step'],
|
||||
use_cudnn=True,
|
||||
dropout=0.1,
|
||||
batchnorm_last=True)
|
||||
self.bias = fluid.ParamAttr(initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k))
|
||||
self.mel_linear = dg.Linear(
|
||||
cfg['fs_hidden_size'],
|
||||
cfg['audio']['num_mels'] * cfg['audio']['outputs_per_step'],
|
||||
param_attr=self.weight,
|
||||
bias_attr=self.bias, )
|
||||
self.postnet = PostConvNet(
|
||||
n_mels=cfg['audio']['num_mels'],
|
||||
num_hidden=512,
|
||||
filter_size=5,
|
||||
padding=int(5 / 2),
|
||||
num_conv=5,
|
||||
outputs_per_step=cfg['audio']['outputs_per_step'],
|
||||
use_cudnn=True,
|
||||
dropout=0.1,
|
||||
batchnorm_last=True)
|
||||
|
||||
def forward(self, character, text_pos, enc_non_pad_mask, dec_non_pad_mask,
|
||||
enc_slf_attn_mask=None, dec_slf_attn_mask=None,
|
||||
mel_pos=None, length_target=None, alpha=1.0):
|
||||
def forward(self,
|
||||
character,
|
||||
text_pos,
|
||||
enc_non_pad_mask,
|
||||
dec_non_pad_mask,
|
||||
enc_slf_attn_mask=None,
|
||||
dec_slf_attn_mask=None,
|
||||
mel_pos=None,
|
||||
length_target=None,
|
||||
alpha=1.0):
|
||||
"""
|
||||
FastSpeech model.
|
||||
|
||||
|
@ -84,29 +112,41 @@ class FastSpeech(dg.Layer):
|
|||
dec_slf_attn_list (Variable), Shape(B, mel_T, mel_T), the decoder self attention list.
|
||||
"""
|
||||
|
||||
encoder_output, enc_slf_attn_list = self.encoder(character, text_pos, enc_non_pad_mask, slf_attn_mask=enc_slf_attn_mask)
|
||||
encoder_output, enc_slf_attn_list = self.encoder(
|
||||
character,
|
||||
text_pos,
|
||||
enc_non_pad_mask,
|
||||
slf_attn_mask=enc_slf_attn_mask)
|
||||
if fluid.framework._dygraph_tracer()._train_mode:
|
||||
|
||||
length_regulator_output, duration_predictor_output = self.length_regulator(encoder_output,
|
||||
target=length_target,
|
||||
alpha=alpha)
|
||||
decoder_output, dec_slf_attn_list = self.decoder(length_regulator_output, mel_pos,
|
||||
dec_non_pad_mask,
|
||||
slf_attn_mask=dec_slf_attn_mask)
|
||||
|
||||
length_regulator_output, duration_predictor_output = self.length_regulator(
|
||||
encoder_output, target=length_target, alpha=alpha)
|
||||
decoder_output, dec_slf_attn_list = self.decoder(
|
||||
length_regulator_output,
|
||||
mel_pos,
|
||||
dec_non_pad_mask,
|
||||
slf_attn_mask=dec_slf_attn_mask)
|
||||
|
||||
mel_output = self.mel_linear(decoder_output)
|
||||
mel_output_postnet = self.postnet(mel_output) + mel_output
|
||||
|
||||
return mel_output, mel_output_postnet, duration_predictor_output, enc_slf_attn_list, dec_slf_attn_list
|
||||
else:
|
||||
length_regulator_output, decoder_pos = self.length_regulator(encoder_output, alpha=alpha)
|
||||
slf_attn_mask = get_triu_tensor(decoder_pos.numpy(), decoder_pos.numpy()).astype(np.float32)
|
||||
slf_attn_mask = fluid.layers.cast(dg.to_variable(slf_attn_mask == 0), np.float32)
|
||||
length_regulator_output, decoder_pos = self.length_regulator(
|
||||
encoder_output, alpha=alpha)
|
||||
slf_attn_mask = get_triu_tensor(
|
||||
decoder_pos.numpy(), decoder_pos.numpy()).astype(np.float32)
|
||||
slf_attn_mask = fluid.layers.cast(
|
||||
dg.to_variable(slf_attn_mask == 0), np.float32)
|
||||
slf_attn_mask = dg.to_variable(slf_attn_mask)
|
||||
dec_non_pad_mask = fluid.layers.unsqueeze((decoder_pos != 0).astype(np.float32), [-1])
|
||||
decoder_output, _ = self.decoder(length_regulator_output, decoder_pos, dec_non_pad_mask,
|
||||
slf_attn_mask=slf_attn_mask)
|
||||
dec_non_pad_mask = fluid.layers.unsqueeze(
|
||||
(decoder_pos != 0).astype(np.float32), [-1])
|
||||
decoder_output, _ = self.decoder(
|
||||
length_regulator_output,
|
||||
decoder_pos,
|
||||
dec_non_pad_mask,
|
||||
slf_attn_mask=slf_attn_mask)
|
||||
mel_output = self.mel_linear(decoder_output)
|
||||
mel_output_postnet = self.postnet(mel_output) + mel_output
|
||||
|
||||
return mel_output, mel_output_postnet
|
||||
return mel_output, mel_output_postnet
|
||||
|
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import numpy as np
|
||||
import math
|
||||
import paddle.fluid.dygraph as dg
|
||||
|
@ -6,11 +19,32 @@ import paddle.fluid as fluid
|
|||
from parakeet.modules.multihead_attention import MultiheadAttention
|
||||
from parakeet.modules.ffn import PositionwiseFeedForward
|
||||
|
||||
|
||||
class FFTBlock(dg.Layer):
|
||||
def __init__(self, d_model, d_inner, n_head, d_k, d_v, filter_size, padding, dropout=0.2):
|
||||
def __init__(self,
|
||||
d_model,
|
||||
d_inner,
|
||||
n_head,
|
||||
d_k,
|
||||
d_v,
|
||||
filter_size,
|
||||
padding,
|
||||
dropout=0.2):
|
||||
super(FFTBlock, self).__init__()
|
||||
self.slf_attn = MultiheadAttention(d_model, d_k, d_v, num_head=n_head, is_bias=True, dropout=dropout, is_concat=False)
|
||||
self.pos_ffn = PositionwiseFeedForward(d_model, d_inner, filter_size =filter_size, padding =padding, dropout=dropout)
|
||||
self.slf_attn = MultiheadAttention(
|
||||
d_model,
|
||||
d_k,
|
||||
d_v,
|
||||
num_head=n_head,
|
||||
is_bias=True,
|
||||
dropout=dropout,
|
||||
is_concat=False)
|
||||
self.pos_ffn = PositionwiseFeedForward(
|
||||
d_model,
|
||||
d_inner,
|
||||
filter_size=filter_size,
|
||||
padding=padding,
|
||||
dropout=dropout)
|
||||
|
||||
def forward(self, enc_input, non_pad_mask, slf_attn_mask=None):
|
||||
"""
|
||||
|
@ -27,11 +61,12 @@ class FFTBlock(dg.Layer):
|
|||
output (Variable), Shape(B, T, C), the output after self-attention & ffn.
|
||||
slf_attn (Variable), Shape(B * n_head, T, T), the self attention.
|
||||
"""
|
||||
output, slf_attn = self.slf_attn(enc_input, enc_input, enc_input, mask=slf_attn_mask)
|
||||
output, slf_attn = self.slf_attn(
|
||||
enc_input, enc_input, enc_input, mask=slf_attn_mask)
|
||||
|
||||
output *= non_pad_mask
|
||||
|
||||
output = self.pos_ffn(output)
|
||||
output *= non_pad_mask
|
||||
|
||||
return output, slf_attn
|
||||
return output, slf_attn
|
||||
|
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import numpy as np
|
||||
import math
|
||||
import parakeet.models.fastspeech.utils
|
||||
|
@ -6,47 +19,50 @@ import paddle.fluid.layers as layers
|
|||
import paddle.fluid as fluid
|
||||
from parakeet.modules.customized import Conv1D
|
||||
|
||||
|
||||
class LengthRegulator(dg.Layer):
|
||||
def __init__(self, input_size, out_channels, filter_size, dropout=0.1):
|
||||
super(LengthRegulator, self).__init__()
|
||||
self.duration_predictor = DurationPredictor(input_size=input_size,
|
||||
out_channels=out_channels,
|
||||
filter_size=filter_size,
|
||||
dropout=dropout)
|
||||
self.duration_predictor = DurationPredictor(
|
||||
input_size=input_size,
|
||||
out_channels=out_channels,
|
||||
filter_size=filter_size,
|
||||
dropout=dropout)
|
||||
|
||||
def LR(self, x, duration_predictor_output, alpha=1.0):
|
||||
output = []
|
||||
batch_size = x.shape[0]
|
||||
for i in range(batch_size):
|
||||
output.append(self.expand(x[i:i+1], duration_predictor_output[i:i+1], alpha))
|
||||
output.append(
|
||||
self.expand(x[i:i + 1], duration_predictor_output[i:i + 1],
|
||||
alpha))
|
||||
output = self.pad(output)
|
||||
return output
|
||||
|
||||
|
||||
def pad(self, input_ele):
|
||||
max_len = max([input_ele[i].shape[0] for i in range(len(input_ele))])
|
||||
out_list = []
|
||||
for i in range(len(input_ele)):
|
||||
pad_len = max_len - input_ele[i].shape[0]
|
||||
one_batch_padded = layers.pad(
|
||||
input_ele[i], [0, pad_len, 0, 0], pad_value=0.0)
|
||||
one_batch_padded = layers.pad(input_ele[i], [0, pad_len, 0, 0],
|
||||
pad_value=0.0)
|
||||
out_list.append(one_batch_padded)
|
||||
out_padded = layers.stack(out_list)
|
||||
return out_padded
|
||||
|
||||
|
||||
def expand(self, batch, predicted, alpha):
|
||||
out = []
|
||||
time_steps = batch.shape[1]
|
||||
fertilities = predicted.numpy()
|
||||
batch = layers.squeeze(batch,[0])
|
||||
|
||||
|
||||
batch = layers.squeeze(batch, [0])
|
||||
|
||||
for i in range(time_steps):
|
||||
if fertilities[0,i]==0:
|
||||
if fertilities[0, i] == 0:
|
||||
continue
|
||||
out.append(layers.expand(batch[i: i + 1, :], [int(fertilities[0,i]), 1]))
|
||||
out.append(
|
||||
layers.expand(batch[i:i + 1, :], [int(fertilities[0, i]), 1]))
|
||||
out = layers.concat(out, axis=0)
|
||||
return out
|
||||
|
||||
|
||||
def forward(self, x, alpha=1.0, target=None):
|
||||
"""
|
||||
|
@ -70,10 +86,11 @@ class LengthRegulator(dg.Layer):
|
|||
else:
|
||||
duration_predictor_output = layers.round(duration_predictor_output)
|
||||
output = self.LR(x, duration_predictor_output, alpha)
|
||||
mel_pos = dg.to_variable(np.arange(1, output.shape[1]+1))
|
||||
mel_pos = dg.to_variable(np.arange(1, output.shape[1] + 1))
|
||||
mel_pos = layers.unsqueeze(mel_pos, [0])
|
||||
return output, mel_pos
|
||||
|
||||
|
||||
class DurationPredictor(dg.Layer):
|
||||
def __init__(self, input_size, out_channels, filter_size, dropout=0.1):
|
||||
super(DurationPredictor, self).__init__()
|
||||
|
@ -83,30 +100,38 @@ class DurationPredictor(dg.Layer):
|
|||
self.dropout = dropout
|
||||
|
||||
k = math.sqrt(1 / self.input_size)
|
||||
self.conv1 = Conv1D(num_channels = self.input_size,
|
||||
num_filters = self.out_channels,
|
||||
filter_size = self.filter_size,
|
||||
padding=1,
|
||||
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)))
|
||||
#data_format='NTC')
|
||||
self.conv1 = Conv1D(
|
||||
num_channels=self.input_size,
|
||||
num_filters=self.out_channels,
|
||||
filter_size=self.filter_size,
|
||||
padding=1,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)))
|
||||
#data_format='NTC')
|
||||
k = math.sqrt(1 / self.out_channels)
|
||||
self.conv2 = Conv1D(num_channels = self.out_channels,
|
||||
num_filters = self.out_channels,
|
||||
filter_size = self.filter_size,
|
||||
padding=1,
|
||||
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)))
|
||||
#data_format='NTC')
|
||||
self.conv2 = Conv1D(
|
||||
num_channels=self.out_channels,
|
||||
num_filters=self.out_channels,
|
||||
filter_size=self.filter_size,
|
||||
padding=1,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)))
|
||||
#data_format='NTC')
|
||||
self.layer_norm1 = dg.LayerNorm(self.out_channels)
|
||||
self.layer_norm2 = dg.LayerNorm(self.out_channels)
|
||||
|
||||
self.weight = fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer())
|
||||
self.weight = fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer())
|
||||
k = math.sqrt(1 / self.out_channels)
|
||||
self.bias = fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k))
|
||||
self.bias = fluid.ParamAttr(initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k))
|
||||
|
||||
self.linear =dg.Linear(self.out_channels, 1, param_attr = self.weight,
|
||||
bias_attr = self.bias)
|
||||
self.linear = dg.Linear(
|
||||
self.out_channels, 1, param_attr=self.weight, bias_attr=self.bias)
|
||||
|
||||
def forward(self, encoder_output):
|
||||
"""
|
||||
|
@ -118,18 +143,21 @@ class DurationPredictor(dg.Layer):
|
|||
out (Variable), Shape(B, T, C), the output of duration predictor.
|
||||
"""
|
||||
# encoder_output.shape(N, T, C)
|
||||
out = layers.transpose(encoder_output, [0,2,1])
|
||||
out = layers.transpose(encoder_output, [0, 2, 1])
|
||||
out = self.conv1(out)
|
||||
out = layers.transpose(out, [0,2,1])
|
||||
out = layers.dropout(layers.relu(self.layer_norm1(out)), self.dropout, dropout_implementation='upscale_in_train')
|
||||
out = layers.transpose(out, [0,2,1])
|
||||
out = layers.transpose(out, [0, 2, 1])
|
||||
out = layers.dropout(
|
||||
layers.relu(self.layer_norm1(out)),
|
||||
self.dropout,
|
||||
dropout_implementation='upscale_in_train')
|
||||
out = layers.transpose(out, [0, 2, 1])
|
||||
out = self.conv2(out)
|
||||
out = layers.transpose(out, [0,2,1])
|
||||
out = layers.dropout(layers.relu(self.layer_norm2(out)), self.dropout, dropout_implementation='upscale_in_train')
|
||||
out = layers.transpose(out, [0, 2, 1])
|
||||
out = layers.dropout(
|
||||
layers.relu(self.layer_norm2(out)),
|
||||
self.dropout,
|
||||
dropout_implementation='upscale_in_train')
|
||||
out = layers.relu(self.linear(out))
|
||||
out = layers.squeeze(out, axes=[-1])
|
||||
|
||||
|
||||
return out
|
||||
|
||||
|
||||
return out
|
||||
|
|
|
@ -1,34 +1,47 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import numpy as np
|
||||
|
||||
|
||||
def get_alignment(attn_probs, mel_lens, n_head):
|
||||
max_F = 0
|
||||
assert attn_probs[0].shape[0] % n_head == 0
|
||||
batch_size = int(attn_probs[0].shape[0] // n_head)
|
||||
#max_attn = attn_probs[0].numpy()[0,batch_size]
|
||||
for i in range(len(attn_probs)):
|
||||
multi_attn = attn_probs[i].numpy()
|
||||
for j in range(n_head):
|
||||
attn = multi_attn[j*batch_size:(j+1)*batch_size]
|
||||
attn = multi_attn[j * batch_size:(j + 1) * batch_size]
|
||||
F = score_F(attn)
|
||||
if max_F < F:
|
||||
max_F = F
|
||||
max_attn = attn
|
||||
alignment = compute_duration(max_attn, mel_lens)
|
||||
return alignment, max_attn
|
||||
|
||||
|
||||
|
||||
def score_F(attn):
|
||||
max = np.max(attn, axis=-1)
|
||||
mean = np.mean(max)
|
||||
return mean
|
||||
|
||||
|
||||
def compute_duration(attn, mel_lens):
|
||||
alignment = np.zeros([attn.shape[0],attn.shape[2]])
|
||||
alignment = np.zeros([attn.shape[0], attn.shape[2]])
|
||||
mel_lens = mel_lens.numpy()
|
||||
for i in range(attn.shape[0]):
|
||||
for j in range(mel_lens[i]):
|
||||
max_index = np.argmax(attn[i,j])
|
||||
alignment[i,max_index] += 1
|
||||
max_index = np.argmax(attn[i, j])
|
||||
alignment[i, max_index] += 1
|
||||
|
||||
return alignment
|
||||
|
||||
|
||||
|
|
|
@ -0,0 +1,13 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import math
|
||||
from parakeet.g2p.text.symbols import symbols
|
||||
import paddle.fluid.dygraph as dg
|
||||
|
@ -7,9 +20,16 @@ from parakeet.modules.customized import Pool1D, Conv1D
|
|||
from parakeet.modules.dynamic_gru import DynamicGRU
|
||||
import numpy as np
|
||||
|
||||
|
||||
class CBHG(dg.Layer):
|
||||
def __init__(self, hidden_size, batch_size, K=16, projection_size = 256, num_gru_layers=2,
|
||||
max_pool_kernel_size=2, is_post=False):
|
||||
def __init__(self,
|
||||
hidden_size,
|
||||
batch_size,
|
||||
K=16,
|
||||
projection_size=256,
|
||||
num_gru_layers=2,
|
||||
max_pool_kernel_size=2,
|
||||
is_post=False):
|
||||
super(CBHG, self).__init__()
|
||||
"""
|
||||
:param hidden_size: dimension of hidden unit
|
||||
|
@ -24,28 +44,39 @@ class CBHG(dg.Layer):
|
|||
self.projection_size = projection_size
|
||||
self.conv_list = []
|
||||
k = math.sqrt(1 / projection_size)
|
||||
self.conv_list.append(Conv1D(num_channels = projection_size,
|
||||
num_filters = hidden_size,
|
||||
filter_size = 1,
|
||||
padding = int(np.floor(1/2)),
|
||||
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k))))
|
||||
self.conv_list.append(
|
||||
Conv1D(
|
||||
num_channels=projection_size,
|
||||
num_filters=hidden_size,
|
||||
filter_size=1,
|
||||
padding=int(np.floor(1 / 2)),
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k))))
|
||||
k = math.sqrt(1 / hidden_size)
|
||||
for i in range(2,K+1):
|
||||
self.conv_list.append(Conv1D(num_channels = hidden_size,
|
||||
num_filters = hidden_size,
|
||||
filter_size = i,
|
||||
padding = int(np.floor(i/2)),
|
||||
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k))))
|
||||
for i in range(2, K + 1):
|
||||
self.conv_list.append(
|
||||
Conv1D(
|
||||
num_channels=hidden_size,
|
||||
num_filters=hidden_size,
|
||||
filter_size=i,
|
||||
padding=int(np.floor(i / 2)),
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k))))
|
||||
|
||||
for i, layer in enumerate(self.conv_list):
|
||||
self.add_sublayer("conv_list_{}".format(i), layer)
|
||||
|
||||
self.batchnorm_list = []
|
||||
for i in range(K):
|
||||
self.batchnorm_list.append(dg.BatchNorm(hidden_size,
|
||||
data_layout='NCHW'))
|
||||
self.batchnorm_list.append(
|
||||
dg.BatchNorm(
|
||||
hidden_size, data_layout='NCHW'))
|
||||
|
||||
for i, layer in enumerate(self.batchnorm_list):
|
||||
self.add_sublayer("batchnorm_list_{}".format(i), layer)
|
||||
|
@ -53,92 +84,120 @@ class CBHG(dg.Layer):
|
|||
conv_outdim = hidden_size * K
|
||||
|
||||
k = math.sqrt(1 / conv_outdim)
|
||||
self.conv_projection_1 = Conv1D(num_channels = conv_outdim,
|
||||
num_filters = hidden_size,
|
||||
filter_size = 3,
|
||||
padding = int(np.floor(3/2)),
|
||||
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)))
|
||||
self.conv_projection_1 = Conv1D(
|
||||
num_channels=conv_outdim,
|
||||
num_filters=hidden_size,
|
||||
filter_size=3,
|
||||
padding=int(np.floor(3 / 2)),
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)))
|
||||
|
||||
k = math.sqrt(1 / hidden_size)
|
||||
self.conv_projection_2 = Conv1D(num_channels = hidden_size,
|
||||
num_filters = projection_size,
|
||||
filter_size = 3,
|
||||
padding = int(np.floor(3/2)),
|
||||
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)))
|
||||
self.conv_projection_2 = Conv1D(
|
||||
num_channels=hidden_size,
|
||||
num_filters=projection_size,
|
||||
filter_size=3,
|
||||
padding=int(np.floor(3 / 2)),
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)))
|
||||
|
||||
self.batchnorm_proj_1 = dg.BatchNorm(hidden_size,
|
||||
data_layout='NCHW')
|
||||
self.batchnorm_proj_2 = dg.BatchNorm(projection_size,
|
||||
data_layout='NCHW')
|
||||
self.max_pool = Pool1D(pool_size = max_pool_kernel_size,
|
||||
pool_type='max',
|
||||
pool_stride=1,
|
||||
pool_padding=1,
|
||||
data_format = "NCT")
|
||||
self.batchnorm_proj_1 = dg.BatchNorm(hidden_size, data_layout='NCHW')
|
||||
self.batchnorm_proj_2 = dg.BatchNorm(
|
||||
projection_size, data_layout='NCHW')
|
||||
self.max_pool = Pool1D(
|
||||
pool_size=max_pool_kernel_size,
|
||||
pool_type='max',
|
||||
pool_stride=1,
|
||||
pool_padding=1,
|
||||
data_format="NCT")
|
||||
self.highway = Highwaynet(self.projection_size)
|
||||
|
||||
h_0 = np.zeros((batch_size, hidden_size // 2), dtype="float32")
|
||||
h_0 = dg.to_variable(h_0)
|
||||
k = math.sqrt(1 / hidden_size)
|
||||
self.fc_forward1 = dg.Linear(hidden_size, hidden_size // 2 * 3,
|
||||
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
|
||||
self.fc_reverse1 = dg.Linear(hidden_size, hidden_size // 2 * 3,
|
||||
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
|
||||
self.gru_forward1 = DynamicGRU(size = self.hidden_size // 2,
|
||||
is_reverse = False,
|
||||
origin_mode = True,
|
||||
h_0 = h_0)
|
||||
self.gru_reverse1 = DynamicGRU(size = self.hidden_size // 2,
|
||||
is_reverse=True,
|
||||
origin_mode=True,
|
||||
h_0 = h_0)
|
||||
self.fc_forward1 = dg.Linear(
|
||||
hidden_size,
|
||||
hidden_size // 2 * 3,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)))
|
||||
self.fc_reverse1 = dg.Linear(
|
||||
hidden_size,
|
||||
hidden_size // 2 * 3,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)))
|
||||
self.gru_forward1 = DynamicGRU(
|
||||
size=self.hidden_size // 2,
|
||||
is_reverse=False,
|
||||
origin_mode=True,
|
||||
h_0=h_0)
|
||||
self.gru_reverse1 = DynamicGRU(
|
||||
size=self.hidden_size // 2,
|
||||
is_reverse=True,
|
||||
origin_mode=True,
|
||||
h_0=h_0)
|
||||
|
||||
self.fc_forward2 = dg.Linear(hidden_size, hidden_size // 2 * 3,
|
||||
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
|
||||
self.fc_reverse2 = dg.Linear(hidden_size, hidden_size // 2 * 3,
|
||||
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
|
||||
self.gru_forward2 = DynamicGRU(size = self.hidden_size // 2,
|
||||
is_reverse = False,
|
||||
origin_mode = True,
|
||||
h_0 = h_0)
|
||||
self.gru_reverse2 = DynamicGRU(size = self.hidden_size // 2,
|
||||
is_reverse=True,
|
||||
origin_mode=True,
|
||||
h_0 = h_0)
|
||||
self.fc_forward2 = dg.Linear(
|
||||
hidden_size,
|
||||
hidden_size // 2 * 3,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)))
|
||||
self.fc_reverse2 = dg.Linear(
|
||||
hidden_size,
|
||||
hidden_size // 2 * 3,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)))
|
||||
self.gru_forward2 = DynamicGRU(
|
||||
size=self.hidden_size // 2,
|
||||
is_reverse=False,
|
||||
origin_mode=True,
|
||||
h_0=h_0)
|
||||
self.gru_reverse2 = DynamicGRU(
|
||||
size=self.hidden_size // 2,
|
||||
is_reverse=True,
|
||||
origin_mode=True,
|
||||
h_0=h_0)
|
||||
|
||||
def _conv_fit_dim(self, x, filter_size=3):
|
||||
if filter_size % 2 == 0:
|
||||
return x[:,:,:-1]
|
||||
return x[:, :, :-1]
|
||||
else:
|
||||
return x
|
||||
return x
|
||||
|
||||
def forward(self, input_):
|
||||
# input_.shape = [N, C, T]
|
||||
|
||||
conv_list = []
|
||||
conv_input = input_
|
||||
|
||||
|
||||
for i, (conv, batchnorm) in enumerate(zip(self.conv_list, self.batchnorm_list)):
|
||||
conv_input = self._conv_fit_dim(conv(conv_input), i+1)
|
||||
|
||||
for i, (conv, batchnorm
|
||||
) in enumerate(zip(self.conv_list, self.batchnorm_list)):
|
||||
conv_input = self._conv_fit_dim(conv(conv_input), i + 1)
|
||||
conv_input = layers.relu(batchnorm(conv_input))
|
||||
conv_list.append(conv_input)
|
||||
|
||||
|
||||
conv_cat = layers.concat(conv_list, axis=1)
|
||||
conv_pool = self.max_pool(conv_cat)[:,:,:-1]
|
||||
|
||||
|
||||
conv_proj = layers.relu(self.batchnorm_proj_1(self._conv_fit_dim(self.conv_projection_1(conv_pool))))
|
||||
conv_proj = self.batchnorm_proj_2(self._conv_fit_dim(self.conv_projection_2(conv_proj))) + input_
|
||||
|
||||
conv_pool = self.max_pool(conv_cat)[:, :, :-1]
|
||||
|
||||
conv_proj = layers.relu(
|
||||
self.batchnorm_proj_1(
|
||||
self._conv_fit_dim(self.conv_projection_1(conv_pool))))
|
||||
conv_proj = self.batchnorm_proj_2(
|
||||
self._conv_fit_dim(self.conv_projection_2(conv_proj))) + input_
|
||||
|
||||
# conv_proj.shape = [N, C, T]
|
||||
highway = layers.transpose(conv_proj, [0,2,1])
|
||||
highway = layers.transpose(conv_proj, [0, 2, 1])
|
||||
highway = self.highway(highway)
|
||||
|
||||
# highway.shape = [N, T, C]
|
||||
|
@ -152,9 +211,10 @@ class CBHG(dg.Layer):
|
|||
out_forward = self.gru_forward2(fc_forward)
|
||||
out_reverse = self.gru_reverse2(fc_reverse)
|
||||
out = layers.concat([out_forward, out_reverse], axis=-1)
|
||||
out = layers.transpose(out, [0,2,1])
|
||||
out = layers.transpose(out, [0, 2, 1])
|
||||
return out
|
||||
|
||||
|
||||
class Highwaynet(dg.Layer):
|
||||
def __init__(self, num_units, num_layers=4):
|
||||
super(Highwaynet, self).__init__()
|
||||
|
@ -165,14 +225,26 @@ class Highwaynet(dg.Layer):
|
|||
self.linears = []
|
||||
k = math.sqrt(1 / num_units)
|
||||
for i in range(num_layers):
|
||||
self.linears.append(dg.Linear(num_units, num_units,
|
||||
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k))))
|
||||
self.gates.append(dg.Linear(num_units, num_units,
|
||||
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k))))
|
||||
|
||||
for i, (linear, gate) in enumerate(zip(self.linears,self.gates)):
|
||||
self.linears.append(
|
||||
dg.Linear(
|
||||
num_units,
|
||||
num_units,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k))))
|
||||
self.gates.append(
|
||||
dg.Linear(
|
||||
num_units,
|
||||
num_units,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k))))
|
||||
|
||||
for i, (linear, gate) in enumerate(zip(self.linears, self.gates)):
|
||||
self.add_sublayer("linears_{}".format(i), linear)
|
||||
self.add_sublayer("gates_{}".format(i), gate)
|
||||
|
||||
|
@ -184,12 +256,6 @@ class Highwaynet(dg.Layer):
|
|||
t_ = fluid.layers.sigmoid(gate(out))
|
||||
|
||||
c = 1 - t_
|
||||
out = h * t_ + out * c
|
||||
|
||||
out = h * t_ + out * c
|
||||
|
||||
return out
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import math
|
||||
import paddle.fluid.dygraph as dg
|
||||
import paddle.fluid as fluid
|
||||
|
@ -7,67 +20,110 @@ from parakeet.modules.ffn import PositionwiseFeedForward
|
|||
from parakeet.models.transformer_tts.prenet import PreNet
|
||||
from parakeet.models.transformer_tts.post_convnet import PostConvNet
|
||||
|
||||
|
||||
class Decoder(dg.Layer):
|
||||
def __init__(self, num_hidden, config, num_head=4):
|
||||
super(Decoder, self).__init__()
|
||||
self.num_hidden = num_hidden
|
||||
self.num_head = num_head
|
||||
param = fluid.ParamAttr()
|
||||
self.alpha = self.create_parameter(shape=(1,), attr=param, dtype='float32',
|
||||
default_initializer = fluid.initializer.ConstantInitializer(value=1.0))
|
||||
self.pos_inp = get_sinusoid_encoding_table(1024, self.num_hidden, padding_idx=0)
|
||||
self.pos_emb = dg.Embedding(size=[1024, num_hidden],
|
||||
padding_idx=0,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.NumpyArrayInitializer(self.pos_inp),
|
||||
trainable=False))
|
||||
self.decoder_prenet = PreNet(input_size = config['audio']['num_mels'],
|
||||
hidden_size = num_hidden * 2,
|
||||
output_size = num_hidden,
|
||||
dropout_rate=0.2)
|
||||
self.alpha = self.create_parameter(
|
||||
shape=(1, ),
|
||||
attr=param,
|
||||
dtype='float32',
|
||||
default_initializer=fluid.initializer.ConstantInitializer(
|
||||
value=1.0))
|
||||
self.pos_inp = get_sinusoid_encoding_table(
|
||||
1024, self.num_hidden, padding_idx=0)
|
||||
self.pos_emb = dg.Embedding(
|
||||
size=[1024, num_hidden],
|
||||
padding_idx=0,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.NumpyArrayInitializer(
|
||||
self.pos_inp),
|
||||
trainable=False))
|
||||
self.decoder_prenet = PreNet(
|
||||
input_size=config['audio']['num_mels'],
|
||||
hidden_size=num_hidden * 2,
|
||||
output_size=num_hidden,
|
||||
dropout_rate=0.2)
|
||||
k = math.sqrt(1 / num_hidden)
|
||||
self.linear = dg.Linear(num_hidden, num_hidden,
|
||||
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
|
||||
self.linear = dg.Linear(
|
||||
num_hidden,
|
||||
num_hidden,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)))
|
||||
|
||||
self.selfattn_layers = [MultiheadAttention(num_hidden, num_hidden//num_head, num_hidden//num_head) for _ in range(3)]
|
||||
self.selfattn_layers = [
|
||||
MultiheadAttention(num_hidden, num_hidden // num_head,
|
||||
num_hidden // num_head) for _ in range(3)
|
||||
]
|
||||
for i, layer in enumerate(self.selfattn_layers):
|
||||
self.add_sublayer("self_attn_{}".format(i), layer)
|
||||
self.attn_layers = [MultiheadAttention(num_hidden, num_hidden//num_head, num_hidden//num_head) for _ in range(3)]
|
||||
self.attn_layers = [
|
||||
MultiheadAttention(num_hidden, num_hidden // num_head,
|
||||
num_hidden // num_head) for _ in range(3)
|
||||
]
|
||||
for i, layer in enumerate(self.attn_layers):
|
||||
self.add_sublayer("attn_{}".format(i), layer)
|
||||
self.ffns = [PositionwiseFeedForward(num_hidden, num_hidden*num_head, filter_size=1) for _ in range(3)]
|
||||
self.ffns = [
|
||||
PositionwiseFeedForward(
|
||||
num_hidden, num_hidden * num_head, filter_size=1)
|
||||
for _ in range(3)
|
||||
]
|
||||
for i, layer in enumerate(self.ffns):
|
||||
self.add_sublayer("ffns_{}".format(i), layer)
|
||||
self.mel_linear = dg.Linear(num_hidden, config['audio']['num_mels'] * config['audio']['outputs_per_step'],
|
||||
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
|
||||
self.stop_linear = dg.Linear(num_hidden, 1,
|
||||
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
|
||||
self.mel_linear = dg.Linear(
|
||||
num_hidden,
|
||||
config['audio']['num_mels'] * config['audio']['outputs_per_step'],
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)))
|
||||
self.stop_linear = dg.Linear(
|
||||
num_hidden,
|
||||
1,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)))
|
||||
|
||||
self.postconvnet = PostConvNet(config['audio']['num_mels'], config['hidden_size'],
|
||||
filter_size = 5, padding = 4, num_conv=5,
|
||||
outputs_per_step=config['audio']['outputs_per_step'],
|
||||
use_cudnn=True)
|
||||
self.postconvnet = PostConvNet(
|
||||
config['audio']['num_mels'],
|
||||
config['hidden_size'],
|
||||
filter_size=5,
|
||||
padding=4,
|
||||
num_conv=5,
|
||||
outputs_per_step=config['audio']['outputs_per_step'],
|
||||
use_cudnn=True)
|
||||
|
||||
def forward(self, key, value, query, positional, mask, m_mask=None, m_self_mask=None, zero_mask=None):
|
||||
def forward(self,
|
||||
key,
|
||||
value,
|
||||
query,
|
||||
positional,
|
||||
mask,
|
||||
m_mask=None,
|
||||
m_self_mask=None,
|
||||
zero_mask=None):
|
||||
|
||||
# get decoder mask with triangular matrix
|
||||
|
||||
|
||||
if fluid.framework._dygraph_tracer()._train_mode:
|
||||
m_mask = layers.expand(m_mask, [self.num_head, 1, key.shape[1]])
|
||||
m_self_mask = layers.expand(m_self_mask, [self.num_head, 1, query.shape[1]])
|
||||
m_self_mask = layers.expand(m_self_mask,
|
||||
[self.num_head, 1, query.shape[1]])
|
||||
mask = layers.expand(mask, [self.num_head, 1, 1])
|
||||
zero_mask = layers.expand(zero_mask, [self.num_head, 1, 1])
|
||||
|
||||
else:
|
||||
m_mask, m_self_mask, zero_mask = None, None, None
|
||||
|
||||
|
||||
# Decoder pre-network
|
||||
# Decoder pre-network
|
||||
query = self.decoder_prenet(query)
|
||||
|
||||
|
||||
# Centered position
|
||||
query = self.linear(query)
|
||||
|
||||
|
@ -76,28 +132,29 @@ class Decoder(dg.Layer):
|
|||
query = positional * self.alpha + query
|
||||
|
||||
#positional dropout
|
||||
query = fluid.layers.dropout(query, 0.1, dropout_implementation='upscale_in_train')
|
||||
|
||||
query = fluid.layers.dropout(
|
||||
query, 0.1, dropout_implementation='upscale_in_train')
|
||||
|
||||
# Attention decoder-decoder, encoder-decoder
|
||||
selfattn_list = list()
|
||||
attn_list = list()
|
||||
|
||||
|
||||
for selfattn, attn, ffn in zip(self.selfattn_layers, self.attn_layers, self.ffns):
|
||||
query, attn_dec = selfattn(query, query, query, mask = mask, query_mask = m_self_mask)
|
||||
query, attn_dot = attn(key, value, query, mask = zero_mask, query_mask = m_mask)
|
||||
|
||||
for selfattn, attn, ffn in zip(self.selfattn_layers, self.attn_layers,
|
||||
self.ffns):
|
||||
query, attn_dec = selfattn(
|
||||
query, query, query, mask=mask, query_mask=m_self_mask)
|
||||
query, attn_dot = attn(
|
||||
key, value, query, mask=zero_mask, query_mask=m_mask)
|
||||
query = ffn(query)
|
||||
selfattn_list.append(attn_dec)
|
||||
attn_list.append(attn_dot)
|
||||
|
||||
|
||||
|
||||
# Mel linear projection
|
||||
mel_out = self.mel_linear(query)
|
||||
# Post Mel Network
|
||||
out = self.postconvnet(mel_out)
|
||||
out = mel_out + out
|
||||
|
||||
|
||||
# Stop tokens
|
||||
stop_tokens = self.stop_linear(query)
|
||||
stop_tokens = layers.squeeze(stop_tokens, [-1])
|
||||
|
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import paddle.fluid.dygraph as dg
|
||||
import paddle.fluid as fluid
|
||||
from parakeet.models.transformer_tts.utils import *
|
||||
|
@ -5,56 +18,69 @@ from parakeet.modules.multihead_attention import MultiheadAttention
|
|||
from parakeet.modules.ffn import PositionwiseFeedForward
|
||||
from parakeet.models.transformer_tts.encoderprenet import EncoderPrenet
|
||||
|
||||
|
||||
class Encoder(dg.Layer):
|
||||
def __init__(self, embedding_size, num_hidden, num_head=4):
|
||||
super(Encoder, self).__init__()
|
||||
self.num_hidden = num_hidden
|
||||
self.num_head = num_head
|
||||
param = fluid.ParamAttr(initializer=fluid.initializer.Constant(value=1.0))
|
||||
self.alpha = self.create_parameter(shape=(1, ), attr=param, dtype='float32')
|
||||
self.pos_inp = get_sinusoid_encoding_table(1024, self.num_hidden, padding_idx=0)
|
||||
self.pos_emb = dg.Embedding(size=[1024, num_hidden],
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.NumpyArrayInitializer(self.pos_inp),
|
||||
trainable=False))
|
||||
self.encoder_prenet = EncoderPrenet(embedding_size = embedding_size,
|
||||
num_hidden = num_hidden,
|
||||
use_cudnn=True)
|
||||
self.layers = [MultiheadAttention(num_hidden, num_hidden//num_head, num_hidden//num_head) for _ in range(3)]
|
||||
param = fluid.ParamAttr(initializer=fluid.initializer.Constant(
|
||||
value=1.0))
|
||||
self.alpha = self.create_parameter(
|
||||
shape=(1, ), attr=param, dtype='float32')
|
||||
self.pos_inp = get_sinusoid_encoding_table(
|
||||
1024, self.num_hidden, padding_idx=0)
|
||||
self.pos_emb = dg.Embedding(
|
||||
size=[1024, num_hidden],
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.NumpyArrayInitializer(
|
||||
self.pos_inp),
|
||||
trainable=False))
|
||||
self.encoder_prenet = EncoderPrenet(
|
||||
embedding_size=embedding_size,
|
||||
num_hidden=num_hidden,
|
||||
use_cudnn=True)
|
||||
self.layers = [
|
||||
MultiheadAttention(num_hidden, num_hidden // num_head,
|
||||
num_hidden // num_head) for _ in range(3)
|
||||
]
|
||||
for i, layer in enumerate(self.layers):
|
||||
self.add_sublayer("self_attn_{}".format(i), layer)
|
||||
self.ffns = [PositionwiseFeedForward(num_hidden, num_hidden*num_head, filter_size=1, use_cudnn=True) for _ in range(3)]
|
||||
self.ffns = [
|
||||
PositionwiseFeedForward(
|
||||
num_hidden,
|
||||
num_hidden * num_head,
|
||||
filter_size=1,
|
||||
use_cudnn=True) for _ in range(3)
|
||||
]
|
||||
for i, layer in enumerate(self.ffns):
|
||||
self.add_sublayer("ffns_{}".format(i), layer)
|
||||
|
||||
def forward(self, x, positional, mask=None, query_mask=None):
|
||||
|
||||
|
||||
if fluid.framework._dygraph_tracer()._train_mode:
|
||||
seq_len_key = x.shape[1]
|
||||
query_mask = layers.expand(query_mask, [self.num_head, 1, seq_len_key])
|
||||
query_mask = layers.expand(query_mask,
|
||||
[self.num_head, 1, seq_len_key])
|
||||
mask = layers.expand(mask, [self.num_head, 1, 1])
|
||||
else:
|
||||
query_mask, mask = None, None
|
||||
|
||||
|
||||
# Encoder pre_network
|
||||
x = self.encoder_prenet(x) #(N,T,C)
|
||||
|
||||
|
||||
x = self.encoder_prenet(x) #(N,T,C)
|
||||
|
||||
# Get positional encoding
|
||||
positional = self.pos_emb(positional)
|
||||
|
||||
x = positional * self.alpha + x #(N, T, C)
|
||||
|
||||
|
||||
positional = self.pos_emb(positional)
|
||||
|
||||
x = positional * self.alpha + x #(N, T, C)
|
||||
|
||||
# Positional dropout
|
||||
x = layers.dropout(x, 0.1, dropout_implementation='upscale_in_train')
|
||||
|
||||
|
||||
# Self attention encoder
|
||||
attentions = list()
|
||||
for layer, ffn in zip(self.layers, self.ffns):
|
||||
x, attention = layer(x, x, x, mask = mask, query_mask = query_mask)
|
||||
x, attention = layer(x, x, x, mask=mask, query_mask=query_mask)
|
||||
x = ffn(x)
|
||||
attentions.append(attention)
|
||||
|
||||
return x, attentions
|
||||
return x, attentions
|
||||
|
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import math
|
||||
from parakeet.g2p.text.symbols import symbols
|
||||
import paddle.fluid.dygraph as dg
|
||||
|
@ -13,50 +26,70 @@ class EncoderPrenet(dg.Layer):
|
|||
self.embedding_size = embedding_size
|
||||
self.num_hidden = num_hidden
|
||||
self.use_cudnn = use_cudnn
|
||||
self.embedding = dg.Embedding( size = [len(symbols), embedding_size],
|
||||
padding_idx = 0,
|
||||
param_attr=fluid.initializer.Normal(loc=0.0, scale=1.0))
|
||||
self.embedding = dg.Embedding(
|
||||
size=[len(symbols), embedding_size],
|
||||
padding_idx=0,
|
||||
param_attr=fluid.initializer.Normal(
|
||||
loc=0.0, scale=1.0))
|
||||
self.conv_list = []
|
||||
k = math.sqrt(1 / embedding_size)
|
||||
self.conv_list.append(Conv1D(num_channels = embedding_size,
|
||||
num_filters = num_hidden,
|
||||
filter_size = 5,
|
||||
padding = int(np.floor(5/2)),
|
||||
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)),
|
||||
use_cudnn = use_cudnn))
|
||||
self.conv_list.append(
|
||||
Conv1D(
|
||||
num_channels=embedding_size,
|
||||
num_filters=num_hidden,
|
||||
filter_size=5,
|
||||
padding=int(np.floor(5 / 2)),
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)),
|
||||
use_cudnn=use_cudnn))
|
||||
k = math.sqrt(1 / num_hidden)
|
||||
for _ in range(2):
|
||||
self.conv_list.append(Conv1D(num_channels = num_hidden,
|
||||
num_filters = num_hidden,
|
||||
filter_size = 5,
|
||||
padding = int(np.floor(5/2)),
|
||||
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)),
|
||||
use_cudnn = use_cudnn))
|
||||
self.conv_list.append(
|
||||
Conv1D(
|
||||
num_channels=num_hidden,
|
||||
num_filters=num_hidden,
|
||||
filter_size=5,
|
||||
padding=int(np.floor(5 / 2)),
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)),
|
||||
use_cudnn=use_cudnn))
|
||||
|
||||
for i, layer in enumerate(self.conv_list):
|
||||
self.add_sublayer("conv_list_{}".format(i), layer)
|
||||
|
||||
self.batch_norm_list = [dg.BatchNorm(num_hidden,
|
||||
data_layout='NCHW') for _ in range(3)]
|
||||
self.batch_norm_list = [
|
||||
dg.BatchNorm(
|
||||
num_hidden, data_layout='NCHW') for _ in range(3)
|
||||
]
|
||||
|
||||
for i, layer in enumerate(self.batch_norm_list):
|
||||
self.add_sublayer("batch_norm_list_{}".format(i), layer)
|
||||
|
||||
k = math.sqrt(1 / num_hidden)
|
||||
self.projection = dg.Linear(num_hidden, num_hidden,
|
||||
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
|
||||
self.projection = dg.Linear(
|
||||
num_hidden,
|
||||
num_hidden,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)))
|
||||
|
||||
def forward(self, x):
|
||||
|
||||
x = self.embedding(x) #(batch_size, seq_len, embending_size)
|
||||
x = layers.transpose(x,[0,2,1])
|
||||
|
||||
x = self.embedding(x) #(batch_size, seq_len, embending_size)
|
||||
x = layers.transpose(x, [0, 2, 1])
|
||||
for batch_norm, conv in zip(self.batch_norm_list, self.conv_list):
|
||||
x = layers.dropout(layers.relu(batch_norm(conv(x))), 0.2,
|
||||
dropout_implementation='upscale_in_train')
|
||||
x = layers.transpose(x,[0,2,1]) #(N,T,C)
|
||||
x = layers.dropout(
|
||||
layers.relu(batch_norm(conv(x))),
|
||||
0.2,
|
||||
dropout_implementation='upscale_in_train')
|
||||
x = layers.transpose(x, [0, 2, 1]) #(N,T,C)
|
||||
x = self.projection(x)
|
||||
|
||||
return x
|
||||
return x
|
||||
|
|
|
@ -1,11 +1,25 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import math
|
||||
import paddle.fluid.dygraph as dg
|
||||
import paddle.fluid as fluid
|
||||
import paddle.fluid.layers as layers
|
||||
from parakeet.modules.customized import Conv1D
|
||||
|
||||
|
||||
class PostConvNet(dg.Layer):
|
||||
def __init__(self,
|
||||
def __init__(self,
|
||||
n_mels=80,
|
||||
num_hidden=512,
|
||||
filter_size=5,
|
||||
|
@ -16,49 +30,66 @@ class PostConvNet(dg.Layer):
|
|||
dropout=0.1,
|
||||
batchnorm_last=False):
|
||||
super(PostConvNet, self).__init__()
|
||||
|
||||
|
||||
self.dropout = dropout
|
||||
self.num_conv = num_conv
|
||||
self.batchnorm_last = batchnorm_last
|
||||
self.conv_list = []
|
||||
k = math.sqrt(1 / (n_mels * outputs_per_step))
|
||||
self.conv_list.append(Conv1D(num_channels = n_mels * outputs_per_step,
|
||||
num_filters = num_hidden,
|
||||
filter_size = filter_size,
|
||||
padding = padding,
|
||||
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)),
|
||||
use_cudnn = use_cudnn))
|
||||
self.conv_list.append(
|
||||
Conv1D(
|
||||
num_channels=n_mels * outputs_per_step,
|
||||
num_filters=num_hidden,
|
||||
filter_size=filter_size,
|
||||
padding=padding,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)),
|
||||
use_cudnn=use_cudnn))
|
||||
|
||||
k = math.sqrt(1 / num_hidden)
|
||||
for _ in range(1, num_conv-1):
|
||||
self.conv_list.append(Conv1D(num_channels = num_hidden,
|
||||
num_filters = num_hidden,
|
||||
filter_size = filter_size,
|
||||
padding = padding,
|
||||
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)),
|
||||
use_cudnn = use_cudnn))
|
||||
for _ in range(1, num_conv - 1):
|
||||
self.conv_list.append(
|
||||
Conv1D(
|
||||
num_channels=num_hidden,
|
||||
num_filters=num_hidden,
|
||||
filter_size=filter_size,
|
||||
padding=padding,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)),
|
||||
use_cudnn=use_cudnn))
|
||||
|
||||
self.conv_list.append(Conv1D(num_channels = num_hidden,
|
||||
num_filters = n_mels * outputs_per_step,
|
||||
filter_size = filter_size,
|
||||
padding = padding,
|
||||
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)),
|
||||
use_cudnn = use_cudnn))
|
||||
self.conv_list.append(
|
||||
Conv1D(
|
||||
num_channels=num_hidden,
|
||||
num_filters=n_mels * outputs_per_step,
|
||||
filter_size=filter_size,
|
||||
padding=padding,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)),
|
||||
use_cudnn=use_cudnn))
|
||||
|
||||
for i, layer in enumerate(self.conv_list):
|
||||
self.add_sublayer("conv_list_{}".format(i), layer)
|
||||
|
||||
self.batch_norm_list = [dg.BatchNorm(num_hidden,
|
||||
data_layout='NCHW') for _ in range(num_conv-1)]
|
||||
self.batch_norm_list = [
|
||||
dg.BatchNorm(
|
||||
num_hidden, data_layout='NCHW') for _ in range(num_conv - 1)
|
||||
]
|
||||
if self.batchnorm_last:
|
||||
self.batch_norm_list.append(dg.BatchNorm(n_mels * outputs_per_step,
|
||||
data_layout='NCHW'))
|
||||
self.batch_norm_list.append(
|
||||
dg.BatchNorm(
|
||||
n_mels * outputs_per_step, data_layout='NCHW'))
|
||||
for i, layer in enumerate(self.batch_norm_list):
|
||||
self.add_sublayer("batch_norm_list_{}".format(i), layer)
|
||||
|
||||
|
||||
def forward(self, input):
|
||||
"""
|
||||
|
@ -69,20 +100,24 @@ class PostConvNet(dg.Layer):
|
|||
Returns:
|
||||
output (Variable), Shape(B, T, C), the result after postconvnet.
|
||||
"""
|
||||
|
||||
input = layers.transpose(input, [0,2,1])
|
||||
|
||||
input = layers.transpose(input, [0, 2, 1])
|
||||
len = input.shape[-1]
|
||||
for i in range(self.num_conv-1):
|
||||
for i in range(self.num_conv - 1):
|
||||
batch_norm = self.batch_norm_list[i]
|
||||
conv = self.conv_list[i]
|
||||
|
||||
input = layers.dropout(layers.tanh(batch_norm(conv(input)[:,:,:len])), self.dropout,
|
||||
dropout_implementation='upscale_in_train')
|
||||
conv = self.conv_list[self.num_conv-1]
|
||||
input = conv(input)[:,:,:len]
|
||||
|
||||
input = layers.dropout(
|
||||
layers.tanh(batch_norm(conv(input)[:, :, :len])),
|
||||
self.dropout,
|
||||
dropout_implementation='upscale_in_train')
|
||||
conv = self.conv_list[self.num_conv - 1]
|
||||
input = conv(input)[:, :, :len]
|
||||
if self.batchnorm_last:
|
||||
batch_norm = self.batch_norm_list[self.num_conv-1]
|
||||
input = layers.dropout(batch_norm(input), self.dropout,
|
||||
dropout_implementation='upscale_in_train')
|
||||
output = layers.transpose(input, [0,2,1])
|
||||
return output
|
||||
batch_norm = self.batch_norm_list[self.num_conv - 1]
|
||||
input = layers.dropout(
|
||||
batch_norm(input),
|
||||
self.dropout,
|
||||
dropout_implementation='upscale_in_train')
|
||||
output = layers.transpose(input, [0, 2, 1])
|
||||
return output
|
||||
|
|
|
@ -1,8 +1,22 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import math
|
||||
import paddle.fluid.dygraph as dg
|
||||
import paddle.fluid as fluid
|
||||
import paddle.fluid.layers as layers
|
||||
|
||||
|
||||
class PreNet(dg.Layer):
|
||||
def __init__(self, input_size, hidden_size, output_size, dropout_rate=0.2):
|
||||
"""
|
||||
|
@ -17,13 +31,21 @@ class PreNet(dg.Layer):
|
|||
self.dropout_rate = dropout_rate
|
||||
|
||||
k = math.sqrt(1 / input_size)
|
||||
self.linear1 = dg.Linear(input_size, hidden_size,
|
||||
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
|
||||
self.linear1 = dg.Linear(
|
||||
input_size,
|
||||
hidden_size,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)))
|
||||
k = math.sqrt(1 / hidden_size)
|
||||
self.linear2 = dg.Linear(hidden_size, output_size,
|
||||
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
|
||||
self.linear2 = dg.Linear(
|
||||
hidden_size,
|
||||
output_size,
|
||||
param_attr=fluid.ParamAttr(
|
||||
initializer=fluid.initializer.XavierInitializer()),
|
||||
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
|
||||
low=-k, high=k)))
|
||||
|
||||
def forward(self, x):
|
||||
"""
|
||||
|
@ -34,6 +56,12 @@ class PreNet(dg.Layer):
|
|||
Returns:
|
||||
x (Variable), Shape(B, T, C), the result after pernet.
|
||||
"""
|
||||
x = layers.dropout(layers.relu(self.linear1(x)), self.dropout_rate, dropout_implementation='upscale_in_train')
|
||||
x = layers.dropout(layers.relu(self.linear2(x)), self.dropout_rate, dropout_implementation='upscale_in_train')
|
||||
x = layers.dropout(
|
||||
layers.relu(self.linear1(x)),
|
||||
self.dropout_rate,
|
||||
dropout_implementation='upscale_in_train')
|
||||
x = layers.dropout(
|
||||
layers.relu(self.linear2(x)),
|
||||
self.dropout_rate,
|
||||
dropout_implementation='upscale_in_train')
|
||||
return x
|
||||
|
|
|
@ -1,8 +1,22 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import paddle.fluid.dygraph as dg
|
||||
import paddle.fluid as fluid
|
||||
from parakeet.models.transformer_tts.encoder import Encoder
|
||||
from parakeet.models.transformer_tts.decoder import Decoder
|
||||
|
||||
|
||||
class TransformerTTS(dg.Layer):
|
||||
def __init__(self, config):
|
||||
super(TransformerTTS, self).__init__()
|
||||
|
@ -10,16 +24,29 @@ class TransformerTTS(dg.Layer):
|
|||
self.decoder = Decoder(config['hidden_size'], config)
|
||||
self.config = config
|
||||
|
||||
def forward(self, characters, mel_input, pos_text, pos_mel, dec_slf_mask, enc_slf_mask=None, enc_query_mask=None, enc_dec_mask=None, dec_query_slf_mask=None, dec_query_mask=None):
|
||||
key, attns_enc = self.encoder(characters, pos_text, mask=enc_slf_mask, query_mask=enc_query_mask)
|
||||
|
||||
mel_output, postnet_output, attn_probs, stop_preds, attns_dec = self.decoder(key, key, mel_input, pos_mel,
|
||||
mask=dec_slf_mask, zero_mask=enc_dec_mask,
|
||||
m_self_mask=dec_query_slf_mask, m_mask=dec_query_mask )
|
||||
def forward(self,
|
||||
characters,
|
||||
mel_input,
|
||||
pos_text,
|
||||
pos_mel,
|
||||
dec_slf_mask,
|
||||
enc_slf_mask=None,
|
||||
enc_query_mask=None,
|
||||
enc_dec_mask=None,
|
||||
dec_query_slf_mask=None,
|
||||
dec_query_mask=None):
|
||||
key, attns_enc = self.encoder(
|
||||
characters, pos_text, mask=enc_slf_mask, query_mask=enc_query_mask)
|
||||
|
||||
mel_output, postnet_output, attn_probs, stop_preds, attns_dec = self.decoder(
|
||||
key,
|
||||
key,
|
||||
mel_input,
|
||||
pos_mel,
|
||||
mask=dec_slf_mask,
|
||||
zero_mask=enc_dec_mask,
|
||||
m_self_mask=dec_query_slf_mask,
|
||||
m_mask=dec_query_mask)
|
||||
return mel_output, postnet_output, attn_probs, stop_preds, attns_enc, attns_dec
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
return mel_output, postnet_output, attn_probs, stop_preds, attns_enc, attns_dec
|
||||
|
|
|
@ -1,3 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import numpy as np
|
||||
import librosa
|
||||
import os, copy
|
||||
|
@ -6,14 +19,15 @@ import paddle.fluid.layers as layers
|
|||
|
||||
|
||||
def get_positional_table(d_pos_vec, n_position=1024):
|
||||
position_enc = np.array([
|
||||
[pos / np.power(10000, 2*i/d_pos_vec) for i in range(d_pos_vec)]
|
||||
if pos != 0 else np.zeros(d_pos_vec) for pos in range(n_position)])
|
||||
position_enc = np.array(
|
||||
[[pos / np.power(10000, 2 * i / d_pos_vec) for i in range(d_pos_vec)]
|
||||
if pos != 0 else np.zeros(d_pos_vec) for pos in range(n_position)])
|
||||
|
||||
position_enc[1:, 0::2] = np.sin(position_enc[1:, 0::2]) # dim 2i
|
||||
position_enc[1:, 1::2] = np.cos(position_enc[1:, 1::2]) # dim 2i+1
|
||||
position_enc[1:, 0::2] = np.sin(position_enc[1:, 0::2]) # dim 2i
|
||||
position_enc[1:, 1::2] = np.cos(position_enc[1:, 1::2]) # dim 2i+1
|
||||
return position_enc
|
||||
|
||||
|
||||
def get_sinusoid_encoding_table(n_position, d_hid, padding_idx=None):
|
||||
''' Sinusoid position encoding table '''
|
||||
|
||||
|
@ -23,7 +37,8 @@ def get_sinusoid_encoding_table(n_position, d_hid, padding_idx=None):
|
|||
def get_posi_angle_vec(position):
|
||||
return [cal_angle(position, hid_j) for hid_j in range(d_hid)]
|
||||
|
||||
sinusoid_table = np.array([get_posi_angle_vec(pos_i) for pos_i in range(n_position)])
|
||||
sinusoid_table = np.array(
|
||||
[get_posi_angle_vec(pos_i) for pos_i in range(n_position)])
|
||||
|
||||
sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i
|
||||
sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2i+1
|
||||
|
@ -34,11 +49,13 @@ def get_sinusoid_encoding_table(n_position, d_hid, padding_idx=None):
|
|||
|
||||
return sinusoid_table
|
||||
|
||||
|
||||
def get_non_pad_mask(seq):
|
||||
mask = (seq != 0).astype(np.float32)
|
||||
mask = np.expand_dims(mask, axis=-1)
|
||||
return mask
|
||||
|
||||
|
||||
def get_attn_key_pad_mask(seq_k, seq_q):
|
||||
''' For masking out the padding part of key sequence. '''
|
||||
|
||||
|
@ -46,10 +63,11 @@ def get_attn_key_pad_mask(seq_k, seq_q):
|
|||
len_q = seq_q.shape[1]
|
||||
padding_mask = (seq_k != 0).astype(np.float32)
|
||||
padding_mask = np.expand_dims(padding_mask, axis=1)
|
||||
padding_mask = padding_mask.repeat([len_q],axis=1)
|
||||
padding_mask = (padding_mask == 0).astype(np.float32) * (-2 ** 32 + 1)
|
||||
padding_mask = padding_mask.repeat([len_q], axis=1)
|
||||
padding_mask = (padding_mask == 0).astype(np.float32) * (-2**32 + 1)
|
||||
return padding_mask
|
||||
|
||||
|
||||
def get_dec_attn_key_pad_mask(seq_k, seq_q):
|
||||
''' For masking out the padding part of key sequence. '''
|
||||
|
||||
|
@ -58,33 +76,37 @@ def get_dec_attn_key_pad_mask(seq_k, seq_q):
|
|||
padding_mask = (seq_k == 0).astype(np.float32)
|
||||
padding_mask = np.expand_dims(padding_mask, axis=1)
|
||||
triu_tensor = get_triu_tensor(seq_q, seq_q)
|
||||
padding_mask = padding_mask.repeat([len_q],axis=1) + triu_tensor
|
||||
padding_mask = (padding_mask != 0).astype(np.float32) * (-2 ** 32 + 1)
|
||||
padding_mask = padding_mask.repeat([len_q], axis=1) + triu_tensor
|
||||
padding_mask = (padding_mask != 0).astype(np.float32) * (-2**32 + 1)
|
||||
return padding_mask
|
||||
|
||||
|
||||
def get_triu_tensor(seq_k, seq_q):
|
||||
''' For make a triu tensor '''
|
||||
len_k = seq_k.shape[1]
|
||||
len_q = seq_q.shape[1]
|
||||
batch_size = seq_k.shape[0]
|
||||
triu_tensor = np.triu(np.ones([len_k, len_q]), 1)
|
||||
triu_tensor = np.repeat(np.expand_dims(triu_tensor, axis=0) ,batch_size, axis=0)
|
||||
|
||||
triu_tensor = np.repeat(
|
||||
np.expand_dims(
|
||||
triu_tensor, axis=0), batch_size, axis=0)
|
||||
|
||||
return triu_tensor
|
||||
|
||||
|
||||
def guided_attention(N, T, g=0.2):
|
||||
'''Guided attention. Refer to page 3 on the paper.'''
|
||||
W = np.zeros((N, T), dtype=np.float32)
|
||||
for n_pos in range(W.shape[0]):
|
||||
for t_pos in range(W.shape[1]):
|
||||
W[n_pos, t_pos] = 1 - np.exp(-(t_pos / float(T) - n_pos / float(N)) ** 2 / (2 * g * g))
|
||||
W[n_pos, t_pos] = 1 - np.exp(-(t_pos / float(T) - n_pos / float(N))
|
||||
**2 / (2 * g * g))
|
||||
return W
|
||||
|
||||
|
||||
def cross_entropy(input, label, position_weight=1.0, epsilon=1e-30):
|
||||
output = -1 * label * layers.log(input + epsilon) - (1-label) * layers.log(1 - input + epsilon)
|
||||
output = -1 * label * layers.log(input + epsilon) - (
|
||||
1 - label) * layers.log(1 - input + epsilon)
|
||||
output = output * (label * (position_weight - 1) + 1)
|
||||
|
||||
return layers.reduce_sum(output, dim=[0, 1])
|
||||
|
||||
|
||||
|
|
|
@ -1,27 +1,44 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import paddle.fluid.dygraph as dg
|
||||
import paddle.fluid as fluid
|
||||
from parakeet.modules.customized import Conv1D
|
||||
from parakeet.models.transformer_tts.utils import *
|
||||
from parakeet.models.transformer_tts.cbhg import CBHG
|
||||
|
||||
|
||||
class Vocoder(dg.Layer):
|
||||
"""
|
||||
CBHG Network (mel -> linear)
|
||||
"""
|
||||
|
||||
def __init__(self, config, batch_size):
|
||||
super(Vocoder, self).__init__()
|
||||
self.pre_proj = Conv1D(num_channels = config['audio']['num_mels'],
|
||||
num_filters = config['hidden_size'],
|
||||
filter_size=1)
|
||||
self.pre_proj = Conv1D(
|
||||
num_channels=config['audio']['num_mels'],
|
||||
num_filters=config['hidden_size'],
|
||||
filter_size=1)
|
||||
self.cbhg = CBHG(config['hidden_size'], batch_size)
|
||||
self.post_proj = Conv1D(num_channels = config['hidden_size'],
|
||||
num_filters = (config['audio']['n_fft'] // 2) + 1,
|
||||
filter_size=1)
|
||||
self.post_proj = Conv1D(
|
||||
num_channels=config['hidden_size'],
|
||||
num_filters=(config['audio']['n_fft'] // 2) + 1,
|
||||
filter_size=1)
|
||||
|
||||
def forward(self, mel):
|
||||
mel = layers.transpose(mel, [0,2,1])
|
||||
mel = layers.transpose(mel, [0, 2, 1])
|
||||
mel = self.pre_proj(mel)
|
||||
mel = self.cbhg(mel)
|
||||
mag_pred = self.post_proj(mel)
|
||||
mag_pred = layers.transpose(mag_pred, [0,2,1])
|
||||
mag_pred = layers.transpose(mag_pred, [0, 2, 1])
|
||||
return mag_pred
|
||||
|
|
|
@ -1 +1,15 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from parakeet.models.waveflow.waveflow import WaveFlow
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import random
|
||||
|
||||
import librosa
|
||||
|
|
|
@ -1,3 +1,17 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import itertools
|
||||
import os
|
||||
import time
|
||||
|
@ -8,6 +22,7 @@ from paddle import fluid
|
|||
from scipy.io.wavfile import write
|
||||
|
||||
import utils
|
||||
from parakeet.modules import weight_norm
|
||||
from .data import LJSpeech
|
||||
from .waveflow_modules import WaveFlowLoss, WaveFlowModule
|
||||
|
||||
|
@ -26,6 +41,7 @@ class WaveFlow():
|
|||
self.rank = rank
|
||||
self.nranks = nranks
|
||||
self.tb_logger = tb_logger
|
||||
self.dtype = "float16" if config.use_fp16 else "float32"
|
||||
|
||||
def build(self, training=True):
|
||||
config = self.config
|
||||
|
@ -36,9 +52,9 @@ class WaveFlow():
|
|||
waveflow = WaveFlowModule(config)
|
||||
|
||||
# Dry run once to create and initalize all necessary parameters.
|
||||
audio = dg.to_variable(np.random.randn(1, 16000).astype(np.float32))
|
||||
audio = dg.to_variable(np.random.randn(1, 16000).astype(self.dtype))
|
||||
mel = dg.to_variable(
|
||||
np.random.randn(1, config.mel_bands, 63).astype(np.float32))
|
||||
np.random.randn(1, config.mel_bands, 63).astype(self.dtype))
|
||||
waveflow(audio, mel)
|
||||
|
||||
if training:
|
||||
|
@ -72,9 +88,14 @@ class WaveFlow():
|
|||
self.rank,
|
||||
waveflow,
|
||||
iteration=config.iteration,
|
||||
file_path=config.checkpoint)
|
||||
file_path=config.checkpoint,
|
||||
dtype=self.dtype)
|
||||
print("Rank {}: checkpoint loaded.".format(self.rank))
|
||||
|
||||
for layer in waveflow.sublayers():
|
||||
if isinstance(layer, weight_norm.WeightNormWrapper):
|
||||
layer.remove_weight_norm()
|
||||
|
||||
self.waveflow = waveflow
|
||||
|
||||
def train_step(self, iteration):
|
||||
|
@ -173,7 +194,7 @@ class WaveFlow():
|
|||
syn_time))
|
||||
|
||||
# Denormalize audio from [-1, 1] to [-32768, 32768] int16 range.
|
||||
audio = audio.numpy() * 32768.0
|
||||
audio = audio.numpy().astype("float32") * 32768.0
|
||||
audio = audio.astype('int16')
|
||||
write(filename, config.sample_rate, audio)
|
||||
|
||||
|
|
|
@ -1,5 +1,18 @@
|
|||
import itertools
|
||||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import itertools
|
||||
import numpy as np
|
||||
import paddle.fluid.dygraph as dg
|
||||
from paddle import fluid
|
||||
|
@ -49,7 +62,7 @@ class WaveFlowLoss:
|
|||
|
||||
|
||||
class Conditioner(dg.Layer):
|
||||
def __init__(self):
|
||||
def __init__(self, dtype):
|
||||
super(Conditioner, self).__init__()
|
||||
upsample_factors = [16, 16]
|
||||
|
||||
|
@ -65,7 +78,8 @@ class Conditioner(dg.Layer):
|
|||
padding=(1, s // 2),
|
||||
stride=(1, s),
|
||||
param_attr=param_attr,
|
||||
bias_attr=bias_attr)
|
||||
bias_attr=bias_attr,
|
||||
dtype="float32")
|
||||
self.upsample_conv2d.append(conv_trans2d)
|
||||
|
||||
for i, layer in enumerate(self.upsample_conv2d):
|
||||
|
@ -74,19 +88,30 @@ class Conditioner(dg.Layer):
|
|||
def forward(self, x):
|
||||
x = fluid.layers.unsqueeze(x, 1)
|
||||
for layer in self.upsample_conv2d:
|
||||
x = fluid.layers.leaky_relu(layer(x), alpha=0.4)
|
||||
in_dtype = x.dtype
|
||||
if in_dtype == fluid.core.VarDesc.VarType.FP16:
|
||||
x = fluid.layers.cast(x, "float32")
|
||||
x = layer(x)
|
||||
if in_dtype == fluid.core.VarDesc.VarType.FP16:
|
||||
x = fluid.layers.cast(x, "float16")
|
||||
x = fluid.layers.leaky_relu(x, alpha=0.4)
|
||||
|
||||
return fluid.layers.squeeze(x, [1])
|
||||
return fluid.layers.reshape(x, [x.shape[0], x.shape[2], x.shape[3]])
|
||||
|
||||
def infer(self, x):
|
||||
x = fluid.layers.unsqueeze(x, 1)
|
||||
for layer in self.upsample_conv2d:
|
||||
in_dtype = x.dtype
|
||||
if in_dtype == fluid.core.VarDesc.VarType.FP16:
|
||||
x = fluid.layers.cast(x, "float32")
|
||||
x = layer(x)
|
||||
if in_dtype == fluid.core.VarDesc.VarType.FP16:
|
||||
x = fluid.layers.cast(x, "float16")
|
||||
# Trim conv artifacts.
|
||||
time_cutoff = layer._filter_size[1] - layer._stride[1]
|
||||
x = fluid.layers.leaky_relu(x[:, :, :, :-time_cutoff], alpha=0.4)
|
||||
|
||||
return fluid.layers.squeeze(x, [1])
|
||||
return fluid.layers.reshape(x, [x.shape[0], x.shape[2], x.shape[3]])
|
||||
|
||||
|
||||
class Flow(dg.Layer):
|
||||
|
@ -96,6 +121,7 @@ class Flow(dg.Layer):
|
|||
self.n_channels = config.n_channels
|
||||
self.kernel_h = config.kernel_h
|
||||
self.kernel_w = config.kernel_w
|
||||
self.dtype = "float16" if config.use_fp16 else "float32"
|
||||
|
||||
# Transform audio: [batch, 1, n_group, time/n_group]
|
||||
# => [batch, n_channels, n_group, time/n_group]
|
||||
|
@ -105,7 +131,8 @@ class Flow(dg.Layer):
|
|||
num_filters=self.n_channels,
|
||||
filter_size=(1, 1),
|
||||
param_attr=param_attr,
|
||||
bias_attr=bias_attr)
|
||||
bias_attr=bias_attr,
|
||||
dtype=self.dtype)
|
||||
|
||||
# Initializing last layer to 0 makes the affine coupling layers
|
||||
# do nothing at first. This helps with training stability
|
||||
|
@ -117,7 +144,8 @@ class Flow(dg.Layer):
|
|||
num_filters=2,
|
||||
filter_size=(1, 1),
|
||||
param_attr=param_attr,
|
||||
bias_attr=bias_attr)
|
||||
bias_attr=bias_attr,
|
||||
dtype=self.dtype)
|
||||
|
||||
# receiptive fileds: (kernel - 1) * sum(dilations) + 1 >= squeeze
|
||||
dilation_dict = {
|
||||
|
@ -145,7 +173,8 @@ class Flow(dg.Layer):
|
|||
filter_size=(self.kernel_h, self.kernel_w),
|
||||
dilation=(dilation_h, dilation_w),
|
||||
param_attr=param_attr,
|
||||
bias_attr=bias_attr)
|
||||
bias_attr=bias_attr,
|
||||
dtype=self.dtype)
|
||||
self.in_layers.append(in_layer)
|
||||
|
||||
param_attr, bias_attr = get_param_attr(
|
||||
|
@ -155,7 +184,8 @@ class Flow(dg.Layer):
|
|||
num_filters=2 * self.n_channels,
|
||||
filter_size=(1, 1),
|
||||
param_attr=param_attr,
|
||||
bias_attr=bias_attr)
|
||||
bias_attr=bias_attr,
|
||||
dtype=self.dtype)
|
||||
self.cond_layers.append(cond_layer)
|
||||
|
||||
if i < self.n_layers - 1:
|
||||
|
@ -169,7 +199,8 @@ class Flow(dg.Layer):
|
|||
num_filters=res_skip_channels,
|
||||
filter_size=(1, 1),
|
||||
param_attr=param_attr,
|
||||
bias_attr=bias_attr)
|
||||
bias_attr=bias_attr,
|
||||
dtype=self.dtype)
|
||||
self.res_skip_layers.append(res_skip_layer)
|
||||
|
||||
self.add_sublayer("in_layer_{}".format(i), in_layer)
|
||||
|
@ -189,10 +220,10 @@ class Flow(dg.Layer):
|
|||
# Pad width dim (time): dialated non-causal convolution
|
||||
pad_top, pad_bottom = (self.kernel_h - 1) * dilation_h, 0
|
||||
pad_left = pad_right = int((self.kernel_w - 1) * dilation_w / 2)
|
||||
audio_pad = fluid.layers.pad2d(
|
||||
audio, paddings=[pad_top, pad_bottom, pad_left, pad_right])
|
||||
|
||||
hidden = self.in_layers[i](audio_pad)
|
||||
self.in_layers[i].layer._padding = [
|
||||
pad_top, pad_bottom, pad_left, pad_right
|
||||
]
|
||||
hidden = self.in_layers[i](audio)
|
||||
cond_hidden = self.cond_layers[i](mel)
|
||||
in_acts = hidden + cond_hidden
|
||||
out_acts = fluid.layers.tanh(in_acts[:, :self.n_channels, :]) * \
|
||||
|
@ -237,14 +268,14 @@ class Flow(dg.Layer):
|
|||
pad_top, pad_bottom = 0, 0
|
||||
pad_left = int((self.kernel_w - 1) * dilation_w / 2)
|
||||
pad_right = int((self.kernel_w - 1) * dilation_w / 2)
|
||||
state = fluid.layers.pad2d(
|
||||
state, paddings=[pad_top, pad_bottom, pad_left, pad_right])
|
||||
|
||||
self.in_layers[i].layer._padding = [
|
||||
pad_top, pad_bottom, pad_left, pad_right
|
||||
]
|
||||
hidden = self.in_layers[i](state)
|
||||
cond_hidden = self.cond_layers[i](mel)
|
||||
in_acts = hidden + cond_hidden
|
||||
out_acts = fluid.layers.tanh(in_acts[:, :self.n_channels, :]) * \
|
||||
fluid.layers.sigmoid(in_acts[:, self.n_channels:, :])
|
||||
fluid.layers.sigmoid(in_acts[:, self.n_channels:, :])
|
||||
res_skip_acts = self.res_skip_layers[i](out_acts)
|
||||
|
||||
if i < self.n_layers - 1:
|
||||
|
@ -270,7 +301,8 @@ class WaveFlowModule(dg.Layer):
|
|||
assert self.n_group % 2 == 0
|
||||
assert self.n_flows % 2 == 0
|
||||
|
||||
self.conditioner = Conditioner()
|
||||
self.dtype = "float16" if config.use_fp16 else "float32"
|
||||
self.conditioner = Conditioner(self.dtype)
|
||||
self.flows = []
|
||||
for i in range(self.n_flows):
|
||||
flow = Flow(config)
|
||||
|
@ -324,17 +356,21 @@ class WaveFlowModule(dg.Layer):
|
|||
mel_slices = [mel[:, :, j, :] for j in self.perms[i]]
|
||||
mel = fluid.layers.stack(mel_slices, axis=2)
|
||||
|
||||
z = fluid.layers.squeeze(audio, [1])
|
||||
z = fluid.layers.reshape(
|
||||
audio, [audio.shape[0], audio.shape[2], audio.shape[3]])
|
||||
return z, log_s_list
|
||||
|
||||
def synthesize(self, mel, sigma=1.0):
|
||||
if self.dtype == "float16":
|
||||
mel = fluid.layers.cast(mel, self.dtype)
|
||||
mel = self.conditioner.infer(mel)
|
||||
# From [bs, mel_bands, time] to [bs, mel_bands, n_group, time/n_group]
|
||||
mel = fluid.layers.transpose(unfold(mel, self.n_group), [0, 1, 3, 2])
|
||||
|
||||
audio = fluid.layers.gaussian_random(
|
||||
shape=[mel.shape[0], 1, mel.shape[2], mel.shape[3]], std=sigma)
|
||||
|
||||
if self.dtype == "float16":
|
||||
audio = fluid.layers.cast(audio, self.dtype)
|
||||
for i in reversed(range(self.n_flows)):
|
||||
# Permute over the height dimension.
|
||||
audio_slices = [audio[:, :, j, :] for j in self.perms[i]]
|
||||
|
@ -362,9 +398,9 @@ class WaveFlowModule(dg.Layer):
|
|||
audio = fluid.layers.concat(audio_list, axis=2)
|
||||
|
||||
# audio: [bs, n_group, time/n_group]
|
||||
audio = fluid.layers.squeeze(audio, [1])
|
||||
audio = fluid.layers.reshape(
|
||||
audio, [audio.shape[0], audio.shape[2], audio.shape[3]])
|
||||
# audio: [bs, time]
|
||||
audio = fluid.layers.reshape(
|
||||
fluid.layers.transpose(audio, [0, 2, 1]), [audio.shape[0], -1])
|
||||
|
||||
return audio
|
||||
|
|
|
@ -1,97 +0,0 @@
|
|||
# WaveNet with Paddle Fluid
|
||||
|
||||
Paddle fluid implementation of WaveNet, a deep generative model of raw audio waveforms.
|
||||
WaveNet model is originally proposed in [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499).
|
||||
Our implementation is based on the WaveNet architecture described in [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](https://arxiv.org/abs/1807.07281) and can provide various output distributions, including single Gaussian, mixture of Gaussian, and softmax with linearly quantized channels.
|
||||
|
||||
We implement WaveNet model in paddle fluid with dynamic graph, which is convenient for flexible network architectures.
|
||||
|
||||
## Project Structure
|
||||
```text
|
||||
├── configs # yaml configuration files of preset model hyperparameters
|
||||
├── data.py # dataset and dataloader settings for LJSpeech
|
||||
├── slurm.py # optional slurm helper functions if you use slurm to train model
|
||||
├── synthesis.py # script for speech synthesis
|
||||
├── train.py # script for model training
|
||||
├── utils.py # helper functions for e.g., model checkpointing
|
||||
├── wavenet.py # WaveNet model high level APIs
|
||||
└── wavenet_modules.py # WaveNet model implementation
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
There are many hyperparameters to be tuned depending on the specification of model and dataset you are working on. Hyperparameters that are known to work good for the LJSpeech dataset are provided as yaml files in `./configs/` folder. Specifically, we provide `wavenet_ljspeech_single_gaussian.yaml`, `wavenet_ljspeech_mix_gaussian.yaml`, and `wavenet_ljspeech_softmax.yaml` config files for WaveNet with single Gaussian, 10-component mixture of Gaussians, and softmax (with 2048 linearly quantized channels) output distributions, respectively.
|
||||
|
||||
Note that `train.py` and `synthesis.py` all accept a `--config` parameter. To ensure consistency, you should use the same config yaml file for both training and synthesizing. You can also overwrite these preset hyperparameters with command line by updating parameters after `--config`. For example `--config=${yaml} --batch_size=8 --layers=20` can overwrite the corresponding hyperparameters in the `${yaml}` config file. For more details about these hyperparameters, check `utils.add_config_options_to_parser`.
|
||||
|
||||
Note that you also need to specify some additional parameters for `train.py` and `synthesis.py`, and the details can be found in `train.add_options_to_parser` and `synthesis.add_options_to_parser`, respectively.
|
||||
|
||||
### Dataset
|
||||
|
||||
Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
|
||||
|
||||
```bash
|
||||
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
|
||||
tar xjvf LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
In this example, assume that the path of unzipped LJSpeech dataset is `./data/LJSpeech-1.1`.
|
||||
|
||||
### Train on single GPU
|
||||
|
||||
```bash
|
||||
export PYTHONPATH="${PYTHONPATH}:${PWD}/../../.."
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
python -u train.py --config=${yaml} \
|
||||
--root=./data/LJSpeech-1.1 \
|
||||
--name=${ModelName} --batch_size=4 \
|
||||
--parallel=false --use_gpu=true
|
||||
```
|
||||
|
||||
#### Save and Load checkpoints
|
||||
|
||||
Our model will save model parameters as checkpoints in `./runs/wavenet/${ModelName}/checkpoint/` every 10000 iterations by default.
|
||||
The saved checkpoint will have the format of `step-${iteration_number}.pdparams` for model parameters and `step-${iteration_number}.pdopt` for optimizer parameters.
|
||||
|
||||
There are three ways to load a checkpoint and resume training (take an example that you want to load a 500000-iteration checkpoint):
|
||||
1. Use `--checkpoint=./runs/wavenet/${ModelName}/checkpoint/step-500000` to provide a specific path to load. Note that you only need to provide the base name of the parameter file, which is `step-500000`, no extension name `.pdparams` or `.pdopt` is needed.
|
||||
2. Use `--iteration=500000`.
|
||||
3. If you don't specify either `--checkpoint` or `--iteration`, the model will automatically load the latest checkpoint in `./runs/wavenet/${ModelName}/checkpoint`.
|
||||
|
||||
### Train on multiple GPUs
|
||||
|
||||
```bash
|
||||
export PYTHONPATH="${PYTHONPATH}:${PWD}/../../.."
|
||||
export CUDA_VISIBLE_DEVICES=0,1,2,3
|
||||
python -u -m paddle.distributed.launch train.py \
|
||||
--config=${yaml} \
|
||||
--root=./data/LJSpeech-1.1 \
|
||||
--name=${ModelName} --parallel=true --use_gpu=true
|
||||
```
|
||||
|
||||
Use `export CUDA_VISIBLE_DEVICES=0,1,2,3` to set the GPUs that you want to use to be visible. Then the `paddle.distributed.launch` module will use these visible GPUs to do data parallel training in multiprocessing mode.
|
||||
|
||||
### Monitor with Tensorboard
|
||||
|
||||
By default, the logs are saved in `./runs/wavenet/${ModelName}/logs/`. You can monitor logs by tensorboard.
|
||||
|
||||
```bash
|
||||
tensorboard --logdir=${log_dir} --port=8888
|
||||
```
|
||||
|
||||
### Synthesize from a checkpoint
|
||||
|
||||
Check the [Save and load checkpoint](#save-and-load-checkpoints) section on how to load a specific checkpoint.
|
||||
The following example will automatically load the latest checkpoint:
|
||||
|
||||
```bash
|
||||
export PYTHONPATH="${PYTHONPATH}:${PWD}/../../.."
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
python -u synthesis.py --config=${yaml} \
|
||||
--root=./data/LJSpeech-1.1 \
|
||||
--name=${ModelName} --use_gpu=true \
|
||||
--output=./syn_audios \
|
||||
--sample=${SAMPLE}
|
||||
```
|
||||
|
||||
In this example, `--output` specifies where to save the synthesized audios and `--sample` specifies which sample in the valid dataset (a split from the whole LJSpeech dataset, by default contains the first 16 audio samples) to synthesize based on the mel-spectrograms computed from the ground truth sample audio, e.g., `--sample=0` means to synthesize the first audio in the valid dataset.
|
|
@ -0,0 +1,16 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from .net import *
|
||||
from .wavenet import *
|
|
@ -1,32 +0,0 @@
|
|||
valid_size: 16
|
||||
train_clip_second: 0.5
|
||||
sample_rate: 22050
|
||||
fft_window_shift: 256
|
||||
fft_window_size: 1024
|
||||
fft_size: 2048
|
||||
mel_bands: 80
|
||||
|
||||
seed: 1
|
||||
batch_size: 8
|
||||
test_every: 2000
|
||||
save_every: 10000
|
||||
max_iterations: 2000000
|
||||
|
||||
layers: 30
|
||||
kernel_width: 2
|
||||
dilation_block: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
|
||||
residual_channels: 128
|
||||
skip_channels: 128
|
||||
loss_type: mix-gaussian-pdf
|
||||
num_mixtures: 10
|
||||
log_scale_min: -9.0
|
||||
|
||||
conditioner:
|
||||
filter_sizes: [[32, 3], [32, 3]]
|
||||
upsample_factors: [16, 16]
|
||||
|
||||
learning_rate: 0.001
|
||||
gradient_max_norm: 100.0
|
||||
anneal:
|
||||
every: 200000
|
||||
rate: 0.5
|
|
@ -1,32 +0,0 @@
|
|||
valid_size: 16
|
||||
train_clip_second: 0.5
|
||||
sample_rate: 22050
|
||||
fft_window_shift: 256
|
||||
fft_window_size: 1024
|
||||
fft_size: 2048
|
||||
mel_bands: 80
|
||||
|
||||
seed: 1
|
||||
batch_size: 8
|
||||
test_every: 2000
|
||||
save_every: 10000
|
||||
max_iterations: 2000000
|
||||
|
||||
layers: 30
|
||||
kernel_width: 2
|
||||
dilation_block: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
|
||||
residual_channels: 128
|
||||
skip_channels: 128
|
||||
loss_type: mix-gaussian-pdf
|
||||
num_mixtures: 1
|
||||
log_scale_min: -9.0
|
||||
|
||||
conditioner:
|
||||
filter_sizes: [[32, 3], [32, 3]]
|
||||
upsample_factors: [16, 16]
|
||||
|
||||
learning_rate: 0.001
|
||||
gradient_max_norm: 100.0
|
||||
anneal:
|
||||
every: 200000
|
||||
rate: 0.5
|
|
@ -1,31 +0,0 @@
|
|||
valid_size: 16
|
||||
train_clip_second: 0.5
|
||||
sample_rate: 22050
|
||||
fft_window_shift: 256
|
||||
fft_window_size: 1024
|
||||
fft_size: 2048
|
||||
mel_bands: 80
|
||||
|
||||
seed: 1
|
||||
batch_size: 8
|
||||
test_every: 2000
|
||||
save_every: 10000
|
||||
max_iterations: 2000000
|
||||
|
||||
layers: 30
|
||||
kernel_width: 2
|
||||
dilation_block: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
|
||||
residual_channels: 128
|
||||
skip_channels: 128
|
||||
loss_type: softmax
|
||||
num_channels: 2048
|
||||
|
||||
conditioner:
|
||||
filter_sizes: [[32, 3], [32, 3]]
|
||||
upsample_factors: [16, 16]
|
||||
|
||||
learning_rate: 0.001
|
||||
gradient_max_norm: 100.0
|
||||
anneal:
|
||||
every: 200000
|
||||
rate: 0.5
|
|
@ -1,160 +0,0 @@
|
|||
import random
|
||||
|
||||
import librosa
|
||||
import numpy as np
|
||||
from paddle import fluid
|
||||
|
||||
import utils
|
||||
from parakeet.datasets import ljspeech
|
||||
from parakeet.data import dataset
|
||||
from parakeet.data.sampler import DistributedSampler, BatchSampler
|
||||
from parakeet.data.datacargo import DataCargo
|
||||
|
||||
|
||||
class Dataset(ljspeech.LJSpeech):
|
||||
def __init__(self, config):
|
||||
super(Dataset, self).__init__(config.root)
|
||||
self.config = config
|
||||
self.fft_window_shift = config.fft_window_shift
|
||||
# Calculate context frames.
|
||||
frames_per_second = config.sample_rate // self.fft_window_shift
|
||||
train_clip_frames = int(np.ceil(
|
||||
config.train_clip_second * frames_per_second))
|
||||
context_frames = config.context_size // self.fft_window_shift
|
||||
self.num_frames = train_clip_frames + context_frames
|
||||
|
||||
def _get_example(self, metadatum):
|
||||
fname, _, _ = metadatum
|
||||
wav_path = self.root.joinpath("wavs", fname + ".wav")
|
||||
|
||||
config = self.config
|
||||
sr = config.sample_rate
|
||||
fft_window_shift = config.fft_window_shift
|
||||
fft_window_size = config.fft_window_size
|
||||
fft_size = config.fft_size
|
||||
|
||||
audio, loaded_sr = librosa.load(wav_path, sr=None)
|
||||
assert loaded_sr == sr
|
||||
|
||||
# Pad audio to the right size.
|
||||
frames = int(np.ceil(float(audio.size) / fft_window_shift))
|
||||
fft_padding = (fft_size - fft_window_shift) // 2
|
||||
desired_length = frames * fft_window_shift + fft_padding * 2
|
||||
pad_amount = (desired_length - audio.size) // 2
|
||||
|
||||
if audio.size % 2 == 0:
|
||||
audio = np.pad(audio, (pad_amount, pad_amount), mode='reflect')
|
||||
else:
|
||||
audio = np.pad(audio, (pad_amount, pad_amount + 1), mode='reflect')
|
||||
|
||||
# Normalize audio.
|
||||
audio = audio / np.abs(audio).max() * 0.999
|
||||
|
||||
# Compute mel-spectrogram.
|
||||
# Turn center to False to prevent internal padding.
|
||||
spectrogram = librosa.core.stft(
|
||||
audio, hop_length=fft_window_shift,
|
||||
win_length=fft_window_size, n_fft=fft_size, center=False)
|
||||
spectrogram_magnitude = np.abs(spectrogram)
|
||||
|
||||
# Compute mel-spectrograms.
|
||||
mel_filter_bank = librosa.filters.mel(sr=sr, n_fft=fft_size,
|
||||
n_mels=config.mel_bands)
|
||||
mel_spectrogram = np.dot(mel_filter_bank, spectrogram_magnitude)
|
||||
mel_spectrogram = mel_spectrogram.T
|
||||
|
||||
# Rescale mel_spectrogram.
|
||||
min_level, ref_level = 1e-5, 20
|
||||
mel_spectrogram = 20 * np.log10(np.maximum(min_level, mel_spectrogram))
|
||||
mel_spectrogram = mel_spectrogram - ref_level
|
||||
mel_spectrogram = np.clip((mel_spectrogram + 100) / 100, 0, 1)
|
||||
|
||||
# Extract the center of audio that corresponds to mel spectrograms.
|
||||
audio = audio[fft_padding : -fft_padding]
|
||||
assert mel_spectrogram.shape[0] * fft_window_shift == audio.size
|
||||
|
||||
return audio, mel_spectrogram
|
||||
|
||||
|
||||
class Subset(dataset.Dataset):
|
||||
def __init__(self, dataset, indices, valid):
|
||||
self.dataset = dataset
|
||||
self.indices = indices
|
||||
self.valid = valid
|
||||
|
||||
def __getitem__(self, idx):
|
||||
fft_window_shift = self.dataset.fft_window_shift
|
||||
num_frames = self.dataset.num_frames
|
||||
audio, mel = self.dataset[self.indices[idx]]
|
||||
|
||||
if self.valid:
|
||||
audio_start = 0
|
||||
else:
|
||||
# Randomly crop context + train_clip_second of audio.
|
||||
audio_frames = int(audio.size) // fft_window_shift
|
||||
max_start_frame = audio_frames - num_frames
|
||||
assert max_start_frame >= 0, "audio {} is too short".format(idx)
|
||||
|
||||
frame_start = random.randint(0, max_start_frame)
|
||||
frame_end = frame_start + num_frames
|
||||
|
||||
audio_start = frame_start * fft_window_shift
|
||||
audio_end = frame_end * fft_window_shift
|
||||
|
||||
audio = audio[audio_start : audio_end]
|
||||
|
||||
return audio, mel, audio_start
|
||||
|
||||
def _batch_examples(self, batch):
|
||||
audios = [sample[0] for sample in batch]
|
||||
audio_starts = [sample[2] for sample in batch]
|
||||
|
||||
# mels shape [num_frames, mel_bands]
|
||||
max_frames = max(sample[1].shape[0] for sample in batch)
|
||||
mels = [utils.pad_to_size(sample[1], max_frames) for sample in batch]
|
||||
|
||||
audios = np.array(audios, dtype=np.float32)
|
||||
mels = np.array(mels, dtype=np.float32)
|
||||
audio_starts = np.array(audio_starts, dtype=np.int32)
|
||||
|
||||
return audios, mels, audio_starts
|
||||
|
||||
def __len__(self):
|
||||
return len(self.indices)
|
||||
|
||||
|
||||
class LJSpeech:
|
||||
def __init__(self, config, nranks, rank):
|
||||
place = fluid.CUDAPlace(rank) if config.use_gpu else fluid.CPUPlace()
|
||||
|
||||
# Whole LJSpeech dataset.
|
||||
ds = Dataset(config)
|
||||
|
||||
# Split into train and valid dataset.
|
||||
indices = list(range(len(ds)))
|
||||
train_indices = indices[config.valid_size:]
|
||||
valid_indices = indices[:config.valid_size]
|
||||
random.shuffle(train_indices)
|
||||
|
||||
# Train dataset.
|
||||
trainset = Subset(ds, train_indices, valid=False)
|
||||
sampler = DistributedSampler(len(trainset), nranks, rank)
|
||||
total_bs = config.batch_size
|
||||
assert total_bs % nranks == 0
|
||||
train_sampler = BatchSampler(sampler, total_bs // nranks,
|
||||
drop_last=True)
|
||||
trainloader = DataCargo(trainset, batch_sampler=train_sampler)
|
||||
|
||||
trainreader = fluid.io.PyReader(capacity=50, return_list=True)
|
||||
trainreader.decorate_batch_generator(trainloader, place)
|
||||
self.trainloader = (data for _ in iter(int, 1)
|
||||
for data in trainreader())
|
||||
|
||||
# Valid dataset.
|
||||
validset = Subset(ds, valid_indices, valid=True)
|
||||
# Currently only support batch_size = 1 for valid loader.
|
||||
validloader = DataCargo(validset, batch_size=1, shuffle=False)
|
||||
|
||||
validreader = fluid.io.PyReader(capacity=20, return_list=True)
|
||||
validreader.decorate_batch_generator(validloader, place)
|
||||
self.validloader = validreader
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue