Modified data.py to generate masks as models inputs

This commit is contained in:
lifuchen 2020-03-05 07:22:50 +00:00
commit d08779d61e
119 changed files with 5525 additions and 2745 deletions

View File

@ -25,3 +25,11 @@
files: \.md$
- id: remove-tabs
files: \.md$
- repo: local
hooks:
- id: copyright_checker
name: copyright_checker
entry: python ./tools/copyright.hook
language: system
files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|proto|py)$
exclude: (?!.*third_party)^.*$ | (?!.*book)^.*$

View File

@ -13,6 +13,4 @@
limitations under the License.
Part of code was copied or adpated from https://github.com/r9y9/deepvoice3_pytorch/
Copyright (c) 2017: Ryuichi Yamamoto, whose applies.

View File

@ -1,16 +1,27 @@
# Parakeet
Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It is built on Paddle Fluid dynamic graph, with the support of many influential TTS models proposed by [Baidu Research](http://research.baidu.com) and other academic institutions.
Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It is built on PaddlePaddle Fluid dynamic graph and includes many influential TTS models proposed by [Baidu Research](http://research.baidu.com) and other research groups.
<div align="center">
<img src="images/logo.png" width=450 /> <br>
</div>
## Installation
In particular, it features the latest [WaveFlow] (https://arxiv.org/abs/1912.01219) model proposed by Baidu Research.
- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than WaveGlow and serveral orders of magnitude faster than WaveNet.
- WaveFlow is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M) and comparable to WaveNet (4.6M).
- WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development.
### Install Paddlepaddle
### Setup
See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. This repo requires paddlepaddle's version is above 1.7.
Make sure the library `libsndfile1` is installed, e.g., on Ubuntu.
```bash
sudo apt-get install libsndfile1
```
### Install PaddlePaddle
See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. This repo requires paddlepaddle 1.7 or above.
### Install Parakeet
@ -20,12 +31,6 @@ cd Parakeet
pip install -e .
```
### Setup
Make sure libsndfile1 installed:
```bash
sudo apt-get install libsndfile1
```
### Install CMUdict for nltk
CMUdict from nltk is used to transform text into phonemes.
@ -36,14 +41,24 @@ nltk.download("cmudict")
```
## Supported models
## Related Research
- [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654)
- [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895)
- [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263).
- [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263)
- [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219)
- [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499)
- [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](https://arxiv.org/abs/1807.07281)
## Examples
- [Train a deepvoice 3 model with ljspeech dataset](./parakeet/examples/deepvoice3)
- [Train a transformer_tts model with ljspeech dataset](./parakeet/examples/transformer_tts)
- [Train a fastspeech model with ljspeech dataset](./parakeet/examples/fastspeech)
- [Train a DeepVoice3 model with ljspeech dataset](./examples/deepvoice3)
- [Train a TransformerTTS model with ljspeech dataset](./examples/transformer_tts)
- [Train a FastSpeech model with ljspeech dataset](./examples/fastspeech)
- [Train a WaveFlow model with ljspeech dataset](./examples/waveflow)
- [Train a WaveNet model with ljspeech dataset](./examples/wavenet)
- [Train a Clarinet model with ljspeech dataset](./examples/clarinet)
## Copyright and License
Parakeet is provided under the [Apache-2.0 license](LICENSE).

103
examples/clarinet/README.md Normal file
View File

@ -0,0 +1,103 @@
# Clarinet
PaddlePaddle dynamic graph implementation of ClariNet, a convolutional network based vocoder. The implementation is based on the paper [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281).
## Dataset
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
## Project Structure
```text
├── data.py data_processing
├── configs/ (example) configuration file
├── synthesis.py script to synthesize waveform from mel_spectrogram
├── train.py script to train a model
└── utils.py utility functions
```
## Train
Train the model using train.py, follow the usage displayed by `python train.py --help`.
```text
usage: train.py [-h] [--config CONFIG] [--device DEVICE] [--output OUTPUT]
[--data DATA] [--resume RESUME] [--wavenet WAVENET]
train a ClariNet model with LJspeech and a trained WaveNet model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG path of the config file.
--device DEVICE device to use.
--output OUTPUT path to save student.
--data DATA path of LJspeech dataset.
--resume RESUME checkpoint to load from.
--wavenet WAVENET wavenet checkpoint to use.
```
1. `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
4. `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.
```text
├── checkpoints # checkpoint
├── states # audio files generated at validation
└── log # tensorboard log
```
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
6. `--wavenet` is the path of the wavenet checkpoint to load. If you do not specify `--resume`, then this must be provided.
Before you start training a ClariNet model, you should have trained a WaveNet model with single Gaussian output distribution. Make sure the config of the teacher model matches that of the trained model.
Example script:
```bash
python train.py --config=./configs/clarinet_ljspeech.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0 --conditioner=wavenet_checkpoint/conditioner --conditioner=wavenet_checkpoint/teacher
```
You can monitor training log via tensorboard, using the script below.
```bash
cd experiment/log
tensorboard --logdir=.
```
## Synthesis
```text
usage: synthesis.py [-h] [--config CONFIG] [--device DEVICE] [--data DATA]
checkpoint output
train a ClariNet model with LJspeech and a trained WaveNet model.
positional arguments:
checkpoint checkpoint to load from.
output path to save student.
optional arguments:
-h, --help show this help message and exit
--config CONFIG path of the config file.
--device DEVICE device to use.
--data DATA path of LJspeech dataset.
```
1. `--config` is the configuration file to use. You should use the same configuration with which you train you model.
2. `--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
3. `checkpoint` is the checkpoint to load.
4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`).
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
Example script:
```bash
python synthesis.py --config=./configs/wavenet_single_gaussian.yaml --data=./LJSpeech-1.1/ --device=0 experiment/checkpoints/step_500000 generated
```

View File

@ -0,0 +1,52 @@
data:
batch_size: 8
train_clip_seconds: 0.5
sample_rate: 22050
hop_length: 256
win_length: 1024
n_fft: 2048
n_mels: 80
valid_size: 16
conditioner:
upsampling_factors: [16, 16]
teacher:
n_loop: 10
n_layer: 3
filter_size: 2
residual_channels: 128
loss_type: "mog"
output_dim: 3
log_scale_min: -9
student:
n_loops: [10, 10, 10, 10, 10, 10]
n_layers: [1, 1, 1, 1, 1, 1]
filter_size: 3
residual_channels: 64
log_scale_min: -7
stft:
n_fft: 2048
win_length: 1024
hop_length: 256
loss:
lmd: 4
train:
learning_rate: 0.0005
anneal_rate: 0.5
anneal_interval: 200000
gradient_max_norm: 100.0
checkpoint_interval: 1000
eval_interval: 1000
max_iterations: 2000000

View File

@ -0,0 +1,151 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
import argparse
import ruamel.yaml
import random
from tqdm import tqdm
import pickle
import numpy as np
from tensorboardX import SummaryWriter
import paddle.fluid.dygraph as dg
from paddle import fluid
from parakeet.models.wavenet import WaveNet, UpsampleNet
from parakeet.models.clarinet import STFT, Clarinet, ParallelWaveNet
from parakeet.data import TransformDataset, SliceDataset, RandomSampler, SequentialSampler, DataCargo
from parakeet.utils.layer_tools import summary, freeze
from utils import valid_model, eval_model, save_checkpoint, load_checkpoint, load_model
sys.path.append("../wavenet")
from data import LJSpeechMetaData, Transform, DataCollector
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="synthesize audio files from mel spectrogram in the validation set."
)
parser.add_argument("--config", type=str, help="path of the config file.")
parser.add_argument(
"--device", type=int, default=-1, help="device to use.")
parser.add_argument("--data", type=str, help="path of LJspeech dataset.")
parser.add_argument(
"checkpoint", type=str, help="checkpoint to load from.")
parser.add_argument(
"output", type=str, default="experiment", help="path to save student.")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = ruamel.yaml.safe_load(f)
ljspeech_meta = LJSpeechMetaData(args.data)
data_config = config["data"]
sample_rate = data_config["sample_rate"]
n_fft = data_config["n_fft"]
win_length = data_config["win_length"]
hop_length = data_config["hop_length"]
n_mels = data_config["n_mels"]
train_clip_seconds = data_config["train_clip_seconds"]
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
ljspeech = TransformDataset(ljspeech_meta, transform)
valid_size = data_config["valid_size"]
ljspeech_valid = SliceDataset(ljspeech, 0, valid_size)
ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech))
teacher_config = config["teacher"]
n_loop = teacher_config["n_loop"]
n_layer = teacher_config["n_layer"]
filter_size = teacher_config["filter_size"]
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
print("context size is {} samples".format(context_size))
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
train_clip_seconds)
valid_batch_fn = DataCollector(
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
batch_size = data_config["batch_size"]
train_cargo = DataCargo(
ljspeech_train,
train_batch_fn,
batch_size,
sampler=RandomSampler(ljspeech_train))
# only batch=1 for validation is enabled
valid_cargo = DataCargo(
ljspeech_valid,
valid_batch_fn,
batch_size=1,
sampler=SequentialSampler(ljspeech_valid))
if args.device == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(args.device)
with dg.guard(place):
# conditioner(upsampling net)
conditioner_config = config["conditioner"]
upsampling_factors = conditioner_config["upsampling_factors"]
upsample_net = UpsampleNet(upscale_factors=upsampling_factors)
freeze(upsample_net)
residual_channels = teacher_config["residual_channels"]
loss_type = teacher_config["loss_type"]
output_dim = teacher_config["output_dim"]
log_scale_min = teacher_config["log_scale_min"]
assert loss_type == "mog" and output_dim == 3, \
"the teacher wavenet should be a wavenet with single gaussian output"
teacher = WaveNet(n_loop, n_layer, residual_channels, output_dim,
n_mels, filter_size, loss_type, log_scale_min)
# load & freeze upsample_net & teacher
freeze(teacher)
student_config = config["student"]
n_loops = student_config["n_loops"]
n_layers = student_config["n_layers"]
student_residual_channels = student_config["residual_channels"]
student_filter_size = student_config["filter_size"]
student_log_scale_min = student_config["log_scale_min"]
student = ParallelWaveNet(n_loops, n_layers, student_residual_channels,
n_mels, student_filter_size)
stft_config = config["stft"]
stft = STFT(
n_fft=stft_config["n_fft"],
hop_length=stft_config["hop_length"],
win_length=stft_config["win_length"])
lmd = config["loss"]["lmd"]
model = Clarinet(upsample_net, teacher, student, stft,
student_log_scale_min, lmd)
summary(model)
load_model(model, args.checkpoint)
# loader
train_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
train_loader.set_batch_generator(train_cargo, place)
valid_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
valid_loader.set_batch_generator(valid_cargo, place)
if not os.path.exists(args.output):
os.makedirs(args.output)
eval_model(model, valid_loader, args.output, sample_rate)

220
examples/clarinet/train.py Normal file
View File

@ -0,0 +1,220 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
import argparse
import ruamel.yaml
import random
from tqdm import tqdm
import pickle
import numpy as np
from tensorboardX import SummaryWriter
import paddle.fluid.dygraph as dg
from paddle import fluid
from parakeet.models.wavenet import WaveNet, UpsampleNet
from parakeet.models.clarinet import STFT, Clarinet, ParallelWaveNet
from parakeet.data import TransformDataset, SliceDataset, RandomSampler, SequentialSampler, DataCargo
from parakeet.utils.layer_tools import summary, freeze
from utils import make_output_tree, valid_model, save_checkpoint, load_checkpoint, load_wavenet
sys.path.append("../wavenet")
from data import LJSpeechMetaData, Transform, DataCollector
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="train a clarinet model with LJspeech and a trained wavenet model."
)
parser.add_argument("--config", type=str, help="path of the config file.")
parser.add_argument(
"--device", type=int, default=-1, help="device to use.")
parser.add_argument(
"--output",
type=str,
default="experiment",
help="path to save student.")
parser.add_argument("--data", type=str, help="path of LJspeech dataset.")
parser.add_argument("--resume", type=str, help="checkpoint to load from.")
parser.add_argument(
"--wavenet", type=str, help="wavenet checkpoint to use.")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = ruamel.yaml.safe_load(f)
ljspeech_meta = LJSpeechMetaData(args.data)
data_config = config["data"]
sample_rate = data_config["sample_rate"]
n_fft = data_config["n_fft"]
win_length = data_config["win_length"]
hop_length = data_config["hop_length"]
n_mels = data_config["n_mels"]
train_clip_seconds = data_config["train_clip_seconds"]
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
ljspeech = TransformDataset(ljspeech_meta, transform)
valid_size = data_config["valid_size"]
ljspeech_valid = SliceDataset(ljspeech, 0, valid_size)
ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech))
teacher_config = config["teacher"]
n_loop = teacher_config["n_loop"]
n_layer = teacher_config["n_layer"]
filter_size = teacher_config["filter_size"]
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
print("context size is {} samples".format(context_size))
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
train_clip_seconds)
valid_batch_fn = DataCollector(
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
batch_size = data_config["batch_size"]
train_cargo = DataCargo(
ljspeech_train,
train_batch_fn,
batch_size,
sampler=RandomSampler(ljspeech_train))
# only batch=1 for validation is enabled
valid_cargo = DataCargo(
ljspeech_valid,
valid_batch_fn,
batch_size=1,
sampler=SequentialSampler(ljspeech_valid))
make_output_tree(args.output)
if args.device == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(args.device)
with dg.guard(place):
# conditioner(upsampling net)
conditioner_config = config["conditioner"]
upsampling_factors = conditioner_config["upsampling_factors"]
upsample_net = UpsampleNet(upscale_factors=upsampling_factors)
freeze(upsample_net)
residual_channels = teacher_config["residual_channels"]
loss_type = teacher_config["loss_type"]
output_dim = teacher_config["output_dim"]
log_scale_min = teacher_config["log_scale_min"]
assert loss_type == "mog" and output_dim == 3, \
"the teacher wavenet should be a wavenet with single gaussian output"
teacher = WaveNet(n_loop, n_layer, residual_channels, output_dim,
n_mels, filter_size, loss_type, log_scale_min)
freeze(teacher)
student_config = config["student"]
n_loops = student_config["n_loops"]
n_layers = student_config["n_layers"]
student_residual_channels = student_config["residual_channels"]
student_filter_size = student_config["filter_size"]
student_log_scale_min = student_config["log_scale_min"]
student = ParallelWaveNet(n_loops, n_layers, student_residual_channels,
n_mels, student_filter_size)
stft_config = config["stft"]
stft = STFT(
n_fft=stft_config["n_fft"],
hop_length=stft_config["hop_length"],
win_length=stft_config["win_length"])
lmd = config["loss"]["lmd"]
model = Clarinet(upsample_net, teacher, student, stft,
student_log_scale_min, lmd)
summary(model)
# optim
train_config = config["train"]
learning_rate = train_config["learning_rate"]
anneal_rate = train_config["anneal_rate"]
anneal_interval = train_config["anneal_interval"]
lr_scheduler = dg.ExponentialDecay(
learning_rate, anneal_interval, anneal_rate, staircase=True)
optim = fluid.optimizer.Adam(
lr_scheduler, parameter_list=model.parameters())
gradiant_max_norm = train_config["gradient_max_norm"]
clipper = fluid.dygraph_grad_clip.GradClipByGlobalNorm(
gradiant_max_norm)
assert args.wavenet or args.resume, "you should load from a trained wavenet or resume training; training without a trained wavenet is not recommended."
if args.wavenet:
load_wavenet(model, args.wavenet)
if args.resume:
load_checkpoint(model, optim, args.resume)
# loader
train_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
train_loader.set_batch_generator(train_cargo, place)
valid_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
valid_loader.set_batch_generator(valid_cargo, place)
# train
max_iterations = train_config["max_iterations"]
checkpoint_interval = train_config["checkpoint_interval"]
eval_interval = train_config["eval_interval"]
checkpoint_dir = os.path.join(args.output, "checkpoints")
state_dir = os.path.join(args.output, "states")
log_dir = os.path.join(args.output, "log")
writer = SummaryWriter(log_dir)
# training loop
global_step = 1
global_epoch = 1
while global_step < max_iterations:
epoch_loss = 0.
for j, batch in tqdm(enumerate(train_loader), desc="[train]"):
audios, mels, audio_starts = batch
model.train()
loss_dict = model(
audios, mels, audio_starts, clip_kl=global_step > 500)
writer.add_scalar("learning_rate",
optim._learning_rate.step().numpy()[0],
global_step)
for k, v in loss_dict.items():
writer.add_scalar("loss/{}".format(k),
v.numpy()[0], global_step)
l = loss_dict["loss"]
step_loss = l.numpy()[0]
print("[train] loss: {:<8.6f}".format(step_loss))
epoch_loss += step_loss
l.backward()
optim.minimize(l, grad_clip=clipper)
optim.clear_gradients()
if global_step % eval_interval == 0:
# evaluate on valid dataset
valid_model(model, valid_loader, state_dir, global_step,
sample_rate)
if global_step % checkpoint_interval == 0:
save_checkpoint(model, optim, checkpoint_dir, global_step)
global_step += 1
# epoch loss
average_loss = epoch_loss / j
writer.add_scalar("average_loss", average_loss, global_epoch)
global_epoch += 1

View File

@ -0,0 +1,96 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import soundfile as sf
from tensorboardX import SummaryWriter
from collections import OrderedDict
from paddle import fluid
import paddle.fluid.dygraph as dg
def make_output_tree(output_dir):
checkpoint_dir = os.path.join(output_dir, "checkpoints")
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
state_dir = os.path.join(output_dir, "states")
if not os.path.exists(state_dir):
os.makedirs(state_dir)
def valid_model(model, valid_loader, output_dir, global_step, sample_rate):
model.eval()
for i, batch in enumerate(valid_loader):
# print("sentence {}".format(i))
path = os.path.join(output_dir,
"step_{}_sentence_{}.wav".format(global_step, i))
audio_clips, mel_specs, audio_starts = batch
wav_var = model.synthesis(mel_specs)
wav_np = wav_var.numpy()[0]
sf.write(path, wav_np, samplerate=sample_rate)
print("generated {}".format(path))
def eval_model(model, valid_loader, output_dir, sample_rate):
model.eval()
for i, batch in enumerate(valid_loader):
# print("sentence {}".format(i))
path = os.path.join(output_dir, "sentence_{}.wav".format(i))
audio_clips, mel_specs, audio_starts = batch
wav_var = model.synthesis(mel_specs)
wav_np = wav_var.numpy()[0]
sf.write(path, wav_np, samplerate=sample_rate)
print("generated {}".format(path))
def save_checkpoint(model, optim, checkpoint_dir, global_step):
path = os.path.join(checkpoint_dir, "step_{}".format(global_step))
dg.save_dygraph(model.state_dict(), path)
print("saving model to {}".format(path + ".pdparams"))
if optim:
dg.save_dygraph(optim.state_dict(), path)
print("saving optimizer to {}".format(path + ".pdopt"))
def load_model(model, path):
model_dict, _ = dg.load_dygraph(path)
model.state_dict(model_dict)
print("loaded model from {}.pdparams".format(path))
def load_checkpoint(model, optim, path):
model_dict, optim_dict = dg.load_dygraph(path)
model.state_dict(model_dict)
print("loaded model from {}.pdparams".format(path))
if optim_dict:
optim.set_dict(optim_dict)
print("loaded optimizer from {}.pdparams".format(path))
def load_wavenet(model, path):
wavenet_dict, _ = dg.load_dygraph(path)
encoder_dict = OrderedDict()
teacher_dict = OrderedDict()
for k, v in wavenet_dict.items():
if k.startswith("encoder."):
encoder_dict[k.split('.', 1)[1]] = v
else:
# k starts with "decoder."
teacher_dict[k.split('.', 1)[1]] = v
model.encoder.set_dict(encoder_dict)
model.teacher.set_dict(teacher_dict)
print("loaded the encoder part and teacher part from wavenet model.")

View File

@ -1,8 +1,8 @@
# Deepvoice 3
# Deep Voice 3
Paddle implementation of deepvoice 3 in dynamic graph, a convolutional network based text-to-speech synthesis model. The implementation is based on [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654).
PaddlePaddle dynamic graph implementation of Deep Voice 3, a convolutional network based text-to-speech generative model. The implementation is based on [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654).
We implement Deepvoice 3 in paddle fluid with dynamic graph, which is convenient for flexible network architectures.
We implement Deep Voice 3 using Paddle Fluid with dynamic graph, which is convenient for building flexible network architectures.
## Dataset
@ -15,15 +15,15 @@ tar xjvf LJSpeech-1.1.tar.bz2
## Model Architecture
![DeepVoice3 model architecture](./images/model_architecture.png)
![Deep Voice 3 model architecture](./images/model_architecture.png)
The model consists of an encoder, a decoder and a converter (and a speaker embedding for multispeaker models). The encoder, together with the decoder forms the seq2seq part of the model, and the converter forms the postnet part.
The model consists of an encoder, a decoder and a converter (and a speaker embedding for multispeaker models). The encoder and the decoder together form the seq2seq part of the model, and the converter forms the postnet part.
## Project Structure
```text
├── data.py data_processing
├── ljspeech.yaml (example) configuration file
├── data.py data_processing
├── configs/ (example) configuration files
├── sentences.txt sample sentences
├── synthesis.py script to synthesize waveform from text
├── train.py script to train a model
@ -37,7 +37,7 @@ Train the model using train.py, follow the usage displayed by `python train.py -
```text
usage: train.py [-h] [-c CONFIG] [-s DATA] [-r RESUME] [-o OUTPUT] [-g DEVICE]
Train a deepvoice 3 model with LJSpeech dataset.
Train a Deep Voice 3 model with LJSpeech dataset.
optional arguments:
-h, --help show this help message and exit
@ -50,18 +50,18 @@ optional arguments:
The directory to save result.
-g DEVICE, --device DEVICE
device to use
```
```
1. `--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config.
2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
4. `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.
4. `--output` is the directory to save results, all results are saved in this directory. The structure of the output directory is shown below.
```text
├── checkpoints # checkpoint
├── log # tensorboard log
└── states # train and evaluation results
├── alignments # attention
├── alignments # attention
├── lin_spec # linear spectrogram
├── mel_spec # mel spectrogram
└── waveform # waveform (.wav files)
@ -69,10 +69,10 @@ optional arguments:
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
example script:
Example script:
```bash
python train.py --config=./ljspeech.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0
python train.py --config=configs/ljspeech.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0
```
You can monitor training log via tensorboard, using the script below.
@ -86,7 +86,7 @@ tensorboard --logdir=.
```text
usage: synthesis.py [-h] [-c CONFIG] [-g DEVICE] checkpoint text output_path
Synthsize waveform with a checkpoint.
Synthsize waveform from a checkpoint.
positional arguments:
checkpoint checkpoint to load.
@ -107,9 +107,8 @@ optional arguments:
4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`) and attention plots (*.png) for each sentence.
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
example script:
Example script:
```bash
python synthesis.py --config=./ljspeech.yaml --device=0 experiment/checkpoints/model_step_005000000 sentences.txt generated
python synthesis.py --config=configs/ljspeech.yaml --device=0 experiment/checkpoints/model_step_005000000 sentences.txt generated
```

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import csv
from pathlib import Path
@ -79,10 +93,11 @@ class Transform(object):
y = signal.lfilter([1., -self.preemphasis], [1.], wav)
# STFT
D = librosa.stft(y=y,
n_fft=self.n_fft,
win_length=self.win_length,
hop_length=self.hop_length)
D = librosa.stft(
y=y,
n_fft=self.n_fft,
win_length=self.win_length,
hop_length=self.hop_length)
S = np.abs(D)
# to db and normalize to 0-1
@ -96,11 +111,8 @@ class Transform(object):
# mel scale and to db and normalize to 0-1,
# CAUTION: pass linear scale S, not dbscaled S
S_mel = librosa.feature.melspectrogram(S=S,
n_mels=self.n_mels,
fmin=self.fmin,
fmax=self.fmax,
power=1.)
S_mel = librosa.feature.melspectrogram(
S=S, n_mels=self.n_mels, fmin=self.fmin, fmax=self.fmax, power=1.)
S_mel = 20 * np.log10(np.maximum(amplitude_min,
S_mel)) - self.ref_level_db
S_mel_norm = (S_mel - self.min_level_db) / (-self.min_level_db)
@ -148,20 +160,18 @@ class DataCollector(object):
(mix_grapheme_phonemes, text_length, speaker_id, S_norm,
S_mel_norm, num_frames) = example
text_sequences.append(
np.pad(mix_grapheme_phonemes,
(0, max_text_length - text_length)))
np.pad(mix_grapheme_phonemes, (0, max_text_length - text_length
)))
lin_specs.append(
np.pad(S_norm,
((0, 0), (self._pad_begin,
max_frames - self._pad_begin - num_frames))))
np.pad(S_norm, ((0, 0), (self._pad_begin, max_frames -
self._pad_begin - num_frames))))
mel_specs.append(
np.pad(S_mel_norm,
((0, 0), (self._pad_begin,
max_frames - self._pad_begin - num_frames))))
np.pad(S_mel_norm, ((0, 0), (self._pad_begin, max_frames -
self._pad_begin - num_frames))))
done_flags.append(
np.pad(np.zeros((int(np.ceil(num_frames // self._factor)), )),
(0, max_decoder_length -
int(np.ceil(num_frames // self._factor))),
(0, max_decoder_length - int(
np.ceil(num_frames // self._factor))),
constant_values=1))
text_sequences = np.array(text_sequences).astype(np.int64)
lin_specs = np.transpose(np.array(lin_specs),

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import argparse
import ruamel.yaml
@ -22,11 +36,8 @@ if __name__ == "__main__":
parser.add_argument("checkpoint", type=str, help="checkpoint to load.")
parser.add_argument("text", type=str, help="text file to synthesize")
parser.add_argument("output_path", type=str, help="path to save results")
parser.add_argument("-g",
"--device",
type=int,
default=-1,
help="device to use")
parser.add_argument(
"-g", "--device", type=int, default=-1, help="device to use")
args = parser.parse_args()
with open(args.config, 'rt') as f:
@ -76,15 +87,14 @@ if __name__ == "__main__":
window_ahead = model_config["window_ahead"]
key_projection = model_config["key_projection"]
value_projection = model_config["value_projection"]
dv3 = make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim,
padding_idx, embedding_std, max_positions, n_vocab,
freeze_embedding, filter_size, encoder_channels,
n_mels, decoder_channels, r,
trainable_positional_encodings, use_memory_mask,
query_position_rate, key_position_rate,
window_backward, window_ahead, key_projection,
value_projection, downsample_factor, linear_dim,
use_decoder_states, converter_channels, dropout)
dv3 = make_model(
n_speakers, speaker_dim, speaker_embed_std, embed_dim, padding_idx,
embedding_std, max_positions, n_vocab, freeze_embedding,
filter_size, encoder_channels, n_mels, decoder_channels, r,
trainable_positional_encodings, use_memory_mask,
query_position_rate, key_position_rate, window_backward,
window_ahead, key_projection, value_projection, downsample_factor,
linear_dim, use_decoder_states, converter_channels, dropout)
summary(dv3)
state, _ = dg.load_dygraph(args.checkpoint)

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import argparse
import ruamel.yaml

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import numpy as np
from matplotlib import cm
@ -28,8 +42,9 @@ def make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim,
converter_channels, dropout):
"""just a simple function to create a deepvoice 3 model"""
if n_speakers > 1:
spe = dg.Embedding((n_speakers, speaker_dim),
param_attr=I.Normal(scale=speaker_embed_std))
spe = dg.Embedding(
(n_speakers, speaker_dim),
param_attr=I.Normal(scale=speaker_embed_std))
else:
spe = None
@ -45,17 +60,17 @@ def make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim,
ConvSpec(h, k, 9),
ConvSpec(h, k, 27),
ConvSpec(h, k, 1),
ConvSpec(h, k, 3),
)
enc = Encoder(n_vocab,
embed_dim,
n_speakers,
speaker_dim,
padding_idx=None,
embedding_weight_std=embedding_std,
convolutions=encoder_convolutions,
max_positions=max_positions,
dropout=dropout)
ConvSpec(h, k, 3), )
enc = Encoder(
n_vocab,
embed_dim,
n_speakers,
speaker_dim,
padding_idx=None,
embedding_weight_std=embedding_std,
convolutions=encoder_convolutions,
max_positions=max_positions,
dropout=dropout)
if freeze_embedding:
freeze(enc.embed)
@ -66,28 +81,28 @@ def make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim,
ConvSpec(h, k, 3),
ConvSpec(h, k, 9),
ConvSpec(h, k, 27),
ConvSpec(h, k, 1),
)
ConvSpec(h, k, 1), )
attention = [True, False, False, False, True]
force_monotonic_attention = [True, False, False, False, True]
dec = Decoder(n_speakers,
speaker_dim,
embed_dim,
mel_dim,
r=r,
max_positions=max_positions,
padding_idx=padding_idx,
preattention=prenet_convolutions,
convolutions=attentive_convolutions,
attention=attention,
dropout=dropout,
use_memory_mask=use_memory_mask,
force_monotonic_attention=force_monotonic_attention,
query_position_rate=query_position_rate,
key_position_rate=key_position_rate,
window_range=WindowRange(window_behind, window_ahead),
key_projection=key_projection,
value_projection=value_projection)
dec = Decoder(
n_speakers,
speaker_dim,
embed_dim,
mel_dim,
r=r,
max_positions=max_positions,
padding_idx=padding_idx,
preattention=prenet_convolutions,
convolutions=attentive_convolutions,
attention=attention,
dropout=dropout,
use_memory_mask=use_memory_mask,
force_monotonic_attention=force_monotonic_attention,
query_position_rate=query_position_rate,
key_position_rate=key_position_rate,
window_range=WindowRange(window_behind, window_ahead),
key_projection=key_projection,
value_projection=value_projection)
if not trainable_positional_encodings:
freeze(dec.embed_keys_positions)
freeze(dec.embed_query_positions)
@ -97,15 +112,15 @@ def make_model(n_speakers, speaker_dim, speaker_embed_std, embed_dim,
ConvSpec(h, k, 1),
ConvSpec(h, k, 3),
ConvSpec(2 * h, k, 1),
ConvSpec(2 * h, k, 3),
)
cvt = Converter(n_speakers,
speaker_dim,
dec.state_dim if use_decoder_states else mel_dim,
linear_dim,
time_upsampling=downsample_factor,
convolutions=postnet_convolutions,
dropout=dropout)
ConvSpec(2 * h, k, 3), )
cvt = Converter(
n_speakers,
speaker_dim,
dec.state_dim if use_decoder_states else mel_dim,
linear_dim,
time_upsampling=downsample_factor,
convolutions=postnet_convolutions,
dropout=dropout)
dv3 = DeepVoice3(enc, dec, cvt, spe, use_decoder_states)
return dv3
@ -115,8 +130,10 @@ def eval_model(model, text, replace_pronounciation_prob, min_level_db,
ref_level_db, power, n_iter, win_length, hop_length,
preemphasis):
"""generate waveform from text using a deepvoice 3 model"""
text = np.array(en.text_to_sequence(text, p=replace_pronounciation_prob),
dtype=np.int64)
text = np.array(
en.text_to_sequence(
text, p=replace_pronounciation_prob),
dtype=np.int64)
length = len(text)
print("text sequence's length: {}".format(length))
text_positions = np.arange(1, 1 + length)
@ -145,10 +162,11 @@ def spec_to_waveform(spec, min_level_db, ref_level_db, power, n_iter,
"""
denoramlized = np.clip(spec, 0, 1) * (-min_level_db) + min_level_db
lin_scaled = np.exp((denoramlized + ref_level_db) / 20 * np.log(10))
wav = librosa.griffinlim(lin_scaled**power,
n_iter=n_iter,
hop_length=hop_length,
win_length=win_length)
wav = librosa.griffinlim(
lin_scaled**power,
n_iter=n_iter,
hop_length=hop_length,
win_length=win_length)
if preemphasis > 0:
wav = signal.lfilter([1.], [1., -preemphasis], wav)
return wav
@ -225,28 +243,30 @@ def save_state(save_dir,
plt.colorbar()
plt.title("mel_input")
plt.savefig(
os.path.join(path,
"target_mel_spec_step{:09d}.png".format(global_step)))
os.path.join(path, "target_mel_spec_step{:09d}.png".format(
global_step)))
plt.close()
writer.add_image("target/mel_spec",
cm.viridis(mel_input),
global_step,
dataformats="HWC")
writer.add_image(
"target/mel_spec",
cm.viridis(mel_input),
global_step,
dataformats="HWC")
plt.figure(figsize=(10, 3))
display.specshow(mel_output)
plt.colorbar()
plt.title("mel_output")
plt.savefig(
os.path.join(
path, "predicted_mel_spec_step{:09d}.png".format(global_step)))
os.path.join(path, "predicted_mel_spec_step{:09d}.png".format(
global_step)))
plt.close()
writer.add_image("predicted/mel_spec",
cm.viridis(mel_output),
global_step,
dataformats="HWC")
writer.add_image(
"predicted/mel_spec",
cm.viridis(mel_output),
global_step,
dataformats="HWC")
if lin_input is not None and lin_output is not None:
lin_input = lin_input[0].numpy().T
@ -258,28 +278,30 @@ def save_state(save_dir,
plt.colorbar()
plt.title("mel_input")
plt.savefig(
os.path.join(path,
"target_lin_spec_step{:09d}.png".format(global_step)))
os.path.join(path, "target_lin_spec_step{:09d}.png".format(
global_step)))
plt.close()
writer.add_image("target/lin_spec",
cm.viridis(lin_input),
global_step,
dataformats="HWC")
writer.add_image(
"target/lin_spec",
cm.viridis(lin_input),
global_step,
dataformats="HWC")
plt.figure(figsize=(10, 3))
display.specshow(lin_output)
plt.colorbar()
plt.title("mel_input")
plt.savefig(
os.path.join(
path, "predicted_lin_spec_step{:09d}.png".format(global_step)))
os.path.join(path, "predicted_lin_spec_step{:09d}.png".format(
global_step)))
plt.close()
writer.add_image("predicted/lin_spec",
cm.viridis(lin_output),
global_step,
dataformats="HWC")
writer.add_image(
"predicted/lin_spec",
cm.viridis(lin_output),
global_step,
dataformats="HWC")
if alignments is not None and len(alignments.shape) == 4:
path = os.path.join(save_dir, "alignments")
@ -290,10 +312,11 @@ def save_state(save_dir,
"train_attn_layer_{}_step_{}.png".format(idx, global_step))
plot_alignment(attn_layer, save_path)
writer.add_image("train_attn/layer_{}".format(idx),
cm.viridis(attn_layer),
global_step,
dataformats="HWC")
writer.add_image(
"train_attn/layer_{}".format(idx),
cm.viridis(attn_layer),
global_step,
dataformats="HWC")
if lin_output is not None:
wav = spec_to_waveform(lin_output, min_level_db, ref_level_db, power,
@ -302,7 +325,5 @@ def save_state(save_dir,
save_path = os.path.join(
path, "train_sample_step_{:09d}.wav".format(global_step))
sf.write(save_path, wav, sample_rate)
writer.add_audio("train_sample",
wav,
global_step,
sample_rate=sample_rate)
writer.add_audio(
"train_sample", wav, global_step, sample_rate=sample_rate)

View File

@ -57,7 +57,7 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr
If you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--fastspeech_step``
For more help on arguments:
For more help on arguments:
``python train.py --help``.
## Synthesis
@ -75,5 +75,5 @@ or you can run the script file directly.
sh synthesis.sh
```
For more help on arguments:
For more help on arguments:
``python synthesis.py --help``.

View File

@ -1,39 +1,96 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
def add_config_options_to_parser(parser):
parser.add_argument('--config_path', type=str, default='config/fastspeech.yaml',
parser.add_argument(
'--config_path',
type=str,
default='config/fastspeech.yaml',
help="the yaml config file path.")
parser.add_argument('--batch_size', type=int, default=32,
help="batch size for training.")
parser.add_argument('--epochs', type=int, default=10000,
parser.add_argument(
'--batch_size', type=int, default=32, help="batch size for training.")
parser.add_argument(
'--epochs',
type=int,
default=10000,
help="the number of epoch for training.")
parser.add_argument('--lr', type=float, default=0.001,
parser.add_argument(
'--lr',
type=float,
default=0.001,
help="the learning rate for training.")
parser.add_argument('--save_step', type=int, default=500,
parser.add_argument(
'--save_step',
type=int,
default=500,
help="checkpointing interval during training.")
parser.add_argument('--fastspeech_step', type=int, default=70000,
parser.add_argument(
'--fastspeech_step',
type=int,
default=70000,
help="Global step to restore checkpoint of fastspeech.")
parser.add_argument('--use_gpu', type=int, default=1,
parser.add_argument(
'--use_gpu',
type=int,
default=1,
help="use gpu or not during training.")
parser.add_argument('--use_data_parallel', type=int, default=0,
parser.add_argument(
'--use_data_parallel',
type=int,
default=0,
help="use data parallel or not during training.")
parser.add_argument('--alpha', type=float, default=1.0,
parser.add_argument(
'--alpha',
type=float,
default=1.0,
help="The hyperparameter to determine the length of the expanded sequence \
mel, thereby controlling the voice speed.")
parser.add_argument('--data_path', type=str, default='./dataset/LJSpeech-1.1',
parser.add_argument(
'--data_path',
type=str,
default='./dataset/LJSpeech-1.1',
help="the path of dataset.")
parser.add_argument('--checkpoint_path', type=str, default=None,
parser.add_argument(
'--checkpoint_path',
type=str,
default=None,
help="the path to load checkpoint or pretrain model.")
parser.add_argument('--save_path', type=str, default='./checkpoint',
parser.add_argument(
'--save_path',
type=str,
default='./checkpoint',
help="the path to save checkpoint.")
parser.add_argument('--log_dir', type=str, default='./log',
parser.add_argument(
'--log_dir',
type=str,
default='./log',
help="the directory to save tensorboard log.")
parser.add_argument('--sample_path', type=str, default='./sample',
parser.add_argument(
'--sample_path',
type=str,
default='./sample',
help="the directory to save audio sample in synthesis.")
parser.add_argument('--transtts_path', type=str, default='./log',
parser.add_argument(
'--transtts_path',
type=str,
default='./log',
help="the directory to load pretrain transformerTTS model.")
parser.add_argument('--transformer_step', type=int, default=160000,
parser.add_argument(
'--transformer_step',
type=int,
default=160000,
help="the step to load transformerTTS model.")

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from tensorboardX import SummaryWriter
from collections import OrderedDict
@ -13,6 +26,7 @@ from parakeet import audio
from parakeet.models.fastspeech.fastspeech import FastSpeech
from parakeet.models.transformer_tts.utils import *
def load_checkpoint(step, model_path):
model_dict, _ = fluid.dygraph.load_dygraph(os.path.join(model_path, step))
new_state_dict = OrderedDict()
@ -23,13 +37,14 @@ def load_checkpoint(step, model_path):
new_state_dict[param] = model_dict[param]
return new_state_dict
def synthesis(text_input, args):
place = (fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace())
# tensorboard
if not os.path.exists(args.log_dir):
os.mkdir(args.log_dir)
path = os.path.join(args.log_dir,'synthesis')
os.mkdir(args.log_dir)
path = os.path.join(args.log_dir, 'synthesis')
with open(args.config_path) as f:
cfg = yaml.load(f, Loader=yaml.Loader)
@ -38,35 +53,42 @@ def synthesis(text_input, args):
with dg.guard(place):
model = FastSpeech(cfg)
model.set_dict(load_checkpoint(str(args.fastspeech_step), os.path.join(args.checkpoint_path, "fastspeech")))
model.set_dict(
load_checkpoint(
str(args.fastspeech_step),
os.path.join(args.checkpoint_path, "fastspeech")))
model.eval()
text = np.asarray(text_to_sequence(text_input))
text = np.expand_dims(text, axis=0)
pos_text = np.arange(1, text.shape[1]+1)
pos_text = np.arange(1, text.shape[1] + 1)
pos_text = np.expand_dims(pos_text, axis=0)
enc_non_pad_mask = get_non_pad_mask(pos_text).astype(np.float32)
enc_slf_attn_mask = get_attn_key_pad_mask(pos_text, text).astype(np.float32)
enc_slf_attn_mask = get_attn_key_pad_mask(pos_text,
text).astype(np.float32)
text = dg.to_variable(text)
pos_text = dg.to_variable(pos_text)
enc_non_pad_mask = dg.to_variable(enc_non_pad_mask)
enc_slf_attn_mask = dg.to_variable(enc_slf_attn_mask)
mel_output, mel_output_postnet = model(text, pos_text, alpha=args.alpha,
enc_non_pad_mask=enc_non_pad_mask,
enc_slf_attn_mask=enc_slf_attn_mask,
dec_non_pad_mask=None,
dec_slf_attn_mask=None)
mel_output, mel_output_postnet = model(
text,
pos_text,
alpha=args.alpha,
enc_non_pad_mask=enc_non_pad_mask,
enc_slf_attn_mask=enc_slf_attn_mask,
dec_non_pad_mask=None,
dec_slf_attn_mask=None)
_ljspeech_processor = audio.AudioProcessor(
sample_rate=cfg['audio']['sr'],
num_mels=cfg['audio']['num_mels'],
min_level_db=cfg['audio']['min_level_db'],
ref_level_db=cfg['audio']['ref_level_db'],
n_fft=cfg['audio']['n_fft'],
win_length= cfg['audio']['win_length'],
hop_length= cfg['audio']['hop_length'],
sample_rate=cfg['audio']['sr'],
num_mels=cfg['audio']['num_mels'],
min_level_db=cfg['audio']['min_level_db'],
ref_level_db=cfg['audio']['ref_level_db'],
n_fft=cfg['audio']['n_fft'],
win_length=cfg['audio']['win_length'],
hop_length=cfg['audio']['hop_length'],
power=cfg['audio']['power'],
preemphasis=cfg['audio']['preemphasis'],
signal_norm=True,
@ -79,14 +101,17 @@ def synthesis(text_input, args):
do_trim_silence=False,
sound_norm=False)
mel_output_postnet = fluid.layers.transpose(fluid.layers.squeeze(mel_output_postnet,[0]), [1,0])
wav = _ljspeech_processor.inv_melspectrogram(mel_output_postnet.numpy())
mel_output_postnet = fluid.layers.transpose(
fluid.layers.squeeze(mel_output_postnet, [0]), [1, 0])
wav = _ljspeech_processor.inv_melspectrogram(mel_output_postnet.numpy(
))
writer.add_audio(text_input, wav, 0, cfg['audio']['sr'])
print("Synthesis completed !!!")
writer.close()
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Train Fastspeech model")
add_config_options_to_parser(parser)
args = parser.parse_args()
synthesis("Transformer model is so fast!", args)
synthesis("Transformer model is so fast!", args)

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import argparse
import os
@ -21,8 +34,10 @@ import sys
sys.path.append("../transformer_tts")
from data import LJSpeechLoader
def load_checkpoint(step, model_path):
model_dict, opti_dict = fluid.dygraph.load_dygraph(os.path.join(model_path, step))
model_dict, opti_dict = fluid.dygraph.load_dygraph(
os.path.join(model_path, step))
new_state_dict = OrderedDict()
for param in model_dict:
if param.startswith('_layers.'):
@ -31,6 +46,7 @@ def load_checkpoint(step, model_path):
new_state_dict[param] = model_dict[param]
return new_state_dict, opti_dict
def main(args):
local_rank = dg.parallel.Env().local_rank if args.use_data_parallel else 0
nranks = dg.parallel.Env().nranks if args.use_data_parallel else 1
@ -44,26 +60,33 @@ def main(args):
if args.use_gpu else fluid.CPUPlace())
if not os.path.exists(args.log_dir):
os.mkdir(args.log_dir)
path = os.path.join(args.log_dir,'fastspeech')
os.mkdir(args.log_dir)
path = os.path.join(args.log_dir, 'fastspeech')
writer = SummaryWriter(path) if local_rank == 0 else None
with dg.guard(place):
with fluid.unique_name.guard():
transformerTTS = TransformerTTS(cfg)
model_dict, _ = load_checkpoint(str(args.transformer_step), os.path.join(args.transtts_path, "transformer"))
model_dict, _ = load_checkpoint(
str(args.transformer_step),
os.path.join(args.transtts_path, "transformer"))
transformerTTS.set_dict(model_dict)
transformerTTS.eval()
model = FastSpeech(cfg)
model.train()
optimizer = fluid.optimizer.AdamOptimizer(learning_rate=dg.NoamDecay(1/(cfg['warm_up_step'] *( args.lr ** 2)), cfg['warm_up_step']),
parameter_list=model.parameters())
reader = LJSpeechLoader(cfg, args, nranks, local_rank, shuffle=True).reader()
optimizer = fluid.optimizer.AdamOptimizer(
learning_rate=dg.NoamDecay(1 / (
cfg['warm_up_step'] * (args.lr**2)), cfg['warm_up_step']),
parameter_list=model.parameters())
reader = LJSpeechLoader(
cfg, args, nranks, local_rank, shuffle=True).reader()
if args.checkpoint_path is not None:
model_dict, opti_dict = load_checkpoint(str(args.fastspeech_step), os.path.join(args.checkpoint_path, "fastspeech"))
model_dict, opti_dict = load_checkpoint(
str(args.fastspeech_step),
os.path.join(args.checkpoint_path, "fastspeech"))
model.set_dict(model_dict)
optimizer.set_dict(opti_dict)
global_step = args.fastspeech_step
@ -77,45 +100,66 @@ def main(args):
pbar = tqdm(reader)
for i, data in enumerate(pbar):
pbar.set_description('Processing at epoch %d'%epoch)
(character, mel, mel_input, pos_text, pos_mel, text_length, mel_lens,
enc_slf_mask, enc_query_mask, dec_slf_mask, enc_dec_mask, dec_query_slf_mask, dec_query_mask) = data
pbar.set_description('Processing at epoch %d' % epoch)
(character, mel, mel_input, pos_text, pos_mel, text_length,
mel_lens, enc_slf_mask, enc_query_mask, dec_slf_mask,
enc_dec_mask, dec_query_slf_mask, dec_query_mask) = data
_, _, attn_probs, _, _, _ = transformerTTS(character, mel_input, pos_text, pos_mel,
dec_slf_mask=dec_slf_mask,
enc_slf_mask=enc_slf_mask, enc_query_mask=enc_query_mask,
enc_dec_mask=enc_dec_mask, dec_query_slf_mask=dec_query_slf_mask,
dec_query_mask=dec_query_mask)
alignment, max_attn = get_alignment(attn_probs, mel_lens, cfg['transformer_head'])
_, _, attn_probs, _, _, _ = transformerTTS(
character,
mel_input,
pos_text,
pos_mel,
dec_slf_mask=dec_slf_mask,
enc_slf_mask=enc_slf_mask,
enc_query_mask=enc_query_mask,
enc_dec_mask=enc_dec_mask,
dec_query_slf_mask=dec_query_slf_mask,
dec_query_mask=dec_query_mask)
alignment, max_attn = get_alignment(attn_probs, mel_lens,
cfg['transformer_head'])
alignment = dg.to_variable(alignment).astype(np.float32)
if local_rank==0 and global_step % 5 == 1:
x = np.uint8(cm.viridis(max_attn[8,:mel_lens.numpy()[8]]) * 255)
writer.add_image('Attention_%d_0'%global_step, x, 0, dataformats="HWC")
if local_rank == 0 and global_step % 5 == 1:
x = np.uint8(
cm.viridis(max_attn[8, :mel_lens.numpy()[8]]) * 255)
writer.add_image(
'Attention_%d_0' % global_step,
x,
0,
dataformats="HWC")
global_step += 1
#Forward
result= model(character,
pos_text,
mel_pos=pos_mel,
length_target=alignment,
enc_non_pad_mask=enc_query_mask,
enc_slf_attn_mask=enc_slf_mask,
dec_non_pad_mask=dec_query_slf_mask,
dec_slf_attn_mask=dec_slf_mask)
result = model(
character,
pos_text,
mel_pos=pos_mel,
length_target=alignment,
enc_non_pad_mask=enc_query_mask,
enc_slf_attn_mask=enc_slf_mask,
dec_non_pad_mask=dec_query_slf_mask,
dec_slf_attn_mask=dec_slf_mask)
mel_output, mel_output_postnet, duration_predictor_output, _, _ = result
mel_loss = layers.mse_loss(mel_output, mel)
mel_postnet_loss = layers.mse_loss(mel_output_postnet, mel)
duration_loss = layers.mean(layers.abs(layers.elementwise_sub(duration_predictor_output, alignment)))
duration_loss = layers.mean(
layers.abs(
layers.elementwise_sub(duration_predictor_output,
alignment)))
total_loss = mel_loss + mel_postnet_loss + duration_loss
if local_rank==0:
writer.add_scalar('mel_loss', mel_loss.numpy(), global_step)
writer.add_scalar('post_mel_loss', mel_postnet_loss.numpy(), global_step)
writer.add_scalar('duration_loss', duration_loss.numpy(), global_step)
writer.add_scalar('learning_rate', optimizer._learning_rate.step().numpy(), global_step)
if local_rank == 0:
writer.add_scalar('mel_loss',
mel_loss.numpy(), global_step)
writer.add_scalar('post_mel_loss',
mel_postnet_loss.numpy(), global_step)
writer.add_scalar('duration_loss',
duration_loss.numpy(), global_step)
writer.add_scalar('learning_rate',
optimizer._learning_rate.step().numpy(),
global_step)
if args.use_data_parallel:
total_loss = model.scale_loss(total_loss)
@ -123,21 +167,25 @@ def main(args):
model.apply_collective_grads()
else:
total_loss.backward()
optimizer.minimize(total_loss, grad_clip = fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg['grad_clip_thresh']))
optimizer.minimize(
total_loss,
grad_clip=fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg[
'grad_clip_thresh']))
model.clear_gradients()
# save checkpoint
if local_rank==0 and global_step % args.save_step == 0:
# save checkpoint
if local_rank == 0 and global_step % args.save_step == 0:
if not os.path.exists(args.save_path):
os.mkdir(args.save_path)
save_path = os.path.join(args.save_path,'fastspeech/%d' % global_step)
save_path = os.path.join(args.save_path,
'fastspeech/%d' % global_step)
dg.save_dygraph(model.state_dict(), save_path)
dg.save_dygraph(optimizer.state_dict(), save_path)
if local_rank==0:
if local_rank == 0:
writer.close()
if __name__ =='__main__':
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Train Fastspeech model")
add_config_options_to_parser(parser)
args = parser.parse_args()

View File

@ -50,7 +50,7 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr
If you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--transformer_step``
For more help on arguments:
For more help on arguments:
``python train_transformer.py --help``.
## Train Vocoder
@ -78,7 +78,7 @@ python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog tr
```
If you wish to resume from an exists model, please set ``--checkpoint_path`` and ``--vocoder_step``
For more help on arguments:
For more help on arguments:
``python train_vocoder.py --help``.
## Synthesis
@ -101,5 +101,5 @@ sh synthesis.sh
And the audio file will be saved in ``--sample_path``.
For more help on arguments:
For more help on arguments:
``python synthesis.py --help``.

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pathlib import Path
import numpy as np
import pandas as pd
@ -13,8 +26,15 @@ from parakeet.data.batch import TextIDBatcher, SpecBatcher
from parakeet.data.dataset import DatasetMixin, TransformDataset, CacheDataset
from parakeet.models.transformer_tts.utils import *
class LJSpeechLoader:
def __init__(self, config, args, nranks, rank, is_vocoder=False, shuffle=True):
def __init__(self,
config,
args,
nranks,
rank,
is_vocoder=False,
shuffle=True):
place = fluid.CUDAPlace(rank) if args.use_gpu else fluid.CPUPlace()
LJSPEECH_ROOT = Path(args.data_path)
@ -23,15 +43,28 @@ class LJSpeechLoader:
dataset = TransformDataset(metadata, transformer)
dataset = CacheDataset(dataset)
sampler = DistributedSampler(len(metadata), nranks, rank, shuffle=shuffle)
sampler = DistributedSampler(
len(metadata), nranks, rank, shuffle=shuffle)
assert args.batch_size % nranks == 0
each_bs = args.batch_size // nranks
if is_vocoder:
dataloader = DataCargo(dataset, sampler=sampler, batch_size=each_bs, shuffle=shuffle, batch_fn=batch_examples_vocoder, drop_last=True)
dataloader = DataCargo(
dataset,
sampler=sampler,
batch_size=each_bs,
shuffle=shuffle,
batch_fn=batch_examples_vocoder,
drop_last=True)
else:
dataloader = DataCargo(dataset, sampler=sampler, batch_size=each_bs, shuffle=shuffle, batch_fn=batch_examples, drop_last=True)
dataloader = DataCargo(
dataset,
sampler=sampler,
batch_size=each_bs,
shuffle=shuffle,
batch_fn=batch_examples,
drop_last=True)
self.reader = fluid.io.DataLoader.from_generator(
capacity=32,
iterable=True,
@ -66,13 +99,13 @@ class LJSpeech(object):
super(LJSpeech, self).__init__()
self.config = config
self._ljspeech_processor = audio.AudioProcessor(
sample_rate=config['audio']['sr'],
num_mels=config['audio']['num_mels'],
min_level_db=config['audio']['min_level_db'],
ref_level_db=config['audio']['ref_level_db'],
n_fft=config['audio']['n_fft'],
win_length= config['audio']['win_length'],
hop_length= config['audio']['hop_length'],
sample_rate=config['audio']['sr'],
num_mels=config['audio']['num_mels'],
min_level_db=config['audio']['min_level_db'],
ref_level_db=config['audio']['ref_level_db'],
n_fft=config['audio']['n_fft'],
win_length=config['audio']['win_length'],
hop_length=config['audio']['hop_length'],
power=config['audio']['power'],
preemphasis=config['audio']['preemphasis'],
signal_norm=True,
@ -84,7 +117,7 @@ class LJSpeech(object):
griffin_lim_iters=60,
do_trim_silence=False,
sound_norm=False)
def __call__(self, metadatum):
"""All the code for generating an Example from a metadatum. If you want a
different preprocessing pipeline, you can override this method.
@ -93,13 +126,15 @@ class LJSpeech(object):
method.
"""
fname, raw_text, normalized_text = metadatum
# load -> trim -> preemphasis -> stft -> magnitude -> mel_scale -> logscale -> normalize
wav = self._ljspeech_processor.load_wav(str(fname))
mag = self._ljspeech_processor.spectrogram(wav).astype(np.float32)
mel = self._ljspeech_processor.melspectrogram(wav).astype(np.float32)
phonemes = np.array(g2p.en.text_to_sequence(normalized_text), dtype=np.int64)
return (mag, mel, phonemes) # maybe we need to implement it as a map in the future
phonemes = np.array(
g2p.en.text_to_sequence(normalized_text), dtype=np.int64)
return (mag, mel, phonemes
) # maybe we need to implement it as a map in the future
def batch_examples(batch):
@ -112,52 +147,81 @@ def batch_examples(batch):
pos_mels = []
for data in batch:
_, mel, text = data
mel_inputs.append(np.concatenate([np.zeros([mel.shape[0], 1], np.float32), mel[:,:-1]], axis=-1))
mel_inputs.append(
np.concatenate(
[np.zeros([mel.shape[0], 1], np.float32), mel[:, :-1]],
axis=-1))
mel_lens.append(mel.shape[1])
text_lens.append(len(text))
pos_texts.append(np.arange(1, len(text) + 1))
pos_mels.append(np.arange(1, mel.shape[1] + 1))
mels.append(mel)
texts.append(text)
# Sort by text_len in descending order
texts = [i for i,_ in sorted(zip(texts, text_lens), key=lambda x: x[1], reverse=True)]
mels = [i for i,_ in sorted(zip(mels, text_lens), key=lambda x: x[1], reverse=True)]
mel_inputs = [i for i,_ in sorted(zip(mel_inputs, text_lens), key=lambda x: x[1], reverse=True)]
mel_lens = [i for i,_ in sorted(zip(mel_lens, text_lens), key=lambda x: x[1], reverse=True)]
pos_texts = [i for i,_ in sorted(zip(pos_texts, text_lens), key=lambda x: x[1], reverse=True)]
pos_mels = [i for i,_ in sorted(zip(pos_mels, text_lens), key=lambda x: x[1], reverse=True)]
texts = [
i
for i, _ in sorted(
zip(texts, text_lens), key=lambda x: x[1], reverse=True)
]
mels = [
i
for i, _ in sorted(
zip(mels, text_lens), key=lambda x: x[1], reverse=True)
]
mel_inputs = [
i
for i, _ in sorted(
zip(mel_inputs, text_lens), key=lambda x: x[1], reverse=True)
]
mel_lens = [
i
for i, _ in sorted(
zip(mel_lens, text_lens), key=lambda x: x[1], reverse=True)
]
pos_texts = [
i
for i, _ in sorted(
zip(pos_texts, text_lens), key=lambda x: x[1], reverse=True)
]
pos_mels = [
i
for i, _ in sorted(
zip(pos_mels, text_lens), key=lambda x: x[1], reverse=True)
]
text_lens = sorted(text_lens, reverse=True)
# Pad sequence with largest len of the batch
texts = TextIDBatcher(pad_id=0)(texts) #(B, T)
pos_texts = TextIDBatcher(pad_id=0)(pos_texts) #(B,T)
pos_mels = TextIDBatcher(pad_id=0)(pos_mels) #(B,T)
mels = np.transpose(SpecBatcher(pad_value=0.)(mels), axes=(0,2,1)) #(B,T,num_mels)
mel_inputs = np.transpose(SpecBatcher(pad_value=0.)(mel_inputs), axes=(0,2,1))#(B,T,num_mels)
texts = TextIDBatcher(pad_id=0)(texts) #(B, T)
pos_texts = TextIDBatcher(pad_id=0)(pos_texts) #(B,T)
pos_mels = TextIDBatcher(pad_id=0)(pos_mels) #(B,T)
mels = np.transpose(
SpecBatcher(pad_value=0.)(mels), axes=(0, 2, 1)) #(B,T,num_mels)
mel_inputs = np.transpose(
SpecBatcher(pad_value=0.)(mel_inputs), axes=(0, 2, 1)) #(B,T,num_mels)
enc_slf_mask = get_attn_key_pad_mask(pos_texts, texts).astype(np.float32)
enc_query_mask = get_non_pad_mask(pos_texts).astype(np.float32)
dec_slf_mask = get_dec_attn_key_pad_mask(pos_mels,mel_inputs).astype(np.float32)
enc_dec_mask = get_attn_key_pad_mask(enc_query_mask[:,:,0], mel_inputs).astype(np.float32)
dec_slf_mask = get_dec_attn_key_pad_mask(pos_mels,
mel_inputs).astype(np.float32)
enc_dec_mask = get_attn_key_pad_mask(enc_query_mask[:, :, 0],
mel_inputs).astype(np.float32)
dec_query_slf_mask = get_non_pad_mask(pos_mels).astype(np.float32)
dec_query_mask = get_non_pad_mask(pos_mels).astype(np.float32)
return (texts, mels, mel_inputs, pos_texts, pos_mels, np.array(text_lens), np.array(mel_lens),
enc_slf_mask, enc_query_mask, dec_slf_mask, enc_dec_mask, dec_query_slf_mask, dec_query_mask)
return (texts, mels, mel_inputs, pos_texts, pos_mels, np.array(text_lens),
np.array(mel_lens), enc_slf_mask, enc_query_mask, dec_slf_mask,
enc_dec_mask, dec_query_slf_mask, dec_query_mask)
def batch_examples_vocoder(batch):
mels=[]
mags=[]
mels = []
mags = []
for data in batch:
mag, mel, _ = data
mels.append(mel)
mags.append(mag)
mels = np.transpose(SpecBatcher(pad_value=0.)(mels), axes=(0,2,1))
mags = np.transpose(SpecBatcher(pad_value=0.)(mags), axes=(0,2,1))
mels = np.transpose(SpecBatcher(pad_value=0.)(mels), axes=(0, 2, 1))
mags = np.transpose(SpecBatcher(pad_value=0.)(mags), axes=(0, 2, 1))
return (mels, mags)

View File

@ -1,38 +1,100 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
def add_config_options_to_parser(parser):
parser.add_argument('--config_path', type=str, default='config/train_transformer.yaml',
parser.add_argument(
'--config_path',
type=str,
default='config/train_transformer.yaml',
help="the yaml config file path.")
parser.add_argument('--batch_size', type=int, default=32,
help="batch size for training.")
parser.add_argument('--epochs', type=int, default=10000,
parser.add_argument(
'--batch_size', type=int, default=32, help="batch size for training.")
parser.add_argument(
'--epochs',
type=int,
default=10000,
help="the number of epoch for training.")
parser.add_argument('--lr', type=float, default=0.001,
parser.add_argument(
'--lr',
type=float,
default=0.001,
help="the learning rate for training.")
parser.add_argument('--save_step', type=int, default=500,
parser.add_argument(
'--save_step',
type=int,
default=500,
help="checkpointing interval during training.")
parser.add_argument('--image_step', type=int, default=2000,
parser.add_argument(
'--image_step',
type=int,
default=2000,
help="attention image interval during training.")
parser.add_argument('--max_len', type=int, default=400,
parser.add_argument(
'--max_len',
type=int,
default=400,
help="The max length of audio when synthsis.")
parser.add_argument('--transformer_step', type=int, default=160000,
parser.add_argument(
'--transformer_step',
type=int,
default=160000,
help="Global step to restore checkpoint of transformer.")
parser.add_argument('--vocoder_step', type=int, default=90000,
parser.add_argument(
'--vocoder_step',
type=int,
default=90000,
help="Global step to restore checkpoint of postnet.")
parser.add_argument('--use_gpu', type=int, default=1,
parser.add_argument(
'--use_gpu',
type=int,
default=1,
help="use gpu or not during training.")
parser.add_argument('--use_data_parallel', type=int, default=0,
parser.add_argument(
'--use_data_parallel',
type=int,
default=0,
help="use data parallel or not during training.")
parser.add_argument('--stop_token', type=int, default=0,
parser.add_argument(
'--stop_token',
type=int,
default=0,
help="use stop token loss in network or not.")
parser.add_argument('--data_path', type=str, default='./dataset/LJSpeech-1.1',
parser.add_argument(
'--data_path',
type=str,
default='./dataset/LJSpeech-1.1',
help="the path of dataset.")
parser.add_argument('--checkpoint_path', type=str, default=None,
parser.add_argument(
'--checkpoint_path',
type=str,
default=None,
help="the path to load checkpoint or pretrain model.")
parser.add_argument('--save_path', type=str, default='./checkpoint',
parser.add_argument(
'--save_path',
type=str,
default='./checkpoint',
help="the path to save checkpoint.")
parser.add_argument('--log_dir', type=str, default='./log',
parser.add_argument(
'--log_dir',
type=str,
default='./log',
help="the directory to save tensorboard log.")
parser.add_argument('--sample_path', type=str, default='./sample',
parser.add_argument(
'--sample_path',
type=str,
default='./sample',
help="the directory to save audio sample in synthesis.")

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from scipy.io.wavfile import write
from parakeet.g2p.en import text_to_sequence
@ -18,6 +31,7 @@ from parakeet import audio
from parakeet.models.transformer_tts.vocoder import Vocoder
from parakeet.models.transformer_tts.transformer_tts import TransformerTTS
def load_checkpoint(step, model_path):
model_dict, _ = fluid.dygraph.load_dygraph(os.path.join(model_path, step))
new_state_dict = OrderedDict()
@ -28,6 +42,7 @@ def load_checkpoint(step, model_path):
new_state_dict[param] = model_dict[param]
return new_state_dict
def synthesis(text_input, args):
place = (fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace())
@ -36,48 +51,57 @@ def synthesis(text_input, args):
# tensorboard
if not os.path.exists(args.log_dir):
os.mkdir(args.log_dir)
path = os.path.join(args.log_dir,'synthesis')
os.mkdir(args.log_dir)
path = os.path.join(args.log_dir, 'synthesis')
writer = SummaryWriter(path)
with dg.guard(place):
with fluid.unique_name.guard():
model = TransformerTTS(cfg)
model.set_dict(load_checkpoint(str(args.transformer_step), os.path.join(args.checkpoint_path, "transformer")))
model.set_dict(
load_checkpoint(
str(args.transformer_step),
os.path.join(args.checkpoint_path, "transformer")))
model.eval()
with fluid.unique_name.guard():
model_vocoder = Vocoder(cfg, args.batch_size)
model_vocoder.set_dict(load_checkpoint(str(args.vocoder_step), os.path.join(args.checkpoint_path, "vocoder")))
model_vocoder.set_dict(
load_checkpoint(
str(args.vocoder_step),
os.path.join(args.checkpoint_path, "vocoder")))
model_vocoder.eval()
# init input
text = np.asarray(text_to_sequence(text_input))
text = fluid.layers.unsqueeze(dg.to_variable(text),[0])
mel_input = dg.to_variable(np.zeros([1,1,80])).astype(np.float32)
pos_text = np.arange(1, text.shape[1]+1)
pos_text = fluid.layers.unsqueeze(dg.to_variable(pos_text),[0])
text = fluid.layers.unsqueeze(dg.to_variable(text), [0])
mel_input = dg.to_variable(np.zeros([1, 1, 80])).astype(np.float32)
pos_text = np.arange(1, text.shape[1] + 1)
pos_text = fluid.layers.unsqueeze(dg.to_variable(pos_text), [0])
pbar = tqdm(range(args.max_len))
for i in pbar:
dec_slf_mask = get_triu_tensor(mel_input.numpy(), mel_input.numpy()).astype(np.float32)
dec_slf_mask = fluid.layers.cast(dg.to_variable(dec_slf_mask == 0), np.float32)
pos_mel = np.arange(1, mel_input.shape[1]+1)
pos_mel = fluid.layers.unsqueeze(dg.to_variable(pos_mel),[0])
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(text, mel_input, pos_text, pos_mel, dec_slf_mask)
mel_input = fluid.layers.concat([mel_input, postnet_pred[:,-1:,:]], axis=1)
dec_slf_mask = get_triu_tensor(
mel_input.numpy(), mel_input.numpy()).astype(np.float32)
dec_slf_mask = fluid.layers.cast(
dg.to_variable(dec_slf_mask == 0), np.float32)
pos_mel = np.arange(1, mel_input.shape[1] + 1)
pos_mel = fluid.layers.unsqueeze(dg.to_variable(pos_mel), [0])
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(
text, mel_input, pos_text, pos_mel, dec_slf_mask)
mel_input = fluid.layers.concat(
[mel_input, postnet_pred[:, -1:, :]], axis=1)
mag_pred = model_vocoder(postnet_pred)
_ljspeech_processor = audio.AudioProcessor(
sample_rate=cfg['audio']['sr'],
num_mels=cfg['audio']['num_mels'],
min_level_db=cfg['audio']['min_level_db'],
ref_level_db=cfg['audio']['ref_level_db'],
n_fft=cfg['audio']['n_fft'],
win_length= cfg['audio']['win_length'],
hop_length= cfg['audio']['hop_length'],
sample_rate=cfg['audio']['sr'],
num_mels=cfg['audio']['num_mels'],
min_level_db=cfg['audio']['min_level_db'],
ref_level_db=cfg['audio']['ref_level_db'],
n_fft=cfg['audio']['n_fft'],
win_length=cfg['audio']['win_length'],
hop_length=cfg['audio']['hop_length'],
power=cfg['audio']['power'],
preemphasis=cfg['audio']['preemphasis'],
signal_norm=True,
@ -90,30 +114,49 @@ def synthesis(text_input, args):
do_trim_silence=False,
sound_norm=False)
wav = _ljspeech_processor.inv_spectrogram(fluid.layers.transpose(fluid.layers.squeeze(mag_pred,[0]), [1,0]).numpy())
wav = _ljspeech_processor.inv_spectrogram(
fluid.layers.transpose(
fluid.layers.squeeze(mag_pred, [0]), [1, 0]).numpy())
global_step = 0
for i, prob in enumerate(attn_probs):
for j in range(4):
x = np.uint8(cm.viridis(prob.numpy()[j]) * 255)
writer.add_image('Attention_%d_0'%global_step, x, i*4+j, dataformats="HWC")
writer.add_image(
'Attention_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
for i, prob in enumerate(attn_enc):
for j in range(4):
x = np.uint8(cm.viridis(prob.numpy()[j]) * 255)
writer.add_image('Attention_enc_%d_0'%global_step, x, i*4+j, dataformats="HWC")
writer.add_image(
'Attention_enc_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
for i, prob in enumerate(attn_dec):
for j in range(4):
x = np.uint8(cm.viridis(prob.numpy()[j]) * 255)
writer.add_image('Attention_dec_%d_0'%global_step, x, i*4+j, dataformats="HWC")
writer.add_image(
'Attention_dec_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
writer.add_audio(text_input, wav, 0, cfg['audio']['sr'])
if not os.path.exists(args.sample_path):
os.mkdir(args.sample_path)
write(os.path.join(args.sample_path,'test.wav'), cfg['audio']['sr'], wav)
write(
os.path.join(args.sample_path, 'test.wav'), cfg['audio']['sr'],
wav)
writer.close()
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Synthesis model")
add_config_options_to_parser(parser)
args = parser.parse_args()
synthesis("They emphasized the necessity that the information now being furnished be handled with judgment and care.", args)
synthesis(
"They emphasized the necessity that the information now being furnished be handled with judgment and care.",
args)

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from tqdm import tqdm
from tensorboardX import SummaryWriter
@ -16,8 +29,10 @@ from parakeet.models.transformer_tts.utils import cross_entropy
from data import LJSpeechLoader
from parakeet.models.transformer_tts.transformer_tts import TransformerTTS
def load_checkpoint(step, model_path):
model_dict, opti_dict = fluid.dygraph.load_dygraph(os.path.join(model_path, step))
model_dict, opti_dict = fluid.dygraph.load_dygraph(
os.path.join(model_path, step))
new_state_dict = OrderedDict()
for param in model_dict:
if param.startswith('_layers.'):
@ -40,22 +55,27 @@ def main(args):
if args.use_gpu else fluid.CPUPlace())
if not os.path.exists(args.log_dir):
os.mkdir(args.log_dir)
path = os.path.join(args.log_dir,'transformer')
os.mkdir(args.log_dir)
path = os.path.join(args.log_dir, 'transformer')
writer = SummaryWriter(path) if local_rank == 0 else None
with dg.guard(place):
model = TransformerTTS(cfg)
model.train()
optimizer = fluid.optimizer.AdamOptimizer(learning_rate=dg.NoamDecay(1/(cfg['warm_up_step'] *( args.lr ** 2)), cfg['warm_up_step']),
parameter_list=model.parameters())
reader = LJSpeechLoader(cfg, args, nranks, local_rank, shuffle=True).reader()
optimizer = fluid.optimizer.AdamOptimizer(
learning_rate=dg.NoamDecay(1 / (
cfg['warm_up_step'] * (args.lr**2)), cfg['warm_up_step']),
parameter_list=model.parameters())
reader = LJSpeechLoader(
cfg, args, nranks, local_rank, shuffle=True).reader()
if args.checkpoint_path is not None:
model_dict, opti_dict = load_checkpoint(str(args.transformer_step), os.path.join(args.checkpoint_path, "transformer"))
model_dict, opti_dict = load_checkpoint(
str(args.transformer_step),
os.path.join(args.checkpoint_path, "transformer"))
model.set_dict(model_dict)
optimizer.set_dict(opti_dict)
global_step = args.transformer_step
@ -64,93 +84,122 @@ def main(args):
if args.use_data_parallel:
strategy = dg.parallel.prepare_context()
model = fluid.dygraph.parallel.DataParallel(model, strategy)
for epoch in range(args.epochs):
pbar = tqdm(reader)
for i, data in enumerate(pbar):
pbar.set_description('Processing at epoch %d'%epoch)
character, mel, mel_input, pos_text, pos_mel, text_length, _, enc_slf_mask, enc_query_mask, dec_slf_mask, enc_dec_mask, dec_query_slf_mask, dec_query_mask= data
pbar.set_description('Processing at epoch %d' % epoch)
character, mel, mel_input, pos_text, pos_mel, text_length, _, enc_slf_mask, enc_query_mask, dec_slf_mask, enc_dec_mask, dec_query_slf_mask, dec_query_mask = data
global_step += 1
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(character, mel_input, pos_text, pos_mel, dec_slf_mask=dec_slf_mask,
enc_slf_mask=enc_slf_mask, enc_query_mask=enc_query_mask,
enc_dec_mask=enc_dec_mask, dec_query_slf_mask=dec_query_slf_mask,
dec_query_mask=dec_query_mask)
mel_loss = layers.mean(layers.abs(layers.elementwise_sub(mel_pred, mel)))
post_mel_loss = layers.mean(layers.abs(layers.elementwise_sub(postnet_pred, mel)))
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(
character,
mel_input,
pos_text,
pos_mel,
dec_slf_mask=dec_slf_mask,
enc_slf_mask=enc_slf_mask,
enc_query_mask=enc_query_mask,
enc_dec_mask=enc_dec_mask,
dec_query_slf_mask=dec_query_slf_mask,
dec_query_mask=dec_query_mask)
mel_loss = layers.mean(
layers.abs(layers.elementwise_sub(mel_pred, mel)))
post_mel_loss = layers.mean(
layers.abs(layers.elementwise_sub(postnet_pred, mel)))
loss = mel_loss + post_mel_loss
# Note: When used stop token loss the learning did not work.
if args.stop_token:
label = (pos_mel == 0).astype(np.float32)
stop_loss = cross_entropy(stop_preds, label)
loss = loss + stop_loss
if local_rank==0:
if local_rank == 0:
writer.add_scalars('training_loss', {
'mel_loss':mel_loss.numpy(),
'post_mel_loss':post_mel_loss.numpy()
'mel_loss': mel_loss.numpy(),
'post_mel_loss': post_mel_loss.numpy()
}, global_step)
if args.stop_token:
writer.add_scalar('stop_loss', stop_loss.numpy(), global_step)
writer.add_scalar('stop_loss',
stop_loss.numpy(), global_step)
if args.use_data_parallel:
writer.add_scalars('alphas', {
'encoder_alpha':model._layers.encoder.alpha.numpy(),
'decoder_alpha':model._layers.decoder.alpha.numpy(),
'encoder_alpha':
model._layers.encoder.alpha.numpy(),
'decoder_alpha':
model._layers.decoder.alpha.numpy(),
}, global_step)
else:
writer.add_scalars('alphas', {
'encoder_alpha':model.encoder.alpha.numpy(),
'decoder_alpha':model.decoder.alpha.numpy(),
'encoder_alpha': model.encoder.alpha.numpy(),
'decoder_alpha': model.decoder.alpha.numpy(),
}, global_step)
writer.add_scalar('learning_rate', optimizer._learning_rate.step().numpy(), global_step)
writer.add_scalar('learning_rate',
optimizer._learning_rate.step().numpy(),
global_step)
if global_step % args.image_step == 1:
for i, prob in enumerate(attn_probs):
for j in range(4):
x = np.uint8(cm.viridis(prob.numpy()[j*16]) * 255)
writer.add_image('Attention_%d_0'%global_step, x, i*4+j, dataformats="HWC")
x = np.uint8(
cm.viridis(prob.numpy()[j * 16]) * 255)
writer.add_image(
'Attention_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
for i, prob in enumerate(attn_enc):
for j in range(4):
x = np.uint8(cm.viridis(prob.numpy()[j*16]) * 255)
writer.add_image('Attention_enc_%d_0'%global_step, x, i*4+j, dataformats="HWC")
x = np.uint8(
cm.viridis(prob.numpy()[j * 16]) * 255)
writer.add_image(
'Attention_enc_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
for i, prob in enumerate(attn_dec):
for j in range(4):
x = np.uint8(cm.viridis(prob.numpy()[j*16]) * 255)
writer.add_image('Attention_dec_%d_0'%global_step, x, i*4+j, dataformats="HWC")
x = np.uint8(
cm.viridis(prob.numpy()[j * 16]) * 255)
writer.add_image(
'Attention_dec_%d_0' % global_step,
x,
i * 4 + j,
dataformats="HWC")
if args.use_data_parallel:
loss = model.scale_loss(loss)
loss.backward()
model.apply_collective_grads()
else:
loss.backward()
optimizer.minimize(loss, grad_clip = fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg['grad_clip_thresh']))
optimizer.minimize(
loss,
grad_clip=fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg[
'grad_clip_thresh']))
model.clear_gradients()
# save checkpoint
if local_rank==0 and global_step % args.save_step == 0:
if local_rank == 0 and global_step % args.save_step == 0:
if not os.path.exists(args.save_path):
os.mkdir(args.save_path)
save_path = os.path.join(args.save_path,'transformer/%d' % global_step)
save_path = os.path.join(args.save_path,
'transformer/%d' % global_step)
dg.save_dygraph(model.state_dict(), save_path)
dg.save_dygraph(optimizer.state_dict(), save_path)
if local_rank==0:
if local_rank == 0:
writer.close()
if __name__ =='__main__':
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Train TransformerTTS model")
add_config_options_to_parser(parser)

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from tensorboardX import SummaryWriter
import os
from tqdm import tqdm
@ -13,6 +26,7 @@ import paddle.fluid.layers as layers
from data import LJSpeechLoader
from parakeet.models.transformer_tts.vocoder import Vocoder
def load_checkpoint(step, model_path):
model_dict, opti_dict = dg.load_dygraph(os.path.join(model_path, step))
new_state_dict = OrderedDict()
@ -23,8 +37,9 @@ def load_checkpoint(step, model_path):
new_state_dict[param] = model_dict[param]
return new_state_dict, opti_dict
def main(args):
local_rank = dg.parallel.Env().local_rank if args.use_data_parallel else 0
nranks = dg.parallel.Env().nranks if args.use_data_parallel else 1
@ -35,23 +50,26 @@ def main(args):
place = (fluid.CUDAPlace(dg.parallel.Env().dev_id)
if args.use_data_parallel else fluid.CUDAPlace(0)
if args.use_gpu else fluid.CPUPlace())
if not os.path.exists(args.log_dir):
os.mkdir(args.log_dir)
path = os.path.join(args.log_dir,'vocoder')
os.mkdir(args.log_dir)
path = os.path.join(args.log_dir, 'vocoder')
writer = SummaryWriter(path) if local_rank == 0 else None
with dg.guard(place):
with dg.guard(place):
model = Vocoder(cfg, args.batch_size)
model.train()
optimizer = fluid.optimizer.AdamOptimizer(learning_rate=dg.NoamDecay(1/(cfg['warm_up_step'] *( args.lr ** 2)), cfg['warm_up_step']),
parameter_list=model.parameters())
optimizer = fluid.optimizer.AdamOptimizer(
learning_rate=dg.NoamDecay(1 / (
cfg['warm_up_step'] * (args.lr**2)), cfg['warm_up_step']),
parameter_list=model.parameters())
if args.checkpoint_path is not None:
model_dict, opti_dict = load_checkpoint(str(args.vocoder_step), os.path.join(args.checkpoint_path, "vocoder"))
model_dict, opti_dict = load_checkpoint(
str(args.vocoder_step),
os.path.join(args.checkpoint_path, "vocoder"))
model.set_dict(model_dict)
optimizer.set_dict(opti_dict)
global_step = args.vocoder_step
@ -61,48 +79,55 @@ def main(args):
strategy = dg.parallel.prepare_context()
model = fluid.dygraph.parallel.DataParallel(model, strategy)
reader = LJSpeechLoader(cfg, args, nranks, local_rank, is_vocoder=True).reader()
reader = LJSpeechLoader(
cfg, args, nranks, local_rank, is_vocoder=True).reader()
for epoch in range(args.epochs):
pbar = tqdm(reader)
for i, data in enumerate(pbar):
pbar.set_description('Processing at epoch %d'%epoch)
pbar.set_description('Processing at epoch %d' % epoch)
mel, mag = data
mag = dg.to_variable(mag.numpy())
mel = dg.to_variable(mel.numpy())
global_step += 1
mag_pred = model(mel)
loss = layers.mean(layers.abs(layers.elementwise_sub(mag_pred, mag)))
loss = layers.mean(
layers.abs(layers.elementwise_sub(mag_pred, mag)))
if args.use_data_parallel:
loss = model.scale_loss(loss)
loss.backward()
model.apply_collective_grads()
else:
loss.backward()
optimizer.minimize(loss, grad_clip = fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg['grad_clip_thresh']))
optimizer.minimize(
loss,
grad_clip=fluid.dygraph_grad_clip.GradClipByGlobalNorm(cfg[
'grad_clip_thresh']))
model.clear_gradients()
if local_rank==0:
writer.add_scalars('training_loss',{
'loss':loss.numpy(),
if local_rank == 0:
writer.add_scalars('training_loss', {
'loss': loss.numpy(),
}, global_step)
if global_step % args.save_step == 0:
if not os.path.exists(args.save_path):
os.mkdir(args.save_path)
save_path = os.path.join(args.save_path,'vocoder/%d' % global_step)
save_path = os.path.join(args.save_path,
'vocoder/%d' % global_step)
dg.save_dygraph(model.state_dict(), save_path)
dg.save_dygraph(optimizer.state_dict(), save_path)
if local_rank==0:
if local_rank == 0:
writer.close()
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Train vocoder model")
add_config_options_to_parser(parser)
args = parser.parse_args()
# Print the whole config setting.
pprint(args)
main(args)
main(args)

View File

@ -109,3 +109,13 @@ python -u benchmark.py \
--root=./data/LJSpeech-1.1 \
--name=${ModelName} --use_gpu=true
```
### Low-precision inference
This model supports the float16 low-precsion inference. By appending the argument
```bash
--use_fp16=true
```
to the command of synthesis and benchmarking, one can experience the fast speed of low-precision inference.

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import random
from pprint import pprint
@ -24,9 +38,14 @@ def add_options_to_parser(parser):
parser.add_argument(
'--use_gpu',
type=bool,
type=utils.str2bool,
default=True,
help="option to use gpu training")
parser.add_argument(
'--use_fp16',
type=utils.str2bool,
default=True,
help="option to use fp16 for inference")
parser.add_argument(
'--iteration',

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import random
from pprint import pprint
@ -24,9 +38,14 @@ def add_options_to_parser(parser):
parser.add_argument(
'--use_gpu',
type=bool,
type=utils.str2bool,
default=True,
help="option to use gpu training")
parser.add_argument(
'--use_fp16',
type=utils.str2bool,
default=True,
help="option to use fp16 for inference")
parser.add_argument(
'--iteration',
@ -74,7 +93,6 @@ def synthesize(config):
# Build model.
model = WaveFlow(config, checkpoint_dir)
model.build(training=False)
# Obtain the current iteration.
if config.checkpoint is None:
if config.iteration is None:

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import random
import subprocess
@ -127,4 +141,6 @@ if __name__ == "__main__":
# the preceding update will be overwritten by the following one.
config = parser.parse_args()
config = utils.add_yaml_config(config)
# Force to use fp32 in model training
vars(config)["use_fp16"] = False
train(config)

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import itertools
import os
import time
@ -126,7 +140,8 @@ def load_parameters(checkpoint_dir,
model,
optimizer=None,
iteration=None,
file_path=None):
file_path=None,
dtype="float32"):
if file_path is None:
if iteration is None:
iteration = load_latest_checkpoint(checkpoint_dir, rank)
@ -135,6 +150,12 @@ def load_parameters(checkpoint_dir,
file_path = "{}/step-{}".format(checkpoint_dir, iteration)
model_dict, optimizer_dict = dg.load_dygraph(file_path)
if dtype == "float16":
for k, v in model_dict.items():
if "conv2d_transpose" in k:
model_dict[k] = v.astype("float32")
else:
model_dict[k] = v.astype(dtype)
model.set_dict(model_dict)
print("[checkpoint] Rank {}: loaded model from {}".format(rank, file_path))
if optimizer and optimizer_dict:

View File

@ -0,0 +1,97 @@
# Wavenet
Paddle implementation of wavenet in dynamic graph, a convolutional network based vocoder. Wavenet is proposed in [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499), but in thie experiment, the implementation follows the teacher model in [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281).
## Dataset
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
## Project Structure
```text
├── data.py data_processing
├── configs/ (example) configuration file
├── synthesis.py script to synthesize waveform from mel_spectrogram
├── train.py script to train a model
└── utils.py utility functions
```
## Train
Train the model using train.py, follow the usage displayed by `python train.py --help`.
```text
usage: train.py [-h] [--data DATA] [--config CONFIG] [--output OUTPUT]
[--device DEVICE] [--resume RESUME]
Train a wavenet model with LJSpeech.
optional arguments:
-h, --help show this help message and exit
--data DATA path of the LJspeech dataset.
--config CONFIG path of the config file.
--output OUTPUT path to save results.
--device DEVICE device to use.
--resume RESUME checkpoint to resume from.
```
1. `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
2. `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
3. `--resume` is the path of the checkpoint. If it is provided, the model would load the checkpoint before trainig.
4. `--output` is the directory to save results, all result are saved in this directory. The structure of the output directory is shown below.
```text
├── checkpoints # checkpoint
└── log # tensorboard log
```
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
example script:
```bash
python train.py --config=./configs/wavenet_single_gaussian.yaml --data=./LJSpeech-1.1/ --output=experiment --device=0
```
You can monitor training log via tensorboard, using the script below.
```bash
cd experiment/log
tensorboard --logdir=.
```
## Synthesis
```text
usage: synthesis.py [-h] [--data DATA] [--config CONFIG] [--device DEVICE]
checkpoint output
Synthesize valid data from LJspeech with a wavenet model.
positional arguments:
checkpoint checkpoint to load.
output path to save results.
optional arguments:
-h, --help show this help message and exit
--data DATA path of the LJspeech dataset.
--config CONFIG path of the config file.
--device DEVICE device to use.
```
1. `--config` is the configuration file to use. You should use the same configuration with which you train you model.
2. `--data` is the path of the LJspeech dataset. A dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
3. `checkpoint` is the checkpoint to load.
4. `output_path` is the directory to save results. The output path contains the generated audio files (`*.wav`).
5. `--device` is the device (gpu id) to use for training. `-1` means CPU.
example script:
```bash
python synthesis.py --config=./configs/wavenet_single_gaussian.yaml --data=./LJSpeech-1.1/ --device=0 experiment/checkpoints/step_500000 generated
```

View File

@ -0,0 +1,36 @@
data:
batch_size: 16
train_clip_seconds: 0.5
sample_rate: 22050
hop_length: 256
win_length: 1024
n_fft: 2048
n_mels: 80
valid_size: 16
model:
upsampling_factors: [16, 16]
n_loop: 10
n_layer: 3
filter_size: 2
residual_channels: 128
loss_type: "mog"
output_dim: 30
log_scale_min: -9
train:
learning_rate: 0.001
anneal_rate: 0.5
anneal_interval: 200000
gradient_max_norm: 100.0
checkpoint_interval: 10000
snap_interval: 10000
eval_interval: 10000
max_iterations: 2000000

View File

@ -0,0 +1,36 @@
data:
batch_size: 16
train_clip_seconds: 0.5
sample_rate: 22050
hop_length: 256
win_length: 1024
n_fft: 2048
n_mels: 80
valid_size: 16
model:
upsampling_factors: [16, 16]
n_loop: 10
n_layer: 3
filter_size: 2
residual_channels: 128
loss_type: "mog"
output_dim: 3
log_scale_min: -9
train:
learning_rate: 0.001
anneal_rate: 0.5
anneal_interval: 200000
gradient_max_norm: 100.0
checkpoint_interval: 10000
snap_interval: 10000
eval_interval: 10000
max_iterations: 2000000

View File

@ -0,0 +1,36 @@
data:
batch_size: 16
train_clip_seconds: 0.5
sample_rate: 22050
hop_length: 256
win_length: 1024
n_fft: 2048
n_mels: 80
valid_size: 16
model:
upsampling_factors: [16, 16]
n_loop: 10
n_layer: 3
filter_size: 2
residual_channels: 128
loss_type: "softmax"
output_dim: 2048
log_scale_min: -9
train:
learning_rate: 0.001
anneal_rate: 0.5
anneal_interval: 200000
gradient_max_norm: 100.0
checkpoint_interval: 10000
snap_interval: 10000
eval_interval: 10000
max_iterations: 2000000

163
examples/wavenet/data.py Normal file
View File

@ -0,0 +1,163 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import csv
import numpy as np
import librosa
from pathlib import Path
import pandas as pd
from parakeet.data import batch_spec, batch_wav
from parakeet.data import DatasetMixin
class LJSpeechMetaData(DatasetMixin):
def __init__(self, root):
self.root = Path(root)
self._wav_dir = self.root.joinpath("wavs")
csv_path = self.root.joinpath("metadata.csv")
self._table = pd.read_csv(
csv_path,
sep="|",
header=None,
quoting=csv.QUOTE_NONE,
names=["fname", "raw_text", "normalized_text"])
def get_example(self, i):
fname, raw_text, normalized_text = self._table.iloc[i]
fname = str(self._wav_dir.joinpath(fname + ".wav"))
return fname, raw_text, normalized_text
def __len__(self):
return len(self._table)
class Transform(object):
def __init__(self, sample_rate, n_fft, win_length, hop_length, n_mels):
self.sample_rate = sample_rate
self.n_fft = n_fft
self.win_length = win_length
self.hop_length = hop_length
self.n_mels = n_mels
def __call__(self, example):
wav_path, _, _ = example
sr = self.sample_rate
n_fft = self.n_fft
win_length = self.win_length
hop_length = self.hop_length
n_mels = self.n_mels
wav, loaded_sr = librosa.load(wav_path, sr=None)
assert loaded_sr == sr, "sample rate does not match, resampling applied"
# Pad audio to the right size.
frames = int(np.ceil(float(wav.size) / hop_length))
fft_padding = (n_fft - hop_length) // 2 # sound
desired_length = frames * hop_length + fft_padding * 2
pad_amount = (desired_length - wav.size) // 2
if wav.size % 2 == 0:
wav = np.pad(wav, (pad_amount, pad_amount), mode='reflect')
else:
wav = np.pad(wav, (pad_amount, pad_amount + 1), mode='reflect')
# Normalize audio.
wav = wav / np.abs(wav).max() * 0.999
# Compute mel-spectrogram.
# Turn center to False to prevent internal padding.
spectrogram = librosa.core.stft(
wav,
hop_length=hop_length,
win_length=win_length,
n_fft=n_fft,
center=False)
spectrogram_magnitude = np.abs(spectrogram)
# Compute mel-spectrograms.
mel_filter_bank = librosa.filters.mel(sr=sr,
n_fft=n_fft,
n_mels=n_mels)
mel_spectrogram = np.dot(mel_filter_bank, spectrogram_magnitude)
mel_spectrogram = mel_spectrogram
# Rescale mel_spectrogram.
min_level, ref_level = 1e-5, 20 # hard code it
mel_spectrogram = 20 * np.log10(np.maximum(min_level, mel_spectrogram))
mel_spectrogram = mel_spectrogram - ref_level
mel_spectrogram = np.clip((mel_spectrogram + 100) / 100, 0, 1)
# Extract the center of audio that corresponds to mel spectrograms.
audio = wav[fft_padding:-fft_padding]
assert mel_spectrogram.shape[1] * hop_length == audio.size
# there is no clipping here
return audio, mel_spectrogram
class DataCollector(object):
def __init__(self,
context_size,
sample_rate,
hop_length,
train_clip_seconds,
valid=False):
frames_per_second = sample_rate // hop_length
train_clip_frames = int(
np.ceil(train_clip_seconds * frames_per_second))
context_frames = context_size // hop_length
self.num_frames = train_clip_frames + context_frames
self.sample_rate = sample_rate
self.hop_length = hop_length
self.valid = valid
def random_crop(self, sample):
audio, mel_spectrogram = sample
audio_frames = int(audio.size) // self.hop_length
max_start_frame = audio_frames - self.num_frames
assert max_start_frame >= 0, "audio is too short to be cropped"
frame_start = np.random.randint(0, max_start_frame)
# frame_start = 0 # norandom
frame_end = frame_start + self.num_frames
audio_start = frame_start * self.hop_length
audio_end = frame_end * self.hop_length
audio = audio[audio_start:audio_end]
return audio, mel_spectrogram, audio_start
def __call__(self, samples):
# transform them first
if self.valid:
samples = [(audio, mel_spectrogram, 0)
for audio, mel_spectrogram in samples]
else:
samples = [self.random_crop(sample) for sample in samples]
# batch them
audios = [sample[0] for sample in samples]
audio_starts = [sample[2] for sample in samples]
mels = [sample[1] for sample in samples]
mels = batch_spec(mels)
if self.valid:
audios = batch_wav(audios, dtype=np.float32)
else:
audios = np.array(audios, dtype=np.float32)
audio_starts = np.array(audio_starts, dtype=np.int64)
return audios, mels, audio_starts

View File

@ -0,0 +1,124 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import ruamel.yaml
import argparse
from tqdm import tqdm
from tensorboardX import SummaryWriter
from paddle import fluid
import paddle.fluid.dygraph as dg
from parakeet.data import SliceDataset, TransformDataset, DataCargo, SequentialSampler, RandomSampler
from parakeet.models.wavenet import UpsampleNet, WaveNet, ConditionalWavenet
from parakeet.utils.layer_tools import summary
from data import LJSpeechMetaData, Transform, DataCollector
from utils import make_output_tree, valid_model, eval_model, save_checkpoint
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Synthesize valid data from LJspeech with a wavenet model.")
parser.add_argument(
"--data", type=str, help="path of the LJspeech dataset.")
parser.add_argument("--config", type=str, help="path of the config file.")
parser.add_argument(
"--device", type=int, default=-1, help="device to use.")
parser.add_argument("checkpoint", type=str, help="checkpoint to load.")
parser.add_argument(
"output", type=str, default="experiment", help="path to save results.")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = ruamel.yaml.safe_load(f)
ljspeech_meta = LJSpeechMetaData(args.data)
data_config = config["data"]
sample_rate = data_config["sample_rate"]
n_fft = data_config["n_fft"]
win_length = data_config["win_length"]
hop_length = data_config["hop_length"]
n_mels = data_config["n_mels"]
train_clip_seconds = data_config["train_clip_seconds"]
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
ljspeech = TransformDataset(ljspeech_meta, transform)
valid_size = data_config["valid_size"]
ljspeech_valid = SliceDataset(ljspeech, 0, valid_size)
ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech))
model_config = config["model"]
n_loop = model_config["n_loop"]
n_layer = model_config["n_layer"]
filter_size = model_config["filter_size"]
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
print("context size is {} samples".format(context_size))
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
train_clip_seconds)
valid_batch_fn = DataCollector(
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
batch_size = data_config["batch_size"]
train_cargo = DataCargo(
ljspeech_train,
train_batch_fn,
batch_size,
sampler=RandomSampler(ljspeech_train))
# only batch=1 for validation is enabled
valid_cargo = DataCargo(
ljspeech_valid,
valid_batch_fn,
batch_size=1,
sampler=SequentialSampler(ljspeech_valid))
make_output_tree(args.output)
if args.device == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(args.device)
with dg.guard(place):
model_config = config["model"]
upsampling_factors = model_config["upsampling_factors"]
encoder = UpsampleNet(upsampling_factors)
n_loop = model_config["n_loop"]
n_layer = model_config["n_layer"]
residual_channels = model_config["residual_channels"]
output_dim = model_config["output_dim"]
loss_type = model_config["loss_type"]
log_scale_min = model_config["log_scale_min"]
decoder = WaveNet(n_loop, n_layer, residual_channels, output_dim,
n_mels, filter_size, loss_type, log_scale_min)
model = ConditionalWavenet(encoder, decoder)
summary(model)
model_dict, _ = dg.load_dygraph(args.checkpoint)
print("Loading from {}.pdparams".format(args.checkpoint))
model.set_dict(model_dict)
train_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
train_loader.set_batch_generator(train_cargo, place)
valid_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
valid_loader.set_batch_generator(valid_cargo, place)
eval_model(model, valid_loader, args.output, sample_rate)

181
examples/wavenet/train.py Normal file
View File

@ -0,0 +1,181 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import ruamel.yaml
import argparse
from tqdm import tqdm
from tensorboardX import SummaryWriter
from paddle import fluid
import paddle.fluid.dygraph as dg
from parakeet.data import SliceDataset, TransformDataset, DataCargo, SequentialSampler, RandomSampler
from parakeet.models.wavenet import UpsampleNet, WaveNet, ConditionalWavenet
from parakeet.utils.layer_tools import summary
from data import LJSpeechMetaData, Transform, DataCollector
from utils import make_output_tree, valid_model, save_checkpoint
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Train a wavenet model with LJSpeech.")
parser.add_argument(
"--data", type=str, help="path of the LJspeech dataset.")
parser.add_argument("--config", type=str, help="path of the config file.")
parser.add_argument(
"--output",
type=str,
default="experiment",
help="path to save results.")
parser.add_argument(
"--device", type=int, default=-1, help="device to use.")
parser.add_argument(
"--resume", type=str, help="checkpoint to resume from.")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = ruamel.yaml.safe_load(f)
ljspeech_meta = LJSpeechMetaData(args.data)
data_config = config["data"]
sample_rate = data_config["sample_rate"]
n_fft = data_config["n_fft"]
win_length = data_config["win_length"]
hop_length = data_config["hop_length"]
n_mels = data_config["n_mels"]
train_clip_seconds = data_config["train_clip_seconds"]
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
ljspeech = TransformDataset(ljspeech_meta, transform)
valid_size = data_config["valid_size"]
ljspeech_valid = SliceDataset(ljspeech, 0, valid_size)
ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech))
model_config = config["model"]
n_loop = model_config["n_loop"]
n_layer = model_config["n_layer"]
filter_size = model_config["filter_size"]
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
print("context size is {} samples".format(context_size))
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
train_clip_seconds)
valid_batch_fn = DataCollector(
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
batch_size = data_config["batch_size"]
train_cargo = DataCargo(
ljspeech_train,
train_batch_fn,
batch_size,
sampler=RandomSampler(ljspeech_train))
# only batch=1 for validation is enabled
valid_cargo = DataCargo(
ljspeech_valid,
valid_batch_fn,
batch_size=1,
sampler=SequentialSampler(ljspeech_valid))
make_output_tree(args.output)
if args.device == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(args.device)
with dg.guard(place):
model_config = config["model"]
upsampling_factors = model_config["upsampling_factors"]
encoder = UpsampleNet(upsampling_factors)
n_loop = model_config["n_loop"]
n_layer = model_config["n_layer"]
residual_channels = model_config["residual_channels"]
output_dim = model_config["output_dim"]
loss_type = model_config["loss_type"]
log_scale_min = model_config["log_scale_min"]
decoder = WaveNet(n_loop, n_layer, residual_channels, output_dim,
n_mels, filter_size, loss_type, log_scale_min)
model = ConditionalWavenet(encoder, decoder)
summary(model)
train_config = config["train"]
learning_rate = train_config["learning_rate"]
anneal_rate = train_config["anneal_rate"]
anneal_interval = train_config["anneal_interval"]
lr_scheduler = dg.ExponentialDecay(
learning_rate, anneal_interval, anneal_rate, staircase=True)
optim = fluid.optimizer.Adam(
lr_scheduler, parameter_list=model.parameters())
gradiant_max_norm = train_config["gradient_max_norm"]
clipper = fluid.dygraph_grad_clip.GradClipByGlobalNorm(
gradiant_max_norm)
if args.resume:
model_dict, optim_dict = dg.load_dygraph(args.resume)
print("Loading from {}.pdparams".format(args.resume))
model.set_dict(model_dict)
if optim_dict:
optim.set_dict(optim_dict)
print("Loading from {}.pdopt".format(args.resume))
train_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
train_loader.set_batch_generator(train_cargo, place)
valid_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
valid_loader.set_batch_generator(valid_cargo, place)
max_iterations = train_config["max_iterations"]
checkpoint_interval = train_config["checkpoint_interval"]
snap_interval = train_config["snap_interval"]
eval_interval = train_config["eval_interval"]
checkpoint_dir = os.path.join(args.output, "checkpoints")
log_dir = os.path.join(args.output, "log")
writer = SummaryWriter(log_dir)
global_step = 1
while global_step <= max_iterations:
epoch_loss = 0.
for i, batch in tqdm(enumerate(train_loader)):
audio_clips, mel_specs, audio_starts = batch
model.train()
y_var = model(audio_clips, mel_specs, audio_starts)
loss_var = model.loss(y_var, audio_clips)
loss_var.backward()
loss_np = loss_var.numpy()
epoch_loss += loss_np[0]
writer.add_scalar("loss", loss_np[0], global_step)
writer.add_scalar("learning_rate",
optim._learning_rate.step().numpy()[0],
global_step)
optim.minimize(loss_var, grad_clip=clipper)
optim.clear_gradients()
print("loss: {:<8.6f}".format(loss_np[0]))
if global_step % snap_interval == 0:
valid_model(model, valid_loader, writer, global_step,
sample_rate)
if global_step % checkpoint_interval == 0:
save_checkpoint(model, optim, checkpoint_dir, global_step)
global_step += 1

67
examples/wavenet/utils.py Normal file
View File

@ -0,0 +1,67 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import numpy as np
import soundfile as sf
import paddle.fluid.dygraph as dg
def make_output_tree(output_dir):
checkpoint_dir = os.path.join(output_dir, "checkpoints")
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
state_dir = os.path.join(output_dir, "states")
if not os.path.exists(state_dir):
os.makedirs(state_dir)
def valid_model(model, valid_loader, writer, global_step, sample_rate):
loss = []
wavs = []
model.eval()
for i, batch in enumerate(valid_loader):
# print("sentence {}".format(i))
audio_clips, mel_specs, audio_starts = batch
y_var = model(audio_clips, mel_specs, audio_starts)
wav_var = model.sample(y_var)
loss_var = model.loss(y_var, audio_clips)
loss.append(loss_var.numpy()[0])
wavs.append(wav_var.numpy()[0])
average_loss = np.mean(loss)
writer.add_scalar("valid_loss", average_loss, global_step)
for i, wav in enumerate(wavs):
writer.add_audio("valid/sample_{}".format(i), wav, global_step,
sample_rate)
def eval_model(model, valid_loader, output_dir, sample_rate):
model.eval()
for i, batch in enumerate(valid_loader):
# print("sentence {}".format(i))
path = os.path.join(output_dir, "sentence_{}.wav".format(i))
audio_clips, mel_specs, audio_starts = batch
wav_var = model.synthesis(mel_specs)
wav_np = wav_var.numpy()[0]
sf.write(path, wav_np, samplerate=sample_rate)
print("generated {}".format(path))
def save_checkpoint(model, optim, checkpoint_dir, global_step):
checkpoint_path = os.path.join(checkpoint_dir,
"step_{:09d}".format(global_step))
dg.save_dygraph(model.state_dict(), checkpoint_path)
dg.save_dygraph(optim.state_dict(), checkpoint_path)

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
__version__ = "0.0.0"
from . import data, g2p, models, modules

View File

@ -1 +1,15 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .audio import AudioProcessor

View File

@ -1,30 +1,46 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import librosa
import soundfile as sf
import numpy as np
import scipy.io
import scipy.signal
class AudioProcessor(object):
def __init__(self,
sample_rate=None, # int, sampling rate
num_mels=None, # int, bands of mel spectrogram
min_level_db=None, # float, minimum level db
ref_level_db=None, # float, reference level db
n_fft=None, # int: number of samples in a frame for stft
win_length=None, # int: the same meaning with n_fft
hop_length=None, # int: number of samples between neighboring frame
power=None, # float:power to raise before griffin-lim
preemphasis=None, # float: preemphasis coefficident
signal_norm=None, #
symmetric_norm=False, # bool, apply clip norm in [-max_norm, max_form]
max_norm=None, # float, max norm
mel_fmin=None, # int: mel spectrogram's minimum frequency
mel_fmax=None, # int: mel spectrogram's maximum frequency
clip_norm=True, # bool: clip spectrogram's norm
griffin_lim_iters=None, # int:
do_trim_silence=False, # bool: trim silence
sound_norm=False,
**kwargs):
def __init__(
self,
sample_rate=None, # int, sampling rate
num_mels=None, # int, bands of mel spectrogram
min_level_db=None, # float, minimum level db
ref_level_db=None, # float, reference level db
n_fft=None, # int: number of samples in a frame for stft
win_length=None, # int: the same meaning with n_fft
hop_length=None, # int: number of samples between neighboring frame
power=None, # float:power to raise before griffin-lim
preemphasis=None, # float: preemphasis coefficident
signal_norm=None, #
symmetric_norm=False, # bool, apply clip norm in [-max_norm, max_form]
max_norm=None, # float, max norm
mel_fmin=None, # int: mel spectrogram's minimum frequency
mel_fmax=None, # int: mel spectrogram's maximum frequency
clip_norm=True, # bool: clip spectrogram's norm
griffin_lim_iters=None, # int:
do_trim_silence=False, # bool: trim silence
sound_norm=False,
**kwargs):
self.sample_rate = sample_rate
self.num_mels = num_mels
self.min_level_db = min_level_db
@ -34,8 +50,8 @@ class AudioProcessor(object):
self.n_fft = n_fft
self.win_length = win_length or n_fft
# hop length defaults to 1/4 window_length
self.hop_length = hop_length or 0.25 * self.win_length
self.hop_length = hop_length or 0.25 * self.win_length
self.power = power
self.preemphasis = float(preemphasis)
@ -52,7 +68,8 @@ class AudioProcessor(object):
self.do_trim_silence = do_trim_silence
self.sound_norm = sound_norm
self.num_freq, self.frame_length_ms, self.frame_shift_ms = self._stft_parameters()
self.num_freq, self.frame_length_ms, self.frame_shift_ms = self._stft_parameters(
)
def _stft_parameters(self):
"""compute frame length and hop length in ms"""
@ -65,44 +82,54 @@ class AudioProcessor(object):
"""object repr"""
cls_name_str = self.__class__.__name__
members = vars(self)
dict_str = "\n".join([" {}: {},".format(k, v) for k, v in members.items()])
dict_str = "\n".join(
[" {}: {},".format(k, v) for k, v in members.items()])
repr_str = "{}(\n{})\n".format(cls_name_str, dict_str)
return repr_str
def save_wav(self, path, wav):
"""save audio with scipy.io.wavfile in 16bit integers"""
wav_norm = wav * (32767 / max(0.01, np.max(np.abs(wav))))
scipy.io.wavfile.write(path, self.sample_rate, wav_norm.as_type(np.int16))
scipy.io.wavfile.write(path, self.sample_rate,
wav_norm.as_type(np.int16))
def load_wav(self, path, sr=None):
"""load wav -> trim_silence -> rescale"""
x, sr = librosa.load(path, sr=None)
assert self.sample_rate == sr, "audio sample rate: {}Hz != processor sample rate: {}Hz".format(sr, self.sample_rate)
assert self.sample_rate == sr, "audio sample rate: {}Hz != processor sample rate: {}Hz".format(
sr, self.sample_rate)
if self.do_trim_silence:
try:
x = self.trim_silence(x)
except ValueError:
print(" [!] File cannot be trimmed for silence - {}".format(path))
print(" [!] File cannot be trimmed for silence - {}".format(
path))
if self.sound_norm:
x = x / x.max() * 0.9 # why 0.9 ?
x = x / x.max() * 0.9 # why 0.9 ?
return x
def trim_silence(self, wav):
"""Trim soilent parts with a threshold and 0.01s margin"""
margin = int(self.sample_rate * 0.01)
wav = wav[margin: -margin]
trimed_wav = librosa.effects.trim(wav, top_db=60, frame_length=self.win_length, hop_length=self.hop_length)[0]
wav = wav[margin:-margin]
trimed_wav = librosa.effects.trim(
wav,
top_db=60,
frame_length=self.win_length,
hop_length=self.hop_length)[0]
return trimed_wav
def apply_preemphasis(self, x):
if self.preemphasis == 0.:
raise RuntimeError(" !! Preemphasis coefficient should be positive. ")
raise RuntimeError(
" !! Preemphasis coefficient should be positive. ")
return scipy.signal.lfilter([1., -self.preemphasis], [1.], x)
def apply_inv_preemphasis(self, x):
if self.preemphasis == 0.:
raise RuntimeError(" !! Preemphasis coefficient should be positive. ")
raise RuntimeError(
" !! Preemphasis coefficient should be positive. ")
return scipy.signal.lfilter([1.], [1., -self.preemphasis], x)
def _amplitude_to_db(self, x):
@ -125,12 +152,11 @@ class AudioProcessor(object):
"""return mel basis for mel scale"""
if self.mel_fmax is not None:
assert self.mel_fmax <= self.sample_rate // 2
return librosa.filters.mel(
self.sample_rate,
self.n_fft,
n_mels=self.num_mels,
fmin=self.mel_fmin,
fmax=self.mel_fmax)
return librosa.filters.mel(self.sample_rate,
self.n_fft,
n_mels=self.num_mels,
fmin=self.mel_fmin,
fmax=self.mel_fmax)
def _normalize(self, S):
"""put values in [0, self.max_norm] or [-self.max_norm, self,max_norm]"""
@ -156,25 +182,29 @@ class AudioProcessor(object):
if self.symmetric_norm:
if self.clip_norm:
S_denorm = np.clip(S_denorm, -self.max_norm, self.max_norm)
S_denorm = (S_denorm + self.max_norm) * (-self.min_level_db) / (2 * self.max_norm) + self.min_level_db
S_denorm = (S_denorm + self.max_norm) * (
-self.min_level_db) / (2 * self.max_norm
) + self.min_level_db
return S_denorm
else:
if self.clip_norm:
S_denorm = np.clip(S_denorm, 0, self.max_norm)
S_denorm = S_denorm * (-self.min_level_db)/ self.max_norm + self.min_level_db
S_denorm = S_denorm * (-self.min_level_db
) / self.max_norm + self.min_level_db
return S_denorm
else:
return S
def _stft(self, y):
return librosa.stft(
y=y,
y=y,
n_fft=self.n_fft,
win_length=self.win_length,
hop_length=self.hop_length)
def _istft(self, S):
return librosa.istft(S, hop_length=self.hop_length, win_length=self.win_length)
return librosa.istft(
S, hop_length=self.hop_length, win_length=self.win_length)
def spectrogram(self, y):
"""compute linear spectrogram(amplitude)
@ -195,7 +225,8 @@ class AudioProcessor(object):
D = self._stft(self.apply_preemphasis(y))
else:
D = self._stft(y)
S = self._amplitude_to_db(self._linear_to_mel(np.abs(D))) - self.ref_level_db
S = self._amplitude_to_db(self._linear_to_mel(np.abs(
D))) - self.ref_level_db
return self._normalize(S)
def inv_spectrogram(self, spectrogram):
@ -203,16 +234,16 @@ class AudioProcessor(object):
S = self._denormalize(spectrogram)
S = self._db_to_amplitude(S + self.ref_level_db)
if self.preemphasis:
return self.apply_inv_preemphasis(self._griffin_lim(S ** self.power))
return self._griffin_lim(S ** self.power)
return self.apply_inv_preemphasis(self._griffin_lim(S**self.power))
return self._griffin_lim(S**self.power)
def inv_melspectrogram(self, mel_spectrogram):
S = self._denormalize(mel_spectrogram)
S = self._db_to_amplitude(S + self.ref_level_db)
S = self._mel_to_linear(np.abs(S))
if self.preemphasis:
return self.apply_inv_preemphasis(self._griffin_lim(S ** self.power))
return self._griffin_lim(S ** self.power)
return self.apply_inv_preemphasis(self._griffin_lim(S**self.power))
return self._griffin_lim(S**self.power)
def out_linear_to_mel(self, linear_spec):
"""convert output linear spec to mel spec"""
@ -222,7 +253,7 @@ class AudioProcessor(object):
S = self._amplitude_to_db(S) - self.ref_level_db
mel = self._normalize(S)
return mel
def _griffin_lim(self, S):
angles = np.exp(2j * np.pi * np.random.rand(*S.shape))
S_complex = np.abs(S).astype(np.complex)
@ -234,18 +265,18 @@ class AudioProcessor(object):
@staticmethod
def mulaw_encode(wav, qc):
mu = 2 ** qc - 1
mu = 2**qc - 1
# wav_abs = np.minimum(np.abs(wav), 1.0)
signal = np.sign(wav) * np.log(1 + mu * np.abs(wav)) / np.log(1. + mu)
# Quantize signal to the specified number of levels.
signal = (signal + 1) / 2 * mu + 0.5
return np.floor(signal,)
return np.floor(signal, )
@staticmethod
def mulaw_decode(wav, qc):
"""Recovers waveform from quantized values."""
mu = 2 ** qc - 1
x = np.sign(wav) / mu * ((1 + mu) ** np.abs(wav) - 1)
mu = 2**qc - 1
x = np.sign(wav) / mu * ((1 + mu)**np.abs(wav) - 1)
return x
@staticmethod

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .dataset import *
from .datacargo import *
from .sampler import *

View File

@ -1,18 +1,34 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
functions to make batch for arrays which satisfy some conditions.
"""
import numpy as np
class TextIDBatcher(object):
"""A wrapper class for a function to build a functor, which holds the configs to pass to the function."""
def __init__(self, pad_id=0, dtype=np.int64):
self.pad_id = pad_id
self.dtype = dtype
def __call__(self, minibatch):
out = batch_text_id(minibatch, pad_id=self.pad_id, dtype=self.dtype)
return out
def batch_text_id(minibatch, pad_id=0, dtype=np.int64):
"""
minibatch: List[Example]
@ -20,26 +36,32 @@ def batch_text_id(minibatch, pad_id=0, dtype=np.int64):
"""
peek_example = minibatch[0]
assert len(peek_example.shape) == 1, "text example is an 1D tensor"
lengths = [example.shape[0] for example in minibatch] # assume (channel, n_samples) or (n_samples, )
lengths = [example.shape[0] for example in minibatch
] # assume (channel, n_samples) or (n_samples, )
max_len = np.max(lengths)
batch = []
for example in minibatch:
pad_len = max_len - example.shape[0]
batch.append(np.pad(example, [(0, pad_len)], mode='constant', constant_values=pad_id))
batch.append(
np.pad(example, [(0, pad_len)],
mode='constant',
constant_values=pad_id))
return np.array(batch, dtype=dtype)
class WavBatcher(object):
def __init__(self, pad_value=0., dtype=np.float32):
self.pad_value = pad_value
self.dtype = dtype
def __call__(self, minibatch):
out = batch_wav(minibatch, pad_value=self.pad_value, dtype=self.dtype)
return out
def batch_wav(minibatch, pad_value=0., dtype=np.float32):
"""
minibatch: List[Example]
@ -51,18 +73,25 @@ def batch_wav(minibatch, pad_value=0., dtype=np.float32):
mono_channel = True
elif len(peek_example.shape) == 2:
mono_channel = False
lengths = [example.shape[-1] for example in minibatch] # assume (channel, n_samples) or (n_samples, )
lengths = [example.shape[-1] for example in minibatch
] # assume (channel, n_samples) or (n_samples, )
max_len = np.max(lengths)
batch = []
for example in minibatch:
pad_len = max_len - example.shape[-1]
if mono_channel:
batch.append(np.pad(example, [(0, pad_len)], mode='constant', constant_values=pad_value))
batch.append(
np.pad(example, [(0, pad_len)],
mode='constant',
constant_values=pad_value))
else:
batch.append(np.pad(example, [(0, 0), (0, pad_len)], mode='constant', constant_values=pad_value)) # what about PCM, no
batch.append(
np.pad(example, [(0, 0), (0, pad_len)],
mode='constant',
constant_values=pad_value)) # what about PCM, no
return np.array(batch, dtype=dtype)
@ -75,6 +104,7 @@ class SpecBatcher(object):
out = batch_spec(minibatch, pad_value=self.pad_value, dtype=self.dtype)
return out
def batch_spec(minibatch, pad_value=0., dtype=np.float32):
"""
minibatch: List[Example]
@ -86,16 +116,23 @@ def batch_spec(minibatch, pad_value=0., dtype=np.float32):
mono_channel = True
elif len(peek_example.shape) == 3:
mono_channel = False
lengths = [example.shape[-1] for example in minibatch] # assume (channel, F, n_frame) or (F, n_frame)
max_len = np.max(lengths)
lengths = [example.shape[-1] for example in minibatch
] # assume (channel, F, n_frame) or (F, n_frame)
max_len = np.max(lengths)
batch = []
for example in minibatch:
pad_len = max_len - example.shape[-1]
if mono_channel:
batch.append(np.pad(example, [(0, 0), (0, pad_len)], mode='constant', constant_values=pad_value))
batch.append(
np.pad(example, [(0, 0), (0, pad_len)],
mode='constant',
constant_values=pad_value))
else:
batch.append(np.pad(example, [(0, 0), (0, 0), (0, pad_len)], mode='constant', constant_values=pad_value)) # what about PCM, no
return np.array(batch, dtype=dtype)
batch.append(
np.pad(example, [(0, 0), (0, 0), (0, pad_len)],
mode='constant',
constant_values=pad_value)) # what about PCM, no
return np.array(batch, dtype=dtype)

View File

@ -1,3 +1,18 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import six
from .sampler import SequentialSampler, RandomSampler, BatchSampler
@ -84,7 +99,11 @@ class DataIterator(object):
return minibatch
def _next_index(self):
return next(self._sampler_iter)
if six.PY3:
return next(self._sampler_iter)
else:
# six.PY2
return self._sampler_iter.next()
def __len__(self):
return len(self._index_sampler)

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import six
import numpy as np
from tqdm import tqdm
@ -10,8 +24,7 @@ class DatasetMixin(object):
if isinstance(index, slice):
start, stop, step = index.indices(len(self))
return [
self.get_example(i)
for i in six.moves.range(start, stop, step)
self.get_example(i) for i in six.moves.range(start, stop, step)
]
elif isinstance(index, (list, np.ndarray)):
return [self.get_example(i) for i in index]
@ -46,6 +59,7 @@ class TransformDataset(DatasetMixin):
in_data = self._dataset[i]
return self._transform(in_data)
class CacheDataset(DatasetMixin):
def __init__(self, dataset):
self._dataset = dataset
@ -58,6 +72,7 @@ class CacheDataset(DatasetMixin):
def get_example(self, i):
return self._cache[i]
class TupleDataset(object):
def __init__(self, *datasets):
if not datasets:
@ -133,7 +148,7 @@ class SliceDataset(DatasetMixin):
format(len(order), len(dataset)))
self._order = order
def len(self):
def __len__(self):
return self._size
def get_example(self, i):
@ -192,8 +207,7 @@ class ChainDataset(DatasetMixin):
def get_example(self, i):
if i < 0:
raise IndexError(
"ChainDataset doesnot support negative indexing.")
raise IndexError("ChainDataset doesnot support negative indexing.")
for dataset in self._datasets:
if i < len(dataset):

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
At most cases, we have non-stream dataset, which means we can random access it with __getitem__, and we can get the length of the dataset with __len__.
@ -6,10 +19,10 @@ This suffices for a sampler. We implemente sampler as iterable of valid indices.
So the sampler is only responsible for generating valid indices.
"""
import numpy as np
import random
class Sampler(object):
def __init__(self, data_source):
pass
@ -23,7 +36,7 @@ class Sampler(object):
class SequentialSampler(Sampler):
def __init__(self, data_source):
self.data_source = data_source
def __iter__(self):
return iter(range(len(self.data_source)))
@ -42,12 +55,14 @@ class RandomSampler(Sampler):
"replacement={}".format(self.replacement))
if self._num_samples is not None and not replacement:
raise ValueError("With replacement=False, num_samples should not be specified, "
"since a random permutation will be performed.")
raise ValueError(
"With replacement=False, num_samples should not be specified, "
"since a random permutation will be performed.")
if not isinstance(self.num_samples, int) or self.num_samples <= 0:
raise ValueError("num_samples should be a positive integer "
"value, but got num_samples={}".format(self.num_samples))
"value, but got num_samples={}".format(
self.num_samples))
@property
def num_samples(self):
@ -59,7 +74,9 @@ class RandomSampler(Sampler):
def __iter__(self):
n = len(self.data_source)
if self.replacement:
return iter(np.random.randint(0, n, size=(self.num_samples,), dtype=np.int64).tolist())
return iter(
np.random.randint(
0, n, size=(self.num_samples, ), dtype=np.int64).tolist())
return iter(np.random.permutation(n).tolist())
def __len__(self):
@ -76,7 +93,8 @@ class SubsetRandomSampler(Sampler):
self.indices = indices
def __iter__(self):
return (self.indices[i] for i in np.random.permutation(len(self.indices)))
return (self.indices[i]
for i in np.random.permutation(len(self.indices)))
def __len__(self):
return len(self.indices)
@ -89,9 +107,14 @@ class PartialyRandomizedSimilarTimeLengthSampler(Sampler):
3. Permutate mini-batchs
"""
def __init__(self, lengths, batch_size=4, batch_group_size=None,
def __init__(self,
lengths,
batch_size=4,
batch_group_size=None,
permutate=True):
_lengths = np.array(lengths, dtype=np.int64) # maybe better implement length as a sort key
_lengths = np.array(
lengths,
dtype=np.int64) # maybe better implement length as a sort key
self.lengths = np.sort(_lengths)
self.sorted_indices = np.argsort(_lengths)
@ -112,20 +135,21 @@ class PartialyRandomizedSimilarTimeLengthSampler(Sampler):
for i in range(len(indices) // batch_group_size):
s = i * batch_group_size
e = s + batch_group_size
random.shuffle(indices[s: e]) # inplace
random.shuffle(indices[s:e]) # inplace
# Permutate batches
if self.permutate:
perm = np.arange(len(indices[:e]) // self.batch_size)
random.shuffle(perm)
indices[:e] = indices[:e].reshape(-1, self.batch_size)[perm, :].reshape(-1)
indices[:e] = indices[:e].reshape(
-1, self.batch_size)[perm, :].reshape(-1)
# Handle last elements
s += batch_group_size
#print(indices)
if s < len(indices):
random.shuffle(indices[s:])
return iter(indices)
def __len__(self):
@ -150,14 +174,19 @@ class WeightedRandomSampler(Sampler):
def __init__(self, weights, num_samples, replacement):
if not isinstance(num_samples, int) or num_samples <= 0:
raise ValueError("num_samples should be a positive integer "
"value, but got num_samples={}".format(num_samples))
"value, but got num_samples={}".format(
num_samples))
self.weights = np.array(weights, dtype=np.float64)
self.num_samples = num_samples
self.replacement = replacement
def __iter__(self):
return iter(np.random.choice(len(self.weights), size=(self.num_samples, ),
replace=self.replacement, p=self.weights).tolist())
return iter(
np.random.choice(
len(self.weights),
size=(self.num_samples, ),
replace=self.replacement,
p=self.weights).tolist())
def __len__(self):
return self.num_samples
@ -184,7 +213,7 @@ class DistributedSampler(Sampler):
# Subset samples for each trainer.
indices = indices[self.rank:self.total_size:self.num_trainers]
assert len(indices) == self.num_samples
assert len(indices) == self.num_samples
return iter(indices)
@ -209,8 +238,7 @@ class BatchSampler(Sampler):
def __init__(self, sampler, batch_size, drop_last):
if not isinstance(sampler, Sampler):
raise ValueError("sampler should be an instance of "
"Sampler, but got sampler={}"
.format(sampler))
"Sampler, but got sampler={}".format(sampler))
if not isinstance(batch_size, int) or batch_size <= 0:
raise ValueError("batch_size should be a positive integer value, "
"but got batch_size={}".format(batch_size))

View File

@ -14,9 +14,4 @@ One of the reasons we choose to load data lazily (only load metadata before hand
For deep learning practice, we typically batch examples. So the dataset should comes with a method to batch examples. Assuming the record is implemented as a tuple with several items. When an item is represented as a fix-sized array, to batch them is trivial, just `np.stack` suffices. But for array with dynamic size, padding is needed. We decide to implement a batching method for each item. Then batching a record can be implemented by these methods. For a dataset, a `_batch_examples` should be implemented. But in most cases, you can choose one from `batching.py`.
That is it!
That is it!

View File

@ -0,0 +1,13 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pathlib import Path
import numpy as np
import pandas as pd

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pathlib import Path
import pandas as pd
from ruamel.yaml import YAML
@ -11,23 +25,25 @@ from parakeet.data.dataset import Dataset
from parakeet.data.datacargo import DataCargo
from parakeet.data.batch import TextIDBatcher, WavBatcher
class VCTK(Dataset):
def __init__(self, root):
assert isinstance(root, (str, Path)), "root should be a string or Path object"
assert isinstance(root, (
str, Path)), "root should be a string or Path object"
self.root = root if isinstance(root, Path) else Path(root)
self.text_root = self.root.joinpath("txt")
self.wav_root = self.root.joinpath("wav48")
if not (self.root.joinpath("metadata.csv").exists() and
if not (self.root.joinpath("metadata.csv").exists() and
self.root.joinpath("speaker_indices.yaml").exists()):
self._prepare_metadata()
self.speaker_indices, self.metadata = self._load_metadata()
def _load_metadata(self):
yaml=YAML(typ='safe')
yaml = YAML(typ='safe')
speaker_indices = yaml.load(self.root.joinpath("speaker_indices.yaml"))
metadata = pd.read_csv(self.root.joinpath("metadata.csv"),
sep="|", quoting=3, header=1)
metadata = pd.read_csv(
self.root.joinpath("metadata.csv"), sep="|", quoting=3, header=1)
return speaker_indices, metadata
def _prepare_metadata(self):
@ -41,15 +57,19 @@ class VCTK(Dataset):
with io.open(str(text_file)) as f:
transcription = f.read().strip()
wav_file = text_file.with_suffix(".wav")
metadata.append((wav_file.name, speaker_folder.name, transcription))
metadata = pd.DataFrame.from_records(metadata,
columns=["wave_file", "speaker", "text"])
metadata.append(
(wav_file.name, speaker_folder.name, transcription))
metadata = pd.DataFrame.from_records(
metadata, columns=["wave_file", "speaker", "text"])
# save them
yaml=YAML(typ='safe')
yaml = YAML(typ='safe')
yaml.dump(speaker_to_index, self.root.joinpath("speaker_indices.yaml"))
metadata.to_csv(self.root.joinpath("metadata.csv"),
sep="|", quoting=3, index=False)
metadata.to_csv(
self.root.joinpath("metadata.csv"),
sep="|",
quoting=3,
index=False)
def _get_example(self, metadatum):
wave_file, speaker, text = metadatum
@ -77,5 +97,3 @@ class VCTK(Dataset):
speaker_batch = np.array(speaker_batch)
phoneme_batch = TextIDBatcher(pad_id=0)(phoneme_batch)
return wav_batch, speaker_batch, phoneme_batch

View File

@ -1,5 +1,4 @@
# coding: utf-8
"""Text processing frontend
All frontend module should have the following functions:

View File

@ -32,6 +32,3 @@ def text_to_sequence(text, p=0.0):
from ..text import text_to_sequence
text = text_to_sequence(text, ["english_cleaners"])
return text

View File

@ -12,6 +12,3 @@ def text_to_sequence(text, p=0.0):
from ..text import text_to_sequence
text = text_to_sequence(text, ["basic_cleaners"])
return text

View File

@ -1,6 +1,5 @@
# coding: utf-8
import MeCab
import jaconv
from random import random
@ -30,9 +29,9 @@ def _yomi(mecab_result):
def _mix_pronunciation(tokens, yomis, p):
return "".join(
yomis[idx] if yomis[idx] is not None and random() < p else tokens[idx]
for idx in range(len(tokens)))
return "".join(yomis[idx]
if yomis[idx] is not None and random() < p else tokens[idx]
for idx in range(len(tokens)))
def mix_pronunciation(text, p):
@ -59,8 +58,7 @@ def normalize_delimitor(text):
def text_to_sequence(text, p=0.0):
for c in [" ", " ", "", "", "", "", "", "", "",
"", "", "(", ")"]:
for c in [" ", " ", "", "", "", "", "", "", "", "", "", "(", ")"]:
text = text.replace(c, "")
text = text.replace("!", "")
text = text.replace("?", "")

View File

@ -1,6 +1,5 @@
# coding: utf-8
from random import random
n_vocab = 0xffff
@ -13,5 +12,6 @@ _tagger = None
def text_to_sequence(text, p=0.0):
return [ord(c) for c in text] + [_eos] # EOS
def sequence_to_text(seq):
return "".join(chr(n) for n in seq)

View File

@ -1,8 +1,21 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import re
from . import cleaners
from .symbols import symbols
# Mappings from symbol to numeric ID and vice versa:
_symbol_to_id = {s: i for i, s in enumerate(symbols)}
_id_to_symbol = {i: s for i, s in enumerate(symbols)}
@ -32,7 +45,8 @@ def text_to_sequence(text, cleaner_names):
if not m:
sequence += _symbols_to_sequence(_clean_text(text, cleaner_names))
break
sequence += _symbols_to_sequence(_clean_text(m.group(1), cleaner_names))
sequence += _symbols_to_sequence(
_clean_text(m.group(1), cleaner_names))
sequence += _arpabet_to_sequence(m.group(2))
text = m.group(3)

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'''
Cleaners are transformations that run over the input text at both training and eval time.
@ -14,31 +27,31 @@ import re
from unidecode import unidecode
from .numbers import normalize_numbers
# Regular expression matching whitespace:
_whitespace_re = re.compile(r'\s+')
# List of (regular expression, replacement) pairs for abbreviations:
_abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [
('mrs', 'misess'),
('mr', 'mister'),
('dr', 'doctor'),
('st', 'saint'),
('co', 'company'),
('jr', 'junior'),
('maj', 'major'),
('gen', 'general'),
('drs', 'doctors'),
('rev', 'reverend'),
('lt', 'lieutenant'),
('hon', 'honorable'),
('sgt', 'sergeant'),
('capt', 'captain'),
('esq', 'esquire'),
('ltd', 'limited'),
('col', 'colonel'),
('ft', 'fort'),
]]
_abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1])
for x in [
('mrs', 'misess'),
('mr', 'mister'),
('dr', 'doctor'),
('st', 'saint'),
('co', 'company'),
('jr', 'junior'),
('maj', 'major'),
('gen', 'general'),
('drs', 'doctors'),
('rev', 'reverend'),
('lt', 'lieutenant'),
('hon', 'honorable'),
('sgt', 'sergeant'),
('capt', 'captain'),
('esq', 'esquire'),
('ltd', 'limited'),
('col', 'colonel'),
('ft', 'fort'),
]]
def expand_abbreviations(text):

View File

@ -1,14 +1,28 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import re
valid_symbols = [
'AA', 'AA0', 'AA1', 'AA2', 'AE', 'AE0', 'AE1', 'AE2', 'AH', 'AH0', 'AH1', 'AH2',
'AO', 'AO0', 'AO1', 'AO2', 'AW', 'AW0', 'AW1', 'AW2', 'AY', 'AY0', 'AY1', 'AY2',
'B', 'CH', 'D', 'DH', 'EH', 'EH0', 'EH1', 'EH2', 'ER', 'ER0', 'ER1', 'ER2', 'EY',
'EY0', 'EY1', 'EY2', 'F', 'G', 'HH', 'IH', 'IH0', 'IH1', 'IH2', 'IY', 'IY0', 'IY1',
'IY2', 'JH', 'K', 'L', 'M', 'N', 'NG', 'OW', 'OW0', 'OW1', 'OW2', 'OY', 'OY0',
'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH', 'UH0', 'UH1', 'UH2', 'UW',
'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH'
'AA', 'AA0', 'AA1', 'AA2', 'AE', 'AE0', 'AE1', 'AE2', 'AH', 'AH0', 'AH1',
'AH2', 'AO', 'AO0', 'AO1', 'AO2', 'AW', 'AW0', 'AW1', 'AW2', 'AY', 'AY0',
'AY1', 'AY2', 'B', 'CH', 'D', 'DH', 'EH', 'EH0', 'EH1', 'EH2', 'ER', 'ER0',
'ER1', 'ER2', 'EY', 'EY0', 'EY1', 'EY2', 'F', 'G', 'HH', 'IH', 'IH0',
'IH1', 'IH2', 'IY', 'IY0', 'IY1', 'IY2', 'JH', 'K', 'L', 'M', 'N', 'NG',
'OW', 'OW0', 'OW1', 'OW2', 'OY', 'OY0', 'OY1', 'OY2', 'P', 'R', 'S', 'SH',
'T', 'TH', 'UH', 'UH0', 'UH1', 'UH2', 'UW', 'UW0', 'UW1', 'UW2', 'V', 'W',
'Y', 'Z', 'ZH'
]
_valid_symbol_set = set(valid_symbols)
@ -24,7 +38,10 @@ class CMUDict:
else:
entries = _parse_cmudict(file_or_path)
if not keep_ambiguous:
entries = {word: pron for word, pron in entries.items() if len(pron) == 1}
entries = {
word: pron
for word, pron in entries.items() if len(pron) == 1
}
self._entries = entries
def __len__(self):

View File

@ -3,7 +3,6 @@
import inflect
import re
_inflect = inflect.engine()
_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
@ -56,7 +55,8 @@ def _expand_number(m):
elif num % 100 == 0:
return _inflect.number_to_words(num // 100) + ' hundred'
else:
return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
return _inflect.number_to_words(
num, andword='', zero='oh', group=2).replace(', ', ' ')
else:
return _inflect.number_to_words(num, andword='')

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'''
Defines the set of symbols used in text input to the model.

View File

@ -0,0 +1,13 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .net import *
from .parallel_wavenet import *

View File

@ -0,0 +1,169 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import itertools
import numpy as np
from scipy import signal
from tqdm import trange
import paddle.fluid.layers as F
import paddle.fluid.dygraph as dg
import paddle.fluid.initializer as I
import paddle.fluid.layers.distributions as D
from parakeet.modules.weight_norm import Conv2DTranspose
from parakeet.models.wavenet import crop, WaveNet, UpsampleNet
from parakeet.models.clarinet.parallel_wavenet import ParallelWaveNet
from parakeet.models.clarinet.utils import conv2d
# Gaussian IAF model
class Clarinet(dg.Layer):
def __init__(self,
encoder,
teacher,
student,
stft,
min_log_scale=-6.0,
lmd=4.0):
super(Clarinet, self).__init__()
self.lmd = lmd
self.encoder = encoder
self.teacher = teacher
self.student = student
self.min_log_scale = min_log_scale
self.stft = stft
def forward(self, audio, mel, audio_start, clip_kl=True):
"""Compute loss for a distill model
Arguments:
audio {Variable} -- shape(batch_size, time_steps), target waveform.
mel {Variable} -- shape(batch_size, condition_dim, time_steps // hop_length), original mel spectrogram, not upsampled yet.
audio_starts {Variable} -- shape(batch_size, ), the index of the start sample.
clip_kl (bool) -- whether to clip kl divergence if it is greater than 10.0.
Returns:
Variable -- shape(1,), loss
"""
batch_size, audio_length = audio.shape # audio clip's length
z = F.gaussian_random(audio.shape)
condition = self.encoder(mel) # (B, C, T)
condition_slice = crop(condition, audio_start, audio_length)
x, s_means, s_scales = self.student(z, condition_slice) # all [0: T]
s_means = s_means[:, 1:] # (B, T-1), time steps [1: T]
s_scales = s_scales[:, 1:] # (B, T-1), time steps [1: T]
s_clipped_scales = F.clip(s_scales, self.min_log_scale, 100.)
# teacher outputs single gaussian
y = self.teacher(x[:, :-1], condition_slice[:, :, 1:])
_, t_means, t_scales = F.split(y, 3, -1) # time steps [1: T]
t_means = F.squeeze(t_means, [-1]) # (B, T-1), time steps [1: T]
t_scales = F.squeeze(t_scales, [-1]) # (B, T-1), time steps [1: T]
t_clipped_scales = F.clip(t_scales, self.min_log_scale, 100.)
s_distribution = D.Normal(s_means, F.exp(s_clipped_scales))
t_distribution = D.Normal(t_means, F.exp(t_clipped_scales))
# kl divergence loss, so we only need to sample once? no MC
kl = s_distribution.kl_divergence(t_distribution)
if clip_kl:
kl = F.clip(kl, -100., 10.)
# context size dropped
kl = F.reduce_mean(kl[:, self.teacher.context_size:])
# major diff here
regularization = F.mse_loss(t_scales[:, self.teacher.context_size:],
s_scales[:, self.teacher.context_size:])
# introduce information from real target
spectrogram_frame_loss = F.mse_loss(
self.stft.magnitude(audio), self.stft.magnitude(x))
loss = kl + self.lmd * regularization + spectrogram_frame_loss
loss_dict = {
"loss": loss,
"kl_divergence": kl,
"regularization": regularization,
"stft_loss": spectrogram_frame_loss
}
return loss_dict
@dg.no_grad
def synthesis(self, mel):
"""Synthesize waveform conditioned on the mel spectrogram.
Arguments:
mel {Variable} -- shape(batch_size, frequqncy_bands, frames)
Returns:
Variable -- shape(batch_size, frames * upsample_factor)
"""
condition = self.encoder(mel)
samples_shape = (condition.shape[0], condition.shape[-1])
z = F.gaussian_random(samples_shape)
x, s_means, s_scales = self.student(z, condition)
return x
class STFT(dg.Layer):
def __init__(self, n_fft, hop_length, win_length, window="hanning"):
super(STFT, self).__init__()
self.hop_length = hop_length
self.n_bin = 1 + n_fft // 2
self.n_fft = n_fft
# calculate window
window = signal.get_window(window, win_length)
if n_fft != win_length:
pad = (n_fft - win_length) // 2
window = np.pad(window, ((pad, pad), ), 'constant')
# calculate weights
r = np.arange(0, n_fft)
M = np.expand_dims(r, -1) * np.expand_dims(r, 0)
w_real = np.reshape(window *
np.cos(2 * np.pi * M / n_fft)[:self.n_bin],
(self.n_bin, 1, 1, self.n_fft)).astype("float32")
w_imag = np.reshape(window *
np.sin(-2 * np.pi * M / n_fft)[:self.n_bin],
(self.n_bin, 1, 1, self.n_fft)).astype("float32")
w = np.concatenate([w_real, w_imag], axis=0)
self.weight = dg.to_variable(w)
def forward(self, x):
# x(batch_size, time_steps)
# pad it first with reflect mode
pad_start = F.reverse(x[:, 1:1 + self.n_fft // 2], axis=1)
pad_stop = F.reverse(x[:, -(1 + self.n_fft // 2):-1], axis=1)
x = F.concat([pad_start, x, pad_stop], axis=-1)
# to BC1T, C=1
x = F.unsqueeze(x, axes=[1, 2])
out = conv2d(x, self.weight, stride=(1, self.hop_length))
real, imag = F.split(out, 2, dim=1) # BC1T
return real, imag
def power(self, x):
real, imag = self(x)
power = real**2 + imag**2
return power
def magnitude(self, x):
power = self.power(x)
magnitude = F.sqrt(power)
return magnitude

View File

@ -0,0 +1,69 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import time
import itertools
import numpy as np
import paddle.fluid.layers as F
import paddle.fluid.dygraph as dg
import paddle.fluid.initializer as I
import paddle.fluid.layers.distributions as D
from parakeet.modules.weight_norm import Linear, Conv1D, Conv1DCell, Conv2DTranspose
from parakeet.models.wavenet import WaveNet
class ParallelWaveNet(dg.Layer):
def __init__(self, n_loops, n_layers, residual_channels, condition_dim,
filter_size):
super(ParallelWaveNet, self).__init__()
self.flows = dg.LayerList()
for n_loop, n_layer in zip(n_loops, n_layers):
# teacher's log_scale_min does not matter herem, -100 is a dummy value
self.flows.append(
WaveNet(n_loop, n_layer, residual_channels, 3, condition_dim,
filter_size, "mog", -100.0))
def forward(self, z, condition=None):
"""Inverse Autoregressive Flow. Several wavenets.
Arguments:
z {Variable} -- shape(batch_size, time_steps), hidden variable, sampled from a standard normal distribution.
Keyword Arguments:
condition {Variable} -- shape(batch_size, condition_dim, time_steps), condition, basically upsampled mel spectrogram. (default: {None})
Returns:
Variable -- shape(batch_size, time_steps), transformed z.
Variable -- shape(batch_size, time_steps), output distribution's mu.
Variable -- shape(batch_size, time_steps), output distribution's log_std.
"""
for i, flow in enumerate(self.flows):
theta = flow(z, condition) # w, mu, log_std [0: T]
w, mu, log_std = F.split(theta, 3, dim=-1) # (B, T, 1) for each
mu = F.squeeze(mu, [-1]) #[0: T]
log_std = F.squeeze(log_std, [-1]) #[0: T]
z = z * F.exp(log_std) + mu #[0: T]
if i == 0:
out_mu = mu
out_log_std = log_std
else:
out_mu = out_mu * F.exp(log_std) + mu
out_log_std += log_std
return z, out_mu, out_log_std

View File

@ -0,0 +1,48 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from paddle import fluid
from paddle.fluid.core import ops
@fluid.framework.dygraph_only
def conv2d(input,
weight,
stride=(1, 1),
padding=((0, 0), (0, 0)),
dilation=(1, 1),
groups=1,
use_cudnn=True,
data_format="NCHW"):
padding = tuple(pad for pad_dim in padding for pad in pad_dim)
inputs = {
'Input': [input],
'Filter': [weight],
}
attrs = {
'strides': stride,
'paddings': padding,
'dilations': dilation,
'groups': groups,
'use_cudnn': use_cudnn,
'use_mkldnn': False,
'fuse_relu_before_depthwise_conv': False,
"padding_algorithm": "EXPLICIT",
"data_format": data_format,
}
outputs = ops.conv2d(inputs, attrs)
out = outputs["Output"][0]
return out

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from parakeet.models.deepvoice3.encoder import Encoder, ConvSpec
from parakeet.models.deepvoice3.decoder import Decoder, WindowRange
from parakeet.models.deepvoice3.converter import Converter

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
from collections import namedtuple
from paddle import fluid
@ -19,23 +33,19 @@ class Attention(dg.Layer):
value_projection=True):
super(Attention, self).__init__()
std = np.sqrt(1 / query_dim)
self.query_proj = Linear(query_dim,
embed_dim,
param_attr=I.Normal(scale=std))
self.query_proj = Linear(
query_dim, embed_dim, param_attr=I.Normal(scale=std))
if key_projection:
std = np.sqrt(1 / embed_dim)
self.key_proj = Linear(embed_dim,
embed_dim,
param_attr=I.Normal(scale=std))
self.key_proj = Linear(
embed_dim, embed_dim, param_attr=I.Normal(scale=std))
if value_projection:
std = np.sqrt(1 / embed_dim)
self.value_proj = Linear(embed_dim,
embed_dim,
param_attr=I.Normal(scale=std))
self.value_proj = Linear(
embed_dim, embed_dim, param_attr=I.Normal(scale=std))
std = np.sqrt(1 / embed_dim)
self.out_proj = Linear(embed_dim,
query_dim,
param_attr=I.Normal(scale=std))
self.out_proj = Linear(
embed_dim, query_dim, param_attr=I.Normal(scale=std))
self.key_projection = key_projection
self.value_projection = value_projection
@ -102,9 +112,8 @@ class Attention(dg.Layer):
x = F.softmax(x)
attn_scores = x
x = F.dropout(x,
self.dropout,
dropout_implementation="upscale_in_train")
x = F.dropout(
x, self.dropout, dropout_implementation="upscale_in_train")
x = F.matmul(x, values)
encoder_length = keys.shape[1]
# CAUTION: is it wrong? let it be now

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
from paddle import fluid
@ -15,6 +29,7 @@ class Conv1DGLU(dg.Layer):
has residual connection from the input x, and scale the output by
np.sqrt(0.5).
"""
def __init__(self,
n_speakers,
speaker_dim,
@ -50,20 +65,20 @@ class Conv1DGLU(dg.Layer):
), "this block uses residual connection"\
"the input_channes should equals num_filters"
std = np.sqrt(std_mul * (1 - dropout) / (filter_size * in_channels))
self.conv = Conv1DCell(in_channels,
2 * num_filters,
filter_size,
dilation,
causal,
param_attr=I.Normal(scale=std))
self.conv = Conv1DCell(
in_channels,
2 * num_filters,
filter_size,
dilation,
causal,
param_attr=I.Normal(scale=std))
if n_speakers > 1:
assert (speaker_dim is not None
), "speaker embed should not be null in multi-speaker case"
std = np.sqrt(1 / speaker_dim)
self.fc = Linear(speaker_dim,
num_filters,
param_attr=I.Normal(scale=std))
self.fc = Linear(
speaker_dim, num_filters, param_attr=I.Normal(scale=std))
def forward(self, x, speaker_embed=None):
"""
@ -82,9 +97,8 @@ class Conv1DGLU(dg.Layer):
C_out means the output channels of Conv1DGLU.
"""
residual = x
x = F.dropout(x,
self.dropout,
dropout_implementation="upscale_in_train")
x = F.dropout(
x, self.dropout, dropout_implementation="upscale_in_train")
x = self.conv(x)
content, gate = F.split(x, num_or_sections=2, dim=1)
@ -118,9 +132,8 @@ class Conv1DGLU(dg.Layer):
C_out means the output channels of Conv1DGLU.
"""
residual = x_t
x_t = F.dropout(x_t,
self.dropout,
dropout_implementation="upscale_in_train")
x_t = F.dropout(
x_t, self.dropout, dropout_implementation="upscale_in_train")
x_t = self.conv.add_input(x_t)
content_t, gate_t = F.split(x_t, num_or_sections=2, dim=1)

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
from itertools import chain
@ -19,44 +33,45 @@ def upsampling_4x_blocks(n_speakers, speaker_dim, target_channels, dropout):
2,
stride=2,
param_attr=I.Normal(scale=np.sqrt(1 / (2 * target_channels)))),
Conv1DGLU(n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=1,
std_mul=1.,
dropout=dropout),
Conv1DGLU(n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=3,
std_mul=4.,
dropout=dropout),
Conv1DTranspose(
Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
2,
stride=2,
param_attr=I.Normal(scale=np.sqrt(4. / (2 * target_channels)))),
Conv1DGLU(n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=1,
std_mul=1.,
dropout=dropout),
Conv1DGLU(n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=3,
std_mul=4.,
dropout=dropout)
3,
dilation=1,
std_mul=1.,
dropout=dropout), Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=3,
std_mul=4.,
dropout=dropout), Conv1DTranspose(
target_channels,
target_channels,
2,
stride=2,
param_attr=I.Normal(scale=np.sqrt(
4. / (2 * target_channels)))), Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=1,
std_mul=1.,
dropout=dropout), Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=3,
std_mul=4.,
dropout=dropout)
]
return upsampling_convolutions
@ -69,36 +84,38 @@ def upsampling_2x_blocks(n_speakers, speaker_dim, target_channels, dropout):
2,
stride=2,
param_attr=I.Normal(scale=np.sqrt(1. / (2 * target_channels)))),
Conv1DGLU(n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=1,
std_mul=1.,
dropout=dropout),
Conv1DGLU(n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=3,
std_mul=4.,
dropout=dropout)
Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=1,
std_mul=1.,
dropout=dropout), Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=3,
std_mul=4.,
dropout=dropout)
]
return upsampling_convolutions
def upsampling_1x_blocks(n_speakers, speaker_dim, target_channels, dropout):
upsampling_convolutions = [
Conv1DGLU(n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=3,
std_mul=4.,
dropout=dropout)
Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=3,
std_mul=4.,
dropout=dropout)
]
return upsampling_convolutions
@ -108,6 +125,7 @@ class Converter(dg.Layer):
Vocoder that transforms mel spectrogram (or ecoder hidden states)
to waveform.
"""
def __init__(self,
n_speakers,
speaker_dim,
@ -161,33 +179,36 @@ class Converter(dg.Layer):
std = np.sqrt(std_mul / in_channels)
# CAUTION: relu
self.convolutions.append(
Conv1D(in_channels,
out_channels,
1,
act="relu",
param_attr=I.Normal(scale=std)))
Conv1D(
in_channels,
out_channels,
1,
act="relu",
param_attr=I.Normal(scale=std)))
in_channels = out_channels
std_mul = 2.0
self.convolutions.append(
Conv1DGLU(n_speakers,
speaker_dim,
in_channels,
out_channels,
filter_size,
dilation=dilation,
std_mul=std_mul,
dropout=dropout))
Conv1DGLU(
n_speakers,
speaker_dim,
in_channels,
out_channels,
filter_size,
dilation=dilation,
std_mul=std_mul,
dropout=dropout))
in_channels = out_channels
std_mul = 4.0
# final conv proj, channel transformed to linear dim
std = np.sqrt(std_mul * (1 - dropout) / in_channels)
# CAUTION: sigmoid
self.last_conv_proj = Conv1D(in_channels,
linear_dim,
1,
act="sigmoid",
param_attr=I.Normal(scale=std))
self.last_conv_proj = Conv1D(
in_channels,
linear_dim,
1,
act="sigmoid",
param_attr=I.Normal(scale=std))
def forward(self, x, speaker_embed=None):
"""
@ -229,4 +250,4 @@ class Converter(dg.Layer):
out = self.last_conv_proj(x)
out = F.transpose(out, [0, 2, 1])
return out
return out

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import paddle.fluid.layers as F
import paddle.fluid.initializer as I
@ -80,25 +94,25 @@ def unfold_adjacent_frames(folded_frames, r):
class Decoder(dg.Layer):
def __init__(
self,
n_speakers,
speaker_dim,
embed_dim,
mel_dim,
r=1,
max_positions=512,
padding_idx=None, # remove it!
preattention=(ConvSpec(128, 5, 1), ) * 4,
convolutions=(ConvSpec(128, 5, 1), ) * 4,
attention=True,
dropout=0.0,
use_memory_mask=False,
force_monotonic_attention=False,
query_position_rate=1.0,
key_position_rate=1.0,
window_range=WindowRange(-1, 3),
key_projection=True,
value_projection=True):
self,
n_speakers,
speaker_dim,
embed_dim,
mel_dim,
r=1,
max_positions=512,
padding_idx=None, # remove it!
preattention=(ConvSpec(128, 5, 1), ) * 4,
convolutions=(ConvSpec(128, 5, 1), ) * 4,
attention=True,
dropout=0.0,
use_memory_mask=False,
force_monotonic_attention=False,
query_position_rate=1.0,
key_position_rate=1.0,
window_range=WindowRange(-1, 3),
key_projection=True,
value_projection=True):
super(Decoder, self).__init__()
self.dropout = dropout
@ -111,23 +125,17 @@ class Decoder(dg.Layer):
conv_channels = convolutions[0].out_channels
# only when padding idx is 0 can we easilt handle it
self.embed_keys_positions = PositionEmbedding(max_positions,
embed_dim,
padding_idx=0)
self.embed_query_positions = PositionEmbedding(max_positions,
conv_channels,
padding_idx=0)
self.embed_keys_positions = PositionEmbedding(
max_positions, embed_dim, padding_idx=0)
self.embed_query_positions = PositionEmbedding(
max_positions, conv_channels, padding_idx=0)
if n_speakers > 1:
std = np.sqrt((1 - dropout) / speaker_dim)
self.speaker_proj1 = Linear(speaker_dim,
1,
act="sigmoid",
param_attr=I.Normal(scale=std))
self.speaker_proj2 = Linear(speaker_dim,
1,
act="sigmoid",
param_attr=I.Normal(scale=std))
self.speaker_proj1 = Linear(
speaker_dim, 1, act="sigmoid", param_attr=I.Normal(scale=std))
self.speaker_proj2 = Linear(
speaker_dim, 1, act="sigmoid", param_attr=I.Normal(scale=std))
# prenet
self.prenet = dg.LayerList()
@ -138,24 +146,26 @@ class Decoder(dg.Layer):
# conv1d & relu
std = np.sqrt(std_mul / in_channels)
self.prenet.append(
Conv1D(in_channels,
out_channels,
1,
act="relu",
param_attr=I.Normal(scale=std)))
Conv1D(
in_channels,
out_channels,
1,
act="relu",
param_attr=I.Normal(scale=std)))
in_channels = out_channels
std_mul = 2.0
self.prenet.append(
Conv1DGLU(n_speakers,
speaker_dim,
in_channels,
out_channels,
filter_size,
dilation,
std_mul,
dropout,
causal=True,
residual=True))
Conv1DGLU(
n_speakers,
speaker_dim,
in_channels,
out_channels,
filter_size,
dilation,
std_mul,
dropout,
causal=True,
residual=True))
in_channels = out_channels
std_mul = 4.0
@ -184,16 +194,17 @@ class Decoder(dg.Layer):
assert (
in_channels == out_channels
), "the stack of convolution & attention does not change channels"
conv_layer = Conv1DGLU(n_speakers,
speaker_dim,
in_channels,
out_channels,
filter_size,
dilation,
std_mul,
dropout,
causal=True,
residual=False)
conv_layer = Conv1DGLU(
n_speakers,
speaker_dim,
in_channels,
out_channels,
filter_size,
dilation,
std_mul,
dropout,
causal=True,
residual=False)
attn_layer = Attention(
out_channels,
embed_dim,
@ -211,10 +222,8 @@ class Decoder(dg.Layer):
# 1 * 1 conv to transform channels
std = np.sqrt(std_mul * (1 - dropout) / in_channels)
self.last_conv = Conv1D(in_channels,
mel_dim * r,
1,
param_attr=I.Normal(scale=std))
self.last_conv = Conv1D(
in_channels, mel_dim * r, 1, param_attr=I.Normal(scale=std))
# mel (before sigmoid) to done hat
std = np.sqrt(1 / in_channels)
@ -308,9 +317,8 @@ class Decoder(dg.Layer):
# (B, C, T)
frames = F.transpose(frames, [0, 2, 1])
x = frames
x = F.dropout(x,
self.dropout,
dropout_implementation="upscale_in_train")
x = F.dropout(
x, self.dropout, dropout_implementation="upscale_in_train")
# Prenet
for layer in self.prenet:
if isinstance(layer, Conv1DGLU):
@ -408,14 +416,13 @@ class Decoder(dg.Layer):
test_inputs = fold_adjacent_frames(test_inputs, self.r)
test_inputs = F.transpose(test_inputs, [0, 2, 1])
initial_input = F.zeros((batch_size, self.mel_dim * self.r, 1),
dtype=keys.dtype)
initial_input = F.zeros(
(batch_size, self.mel_dim * self.r, 1), dtype=keys.dtype)
t = 0 # decoder time step
while True:
frame_pos = F.fill_constant((batch_size, 1),
value=t + 1,
dtype="int64")
frame_pos = F.fill_constant(
(batch_size, 1), value=t + 1, dtype="int64")
w = self.query_position_rate
if self.n_speakers > 1:
w = w * F.squeeze(self.speaker_proj2(speaker_embed), [-1])
@ -433,9 +440,8 @@ class Decoder(dg.Layer):
current_input = initial_input
x_t = current_input
x_t = F.dropout(x_t,
self.dropout,
dropout_implementation="upscale_in_train")
x_t = F.dropout(
x_t, self.dropout, dropout_implementation="upscale_in_train")
# Prenet
for layer in self.prenet:
@ -453,15 +459,15 @@ class Decoder(dg.Layer):
x_t = F.transpose(x_t, [0, 2, 1])
if frame_pos_embed is not None:
x_t += frame_pos_embed
x_t, attn_scores = attn(
x_t, (keys, values), mask,
last_attended[i] if test_inputs is None else None)
x_t, attn_scores = attn(x_t, (keys, values), mask,
last_attended[i]
if test_inputs is None else None)
x_t = F.transpose(x_t, [0, 2, 1])
step_attn_scores.append(attn_scores) #(B, T_dec=1, T_enc)
# update last attended when necessary
if self.force_monotonic_attention[i]:
last_attended[i] = np.argmax(attn_scores.numpy(),
axis=-1)[0][0]
last_attended[i] = np.argmax(
attn_scores.numpy(), axis=-1)[0][0]
x_t = F.scale(residual + x_t, np.sqrt(0.5))
if len(step_attn_scores):
# (B, 1, T_enc) again
@ -485,8 +491,8 @@ class Decoder(dg.Layer):
t += 1
if test_inputs is None:
if F.reduce_min(done_t).numpy(
)[0] > 0.5 and t > self.min_decoder_steps:
if F.reduce_min(done_t).numpy()[
0] > 0.5 and t > self.min_decoder_steps:
break
elif t > self.max_decoder_steps:
break

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
from collections import namedtuple
@ -33,14 +47,16 @@ class Encoder(dg.Layer):
self.dropout = dropout
if n_speakers > 1:
std = np.sqrt((1 - dropout) / speaker_dim)
self.sp_proj1 = Linear(speaker_dim,
embed_dim,
act="softsign",
param_attr=I.Normal(scale=std))
self.sp_proj2 = Linear(speaker_dim,
embed_dim,
act="softsign",
param_attr=I.Normal(scale=std))
self.sp_proj1 = Linear(
speaker_dim,
embed_dim,
act="softsign",
param_attr=I.Normal(scale=std))
self.sp_proj2 = Linear(
speaker_dim,
embed_dim,
act="softsign",
param_attr=I.Normal(scale=std))
self.n_speakers = n_speakers
self.convolutions = dg.LayerList()
@ -51,31 +67,34 @@ class Encoder(dg.Layer):
if in_channels != out_channels:
std = np.sqrt(std_mul / in_channels)
self.convolutions.append(
Conv1D(in_channels,
out_channels,
1,
act="relu",
param_attr=I.Normal(scale=std)))
Conv1D(
in_channels,
out_channels,
1,
act="relu",
param_attr=I.Normal(scale=std)))
in_channels = out_channels
std_mul = 2.0
self.convolutions.append(
Conv1DGLU(n_speakers,
speaker_dim,
in_channels,
out_channels,
filter_size,
dilation,
std_mul,
dropout,
causal=False,
residual=True))
Conv1DGLU(
n_speakers,
speaker_dim,
in_channels,
out_channels,
filter_size,
dilation,
std_mul,
dropout,
causal=False,
residual=True))
in_channels = out_channels
std_mul = 4.0
std = np.sqrt(std_mul * (1 - dropout) / in_channels)
self.convolutions.append(
Conv1D(in_channels, embed_dim, 1, param_attr=I.Normal(scale=std)))
Conv1D(
in_channels, embed_dim, 1, param_attr=I.Normal(scale=std)))
def forward(self, x, speaker_embed=None):
"""
@ -96,9 +115,8 @@ class Encoder(dg.Layer):
representation for values.
"""
x = self.embed(x)
x = F.dropout(x,
self.dropout,
dropout_implementation="upscale_in_train")
x = F.dropout(
x, self.dropout, dropout_implementation="upscale_in_train")
x = F.transpose(x, [0, 2, 1])
if self.n_speakers > 1 and speaker_embed is not None:

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
from numba import jit
@ -31,9 +45,7 @@ def guided_attention(N, max_N, T, max_T, g):
return W
def guided_attentions(encoder_lengths,
decoder_lengths,
max_decoder_len,
def guided_attentions(encoder_lengths, decoder_lengths, max_decoder_len,
g=0.2):
B = len(encoder_lengths)
max_input_len = encoder_lengths.max()
@ -93,9 +105,8 @@ class TTSLoss(object):
def binary_divergence(self, prediction, target, mask):
flattened_prediction = F.reshape(prediction, [-1, 1])
flattened_target = F.reshape(target, [-1, 1])
flattened_loss = F.log_loss(flattened_prediction,
flattened_target,
epsilon=1e-8)
flattened_loss = F.log_loss(
flattened_prediction, flattened_target, epsilon=1e-8)
bin_div = fluid.layers.reshape(flattened_loss, prediction.shape)
w = self.masked_weight
@ -163,23 +174,20 @@ class TTSLoss(object):
max_mel_steps = max_frames // self.downsample_factor
max_decoder_steps = max_mel_steps // self.r
decoder_mask = F.sequence_mask(n_frames // self.downsample_factor //
self.r,
max_decoder_steps,
dtype="float32")
mel_mask = F.sequence_mask(n_frames // self.downsample_factor,
max_mel_steps,
dtype="float32")
decoder_mask = F.sequence_mask(
n_frames // self.downsample_factor // self.r,
max_decoder_steps,
dtype="float32")
mel_mask = F.sequence_mask(
n_frames // self.downsample_factor, max_mel_steps, dtype="float32")
lin_mask = F.sequence_mask(n_frames, max_frames, dtype="float32")
if compute_lin_loss:
lin_hyp = lin_hyp[:, :-self.time_shift, :]
lin_ref = lin_ref[:, self.time_shift:, :]
lin_mask = lin_mask[:, self.time_shift:, :]
lin_l1_loss = self.l1_loss(lin_hyp,
lin_ref,
lin_mask,
priority_bin=self.priority_bin)
lin_l1_loss = self.l1_loss(
lin_hyp, lin_ref, lin_mask, priority_bin=self.priority_bin)
lin_bce_loss = self.binary_divergence(lin_hyp, lin_ref, lin_mask)
lin_loss = self.binary_divergence_weight * lin_bce_loss \
+ (1 - self.binary_divergence_weight) * lin_l1_loss
@ -197,9 +205,10 @@ class TTSLoss(object):
total_loss += mel_loss
if compute_attn_loss:
attn_loss = self.attention_loss(
attn_hyp, input_lengths.numpy(),
n_frames.numpy() // (self.downsample_factor * self.r))
attn_loss = self.attention_loss(attn_hyp,
input_lengths.numpy(),
n_frames.numpy() //
(self.downsample_factor * self.r))
total_loss += attn_loss
if compute_done_loss:

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import paddle.fluid.layers as F
@ -29,9 +43,9 @@ class DeepVoice3(dg.Layer):
mel_outputs, alignments, done, decoder_states = self.decoder(
(keys, values), valid_lengths, mel_inputs, text_positions,
frame_positions, speaker_embed)
linear_outputs = self.converter(
decoder_states if self.use_decoder_states else mel_outputs,
speaker_embed)
linear_outputs = self.converter(decoder_states
if self.use_decoder_states else
mel_outputs, speaker_embed)
return mel_outputs, linear_outputs, alignments, done
def transduce(self, text_sequences, text_positions, speaker_indices=None):
@ -43,7 +57,7 @@ class DeepVoice3(dg.Layer):
keys, values = self.encoder(text_sequences, speaker_embed)
mel_outputs, alignments, done, decoder_states = self.decoder.decode(
(keys, values), text_positions, speaker_embed)
linear_outputs = self.converter(
decoder_states if self.use_decoder_states else mel_outputs,
speaker_embed)
linear_outputs = self.converter(decoder_states
if self.use_decoder_states else
mel_outputs, speaker_embed)
return mel_outputs, linear_outputs, alignments, done

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
from paddle import fluid
import paddle.fluid.layers as F
@ -95,10 +109,11 @@ class PositionEmbedding(dg.Layer):
speaker_position_rate) # (B, V, C)
# make indices for gather_nd
batch_id = F.expand(
F.unsqueeze(F.range(0, batch_size, 1, dtype="int64"), [1]),
[1, time_steps])
F.unsqueeze(
F.range(
0, batch_size, 1, dtype="int64"), [1]), [1, time_steps])
# (B, T, 2)
gather_nd_id = F.stack([batch_id, indices], -1)
out = F.gather_nd(weight, gather_nd_id)
return out
return out

View File

@ -0,0 +1,13 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -1,8 +1,22 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.fluid.dygraph as dg
import paddle.fluid as fluid
from parakeet.models.transformer_tts.utils import *
from parakeet.models.fastspeech.fft_block import FFTBlock
class Decoder(dg.Layer):
def __init__(self,
len_max_seq,
@ -19,16 +33,29 @@ class Decoder(dg.Layer):
n_position = len_max_seq + 1
self.n_head = n_head
self.pos_inp = get_sinusoid_encoding_table(n_position, d_model, padding_idx=0)
self.position_enc = dg.Embedding(size=[n_position, d_model],
padding_idx=0,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.NumpyArrayInitializer(self.pos_inp),
trainable=False))
self.layer_stack = [FFTBlock(d_model, d_inner, n_head, d_k, d_v, fft_conv1d_kernel, fft_conv1d_padding, dropout=dropout) for _ in range(n_layers)]
self.pos_inp = get_sinusoid_encoding_table(
n_position, d_model, padding_idx=0)
self.position_enc = dg.Embedding(
size=[n_position, d_model],
padding_idx=0,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.NumpyArrayInitializer(
self.pos_inp),
trainable=False))
self.layer_stack = [
FFTBlock(
d_model,
d_inner,
n_head,
d_k,
d_v,
fft_conv1d_kernel,
fft_conv1d_padding,
dropout=dropout) for _ in range(n_layers)
]
for i, layer in enumerate(self.layer_stack):
self.add_sublayer('fft_{}'.format(i), layer)
def forward(self, enc_seq, enc_pos, non_pad_mask, slf_attn_mask=None):
"""
Decoder layer of FastSpeech.
@ -55,4 +82,4 @@ class Decoder(dg.Layer):
slf_attn_mask=slf_attn_mask)
dec_slf_attn_list += [dec_slf_attn]
return dec_output, dec_slf_attn_list
return dec_output, dec_slf_attn_list

View File

@ -1,8 +1,22 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.fluid.dygraph as dg
import paddle.fluid as fluid
from parakeet.models.transformer_tts.utils import *
from parakeet.models.fastspeech.fft_block import FFTBlock
class Encoder(dg.Layer):
def __init__(self,
n_src_vocab,
@ -20,14 +34,30 @@ class Encoder(dg.Layer):
n_position = len_max_seq + 1
self.n_head = n_head
self.src_word_emb = dg.Embedding(size=[n_src_vocab, d_model], padding_idx=0,
param_attr=fluid.initializer.Normal(loc=0.0, scale=1.0))
self.pos_inp = get_sinusoid_encoding_table(n_position, d_model, padding_idx=0)
self.position_enc = dg.Embedding(size=[n_position, d_model],
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.NumpyArrayInitializer(self.pos_inp),
trainable=False))
self.layer_stack = [FFTBlock(d_model, d_inner, n_head, d_k, d_v, fft_conv1d_kernel, fft_conv1d_padding, dropout=dropout) for _ in range(n_layers)]
self.src_word_emb = dg.Embedding(
size=[n_src_vocab, d_model],
padding_idx=0,
param_attr=fluid.initializer.Normal(
loc=0.0, scale=1.0))
self.pos_inp = get_sinusoid_encoding_table(
n_position, d_model, padding_idx=0)
self.position_enc = dg.Embedding(
size=[n_position, d_model],
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.NumpyArrayInitializer(
self.pos_inp),
trainable=False))
self.layer_stack = [
FFTBlock(
d_model,
d_inner,
n_head,
d_k,
d_v,
fft_conv1d_kernel,
fft_conv1d_padding,
dropout=dropout) for _ in range(n_layers)
]
for i, layer in enumerate(self.layer_stack):
self.add_sublayer('fft_{}'.format(i), layer)
@ -50,7 +80,8 @@ class Encoder(dg.Layer):
slf_attn_mask = layers.expand(slf_attn_mask, [self.n_head, 1, 1])
# -- Forward
enc_output = self.src_word_emb(character) + self.position_enc(text_pos) #(N, T, C)
enc_output = self.src_word_emb(character) + self.position_enc(
text_pos) #(N, T, C)
for enc_layer in self.layer_stack:
enc_output, enc_slf_attn = enc_layer(
@ -58,5 +89,5 @@ class Encoder(dg.Layer):
non_pad_mask=non_pad_mask,
slf_attn_mask=slf_attn_mask)
enc_slf_attn_list += [enc_slf_attn]
return enc_output, enc_slf_attn_list
return enc_output, enc_slf_attn_list

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import numpy as np
import paddle.fluid.dygraph as dg
@ -9,56 +22,71 @@ from parakeet.models.fastspeech.length_regulator import LengthRegulator
from parakeet.models.fastspeech.encoder import Encoder
from parakeet.models.fastspeech.decoder import Decoder
class FastSpeech(dg.Layer):
def __init__(self, cfg):
" FastSpeech"
super(FastSpeech, self).__init__()
self.encoder = Encoder(n_src_vocab=len(symbols)+1,
len_max_seq=cfg['max_seq_len'],
n_layers=cfg['encoder_n_layer'],
n_head=cfg['encoder_head'],
d_k=cfg['fs_hidden_size'] // cfg['encoder_head'],
d_v=cfg['fs_hidden_size'] // cfg['encoder_head'],
d_model=cfg['fs_hidden_size'],
d_inner=cfg['encoder_conv1d_filter_size'],
fft_conv1d_kernel=cfg['fft_conv1d_filter'],
fft_conv1d_padding=cfg['fft_conv1d_padding'],
dropout=0.1)
self.length_regulator = LengthRegulator(input_size=cfg['fs_hidden_size'],
out_channels=cfg['duration_predictor_output_size'],
filter_size=cfg['duration_predictor_filter_size'],
dropout=cfg['dropout'])
self.decoder = Decoder(len_max_seq=cfg['max_seq_len'],
n_layers=cfg['decoder_n_layer'],
n_head=cfg['decoder_head'],
d_k=cfg['fs_hidden_size'] // cfg['decoder_head'],
d_v=cfg['fs_hidden_size'] // cfg['decoder_head'],
d_model=cfg['fs_hidden_size'],
d_inner=cfg['decoder_conv1d_filter_size'],
fft_conv1d_kernel=cfg['fft_conv1d_filter'],
fft_conv1d_padding=cfg['fft_conv1d_padding'],
dropout=0.1)
self.weight = fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer())
self.encoder = Encoder(
n_src_vocab=len(symbols) + 1,
len_max_seq=cfg['max_seq_len'],
n_layers=cfg['encoder_n_layer'],
n_head=cfg['encoder_head'],
d_k=cfg['fs_hidden_size'] // cfg['encoder_head'],
d_v=cfg['fs_hidden_size'] // cfg['encoder_head'],
d_model=cfg['fs_hidden_size'],
d_inner=cfg['encoder_conv1d_filter_size'],
fft_conv1d_kernel=cfg['fft_conv1d_filter'],
fft_conv1d_padding=cfg['fft_conv1d_padding'],
dropout=0.1)
self.length_regulator = LengthRegulator(
input_size=cfg['fs_hidden_size'],
out_channels=cfg['duration_predictor_output_size'],
filter_size=cfg['duration_predictor_filter_size'],
dropout=cfg['dropout'])
self.decoder = Decoder(
len_max_seq=cfg['max_seq_len'],
n_layers=cfg['decoder_n_layer'],
n_head=cfg['decoder_head'],
d_k=cfg['fs_hidden_size'] // cfg['decoder_head'],
d_v=cfg['fs_hidden_size'] // cfg['decoder_head'],
d_model=cfg['fs_hidden_size'],
d_inner=cfg['decoder_conv1d_filter_size'],
fft_conv1d_kernel=cfg['fft_conv1d_filter'],
fft_conv1d_padding=cfg['fft_conv1d_padding'],
dropout=0.1)
self.weight = fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer())
k = math.sqrt(1 / cfg['fs_hidden_size'])
self.bias = fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k))
self.mel_linear = dg.Linear(cfg['fs_hidden_size'],
cfg['audio']['num_mels']* cfg['audio']['outputs_per_step'],
param_attr = self.weight,
bias_attr = self.bias,)
self.postnet = PostConvNet(n_mels=cfg['audio']['num_mels'],
num_hidden=512,
filter_size=5,
padding=int(5 / 2),
num_conv=5,
outputs_per_step=cfg['audio']['outputs_per_step'],
use_cudnn=True,
dropout=0.1,
batchnorm_last=True)
self.bias = fluid.ParamAttr(initializer=fluid.initializer.Uniform(
low=-k, high=k))
self.mel_linear = dg.Linear(
cfg['fs_hidden_size'],
cfg['audio']['num_mels'] * cfg['audio']['outputs_per_step'],
param_attr=self.weight,
bias_attr=self.bias, )
self.postnet = PostConvNet(
n_mels=cfg['audio']['num_mels'],
num_hidden=512,
filter_size=5,
padding=int(5 / 2),
num_conv=5,
outputs_per_step=cfg['audio']['outputs_per_step'],
use_cudnn=True,
dropout=0.1,
batchnorm_last=True)
def forward(self, character, text_pos, enc_non_pad_mask, dec_non_pad_mask,
enc_slf_attn_mask=None, dec_slf_attn_mask=None,
mel_pos=None, length_target=None, alpha=1.0):
def forward(self,
character,
text_pos,
enc_non_pad_mask,
dec_non_pad_mask,
enc_slf_attn_mask=None,
dec_slf_attn_mask=None,
mel_pos=None,
length_target=None,
alpha=1.0):
"""
FastSpeech model.
@ -84,29 +112,41 @@ class FastSpeech(dg.Layer):
dec_slf_attn_list (Variable), Shape(B, mel_T, mel_T), the decoder self attention list.
"""
encoder_output, enc_slf_attn_list = self.encoder(character, text_pos, enc_non_pad_mask, slf_attn_mask=enc_slf_attn_mask)
encoder_output, enc_slf_attn_list = self.encoder(
character,
text_pos,
enc_non_pad_mask,
slf_attn_mask=enc_slf_attn_mask)
if fluid.framework._dygraph_tracer()._train_mode:
length_regulator_output, duration_predictor_output = self.length_regulator(encoder_output,
target=length_target,
alpha=alpha)
decoder_output, dec_slf_attn_list = self.decoder(length_regulator_output, mel_pos,
dec_non_pad_mask,
slf_attn_mask=dec_slf_attn_mask)
length_regulator_output, duration_predictor_output = self.length_regulator(
encoder_output, target=length_target, alpha=alpha)
decoder_output, dec_slf_attn_list = self.decoder(
length_regulator_output,
mel_pos,
dec_non_pad_mask,
slf_attn_mask=dec_slf_attn_mask)
mel_output = self.mel_linear(decoder_output)
mel_output_postnet = self.postnet(mel_output) + mel_output
return mel_output, mel_output_postnet, duration_predictor_output, enc_slf_attn_list, dec_slf_attn_list
else:
length_regulator_output, decoder_pos = self.length_regulator(encoder_output, alpha=alpha)
slf_attn_mask = get_triu_tensor(decoder_pos.numpy(), decoder_pos.numpy()).astype(np.float32)
slf_attn_mask = fluid.layers.cast(dg.to_variable(slf_attn_mask == 0), np.float32)
length_regulator_output, decoder_pos = self.length_regulator(
encoder_output, alpha=alpha)
slf_attn_mask = get_triu_tensor(
decoder_pos.numpy(), decoder_pos.numpy()).astype(np.float32)
slf_attn_mask = fluid.layers.cast(
dg.to_variable(slf_attn_mask == 0), np.float32)
slf_attn_mask = dg.to_variable(slf_attn_mask)
dec_non_pad_mask = fluid.layers.unsqueeze((decoder_pos != 0).astype(np.float32), [-1])
decoder_output, _ = self.decoder(length_regulator_output, decoder_pos, dec_non_pad_mask,
slf_attn_mask=slf_attn_mask)
dec_non_pad_mask = fluid.layers.unsqueeze(
(decoder_pos != 0).astype(np.float32), [-1])
decoder_output, _ = self.decoder(
length_regulator_output,
decoder_pos,
dec_non_pad_mask,
slf_attn_mask=slf_attn_mask)
mel_output = self.mel_linear(decoder_output)
mel_output_postnet = self.postnet(mel_output) + mel_output
return mel_output, mel_output_postnet
return mel_output, mel_output_postnet

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import math
import paddle.fluid.dygraph as dg
@ -6,11 +19,32 @@ import paddle.fluid as fluid
from parakeet.modules.multihead_attention import MultiheadAttention
from parakeet.modules.ffn import PositionwiseFeedForward
class FFTBlock(dg.Layer):
def __init__(self, d_model, d_inner, n_head, d_k, d_v, filter_size, padding, dropout=0.2):
def __init__(self,
d_model,
d_inner,
n_head,
d_k,
d_v,
filter_size,
padding,
dropout=0.2):
super(FFTBlock, self).__init__()
self.slf_attn = MultiheadAttention(d_model, d_k, d_v, num_head=n_head, is_bias=True, dropout=dropout, is_concat=False)
self.pos_ffn = PositionwiseFeedForward(d_model, d_inner, filter_size =filter_size, padding =padding, dropout=dropout)
self.slf_attn = MultiheadAttention(
d_model,
d_k,
d_v,
num_head=n_head,
is_bias=True,
dropout=dropout,
is_concat=False)
self.pos_ffn = PositionwiseFeedForward(
d_model,
d_inner,
filter_size=filter_size,
padding=padding,
dropout=dropout)
def forward(self, enc_input, non_pad_mask, slf_attn_mask=None):
"""
@ -27,11 +61,12 @@ class FFTBlock(dg.Layer):
output (Variable), Shape(B, T, C), the output after self-attention & ffn.
slf_attn (Variable), Shape(B * n_head, T, T), the self attention.
"""
output, slf_attn = self.slf_attn(enc_input, enc_input, enc_input, mask=slf_attn_mask)
output, slf_attn = self.slf_attn(
enc_input, enc_input, enc_input, mask=slf_attn_mask)
output *= non_pad_mask
output = self.pos_ffn(output)
output *= non_pad_mask
return output, slf_attn
return output, slf_attn

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import math
import parakeet.models.fastspeech.utils
@ -6,47 +19,50 @@ import paddle.fluid.layers as layers
import paddle.fluid as fluid
from parakeet.modules.customized import Conv1D
class LengthRegulator(dg.Layer):
def __init__(self, input_size, out_channels, filter_size, dropout=0.1):
super(LengthRegulator, self).__init__()
self.duration_predictor = DurationPredictor(input_size=input_size,
out_channels=out_channels,
filter_size=filter_size,
dropout=dropout)
self.duration_predictor = DurationPredictor(
input_size=input_size,
out_channels=out_channels,
filter_size=filter_size,
dropout=dropout)
def LR(self, x, duration_predictor_output, alpha=1.0):
output = []
batch_size = x.shape[0]
for i in range(batch_size):
output.append(self.expand(x[i:i+1], duration_predictor_output[i:i+1], alpha))
output.append(
self.expand(x[i:i + 1], duration_predictor_output[i:i + 1],
alpha))
output = self.pad(output)
return output
def pad(self, input_ele):
max_len = max([input_ele[i].shape[0] for i in range(len(input_ele))])
out_list = []
for i in range(len(input_ele)):
pad_len = max_len - input_ele[i].shape[0]
one_batch_padded = layers.pad(
input_ele[i], [0, pad_len, 0, 0], pad_value=0.0)
one_batch_padded = layers.pad(input_ele[i], [0, pad_len, 0, 0],
pad_value=0.0)
out_list.append(one_batch_padded)
out_padded = layers.stack(out_list)
return out_padded
def expand(self, batch, predicted, alpha):
out = []
time_steps = batch.shape[1]
fertilities = predicted.numpy()
batch = layers.squeeze(batch,[0])
batch = layers.squeeze(batch, [0])
for i in range(time_steps):
if fertilities[0,i]==0:
if fertilities[0, i] == 0:
continue
out.append(layers.expand(batch[i: i + 1, :], [int(fertilities[0,i]), 1]))
out.append(
layers.expand(batch[i:i + 1, :], [int(fertilities[0, i]), 1]))
out = layers.concat(out, axis=0)
return out
def forward(self, x, alpha=1.0, target=None):
"""
@ -70,10 +86,11 @@ class LengthRegulator(dg.Layer):
else:
duration_predictor_output = layers.round(duration_predictor_output)
output = self.LR(x, duration_predictor_output, alpha)
mel_pos = dg.to_variable(np.arange(1, output.shape[1]+1))
mel_pos = dg.to_variable(np.arange(1, output.shape[1] + 1))
mel_pos = layers.unsqueeze(mel_pos, [0])
return output, mel_pos
class DurationPredictor(dg.Layer):
def __init__(self, input_size, out_channels, filter_size, dropout=0.1):
super(DurationPredictor, self).__init__()
@ -83,30 +100,38 @@ class DurationPredictor(dg.Layer):
self.dropout = dropout
k = math.sqrt(1 / self.input_size)
self.conv1 = Conv1D(num_channels = self.input_size,
num_filters = self.out_channels,
filter_size = self.filter_size,
padding=1,
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)))
#data_format='NTC')
self.conv1 = Conv1D(
num_channels=self.input_size,
num_filters=self.out_channels,
filter_size=self.filter_size,
padding=1,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
low=-k, high=k)))
#data_format='NTC')
k = math.sqrt(1 / self.out_channels)
self.conv2 = Conv1D(num_channels = self.out_channels,
num_filters = self.out_channels,
filter_size = self.filter_size,
padding=1,
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)))
#data_format='NTC')
self.conv2 = Conv1D(
num_channels=self.out_channels,
num_filters=self.out_channels,
filter_size=self.filter_size,
padding=1,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
low=-k, high=k)))
#data_format='NTC')
self.layer_norm1 = dg.LayerNorm(self.out_channels)
self.layer_norm2 = dg.LayerNorm(self.out_channels)
self.weight = fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer())
self.weight = fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer())
k = math.sqrt(1 / self.out_channels)
self.bias = fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k))
self.bias = fluid.ParamAttr(initializer=fluid.initializer.Uniform(
low=-k, high=k))
self.linear =dg.Linear(self.out_channels, 1, param_attr = self.weight,
bias_attr = self.bias)
self.linear = dg.Linear(
self.out_channels, 1, param_attr=self.weight, bias_attr=self.bias)
def forward(self, encoder_output):
"""
@ -118,18 +143,21 @@ class DurationPredictor(dg.Layer):
out (Variable), Shape(B, T, C), the output of duration predictor.
"""
# encoder_output.shape(N, T, C)
out = layers.transpose(encoder_output, [0,2,1])
out = layers.transpose(encoder_output, [0, 2, 1])
out = self.conv1(out)
out = layers.transpose(out, [0,2,1])
out = layers.dropout(layers.relu(self.layer_norm1(out)), self.dropout, dropout_implementation='upscale_in_train')
out = layers.transpose(out, [0,2,1])
out = layers.transpose(out, [0, 2, 1])
out = layers.dropout(
layers.relu(self.layer_norm1(out)),
self.dropout,
dropout_implementation='upscale_in_train')
out = layers.transpose(out, [0, 2, 1])
out = self.conv2(out)
out = layers.transpose(out, [0,2,1])
out = layers.dropout(layers.relu(self.layer_norm2(out)), self.dropout, dropout_implementation='upscale_in_train')
out = layers.transpose(out, [0, 2, 1])
out = layers.dropout(
layers.relu(self.layer_norm2(out)),
self.dropout,
dropout_implementation='upscale_in_train')
out = layers.relu(self.linear(out))
out = layers.squeeze(out, axes=[-1])
return out
return out

View File

@ -1,34 +1,47 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
def get_alignment(attn_probs, mel_lens, n_head):
max_F = 0
assert attn_probs[0].shape[0] % n_head == 0
batch_size = int(attn_probs[0].shape[0] // n_head)
#max_attn = attn_probs[0].numpy()[0,batch_size]
for i in range(len(attn_probs)):
multi_attn = attn_probs[i].numpy()
for j in range(n_head):
attn = multi_attn[j*batch_size:(j+1)*batch_size]
attn = multi_attn[j * batch_size:(j + 1) * batch_size]
F = score_F(attn)
if max_F < F:
max_F = F
max_attn = attn
alignment = compute_duration(max_attn, mel_lens)
return alignment, max_attn
def score_F(attn):
max = np.max(attn, axis=-1)
mean = np.mean(max)
return mean
def compute_duration(attn, mel_lens):
alignment = np.zeros([attn.shape[0],attn.shape[2]])
alignment = np.zeros([attn.shape[0], attn.shape[2]])
mel_lens = mel_lens.numpy()
for i in range(attn.shape[0]):
for j in range(mel_lens[i]):
max_index = np.argmax(attn[i,j])
alignment[i,max_index] += 1
max_index = np.argmax(attn[i, j])
alignment[i, max_index] += 1
return alignment

View File

@ -0,0 +1,13 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
from parakeet.g2p.text.symbols import symbols
import paddle.fluid.dygraph as dg
@ -7,9 +20,16 @@ from parakeet.modules.customized import Pool1D, Conv1D
from parakeet.modules.dynamic_gru import DynamicGRU
import numpy as np
class CBHG(dg.Layer):
def __init__(self, hidden_size, batch_size, K=16, projection_size = 256, num_gru_layers=2,
max_pool_kernel_size=2, is_post=False):
def __init__(self,
hidden_size,
batch_size,
K=16,
projection_size=256,
num_gru_layers=2,
max_pool_kernel_size=2,
is_post=False):
super(CBHG, self).__init__()
"""
:param hidden_size: dimension of hidden unit
@ -24,28 +44,39 @@ class CBHG(dg.Layer):
self.projection_size = projection_size
self.conv_list = []
k = math.sqrt(1 / projection_size)
self.conv_list.append(Conv1D(num_channels = projection_size,
num_filters = hidden_size,
filter_size = 1,
padding = int(np.floor(1/2)),
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k))))
self.conv_list.append(
Conv1D(
num_channels=projection_size,
num_filters=hidden_size,
filter_size=1,
padding=int(np.floor(1 / 2)),
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-k, high=k))))
k = math.sqrt(1 / hidden_size)
for i in range(2,K+1):
self.conv_list.append(Conv1D(num_channels = hidden_size,
num_filters = hidden_size,
filter_size = i,
padding = int(np.floor(i/2)),
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k))))
for i in range(2, K + 1):
self.conv_list.append(
Conv1D(
num_channels=hidden_size,
num_filters=hidden_size,
filter_size=i,
padding=int(np.floor(i / 2)),
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-k, high=k))))
for i, layer in enumerate(self.conv_list):
self.add_sublayer("conv_list_{}".format(i), layer)
self.batchnorm_list = []
for i in range(K):
self.batchnorm_list.append(dg.BatchNorm(hidden_size,
data_layout='NCHW'))
self.batchnorm_list.append(
dg.BatchNorm(
hidden_size, data_layout='NCHW'))
for i, layer in enumerate(self.batchnorm_list):
self.add_sublayer("batchnorm_list_{}".format(i), layer)
@ -53,92 +84,120 @@ class CBHG(dg.Layer):
conv_outdim = hidden_size * K
k = math.sqrt(1 / conv_outdim)
self.conv_projection_1 = Conv1D(num_channels = conv_outdim,
num_filters = hidden_size,
filter_size = 3,
padding = int(np.floor(3/2)),
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)))
self.conv_projection_1 = Conv1D(
num_channels=conv_outdim,
num_filters=hidden_size,
filter_size=3,
padding=int(np.floor(3 / 2)),
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
low=-k, high=k)))
k = math.sqrt(1 / hidden_size)
self.conv_projection_2 = Conv1D(num_channels = hidden_size,
num_filters = projection_size,
filter_size = 3,
padding = int(np.floor(3/2)),
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)))
self.conv_projection_2 = Conv1D(
num_channels=hidden_size,
num_filters=projection_size,
filter_size=3,
padding=int(np.floor(3 / 2)),
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
low=-k, high=k)))
self.batchnorm_proj_1 = dg.BatchNorm(hidden_size,
data_layout='NCHW')
self.batchnorm_proj_2 = dg.BatchNorm(projection_size,
data_layout='NCHW')
self.max_pool = Pool1D(pool_size = max_pool_kernel_size,
pool_type='max',
pool_stride=1,
pool_padding=1,
data_format = "NCT")
self.batchnorm_proj_1 = dg.BatchNorm(hidden_size, data_layout='NCHW')
self.batchnorm_proj_2 = dg.BatchNorm(
projection_size, data_layout='NCHW')
self.max_pool = Pool1D(
pool_size=max_pool_kernel_size,
pool_type='max',
pool_stride=1,
pool_padding=1,
data_format="NCT")
self.highway = Highwaynet(self.projection_size)
h_0 = np.zeros((batch_size, hidden_size // 2), dtype="float32")
h_0 = dg.to_variable(h_0)
k = math.sqrt(1 / hidden_size)
self.fc_forward1 = dg.Linear(hidden_size, hidden_size // 2 * 3,
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
self.fc_reverse1 = dg.Linear(hidden_size, hidden_size // 2 * 3,
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
self.gru_forward1 = DynamicGRU(size = self.hidden_size // 2,
is_reverse = False,
origin_mode = True,
h_0 = h_0)
self.gru_reverse1 = DynamicGRU(size = self.hidden_size // 2,
is_reverse=True,
origin_mode=True,
h_0 = h_0)
self.fc_forward1 = dg.Linear(
hidden_size,
hidden_size // 2 * 3,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
low=-k, high=k)))
self.fc_reverse1 = dg.Linear(
hidden_size,
hidden_size // 2 * 3,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
low=-k, high=k)))
self.gru_forward1 = DynamicGRU(
size=self.hidden_size // 2,
is_reverse=False,
origin_mode=True,
h_0=h_0)
self.gru_reverse1 = DynamicGRU(
size=self.hidden_size // 2,
is_reverse=True,
origin_mode=True,
h_0=h_0)
self.fc_forward2 = dg.Linear(hidden_size, hidden_size // 2 * 3,
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
self.fc_reverse2 = dg.Linear(hidden_size, hidden_size // 2 * 3,
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
self.gru_forward2 = DynamicGRU(size = self.hidden_size // 2,
is_reverse = False,
origin_mode = True,
h_0 = h_0)
self.gru_reverse2 = DynamicGRU(size = self.hidden_size // 2,
is_reverse=True,
origin_mode=True,
h_0 = h_0)
self.fc_forward2 = dg.Linear(
hidden_size,
hidden_size // 2 * 3,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
low=-k, high=k)))
self.fc_reverse2 = dg.Linear(
hidden_size,
hidden_size // 2 * 3,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
low=-k, high=k)))
self.gru_forward2 = DynamicGRU(
size=self.hidden_size // 2,
is_reverse=False,
origin_mode=True,
h_0=h_0)
self.gru_reverse2 = DynamicGRU(
size=self.hidden_size // 2,
is_reverse=True,
origin_mode=True,
h_0=h_0)
def _conv_fit_dim(self, x, filter_size=3):
if filter_size % 2 == 0:
return x[:,:,:-1]
return x[:, :, :-1]
else:
return x
return x
def forward(self, input_):
# input_.shape = [N, C, T]
conv_list = []
conv_input = input_
for i, (conv, batchnorm) in enumerate(zip(self.conv_list, self.batchnorm_list)):
conv_input = self._conv_fit_dim(conv(conv_input), i+1)
for i, (conv, batchnorm
) in enumerate(zip(self.conv_list, self.batchnorm_list)):
conv_input = self._conv_fit_dim(conv(conv_input), i + 1)
conv_input = layers.relu(batchnorm(conv_input))
conv_list.append(conv_input)
conv_cat = layers.concat(conv_list, axis=1)
conv_pool = self.max_pool(conv_cat)[:,:,:-1]
conv_proj = layers.relu(self.batchnorm_proj_1(self._conv_fit_dim(self.conv_projection_1(conv_pool))))
conv_proj = self.batchnorm_proj_2(self._conv_fit_dim(self.conv_projection_2(conv_proj))) + input_
conv_pool = self.max_pool(conv_cat)[:, :, :-1]
conv_proj = layers.relu(
self.batchnorm_proj_1(
self._conv_fit_dim(self.conv_projection_1(conv_pool))))
conv_proj = self.batchnorm_proj_2(
self._conv_fit_dim(self.conv_projection_2(conv_proj))) + input_
# conv_proj.shape = [N, C, T]
highway = layers.transpose(conv_proj, [0,2,1])
highway = layers.transpose(conv_proj, [0, 2, 1])
highway = self.highway(highway)
# highway.shape = [N, T, C]
@ -152,9 +211,10 @@ class CBHG(dg.Layer):
out_forward = self.gru_forward2(fc_forward)
out_reverse = self.gru_reverse2(fc_reverse)
out = layers.concat([out_forward, out_reverse], axis=-1)
out = layers.transpose(out, [0,2,1])
out = layers.transpose(out, [0, 2, 1])
return out
class Highwaynet(dg.Layer):
def __init__(self, num_units, num_layers=4):
super(Highwaynet, self).__init__()
@ -165,14 +225,26 @@ class Highwaynet(dg.Layer):
self.linears = []
k = math.sqrt(1 / num_units)
for i in range(num_layers):
self.linears.append(dg.Linear(num_units, num_units,
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k))))
self.gates.append(dg.Linear(num_units, num_units,
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k))))
for i, (linear, gate) in enumerate(zip(self.linears,self.gates)):
self.linears.append(
dg.Linear(
num_units,
num_units,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-k, high=k))))
self.gates.append(
dg.Linear(
num_units,
num_units,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-k, high=k))))
for i, (linear, gate) in enumerate(zip(self.linears, self.gates)):
self.add_sublayer("linears_{}".format(i), linear)
self.add_sublayer("gates_{}".format(i), gate)
@ -184,12 +256,6 @@ class Highwaynet(dg.Layer):
t_ = fluid.layers.sigmoid(gate(out))
c = 1 - t_
out = h * t_ + out * c
out = h * t_ + out * c
return out

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import paddle.fluid.dygraph as dg
import paddle.fluid as fluid
@ -7,67 +20,110 @@ from parakeet.modules.ffn import PositionwiseFeedForward
from parakeet.models.transformer_tts.prenet import PreNet
from parakeet.models.transformer_tts.post_convnet import PostConvNet
class Decoder(dg.Layer):
def __init__(self, num_hidden, config, num_head=4):
super(Decoder, self).__init__()
self.num_hidden = num_hidden
self.num_head = num_head
param = fluid.ParamAttr()
self.alpha = self.create_parameter(shape=(1,), attr=param, dtype='float32',
default_initializer = fluid.initializer.ConstantInitializer(value=1.0))
self.pos_inp = get_sinusoid_encoding_table(1024, self.num_hidden, padding_idx=0)
self.pos_emb = dg.Embedding(size=[1024, num_hidden],
padding_idx=0,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.NumpyArrayInitializer(self.pos_inp),
trainable=False))
self.decoder_prenet = PreNet(input_size = config['audio']['num_mels'],
hidden_size = num_hidden * 2,
output_size = num_hidden,
dropout_rate=0.2)
self.alpha = self.create_parameter(
shape=(1, ),
attr=param,
dtype='float32',
default_initializer=fluid.initializer.ConstantInitializer(
value=1.0))
self.pos_inp = get_sinusoid_encoding_table(
1024, self.num_hidden, padding_idx=0)
self.pos_emb = dg.Embedding(
size=[1024, num_hidden],
padding_idx=0,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.NumpyArrayInitializer(
self.pos_inp),
trainable=False))
self.decoder_prenet = PreNet(
input_size=config['audio']['num_mels'],
hidden_size=num_hidden * 2,
output_size=num_hidden,
dropout_rate=0.2)
k = math.sqrt(1 / num_hidden)
self.linear = dg.Linear(num_hidden, num_hidden,
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
self.linear = dg.Linear(
num_hidden,
num_hidden,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
low=-k, high=k)))
self.selfattn_layers = [MultiheadAttention(num_hidden, num_hidden//num_head, num_hidden//num_head) for _ in range(3)]
self.selfattn_layers = [
MultiheadAttention(num_hidden, num_hidden // num_head,
num_hidden // num_head) for _ in range(3)
]
for i, layer in enumerate(self.selfattn_layers):
self.add_sublayer("self_attn_{}".format(i), layer)
self.attn_layers = [MultiheadAttention(num_hidden, num_hidden//num_head, num_hidden//num_head) for _ in range(3)]
self.attn_layers = [
MultiheadAttention(num_hidden, num_hidden // num_head,
num_hidden // num_head) for _ in range(3)
]
for i, layer in enumerate(self.attn_layers):
self.add_sublayer("attn_{}".format(i), layer)
self.ffns = [PositionwiseFeedForward(num_hidden, num_hidden*num_head, filter_size=1) for _ in range(3)]
self.ffns = [
PositionwiseFeedForward(
num_hidden, num_hidden * num_head, filter_size=1)
for _ in range(3)
]
for i, layer in enumerate(self.ffns):
self.add_sublayer("ffns_{}".format(i), layer)
self.mel_linear = dg.Linear(num_hidden, config['audio']['num_mels'] * config['audio']['outputs_per_step'],
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
self.stop_linear = dg.Linear(num_hidden, 1,
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
self.mel_linear = dg.Linear(
num_hidden,
config['audio']['num_mels'] * config['audio']['outputs_per_step'],
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
low=-k, high=k)))
self.stop_linear = dg.Linear(
num_hidden,
1,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
low=-k, high=k)))
self.postconvnet = PostConvNet(config['audio']['num_mels'], config['hidden_size'],
filter_size = 5, padding = 4, num_conv=5,
outputs_per_step=config['audio']['outputs_per_step'],
use_cudnn=True)
self.postconvnet = PostConvNet(
config['audio']['num_mels'],
config['hidden_size'],
filter_size=5,
padding=4,
num_conv=5,
outputs_per_step=config['audio']['outputs_per_step'],
use_cudnn=True)
def forward(self, key, value, query, positional, mask, m_mask=None, m_self_mask=None, zero_mask=None):
def forward(self,
key,
value,
query,
positional,
mask,
m_mask=None,
m_self_mask=None,
zero_mask=None):
# get decoder mask with triangular matrix
if fluid.framework._dygraph_tracer()._train_mode:
m_mask = layers.expand(m_mask, [self.num_head, 1, key.shape[1]])
m_self_mask = layers.expand(m_self_mask, [self.num_head, 1, query.shape[1]])
m_self_mask = layers.expand(m_self_mask,
[self.num_head, 1, query.shape[1]])
mask = layers.expand(mask, [self.num_head, 1, 1])
zero_mask = layers.expand(zero_mask, [self.num_head, 1, 1])
else:
m_mask, m_self_mask, zero_mask = None, None, None
# Decoder pre-network
# Decoder pre-network
query = self.decoder_prenet(query)
# Centered position
query = self.linear(query)
@ -76,28 +132,29 @@ class Decoder(dg.Layer):
query = positional * self.alpha + query
#positional dropout
query = fluid.layers.dropout(query, 0.1, dropout_implementation='upscale_in_train')
query = fluid.layers.dropout(
query, 0.1, dropout_implementation='upscale_in_train')
# Attention decoder-decoder, encoder-decoder
selfattn_list = list()
attn_list = list()
for selfattn, attn, ffn in zip(self.selfattn_layers, self.attn_layers, self.ffns):
query, attn_dec = selfattn(query, query, query, mask = mask, query_mask = m_self_mask)
query, attn_dot = attn(key, value, query, mask = zero_mask, query_mask = m_mask)
for selfattn, attn, ffn in zip(self.selfattn_layers, self.attn_layers,
self.ffns):
query, attn_dec = selfattn(
query, query, query, mask=mask, query_mask=m_self_mask)
query, attn_dot = attn(
key, value, query, mask=zero_mask, query_mask=m_mask)
query = ffn(query)
selfattn_list.append(attn_dec)
attn_list.append(attn_dot)
# Mel linear projection
mel_out = self.mel_linear(query)
# Post Mel Network
out = self.postconvnet(mel_out)
out = mel_out + out
# Stop tokens
stop_tokens = self.stop_linear(query)
stop_tokens = layers.squeeze(stop_tokens, [-1])

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.fluid.dygraph as dg
import paddle.fluid as fluid
from parakeet.models.transformer_tts.utils import *
@ -5,56 +18,69 @@ from parakeet.modules.multihead_attention import MultiheadAttention
from parakeet.modules.ffn import PositionwiseFeedForward
from parakeet.models.transformer_tts.encoderprenet import EncoderPrenet
class Encoder(dg.Layer):
def __init__(self, embedding_size, num_hidden, num_head=4):
super(Encoder, self).__init__()
self.num_hidden = num_hidden
self.num_head = num_head
param = fluid.ParamAttr(initializer=fluid.initializer.Constant(value=1.0))
self.alpha = self.create_parameter(shape=(1, ), attr=param, dtype='float32')
self.pos_inp = get_sinusoid_encoding_table(1024, self.num_hidden, padding_idx=0)
self.pos_emb = dg.Embedding(size=[1024, num_hidden],
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.NumpyArrayInitializer(self.pos_inp),
trainable=False))
self.encoder_prenet = EncoderPrenet(embedding_size = embedding_size,
num_hidden = num_hidden,
use_cudnn=True)
self.layers = [MultiheadAttention(num_hidden, num_hidden//num_head, num_hidden//num_head) for _ in range(3)]
param = fluid.ParamAttr(initializer=fluid.initializer.Constant(
value=1.0))
self.alpha = self.create_parameter(
shape=(1, ), attr=param, dtype='float32')
self.pos_inp = get_sinusoid_encoding_table(
1024, self.num_hidden, padding_idx=0)
self.pos_emb = dg.Embedding(
size=[1024, num_hidden],
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.NumpyArrayInitializer(
self.pos_inp),
trainable=False))
self.encoder_prenet = EncoderPrenet(
embedding_size=embedding_size,
num_hidden=num_hidden,
use_cudnn=True)
self.layers = [
MultiheadAttention(num_hidden, num_hidden // num_head,
num_hidden // num_head) for _ in range(3)
]
for i, layer in enumerate(self.layers):
self.add_sublayer("self_attn_{}".format(i), layer)
self.ffns = [PositionwiseFeedForward(num_hidden, num_hidden*num_head, filter_size=1, use_cudnn=True) for _ in range(3)]
self.ffns = [
PositionwiseFeedForward(
num_hidden,
num_hidden * num_head,
filter_size=1,
use_cudnn=True) for _ in range(3)
]
for i, layer in enumerate(self.ffns):
self.add_sublayer("ffns_{}".format(i), layer)
def forward(self, x, positional, mask=None, query_mask=None):
if fluid.framework._dygraph_tracer()._train_mode:
seq_len_key = x.shape[1]
query_mask = layers.expand(query_mask, [self.num_head, 1, seq_len_key])
query_mask = layers.expand(query_mask,
[self.num_head, 1, seq_len_key])
mask = layers.expand(mask, [self.num_head, 1, 1])
else:
query_mask, mask = None, None
# Encoder pre_network
x = self.encoder_prenet(x) #(N,T,C)
x = self.encoder_prenet(x) #(N,T,C)
# Get positional encoding
positional = self.pos_emb(positional)
x = positional * self.alpha + x #(N, T, C)
positional = self.pos_emb(positional)
x = positional * self.alpha + x #(N, T, C)
# Positional dropout
x = layers.dropout(x, 0.1, dropout_implementation='upscale_in_train')
# Self attention encoder
attentions = list()
for layer, ffn in zip(self.layers, self.ffns):
x, attention = layer(x, x, x, mask = mask, query_mask = query_mask)
x, attention = layer(x, x, x, mask=mask, query_mask=query_mask)
x = ffn(x)
attentions.append(attention)
return x, attentions
return x, attentions

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
from parakeet.g2p.text.symbols import symbols
import paddle.fluid.dygraph as dg
@ -13,50 +26,70 @@ class EncoderPrenet(dg.Layer):
self.embedding_size = embedding_size
self.num_hidden = num_hidden
self.use_cudnn = use_cudnn
self.embedding = dg.Embedding( size = [len(symbols), embedding_size],
padding_idx = 0,
param_attr=fluid.initializer.Normal(loc=0.0, scale=1.0))
self.embedding = dg.Embedding(
size=[len(symbols), embedding_size],
padding_idx=0,
param_attr=fluid.initializer.Normal(
loc=0.0, scale=1.0))
self.conv_list = []
k = math.sqrt(1 / embedding_size)
self.conv_list.append(Conv1D(num_channels = embedding_size,
num_filters = num_hidden,
filter_size = 5,
padding = int(np.floor(5/2)),
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)),
use_cudnn = use_cudnn))
self.conv_list.append(
Conv1D(
num_channels=embedding_size,
num_filters=num_hidden,
filter_size=5,
padding=int(np.floor(5 / 2)),
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-k, high=k)),
use_cudnn=use_cudnn))
k = math.sqrt(1 / num_hidden)
for _ in range(2):
self.conv_list.append(Conv1D(num_channels = num_hidden,
num_filters = num_hidden,
filter_size = 5,
padding = int(np.floor(5/2)),
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)),
use_cudnn = use_cudnn))
self.conv_list.append(
Conv1D(
num_channels=num_hidden,
num_filters=num_hidden,
filter_size=5,
padding=int(np.floor(5 / 2)),
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-k, high=k)),
use_cudnn=use_cudnn))
for i, layer in enumerate(self.conv_list):
self.add_sublayer("conv_list_{}".format(i), layer)
self.batch_norm_list = [dg.BatchNorm(num_hidden,
data_layout='NCHW') for _ in range(3)]
self.batch_norm_list = [
dg.BatchNorm(
num_hidden, data_layout='NCHW') for _ in range(3)
]
for i, layer in enumerate(self.batch_norm_list):
self.add_sublayer("batch_norm_list_{}".format(i), layer)
k = math.sqrt(1 / num_hidden)
self.projection = dg.Linear(num_hidden, num_hidden,
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
self.projection = dg.Linear(
num_hidden,
num_hidden,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
low=-k, high=k)))
def forward(self, x):
x = self.embedding(x) #(batch_size, seq_len, embending_size)
x = layers.transpose(x,[0,2,1])
x = self.embedding(x) #(batch_size, seq_len, embending_size)
x = layers.transpose(x, [0, 2, 1])
for batch_norm, conv in zip(self.batch_norm_list, self.conv_list):
x = layers.dropout(layers.relu(batch_norm(conv(x))), 0.2,
dropout_implementation='upscale_in_train')
x = layers.transpose(x,[0,2,1]) #(N,T,C)
x = layers.dropout(
layers.relu(batch_norm(conv(x))),
0.2,
dropout_implementation='upscale_in_train')
x = layers.transpose(x, [0, 2, 1]) #(N,T,C)
x = self.projection(x)
return x
return x

View File

@ -1,11 +1,25 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import paddle.fluid.dygraph as dg
import paddle.fluid as fluid
import paddle.fluid.layers as layers
from parakeet.modules.customized import Conv1D
class PostConvNet(dg.Layer):
def __init__(self,
def __init__(self,
n_mels=80,
num_hidden=512,
filter_size=5,
@ -16,49 +30,66 @@ class PostConvNet(dg.Layer):
dropout=0.1,
batchnorm_last=False):
super(PostConvNet, self).__init__()
self.dropout = dropout
self.num_conv = num_conv
self.batchnorm_last = batchnorm_last
self.conv_list = []
k = math.sqrt(1 / (n_mels * outputs_per_step))
self.conv_list.append(Conv1D(num_channels = n_mels * outputs_per_step,
num_filters = num_hidden,
filter_size = filter_size,
padding = padding,
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)),
use_cudnn = use_cudnn))
self.conv_list.append(
Conv1D(
num_channels=n_mels * outputs_per_step,
num_filters=num_hidden,
filter_size=filter_size,
padding=padding,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-k, high=k)),
use_cudnn=use_cudnn))
k = math.sqrt(1 / num_hidden)
for _ in range(1, num_conv-1):
self.conv_list.append(Conv1D(num_channels = num_hidden,
num_filters = num_hidden,
filter_size = filter_size,
padding = padding,
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)),
use_cudnn = use_cudnn))
for _ in range(1, num_conv - 1):
self.conv_list.append(
Conv1D(
num_channels=num_hidden,
num_filters=num_hidden,
filter_size=filter_size,
padding=padding,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-k, high=k)),
use_cudnn=use_cudnn))
self.conv_list.append(Conv1D(num_channels = num_hidden,
num_filters = n_mels * outputs_per_step,
filter_size = filter_size,
padding = padding,
param_attr = fluid.ParamAttr(initializer=fluid.initializer.XavierInitializer()),
bias_attr = fluid.ParamAttr(initializer=fluid.initializer.Uniform(low=-k, high=k)),
use_cudnn = use_cudnn))
self.conv_list.append(
Conv1D(
num_channels=num_hidden,
num_filters=n_mels * outputs_per_step,
filter_size=filter_size,
padding=padding,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(
initializer=fluid.initializer.Uniform(
low=-k, high=k)),
use_cudnn=use_cudnn))
for i, layer in enumerate(self.conv_list):
self.add_sublayer("conv_list_{}".format(i), layer)
self.batch_norm_list = [dg.BatchNorm(num_hidden,
data_layout='NCHW') for _ in range(num_conv-1)]
self.batch_norm_list = [
dg.BatchNorm(
num_hidden, data_layout='NCHW') for _ in range(num_conv - 1)
]
if self.batchnorm_last:
self.batch_norm_list.append(dg.BatchNorm(n_mels * outputs_per_step,
data_layout='NCHW'))
self.batch_norm_list.append(
dg.BatchNorm(
n_mels * outputs_per_step, data_layout='NCHW'))
for i, layer in enumerate(self.batch_norm_list):
self.add_sublayer("batch_norm_list_{}".format(i), layer)
def forward(self, input):
"""
@ -69,20 +100,24 @@ class PostConvNet(dg.Layer):
Returns:
output (Variable), Shape(B, T, C), the result after postconvnet.
"""
input = layers.transpose(input, [0,2,1])
input = layers.transpose(input, [0, 2, 1])
len = input.shape[-1]
for i in range(self.num_conv-1):
for i in range(self.num_conv - 1):
batch_norm = self.batch_norm_list[i]
conv = self.conv_list[i]
input = layers.dropout(layers.tanh(batch_norm(conv(input)[:,:,:len])), self.dropout,
dropout_implementation='upscale_in_train')
conv = self.conv_list[self.num_conv-1]
input = conv(input)[:,:,:len]
input = layers.dropout(
layers.tanh(batch_norm(conv(input)[:, :, :len])),
self.dropout,
dropout_implementation='upscale_in_train')
conv = self.conv_list[self.num_conv - 1]
input = conv(input)[:, :, :len]
if self.batchnorm_last:
batch_norm = self.batch_norm_list[self.num_conv-1]
input = layers.dropout(batch_norm(input), self.dropout,
dropout_implementation='upscale_in_train')
output = layers.transpose(input, [0,2,1])
return output
batch_norm = self.batch_norm_list[self.num_conv - 1]
input = layers.dropout(
batch_norm(input),
self.dropout,
dropout_implementation='upscale_in_train')
output = layers.transpose(input, [0, 2, 1])
return output

View File

@ -1,8 +1,22 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import paddle.fluid.dygraph as dg
import paddle.fluid as fluid
import paddle.fluid.layers as layers
class PreNet(dg.Layer):
def __init__(self, input_size, hidden_size, output_size, dropout_rate=0.2):
"""
@ -17,13 +31,21 @@ class PreNet(dg.Layer):
self.dropout_rate = dropout_rate
k = math.sqrt(1 / input_size)
self.linear1 = dg.Linear(input_size, hidden_size,
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
self.linear1 = dg.Linear(
input_size,
hidden_size,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
low=-k, high=k)))
k = math.sqrt(1 / hidden_size)
self.linear2 = dg.Linear(hidden_size, output_size,
param_attr=fluid.ParamAttr(initializer = fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer = fluid.initializer.Uniform(low=-k, high=k)))
self.linear2 = dg.Linear(
hidden_size,
output_size,
param_attr=fluid.ParamAttr(
initializer=fluid.initializer.XavierInitializer()),
bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform(
low=-k, high=k)))
def forward(self, x):
"""
@ -34,6 +56,12 @@ class PreNet(dg.Layer):
Returns:
x (Variable), Shape(B, T, C), the result after pernet.
"""
x = layers.dropout(layers.relu(self.linear1(x)), self.dropout_rate, dropout_implementation='upscale_in_train')
x = layers.dropout(layers.relu(self.linear2(x)), self.dropout_rate, dropout_implementation='upscale_in_train')
x = layers.dropout(
layers.relu(self.linear1(x)),
self.dropout_rate,
dropout_implementation='upscale_in_train')
x = layers.dropout(
layers.relu(self.linear2(x)),
self.dropout_rate,
dropout_implementation='upscale_in_train')
return x

View File

@ -1,8 +1,22 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.fluid.dygraph as dg
import paddle.fluid as fluid
from parakeet.models.transformer_tts.encoder import Encoder
from parakeet.models.transformer_tts.decoder import Decoder
class TransformerTTS(dg.Layer):
def __init__(self, config):
super(TransformerTTS, self).__init__()
@ -10,16 +24,29 @@ class TransformerTTS(dg.Layer):
self.decoder = Decoder(config['hidden_size'], config)
self.config = config
def forward(self, characters, mel_input, pos_text, pos_mel, dec_slf_mask, enc_slf_mask=None, enc_query_mask=None, enc_dec_mask=None, dec_query_slf_mask=None, dec_query_mask=None):
key, attns_enc = self.encoder(characters, pos_text, mask=enc_slf_mask, query_mask=enc_query_mask)
mel_output, postnet_output, attn_probs, stop_preds, attns_dec = self.decoder(key, key, mel_input, pos_mel,
mask=dec_slf_mask, zero_mask=enc_dec_mask,
m_self_mask=dec_query_slf_mask, m_mask=dec_query_mask )
def forward(self,
characters,
mel_input,
pos_text,
pos_mel,
dec_slf_mask,
enc_slf_mask=None,
enc_query_mask=None,
enc_dec_mask=None,
dec_query_slf_mask=None,
dec_query_mask=None):
key, attns_enc = self.encoder(
characters, pos_text, mask=enc_slf_mask, query_mask=enc_query_mask)
mel_output, postnet_output, attn_probs, stop_preds, attns_dec = self.decoder(
key,
key,
mel_input,
pos_mel,
mask=dec_slf_mask,
zero_mask=enc_dec_mask,
m_self_mask=dec_query_slf_mask,
m_mask=dec_query_mask)
return mel_output, postnet_output, attn_probs, stop_preds, attns_enc, attns_dec
return mel_output, postnet_output, attn_probs, stop_preds, attns_enc, attns_dec

View File

@ -1,3 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import librosa
import os, copy
@ -6,14 +19,15 @@ import paddle.fluid.layers as layers
def get_positional_table(d_pos_vec, n_position=1024):
position_enc = np.array([
[pos / np.power(10000, 2*i/d_pos_vec) for i in range(d_pos_vec)]
if pos != 0 else np.zeros(d_pos_vec) for pos in range(n_position)])
position_enc = np.array(
[[pos / np.power(10000, 2 * i / d_pos_vec) for i in range(d_pos_vec)]
if pos != 0 else np.zeros(d_pos_vec) for pos in range(n_position)])
position_enc[1:, 0::2] = np.sin(position_enc[1:, 0::2]) # dim 2i
position_enc[1:, 1::2] = np.cos(position_enc[1:, 1::2]) # dim 2i+1
position_enc[1:, 0::2] = np.sin(position_enc[1:, 0::2]) # dim 2i
position_enc[1:, 1::2] = np.cos(position_enc[1:, 1::2]) # dim 2i+1
return position_enc
def get_sinusoid_encoding_table(n_position, d_hid, padding_idx=None):
''' Sinusoid position encoding table '''
@ -23,7 +37,8 @@ def get_sinusoid_encoding_table(n_position, d_hid, padding_idx=None):
def get_posi_angle_vec(position):
return [cal_angle(position, hid_j) for hid_j in range(d_hid)]
sinusoid_table = np.array([get_posi_angle_vec(pos_i) for pos_i in range(n_position)])
sinusoid_table = np.array(
[get_posi_angle_vec(pos_i) for pos_i in range(n_position)])
sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i
sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2i+1
@ -34,11 +49,13 @@ def get_sinusoid_encoding_table(n_position, d_hid, padding_idx=None):
return sinusoid_table
def get_non_pad_mask(seq):
mask = (seq != 0).astype(np.float32)
mask = np.expand_dims(mask, axis=-1)
return mask
def get_attn_key_pad_mask(seq_k, seq_q):
''' For masking out the padding part of key sequence. '''
@ -46,10 +63,11 @@ def get_attn_key_pad_mask(seq_k, seq_q):
len_q = seq_q.shape[1]
padding_mask = (seq_k != 0).astype(np.float32)
padding_mask = np.expand_dims(padding_mask, axis=1)
padding_mask = padding_mask.repeat([len_q],axis=1)
padding_mask = (padding_mask == 0).astype(np.float32) * (-2 ** 32 + 1)
padding_mask = padding_mask.repeat([len_q], axis=1)
padding_mask = (padding_mask == 0).astype(np.float32) * (-2**32 + 1)
return padding_mask
def get_dec_attn_key_pad_mask(seq_k, seq_q):
''' For masking out the padding part of key sequence. '''
@ -58,33 +76,37 @@ def get_dec_attn_key_pad_mask(seq_k, seq_q):
padding_mask = (seq_k == 0).astype(np.float32)
padding_mask = np.expand_dims(padding_mask, axis=1)
triu_tensor = get_triu_tensor(seq_q, seq_q)
padding_mask = padding_mask.repeat([len_q],axis=1) + triu_tensor
padding_mask = (padding_mask != 0).astype(np.float32) * (-2 ** 32 + 1)
padding_mask = padding_mask.repeat([len_q], axis=1) + triu_tensor
padding_mask = (padding_mask != 0).astype(np.float32) * (-2**32 + 1)
return padding_mask
def get_triu_tensor(seq_k, seq_q):
''' For make a triu tensor '''
len_k = seq_k.shape[1]
len_q = seq_q.shape[1]
batch_size = seq_k.shape[0]
triu_tensor = np.triu(np.ones([len_k, len_q]), 1)
triu_tensor = np.repeat(np.expand_dims(triu_tensor, axis=0) ,batch_size, axis=0)
triu_tensor = np.repeat(
np.expand_dims(
triu_tensor, axis=0), batch_size, axis=0)
return triu_tensor
def guided_attention(N, T, g=0.2):
'''Guided attention. Refer to page 3 on the paper.'''
W = np.zeros((N, T), dtype=np.float32)
for n_pos in range(W.shape[0]):
for t_pos in range(W.shape[1]):
W[n_pos, t_pos] = 1 - np.exp(-(t_pos / float(T) - n_pos / float(N)) ** 2 / (2 * g * g))
W[n_pos, t_pos] = 1 - np.exp(-(t_pos / float(T) - n_pos / float(N))
**2 / (2 * g * g))
return W
def cross_entropy(input, label, position_weight=1.0, epsilon=1e-30):
output = -1 * label * layers.log(input + epsilon) - (1-label) * layers.log(1 - input + epsilon)
output = -1 * label * layers.log(input + epsilon) - (
1 - label) * layers.log(1 - input + epsilon)
output = output * (label * (position_weight - 1) + 1)
return layers.reduce_sum(output, dim=[0, 1])

View File

@ -1,27 +1,44 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.fluid.dygraph as dg
import paddle.fluid as fluid
from parakeet.modules.customized import Conv1D
from parakeet.models.transformer_tts.utils import *
from parakeet.models.transformer_tts.cbhg import CBHG
class Vocoder(dg.Layer):
"""
CBHG Network (mel -> linear)
"""
def __init__(self, config, batch_size):
super(Vocoder, self).__init__()
self.pre_proj = Conv1D(num_channels = config['audio']['num_mels'],
num_filters = config['hidden_size'],
filter_size=1)
self.pre_proj = Conv1D(
num_channels=config['audio']['num_mels'],
num_filters=config['hidden_size'],
filter_size=1)
self.cbhg = CBHG(config['hidden_size'], batch_size)
self.post_proj = Conv1D(num_channels = config['hidden_size'],
num_filters = (config['audio']['n_fft'] // 2) + 1,
filter_size=1)
self.post_proj = Conv1D(
num_channels=config['hidden_size'],
num_filters=(config['audio']['n_fft'] // 2) + 1,
filter_size=1)
def forward(self, mel):
mel = layers.transpose(mel, [0,2,1])
mel = layers.transpose(mel, [0, 2, 1])
mel = self.pre_proj(mel)
mel = self.cbhg(mel)
mag_pred = self.post_proj(mel)
mag_pred = layers.transpose(mag_pred, [0,2,1])
mag_pred = layers.transpose(mag_pred, [0, 2, 1])
return mag_pred

View File

@ -1 +1,15 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from parakeet.models.waveflow.waveflow import WaveFlow

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import random
import librosa

View File

@ -1,3 +1,17 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import itertools
import os
import time
@ -8,6 +22,7 @@ from paddle import fluid
from scipy.io.wavfile import write
import utils
from parakeet.modules import weight_norm
from .data import LJSpeech
from .waveflow_modules import WaveFlowLoss, WaveFlowModule
@ -26,6 +41,7 @@ class WaveFlow():
self.rank = rank
self.nranks = nranks
self.tb_logger = tb_logger
self.dtype = "float16" if config.use_fp16 else "float32"
def build(self, training=True):
config = self.config
@ -36,9 +52,9 @@ class WaveFlow():
waveflow = WaveFlowModule(config)
# Dry run once to create and initalize all necessary parameters.
audio = dg.to_variable(np.random.randn(1, 16000).astype(np.float32))
audio = dg.to_variable(np.random.randn(1, 16000).astype(self.dtype))
mel = dg.to_variable(
np.random.randn(1, config.mel_bands, 63).astype(np.float32))
np.random.randn(1, config.mel_bands, 63).astype(self.dtype))
waveflow(audio, mel)
if training:
@ -72,9 +88,14 @@ class WaveFlow():
self.rank,
waveflow,
iteration=config.iteration,
file_path=config.checkpoint)
file_path=config.checkpoint,
dtype=self.dtype)
print("Rank {}: checkpoint loaded.".format(self.rank))
for layer in waveflow.sublayers():
if isinstance(layer, weight_norm.WeightNormWrapper):
layer.remove_weight_norm()
self.waveflow = waveflow
def train_step(self, iteration):
@ -173,7 +194,7 @@ class WaveFlow():
syn_time))
# Denormalize audio from [-1, 1] to [-32768, 32768] int16 range.
audio = audio.numpy() * 32768.0
audio = audio.numpy().astype("float32") * 32768.0
audio = audio.astype('int16')
write(filename, config.sample_rate, audio)

View File

@ -1,5 +1,18 @@
import itertools
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import itertools
import numpy as np
import paddle.fluid.dygraph as dg
from paddle import fluid
@ -49,7 +62,7 @@ class WaveFlowLoss:
class Conditioner(dg.Layer):
def __init__(self):
def __init__(self, dtype):
super(Conditioner, self).__init__()
upsample_factors = [16, 16]
@ -65,7 +78,8 @@ class Conditioner(dg.Layer):
padding=(1, s // 2),
stride=(1, s),
param_attr=param_attr,
bias_attr=bias_attr)
bias_attr=bias_attr,
dtype="float32")
self.upsample_conv2d.append(conv_trans2d)
for i, layer in enumerate(self.upsample_conv2d):
@ -74,19 +88,30 @@ class Conditioner(dg.Layer):
def forward(self, x):
x = fluid.layers.unsqueeze(x, 1)
for layer in self.upsample_conv2d:
x = fluid.layers.leaky_relu(layer(x), alpha=0.4)
in_dtype = x.dtype
if in_dtype == fluid.core.VarDesc.VarType.FP16:
x = fluid.layers.cast(x, "float32")
x = layer(x)
if in_dtype == fluid.core.VarDesc.VarType.FP16:
x = fluid.layers.cast(x, "float16")
x = fluid.layers.leaky_relu(x, alpha=0.4)
return fluid.layers.squeeze(x, [1])
return fluid.layers.reshape(x, [x.shape[0], x.shape[2], x.shape[3]])
def infer(self, x):
x = fluid.layers.unsqueeze(x, 1)
for layer in self.upsample_conv2d:
in_dtype = x.dtype
if in_dtype == fluid.core.VarDesc.VarType.FP16:
x = fluid.layers.cast(x, "float32")
x = layer(x)
if in_dtype == fluid.core.VarDesc.VarType.FP16:
x = fluid.layers.cast(x, "float16")
# Trim conv artifacts.
time_cutoff = layer._filter_size[1] - layer._stride[1]
x = fluid.layers.leaky_relu(x[:, :, :, :-time_cutoff], alpha=0.4)
return fluid.layers.squeeze(x, [1])
return fluid.layers.reshape(x, [x.shape[0], x.shape[2], x.shape[3]])
class Flow(dg.Layer):
@ -96,6 +121,7 @@ class Flow(dg.Layer):
self.n_channels = config.n_channels
self.kernel_h = config.kernel_h
self.kernel_w = config.kernel_w
self.dtype = "float16" if config.use_fp16 else "float32"
# Transform audio: [batch, 1, n_group, time/n_group]
# => [batch, n_channels, n_group, time/n_group]
@ -105,7 +131,8 @@ class Flow(dg.Layer):
num_filters=self.n_channels,
filter_size=(1, 1),
param_attr=param_attr,
bias_attr=bias_attr)
bias_attr=bias_attr,
dtype=self.dtype)
# Initializing last layer to 0 makes the affine coupling layers
# do nothing at first. This helps with training stability
@ -117,7 +144,8 @@ class Flow(dg.Layer):
num_filters=2,
filter_size=(1, 1),
param_attr=param_attr,
bias_attr=bias_attr)
bias_attr=bias_attr,
dtype=self.dtype)
# receiptive fileds: (kernel - 1) * sum(dilations) + 1 >= squeeze
dilation_dict = {
@ -145,7 +173,8 @@ class Flow(dg.Layer):
filter_size=(self.kernel_h, self.kernel_w),
dilation=(dilation_h, dilation_w),
param_attr=param_attr,
bias_attr=bias_attr)
bias_attr=bias_attr,
dtype=self.dtype)
self.in_layers.append(in_layer)
param_attr, bias_attr = get_param_attr(
@ -155,7 +184,8 @@ class Flow(dg.Layer):
num_filters=2 * self.n_channels,
filter_size=(1, 1),
param_attr=param_attr,
bias_attr=bias_attr)
bias_attr=bias_attr,
dtype=self.dtype)
self.cond_layers.append(cond_layer)
if i < self.n_layers - 1:
@ -169,7 +199,8 @@ class Flow(dg.Layer):
num_filters=res_skip_channels,
filter_size=(1, 1),
param_attr=param_attr,
bias_attr=bias_attr)
bias_attr=bias_attr,
dtype=self.dtype)
self.res_skip_layers.append(res_skip_layer)
self.add_sublayer("in_layer_{}".format(i), in_layer)
@ -189,10 +220,10 @@ class Flow(dg.Layer):
# Pad width dim (time): dialated non-causal convolution
pad_top, pad_bottom = (self.kernel_h - 1) * dilation_h, 0
pad_left = pad_right = int((self.kernel_w - 1) * dilation_w / 2)
audio_pad = fluid.layers.pad2d(
audio, paddings=[pad_top, pad_bottom, pad_left, pad_right])
hidden = self.in_layers[i](audio_pad)
self.in_layers[i].layer._padding = [
pad_top, pad_bottom, pad_left, pad_right
]
hidden = self.in_layers[i](audio)
cond_hidden = self.cond_layers[i](mel)
in_acts = hidden + cond_hidden
out_acts = fluid.layers.tanh(in_acts[:, :self.n_channels, :]) * \
@ -237,14 +268,14 @@ class Flow(dg.Layer):
pad_top, pad_bottom = 0, 0
pad_left = int((self.kernel_w - 1) * dilation_w / 2)
pad_right = int((self.kernel_w - 1) * dilation_w / 2)
state = fluid.layers.pad2d(
state, paddings=[pad_top, pad_bottom, pad_left, pad_right])
self.in_layers[i].layer._padding = [
pad_top, pad_bottom, pad_left, pad_right
]
hidden = self.in_layers[i](state)
cond_hidden = self.cond_layers[i](mel)
in_acts = hidden + cond_hidden
out_acts = fluid.layers.tanh(in_acts[:, :self.n_channels, :]) * \
fluid.layers.sigmoid(in_acts[:, self.n_channels:, :])
fluid.layers.sigmoid(in_acts[:, self.n_channels:, :])
res_skip_acts = self.res_skip_layers[i](out_acts)
if i < self.n_layers - 1:
@ -270,7 +301,8 @@ class WaveFlowModule(dg.Layer):
assert self.n_group % 2 == 0
assert self.n_flows % 2 == 0
self.conditioner = Conditioner()
self.dtype = "float16" if config.use_fp16 else "float32"
self.conditioner = Conditioner(self.dtype)
self.flows = []
for i in range(self.n_flows):
flow = Flow(config)
@ -324,17 +356,21 @@ class WaveFlowModule(dg.Layer):
mel_slices = [mel[:, :, j, :] for j in self.perms[i]]
mel = fluid.layers.stack(mel_slices, axis=2)
z = fluid.layers.squeeze(audio, [1])
z = fluid.layers.reshape(
audio, [audio.shape[0], audio.shape[2], audio.shape[3]])
return z, log_s_list
def synthesize(self, mel, sigma=1.0):
if self.dtype == "float16":
mel = fluid.layers.cast(mel, self.dtype)
mel = self.conditioner.infer(mel)
# From [bs, mel_bands, time] to [bs, mel_bands, n_group, time/n_group]
mel = fluid.layers.transpose(unfold(mel, self.n_group), [0, 1, 3, 2])
audio = fluid.layers.gaussian_random(
shape=[mel.shape[0], 1, mel.shape[2], mel.shape[3]], std=sigma)
if self.dtype == "float16":
audio = fluid.layers.cast(audio, self.dtype)
for i in reversed(range(self.n_flows)):
# Permute over the height dimension.
audio_slices = [audio[:, :, j, :] for j in self.perms[i]]
@ -362,9 +398,9 @@ class WaveFlowModule(dg.Layer):
audio = fluid.layers.concat(audio_list, axis=2)
# audio: [bs, n_group, time/n_group]
audio = fluid.layers.squeeze(audio, [1])
audio = fluid.layers.reshape(
audio, [audio.shape[0], audio.shape[2], audio.shape[3]])
# audio: [bs, time]
audio = fluid.layers.reshape(
fluid.layers.transpose(audio, [0, 2, 1]), [audio.shape[0], -1])
return audio

View File

@ -1,97 +0,0 @@
# WaveNet with Paddle Fluid
Paddle fluid implementation of WaveNet, a deep generative model of raw audio waveforms.
WaveNet model is originally proposed in [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499).
Our implementation is based on the WaveNet architecture described in [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](https://arxiv.org/abs/1807.07281) and can provide various output distributions, including single Gaussian, mixture of Gaussian, and softmax with linearly quantized channels.
We implement WaveNet model in paddle fluid with dynamic graph, which is convenient for flexible network architectures.
## Project Structure
```text
├── configs # yaml configuration files of preset model hyperparameters
├── data.py # dataset and dataloader settings for LJSpeech
├── slurm.py # optional slurm helper functions if you use slurm to train model
├── synthesis.py # script for speech synthesis
├── train.py # script for model training
├── utils.py # helper functions for e.g., model checkpointing
├── wavenet.py # WaveNet model high level APIs
└── wavenet_modules.py # WaveNet model implementation
```
## Usage
There are many hyperparameters to be tuned depending on the specification of model and dataset you are working on. Hyperparameters that are known to work good for the LJSpeech dataset are provided as yaml files in `./configs/` folder. Specifically, we provide `wavenet_ljspeech_single_gaussian.yaml`, `wavenet_ljspeech_mix_gaussian.yaml`, and `wavenet_ljspeech_softmax.yaml` config files for WaveNet with single Gaussian, 10-component mixture of Gaussians, and softmax (with 2048 linearly quantized channels) output distributions, respectively.
Note that `train.py` and `synthesis.py` all accept a `--config` parameter. To ensure consistency, you should use the same config yaml file for both training and synthesizing. You can also overwrite these preset hyperparameters with command line by updating parameters after `--config`. For example `--config=${yaml} --batch_size=8 --layers=20` can overwrite the corresponding hyperparameters in the `${yaml}` config file. For more details about these hyperparameters, check `utils.add_config_options_to_parser`.
Note that you also need to specify some additional parameters for `train.py` and `synthesis.py`, and the details can be found in `train.add_options_to_parser` and `synthesis.add_options_to_parser`, respectively.
### Dataset
Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
In this example, assume that the path of unzipped LJSpeech dataset is `./data/LJSpeech-1.1`.
### Train on single GPU
```bash
export PYTHONPATH="${PYTHONPATH}:${PWD}/../../.."
export CUDA_VISIBLE_DEVICES=0
python -u train.py --config=${yaml} \
--root=./data/LJSpeech-1.1 \
--name=${ModelName} --batch_size=4 \
--parallel=false --use_gpu=true
```
#### Save and Load checkpoints
Our model will save model parameters as checkpoints in `./runs/wavenet/${ModelName}/checkpoint/` every 10000 iterations by default.
The saved checkpoint will have the format of `step-${iteration_number}.pdparams` for model parameters and `step-${iteration_number}.pdopt` for optimizer parameters.
There are three ways to load a checkpoint and resume training (take an example that you want to load a 500000-iteration checkpoint):
1. Use `--checkpoint=./runs/wavenet/${ModelName}/checkpoint/step-500000` to provide a specific path to load. Note that you only need to provide the base name of the parameter file, which is `step-500000`, no extension name `.pdparams` or `.pdopt` is needed.
2. Use `--iteration=500000`.
3. If you don't specify either `--checkpoint` or `--iteration`, the model will automatically load the latest checkpoint in `./runs/wavenet/${ModelName}/checkpoint`.
### Train on multiple GPUs
```bash
export PYTHONPATH="${PYTHONPATH}:${PWD}/../../.."
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -u -m paddle.distributed.launch train.py \
--config=${yaml} \
--root=./data/LJSpeech-1.1 \
--name=${ModelName} --parallel=true --use_gpu=true
```
Use `export CUDA_VISIBLE_DEVICES=0,1,2,3` to set the GPUs that you want to use to be visible. Then the `paddle.distributed.launch` module will use these visible GPUs to do data parallel training in multiprocessing mode.
### Monitor with Tensorboard
By default, the logs are saved in `./runs/wavenet/${ModelName}/logs/`. You can monitor logs by tensorboard.
```bash
tensorboard --logdir=${log_dir} --port=8888
```
### Synthesize from a checkpoint
Check the [Save and load checkpoint](#save-and-load-checkpoints) section on how to load a specific checkpoint.
The following example will automatically load the latest checkpoint:
```bash
export PYTHONPATH="${PYTHONPATH}:${PWD}/../../.."
export CUDA_VISIBLE_DEVICES=0
python -u synthesis.py --config=${yaml} \
--root=./data/LJSpeech-1.1 \
--name=${ModelName} --use_gpu=true \
--output=./syn_audios \
--sample=${SAMPLE}
```
In this example, `--output` specifies where to save the synthesized audios and `--sample` specifies which sample in the valid dataset (a split from the whole LJSpeech dataset, by default contains the first 16 audio samples) to synthesize based on the mel-spectrograms computed from the ground truth sample audio, e.g., `--sample=0` means to synthesize the first audio in the valid dataset.

View File

@ -0,0 +1,16 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .net import *
from .wavenet import *

View File

@ -1,32 +0,0 @@
valid_size: 16
train_clip_second: 0.5
sample_rate: 22050
fft_window_shift: 256
fft_window_size: 1024
fft_size: 2048
mel_bands: 80
seed: 1
batch_size: 8
test_every: 2000
save_every: 10000
max_iterations: 2000000
layers: 30
kernel_width: 2
dilation_block: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
residual_channels: 128
skip_channels: 128
loss_type: mix-gaussian-pdf
num_mixtures: 10
log_scale_min: -9.0
conditioner:
filter_sizes: [[32, 3], [32, 3]]
upsample_factors: [16, 16]
learning_rate: 0.001
gradient_max_norm: 100.0
anneal:
every: 200000
rate: 0.5

View File

@ -1,32 +0,0 @@
valid_size: 16
train_clip_second: 0.5
sample_rate: 22050
fft_window_shift: 256
fft_window_size: 1024
fft_size: 2048
mel_bands: 80
seed: 1
batch_size: 8
test_every: 2000
save_every: 10000
max_iterations: 2000000
layers: 30
kernel_width: 2
dilation_block: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
residual_channels: 128
skip_channels: 128
loss_type: mix-gaussian-pdf
num_mixtures: 1
log_scale_min: -9.0
conditioner:
filter_sizes: [[32, 3], [32, 3]]
upsample_factors: [16, 16]
learning_rate: 0.001
gradient_max_norm: 100.0
anneal:
every: 200000
rate: 0.5

View File

@ -1,31 +0,0 @@
valid_size: 16
train_clip_second: 0.5
sample_rate: 22050
fft_window_shift: 256
fft_window_size: 1024
fft_size: 2048
mel_bands: 80
seed: 1
batch_size: 8
test_every: 2000
save_every: 10000
max_iterations: 2000000
layers: 30
kernel_width: 2
dilation_block: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
residual_channels: 128
skip_channels: 128
loss_type: softmax
num_channels: 2048
conditioner:
filter_sizes: [[32, 3], [32, 3]]
upsample_factors: [16, 16]
learning_rate: 0.001
gradient_max_norm: 100.0
anneal:
every: 200000
rate: 0.5

View File

@ -1,160 +0,0 @@
import random
import librosa
import numpy as np
from paddle import fluid
import utils
from parakeet.datasets import ljspeech
from parakeet.data import dataset
from parakeet.data.sampler import DistributedSampler, BatchSampler
from parakeet.data.datacargo import DataCargo
class Dataset(ljspeech.LJSpeech):
def __init__(self, config):
super(Dataset, self).__init__(config.root)
self.config = config
self.fft_window_shift = config.fft_window_shift
# Calculate context frames.
frames_per_second = config.sample_rate // self.fft_window_shift
train_clip_frames = int(np.ceil(
config.train_clip_second * frames_per_second))
context_frames = config.context_size // self.fft_window_shift
self.num_frames = train_clip_frames + context_frames
def _get_example(self, metadatum):
fname, _, _ = metadatum
wav_path = self.root.joinpath("wavs", fname + ".wav")
config = self.config
sr = config.sample_rate
fft_window_shift = config.fft_window_shift
fft_window_size = config.fft_window_size
fft_size = config.fft_size
audio, loaded_sr = librosa.load(wav_path, sr=None)
assert loaded_sr == sr
# Pad audio to the right size.
frames = int(np.ceil(float(audio.size) / fft_window_shift))
fft_padding = (fft_size - fft_window_shift) // 2
desired_length = frames * fft_window_shift + fft_padding * 2
pad_amount = (desired_length - audio.size) // 2
if audio.size % 2 == 0:
audio = np.pad(audio, (pad_amount, pad_amount), mode='reflect')
else:
audio = np.pad(audio, (pad_amount, pad_amount + 1), mode='reflect')
# Normalize audio.
audio = audio / np.abs(audio).max() * 0.999
# Compute mel-spectrogram.
# Turn center to False to prevent internal padding.
spectrogram = librosa.core.stft(
audio, hop_length=fft_window_shift,
win_length=fft_window_size, n_fft=fft_size, center=False)
spectrogram_magnitude = np.abs(spectrogram)
# Compute mel-spectrograms.
mel_filter_bank = librosa.filters.mel(sr=sr, n_fft=fft_size,
n_mels=config.mel_bands)
mel_spectrogram = np.dot(mel_filter_bank, spectrogram_magnitude)
mel_spectrogram = mel_spectrogram.T
# Rescale mel_spectrogram.
min_level, ref_level = 1e-5, 20
mel_spectrogram = 20 * np.log10(np.maximum(min_level, mel_spectrogram))
mel_spectrogram = mel_spectrogram - ref_level
mel_spectrogram = np.clip((mel_spectrogram + 100) / 100, 0, 1)
# Extract the center of audio that corresponds to mel spectrograms.
audio = audio[fft_padding : -fft_padding]
assert mel_spectrogram.shape[0] * fft_window_shift == audio.size
return audio, mel_spectrogram
class Subset(dataset.Dataset):
def __init__(self, dataset, indices, valid):
self.dataset = dataset
self.indices = indices
self.valid = valid
def __getitem__(self, idx):
fft_window_shift = self.dataset.fft_window_shift
num_frames = self.dataset.num_frames
audio, mel = self.dataset[self.indices[idx]]
if self.valid:
audio_start = 0
else:
# Randomly crop context + train_clip_second of audio.
audio_frames = int(audio.size) // fft_window_shift
max_start_frame = audio_frames - num_frames
assert max_start_frame >= 0, "audio {} is too short".format(idx)
frame_start = random.randint(0, max_start_frame)
frame_end = frame_start + num_frames
audio_start = frame_start * fft_window_shift
audio_end = frame_end * fft_window_shift
audio = audio[audio_start : audio_end]
return audio, mel, audio_start
def _batch_examples(self, batch):
audios = [sample[0] for sample in batch]
audio_starts = [sample[2] for sample in batch]
# mels shape [num_frames, mel_bands]
max_frames = max(sample[1].shape[0] for sample in batch)
mels = [utils.pad_to_size(sample[1], max_frames) for sample in batch]
audios = np.array(audios, dtype=np.float32)
mels = np.array(mels, dtype=np.float32)
audio_starts = np.array(audio_starts, dtype=np.int32)
return audios, mels, audio_starts
def __len__(self):
return len(self.indices)
class LJSpeech:
def __init__(self, config, nranks, rank):
place = fluid.CUDAPlace(rank) if config.use_gpu else fluid.CPUPlace()
# Whole LJSpeech dataset.
ds = Dataset(config)
# Split into train and valid dataset.
indices = list(range(len(ds)))
train_indices = indices[config.valid_size:]
valid_indices = indices[:config.valid_size]
random.shuffle(train_indices)
# Train dataset.
trainset = Subset(ds, train_indices, valid=False)
sampler = DistributedSampler(len(trainset), nranks, rank)
total_bs = config.batch_size
assert total_bs % nranks == 0
train_sampler = BatchSampler(sampler, total_bs // nranks,
drop_last=True)
trainloader = DataCargo(trainset, batch_sampler=train_sampler)
trainreader = fluid.io.PyReader(capacity=50, return_list=True)
trainreader.decorate_batch_generator(trainloader, place)
self.trainloader = (data for _ in iter(int, 1)
for data in trainreader())
# Valid dataset.
validset = Subset(ds, valid_indices, valid=True)
# Currently only support batch_size = 1 for valid loader.
validloader = DataCargo(validset, batch_size=1, shuffle=False)
validreader = fluid.io.PyReader(capacity=20, return_list=True)
validreader.decorate_batch_generator(validloader, place)
self.validloader = validreader

Some files were not shown because too many files have changed in this diff Show More