dv3 reloaded, back to the origin

This commit is contained in:
chenfeiyu 2020-07-10 20:22:43 +08:00
parent 24eb14a718
commit 282c36c2c1
24 changed files with 1649 additions and 2995 deletions

View File

@ -22,151 +22,118 @@ The model consists of an encoder, a decoder and a converter (and a speaker embed
## Project Structure
```text
├── data.py data_processing
├── model.py function to create model, criterion and optimizer
├── configs/ (example) configuration files
├── sentences.txt sample sentences
├── synthesis.py script to synthesize waveform from text
├── train.py script to train a model
└── utils.py utility functions
├── config/
├── synthesize.py
├── data.py
├── preprocess.py
├── clip.py
├── train.py
└── vocoder.py
```
## Saving & Loading
`train.py` and `synthesis.py` have 3 arguments in common, `--checkpooint`, `iteration` and `output`.
# Preprocess
1. `output` is the directory for saving results.
During training, checkpoints are saved in `checkpoints/` in `output` and tensorboard log is save in `log/` in `output`. States for training including alignment plots, spectrogram plots and generated audio files are saved in `states/` in `outuput`. In addition, we periodically evaluate the model with several given sentences, the alignment plots and generated audio files are save in `eval/` in `output`.
During synthesizing, audio files and the alignment plots are save in `synthesis/` in `output`.
So after training and synthesizing with the same output directory, the file structure of the output directory looks like this.
Preprocess to dataset with `preprocess.py`.
```text
├── checkpoints/ # checkpoint directory (including *.pdparams, *.pdopt and a text file `checkpoint` that records the latest checkpoint)
├── states/ # alignment plots, spectrogram plots and generated wavs at training
├── log/ # tensorboard log
├── eval/ # audio files an alignment plots generated at evaluation during training
└── synthesis/ # synthesized audio files and alignment plots
usage: preprocess.py [-h] --config CONFIG --input INPUT --output OUTPUT
preprocess ljspeech dataset and save it.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file
--input INPUT data path of the original data
--output OUTPUT path to save the preprocessed dataset
```
2. `--checkpoint` and `--iteration` for loading from existing checkpoint. Loading existing checkpoiont follows the following rule:
If `--checkpoint` is provided, the path of the checkpoint specified by `--checkpoint` is loaded.
If `--checkpoint` is not provided, we try to load the model specified by `--iteration` from the checkpoint directory. If `--iteration` is not provided, we try to load the latested checkpoint from checkpoint directory.
example code:
```bash
python preprocess.py --config=configs/ljspeech.yaml --input=LJSpeech-1.1/ --output=data/ljspeech
```
## Train
Train the model using train.py, follow the usage displayed by `python train.py --help`.
```text
usage: train.py [-h] [--config CONFIG] [--data DATA] [--device DEVICE]
[--checkpoint CHECKPOINT | --iteration ITERATION]
output
usage: train.py [-h] --config CONFIG --input INPUT
Train a Deep Voice 3 model with LJSpeech dataset.
positional arguments:
output path to save results
train a Deep Voice 3 model with LJSpeech
optional arguments:
-h, --help show this help message and exit
--config CONFIG experimrnt config
--data DATA The path of the LJSpeech dataset.
--device DEVICE device to use
--checkpoint CHECKPOINT checkpoint to resume from.
--iteration ITERATION the iteration of the checkpoint to load from output directory
-h, --help show this help message and exit
--config CONFIG config file
--input INPUT data path of the original data
```
- `--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config.
- `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
- `--checkpoint` is the path of the checkpoint.
- `--iteration` is the iteration of the checkpoint to load from output directory.
See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
- `output` is the directory to save results, all results are saved in this directory. The structure of the output directory is shown below.
example code:
```bash
CUDA_VISIBLE_DEVICES=0 python train.py --config=configs/ljspeech.yaml --input=data/ljspeech
```
It would create a `runs` folder, outputs for each run is saved in a seperate folder in `runs`, whose name is the time joined with hostname. Inside this filder, tensorboard log, parameters and optimizer states are saved. Parameters(`*.pdparams`) and optimizer states(`*.pdopt`) are named by the step when they are saved.
```text
├── checkpoints # checkpoint
├── log # tensorboard log
└── states # train and evaluation results
├── alignments # attention
├── lin_spec # linear spectrogram
├── mel_spec # mel spectrogram
└── waveform # waveform (.wav files)
runs/Jul07_09-39-34_instance-mqcyj27y-4/
├── checkpoint
├── events.out.tfevents.1594085974.instance-mqcyj27y-4
├── step-1000000.pdopt
├── step-1000000.pdparams
├── step-100000.pdopt
├── step-100000.pdparams
...
```
Example script:
Since e use waveflow to synthesize audio while training, so download the trained waveflow model and extract it in current directory before training.
```bash
python train.py \
--config=configs/ljspeech.yaml \
--data=./LJSpeech-1.1/ \
--device=0 \
experiment
wget https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_ckpt_1.0.zip
unzip waveflow_res128_ljspeech_ckpt_1.0.zip
```
To train the model in a paralle in multiple gpus, you can launch the training script with `paddle.distributed.launch`. For example, to train with gpu `0,1,2,3`, you can use the example script below. Note that for parallel training, devices are specified with `--selected_gpus` passed to `paddle.distributed.launch`. In this case, `--device` passed to `train.py`, if specified, is ignored.
Example script:
## Visualization
You can visualize training losses, check the attention and listen to the synthesized audio when training with teacher forcing.
example code:
```bash
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 \
train.py \
--config=configs/ljspeech.yaml \
--data=./LJSpeech-1.1/ \
experiment
```
You can monitor training log via tensorboard, using the script below.
```bash
cd experiment/log
tensorboard --logdir=.
tensorboard --logdir=runs/ --host=$HOSTNAME --port=8000
```
## Synthesis
```text
usage: synthesis.py [-h] [--config CONFIG] [--device DEVICE]
[--checkpoint CHECKPOINT | --iteration ITERATION]
text output
Synthsize waveform with a checkpoint.
positional arguments:
text text file to synthesize
output path to save synthesized audio
usage: synthesize from a checkpoint [-h] --config CONFIG --input INPUT
--output OUTPUT --checkpoint CHECKPOINT
--monotonic_layers MONOTONIC_LAYERS
optional arguments:
-h, --help show this help message and exit
--config CONFIG experiment config
--device DEVICE device to use
--checkpoint CHECKPOINT checkpoint to resume from
--iteration ITERATION the iteration of the checkpoint to load from output directory
-h, --help show this help message and exit
--config CONFIG config file
--input INPUT text file to synthesize
--output OUTPUT path to save audio
--checkpoint CHECKPOINT
data path of the checkpoint
--monotonic_layers MONOTONIC_LAYERS
monotonic decoder layer, index starts friom 1
```
- `--config` is the configuration file to use. You should use the same configuration with which you train you model.
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
`synthesize.py` is used to synthesize several sentences in a text file.
`--monotonic_layers` is the index of the decoders layer that manifest monotonic diagonal attention. You can get monotonic layers by inspecting tensorboard logs. Mind that the index starts from 1. The layers that manifest monotonic diagonal attention are stable for a model during training and synthesizing, but differ among different runs. So once you get the indices of monotonic layers by inspecting tensorboard log, you can use them at synthesizing. Note that only decoder layers that show strong diagonal attention should be considerd.
- `--checkpoint` is the path of the checkpoint.
- `--iteration` is the iteration of the checkpoint to load from output directory.
See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
- `text`is the text file to synthesize.
- `output` is the directory to save results. The generated audio files (`*.wav`) and attention plots (*.png) for are save in `synthesis/` in ouput directory.
Example script:
example code:
```bash
python synthesis.py \
--config=configs/ljspeech.yaml \
--device=0 \
--checkpoint="experiment/checkpoints/model_step_005000000" \
sentences.txt experiment
```
or
```bash
python synthesis.py \
--config=configs/ljspeech.yaml \
--device=0 \
--iteration=005000000 \
sentences.txt experiment
CUDA_VISIBLE_DEVICES=2 python synthesize.py \
--config configs/ljspeech.yaml \
--input sentences.txt \
--output outputs/ \
--checkpoint runs/Jul07_09-39-34_instance-mqcyj27y-4/step-1320000 \
--monotonic_layers "5,6"
```

181
examples/deepvoice3/clip.py Normal file
View File

@ -0,0 +1,181 @@
from __future__ import print_function
import copy
import six
import warnings
import functools
from paddle.fluid import layers
from paddle.fluid import framework
from paddle.fluid import core
from paddle.fluid import name_scope
from paddle.fluid.dygraph import base as imperative_base
from paddle.fluid.clip import GradientClipBase, _correct_clip_op_role_var
class DoubleClip(GradientClipBase):
"""
:alias_main: paddle.nn.GradientClipByGlobalNorm
:alias: paddle.nn.GradientClipByGlobalNorm,paddle.nn.clip.GradientClipByGlobalNorm
:old_api: paddle.fluid.clip.GradientClipByGlobalNorm
Given a list of Tensor :math:`t\_list` , calculate the global norm for the elements of all tensors in
:math:`t\_list` , and limit it to ``clip_norm`` .
- If the global norm is greater than ``clip_norm`` , all elements of :math:`t\_list` will be compressed by a ratio.
- If the global norm is less than or equal to ``clip_norm`` , nothing will be done.
The list of Tensor :math:`t\_list` is not passed from this class, but the gradients of all parameters in ``Program`` . If ``need_clip``
is not None, then only part of gradients can be selected for gradient clipping.
Gradient clip will takes effect after being set in ``optimizer`` , see the document ``optimizer``
(for example: :ref:`api_fluid_optimizer_SGDOptimizer`).
The clipping formula is:
.. math::
t\_list[i] = t\_list[i] * \\frac{clip\_norm}{\max(global\_norm, clip\_norm)}
where:
.. math::
global\_norm = \sqrt{\sum_{i=0}^{N-1}(l2norm(t\_list[i]))^2}
Args:
clip_norm (float): The maximum norm value.
group_name (str, optional): The group name for this clip. Default value is ``default_group``
need_clip (function, optional): Type: function. This function accepts a ``Parameter`` and returns ``bool``
(True: the gradient of this ``Parameter`` need to be clipped, False: not need). Default: None,
and gradients of all parameters in the network will be clipped.
Examples:
.. code-block:: python
# use for Static mode
import paddle
import paddle.fluid as fluid
import numpy as np
main_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(
main_program=main_prog, startup_program=startup_prog):
image = fluid.data(
name='x', shape=[-1, 2], dtype='float32')
predict = fluid.layers.fc(input=image, size=3, act='relu') # Trainable parameters: fc_0.w.0, fc_0.b.0
loss = fluid.layers.mean(predict)
# Clip all parameters in network:
clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0)
# Clip a part of parameters in network: (e.g. fc_0.w_0)
# pass a function(fileter_func) to need_clip, and fileter_func receive a ParamBase, and return bool
# def fileter_func(Parameter):
# # It can be easily filtered by Parameter.name (name can be set in fluid.ParamAttr, and the default name is fc_0.w_0, fc_0.b_0)
# return Parameter.name=="fc_0.w_0"
# clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0, need_clip=fileter_func)
sgd_optimizer = fluid.optimizer.SGDOptimizer(learning_rate=0.1, grad_clip=clip)
sgd_optimizer.minimize(loss)
place = fluid.CPUPlace()
exe = fluid.Executor(place)
x = np.random.uniform(-100, 100, (10, 2)).astype('float32')
exe.run(startup_prog)
out = exe.run(main_prog, feed={'x': x}, fetch_list=loss)
# use for Dygraph mode
import paddle
import paddle.fluid as fluid
with fluid.dygraph.guard():
linear = fluid.dygraph.Linear(10, 10) # Trainable: linear_0.w.0, linear_0.b.0
inputs = fluid.layers.uniform_random([32, 10]).astype('float32')
out = linear(fluid.dygraph.to_variable(inputs))
loss = fluid.layers.reduce_mean(out)
loss.backward()
# Clip all parameters in network:
clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0)
# Clip a part of parameters in network: (e.g. linear_0.w_0)
# pass a function(fileter_func) to need_clip, and fileter_func receive a ParamBase, and return bool
# def fileter_func(ParamBase):
# # It can be easily filtered by ParamBase.name(name can be set in fluid.ParamAttr, and the default name is linear_0.w_0, linear_0.b_0)
# return ParamBase.name == "linear_0.w_0"
# # Note: linear.weight and linear.bias can return the weight and bias of dygraph.Linear, respectively, and can be used to filter
# return ParamBase.name == linear.weight.name
# clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0, need_clip=fileter_func)
sgd_optimizer = fluid.optimizer.SGD(
learning_rate=0.1, parameter_list=linear.parameters(), grad_clip=clip)
sgd_optimizer.minimize(loss)
"""
def __init__(self, clip_value, clip_norm, group_name="default_group", need_clip=None):
super(DoubleClip, self).__init__(need_clip)
self.clip_value = float(clip_value)
self.clip_norm = float(clip_norm)
self.group_name = group_name
def __str__(self):
return "Gradient Clip By Value and GlobalNorm, value={}, global_norm={}".format(
self.clip_value, self.clip_norm)
@imperative_base.no_grad
def _dygraph_clip(self, params_grads):
params_and_grads = []
# clip by value first
for p, g in params_grads:
if g is None:
continue
if self._need_clip_func is not None and not self._need_clip_func(p):
params_and_grads.append((p, g))
continue
new_grad = layers.clip(x=g, min=-self.clip_value, max=self.clip_value)
params_and_grads.append((p, new_grad))
params_grads = params_and_grads
# clip by global norm
params_and_grads = []
sum_square_list = []
for p, g in params_grads:
if g is None:
continue
if self._need_clip_func is not None and not self._need_clip_func(p):
continue
merge_grad = g
if g.type == core.VarDesc.VarType.SELECTED_ROWS:
merge_grad = layers.merge_selected_rows(g)
merge_grad = layers.get_tensor_from_selected_rows(merge_grad)
square = layers.square(merge_grad)
sum_square = layers.reduce_sum(square)
sum_square_list.append(sum_square)
# all parameters have been filterd out
if len(sum_square_list) == 0:
return params_grads
global_norm_var = layers.concat(sum_square_list)
global_norm_var = layers.reduce_sum(global_norm_var)
global_norm_var = layers.sqrt(global_norm_var)
max_global_norm = layers.fill_constant(
shape=[1], dtype='float32', value=self.clip_norm)
clip_var = layers.elementwise_div(
x=max_global_norm,
y=layers.elementwise_max(
x=global_norm_var, y=max_global_norm))
for p, g in params_grads:
if g is None:
continue
if self._need_clip_func is not None and not self._need_clip_func(p):
params_and_grads.append((p, g))
continue
new_grad = layers.elementwise_mul(x=g, y=clip_var)
params_and_grads.append((p, new_grad))
return params_and_grads

View File

@ -1,90 +1,45 @@
meta_data:
min_text_length: 20
# data processing
p_pronunciation: 0.99
sample_rate: 22050 # Hz
n_fft: 1024
win_length: 1024
hop_length: 256
n_mels: 80
reduction_factor: 4
transform:
# text
replace_pronunciation_prob: 0.5
# model-s2s
n_speakers: 1
speaker_dim: 16
char_dim: 256
encoder_dim: 64
kernel_size: 5
encoder_layers: 7
decoder_layers: 8
prenet_sizes: [128]
attention_dim: 128
# spectrogram
sample_rate: 22050
max_norm: 0.999
preemphasis: 0.97
n_fft: 1024
win_length: 1024
hop_length: 256
# model-postnet
postnet_layers: 5
postnet_dim: 256
# mel
fmin: 125
fmax: 7600
n_mels: 80
# position embedding
position_weight: 1.0
position_rate: 5.54
forward_step: 4
backward_step: 0
# db scale
min_level_db: -100
ref_level_db: 20
clip_norm: true
dropout: 0.05
# output-griffinlim
sharpening_factor: 1.4
loss:
masked_loss_weight: 0.5
priority_freq: 3000
priority_freq_weight: 0.0
binary_divergence_weight: 0.1
guided_attention_sigma: 0.2
# optimizer:
learning_rate: 0.001
clip_value: 5.0
clip_norm: 100.0
synthesis:
max_steps: 512
power: 1.4
n_iter: 32
model:
# speaker_embedding
n_speakers: 1
speaker_embed_dim: 16
speaker_embedding_weight_std: 0.01
max_positions: 512
dropout: 0.050000000000000044
# encoder
text_embed_dim: 256
embedding_weight_std: 0.1
freeze_embedding: false
padding_idx: 0
encoder_channels: 512
# decoder
query_position_rate: 1.0
key_position_rate: 1.29
trainable_positional_encodings: false
kernel_size: 3
decoder_channels: 256
downsample_factor: 4
outputs_per_step: 1
# attention
key_projection: true
value_projection: true
force_monotonic_attention: true
window_backward: -1
window_ahead: 3
use_memory_mask: true
# converter
use_decoder_state_for_postnet_input: true
converter_channels: 256
optimizer:
beta1: 0.5
beta2: 0.9
epsilon: 1e-6
lr_scheduler:
warmup_steps: 4000
peak_learning_rate: 5e-4
train:
batch_size: 16
max_iteration: 2000000
snap_interval: 1000
eval_interval: 10000
save_interval: 10000
# training:
batch_size: 16
report_interval: 10000
save_interval: 10000
valid_size: 5

View File

@ -1,257 +1,110 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import numpy as np
import os
import csv
from pathlib import Path
import numpy as np
from paddle import fluid
import pandas as pd
import librosa
from scipy import signal
import paddle.fluid.dygraph as dg
import paddle
from paddle import fluid
from paddle.fluid import dygraph as dg
from paddle.fluid.dataloader import Dataset, BatchSampler
from paddle.fluid.io import DataLoader
from parakeet.g2p.en import text_to_sequence, sequence_to_text
from parakeet.data import DatasetMixin, TransformDataset, FilterDataset, CacheDataset
from parakeet.data import DataCargo, PartialyRandomizedSimilarTimeLengthSampler, SequentialSampler, BucketSampler
from parakeet.data import DatasetMixin, DataCargo, PartialyRandomizedSimilarTimeLengthSampler
from parakeet.g2p import en
class LJSpeechMetaData(DatasetMixin):
class LJSpeech(DatasetMixin):
def __init__(self, root):
self.root = Path(root)
self._wav_dir = self.root.joinpath("wavs")
csv_path = self.root.joinpath("metadata.csv")
self._root = root
self._table = pd.read_csv(
csv_path,
sep="|",
encoding="utf-8",
header=None,
quoting=csv.QUOTE_NONE,
names=["fname", "raw_text", "normalized_text"])
os.path.join(root, "metadata.csv"),
sep="|",
encoding="utf-8",
quoting=csv.QUOTE_NONE,
header=None,
names=["num_frames", "spec_name", "mel_name", "text"],
dtype={"num_frames": np.int64, "spec_name": str, "mel_name":str, "text":str})
def num_frames(self):
return self._table["num_frames"].to_list()
def get_example(self, i):
fname, raw_text, normalized_text = self._table.iloc[i]
fname = str(self._wav_dir.joinpath(fname + ".wav"))
return fname, raw_text, normalized_text
"""
spec (T_frame, C_spec)
mel (T_frame, C_mel)
"""
num_frames, spec_name, mel_name, text = self._table.iloc[i]
spec = np.load(os.path.join(self._root, spec_name))
mel = np.load(os.path.join(self._root, mel_name))
return (text, spec, mel, num_frames)
def __len__(self):
return len(self._table)
class Transform(object):
def __init__(self,
replace_pronunciation_prob=0.,
sample_rate=22050,
preemphasis=.97,
n_fft=1024,
win_length=1024,
hop_length=256,
fmin=125,
fmax=7600,
n_mels=80,
min_level_db=-100,
ref_level_db=20,
max_norm=0.999,
clip_norm=True):
self.replace_pronunciation_prob = replace_pronunciation_prob
self.sample_rate = sample_rate
self.preemphasis = preemphasis
self.n_fft = n_fft
self.win_length = win_length
self.hop_length = hop_length
self.fmin = fmin
self.fmax = fmax
self.n_mels = n_mels
self.min_level_db = min_level_db
self.ref_level_db = ref_level_db
self.max_norm = max_norm
self.clip_norm = clip_norm
def __call__(self, in_data):
fname, _, normalized_text = in_data
# text processing
mix_grapheme_phonemes = text_to_sequence(
normalized_text, self.replace_pronunciation_prob)
text_length = len(mix_grapheme_phonemes)
# CAUTION: positions start from 1
speaker_id = None
# wave processing
wav, _ = librosa.load(fname, sr=self.sample_rate)
# preemphasis
y = signal.lfilter([1., -self.preemphasis], [1.], wav)
# STFT
D = librosa.stft(
y=y,
n_fft=self.n_fft,
win_length=self.win_length,
hop_length=self.hop_length)
S = np.abs(D)
# to db and normalize to 0-1
amplitude_min = np.exp(self.min_level_db / 20 * np.log(10)) # 1e-5
S_norm = 20 * np.log10(np.maximum(amplitude_min,
S)) - self.ref_level_db
S_norm = (S_norm - self.min_level_db) / (-self.min_level_db)
S_norm = self.max_norm * S_norm
if self.clip_norm:
S_norm = np.clip(S_norm, 0, self.max_norm)
# mel scale and to db and normalize to 0-1,
# CAUTION: pass linear scale S, not dbscaled S
S_mel = librosa.feature.melspectrogram(
S=S, n_mels=self.n_mels, fmin=self.fmin, fmax=self.fmax, power=1.)
S_mel = 20 * np.log10(np.maximum(amplitude_min,
S_mel)) - self.ref_level_db
S_mel_norm = (S_mel - self.min_level_db) / (-self.min_level_db)
S_mel_norm = self.max_norm * S_mel_norm
if self.clip_norm:
S_mel_norm = np.clip(S_mel_norm, 0, self.max_norm)
# num_frames
n_frames = S_mel_norm.shape[-1] # CAUTION: original number of frames
return (mix_grapheme_phonemes, text_length, speaker_id, S_norm.T,
S_mel_norm.T, n_frames)
class DataCollector(object):
def __init__(self, downsample_factor=4, r=1):
self.downsample_factor = int(downsample_factor)
self.frames_per_step = int(r)
self._factor = int(downsample_factor * r)
# CAUTION: small diff here
self._pad_begin = int(downsample_factor * r)
def __init__(self, p_pronunciation):
self.p_pronunciation = p_pronunciation
def __call__(self, examples):
batch_size = len(examples)
"""
output shape and dtype
(B, T_text) int64
(B,) int64
(B, T_frame, C_spec) float32
(B, T_frame, C_mel) float32
(B,) int64
"""
text_seqs = []
specs = []
mels = []
num_frames = np.array([example[3] for example in examples], dtype=np.int64)
max_frames = np.max(num_frames)
# lengths
text_lengths = np.array([example[1]
for example in examples]).astype(np.int64)
frames = np.array([example[5]
for example in examples]).astype(np.int64)
max_text_length = int(np.max(text_lengths))
max_frames = int(np.max(frames))
if max_frames % self._factor != 0:
max_frames += (self._factor - max_frames % self._factor)
max_frames += self._pad_begin
max_decoder_length = max_frames // self._factor
# pad time sequence
text_sequences = []
lin_specs = []
mel_specs = []
done_flags = []
for example in examples:
(mix_grapheme_phonemes, text_length, speaker_id, S_norm,
S_mel_norm, num_frames) = example
text_sequences.append(
np.pad(mix_grapheme_phonemes, (0, max_text_length - text_length
),
mode="constant"))
lin_specs.append(
np.pad(S_norm, ((self._pad_begin, max_frames - self._pad_begin
- num_frames), (0, 0)),
mode="constant"))
mel_specs.append(
np.pad(S_mel_norm, ((self._pad_begin, max_frames -
self._pad_begin - num_frames), (0, 0)),
mode="constant"))
done_flags.append(
np.pad(np.zeros((int(np.ceil(num_frames // self._factor)), )),
(0, max_decoder_length - int(
np.ceil(num_frames // self._factor))),
mode="constant",
constant_values=1))
text_sequences = np.array(text_sequences).astype(np.int64)
lin_specs = np.array(lin_specs).astype(np.float32)
mel_specs = np.array(mel_specs).astype(np.float32)
text, spec, mel, _ = example
text_seqs.append(en.text_to_sequence(text, self.p_pronunciation))
# if max_frames - mel.shape[0] < 0:
# import pdb; pdb.set_trace()
specs.append(np.pad(spec, [(0, max_frames - spec.shape[0]), (0, 0)]))
mels.append(np.pad(mel, [(0, max_frames - mel.shape[0]), (0, 0)]))
# downsample here
done_flags = np.array(done_flags).astype(np.float32)
specs = np.stack(specs)
mels = np.stack(mels)
# text positions
text_mask = (np.arange(1, 1 + max_text_length) <= np.expand_dims(
text_lengths, -1)).astype(np.int64)
text_positions = np.arange(
1, 1 + max_text_length, dtype=np.int64) * text_mask
text_lengths = np.array([len(seq) for seq in text_seqs], dtype=np.int64)
max_length = np.max(text_lengths)
text_seqs = np.array([seq + [0] * (max_length - len(seq)) for seq in text_seqs], dtype=np.int64)
return text_seqs, text_lengths, specs, mels, num_frames
# decoder_positions
decoder_positions = np.tile(
np.expand_dims(
np.arange(
1, 1 + max_decoder_length, dtype=np.int64), 0),
(batch_size, 1))
if __name__ == "__main__":
import argparse
import tqdm
import time
from ruamel import yaml
return (text_sequences, text_lengths, text_positions, mel_specs,
lin_specs, frames, decoder_positions, done_flags)
parser = argparse.ArgumentParser(description="load the preprocessed ljspeech dataset")
parser.add_argument("--config", type=str, required=True, help="config file")
parser.add_argument("--input", type=str, required=True, help="data path of the original data")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = yaml.safe_load(f)
print("========= Command Line Arguments ========")
for k, v in vars(args).items():
print("{}: {}".format(k, v))
print("=========== Configurations ==============")
for k in ["p_pronunciation", "batch_size"]:
print("{}: {}".format(k, config[k]))
ljspeech = LJSpeech(args.input)
collate_fn = DataCollector(config["p_pronunciation"])
def make_data_loader(data_root, config):
# construct meta data
meta = LJSpeechMetaData(data_root)
dg.enable_dygraph(fluid.CPUPlace())
sampler = PartialyRandomizedSimilarTimeLengthSampler(ljspeech.num_frames())
cargo = DataCargo(ljspeech, collate_fn,
batch_size=config["batch_size"], sampler=sampler)
loader = DataLoader\
.from_generator(capacity=5, return_list=True)\
.set_batch_generator(cargo)
# filter it!
min_text_length = config["meta_data"]["min_text_length"]
meta = FilterDataset(meta, lambda x: len(x[2]) >= min_text_length)
# transform meta data into meta data
c = config["transform"]
transform = Transform(
replace_pronunciation_prob=c["replace_pronunciation_prob"],
sample_rate=c["sample_rate"],
preemphasis=c["preemphasis"],
n_fft=c["n_fft"],
win_length=c["win_length"],
hop_length=c["hop_length"],
fmin=c["fmin"],
fmax=c["fmax"],
n_mels=c["n_mels"],
min_level_db=c["min_level_db"],
ref_level_db=c["ref_level_db"],
max_norm=c["max_norm"],
clip_norm=c["clip_norm"])
ljspeech = TransformDataset(meta, transform)
# use meta data's text length as a sort key for the sampler
batch_size = config["train"]["batch_size"]
text_lengths = [len(example[2]) for example in meta]
sampler = PartialyRandomizedSimilarTimeLengthSampler(text_lengths,
batch_size)
env = dg.parallel.ParallelEnv()
num_trainers = env.nranks
local_rank = env.local_rank
sampler = BucketSampler(
text_lengths, batch_size, num_trainers=num_trainers, rank=local_rank)
# some model hyperparameters affect how we process data
model_config = config["model"]
collector = DataCollector(
downsample_factor=model_config["downsample_factor"],
r=model_config["outputs_per_step"])
ljspeech_loader = DataCargo(
ljspeech, batch_fn=collector, batch_size=batch_size, sampler=sampler)
loader = fluid.io.DataLoader.from_generator(capacity=10, return_list=True)
loader.set_batch_generator(
ljspeech_loader, places=fluid.framework._current_expected_place())
return loader
for i, batch in tqdm.tqdm(enumerate(loader)):
continue

Binary file not shown.

Before

Width:  |  Height:  |  Size: 447 KiB

View File

@ -1,164 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from paddle import fluid
import paddle.fluid.initializer as I
import paddle.fluid.dygraph as dg
from parakeet.g2p import en
from parakeet.models.deepvoice3 import Encoder, Decoder, Converter, DeepVoice3, TTSLoss, ConvSpec, WindowRange
from parakeet.utils.layer_tools import summary, freeze
def make_model(config):
c = config["model"]
# speaker embedding
n_speakers = c["n_speakers"]
speaker_dim = c["speaker_embed_dim"]
if n_speakers > 1:
speaker_embed = dg.Embedding(
(n_speakers, speaker_dim),
param_attr=I.Normal(scale=c["speaker_embedding_weight_std"]))
else:
speaker_embed = None
# encoder
h = c["encoder_channels"]
k = c["kernel_size"]
encoder_convolutions = (
ConvSpec(h, k, 1),
ConvSpec(h, k, 3),
ConvSpec(h, k, 9),
ConvSpec(h, k, 27),
ConvSpec(h, k, 1),
ConvSpec(h, k, 3),
ConvSpec(h, k, 9),
ConvSpec(h, k, 27),
ConvSpec(h, k, 1),
ConvSpec(h, k, 3), )
encoder = Encoder(
n_vocab=en.n_vocab,
embed_dim=c["text_embed_dim"],
n_speakers=n_speakers,
speaker_dim=speaker_dim,
embedding_weight_std=c["embedding_weight_std"],
convolutions=encoder_convolutions,
dropout=c["dropout"])
if c["freeze_embedding"]:
freeze(encoder.embed)
# decoder
h = c["decoder_channels"]
k = c["kernel_size"]
prenet_convolutions = (ConvSpec(h, k, 1), ConvSpec(h, k, 3))
attentive_convolutions = (
ConvSpec(h, k, 1),
ConvSpec(h, k, 3),
ConvSpec(h, k, 9),
ConvSpec(h, k, 27),
ConvSpec(h, k, 1), )
attention = [True, False, False, False, True]
force_monotonic_attention = [True, False, False, False, True]
window = WindowRange(c["window_backward"], c["window_ahead"])
decoder = Decoder(
n_speakers,
speaker_dim,
embed_dim=c["text_embed_dim"],
mel_dim=config["transform"]["n_mels"],
r=c["outputs_per_step"],
max_positions=c["max_positions"],
preattention=prenet_convolutions,
convolutions=attentive_convolutions,
attention=attention,
dropout=c["dropout"],
use_memory_mask=c["use_memory_mask"],
force_monotonic_attention=force_monotonic_attention,
query_position_rate=c["query_position_rate"],
key_position_rate=c["key_position_rate"],
window_range=window,
key_projection=c["key_projection"],
value_projection=c["value_projection"])
if not c["trainable_positional_encodings"]:
freeze(decoder.embed_keys_positions)
freeze(decoder.embed_query_positions)
# converter(postnet)
linear_dim = 1 + config["transform"]["n_fft"] // 2
h = c["converter_channels"]
k = c["kernel_size"]
postnet_convolutions = (
ConvSpec(h, k, 1),
ConvSpec(h, k, 3),
ConvSpec(2 * h, k, 1),
ConvSpec(2 * h, k, 3), )
use_decoder_states = c["use_decoder_state_for_postnet_input"]
converter = Converter(
n_speakers,
speaker_dim,
in_channels=decoder.state_dim
if use_decoder_states else config["transform"]["n_mels"],
linear_dim=linear_dim,
time_upsampling=c["downsample_factor"],
convolutions=postnet_convolutions,
dropout=c["dropout"])
model = DeepVoice3(
encoder,
decoder,
converter,
speaker_embed,
use_decoder_states=use_decoder_states)
return model
def make_criterion(config):
# =========================loss=========================
loss_config = config["loss"]
transform_config = config["transform"]
model_config = config["model"]
priority_freq = loss_config["priority_freq"] # Hz
sample_rate = transform_config["sample_rate"]
linear_dim = 1 + transform_config["n_fft"] // 2
priority_bin = int(priority_freq / (0.5 * sample_rate) * linear_dim)
criterion = TTSLoss(
masked_weight=loss_config["masked_loss_weight"],
priority_bin=priority_bin,
priority_weight=loss_config["priority_freq_weight"],
binary_divergence_weight=loss_config["binary_divergence_weight"],
guided_attention_sigma=loss_config["guided_attention_sigma"],
downsample_factor=model_config["downsample_factor"],
r=model_config["outputs_per_step"])
return criterion
def make_optimizer(model, config):
# =========================lr_scheduler=========================
lr_config = config["lr_scheduler"]
warmup_steps = lr_config["warmup_steps"]
peak_learning_rate = lr_config["peak_learning_rate"]
lr_scheduler = dg.NoamDecay(1 / (warmup_steps * (peak_learning_rate)**2),
warmup_steps)
# =========================optimizer=========================
optim_config = config["optimizer"]
optim = fluid.optimizer.Adam(
lr_scheduler,
beta1=optim_config["beta1"],
beta2=optim_config["beta2"],
epsilon=optim_config["epsilon"],
parameter_list=model.parameters(),
grad_clip=fluid.clip.GradientClipByGlobalNorm(0.1))
return optim

View File

@ -0,0 +1,122 @@
from __future__ import division
import os
import argparse
from ruamel import yaml
import tqdm
from os.path import join
import csv
import numpy as np
import pandas as pd
import librosa
import logging
from parakeet.data import DatasetMixin
class LJSpeechMetaData(DatasetMixin):
def __init__(self, root):
self.root = root
self._wav_dir = join(root, "wavs")
csv_path = join(root, "metadata.csv")
self._table = pd.read_csv(
csv_path,
sep="|",
encoding="utf-8",
header=None,
quoting=csv.QUOTE_NONE,
names=["fname", "raw_text", "normalized_text"])
def get_example(self, i):
fname, raw_text, normalized_text = self._table.iloc[i]
abs_fname = join(self._wav_dir, fname + ".wav")
return fname, abs_fname, raw_text, normalized_text
def __len__(self):
return len(self._table)
class Transform(object):
def __init__(self, sample_rate, n_fft, hop_length, win_length, n_mels, reduction_factor):
self.sample_rate = sample_rate
self.n_fft = n_fft
self.win_length = win_length
self.hop_length = hop_length
self.n_mels = n_mels
self.reduction_factor = reduction_factor
def __call__(self, fname):
# wave processing
audio, _ = librosa.load(fname, sr=self.sample_rate)
# Pad the data to the right size to have a whole number of timesteps,
# accounting properly for the model reduction factor.
frames = audio.size // (self.reduction_factor * self.hop_length) + 1
# librosa's stft extract frame of n_fft size, so we should pad n_fft // 2 on both sidess
desired_length = (frames * self.reduction_factor - 1) * self.hop_length + self.n_fft
pad_amount = (desired_length - audio.size) // 2
# we pad mannually to control the number of generated frames
if audio.size % 2 == 0:
audio = np.pad(audio, (pad_amount, pad_amount), mode='reflect')
else:
audio = np.pad(audio, (pad_amount, pad_amount + 1), mode='reflect')
# STFT
D = librosa.stft(audio, self.n_fft, self.hop_length, self.win_length, center=False)
S = np.abs(D)
S_mel = librosa.feature.melspectrogram(sr=self.sample_rate, S=S, n_mels=self.n_mels, fmax=8000.0)
# log magnitude
log_spectrogram = np.log(np.clip(S, a_min=1e-5, a_max=None))
log_mel_spectrogram = np.log(np.clip(S_mel, a_min=1e-5, a_max=None))
num_frames = log_spectrogram.shape[-1]
assert num_frames % self.reduction_factor == 0, "num_frames is wrong"
return (log_spectrogram.T, log_mel_spectrogram.T, num_frames)
def save(output_path, dataset, transform):
if not os.path.exists(output_path):
os.makedirs(output_path)
records = []
for example in tqdm.tqdm(dataset):
fname, abs_fname, _, normalized_text = example
log_spec, log_mel_spec, num_frames = transform(abs_fname)
records.append((num_frames,
fname + "_spec.npy",
fname + "_mel.npy",
normalized_text))
np.save(join(output_path, fname + "_spec"), log_spec)
np.save(join(output_path, fname + "_mel"), log_mel_spec)
meta_data = pd.DataFrame.from_records(records)
meta_data.to_csv(join(output_path, "metadata.csv"),
quoting=csv.QUOTE_NONE, sep="|", encoding="utf-8",
header=False, index=False)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="preprocess ljspeech dataset and save it.")
parser.add_argument("--config", type=str, required=True, help="config file")
parser.add_argument("--input", type=str, required=True, help="data path of the original data")
parser.add_argument("--output", type=str, required=True, help="path to save the preprocessed dataset")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = yaml.safe_load(f)
print("========= Command Line Arguments ========")
for k, v in vars(args).items():
print("{}: {}".format(k, v))
print("=========== Configurations ==============")
for k in ["sample_rate", "n_fft", "win_length",
"hop_length", "n_mels", "reduction_factor"]:
print("{}: {}".format(k, config[k]))
ljspeech_meta = LJSpeechMetaData(args.input)
transform = Transform(config["sample_rate"],
config["n_fft"],
config["hop_length"],
config["win_length"],
config["n_mels"],
config["reduction_factor"])
save(args.output, ljspeech_meta, transform)

View File

@ -1,6 +1,5 @@
Scientists at the CERN laboratory say they have discovered a new particle.
There's a way to measure the acute emotional intelligence that has never gone out of style.
President Trump met with other leaders at the Group of 20 conference.
Generative adversarial network or variational auto-encoder.
Please call Stella.
Some have accepted this as a miracle without any physical explanation.
Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition
in being comparatively modern.
For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process
produced the block books, which were the immediate predecessors of the true printed book,
the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.

View File

@ -1,91 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import os
import argparse
import ruamel.yaml
import numpy as np
import soundfile as sf
from paddle import fluid
fluid.require_version('1.8.0')
import paddle.fluid.layers as F
import paddle.fluid.dygraph as dg
from tensorboardX import SummaryWriter
from parakeet.g2p import en
from parakeet.modules.weight_norm import WeightNormWrapper
from parakeet.utils.layer_tools import summary
from parakeet.utils import io
from model import make_model
from utils import make_evaluator
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Synthsize waveform with a checkpoint.")
parser.add_argument("--config", type=str, help="experiment config")
parser.add_argument("--device", type=int, default=-1, help="device to use")
g = parser.add_mutually_exclusive_group()
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
g.add_argument(
"--iteration",
type=int,
help="the iteration of the checkpoint to load from output directory")
parser.add_argument("text", type=str, help="text file to synthesize")
parser.add_argument(
"output", type=str, help="path to save synthesized audio")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = ruamel.yaml.safe_load(f)
print("Command Line Args: ")
for k, v in vars(args).items():
print("{}: {}".format(k, v))
if args.device == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(args.device)
dg.enable_dygraph(place)
model = make_model(config)
checkpoint_dir = os.path.join(args.output, "checkpoints")
if args.checkpoint is not None:
iteration = io.load_parameters(model, checkpoint_path=args.checkpoint)
else:
iteration = io.load_parameters(
model, checkpoint_dir=checkpoint_dir, iteration=args.iteration)
# WARNING: don't forget to remove weight norm to re-compute each wrapped layer's weight
# removing weight norm also speeds up computation
for layer in model.sublayers():
if isinstance(layer, WeightNormWrapper):
layer.remove_weight_norm()
synthesis_dir = os.path.join(args.output, "synthesis")
if not os.path.exists(synthesis_dir):
os.makedirs(synthesis_dir)
with open(args.text, "rt", encoding="utf-8") as f:
lines = f.readlines()
sentences = [line[:-1] for line in lines]
evaluator = make_evaluator(config, sentences, synthesis_dir)
evaluator(model, iteration)

View File

@ -0,0 +1,80 @@
import numpy as np
from matplotlib import cm
import librosa
import os
import time
import tqdm
import argparse
from ruamel import yaml
import paddle
from paddle import fluid
from paddle.fluid import layers as F
from paddle.fluid import dygraph as dg
from paddle.fluid.io import DataLoader
from tensorboardX import SummaryWriter
import soundfile as sf
from parakeet.data import SliceDataset, DataCargo, PartialyRandomizedSimilarTimeLengthSampler, SequentialSampler
from parakeet.utils.io import save_parameters, load_parameters, add_yaml_config_to_args
from parakeet.g2p import en
from vocoder import WaveflowVocoder
from train import create_model
def main(args, config):
model = create_model(config)
loaded_step = load_parameters(model, checkpoint_path=args.checkpoint)
model.eval()
vocoder = WaveflowVocoder()
vocoder.model.eval()
if not os.path.exists(args.output):
os.makedirs(args.output)
monotonic_layers = [int(item.strip()) - 1 for item in args.monotonic_layers.split(',')]
with open(args.input, 'rt') as f:
sentences = [line.strip() for line in f.readlines()]
for i, sentence in enumerate(sentences):
wav = synthesize(config, model, vocoder, sentence, monotonic_layers)
sf.write(os.path.join(args.output, "sentence{}.wav".format(i)),
wav, samplerate=config["sample_rate"])
def synthesize(config, model, vocoder, sentence, monotonic_layers):
print("[synthesize] {}".format(sentence))
text = en.text_to_sequence(sentence, p=1.0)
text = np.expand_dims(np.array(text, dtype="int64"), 0)
lengths = np.array([text.size], dtype=np.int64)
text_seqs = dg.to_variable(text)
text_lengths = dg.to_variable(lengths)
decoder_layers = config["decoder_layers"]
force_monotonic_attention = [False] * decoder_layers
for i in monotonic_layers:
force_monotonic_attention[i] = True
with dg.no_grad():
outputs = model(text_seqs, text_lengths, speakers=None,
force_monotonic_attention=force_monotonic_attention,
window=(config["backward_step"], config["forward_step"]))
decoded, refined, attentions = outputs
wav = vocoder(F.transpose(decoded, (0, 2, 1)))
wav_np = wav.numpy()[0]
return wav_np
if __name__ == "__main__":
import argparse
from ruamel import yaml
parser = argparse.ArgumentParser("synthesize from a checkpoint")
parser.add_argument("--config", type=str, required=True, help="config file")
parser.add_argument("--input", type=str, required=True, help="text file to synthesize")
parser.add_argument("--output", type=str, required=True, help="path to save audio")
parser.add_argument("--checkpoint", type=str, required=True, help="data path of the checkpoint")
parser.add_argument("--monotonic_layers", type=str, required=True, help="monotonic decoder layer, index starts friom 1")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = yaml.safe_load(f)
dg.enable_dygraph(fluid.CUDAPlace(0))
main(args, config)

View File

@ -1,172 +1,187 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import time
import numpy as np
from matplotlib import cm
import librosa
import os
import argparse
import ruamel.yaml
import time
import tqdm
from tensorboardX import SummaryWriter
import paddle
from paddle import fluid
fluid.require_version('1.8.0')
import paddle.fluid.layers as F
import paddle.fluid.dygraph as dg
from parakeet.utils.io import load_parameters, save_parameters
from paddle.fluid import layers as F
from paddle.fluid import dygraph as dg
from paddle.fluid.io import DataLoader
from tensorboardX import SummaryWriter
from data import make_data_loader
from model import make_model, make_criterion, make_optimizer
from utils import make_output_tree, add_options, get_place, Evaluator, StateSaver, make_evaluator, make_state_saver
from parakeet.models.deepvoice3 import Encoder, Decoder, PostNet, SpectraNet
from parakeet.data import SliceDataset, DataCargo, PartialyRandomizedSimilarTimeLengthSampler, SequentialSampler
from parakeet.utils.io import save_parameters, load_parameters
from parakeet.g2p import en
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Train a Deep Voice 3 model with LJSpeech dataset.")
add_options(parser)
args, _ = parser.parse_known_args()
from data import LJSpeech, DataCollector
from vocoder import WaveflowVocoder, GriffinLimVocoder
from clip import DoubleClip
# only use args.device when training in single process
# when training with distributed.launch, devices are provided by
# `--selected_gpus` for distributed.launch
env = dg.parallel.ParallelEnv()
device_id = env.dev_id if env.nranks > 1 else args.device
place = get_place(device_id)
# start dygraph
dg.enable_dygraph(place)
with open(args.config, 'rt') as f:
config = ruamel.yaml.safe_load(f)
def create_model(config):
char_embedding = dg.Embedding((en.n_vocab, config["char_dim"]))
multi_speaker = config["n_speakers"] > 1
speaker_embedding = dg.Embedding((config["n_speakers"], config["speaker_dim"])) \
if multi_speaker else None
encoder = Encoder(config["encoder_layers"], config["char_dim"],
config["encoder_dim"], config["kernel_size"],
has_bias=multi_speaker, bias_dim=config["speaker_dim"],
keep_prob=1.0 - config["dropout"])
decoder = Decoder(config["n_mels"], config["reduction_factor"],
list(config["prenet_sizes"]) + [config["char_dim"]],
config["decoder_layers"], config["kernel_size"],
config["attention_dim"],
position_encoding_weight=config["position_weight"],
omega=config["position_rate"],
has_bias=multi_speaker, bias_dim=config["speaker_dim"],
keep_prob=1.0 - config["dropout"])
postnet = PostNet(config["postnet_layers"], config["char_dim"],
config["postnet_dim"], config["kernel_size"],
config["n_mels"], config["reduction_factor"],
has_bias=multi_speaker, bias_dim=config["speaker_dim"],
keep_prob=1.0 - config["dropout"])
spectranet = SpectraNet(char_embedding, speaker_embedding, encoder, decoder, postnet)
return spectranet
print("Command Line Args: ")
for k, v in vars(args).items():
print("{}: {}".format(k, v))
def create_data(config, data_path):
dataset = LJSpeech(data_path)
data_loader = make_data_loader(args.data, config)
model = make_model(config)
if env.nranks > 1:
strategy = dg.parallel.prepare_context()
model = dg.DataParallel(model, strategy)
criterion = make_criterion(config)
optim = make_optimizer(model, config)
train_dataset = SliceDataset(dataset, config["valid_size"], len(dataset))
train_collator = DataCollector(config["p_pronunciation"])
train_sampler = PartialyRandomizedSimilarTimeLengthSampler(
dataset.num_frames()[config["valid_size"]:])
train_cargo = DataCargo(train_dataset, train_collator,
batch_size=config["batch_size"], sampler=train_sampler)
train_loader = DataLoader\
.from_generator(capacity=10, return_list=True)\
.set_batch_generator(train_cargo)
# generation
synthesis_config = config["synthesis"]
power = synthesis_config["power"]
n_iter = synthesis_config["n_iter"]
valid_dataset = SliceDataset(dataset, 0, config["valid_size"])
valid_collector = DataCollector(1.)
valid_sampler = SequentialSampler(valid_dataset)
valid_cargo = DataCargo(valid_dataset, valid_collector,
batch_size=1, sampler=valid_sampler)
valid_loader = DataLoader\
.from_generator(capacity=2, return_list=True)\
.set_batch_generator(valid_cargo)
return train_loader, valid_loader
# tensorboard & checkpoint preparation
output_dir = args.output
ckpt_dir = os.path.join(output_dir, "checkpoints")
log_dir = os.path.join(output_dir, "log")
state_dir = os.path.join(output_dir, "states")
eval_dir = os.path.join(output_dir, "eval")
if env.local_rank == 0:
make_output_tree(output_dir)
writer = SummaryWriter(logdir=log_dir)
else:
writer = None
sentences = [
"Scientists at the CERN laboratory say they have discovered a new particle.",
"There's a way to measure the acute emotional intelligence that has never gone out of style.",
"President Trump met with other leaders at the Group of 20 conference.",
"Generative adversarial network or variational auto-encoder.",
"Please call Stella.",
"Some have accepted this as a miracle without any physical explanation.",
]
evaluator = make_evaluator(config, sentences, eval_dir, writer)
state_saver = make_state_saver(config, state_dir, writer)
def create_optimizer(model, config):
optim = fluid.optimizer.Adam(config["learning_rate"],
parameter_list=model.parameters(),
grad_clip=DoubleClip(config["clip_value"], config["clip_norm"]))
return optim
# load parameters and optimizer, and opdate iterations done sofar
if args.checkpoint is not None:
iteration = load_parameters(
model, optim, checkpoint_path=args.checkpoint)
else:
iteration = load_parameters(
model, optim, checkpoint_dir=ckpt_dir, iteration=args.iteration)
def train(args, config):
model = create_model(config)
train_loader, valid_loader = create_data(config, args.input)
optim = create_optimizer(model, config)
# =========================train=========================
train_config = config["train"]
max_iter = train_config["max_iteration"]
snap_interval = train_config["snap_interval"]
save_interval = train_config["save_interval"]
eval_interval = train_config["eval_interval"]
global_step = iteration + 1
iterator = iter(tqdm.tqdm(data_loader))
downsample_factor = config["model"]["downsample_factor"]
while global_step <= max_iter:
global global_step
max_iteration = 2000000
iterator = iter(tqdm.tqdm(train_loader))
while global_step <= max_iteration:
# get inputs
try:
batch = next(iterator)
except StopIteration as e:
iterator = iter(tqdm.tqdm(data_loader))
except StopIteration:
iterator = iter(tqdm.tqdm(train_loader))
batch = next(iterator)
# unzip it
text_seqs, text_lengths, specs, mels, num_frames = batch
# forward & backward
model.train()
(text_sequences, text_lengths, text_positions, mel_specs, lin_specs,
frames, decoder_positions, done_flags) = batch
downsampled_mel_specs = F.strided_slice(
mel_specs,
axes=[1],
starts=[0],
ends=[mel_specs.shape[1]],
strides=[downsample_factor])
outputs = model(
text_sequences,
text_positions,
text_lengths,
None,
downsampled_mel_specs,
decoder_positions, )
# mel_outputs, linear_outputs, alignments, done
inputs = (downsampled_mel_specs, lin_specs, done_flags, text_lengths,
frames)
losses = criterion(outputs, inputs)
outputs = model(text_seqs, text_lengths, speakers=None, mel=mels)
decoded, refined, attentions, final_state = outputs
l = losses["loss"]
if env.nranks > 1:
l = model.scale_loss(l)
l.backward()
model.apply_collective_grads()
else:
l.backward()
causal_mel_loss = model.spec_loss(decoded, mels, num_frames)
non_causal_mel_loss = model.spec_loss(refined, mels, num_frames)
loss = causal_mel_loss + non_causal_mel_loss
loss.backward()
# record learning rate before updating
if env.local_rank == 0:
writer.add_scalar("learning_rate",
optim._learning_rate.step().numpy(), global_step)
optim.minimize(l)
optim.clear_gradients()
# update
optim.minimize(loss)
# record step losses
step_loss = {k: v.numpy()[0] for k, v in losses.items()}
# logging
tqdm.tqdm.write("[train] step: {}\tloss: {:.6f}\tcausal:{:.6f}\tnon_causal:{:.6f}".format(
global_step,
loss.numpy()[0],
causal_mel_loss.numpy()[0],
non_causal_mel_loss.numpy()[0]))
writer.add_scalar("loss/causal_mel_loss", causal_mel_loss.numpy()[0], global_step=global_step)
writer.add_scalar("loss/non_causal_mel_loss", non_causal_mel_loss.numpy()[0], global_step=global_step)
writer.add_scalar("loss/loss", loss.numpy()[0], global_step=global_step)
if global_step % config["report_interval"] == 0:
text_length = int(text_lengths.numpy()[0])
num_frame = int(num_frames.numpy()[0])
if env.local_rank == 0:
tqdm.tqdm.write("[Train] global_step: {}\tloss: {}".format(
global_step, step_loss["loss"]))
for k, v in step_loss.items():
writer.add_scalar(k, v, global_step)
tag = "train_mel/ground-truth"
img = cm.viridis(normalize(mels.numpy()[0, :num_frame].T))
writer.add_image(tag, img, global_step=global_step, dataformats="HWC")
# train state saving, the first sentence in the batch
if env.local_rank == 0 and global_step % snap_interval == 0:
input_specs = (mel_specs, lin_specs)
state_saver(outputs, input_specs, global_step)
tag = "train_mel/decoded"
img = cm.viridis(normalize(decoded.numpy()[0, :num_frame].T))
writer.add_image(tag, img, global_step=global_step, dataformats="HWC")
# evaluation
if env.local_rank == 0 and global_step % eval_interval == 0:
evaluator(model, global_step)
tag = "train_mel/refined"
img = cm.viridis(normalize(refined.numpy()[0, :num_frame].T))
writer.add_image(tag, img, global_step=global_step, dataformats="HWC")
# save checkpoint
if env.local_rank == 0 and global_step % save_interval == 0:
save_parameters(ckpt_dir, global_step, model, optim)
vocoder = WaveflowVocoder()
vocoder.model.eval()
tag = "train_audio/ground-truth-waveflow"
wav = vocoder(F.transpose(mels[0:1, :num_frame, :], (0, 2, 1)))
writer.add_audio(tag, wav.numpy()[0], global_step=global_step, sample_rate=22050)
tag = "train_audio/decoded-waveflow"
wav = vocoder(F.transpose(decoded[0:1, :num_frame, :], (0, 2, 1)))
writer.add_audio(tag, wav.numpy()[0], global_step=global_step, sample_rate=22050)
tag = "train_audio/refined-waveflow"
wav = vocoder(F.transpose(refined[0:1, :num_frame, :], (0, 2, 1)))
writer.add_audio(tag, wav.numpy()[0], global_step=global_step, sample_rate=22050)
attentions_np = attentions.numpy()
attentions_np = attentions_np[:, 0, :num_frame // 4 , :text_length]
for i, attention_layer in enumerate(np.rot90(attentions_np, axes=(1,2))):
tag = "train_attention/layer_{}".format(i)
img = cm.viridis(normalize(attention_layer))
writer.add_image(tag, img, global_step=global_step, dataformats="HWC")
if global_step % config["save_interval"] == 0:
save_parameters(writer.logdir, global_step, model, optim)
# global step +1
global_step += 1
def normalize(arr):
return (arr - arr.min()) / (arr.max() - arr.min())
if __name__ == "__main__":
import argparse
from ruamel import yaml
parser = argparse.ArgumentParser(description="train a Deep Voice 3 model with LJSpeech")
parser.add_argument("--config", type=str, required=True, help="config file")
parser.add_argument("--input", type=str, required=True, help="data path of the original data")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = yaml.safe_load(f)
dg.enable_dygraph(fluid.CUDAPlace(0))
global global_step
global_step = 1
global writer
writer = SummaryWriter()
print("[Training] tensorboard log and checkpoints are save in {}".format(
writer.logdir))
train(args, config)

View File

@ -1,374 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import os
import numpy as np
import matplotlib
matplotlib.use("agg")
from matplotlib import cm
import matplotlib.pyplot as plt
import librosa
from scipy import signal
from librosa import display
import soundfile as sf
from paddle import fluid
import paddle.fluid.dygraph as dg
from parakeet.g2p import en
def get_place(device_id):
"""get place from device_id, -1 stands for CPU"""
if device_id == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(device_id)
return place
def add_options(parser):
parser.add_argument("--config", type=str, help="experimrnt config")
parser.add_argument(
"--data",
type=str,
default="/workspace/datasets/LJSpeech-1.1/",
help="The path of the LJSpeech dataset.")
parser.add_argument("--device", type=int, default=-1, help="device to use")
g = parser.add_mutually_exclusive_group()
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from.")
g.add_argument(
"--iteration",
type=int,
help="the iteration of the checkpoint to load from output directory")
parser.add_argument(
"output", type=str, default="experiment", help="path to save results")
def make_evaluator(config, text_sequences, output_dir, writer=None):
c = config["transform"]
p_replace = 0.0
sample_rate = c["sample_rate"]
preemphasis = c["preemphasis"]
win_length = c["win_length"]
hop_length = c["hop_length"]
min_level_db = c["min_level_db"]
ref_level_db = c["ref_level_db"]
synthesis_config = config["synthesis"]
power = synthesis_config["power"]
n_iter = synthesis_config["n_iter"]
return Evaluator(
text_sequences,
p_replace,
sample_rate,
preemphasis,
win_length,
hop_length,
min_level_db,
ref_level_db,
power,
n_iter,
output_dir=output_dir,
writer=writer)
class Evaluator(object):
def __init__(self,
text_sequences,
p_replace,
sample_rate,
preemphasis,
win_length,
hop_length,
min_level_db,
ref_level_db,
power,
n_iter,
output_dir,
writer=None):
self.text_sequences = text_sequences
self.output_dir = output_dir
self.writer = writer
self.p_replace = p_replace
self.sample_rate = sample_rate
self.preemphasis = preemphasis
self.win_length = win_length
self.hop_length = hop_length
self.min_level_db = min_level_db
self.ref_level_db = ref_level_db
self.power = power
self.n_iter = n_iter
def process_a_sentence(self, model, text):
text = np.array(
en.text_to_sequence(
text, p=self.p_replace), dtype=np.int64)
length = len(text)
text_positions = np.arange(1, 1 + length, dtype=np.int64)
text = np.expand_dims(text, 0)
text_positions = np.expand_dims(text_positions, 0)
model.eval()
if isinstance(model, dg.DataParallel):
_model = model._layers
else:
_model = model
mel_outputs, linear_outputs, alignments, done = _model.transduce(
dg.to_variable(text), dg.to_variable(text_positions))
linear_outputs_np = linear_outputs.numpy()[0].T # (C, T)
wav = spec_to_waveform(linear_outputs_np, self.min_level_db,
self.ref_level_db, self.power, self.n_iter,
self.win_length, self.hop_length,
self.preemphasis)
alignments_np = alignments.numpy()[0] # batch_size = 1
return wav, alignments_np
def __call__(self, model, iteration):
writer = self.writer
for i, seq in enumerate(self.text_sequences):
print("[Eval] synthesizing sentence {}".format(i))
wav, alignments_np = self.process_a_sentence(model, seq)
wav_path = os.path.join(
self.output_dir,
"eval_sample_{}_step_{:09d}.wav".format(i, iteration))
sf.write(wav_path, wav, self.sample_rate)
if writer is not None:
writer.add_audio(
"eval_sample_{}".format(i),
wav,
iteration,
sample_rate=self.sample_rate)
attn_path = os.path.join(
self.output_dir,
"eval_sample_{}_step_{:09d}.png".format(i, iteration))
plot_alignment(alignments_np, attn_path)
if writer is not None:
writer.add_image(
"eval_sample_attn_{}".format(i),
cm.viridis(alignments_np),
iteration,
dataformats="HWC")
def make_state_saver(config, output_dir, writer=None):
c = config["transform"]
p_replace = c["replace_pronunciation_prob"]
sample_rate = c["sample_rate"]
preemphasis = c["preemphasis"]
win_length = c["win_length"]
hop_length = c["hop_length"]
min_level_db = c["min_level_db"]
ref_level_db = c["ref_level_db"]
synthesis_config = config["synthesis"]
power = synthesis_config["power"]
n_iter = synthesis_config["n_iter"]
return StateSaver(p_replace, sample_rate, preemphasis, win_length,
hop_length, min_level_db, ref_level_db, power, n_iter,
output_dir, writer)
class StateSaver(object):
def __init__(self,
p_replace,
sample_rate,
preemphasis,
win_length,
hop_length,
min_level_db,
ref_level_db,
power,
n_iter,
output_dir,
writer=None):
self.output_dir = output_dir
self.writer = writer
self.p_replace = p_replace
self.sample_rate = sample_rate
self.preemphasis = preemphasis
self.win_length = win_length
self.hop_length = hop_length
self.min_level_db = min_level_db
self.ref_level_db = ref_level_db
self.power = power
self.n_iter = n_iter
def __call__(self, outputs, inputs, iteration):
mel_output, lin_output, alignments, done_output = outputs
mel_input, lin_input = inputs
writer = self.writer
# mel spectrogram
mel_input = mel_input[0].numpy().T
mel_output = mel_output[0].numpy().T
path = os.path.join(self.output_dir, "mel_spec")
plt.figure(figsize=(10, 3))
display.specshow(mel_input)
plt.colorbar()
plt.title("mel_input")
plt.savefig(
os.path.join(path, "target_mel_spec_step_{:09d}.png".format(
iteration)))
plt.close()
if writer is not None:
writer.add_image(
"target/mel_spec",
cm.viridis(mel_input),
iteration,
dataformats="HWC")
plt.figure(figsize=(10, 3))
display.specshow(mel_output)
plt.colorbar()
plt.title("mel_output")
plt.savefig(
os.path.join(path, "predicted_mel_spec_step_{:09d}.png".format(
iteration)))
plt.close()
if writer is not None:
writer.add_image(
"predicted/mel_spec",
cm.viridis(mel_output),
iteration,
dataformats="HWC")
# linear spectrogram
lin_input = lin_input[0].numpy().T
lin_output = lin_output[0].numpy().T
path = os.path.join(self.output_dir, "lin_spec")
plt.figure(figsize=(10, 3))
display.specshow(lin_input)
plt.colorbar()
plt.title("mel_input")
plt.savefig(
os.path.join(path, "target_lin_spec_step_{:09d}.png".format(
iteration)))
plt.close()
if writer is not None:
writer.add_image(
"target/lin_spec",
cm.viridis(lin_input),
iteration,
dataformats="HWC")
plt.figure(figsize=(10, 3))
display.specshow(lin_output)
plt.colorbar()
plt.title("mel_input")
plt.savefig(
os.path.join(path, "predicted_lin_spec_step_{:09d}.png".format(
iteration)))
plt.close()
if writer is not None:
writer.add_image(
"predicted/lin_spec",
cm.viridis(lin_output),
iteration,
dataformats="HWC")
# alignment
path = os.path.join(self.output_dir, "alignments")
alignments = alignments[:, 0, :, :].numpy()
for idx, attn_layer in enumerate(alignments):
save_path = os.path.join(
path, "train_attn_layer_{}_step_{}.png".format(idx, iteration))
plot_alignment(attn_layer, save_path)
if writer is not None:
writer.add_image(
"train_attn/layer_{}".format(idx),
cm.viridis(attn_layer),
iteration,
dataformats="HWC")
# synthesize waveform
wav = spec_to_waveform(
lin_output, self.min_level_db, self.ref_level_db, self.power,
self.n_iter, self.win_length, self.hop_length, self.preemphasis)
path = os.path.join(self.output_dir, "waveform")
save_path = os.path.join(
path, "train_sample_step_{:09d}.wav".format(iteration))
sf.write(save_path, wav, self.sample_rate)
if writer is not None:
writer.add_audio(
"train_sample", wav, iteration, sample_rate=self.sample_rate)
def spec_to_waveform(spec, min_level_db, ref_level_db, power, n_iter,
win_length, hop_length, preemphasis):
"""Convert output linear spec to waveform using griffin-lim vocoder.
Args:
spec (ndarray): the output linear spectrogram, shape(C, T), where C means n_fft, T means frames.
"""
denoramlized = np.clip(spec, 0, 1) * (-min_level_db) + min_level_db
lin_scaled = np.exp((denoramlized + ref_level_db) / 20 * np.log(10))
wav = librosa.griffinlim(
lin_scaled**power,
n_iter=n_iter,
hop_length=hop_length,
win_length=win_length)
if preemphasis > 0:
wav = signal.lfilter([1.], [1., -preemphasis], wav)
wav = np.clip(wav, -1.0, 1.0)
return wav
def make_output_tree(output_dir):
print("creating output tree: {}".format(output_dir))
ckpt_dir = os.path.join(output_dir, "checkpoints")
state_dir = os.path.join(output_dir, "states")
eval_dir = os.path.join(output_dir, "eval")
for x in [ckpt_dir, state_dir, eval_dir]:
if not os.path.exists(x):
os.makedirs(x)
for x in ["alignments", "waveform", "lin_spec", "mel_spec"]:
p = os.path.join(state_dir, x)
if not os.path.exists(p):
os.makedirs(p)
def plot_alignment(alignment, path):
"""
Plot an attention layer's alignment for a sentence.
alignment: shape(T_dec, T_enc).
"""
plt.figure()
plt.imshow(alignment)
plt.colorbar()
plt.xlabel('Encoder timestep')
plt.ylabel('Decoder timestep')
plt.savefig(path)
plt.close()

View File

@ -0,0 +1,43 @@
import argparse
from ruamel import yaml
import numpy as np
import librosa
import paddle
from paddle import fluid
from paddle.fluid import layers as F
from paddle.fluid import dygraph as dg
from parakeet.utils.io import load_parameters
from parakeet.models.waveflow.waveflow_modules import WaveFlowModule
class WaveflowVocoder(object):
def __init__(self):
config_path = "waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml"
with open(config_path, 'rt') as f:
config = yaml.safe_load(f)
ns = argparse.Namespace()
for k, v in config.items():
setattr(ns, k, v)
ns.use_fp16 = False
self.model = WaveFlowModule(ns)
checkpoint_path = "waveflow_res128_ljspeech_ckpt_1.0/step-2000000"
load_parameters(self.model, checkpoint_path=checkpoint_path)
def __call__(self, mel):
with dg.no_grad():
self.model.eval()
audio = self.model.synthesize(mel)
self.model.train()
return audio
class GriffinLimVocoder(object):
def __init__(self, sharpening_factor=1.4, win_length=1024, hop_length=256):
self.sharpening_factor = sharpening_factor
self.win_length = win_length
self.hop_length = hop_length
def __call__(self, spec):
audio = librosa.core.griffinlim(np.exp(spec * self.sharpening_factor),
win_length=self.win_length, hop_length=self.hop_length)
return audio

View File

@ -1,19 +1 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from parakeet.models.deepvoice3.encoder import Encoder, ConvSpec
from parakeet.models.deepvoice3.decoder import Decoder, WindowRange
from parakeet.models.deepvoice3.converter import Converter
from parakeet.models.deepvoice3.loss import TTSLoss
from parakeet.models.deepvoice3.model import DeepVoice3
from .model import *

View File

@ -1,122 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import numpy as np
from collections import namedtuple
from paddle import fluid
import paddle.fluid.dygraph as dg
import paddle.fluid.layers as F
import paddle.fluid.initializer as I
from parakeet.modules.weight_norm import Linear
WindowRange = namedtuple("WindowRange", ["backward", "ahead"])
class Attention(dg.Layer):
def __init__(self,
query_dim,
embed_dim,
dropout=0.0,
window_range=WindowRange(-1, 3),
key_projection=True,
value_projection=True):
"""Attention Layer for Deep Voice 3.
Args:
query_dim (int): the dimension of query vectors. (The size of a single vector of query.)
embed_dim (int): the dimension of keys and values.
dropout (float, optional): dropout probability of attention. Defaults to 0.0.
window_range (WindowRange, optional): range of attention, this is only used at inference. Defaults to WindowRange(-1, 3).
key_projection (bool, optional): whether the `Attention` Layer has a Linear Layer for the keys to pass through before computing attention. Defaults to True.
value_projection (bool, optional): whether the `Attention` Layer has a Linear Layer for the values to pass through before computing attention. Defaults to True.
"""
super(Attention, self).__init__()
std = np.sqrt(1 / query_dim)
self.query_proj = Linear(
query_dim, embed_dim, param_attr=I.Normal(scale=std))
if key_projection:
std = np.sqrt(1 / embed_dim)
self.key_proj = Linear(
embed_dim, embed_dim, param_attr=I.Normal(scale=std))
if value_projection:
std = np.sqrt(1 / embed_dim)
self.value_proj = Linear(
embed_dim, embed_dim, param_attr=I.Normal(scale=std))
std = np.sqrt(1 / embed_dim)
self.out_proj = Linear(
embed_dim, query_dim, param_attr=I.Normal(scale=std))
self.key_projection = key_projection
self.value_projection = value_projection
self.dropout = dropout
self.window_range = window_range
def forward(self, query, encoder_out, mask=None, last_attended=None):
"""
Compute contextualized representation and alignment scores.
Args:
query (Variable): shape(B, T_dec, C_q), dtype float32, the query tensor, where C_q means the query dim.
encoder_out (keys, values):
keys (Variable): shape(B, T_enc, C_emb), dtype float32, the key representation from an encoder, where C_emb means embed dim.
values (Variable): shape(B, T_enc, C_emb), dtype float32, the value representation from an encoder, where C_emb means embed dim.
mask (Variable, optional): shape(B, T_enc), dtype float32, mask generated with valid text lengths. Pad tokens corresponds to 1, and valid tokens correspond to 0.
last_attended (int, optional): The position that received the most attention at last time step. This is only used at inference.
Outpus:
x (Variable): shape(B, T_dec, C_q), dtype float32, the contextualized representation from attention mechanism.
attn_scores (Variable): shape(B, T_dec, T_enc), dtype float32, the alignment tensor, where T_dec means the number of decoder time steps and T_enc means number the number of decoder time steps.
"""
keys, values = encoder_out
residual = query
if self.value_projection:
values = self.value_proj(values)
if self.key_projection:
keys = self.key_proj(keys)
x = self.query_proj(query)
x = F.matmul(x, keys, transpose_y=True)
# mask generated by sentence length
neg_inf = -1.e30
if mask is not None:
neg_inf_mask = F.scale(F.unsqueeze(mask, [1]), neg_inf)
x += neg_inf_mask
# if last_attended is provided, focus only on a window range around it
# to enforce monotonic attention.
if last_attended is not None:
locality_mask = np.ones(shape=x.shape, dtype=np.float32)
backward, ahead = self.window_range
backward = last_attended + backward
ahead = last_attended + ahead
backward = max(backward, 0)
ahead = min(ahead, x.shape[-1])
locality_mask[:, :, backward:ahead] = 0.
locality_mask = dg.to_variable(locality_mask)
neg_inf_mask = F.scale(locality_mask, neg_inf)
x += neg_inf_mask
x = F.softmax(x)
attn_scores = x
x = F.dropout(
x, self.dropout, dropout_implementation="upscale_in_train")
x = F.matmul(x, values)
encoder_length = keys.shape[1]
x = F.scale(x, encoder_length * np.sqrt(1.0 / encoder_length))
x = self.out_proj(x)
x = F.scale((x + residual), np.sqrt(0.5))
return x, attn_scores

View File

@ -0,0 +1,245 @@
import numpy as np
from paddle.fluid import layers as F
from paddle.fluid.framework import Variable, in_dygraph_mode
from paddle.fluid import core, dygraph_utils
from paddle.fluid.layers import nn, utils
from paddle.fluid.data_feeder import check_variable_and_dtype
from paddle.fluid.param_attr import ParamAttr
from paddle.fluid.layer_helper import LayerHelper
from paddle.fluid.dygraph import layers
from paddle.fluid.initializer import Normal
def _is_list_or_tuple(input):
return isinstance(input, (list, tuple))
def _zero_padding_in_batch_and_channel(padding, channel_last):
if channel_last:
return list(padding[0]) == [0, 0] and list(padding[-1]) == [0, 0]
else:
return list(padding[0]) == [0, 0] and list(padding[1]) == [0, 0]
def _exclude_padding_in_batch_and_channel(padding, channel_last):
padding_ = padding[1:-1] if channel_last else padding[2:]
padding_ = [elem for pad_a_dim in padding_ for elem in pad_a_dim]
return padding_
def _update_padding_nd(padding, channel_last, num_dims):
if isinstance(padding, str):
padding = padding.upper()
if padding not in ["SAME", "VALID"]:
raise ValueError(
"Unknown padding: '{}'. It can only be 'SAME' or 'VALID'.".
format(padding))
if padding == "VALID":
padding_algorithm = "VALID"
padding = [0] * num_dims
else:
padding_algorithm = "SAME"
padding = [0] * num_dims
elif _is_list_or_tuple(padding):
# for padding like
# [(pad_before, pad_after), (pad_before, pad_after), ...]
# padding for batch_dim and channel_dim included
if len(padding) == 2 + num_dims and _is_list_or_tuple(padding[0]):
if not _zero_padding_in_batch_and_channel(padding, channel_last):
raise ValueError(
"Non-zero padding({}) in the batch or channel dimensions "
"is not supported.".format(padding))
padding_algorithm = "EXPLICIT"
padding = _exclude_padding_in_batch_and_channel(padding,
channel_last)
if utils._is_symmetric_padding(padding, num_dims):
padding = padding[0::2]
# for padding like [pad_before, pad_after, pad_before, pad_after, ...]
elif len(padding) == 2 * num_dims and isinstance(padding[0], int):
padding_algorithm = "EXPLICIT"
padding = utils.convert_to_list(padding, 2 * num_dims, 'padding')
if utils._is_symmetric_padding(padding, num_dims):
padding = padding[0::2]
# for padding like [pad_d1, pad_d2, ...]
elif len(padding) == num_dims and isinstance(padding[0], int):
padding_algorithm = "EXPLICIT"
padding = utils.convert_to_list(padding, num_dims, 'padding')
else:
raise ValueError("In valid padding: {}".format(padding))
# for integer padding
else:
padding_algorithm = "EXPLICIT"
padding = utils.convert_to_list(padding, num_dims, 'padding')
return padding, padding_algorithm
def _get_default_param_initializer(num_channels, filter_size):
filter_elem_num = num_channels * np.prod(filter_size)
std = (2.0 / filter_elem_num)**0.5
return Normal(0.0, std, 0)
def conv1d(input,
weight,
bias=None,
padding=0,
stride=1,
dilation=1,
groups=1,
use_cudnn=True,
act=None,
data_format="NCT",
name=None):
# entry checks
if not isinstance(use_cudnn, bool):
raise ValueError("Attr(use_cudnn) should be True or False. "
"Received Attr(use_cudnn): {}.".format(use_cudnn))
if data_format not in ["NCT", "NTC"]:
raise ValueError("Attr(data_format) should be 'NCT' or 'NTC'. "
"Received Attr(data_format): {}.".format(data_format))
channel_last = (data_format == "NTC")
channel_dim = -1 if channel_last else 1
num_channels = input.shape[channel_dim]
num_filters = weight.shape[0]
if num_channels < 0:
raise ValueError("The channel dimmention of the input({}) "
"should be defined. Received: {}.".format(
input.shape, num_channels))
if num_channels % groups != 0:
raise ValueError(
"the channel of input must be divisible by groups,"
"received: the channel of input is {}, the shape of input is {}"
", the groups is {}".format(num_channels, input.shape, groups))
if num_filters % groups != 0:
raise ValueError(
"the number of filters must be divisible by groups,"
"received: the number of filters is {}, the shape of weight is {}"
", the groups is {}".format(num_filters, weight.shape, groups))
# update attrs
padding, padding_algorithm = _update_padding_nd(padding, channel_last, 1)
if len(padding) == 1: # synmmetric padding
padding = [0,] + padding
else:
# len(padding) == 2
padding = [0, 0] + padding
stride = [1,] + utils.convert_to_list(stride, 1, 'stride')
dilation = [1,] + utils.convert_to_list(dilation, 1, 'dilation')
data_format = "NHWC" if channel_last else "NCHW"
l_type = "conv2d"
if (num_channels == groups and num_filters % num_channels == 0 and
not use_cudnn):
l_type = 'depthwise_conv2d'
weight = F.unsqueeze(weight, [2])
input = F.unsqueeze(input, [1]) if channel_last else F.unsqueeze(input, [2])
if in_dygraph_mode():
attrs = ('strides', stride, 'paddings', padding, 'dilations', dilation,
'groups', groups, 'use_cudnn', use_cudnn, 'use_mkldnn', False,
'fuse_relu_before_depthwise_conv', False, "padding_algorithm",
padding_algorithm, "data_format", data_format)
pre_bias = getattr(core.ops, l_type)(input, weight, *attrs)
if bias is not None:
pre_act = nn.elementwise_add(pre_bias, bias, axis=channel_dim)
else:
pre_act = pre_bias
out = dygraph_utils._append_activation_in_dygraph(
pre_act, act, use_cudnn=use_cudnn)
else:
inputs = {'Input': [input], 'Filter': [weight]}
attrs = {
'strides': stride,
'paddings': padding,
'dilations': dilation,
'groups': groups,
'use_cudnn': use_cudnn,
'use_mkldnn': False,
'fuse_relu_before_depthwise_conv': False,
"padding_algorithm": padding_algorithm,
"data_format": data_format
}
check_variable_and_dtype(input, 'input',
['float16', 'float32', 'float64'], 'conv2d')
helper = LayerHelper(l_type, **locals())
dtype = helper.input_dtype()
pre_bias = helper.create_variable_for_type_inference(dtype)
outputs = {"Output": [pre_bias]}
helper.append_op(
type=l_type, inputs=inputs, outputs=outputs, attrs=attrs)
if bias is not None:
pre_act = nn.elementwise_add(pre_bias, bias, axis=channel_dim)
else:
pre_act = pre_bias
out = helper.append_activation(pre_act)
out = F.squeeze(out, [1]) if channel_last else F.squeeze(out, [2])
return out
class Conv1D(layers.Layer):
def __init__(self,
num_channels,
num_filters,
filter_size,
padding=0,
stride=1,
dilation=1,
groups=1,
param_attr=None,
bias_attr=None,
use_cudnn=True,
act=None,
data_format="NCT",
dtype='float32'):
super(Conv1D, self).__init__()
assert param_attr is not False, "param_attr should not be False here."
self._num_channels = num_channels
self._num_filters = num_filters
self._groups = groups
if num_channels % groups != 0:
raise ValueError("num_channels must be divisible by groups.")
self._act = act
self._data_format = data_format
self._dtype = dtype
if not isinstance(use_cudnn, bool):
raise ValueError("use_cudnn should be True or False")
self._use_cudnn = use_cudnn
self._filter_size = utils.convert_to_list(filter_size, 1, 'filter_size')
self._stride = utils.convert_to_list(stride, 1, 'stride')
self._dilation = utils.convert_to_list(dilation, 1, 'dilation')
channel_last = (data_format == "NTC")
self._padding = padding # leave it to F.conv1d
self._param_attr = param_attr
self._bias_attr = bias_attr
num_filter_channels = num_channels // groups
filter_shape = [self._num_filters, num_filter_channels
] + self._filter_size
self.weight = self.create_parameter(
attr=self._param_attr,
shape=filter_shape,
dtype=self._dtype,
default_initializer=_get_default_param_initializer(
self._num_channels, filter_shape))
self.bias = self.create_parameter(
attr=self._bias_attr,
shape=[self._num_filters],
dtype=self._dtype,
is_bias=True)
def forward(self, input):
out = conv1d(
input,
self.weight,
bias=self.bias,
padding=self._padding,
stride=self._stride,
dilation=self._dilation,
groups=self._groups,
use_cudnn=self._use_cudnn,
act=self._act,
data_format=self._data_format)
return out

View File

@ -1,152 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import numpy as np
from paddle import fluid
import paddle.fluid.dygraph as dg
import paddle.fluid.layers as F
import paddle.fluid.initializer as I
from parakeet.modules.weight_norm import Conv1D, Conv1DCell, Conv2D, Linear
class Conv1DGLU(dg.Layer):
"""
A Convolution 1D block with GLU activation. It also applys dropout for the input x. It integrates speaker embeddings through a Linear activated by softsign. It has residual connection from the input x, and scale the output by np.sqrt(0.5).
"""
def __init__(self,
n_speakers,
speaker_dim,
in_channels,
num_filters,
filter_size=1,
dilation=1,
std_mul=4.0,
dropout=0.0,
causal=False,
residual=True):
"""[summary]
Args:
n_speakers (int): number of speakers.
speaker_dim (int): speaker embedding's size.
in_channels (int): channels of the input.
num_filters (int): channels of the output.
filter_size (int, optional): filter size of the internal Conv1DCell. Defaults to 1.
dilation (int, optional): dilation of the internal Conv1DCell. Defaults to 1.
std_mul (float, optional): [description]. Defaults to 4.0.
dropout (float, optional): dropout probability. Defaults to 0.0.
causal (bool, optional): padding of the Conv1DCell. It shoudl be True if `add_input` method of `Conv1DCell` is ever used. Defaults to False.
residual (bool, optional): whether to use residual connection. If True, in_channels shoudl equals num_filters. Defaults to True.
"""
super(Conv1DGLU, self).__init__()
# conv spec
self.in_channels = in_channels
self.n_speakers = n_speakers
self.speaker_dim = speaker_dim
self.num_filters = num_filters
self.filter_size = filter_size
self.dilation = dilation
# padding
self.causal = causal
# weight init and dropout
self.std_mul = std_mul
self.dropout = dropout
self.residual = residual
if residual:
assert (
in_channels == num_filters
), "this block uses residual connection"\
"the input_channes should equals num_filters"
std = np.sqrt(std_mul * (1 - dropout) / (filter_size * in_channels))
self.conv = Conv1DCell(
in_channels,
2 * num_filters,
filter_size,
dilation,
causal,
param_attr=I.Normal(scale=std))
if n_speakers > 1:
assert (speaker_dim is not None
), "speaker embed should not be null in multi-speaker case"
std = np.sqrt(1 / speaker_dim)
self.fc = Linear(
speaker_dim, num_filters, param_attr=I.Normal(scale=std))
def forward(self, x, speaker_embed=None):
"""
Args:
x (Variable): shape(B, C_in, T), dtype float32, the input of Conv1DGLU layer, where B means batch_size, C_in means the input channels T means input time steps.
speaker_embed (Variable): shape(B, C_sp), dtype float32, speaker embed, where C_sp means speaker embedding size.
Returns:
x (Variable): shape(B, C_out, T), the output of Conv1DGLU, where
C_out means the `num_filters`.
"""
residual = x
x = F.dropout(
x, self.dropout, dropout_implementation="upscale_in_train")
x = self.conv(x)
content, gate = F.split(x, num_or_sections=2, dim=1)
if speaker_embed is not None:
sp = F.softsign(self.fc(speaker_embed))
content = F.elementwise_add(content, sp, axis=0)
# glu
x = F.sigmoid(gate) * content
if self.residual:
x = F.scale(x + residual, np.sqrt(0.5))
return x
def start_sequence(self):
"""Prepare the Conv1DGLU to generate a new sequence. This method should be called before starting calling `add_input` multiple times.
"""
self.conv.start_sequence()
def add_input(self, x_t, speaker_embed=None):
"""
Takes a step of inputs and return a step of outputs. It works similarily with the `forward` method, but in a `step-in-step-out` fashion.
Args:
x_t (Variable): shape(B, C_in, T=1), dtype float32, the input of Conv1DGLU layer, where B means batch_size, C_in means the input channels.
speaker_embed (Variable): Shape(B, C_sp), dtype float32, speaker embed, where C_sp means speaker embedding size.
Returns:
x (Variable): shape(B, C_out), the output of Conv1DGLU, where C_out means the `num_filter`.
"""
residual = x_t
x_t = F.dropout(
x_t, self.dropout, dropout_implementation="upscale_in_train")
x_t = self.conv.add_input(x_t)
content_t, gate_t = F.split(x_t, num_or_sections=2, dim=1)
if speaker_embed is not None:
sp = F.softsign(self.fc(speaker_embed))
content_t = F.elementwise_add(content_t, sp, axis=0)
# glu
x_t = F.sigmoid(gate_t) * content_t
if self.residual:
x_t = F.scale(x_t + residual, np.sqrt(0.5))
return x_t

View File

@ -1,285 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import numpy as np
from itertools import chain
import paddle.fluid.layers as F
import paddle.fluid.initializer as I
import paddle.fluid.dygraph as dg
from parakeet.modules.weight_norm import Conv1D, Conv1DTranspose, Conv2D, Conv2DTranspose, Linear
from parakeet.models.deepvoice3.conv1dglu import Conv1DGLU
from parakeet.models.deepvoice3.encoder import ConvSpec
def upsampling_4x_blocks(n_speakers, speaker_dim, target_channels, dropout):
"""Return a list of Layers that upsamples the input by 4 times in time dimension.
Args:
n_speakers (int): number of speakers of the Conv1DGLU layers used.
speaker_dim (int): speaker embedding size of the Conv1DGLU layers used.
target_channels (int): channels of the input and the output.(the list of layers does not change the number of channels.)
dropout (float): dropout probability.
Returns:
List[Layer]: upsampling layers.
"""
# upsampling convolitions
upsampling_convolutions = [
Conv1DTranspose(
target_channels,
target_channels,
2,
stride=2,
param_attr=I.Normal(scale=np.sqrt(1 / (2 * target_channels)))),
Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=1,
std_mul=1.,
dropout=dropout),
Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=3,
std_mul=4.,
dropout=dropout),
Conv1DTranspose(
target_channels,
target_channels,
2,
stride=2,
param_attr=I.Normal(scale=np.sqrt(4. / (2 * target_channels)))),
Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=1,
std_mul=1.,
dropout=dropout),
Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=3,
std_mul=4.,
dropout=dropout),
]
return upsampling_convolutions
def upsampling_2x_blocks(n_speakers, speaker_dim, target_channels, dropout):
"""Return a list of Layers that upsamples the input by 2 times in time dimension.
Args:
n_speakers (int): number of speakers of the Conv1DGLU layers used.
speaker_dim (int): speaker embedding size of the Conv1DGLU layers used.
target_channels (int): channels of the input and the output.(the list of layers does not change the number of channels.)
dropout (float): dropout probability.
Returns:
List[Layer]: upsampling layers.
"""
upsampling_convolutions = [
Conv1DTranspose(
target_channels,
target_channels,
2,
stride=2,
param_attr=I.Normal(scale=np.sqrt(1. / (2 * target_channels)))),
Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=1,
std_mul=1.,
dropout=dropout), Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=3,
std_mul=4.,
dropout=dropout)
]
return upsampling_convolutions
def upsampling_1x_blocks(n_speakers, speaker_dim, target_channels, dropout):
"""Return a list of Layers that upsamples the input by 1 times in time dimension.
Args:
n_speakers (int): number of speakers of the Conv1DGLU layers used.
speaker_dim (int): speaker embedding size of the Conv1DGLU layers used.
target_channels (int): channels of the input and the output.(the list of layers does not change the number of channels.)
dropout (float): dropout probability.
Returns:
List[Layer]: upsampling layers.
"""
upsampling_convolutions = [
Conv1DGLU(
n_speakers,
speaker_dim,
target_channels,
target_channels,
3,
dilation=3,
std_mul=4.,
dropout=dropout)
]
return upsampling_convolutions
class Converter(dg.Layer):
def __init__(self,
n_speakers,
speaker_dim,
in_channels,
linear_dim,
convolutions=(ConvSpec(256, 5, 1), ) * 4,
time_upsampling=1,
dropout=0.0):
"""Vocoder that transforms mel spectrogram (or ecoder hidden states) to waveform.
Args:
n_speakers (int): number of speakers.
speaker_dim (int): speaker embedding size.
in_channels (int): channels of the input.
linear_dim (int): channels of the linear spectrogram.
convolutions (Iterable[ConvSpec], optional): specifications of the internal convolutional layers. ConvSpec is a namedtuple of (output_channels, filter_size, dilation) Defaults to (ConvSpec(256, 5, 1), )*4.
time_upsampling (int, optional): time upsampling factor of the converter, possible options are {1, 2, 4}. Note that this should equals the downsample factor of the mel spectrogram. Defaults to 1.
dropout (float, optional): dropout probability. Defaults to 0.0.
"""
super(Converter, self).__init__()
self.n_speakers = n_speakers
self.speaker_dim = speaker_dim
self.in_channels = in_channels
self.linear_dim = linear_dim
# CAUTION: this should equals the downsampling steps coefficient
self.time_upsampling = time_upsampling
self.dropout = dropout
target_channels = convolutions[0].out_channels
# conv proj to target channels
self.first_conv_proj = Conv1D(
in_channels,
target_channels,
1,
param_attr=I.Normal(scale=np.sqrt(1 / in_channels)))
# Idea from nyanko
if time_upsampling == 4:
self.upsampling_convolutions = dg.LayerList(
upsampling_4x_blocks(n_speakers, speaker_dim, target_channels,
dropout))
elif time_upsampling == 2:
self.upsampling_convolutions = dg.LayerList(
upsampling_2x_blocks(n_speakers, speaker_dim, target_channels,
dropout))
elif time_upsampling == 1:
self.upsampling_convolutions = dg.LayerList(
upsampling_1x_blocks(n_speakers, speaker_dim, target_channels,
dropout))
else:
raise ValueError(
"Upsampling factors other than {1, 2, 4} are Not supported.")
# post conv layers
std_mul = 4.0
in_channels = target_channels
self.convolutions = dg.LayerList()
for (out_channels, filter_size, dilation) in convolutions:
if in_channels != out_channels:
std = np.sqrt(std_mul / in_channels)
# CAUTION: relu
self.convolutions.append(
Conv1D(
in_channels,
out_channels,
1,
act="relu",
param_attr=I.Normal(scale=std)))
in_channels = out_channels
std_mul = 2.0
self.convolutions.append(
Conv1DGLU(
n_speakers,
speaker_dim,
in_channels,
out_channels,
filter_size,
dilation=dilation,
std_mul=std_mul,
dropout=dropout))
in_channels = out_channels
std_mul = 4.0
# final conv proj, channel transformed to linear dim
std = np.sqrt(std_mul * (1 - dropout) / in_channels)
# CAUTION: sigmoid
self.last_conv_proj = Conv1D(
in_channels,
linear_dim,
1,
act="sigmoid",
param_attr=I.Normal(scale=std))
def forward(self, x, speaker_embed=None):
"""
Convert mel spectrogram or decoder hidden states to linear spectrogram.
Args:
x (Variable): Shape(B, T_mel, C_in), dtype float32, converter inputs, where C_in means the input channel for the converter. Note that it can be either C_mel (channel of mel spectrogram) or C_dec // r.
When use mel_spectrogram as the input of converter, C_in = C_mel; and when use decoder states as the input of converter, C_in = C_dec // r.
speaker_embed (Variable, optional): shape(B, C_sp), dtype float32, speaker embedding, where C_sp means the speaker embedding size.
Returns:
out (Variable): Shape(B, T_lin, C_lin), the output linear spectrogram, where C_lin means the channel of linear spectrogram and T_linear means the length(time steps) of linear spectrogram. T_line = time_upsampling * T_mel, which depends on the time_upsampling of the converter.
"""
x = F.transpose(x, [0, 2, 1])
x = self.first_conv_proj(x)
if speaker_embed is not None:
speaker_embed = F.dropout(
speaker_embed,
self.dropout,
dropout_implementation="upscale_in_train")
for layer in chain(self.upsampling_convolutions, self.convolutions):
if isinstance(layer, Conv1DGLU):
x = layer(x, speaker_embed)
else:
x = layer(x)
out = self.last_conv_proj(x)
out = F.transpose(out, [0, 2, 1])
return out

View File

@ -1,526 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import numpy as np
import paddle.fluid.layers as F
import paddle.fluid.initializer as I
import paddle.fluid.dygraph as dg
from parakeet.modules.weight_norm import Conv1D, Linear
from parakeet.models.deepvoice3.conv1dglu import Conv1DGLU
from parakeet.models.deepvoice3.encoder import ConvSpec
from parakeet.models.deepvoice3.attention import Attention, WindowRange
from parakeet.models.deepvoice3.position_embedding import PositionEmbedding
def gen_mask(valid_lengths, max_len, dtype="float32"):
"""
Generate a mask tensor from valid lengths. note that it return a *reverse*
mask. Indices within valid lengths correspond to 0, and those within
padding area correspond to 1.
Assume that valid_lengths = [2,5,7], and max_len = 7, the generated mask is
[[0, 0, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 1],
[0, 0, 0, 0, 0, 0, 0]].
Args:
valid_lengths (Variable): shape(B, ), dtype: int64. A rank-1 Tensor containing the valid lengths (timesteps) of each example, where B means beatch_size.
max_len (int): The length (number of time steps) of the mask.
dtype (str, optional): A string that specifies the data type of the returned mask. Defaults to 'float32'.
Returns:
mask (Variable): shape(B, max_len), dtype float32, a mask computed from valid lengths.
"""
mask = F.sequence_mask(valid_lengths, maxlen=max_len, dtype=dtype)
mask = 1 - mask
return mask
def fold_adjacent_frames(frames, r):
"""fold multiple adjacent frames.
Args:
frames (Variable): shape(B, T, C), the spectrogram.
r (int): frames per step.
Returns:
Variable: shape(B, T // r, r * C), folded frames.
"""
if r == 1:
return frames
batch_size, time_steps, channels = frames.shape
if time_steps % r != 0:
print(
"time_steps cannot be divided by r, you would lose {} tailing frames"
.format(time_steps % r))
frames = frames[:, :time_steps - time_steps % r, :]
frames = F.reshape(frames, (batch_size, -1, channels * r))
return frames
def unfold_adjacent_frames(folded_frames, r):
"""unfold the folded frames.
Args:
folded_frames (Variable): shape(B, T, C), the folded spectrogram.
r (int): frames per step.
Returns:
Variable: shape(B, T * r, C // r), unfolded frames.
"""
if r == 1:
return folded_frames
batch_size, time_steps, channels = folded_frames.shape
folded_frames = F.reshape(folded_frames, (batch_size, -1, channels // r))
return folded_frames
class Decoder(dg.Layer):
def __init__(self,
n_speakers,
speaker_dim,
embed_dim,
mel_dim,
r=1,
max_positions=512,
preattention=(ConvSpec(128, 5, 1), ) * 4,
convolutions=(ConvSpec(128, 5, 1), ) * 4,
attention=True,
dropout=0.0,
use_memory_mask=False,
force_monotonic_attention=False,
query_position_rate=1.0,
key_position_rate=1.0,
window_range=WindowRange(-1, 3),
key_projection=True,
value_projection=True):
"""Decoder of the Deep Voice 3 model.
Args:
n_speakers (int): number of speakers.
speaker_dim (int): speaker embedding size.
embed_dim (int): text embedding size.
mel_dim (int): channel of mel input.(mel bands)
r (int, optional): number of frames generated per decoder step. Defaults to 1.
max_positions (int, optional): max position for text and decoder steps. Defaults to 512.
convolutions (Iterable[ConvSpec], optional): specification of causal convolutional layers inside the decoder. ConvSpec is a namedtuple of output_channels, filter_size and dilation. Defaults to (ConvSpec(128, 5, 1), )*4.
attention (bool or List[bool], optional): whether to use attention, it should have the same length with `convolutions` if it is a list of bool, indicating whether to have an Attention layer coupled with the corresponding convolutional layer. If it is a bool, it is repeated len(convolutions) times internally. Defaults to True.
dropout (float, optional): dropout probability. Defaults to 0.0.
use_memory_mask (bool, optional): whether to use memory mask at the Attention layer. It should have the same length with `attention` if it is a list of bool, indicating whether to use memory mask at the corresponding Attention layer. If it is a bool, it is repeated len(attention) times internally. Defaults to False.
force_monotonic_attention (bool, optional): whether to use monotonic_attention at the Attention layer when inferencing. It should have the same length with `attention` if it is a list of bool, indicating whether to use monotonic_attention at the corresponding Attention layer. If it is a bool, it is repeated len(attention) times internally. Defaults to False.
query_position_rate (float, optional): position_rate of the PositionEmbedding for query. Defaults to 1.0.
key_position_rate (float, optional): position_rate of the PositionEmbedding for key. Defaults to 1.0.
window_range (WindowRange, optional): window range of monotonic attention. Defaults to WindowRange(-1, 3).
key_projection (bool, optional): `key_projection` of Attention layers. Defaults to True.
value_projection (bool, optional): `value_projection` of Attention layers Defaults to True.
"""
super(Decoder, self).__init__()
self.dropout = dropout
self.mel_dim = mel_dim
self.r = r
self.query_position_rate = query_position_rate
self.key_position_rate = key_position_rate
self.window_range = window_range
self.n_speakers = n_speakers
conv_channels = convolutions[0].out_channels
# only when padding idx is 0 can we easilt handle it
self.embed_keys_positions = PositionEmbedding(max_positions, embed_dim)
self.embed_query_positions = PositionEmbedding(max_positions,
conv_channels)
if n_speakers > 1:
std = np.sqrt((1 - dropout) / speaker_dim)
self.speaker_proj1 = Linear(
speaker_dim, 1, act="sigmoid", param_attr=I.Normal(scale=std))
self.speaker_proj2 = Linear(
speaker_dim, 1, act="sigmoid", param_attr=I.Normal(scale=std))
# prenet
self.prenet = dg.LayerList()
in_channels = mel_dim * r # multiframe
std_mul = 1.0
for (out_channels, filter_size, dilation) in preattention:
if in_channels != out_channels:
# conv1d & relu
std = np.sqrt(std_mul / in_channels)
self.prenet.append(
Conv1D(
in_channels,
out_channels,
1,
act="relu",
param_attr=I.Normal(scale=std)))
in_channels = out_channels
std_mul = 2.0
self.prenet.append(
Conv1DGLU(
n_speakers,
speaker_dim,
in_channels,
out_channels,
filter_size,
dilation,
std_mul,
dropout,
causal=True,
residual=True))
in_channels = out_channels
std_mul = 4.0
# attention
self.use_memory_mask = use_memory_mask
if isinstance(attention, bool):
self.attention = [attention] * len(convolutions)
else:
self.attention = attention
if isinstance(force_monotonic_attention, bool):
self.force_monotonic_attention = [force_monotonic_attention
] * len(convolutions)
else:
self.force_monotonic_attention = force_monotonic_attention
for x, y in zip(self.force_monotonic_attention, self.attention):
if x is True and y is False:
raise ValueError("When not using attention, there is no "
"monotonic attention at all")
# causual convolution & attention
self.conv_attn = []
for use_attention, (out_channels, filter_size,
dilation) in zip(self.attention, convolutions):
assert (
in_channels == out_channels
), "the stack of convolution & attention does not change channels"
conv_layer = Conv1DGLU(
n_speakers,
speaker_dim,
in_channels,
out_channels,
filter_size,
dilation,
std_mul,
dropout,
causal=True,
residual=False)
attn_layer = Attention(
out_channels,
embed_dim,
dropout,
window_range,
key_projection=key_projection,
value_projection=value_projection) if use_attention else None
in_channels = out_channels
std_mul = 4.0
self.conv_attn.append((conv_layer, attn_layer))
for i, (conv_layer, attn_layer) in enumerate(self.conv_attn):
self.add_sublayer("conv_{}".format(i), conv_layer)
if attn_layer is not None:
self.add_sublayer("attn_{}".format(i), attn_layer)
# 1 * 1 conv to transform channels
std = np.sqrt(std_mul * (1 - dropout) / in_channels)
self.last_conv = Conv1D(
in_channels, mel_dim * r, 1, param_attr=I.Normal(scale=std))
# mel (before sigmoid) to done hat
std = np.sqrt(1 / in_channels)
self.fc = Conv1D(mel_dim * r, 1, 1, param_attr=I.Normal(scale=std))
# decoding configs
self.max_decoder_steps = 200
self.min_decoder_steps = 10
assert convolutions[-1].out_channels % r == 0, \
"decoder_state dim must be divided by r"
self.state_dim = convolutions[-1].out_channels // self.r
def forward(self,
encoder_out,
lengths,
frames,
text_positions,
frame_positions,
speaker_embed=None):
"""
Compute decoder outputs with ground truth mel spectrogram.
Args:
encoder_out (keys, values):
keys (Variable): shape(B, T_enc, C_emb), dtype float32, the key representation from an encoder, where C_emb means text embedding size.
values (Variable): shape(B, T_enc, C_emb), dtype float32, the value representation from an encoder, where C_emb means text embedding size.
lengths (Variable): shape(batch_size,), dtype: int64, valid lengths of text inputs for each example.
inputs (Variable): shape(B, T_mel, C_mel), ground truth mel-spectrogram, which is used as decoder inputs when training.
text_positions (Variable): shape(B, T_enc), dtype: int64. Positions indices for text inputs for the encoder, where T_enc means the encoder timesteps.
frame_positions (Variable): shape(B, T_mel // r), dtype: int64. Positions indices for each decoder time steps.
speaker_embed (Variable, optionals): shape(batch_size, speaker_dim), speaker embedding, only used for multispeaker model.
Returns:
outputs (Variable): shape(B, T_mel, C_mel), dtype float32, decoder outputs, where C_mel means the channels of mel-spectrogram, T_mel means the length(time steps) of mel spectrogram.
alignments (Variable): shape(N, B, T_mel // r, T_enc), dtype float32, the alignment tensor between the decoder and the encoder, where N means number of Attention Layers, T_mel means the length of mel spectrogram, r means the outputs per decoder step, T_enc means the encoder time steps.
done (Variable): shape(B, T_mel // r), dtype float32, probability that the last frame has been generated.
decoder_states (Variable): shape(B, T_mel, C_dec // r), ddtype float32, decoder hidden states, where C_dec means the channels of decoder states (the output channels of the last `convolutions`). Note that it should be perfectlt devided by `r`.
"""
if speaker_embed is not None:
speaker_embed = F.dropout(
speaker_embed,
self.dropout,
dropout_implementation="upscale_in_train")
keys, values = encoder_out
enc_time_steps = keys.shape[1]
if self.use_memory_mask and lengths is not None:
mask = gen_mask(lengths, enc_time_steps)
else:
mask = None
if text_positions is not None:
w = self.key_position_rate
if self.n_speakers > 1:
w = w * F.squeeze(self.speaker_proj1(speaker_embed), [-1])
text_pos_embed = self.embed_keys_positions(text_positions, w)
keys += text_pos_embed # (B, T, C)
if frame_positions is not None:
w = self.query_position_rate
if self.n_speakers > 1:
w = w * F.squeeze(self.speaker_proj2(speaker_embed), [-1])
frame_pos_embed = self.embed_query_positions(frame_positions, w)
else:
frame_pos_embed = None
# pack multiple frames if necessary
frames = fold_adjacent_frames(frames, self.r) # assume (B, T, C) input
# (B, C, T)
frames = F.transpose(frames, [0, 2, 1])
x = frames
x = F.dropout(
x, self.dropout, dropout_implementation="upscale_in_train")
# Prenet
for layer in self.prenet:
if isinstance(layer, Conv1DGLU):
x = layer(x, speaker_embed)
else:
x = layer(x)
# Convolution & Multi-hop Attention
alignments = []
for (conv, attn) in self.conv_attn:
residual = x
x = conv(x, speaker_embed)
if attn is not None:
x = F.transpose(x, [0, 2, 1]) # (B, T, C)
if frame_pos_embed is not None:
x = x + frame_pos_embed
x, attn_scores = attn(x, (keys, values), mask)
alignments.append(attn_scores)
x = F.transpose(x, [0, 2, 1]) #(B, C, T)
x = F.scale(residual + x, np.sqrt(0.5))
alignments = F.stack(alignments)
decoder_states = x
x = self.last_conv(x)
outputs = F.sigmoid(x)
done = F.sigmoid(self.fc(x))
outputs = F.transpose(outputs, [0, 2, 1])
decoder_states = F.transpose(decoder_states, [0, 2, 1])
done = F.squeeze(done, [1])
outputs = unfold_adjacent_frames(outputs, self.r)
decoder_states = unfold_adjacent_frames(decoder_states, self.r)
return outputs, alignments, done, decoder_states
@property
def receptive_field(self):
"""Whole receptive field of the causally convolutional decoder."""
r = 1
for conv in self.prenet:
r += conv.dilation[1] * (conv.filter_size[1] - 1)
for (conv, _) in self.conv_attn:
r += conv.dilation[1] * (conv.filter_size[1] - 1)
return r
def start_sequence(self):
"""Prepare the Decoder to decode. This method is called by `decode`.
"""
for layer in self.prenet:
if isinstance(layer, Conv1DGLU):
layer.start_sequence()
for conv, _ in self.conv_attn:
if isinstance(conv, Conv1DGLU):
conv.start_sequence()
def decode(self,
encoder_out,
text_positions,
speaker_embed=None,
test_inputs=None):
"""Decode from the encoder's output and other conditions.
Args:
encoder_out (keys, values):
keys (Variable): shape(B, T_enc, C_emb), dtype float32, the key representation from an encoder, where C_emb means text embedding size.
values (Variable): shape(B, T_enc, C_emb), dtype float32, the value representation from an encoder, where C_emb means text embedding size.
text_positions (Variable): shape(B, T_enc), dtype: int64. Positions indices for text inputs for the encoder, where T_enc means the encoder timesteps.
speaker_embed (Variable, optional): shape(B, C_sp), speaker embedding, only used for multispeaker model.
test_inputs (Variable, optional): shape(B, T_test, C_mel). test input, it is only used for debugging. Defaults to None.
Returns:
outputs (Variable): shape(B, T_mel, C_mel), dtype float32, decoder outputs, where C_mel means the channels of mel-spectrogram, T_mel means the length(time steps) of mel spectrogram.
alignments (Variable): shape(N, B, T_mel // r, T_enc), dtype float32, the alignment tensor between the decoder and the encoder, where N means number of Attention Layers, T_mel means the length of mel spectrogram, r means the outputs per decoder step, T_enc means the encoder time steps.
done (Variable): shape(B, T_mel // r), dtype float32, probability that the last frame has been generated. If the probability is larger than 0.5 at a step, the generation stops.
decoder_states (Variable): shape(B, T_mel, C_dec // r), ddtype float32, decoder hidden states, where C_dec means the channels of decoder states (the output channels of the last `convolutions`). Note that it should be perfectlt devided by `r`.
Note:
Only single instance inference is supported now, so B = 1.
"""
self.start_sequence()
keys, values = encoder_out
batch_size = keys.shape[0]
assert batch_size == 1, "now only supports single instance inference"
mask = None # no mask because we use single instance decoding
# no dropout in inference
if speaker_embed is not None:
speaker_embed = F.dropout(
speaker_embed,
self.dropout,
dropout_implementation="upscale_in_train")
# since we use single example inference, there is no text_mask
if text_positions is not None:
w = self.key_position_rate
if self.n_speakers > 1:
# shape (B, )
w = w * F.squeeze(self.speaker_proj1(speaker_embed), [-1])
text_pos_embed = self.embed_keys_positions(text_positions, w)
keys += text_pos_embed # (B, T, C)
# statr decoding
decoder_states = [] # (B, C, 1) tensors
mel_outputs = [] # (B, C, 1) tensors
alignments = [] # (B, 1, T_enc) tensors
dones = [] # (B, 1, 1) tensors
last_attended = [None] * len(self.conv_attn)
for idx, monotonic_attn in enumerate(self.force_monotonic_attention):
if monotonic_attn:
last_attended[idx] = 0
if test_inputs is not None:
# pack multiple frames if necessary # assume (B, T, C) input
test_inputs = fold_adjacent_frames(test_inputs, self.r)
test_inputs = F.transpose(test_inputs, [0, 2, 1])
initial_input = F.zeros(
(batch_size, self.mel_dim * self.r, 1), dtype=keys.dtype)
t = 0 # decoder time step
while True:
frame_pos = F.fill_constant(
(batch_size, 1), value=t + 1, dtype="int64")
w = self.query_position_rate
if self.n_speakers > 1:
w = w * F.squeeze(self.speaker_proj2(speaker_embed), [-1])
# (B, T=1, C)
frame_pos_embed = self.embed_query_positions(frame_pos, w)
if test_inputs is not None:
if t >= test_inputs.shape[-1]:
break
current_input = test_inputs[:, :, t:t + 1]
else:
if t > 0:
current_input = mel_outputs[-1] # auto-regressive
else:
current_input = initial_input
x_t = current_input
x_t = F.dropout(
x_t, self.dropout, dropout_implementation="upscale_in_train")
# Prenet
for layer in self.prenet:
if isinstance(layer, Conv1DGLU):
x_t = layer.add_input(x_t, speaker_embed)
else:
x_t = layer(x_t) # (B, C, T=1)
step_attn_scores = []
# causal convolutions + multi-hop attentions
for i, (conv, attn) in enumerate(self.conv_attn):
residual = x_t #(B, C, T=1)
x_t = conv.add_input(x_t, speaker_embed)
if attn is not None:
x_t = F.transpose(x_t, [0, 2, 1])
if frame_pos_embed is not None:
x_t += frame_pos_embed
x_t, attn_scores = attn(x_t, (keys, values), mask,
last_attended[i]
if test_inputs is None else None)
x_t = F.transpose(x_t, [0, 2, 1])
step_attn_scores.append(attn_scores) #(B, T_dec=1, T_enc)
# update last attended when necessary
if self.force_monotonic_attention[i]:
last_attended[i] = np.argmax(
attn_scores.numpy(), axis=-1)[0][0]
x_t = F.scale(residual + x_t, np.sqrt(0.5))
if len(step_attn_scores):
# (B, 1, T_enc) again
average_attn_scores = F.reduce_mean(
F.stack(step_attn_scores, 0), 0)
else:
average_attn_scores = None
decoder_state_t = x_t
x_t = self.last_conv(x_t)
mel_output_t = F.sigmoid(x_t)
done_t = F.sigmoid(self.fc(x_t))
decoder_states.append(decoder_state_t)
mel_outputs.append(mel_output_t)
if average_attn_scores is not None:
alignments.append(average_attn_scores)
dones.append(done_t)
t += 1
if test_inputs is None:
if F.reduce_min(done_t).numpy()[
0] > 0.5 and t > self.min_decoder_steps:
break
elif t > self.max_decoder_steps:
break
# concat results
mel_outputs = F.concat(mel_outputs, axis=-1)
decoder_states = F.concat(decoder_states, axis=-1)
dones = F.concat(dones, axis=-1)
alignments = F.concat(alignments, axis=1)
mel_outputs = F.transpose(mel_outputs, [0, 2, 1])
decoder_states = F.transpose(decoder_states, [0, 2, 1])
dones = F.squeeze(dones, [1])
mel_outputs = unfold_adjacent_frames(mel_outputs, self.r)
decoder_states = unfold_adjacent_frames(decoder_states, self.r)
return mel_outputs, alignments, dones, decoder_states

View File

@ -1,149 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import numpy as np
from collections import namedtuple
import paddle.fluid.layers as F
import paddle.fluid.initializer as I
import paddle.fluid.dygraph as dg
from parakeet.modules.weight_norm import Conv1D, Linear
from parakeet.models.deepvoice3.conv1dglu import Conv1DGLU
ConvSpec = namedtuple("ConvSpec", ["out_channels", "filter_size", "dilation"])
class Encoder(dg.Layer):
def __init__(self,
n_vocab,
embed_dim,
n_speakers,
speaker_dim,
padding_idx=None,
embedding_weight_std=0.1,
convolutions=(ConvSpec(64, 5, 1), ) * 7,
dropout=0.):
"""Encoder of Deep Voice 3.
Args:
n_vocab (int): vocabulary size of the text embedding.
embed_dim (int): embedding size of the text embedding.
n_speakers (int): number of speakers.
speaker_dim (int): speaker embedding size.
padding_idx (int, optional): padding index of text embedding. Defaults to None.
embedding_weight_std (float, optional): standard deviation of the embedding weights when intialized. Defaults to 0.1.
convolutions (Iterable[ConvSpec], optional): specifications of the convolutional layers. ConvSpec is a namedtuple of output channels, filter_size and dilation. Defaults to (ConvSpec(64, 5, 1), )*7.
dropout (float, optional): dropout probability. Defaults to 0..
"""
super(Encoder, self).__init__()
self.embedding_weight_std = embedding_weight_std
self.embed = dg.Embedding(
(n_vocab, embed_dim),
padding_idx=padding_idx,
param_attr=I.Normal(scale=embedding_weight_std))
self.dropout = dropout
if n_speakers > 1:
std = np.sqrt((1 - dropout) / speaker_dim)
self.sp_proj1 = Linear(
speaker_dim,
embed_dim,
act="softsign",
param_attr=I.Normal(scale=std))
self.sp_proj2 = Linear(
speaker_dim,
embed_dim,
act="softsign",
param_attr=I.Normal(scale=std))
self.n_speakers = n_speakers
self.convolutions = dg.LayerList()
in_channels = embed_dim
std_mul = 1.0
for (out_channels, filter_size, dilation) in convolutions:
# 1 * 1 convolution & relu
if in_channels != out_channels:
std = np.sqrt(std_mul / in_channels)
self.convolutions.append(
Conv1D(
in_channels,
out_channels,
1,
act="relu",
param_attr=I.Normal(scale=std)))
in_channels = out_channels
std_mul = 2.0
self.convolutions.append(
Conv1DGLU(
n_speakers,
speaker_dim,
in_channels,
out_channels,
filter_size,
dilation,
std_mul,
dropout,
causal=False,
residual=True))
in_channels = out_channels
std_mul = 4.0
std = np.sqrt(std_mul * (1 - dropout) / in_channels)
self.convolutions.append(
Conv1D(
in_channels, embed_dim, 1, param_attr=I.Normal(scale=std)))
def forward(self, x, speaker_embed=None):
"""
Encode text sequence.
Args:
x (Variable): shape(B, T_enc), dtype: int64. Ihe input text indices. T_enc means the timesteps of decoder input x.
speaker_embed (Variable, optional): shape(B, C_sp), dtype float32, speaker embeddings. This arg is not None only when the model is a multispeaker model.
Returns:
keys (Variable), Shape(B, T_enc, C_emb), dtype float32, the encoded epresentation for keys, where C_emb menas the text embedding size.
values (Variable), Shape(B, T_enc, C_emb), dtype float32, the encoded representation for values.
"""
x = self.embed(x)
x = F.dropout(
x, self.dropout, dropout_implementation="upscale_in_train")
x = F.transpose(x, [0, 2, 1])
if self.n_speakers > 1 and speaker_embed is not None:
speaker_embed = F.dropout(
speaker_embed,
self.dropout,
dropout_implementation="upscale_in_train")
x = F.elementwise_add(x, self.sp_proj1(speaker_embed), axis=0)
input_embed = x
for layer in self.convolutions:
if isinstance(layer, Conv1DGLU):
x = layer(x, speaker_embed)
else:
# layer is a Conv1D with (1,) filter wrapped by WeightNormWrapper
x = layer(x)
if self.n_speakers > 1 and speaker_embed is not None:
x = F.elementwise_add(x, self.sp_proj2(speaker_embed), axis=0)
keys = x # (B, C, T)
values = F.scale(input_embed + x, scale=np.sqrt(0.5))
keys = F.transpose(keys, [0, 2, 1])
values = F.transpose(values, [0, 2, 1])
return keys, values

View File

@ -1,291 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import numpy as np
from numba import jit
from paddle import fluid
import paddle.fluid.layers as F
import paddle.fluid.dygraph as dg
def masked_mean(inputs, mask):
"""
Args:
inputs (Variable): shape(B, T, C), dtype float32, the input.
mask (Variable): shape(B, T), dtype float32, a mask.
Returns:
loss (Variable): shape(1, ), dtype float32, masked mean.
"""
channels = inputs.shape[-1]
masked_inputs = F.elementwise_mul(inputs, mask, axis=0)
loss = F.reduce_sum(masked_inputs) / (channels * F.reduce_sum(mask))
return loss
@jit(nopython=True)
def guided_attention(N, max_N, T, max_T, g):
"""Generate an diagonal attention guide.
Args:
N (int): valid length of encoder.
max_N (int): max length of encoder.
T (int): valid length of decoder.
max_T (int): max length of decoder.
g (float): sigma to adjust the degree of diagonal guide.
Returns:
np.ndarray: shape(max_N, max_T), dtype float32, the diagonal guide.
"""
W = np.zeros((max_N, max_T), dtype=np.float32)
for n in range(N):
for t in range(T):
W[n, t] = 1 - np.exp(-(n / N - t / T)**2 / (2 * g * g))
return W
def guided_attentions(encoder_lengths, decoder_lengths, max_decoder_len,
g=0.2):
"""Generate a diagonal attention guide for a batch.
Args:
encoder_lengths (np.ndarray): shape(B, ), dtype: int64, encoder valid lengths.
decoder_lengths (np.ndarray): shape(B, ), dtype: int64, decoder valid lengths.
max_decoder_len (int): max length of decoder.
g (float, optional): sigma to adjust the degree of diagonal guide.. Defaults to 0.2.
Returns:
np.ndarray: shape(B, max_T, max_N), dtype float32, the diagonal guide. (max_N: max encoder length, max_T: max decoder length.)
"""
B = len(encoder_lengths)
max_input_len = encoder_lengths.max()
W = np.zeros((B, max_decoder_len, max_input_len), dtype=np.float32)
for b in range(B):
W[b] = guided_attention(encoder_lengths[b], max_input_len,
decoder_lengths[b], max_decoder_len, g).T
return W
class TTSLoss(object):
def __init__(self,
masked_weight=0.0,
priority_bin=None,
priority_weight=0.0,
binary_divergence_weight=0.0,
guided_attention_sigma=0.2,
downsample_factor=4,
r=1):
"""Compute loss for Deep Voice 3 model.
Args:
masked_weight (float, optional): the weight of masked loss. Defaults to 0.0.
priority_bin ([type], optional): frequency bands for linear spectrogram loss to be prioritized. Defaults to None.
priority_weight (float, optional): weight for the prioritized frequency bands. Defaults to 0.0.
binary_divergence_weight (float, optional): weight for binary cross entropy (used for spectrogram loss). Defaults to 0.0.
guided_attention_sigma (float, optional): `sigma` for attention guide. Defaults to 0.2.
downsample_factor (int, optional): the downsample factor for mel spectrogram. Defaults to 4.
r (int, optional): frames per decoder step. Defaults to 1.
"""
self.masked_weight = masked_weight
self.priority_bin = priority_bin # only used for lin-spec loss
self.priority_weight = priority_weight # only used for lin-spec loss
self.binary_divergence_weight = binary_divergence_weight
self.guided_attention_sigma = guided_attention_sigma
self.time_shift = r
self.r = r
self.downsample_factor = downsample_factor
def l1_loss(self, prediction, target, mask, priority_bin=None):
"""L1 loss for spectrogram.
Args:
prediction (Variable): shape(B, T, C), dtype float32, predicted spectrogram.
target (Variable): shape(B, T, C), dtype float32, target spectrogram.
mask (Variable): shape(B, T), mask.
priority_bin (int, optional): frequency bands for linear spectrogram loss to be prioritized. Defaults to None.
Returns:
Variable: shape(1,), dtype float32, l1 loss(with mask and possibly priority bin applied.)
"""
abs_diff = F.abs(prediction - target)
# basic mask-weighted l1 loss
w = self.masked_weight
if w > 0 and mask is not None:
base_l1_loss = w * masked_mean(abs_diff, mask) \
+ (1 - w) * F.reduce_mean(abs_diff)
else:
base_l1_loss = F.reduce_mean(abs_diff)
if self.priority_weight > 0 and priority_bin is not None:
# mask-weighted priority channels' l1-loss
priority_abs_diff = abs_diff[:, :, :priority_bin]
if w > 0 and mask is not None:
priority_loss = w * masked_mean(priority_abs_diff, mask) \
+ (1 - w) * F.reduce_mean(priority_abs_diff)
else:
priority_loss = F.reduce_mean(priority_abs_diff)
# priority weighted sum
p = self.priority_weight
loss = p * priority_loss + (1 - p) * base_l1_loss
else:
loss = base_l1_loss
return loss
def binary_divergence(self, prediction, target, mask):
"""Binary cross entropy loss for spectrogram. All the values in the spectrogram are treated as logits in a logistic regression.
Args:
prediction (Variable): shape(B, T, C), dtype float32, predicted spectrogram.
target (Variable): shape(B, T, C), dtype float32, target spectrogram.
mask (Variable): shape(B, T), mask.
Returns:
Variable: shape(1,), dtype float32, binary cross entropy loss.
"""
flattened_prediction = F.reshape(prediction, [-1, 1])
flattened_target = F.reshape(target, [-1, 1])
flattened_loss = F.log_loss(
flattened_prediction, flattened_target, epsilon=1e-8)
bin_div = fluid.layers.reshape(flattened_loss, prediction.shape)
w = self.masked_weight
if w > 0 and mask is not None:
loss = w * masked_mean(bin_div, mask) \
+ (1 - w) * F.reduce_mean(bin_div)
else:
loss = F.reduce_mean(bin_div)
return loss
@staticmethod
def done_loss(done_hat, done):
"""Compute done loss
Args:
done_hat (Variable): shape(B, T), dtype float32, predicted done probability(the probability that the final frame has been generated.)
done (Variable): shape(B, T), dtype float32, ground truth done probability(the probability that the final frame has been generated.)
Returns:
Variable: shape(1, ), dtype float32, done loss.
"""
flat_done_hat = F.reshape(done_hat, [-1, 1])
flat_done = F.reshape(done, [-1, 1])
loss = F.log_loss(flat_done_hat, flat_done, epsilon=1e-8)
loss = F.reduce_mean(loss)
return loss
def attention_loss(self, predicted_attention, input_lengths,
target_lengths):
"""
Given valid encoder_lengths and decoder_lengths, compute a diagonal guide, and compute loss from the predicted attention and the guide.
Args:
predicted_attention (Variable): shape(*, B, T_dec, T_enc), dtype float32, the alignment tensor, where B means batch size, T_dec means number of time steps of the decoder, T_enc means the number of time steps of the encoder, * means other possible dimensions.
input_lengths (numpy.ndarray): shape(B,), dtype:int64, valid lengths (time steps) of encoder outputs.
target_lengths (numpy.ndarray): shape(batch_size,), dtype:int64, valid lengths (time steps) of decoder outputs.
Returns:
loss (Variable): shape(1, ), dtype float32, attention loss.
"""
n_attention, batch_size, max_target_len, max_input_len = (
predicted_attention.shape)
soft_mask = guided_attentions(input_lengths, target_lengths,
max_target_len,
self.guided_attention_sigma)
soft_mask_ = dg.to_variable(soft_mask)
loss = fluid.layers.reduce_mean(predicted_attention * soft_mask_)
return loss
def __call__(self, outputs, inputs):
"""Total loss
Args:
outpus is a tuple of (mel_hyp, lin_hyp, attn_hyp, done_hyp).
mel_hyp (Variable): shape(B, T, C_mel), dtype float32, predicted mel spectrogram.
lin_hyp (Variable): shape(B, T, C_lin), dtype float32, predicted linear spectrogram.
done_hyp (Variable): shape(B, T), dtype float32, predicted done probability.
attn_hyp (Variable): shape(N, B, T_dec, T_enc), dtype float32, predicted attention.
inputs is a tuple of (mel_ref, lin_ref, done_ref, input_lengths, n_frames)
mel_ref (Variable): shape(B, T, C_mel), dtype float32, ground truth mel spectrogram.
lin_ref (Variable): shape(B, T, C_lin), dtype float32, ground truth linear spectrogram.
done_ref (Variable): shape(B, T), dtype float32, ground truth done flag.
input_lengths (Variable): shape(B, ), dtype: int, encoder valid lengths.
n_frames (Variable): shape(B, ), dtype: int, decoder valid lengths.
Returns:
Dict(str, Variable): details of loss.
"""
total_loss = 0.
mel_hyp, lin_hyp, attn_hyp, done_hyp = outputs
mel_ref, lin_ref, done_ref, input_lengths, n_frames = inputs
# n_frames # mel_lengths # decoder_lengths
max_frames = lin_hyp.shape[1]
max_mel_steps = max_frames // self.downsample_factor
# max_decoder_steps = max_mel_steps // self.r
# decoder_mask = F.sequence_mask(n_frames // self.downsample_factor //
# self.r,
# max_decoder_steps,
# dtype="float32")
mel_mask = F.sequence_mask(
n_frames // self.downsample_factor, max_mel_steps, dtype="float32")
lin_mask = F.sequence_mask(n_frames, max_frames, dtype="float32")
lin_hyp = lin_hyp[:, :-self.time_shift, :]
lin_ref = lin_ref[:, self.time_shift:, :]
lin_mask = lin_mask[:, self.time_shift:]
lin_l1_loss = self.l1_loss(
lin_hyp, lin_ref, lin_mask, priority_bin=self.priority_bin)
lin_bce_loss = self.binary_divergence(lin_hyp, lin_ref, lin_mask)
lin_loss = self.binary_divergence_weight * lin_bce_loss \
+ (1 - self.binary_divergence_weight) * lin_l1_loss
total_loss += lin_loss
mel_hyp = mel_hyp[:, :-self.time_shift, :]
mel_ref = mel_ref[:, self.time_shift:, :]
mel_mask = mel_mask[:, self.time_shift:]
mel_l1_loss = self.l1_loss(mel_hyp, mel_ref, mel_mask)
mel_bce_loss = self.binary_divergence(mel_hyp, mel_ref, mel_mask)
# print("=====>", mel_l1_loss.numpy()[0], mel_bce_loss.numpy()[0])
mel_loss = self.binary_divergence_weight * mel_bce_loss \
+ (1 - self.binary_divergence_weight) * mel_l1_loss
total_loss += mel_loss
attn_loss = self.attention_loss(attn_hyp,
input_lengths.numpy(),
n_frames.numpy() //
(self.downsample_factor * self.r))
total_loss += attn_loss
done_loss = self.done_loss(done_hyp, done_ref)
total_loss += done_loss
losses = {
"loss": total_loss,
"mel/mel_loss": mel_loss,
"mel/l1_loss": mel_l1_loss,
"mel/bce_loss": mel_bce_loss,
"lin/lin_loss": lin_loss,
"lin/l1_loss": lin_l1_loss,
"lin/bce_loss": lin_bce_loss,
"done": done_loss,
"attn": attn_loss,
}
return losses

View File

@ -1,106 +1,482 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import numpy as np
import math
import paddle
from paddle import fluid
from paddle.fluid import layers as F
from paddle.fluid import initializer as I
from paddle.fluid import dygraph as dg
import paddle.fluid.layers as F
import paddle.fluid.initializer as I
import paddle.fluid.dygraph as dg
from .conv import Conv1D
from .weight_norm_hook import weight_norm, remove_weight_norm
def positional_encoding(tensor, start_index, omega):
"""
tensor: a reference tensor we use to get shape. actually only T and C are needed. Shape(B, T, C)
start_index: int, we can actually use start and length to specify them.
omega (B,): speaker position rates
class DeepVoice3(dg.Layer):
def __init__(self, encoder, decoder, converter, speaker_embedding,
use_decoder_states):
"""Deep Voice 3 TTS model.
return (B, T, C), position embedding
"""
dtype = omega.dtype
_, length, dimension = tensor.shape
index = F.range(start_index, start_index + length, 1, dtype=dtype)
channel = F.range(0, dimension, 2, dtype=dtype)
Args:
encoder (Layer): the encoder.
decoder (Layer): the decoder.
converter (Layer): the converter.
speaker_embedding (Layer): the speaker embedding (for multispeaker cases).
use_decoder_states (bool): use decoder states instead of predicted mel spectrogram as the input of the converter.
p = F.unsqueeze(omega, [1, 2]) \
* F.unsqueeze(index, [1]) \
/ (10000 ** (channel / float(dimension)))
encodings = F.concat([F.sin(p), F.cos(p)], axis=2)
return encodings
class ConvBlock(dg.Layer):
def __init__(self, in_channel, kernel_size, causal=False, has_bias=False,
bias_dim=None, keep_prob=1.):
super(ConvBlock, self).__init__()
self.causal = causal
self.keep_prob = keep_prob
self.in_channel = in_channel
self.has_bias = has_bias
std = np.sqrt(4 * keep_prob / (kernel_size * in_channel))
initializer = I.NormalInitializer(loc=0., scale=std)
padding = "valid" if causal else "same"
conv = Conv1D(in_channel, 2 * in_channel, (kernel_size, ),
padding=padding,
data_format="NTC",
param_attr=initializer)
self.conv = weight_norm(conv)
if has_bias:
self.bias_affine = dg.Linear(bias_dim, 2 * in_channel)
def forward(self, input, bias=None, padding=None):
"""
super(DeepVoice3, self).__init__()
if speaker_embedding is None:
self.n_speakers = 1
input: input feature (B, T, C)
padding: only used when using causal conv, we pad mannually
"""
input_dropped = F.dropout(input, 1. - self.keep_prob,
dropout_implementation="upscale_in_train")
if self.causal:
assert padding is not None
input_dropped = F.concat([padding, input_dropped], axis=1)
hidden = self.conv(input_dropped)
if self.has_bias:
assert bias is not None
transformed_bias = F.softsign(self.bias_affine(bias))
hidden_embedded = hidden + F.unsqueeze(transformed_bias, [1])
else:
self.speaker_embedding = speaker_embedding
hidden_embedded = hidden
# glu
content, gate = F.split(hidden, num_or_sections=2, dim=-1)
content = hidden_embedded[:, :, :self.in_channel]
hidden = F.sigmoid(gate) * content
# # residual
hidden = F.scale(input + hidden, math.sqrt(0.5))
return hidden
class AffineBlock1(dg.Layer):
def __init__(self, in_channel, out_channel, has_bias=False, bias_dim=0):
super(AffineBlock1, self).__init__()
std = np.sqrt(1.0 / in_channel)
initializer = I.NormalInitializer(loc=0., scale=std)
affine = dg.Linear(in_channel, out_channel, param_attr=initializer)
self.affine = weight_norm(affine, dim=-1)
if has_bias:
self.bias_affine = dg.Linear(bias_dim, out_channel)
self.has_bias = has_bias
self.bias_dim = bias_dim
def forward(self, input, bias=None):
"""
input -> (affine + weight_norm) ->hidden
bias -> (affine) -> softsign -> transformed_bis
hidden += transformed_bias
"""
hidden = self.affine(input)
if self.has_bias:
assert bias is not None
transformed_bias = F.softsign(self.bias_affine(bias))
hidden += F.unsqueeze(transformed_bias, [1])
return hidden
class AffineBlock2(dg.Layer):
def __init__(self, in_channel, out_channel,
has_bias=False, bias_dim=0, dropout=False, keep_prob=1.):
super(AffineBlock2, self).__init__()
if has_bias:
self.bias_affine = dg.Linear(bias_dim, in_channel)
std = np.sqrt(1.0 / in_channel)
initializer = I.NormalInitializer(loc=0., scale=std)
affine = dg.Linear(in_channel, out_channel, param_attr=initializer)
self.affine = weight_norm(affine, dim=-1)
self.has_bias = has_bias
self.bias_dim = bias_dim
self.dropout = dropout
self.keep_prob = keep_prob
def forward(self, input, bias=None):
"""
input -> (dropout) ->hidden
bias -> (affine) -> softsign -> transformed_bis
hidden += transformed_bias
hidden -> (affine + weight_norm) -> relu -> hidden
"""
hidden = input
if self.dropout:
hidden = F.dropout(hidden, 1. - self.keep_prob,
dropout_implementation="upscale_in_train")
if self.has_bias:
assert bias is not None
transformed_bias = F.softsign(self.bias_affine(bias))
hidden += F.unsqueeze(transformed_bias, [1])
hidden = F.relu(self.affine(hidden))
return hidden
class Encoder(dg.Layer):
def __init__(self, layers, in_channels, encoder_dim, kernel_size,
has_bias=False, bias_dim=0, keep_prob=1.):
super(Encoder, self).__init__()
self.pre_affine = AffineBlock1(in_channels, encoder_dim, has_bias, bias_dim)
self.convs = dg.LayerList([
ConvBlock(encoder_dim, kernel_size, False, has_bias, bias_dim, keep_prob) \
for _ in range(layers)])
self.post_affine = AffineBlock1(encoder_dim, in_channels, has_bias, bias_dim)
def forward(self, char_embed, speaker_embed=None):
hidden = self.pre_affine(char_embed, speaker_embed)
for layer in self.convs:
hidden = layer(hidden, speaker_embed)
hidden = self.post_affine(hidden, speaker_embed)
keys = hidden
values = F.scale(char_embed + hidden, np.sqrt(0.5))
return keys, values
class AttentionBlock(dg.Layer):
def __init__(self, attention_dim, input_dim, position_encoding_weight=1.,
position_rate=1., reduction_factor=1, has_bias=False, bias_dim=0,
keep_prob=1.):
super(AttentionBlock, self).__init__()
# positional encoding
omega_default = position_rate / reduction_factor
self.omega_default = omega_default
# multispeaker case
if has_bias:
std = np.sqrt(1.0 / bias_dim)
initializer = I.NormalInitializer(loc=0., scale=std)
self.q_pos_affine = dg.Linear(bias_dim, 1, param_attr=initializer)
self.k_pos_affine = dg.Linear(bias_dim, 1, param_attr=initializer)
self.omega_initial = self.create_parameter(shape=[1],
attr=I.ConstantInitializer(value=omega_default))
# mind the fact that q, k, v have the same feature dimension
# so we can init k_affine and q_affine's weight as the same matrix
# to get a better init attention
init_weight = np.random.normal(size=(input_dim, attention_dim),
scale=np.sqrt(1. / input_dim))
initializer = I.NumpyArrayInitializer(init_weight.astype(np.float32))
# 3 affine transformation to project q, k, v into attention_dim
q_affine = dg.Linear(input_dim, attention_dim,
param_attr=initializer)
self.q_affine = weight_norm(q_affine, dim=-1)
k_affine = dg.Linear(input_dim, attention_dim,
param_attr=initializer)
self.k_affine = weight_norm(k_affine, dim=-1)
std = np.sqrt(1.0 / input_dim)
initializer = I.NormalInitializer(loc=0., scale=std)
v_affine = dg.Linear(input_dim, attention_dim, param_attr=initializer)
self.v_affine = weight_norm(v_affine, dim=-1)
std = np.sqrt(1.0 / attention_dim)
initializer = I.NormalInitializer(loc=0., scale=std)
out_affine = dg.Linear(attention_dim, input_dim, param_attr=initializer)
self.out_affine = weight_norm(out_affine, dim=-1)
self.keep_prob = keep_prob
self.has_bias = has_bias
self.bias_dim = bias_dim
self.attention_dim = attention_dim
self.position_encoding_weight = position_encoding_weight
def forward(self, q, k, v, lengths, speaker_embed, start_index,
force_monotonic=False, prev_coeffs=None, window=None):
# add position encoding as an inductive bias
if self.has_bias: # multi-speaker model
omega_q = 2 * F.sigmoid(
F.squeeze(self.q_pos_affine(speaker_embed), axes=[-1]))
omega_k = 2 * self.omega_initial * F.sigmoid(F.squeeze(
self.k_pos_affine(speaker_embed), axes=[-1]))
else: # single-speaker case
batch_size = q.shape[0]
omega_q = F.ones((batch_size, ), dtype="float32")
omega_k = F.ones((batch_size, ), dtype="float32") * self.omega_default
q += self.position_encoding_weight * positional_encoding(q, start_index, omega_q)
k += self.position_encoding_weight * positional_encoding(k, 0, omega_k)
q, k, v = self.q_affine(q), self.k_affine(k), self.v_affine(v)
activations = F.matmul(q, k, transpose_y=True)
activations /= np.sqrt(self.attention_dim)
if self.training:
# mask the <pad> parts from the encoder
mask = F.sequence_mask(lengths, dtype="float32")
attn_bias = F.scale(1. - mask, -1000)
activations += F.unsqueeze(attn_bias, [1])
elif force_monotonic:
assert window is not None
backward_step, forward_step = window
T_enc = k.shape[1]
batch_size, T_dec, _ = q.shape
# actually T_dec = 1 here
alpha = F.fill_constant((batch_size, T_dec), value=0, dtype="int64") \
if prev_coeffs is None \
else F.argmax(prev_coeffs, axis=-1)
backward = F.sequence_mask(alpha - backward_step, maxlen=T_enc, dtype="bool")
forward = F.sequence_mask(alpha + forward_step, maxlen=T_enc, dtype="bool")
mask = F.cast(F.logical_xor(backward, forward), "float32")
# print("mask's shape:", mask.shape)
attn_bias = F.scale(1. - mask, -1000)
activations += attn_bias
# softmax
coefficients = F.softmax(activations, axis=-1)
# context vector
coefficients = F.dropout(coefficients, 1. - self.keep_prob,
dropout_implementation='upscale_in_train')
contexts = F.matmul(coefficients, v)
# context normalization
enc_lengths = F.cast(F.unsqueeze(lengths, axes=[1, 2]), "float32")
contexts *= F.sqrt(enc_lengths)
# out affine
contexts = self.out_affine(contexts)
return contexts, coefficients
class Decoder(dg.Layer):
def __init__(self, in_channels, reduction_factor, prenet_sizes,
layers, kernel_size, attention_dim,
position_encoding_weight=1., omega=1.,
has_bias=False, bias_dim=0, keep_prob=1.):
super(Decoder, self).__init__()
# prenet-mind the difference of AffineBlock2 and AffineBlock1
c_in = in_channels
self.prenet = dg.LayerList()
for i, c_out in enumerate(prenet_sizes):
affine = AffineBlock2(c_in, c_out, has_bias, bias_dim, dropout=(i!=0), keep_prob=keep_prob)
self.prenet.append(affine)
c_in = c_out
# causal convolutions + multihop attention
decoder_dim = prenet_sizes[-1]
self.causal_convs = dg.LayerList()
self.attention_blocks = dg.LayerList()
for i in range(layers):
conv = ConvBlock(decoder_dim, kernel_size, True, has_bias, bias_dim, keep_prob)
attn = AttentionBlock(attention_dim, decoder_dim, position_encoding_weight, omega, reduction_factor, has_bias, bias_dim, keep_prob)
self.causal_convs.append(conv)
self.attention_blocks.append(attn)
# output mel spectrogram
output_dim = reduction_factor * in_channels # r * mel_dim
std = np.sqrt(1.0 / decoder_dim)
initializer = I.NormalInitializer(loc=0., scale=std)
out_affine = dg.Linear(decoder_dim, output_dim, param_attr=initializer)
self.out_affine = weight_norm(out_affine, dim=-1)
if has_bias:
self.out_sp_affine = dg.Linear(bias_dim, output_dim)
self.has_bias = has_bias
self.kernel_size = kernel_size
self.in_channels = in_channels
self.decoder_dim = decoder_dim
self.reduction_factor = reduction_factor
self.out_channels = output_dim
def forward(self, inputs, keys, values, lengths, start_index, speaker_embed=None,
state=None, force_monotonic_attention=None, coeffs=None, window=(0, 4)):
hidden = inputs
for layer in self.prenet:
hidden = layer(hidden, speaker_embed)
attentions = [] # every layer of (B, T_dec, T_enc) attention
final_state = [] # layers * (B, (k-1)d, C_dec)
batch_size = inputs.shape[0]
causal_padding_shape = (batch_size, self.kernel_size - 1, self.decoder_dim)
for i in range(len(self.causal_convs)):
if state is None:
padding = F.zeros(causal_padding_shape, dtype="float32")
else:
padding = state[i]
new_state = F.concat([padding, hidden], axis=1) # => to be used next step
# causal conv, (B, T, C)
hidden = self.causal_convs[i](hidden, speaker_embed, padding=padding)
# attn
prev_coeffs = None if coeffs is None else coeffs[i]
force_monotonic = False if force_monotonic_attention is None else force_monotonic_attention[i]
context, attention = self.attention_blocks[i](
hidden, keys, values, lengths, speaker_embed,
start_index, force_monotonic, prev_coeffs, window)
# residual connextion (B, T_dec, C_dec)
hidden = F.scale(hidden + context, np.sqrt(0.5))
attentions.append(attention) # layers * (B, T_dec, T_enc)
# new state: shift a step, layers * (B, T, C)
new_state = new_state[:, -(self.kernel_size - 1):, :]
final_state.append(new_state)
# predict mel spectrogram (B, 1, T_dec, r * C_in)
decoded = self.out_affine(hidden)
if self.has_bias:
decoded *= F.sigmoid(F.unsqueeze(self.out_sp_affine(speaker_embed), [1]))
return decoded, hidden, attentions, final_state
class PostNet(dg.Layer):
def __init__(self, layers, in_channels, postnet_dim, kernel_size, out_channels, upsample_factor, has_bias=False, bias_dim=0, keep_prob=1.):
super(PostNet, self).__init__()
self.pre_affine = AffineBlock1(in_channels, postnet_dim, has_bias, bias_dim)
self.convs = dg.LayerList([
ConvBlock(postnet_dim, kernel_size, False, has_bias, bias_dim, keep_prob) for _ in range(layers)
])
std = np.sqrt(1.0 / postnet_dim)
initializer = I.NormalInitializer(loc=0., scale=std)
post_affine = dg.Linear(postnet_dim, out_channels, param_attr=initializer)
self.post_affine = weight_norm(post_affine, dim=-1)
self.upsample_factor = upsample_factor
def forward(self, hidden, speaker_embed=None):
hidden = self.pre_affine(hidden, speaker_embed)
batch_size, time_steps, channels = hidden.shape # pylint: disable=unused-variable
hidden = F.expand(hidden, [1, 1, self.upsample_factor])
hidden = F.reshape(hidden, [batch_size, -1, channels])
for layer in self.convs:
hidden = layer(hidden, speaker_embed)
spec = self.post_affine(hidden)
return spec
class SpectraNet(dg.Layer):
def __init__(self, char_embedding, speaker_embedding, encoder, decoder, postnet):
super(SpectraNet, self).__init__()
self.char_embedding = char_embedding
self.speaker_embedding = speaker_embedding
self.encoder = encoder
self.decoder = decoder
self.converter = converter
self.use_decoder_states = use_decoder_states
self.postnet = postnet
def forward(self, text, text_lengths, speakers=None, mel=None, frame_lengths=None,
force_monotonic_attention=None, window=None):
# encode
text_embed = self.char_embedding(text)# no stress embedding here
speaker_embed = F.softsign(self.speaker_embedding(speakers)) if self.speaker_embedding is not None else None
keys, values = self.encoder(text_embed, speaker_embed)
def forward(self, text_sequences, text_positions, valid_lengths,
speaker_indices, mel_inputs, frame_positions):
"""Compute predicted value in a teacher forcing training manner.
Args:
text_sequences (Variable): shape(B, T_enc), dtype: int64, text indices.
text_positions (Variable): shape(B, T_enc), dtype: int64, positions of text indices.
valid_lengths (Variable): shape(B, ), dtype: int64, valid lengths of utterances.
speaker_indices (Variable): shape(B, ), dtype: int64, speaker indices for utterances.
mel_inputs (Variable): shape(B, T_mel, C_mel), dytpe: int64, ground truth mel spectrogram.
frame_positions (Variable): shape(B, T_dec), dtype: int64, positions of decoder steps.
Returns:
(mel_outputs, linear_outputs, alignments, done)
mel_outputs (Variable): shape(B, T_mel, C_mel), dtype float32, predicted mel spectrogram.
mel_outputs (Variable): shape(B, T_mel, C_mel), dtype float32, predicted mel spectrogram.
alignments (Variable): shape(N, B, T_dec, T_enc), dtype float32, predicted attention.
done (Variable): shape(B, T_dec), dtype float32, predicted done probability.
(T_mel: time steps of mel spectrogram, T_lin: time steps of linear spectrogra, T_dec, time steps of decoder, T_enc: time steps of encoder.)
"""
if hasattr(self, "speaker_embedding"):
speaker_embed = self.speaker_embedding(speaker_indices)
if mel is not None:
return self.teacher_forced_train(keys, values, text_lengths, speaker_embed, mel)
else:
speaker_embed = None
return self.inference(keys, values, text_lengths, speaker_embed, force_monotonic_attention, window)
keys, values = self.encoder(text_sequences, speaker_embed)
mel_outputs, alignments, done, decoder_states = self.decoder(
(keys, values), valid_lengths, mel_inputs, text_positions,
frame_positions, speaker_embed)
linear_outputs = self.converter(decoder_states
if self.use_decoder_states else
mel_outputs, speaker_embed)
return mel_outputs, linear_outputs, alignments, done
def teacher_forced_train(self, keys, values, text_lengths, speaker_embed, mel):
# build decoder inputs by shifting over by one frame and add all zero <start> frame
# the mel input is downsampled by a reduction factor
batch_size = mel.shape[0]
mel_input = F.reshape(mel, (batch_size, -1, self.decoder.reduction_factor, self.decoder.in_channels))
zero_frame = F.zeros((batch_size, 1, self.decoder.in_channels), dtype="float32")
# downsample mel input as a regularization
mel_input = F.concat([zero_frame, mel_input[:, :-1, -1, :]], axis=1)
def transduce(self, text_sequences, text_positions, speaker_indices=None):
"""Generate output without teacher forcing. Only batch_size = 1 is supported.
# decoder
decoded, hidden, attentions, final_state = self.decoder(mel_input, keys, values, text_lengths, 0, speaker_embed)
attentions = F.stack(attentions) # (N, B, T_dec, T_encs)
# unfold frames
decoded = F.reshape(decoded, (batch_size, -1, self.decoder.in_channels))
# postnet
refined = self.postnet(hidden, speaker_embed)
return decoded, refined, attentions, final_state
Args:
text_sequences (Variable): shape(B, T_enc), dtype: int64, text indices.
text_positions (Variable): shape(B, T_enc), dtype: int64, positions of text indices.
speaker_indices (Variable): shape(B, ), dtype: int64, speaker indices for utterances.
Returns:
(mel_outputs, linear_outputs, alignments, done)
mel_outputs (Variable): shape(B, T_mel, C_mel), dtype float32, predicted mel spectrogram.
mel_outputs (Variable): shape(B, T_mel, C_mel), dtype float32, predicted mel spectrogram.
alignments (Variable): shape(B, T_dec, T_enc), dtype float32, predicted average attention of all attention layers.
done (Variable): shape(B, T_dec), dtype float32, predicted done probability.
(T_mel: time steps of mel spectrogram, T_lin: time steps of linear spectrogra, T_dec, time steps of decoder, T_enc: time steps of encoder.)
"""
if hasattr(self, "speaker_embedding"):
speaker_embed = self.speaker_embedding(speaker_indices)
def spec_loss(self, decoded, input, num_frames=None):
if num_frames is None:
l1_loss = F.reduce_mean(F.abs(decoded - input))
else:
speaker_embed = None
# mask the <pad> part of the decoder
num_channels = decoded.shape[-1]
l1_loss = F.abs(decoded - input)
mask = F.sequence_mask(num_frames, dtype="float32")
l1_loss *= F.unsqueeze(mask, axes=[-1])
l1_loss = F.reduce_sum(l1_loss) / F.scale(F.reduce_sum(mask), num_channels)
return l1_loss
keys, values = self.encoder(text_sequences, speaker_embed)
mel_outputs, alignments, done, decoder_states = self.decoder.decode(
(keys, values), text_positions, speaker_embed)
linear_outputs = self.converter(decoder_states
if self.use_decoder_states else
mel_outputs, speaker_embed)
return mel_outputs, linear_outputs, alignments, done
@dg.no_grad
def inference(self, keys, values, text_lengths, speaker_embed,
force_monotonic_attention, window):
MAX_STEP = 500
# layer index of the first monotonic attention
num_monotonic_attention_layers = sum(force_monotonic_attention)
first_mono_attention_layer = 0
if num_monotonic_attention_layers > 0:
for i, item in enumerate(force_monotonic_attention):
if item:
first_mono_attention_layer = i
break
# stop cond (if would be more complicated to support minibatch autoregressive decoding)
# so we only supports batch_size == 0 in inference
def should_continue(i, mel_input, outputs, hidden, attention, state, coeffs):
T_enc = coeffs.shape[-1]
attn_peak = F.argmax(coeffs[first_mono_attention_layer, 0, 0]) \
if num_monotonic_attention_layers > 0 \
else F.fill_constant([1], "int64", value=0)
return i < MAX_STEP and F.reshape(attn_peak, [1]) < T_enc - 1
def loop_body(i, mel_input, outputs, hiddens, attentions, state=None, coeffs=None):
# state is None coeffs is None for the first step
decoded, hidden, new_coeffs, new_state = self.decoder(
mel_input, keys, values, text_lengths, i, speaker_embed,
state, force_monotonic_attention, coeffs, window)
new_coeffs = F.stack(new_coeffs) # (N, B, T_dec=1, T_enc)
attentions.append(new_coeffs) # (N, B, T_dec=1, T_enc)
outputs.append(decoded) # (B, T_dec=1, rC_mel)
hiddens.append(hidden) # (B, T_dec=1, C_dec)
# slice the last frame out of r generated frames to be used as the input for the next step
batch_size = mel_input.shape[0]
frames = F.reshape(decoded, [batch_size, -1, self.decoder.reduction_factor, self.decoder.in_channels])
input_frame = frames[:, :, -1, :]
return (i + 1, input_frame, outputs, hiddens, attentions, new_state, new_coeffs)
i = 0
batch_size = keys.shape[0]
input_frame = F.zeros((batch_size, 1, self.decoder.in_channels), dtype="float32")
outputs = []
hiddens = []
attentions = []
loop_state = loop_body(i, input_frame, outputs, hiddens, attentions)
while should_continue(*loop_state):
loop_state = loop_body(*loop_state)
outputs, hiddens, attention = loop_state[2], loop_state[3], loop_state[4]
# concat decoder timesteps
outputs = F.concat(outputs, axis=1)
hiddens = F.concat(hiddens, axis=1)
attention = F.concat(attention, axis=2)
# unfold frames
outputs = F.reshape(outputs, (batch_size, -1, self.decoder.in_channels))
refined = self.postnet(hiddens, speaker_embed)
return outputs, refined, attention

View File

@ -1,158 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import numpy as np
from paddle import fluid
import paddle.fluid.layers as F
import paddle.fluid.dygraph as dg
def lookup(weight, indices, padding_idx):
out = fluid.core.ops.lookup_table_v2(
weight, indices, 'is_sparse', False, 'is_distributed', False,
'remote_prefetch', False, 'padding_idx', padding_idx)
return out
def compute_position_embedding_single_speaker(radians, speaker_position_rate):
"""Compute sin/cos interleaved matrix from the radians.
Arg:
radians (Variable): shape(n_vocab, embed_dim), dtype float32, the radians matrix.
speaker_position_rate (float or Variable): float or Variable of shape(1, ), speaker positioning rate.
Returns:
Variable: shape(n_vocab, embed_dim), the sin, cos interleaved matrix.
"""
_, embed_dim = radians.shape
scaled_radians = radians * speaker_position_rate
odd_mask = (np.arange(embed_dim) % 2).astype(np.float32)
odd_mask = dg.to_variable(odd_mask)
out = odd_mask * F.cos(scaled_radians) \
+ (1 - odd_mask) * F.sin(scaled_radians)
return out
def compute_position_embedding(radians, speaker_position_rate):
"""Compute sin/cos interleaved matrix from the radians.
Arg:
radians (Variable): shape(n_vocab, embed_dim), dtype float32, the radians matrix.
speaker_position_rate (Variable): shape(B, ), speaker positioning rate.
Returns:
Variable: shape(B, n_vocab, embed_dim), the sin, cos interleaved matrix.
"""
_, embed_dim = radians.shape
batch_size = speaker_position_rate.shape[0]
scaled_radians = F.elementwise_mul(
F.expand(F.unsqueeze(radians, [0]), [batch_size, 1, 1]),
speaker_position_rate,
axis=0)
odd_mask = (np.arange(embed_dim) % 2).astype(np.float32)
odd_mask = dg.to_variable(odd_mask)
out = odd_mask * F.cos(scaled_radians) \
+ (1 - odd_mask) * F.sin(scaled_radians)
out = F.concat(
[F.zeros((batch_size, 1, embed_dim), radians.dtype), out[:, 1:, :]],
axis=1)
return out
def position_encoding_init(n_position,
d_pos_vec,
position_rate=1.0,
padding_idx=None):
"""Init the position encoding.
Args:
n_position (int): max position, vocab size for position embedding.
d_pos_vec (int): position embedding size.
position_rate (float, optional): position rate (this should only be used when all the utterances are from one speaker.). Defaults to 1.0.
padding_idx (int, optional): padding index for the position embedding(it is set as 0 internally if not provided.). Defaults to None.
Returns:
[type]: [description]
"""
# init the position encoding table
# keep idx 0 for padding token position encoding zero vector
# CAUTION: it is radians here, sin and cos are not applied
indices_range = np.expand_dims(np.arange(n_position), -1)
embed_range = 2 * (np.arange(d_pos_vec) // 2)
radians = position_rate \
* indices_range \
/ np.power(1.e4, embed_range / d_pos_vec)
if padding_idx is not None:
radians[padding_idx] = 0.
return radians
class PositionEmbedding(dg.Layer):
def __init__(self, n_position, d_pos_vec, position_rate=1.0):
"""Position Embedding for Deep Voice 3.
Args:
n_position (int): max position, vocab size for position embedding.
d_pos_vec (int): position embedding size.
position_rate (float, optional): position rate (this should only be used when all the utterances are from one speaker.). Defaults to 1.0.
"""
super(PositionEmbedding, self).__init__()
self.weight = self.create_parameter((n_position, d_pos_vec))
self.weight.set_value(
position_encoding_init(n_position, d_pos_vec, position_rate)
.astype("float32"))
def forward(self, indices, speaker_position_rate=None):
"""
Args:
indices (Variable): shape (B, T), dtype: int64, position
indices, where B means the batch size, T means the time steps.
speaker_position_rate (Variable | float, optional), position
rate. It can be a float point number or a Variable with
shape (1,), then this speaker_position_rate is used for every
example. It can also be a Variable with shape (B, ), which
contains a speaker position rate for each utterance.
Returns:
out (Variable): shape(B, T, C_pos), dtype float32, position embedding, where C_pos
means position embedding size.
"""
batch_size, time_steps = indices.shape
if isinstance(speaker_position_rate, float) or \
(isinstance(speaker_position_rate, fluid.framework.Variable)
and list(speaker_position_rate.shape) == [1]):
temp_weight = compute_position_embedding_single_speaker(
self.weight, speaker_position_rate)
out = lookup(temp_weight, indices, 0)
return out
assert len(speaker_position_rate.shape) == 1 and \
list(speaker_position_rate.shape) == [batch_size]
weight = compute_position_embedding(self.weight,
speaker_position_rate) # (B, V, C)
# make indices for gather_nd
batch_id = F.expand(
F.unsqueeze(
F.range(
0, batch_size, 1, dtype="int64"), [1]), [1, time_steps])
# (B, T, 2)
gather_nd_id = F.stack([batch_id, indices], -1)
out = F.gather_nd(weight, gather_nd_id)
return out

View File

@ -0,0 +1,148 @@
import paddle
import paddle.fluid.dygraph as dg
import numpy as np
from paddle import fluid
import paddle.fluid.dygraph as dg
import paddle.fluid.layers as F
from paddle.fluid.layer_helper import LayerHelper
from paddle.fluid.data_feeder import check_variable_and_dtype
def l2_norm(x, axis, epsilon=1e-12, name=None):
if len(x.shape) == 1:
axis = 0
check_variable_and_dtype(x, "X", ("float32", "float64"), "norm")
helper = LayerHelper("l2_normalize", **locals())
out = helper.create_variable_for_type_inference(dtype=x.dtype)
norm = helper.create_variable_for_type_inference(dtype=x.dtype)
helper.append_op(
type="norm",
inputs={"X": x},
outputs={"Out": out,
"Norm": norm},
attrs={
"axis": 1 if axis is None else axis,
"epsilon": epsilon,
})
return F.squeeze(norm, axes=[axis])
def norm_except_dim(p, dim):
shape = p.shape
ndims = len(shape)
if dim is None:
return F.sqrt(F.reduce_sum(F.square(p)))
elif dim == 0:
p_matrix = F.reshape(p, (shape[0], -1))
return l2_norm(p_matrix, axis=1)
elif dim == -1 or dim == ndims - 1:
p_matrix = F.reshape(p, (-1, shape[-1]))
return l2_norm(p_matrix, axis=0)
else:
perm = list(range(ndims))
perm[0] = dim
perm[dim] = 0
p_transposed = F.transpose(p, perm)
return norm_except_dim(p_transposed, 0)
def _weight_norm(v, g, dim):
shape = v.shape
ndims = len(shape)
if dim is None:
v_normalized = v / (F.sqrt(F.reduce_sum(F.square(v))) + 1e-12)
elif dim == 0:
p_matrix = F.reshape(v, (shape[0], -1))
v_normalized = F.l2_normalize(p_matrix, axis=1)
v_normalized = F.reshape(v_normalized, shape)
elif dim == -1 or dim == ndims - 1:
p_matrix = F.reshape(v, (-1, shape[-1]))
v_normalized = F.l2_normalize(p_matrix, axis=0)
v_normalized = F.reshape(v_normalized, shape)
else:
perm = list(range(ndims))
perm[0] = dim
perm[dim] = 0
p_transposed = F.transpose(v, perm)
transposed_shape = p_transposed.shape
p_matrix = F.reshape(p_transposed, (p_transposed.shape[0], -1))
v_normalized = F.l2_normalize(p_matrix, axis=1)
v_normalized = F.reshape(v_normalized, transposed_shape)
v_normalized = F.transpose(v_normalized, perm)
weight = F.elementwise_mul(v_normalized, g, axis=dim if dim is not None else -1)
return weight
class WeightNorm(object):
def __init__(self, name, dim):
if dim is None:
dim = -1
self.name = name
self.dim = dim
def compute_weight(self, module):
g = getattr(module, self.name + '_g')
v = getattr(module, self.name + '_v')
w = _weight_norm(v, g, self.dim)
return w
@staticmethod
def apply(module: dg.Layer, name, dim):
for k, hook in module._forward_pre_hooks.items():
if isinstance(hook, WeightNorm) and hook.name == name:
raise RuntimeError("Cannot register two weight_norm hooks on "
"the same parameter {}".format(name))
if dim is None:
dim = -1
fn = WeightNorm(name, dim)
# remove w from parameter list
w = getattr(module, name)
del module._parameters[name]
# add g and v as new parameters and express w as g/||v|| * v
g_var = norm_except_dim(w, dim)
v = module.create_parameter(w.shape, dtype=w.dtype)
module.add_parameter(name + "_v", v)
g = module.create_parameter(g_var.shape, dtype=g_var.dtype)
module.add_parameter(name + "_g", g)
with dg.no_grad():
F.assign(w, v)
F.assign(g_var, g)
setattr(module, name, fn.compute_weight(module))
# recompute weight before every forward()
module.register_forward_pre_hook(fn)
return fn
def remove(self, module):
w_var = self.compute_weight(module)
delattr(module, self.name)
del module._parameters[self.name + '_g']
del module._parameters[self.name + '_v']
w = module.create_parameter(w_var.shape, dtype=w_var.dtype)
module.add_parameter(self.name, w)
with dg.no_grad():
F.assign(w_var, w)
def __call__(self, module, inputs):
setattr(module, self.name, self.compute_weight(module))
def weight_norm(module, name='weight', dim=0):
WeightNorm.apply(module, name, dim)
return module
def remove_weight_norm(module, name='weight'):
for k, hook in module._forward_pre_hooks.items():
if isinstance(hook, WeightNorm) and hook.name == name:
hook.remove(module)
del module._forward_pre_hooks[k]
return module
raise ValueError("weight_norm of '{}' not found in {}"
.format(name, module))