remove old examples

This commit is contained in:
iclementine 2020-11-19 15:47:57 +08:00
parent 0e35119453
commit c7e5aaa540
51 changed files with 0 additions and 5295 deletions

View File

@ -1,148 +0,0 @@
# Clarinet
PaddlePaddle dynamic graph implementation of ClariNet, a convolutional network based vocoder. The implementation is based on the paper [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281).
## Dataset
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
## Project Structure
```text
├── data.py data_processing
├── configs/ (example) configuration file
├── synthesis.py script to synthesize waveform from mel_spectrogram
├── train.py script to train a model
└── utils.py utility functions
```
## Saving & Loading
`train.py` and `synthesis.py` have 3 arguments in common, `--checkpooint`, `iteration` and `output`.
1. `output` is the directory for saving results.
During training, checkpoints are saved in `checkpoints/` in `output` and tensorboard log is save in `log/` in `output`. Other possible outputs are saved in `states/` in `outuput`.
During synthesizing, audio files and other possible outputs are save in `synthesis/` in `output`.
So after training and synthesizing with the same output directory, the file structure of the output directory looks like this.
```text
├── checkpoints/ # checkpoint directory (including *.pdparams, *.pdopt and a text file `checkpoint` that records the latest checkpoint)
├── states/ # audio files generated at validation and other possible outputs
├── log/ # tensorboard log
└── synthesis/ # synthesized audio files and other possible outputs
```
2. `--checkpoint` and `--iteration` for loading from existing checkpoint. Loading existing checkpoiont follows the following rule:
If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.
If `--checkpoint` is not provided, we try to load the model specified by `--iteration` from the checkpoint directory. If `--iteration` is not provided, we try to load the latested checkpoint from checkpoint directory.
## Train
Train the model using train.py, follow the usage displayed by `python train.py --help`.
```text
usage: train.py [-h] [--config CONFIG] [--device DEVICE] [--data DATA]
[--checkpoint CHECKPOINT | --iteration ITERATION]
[--wavenet WAVENET]
output
Train a ClariNet model with LJspeech and a trained WaveNet model.
positional arguments:
output path to save experiment results
optional arguments:
-h, --help show this help message and exit
--config CONFIG path of the config file
--device DEVICE device to use
--data DATA path of LJspeech dataset
--checkpoint CHECKPOINT checkpoint to resume from
--iteration ITERATION the iteration of the checkpoint to load from output directory
--wavenet WAVENET wavenet checkpoint to use
- `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
- `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains `metadata.txt`).
- `--checkpoint` is the path of the checkpoint.
- `--iteration` is the iteration of the checkpoint to load from output directory.
- `output` is the directory to save results, all result are saved in this directory.
See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
- `--wavenet` is the path of the wavenet checkpoint to load.
When you start training a ClariNet model without loading form a ClariNet checkpoint, you should have trained a WaveNet model with single Gaussian output distribution. Make sure the config of the teacher model matches that of the trained wavenet model.
Example script:
```bash
python train.py
--config=./configs/clarinet_ljspeech.yaml
--data=./LJSpeech-1.1/
--device=0
--wavenet="wavenet-step-2000000"
experiment
```
You can monitor training log via tensorboard, using the script below.
```bash
cd experiment/log
tensorboard --logdir=.
```
## Synthesis
```text
usage: synthesis.py [-h] [--config CONFIG] [--device DEVICE] [--data DATA]
[--checkpoint CHECKPOINT | --iteration ITERATION]
output
Synthesize audio files from mel spectrogram in the validation set.
positional arguments:
output path to save the synthesized audio
optional arguments:
-h, --help show this help message and exit
--config CONFIG path of the config file
--device DEVICE device to use.
--data DATA path of LJspeech dataset
--checkpoint CHECKPOINT checkpoint to resume from
--iteration ITERATION the iteration of the checkpoint to load from output directory
```
- `--config` is the configuration file to use. You should use the same configuration with which you train you model.
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
- `--data` is the path of the LJspeech dataset. In principle, a dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
- `--checkpoint` is the checkpoint to load.
- `--iteration` is the iteration of the checkpoint to load from output directory.
- `output` is the directory to save synthesized audio. Audio file is saved in `synthesis/` in `output` directory.
See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
Example script:
```bash
python synthesis.py \
--config=./configs/clarinet_ljspeech.yaml \
--data=./LJSpeech-1.1/ \
--device=0 \
--iteration=500000 \
experiment
```
or
```bash
python synthesis.py \
--config=./configs/clarinet_ljspeech.yaml \
--data=./LJSpeech-1.1/ \
--device=0 \
--checkpoint="experiment/checkpoints/step-500000" \
experiment
```

View File

@ -1,52 +0,0 @@
data:
batch_size: 8
train_clip_seconds: 0.5
sample_rate: 22050
hop_length: 256
win_length: 1024
n_fft: 2048
n_mels: 80
valid_size: 16
conditioner:
upsampling_factors: [16, 16]
teacher:
n_loop: 10
n_layer: 3
filter_size: 2
residual_channels: 128
loss_type: "mog"
output_dim: 3
log_scale_min: -9
student:
n_loops: [10, 10, 10, 10, 10, 10]
n_layers: [1, 1, 1, 1, 1, 1]
filter_size: 3
residual_channels: 64
log_scale_min: -7
stft:
n_fft: 2048
win_length: 1024
hop_length: 256
loss:
lmd: 4
train:
learning_rate: 0.0005
anneal_rate: 0.5
anneal_interval: 200000
gradient_max_norm: 100.0
checkpoint_interval: 1000
eval_interval: 1000
max_iterations: 2000000

View File

@ -1,52 +0,0 @@
data:
batch_size: 8
train_clip_seconds: 0.5
sample_rate: 22050
hop_length: 256
win_length: 1024
n_fft: 2048
n_mels: 80
valid_size: 16
conditioner:
upsampling_factors: [16, 16]
teacher:
n_loop: 10
n_layer: 3
filter_size: 2
residual_channels: 128
loss_type: "mog"
output_dim: 3
log_scale_min: -9
student:
n_loops: [10, 10, 10, 10, 10, 10]
n_layers: [1, 1, 1, 1, 1, 1]
filter_size: 3
residual_channels: 64
log_scale_min: -7
stft:
n_fft: 2048
win_length: 1024
hop_length: 256
loss:
lmd: 4
train:
learning_rate: 0.0005
anneal_rate: 0.5
anneal_interval: 200000
gradient_max_norm: 100.0
checkpoint_interval: 1000
eval_interval: 1000
max_iterations: 2000000

View File

@ -1,179 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import os
import sys
import argparse
import ruamel.yaml
import random
from tqdm import tqdm
import pickle
import numpy as np
import paddle.fluid.dygraph as dg
from paddle import fluid
fluid.require_version('1.8.0')
from parakeet.modules.weight_norm import WeightNormWrapper
from parakeet.models.wavenet import WaveNet, UpsampleNet
from parakeet.models.clarinet import STFT, Clarinet, ParallelWaveNet
from parakeet.data import TransformDataset, SliceDataset, RandomSampler, SequentialSampler, DataCargo
from parakeet.utils.layer_tools import summary, freeze
from parakeet.utils import io
from utils import eval_model
sys.path.append("../wavenet")
from data import LJSpeechMetaData, Transform, DataCollector
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Synthesize audio files from mel spectrogram in the validation set."
)
parser.add_argument("--config", type=str, help="path of the config file")
parser.add_argument(
"--device", type=int, default=-1, help="device to use.")
parser.add_argument("--data", type=str, help="path of LJspeech dataset")
g = parser.add_mutually_exclusive_group()
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
g.add_argument(
"--iteration",
type=int,
help="the iteration of the checkpoint to load from output directory")
parser.add_argument(
"output",
type=str,
default="experiment",
help="path to save the synthesized audio")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = ruamel.yaml.safe_load(f)
if args.device == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(args.device)
dg.enable_dygraph(place)
ljspeech_meta = LJSpeechMetaData(args.data)
data_config = config["data"]
sample_rate = data_config["sample_rate"]
n_fft = data_config["n_fft"]
win_length = data_config["win_length"]
hop_length = data_config["hop_length"]
n_mels = data_config["n_mels"]
train_clip_seconds = data_config["train_clip_seconds"]
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
ljspeech = TransformDataset(ljspeech_meta, transform)
valid_size = data_config["valid_size"]
ljspeech_valid = SliceDataset(ljspeech, 0, valid_size)
ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech))
teacher_config = config["teacher"]
n_loop = teacher_config["n_loop"]
n_layer = teacher_config["n_layer"]
filter_size = teacher_config["filter_size"]
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
print("context size is {} samples".format(context_size))
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
train_clip_seconds)
valid_batch_fn = DataCollector(
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
batch_size = data_config["batch_size"]
train_cargo = DataCargo(
ljspeech_train,
train_batch_fn,
batch_size,
sampler=RandomSampler(ljspeech_train))
# only batch=1 for validation is enabled
valid_cargo = DataCargo(
ljspeech_valid,
valid_batch_fn,
batch_size=1,
sampler=SequentialSampler(ljspeech_valid))
# conditioner(upsampling net)
conditioner_config = config["conditioner"]
upsampling_factors = conditioner_config["upsampling_factors"]
upsample_net = UpsampleNet(upscale_factors=upsampling_factors)
freeze(upsample_net)
residual_channels = teacher_config["residual_channels"]
loss_type = teacher_config["loss_type"]
output_dim = teacher_config["output_dim"]
log_scale_min = teacher_config["log_scale_min"]
assert loss_type == "mog" and output_dim == 3, \
"the teacher wavenet should be a wavenet with single gaussian output"
teacher = WaveNet(n_loop, n_layer, residual_channels, output_dim, n_mels,
filter_size, loss_type, log_scale_min)
# load & freeze upsample_net & teacher
freeze(teacher)
student_config = config["student"]
n_loops = student_config["n_loops"]
n_layers = student_config["n_layers"]
student_residual_channels = student_config["residual_channels"]
student_filter_size = student_config["filter_size"]
student_log_scale_min = student_config["log_scale_min"]
student = ParallelWaveNet(n_loops, n_layers, student_residual_channels,
n_mels, student_filter_size)
stft_config = config["stft"]
stft = STFT(
n_fft=stft_config["n_fft"],
hop_length=stft_config["hop_length"],
win_length=stft_config["win_length"])
lmd = config["loss"]["lmd"]
model = Clarinet(upsample_net, teacher, student, stft,
student_log_scale_min, lmd)
summary(model)
# load parameters
if args.checkpoint is not None:
# load from args.checkpoint
iteration = io.load_parameters(model, checkpoint_path=args.checkpoint)
else:
# load from "args.output/checkpoints"
checkpoint_dir = os.path.join(args.output, "checkpoints")
iteration = io.load_parameters(
model, checkpoint_dir=checkpoint_dir, iteration=args.iteration)
assert iteration > 0, "A trained checkpoint is needed."
# make generation fast
for sublayer in model.sublayers():
if isinstance(sublayer, WeightNormWrapper):
sublayer.remove_weight_norm()
# data loader
valid_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
valid_loader.set_batch_generator(valid_cargo, place)
# the directory to save audio files
synthesis_dir = os.path.join(args.output, "synthesis")
if not os.path.exists(synthesis_dir):
os.makedirs(synthesis_dir)
eval_model(model, valid_loader, synthesis_dir, iteration, sample_rate)

View File

@ -1,243 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import os
import sys
import argparse
import ruamel.yaml
import random
from tqdm import tqdm
import pickle
import numpy as np
from visualdl import LogWriter
import paddle.fluid.dygraph as dg
from paddle import fluid
fluid.require_version('1.8.0')
from parakeet.models.wavenet import WaveNet, UpsampleNet
from parakeet.models.clarinet import STFT, Clarinet, ParallelWaveNet
from parakeet.data import TransformDataset, SliceDataset, CacheDataset, RandomSampler, SequentialSampler, DataCargo
from parakeet.utils.layer_tools import summary, freeze
from parakeet.utils import io
from utils import make_output_tree, eval_model, load_wavenet
# import dataset from wavenet
sys.path.append("../wavenet")
from data import LJSpeechMetaData, Transform, DataCollector
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Train a ClariNet model with LJspeech and a trained WaveNet model."
)
parser.add_argument("--config", type=str, help="path of the config file")
parser.add_argument("--device", type=int, default=-1, help="device to use")
parser.add_argument("--data", type=str, help="path of LJspeech dataset")
g = parser.add_mutually_exclusive_group()
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
g.add_argument(
"--iteration",
type=int,
help="the iteration of the checkpoint to load from output directory")
parser.add_argument(
"--wavenet", type=str, help="wavenet checkpoint to use")
parser.add_argument(
"output",
type=str,
default="experiment",
help="path to save experiment results")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = ruamel.yaml.safe_load(f)
if args.device == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(args.device)
dg.enable_dygraph(place)
print("Command Line args: ")
for k, v in vars(args).items():
print("{}: {}".format(k, v))
ljspeech_meta = LJSpeechMetaData(args.data)
data_config = config["data"]
sample_rate = data_config["sample_rate"]
n_fft = data_config["n_fft"]
win_length = data_config["win_length"]
hop_length = data_config["hop_length"]
n_mels = data_config["n_mels"]
train_clip_seconds = data_config["train_clip_seconds"]
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
ljspeech = TransformDataset(ljspeech_meta, transform)
valid_size = data_config["valid_size"]
ljspeech_valid = CacheDataset(SliceDataset(ljspeech, 0, valid_size))
ljspeech_train = CacheDataset(
SliceDataset(ljspeech, valid_size, len(ljspeech)))
teacher_config = config["teacher"]
n_loop = teacher_config["n_loop"]
n_layer = teacher_config["n_layer"]
filter_size = teacher_config["filter_size"]
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
print("context size is {} samples".format(context_size))
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
train_clip_seconds)
valid_batch_fn = DataCollector(
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
batch_size = data_config["batch_size"]
train_cargo = DataCargo(
ljspeech_train,
train_batch_fn,
batch_size,
sampler=RandomSampler(ljspeech_train))
# only batch=1 for validation is enabled
valid_cargo = DataCargo(
ljspeech_valid,
valid_batch_fn,
batch_size=1,
sampler=SequentialSampler(ljspeech_valid))
make_output_tree(args.output)
# conditioner(upsampling net)
conditioner_config = config["conditioner"]
upsampling_factors = conditioner_config["upsampling_factors"]
upsample_net = UpsampleNet(upscale_factors=upsampling_factors)
freeze(upsample_net)
residual_channels = teacher_config["residual_channels"]
loss_type = teacher_config["loss_type"]
output_dim = teacher_config["output_dim"]
log_scale_min = teacher_config["log_scale_min"]
assert loss_type == "mog" and output_dim == 3, \
"the teacher wavenet should be a wavenet with single gaussian output"
teacher = WaveNet(n_loop, n_layer, residual_channels, output_dim, n_mels,
filter_size, loss_type, log_scale_min)
freeze(teacher)
student_config = config["student"]
n_loops = student_config["n_loops"]
n_layers = student_config["n_layers"]
student_residual_channels = student_config["residual_channels"]
student_filter_size = student_config["filter_size"]
student_log_scale_min = student_config["log_scale_min"]
student = ParallelWaveNet(n_loops, n_layers, student_residual_channels,
n_mels, student_filter_size)
stft_config = config["stft"]
stft = STFT(
n_fft=stft_config["n_fft"],
hop_length=stft_config["hop_length"],
win_length=stft_config["win_length"])
lmd = config["loss"]["lmd"]
model = Clarinet(upsample_net, teacher, student, stft,
student_log_scale_min, lmd)
summary(model)
# optim
train_config = config["train"]
learning_rate = train_config["learning_rate"]
anneal_rate = train_config["anneal_rate"]
anneal_interval = train_config["anneal_interval"]
lr_scheduler = dg.ExponentialDecay(
learning_rate, anneal_interval, anneal_rate, staircase=True)
gradiant_max_norm = train_config["gradient_max_norm"]
optim = fluid.optimizer.Adam(
lr_scheduler,
parameter_list=model.parameters(),
grad_clip=fluid.clip.ClipByGlobalNorm(gradiant_max_norm))
# train
max_iterations = train_config["max_iterations"]
checkpoint_interval = train_config["checkpoint_interval"]
eval_interval = train_config["eval_interval"]
checkpoint_dir = os.path.join(args.output, "checkpoints")
state_dir = os.path.join(args.output, "states")
log_dir = os.path.join(args.output, "log")
writer = LogWriter(log_dir)
if args.checkpoint is not None:
iteration = io.load_parameters(
model, optim, checkpoint_path=args.checkpoint)
else:
iteration = io.load_parameters(
model,
optim,
checkpoint_dir=checkpoint_dir,
iteration=args.iteration)
if iteration == 0:
assert args.wavenet is not None, "When training afresh, a trained wavenet model should be provided."
load_wavenet(model, args.wavenet)
# loader
train_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
train_loader.set_batch_generator(train_cargo, place)
valid_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
valid_loader.set_batch_generator(valid_cargo, place)
# training loop
global_step = iteration + 1
iterator = iter(tqdm(train_loader))
while global_step <= max_iterations:
try:
batch = next(iterator)
except StopIteration as e:
iterator = iter(tqdm(train_loader))
batch = next(iterator)
audios, mels, audio_starts = batch
model.train()
loss_dict = model(
audios, mels, audio_starts, clip_kl=global_step > 500)
writer.add_scalar("learning_rate",
optim._learning_rate.step().numpy()[0], global_step)
for k, v in loss_dict.items():
writer.add_scalar("loss/{}".format(k), v.numpy()[0], global_step)
l = loss_dict["loss"]
step_loss = l.numpy()[0]
print("[train] global_step: {} loss: {:<8.6f}".format(global_step,
step_loss))
l.backward()
optim.minimize(l)
optim.clear_gradients()
if global_step % eval_interval == 0:
# evaluate on valid dataset
eval_model(model, valid_loader, state_dir, global_step,
sample_rate)
if global_step % checkpoint_interval == 0:
io.save_parameters(checkpoint_dir, global_step, model, optim)
global_step += 1

View File

@ -1,60 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import os
import soundfile as sf
from collections import OrderedDict
from paddle import fluid
import paddle.fluid.dygraph as dg
def make_output_tree(output_dir):
checkpoint_dir = os.path.join(output_dir, "checkpoints")
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
state_dir = os.path.join(output_dir, "states")
if not os.path.exists(state_dir):
os.makedirs(state_dir)
def eval_model(model, valid_loader, output_dir, iteration, sample_rate):
model.eval()
for i, batch in enumerate(valid_loader):
# print("sentence {}".format(i))
path = os.path.join(output_dir,
"sentence_{}_step_{}.wav".format(i, iteration))
audio_clips, mel_specs, audio_starts = batch
wav_var = model.synthesis(mel_specs)
wav_np = wav_var.numpy()[0]
sf.write(path, wav_np, samplerate=sample_rate)
print("generated {}".format(path))
def load_wavenet(model, path):
wavenet_dict, _ = dg.load_dygraph(path)
encoder_dict = OrderedDict()
teacher_dict = OrderedDict()
for k, v in wavenet_dict.items():
if k.startswith("encoder."):
encoder_dict[k.split('.', 1)[1]] = v
else:
# k starts with "decoder."
teacher_dict[k.split('.', 1)[1]] = v
model.encoder.set_dict(encoder_dict)
model.teacher.set_dict(teacher_dict)
print("loaded the encoder part and teacher part from wavenet model.")

View File

@ -1,144 +0,0 @@
# Deep Voice 3
PaddlePaddle dynamic graph implementation of Deep Voice 3, a convolutional network based text-to-speech generative model. The implementation is based on [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654).
We implement Deep Voice 3 using Paddle Fluid with dynamic graph, which is convenient for building flexible network architectures.
## Dataset
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
## Model Architecture
![Deep Voice 3 model architecture](./images/model_architecture.png)
The model consists of an encoder, a decoder and a converter (and a speaker embedding for multispeaker models). The encoder and the decoder together form the seq2seq part of the model, and the converter forms the postnet part.
## Project Structure
```text
├── config/
├── synthesize.py
├── data.py
├── preprocess.py
├── clip.py
├── train.py
└── vocoder.py
```
# Preprocess
Preprocess to dataset with `preprocess.py`.
```text
usage: preprocess.py [-h] --config CONFIG --input INPUT --output OUTPUT
preprocess ljspeech dataset and save it.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file
--input INPUT data path of the original data
--output OUTPUT path to save the preprocessed dataset
```
example code:
```bash
python preprocess.py --config=configs/ljspeech.yaml --input=LJSpeech-1.1/ --output=data/ljspeech
```
## Train
Train the model using train.py, follow the usage displayed by `python train.py --help`.
```text
usage: train.py [-h] --config CONFIG --input INPUT
train a Deep Voice 3 model with LJSpeech
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file
--input INPUT data path of the original data
```
example code:
```bash
CUDA_VISIBLE_DEVICES=0 python train.py --config=configs/ljspeech.yaml --input=data/ljspeech
```
It would create a `runs` folder, outputs for each run is saved in a seperate folder in `runs`, whose name is the time joined with hostname. Inside this filder, tensorboard log, parameters and optimizer states are saved. Parameters(`*.pdparams`) and optimizer states(`*.pdopt`) are named by the step when they are saved.
```text
runs/Jul07_09-39-34_instance-mqcyj27y-4/
├── checkpoint
├── events.out.tfevents.1594085974.instance-mqcyj27y-4
├── step-1000000.pdopt
├── step-1000000.pdparams
├── step-100000.pdopt
├── step-100000.pdparams
...
```
Since we use waveflow to synthesize audio while training, so download the trained waveflow model and extract it in current directory before training.
```bash
wget https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_ckpt_1.0.zip
unzip waveflow_res128_ljspeech_ckpt_1.0.zip
```
## Visualization
You can visualize training losses, check the attention and listen to the synthesized audio when training with teacher forcing.
example code:
```bash
tensorboard --logdir=runs/ --host=$HOSTNAME --port=8000
```
## Synthesis
```text
usage: synthesize from a checkpoint [-h] --config CONFIG --input INPUT
--output OUTPUT --checkpoint CHECKPOINT
--monotonic_layers MONOTONIC_LAYERS
[--vocoder {griffin-lim,waveflow}]
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file
--input INPUT text file to synthesize
--output OUTPUT path to save audio
--checkpoint CHECKPOINT
data path of the checkpoint
--monotonic_layers MONOTONIC_LAYERS
monotonic decoder layers' indices(start from 1)
--vocoder {griffin-lim,waveflow}
vocoder to use
```
`synthesize.py` is used to synthesize several sentences in a text file.
`--monotonic_layers` is the index of the decoders layer that manifest monotonic diagonal attention. You can get monotonic layers by inspecting tensorboard logs. Mind that the index starts from 1. The layers that manifest monotonic diagonal attention are stable for a model during training and synthesizing, but differ among different runs. So once you get the indices of monotonic layers by inspecting tensorboard log, you can use them at synthesizing. Note that only decoder layers that show strong diagonal attention should be considerd.
`--vocoder` is the vocoder to use. Current supported values are "waveflow" and "griffin-lim". Default value is "waveflow".
example code:
```bash
CUDA_VISIBLE_DEVICES=2 python synthesize.py \
--config configs/ljspeech.yaml \
--input sentences.txt \
--output outputs/ \
--checkpoint runs/Jul07_09-39-34_instance-mqcyj27y-4/step-1320000 \
--monotonic_layers "5,6" \
--vocoder waveflow
```

View File

@ -1,84 +0,0 @@
from __future__ import print_function
import copy
import six
import warnings
import functools
from paddle.fluid import layers
from paddle.fluid import framework
from paddle.fluid import core
from paddle.fluid import name_scope
from paddle.fluid.dygraph import base as imperative_base
from paddle.fluid.clip import GradientClipBase, _correct_clip_op_role_var
class DoubleClip(GradientClipBase):
def __init__(self, clip_value, clip_norm, group_name="default_group", need_clip=None):
super(DoubleClip, self).__init__(need_clip)
self.clip_value = float(clip_value)
self.clip_norm = float(clip_norm)
self.group_name = group_name
def __str__(self):
return "Gradient Clip By Value and GlobalNorm, value={}, global_norm={}".format(
self.clip_value, self.clip_norm)
@imperative_base.no_grad
def _dygraph_clip(self, params_grads):
params_grads = self._dygraph_clip_by_value(params_grads)
params_grads = self._dygraph_clip_by_global_norm(params_grads)
return params_grads
@imperative_base.no_grad
def _dygraph_clip_by_value(self, params_grads):
params_and_grads = []
for p, g in params_grads:
if g is None:
continue
if self._need_clip_func is not None and not self._need_clip_func(p):
params_and_grads.append((p, g))
continue
new_grad = layers.clip(x=g, min=-self.clip_value, max=self.clip_value)
params_and_grads.append((p, new_grad))
return params_and_grads
@imperative_base.no_grad
def _dygraph_clip_by_global_norm(self, params_grads):
params_and_grads = []
sum_square_list = []
for p, g in params_grads:
if g is None:
continue
if self._need_clip_func is not None and not self._need_clip_func(p):
continue
merge_grad = g
if g.type == core.VarDesc.VarType.SELECTED_ROWS:
merge_grad = layers.merge_selected_rows(g)
merge_grad = layers.get_tensor_from_selected_rows(merge_grad)
square = layers.square(merge_grad)
sum_square = layers.reduce_sum(square)
sum_square_list.append(sum_square)
# all parameters have been filterd out
if len(sum_square_list) == 0:
return params_grads
global_norm_var = layers.concat(sum_square_list)
global_norm_var = layers.reduce_sum(global_norm_var)
global_norm_var = layers.sqrt(global_norm_var)
max_global_norm = layers.fill_constant(
shape=[1], dtype='float32', value=self.clip_norm)
clip_var = layers.elementwise_div(
x=max_global_norm,
y=layers.elementwise_max(
x=global_norm_var, y=max_global_norm))
for p, g in params_grads:
if g is None:
continue
if self._need_clip_func is not None and not self._need_clip_func(p):
params_and_grads.append((p, g))
continue
new_grad = layers.elementwise_mul(x=g, y=clip_var)
params_and_grads.append((p, new_grad))
return params_and_grads

View File

@ -1,46 +0,0 @@
# data processing
p_pronunciation: 0.99
sample_rate: 22050 # Hz
n_fft: 1024
win_length: 1024
hop_length: 256
n_mels: 80
reduction_factor: 4
# model-s2s
n_speakers: 1
speaker_dim: 16
char_dim: 256
encoder_dim: 64
kernel_size: 5
encoder_layers: 7
decoder_layers: 8
prenet_sizes: [128]
attention_dim: 128
# model-postnet
postnet_layers: 5
postnet_dim: 256
# position embedding
position_weight: 1.0
position_rate: 5.54
forward_step: 4
backward_step: 0
dropout: 0.05
# output-griffinlim
sharpening_factor: 1.4
# optimizer:
learning_rate: 0.001
clip_value: 5.0
clip_norm: 100.0
# training:
max_iteration: 1000000
batch_size: 16
report_interval: 10000
save_interval: 10000
valid_size: 5

View File

@ -1,108 +0,0 @@
import numpy as np
import os
import csv
import pandas as pd
import paddle
from paddle import fluid
from paddle.fluid import dygraph as dg
from paddle.fluid.dataloader import Dataset, BatchSampler
from paddle.fluid.io import DataLoader
from parakeet.data import DatasetMixin, DataCargo, PartialyRandomizedSimilarTimeLengthSampler
from parakeet.g2p import en
class LJSpeech(DatasetMixin):
def __init__(self, root):
self._root = root
self._table = pd.read_csv(
os.path.join(root, "metadata.csv"),
sep="|",
encoding="utf-8",
quoting=csv.QUOTE_NONE,
header=None,
names=["num_frames", "spec_name", "mel_name", "text"],
dtype={"num_frames": np.int64, "spec_name": str, "mel_name":str, "text":str})
def num_frames(self):
return self._table["num_frames"].to_list()
def get_example(self, i):
"""
spec (T_frame, C_spec)
mel (T_frame, C_mel)
"""
num_frames, spec_name, mel_name, text = self._table.iloc[i]
spec = np.load(os.path.join(self._root, spec_name))
mel = np.load(os.path.join(self._root, mel_name))
return (text, spec, mel, num_frames)
def __len__(self):
return len(self._table)
class DataCollector(object):
def __init__(self, p_pronunciation):
self.p_pronunciation = p_pronunciation
def __call__(self, examples):
"""
output shape and dtype
(B, T_text) int64
(B,) int64
(B, T_frame, C_spec) float32
(B, T_frame, C_mel) float32
(B,) int64
"""
text_seqs = []
specs = []
mels = []
num_frames = np.array([example[3] for example in examples], dtype=np.int64)
max_frames = np.max(num_frames)
for example in examples:
text, spec, mel, _ = example
text_seqs.append(en.text_to_sequence(text, self.p_pronunciation))
specs.append(np.pad(spec, [(0, max_frames - spec.shape[0]), (0, 0)], mode="constant"))
mels.append(np.pad(mel, [(0, max_frames - mel.shape[0]), (0, 0)], mode="constant"))
specs = np.stack(specs)
mels = np.stack(mels)
text_lengths = np.array([len(seq) for seq in text_seqs], dtype=np.int64)
max_length = np.max(text_lengths)
text_seqs = np.array([seq + [0] * (max_length - len(seq)) for seq in text_seqs], dtype=np.int64)
return text_seqs, text_lengths, specs, mels, num_frames
if __name__ == "__main__":
import argparse
import tqdm
import time
from ruamel import yaml
parser = argparse.ArgumentParser(description="load the preprocessed ljspeech dataset")
parser.add_argument("--config", type=str, required=True, help="config file")
parser.add_argument("--input", type=str, required=True, help="data path of the original data")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = yaml.safe_load(f)
print("========= Command Line Arguments ========")
for k, v in vars(args).items():
print("{}: {}".format(k, v))
print("=========== Configurations ==============")
for k in ["p_pronunciation", "batch_size"]:
print("{}: {}".format(k, config[k]))
ljspeech = LJSpeech(args.input)
collate_fn = DataCollector(config["p_pronunciation"])
dg.enable_dygraph(fluid.CPUPlace())
sampler = PartialyRandomizedSimilarTimeLengthSampler(ljspeech.num_frames())
cargo = DataCargo(ljspeech, collate_fn,
batch_size=config["batch_size"], sampler=sampler)
loader = DataLoader\
.from_generator(capacity=5, return_list=True)\
.set_batch_generator(cargo)
for i, batch in tqdm.tqdm(enumerate(loader)):
continue

Binary file not shown.

Before

Width:  |  Height:  |  Size: 447 KiB

View File

@ -1,122 +0,0 @@
from __future__ import division
import os
import argparse
from ruamel import yaml
import tqdm
from os.path import join
import csv
import numpy as np
import pandas as pd
import librosa
import logging
from parakeet.data import DatasetMixin
class LJSpeechMetaData(DatasetMixin):
def __init__(self, root):
self.root = root
self._wav_dir = join(root, "wavs")
csv_path = join(root, "metadata.csv")
self._table = pd.read_csv(
csv_path,
sep="|",
encoding="utf-8",
header=None,
quoting=csv.QUOTE_NONE,
names=["fname", "raw_text", "normalized_text"])
def get_example(self, i):
fname, raw_text, normalized_text = self._table.iloc[i]
abs_fname = join(self._wav_dir, fname + ".wav")
return fname, abs_fname, raw_text, normalized_text
def __len__(self):
return len(self._table)
class Transform(object):
def __init__(self, sample_rate, n_fft, hop_length, win_length, n_mels, reduction_factor):
self.sample_rate = sample_rate
self.n_fft = n_fft
self.win_length = win_length
self.hop_length = hop_length
self.n_mels = n_mels
self.reduction_factor = reduction_factor
def __call__(self, fname):
# wave processing
audio, _ = librosa.load(fname, sr=self.sample_rate)
# Pad the data to the right size to have a whole number of timesteps,
# accounting properly for the model reduction factor.
frames = audio.size // (self.reduction_factor * self.hop_length) + 1
# librosa's stft extract frame of n_fft size, so we should pad n_fft // 2 on both sidess
desired_length = (frames * self.reduction_factor - 1) * self.hop_length + self.n_fft
pad_amount = (desired_length - audio.size) // 2
# we pad mannually to control the number of generated frames
if audio.size % 2 == 0:
audio = np.pad(audio, (pad_amount, pad_amount), mode='reflect')
else:
audio = np.pad(audio, (pad_amount, pad_amount + 1), mode='reflect')
# STFT
D = librosa.stft(audio, self.n_fft, self.hop_length, self.win_length, center=False)
S = np.abs(D)
S_mel = librosa.feature.melspectrogram(sr=self.sample_rate, S=S, n_mels=self.n_mels, fmax=8000.0)
# log magnitude
log_spectrogram = np.log(np.clip(S, a_min=1e-5, a_max=None))
log_mel_spectrogram = np.log(np.clip(S_mel, a_min=1e-5, a_max=None))
num_frames = log_spectrogram.shape[-1]
assert num_frames % self.reduction_factor == 0, "num_frames is wrong"
return (log_spectrogram.T, log_mel_spectrogram.T, num_frames)
def save(output_path, dataset, transform):
if not os.path.exists(output_path):
os.makedirs(output_path)
records = []
for example in tqdm.tqdm(dataset):
fname, abs_fname, _, normalized_text = example
log_spec, log_mel_spec, num_frames = transform(abs_fname)
records.append((num_frames,
fname + "_spec.npy",
fname + "_mel.npy",
normalized_text))
np.save(join(output_path, fname + "_spec"), log_spec)
np.save(join(output_path, fname + "_mel"), log_mel_spec)
meta_data = pd.DataFrame.from_records(records)
meta_data.to_csv(join(output_path, "metadata.csv"),
quoting=csv.QUOTE_NONE, sep="|", encoding="utf-8",
header=False, index=False)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="preprocess ljspeech dataset and save it.")
parser.add_argument("--config", type=str, required=True, help="config file")
parser.add_argument("--input", type=str, required=True, help="data path of the original data")
parser.add_argument("--output", type=str, required=True, help="path to save the preprocessed dataset")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = yaml.safe_load(f)
print("========= Command Line Arguments ========")
for k, v in vars(args).items():
print("{}: {}".format(k, v))
print("=========== Configurations ==============")
for k in ["sample_rate", "n_fft", "win_length",
"hop_length", "n_mels", "reduction_factor"]:
print("{}: {}".format(k, config[k]))
ljspeech_meta = LJSpeechMetaData(args.input)
transform = Transform(config["sample_rate"],
config["n_fft"],
config["hop_length"],
config["win_length"],
config["n_mels"],
config["reduction_factor"])
save(args.output, ljspeech_meta, transform)

View File

@ -1,101 +0,0 @@
import numpy as np
from matplotlib import cm
import librosa
import os
import time
import tqdm
import argparse
from ruamel import yaml
import paddle
from paddle import fluid
from paddle.fluid import layers as F
from paddle.fluid import dygraph as dg
from paddle.fluid.io import DataLoader
import soundfile as sf
from parakeet.data import SliceDataset, DataCargo, PartialyRandomizedSimilarTimeLengthSampler, SequentialSampler
from parakeet.utils.io import save_parameters, load_parameters, add_yaml_config_to_args
from parakeet.g2p import en
from parakeet.models.deepvoice3.weight_norm_hook import remove_weight_norm
from vocoder import WaveflowVocoder, GriffinLimVocoder
from train import create_model
def main(args, config):
model = create_model(config)
loaded_step = load_parameters(model, checkpoint_path=args.checkpoint)
for name, layer in model.named_sublayers():
try:
remove_weight_norm(layer)
except ValueError:
# this layer has not weight norm hook
pass
model.eval()
if args.vocoder == "waveflow":
vocoder = WaveflowVocoder()
vocoder.model.eval()
elif args.vocoder == "griffin-lim":
vocoder = GriffinLimVocoder(
sharpening_factor=config["sharpening_factor"],
sample_rate=config["sample_rate"],
n_fft=config["n_fft"],
win_length=config["win_length"],
hop_length=config["hop_length"])
else:
raise ValueError("Other vocoders are not supported.")
if not os.path.exists(args.output):
os.makedirs(args.output)
monotonic_layers = [int(item.strip()) - 1 for item in args.monotonic_layers.split(',')]
with open(args.input, 'rt') as f:
sentences = [line.strip() for line in f.readlines()]
for i, sentence in enumerate(sentences):
wav = synthesize(args, config, model, vocoder, sentence, monotonic_layers)
sf.write(os.path.join(args.output, "sentence{}.wav".format(i)),
wav, samplerate=config["sample_rate"])
def synthesize(args, config, model, vocoder, sentence, monotonic_layers):
print("[synthesize] {}".format(sentence))
text = en.text_to_sequence(sentence, p=1.0)
text = np.expand_dims(np.array(text, dtype="int64"), 0)
lengths = np.array([text.size], dtype=np.int64)
text_seqs = dg.to_variable(text)
text_lengths = dg.to_variable(lengths)
decoder_layers = config["decoder_layers"]
force_monotonic_attention = [False] * decoder_layers
for i in monotonic_layers:
force_monotonic_attention[i] = True
with dg.no_grad():
outputs = model(text_seqs, text_lengths, speakers=None,
force_monotonic_attention=force_monotonic_attention,
window=(config["backward_step"], config["forward_step"]))
decoded, refined, attentions = outputs
if args.vocoder == "griffin-lim":
wav_np = vocoder(refined.numpy()[0].T)
else:
wav = vocoder(F.transpose(refined, (0, 2, 1)))
wav_np = wav.numpy()[0]
return wav_np
if __name__ == "__main__":
import argparse
from ruamel import yaml
parser = argparse.ArgumentParser("synthesize from a checkpoint")
parser.add_argument("--config", type=str, required=True, help="config file")
parser.add_argument("--input", type=str, required=True, help="text file to synthesize")
parser.add_argument("--output", type=str, required=True, help="path to save audio")
parser.add_argument("--checkpoint", type=str, required=True, help="data path of the checkpoint")
parser.add_argument("--monotonic_layers", type=str, required=True, help="monotonic decoder layers' indices(start from 1)")
parser.add_argument("--vocoder", type=str, default="waveflow", choices=['griffin-lim', 'waveflow'], help="vocoder to use")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = yaml.safe_load(f)
dg.enable_dygraph(fluid.CUDAPlace(0))
main(args, config)

View File

@ -1,187 +0,0 @@
import numpy as np
from matplotlib import cm
import librosa
import os
import time
import tqdm
import paddle
from paddle import fluid
from paddle.fluid import layers as F
from paddle.fluid import initializer as I
from paddle.fluid import dygraph as dg
from paddle.fluid.io import DataLoader
from visualdl import LogWriter
from parakeet.models.deepvoice3 import Encoder, Decoder, PostNet, SpectraNet
from parakeet.data import SliceDataset, DataCargo, SequentialSampler, RandomSampler
from parakeet.utils.io import save_parameters, load_parameters
from parakeet.g2p import en
from data import LJSpeech, DataCollector
from vocoder import WaveflowVocoder, GriffinLimVocoder
from clip import DoubleClip
def create_model(config):
char_embedding = dg.Embedding((en.n_vocab, config["char_dim"]), param_attr=I.Normal(scale=0.1))
multi_speaker = config["n_speakers"] > 1
speaker_embedding = dg.Embedding((config["n_speakers"], config["speaker_dim"]), param_attr=I.Normal(scale=0.1)) \
if multi_speaker else None
encoder = Encoder(config["encoder_layers"], config["char_dim"],
config["encoder_dim"], config["kernel_size"],
has_bias=multi_speaker, bias_dim=config["speaker_dim"],
keep_prob=1.0 - config["dropout"])
decoder = Decoder(config["n_mels"], config["reduction_factor"],
list(config["prenet_sizes"]) + [config["char_dim"]],
config["decoder_layers"], config["kernel_size"],
config["attention_dim"],
position_encoding_weight=config["position_weight"],
omega=config["position_rate"],
has_bias=multi_speaker, bias_dim=config["speaker_dim"],
keep_prob=1.0 - config["dropout"])
postnet = PostNet(config["postnet_layers"], config["char_dim"],
config["postnet_dim"], config["kernel_size"],
config["n_mels"], config["reduction_factor"],
has_bias=multi_speaker, bias_dim=config["speaker_dim"],
keep_prob=1.0 - config["dropout"])
spectranet = SpectraNet(char_embedding, speaker_embedding, encoder, decoder, postnet)
return spectranet
def create_data(config, data_path):
dataset = LJSpeech(data_path)
train_dataset = SliceDataset(dataset, config["valid_size"], len(dataset))
train_collator = DataCollector(config["p_pronunciation"])
train_sampler = RandomSampler(train_dataset)
train_cargo = DataCargo(train_dataset, train_collator,
batch_size=config["batch_size"], sampler=train_sampler)
train_loader = DataLoader\
.from_generator(capacity=10, return_list=True)\
.set_batch_generator(train_cargo)
valid_dataset = SliceDataset(dataset, 0, config["valid_size"])
valid_collector = DataCollector(1.)
valid_sampler = SequentialSampler(valid_dataset)
valid_cargo = DataCargo(valid_dataset, valid_collector,
batch_size=1, sampler=valid_sampler)
valid_loader = DataLoader\
.from_generator(capacity=2, return_list=True)\
.set_batch_generator(valid_cargo)
return train_loader, valid_loader
def create_optimizer(model, config):
optim = fluid.optimizer.Adam(config["learning_rate"],
parameter_list=model.parameters(),
grad_clip=DoubleClip(config["clip_value"], config["clip_norm"]))
return optim
def train(args, config):
model = create_model(config)
train_loader, valid_loader = create_data(config, args.input)
optim = create_optimizer(model, config)
global global_step
max_iteration = config["max_iteration"]
iterator = iter(tqdm.tqdm(train_loader))
while global_step <= max_iteration:
# get inputs
try:
batch = next(iterator)
except StopIteration:
iterator = iter(tqdm.tqdm(train_loader))
batch = next(iterator)
# unzip it
text_seqs, text_lengths, specs, mels, num_frames = batch
# forward & backward
model.train()
outputs = model(text_seqs, text_lengths, speakers=None, mel=mels)
decoded, refined, attentions, final_state = outputs
causal_mel_loss = model.spec_loss(decoded, mels, num_frames)
non_causal_mel_loss = model.spec_loss(refined, mels, num_frames)
loss = causal_mel_loss + non_causal_mel_loss
loss.backward()
# update
optim.minimize(loss)
# logging
tqdm.tqdm.write("[train] step: {}\tloss: {:.6f}\tcausal:{:.6f}\tnon_causal:{:.6f}".format(
global_step,
loss.numpy()[0],
causal_mel_loss.numpy()[0],
non_causal_mel_loss.numpy()[0]))
writer.add_scalar("loss/causal_mel_loss", causal_mel_loss.numpy()[0], step=global_step)
writer.add_scalar("loss/non_causal_mel_loss", non_causal_mel_loss.numpy()[0], step=global_step)
writer.add_scalar("loss/loss", loss.numpy()[0], step=global_step)
if global_step % config["report_interval"] == 0:
text_length = int(text_lengths.numpy()[0])
num_frame = int(num_frames.numpy()[0])
tag = "train_mel/ground-truth"
img = cm.viridis(normalize(mels.numpy()[0, :num_frame].T))
writer.add_image(tag, img, step=global_step)
tag = "train_mel/decoded"
img = cm.viridis(normalize(decoded.numpy()[0, :num_frame].T))
writer.add_image(tag, img, step=global_step)
tag = "train_mel/refined"
img = cm.viridis(normalize(refined.numpy()[0, :num_frame].T))
writer.add_image(tag, img, step=global_step)
vocoder = WaveflowVocoder()
vocoder.model.eval()
tag = "train_audio/ground-truth-waveflow"
wav = vocoder(F.transpose(mels[0:1, :num_frame, :], (0, 2, 1)))
writer.add_audio(tag, wav.numpy()[0], step=global_step, sample_rate=22050)
tag = "train_audio/decoded-waveflow"
wav = vocoder(F.transpose(decoded[0:1, :num_frame, :], (0, 2, 1)))
writer.add_audio(tag, wav.numpy()[0], step=global_step, sample_rate=22050)
tag = "train_audio/refined-waveflow"
wav = vocoder(F.transpose(refined[0:1, :num_frame, :], (0, 2, 1)))
writer.add_audio(tag, wav.numpy()[0], step=global_step, sample_rate=22050)
attentions_np = attentions.numpy()
attentions_np = attentions_np[:, 0, :num_frame // 4 , :text_length]
for i, attention_layer in enumerate(np.rot90(attentions_np, axes=(1,2))):
tag = "train_attention/layer_{}".format(i)
img = cm.viridis(normalize(attention_layer))
writer.add_image(tag, img, step=global_step, dataformats="HWC")
if global_step % config["save_interval"] == 0:
save_parameters(writer.logdir, global_step, model, optim)
# global step +1
global_step += 1
def normalize(arr):
return (arr - arr.min()) / (arr.max() - arr.min())
if __name__ == "__main__":
import argparse
from ruamel import yaml
parser = argparse.ArgumentParser(description="train a Deep Voice 3 model with LJSpeech")
parser.add_argument("--config", type=str, required=True, help="config file")
parser.add_argument("--input", type=str, required=True, help="data path of the original data")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = yaml.safe_load(f)
dg.enable_dygraph(fluid.CUDAPlace(0))
global global_step
global_step = 1
global writer
writer = LogWriter()
print("[Training] tensorboard log and checkpoints are save in {}".format(
writer.logdir))
train(args, config)

View File

@ -1,51 +0,0 @@
import argparse
from ruamel import yaml
import numpy as np
import librosa
import paddle
from paddle import fluid
from paddle.fluid import layers as F
from paddle.fluid import dygraph as dg
from parakeet.utils.io import load_parameters
from parakeet.models.waveflow.waveflow_modules import WaveFlowModule
class WaveflowVocoder(object):
def __init__(self):
config_path = "waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml"
with open(config_path, 'rt') as f:
config = yaml.safe_load(f)
ns = argparse.Namespace()
for k, v in config.items():
setattr(ns, k, v)
ns.use_fp16 = False
self.model = WaveFlowModule(ns)
checkpoint_path = "waveflow_res128_ljspeech_ckpt_1.0/step-2000000"
load_parameters(self.model, checkpoint_path=checkpoint_path)
def __call__(self, mel):
with dg.no_grad():
self.model.eval()
audio = self.model.synthesize(mel)
self.model.train()
return audio
class GriffinLimVocoder(object):
def __init__(self, sharpening_factor=1.4, sample_rate=22050, n_fft=1024,
win_length=1024, hop_length=256):
self.sample_rate = sample_rate
self.n_fft = n_fft
self.sharpening_factor = sharpening_factor
self.win_length = win_length
self.hop_length = hop_length
def __call__(self, mel):
spec = librosa.feature.inverse.mel_to_stft(
np.exp(mel),
sr=self.sample_rate,
n_fft=self.n_fft,
fmin=0, fmax=8000.0, power=1.0)
audio = librosa.core.griffinlim(spec ** self.sharpening_factor,
win_length=self.win_length, hop_length=self.hop_length)
return audio

View File

@ -1,144 +0,0 @@
# Fastspeech
PaddlePaddle dynamic graph implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263).
## Dataset
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
## Model Architecture
![FastSpeech model architecture](./images/model_architecture.png)
FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extracts attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length
regulator to expand the source phoneme sequence to match the length of the target
mel-spectrogram sequence for parallel mel-spectrogram generation. We use the TransformerTTS as teacher model.
The model consists of encoder, decoder and length regulator three parts.
## Project Structure
```text
├── config # yaml configuration files
├── synthesis.py # script to synthesize waveform from text
├── train.py # script for model training
```
## Saving & Loading
`train_transformer.py` and `train_vocoer.py` have 3 arguments in common, `--checkpoint`, `--iteration` and `--output`.
1. `--output` is the directory for saving results.
During training, checkpoints are saved in `${output}/checkpoints` and tensorboard logs are saved in `${output}/log`.
During synthesis, results are saved in `${output}/samples` and tensorboard log is save in `${output}/log`.
2. `--checkpoint` is the path of a checkpoint and `--iteration` is the target step. They are used to load checkpoints in the following way.
- If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.
- If `--checkpoint` is not provided, we try to load the checkpoint of the target step specified by `--iteration` from the `${output}/checkpoints/` directory, e.g. if given `--iteration 120000`, the checkpoint `${output}/checkpoints/step-120000.*` will be load.
- If both `--checkpoint` and `--iteration` are not provided, we try to load the latest checkpoint from `${output}/checkpoints/` directory.
## Compute Phoneme Duration
A ground truth duration of each phoneme (number of frames in the spectrogram that correspond to that phoneme) should be provided when training a FastSpeech model.
We compute the ground truth duration of each phomemes in the following way.
We extract the encoder-decoder attention alignment from a trained Transformer TTS model;
Each frame is considered corresponding to the phoneme that receive the most attention;
You can run alignments/get_alignments.py to get it.
```bash
cd alignments
python get_alignments.py \
--use_gpu=1 \
--output='./alignments' \
--data=${DATAPATH} \
--config=${CONFIG} \
--checkpoint_transformer=${CHECKPOINT} \
```
where `${DATAPATH}` is the path saved LJSpeech data, `${CHECKPOINT}` is the pretrain model path of TransformerTTS, `${CONFIG}` is the config yaml file of TransformerTTS checkpoint. It is necessary for you to prepare a pre-trained TranformerTTS checkpoint.
For more help on arguments
``python alignments.py --help``.
Or you can use your own phoneme duration, you just need to process the data into the following format.
```bash
{'fname1': alignment1,
'fname2': alignment2,
...}
```
## Train FastSpeech
FastSpeech model can be trained by running ``train.py``.
```bash
python train.py \
--use_gpu=1 \
--data=${DATAPATH} \
--alignments_path=${ALIGNMENTS_PATH} \
--output=${OUTPUTPATH} \
--config='configs/ljspeech.yaml' \
```
Or you can run the script file directly.
```bash
sh train.sh
```
If you want to train on multiple GPUs, start training in the following way.
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train.py \
--use_gpu=1 \
--data=${DATAPATH} \
--alignments_path=${ALIGNMENTS_PATH} \
--output=${OUTPUTPATH} \
--config='configs/ljspeech.yaml' \
```
If you wish to resume from an existing model, See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
For more help on arguments
``python train.py --help``.
## Synthesis
After training the FastSpeech, audio can be synthesized by running ``synthesis.py``.
```bash
python synthesis.py \
--use_gpu=1 \
--alpha=1.0 \
--checkpoint=${CHECKPOINTPATH} \
--config='configs/ljspeech.yaml' \
--output=${OUTPUTPATH} \
--vocoder='griffin-lim' \
```
We currently support two vocoders, Griffin-Lim algorithm and WaveFlow. You can set ``--vocoder`` to use one of them. If you want to use WaveFlow as your vocoder, you need to set ``--config_vocoder`` and ``--checkpoint_vocoder`` which are the path of the config and checkpoint of vocoder. You can download the pre-trained model of WaveFlow from [here](https://github.com/PaddlePaddle/Parakeet#vocoders).
Or you can run the script file directly.
```bash
sh synthesis.sh
```
For more help on arguments
``python synthesis.py --help``.
Then you can find the synthesized audio files in ``${OUTPUTPATH}/samples``.

View File

@ -1,132 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from scipy.io.wavfile import write
from parakeet.g2p.en import text_to_sequence
import numpy as np
import pandas as pd
import csv
from tqdm import tqdm
from ruamel import yaml
import pickle
from pathlib import Path
import argparse
from pprint import pprint
from collections import OrderedDict
import paddle.fluid as fluid
import paddle.fluid.dygraph as dg
from parakeet.models.transformer_tts.utils import *
from parakeet.models.transformer_tts import TransformerTTS
from parakeet.models.fastspeech.utils import get_alignment
from parakeet.utils import io
def add_config_options_to_parser(parser):
parser.add_argument("--config", type=str, help="path of the config file")
parser.add_argument("--use_gpu", type=int, default=0, help="device to use")
parser.add_argument("--data", type=str, help="path of LJspeech dataset")
parser.add_argument(
"--checkpoint_transformer",
type=str,
help="transformer_tts checkpoint to synthesis")
parser.add_argument(
"--output",
type=str,
default="./alignments",
help="path to save experiment results")
def alignments(args):
local_rank = dg.parallel.Env().local_rank
place = (fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace())
with open(args.config) as f:
cfg = yaml.load(f, Loader=yaml.Loader)
with dg.guard(place):
network_cfg = cfg['network']
model = TransformerTTS(
network_cfg['embedding_size'], network_cfg['hidden_size'],
network_cfg['encoder_num_head'], network_cfg['encoder_n_layers'],
cfg['audio']['num_mels'], network_cfg['outputs_per_step'],
network_cfg['decoder_num_head'], network_cfg['decoder_n_layers'])
# Load parameters.
global_step = io.load_parameters(
model=model, checkpoint_path=args.checkpoint_transformer)
model.eval()
# get text data
root = Path(args.data)
csv_path = root.joinpath("metadata.csv")
table = pd.read_csv(
csv_path,
sep="|",
header=None,
quoting=csv.QUOTE_NONE,
names=["fname", "raw_text", "normalized_text"])
pbar = tqdm(range(len(table)))
alignments = OrderedDict()
for i in pbar:
fname, raw_text, normalized_text = table.iloc[i]
# init input
text = np.asarray(text_to_sequence(normalized_text))
text = fluid.layers.unsqueeze(dg.to_variable(text), [0])
pos_text = np.arange(1, text.shape[1] + 1)
pos_text = fluid.layers.unsqueeze(dg.to_variable(pos_text), [0])
# load
wav, _ = librosa.load(
str(os.path.join(args.data, 'wavs', fname + ".wav")))
spec = librosa.stft(
y=wav,
n_fft=cfg['audio']['n_fft'],
win_length=cfg['audio']['win_length'],
hop_length=cfg['audio']['hop_length'])
mag = np.abs(spec)
mel = librosa.filters.mel(sr=cfg['audio']['sr'],
n_fft=cfg['audio']['n_fft'],
n_mels=cfg['audio']['num_mels'],
fmin=cfg['audio']['fmin'],
fmax=cfg['audio']['fmax'])
mel = np.matmul(mel, mag)
mel = np.log(np.maximum(mel, 1e-5))
mel_input = np.transpose(mel, axes=(1, 0))
mel_input = fluid.layers.unsqueeze(dg.to_variable(mel_input), [0])
mel_lens = mel_input.shape[1]
pos_mel = np.arange(1, mel_input.shape[1] + 1)
pos_mel = fluid.layers.unsqueeze(dg.to_variable(pos_mel), [0])
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(
text, mel_input, pos_text, pos_mel)
mel_input = fluid.layers.concat(
[mel_input, postnet_pred[:, -1:, :]], axis=1)
alignment, _ = get_alignment(attn_probs, mel_lens,
network_cfg['decoder_num_head'])
alignments[fname] = alignment
with open(args.output + '.pkl', "wb") as f:
pickle.dump(alignments, f)
if __name__ == '__main__':
parser = argparse.ArgumentParser(
description="Get alignments from TransformerTTS model")
add_config_options_to_parser(parser)
args = parser.parse_args()
alignments(args)

View File

@ -1,14 +0,0 @@
CUDA_VISIBLE_DEVICES=0 \
python -u get_alignments.py \
--use_gpu=1 \
--output='./alignments' \
--data='../../../dataset/LJSpeech-1.1' \
--config='../../transformer_tts/configs/ljspeech.yaml' \
--checkpoint_transformer='../../transformer_tts/checkpoint/transformer/step-120000' \
if [ $? -ne 0 ]; then
echo "Failed in training!"
exit 1
fi
exit 0

View File

@ -1,36 +0,0 @@
audio:
num_mels: 80 #the number of mel bands when calculating mel spectrograms.
n_fft: 1024 #the number of fft components.
sr: 22050 #the sampling rate of audio data file.
hop_length: 256 #the number of samples to advance between frames.
win_length: 1024 #the length (width) of the window function.
preemphasis: 0.97
power: 1.2 #the power to raise before griffin-lim.
fmin: 0
fmax: 8000
network:
encoder_n_layer: 6 #the number of FFT Block in encoder.
encoder_head: 2 #the attention head number in encoder.
encoder_conv1d_filter_size: 1536 #the filter size of conv1d in encoder.
max_seq_len: 2048 #the max length of sequence.
decoder_n_layer: 6 #the number of FFT Block in decoder.
decoder_head: 2 #the attention head number in decoder.
decoder_conv1d_filter_size: 1536 #the filter size of conv1d in decoder.
hidden_size: 384 #the hidden size in model of fastspeech.
duration_predictor_output_size: 256 #the output size of duration predictior.
duration_predictor_filter_size: 3 #the filter size of conv1d in duration prediction.
fft_conv1d_filter: 3 #the filter size of conv1d in fft.
fft_conv1d_padding: 1 #the padding size of conv1d in fft.
dropout: 0.1 #the dropout in network.
outputs_per_step: 1
train:
batch_size: 32
learning_rate: 0.001
warm_up_step: 4000 #the warm up step of learning rate.
grad_clip_thresh: 0.1 #the threshold of grad clip.
checkpoint_interval: 1000
max_iteration: 500000

View File

@ -1,186 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pathlib import Path
import numpy as np
import pandas as pd
import librosa
import csv
import pickle
from paddle import fluid
from parakeet import g2p
from parakeet import audio
from parakeet.data.sampler import *
from parakeet.data.datacargo import DataCargo
from parakeet.data.batch import TextIDBatcher, SpecBatcher
from parakeet.data.dataset import DatasetMixin, TransformDataset, CacheDataset, SliceDataset
from parakeet.models.transformer_tts.utils import *
class LJSpeechLoader:
def __init__(self,
config,
place,
data_path,
alignments_path,
batch_size,
nranks,
rank,
is_vocoder=False,
shuffle=True):
LJSPEECH_ROOT = Path(data_path)
metadata = LJSpeechMetaData(LJSPEECH_ROOT, alignments_path)
transformer = LJSpeech(config)
dataset = TransformDataset(metadata, transformer)
dataset = CacheDataset(dataset)
sampler = DistributedSampler(
len(dataset), nranks, rank, shuffle=shuffle)
assert batch_size % nranks == 0
each_bs = batch_size // nranks
dataloader = DataCargo(
dataset,
sampler=sampler,
batch_size=each_bs,
shuffle=shuffle,
batch_fn=batch_examples,
drop_last=True)
self.reader = fluid.io.DataLoader.from_generator(
capacity=32,
iterable=True,
use_double_buffer=True,
return_list=True)
self.reader.set_batch_generator(dataloader, place)
class LJSpeechMetaData(DatasetMixin):
def __init__(self, root, alignments_path):
self.root = Path(root)
self._wav_dir = self.root.joinpath("wavs")
csv_path = self.root.joinpath("metadata.csv")
self._table = pd.read_csv(
csv_path,
sep="|",
header=None,
quoting=csv.QUOTE_NONE,
names=["fname", "raw_text", "normalized_text"])
with open(alignments_path, "rb") as f:
self._alignments = pickle.load(f)
def get_example(self, i):
fname, raw_text, normalized_text = self._table.iloc[i]
alignment = self._alignments[fname]
fname = str(self._wav_dir.joinpath(fname + ".wav"))
return fname, normalized_text, alignment
def __len__(self):
return len(self._table)
class LJSpeech(object):
def __init__(self, cfg):
super(LJSpeech, self).__init__()
self.sr = cfg['sr']
self.n_fft = cfg['n_fft']
self.num_mels = cfg['num_mels']
self.win_length = cfg['win_length']
self.hop_length = cfg['hop_length']
self.preemphasis = cfg['preemphasis']
self.fmin = cfg['fmin']
self.fmax = cfg['fmax']
def __call__(self, metadatum):
"""All the code for generating an Example from a metadatum. If you want a
different preprocessing pipeline, you can override this method.
This method may require several processor, each of which has a lot of options.
In this case, you'd better pass a composed transform and pass it to the init
method.
"""
fname, normalized_text, alignment = metadatum
wav, _ = librosa.load(str(fname))
spec = librosa.stft(
y=wav,
n_fft=self.n_fft,
win_length=self.win_length,
hop_length=self.hop_length)
mag = np.abs(spec)
mel = librosa.filters.mel(self.sr,
self.n_fft,
n_mels=self.num_mels,
fmin=self.fmin,
fmax=self.fmax)
mel = np.matmul(mel, mag)
mel = np.log(np.maximum(mel, 1e-5))
phonemes = np.array(
g2p.en.text_to_sequence(normalized_text), dtype=np.int64)
return (mel, phonemes, alignment
) # maybe we need to implement it as a map in the future
def batch_examples(batch):
texts = []
mels = []
text_lens = []
pos_texts = []
pos_mels = []
alignments = []
for data in batch:
mel, text, alignment = data
text_lens.append(len(text))
pos_texts.append(np.arange(1, len(text) + 1))
pos_mels.append(np.arange(1, mel.shape[1] + 1))
mels.append(mel)
texts.append(text)
alignments.append(alignment)
# Sort by text_len in descending order
texts = [
i
for i, _ in sorted(
zip(texts, text_lens), key=lambda x: x[1], reverse=True)
]
mels = [
i
for i, _ in sorted(
zip(mels, text_lens), key=lambda x: x[1], reverse=True)
]
pos_texts = [
i
for i, _ in sorted(
zip(pos_texts, text_lens), key=lambda x: x[1], reverse=True)
]
pos_mels = [
i
for i, _ in sorted(
zip(pos_mels, text_lens), key=lambda x: x[1], reverse=True)
]
alignments = [
i
for i, _ in sorted(
zip(alignments, text_lens), key=lambda x: x[1], reverse=True)
]
#text_lens = sorted(text_lens, reverse=True)
# Pad sequence with largest len of the batch
texts = TextIDBatcher(pad_id=0)(texts) #(B, T)
pos_texts = TextIDBatcher(pad_id=0)(pos_texts) #(B,T)
pos_mels = TextIDBatcher(pad_id=0)(pos_mels) #(B,T)
alignments = TextIDBatcher(pad_id=0)(alignments).astype(np.float32)
mels = np.transpose(
SpecBatcher(pad_value=0.)(mels), axes=(0, 2, 1)) #(B,T,num_mels)
return (texts, mels, pos_texts, pos_mels, alignments)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 513 KiB

View File

@ -1,170 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from visualdl import LogWriter
from scipy.io.wavfile import write
from collections import OrderedDict
import argparse
from pprint import pprint
from ruamel import yaml
from matplotlib import cm
import numpy as np
import paddle.fluid as fluid
import paddle.fluid.dygraph as dg
from parakeet.g2p.en import text_to_sequence
from parakeet import audio
from parakeet.models.fastspeech.fastspeech import FastSpeech
from parakeet.models.transformer_tts.utils import *
from parakeet.models.wavenet import WaveNet, UpsampleNet
from parakeet.models.clarinet import STFT, Clarinet, ParallelWaveNet
from parakeet.modules import weight_norm
from parakeet.models.waveflow import WaveFlowModule
from parakeet.utils.layer_tools import freeze
from parakeet.utils import io
def add_config_options_to_parser(parser):
parser.add_argument("--config", type=str, help="path of the config file")
parser.add_argument(
"--vocoder",
type=str,
default="griffin-lim",
choices=['griffin-lim', 'waveflow'],
help="vocoder method")
parser.add_argument(
"--config_vocoder", type=str, help="path of the vocoder config file")
parser.add_argument("--use_gpu", type=int, default=0, help="device to use")
parser.add_argument(
"--alpha",
type=float,
default=1,
help="determine the length of the expanded sequence mel, controlling the voice speed."
)
parser.add_argument(
"--checkpoint", type=str, help="fastspeech checkpoint for synthesis")
parser.add_argument(
"--checkpoint_vocoder",
type=str,
help="vocoder checkpoint for synthesis")
parser.add_argument(
"--output",
type=str,
default="synthesis",
help="path to save experiment results")
def synthesis(text_input, args):
local_rank = dg.parallel.Env().local_rank
place = (fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace())
fluid.enable_dygraph(place)
with open(args.config) as f:
cfg = yaml.load(f, Loader=yaml.Loader)
# tensorboard
if not os.path.exists(args.output):
os.mkdir(args.output)
writer = LogWriter(os.path.join(args.output, 'log'))
model = FastSpeech(cfg['network'], num_mels=cfg['audio']['num_mels'])
# Load parameters.
global_step = io.load_parameters(
model=model, checkpoint_path=args.checkpoint)
model.eval()
text = np.asarray(text_to_sequence(text_input))
text = np.expand_dims(text, axis=0)
pos_text = np.arange(1, text.shape[1] + 1)
pos_text = np.expand_dims(pos_text, axis=0)
text = dg.to_variable(text).astype(np.int64)
pos_text = dg.to_variable(pos_text).astype(np.int64)
_, mel_output_postnet = model(text, pos_text, alpha=args.alpha)
if args.vocoder == 'griffin-lim':
#synthesis use griffin-lim
wav = synthesis_with_griffinlim(mel_output_postnet, cfg['audio'])
elif args.vocoder == 'waveflow':
wav = synthesis_with_waveflow(mel_output_postnet, args,
args.checkpoint_vocoder, place)
else:
print(
'vocoder error, we only support griffinlim and waveflow, but recevied %s.'
% args.vocoder)
writer.add_audio(text_input + '(' + args.vocoder + ')', wav, 0,
cfg['audio']['sr'])
if not os.path.exists(os.path.join(args.output, 'samples')):
os.mkdir(os.path.join(args.output, 'samples'))
write(
os.path.join(
os.path.join(args.output, 'samples'), args.vocoder + '.wav'),
cfg['audio']['sr'], wav)
print("Synthesis completed !!!")
writer.close()
def synthesis_with_griffinlim(mel_output, cfg):
mel_output = fluid.layers.transpose(
fluid.layers.squeeze(mel_output, [0]), [1, 0])
mel_output = np.exp(mel_output.numpy())
basis = librosa.filters.mel(cfg['sr'],
cfg['n_fft'],
cfg['num_mels'],
fmin=cfg['fmin'],
fmax=cfg['fmax'])
inv_basis = np.linalg.pinv(basis)
spec = np.maximum(1e-10, np.dot(inv_basis, mel_output))
wav = librosa.core.griffinlim(
spec**cfg['power'],
hop_length=cfg['hop_length'],
win_length=cfg['win_length'])
return wav
def synthesis_with_waveflow(mel_output, args, checkpoint, place):
fluid.enable_dygraph(place)
args.config = args.config_vocoder
args.use_fp16 = False
config = io.add_yaml_config_to_args(args)
mel_spectrogram = fluid.layers.transpose(mel_output, [0, 2, 1])
# Build model.
waveflow = WaveFlowModule(config)
io.load_parameters(model=waveflow, checkpoint_path=checkpoint)
for layer in waveflow.sublayers():
if isinstance(layer, weight_norm.WeightNormWrapper):
layer.remove_weight_norm()
# Run model inference.
wav = waveflow.synthesize(mel_spectrogram, sigma=config.sigma)
return wav.numpy()[0]
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Synthesis model")
add_config_options_to_parser(parser)
args = parser.parse_args()
pprint(vars(args))
synthesis(
"Don't argue with the people of strong determination, because they may change the fact!",
args)

View File

@ -1,20 +0,0 @@
# train model
CUDA_VISIBLE_DEVICES=0 \
python -u synthesis.py \
--use_gpu=1 \
--alpha=1.0 \
--checkpoint='./fastspeech_ljspeech_ckpt_1.0/fastspeech/step-162000' \
--config='fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml' \
--output='./synthesis' \
--vocoder='waveflow' \
--config_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml' \
--checkpoint_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/step-2000000' \
if [ $? -ne 0 ]; then
echo "Failed in synthesis!"
exit 1
fi
exit 0

View File

@ -1,166 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import argparse
import os
import time
import math
from pathlib import Path
from pprint import pprint
from ruamel import yaml
from tqdm import tqdm
from matplotlib import cm
from collections import OrderedDict
from visualdl import LogWriter
import paddle.fluid.dygraph as dg
import paddle.fluid.layers as layers
import paddle.fluid as fluid
from parakeet.models.fastspeech.fastspeech import FastSpeech
from parakeet.models.fastspeech.utils import get_alignment
from data import LJSpeechLoader
from parakeet.utils import io
def add_config_options_to_parser(parser):
parser.add_argument("--config", type=str, help="path of the config file")
parser.add_argument("--use_gpu", type=int, default=0, help="device to use")
parser.add_argument("--data", type=str, help="path of LJspeech dataset")
parser.add_argument(
"--alignments_path", type=str, help="path of alignments")
g = parser.add_mutually_exclusive_group()
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
g.add_argument(
"--iteration",
type=int,
help="the iteration of the checkpoint to load from output directory")
parser.add_argument(
"--output",
type=str,
default="experiment",
help="path to save experiment results")
def main(args):
local_rank = dg.parallel.Env().local_rank
nranks = dg.parallel.Env().nranks
parallel = nranks > 1
with open(args.config) as f:
cfg = yaml.load(f, Loader=yaml.Loader)
global_step = 0
place = fluid.CUDAPlace(dg.parallel.Env()
.dev_id) if args.use_gpu else fluid.CPUPlace()
fluid.enable_dygraph(place)
if not os.path.exists(args.output):
os.mkdir(args.output)
writer = LogWriter(os.path.join(args.output,
'log')) if local_rank == 0 else None
model = FastSpeech(cfg['network'], num_mels=cfg['audio']['num_mels'])
model.train()
optimizer = fluid.optimizer.AdamOptimizer(
learning_rate=dg.NoamDecay(1 / (cfg['train']['warm_up_step'] *
(cfg['train']['learning_rate']**2)),
cfg['train']['warm_up_step']),
parameter_list=model.parameters(),
grad_clip=fluid.clip.GradientClipByGlobalNorm(cfg['train'][
'grad_clip_thresh']))
reader = LJSpeechLoader(
cfg['audio'],
place,
args.data,
args.alignments_path,
cfg['train']['batch_size'],
nranks,
local_rank,
shuffle=True).reader
iterator = iter(tqdm(reader))
# Load parameters.
global_step = io.load_parameters(
model=model,
optimizer=optimizer,
checkpoint_dir=os.path.join(args.output, 'checkpoints'),
iteration=args.iteration,
checkpoint_path=args.checkpoint)
print("Rank {}: checkpoint loaded.".format(local_rank))
if parallel:
strategy = dg.parallel.prepare_context()
model = fluid.dygraph.parallel.DataParallel(model, strategy)
while global_step <= cfg['train']['max_iteration']:
try:
batch = next(iterator)
except StopIteration as e:
iterator = iter(tqdm(reader))
batch = next(iterator)
(character, mel, pos_text, pos_mel, alignment) = batch
global_step += 1
#Forward
result = model(
character, pos_text, mel_pos=pos_mel, length_target=alignment)
mel_output, mel_output_postnet, duration_predictor_output, _, _ = result
mel_loss = layers.mse_loss(mel_output, mel)
mel_postnet_loss = layers.mse_loss(mel_output_postnet, mel)
duration_loss = layers.mean(
layers.abs(
layers.elementwise_sub(duration_predictor_output, alignment)))
total_loss = mel_loss + mel_postnet_loss + duration_loss
if local_rank == 0:
writer.add_scalar('mel_loss', mel_loss.numpy(), global_step)
writer.add_scalar('post_mel_loss',
mel_postnet_loss.numpy(), global_step)
writer.add_scalar('duration_loss',
duration_loss.numpy(), global_step)
writer.add_scalar('learning_rate',
optimizer._learning_rate.step().numpy(),
global_step)
if parallel:
total_loss = model.scale_loss(total_loss)
total_loss.backward()
model.apply_collective_grads()
else:
total_loss.backward()
optimizer.minimize(total_loss)
model.clear_gradients()
# save checkpoint
if local_rank == 0 and global_step % cfg['train'][
'checkpoint_interval'] == 0:
io.save_parameters(
os.path.join(args.output, 'checkpoints'), global_step, model,
optimizer)
if local_rank == 0:
writer.close()
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Train Fastspeech model")
add_config_options_to_parser(parser)
args = parser.parse_args()
# Print the whole config setting.
pprint(vars(args))
main(args)

View File

@ -1,15 +0,0 @@
# train model
export CUDA_VISIBLE_DEVICES=0
python -u train.py \
--use_gpu=1 \
--data='../../dataset/LJSpeech-1.1' \
--alignments_path='./alignments/alignments.pkl' \
--output='./experiment' \
--config='configs/ljspeech.yaml' \
#--checkpoint='./checkpoint/fastspeech/step-120000' \
if [ $? -ne 0 ]; then
echo "Failed in training!"
exit 1
fi
exit 0

View File

@ -1,112 +0,0 @@
# TransformerTTS
PaddlePaddle dynamic graph implementation of TransformerTTS, a neural TTS with Transformer. The implementation is based on [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895).
## Dataset
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
## Model Architecture
<div align="center" name="TransformerTTS model architecture">
<img src="./images/model_architecture.jpg" width=400 height=600 /> <br>
</div>
<div align="center" >
TransformerTTS model architecture
</div>
The model adopts the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in [Tacotron2](https://arxiv.org/abs/1712.05884). The model consists of two main parts, encoder and decoder. We also implement the CBHG model of Tacotron as the vocoder part and convert the spectrogram into raw wave using Griffin-Lim algorithm.
## Project Structure
```text
├── config # yaml configuration files
├── data.py # dataset and dataloader settings for LJSpeech
├── synthesis.py # script to synthesize waveform from text
├── train_transformer.py # script for transformer model training
├── train_vocoder.py # script for vocoder model training
```
## Saving & Loading
`train_transformer.py` and `train_vocoer.py` have 3 arguments in common, `--checkpoint`, `--iteration` and `--output`.
1. `--output` is the directory for saving results.
During training, checkpoints are saved in `${output}/checkpoints` and tensorboard logs are saved in `${output}/log`.
During synthesis, results are saved in `${output}/samples` and tensorboard log is save in `${output}/log`.
2. `--checkpoint` is the path of a checkpoint and `--iteration` is the target step. They are used to load checkpoints in the following way.
- If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.
- If `--checkpoint` is not provided, we try to load the checkpoint of the target step specified by `--iteration` from the `${output}/checkpoints/` directory, e.g. if given `--iteration 120000`, the checkpoint `${output}/checkpoints/step-120000.*` will be load.
- If both `--checkpoint` and `--iteration` are not provided, we try to load the latest checkpoint from `${output}/checkpoints/` directory.
## Train Transformer
TransformerTTS model can be trained by running ``train_transformer.py``.
```bash
python train_transformer.py \
--use_gpu=1 \
--data=${DATAPATH} \
--output=${OUTPUTPATH} \
--config='configs/ljspeech.yaml' \
```
Or you can run the script file directly.
```bash
sh train_transformer.sh
```
If you want to train on multiple GPUs, you must start training in the following way.
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train_transformer.py \
--use_gpu=1 \
--data=${DATAPATH} \
--output=${OUTPUTPATH} \
--config='configs/ljspeech.yaml' \
```
If you wish to resume from an existing model, See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
**Note: In order to ensure the training effect, we recommend using multi-GPU training to enlarge the batch size, and at least 16 samples in single batch per GPU.**
For more help on arguments
``python train_transformer.py --help``.
## Synthesis
After training the TransformerTTS, audio can be synthesized by running ``synthesis.py``.
```bash
python synthesis.py \
--use_gpu=0 \
--output=${OUTPUTPATH} \
--config='configs/ljspeech.yaml' \
--checkpoint_transformer=${CHECKPOINTPATH} \
--vocoder='griffin-lim' \
```
We currently support two vocoders, Griffin-Lim algorithm and WaveFlow. You can set ``--vocoder`` to use one of them. If you want to use WaveFlow as your vocoder, you need to set ``--config_vocoder`` and ``--checkpoint_vocoder`` which are the path of the config and checkpoint of vocoder. You can download the pre-trained model of WaveFlow from [here](https://github.com/PaddlePaddle/Parakeet#vocoders).
Or you can run the script file directly.
```bash
sh synthesis.sh
```
For more help on arguments
``python synthesis.py --help``.
Then you can find the synthesized audio files in ``${OUTPUTPATH}/samples``.

View File

@ -1,38 +0,0 @@
audio:
num_mels: 80
n_fft: 1024
sr: 22050
preemphasis: 0.97
hop_length: 256
win_length: 1024
power: 1.2
fmin: 0
fmax: 8000
network:
hidden_size: 256
embedding_size: 512
encoder_num_head: 4
encoder_n_layers: 3
decoder_num_head: 4
decoder_n_layers: 3
outputs_per_step: 1
stop_loss_weight: 8
vocoder:
hidden_size: 256
train:
batch_size: 32
learning_rate: 0.001
warm_up_step: 4000
grad_clip_thresh: 1.0
checkpoint_interval: 1000
image_interval: 2000
max_iteration: 500000

View File

@ -1,219 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pathlib import Path
import numpy as np
import pandas as pd
import librosa
import csv
from paddle import fluid
from parakeet import g2p
from parakeet.data.sampler import *
from parakeet.data.datacargo import DataCargo
from parakeet.data.batch import TextIDBatcher, SpecBatcher
from parakeet.data.dataset import DatasetMixin, TransformDataset, CacheDataset, SliceDataset
from parakeet.models.transformer_tts.utils import *
class LJSpeechLoader:
def __init__(self,
config,
place,
data_path,
batch_size,
nranks,
rank,
is_vocoder=False,
shuffle=True):
LJSPEECH_ROOT = Path(data_path)
metadata = LJSpeechMetaData(LJSPEECH_ROOT)
transformer = LJSpeech(config)
dataset = TransformDataset(metadata, transformer)
dataset = CacheDataset(dataset)
sampler = DistributedSampler(
len(dataset), nranks, rank, shuffle=shuffle)
assert batch_size % nranks == 0
each_bs = batch_size // nranks
if is_vocoder:
dataloader = DataCargo(
dataset,
sampler=sampler,
batch_size=each_bs,
shuffle=shuffle,
batch_fn=batch_examples_vocoder,
drop_last=True)
else:
dataloader = DataCargo(
dataset,
sampler=sampler,
batch_size=each_bs,
shuffle=shuffle,
batch_fn=batch_examples,
drop_last=True)
self.reader = fluid.io.DataLoader.from_generator(
capacity=32,
iterable=True,
use_double_buffer=True,
return_list=True)
self.reader.set_batch_generator(dataloader, place)
class LJSpeechMetaData(DatasetMixin):
def __init__(self, root):
self.root = Path(root)
self._wav_dir = self.root.joinpath("wavs")
csv_path = self.root.joinpath("metadata.csv")
self._table = pd.read_csv(
csv_path,
sep="|",
header=None,
quoting=csv.QUOTE_NONE,
names=["fname", "raw_text", "normalized_text"])
def get_example(self, i):
fname, raw_text, normalized_text = self._table.iloc[i]
fname = str(self._wav_dir.joinpath(fname + ".wav"))
return fname, raw_text, normalized_text
def __len__(self):
return len(self._table)
class LJSpeech(object):
def __init__(self, config):
super(LJSpeech, self).__init__()
self.config = config
self.sr = config['sr']
self.n_mels = config['num_mels']
self.preemphasis = config['preemphasis']
self.n_fft = config['n_fft']
self.win_length = config['win_length']
self.hop_length = config['hop_length']
self.fmin = config['fmin']
self.fmax = config['fmax']
def __call__(self, metadatum):
"""All the code for generating an Example from a metadatum. If you want a
different preprocessing pipeline, you can override this method.
This method may require several processor, each of which has a lot of options.
In this case, you'd better pass a composed transform and pass it to the init
method.
"""
fname, raw_text, normalized_text = metadatum
# load
wav, _ = librosa.load(str(fname))
spec = librosa.stft(
y=wav,
n_fft=self.n_fft,
win_length=self.win_length,
hop_length=self.hop_length)
mag = np.abs(spec)
mel = librosa.filters.mel(sr=self.sr,
n_fft=self.n_fft,
n_mels=self.n_mels,
fmin=self.fmin,
fmax=self.fmax)
mel = np.matmul(mel, mag)
mel = np.log(np.maximum(mel, 1e-5))
characters = np.array(
g2p.en.text_to_sequence(normalized_text), dtype=np.int64)
return (mag, mel, characters)
def batch_examples(batch):
texts = []
mels = []
mel_inputs = []
text_lens = []
pos_texts = []
pos_mels = []
stop_tokens = []
for data in batch:
_, mel, text = data
mel_inputs.append(
np.concatenate(
[np.zeros([mel.shape[0], 1], np.float32), mel[:, :-1]],
axis=-1))
text_lens.append(len(text))
pos_texts.append(np.arange(1, len(text) + 1))
pos_mels.append(np.arange(1, mel.shape[1] + 1))
mels.append(mel)
texts.append(text)
stop_token = np.append(np.zeros([mel.shape[1] - 1], np.float32), 1.0)
stop_tokens.append(stop_token)
# Sort by text_len in descending order
texts = [
i
for i, _ in sorted(
zip(texts, text_lens), key=lambda x: x[1], reverse=True)
]
mels = [
i
for i, _ in sorted(
zip(mels, text_lens), key=lambda x: x[1], reverse=True)
]
mel_inputs = [
i
for i, _ in sorted(
zip(mel_inputs, text_lens), key=lambda x: x[1], reverse=True)
]
pos_texts = [
i
for i, _ in sorted(
zip(pos_texts, text_lens), key=lambda x: x[1], reverse=True)
]
pos_mels = [
i
for i, _ in sorted(
zip(pos_mels, text_lens), key=lambda x: x[1], reverse=True)
]
stop_tokens = [
i
for i, _ in sorted(
zip(stop_tokens, text_lens), key=lambda x: x[1], reverse=True)
]
text_lens = sorted(text_lens, reverse=True)
# Pad sequence with largest len of the batch
texts = TextIDBatcher(pad_id=0)(texts) #(B, T)
pos_texts = TextIDBatcher(pad_id=0)(pos_texts) #(B,T)
pos_mels = TextIDBatcher(pad_id=0)(pos_mels) #(B,T)
stop_tokens = TextIDBatcher(pad_id=1, dtype=np.float32)(pos_mels)
mels = np.transpose(
SpecBatcher(pad_value=0.)(mels), axes=(0, 2, 1)) #(B,T,num_mels)
mel_inputs = np.transpose(
SpecBatcher(pad_value=0.)(mel_inputs), axes=(0, 2, 1)) #(B,T,num_mels)
return (texts, mels, mel_inputs, pos_texts, pos_mels, stop_tokens)
def batch_examples_vocoder(batch):
mels = []
mags = []
for data in batch:
mag, mel, _ = data
mels.append(mel)
mags.append(mag)
mels = np.transpose(SpecBatcher(pad_value=0.)(mels), axes=(0, 2, 1))
mags = np.transpose(SpecBatcher(pad_value=0.)(mags), axes=(0, 2, 1))
return (mels, mags)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 322 KiB

View File

@ -1,202 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from scipy.io.wavfile import write
import numpy as np
from tqdm import tqdm
from matplotlib import cm
from visualdl import LogWriter
from ruamel import yaml
from pathlib import Path
import argparse
from pprint import pprint
import paddle.fluid as fluid
import paddle.fluid.dygraph as dg
from parakeet.g2p.en import text_to_sequence
from parakeet.models.transformer_tts.utils import *
from parakeet.models.transformer_tts import TransformerTTS
from parakeet.models.waveflow import WaveFlowModule
from parakeet.modules.weight_norm import WeightNormWrapper
from parakeet.utils import io
def add_config_options_to_parser(parser):
parser.add_argument("--config", type=str, help="path of the config file")
parser.add_argument("--use_gpu", type=int, default=0, help="device to use")
parser.add_argument(
"--stop_threshold",
type=float,
default=0.5,
help="The threshold of stop token which indicates the time step should stop generate spectrum or not."
)
parser.add_argument(
"--max_len",
type=int,
default=1000,
help="The max length of spectrum when synthesize. If the length of synthetical spectrum is lager than max_len, spectrum will be cut off."
)
parser.add_argument(
"--checkpoint_transformer",
type=str,
help="transformer_tts checkpoint for synthesis")
parser.add_argument(
"--vocoder",
type=str,
default="griffin-lim",
choices=['griffin-lim', 'waveflow'],
help="vocoder method")
parser.add_argument(
"--config_vocoder", type=str, help="path of the vocoder config file")
parser.add_argument(
"--checkpoint_vocoder",
type=str,
help="vocoder checkpoint for synthesis")
parser.add_argument(
"--output",
type=str,
default="synthesis",
help="path to save experiment results")
def synthesis(text_input, args):
local_rank = dg.parallel.Env().local_rank
place = (fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace())
with open(args.config) as f:
cfg = yaml.load(f, Loader=yaml.Loader)
# tensorboard
if not os.path.exists(args.output):
os.mkdir(args.output)
writer = LogWriter(os.path.join(args.output, 'log'))
fluid.enable_dygraph(place)
with fluid.unique_name.guard():
network_cfg = cfg['network']
model = TransformerTTS(
network_cfg['embedding_size'], network_cfg['hidden_size'],
network_cfg['encoder_num_head'], network_cfg['encoder_n_layers'],
cfg['audio']['num_mels'], network_cfg['outputs_per_step'],
network_cfg['decoder_num_head'], network_cfg['decoder_n_layers'])
# Load parameters.
global_step = io.load_parameters(
model=model, checkpoint_path=args.checkpoint_transformer)
model.eval()
# init input
text = np.asarray(text_to_sequence(text_input))
text = fluid.layers.unsqueeze(dg.to_variable(text).astype(np.int64), [0])
mel_input = dg.to_variable(np.zeros([1, 1, 80])).astype(np.float32)
pos_text = np.arange(1, text.shape[1] + 1)
pos_text = fluid.layers.unsqueeze(
dg.to_variable(pos_text).astype(np.int64), [0])
for i in range(args.max_len):
pos_mel = np.arange(1, mel_input.shape[1] + 1)
pos_mel = fluid.layers.unsqueeze(
dg.to_variable(pos_mel).astype(np.int64), [0])
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(
text, mel_input, pos_text, pos_mel)
if stop_preds.numpy()[0, -1] > args.stop_threshold:
break
mel_input = fluid.layers.concat(
[mel_input, postnet_pred[:, -1:, :]], axis=1)
global_step = 0
for i, prob in enumerate(attn_probs):
for j in range(4):
x = np.uint8(cm.viridis(prob.numpy()[j]) * 255)
writer.add_image(
'Attention_%d_0' % global_step,
x,
i * 4 + j)
if args.vocoder == 'griffin-lim':
#synthesis use griffin-lim
wav = synthesis_with_griffinlim(postnet_pred, cfg['audio'])
elif args.vocoder == 'waveflow':
# synthesis use waveflow
wav = synthesis_with_waveflow(postnet_pred, args,
args.checkpoint_vocoder, place)
else:
print(
'vocoder error, we only support griffinlim and waveflow, but recevied %s.'
% args.vocoder)
writer.add_audio(text_input + '(' + args.vocoder + ')', wav, 0,
cfg['audio']['sr'])
if not os.path.exists(os.path.join(args.output, 'samples')):
os.mkdir(os.path.join(args.output, 'samples'))
write(
os.path.join(
os.path.join(args.output, 'samples'), args.vocoder + '.wav'),
cfg['audio']['sr'], wav)
print("Synthesis completed !!!")
writer.close()
def synthesis_with_griffinlim(mel_output, cfg):
# synthesis with griffin-lim
mel_output = fluid.layers.transpose(
fluid.layers.squeeze(mel_output, [0]), [1, 0])
mel_output = np.exp(mel_output.numpy())
basis = librosa.filters.mel(cfg['sr'],
cfg['n_fft'],
cfg['num_mels'],
fmin=cfg['fmin'],
fmax=cfg['fmax'])
inv_basis = np.linalg.pinv(basis)
spec = np.maximum(1e-10, np.dot(inv_basis, mel_output))
wav = librosa.core.griffinlim(
spec**cfg['power'],
hop_length=cfg['hop_length'],
win_length=cfg['win_length'])
return wav
def synthesis_with_waveflow(mel_output, args, checkpoint, place):
fluid.enable_dygraph(place)
args.config = args.config_vocoder
args.use_fp16 = False
config = io.add_yaml_config_to_args(args)
mel_spectrogram = fluid.layers.transpose(
fluid.layers.squeeze(mel_output, [0]), [1, 0])
mel_spectrogram = fluid.layers.unsqueeze(mel_spectrogram, [0])
# Build model.
waveflow = WaveFlowModule(config)
io.load_parameters(model=waveflow, checkpoint_path=checkpoint)
for layer in waveflow.sublayers():
if isinstance(layer, WeightNormWrapper):
layer.remove_weight_norm()
# Run model inference.
wav = waveflow.synthesize(mel_spectrogram, sigma=config.sigma)
return wav.numpy()[0]
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Synthesis model")
add_config_options_to_parser(parser)
args = parser.parse_args()
# Print the whole config setting.
pprint(vars(args))
synthesis(
"Life was like a box of chocolates, you never know what you're gonna get.",
args)

View File

@ -1,17 +0,0 @@
# train model
CUDA_VISIBLE_DEVICES=0 \
python -u synthesis.py \
--use_gpu=0 \
--output='./synthesis' \
--config='transformer_tts_ljspeech_ckpt_1.0/ljspeech.yaml' \
--checkpoint_transformer='./transformer_tts_ljspeech_ckpt_1.0/step-120000' \
--vocoder='waveflow' \
--config_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml' \
--checkpoint_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/step-2000000' \
if [ $? -ne 0 ]; then
echo "Failed in training!"
exit 1
fi
exit 0

View File

@ -1,219 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
from tqdm import tqdm
from visualdl import LogWriter
from collections import OrderedDict
import argparse
from pprint import pprint
from ruamel import yaml
from matplotlib import cm
import numpy as np
import paddle.fluid as fluid
import paddle.fluid.dygraph as dg
import paddle.fluid.layers as layers
from parakeet.models.transformer_tts.utils import cross_entropy
from data import LJSpeechLoader
from parakeet.models.transformer_tts import TransformerTTS
from parakeet.utils import io
def add_config_options_to_parser(parser):
parser.add_argument("--config", type=str, help="path of the config file")
parser.add_argument("--use_gpu", type=int, default=0, help="device to use")
parser.add_argument("--data", type=str, help="path of LJspeech dataset")
g = parser.add_mutually_exclusive_group()
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
g.add_argument(
"--iteration",
type=int,
help="the iteration of the checkpoint to load from output directory")
parser.add_argument(
"--output",
type=str,
default="experiment",
help="path to save experiment results")
def main(args):
local_rank = dg.parallel.Env().local_rank
nranks = dg.parallel.Env().nranks
parallel = nranks > 1
with open(args.config) as f:
cfg = yaml.load(f, Loader=yaml.Loader)
global_step = 0
place = fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace()
if not os.path.exists(args.output):
os.mkdir(args.output)
writer = LogWriter(os.path.join(args.output,
'log')) if local_rank == 0 else None
fluid.enable_dygraph(place)
network_cfg = cfg['network']
model = TransformerTTS(
network_cfg['embedding_size'], network_cfg['hidden_size'],
network_cfg['encoder_num_head'], network_cfg['encoder_n_layers'],
cfg['audio']['num_mels'], network_cfg['outputs_per_step'],
network_cfg['decoder_num_head'], network_cfg['decoder_n_layers'])
model.train()
optimizer = fluid.optimizer.AdamOptimizer(
learning_rate=dg.NoamDecay(1 / (cfg['train']['warm_up_step'] *
(cfg['train']['learning_rate']**2)),
cfg['train']['warm_up_step']),
parameter_list=model.parameters(),
grad_clip=fluid.clip.GradientClipByGlobalNorm(cfg['train'][
'grad_clip_thresh']))
# Load parameters.
global_step = io.load_parameters(
model=model,
optimizer=optimizer,
checkpoint_dir=os.path.join(args.output, 'checkpoints'),
iteration=args.iteration,
checkpoint_path=args.checkpoint)
print("Rank {}: checkpoint loaded.".format(local_rank))
if parallel:
strategy = dg.parallel.prepare_context()
model = fluid.dygraph.parallel.DataParallel(model, strategy)
reader = LJSpeechLoader(
cfg['audio'],
place,
args.data,
cfg['train']['batch_size'],
nranks,
local_rank,
shuffle=True).reader
iterator = iter(tqdm(reader))
global_step += 1
while global_step <= cfg['train']['max_iteration']:
try:
batch = next(iterator)
except StopIteration as e:
iterator = iter(tqdm(reader))
batch = next(iterator)
character, mel, mel_input, pos_text, pos_mel, stop_tokens = batch
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(
character, mel_input, pos_text, pos_mel)
mel_loss = layers.mean(
layers.abs(layers.elementwise_sub(mel_pred, mel)))
post_mel_loss = layers.mean(
layers.abs(layers.elementwise_sub(postnet_pred, mel)))
loss = mel_loss + post_mel_loss
stop_loss = cross_entropy(
stop_preds, stop_tokens, weight=cfg['network']['stop_loss_weight'])
loss = loss + stop_loss
if local_rank == 0:
writer.add_scalar('training_loss/mel_loss',
mel_loss.numpy(),
global_step)
writer.add_scalar('training_loss/post_mel_loss',
post_mel_loss.numpy(),
global_step)
writer.add_scalar('stop_loss', stop_loss.numpy(), global_step)
if parallel:
writer.add_scalar('alphas/encoder_alpha',
model._layers.encoder.alpha.numpy(),
global_step)
writer.add_scalar('alphas/decoder_alpha',
model._layers.decoder.alpha.numpy(),
global_step)
else:
writer.add_scalar('alphas/encoder_alpha',
model.encoder.alpha.numpy(),
global_step)
writer.add_scalar('alphas/decoder_alpha',
model.decoder.alpha.numpy(),
global_step)
writer.add_scalar('learning_rate',
optimizer._learning_rate.step().numpy(),
global_step)
if global_step % cfg['train']['image_interval'] == 1:
for i, prob in enumerate(attn_probs):
for j in range(cfg['network']['decoder_num_head']):
x = np.uint8(
cm.viridis(prob.numpy()[j * cfg['train'][
'batch_size'] // nranks]) * 255)
writer.add_image(
'Attention_%d_0' % global_step,
x,
i * 4 + j)
for i, prob in enumerate(attn_enc):
for j in range(cfg['network']['encoder_num_head']):
x = np.uint8(
cm.viridis(prob.numpy()[j * cfg['train'][
'batch_size'] // nranks]) * 255)
writer.add_image(
'Attention_enc_%d_0' % global_step,
x,
i * 4 + j)
for i, prob in enumerate(attn_dec):
for j in range(cfg['network']['decoder_num_head']):
x = np.uint8(
cm.viridis(prob.numpy()[j * cfg['train'][
'batch_size'] // nranks]) * 255)
writer.add_image(
'Attention_dec_%d_0' % global_step,
x,
i * 4 + j)
if parallel:
loss = model.scale_loss(loss)
loss.backward()
model.apply_collective_grads()
else:
loss.backward()
optimizer.minimize(loss)
model.clear_gradients()
# save checkpoint
if local_rank == 0 and global_step % cfg['train'][
'checkpoint_interval'] == 0:
io.save_parameters(
os.path.join(args.output, 'checkpoints'), global_step, model,
optimizer)
global_step += 1
if local_rank == 0:
writer.close()
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Train TransformerTTS model")
add_config_options_to_parser(parser)
args = parser.parse_args()
# Print the whole config setting.
pprint(vars(args))
main(args)

View File

@ -1,15 +0,0 @@
# train model
export CUDA_VISIBLE_DEVICES=0
python -u train_transformer.py \
--use_gpu=1 \
--data='../../dataset/LJSpeech-1.1' \
--output='./experiment' \
--config='configs/ljspeech.yaml' \
#--checkpoint='./checkpoint/transformer/step-120000' \
if [ $? -ne 0 ]; then
echo "Failed in training!"
exit 1
fi
exit 0

View File

@ -1,144 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from visualdl import LogWriter
import os
from tqdm import tqdm
from pathlib import Path
from collections import OrderedDict
import argparse
from ruamel import yaml
from pprint import pprint
import paddle.fluid as fluid
import paddle.fluid.dygraph as dg
import paddle.fluid.layers as layers
from data import LJSpeechLoader
from parakeet.models.transformer_tts import Vocoder
from parakeet.utils import io
def add_config_options_to_parser(parser):
parser.add_argument("--config", type=str, help="path of the config file")
parser.add_argument("--use_gpu", type=int, default=0, help="device to use")
parser.add_argument("--data", type=str, help="path of LJspeech dataset")
g = parser.add_mutually_exclusive_group()
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
g.add_argument(
"--iteration",
type=int,
help="the iteration of the checkpoint to load from output directory")
parser.add_argument(
"--output",
type=str,
default="vocoder",
help="path to save experiment results")
def main(args):
local_rank = dg.parallel.Env().local_rank
nranks = dg.parallel.Env().nranks
parallel = nranks > 1
with open(args.config) as f:
cfg = yaml.load(f, Loader=yaml.Loader)
global_step = 0
place = fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace()
if not os.path.exists(args.output):
os.mkdir(args.output)
writer = LogWriter(os.path.join(args.output,
'log')) if local_rank == 0 else None
fluid.enable_dygraph(place)
model = Vocoder(cfg['train']['batch_size'], cfg['vocoder']['hidden_size'],
cfg['audio']['num_mels'], cfg['audio']['n_fft'])
model.train()
optimizer = fluid.optimizer.AdamOptimizer(
learning_rate=dg.NoamDecay(1 / (cfg['train']['warm_up_step'] *
(cfg['train']['learning_rate']**2)),
cfg['train']['warm_up_step']),
parameter_list=model.parameters(),
grad_clip=fluid.clip.GradientClipByGlobalNorm(cfg['train'][
'grad_clip_thresh']))
# Load parameters.
global_step = io.load_parameters(
model=model,
optimizer=optimizer,
checkpoint_dir=os.path.join(args.output, 'checkpoints'),
iteration=args.iteration,
checkpoint_path=args.checkpoint)
print("Rank {}: checkpoint loaded.".format(local_rank))
if parallel:
strategy = dg.parallel.prepare_context()
model = fluid.dygraph.parallel.DataParallel(model, strategy)
reader = LJSpeechLoader(
cfg['audio'],
place,
args.data,
cfg['train']['batch_size'],
nranks,
local_rank,
is_vocoder=True).reader()
for epoch in range(cfg['train']['max_iteration']):
pbar = tqdm(reader)
for i, data in enumerate(pbar):
pbar.set_description('Processing at epoch %d' % epoch)
mel, mag = data
mag = dg.to_variable(mag.numpy())
mel = dg.to_variable(mel.numpy())
global_step += 1
mag_pred = model(mel)
loss = layers.mean(
layers.abs(layers.elementwise_sub(mag_pred, mag)))
if parallel:
loss = model.scale_loss(loss)
loss.backward()
model.apply_collective_grads()
else:
loss.backward()
optimizer.minimize(loss)
model.clear_gradients()
if local_rank == 0:
writer.add_scalar('training_loss/loss', loss.numpy(),
global_step)
# save checkpoint
if local_rank == 0 and global_step % cfg['train'][
'checkpoint_interval'] == 0:
io.save_parameters(
os.path.join(args.output, 'checkpoints'), global_step,
model, optimizer)
if local_rank == 0:
writer.close()
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Train vocoder model")
add_config_options_to_parser(parser)
args = parser.parse_args()
# Print the whole config setting.
pprint(args)
main(args)

View File

@ -1,16 +0,0 @@
# train model
CUDA_VISIBLE_DEVICES=0 \
python -u train_vocoder.py \
--use_gpu=1 \
--data='../../dataset/LJSpeech-1.1' \
--output='./vocoder' \
--config='configs/ljspeech.yaml' \
#--checkpoint='./checkpoint/vocoder/step-100000' \
if [ $? -ne 0 ]; then
echo "Failed in training!"
exit 1
fi
exit 0

View File

@ -1,122 +0,0 @@
# WaveFlow
PaddlePaddle dynamic graph implementation of [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219).
- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow](https://github.com/NVIDIA/waveglow) and serveral orders of magnitude faster than WaveNet.
- WaveFlow is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M).
- WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development.
## Project Structure
```text
├── configs # yaml configuration files of preset model hyperparameters
├── benchmark.py # benchmark code to test the speed of batched speech synthesis
├── synthesis.py # script for speech synthesis
├── train.py # script for model training
├── utils.py # helper functions for e.g., model checkpointing
├── data.py # dataset and dataloader settings for LJSpeech
├── waveflow.py # WaveFlow model high level APIs
└── parakeet/models/waveflow/waveflow_modules.py # WaveFlow model implementation
```
## Usage
There are many hyperparameters to be tuned depending on the specification of model and dataset you are working on.
We provide `wavenet_ljspeech.yaml` as a hyperparameter set that works well on the LJSpeech dataset.
Note that we use [convolutional queue](https://arxiv.org/abs/1611.09482) at audio synthesis to cache the intermediate hidden states, which will speed up the autoregressive inference over the height dimension. Current implementation only supports height dimension equals 8 or 16, i.e., where there is no dilation on the height dimension. Therefore, you can only set value of `n_group` key in the yaml config file to be either 8 or 16.
Also note that `train.py`, `synthesis.py`, and `benchmark.py` all accept a `--config` parameter. To ensure consistency, you should use the same config yaml file for both training, synthesizing and benchmarking. You can also overwrite these preset hyperparameters with command line by updating parameters after `--config`.
For example `--config=${yaml} --batch_size=8` can overwrite the corresponding hyperparameters in the `${yaml}` config file. For more details about these hyperparameters, check `utils.add_config_options_to_parser`.
Additionally, you need to specify some additional parameters for `train.py`, `synthesis.py`, and `benchmark.py`, and the details can be found in `train.add_options_to_parser`, `synthesis.add_options_to_parser`, and `benchmark.add_options_to_parser`, respectively.
### Dataset
Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
In this example, assume that the path of unzipped LJSpeech dataset is `./data/LJSpeech-1.1`.
### Train on single GPU
```bash
export CUDA_VISIBLE_DEVICES=0
python -u train.py \
--config=./configs/waveflow_ljspeech.yaml \
--root=./data/LJSpeech-1.1 \
--name=${ModelName} --batch_size=4 \
--use_gpu=true
```
#### Save and Load checkpoints
Our model will save model parameters as checkpoints in `./runs/waveflow/${ModelName}/checkpoint/` every 10000 iterations by default, where `${ModelName}` is the model name for one single experiment and it could be whatever you like.
The saved checkpoint will have the format of `step-${iteration_number}.pdparams` for model parameters and `step-${iteration_number}.pdopt` for optimizer parameters.
There are three ways to load a checkpoint and resume training (take an example that you want to load a 500000-iteration checkpoint):
1. Use `--checkpoint=./runs/waveflow/${ModelName}/checkpoint/step-500000` to provide a specific path to load. Note that you only need to provide the base name of the parameter file, which is `step-500000`, no extension name `.pdparams` or `.pdopt` is needed.
2. Use `--iteration=500000`.
3. If you don't specify either `--checkpoint` or `--iteration`, the model will automatically load the latest checkpoint in `./runs/waveflow/${ModelName}/checkpoint`.
### Train on multiple GPUs
```bash
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -u -m paddle.distributed.launch train.py \
--config=./configs/waveflow_ljspeech.yaml \
--root=./data/LJSpeech-1.1 \
--name=${ModelName} --use_gpu=true
```
Use `export CUDA_VISIBLE_DEVICES=0,1,2,3` to set the GPUs that you want to use to be visible. Then the `paddle.distributed.launch` module will use these visible GPUs to do data parallel training in multiprocessing mode.
### Monitor with Tensorboard
By default, the logs are saved in `./runs/waveflow/${ModelName}/logs/`. You can monitor logs using TensorBoard.
```bash
tensorboard --logdir=${log_dir} --port=8888
```
### Synthesize from a checkpoint
Check the [Save and load checkpoint](#save-and-load-checkpoints) section on how to load a specific checkpoint.
The following example will automatically load the latest checkpoint:
```bash
export CUDA_VISIBLE_DEVICES=0
python -u synthesis.py \
--config=./configs/waveflow_ljspeech.yaml \
--root=./data/LJSpeech-1.1 \
--name=${ModelName} --use_gpu=true \
--output=./syn_audios \
--sample=${SAMPLE} \
--sigma=1.0
```
In this example, `--output` specifies where to save the synthesized audios and `--sample` (<16) specifies which sample in the valid dataset (a split from the whole LJSpeech dataset, by default contains the first 16 audio samples) to synthesize based on the mel-spectrograms computed from the ground truth sample audio, e.g., `--sample=0` means to synthesize the first audio in the valid dataset.
### Benchmarking
Use the following example to benchmark the speed of batched speech synthesis, which reports how many times faster than real-time:
```bash
export CUDA_VISIBLE_DEVICES=0
python -u benchmark.py \
--config=./configs/waveflow_ljspeech.yaml \
--root=./data/LJSpeech-1.1 \
--name=${ModelName} --use_gpu=true
```
### Low-precision inference
This model supports the float16 low-precision inference. By appending the argument
```bash
--use_fp16=true
```
to the command of synthesis and benchmarking, one can experience the fast speed of low-precision inference.

View File

@ -1,103 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import random
from pprint import pprint
import argparse
import numpy as np
import paddle.fluid.dygraph as dg
from paddle import fluid
import utils
from parakeet.utils import io
from waveflow import WaveFlow
def add_options_to_parser(parser):
parser.add_argument(
'--model',
type=str,
default='waveflow',
help="general name of the model")
parser.add_argument(
'--name', type=str, help="specific name of the training model")
parser.add_argument(
'--root', type=str, help="root path of the LJSpeech dataset")
parser.add_argument(
'--use_gpu',
type=utils.str2bool,
default=True,
help="option to use gpu training")
parser.add_argument(
'--use_fp16',
type=utils.str2bool,
default=True,
help="option to use fp16 for inference")
parser.add_argument(
'--iteration',
type=int,
default=None,
help=("which iteration of checkpoint to load, "
"default to load the latest checkpoint"))
parser.add_argument(
'--checkpoint',
type=str,
default=None,
help="path of the checkpoint to load")
def benchmark(config):
pprint(vars(config))
# Get checkpoint directory path.
run_dir = os.path.join("runs", config.model, config.name)
checkpoint_dir = os.path.join(run_dir, "checkpoint")
# Configurate device.
place = fluid.CUDAPlace(0) if config.use_gpu else fluid.CPUPlace()
with dg.guard(place):
# Fix random seed.
seed = config.seed
random.seed(seed)
np.random.seed(seed)
fluid.default_startup_program().random_seed = seed
fluid.default_main_program().random_seed = seed
print("Random Seed: ", seed)
# Build model.
model = WaveFlow(config, checkpoint_dir)
model.build(training=False)
# Run model inference.
model.benchmark()
if __name__ == "__main__":
# Create parser.
parser = argparse.ArgumentParser(
description="Synthesize audio using WaveNet model")
add_options_to_parser(parser)
utils.add_config_options_to_parser(parser)
# Parse argument from both command line and yaml config file.
# For conflicting updates to the same field,
# the preceding update will be overwritten by the following one.
config = parser.parse_args()
config = io.add_yaml_config_to_args(config)
benchmark(config)

View File

@ -1,24 +0,0 @@
valid_size: 16
segment_length: 16000
sample_rate: 22050
fft_window_shift: 256
fft_window_size: 1024
fft_size: 1024
mel_bands: 80
mel_fmin: 0.0
mel_fmax: 8000.0
seed: 1234
learning_rate: 0.0002
batch_size: 8
test_every: 2000
save_every: 10000
max_iterations: 3000000
sigma: 1.0
n_flows: 8
n_group: 16
n_layers: 8
n_channels: 64
kernel_h: 3
kernel_w: 3

View File

@ -1,144 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import random
import librosa
import numpy as np
from paddle import fluid
from parakeet.datasets import ljspeech
from parakeet.data import SpecBatcher, WavBatcher
from parakeet.data import DataCargo, DatasetMixin
from parakeet.data import DistributedSampler, BatchSampler
from scipy.io.wavfile import read
class Dataset(ljspeech.LJSpeech):
def __init__(self, config):
super(Dataset, self).__init__(config.root)
self.config = config
def _get_example(self, metadatum):
fname, _, _ = metadatum
wav_path = os.path.join(self.root, "wavs", fname + ".wav")
audio, loaded_sr = librosa.load(wav_path, sr=self.config.sample_rate)
return audio
class Subset(DatasetMixin):
def __init__(self, dataset, indices, valid):
self.dataset = dataset
self.indices = indices
self.valid = valid
self.config = dataset.config
def get_mel(self, audio):
spectrogram = librosa.core.stft(
audio,
n_fft=self.config.fft_size,
hop_length=self.config.fft_window_shift,
win_length=self.config.fft_window_size)
spectrogram_magnitude = np.abs(spectrogram)
# mel_filter_bank shape: [n_mels, 1 + n_fft/2]
mel_filter_bank = librosa.filters.mel(sr=self.config.sample_rate,
n_fft=self.config.fft_size,
n_mels=self.config.mel_bands,
fmin=self.config.mel_fmin,
fmax=self.config.mel_fmax)
# mel shape: [n_mels, num_frames]
mel = np.dot(mel_filter_bank, spectrogram_magnitude)
# Normalize mel.
clip_val = 1e-5
ref_constant = 1
mel = np.log(np.clip(mel, a_min=clip_val, a_max=None) * ref_constant)
return mel
def __getitem__(self, idx):
audio = self.dataset[self.indices[idx]]
segment_length = self.config.segment_length
if self.valid:
# whole audio for valid set
pass
else:
# Randomly crop segment_length from audios in the training set.
# audio shape: [len]
if audio.shape[0] >= segment_length:
max_audio_start = audio.shape[0] - segment_length
audio_start = random.randint(0, max_audio_start)
audio = audio[audio_start:(audio_start + segment_length)]
else:
audio = np.pad(audio, (0, segment_length - audio.shape[0]),
mode='constant',
constant_values=0)
mel = self.get_mel(audio)
return audio, mel
def _batch_examples(self, batch):
audios = [sample[0] for sample in batch]
mels = [sample[1] for sample in batch]
audios = WavBatcher(pad_value=0.0)(audios)
mels = SpecBatcher(pad_value=0.0)(mels)
return audios, mels
def __len__(self):
return len(self.indices)
class LJSpeech:
def __init__(self, config, nranks, rank):
place = fluid.CUDAPlace(rank) if config.use_gpu else fluid.CPUPlace()
# Whole LJSpeech dataset.
ds = Dataset(config)
# Split into train and valid dataset.
indices = list(range(len(ds)))
train_indices = indices[config.valid_size:]
valid_indices = indices[:config.valid_size]
random.shuffle(train_indices)
# Train dataset.
trainset = Subset(ds, train_indices, valid=False)
sampler = DistributedSampler(len(trainset), nranks, rank)
total_bs = config.batch_size
assert total_bs % nranks == 0
train_sampler = BatchSampler(
sampler, total_bs // nranks, drop_last=True)
trainloader = DataCargo(trainset, batch_sampler=train_sampler)
trainreader = fluid.io.PyReader(capacity=50, return_list=True)
trainreader.decorate_batch_generator(trainloader, place)
self.trainloader = (data for _ in iter(int, 1)
for data in trainreader())
# Valid dataset.
validset = Subset(ds, valid_indices, valid=True)
# Currently only support batch_size = 1 for valid loader.
validloader = DataCargo(validset, batch_size=1, shuffle=False)
validreader = fluid.io.PyReader(capacity=20, return_list=True)
validreader.decorate_batch_generator(validloader, place)
self.validloader = validreader

View File

@ -1,113 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import random
from pprint import pprint
import argparse
import numpy as np
import paddle.fluid.dygraph as dg
from paddle import fluid
from parakeet.utils import io
import utils
from waveflow import WaveFlow
def add_options_to_parser(parser):
parser.add_argument(
'--model',
type=str,
default='waveflow',
help="general name of the model")
parser.add_argument(
'--name', type=str, help="specific name of the training model")
parser.add_argument(
'--root', type=str, help="root path of the LJSpeech dataset")
parser.add_argument(
'--use_gpu',
type=utils.str2bool,
default=True,
help="option to use gpu training")
parser.add_argument(
'--use_fp16',
type=utils.str2bool,
default=True,
help="option to use fp16 for inference")
parser.add_argument(
'--iteration',
type=int,
default=None,
help=("which iteration of checkpoint to load, "
"default to load the latest checkpoint"))
parser.add_argument(
'--checkpoint',
type=str,
default=None,
help="path of the checkpoint to load")
parser.add_argument(
'--output',
type=str,
default="./syn_audios",
help="path to write synthesized audio files")
parser.add_argument(
'--sample',
type=int,
default=None,
help="which of the valid samples to synthesize audio")
def synthesize(config):
pprint(vars(config))
# Get checkpoint directory path.
run_dir = os.path.join("runs", config.model, config.name)
checkpoint_dir = os.path.join(run_dir, "checkpoint")
# Configurate device.
place = fluid.CUDAPlace(0) if config.use_gpu else fluid.CPUPlace()
with dg.guard(place):
# Fix random seed.
seed = config.seed
random.seed(seed)
np.random.seed(seed)
fluid.default_startup_program().random_seed = seed
fluid.default_main_program().random_seed = seed
print("Random Seed: ", seed)
# Build model.
model = WaveFlow(config, checkpoint_dir)
iteration = model.build(training=False)
# Run model inference.
model.infer(iteration)
if __name__ == "__main__":
# Create parser.
parser = argparse.ArgumentParser(
description="Synthesize audio using WaveNet model")
add_options_to_parser(parser)
utils.add_config_options_to_parser(parser)
# Parse argument from both command line and yaml config file.
# For conflicting updates to the same field,
# the preceding update will be overwritten by the following one.
config = parser.parse_args()
config = io.add_yaml_config_to_args(config)
synthesize(config)

View File

@ -1,134 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import random
import subprocess
import time
from pprint import pprint
import argparse
import numpy as np
import paddle.fluid.dygraph as dg
from paddle import fluid
from visualdl import LogWriter
import utils
from parakeet.utils import io
from waveflow import WaveFlow
def add_options_to_parser(parser):
parser.add_argument(
'--model',
type=str,
default='waveflow',
help="general name of the model")
parser.add_argument(
'--name', type=str, help="specific name of the training model")
parser.add_argument(
'--root', type=str, help="root path of the LJSpeech dataset")
parser.add_argument(
'--use_gpu',
type=utils.str2bool,
default=True,
help="option to use gpu training")
parser.add_argument(
'--iteration',
type=int,
default=None,
help=("which iteration of checkpoint to load, "
"default to load the latest checkpoint"))
parser.add_argument(
'--checkpoint',
type=str,
default=None,
help="path of the checkpoint to load")
def train(config):
use_gpu = config.use_gpu
# Get the rank of the current training process.
rank = dg.parallel.Env().local_rank
nranks = dg.parallel.Env().nranks
parallel = nranks > 1
if rank == 0:
# Print the whole config setting.
pprint(vars(config))
# Make checkpoint directory.
run_dir = os.path.join("runs", config.model, config.name)
checkpoint_dir = os.path.join(run_dir, "checkpoint")
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
# Create tensorboard logger.
vdl = LogWriter(os.path.join(run_dir, "logs")) \
if rank == 0 else None
# Configurate device
place = fluid.CUDAPlace(rank) if use_gpu else fluid.CPUPlace()
with dg.guard(place):
# Fix random seed.
seed = config.seed
random.seed(seed)
np.random.seed(seed)
fluid.default_startup_program().random_seed = seed
fluid.default_main_program().random_seed = seed
print("Random Seed: ", seed)
# Build model.
model = WaveFlow(config, checkpoint_dir, parallel, rank, nranks, vdl)
iteration = model.build()
while iteration < config.max_iterations:
# Run one single training step.
model.train_step(iteration)
iteration += 1
if iteration % config.test_every == 0:
# Run validation step.
model.valid_step(iteration)
if rank == 0 and iteration % config.save_every == 0:
# Save parameters.
model.save(iteration)
# Close TensorBoard.
if rank == 0:
vdl.close()
if __name__ == "__main__":
# Create parser.
parser = argparse.ArgumentParser(description="Train WaveFlow model")
#formatter_class='default_argparse')
add_options_to_parser(parser)
utils.add_config_options_to_parser(parser)
# Parse argument from both command line and yaml config file.
# For conflicting updates to the same field,
# the preceding update will be overwritten by the following one.
config = parser.parse_args()
config = io.add_yaml_config_to_args(config)
# Force to use fp32 in model training
vars(config)["use_fp16"] = False
train(config)

View File

@ -1,90 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
def str2bool(v):
return v.lower() in ("true", "t", "1")
def add_config_options_to_parser(parser):
parser.add_argument(
'--valid_size', type=int, help="size of the valid dataset")
parser.add_argument(
'--segment_length',
type=int,
help="the length of audio clip for training")
parser.add_argument(
'--sample_rate', type=int, help="sampling rate of audio data file")
parser.add_argument(
'--fft_window_shift',
type=int,
help="the shift of fft window for each frame")
parser.add_argument(
'--fft_window_size',
type=int,
help="the size of fft window for each frame")
parser.add_argument(
'--fft_size', type=int, help="the size of fft filter on each frame")
parser.add_argument(
'--mel_bands',
type=int,
help="the number of mel bands when calculating mel spectrograms")
parser.add_argument(
'--mel_fmin',
type=float,
help="lowest frequency in calculating mel spectrograms")
parser.add_argument(
'--mel_fmax',
type=float,
help="highest frequency in calculating mel spectrograms")
parser.add_argument(
'--seed', type=int, help="seed of random initialization for the model")
parser.add_argument('--learning_rate', type=float)
parser.add_argument(
'--batch_size', type=int, help="batch size for training")
parser.add_argument(
'--test_every', type=int, help="test interval during training")
parser.add_argument(
'--save_every',
type=int,
help="checkpointing interval during training")
parser.add_argument(
'--max_iterations', type=int, help="maximum training iterations")
parser.add_argument(
'--sigma',
type=float,
help="standard deviation of the latent Gaussian variable")
parser.add_argument('--n_flows', type=int, help="number of flows")
parser.add_argument(
'--n_group',
type=int,
help="number of adjacent audio samples to squeeze into one column")
parser.add_argument(
'--n_layers',
type=int,
help="number of conv2d layer in one wavenet-like flow architecture")
parser.add_argument(
'--n_channels', type=int, help="number of residual channels in flow")
parser.add_argument(
'--kernel_h',
type=int,
help="height of the kernel in the conv2d layer")
parser.add_argument(
'--kernel_w', type=int, help="width of the kernel in the conv2d layer")
parser.add_argument('--config', type=str, help="Path to the config file.")

View File

@ -1,292 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import itertools
import os
import time
import numpy as np
import paddle.fluid.dygraph as dg
from paddle import fluid
from scipy.io.wavfile import write
from parakeet.utils import io
from parakeet.modules import weight_norm
from parakeet.models.waveflow import WaveFlowLoss, WaveFlowModule
from data import LJSpeech
import utils
class WaveFlow():
"""Wrapper class of WaveFlow model that supports multiple APIs.
This module provides APIs for model building, training, validation,
inference, benchmarking, and saving.
Args:
config (obj): config info.
checkpoint_dir (str): path for checkpointing.
parallel (bool, optional): whether use multiple GPUs for training.
Defaults to False.
rank (int, optional): the rank of the process in a multi-process
scenario. Defaults to 0.
nranks (int, optional): the total number of processes. Defaults to 1.
vdl_logger (obj, optional): logger to visualize metrics.
Defaults to None.
Returns:
WaveFlow
"""
def __init__(self,
config,
checkpoint_dir,
parallel=False,
rank=0,
nranks=1,
vdl_logger=None):
self.config = config
self.checkpoint_dir = checkpoint_dir
self.parallel = parallel
self.rank = rank
self.nranks = nranks
self.vdl_logger = vdl_logger
self.dtype = "float16" if config.use_fp16 else "float32"
def build(self, training=True):
"""Initialize the model.
Args:
training (bool, optional): Whether the model is built for training or inference.
Defaults to True.
Returns:
None
"""
config = self.config
dataset = LJSpeech(config, self.nranks, self.rank)
self.trainloader = dataset.trainloader
self.validloader = dataset.validloader
waveflow = WaveFlowModule(config)
if training:
optimizer = fluid.optimizer.AdamOptimizer(
learning_rate=config.learning_rate,
parameter_list=waveflow.parameters())
# Load parameters.
iteration = io.load_parameters(
model=waveflow,
optimizer=optimizer,
checkpoint_dir=self.checkpoint_dir,
iteration=config.iteration,
checkpoint_path=config.checkpoint)
print("Rank {}: checkpoint loaded.".format(self.rank))
# Data parallelism.
if self.parallel:
strategy = dg.parallel.prepare_context()
waveflow = dg.parallel.DataParallel(waveflow, strategy)
self.waveflow = waveflow
self.optimizer = optimizer
self.criterion = WaveFlowLoss(config.sigma)
else:
# Load parameters.
iteration = io.load_parameters(
model=waveflow,
checkpoint_dir=self.checkpoint_dir,
iteration=config.iteration,
checkpoint_path=config.checkpoint)
print("Rank {}: checkpoint loaded.".format(self.rank))
for layer in waveflow.sublayers():
if isinstance(layer, weight_norm.WeightNormWrapper):
layer.remove_weight_norm()
self.waveflow = waveflow
return iteration
def train_step(self, iteration):
"""Train the model for one step.
Args:
iteration (int): current iteration number.
Returns:
None
"""
self.waveflow.train()
start_time = time.time()
audios, mels = next(self.trainloader)
load_time = time.time()
outputs = self.waveflow(audios, mels)
loss = self.criterion(outputs)
if self.parallel:
# loss = loss / num_trainers
loss = self.waveflow.scale_loss(loss)
loss.backward()
self.waveflow.apply_collective_grads()
else:
loss.backward()
self.optimizer.minimize(
loss, parameter_list=self.waveflow.parameters())
self.waveflow.clear_gradients()
graph_time = time.time()
if self.rank == 0:
loss_val = float(loss.numpy()) * self.nranks
log = "Rank: {} Step: {:^8d} Loss: {:<8.3f} " \
"Time: {:.3f}/{:.3f}".format(
self.rank, iteration, loss_val,
load_time - start_time, graph_time - load_time)
print(log)
vdl_writer = self.vdl_logger
vdl_writer.add_scalar("Train-Loss-Rank-0", loss_val, iteration)
@dg.no_grad
def valid_step(self, iteration):
"""Run the model on the validation dataset.
Args:
iteration (int): current iteration number.
Returns:
None
"""
self.waveflow.eval()
vdl_writer = self.vdl_logger
total_loss = []
sample_audios = []
start_time = time.time()
for i, batch in enumerate(self.validloader()):
audios, mels = batch
valid_outputs = self.waveflow(audios, mels)
valid_z, valid_log_s_list = valid_outputs
# Visualize latent z and scale log_s.
if self.rank == 0 and i == 0:
vdl_writer.add_histogram("Valid-Latent_z", valid_z.numpy(),
iteration)
for j, valid_log_s in enumerate(valid_log_s_list):
hist_name = "Valid-{}th-Flow-Log_s".format(j)
vdl_writer.add_histogram(hist_name, valid_log_s.numpy(),
iteration)
valid_loss = self.criterion(valid_outputs)
total_loss.append(float(valid_loss.numpy()))
total_time = time.time() - start_time
if self.rank == 0:
loss_val = np.mean(total_loss)
log = "Test | Rank: {} AvgLoss: {:<8.3f} Time {:<8.3f}".format(
self.rank, loss_val, total_time)
print(log)
vdl_writer.add_scalar("Valid-Avg-Loss", loss_val, iteration)
@dg.no_grad
def infer(self, iteration):
"""Run the model to synthesize audios.
Args:
iteration (int): iteration number of the loaded checkpoint.
Returns:
None
"""
self.waveflow.eval()
config = self.config
sample = config.sample
output = "{}/{}/iter-{}".format(config.output, config.name, iteration)
if not os.path.exists(output):
os.makedirs(output)
mels_list = [mels for _, mels in self.validloader()]
if sample is not None:
mels_list = [mels_list[sample]]
else:
sample = 0
for idx, mel in enumerate(mels_list):
abs_idx = sample + idx
filename = "{}/valid_{}.wav".format(output, abs_idx)
print("Synthesize sample {}, save as {}".format(abs_idx, filename))
start_time = time.time()
audio = self.waveflow.synthesize(mel, sigma=self.config.sigma)
syn_time = time.time() - start_time
audio = audio[0]
audio_time = audio.shape[0] / self.config.sample_rate
print("audio time {:.4f}, synthesis time {:.4f}".format(audio_time,
syn_time))
# Denormalize audio from [-1, 1] to [-32768, 32768] int16 range.
audio = audio.numpy().astype("float32") * 32768.0
audio = audio.astype('int16')
write(filename, config.sample_rate, audio)
@dg.no_grad
def benchmark(self):
"""Run the model to benchmark synthesis speed.
Args:
None
Returns:
None
"""
self.waveflow.eval()
mels_list = [mels for _, mels in self.validloader()]
mel = fluid.layers.concat(mels_list, axis=2)
mel = mel[:, :, :864]
batch_size = 8
mel = fluid.layers.expand(mel, [batch_size, 1, 1])
for i in range(10):
start_time = time.time()
audio = self.waveflow.synthesize(mel, sigma=self.config.sigma)
print("audio.shape = ", audio.shape)
syn_time = time.time() - start_time
audio_time = audio.shape[1] * batch_size / self.config.sample_rate
print("audio time {:.4f}, synthesis time {:.4f}".format(audio_time,
syn_time))
print("{} X real-time".format(audio_time / syn_time))
def save(self, iteration):
"""Save model checkpoint.
Args:
iteration (int): iteration number of the model to be saved.
Returns:
None
"""
io.save_parameters(self.checkpoint_dir, iteration, self.waveflow,
self.optimizer)

View File

@ -1,144 +0,0 @@
# WaveNet
PaddlePaddle dynamic graph implementation of WaveNet, a convolutional network based vocoder. WaveNet is originally proposed in [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499). However, in this experiment, the implementation follows the teacher model in [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281).
## Dataset
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
## Project Structure
```text
├── data.py data_processing
├── configs/ (example) configuration file
├── synthesis.py script to synthesize waveform from mel_spectrogram
├── train.py script to train a model
└── utils.py utility functions
```
## Saving & Loading
`train.py` and `synthesis.py` have 3 arguments in common, `--checkpooint`, `iteration` and `output`.
1. `output` is the directory for saving results.
During training, checkpoints are saved in `checkpoints/` in `output` and tensorboard log is save in `log/` in `output`. Other possible outputs are saved in `states/` in `outuput`.
During synthesizing, audio files and other possible outputs are save in `synthesis/` in `output`.
So after training and synthesizing with the same output directory, the file structure of the output directory looks like this.
```text
├── checkpoints/ # checkpoint directory (including *.pdparams, *.pdopt and a text file `checkpoint` that records the latest checkpoint)
├── states/ # audio files generated at validation and other possible outputs
├── log/ # tensorboard log
└── synthesis/ # synthesized audio files and other possible outputs
```
2. `--checkpoint` and `--iteration` for loading from existing checkpoint. Loading existing checkpoiont follows the following rule:
If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.
If `--checkpoint` is not provided, we try to load the model specified by `--iteration` from the checkpoint directory. If `--iteration` is not provided, we try to load the latested checkpoint from checkpoint directory.
## Train
Train the model using train.py. For help on usage, try `python train.py --help`.
```text
usage: train.py [-h] [--data DATA] [--config CONFIG] [--device DEVICE]
[--checkpoint CHECKPOINT | --iteration ITERATION]
output
Train a WaveNet model with LJSpeech.
positional arguments:
output path to save results
optional arguments:
-h, --help show this help message and exit
--data DATA path of the LJspeech dataset
--config CONFIG path of the config file
--device DEVICE device to use
--checkpoint CHECKPOINT checkpoint to resume from
--iteration ITERATION the iteration of the checkpoint to load from output directory
```
- `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
- `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
- `--checkpoint` is the path of the checkpoint.
- `--iteration` is the iteration of the checkpoint to load from output directory.
- `output` is the directory to save results, all result are saved in this directory.
See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
Example script:
```bash
python train.py \
--config=./configs/wavenet_single_gaussian.yaml \
--data=./LJSpeech-1.1/ \
--device=0 \
experiment
```
You can monitor training log via TensorBoard, using the script below.
```bash
cd experiment/log
tensorboard --logdir=.
```
## Synthesis
```text
usage: synthesis.py [-h] [--data DATA] [--config CONFIG] [--device DEVICE]
[--checkpoint CHECKPOINT | --iteration ITERATION]
output
Synthesize valid data from LJspeech with a wavenet model.
positional arguments:
output path to save the synthesized audio
optional arguments:
-h, --help show this help message and exit
--data DATA path of the LJspeech dataset
--config CONFIG path of the config file
--device DEVICE device to use
--checkpoint CHECKPOINT checkpoint to resume from
--iteration ITERATION the iteration of the checkpoint to load from output directory
```
- `--data` is the path of the LJspeech dataset. In principle, a dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
- `--config` is the configuration file to use. You should use the same configuration with which you train you model.
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
- `--checkpoint` is the checkpoint to load.
- `--iteration` is the iteration of the checkpoint to load from output directory.
- `output` is the directory to save synthesized audio. Audio file is saved in `synthesis/` in `output` directory.
See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
Example script:
```bash
python synthesis.py \
--config=./configs/wavenet_single_gaussian.yaml \
--data=./LJSpeech-1.1/ \
--device=0 \
--checkpoint="experiment/checkpoints/step-1000000" \
experiment
```
or
```bash
python synthesis.py \
--config=./configs/wavenet_single_gaussian.yaml \
--data=./LJSpeech-1.1/ \
--device=0 \
--iteration=1000000 \
experiment
```

View File

@ -1,36 +0,0 @@
data:
batch_size: 16
train_clip_seconds: 0.5
sample_rate: 22050
hop_length: 256
win_length: 1024
n_fft: 2048
n_mels: 80
valid_size: 16
model:
upsampling_factors: [16, 16]
n_loop: 10
n_layer: 3
filter_size: 2
residual_channels: 128
loss_type: "mog"
output_dim: 30
log_scale_min: -9
train:
learning_rate: 0.001
anneal_rate: 0.5
anneal_interval: 200000
gradient_max_norm: 100.0
checkpoint_interval: 10000
snap_interval: 10000
eval_interval: 10000
max_iterations: 2000000

View File

@ -1,36 +0,0 @@
data:
batch_size: 16
train_clip_seconds: 0.5
sample_rate: 22050
hop_length: 256
win_length: 1024
n_fft: 2048
n_mels: 80
valid_size: 16
model:
upsampling_factors: [16, 16]
n_loop: 10
n_layer: 3
filter_size: 2
residual_channels: 128
loss_type: "mog"
output_dim: 3
log_scale_min: -9
train:
learning_rate: 0.001
anneal_rate: 0.5
anneal_interval: 200000
gradient_max_norm: 100.0
checkpoint_interval: 10000
snap_interval: 10000
eval_interval: 10000
max_iterations: 2000000

View File

@ -1,36 +0,0 @@
data:
batch_size: 16
train_clip_seconds: 0.5
sample_rate: 22050
hop_length: 256
win_length: 1024
n_fft: 2048
n_mels: 80
valid_size: 16
model:
upsampling_factors: [16, 16]
n_loop: 10
n_layer: 3
filter_size: 2
residual_channels: 128
loss_type: "softmax"
output_dim: 2048
log_scale_min: -9
train:
learning_rate: 0.001
anneal_rate: 0.5
anneal_interval: 200000
gradient_max_norm: 100.0
checkpoint_interval: 10000
snap_interval: 10000
eval_interval: 10000
max_iterations: 2000000

View File

@ -1,164 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import csv
import numpy as np
import librosa
from pathlib import Path
import pandas as pd
from parakeet.data import batch_spec, batch_wav
from parakeet.data import DatasetMixin
class LJSpeechMetaData(DatasetMixin):
def __init__(self, root):
self.root = Path(root)
self._wav_dir = self.root.joinpath("wavs")
csv_path = self.root.joinpath("metadata.csv")
self._table = pd.read_csv(
csv_path,
sep="|",
header=None,
quoting=csv.QUOTE_NONE,
names=["fname", "raw_text", "normalized_text"])
def get_example(self, i):
fname, raw_text, normalized_text = self._table.iloc[i]
fname = str(self._wav_dir.joinpath(fname + ".wav"))
return fname, raw_text, normalized_text
def __len__(self):
return len(self._table)
class Transform(object):
def __init__(self, sample_rate, n_fft, win_length, hop_length, n_mels):
self.sample_rate = sample_rate
self.n_fft = n_fft
self.win_length = win_length
self.hop_length = hop_length
self.n_mels = n_mels
def __call__(self, example):
wav_path, _, _ = example
sr = self.sample_rate
n_fft = self.n_fft
win_length = self.win_length
hop_length = self.hop_length
n_mels = self.n_mels
wav, loaded_sr = librosa.load(wav_path, sr=None)
assert loaded_sr == sr, "sample rate does not match, resampling applied"
# Pad audio to the right size.
frames = int(np.ceil(float(wav.size) / hop_length))
fft_padding = (n_fft - hop_length) // 2 # sound
desired_length = frames * hop_length + fft_padding * 2
pad_amount = (desired_length - wav.size) // 2
if wav.size % 2 == 0:
wav = np.pad(wav, (pad_amount, pad_amount), mode='reflect')
else:
wav = np.pad(wav, (pad_amount, pad_amount + 1), mode='reflect')
# Normalize audio.
wav = wav / np.abs(wav).max() * 0.999
# Compute mel-spectrogram.
# Turn center to False to prevent internal padding.
spectrogram = librosa.core.stft(
wav,
hop_length=hop_length,
win_length=win_length,
n_fft=n_fft,
center=False)
spectrogram_magnitude = np.abs(spectrogram)
# Compute mel-spectrograms.
mel_filter_bank = librosa.filters.mel(sr=sr,
n_fft=n_fft,
n_mels=n_mels)
mel_spectrogram = np.dot(mel_filter_bank, spectrogram_magnitude)
mel_spectrogram = mel_spectrogram
# Rescale mel_spectrogram.
min_level, ref_level = 1e-5, 20 # hard code it
mel_spectrogram = 20 * np.log10(np.maximum(min_level, mel_spectrogram))
mel_spectrogram = mel_spectrogram - ref_level
mel_spectrogram = np.clip((mel_spectrogram + 100) / 100, 0, 1)
# Extract the center of audio that corresponds to mel spectrograms.
audio = wav[fft_padding:-fft_padding]
assert mel_spectrogram.shape[1] * hop_length == audio.size
# there is no clipping here
return audio, mel_spectrogram
class DataCollector(object):
def __init__(self,
context_size,
sample_rate,
hop_length,
train_clip_seconds,
valid=False):
frames_per_second = sample_rate // hop_length
train_clip_frames = int(
np.ceil(train_clip_seconds * frames_per_second))
context_frames = context_size // hop_length
self.num_frames = train_clip_frames + context_frames
self.sample_rate = sample_rate
self.hop_length = hop_length
self.valid = valid
def random_crop(self, sample):
audio, mel_spectrogram = sample
audio_frames = int(audio.size) // self.hop_length
max_start_frame = audio_frames - self.num_frames
assert max_start_frame >= 0, "audio is too short to be cropped"
frame_start = np.random.randint(0, max_start_frame)
# frame_start = 0 # norandom
frame_end = frame_start + self.num_frames
audio_start = frame_start * self.hop_length
audio_end = frame_end * self.hop_length
audio = audio[audio_start:audio_end]
return audio, mel_spectrogram, audio_start
def __call__(self, samples):
# transform them first
if self.valid:
samples = [(audio, mel_spectrogram, 0)
for audio, mel_spectrogram in samples]
else:
samples = [self.random_crop(sample) for sample in samples]
# batch them
audios = [sample[0] for sample in samples]
audio_starts = [sample[2] for sample in samples]
mels = [sample[1] for sample in samples]
mels = batch_spec(mels)
if self.valid:
audios = batch_wav(audios, dtype=np.float32)
else:
audios = np.array(audios, dtype=np.float32)
audio_starts = np.array(audio_starts, dtype=np.int64)
return audios, mels, audio_starts

View File

@ -1,152 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import os
import ruamel.yaml
import argparse
from tqdm import tqdm
from paddle import fluid
fluid.require_version('1.8.0')
import paddle.fluid.dygraph as dg
from parakeet.modules.weight_norm import WeightNormWrapper
from parakeet.data import SliceDataset, TransformDataset, DataCargo, SequentialSampler, RandomSampler
from parakeet.models.wavenet import UpsampleNet, WaveNet, ConditionalWavenet
from parakeet.utils.layer_tools import summary
from parakeet.utils import io
from data import LJSpeechMetaData, Transform, DataCollector
from utils import make_output_tree, valid_model, eval_model
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Synthesize valid data from LJspeech with a wavenet model.")
parser.add_argument(
"--data", type=str, help="path of the LJspeech dataset")
parser.add_argument("--config", type=str, help="path of the config file")
parser.add_argument("--device", type=int, default=-1, help="device to use")
g = parser.add_mutually_exclusive_group()
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
g.add_argument(
"--iteration",
type=int,
help="the iteration of the checkpoint to load from output directory")
parser.add_argument(
"output",
type=str,
default="experiment",
help="path to save the synthesized audio")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = ruamel.yaml.safe_load(f)
if args.device == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(args.device)
dg.enable_dygraph(place)
ljspeech_meta = LJSpeechMetaData(args.data)
data_config = config["data"]
sample_rate = data_config["sample_rate"]
n_fft = data_config["n_fft"]
win_length = data_config["win_length"]
hop_length = data_config["hop_length"]
n_mels = data_config["n_mels"]
train_clip_seconds = data_config["train_clip_seconds"]
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
ljspeech = TransformDataset(ljspeech_meta, transform)
valid_size = data_config["valid_size"]
ljspeech_valid = SliceDataset(ljspeech, 0, valid_size)
ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech))
model_config = config["model"]
n_loop = model_config["n_loop"]
n_layer = model_config["n_layer"]
filter_size = model_config["filter_size"]
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
print("context size is {} samples".format(context_size))
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
train_clip_seconds)
valid_batch_fn = DataCollector(
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
batch_size = data_config["batch_size"]
train_cargo = DataCargo(
ljspeech_train,
train_batch_fn,
batch_size,
sampler=RandomSampler(ljspeech_train))
# only batch=1 for validation is enabled
valid_cargo = DataCargo(
ljspeech_valid,
valid_batch_fn,
batch_size=1,
sampler=SequentialSampler(ljspeech_valid))
if not os.path.exists(args.output):
os.makedirs(args.output)
model_config = config["model"]
upsampling_factors = model_config["upsampling_factors"]
encoder = UpsampleNet(upsampling_factors)
n_loop = model_config["n_loop"]
n_layer = model_config["n_layer"]
residual_channels = model_config["residual_channels"]
output_dim = model_config["output_dim"]
loss_type = model_config["loss_type"]
log_scale_min = model_config["log_scale_min"]
decoder = WaveNet(n_loop, n_layer, residual_channels, output_dim, n_mels,
filter_size, loss_type, log_scale_min)
model = ConditionalWavenet(encoder, decoder)
summary(model)
# load model parameters
checkpoint_dir = os.path.join(args.output, "checkpoints")
if args.checkpoint:
iteration = io.load_parameters(model, checkpoint_path=args.checkpoint)
else:
iteration = io.load_parameters(
model, checkpoint_dir=checkpoint_dir, iteration=args.iteration)
assert iteration > 0, "A trained model is needed."
# WARNING: don't forget to remove weight norm to re-compute each wrapped layer's weight
# removing weight norm also speeds up computation
for layer in model.sublayers():
if isinstance(layer, WeightNormWrapper):
layer.remove_weight_norm()
train_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
train_loader.set_batch_generator(train_cargo, place)
valid_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
valid_loader.set_batch_generator(valid_cargo, place)
synthesis_dir = os.path.join(args.output, "synthesis")
if not os.path.exists(synthesis_dir):
os.makedirs(synthesis_dir)
eval_model(model, valid_loader, synthesis_dir, iteration, sample_rate)

View File

@ -1,201 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import os
import ruamel.yaml
import argparse
import tqdm
from visualdl import LogWriter
from paddle import fluid
fluid.require_version('1.8.0')
import paddle.fluid.dygraph as dg
from parakeet.data import SliceDataset, TransformDataset, CacheDataset, DataCargo, SequentialSampler, RandomSampler
from parakeet.models.wavenet import UpsampleNet, WaveNet, ConditionalWavenet
from parakeet.utils.layer_tools import summary
from parakeet.utils import io
from data import LJSpeechMetaData, Transform, DataCollector
from utils import make_output_tree, valid_model
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Train a WaveNet model with LJSpeech.")
parser.add_argument(
"--data", type=str, help="path of the LJspeech dataset")
parser.add_argument("--config", type=str, help="path of the config file")
parser.add_argument("--device", type=int, default=-1, help="device to use")
g = parser.add_mutually_exclusive_group()
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
g.add_argument(
"--iteration",
type=int,
help="the iteration of the checkpoint to load from output directory")
parser.add_argument(
"output", type=str, default="experiment", help="path to save results")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = ruamel.yaml.safe_load(f)
if args.device == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(args.device)
dg.enable_dygraph(place)
print("Command Line Args: ")
for k, v in vars(args).items():
print("{}: {}".format(k, v))
ljspeech_meta = LJSpeechMetaData(args.data)
data_config = config["data"]
sample_rate = data_config["sample_rate"]
n_fft = data_config["n_fft"]
win_length = data_config["win_length"]
hop_length = data_config["hop_length"]
n_mels = data_config["n_mels"]
train_clip_seconds = data_config["train_clip_seconds"]
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
ljspeech = TransformDataset(ljspeech_meta, transform)
valid_size = data_config["valid_size"]
ljspeech_valid = CacheDataset(SliceDataset(ljspeech, 0, valid_size))
ljspeech_train = CacheDataset(
SliceDataset(ljspeech, valid_size, len(ljspeech)))
model_config = config["model"]
n_loop = model_config["n_loop"]
n_layer = model_config["n_layer"]
filter_size = model_config["filter_size"]
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
print("context size is {} samples".format(context_size))
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
train_clip_seconds)
valid_batch_fn = DataCollector(
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
batch_size = data_config["batch_size"]
train_cargo = DataCargo(
ljspeech_train,
train_batch_fn,
batch_size,
sampler=RandomSampler(ljspeech_train))
# only batch=1 for validation is enabled
valid_cargo = DataCargo(
ljspeech_valid,
valid_batch_fn,
batch_size=1,
sampler=SequentialSampler(ljspeech_valid))
make_output_tree(args.output)
if args.device == -1:
place = fluid.CPUPlace()
else:
place = fluid.CUDAPlace(args.device)
model_config = config["model"]
upsampling_factors = model_config["upsampling_factors"]
encoder = UpsampleNet(upsampling_factors)
n_loop = model_config["n_loop"]
n_layer = model_config["n_layer"]
residual_channels = model_config["residual_channels"]
output_dim = model_config["output_dim"]
loss_type = model_config["loss_type"]
log_scale_min = model_config["log_scale_min"]
decoder = WaveNet(n_loop, n_layer, residual_channels, output_dim, n_mels,
filter_size, loss_type, log_scale_min)
model = ConditionalWavenet(encoder, decoder)
summary(model)
train_config = config["train"]
learning_rate = train_config["learning_rate"]
anneal_rate = train_config["anneal_rate"]
anneal_interval = train_config["anneal_interval"]
lr_scheduler = dg.ExponentialDecay(
learning_rate, anneal_interval, anneal_rate, staircase=True)
gradiant_max_norm = train_config["gradient_max_norm"]
optim = fluid.optimizer.Adam(
lr_scheduler,
parameter_list=model.parameters(),
grad_clip=fluid.clip.ClipByGlobalNorm(gradiant_max_norm))
train_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
train_loader.set_batch_generator(train_cargo, place)
valid_loader = fluid.io.DataLoader.from_generator(
capacity=10, return_list=True)
valid_loader.set_batch_generator(valid_cargo, place)
max_iterations = train_config["max_iterations"]
checkpoint_interval = train_config["checkpoint_interval"]
snap_interval = train_config["snap_interval"]
eval_interval = train_config["eval_interval"]
checkpoint_dir = os.path.join(args.output, "checkpoints")
log_dir = os.path.join(args.output, "log")
writer = LogWriter(log_dir)
# load parameters and optimizer, and update iterations done so far
if args.checkpoint is not None:
iteration = io.load_parameters(
model, optim, checkpoint_path=args.checkpoint)
else:
iteration = io.load_parameters(
model,
optim,
checkpoint_dir=checkpoint_dir,
iteration=args.iteration)
global_step = iteration + 1
iterator = iter(tqdm.tqdm(train_loader))
while global_step <= max_iterations:
try:
batch = next(iterator)
except StopIteration as e:
iterator = iter(tqdm.tqdm(train_loader))
batch = next(iterator)
audio_clips, mel_specs, audio_starts = batch
model.train()
y_var = model(audio_clips, mel_specs, audio_starts)
loss_var = model.loss(y_var, audio_clips)
loss_var.backward()
loss_np = loss_var.numpy()
writer.add_scalar("loss", loss_np[0], global_step)
writer.add_scalar("learning_rate",
optim._learning_rate.step().numpy()[0], global_step)
optim.minimize(loss_var)
optim.clear_gradients()
print("global_step: {}\tloss: {:<8.6f}".format(global_step, loss_np[
0]))
if global_step % snap_interval == 0:
valid_model(model, valid_loader, writer, global_step, sample_rate)
if global_step % checkpoint_interval == 0:
io.save_parameters(checkpoint_dir, global_step, model, optim)
global_step += 1

View File

@ -1,62 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import division
import os
import numpy as np
import soundfile as sf
import paddle.fluid.dygraph as dg
def make_output_tree(output_dir):
checkpoint_dir = os.path.join(output_dir, "checkpoints")
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
state_dir = os.path.join(output_dir, "states")
if not os.path.exists(state_dir):
os.makedirs(state_dir)
def valid_model(model, valid_loader, writer, global_step, sample_rate):
loss = []
wavs = []
model.eval()
for i, batch in enumerate(valid_loader):
# print("sentence {}".format(i))
audio_clips, mel_specs, audio_starts = batch
y_var = model(audio_clips, mel_specs, audio_starts)
wav_var = model.sample(y_var)
loss_var = model.loss(y_var, audio_clips)
loss.append(loss_var.numpy()[0])
wavs.append(wav_var.numpy()[0])
average_loss = np.mean(loss)
writer.add_scalar("valid_loss", average_loss, global_step)
for i, wav in enumerate(wavs):
writer.add_audio("valid/sample_{}".format(i), wav, global_step,
sample_rate)
def eval_model(model, valid_loader, output_dir, global_step, sample_rate):
model.eval()
for i, batch in enumerate(valid_loader):
# print("sentence {}".format(i))
path = os.path.join(output_dir,
"sentence_{}_step_{}.wav".format(i, global_step))
audio_clips, mel_specs, audio_starts = batch
wav_var = model.synthesis(mel_specs)
wav_np = wav_var.numpy()[0]
sf.write(path, wav_np, samplerate=sample_rate)
print("generated {}".format(path))