remove old examples
This commit is contained in:
parent
0e35119453
commit
c7e5aaa540
|
@ -1,148 +0,0 @@
|
|||
# Clarinet
|
||||
|
||||
PaddlePaddle dynamic graph implementation of ClariNet, a convolutional network based vocoder. The implementation is based on the paper [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281).
|
||||
|
||||
|
||||
## Dataset
|
||||
|
||||
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
|
||||
|
||||
```bash
|
||||
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
|
||||
tar xjvf LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```text
|
||||
├── data.py data_processing
|
||||
├── configs/ (example) configuration file
|
||||
├── synthesis.py script to synthesize waveform from mel_spectrogram
|
||||
├── train.py script to train a model
|
||||
└── utils.py utility functions
|
||||
```
|
||||
|
||||
## Saving & Loading
|
||||
`train.py` and `synthesis.py` have 3 arguments in common, `--checkpooint`, `iteration` and `output`.
|
||||
|
||||
1. `output` is the directory for saving results.
|
||||
During training, checkpoints are saved in `checkpoints/` in `output` and tensorboard log is save in `log/` in `output`. Other possible outputs are saved in `states/` in `outuput`.
|
||||
During synthesizing, audio files and other possible outputs are save in `synthesis/` in `output`.
|
||||
So after training and synthesizing with the same output directory, the file structure of the output directory looks like this.
|
||||
|
||||
```text
|
||||
├── checkpoints/ # checkpoint directory (including *.pdparams, *.pdopt and a text file `checkpoint` that records the latest checkpoint)
|
||||
├── states/ # audio files generated at validation and other possible outputs
|
||||
├── log/ # tensorboard log
|
||||
└── synthesis/ # synthesized audio files and other possible outputs
|
||||
```
|
||||
|
||||
2. `--checkpoint` and `--iteration` for loading from existing checkpoint. Loading existing checkpoiont follows the following rule:
|
||||
If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.
|
||||
If `--checkpoint` is not provided, we try to load the model specified by `--iteration` from the checkpoint directory. If `--iteration` is not provided, we try to load the latested checkpoint from checkpoint directory.
|
||||
|
||||
## Train
|
||||
|
||||
Train the model using train.py, follow the usage displayed by `python train.py --help`.
|
||||
|
||||
```text
|
||||
usage: train.py [-h] [--config CONFIG] [--device DEVICE] [--data DATA]
|
||||
[--checkpoint CHECKPOINT | --iteration ITERATION]
|
||||
[--wavenet WAVENET]
|
||||
output
|
||||
|
||||
Train a ClariNet model with LJspeech and a trained WaveNet model.
|
||||
|
||||
positional arguments:
|
||||
output path to save experiment results
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--config CONFIG path of the config file
|
||||
--device DEVICE device to use
|
||||
--data DATA path of LJspeech dataset
|
||||
--checkpoint CHECKPOINT checkpoint to resume from
|
||||
--iteration ITERATION the iteration of the checkpoint to load from output directory
|
||||
--wavenet WAVENET wavenet checkpoint to use
|
||||
|
||||
- `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
|
||||
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
- `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains `metadata.txt`).
|
||||
|
||||
- `--checkpoint` is the path of the checkpoint.
|
||||
- `--iteration` is the iteration of the checkpoint to load from output directory.
|
||||
- `output` is the directory to save results, all result are saved in this directory.
|
||||
|
||||
See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
|
||||
|
||||
- `--wavenet` is the path of the wavenet checkpoint to load.
|
||||
When you start training a ClariNet model without loading form a ClariNet checkpoint, you should have trained a WaveNet model with single Gaussian output distribution. Make sure the config of the teacher model matches that of the trained wavenet model.
|
||||
|
||||
Example script:
|
||||
|
||||
```bash
|
||||
python train.py
|
||||
--config=./configs/clarinet_ljspeech.yaml
|
||||
--data=./LJSpeech-1.1/
|
||||
--device=0
|
||||
--wavenet="wavenet-step-2000000"
|
||||
experiment
|
||||
```
|
||||
|
||||
You can monitor training log via tensorboard, using the script below.
|
||||
|
||||
```bash
|
||||
cd experiment/log
|
||||
tensorboard --logdir=.
|
||||
```
|
||||
|
||||
## Synthesis
|
||||
```text
|
||||
usage: synthesis.py [-h] [--config CONFIG] [--device DEVICE] [--data DATA]
|
||||
[--checkpoint CHECKPOINT | --iteration ITERATION]
|
||||
output
|
||||
|
||||
Synthesize audio files from mel spectrogram in the validation set.
|
||||
|
||||
positional arguments:
|
||||
output path to save the synthesized audio
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--config CONFIG path of the config file
|
||||
--device DEVICE device to use.
|
||||
--data DATA path of LJspeech dataset
|
||||
--checkpoint CHECKPOINT checkpoint to resume from
|
||||
--iteration ITERATION the iteration of the checkpoint to load from output directory
|
||||
```
|
||||
|
||||
- `--config` is the configuration file to use. You should use the same configuration with which you train you model.
|
||||
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
- `--data` is the path of the LJspeech dataset. In principle, a dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
|
||||
- `--checkpoint` is the checkpoint to load.
|
||||
- `--iteration` is the iteration of the checkpoint to load from output directory.
|
||||
- `output` is the directory to save synthesized audio. Audio file is saved in `synthesis/` in `output` directory.
|
||||
See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
|
||||
|
||||
|
||||
Example script:
|
||||
|
||||
```bash
|
||||
python synthesis.py \
|
||||
--config=./configs/clarinet_ljspeech.yaml \
|
||||
--data=./LJSpeech-1.1/ \
|
||||
--device=0 \
|
||||
--iteration=500000 \
|
||||
experiment
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```bash
|
||||
python synthesis.py \
|
||||
--config=./configs/clarinet_ljspeech.yaml \
|
||||
--data=./LJSpeech-1.1/ \
|
||||
--device=0 \
|
||||
--checkpoint="experiment/checkpoints/step-500000" \
|
||||
experiment
|
||||
```
|
|
@ -1,52 +0,0 @@
|
|||
data:
|
||||
batch_size: 8
|
||||
train_clip_seconds: 0.5
|
||||
sample_rate: 22050
|
||||
hop_length: 256
|
||||
win_length: 1024
|
||||
n_fft: 2048
|
||||
|
||||
n_mels: 80
|
||||
valid_size: 16
|
||||
|
||||
|
||||
conditioner:
|
||||
upsampling_factors: [16, 16]
|
||||
|
||||
teacher:
|
||||
n_loop: 10
|
||||
n_layer: 3
|
||||
filter_size: 2
|
||||
residual_channels: 128
|
||||
loss_type: "mog"
|
||||
output_dim: 3
|
||||
log_scale_min: -9
|
||||
|
||||
student:
|
||||
n_loops: [10, 10, 10, 10, 10, 10]
|
||||
n_layers: [1, 1, 1, 1, 1, 1]
|
||||
filter_size: 3
|
||||
residual_channels: 64
|
||||
log_scale_min: -7
|
||||
|
||||
stft:
|
||||
n_fft: 2048
|
||||
win_length: 1024
|
||||
hop_length: 256
|
||||
|
||||
loss:
|
||||
lmd: 4
|
||||
|
||||
train:
|
||||
learning_rate: 0.0005
|
||||
anneal_rate: 0.5
|
||||
anneal_interval: 200000
|
||||
gradient_max_norm: 100.0
|
||||
|
||||
checkpoint_interval: 1000
|
||||
eval_interval: 1000
|
||||
|
||||
max_iterations: 2000000
|
||||
|
||||
|
||||
|
|
@ -1,52 +0,0 @@
|
|||
data:
|
||||
batch_size: 8
|
||||
train_clip_seconds: 0.5
|
||||
sample_rate: 22050
|
||||
hop_length: 256
|
||||
win_length: 1024
|
||||
n_fft: 2048
|
||||
|
||||
n_mels: 80
|
||||
valid_size: 16
|
||||
|
||||
|
||||
conditioner:
|
||||
upsampling_factors: [16, 16]
|
||||
|
||||
teacher:
|
||||
n_loop: 10
|
||||
n_layer: 3
|
||||
filter_size: 2
|
||||
residual_channels: 128
|
||||
loss_type: "mog"
|
||||
output_dim: 3
|
||||
log_scale_min: -9
|
||||
|
||||
student:
|
||||
n_loops: [10, 10, 10, 10, 10, 10]
|
||||
n_layers: [1, 1, 1, 1, 1, 1]
|
||||
filter_size: 3
|
||||
residual_channels: 64
|
||||
log_scale_min: -7
|
||||
|
||||
stft:
|
||||
n_fft: 2048
|
||||
win_length: 1024
|
||||
hop_length: 256
|
||||
|
||||
loss:
|
||||
lmd: 4
|
||||
|
||||
train:
|
||||
learning_rate: 0.0005
|
||||
anneal_rate: 0.5
|
||||
anneal_interval: 200000
|
||||
gradient_max_norm: 100.0
|
||||
|
||||
checkpoint_interval: 1000
|
||||
eval_interval: 1000
|
||||
|
||||
max_iterations: 2000000
|
||||
|
||||
|
||||
|
|
@ -1,179 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from __future__ import division
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
import ruamel.yaml
|
||||
import random
|
||||
from tqdm import tqdm
|
||||
import pickle
|
||||
import numpy as np
|
||||
|
||||
import paddle.fluid.dygraph as dg
|
||||
from paddle import fluid
|
||||
fluid.require_version('1.8.0')
|
||||
|
||||
from parakeet.modules.weight_norm import WeightNormWrapper
|
||||
from parakeet.models.wavenet import WaveNet, UpsampleNet
|
||||
from parakeet.models.clarinet import STFT, Clarinet, ParallelWaveNet
|
||||
from parakeet.data import TransformDataset, SliceDataset, RandomSampler, SequentialSampler, DataCargo
|
||||
from parakeet.utils.layer_tools import summary, freeze
|
||||
from parakeet.utils import io
|
||||
|
||||
from utils import eval_model
|
||||
sys.path.append("../wavenet")
|
||||
from data import LJSpeechMetaData, Transform, DataCollector
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Synthesize audio files from mel spectrogram in the validation set."
|
||||
)
|
||||
parser.add_argument("--config", type=str, help="path of the config file")
|
||||
parser.add_argument(
|
||||
"--device", type=int, default=-1, help="device to use.")
|
||||
parser.add_argument("--data", type=str, help="path of LJspeech dataset")
|
||||
|
||||
g = parser.add_mutually_exclusive_group()
|
||||
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
|
||||
g.add_argument(
|
||||
"--iteration",
|
||||
type=int,
|
||||
help="the iteration of the checkpoint to load from output directory")
|
||||
|
||||
parser.add_argument(
|
||||
"output",
|
||||
type=str,
|
||||
default="experiment",
|
||||
help="path to save the synthesized audio")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
with open(args.config, 'rt') as f:
|
||||
config = ruamel.yaml.safe_load(f)
|
||||
|
||||
if args.device == -1:
|
||||
place = fluid.CPUPlace()
|
||||
else:
|
||||
place = fluid.CUDAPlace(args.device)
|
||||
|
||||
dg.enable_dygraph(place)
|
||||
|
||||
ljspeech_meta = LJSpeechMetaData(args.data)
|
||||
|
||||
data_config = config["data"]
|
||||
sample_rate = data_config["sample_rate"]
|
||||
n_fft = data_config["n_fft"]
|
||||
win_length = data_config["win_length"]
|
||||
hop_length = data_config["hop_length"]
|
||||
n_mels = data_config["n_mels"]
|
||||
train_clip_seconds = data_config["train_clip_seconds"]
|
||||
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
|
||||
ljspeech = TransformDataset(ljspeech_meta, transform)
|
||||
|
||||
valid_size = data_config["valid_size"]
|
||||
ljspeech_valid = SliceDataset(ljspeech, 0, valid_size)
|
||||
ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech))
|
||||
|
||||
teacher_config = config["teacher"]
|
||||
n_loop = teacher_config["n_loop"]
|
||||
n_layer = teacher_config["n_layer"]
|
||||
filter_size = teacher_config["filter_size"]
|
||||
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
|
||||
print("context size is {} samples".format(context_size))
|
||||
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
|
||||
train_clip_seconds)
|
||||
valid_batch_fn = DataCollector(
|
||||
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
|
||||
|
||||
batch_size = data_config["batch_size"]
|
||||
train_cargo = DataCargo(
|
||||
ljspeech_train,
|
||||
train_batch_fn,
|
||||
batch_size,
|
||||
sampler=RandomSampler(ljspeech_train))
|
||||
|
||||
# only batch=1 for validation is enabled
|
||||
valid_cargo = DataCargo(
|
||||
ljspeech_valid,
|
||||
valid_batch_fn,
|
||||
batch_size=1,
|
||||
sampler=SequentialSampler(ljspeech_valid))
|
||||
|
||||
# conditioner(upsampling net)
|
||||
conditioner_config = config["conditioner"]
|
||||
upsampling_factors = conditioner_config["upsampling_factors"]
|
||||
upsample_net = UpsampleNet(upscale_factors=upsampling_factors)
|
||||
freeze(upsample_net)
|
||||
|
||||
residual_channels = teacher_config["residual_channels"]
|
||||
loss_type = teacher_config["loss_type"]
|
||||
output_dim = teacher_config["output_dim"]
|
||||
log_scale_min = teacher_config["log_scale_min"]
|
||||
assert loss_type == "mog" and output_dim == 3, \
|
||||
"the teacher wavenet should be a wavenet with single gaussian output"
|
||||
|
||||
teacher = WaveNet(n_loop, n_layer, residual_channels, output_dim, n_mels,
|
||||
filter_size, loss_type, log_scale_min)
|
||||
# load & freeze upsample_net & teacher
|
||||
freeze(teacher)
|
||||
|
||||
student_config = config["student"]
|
||||
n_loops = student_config["n_loops"]
|
||||
n_layers = student_config["n_layers"]
|
||||
student_residual_channels = student_config["residual_channels"]
|
||||
student_filter_size = student_config["filter_size"]
|
||||
student_log_scale_min = student_config["log_scale_min"]
|
||||
student = ParallelWaveNet(n_loops, n_layers, student_residual_channels,
|
||||
n_mels, student_filter_size)
|
||||
|
||||
stft_config = config["stft"]
|
||||
stft = STFT(
|
||||
n_fft=stft_config["n_fft"],
|
||||
hop_length=stft_config["hop_length"],
|
||||
win_length=stft_config["win_length"])
|
||||
|
||||
lmd = config["loss"]["lmd"]
|
||||
model = Clarinet(upsample_net, teacher, student, stft,
|
||||
student_log_scale_min, lmd)
|
||||
summary(model)
|
||||
|
||||
# load parameters
|
||||
if args.checkpoint is not None:
|
||||
# load from args.checkpoint
|
||||
iteration = io.load_parameters(model, checkpoint_path=args.checkpoint)
|
||||
else:
|
||||
# load from "args.output/checkpoints"
|
||||
checkpoint_dir = os.path.join(args.output, "checkpoints")
|
||||
iteration = io.load_parameters(
|
||||
model, checkpoint_dir=checkpoint_dir, iteration=args.iteration)
|
||||
assert iteration > 0, "A trained checkpoint is needed."
|
||||
|
||||
# make generation fast
|
||||
for sublayer in model.sublayers():
|
||||
if isinstance(sublayer, WeightNormWrapper):
|
||||
sublayer.remove_weight_norm()
|
||||
|
||||
# data loader
|
||||
valid_loader = fluid.io.DataLoader.from_generator(
|
||||
capacity=10, return_list=True)
|
||||
valid_loader.set_batch_generator(valid_cargo, place)
|
||||
|
||||
# the directory to save audio files
|
||||
synthesis_dir = os.path.join(args.output, "synthesis")
|
||||
if not os.path.exists(synthesis_dir):
|
||||
os.makedirs(synthesis_dir)
|
||||
|
||||
eval_model(model, valid_loader, synthesis_dir, iteration, sample_rate)
|
|
@ -1,243 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from __future__ import division
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
import ruamel.yaml
|
||||
import random
|
||||
from tqdm import tqdm
|
||||
import pickle
|
||||
import numpy as np
|
||||
from visualdl import LogWriter
|
||||
|
||||
import paddle.fluid.dygraph as dg
|
||||
from paddle import fluid
|
||||
fluid.require_version('1.8.0')
|
||||
|
||||
from parakeet.models.wavenet import WaveNet, UpsampleNet
|
||||
from parakeet.models.clarinet import STFT, Clarinet, ParallelWaveNet
|
||||
from parakeet.data import TransformDataset, SliceDataset, CacheDataset, RandomSampler, SequentialSampler, DataCargo
|
||||
from parakeet.utils.layer_tools import summary, freeze
|
||||
from parakeet.utils import io
|
||||
|
||||
from utils import make_output_tree, eval_model, load_wavenet
|
||||
|
||||
# import dataset from wavenet
|
||||
sys.path.append("../wavenet")
|
||||
from data import LJSpeechMetaData, Transform, DataCollector
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Train a ClariNet model with LJspeech and a trained WaveNet model."
|
||||
)
|
||||
parser.add_argument("--config", type=str, help="path of the config file")
|
||||
parser.add_argument("--device", type=int, default=-1, help="device to use")
|
||||
parser.add_argument("--data", type=str, help="path of LJspeech dataset")
|
||||
|
||||
g = parser.add_mutually_exclusive_group()
|
||||
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
|
||||
g.add_argument(
|
||||
"--iteration",
|
||||
type=int,
|
||||
help="the iteration of the checkpoint to load from output directory")
|
||||
|
||||
parser.add_argument(
|
||||
"--wavenet", type=str, help="wavenet checkpoint to use")
|
||||
|
||||
parser.add_argument(
|
||||
"output",
|
||||
type=str,
|
||||
default="experiment",
|
||||
help="path to save experiment results")
|
||||
|
||||
args = parser.parse_args()
|
||||
with open(args.config, 'rt') as f:
|
||||
config = ruamel.yaml.safe_load(f)
|
||||
|
||||
if args.device == -1:
|
||||
place = fluid.CPUPlace()
|
||||
else:
|
||||
place = fluid.CUDAPlace(args.device)
|
||||
|
||||
dg.enable_dygraph(place)
|
||||
|
||||
print("Command Line args: ")
|
||||
for k, v in vars(args).items():
|
||||
print("{}: {}".format(k, v))
|
||||
|
||||
ljspeech_meta = LJSpeechMetaData(args.data)
|
||||
|
||||
data_config = config["data"]
|
||||
sample_rate = data_config["sample_rate"]
|
||||
n_fft = data_config["n_fft"]
|
||||
win_length = data_config["win_length"]
|
||||
hop_length = data_config["hop_length"]
|
||||
n_mels = data_config["n_mels"]
|
||||
train_clip_seconds = data_config["train_clip_seconds"]
|
||||
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
|
||||
ljspeech = TransformDataset(ljspeech_meta, transform)
|
||||
|
||||
valid_size = data_config["valid_size"]
|
||||
ljspeech_valid = CacheDataset(SliceDataset(ljspeech, 0, valid_size))
|
||||
ljspeech_train = CacheDataset(
|
||||
SliceDataset(ljspeech, valid_size, len(ljspeech)))
|
||||
|
||||
teacher_config = config["teacher"]
|
||||
n_loop = teacher_config["n_loop"]
|
||||
n_layer = teacher_config["n_layer"]
|
||||
filter_size = teacher_config["filter_size"]
|
||||
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
|
||||
print("context size is {} samples".format(context_size))
|
||||
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
|
||||
train_clip_seconds)
|
||||
valid_batch_fn = DataCollector(
|
||||
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
|
||||
|
||||
batch_size = data_config["batch_size"]
|
||||
train_cargo = DataCargo(
|
||||
ljspeech_train,
|
||||
train_batch_fn,
|
||||
batch_size,
|
||||
sampler=RandomSampler(ljspeech_train))
|
||||
|
||||
# only batch=1 for validation is enabled
|
||||
valid_cargo = DataCargo(
|
||||
ljspeech_valid,
|
||||
valid_batch_fn,
|
||||
batch_size=1,
|
||||
sampler=SequentialSampler(ljspeech_valid))
|
||||
|
||||
make_output_tree(args.output)
|
||||
|
||||
# conditioner(upsampling net)
|
||||
conditioner_config = config["conditioner"]
|
||||
upsampling_factors = conditioner_config["upsampling_factors"]
|
||||
upsample_net = UpsampleNet(upscale_factors=upsampling_factors)
|
||||
freeze(upsample_net)
|
||||
|
||||
residual_channels = teacher_config["residual_channels"]
|
||||
loss_type = teacher_config["loss_type"]
|
||||
output_dim = teacher_config["output_dim"]
|
||||
log_scale_min = teacher_config["log_scale_min"]
|
||||
assert loss_type == "mog" and output_dim == 3, \
|
||||
"the teacher wavenet should be a wavenet with single gaussian output"
|
||||
|
||||
teacher = WaveNet(n_loop, n_layer, residual_channels, output_dim, n_mels,
|
||||
filter_size, loss_type, log_scale_min)
|
||||
freeze(teacher)
|
||||
|
||||
student_config = config["student"]
|
||||
n_loops = student_config["n_loops"]
|
||||
n_layers = student_config["n_layers"]
|
||||
student_residual_channels = student_config["residual_channels"]
|
||||
student_filter_size = student_config["filter_size"]
|
||||
student_log_scale_min = student_config["log_scale_min"]
|
||||
student = ParallelWaveNet(n_loops, n_layers, student_residual_channels,
|
||||
n_mels, student_filter_size)
|
||||
|
||||
stft_config = config["stft"]
|
||||
stft = STFT(
|
||||
n_fft=stft_config["n_fft"],
|
||||
hop_length=stft_config["hop_length"],
|
||||
win_length=stft_config["win_length"])
|
||||
|
||||
lmd = config["loss"]["lmd"]
|
||||
model = Clarinet(upsample_net, teacher, student, stft,
|
||||
student_log_scale_min, lmd)
|
||||
summary(model)
|
||||
|
||||
# optim
|
||||
train_config = config["train"]
|
||||
learning_rate = train_config["learning_rate"]
|
||||
anneal_rate = train_config["anneal_rate"]
|
||||
anneal_interval = train_config["anneal_interval"]
|
||||
lr_scheduler = dg.ExponentialDecay(
|
||||
learning_rate, anneal_interval, anneal_rate, staircase=True)
|
||||
gradiant_max_norm = train_config["gradient_max_norm"]
|
||||
optim = fluid.optimizer.Adam(
|
||||
lr_scheduler,
|
||||
parameter_list=model.parameters(),
|
||||
grad_clip=fluid.clip.ClipByGlobalNorm(gradiant_max_norm))
|
||||
|
||||
# train
|
||||
max_iterations = train_config["max_iterations"]
|
||||
checkpoint_interval = train_config["checkpoint_interval"]
|
||||
eval_interval = train_config["eval_interval"]
|
||||
checkpoint_dir = os.path.join(args.output, "checkpoints")
|
||||
state_dir = os.path.join(args.output, "states")
|
||||
log_dir = os.path.join(args.output, "log")
|
||||
writer = LogWriter(log_dir)
|
||||
|
||||
if args.checkpoint is not None:
|
||||
iteration = io.load_parameters(
|
||||
model, optim, checkpoint_path=args.checkpoint)
|
||||
else:
|
||||
iteration = io.load_parameters(
|
||||
model,
|
||||
optim,
|
||||
checkpoint_dir=checkpoint_dir,
|
||||
iteration=args.iteration)
|
||||
|
||||
if iteration == 0:
|
||||
assert args.wavenet is not None, "When training afresh, a trained wavenet model should be provided."
|
||||
load_wavenet(model, args.wavenet)
|
||||
|
||||
# loader
|
||||
train_loader = fluid.io.DataLoader.from_generator(
|
||||
capacity=10, return_list=True)
|
||||
train_loader.set_batch_generator(train_cargo, place)
|
||||
|
||||
valid_loader = fluid.io.DataLoader.from_generator(
|
||||
capacity=10, return_list=True)
|
||||
valid_loader.set_batch_generator(valid_cargo, place)
|
||||
|
||||
# training loop
|
||||
global_step = iteration + 1
|
||||
iterator = iter(tqdm(train_loader))
|
||||
while global_step <= max_iterations:
|
||||
try:
|
||||
batch = next(iterator)
|
||||
except StopIteration as e:
|
||||
iterator = iter(tqdm(train_loader))
|
||||
batch = next(iterator)
|
||||
|
||||
audios, mels, audio_starts = batch
|
||||
model.train()
|
||||
loss_dict = model(
|
||||
audios, mels, audio_starts, clip_kl=global_step > 500)
|
||||
|
||||
writer.add_scalar("learning_rate",
|
||||
optim._learning_rate.step().numpy()[0], global_step)
|
||||
for k, v in loss_dict.items():
|
||||
writer.add_scalar("loss/{}".format(k), v.numpy()[0], global_step)
|
||||
|
||||
l = loss_dict["loss"]
|
||||
step_loss = l.numpy()[0]
|
||||
print("[train] global_step: {} loss: {:<8.6f}".format(global_step,
|
||||
step_loss))
|
||||
|
||||
l.backward()
|
||||
optim.minimize(l)
|
||||
optim.clear_gradients()
|
||||
|
||||
if global_step % eval_interval == 0:
|
||||
# evaluate on valid dataset
|
||||
eval_model(model, valid_loader, state_dir, global_step,
|
||||
sample_rate)
|
||||
if global_step % checkpoint_interval == 0:
|
||||
io.save_parameters(checkpoint_dir, global_step, model, optim)
|
||||
|
||||
global_step += 1
|
|
@ -1,60 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from __future__ import division
|
||||
import os
|
||||
import soundfile as sf
|
||||
from collections import OrderedDict
|
||||
|
||||
from paddle import fluid
|
||||
import paddle.fluid.dygraph as dg
|
||||
|
||||
|
||||
def make_output_tree(output_dir):
|
||||
checkpoint_dir = os.path.join(output_dir, "checkpoints")
|
||||
if not os.path.exists(checkpoint_dir):
|
||||
os.makedirs(checkpoint_dir)
|
||||
|
||||
state_dir = os.path.join(output_dir, "states")
|
||||
if not os.path.exists(state_dir):
|
||||
os.makedirs(state_dir)
|
||||
|
||||
|
||||
def eval_model(model, valid_loader, output_dir, iteration, sample_rate):
|
||||
model.eval()
|
||||
for i, batch in enumerate(valid_loader):
|
||||
# print("sentence {}".format(i))
|
||||
path = os.path.join(output_dir,
|
||||
"sentence_{}_step_{}.wav".format(i, iteration))
|
||||
audio_clips, mel_specs, audio_starts = batch
|
||||
wav_var = model.synthesis(mel_specs)
|
||||
wav_np = wav_var.numpy()[0]
|
||||
sf.write(path, wav_np, samplerate=sample_rate)
|
||||
print("generated {}".format(path))
|
||||
|
||||
|
||||
def load_wavenet(model, path):
|
||||
wavenet_dict, _ = dg.load_dygraph(path)
|
||||
encoder_dict = OrderedDict()
|
||||
teacher_dict = OrderedDict()
|
||||
for k, v in wavenet_dict.items():
|
||||
if k.startswith("encoder."):
|
||||
encoder_dict[k.split('.', 1)[1]] = v
|
||||
else:
|
||||
# k starts with "decoder."
|
||||
teacher_dict[k.split('.', 1)[1]] = v
|
||||
|
||||
model.encoder.set_dict(encoder_dict)
|
||||
model.teacher.set_dict(teacher_dict)
|
||||
print("loaded the encoder part and teacher part from wavenet model.")
|
|
@ -1,144 +0,0 @@
|
|||
# Deep Voice 3
|
||||
|
||||
PaddlePaddle dynamic graph implementation of Deep Voice 3, a convolutional network based text-to-speech generative model. The implementation is based on [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654).
|
||||
|
||||
We implement Deep Voice 3 using Paddle Fluid with dynamic graph, which is convenient for building flexible network architectures.
|
||||
|
||||
## Dataset
|
||||
|
||||
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
|
||||
|
||||
```bash
|
||||
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
|
||||
tar xjvf LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
## Model Architecture
|
||||
|
||||

|
||||
|
||||
The model consists of an encoder, a decoder and a converter (and a speaker embedding for multispeaker models). The encoder and the decoder together form the seq2seq part of the model, and the converter forms the postnet part.
|
||||
|
||||
## Project Structure
|
||||
|
||||
```text
|
||||
├── config/
|
||||
├── synthesize.py
|
||||
├── data.py
|
||||
├── preprocess.py
|
||||
├── clip.py
|
||||
├── train.py
|
||||
└── vocoder.py
|
||||
```
|
||||
|
||||
# Preprocess
|
||||
|
||||
Preprocess to dataset with `preprocess.py`.
|
||||
|
||||
```text
|
||||
usage: preprocess.py [-h] --config CONFIG --input INPUT --output OUTPUT
|
||||
|
||||
preprocess ljspeech dataset and save it.
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--config CONFIG config file
|
||||
--input INPUT data path of the original data
|
||||
--output OUTPUT path to save the preprocessed dataset
|
||||
```
|
||||
|
||||
example code:
|
||||
|
||||
```bash
|
||||
python preprocess.py --config=configs/ljspeech.yaml --input=LJSpeech-1.1/ --output=data/ljspeech
|
||||
```
|
||||
|
||||
## Train
|
||||
|
||||
Train the model using train.py, follow the usage displayed by `python train.py --help`.
|
||||
|
||||
```text
|
||||
usage: train.py [-h] --config CONFIG --input INPUT
|
||||
|
||||
train a Deep Voice 3 model with LJSpeech
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--config CONFIG config file
|
||||
--input INPUT data path of the original data
|
||||
```
|
||||
|
||||
example code:
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0 python train.py --config=configs/ljspeech.yaml --input=data/ljspeech
|
||||
```
|
||||
|
||||
It would create a `runs` folder, outputs for each run is saved in a seperate folder in `runs`, whose name is the time joined with hostname. Inside this filder, tensorboard log, parameters and optimizer states are saved. Parameters(`*.pdparams`) and optimizer states(`*.pdopt`) are named by the step when they are saved.
|
||||
|
||||
```text
|
||||
runs/Jul07_09-39-34_instance-mqcyj27y-4/
|
||||
├── checkpoint
|
||||
├── events.out.tfevents.1594085974.instance-mqcyj27y-4
|
||||
├── step-1000000.pdopt
|
||||
├── step-1000000.pdparams
|
||||
├── step-100000.pdopt
|
||||
├── step-100000.pdparams
|
||||
...
|
||||
```
|
||||
|
||||
Since we use waveflow to synthesize audio while training, so download the trained waveflow model and extract it in current directory before training.
|
||||
|
||||
```bash
|
||||
wget https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_ckpt_1.0.zip
|
||||
unzip waveflow_res128_ljspeech_ckpt_1.0.zip
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Visualization
|
||||
|
||||
You can visualize training losses, check the attention and listen to the synthesized audio when training with teacher forcing.
|
||||
|
||||
example code:
|
||||
|
||||
```bash
|
||||
tensorboard --logdir=runs/ --host=$HOSTNAME --port=8000
|
||||
```
|
||||
|
||||
## Synthesis
|
||||
|
||||
```text
|
||||
usage: synthesize from a checkpoint [-h] --config CONFIG --input INPUT
|
||||
--output OUTPUT --checkpoint CHECKPOINT
|
||||
--monotonic_layers MONOTONIC_LAYERS
|
||||
[--vocoder {griffin-lim,waveflow}]
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--config CONFIG config file
|
||||
--input INPUT text file to synthesize
|
||||
--output OUTPUT path to save audio
|
||||
--checkpoint CHECKPOINT
|
||||
data path of the checkpoint
|
||||
--monotonic_layers MONOTONIC_LAYERS
|
||||
monotonic decoder layers' indices(start from 1)
|
||||
--vocoder {griffin-lim,waveflow}
|
||||
vocoder to use
|
||||
```
|
||||
|
||||
`synthesize.py` is used to synthesize several sentences in a text file.
|
||||
`--monotonic_layers` is the index of the decoders layer that manifest monotonic diagonal attention. You can get monotonic layers by inspecting tensorboard logs. Mind that the index starts from 1. The layers that manifest monotonic diagonal attention are stable for a model during training and synthesizing, but differ among different runs. So once you get the indices of monotonic layers by inspecting tensorboard log, you can use them at synthesizing. Note that only decoder layers that show strong diagonal attention should be considerd.
|
||||
`--vocoder` is the vocoder to use. Current supported values are "waveflow" and "griffin-lim". Default value is "waveflow".
|
||||
|
||||
example code:
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=2 python synthesize.py \
|
||||
--config configs/ljspeech.yaml \
|
||||
--input sentences.txt \
|
||||
--output outputs/ \
|
||||
--checkpoint runs/Jul07_09-39-34_instance-mqcyj27y-4/step-1320000 \
|
||||
--monotonic_layers "5,6" \
|
||||
--vocoder waveflow
|
||||
```
|
|
@ -1,84 +0,0 @@
|
|||
from __future__ import print_function
|
||||
|
||||
import copy
|
||||
import six
|
||||
import warnings
|
||||
|
||||
import functools
|
||||
from paddle.fluid import layers
|
||||
from paddle.fluid import framework
|
||||
from paddle.fluid import core
|
||||
from paddle.fluid import name_scope
|
||||
from paddle.fluid.dygraph import base as imperative_base
|
||||
from paddle.fluid.clip import GradientClipBase, _correct_clip_op_role_var
|
||||
|
||||
class DoubleClip(GradientClipBase):
|
||||
def __init__(self, clip_value, clip_norm, group_name="default_group", need_clip=None):
|
||||
super(DoubleClip, self).__init__(need_clip)
|
||||
self.clip_value = float(clip_value)
|
||||
self.clip_norm = float(clip_norm)
|
||||
self.group_name = group_name
|
||||
|
||||
def __str__(self):
|
||||
return "Gradient Clip By Value and GlobalNorm, value={}, global_norm={}".format(
|
||||
self.clip_value, self.clip_norm)
|
||||
|
||||
@imperative_base.no_grad
|
||||
def _dygraph_clip(self, params_grads):
|
||||
params_grads = self._dygraph_clip_by_value(params_grads)
|
||||
params_grads = self._dygraph_clip_by_global_norm(params_grads)
|
||||
return params_grads
|
||||
|
||||
@imperative_base.no_grad
|
||||
def _dygraph_clip_by_value(self, params_grads):
|
||||
params_and_grads = []
|
||||
for p, g in params_grads:
|
||||
if g is None:
|
||||
continue
|
||||
if self._need_clip_func is not None and not self._need_clip_func(p):
|
||||
params_and_grads.append((p, g))
|
||||
continue
|
||||
new_grad = layers.clip(x=g, min=-self.clip_value, max=self.clip_value)
|
||||
params_and_grads.append((p, new_grad))
|
||||
return params_and_grads
|
||||
|
||||
@imperative_base.no_grad
|
||||
def _dygraph_clip_by_global_norm(self, params_grads):
|
||||
params_and_grads = []
|
||||
sum_square_list = []
|
||||
for p, g in params_grads:
|
||||
if g is None:
|
||||
continue
|
||||
if self._need_clip_func is not None and not self._need_clip_func(p):
|
||||
continue
|
||||
merge_grad = g
|
||||
if g.type == core.VarDesc.VarType.SELECTED_ROWS:
|
||||
merge_grad = layers.merge_selected_rows(g)
|
||||
merge_grad = layers.get_tensor_from_selected_rows(merge_grad)
|
||||
square = layers.square(merge_grad)
|
||||
sum_square = layers.reduce_sum(square)
|
||||
sum_square_list.append(sum_square)
|
||||
|
||||
# all parameters have been filterd out
|
||||
if len(sum_square_list) == 0:
|
||||
return params_grads
|
||||
|
||||
global_norm_var = layers.concat(sum_square_list)
|
||||
global_norm_var = layers.reduce_sum(global_norm_var)
|
||||
global_norm_var = layers.sqrt(global_norm_var)
|
||||
max_global_norm = layers.fill_constant(
|
||||
shape=[1], dtype='float32', value=self.clip_norm)
|
||||
clip_var = layers.elementwise_div(
|
||||
x=max_global_norm,
|
||||
y=layers.elementwise_max(
|
||||
x=global_norm_var, y=max_global_norm))
|
||||
for p, g in params_grads:
|
||||
if g is None:
|
||||
continue
|
||||
if self._need_clip_func is not None and not self._need_clip_func(p):
|
||||
params_and_grads.append((p, g))
|
||||
continue
|
||||
new_grad = layers.elementwise_mul(x=g, y=clip_var)
|
||||
params_and_grads.append((p, new_grad))
|
||||
|
||||
return params_and_grads
|
|
@ -1,46 +0,0 @@
|
|||
# data processing
|
||||
p_pronunciation: 0.99
|
||||
sample_rate: 22050 # Hz
|
||||
n_fft: 1024
|
||||
win_length: 1024
|
||||
hop_length: 256
|
||||
n_mels: 80
|
||||
reduction_factor: 4
|
||||
|
||||
# model-s2s
|
||||
n_speakers: 1
|
||||
speaker_dim: 16
|
||||
char_dim: 256
|
||||
encoder_dim: 64
|
||||
kernel_size: 5
|
||||
encoder_layers: 7
|
||||
decoder_layers: 8
|
||||
prenet_sizes: [128]
|
||||
attention_dim: 128
|
||||
|
||||
# model-postnet
|
||||
postnet_layers: 5
|
||||
postnet_dim: 256
|
||||
|
||||
# position embedding
|
||||
position_weight: 1.0
|
||||
position_rate: 5.54
|
||||
forward_step: 4
|
||||
backward_step: 0
|
||||
|
||||
dropout: 0.05
|
||||
|
||||
# output-griffinlim
|
||||
sharpening_factor: 1.4
|
||||
|
||||
# optimizer:
|
||||
learning_rate: 0.001
|
||||
clip_value: 5.0
|
||||
clip_norm: 100.0
|
||||
|
||||
# training:
|
||||
max_iteration: 1000000
|
||||
batch_size: 16
|
||||
report_interval: 10000
|
||||
save_interval: 10000
|
||||
valid_size: 5
|
|
@ -1,108 +0,0 @@
|
|||
import numpy as np
|
||||
import os
|
||||
import csv
|
||||
import pandas as pd
|
||||
|
||||
import paddle
|
||||
from paddle import fluid
|
||||
from paddle.fluid import dygraph as dg
|
||||
from paddle.fluid.dataloader import Dataset, BatchSampler
|
||||
from paddle.fluid.io import DataLoader
|
||||
|
||||
from parakeet.data import DatasetMixin, DataCargo, PartialyRandomizedSimilarTimeLengthSampler
|
||||
from parakeet.g2p import en
|
||||
|
||||
class LJSpeech(DatasetMixin):
|
||||
def __init__(self, root):
|
||||
self._root = root
|
||||
self._table = pd.read_csv(
|
||||
os.path.join(root, "metadata.csv"),
|
||||
sep="|",
|
||||
encoding="utf-8",
|
||||
quoting=csv.QUOTE_NONE,
|
||||
header=None,
|
||||
names=["num_frames", "spec_name", "mel_name", "text"],
|
||||
dtype={"num_frames": np.int64, "spec_name": str, "mel_name":str, "text":str})
|
||||
|
||||
def num_frames(self):
|
||||
return self._table["num_frames"].to_list()
|
||||
|
||||
def get_example(self, i):
|
||||
"""
|
||||
spec (T_frame, C_spec)
|
||||
mel (T_frame, C_mel)
|
||||
"""
|
||||
num_frames, spec_name, mel_name, text = self._table.iloc[i]
|
||||
spec = np.load(os.path.join(self._root, spec_name))
|
||||
mel = np.load(os.path.join(self._root, mel_name))
|
||||
return (text, spec, mel, num_frames)
|
||||
|
||||
def __len__(self):
|
||||
return len(self._table)
|
||||
|
||||
class DataCollector(object):
|
||||
def __init__(self, p_pronunciation):
|
||||
self.p_pronunciation = p_pronunciation
|
||||
|
||||
def __call__(self, examples):
|
||||
"""
|
||||
output shape and dtype
|
||||
(B, T_text) int64
|
||||
(B,) int64
|
||||
(B, T_frame, C_spec) float32
|
||||
(B, T_frame, C_mel) float32
|
||||
(B,) int64
|
||||
"""
|
||||
text_seqs = []
|
||||
specs = []
|
||||
mels = []
|
||||
num_frames = np.array([example[3] for example in examples], dtype=np.int64)
|
||||
max_frames = np.max(num_frames)
|
||||
|
||||
for example in examples:
|
||||
text, spec, mel, _ = example
|
||||
text_seqs.append(en.text_to_sequence(text, self.p_pronunciation))
|
||||
specs.append(np.pad(spec, [(0, max_frames - spec.shape[0]), (0, 0)], mode="constant"))
|
||||
mels.append(np.pad(mel, [(0, max_frames - mel.shape[0]), (0, 0)], mode="constant"))
|
||||
|
||||
specs = np.stack(specs)
|
||||
mels = np.stack(mels)
|
||||
|
||||
text_lengths = np.array([len(seq) for seq in text_seqs], dtype=np.int64)
|
||||
max_length = np.max(text_lengths)
|
||||
text_seqs = np.array([seq + [0] * (max_length - len(seq)) for seq in text_seqs], dtype=np.int64)
|
||||
return text_seqs, text_lengths, specs, mels, num_frames
|
||||
|
||||
if __name__ == "__main__":
|
||||
import argparse
|
||||
import tqdm
|
||||
import time
|
||||
from ruamel import yaml
|
||||
|
||||
parser = argparse.ArgumentParser(description="load the preprocessed ljspeech dataset")
|
||||
parser.add_argument("--config", type=str, required=True, help="config file")
|
||||
parser.add_argument("--input", type=str, required=True, help="data path of the original data")
|
||||
args = parser.parse_args()
|
||||
with open(args.config, 'rt') as f:
|
||||
config = yaml.safe_load(f)
|
||||
|
||||
print("========= Command Line Arguments ========")
|
||||
for k, v in vars(args).items():
|
||||
print("{}: {}".format(k, v))
|
||||
print("=========== Configurations ==============")
|
||||
for k in ["p_pronunciation", "batch_size"]:
|
||||
print("{}: {}".format(k, config[k]))
|
||||
|
||||
ljspeech = LJSpeech(args.input)
|
||||
collate_fn = DataCollector(config["p_pronunciation"])
|
||||
|
||||
dg.enable_dygraph(fluid.CPUPlace())
|
||||
sampler = PartialyRandomizedSimilarTimeLengthSampler(ljspeech.num_frames())
|
||||
cargo = DataCargo(ljspeech, collate_fn,
|
||||
batch_size=config["batch_size"], sampler=sampler)
|
||||
loader = DataLoader\
|
||||
.from_generator(capacity=5, return_list=True)\
|
||||
.set_batch_generator(cargo)
|
||||
|
||||
for i, batch in tqdm.tqdm(enumerate(loader)):
|
||||
continue
|
Binary file not shown.
Before Width: | Height: | Size: 447 KiB |
|
@ -1,122 +0,0 @@
|
|||
from __future__ import division
|
||||
import os
|
||||
import argparse
|
||||
from ruamel import yaml
|
||||
import tqdm
|
||||
from os.path import join
|
||||
import csv
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import librosa
|
||||
import logging
|
||||
|
||||
from parakeet.data import DatasetMixin
|
||||
|
||||
|
||||
class LJSpeechMetaData(DatasetMixin):
|
||||
def __init__(self, root):
|
||||
self.root = root
|
||||
self._wav_dir = join(root, "wavs")
|
||||
csv_path = join(root, "metadata.csv")
|
||||
self._table = pd.read_csv(
|
||||
csv_path,
|
||||
sep="|",
|
||||
encoding="utf-8",
|
||||
header=None,
|
||||
quoting=csv.QUOTE_NONE,
|
||||
names=["fname", "raw_text", "normalized_text"])
|
||||
|
||||
def get_example(self, i):
|
||||
fname, raw_text, normalized_text = self._table.iloc[i]
|
||||
abs_fname = join(self._wav_dir, fname + ".wav")
|
||||
return fname, abs_fname, raw_text, normalized_text
|
||||
|
||||
def __len__(self):
|
||||
return len(self._table)
|
||||
|
||||
|
||||
class Transform(object):
|
||||
def __init__(self, sample_rate, n_fft, hop_length, win_length, n_mels, reduction_factor):
|
||||
self.sample_rate = sample_rate
|
||||
self.n_fft = n_fft
|
||||
self.win_length = win_length
|
||||
self.hop_length = hop_length
|
||||
self.n_mels = n_mels
|
||||
self.reduction_factor = reduction_factor
|
||||
|
||||
def __call__(self, fname):
|
||||
# wave processing
|
||||
audio, _ = librosa.load(fname, sr=self.sample_rate)
|
||||
|
||||
# Pad the data to the right size to have a whole number of timesteps,
|
||||
# accounting properly for the model reduction factor.
|
||||
frames = audio.size // (self.reduction_factor * self.hop_length) + 1
|
||||
# librosa's stft extract frame of n_fft size, so we should pad n_fft // 2 on both sidess
|
||||
desired_length = (frames * self.reduction_factor - 1) * self.hop_length + self.n_fft
|
||||
pad_amount = (desired_length - audio.size) // 2
|
||||
|
||||
# we pad mannually to control the number of generated frames
|
||||
if audio.size % 2 == 0:
|
||||
audio = np.pad(audio, (pad_amount, pad_amount), mode='reflect')
|
||||
else:
|
||||
audio = np.pad(audio, (pad_amount, pad_amount + 1), mode='reflect')
|
||||
|
||||
# STFT
|
||||
D = librosa.stft(audio, self.n_fft, self.hop_length, self.win_length, center=False)
|
||||
S = np.abs(D)
|
||||
S_mel = librosa.feature.melspectrogram(sr=self.sample_rate, S=S, n_mels=self.n_mels, fmax=8000.0)
|
||||
|
||||
# log magnitude
|
||||
log_spectrogram = np.log(np.clip(S, a_min=1e-5, a_max=None))
|
||||
log_mel_spectrogram = np.log(np.clip(S_mel, a_min=1e-5, a_max=None))
|
||||
num_frames = log_spectrogram.shape[-1]
|
||||
assert num_frames % self.reduction_factor == 0, "num_frames is wrong"
|
||||
return (log_spectrogram.T, log_mel_spectrogram.T, num_frames)
|
||||
|
||||
|
||||
def save(output_path, dataset, transform):
|
||||
if not os.path.exists(output_path):
|
||||
os.makedirs(output_path)
|
||||
records = []
|
||||
for example in tqdm.tqdm(dataset):
|
||||
fname, abs_fname, _, normalized_text = example
|
||||
log_spec, log_mel_spec, num_frames = transform(abs_fname)
|
||||
records.append((num_frames,
|
||||
fname + "_spec.npy",
|
||||
fname + "_mel.npy",
|
||||
normalized_text))
|
||||
np.save(join(output_path, fname + "_spec"), log_spec)
|
||||
np.save(join(output_path, fname + "_mel"), log_mel_spec)
|
||||
meta_data = pd.DataFrame.from_records(records)
|
||||
meta_data.to_csv(join(output_path, "metadata.csv"),
|
||||
quoting=csv.QUOTE_NONE, sep="|", encoding="utf-8",
|
||||
header=False, index=False)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="preprocess ljspeech dataset and save it.")
|
||||
parser.add_argument("--config", type=str, required=True, help="config file")
|
||||
parser.add_argument("--input", type=str, required=True, help="data path of the original data")
|
||||
parser.add_argument("--output", type=str, required=True, help="path to save the preprocessed dataset")
|
||||
|
||||
args = parser.parse_args()
|
||||
with open(args.config, 'rt') as f:
|
||||
config = yaml.safe_load(f)
|
||||
|
||||
print("========= Command Line Arguments ========")
|
||||
for k, v in vars(args).items():
|
||||
print("{}: {}".format(k, v))
|
||||
print("=========== Configurations ==============")
|
||||
for k in ["sample_rate", "n_fft", "win_length",
|
||||
"hop_length", "n_mels", "reduction_factor"]:
|
||||
print("{}: {}".format(k, config[k]))
|
||||
|
||||
ljspeech_meta = LJSpeechMetaData(args.input)
|
||||
transform = Transform(config["sample_rate"],
|
||||
config["n_fft"],
|
||||
config["hop_length"],
|
||||
config["win_length"],
|
||||
config["n_mels"],
|
||||
config["reduction_factor"])
|
||||
save(args.output, ljspeech_meta, transform)
|
||||
|
|
@ -1,101 +0,0 @@
|
|||
import numpy as np
|
||||
from matplotlib import cm
|
||||
import librosa
|
||||
import os
|
||||
import time
|
||||
import tqdm
|
||||
import argparse
|
||||
from ruamel import yaml
|
||||
import paddle
|
||||
from paddle import fluid
|
||||
from paddle.fluid import layers as F
|
||||
from paddle.fluid import dygraph as dg
|
||||
from paddle.fluid.io import DataLoader
|
||||
import soundfile as sf
|
||||
|
||||
from parakeet.data import SliceDataset, DataCargo, PartialyRandomizedSimilarTimeLengthSampler, SequentialSampler
|
||||
from parakeet.utils.io import save_parameters, load_parameters, add_yaml_config_to_args
|
||||
from parakeet.g2p import en
|
||||
from parakeet.models.deepvoice3.weight_norm_hook import remove_weight_norm
|
||||
from vocoder import WaveflowVocoder, GriffinLimVocoder
|
||||
from train import create_model
|
||||
|
||||
|
||||
def main(args, config):
|
||||
model = create_model(config)
|
||||
loaded_step = load_parameters(model, checkpoint_path=args.checkpoint)
|
||||
for name, layer in model.named_sublayers():
|
||||
try:
|
||||
remove_weight_norm(layer)
|
||||
except ValueError:
|
||||
# this layer has not weight norm hook
|
||||
pass
|
||||
model.eval()
|
||||
if args.vocoder == "waveflow":
|
||||
vocoder = WaveflowVocoder()
|
||||
vocoder.model.eval()
|
||||
elif args.vocoder == "griffin-lim":
|
||||
vocoder = GriffinLimVocoder(
|
||||
sharpening_factor=config["sharpening_factor"],
|
||||
sample_rate=config["sample_rate"],
|
||||
n_fft=config["n_fft"],
|
||||
win_length=config["win_length"],
|
||||
hop_length=config["hop_length"])
|
||||
else:
|
||||
raise ValueError("Other vocoders are not supported.")
|
||||
|
||||
if not os.path.exists(args.output):
|
||||
os.makedirs(args.output)
|
||||
monotonic_layers = [int(item.strip()) - 1 for item in args.monotonic_layers.split(',')]
|
||||
with open(args.input, 'rt') as f:
|
||||
sentences = [line.strip() for line in f.readlines()]
|
||||
for i, sentence in enumerate(sentences):
|
||||
wav = synthesize(args, config, model, vocoder, sentence, monotonic_layers)
|
||||
sf.write(os.path.join(args.output, "sentence{}.wav".format(i)),
|
||||
wav, samplerate=config["sample_rate"])
|
||||
|
||||
|
||||
def synthesize(args, config, model, vocoder, sentence, monotonic_layers):
|
||||
print("[synthesize] {}".format(sentence))
|
||||
text = en.text_to_sequence(sentence, p=1.0)
|
||||
text = np.expand_dims(np.array(text, dtype="int64"), 0)
|
||||
lengths = np.array([text.size], dtype=np.int64)
|
||||
text_seqs = dg.to_variable(text)
|
||||
text_lengths = dg.to_variable(lengths)
|
||||
|
||||
decoder_layers = config["decoder_layers"]
|
||||
force_monotonic_attention = [False] * decoder_layers
|
||||
for i in monotonic_layers:
|
||||
force_monotonic_attention[i] = True
|
||||
|
||||
with dg.no_grad():
|
||||
outputs = model(text_seqs, text_lengths, speakers=None,
|
||||
force_monotonic_attention=force_monotonic_attention,
|
||||
window=(config["backward_step"], config["forward_step"]))
|
||||
decoded, refined, attentions = outputs
|
||||
if args.vocoder == "griffin-lim":
|
||||
wav_np = vocoder(refined.numpy()[0].T)
|
||||
else:
|
||||
wav = vocoder(F.transpose(refined, (0, 2, 1)))
|
||||
wav_np = wav.numpy()[0]
|
||||
return wav_np
|
||||
|
||||
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import argparse
|
||||
from ruamel import yaml
|
||||
parser = argparse.ArgumentParser("synthesize from a checkpoint")
|
||||
parser.add_argument("--config", type=str, required=True, help="config file")
|
||||
parser.add_argument("--input", type=str, required=True, help="text file to synthesize")
|
||||
parser.add_argument("--output", type=str, required=True, help="path to save audio")
|
||||
parser.add_argument("--checkpoint", type=str, required=True, help="data path of the checkpoint")
|
||||
parser.add_argument("--monotonic_layers", type=str, required=True, help="monotonic decoder layers' indices(start from 1)")
|
||||
parser.add_argument("--vocoder", type=str, default="waveflow", choices=['griffin-lim', 'waveflow'], help="vocoder to use")
|
||||
args = parser.parse_args()
|
||||
with open(args.config, 'rt') as f:
|
||||
config = yaml.safe_load(f)
|
||||
|
||||
dg.enable_dygraph(fluid.CUDAPlace(0))
|
||||
main(args, config)
|
|
@ -1,187 +0,0 @@
|
|||
import numpy as np
|
||||
from matplotlib import cm
|
||||
import librosa
|
||||
import os
|
||||
import time
|
||||
import tqdm
|
||||
import paddle
|
||||
from paddle import fluid
|
||||
from paddle.fluid import layers as F
|
||||
from paddle.fluid import initializer as I
|
||||
from paddle.fluid import dygraph as dg
|
||||
from paddle.fluid.io import DataLoader
|
||||
from visualdl import LogWriter
|
||||
|
||||
from parakeet.models.deepvoice3 import Encoder, Decoder, PostNet, SpectraNet
|
||||
from parakeet.data import SliceDataset, DataCargo, SequentialSampler, RandomSampler
|
||||
from parakeet.utils.io import save_parameters, load_parameters
|
||||
from parakeet.g2p import en
|
||||
|
||||
from data import LJSpeech, DataCollector
|
||||
from vocoder import WaveflowVocoder, GriffinLimVocoder
|
||||
from clip import DoubleClip
|
||||
|
||||
|
||||
def create_model(config):
|
||||
char_embedding = dg.Embedding((en.n_vocab, config["char_dim"]), param_attr=I.Normal(scale=0.1))
|
||||
multi_speaker = config["n_speakers"] > 1
|
||||
speaker_embedding = dg.Embedding((config["n_speakers"], config["speaker_dim"]), param_attr=I.Normal(scale=0.1)) \
|
||||
if multi_speaker else None
|
||||
encoder = Encoder(config["encoder_layers"], config["char_dim"],
|
||||
config["encoder_dim"], config["kernel_size"],
|
||||
has_bias=multi_speaker, bias_dim=config["speaker_dim"],
|
||||
keep_prob=1.0 - config["dropout"])
|
||||
decoder = Decoder(config["n_mels"], config["reduction_factor"],
|
||||
list(config["prenet_sizes"]) + [config["char_dim"]],
|
||||
config["decoder_layers"], config["kernel_size"],
|
||||
config["attention_dim"],
|
||||
position_encoding_weight=config["position_weight"],
|
||||
omega=config["position_rate"],
|
||||
has_bias=multi_speaker, bias_dim=config["speaker_dim"],
|
||||
keep_prob=1.0 - config["dropout"])
|
||||
postnet = PostNet(config["postnet_layers"], config["char_dim"],
|
||||
config["postnet_dim"], config["kernel_size"],
|
||||
config["n_mels"], config["reduction_factor"],
|
||||
has_bias=multi_speaker, bias_dim=config["speaker_dim"],
|
||||
keep_prob=1.0 - config["dropout"])
|
||||
spectranet = SpectraNet(char_embedding, speaker_embedding, encoder, decoder, postnet)
|
||||
return spectranet
|
||||
|
||||
def create_data(config, data_path):
|
||||
dataset = LJSpeech(data_path)
|
||||
|
||||
train_dataset = SliceDataset(dataset, config["valid_size"], len(dataset))
|
||||
train_collator = DataCollector(config["p_pronunciation"])
|
||||
train_sampler = RandomSampler(train_dataset)
|
||||
train_cargo = DataCargo(train_dataset, train_collator,
|
||||
batch_size=config["batch_size"], sampler=train_sampler)
|
||||
train_loader = DataLoader\
|
||||
.from_generator(capacity=10, return_list=True)\
|
||||
.set_batch_generator(train_cargo)
|
||||
|
||||
valid_dataset = SliceDataset(dataset, 0, config["valid_size"])
|
||||
valid_collector = DataCollector(1.)
|
||||
valid_sampler = SequentialSampler(valid_dataset)
|
||||
valid_cargo = DataCargo(valid_dataset, valid_collector,
|
||||
batch_size=1, sampler=valid_sampler)
|
||||
valid_loader = DataLoader\
|
||||
.from_generator(capacity=2, return_list=True)\
|
||||
.set_batch_generator(valid_cargo)
|
||||
return train_loader, valid_loader
|
||||
|
||||
def create_optimizer(model, config):
|
||||
optim = fluid.optimizer.Adam(config["learning_rate"],
|
||||
parameter_list=model.parameters(),
|
||||
grad_clip=DoubleClip(config["clip_value"], config["clip_norm"]))
|
||||
return optim
|
||||
|
||||
def train(args, config):
|
||||
model = create_model(config)
|
||||
train_loader, valid_loader = create_data(config, args.input)
|
||||
optim = create_optimizer(model, config)
|
||||
|
||||
global global_step
|
||||
max_iteration = config["max_iteration"]
|
||||
|
||||
iterator = iter(tqdm.tqdm(train_loader))
|
||||
while global_step <= max_iteration:
|
||||
# get inputs
|
||||
try:
|
||||
batch = next(iterator)
|
||||
except StopIteration:
|
||||
iterator = iter(tqdm.tqdm(train_loader))
|
||||
batch = next(iterator)
|
||||
|
||||
# unzip it
|
||||
text_seqs, text_lengths, specs, mels, num_frames = batch
|
||||
|
||||
# forward & backward
|
||||
model.train()
|
||||
outputs = model(text_seqs, text_lengths, speakers=None, mel=mels)
|
||||
decoded, refined, attentions, final_state = outputs
|
||||
|
||||
causal_mel_loss = model.spec_loss(decoded, mels, num_frames)
|
||||
non_causal_mel_loss = model.spec_loss(refined, mels, num_frames)
|
||||
loss = causal_mel_loss + non_causal_mel_loss
|
||||
loss.backward()
|
||||
|
||||
# update
|
||||
optim.minimize(loss)
|
||||
|
||||
# logging
|
||||
tqdm.tqdm.write("[train] step: {}\tloss: {:.6f}\tcausal:{:.6f}\tnon_causal:{:.6f}".format(
|
||||
global_step,
|
||||
loss.numpy()[0],
|
||||
causal_mel_loss.numpy()[0],
|
||||
non_causal_mel_loss.numpy()[0]))
|
||||
writer.add_scalar("loss/causal_mel_loss", causal_mel_loss.numpy()[0], step=global_step)
|
||||
writer.add_scalar("loss/non_causal_mel_loss", non_causal_mel_loss.numpy()[0], step=global_step)
|
||||
writer.add_scalar("loss/loss", loss.numpy()[0], step=global_step)
|
||||
|
||||
if global_step % config["report_interval"] == 0:
|
||||
text_length = int(text_lengths.numpy()[0])
|
||||
num_frame = int(num_frames.numpy()[0])
|
||||
|
||||
tag = "train_mel/ground-truth"
|
||||
img = cm.viridis(normalize(mels.numpy()[0, :num_frame].T))
|
||||
writer.add_image(tag, img, step=global_step)
|
||||
|
||||
tag = "train_mel/decoded"
|
||||
img = cm.viridis(normalize(decoded.numpy()[0, :num_frame].T))
|
||||
writer.add_image(tag, img, step=global_step)
|
||||
|
||||
tag = "train_mel/refined"
|
||||
img = cm.viridis(normalize(refined.numpy()[0, :num_frame].T))
|
||||
writer.add_image(tag, img, step=global_step)
|
||||
|
||||
vocoder = WaveflowVocoder()
|
||||
vocoder.model.eval()
|
||||
|
||||
tag = "train_audio/ground-truth-waveflow"
|
||||
wav = vocoder(F.transpose(mels[0:1, :num_frame, :], (0, 2, 1)))
|
||||
writer.add_audio(tag, wav.numpy()[0], step=global_step, sample_rate=22050)
|
||||
|
||||
tag = "train_audio/decoded-waveflow"
|
||||
wav = vocoder(F.transpose(decoded[0:1, :num_frame, :], (0, 2, 1)))
|
||||
writer.add_audio(tag, wav.numpy()[0], step=global_step, sample_rate=22050)
|
||||
|
||||
tag = "train_audio/refined-waveflow"
|
||||
wav = vocoder(F.transpose(refined[0:1, :num_frame, :], (0, 2, 1)))
|
||||
writer.add_audio(tag, wav.numpy()[0], step=global_step, sample_rate=22050)
|
||||
|
||||
attentions_np = attentions.numpy()
|
||||
attentions_np = attentions_np[:, 0, :num_frame // 4 , :text_length]
|
||||
for i, attention_layer in enumerate(np.rot90(attentions_np, axes=(1,2))):
|
||||
tag = "train_attention/layer_{}".format(i)
|
||||
img = cm.viridis(normalize(attention_layer))
|
||||
writer.add_image(tag, img, step=global_step, dataformats="HWC")
|
||||
|
||||
if global_step % config["save_interval"] == 0:
|
||||
save_parameters(writer.logdir, global_step, model, optim)
|
||||
|
||||
# global step +1
|
||||
global_step += 1
|
||||
|
||||
def normalize(arr):
|
||||
return (arr - arr.min()) / (arr.max() - arr.min())
|
||||
|
||||
if __name__ == "__main__":
|
||||
import argparse
|
||||
from ruamel import yaml
|
||||
|
||||
parser = argparse.ArgumentParser(description="train a Deep Voice 3 model with LJSpeech")
|
||||
parser.add_argument("--config", type=str, required=True, help="config file")
|
||||
parser.add_argument("--input", type=str, required=True, help="data path of the original data")
|
||||
|
||||
args = parser.parse_args()
|
||||
with open(args.config, 'rt') as f:
|
||||
config = yaml.safe_load(f)
|
||||
|
||||
dg.enable_dygraph(fluid.CUDAPlace(0))
|
||||
global global_step
|
||||
global_step = 1
|
||||
global writer
|
||||
writer = LogWriter()
|
||||
print("[Training] tensorboard log and checkpoints are save in {}".format(
|
||||
writer.logdir))
|
||||
train(args, config)
|
|
@ -1,51 +0,0 @@
|
|||
import argparse
|
||||
from ruamel import yaml
|
||||
import numpy as np
|
||||
import librosa
|
||||
import paddle
|
||||
from paddle import fluid
|
||||
from paddle.fluid import layers as F
|
||||
from paddle.fluid import dygraph as dg
|
||||
from parakeet.utils.io import load_parameters
|
||||
from parakeet.models.waveflow.waveflow_modules import WaveFlowModule
|
||||
|
||||
class WaveflowVocoder(object):
|
||||
def __init__(self):
|
||||
config_path = "waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml"
|
||||
with open(config_path, 'rt') as f:
|
||||
config = yaml.safe_load(f)
|
||||
ns = argparse.Namespace()
|
||||
for k, v in config.items():
|
||||
setattr(ns, k, v)
|
||||
ns.use_fp16 = False
|
||||
|
||||
self.model = WaveFlowModule(ns)
|
||||
checkpoint_path = "waveflow_res128_ljspeech_ckpt_1.0/step-2000000"
|
||||
load_parameters(self.model, checkpoint_path=checkpoint_path)
|
||||
|
||||
def __call__(self, mel):
|
||||
with dg.no_grad():
|
||||
self.model.eval()
|
||||
audio = self.model.synthesize(mel)
|
||||
self.model.train()
|
||||
return audio
|
||||
|
||||
class GriffinLimVocoder(object):
|
||||
def __init__(self, sharpening_factor=1.4, sample_rate=22050, n_fft=1024,
|
||||
win_length=1024, hop_length=256):
|
||||
self.sample_rate = sample_rate
|
||||
self.n_fft = n_fft
|
||||
self.sharpening_factor = sharpening_factor
|
||||
self.win_length = win_length
|
||||
self.hop_length = hop_length
|
||||
|
||||
def __call__(self, mel):
|
||||
spec = librosa.feature.inverse.mel_to_stft(
|
||||
np.exp(mel),
|
||||
sr=self.sample_rate,
|
||||
n_fft=self.n_fft,
|
||||
fmin=0, fmax=8000.0, power=1.0)
|
||||
audio = librosa.core.griffinlim(spec ** self.sharpening_factor,
|
||||
win_length=self.win_length, hop_length=self.hop_length)
|
||||
return audio
|
||||
|
|
@ -1,144 +0,0 @@
|
|||
# Fastspeech
|
||||
|
||||
PaddlePaddle dynamic graph implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263).
|
||||
|
||||
## Dataset
|
||||
|
||||
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
|
||||
|
||||
```bash
|
||||
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
|
||||
tar xjvf LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
## Model Architecture
|
||||
|
||||

|
||||
|
||||
FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extracts attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length
|
||||
regulator to expand the source phoneme sequence to match the length of the target
|
||||
mel-spectrogram sequence for parallel mel-spectrogram generation. We use the TransformerTTS as teacher model.
|
||||
The model consists of encoder, decoder and length regulator three parts.
|
||||
|
||||
## Project Structure
|
||||
|
||||
```text
|
||||
├── config # yaml configuration files
|
||||
├── synthesis.py # script to synthesize waveform from text
|
||||
├── train.py # script for model training
|
||||
```
|
||||
|
||||
## Saving & Loading
|
||||
|
||||
`train_transformer.py` and `train_vocoer.py` have 3 arguments in common, `--checkpoint`, `--iteration` and `--output`.
|
||||
|
||||
1. `--output` is the directory for saving results.
|
||||
During training, checkpoints are saved in `${output}/checkpoints` and tensorboard logs are saved in `${output}/log`.
|
||||
During synthesis, results are saved in `${output}/samples` and tensorboard log is save in `${output}/log`.
|
||||
|
||||
2. `--checkpoint` is the path of a checkpoint and `--iteration` is the target step. They are used to load checkpoints in the following way.
|
||||
|
||||
- If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.
|
||||
|
||||
- If `--checkpoint` is not provided, we try to load the checkpoint of the target step specified by `--iteration` from the `${output}/checkpoints/` directory, e.g. if given `--iteration 120000`, the checkpoint `${output}/checkpoints/step-120000.*` will be load.
|
||||
|
||||
- If both `--checkpoint` and `--iteration` are not provided, we try to load the latest checkpoint from `${output}/checkpoints/` directory.
|
||||
|
||||
## Compute Phoneme Duration
|
||||
|
||||
A ground truth duration of each phoneme (number of frames in the spectrogram that correspond to that phoneme) should be provided when training a FastSpeech model.
|
||||
|
||||
We compute the ground truth duration of each phomemes in the following way.
|
||||
We extract the encoder-decoder attention alignment from a trained Transformer TTS model;
|
||||
Each frame is considered corresponding to the phoneme that receive the most attention;
|
||||
|
||||
You can run alignments/get_alignments.py to get it.
|
||||
|
||||
```bash
|
||||
cd alignments
|
||||
python get_alignments.py \
|
||||
--use_gpu=1 \
|
||||
--output='./alignments' \
|
||||
--data=${DATAPATH} \
|
||||
--config=${CONFIG} \
|
||||
--checkpoint_transformer=${CHECKPOINT} \
|
||||
```
|
||||
|
||||
where `${DATAPATH}` is the path saved LJSpeech data, `${CHECKPOINT}` is the pretrain model path of TransformerTTS, `${CONFIG}` is the config yaml file of TransformerTTS checkpoint. It is necessary for you to prepare a pre-trained TranformerTTS checkpoint.
|
||||
|
||||
For more help on arguments
|
||||
|
||||
``python alignments.py --help``.
|
||||
|
||||
Or you can use your own phoneme duration, you just need to process the data into the following format.
|
||||
|
||||
```bash
|
||||
{'fname1': alignment1,
|
||||
'fname2': alignment2,
|
||||
...}
|
||||
```
|
||||
|
||||
## Train FastSpeech
|
||||
|
||||
FastSpeech model can be trained by running ``train.py``.
|
||||
|
||||
```bash
|
||||
python train.py \
|
||||
--use_gpu=1 \
|
||||
--data=${DATAPATH} \
|
||||
--alignments_path=${ALIGNMENTS_PATH} \
|
||||
--output=${OUTPUTPATH} \
|
||||
--config='configs/ljspeech.yaml' \
|
||||
```
|
||||
|
||||
Or you can run the script file directly.
|
||||
|
||||
```bash
|
||||
sh train.sh
|
||||
```
|
||||
|
||||
If you want to train on multiple GPUs, start training in the following way.
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3
|
||||
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train.py \
|
||||
--use_gpu=1 \
|
||||
--data=${DATAPATH} \
|
||||
--alignments_path=${ALIGNMENTS_PATH} \
|
||||
--output=${OUTPUTPATH} \
|
||||
--config='configs/ljspeech.yaml' \
|
||||
```
|
||||
|
||||
If you wish to resume from an existing model, See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
|
||||
|
||||
For more help on arguments
|
||||
|
||||
``python train.py --help``.
|
||||
|
||||
## Synthesis
|
||||
|
||||
After training the FastSpeech, audio can be synthesized by running ``synthesis.py``.
|
||||
|
||||
```bash
|
||||
python synthesis.py \
|
||||
--use_gpu=1 \
|
||||
--alpha=1.0 \
|
||||
--checkpoint=${CHECKPOINTPATH} \
|
||||
--config='configs/ljspeech.yaml' \
|
||||
--output=${OUTPUTPATH} \
|
||||
--vocoder='griffin-lim' \
|
||||
```
|
||||
|
||||
We currently support two vocoders, Griffin-Lim algorithm and WaveFlow. You can set ``--vocoder`` to use one of them. If you want to use WaveFlow as your vocoder, you need to set ``--config_vocoder`` and ``--checkpoint_vocoder`` which are the path of the config and checkpoint of vocoder. You can download the pre-trained model of WaveFlow from [here](https://github.com/PaddlePaddle/Parakeet#vocoders).
|
||||
|
||||
Or you can run the script file directly.
|
||||
|
||||
```bash
|
||||
sh synthesis.sh
|
||||
```
|
||||
|
||||
For more help on arguments
|
||||
|
||||
``python synthesis.py --help``.
|
||||
|
||||
Then you can find the synthesized audio files in ``${OUTPUTPATH}/samples``.
|
|
@ -1,132 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import os
|
||||
from scipy.io.wavfile import write
|
||||
from parakeet.g2p.en import text_to_sequence
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import csv
|
||||
from tqdm import tqdm
|
||||
from ruamel import yaml
|
||||
import pickle
|
||||
from pathlib import Path
|
||||
import argparse
|
||||
from pprint import pprint
|
||||
from collections import OrderedDict
|
||||
import paddle.fluid as fluid
|
||||
import paddle.fluid.dygraph as dg
|
||||
from parakeet.models.transformer_tts.utils import *
|
||||
from parakeet.models.transformer_tts import TransformerTTS
|
||||
from parakeet.models.fastspeech.utils import get_alignment
|
||||
from parakeet.utils import io
|
||||
|
||||
|
||||
def add_config_options_to_parser(parser):
|
||||
parser.add_argument("--config", type=str, help="path of the config file")
|
||||
parser.add_argument("--use_gpu", type=int, default=0, help="device to use")
|
||||
parser.add_argument("--data", type=str, help="path of LJspeech dataset")
|
||||
|
||||
parser.add_argument(
|
||||
"--checkpoint_transformer",
|
||||
type=str,
|
||||
help="transformer_tts checkpoint to synthesis")
|
||||
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
type=str,
|
||||
default="./alignments",
|
||||
help="path to save experiment results")
|
||||
|
||||
|
||||
def alignments(args):
|
||||
local_rank = dg.parallel.Env().local_rank
|
||||
place = (fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace())
|
||||
|
||||
with open(args.config) as f:
|
||||
cfg = yaml.load(f, Loader=yaml.Loader)
|
||||
|
||||
with dg.guard(place):
|
||||
network_cfg = cfg['network']
|
||||
model = TransformerTTS(
|
||||
network_cfg['embedding_size'], network_cfg['hidden_size'],
|
||||
network_cfg['encoder_num_head'], network_cfg['encoder_n_layers'],
|
||||
cfg['audio']['num_mels'], network_cfg['outputs_per_step'],
|
||||
network_cfg['decoder_num_head'], network_cfg['decoder_n_layers'])
|
||||
# Load parameters.
|
||||
global_step = io.load_parameters(
|
||||
model=model, checkpoint_path=args.checkpoint_transformer)
|
||||
model.eval()
|
||||
|
||||
# get text data
|
||||
root = Path(args.data)
|
||||
csv_path = root.joinpath("metadata.csv")
|
||||
table = pd.read_csv(
|
||||
csv_path,
|
||||
sep="|",
|
||||
header=None,
|
||||
quoting=csv.QUOTE_NONE,
|
||||
names=["fname", "raw_text", "normalized_text"])
|
||||
|
||||
pbar = tqdm(range(len(table)))
|
||||
alignments = OrderedDict()
|
||||
for i in pbar:
|
||||
fname, raw_text, normalized_text = table.iloc[i]
|
||||
# init input
|
||||
text = np.asarray(text_to_sequence(normalized_text))
|
||||
text = fluid.layers.unsqueeze(dg.to_variable(text), [0])
|
||||
pos_text = np.arange(1, text.shape[1] + 1)
|
||||
pos_text = fluid.layers.unsqueeze(dg.to_variable(pos_text), [0])
|
||||
|
||||
# load
|
||||
wav, _ = librosa.load(
|
||||
str(os.path.join(args.data, 'wavs', fname + ".wav")))
|
||||
|
||||
spec = librosa.stft(
|
||||
y=wav,
|
||||
n_fft=cfg['audio']['n_fft'],
|
||||
win_length=cfg['audio']['win_length'],
|
||||
hop_length=cfg['audio']['hop_length'])
|
||||
mag = np.abs(spec)
|
||||
mel = librosa.filters.mel(sr=cfg['audio']['sr'],
|
||||
n_fft=cfg['audio']['n_fft'],
|
||||
n_mels=cfg['audio']['num_mels'],
|
||||
fmin=cfg['audio']['fmin'],
|
||||
fmax=cfg['audio']['fmax'])
|
||||
mel = np.matmul(mel, mag)
|
||||
mel = np.log(np.maximum(mel, 1e-5))
|
||||
|
||||
mel_input = np.transpose(mel, axes=(1, 0))
|
||||
mel_input = fluid.layers.unsqueeze(dg.to_variable(mel_input), [0])
|
||||
mel_lens = mel_input.shape[1]
|
||||
|
||||
pos_mel = np.arange(1, mel_input.shape[1] + 1)
|
||||
pos_mel = fluid.layers.unsqueeze(dg.to_variable(pos_mel), [0])
|
||||
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(
|
||||
text, mel_input, pos_text, pos_mel)
|
||||
mel_input = fluid.layers.concat(
|
||||
[mel_input, postnet_pred[:, -1:, :]], axis=1)
|
||||
|
||||
alignment, _ = get_alignment(attn_probs, mel_lens,
|
||||
network_cfg['decoder_num_head'])
|
||||
alignments[fname] = alignment
|
||||
with open(args.output + '.pkl', "wb") as f:
|
||||
pickle.dump(alignments, f)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Get alignments from TransformerTTS model")
|
||||
add_config_options_to_parser(parser)
|
||||
args = parser.parse_args()
|
||||
alignments(args)
|
|
@ -1,14 +0,0 @@
|
|||
|
||||
CUDA_VISIBLE_DEVICES=0 \
|
||||
python -u get_alignments.py \
|
||||
--use_gpu=1 \
|
||||
--output='./alignments' \
|
||||
--data='../../../dataset/LJSpeech-1.1' \
|
||||
--config='../../transformer_tts/configs/ljspeech.yaml' \
|
||||
--checkpoint_transformer='../../transformer_tts/checkpoint/transformer/step-120000' \
|
||||
|
||||
if [ $? -ne 0 ]; then
|
||||
echo "Failed in training!"
|
||||
exit 1
|
||||
fi
|
||||
exit 0
|
|
@ -1,36 +0,0 @@
|
|||
audio:
|
||||
num_mels: 80 #the number of mel bands when calculating mel spectrograms.
|
||||
n_fft: 1024 #the number of fft components.
|
||||
sr: 22050 #the sampling rate of audio data file.
|
||||
hop_length: 256 #the number of samples to advance between frames.
|
||||
win_length: 1024 #the length (width) of the window function.
|
||||
preemphasis: 0.97
|
||||
power: 1.2 #the power to raise before griffin-lim.
|
||||
fmin: 0
|
||||
fmax: 8000
|
||||
|
||||
network:
|
||||
encoder_n_layer: 6 #the number of FFT Block in encoder.
|
||||
encoder_head: 2 #the attention head number in encoder.
|
||||
encoder_conv1d_filter_size: 1536 #the filter size of conv1d in encoder.
|
||||
max_seq_len: 2048 #the max length of sequence.
|
||||
decoder_n_layer: 6 #the number of FFT Block in decoder.
|
||||
decoder_head: 2 #the attention head number in decoder.
|
||||
decoder_conv1d_filter_size: 1536 #the filter size of conv1d in decoder.
|
||||
hidden_size: 384 #the hidden size in model of fastspeech.
|
||||
duration_predictor_output_size: 256 #the output size of duration predictior.
|
||||
duration_predictor_filter_size: 3 #the filter size of conv1d in duration prediction.
|
||||
fft_conv1d_filter: 3 #the filter size of conv1d in fft.
|
||||
fft_conv1d_padding: 1 #the padding size of conv1d in fft.
|
||||
dropout: 0.1 #the dropout in network.
|
||||
outputs_per_step: 1
|
||||
|
||||
train:
|
||||
batch_size: 32
|
||||
learning_rate: 0.001
|
||||
warm_up_step: 4000 #the warm up step of learning rate.
|
||||
grad_clip_thresh: 0.1 #the threshold of grad clip.
|
||||
|
||||
checkpoint_interval: 1000
|
||||
max_iteration: 500000
|
||||
|
|
@ -1,186 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from pathlib import Path
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import librosa
|
||||
import csv
|
||||
import pickle
|
||||
|
||||
from paddle import fluid
|
||||
from parakeet import g2p
|
||||
from parakeet import audio
|
||||
from parakeet.data.sampler import *
|
||||
from parakeet.data.datacargo import DataCargo
|
||||
from parakeet.data.batch import TextIDBatcher, SpecBatcher
|
||||
from parakeet.data.dataset import DatasetMixin, TransformDataset, CacheDataset, SliceDataset
|
||||
from parakeet.models.transformer_tts.utils import *
|
||||
|
||||
|
||||
class LJSpeechLoader:
|
||||
def __init__(self,
|
||||
config,
|
||||
place,
|
||||
data_path,
|
||||
alignments_path,
|
||||
batch_size,
|
||||
nranks,
|
||||
rank,
|
||||
is_vocoder=False,
|
||||
shuffle=True):
|
||||
|
||||
LJSPEECH_ROOT = Path(data_path)
|
||||
metadata = LJSpeechMetaData(LJSPEECH_ROOT, alignments_path)
|
||||
transformer = LJSpeech(config)
|
||||
dataset = TransformDataset(metadata, transformer)
|
||||
dataset = CacheDataset(dataset)
|
||||
|
||||
sampler = DistributedSampler(
|
||||
len(dataset), nranks, rank, shuffle=shuffle)
|
||||
|
||||
assert batch_size % nranks == 0
|
||||
each_bs = batch_size // nranks
|
||||
dataloader = DataCargo(
|
||||
dataset,
|
||||
sampler=sampler,
|
||||
batch_size=each_bs,
|
||||
shuffle=shuffle,
|
||||
batch_fn=batch_examples,
|
||||
drop_last=True)
|
||||
self.reader = fluid.io.DataLoader.from_generator(
|
||||
capacity=32,
|
||||
iterable=True,
|
||||
use_double_buffer=True,
|
||||
return_list=True)
|
||||
self.reader.set_batch_generator(dataloader, place)
|
||||
|
||||
|
||||
class LJSpeechMetaData(DatasetMixin):
|
||||
def __init__(self, root, alignments_path):
|
||||
self.root = Path(root)
|
||||
self._wav_dir = self.root.joinpath("wavs")
|
||||
csv_path = self.root.joinpath("metadata.csv")
|
||||
self._table = pd.read_csv(
|
||||
csv_path,
|
||||
sep="|",
|
||||
header=None,
|
||||
quoting=csv.QUOTE_NONE,
|
||||
names=["fname", "raw_text", "normalized_text"])
|
||||
with open(alignments_path, "rb") as f:
|
||||
self._alignments = pickle.load(f)
|
||||
|
||||
def get_example(self, i):
|
||||
fname, raw_text, normalized_text = self._table.iloc[i]
|
||||
alignment = self._alignments[fname]
|
||||
fname = str(self._wav_dir.joinpath(fname + ".wav"))
|
||||
return fname, normalized_text, alignment
|
||||
|
||||
def __len__(self):
|
||||
return len(self._table)
|
||||
|
||||
|
||||
class LJSpeech(object):
|
||||
def __init__(self, cfg):
|
||||
super(LJSpeech, self).__init__()
|
||||
self.sr = cfg['sr']
|
||||
self.n_fft = cfg['n_fft']
|
||||
self.num_mels = cfg['num_mels']
|
||||
self.win_length = cfg['win_length']
|
||||
self.hop_length = cfg['hop_length']
|
||||
self.preemphasis = cfg['preemphasis']
|
||||
self.fmin = cfg['fmin']
|
||||
self.fmax = cfg['fmax']
|
||||
|
||||
def __call__(self, metadatum):
|
||||
"""All the code for generating an Example from a metadatum. If you want a
|
||||
different preprocessing pipeline, you can override this method.
|
||||
This method may require several processor, each of which has a lot of options.
|
||||
In this case, you'd better pass a composed transform and pass it to the init
|
||||
method.
|
||||
"""
|
||||
fname, normalized_text, alignment = metadatum
|
||||
|
||||
wav, _ = librosa.load(str(fname))
|
||||
spec = librosa.stft(
|
||||
y=wav,
|
||||
n_fft=self.n_fft,
|
||||
win_length=self.win_length,
|
||||
hop_length=self.hop_length)
|
||||
mag = np.abs(spec)
|
||||
mel = librosa.filters.mel(self.sr,
|
||||
self.n_fft,
|
||||
n_mels=self.num_mels,
|
||||
fmin=self.fmin,
|
||||
fmax=self.fmax)
|
||||
mel = np.matmul(mel, mag)
|
||||
mel = np.log(np.maximum(mel, 1e-5))
|
||||
phonemes = np.array(
|
||||
g2p.en.text_to_sequence(normalized_text), dtype=np.int64)
|
||||
return (mel, phonemes, alignment
|
||||
) # maybe we need to implement it as a map in the future
|
||||
|
||||
|
||||
def batch_examples(batch):
|
||||
texts = []
|
||||
mels = []
|
||||
text_lens = []
|
||||
pos_texts = []
|
||||
pos_mels = []
|
||||
alignments = []
|
||||
for data in batch:
|
||||
mel, text, alignment = data
|
||||
text_lens.append(len(text))
|
||||
pos_texts.append(np.arange(1, len(text) + 1))
|
||||
pos_mels.append(np.arange(1, mel.shape[1] + 1))
|
||||
mels.append(mel)
|
||||
texts.append(text)
|
||||
alignments.append(alignment)
|
||||
|
||||
# Sort by text_len in descending order
|
||||
texts = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(texts, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
mels = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(mels, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
pos_texts = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(pos_texts, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
pos_mels = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(pos_mels, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
alignments = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(alignments, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
#text_lens = sorted(text_lens, reverse=True)
|
||||
|
||||
# Pad sequence with largest len of the batch
|
||||
texts = TextIDBatcher(pad_id=0)(texts) #(B, T)
|
||||
pos_texts = TextIDBatcher(pad_id=0)(pos_texts) #(B,T)
|
||||
pos_mels = TextIDBatcher(pad_id=0)(pos_mels) #(B,T)
|
||||
alignments = TextIDBatcher(pad_id=0)(alignments).astype(np.float32)
|
||||
mels = np.transpose(
|
||||
SpecBatcher(pad_value=0.)(mels), axes=(0, 2, 1)) #(B,T,num_mels)
|
||||
|
||||
return (texts, mels, pos_texts, pos_mels, alignments)
|
Binary file not shown.
Before Width: | Height: | Size: 513 KiB |
|
@ -1,170 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import os
|
||||
from visualdl import LogWriter
|
||||
from scipy.io.wavfile import write
|
||||
from collections import OrderedDict
|
||||
import argparse
|
||||
from pprint import pprint
|
||||
from ruamel import yaml
|
||||
from matplotlib import cm
|
||||
import numpy as np
|
||||
import paddle.fluid as fluid
|
||||
import paddle.fluid.dygraph as dg
|
||||
from parakeet.g2p.en import text_to_sequence
|
||||
from parakeet import audio
|
||||
from parakeet.models.fastspeech.fastspeech import FastSpeech
|
||||
from parakeet.models.transformer_tts.utils import *
|
||||
from parakeet.models.wavenet import WaveNet, UpsampleNet
|
||||
from parakeet.models.clarinet import STFT, Clarinet, ParallelWaveNet
|
||||
from parakeet.modules import weight_norm
|
||||
from parakeet.models.waveflow import WaveFlowModule
|
||||
from parakeet.utils.layer_tools import freeze
|
||||
from parakeet.utils import io
|
||||
|
||||
|
||||
def add_config_options_to_parser(parser):
|
||||
parser.add_argument("--config", type=str, help="path of the config file")
|
||||
parser.add_argument(
|
||||
"--vocoder",
|
||||
type=str,
|
||||
default="griffin-lim",
|
||||
choices=['griffin-lim', 'waveflow'],
|
||||
help="vocoder method")
|
||||
parser.add_argument(
|
||||
"--config_vocoder", type=str, help="path of the vocoder config file")
|
||||
parser.add_argument("--use_gpu", type=int, default=0, help="device to use")
|
||||
parser.add_argument(
|
||||
"--alpha",
|
||||
type=float,
|
||||
default=1,
|
||||
help="determine the length of the expanded sequence mel, controlling the voice speed."
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--checkpoint", type=str, help="fastspeech checkpoint for synthesis")
|
||||
parser.add_argument(
|
||||
"--checkpoint_vocoder",
|
||||
type=str,
|
||||
help="vocoder checkpoint for synthesis")
|
||||
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
type=str,
|
||||
default="synthesis",
|
||||
help="path to save experiment results")
|
||||
|
||||
|
||||
def synthesis(text_input, args):
|
||||
local_rank = dg.parallel.Env().local_rank
|
||||
place = (fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace())
|
||||
fluid.enable_dygraph(place)
|
||||
|
||||
with open(args.config) as f:
|
||||
cfg = yaml.load(f, Loader=yaml.Loader)
|
||||
|
||||
# tensorboard
|
||||
if not os.path.exists(args.output):
|
||||
os.mkdir(args.output)
|
||||
|
||||
writer = LogWriter(os.path.join(args.output, 'log'))
|
||||
|
||||
model = FastSpeech(cfg['network'], num_mels=cfg['audio']['num_mels'])
|
||||
# Load parameters.
|
||||
global_step = io.load_parameters(
|
||||
model=model, checkpoint_path=args.checkpoint)
|
||||
model.eval()
|
||||
|
||||
text = np.asarray(text_to_sequence(text_input))
|
||||
text = np.expand_dims(text, axis=0)
|
||||
pos_text = np.arange(1, text.shape[1] + 1)
|
||||
pos_text = np.expand_dims(pos_text, axis=0)
|
||||
|
||||
text = dg.to_variable(text).astype(np.int64)
|
||||
pos_text = dg.to_variable(pos_text).astype(np.int64)
|
||||
|
||||
_, mel_output_postnet = model(text, pos_text, alpha=args.alpha)
|
||||
|
||||
if args.vocoder == 'griffin-lim':
|
||||
#synthesis use griffin-lim
|
||||
wav = synthesis_with_griffinlim(mel_output_postnet, cfg['audio'])
|
||||
elif args.vocoder == 'waveflow':
|
||||
wav = synthesis_with_waveflow(mel_output_postnet, args,
|
||||
args.checkpoint_vocoder, place)
|
||||
else:
|
||||
print(
|
||||
'vocoder error, we only support griffinlim and waveflow, but recevied %s.'
|
||||
% args.vocoder)
|
||||
|
||||
writer.add_audio(text_input + '(' + args.vocoder + ')', wav, 0,
|
||||
cfg['audio']['sr'])
|
||||
if not os.path.exists(os.path.join(args.output, 'samples')):
|
||||
os.mkdir(os.path.join(args.output, 'samples'))
|
||||
write(
|
||||
os.path.join(
|
||||
os.path.join(args.output, 'samples'), args.vocoder + '.wav'),
|
||||
cfg['audio']['sr'], wav)
|
||||
print("Synthesis completed !!!")
|
||||
writer.close()
|
||||
|
||||
|
||||
def synthesis_with_griffinlim(mel_output, cfg):
|
||||
mel_output = fluid.layers.transpose(
|
||||
fluid.layers.squeeze(mel_output, [0]), [1, 0])
|
||||
mel_output = np.exp(mel_output.numpy())
|
||||
basis = librosa.filters.mel(cfg['sr'],
|
||||
cfg['n_fft'],
|
||||
cfg['num_mels'],
|
||||
fmin=cfg['fmin'],
|
||||
fmax=cfg['fmax'])
|
||||
inv_basis = np.linalg.pinv(basis)
|
||||
spec = np.maximum(1e-10, np.dot(inv_basis, mel_output))
|
||||
|
||||
wav = librosa.core.griffinlim(
|
||||
spec**cfg['power'],
|
||||
hop_length=cfg['hop_length'],
|
||||
win_length=cfg['win_length'])
|
||||
|
||||
return wav
|
||||
|
||||
|
||||
def synthesis_with_waveflow(mel_output, args, checkpoint, place):
|
||||
|
||||
fluid.enable_dygraph(place)
|
||||
args.config = args.config_vocoder
|
||||
args.use_fp16 = False
|
||||
config = io.add_yaml_config_to_args(args)
|
||||
|
||||
mel_spectrogram = fluid.layers.transpose(mel_output, [0, 2, 1])
|
||||
|
||||
# Build model.
|
||||
waveflow = WaveFlowModule(config)
|
||||
io.load_parameters(model=waveflow, checkpoint_path=checkpoint)
|
||||
for layer in waveflow.sublayers():
|
||||
if isinstance(layer, weight_norm.WeightNormWrapper):
|
||||
layer.remove_weight_norm()
|
||||
|
||||
# Run model inference.
|
||||
wav = waveflow.synthesize(mel_spectrogram, sigma=config.sigma)
|
||||
return wav.numpy()[0]
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(description="Synthesis model")
|
||||
add_config_options_to_parser(parser)
|
||||
args = parser.parse_args()
|
||||
pprint(vars(args))
|
||||
synthesis(
|
||||
"Don't argue with the people of strong determination, because they may change the fact!",
|
||||
args)
|
|
@ -1,20 +0,0 @@
|
|||
# train model
|
||||
|
||||
CUDA_VISIBLE_DEVICES=0 \
|
||||
python -u synthesis.py \
|
||||
--use_gpu=1 \
|
||||
--alpha=1.0 \
|
||||
--checkpoint='./fastspeech_ljspeech_ckpt_1.0/fastspeech/step-162000' \
|
||||
--config='fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml' \
|
||||
--output='./synthesis' \
|
||||
--vocoder='waveflow' \
|
||||
--config_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml' \
|
||||
--checkpoint_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/step-2000000' \
|
||||
|
||||
|
||||
|
||||
if [ $? -ne 0 ]; then
|
||||
echo "Failed in synthesis!"
|
||||
exit 1
|
||||
fi
|
||||
exit 0
|
|
@ -1,166 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import numpy as np
|
||||
import argparse
|
||||
import os
|
||||
import time
|
||||
import math
|
||||
from pathlib import Path
|
||||
from pprint import pprint
|
||||
from ruamel import yaml
|
||||
from tqdm import tqdm
|
||||
from matplotlib import cm
|
||||
from collections import OrderedDict
|
||||
from visualdl import LogWriter
|
||||
import paddle.fluid.dygraph as dg
|
||||
import paddle.fluid.layers as layers
|
||||
import paddle.fluid as fluid
|
||||
from parakeet.models.fastspeech.fastspeech import FastSpeech
|
||||
from parakeet.models.fastspeech.utils import get_alignment
|
||||
from data import LJSpeechLoader
|
||||
from parakeet.utils import io
|
||||
|
||||
|
||||
def add_config_options_to_parser(parser):
|
||||
parser.add_argument("--config", type=str, help="path of the config file")
|
||||
parser.add_argument("--use_gpu", type=int, default=0, help="device to use")
|
||||
parser.add_argument("--data", type=str, help="path of LJspeech dataset")
|
||||
parser.add_argument(
|
||||
"--alignments_path", type=str, help="path of alignments")
|
||||
|
||||
g = parser.add_mutually_exclusive_group()
|
||||
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
|
||||
g.add_argument(
|
||||
"--iteration",
|
||||
type=int,
|
||||
help="the iteration of the checkpoint to load from output directory")
|
||||
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
type=str,
|
||||
default="experiment",
|
||||
help="path to save experiment results")
|
||||
|
||||
|
||||
def main(args):
|
||||
local_rank = dg.parallel.Env().local_rank
|
||||
nranks = dg.parallel.Env().nranks
|
||||
parallel = nranks > 1
|
||||
|
||||
with open(args.config) as f:
|
||||
cfg = yaml.load(f, Loader=yaml.Loader)
|
||||
|
||||
global_step = 0
|
||||
place = fluid.CUDAPlace(dg.parallel.Env()
|
||||
.dev_id) if args.use_gpu else fluid.CPUPlace()
|
||||
fluid.enable_dygraph(place)
|
||||
|
||||
if not os.path.exists(args.output):
|
||||
os.mkdir(args.output)
|
||||
|
||||
writer = LogWriter(os.path.join(args.output,
|
||||
'log')) if local_rank == 0 else None
|
||||
|
||||
model = FastSpeech(cfg['network'], num_mels=cfg['audio']['num_mels'])
|
||||
model.train()
|
||||
optimizer = fluid.optimizer.AdamOptimizer(
|
||||
learning_rate=dg.NoamDecay(1 / (cfg['train']['warm_up_step'] *
|
||||
(cfg['train']['learning_rate']**2)),
|
||||
cfg['train']['warm_up_step']),
|
||||
parameter_list=model.parameters(),
|
||||
grad_clip=fluid.clip.GradientClipByGlobalNorm(cfg['train'][
|
||||
'grad_clip_thresh']))
|
||||
reader = LJSpeechLoader(
|
||||
cfg['audio'],
|
||||
place,
|
||||
args.data,
|
||||
args.alignments_path,
|
||||
cfg['train']['batch_size'],
|
||||
nranks,
|
||||
local_rank,
|
||||
shuffle=True).reader
|
||||
iterator = iter(tqdm(reader))
|
||||
|
||||
# Load parameters.
|
||||
global_step = io.load_parameters(
|
||||
model=model,
|
||||
optimizer=optimizer,
|
||||
checkpoint_dir=os.path.join(args.output, 'checkpoints'),
|
||||
iteration=args.iteration,
|
||||
checkpoint_path=args.checkpoint)
|
||||
print("Rank {}: checkpoint loaded.".format(local_rank))
|
||||
|
||||
if parallel:
|
||||
strategy = dg.parallel.prepare_context()
|
||||
model = fluid.dygraph.parallel.DataParallel(model, strategy)
|
||||
|
||||
while global_step <= cfg['train']['max_iteration']:
|
||||
try:
|
||||
batch = next(iterator)
|
||||
except StopIteration as e:
|
||||
iterator = iter(tqdm(reader))
|
||||
batch = next(iterator)
|
||||
|
||||
(character, mel, pos_text, pos_mel, alignment) = batch
|
||||
|
||||
global_step += 1
|
||||
|
||||
#Forward
|
||||
result = model(
|
||||
character, pos_text, mel_pos=pos_mel, length_target=alignment)
|
||||
mel_output, mel_output_postnet, duration_predictor_output, _, _ = result
|
||||
mel_loss = layers.mse_loss(mel_output, mel)
|
||||
mel_postnet_loss = layers.mse_loss(mel_output_postnet, mel)
|
||||
duration_loss = layers.mean(
|
||||
layers.abs(
|
||||
layers.elementwise_sub(duration_predictor_output, alignment)))
|
||||
total_loss = mel_loss + mel_postnet_loss + duration_loss
|
||||
|
||||
if local_rank == 0:
|
||||
writer.add_scalar('mel_loss', mel_loss.numpy(), global_step)
|
||||
writer.add_scalar('post_mel_loss',
|
||||
mel_postnet_loss.numpy(), global_step)
|
||||
writer.add_scalar('duration_loss',
|
||||
duration_loss.numpy(), global_step)
|
||||
writer.add_scalar('learning_rate',
|
||||
optimizer._learning_rate.step().numpy(),
|
||||
global_step)
|
||||
|
||||
if parallel:
|
||||
total_loss = model.scale_loss(total_loss)
|
||||
total_loss.backward()
|
||||
model.apply_collective_grads()
|
||||
else:
|
||||
total_loss.backward()
|
||||
optimizer.minimize(total_loss)
|
||||
model.clear_gradients()
|
||||
|
||||
# save checkpoint
|
||||
if local_rank == 0 and global_step % cfg['train'][
|
||||
'checkpoint_interval'] == 0:
|
||||
io.save_parameters(
|
||||
os.path.join(args.output, 'checkpoints'), global_step, model,
|
||||
optimizer)
|
||||
|
||||
if local_rank == 0:
|
||||
writer.close()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(description="Train Fastspeech model")
|
||||
add_config_options_to_parser(parser)
|
||||
args = parser.parse_args()
|
||||
# Print the whole config setting.
|
||||
pprint(vars(args))
|
||||
main(args)
|
|
@ -1,15 +0,0 @@
|
|||
# train model
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
python -u train.py \
|
||||
--use_gpu=1 \
|
||||
--data='../../dataset/LJSpeech-1.1' \
|
||||
--alignments_path='./alignments/alignments.pkl' \
|
||||
--output='./experiment' \
|
||||
--config='configs/ljspeech.yaml' \
|
||||
#--checkpoint='./checkpoint/fastspeech/step-120000' \
|
||||
|
||||
if [ $? -ne 0 ]; then
|
||||
echo "Failed in training!"
|
||||
exit 1
|
||||
fi
|
||||
exit 0
|
|
@ -1,112 +0,0 @@
|
|||
# TransformerTTS
|
||||
|
||||
PaddlePaddle dynamic graph implementation of TransformerTTS, a neural TTS with Transformer. The implementation is based on [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895).
|
||||
|
||||
## Dataset
|
||||
|
||||
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
|
||||
|
||||
```bash
|
||||
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
|
||||
tar xjvf LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
## Model Architecture
|
||||
|
||||
<div align="center" name="TransformerTTS model architecture">
|
||||
<img src="./images/model_architecture.jpg" width=400 height=600 /> <br>
|
||||
</div>
|
||||
<div align="center" >
|
||||
TransformerTTS model architecture
|
||||
</div>
|
||||
|
||||
The model adopts the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in [Tacotron2](https://arxiv.org/abs/1712.05884). The model consists of two main parts, encoder and decoder. We also implement the CBHG model of Tacotron as the vocoder part and convert the spectrogram into raw wave using Griffin-Lim algorithm.
|
||||
|
||||
## Project Structure
|
||||
|
||||
```text
|
||||
├── config # yaml configuration files
|
||||
├── data.py # dataset and dataloader settings for LJSpeech
|
||||
├── synthesis.py # script to synthesize waveform from text
|
||||
├── train_transformer.py # script for transformer model training
|
||||
├── train_vocoder.py # script for vocoder model training
|
||||
```
|
||||
|
||||
## Saving & Loading
|
||||
|
||||
`train_transformer.py` and `train_vocoer.py` have 3 arguments in common, `--checkpoint`, `--iteration` and `--output`.
|
||||
|
||||
1. `--output` is the directory for saving results.
|
||||
During training, checkpoints are saved in `${output}/checkpoints` and tensorboard logs are saved in `${output}/log`.
|
||||
During synthesis, results are saved in `${output}/samples` and tensorboard log is save in `${output}/log`.
|
||||
|
||||
2. `--checkpoint` is the path of a checkpoint and `--iteration` is the target step. They are used to load checkpoints in the following way.
|
||||
|
||||
- If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.
|
||||
|
||||
- If `--checkpoint` is not provided, we try to load the checkpoint of the target step specified by `--iteration` from the `${output}/checkpoints/` directory, e.g. if given `--iteration 120000`, the checkpoint `${output}/checkpoints/step-120000.*` will be load.
|
||||
|
||||
- If both `--checkpoint` and `--iteration` are not provided, we try to load the latest checkpoint from `${output}/checkpoints/` directory.
|
||||
|
||||
## Train Transformer
|
||||
|
||||
TransformerTTS model can be trained by running ``train_transformer.py``.
|
||||
|
||||
```bash
|
||||
python train_transformer.py \
|
||||
--use_gpu=1 \
|
||||
--data=${DATAPATH} \
|
||||
--output=${OUTPUTPATH} \
|
||||
--config='configs/ljspeech.yaml' \
|
||||
```
|
||||
|
||||
Or you can run the script file directly.
|
||||
|
||||
```bash
|
||||
sh train_transformer.sh
|
||||
```
|
||||
|
||||
If you want to train on multiple GPUs, you must start training in the following way.
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=0,1,2,3
|
||||
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train_transformer.py \
|
||||
--use_gpu=1 \
|
||||
--data=${DATAPATH} \
|
||||
--output=${OUTPUTPATH} \
|
||||
--config='configs/ljspeech.yaml' \
|
||||
```
|
||||
|
||||
If you wish to resume from an existing model, See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
|
||||
|
||||
**Note: In order to ensure the training effect, we recommend using multi-GPU training to enlarge the batch size, and at least 16 samples in single batch per GPU.**
|
||||
|
||||
For more help on arguments
|
||||
|
||||
``python train_transformer.py --help``.
|
||||
|
||||
## Synthesis
|
||||
|
||||
After training the TransformerTTS, audio can be synthesized by running ``synthesis.py``.
|
||||
|
||||
```bash
|
||||
python synthesis.py \
|
||||
--use_gpu=0 \
|
||||
--output=${OUTPUTPATH} \
|
||||
--config='configs/ljspeech.yaml' \
|
||||
--checkpoint_transformer=${CHECKPOINTPATH} \
|
||||
--vocoder='griffin-lim' \
|
||||
```
|
||||
|
||||
We currently support two vocoders, Griffin-Lim algorithm and WaveFlow. You can set ``--vocoder`` to use one of them. If you want to use WaveFlow as your vocoder, you need to set ``--config_vocoder`` and ``--checkpoint_vocoder`` which are the path of the config and checkpoint of vocoder. You can download the pre-trained model of WaveFlow from [here](https://github.com/PaddlePaddle/Parakeet#vocoders).
|
||||
|
||||
Or you can run the script file directly.
|
||||
|
||||
```bash
|
||||
sh synthesis.sh
|
||||
```
|
||||
For more help on arguments
|
||||
|
||||
``python synthesis.py --help``.
|
||||
|
||||
Then you can find the synthesized audio files in ``${OUTPUTPATH}/samples``.
|
|
@ -1,38 +0,0 @@
|
|||
audio:
|
||||
num_mels: 80
|
||||
n_fft: 1024
|
||||
sr: 22050
|
||||
preemphasis: 0.97
|
||||
hop_length: 256
|
||||
win_length: 1024
|
||||
power: 1.2
|
||||
fmin: 0
|
||||
fmax: 8000
|
||||
|
||||
network:
|
||||
hidden_size: 256
|
||||
embedding_size: 512
|
||||
encoder_num_head: 4
|
||||
encoder_n_layers: 3
|
||||
decoder_num_head: 4
|
||||
decoder_n_layers: 3
|
||||
outputs_per_step: 1
|
||||
stop_loss_weight: 8
|
||||
|
||||
vocoder:
|
||||
hidden_size: 256
|
||||
|
||||
train:
|
||||
batch_size: 32
|
||||
learning_rate: 0.001
|
||||
warm_up_step: 4000
|
||||
grad_clip_thresh: 1.0
|
||||
|
||||
checkpoint_interval: 1000
|
||||
image_interval: 2000
|
||||
|
||||
max_iteration: 500000
|
||||
|
||||
|
||||
|
||||
|
|
@ -1,219 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from pathlib import Path
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import librosa
|
||||
import csv
|
||||
|
||||
from paddle import fluid
|
||||
from parakeet import g2p
|
||||
from parakeet.data.sampler import *
|
||||
from parakeet.data.datacargo import DataCargo
|
||||
from parakeet.data.batch import TextIDBatcher, SpecBatcher
|
||||
from parakeet.data.dataset import DatasetMixin, TransformDataset, CacheDataset, SliceDataset
|
||||
from parakeet.models.transformer_tts.utils import *
|
||||
|
||||
|
||||
class LJSpeechLoader:
|
||||
def __init__(self,
|
||||
config,
|
||||
place,
|
||||
data_path,
|
||||
batch_size,
|
||||
nranks,
|
||||
rank,
|
||||
is_vocoder=False,
|
||||
shuffle=True):
|
||||
|
||||
LJSPEECH_ROOT = Path(data_path)
|
||||
metadata = LJSpeechMetaData(LJSPEECH_ROOT)
|
||||
transformer = LJSpeech(config)
|
||||
dataset = TransformDataset(metadata, transformer)
|
||||
dataset = CacheDataset(dataset)
|
||||
|
||||
sampler = DistributedSampler(
|
||||
len(dataset), nranks, rank, shuffle=shuffle)
|
||||
|
||||
assert batch_size % nranks == 0
|
||||
each_bs = batch_size // nranks
|
||||
if is_vocoder:
|
||||
dataloader = DataCargo(
|
||||
dataset,
|
||||
sampler=sampler,
|
||||
batch_size=each_bs,
|
||||
shuffle=shuffle,
|
||||
batch_fn=batch_examples_vocoder,
|
||||
drop_last=True)
|
||||
else:
|
||||
dataloader = DataCargo(
|
||||
dataset,
|
||||
sampler=sampler,
|
||||
batch_size=each_bs,
|
||||
shuffle=shuffle,
|
||||
batch_fn=batch_examples,
|
||||
drop_last=True)
|
||||
self.reader = fluid.io.DataLoader.from_generator(
|
||||
capacity=32,
|
||||
iterable=True,
|
||||
use_double_buffer=True,
|
||||
return_list=True)
|
||||
self.reader.set_batch_generator(dataloader, place)
|
||||
|
||||
|
||||
class LJSpeechMetaData(DatasetMixin):
|
||||
def __init__(self, root):
|
||||
self.root = Path(root)
|
||||
self._wav_dir = self.root.joinpath("wavs")
|
||||
csv_path = self.root.joinpath("metadata.csv")
|
||||
self._table = pd.read_csv(
|
||||
csv_path,
|
||||
sep="|",
|
||||
header=None,
|
||||
quoting=csv.QUOTE_NONE,
|
||||
names=["fname", "raw_text", "normalized_text"])
|
||||
|
||||
def get_example(self, i):
|
||||
fname, raw_text, normalized_text = self._table.iloc[i]
|
||||
fname = str(self._wav_dir.joinpath(fname + ".wav"))
|
||||
return fname, raw_text, normalized_text
|
||||
|
||||
def __len__(self):
|
||||
return len(self._table)
|
||||
|
||||
|
||||
class LJSpeech(object):
|
||||
def __init__(self, config):
|
||||
super(LJSpeech, self).__init__()
|
||||
self.config = config
|
||||
self.sr = config['sr']
|
||||
self.n_mels = config['num_mels']
|
||||
self.preemphasis = config['preemphasis']
|
||||
self.n_fft = config['n_fft']
|
||||
self.win_length = config['win_length']
|
||||
self.hop_length = config['hop_length']
|
||||
self.fmin = config['fmin']
|
||||
self.fmax = config['fmax']
|
||||
|
||||
def __call__(self, metadatum):
|
||||
"""All the code for generating an Example from a metadatum. If you want a
|
||||
different preprocessing pipeline, you can override this method.
|
||||
This method may require several processor, each of which has a lot of options.
|
||||
In this case, you'd better pass a composed transform and pass it to the init
|
||||
method.
|
||||
"""
|
||||
fname, raw_text, normalized_text = metadatum
|
||||
|
||||
# load
|
||||
wav, _ = librosa.load(str(fname))
|
||||
|
||||
spec = librosa.stft(
|
||||
y=wav,
|
||||
n_fft=self.n_fft,
|
||||
win_length=self.win_length,
|
||||
hop_length=self.hop_length)
|
||||
mag = np.abs(spec)
|
||||
mel = librosa.filters.mel(sr=self.sr,
|
||||
n_fft=self.n_fft,
|
||||
n_mels=self.n_mels,
|
||||
fmin=self.fmin,
|
||||
fmax=self.fmax)
|
||||
mel = np.matmul(mel, mag)
|
||||
mel = np.log(np.maximum(mel, 1e-5))
|
||||
|
||||
characters = np.array(
|
||||
g2p.en.text_to_sequence(normalized_text), dtype=np.int64)
|
||||
return (mag, mel, characters)
|
||||
|
||||
|
||||
def batch_examples(batch):
|
||||
texts = []
|
||||
mels = []
|
||||
mel_inputs = []
|
||||
text_lens = []
|
||||
pos_texts = []
|
||||
pos_mels = []
|
||||
stop_tokens = []
|
||||
for data in batch:
|
||||
_, mel, text = data
|
||||
mel_inputs.append(
|
||||
np.concatenate(
|
||||
[np.zeros([mel.shape[0], 1], np.float32), mel[:, :-1]],
|
||||
axis=-1))
|
||||
text_lens.append(len(text))
|
||||
pos_texts.append(np.arange(1, len(text) + 1))
|
||||
pos_mels.append(np.arange(1, mel.shape[1] + 1))
|
||||
mels.append(mel)
|
||||
texts.append(text)
|
||||
stop_token = np.append(np.zeros([mel.shape[1] - 1], np.float32), 1.0)
|
||||
stop_tokens.append(stop_token)
|
||||
|
||||
# Sort by text_len in descending order
|
||||
texts = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(texts, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
mels = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(mels, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
mel_inputs = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(mel_inputs, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
pos_texts = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(pos_texts, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
pos_mels = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(pos_mels, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
stop_tokens = [
|
||||
i
|
||||
for i, _ in sorted(
|
||||
zip(stop_tokens, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
text_lens = sorted(text_lens, reverse=True)
|
||||
|
||||
# Pad sequence with largest len of the batch
|
||||
texts = TextIDBatcher(pad_id=0)(texts) #(B, T)
|
||||
pos_texts = TextIDBatcher(pad_id=0)(pos_texts) #(B,T)
|
||||
pos_mels = TextIDBatcher(pad_id=0)(pos_mels) #(B,T)
|
||||
stop_tokens = TextIDBatcher(pad_id=1, dtype=np.float32)(pos_mels)
|
||||
mels = np.transpose(
|
||||
SpecBatcher(pad_value=0.)(mels), axes=(0, 2, 1)) #(B,T,num_mels)
|
||||
mel_inputs = np.transpose(
|
||||
SpecBatcher(pad_value=0.)(mel_inputs), axes=(0, 2, 1)) #(B,T,num_mels)
|
||||
|
||||
return (texts, mels, mel_inputs, pos_texts, pos_mels, stop_tokens)
|
||||
|
||||
|
||||
def batch_examples_vocoder(batch):
|
||||
mels = []
|
||||
mags = []
|
||||
for data in batch:
|
||||
mag, mel, _ = data
|
||||
mels.append(mel)
|
||||
mags.append(mag)
|
||||
|
||||
mels = np.transpose(SpecBatcher(pad_value=0.)(mels), axes=(0, 2, 1))
|
||||
mags = np.transpose(SpecBatcher(pad_value=0.)(mags), axes=(0, 2, 1))
|
||||
|
||||
return (mels, mags)
|
Binary file not shown.
Before Width: | Height: | Size: 322 KiB |
|
@ -1,202 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import os
|
||||
from scipy.io.wavfile import write
|
||||
import numpy as np
|
||||
from tqdm import tqdm
|
||||
from matplotlib import cm
|
||||
from visualdl import LogWriter
|
||||
from ruamel import yaml
|
||||
from pathlib import Path
|
||||
import argparse
|
||||
from pprint import pprint
|
||||
import paddle.fluid as fluid
|
||||
import paddle.fluid.dygraph as dg
|
||||
from parakeet.g2p.en import text_to_sequence
|
||||
from parakeet.models.transformer_tts.utils import *
|
||||
from parakeet.models.transformer_tts import TransformerTTS
|
||||
from parakeet.models.waveflow import WaveFlowModule
|
||||
from parakeet.modules.weight_norm import WeightNormWrapper
|
||||
from parakeet.utils import io
|
||||
|
||||
|
||||
def add_config_options_to_parser(parser):
|
||||
parser.add_argument("--config", type=str, help="path of the config file")
|
||||
parser.add_argument("--use_gpu", type=int, default=0, help="device to use")
|
||||
parser.add_argument(
|
||||
"--stop_threshold",
|
||||
type=float,
|
||||
default=0.5,
|
||||
help="The threshold of stop token which indicates the time step should stop generate spectrum or not."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max_len",
|
||||
type=int,
|
||||
default=1000,
|
||||
help="The max length of spectrum when synthesize. If the length of synthetical spectrum is lager than max_len, spectrum will be cut off."
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--checkpoint_transformer",
|
||||
type=str,
|
||||
help="transformer_tts checkpoint for synthesis")
|
||||
parser.add_argument(
|
||||
"--vocoder",
|
||||
type=str,
|
||||
default="griffin-lim",
|
||||
choices=['griffin-lim', 'waveflow'],
|
||||
help="vocoder method")
|
||||
parser.add_argument(
|
||||
"--config_vocoder", type=str, help="path of the vocoder config file")
|
||||
parser.add_argument(
|
||||
"--checkpoint_vocoder",
|
||||
type=str,
|
||||
help="vocoder checkpoint for synthesis")
|
||||
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
type=str,
|
||||
default="synthesis",
|
||||
help="path to save experiment results")
|
||||
|
||||
|
||||
def synthesis(text_input, args):
|
||||
local_rank = dg.parallel.Env().local_rank
|
||||
place = (fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace())
|
||||
|
||||
with open(args.config) as f:
|
||||
cfg = yaml.load(f, Loader=yaml.Loader)
|
||||
|
||||
# tensorboard
|
||||
if not os.path.exists(args.output):
|
||||
os.mkdir(args.output)
|
||||
|
||||
writer = LogWriter(os.path.join(args.output, 'log'))
|
||||
|
||||
fluid.enable_dygraph(place)
|
||||
with fluid.unique_name.guard():
|
||||
network_cfg = cfg['network']
|
||||
model = TransformerTTS(
|
||||
network_cfg['embedding_size'], network_cfg['hidden_size'],
|
||||
network_cfg['encoder_num_head'], network_cfg['encoder_n_layers'],
|
||||
cfg['audio']['num_mels'], network_cfg['outputs_per_step'],
|
||||
network_cfg['decoder_num_head'], network_cfg['decoder_n_layers'])
|
||||
# Load parameters.
|
||||
global_step = io.load_parameters(
|
||||
model=model, checkpoint_path=args.checkpoint_transformer)
|
||||
model.eval()
|
||||
|
||||
# init input
|
||||
text = np.asarray(text_to_sequence(text_input))
|
||||
text = fluid.layers.unsqueeze(dg.to_variable(text).astype(np.int64), [0])
|
||||
mel_input = dg.to_variable(np.zeros([1, 1, 80])).astype(np.float32)
|
||||
pos_text = np.arange(1, text.shape[1] + 1)
|
||||
pos_text = fluid.layers.unsqueeze(
|
||||
dg.to_variable(pos_text).astype(np.int64), [0])
|
||||
|
||||
for i in range(args.max_len):
|
||||
pos_mel = np.arange(1, mel_input.shape[1] + 1)
|
||||
pos_mel = fluid.layers.unsqueeze(
|
||||
dg.to_variable(pos_mel).astype(np.int64), [0])
|
||||
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(
|
||||
text, mel_input, pos_text, pos_mel)
|
||||
if stop_preds.numpy()[0, -1] > args.stop_threshold:
|
||||
break
|
||||
mel_input = fluid.layers.concat(
|
||||
[mel_input, postnet_pred[:, -1:, :]], axis=1)
|
||||
global_step = 0
|
||||
for i, prob in enumerate(attn_probs):
|
||||
for j in range(4):
|
||||
x = np.uint8(cm.viridis(prob.numpy()[j]) * 255)
|
||||
writer.add_image(
|
||||
'Attention_%d_0' % global_step,
|
||||
x,
|
||||
i * 4 + j)
|
||||
|
||||
if args.vocoder == 'griffin-lim':
|
||||
#synthesis use griffin-lim
|
||||
wav = synthesis_with_griffinlim(postnet_pred, cfg['audio'])
|
||||
elif args.vocoder == 'waveflow':
|
||||
# synthesis use waveflow
|
||||
wav = synthesis_with_waveflow(postnet_pred, args,
|
||||
args.checkpoint_vocoder, place)
|
||||
else:
|
||||
print(
|
||||
'vocoder error, we only support griffinlim and waveflow, but recevied %s.'
|
||||
% args.vocoder)
|
||||
|
||||
writer.add_audio(text_input + '(' + args.vocoder + ')', wav, 0,
|
||||
cfg['audio']['sr'])
|
||||
if not os.path.exists(os.path.join(args.output, 'samples')):
|
||||
os.mkdir(os.path.join(args.output, 'samples'))
|
||||
write(
|
||||
os.path.join(
|
||||
os.path.join(args.output, 'samples'), args.vocoder + '.wav'),
|
||||
cfg['audio']['sr'], wav)
|
||||
print("Synthesis completed !!!")
|
||||
writer.close()
|
||||
|
||||
|
||||
def synthesis_with_griffinlim(mel_output, cfg):
|
||||
# synthesis with griffin-lim
|
||||
mel_output = fluid.layers.transpose(
|
||||
fluid.layers.squeeze(mel_output, [0]), [1, 0])
|
||||
mel_output = np.exp(mel_output.numpy())
|
||||
basis = librosa.filters.mel(cfg['sr'],
|
||||
cfg['n_fft'],
|
||||
cfg['num_mels'],
|
||||
fmin=cfg['fmin'],
|
||||
fmax=cfg['fmax'])
|
||||
inv_basis = np.linalg.pinv(basis)
|
||||
spec = np.maximum(1e-10, np.dot(inv_basis, mel_output))
|
||||
|
||||
wav = librosa.core.griffinlim(
|
||||
spec**cfg['power'],
|
||||
hop_length=cfg['hop_length'],
|
||||
win_length=cfg['win_length'])
|
||||
|
||||
return wav
|
||||
|
||||
|
||||
def synthesis_with_waveflow(mel_output, args, checkpoint, place):
|
||||
fluid.enable_dygraph(place)
|
||||
args.config = args.config_vocoder
|
||||
args.use_fp16 = False
|
||||
config = io.add_yaml_config_to_args(args)
|
||||
|
||||
mel_spectrogram = fluid.layers.transpose(
|
||||
fluid.layers.squeeze(mel_output, [0]), [1, 0])
|
||||
mel_spectrogram = fluid.layers.unsqueeze(mel_spectrogram, [0])
|
||||
|
||||
# Build model.
|
||||
waveflow = WaveFlowModule(config)
|
||||
io.load_parameters(model=waveflow, checkpoint_path=checkpoint)
|
||||
for layer in waveflow.sublayers():
|
||||
if isinstance(layer, WeightNormWrapper):
|
||||
layer.remove_weight_norm()
|
||||
|
||||
# Run model inference.
|
||||
wav = waveflow.synthesize(mel_spectrogram, sigma=config.sigma)
|
||||
return wav.numpy()[0]
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(description="Synthesis model")
|
||||
add_config_options_to_parser(parser)
|
||||
args = parser.parse_args()
|
||||
# Print the whole config setting.
|
||||
pprint(vars(args))
|
||||
synthesis(
|
||||
"Life was like a box of chocolates, you never know what you're gonna get.",
|
||||
args)
|
|
@ -1,17 +0,0 @@
|
|||
|
||||
# train model
|
||||
CUDA_VISIBLE_DEVICES=0 \
|
||||
python -u synthesis.py \
|
||||
--use_gpu=0 \
|
||||
--output='./synthesis' \
|
||||
--config='transformer_tts_ljspeech_ckpt_1.0/ljspeech.yaml' \
|
||||
--checkpoint_transformer='./transformer_tts_ljspeech_ckpt_1.0/step-120000' \
|
||||
--vocoder='waveflow' \
|
||||
--config_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml' \
|
||||
--checkpoint_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/step-2000000' \
|
||||
|
||||
if [ $? -ne 0 ]; then
|
||||
echo "Failed in training!"
|
||||
exit 1
|
||||
fi
|
||||
exit 0
|
|
@ -1,219 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
import os
|
||||
from tqdm import tqdm
|
||||
from visualdl import LogWriter
|
||||
from collections import OrderedDict
|
||||
import argparse
|
||||
from pprint import pprint
|
||||
from ruamel import yaml
|
||||
from matplotlib import cm
|
||||
import numpy as np
|
||||
import paddle.fluid as fluid
|
||||
import paddle.fluid.dygraph as dg
|
||||
import paddle.fluid.layers as layers
|
||||
from parakeet.models.transformer_tts.utils import cross_entropy
|
||||
from data import LJSpeechLoader
|
||||
from parakeet.models.transformer_tts import TransformerTTS
|
||||
from parakeet.utils import io
|
||||
|
||||
|
||||
def add_config_options_to_parser(parser):
|
||||
parser.add_argument("--config", type=str, help="path of the config file")
|
||||
parser.add_argument("--use_gpu", type=int, default=0, help="device to use")
|
||||
parser.add_argument("--data", type=str, help="path of LJspeech dataset")
|
||||
|
||||
g = parser.add_mutually_exclusive_group()
|
||||
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
|
||||
g.add_argument(
|
||||
"--iteration",
|
||||
type=int,
|
||||
help="the iteration of the checkpoint to load from output directory")
|
||||
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
type=str,
|
||||
default="experiment",
|
||||
help="path to save experiment results")
|
||||
|
||||
|
||||
def main(args):
|
||||
local_rank = dg.parallel.Env().local_rank
|
||||
nranks = dg.parallel.Env().nranks
|
||||
parallel = nranks > 1
|
||||
|
||||
with open(args.config) as f:
|
||||
cfg = yaml.load(f, Loader=yaml.Loader)
|
||||
|
||||
global_step = 0
|
||||
place = fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace()
|
||||
|
||||
if not os.path.exists(args.output):
|
||||
os.mkdir(args.output)
|
||||
|
||||
writer = LogWriter(os.path.join(args.output,
|
||||
'log')) if local_rank == 0 else None
|
||||
|
||||
fluid.enable_dygraph(place)
|
||||
network_cfg = cfg['network']
|
||||
model = TransformerTTS(
|
||||
network_cfg['embedding_size'], network_cfg['hidden_size'],
|
||||
network_cfg['encoder_num_head'], network_cfg['encoder_n_layers'],
|
||||
cfg['audio']['num_mels'], network_cfg['outputs_per_step'],
|
||||
network_cfg['decoder_num_head'], network_cfg['decoder_n_layers'])
|
||||
|
||||
model.train()
|
||||
optimizer = fluid.optimizer.AdamOptimizer(
|
||||
learning_rate=dg.NoamDecay(1 / (cfg['train']['warm_up_step'] *
|
||||
(cfg['train']['learning_rate']**2)),
|
||||
cfg['train']['warm_up_step']),
|
||||
parameter_list=model.parameters(),
|
||||
grad_clip=fluid.clip.GradientClipByGlobalNorm(cfg['train'][
|
||||
'grad_clip_thresh']))
|
||||
|
||||
# Load parameters.
|
||||
global_step = io.load_parameters(
|
||||
model=model,
|
||||
optimizer=optimizer,
|
||||
checkpoint_dir=os.path.join(args.output, 'checkpoints'),
|
||||
iteration=args.iteration,
|
||||
checkpoint_path=args.checkpoint)
|
||||
print("Rank {}: checkpoint loaded.".format(local_rank))
|
||||
|
||||
if parallel:
|
||||
strategy = dg.parallel.prepare_context()
|
||||
model = fluid.dygraph.parallel.DataParallel(model, strategy)
|
||||
|
||||
reader = LJSpeechLoader(
|
||||
cfg['audio'],
|
||||
place,
|
||||
args.data,
|
||||
cfg['train']['batch_size'],
|
||||
nranks,
|
||||
local_rank,
|
||||
shuffle=True).reader
|
||||
|
||||
iterator = iter(tqdm(reader))
|
||||
|
||||
global_step += 1
|
||||
|
||||
while global_step <= cfg['train']['max_iteration']:
|
||||
try:
|
||||
batch = next(iterator)
|
||||
except StopIteration as e:
|
||||
iterator = iter(tqdm(reader))
|
||||
batch = next(iterator)
|
||||
|
||||
character, mel, mel_input, pos_text, pos_mel, stop_tokens = batch
|
||||
|
||||
mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model(
|
||||
character, mel_input, pos_text, pos_mel)
|
||||
|
||||
mel_loss = layers.mean(
|
||||
layers.abs(layers.elementwise_sub(mel_pred, mel)))
|
||||
post_mel_loss = layers.mean(
|
||||
layers.abs(layers.elementwise_sub(postnet_pred, mel)))
|
||||
loss = mel_loss + post_mel_loss
|
||||
|
||||
stop_loss = cross_entropy(
|
||||
stop_preds, stop_tokens, weight=cfg['network']['stop_loss_weight'])
|
||||
loss = loss + stop_loss
|
||||
|
||||
if local_rank == 0:
|
||||
writer.add_scalar('training_loss/mel_loss',
|
||||
mel_loss.numpy(),
|
||||
global_step)
|
||||
writer.add_scalar('training_loss/post_mel_loss',
|
||||
post_mel_loss.numpy(),
|
||||
global_step)
|
||||
writer.add_scalar('stop_loss', stop_loss.numpy(), global_step)
|
||||
|
||||
if parallel:
|
||||
writer.add_scalar('alphas/encoder_alpha',
|
||||
model._layers.encoder.alpha.numpy(),
|
||||
global_step)
|
||||
writer.add_scalar('alphas/decoder_alpha',
|
||||
model._layers.decoder.alpha.numpy(),
|
||||
global_step)
|
||||
else:
|
||||
writer.add_scalar('alphas/encoder_alpha',
|
||||
model.encoder.alpha.numpy(),
|
||||
global_step)
|
||||
writer.add_scalar('alphas/decoder_alpha',
|
||||
model.decoder.alpha.numpy(),
|
||||
global_step)
|
||||
|
||||
writer.add_scalar('learning_rate',
|
||||
optimizer._learning_rate.step().numpy(),
|
||||
global_step)
|
||||
|
||||
if global_step % cfg['train']['image_interval'] == 1:
|
||||
for i, prob in enumerate(attn_probs):
|
||||
for j in range(cfg['network']['decoder_num_head']):
|
||||
x = np.uint8(
|
||||
cm.viridis(prob.numpy()[j * cfg['train'][
|
||||
'batch_size'] // nranks]) * 255)
|
||||
writer.add_image(
|
||||
'Attention_%d_0' % global_step,
|
||||
x,
|
||||
i * 4 + j)
|
||||
|
||||
for i, prob in enumerate(attn_enc):
|
||||
for j in range(cfg['network']['encoder_num_head']):
|
||||
x = np.uint8(
|
||||
cm.viridis(prob.numpy()[j * cfg['train'][
|
||||
'batch_size'] // nranks]) * 255)
|
||||
writer.add_image(
|
||||
'Attention_enc_%d_0' % global_step,
|
||||
x,
|
||||
i * 4 + j)
|
||||
|
||||
for i, prob in enumerate(attn_dec):
|
||||
for j in range(cfg['network']['decoder_num_head']):
|
||||
x = np.uint8(
|
||||
cm.viridis(prob.numpy()[j * cfg['train'][
|
||||
'batch_size'] // nranks]) * 255)
|
||||
writer.add_image(
|
||||
'Attention_dec_%d_0' % global_step,
|
||||
x,
|
||||
i * 4 + j)
|
||||
|
||||
if parallel:
|
||||
loss = model.scale_loss(loss)
|
||||
loss.backward()
|
||||
model.apply_collective_grads()
|
||||
else:
|
||||
loss.backward()
|
||||
optimizer.minimize(loss)
|
||||
model.clear_gradients()
|
||||
|
||||
# save checkpoint
|
||||
if local_rank == 0 and global_step % cfg['train'][
|
||||
'checkpoint_interval'] == 0:
|
||||
io.save_parameters(
|
||||
os.path.join(args.output, 'checkpoints'), global_step, model,
|
||||
optimizer)
|
||||
global_step += 1
|
||||
|
||||
if local_rank == 0:
|
||||
writer.close()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(description="Train TransformerTTS model")
|
||||
add_config_options_to_parser(parser)
|
||||
args = parser.parse_args()
|
||||
# Print the whole config setting.
|
||||
pprint(vars(args))
|
||||
main(args)
|
|
@ -1,15 +0,0 @@
|
|||
|
||||
# train model
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
python -u train_transformer.py \
|
||||
--use_gpu=1 \
|
||||
--data='../../dataset/LJSpeech-1.1' \
|
||||
--output='./experiment' \
|
||||
--config='configs/ljspeech.yaml' \
|
||||
#--checkpoint='./checkpoint/transformer/step-120000' \
|
||||
|
||||
if [ $? -ne 0 ]; then
|
||||
echo "Failed in training!"
|
||||
exit 1
|
||||
fi
|
||||
exit 0
|
|
@ -1,144 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
from visualdl import LogWriter
|
||||
import os
|
||||
from tqdm import tqdm
|
||||
from pathlib import Path
|
||||
from collections import OrderedDict
|
||||
import argparse
|
||||
from ruamel import yaml
|
||||
from pprint import pprint
|
||||
import paddle.fluid as fluid
|
||||
import paddle.fluid.dygraph as dg
|
||||
import paddle.fluid.layers as layers
|
||||
from data import LJSpeechLoader
|
||||
from parakeet.models.transformer_tts import Vocoder
|
||||
from parakeet.utils import io
|
||||
|
||||
|
||||
def add_config_options_to_parser(parser):
|
||||
parser.add_argument("--config", type=str, help="path of the config file")
|
||||
parser.add_argument("--use_gpu", type=int, default=0, help="device to use")
|
||||
parser.add_argument("--data", type=str, help="path of LJspeech dataset")
|
||||
|
||||
g = parser.add_mutually_exclusive_group()
|
||||
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
|
||||
g.add_argument(
|
||||
"--iteration",
|
||||
type=int,
|
||||
help="the iteration of the checkpoint to load from output directory")
|
||||
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
type=str,
|
||||
default="vocoder",
|
||||
help="path to save experiment results")
|
||||
|
||||
|
||||
def main(args):
|
||||
local_rank = dg.parallel.Env().local_rank
|
||||
nranks = dg.parallel.Env().nranks
|
||||
parallel = nranks > 1
|
||||
|
||||
with open(args.config) as f:
|
||||
cfg = yaml.load(f, Loader=yaml.Loader)
|
||||
|
||||
global_step = 0
|
||||
place = fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace()
|
||||
|
||||
if not os.path.exists(args.output):
|
||||
os.mkdir(args.output)
|
||||
|
||||
writer = LogWriter(os.path.join(args.output,
|
||||
'log')) if local_rank == 0 else None
|
||||
|
||||
fluid.enable_dygraph(place)
|
||||
model = Vocoder(cfg['train']['batch_size'], cfg['vocoder']['hidden_size'],
|
||||
cfg['audio']['num_mels'], cfg['audio']['n_fft'])
|
||||
|
||||
model.train()
|
||||
optimizer = fluid.optimizer.AdamOptimizer(
|
||||
learning_rate=dg.NoamDecay(1 / (cfg['train']['warm_up_step'] *
|
||||
(cfg['train']['learning_rate']**2)),
|
||||
cfg['train']['warm_up_step']),
|
||||
parameter_list=model.parameters(),
|
||||
grad_clip=fluid.clip.GradientClipByGlobalNorm(cfg['train'][
|
||||
'grad_clip_thresh']))
|
||||
|
||||
# Load parameters.
|
||||
global_step = io.load_parameters(
|
||||
model=model,
|
||||
optimizer=optimizer,
|
||||
checkpoint_dir=os.path.join(args.output, 'checkpoints'),
|
||||
iteration=args.iteration,
|
||||
checkpoint_path=args.checkpoint)
|
||||
print("Rank {}: checkpoint loaded.".format(local_rank))
|
||||
|
||||
if parallel:
|
||||
strategy = dg.parallel.prepare_context()
|
||||
model = fluid.dygraph.parallel.DataParallel(model, strategy)
|
||||
|
||||
reader = LJSpeechLoader(
|
||||
cfg['audio'],
|
||||
place,
|
||||
args.data,
|
||||
cfg['train']['batch_size'],
|
||||
nranks,
|
||||
local_rank,
|
||||
is_vocoder=True).reader()
|
||||
|
||||
for epoch in range(cfg['train']['max_iteration']):
|
||||
pbar = tqdm(reader)
|
||||
for i, data in enumerate(pbar):
|
||||
pbar.set_description('Processing at epoch %d' % epoch)
|
||||
mel, mag = data
|
||||
mag = dg.to_variable(mag.numpy())
|
||||
mel = dg.to_variable(mel.numpy())
|
||||
global_step += 1
|
||||
|
||||
mag_pred = model(mel)
|
||||
loss = layers.mean(
|
||||
layers.abs(layers.elementwise_sub(mag_pred, mag)))
|
||||
|
||||
if parallel:
|
||||
loss = model.scale_loss(loss)
|
||||
loss.backward()
|
||||
model.apply_collective_grads()
|
||||
else:
|
||||
loss.backward()
|
||||
optimizer.minimize(loss)
|
||||
model.clear_gradients()
|
||||
|
||||
if local_rank == 0:
|
||||
writer.add_scalar('training_loss/loss', loss.numpy(),
|
||||
global_step)
|
||||
|
||||
# save checkpoint
|
||||
if local_rank == 0 and global_step % cfg['train'][
|
||||
'checkpoint_interval'] == 0:
|
||||
io.save_parameters(
|
||||
os.path.join(args.output, 'checkpoints'), global_step,
|
||||
model, optimizer)
|
||||
|
||||
if local_rank == 0:
|
||||
writer.close()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser(description="Train vocoder model")
|
||||
add_config_options_to_parser(parser)
|
||||
args = parser.parse_args()
|
||||
# Print the whole config setting.
|
||||
pprint(args)
|
||||
main(args)
|
|
@ -1,16 +0,0 @@
|
|||
|
||||
# train model
|
||||
CUDA_VISIBLE_DEVICES=0 \
|
||||
python -u train_vocoder.py \
|
||||
--use_gpu=1 \
|
||||
--data='../../dataset/LJSpeech-1.1' \
|
||||
--output='./vocoder' \
|
||||
--config='configs/ljspeech.yaml' \
|
||||
#--checkpoint='./checkpoint/vocoder/step-100000' \
|
||||
|
||||
|
||||
if [ $? -ne 0 ]; then
|
||||
echo "Failed in training!"
|
||||
exit 1
|
||||
fi
|
||||
exit 0
|
|
@ -1,122 +0,0 @@
|
|||
# WaveFlow
|
||||
|
||||
PaddlePaddle dynamic graph implementation of [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219).
|
||||
|
||||
- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow](https://github.com/NVIDIA/waveglow) and serveral orders of magnitude faster than WaveNet.
|
||||
- WaveFlow is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M).
|
||||
- WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development.
|
||||
|
||||
## Project Structure
|
||||
```text
|
||||
├── configs # yaml configuration files of preset model hyperparameters
|
||||
├── benchmark.py # benchmark code to test the speed of batched speech synthesis
|
||||
├── synthesis.py # script for speech synthesis
|
||||
├── train.py # script for model training
|
||||
├── utils.py # helper functions for e.g., model checkpointing
|
||||
├── data.py # dataset and dataloader settings for LJSpeech
|
||||
├── waveflow.py # WaveFlow model high level APIs
|
||||
└── parakeet/models/waveflow/waveflow_modules.py # WaveFlow model implementation
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
There are many hyperparameters to be tuned depending on the specification of model and dataset you are working on.
|
||||
We provide `wavenet_ljspeech.yaml` as a hyperparameter set that works well on the LJSpeech dataset.
|
||||
Note that we use [convolutional queue](https://arxiv.org/abs/1611.09482) at audio synthesis to cache the intermediate hidden states, which will speed up the autoregressive inference over the height dimension. Current implementation only supports height dimension equals 8 or 16, i.e., where there is no dilation on the height dimension. Therefore, you can only set value of `n_group` key in the yaml config file to be either 8 or 16.
|
||||
|
||||
Also note that `train.py`, `synthesis.py`, and `benchmark.py` all accept a `--config` parameter. To ensure consistency, you should use the same config yaml file for both training, synthesizing and benchmarking. You can also overwrite these preset hyperparameters with command line by updating parameters after `--config`.
|
||||
For example `--config=${yaml} --batch_size=8` can overwrite the corresponding hyperparameters in the `${yaml}` config file. For more details about these hyperparameters, check `utils.add_config_options_to_parser`.
|
||||
|
||||
Additionally, you need to specify some additional parameters for `train.py`, `synthesis.py`, and `benchmark.py`, and the details can be found in `train.add_options_to_parser`, `synthesis.add_options_to_parser`, and `benchmark.add_options_to_parser`, respectively.
|
||||
|
||||
### Dataset
|
||||
|
||||
Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
|
||||
|
||||
```bash
|
||||
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
|
||||
tar xjvf LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
In this example, assume that the path of unzipped LJSpeech dataset is `./data/LJSpeech-1.1`.
|
||||
|
||||
### Train on single GPU
|
||||
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
python -u train.py \
|
||||
--config=./configs/waveflow_ljspeech.yaml \
|
||||
--root=./data/LJSpeech-1.1 \
|
||||
--name=${ModelName} --batch_size=4 \
|
||||
--use_gpu=true
|
||||
```
|
||||
|
||||
#### Save and Load checkpoints
|
||||
|
||||
Our model will save model parameters as checkpoints in `./runs/waveflow/${ModelName}/checkpoint/` every 10000 iterations by default, where `${ModelName}` is the model name for one single experiment and it could be whatever you like.
|
||||
The saved checkpoint will have the format of `step-${iteration_number}.pdparams` for model parameters and `step-${iteration_number}.pdopt` for optimizer parameters.
|
||||
|
||||
There are three ways to load a checkpoint and resume training (take an example that you want to load a 500000-iteration checkpoint):
|
||||
1. Use `--checkpoint=./runs/waveflow/${ModelName}/checkpoint/step-500000` to provide a specific path to load. Note that you only need to provide the base name of the parameter file, which is `step-500000`, no extension name `.pdparams` or `.pdopt` is needed.
|
||||
2. Use `--iteration=500000`.
|
||||
3. If you don't specify either `--checkpoint` or `--iteration`, the model will automatically load the latest checkpoint in `./runs/waveflow/${ModelName}/checkpoint`.
|
||||
|
||||
### Train on multiple GPUs
|
||||
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES=0,1,2,3
|
||||
python -u -m paddle.distributed.launch train.py \
|
||||
--config=./configs/waveflow_ljspeech.yaml \
|
||||
--root=./data/LJSpeech-1.1 \
|
||||
--name=${ModelName} --use_gpu=true
|
||||
```
|
||||
|
||||
Use `export CUDA_VISIBLE_DEVICES=0,1,2,3` to set the GPUs that you want to use to be visible. Then the `paddle.distributed.launch` module will use these visible GPUs to do data parallel training in multiprocessing mode.
|
||||
|
||||
### Monitor with Tensorboard
|
||||
|
||||
By default, the logs are saved in `./runs/waveflow/${ModelName}/logs/`. You can monitor logs using TensorBoard.
|
||||
|
||||
```bash
|
||||
tensorboard --logdir=${log_dir} --port=8888
|
||||
```
|
||||
|
||||
### Synthesize from a checkpoint
|
||||
|
||||
Check the [Save and load checkpoint](#save-and-load-checkpoints) section on how to load a specific checkpoint.
|
||||
The following example will automatically load the latest checkpoint:
|
||||
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
python -u synthesis.py \
|
||||
--config=./configs/waveflow_ljspeech.yaml \
|
||||
--root=./data/LJSpeech-1.1 \
|
||||
--name=${ModelName} --use_gpu=true \
|
||||
--output=./syn_audios \
|
||||
--sample=${SAMPLE} \
|
||||
--sigma=1.0
|
||||
```
|
||||
|
||||
In this example, `--output` specifies where to save the synthesized audios and `--sample` (<16) specifies which sample in the valid dataset (a split from the whole LJSpeech dataset, by default contains the first 16 audio samples) to synthesize based on the mel-spectrograms computed from the ground truth sample audio, e.g., `--sample=0` means to synthesize the first audio in the valid dataset.
|
||||
|
||||
### Benchmarking
|
||||
|
||||
Use the following example to benchmark the speed of batched speech synthesis, which reports how many times faster than real-time:
|
||||
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
python -u benchmark.py \
|
||||
--config=./configs/waveflow_ljspeech.yaml \
|
||||
--root=./data/LJSpeech-1.1 \
|
||||
--name=${ModelName} --use_gpu=true
|
||||
```
|
||||
|
||||
### Low-precision inference
|
||||
|
||||
This model supports the float16 low-precision inference. By appending the argument
|
||||
|
||||
```bash
|
||||
--use_fp16=true
|
||||
```
|
||||
|
||||
to the command of synthesis and benchmarking, one can experience the fast speed of low-precision inference.
|
|
@ -1,103 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import random
|
||||
from pprint import pprint
|
||||
|
||||
import argparse
|
||||
import numpy as np
|
||||
import paddle.fluid.dygraph as dg
|
||||
from paddle import fluid
|
||||
|
||||
import utils
|
||||
from parakeet.utils import io
|
||||
from waveflow import WaveFlow
|
||||
|
||||
|
||||
def add_options_to_parser(parser):
|
||||
parser.add_argument(
|
||||
'--model',
|
||||
type=str,
|
||||
default='waveflow',
|
||||
help="general name of the model")
|
||||
parser.add_argument(
|
||||
'--name', type=str, help="specific name of the training model")
|
||||
parser.add_argument(
|
||||
'--root', type=str, help="root path of the LJSpeech dataset")
|
||||
|
||||
parser.add_argument(
|
||||
'--use_gpu',
|
||||
type=utils.str2bool,
|
||||
default=True,
|
||||
help="option to use gpu training")
|
||||
parser.add_argument(
|
||||
'--use_fp16',
|
||||
type=utils.str2bool,
|
||||
default=True,
|
||||
help="option to use fp16 for inference")
|
||||
|
||||
parser.add_argument(
|
||||
'--iteration',
|
||||
type=int,
|
||||
default=None,
|
||||
help=("which iteration of checkpoint to load, "
|
||||
"default to load the latest checkpoint"))
|
||||
parser.add_argument(
|
||||
'--checkpoint',
|
||||
type=str,
|
||||
default=None,
|
||||
help="path of the checkpoint to load")
|
||||
|
||||
|
||||
def benchmark(config):
|
||||
pprint(vars(config))
|
||||
|
||||
# Get checkpoint directory path.
|
||||
run_dir = os.path.join("runs", config.model, config.name)
|
||||
checkpoint_dir = os.path.join(run_dir, "checkpoint")
|
||||
|
||||
# Configurate device.
|
||||
place = fluid.CUDAPlace(0) if config.use_gpu else fluid.CPUPlace()
|
||||
|
||||
with dg.guard(place):
|
||||
# Fix random seed.
|
||||
seed = config.seed
|
||||
random.seed(seed)
|
||||
np.random.seed(seed)
|
||||
fluid.default_startup_program().random_seed = seed
|
||||
fluid.default_main_program().random_seed = seed
|
||||
print("Random Seed: ", seed)
|
||||
|
||||
# Build model.
|
||||
model = WaveFlow(config, checkpoint_dir)
|
||||
model.build(training=False)
|
||||
|
||||
# Run model inference.
|
||||
model.benchmark()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Create parser.
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Synthesize audio using WaveNet model")
|
||||
add_options_to_parser(parser)
|
||||
utils.add_config_options_to_parser(parser)
|
||||
|
||||
# Parse argument from both command line and yaml config file.
|
||||
# For conflicting updates to the same field,
|
||||
# the preceding update will be overwritten by the following one.
|
||||
config = parser.parse_args()
|
||||
config = io.add_yaml_config_to_args(config)
|
||||
benchmark(config)
|
|
@ -1,24 +0,0 @@
|
|||
valid_size: 16
|
||||
segment_length: 16000
|
||||
sample_rate: 22050
|
||||
fft_window_shift: 256
|
||||
fft_window_size: 1024
|
||||
fft_size: 1024
|
||||
mel_bands: 80
|
||||
mel_fmin: 0.0
|
||||
mel_fmax: 8000.0
|
||||
|
||||
seed: 1234
|
||||
learning_rate: 0.0002
|
||||
batch_size: 8
|
||||
test_every: 2000
|
||||
save_every: 10000
|
||||
max_iterations: 3000000
|
||||
|
||||
sigma: 1.0
|
||||
n_flows: 8
|
||||
n_group: 16
|
||||
n_layers: 8
|
||||
n_channels: 64
|
||||
kernel_h: 3
|
||||
kernel_w: 3
|
|
@ -1,144 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import random
|
||||
|
||||
import librosa
|
||||
import numpy as np
|
||||
from paddle import fluid
|
||||
|
||||
from parakeet.datasets import ljspeech
|
||||
from parakeet.data import SpecBatcher, WavBatcher
|
||||
from parakeet.data import DataCargo, DatasetMixin
|
||||
from parakeet.data import DistributedSampler, BatchSampler
|
||||
from scipy.io.wavfile import read
|
||||
|
||||
|
||||
class Dataset(ljspeech.LJSpeech):
|
||||
def __init__(self, config):
|
||||
super(Dataset, self).__init__(config.root)
|
||||
self.config = config
|
||||
|
||||
def _get_example(self, metadatum):
|
||||
fname, _, _ = metadatum
|
||||
wav_path = os.path.join(self.root, "wavs", fname + ".wav")
|
||||
|
||||
audio, loaded_sr = librosa.load(wav_path, sr=self.config.sample_rate)
|
||||
|
||||
return audio
|
||||
|
||||
|
||||
class Subset(DatasetMixin):
|
||||
def __init__(self, dataset, indices, valid):
|
||||
self.dataset = dataset
|
||||
self.indices = indices
|
||||
self.valid = valid
|
||||
self.config = dataset.config
|
||||
|
||||
def get_mel(self, audio):
|
||||
spectrogram = librosa.core.stft(
|
||||
audio,
|
||||
n_fft=self.config.fft_size,
|
||||
hop_length=self.config.fft_window_shift,
|
||||
win_length=self.config.fft_window_size)
|
||||
spectrogram_magnitude = np.abs(spectrogram)
|
||||
|
||||
# mel_filter_bank shape: [n_mels, 1 + n_fft/2]
|
||||
mel_filter_bank = librosa.filters.mel(sr=self.config.sample_rate,
|
||||
n_fft=self.config.fft_size,
|
||||
n_mels=self.config.mel_bands,
|
||||
fmin=self.config.mel_fmin,
|
||||
fmax=self.config.mel_fmax)
|
||||
# mel shape: [n_mels, num_frames]
|
||||
mel = np.dot(mel_filter_bank, spectrogram_magnitude)
|
||||
|
||||
# Normalize mel.
|
||||
clip_val = 1e-5
|
||||
ref_constant = 1
|
||||
mel = np.log(np.clip(mel, a_min=clip_val, a_max=None) * ref_constant)
|
||||
|
||||
return mel
|
||||
|
||||
def __getitem__(self, idx):
|
||||
audio = self.dataset[self.indices[idx]]
|
||||
segment_length = self.config.segment_length
|
||||
|
||||
if self.valid:
|
||||
# whole audio for valid set
|
||||
pass
|
||||
else:
|
||||
# Randomly crop segment_length from audios in the training set.
|
||||
# audio shape: [len]
|
||||
if audio.shape[0] >= segment_length:
|
||||
max_audio_start = audio.shape[0] - segment_length
|
||||
audio_start = random.randint(0, max_audio_start)
|
||||
audio = audio[audio_start:(audio_start + segment_length)]
|
||||
else:
|
||||
audio = np.pad(audio, (0, segment_length - audio.shape[0]),
|
||||
mode='constant',
|
||||
constant_values=0)
|
||||
|
||||
mel = self.get_mel(audio)
|
||||
|
||||
return audio, mel
|
||||
|
||||
def _batch_examples(self, batch):
|
||||
audios = [sample[0] for sample in batch]
|
||||
mels = [sample[1] for sample in batch]
|
||||
|
||||
audios = WavBatcher(pad_value=0.0)(audios)
|
||||
mels = SpecBatcher(pad_value=0.0)(mels)
|
||||
|
||||
return audios, mels
|
||||
|
||||
def __len__(self):
|
||||
return len(self.indices)
|
||||
|
||||
|
||||
class LJSpeech:
|
||||
def __init__(self, config, nranks, rank):
|
||||
place = fluid.CUDAPlace(rank) if config.use_gpu else fluid.CPUPlace()
|
||||
|
||||
# Whole LJSpeech dataset.
|
||||
ds = Dataset(config)
|
||||
|
||||
# Split into train and valid dataset.
|
||||
indices = list(range(len(ds)))
|
||||
train_indices = indices[config.valid_size:]
|
||||
valid_indices = indices[:config.valid_size]
|
||||
random.shuffle(train_indices)
|
||||
|
||||
# Train dataset.
|
||||
trainset = Subset(ds, train_indices, valid=False)
|
||||
sampler = DistributedSampler(len(trainset), nranks, rank)
|
||||
total_bs = config.batch_size
|
||||
assert total_bs % nranks == 0
|
||||
train_sampler = BatchSampler(
|
||||
sampler, total_bs // nranks, drop_last=True)
|
||||
trainloader = DataCargo(trainset, batch_sampler=train_sampler)
|
||||
|
||||
trainreader = fluid.io.PyReader(capacity=50, return_list=True)
|
||||
trainreader.decorate_batch_generator(trainloader, place)
|
||||
self.trainloader = (data for _ in iter(int, 1)
|
||||
for data in trainreader())
|
||||
|
||||
# Valid dataset.
|
||||
validset = Subset(ds, valid_indices, valid=True)
|
||||
# Currently only support batch_size = 1 for valid loader.
|
||||
validloader = DataCargo(validset, batch_size=1, shuffle=False)
|
||||
|
||||
validreader = fluid.io.PyReader(capacity=20, return_list=True)
|
||||
validreader.decorate_batch_generator(validloader, place)
|
||||
self.validloader = validreader
|
|
@ -1,113 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import random
|
||||
from pprint import pprint
|
||||
|
||||
import argparse
|
||||
import numpy as np
|
||||
import paddle.fluid.dygraph as dg
|
||||
from paddle import fluid
|
||||
|
||||
from parakeet.utils import io
|
||||
import utils
|
||||
from waveflow import WaveFlow
|
||||
|
||||
|
||||
def add_options_to_parser(parser):
|
||||
parser.add_argument(
|
||||
'--model',
|
||||
type=str,
|
||||
default='waveflow',
|
||||
help="general name of the model")
|
||||
parser.add_argument(
|
||||
'--name', type=str, help="specific name of the training model")
|
||||
parser.add_argument(
|
||||
'--root', type=str, help="root path of the LJSpeech dataset")
|
||||
|
||||
parser.add_argument(
|
||||
'--use_gpu',
|
||||
type=utils.str2bool,
|
||||
default=True,
|
||||
help="option to use gpu training")
|
||||
parser.add_argument(
|
||||
'--use_fp16',
|
||||
type=utils.str2bool,
|
||||
default=True,
|
||||
help="option to use fp16 for inference")
|
||||
|
||||
parser.add_argument(
|
||||
'--iteration',
|
||||
type=int,
|
||||
default=None,
|
||||
help=("which iteration of checkpoint to load, "
|
||||
"default to load the latest checkpoint"))
|
||||
parser.add_argument(
|
||||
'--checkpoint',
|
||||
type=str,
|
||||
default=None,
|
||||
help="path of the checkpoint to load")
|
||||
|
||||
parser.add_argument(
|
||||
'--output',
|
||||
type=str,
|
||||
default="./syn_audios",
|
||||
help="path to write synthesized audio files")
|
||||
parser.add_argument(
|
||||
'--sample',
|
||||
type=int,
|
||||
default=None,
|
||||
help="which of the valid samples to synthesize audio")
|
||||
|
||||
|
||||
def synthesize(config):
|
||||
pprint(vars(config))
|
||||
|
||||
# Get checkpoint directory path.
|
||||
run_dir = os.path.join("runs", config.model, config.name)
|
||||
checkpoint_dir = os.path.join(run_dir, "checkpoint")
|
||||
|
||||
# Configurate device.
|
||||
place = fluid.CUDAPlace(0) if config.use_gpu else fluid.CPUPlace()
|
||||
|
||||
with dg.guard(place):
|
||||
# Fix random seed.
|
||||
seed = config.seed
|
||||
random.seed(seed)
|
||||
np.random.seed(seed)
|
||||
fluid.default_startup_program().random_seed = seed
|
||||
fluid.default_main_program().random_seed = seed
|
||||
print("Random Seed: ", seed)
|
||||
|
||||
# Build model.
|
||||
model = WaveFlow(config, checkpoint_dir)
|
||||
iteration = model.build(training=False)
|
||||
# Run model inference.
|
||||
model.infer(iteration)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Create parser.
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Synthesize audio using WaveNet model")
|
||||
add_options_to_parser(parser)
|
||||
utils.add_config_options_to_parser(parser)
|
||||
|
||||
# Parse argument from both command line and yaml config file.
|
||||
# For conflicting updates to the same field,
|
||||
# the preceding update will be overwritten by the following one.
|
||||
config = parser.parse_args()
|
||||
config = io.add_yaml_config_to_args(config)
|
||||
synthesize(config)
|
|
@ -1,134 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import random
|
||||
import subprocess
|
||||
import time
|
||||
from pprint import pprint
|
||||
|
||||
import argparse
|
||||
import numpy as np
|
||||
import paddle.fluid.dygraph as dg
|
||||
from paddle import fluid
|
||||
from visualdl import LogWriter
|
||||
|
||||
|
||||
import utils
|
||||
from parakeet.utils import io
|
||||
from waveflow import WaveFlow
|
||||
|
||||
|
||||
def add_options_to_parser(parser):
|
||||
parser.add_argument(
|
||||
'--model',
|
||||
type=str,
|
||||
default='waveflow',
|
||||
help="general name of the model")
|
||||
parser.add_argument(
|
||||
'--name', type=str, help="specific name of the training model")
|
||||
parser.add_argument(
|
||||
'--root', type=str, help="root path of the LJSpeech dataset")
|
||||
|
||||
parser.add_argument(
|
||||
'--use_gpu',
|
||||
type=utils.str2bool,
|
||||
default=True,
|
||||
help="option to use gpu training")
|
||||
|
||||
parser.add_argument(
|
||||
'--iteration',
|
||||
type=int,
|
||||
default=None,
|
||||
help=("which iteration of checkpoint to load, "
|
||||
"default to load the latest checkpoint"))
|
||||
parser.add_argument(
|
||||
'--checkpoint',
|
||||
type=str,
|
||||
default=None,
|
||||
help="path of the checkpoint to load")
|
||||
|
||||
|
||||
def train(config):
|
||||
use_gpu = config.use_gpu
|
||||
|
||||
# Get the rank of the current training process.
|
||||
rank = dg.parallel.Env().local_rank
|
||||
nranks = dg.parallel.Env().nranks
|
||||
parallel = nranks > 1
|
||||
|
||||
if rank == 0:
|
||||
# Print the whole config setting.
|
||||
pprint(vars(config))
|
||||
|
||||
# Make checkpoint directory.
|
||||
run_dir = os.path.join("runs", config.model, config.name)
|
||||
checkpoint_dir = os.path.join(run_dir, "checkpoint")
|
||||
if not os.path.exists(checkpoint_dir):
|
||||
os.makedirs(checkpoint_dir)
|
||||
|
||||
# Create tensorboard logger.
|
||||
vdl = LogWriter(os.path.join(run_dir, "logs")) \
|
||||
if rank == 0 else None
|
||||
|
||||
# Configurate device
|
||||
place = fluid.CUDAPlace(rank) if use_gpu else fluid.CPUPlace()
|
||||
|
||||
with dg.guard(place):
|
||||
# Fix random seed.
|
||||
seed = config.seed
|
||||
random.seed(seed)
|
||||
np.random.seed(seed)
|
||||
fluid.default_startup_program().random_seed = seed
|
||||
fluid.default_main_program().random_seed = seed
|
||||
print("Random Seed: ", seed)
|
||||
|
||||
# Build model.
|
||||
model = WaveFlow(config, checkpoint_dir, parallel, rank, nranks, vdl)
|
||||
iteration = model.build()
|
||||
|
||||
while iteration < config.max_iterations:
|
||||
# Run one single training step.
|
||||
model.train_step(iteration)
|
||||
|
||||
iteration += 1
|
||||
|
||||
if iteration % config.test_every == 0:
|
||||
# Run validation step.
|
||||
model.valid_step(iteration)
|
||||
|
||||
if rank == 0 and iteration % config.save_every == 0:
|
||||
# Save parameters.
|
||||
model.save(iteration)
|
||||
|
||||
# Close TensorBoard.
|
||||
if rank == 0:
|
||||
vdl.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Create parser.
|
||||
parser = argparse.ArgumentParser(description="Train WaveFlow model")
|
||||
#formatter_class='default_argparse')
|
||||
add_options_to_parser(parser)
|
||||
utils.add_config_options_to_parser(parser)
|
||||
|
||||
# Parse argument from both command line and yaml config file.
|
||||
# For conflicting updates to the same field,
|
||||
# the preceding update will be overwritten by the following one.
|
||||
config = parser.parse_args()
|
||||
config = io.add_yaml_config_to_args(config)
|
||||
# Force to use fp32 in model training
|
||||
vars(config)["use_fp16"] = False
|
||||
train(config)
|
|
@ -1,90 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import argparse
|
||||
|
||||
|
||||
def str2bool(v):
|
||||
return v.lower() in ("true", "t", "1")
|
||||
|
||||
|
||||
def add_config_options_to_parser(parser):
|
||||
parser.add_argument(
|
||||
'--valid_size', type=int, help="size of the valid dataset")
|
||||
parser.add_argument(
|
||||
'--segment_length',
|
||||
type=int,
|
||||
help="the length of audio clip for training")
|
||||
parser.add_argument(
|
||||
'--sample_rate', type=int, help="sampling rate of audio data file")
|
||||
parser.add_argument(
|
||||
'--fft_window_shift',
|
||||
type=int,
|
||||
help="the shift of fft window for each frame")
|
||||
parser.add_argument(
|
||||
'--fft_window_size',
|
||||
type=int,
|
||||
help="the size of fft window for each frame")
|
||||
parser.add_argument(
|
||||
'--fft_size', type=int, help="the size of fft filter on each frame")
|
||||
parser.add_argument(
|
||||
'--mel_bands',
|
||||
type=int,
|
||||
help="the number of mel bands when calculating mel spectrograms")
|
||||
parser.add_argument(
|
||||
'--mel_fmin',
|
||||
type=float,
|
||||
help="lowest frequency in calculating mel spectrograms")
|
||||
parser.add_argument(
|
||||
'--mel_fmax',
|
||||
type=float,
|
||||
help="highest frequency in calculating mel spectrograms")
|
||||
|
||||
parser.add_argument(
|
||||
'--seed', type=int, help="seed of random initialization for the model")
|
||||
parser.add_argument('--learning_rate', type=float)
|
||||
parser.add_argument(
|
||||
'--batch_size', type=int, help="batch size for training")
|
||||
parser.add_argument(
|
||||
'--test_every', type=int, help="test interval during training")
|
||||
parser.add_argument(
|
||||
'--save_every',
|
||||
type=int,
|
||||
help="checkpointing interval during training")
|
||||
parser.add_argument(
|
||||
'--max_iterations', type=int, help="maximum training iterations")
|
||||
|
||||
parser.add_argument(
|
||||
'--sigma',
|
||||
type=float,
|
||||
help="standard deviation of the latent Gaussian variable")
|
||||
parser.add_argument('--n_flows', type=int, help="number of flows")
|
||||
parser.add_argument(
|
||||
'--n_group',
|
||||
type=int,
|
||||
help="number of adjacent audio samples to squeeze into one column")
|
||||
parser.add_argument(
|
||||
'--n_layers',
|
||||
type=int,
|
||||
help="number of conv2d layer in one wavenet-like flow architecture")
|
||||
parser.add_argument(
|
||||
'--n_channels', type=int, help="number of residual channels in flow")
|
||||
parser.add_argument(
|
||||
'--kernel_h',
|
||||
type=int,
|
||||
help="height of the kernel in the conv2d layer")
|
||||
parser.add_argument(
|
||||
'--kernel_w', type=int, help="width of the kernel in the conv2d layer")
|
||||
|
||||
parser.add_argument('--config', type=str, help="Path to the config file.")
|
|
@ -1,292 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import itertools
|
||||
import os
|
||||
import time
|
||||
|
||||
import numpy as np
|
||||
import paddle.fluid.dygraph as dg
|
||||
from paddle import fluid
|
||||
from scipy.io.wavfile import write
|
||||
|
||||
from parakeet.utils import io
|
||||
from parakeet.modules import weight_norm
|
||||
from parakeet.models.waveflow import WaveFlowLoss, WaveFlowModule
|
||||
from data import LJSpeech
|
||||
import utils
|
||||
|
||||
|
||||
class WaveFlow():
|
||||
"""Wrapper class of WaveFlow model that supports multiple APIs.
|
||||
|
||||
This module provides APIs for model building, training, validation,
|
||||
inference, benchmarking, and saving.
|
||||
|
||||
Args:
|
||||
config (obj): config info.
|
||||
checkpoint_dir (str): path for checkpointing.
|
||||
parallel (bool, optional): whether use multiple GPUs for training.
|
||||
Defaults to False.
|
||||
rank (int, optional): the rank of the process in a multi-process
|
||||
scenario. Defaults to 0.
|
||||
nranks (int, optional): the total number of processes. Defaults to 1.
|
||||
vdl_logger (obj, optional): logger to visualize metrics.
|
||||
Defaults to None.
|
||||
|
||||
Returns:
|
||||
WaveFlow
|
||||
"""
|
||||
|
||||
def __init__(self,
|
||||
config,
|
||||
checkpoint_dir,
|
||||
parallel=False,
|
||||
rank=0,
|
||||
nranks=1,
|
||||
vdl_logger=None):
|
||||
self.config = config
|
||||
self.checkpoint_dir = checkpoint_dir
|
||||
self.parallel = parallel
|
||||
self.rank = rank
|
||||
self.nranks = nranks
|
||||
self.vdl_logger = vdl_logger
|
||||
self.dtype = "float16" if config.use_fp16 else "float32"
|
||||
|
||||
def build(self, training=True):
|
||||
"""Initialize the model.
|
||||
|
||||
Args:
|
||||
training (bool, optional): Whether the model is built for training or inference.
|
||||
Defaults to True.
|
||||
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
config = self.config
|
||||
dataset = LJSpeech(config, self.nranks, self.rank)
|
||||
self.trainloader = dataset.trainloader
|
||||
self.validloader = dataset.validloader
|
||||
|
||||
waveflow = WaveFlowModule(config)
|
||||
|
||||
if training:
|
||||
optimizer = fluid.optimizer.AdamOptimizer(
|
||||
learning_rate=config.learning_rate,
|
||||
parameter_list=waveflow.parameters())
|
||||
|
||||
# Load parameters.
|
||||
iteration = io.load_parameters(
|
||||
model=waveflow,
|
||||
optimizer=optimizer,
|
||||
checkpoint_dir=self.checkpoint_dir,
|
||||
iteration=config.iteration,
|
||||
checkpoint_path=config.checkpoint)
|
||||
print("Rank {}: checkpoint loaded.".format(self.rank))
|
||||
|
||||
# Data parallelism.
|
||||
if self.parallel:
|
||||
strategy = dg.parallel.prepare_context()
|
||||
waveflow = dg.parallel.DataParallel(waveflow, strategy)
|
||||
|
||||
self.waveflow = waveflow
|
||||
self.optimizer = optimizer
|
||||
self.criterion = WaveFlowLoss(config.sigma)
|
||||
|
||||
else:
|
||||
# Load parameters.
|
||||
iteration = io.load_parameters(
|
||||
model=waveflow,
|
||||
checkpoint_dir=self.checkpoint_dir,
|
||||
iteration=config.iteration,
|
||||
checkpoint_path=config.checkpoint)
|
||||
print("Rank {}: checkpoint loaded.".format(self.rank))
|
||||
|
||||
for layer in waveflow.sublayers():
|
||||
if isinstance(layer, weight_norm.WeightNormWrapper):
|
||||
layer.remove_weight_norm()
|
||||
|
||||
self.waveflow = waveflow
|
||||
|
||||
return iteration
|
||||
|
||||
def train_step(self, iteration):
|
||||
"""Train the model for one step.
|
||||
|
||||
Args:
|
||||
iteration (int): current iteration number.
|
||||
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
self.waveflow.train()
|
||||
|
||||
start_time = time.time()
|
||||
audios, mels = next(self.trainloader)
|
||||
load_time = time.time()
|
||||
|
||||
outputs = self.waveflow(audios, mels)
|
||||
loss = self.criterion(outputs)
|
||||
|
||||
if self.parallel:
|
||||
# loss = loss / num_trainers
|
||||
loss = self.waveflow.scale_loss(loss)
|
||||
loss.backward()
|
||||
self.waveflow.apply_collective_grads()
|
||||
else:
|
||||
loss.backward()
|
||||
|
||||
self.optimizer.minimize(
|
||||
loss, parameter_list=self.waveflow.parameters())
|
||||
self.waveflow.clear_gradients()
|
||||
|
||||
graph_time = time.time()
|
||||
|
||||
if self.rank == 0:
|
||||
loss_val = float(loss.numpy()) * self.nranks
|
||||
log = "Rank: {} Step: {:^8d} Loss: {:<8.3f} " \
|
||||
"Time: {:.3f}/{:.3f}".format(
|
||||
self.rank, iteration, loss_val,
|
||||
load_time - start_time, graph_time - load_time)
|
||||
print(log)
|
||||
|
||||
vdl_writer = self.vdl_logger
|
||||
vdl_writer.add_scalar("Train-Loss-Rank-0", loss_val, iteration)
|
||||
|
||||
@dg.no_grad
|
||||
def valid_step(self, iteration):
|
||||
"""Run the model on the validation dataset.
|
||||
|
||||
Args:
|
||||
iteration (int): current iteration number.
|
||||
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
self.waveflow.eval()
|
||||
vdl_writer = self.vdl_logger
|
||||
|
||||
total_loss = []
|
||||
sample_audios = []
|
||||
start_time = time.time()
|
||||
|
||||
for i, batch in enumerate(self.validloader()):
|
||||
audios, mels = batch
|
||||
valid_outputs = self.waveflow(audios, mels)
|
||||
valid_z, valid_log_s_list = valid_outputs
|
||||
|
||||
# Visualize latent z and scale log_s.
|
||||
if self.rank == 0 and i == 0:
|
||||
vdl_writer.add_histogram("Valid-Latent_z", valid_z.numpy(),
|
||||
iteration)
|
||||
for j, valid_log_s in enumerate(valid_log_s_list):
|
||||
hist_name = "Valid-{}th-Flow-Log_s".format(j)
|
||||
vdl_writer.add_histogram(hist_name, valid_log_s.numpy(),
|
||||
iteration)
|
||||
|
||||
valid_loss = self.criterion(valid_outputs)
|
||||
total_loss.append(float(valid_loss.numpy()))
|
||||
|
||||
total_time = time.time() - start_time
|
||||
if self.rank == 0:
|
||||
loss_val = np.mean(total_loss)
|
||||
log = "Test | Rank: {} AvgLoss: {:<8.3f} Time {:<8.3f}".format(
|
||||
self.rank, loss_val, total_time)
|
||||
print(log)
|
||||
vdl_writer.add_scalar("Valid-Avg-Loss", loss_val, iteration)
|
||||
|
||||
@dg.no_grad
|
||||
def infer(self, iteration):
|
||||
"""Run the model to synthesize audios.
|
||||
|
||||
Args:
|
||||
iteration (int): iteration number of the loaded checkpoint.
|
||||
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
self.waveflow.eval()
|
||||
|
||||
config = self.config
|
||||
sample = config.sample
|
||||
|
||||
output = "{}/{}/iter-{}".format(config.output, config.name, iteration)
|
||||
if not os.path.exists(output):
|
||||
os.makedirs(output)
|
||||
|
||||
mels_list = [mels for _, mels in self.validloader()]
|
||||
if sample is not None:
|
||||
mels_list = [mels_list[sample]]
|
||||
else:
|
||||
sample = 0
|
||||
|
||||
for idx, mel in enumerate(mels_list):
|
||||
abs_idx = sample + idx
|
||||
filename = "{}/valid_{}.wav".format(output, abs_idx)
|
||||
print("Synthesize sample {}, save as {}".format(abs_idx, filename))
|
||||
|
||||
start_time = time.time()
|
||||
audio = self.waveflow.synthesize(mel, sigma=self.config.sigma)
|
||||
syn_time = time.time() - start_time
|
||||
|
||||
audio = audio[0]
|
||||
audio_time = audio.shape[0] / self.config.sample_rate
|
||||
print("audio time {:.4f}, synthesis time {:.4f}".format(audio_time,
|
||||
syn_time))
|
||||
|
||||
# Denormalize audio from [-1, 1] to [-32768, 32768] int16 range.
|
||||
audio = audio.numpy().astype("float32") * 32768.0
|
||||
audio = audio.astype('int16')
|
||||
write(filename, config.sample_rate, audio)
|
||||
|
||||
@dg.no_grad
|
||||
def benchmark(self):
|
||||
"""Run the model to benchmark synthesis speed.
|
||||
|
||||
Args:
|
||||
None
|
||||
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
self.waveflow.eval()
|
||||
|
||||
mels_list = [mels for _, mels in self.validloader()]
|
||||
mel = fluid.layers.concat(mels_list, axis=2)
|
||||
mel = mel[:, :, :864]
|
||||
batch_size = 8
|
||||
mel = fluid.layers.expand(mel, [batch_size, 1, 1])
|
||||
|
||||
for i in range(10):
|
||||
start_time = time.time()
|
||||
audio = self.waveflow.synthesize(mel, sigma=self.config.sigma)
|
||||
print("audio.shape = ", audio.shape)
|
||||
syn_time = time.time() - start_time
|
||||
|
||||
audio_time = audio.shape[1] * batch_size / self.config.sample_rate
|
||||
print("audio time {:.4f}, synthesis time {:.4f}".format(audio_time,
|
||||
syn_time))
|
||||
print("{} X real-time".format(audio_time / syn_time))
|
||||
|
||||
def save(self, iteration):
|
||||
"""Save model checkpoint.
|
||||
|
||||
Args:
|
||||
iteration (int): iteration number of the model to be saved.
|
||||
|
||||
Returns:
|
||||
None
|
||||
"""
|
||||
io.save_parameters(self.checkpoint_dir, iteration, self.waveflow,
|
||||
self.optimizer)
|
|
@ -1,144 +0,0 @@
|
|||
# WaveNet
|
||||
|
||||
PaddlePaddle dynamic graph implementation of WaveNet, a convolutional network based vocoder. WaveNet is originally proposed in [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499). However, in this experiment, the implementation follows the teacher model in [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281).
|
||||
|
||||
|
||||
## Dataset
|
||||
|
||||
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
|
||||
|
||||
```bash
|
||||
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
|
||||
tar xjvf LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```text
|
||||
├── data.py data_processing
|
||||
├── configs/ (example) configuration file
|
||||
├── synthesis.py script to synthesize waveform from mel_spectrogram
|
||||
├── train.py script to train a model
|
||||
└── utils.py utility functions
|
||||
```
|
||||
|
||||
## Saving & Loading
|
||||
`train.py` and `synthesis.py` have 3 arguments in common, `--checkpooint`, `iteration` and `output`.
|
||||
|
||||
1. `output` is the directory for saving results.
|
||||
During training, checkpoints are saved in `checkpoints/` in `output` and tensorboard log is save in `log/` in `output`. Other possible outputs are saved in `states/` in `outuput`.
|
||||
During synthesizing, audio files and other possible outputs are save in `synthesis/` in `output`.
|
||||
So after training and synthesizing with the same output directory, the file structure of the output directory looks like this.
|
||||
|
||||
```text
|
||||
├── checkpoints/ # checkpoint directory (including *.pdparams, *.pdopt and a text file `checkpoint` that records the latest checkpoint)
|
||||
├── states/ # audio files generated at validation and other possible outputs
|
||||
├── log/ # tensorboard log
|
||||
└── synthesis/ # synthesized audio files and other possible outputs
|
||||
```
|
||||
|
||||
2. `--checkpoint` and `--iteration` for loading from existing checkpoint. Loading existing checkpoiont follows the following rule:
|
||||
If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.
|
||||
If `--checkpoint` is not provided, we try to load the model specified by `--iteration` from the checkpoint directory. If `--iteration` is not provided, we try to load the latested checkpoint from checkpoint directory.
|
||||
|
||||
## Train
|
||||
|
||||
Train the model using train.py. For help on usage, try `python train.py --help`.
|
||||
|
||||
```text
|
||||
usage: train.py [-h] [--data DATA] [--config CONFIG] [--device DEVICE]
|
||||
[--checkpoint CHECKPOINT | --iteration ITERATION]
|
||||
output
|
||||
|
||||
Train a WaveNet model with LJSpeech.
|
||||
|
||||
positional arguments:
|
||||
output path to save results
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--data DATA path of the LJspeech dataset
|
||||
--config CONFIG path of the config file
|
||||
--device DEVICE device to use
|
||||
--checkpoint CHECKPOINT checkpoint to resume from
|
||||
--iteration ITERATION the iteration of the checkpoint to load from output directory
|
||||
```
|
||||
|
||||
- `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
|
||||
- `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config.
|
||||
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
|
||||
- `--checkpoint` is the path of the checkpoint.
|
||||
- `--iteration` is the iteration of the checkpoint to load from output directory.
|
||||
- `output` is the directory to save results, all result are saved in this directory.
|
||||
|
||||
See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
|
||||
|
||||
|
||||
Example script:
|
||||
|
||||
```bash
|
||||
python train.py \
|
||||
--config=./configs/wavenet_single_gaussian.yaml \
|
||||
--data=./LJSpeech-1.1/ \
|
||||
--device=0 \
|
||||
experiment
|
||||
```
|
||||
|
||||
You can monitor training log via TensorBoard, using the script below.
|
||||
|
||||
```bash
|
||||
cd experiment/log
|
||||
tensorboard --logdir=.
|
||||
```
|
||||
|
||||
## Synthesis
|
||||
```text
|
||||
usage: synthesis.py [-h] [--data DATA] [--config CONFIG] [--device DEVICE]
|
||||
[--checkpoint CHECKPOINT | --iteration ITERATION]
|
||||
output
|
||||
|
||||
Synthesize valid data from LJspeech with a wavenet model.
|
||||
|
||||
positional arguments:
|
||||
output path to save the synthesized audio
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--data DATA path of the LJspeech dataset
|
||||
--config CONFIG path of the config file
|
||||
--device DEVICE device to use
|
||||
--checkpoint CHECKPOINT checkpoint to resume from
|
||||
--iteration ITERATION the iteration of the checkpoint to load from output directory
|
||||
```
|
||||
|
||||
- `--data` is the path of the LJspeech dataset. In principle, a dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files.
|
||||
- `--config` is the configuration file to use. You should use the same configuration with which you train you model.
|
||||
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
|
||||
- `--checkpoint` is the checkpoint to load.
|
||||
- `--iteration` is the iteration of the checkpoint to load from output directory.
|
||||
- `output` is the directory to save synthesized audio. Audio file is saved in `synthesis/` in `output` directory.
|
||||
See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
|
||||
|
||||
|
||||
Example script:
|
||||
|
||||
```bash
|
||||
python synthesis.py \
|
||||
--config=./configs/wavenet_single_gaussian.yaml \
|
||||
--data=./LJSpeech-1.1/ \
|
||||
--device=0 \
|
||||
--checkpoint="experiment/checkpoints/step-1000000" \
|
||||
experiment
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```bash
|
||||
python synthesis.py \
|
||||
--config=./configs/wavenet_single_gaussian.yaml \
|
||||
--data=./LJSpeech-1.1/ \
|
||||
--device=0 \
|
||||
--iteration=1000000 \
|
||||
experiment
|
||||
```
|
|
@ -1,36 +0,0 @@
|
|||
data:
|
||||
batch_size: 16
|
||||
train_clip_seconds: 0.5
|
||||
sample_rate: 22050
|
||||
hop_length: 256
|
||||
win_length: 1024
|
||||
n_fft: 2048
|
||||
n_mels: 80
|
||||
valid_size: 16
|
||||
|
||||
|
||||
|
||||
model:
|
||||
upsampling_factors: [16, 16]
|
||||
n_loop: 10
|
||||
n_layer: 3
|
||||
filter_size: 2
|
||||
residual_channels: 128
|
||||
loss_type: "mog"
|
||||
output_dim: 30
|
||||
log_scale_min: -9
|
||||
|
||||
train:
|
||||
learning_rate: 0.001
|
||||
anneal_rate: 0.5
|
||||
anneal_interval: 200000
|
||||
gradient_max_norm: 100.0
|
||||
|
||||
checkpoint_interval: 10000
|
||||
snap_interval: 10000
|
||||
eval_interval: 10000
|
||||
|
||||
max_iterations: 2000000
|
||||
|
||||
|
||||
|
|
@ -1,36 +0,0 @@
|
|||
data:
|
||||
batch_size: 16
|
||||
train_clip_seconds: 0.5
|
||||
sample_rate: 22050
|
||||
hop_length: 256
|
||||
win_length: 1024
|
||||
n_fft: 2048
|
||||
n_mels: 80
|
||||
valid_size: 16
|
||||
|
||||
|
||||
|
||||
model:
|
||||
upsampling_factors: [16, 16]
|
||||
n_loop: 10
|
||||
n_layer: 3
|
||||
filter_size: 2
|
||||
residual_channels: 128
|
||||
loss_type: "mog"
|
||||
output_dim: 3
|
||||
log_scale_min: -9
|
||||
|
||||
train:
|
||||
learning_rate: 0.001
|
||||
anneal_rate: 0.5
|
||||
anneal_interval: 200000
|
||||
gradient_max_norm: 100.0
|
||||
|
||||
checkpoint_interval: 10000
|
||||
snap_interval: 10000
|
||||
eval_interval: 10000
|
||||
|
||||
max_iterations: 2000000
|
||||
|
||||
|
||||
|
|
@ -1,36 +0,0 @@
|
|||
data:
|
||||
batch_size: 16
|
||||
train_clip_seconds: 0.5
|
||||
sample_rate: 22050
|
||||
hop_length: 256
|
||||
win_length: 1024
|
||||
n_fft: 2048
|
||||
n_mels: 80
|
||||
valid_size: 16
|
||||
|
||||
|
||||
|
||||
model:
|
||||
upsampling_factors: [16, 16]
|
||||
n_loop: 10
|
||||
n_layer: 3
|
||||
filter_size: 2
|
||||
residual_channels: 128
|
||||
loss_type: "softmax"
|
||||
output_dim: 2048
|
||||
log_scale_min: -9
|
||||
|
||||
train:
|
||||
learning_rate: 0.001
|
||||
anneal_rate: 0.5
|
||||
anneal_interval: 200000
|
||||
gradient_max_norm: 100.0
|
||||
|
||||
checkpoint_interval: 10000
|
||||
snap_interval: 10000
|
||||
eval_interval: 10000
|
||||
|
||||
max_iterations: 2000000
|
||||
|
||||
|
||||
|
|
@ -1,164 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from __future__ import division
|
||||
import csv
|
||||
import numpy as np
|
||||
import librosa
|
||||
from pathlib import Path
|
||||
import pandas as pd
|
||||
|
||||
from parakeet.data import batch_spec, batch_wav
|
||||
from parakeet.data import DatasetMixin
|
||||
|
||||
|
||||
class LJSpeechMetaData(DatasetMixin):
|
||||
def __init__(self, root):
|
||||
self.root = Path(root)
|
||||
self._wav_dir = self.root.joinpath("wavs")
|
||||
csv_path = self.root.joinpath("metadata.csv")
|
||||
self._table = pd.read_csv(
|
||||
csv_path,
|
||||
sep="|",
|
||||
header=None,
|
||||
quoting=csv.QUOTE_NONE,
|
||||
names=["fname", "raw_text", "normalized_text"])
|
||||
|
||||
def get_example(self, i):
|
||||
fname, raw_text, normalized_text = self._table.iloc[i]
|
||||
fname = str(self._wav_dir.joinpath(fname + ".wav"))
|
||||
return fname, raw_text, normalized_text
|
||||
|
||||
def __len__(self):
|
||||
return len(self._table)
|
||||
|
||||
|
||||
class Transform(object):
|
||||
def __init__(self, sample_rate, n_fft, win_length, hop_length, n_mels):
|
||||
self.sample_rate = sample_rate
|
||||
self.n_fft = n_fft
|
||||
self.win_length = win_length
|
||||
self.hop_length = hop_length
|
||||
self.n_mels = n_mels
|
||||
|
||||
def __call__(self, example):
|
||||
wav_path, _, _ = example
|
||||
|
||||
sr = self.sample_rate
|
||||
n_fft = self.n_fft
|
||||
win_length = self.win_length
|
||||
hop_length = self.hop_length
|
||||
n_mels = self.n_mels
|
||||
|
||||
wav, loaded_sr = librosa.load(wav_path, sr=None)
|
||||
assert loaded_sr == sr, "sample rate does not match, resampling applied"
|
||||
|
||||
# Pad audio to the right size.
|
||||
frames = int(np.ceil(float(wav.size) / hop_length))
|
||||
fft_padding = (n_fft - hop_length) // 2 # sound
|
||||
desired_length = frames * hop_length + fft_padding * 2
|
||||
pad_amount = (desired_length - wav.size) // 2
|
||||
|
||||
if wav.size % 2 == 0:
|
||||
wav = np.pad(wav, (pad_amount, pad_amount), mode='reflect')
|
||||
else:
|
||||
wav = np.pad(wav, (pad_amount, pad_amount + 1), mode='reflect')
|
||||
|
||||
# Normalize audio.
|
||||
wav = wav / np.abs(wav).max() * 0.999
|
||||
|
||||
# Compute mel-spectrogram.
|
||||
# Turn center to False to prevent internal padding.
|
||||
spectrogram = librosa.core.stft(
|
||||
wav,
|
||||
hop_length=hop_length,
|
||||
win_length=win_length,
|
||||
n_fft=n_fft,
|
||||
center=False)
|
||||
spectrogram_magnitude = np.abs(spectrogram)
|
||||
|
||||
# Compute mel-spectrograms.
|
||||
mel_filter_bank = librosa.filters.mel(sr=sr,
|
||||
n_fft=n_fft,
|
||||
n_mels=n_mels)
|
||||
mel_spectrogram = np.dot(mel_filter_bank, spectrogram_magnitude)
|
||||
mel_spectrogram = mel_spectrogram
|
||||
|
||||
# Rescale mel_spectrogram.
|
||||
min_level, ref_level = 1e-5, 20 # hard code it
|
||||
mel_spectrogram = 20 * np.log10(np.maximum(min_level, mel_spectrogram))
|
||||
mel_spectrogram = mel_spectrogram - ref_level
|
||||
mel_spectrogram = np.clip((mel_spectrogram + 100) / 100, 0, 1)
|
||||
|
||||
# Extract the center of audio that corresponds to mel spectrograms.
|
||||
audio = wav[fft_padding:-fft_padding]
|
||||
assert mel_spectrogram.shape[1] * hop_length == audio.size
|
||||
|
||||
# there is no clipping here
|
||||
return audio, mel_spectrogram
|
||||
|
||||
|
||||
class DataCollector(object):
|
||||
def __init__(self,
|
||||
context_size,
|
||||
sample_rate,
|
||||
hop_length,
|
||||
train_clip_seconds,
|
||||
valid=False):
|
||||
frames_per_second = sample_rate // hop_length
|
||||
train_clip_frames = int(
|
||||
np.ceil(train_clip_seconds * frames_per_second))
|
||||
context_frames = context_size // hop_length
|
||||
self.num_frames = train_clip_frames + context_frames
|
||||
|
||||
self.sample_rate = sample_rate
|
||||
self.hop_length = hop_length
|
||||
self.valid = valid
|
||||
|
||||
def random_crop(self, sample):
|
||||
audio, mel_spectrogram = sample
|
||||
audio_frames = int(audio.size) // self.hop_length
|
||||
max_start_frame = audio_frames - self.num_frames
|
||||
assert max_start_frame >= 0, "audio is too short to be cropped"
|
||||
|
||||
frame_start = np.random.randint(0, max_start_frame)
|
||||
# frame_start = 0 # norandom
|
||||
frame_end = frame_start + self.num_frames
|
||||
|
||||
audio_start = frame_start * self.hop_length
|
||||
audio_end = frame_end * self.hop_length
|
||||
|
||||
audio = audio[audio_start:audio_end]
|
||||
return audio, mel_spectrogram, audio_start
|
||||
|
||||
def __call__(self, samples):
|
||||
# transform them first
|
||||
if self.valid:
|
||||
samples = [(audio, mel_spectrogram, 0)
|
||||
for audio, mel_spectrogram in samples]
|
||||
else:
|
||||
samples = [self.random_crop(sample) for sample in samples]
|
||||
# batch them
|
||||
audios = [sample[0] for sample in samples]
|
||||
audio_starts = [sample[2] for sample in samples]
|
||||
mels = [sample[1] for sample in samples]
|
||||
|
||||
mels = batch_spec(mels)
|
||||
|
||||
if self.valid:
|
||||
audios = batch_wav(audios, dtype=np.float32)
|
||||
else:
|
||||
audios = np.array(audios, dtype=np.float32)
|
||||
audio_starts = np.array(audio_starts, dtype=np.int64)
|
||||
return audios, mels, audio_starts
|
|
@ -1,152 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from __future__ import division
|
||||
import os
|
||||
import ruamel.yaml
|
||||
import argparse
|
||||
from tqdm import tqdm
|
||||
from paddle import fluid
|
||||
fluid.require_version('1.8.0')
|
||||
import paddle.fluid.dygraph as dg
|
||||
|
||||
from parakeet.modules.weight_norm import WeightNormWrapper
|
||||
from parakeet.data import SliceDataset, TransformDataset, DataCargo, SequentialSampler, RandomSampler
|
||||
from parakeet.models.wavenet import UpsampleNet, WaveNet, ConditionalWavenet
|
||||
from parakeet.utils.layer_tools import summary
|
||||
from parakeet.utils import io
|
||||
|
||||
from data import LJSpeechMetaData, Transform, DataCollector
|
||||
from utils import make_output_tree, valid_model, eval_model
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Synthesize valid data from LJspeech with a wavenet model.")
|
||||
parser.add_argument(
|
||||
"--data", type=str, help="path of the LJspeech dataset")
|
||||
parser.add_argument("--config", type=str, help="path of the config file")
|
||||
parser.add_argument("--device", type=int, default=-1, help="device to use")
|
||||
|
||||
g = parser.add_mutually_exclusive_group()
|
||||
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
|
||||
g.add_argument(
|
||||
"--iteration",
|
||||
type=int,
|
||||
help="the iteration of the checkpoint to load from output directory")
|
||||
|
||||
parser.add_argument(
|
||||
"output",
|
||||
type=str,
|
||||
default="experiment",
|
||||
help="path to save the synthesized audio")
|
||||
|
||||
args = parser.parse_args()
|
||||
with open(args.config, 'rt') as f:
|
||||
config = ruamel.yaml.safe_load(f)
|
||||
|
||||
if args.device == -1:
|
||||
place = fluid.CPUPlace()
|
||||
else:
|
||||
place = fluid.CUDAPlace(args.device)
|
||||
|
||||
dg.enable_dygraph(place)
|
||||
|
||||
ljspeech_meta = LJSpeechMetaData(args.data)
|
||||
|
||||
data_config = config["data"]
|
||||
sample_rate = data_config["sample_rate"]
|
||||
n_fft = data_config["n_fft"]
|
||||
win_length = data_config["win_length"]
|
||||
hop_length = data_config["hop_length"]
|
||||
n_mels = data_config["n_mels"]
|
||||
train_clip_seconds = data_config["train_clip_seconds"]
|
||||
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
|
||||
ljspeech = TransformDataset(ljspeech_meta, transform)
|
||||
|
||||
valid_size = data_config["valid_size"]
|
||||
ljspeech_valid = SliceDataset(ljspeech, 0, valid_size)
|
||||
ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech))
|
||||
|
||||
model_config = config["model"]
|
||||
n_loop = model_config["n_loop"]
|
||||
n_layer = model_config["n_layer"]
|
||||
filter_size = model_config["filter_size"]
|
||||
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
|
||||
print("context size is {} samples".format(context_size))
|
||||
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
|
||||
train_clip_seconds)
|
||||
valid_batch_fn = DataCollector(
|
||||
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
|
||||
|
||||
batch_size = data_config["batch_size"]
|
||||
train_cargo = DataCargo(
|
||||
ljspeech_train,
|
||||
train_batch_fn,
|
||||
batch_size,
|
||||
sampler=RandomSampler(ljspeech_train))
|
||||
|
||||
# only batch=1 for validation is enabled
|
||||
valid_cargo = DataCargo(
|
||||
ljspeech_valid,
|
||||
valid_batch_fn,
|
||||
batch_size=1,
|
||||
sampler=SequentialSampler(ljspeech_valid))
|
||||
|
||||
if not os.path.exists(args.output):
|
||||
os.makedirs(args.output)
|
||||
|
||||
model_config = config["model"]
|
||||
upsampling_factors = model_config["upsampling_factors"]
|
||||
encoder = UpsampleNet(upsampling_factors)
|
||||
|
||||
n_loop = model_config["n_loop"]
|
||||
n_layer = model_config["n_layer"]
|
||||
residual_channels = model_config["residual_channels"]
|
||||
output_dim = model_config["output_dim"]
|
||||
loss_type = model_config["loss_type"]
|
||||
log_scale_min = model_config["log_scale_min"]
|
||||
decoder = WaveNet(n_loop, n_layer, residual_channels, output_dim, n_mels,
|
||||
filter_size, loss_type, log_scale_min)
|
||||
|
||||
model = ConditionalWavenet(encoder, decoder)
|
||||
summary(model)
|
||||
|
||||
# load model parameters
|
||||
checkpoint_dir = os.path.join(args.output, "checkpoints")
|
||||
if args.checkpoint:
|
||||
iteration = io.load_parameters(model, checkpoint_path=args.checkpoint)
|
||||
else:
|
||||
iteration = io.load_parameters(
|
||||
model, checkpoint_dir=checkpoint_dir, iteration=args.iteration)
|
||||
assert iteration > 0, "A trained model is needed."
|
||||
|
||||
# WARNING: don't forget to remove weight norm to re-compute each wrapped layer's weight
|
||||
# removing weight norm also speeds up computation
|
||||
for layer in model.sublayers():
|
||||
if isinstance(layer, WeightNormWrapper):
|
||||
layer.remove_weight_norm()
|
||||
|
||||
train_loader = fluid.io.DataLoader.from_generator(
|
||||
capacity=10, return_list=True)
|
||||
train_loader.set_batch_generator(train_cargo, place)
|
||||
|
||||
valid_loader = fluid.io.DataLoader.from_generator(
|
||||
capacity=10, return_list=True)
|
||||
valid_loader.set_batch_generator(valid_cargo, place)
|
||||
|
||||
synthesis_dir = os.path.join(args.output, "synthesis")
|
||||
if not os.path.exists(synthesis_dir):
|
||||
os.makedirs(synthesis_dir)
|
||||
|
||||
eval_model(model, valid_loader, synthesis_dir, iteration, sample_rate)
|
|
@ -1,201 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from __future__ import division
|
||||
import os
|
||||
import ruamel.yaml
|
||||
import argparse
|
||||
import tqdm
|
||||
from visualdl import LogWriter
|
||||
from paddle import fluid
|
||||
fluid.require_version('1.8.0')
|
||||
import paddle.fluid.dygraph as dg
|
||||
|
||||
from parakeet.data import SliceDataset, TransformDataset, CacheDataset, DataCargo, SequentialSampler, RandomSampler
|
||||
from parakeet.models.wavenet import UpsampleNet, WaveNet, ConditionalWavenet
|
||||
from parakeet.utils.layer_tools import summary
|
||||
from parakeet.utils import io
|
||||
|
||||
from data import LJSpeechMetaData, Transform, DataCollector
|
||||
from utils import make_output_tree, valid_model
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Train a WaveNet model with LJSpeech.")
|
||||
parser.add_argument(
|
||||
"--data", type=str, help="path of the LJspeech dataset")
|
||||
parser.add_argument("--config", type=str, help="path of the config file")
|
||||
parser.add_argument("--device", type=int, default=-1, help="device to use")
|
||||
|
||||
g = parser.add_mutually_exclusive_group()
|
||||
g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
|
||||
g.add_argument(
|
||||
"--iteration",
|
||||
type=int,
|
||||
help="the iteration of the checkpoint to load from output directory")
|
||||
|
||||
parser.add_argument(
|
||||
"output", type=str, default="experiment", help="path to save results")
|
||||
|
||||
args = parser.parse_args()
|
||||
with open(args.config, 'rt') as f:
|
||||
config = ruamel.yaml.safe_load(f)
|
||||
|
||||
if args.device == -1:
|
||||
place = fluid.CPUPlace()
|
||||
else:
|
||||
place = fluid.CUDAPlace(args.device)
|
||||
|
||||
dg.enable_dygraph(place)
|
||||
|
||||
print("Command Line Args: ")
|
||||
for k, v in vars(args).items():
|
||||
print("{}: {}".format(k, v))
|
||||
|
||||
ljspeech_meta = LJSpeechMetaData(args.data)
|
||||
|
||||
data_config = config["data"]
|
||||
sample_rate = data_config["sample_rate"]
|
||||
n_fft = data_config["n_fft"]
|
||||
win_length = data_config["win_length"]
|
||||
hop_length = data_config["hop_length"]
|
||||
n_mels = data_config["n_mels"]
|
||||
train_clip_seconds = data_config["train_clip_seconds"]
|
||||
transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels)
|
||||
ljspeech = TransformDataset(ljspeech_meta, transform)
|
||||
|
||||
valid_size = data_config["valid_size"]
|
||||
ljspeech_valid = CacheDataset(SliceDataset(ljspeech, 0, valid_size))
|
||||
ljspeech_train = CacheDataset(
|
||||
SliceDataset(ljspeech, valid_size, len(ljspeech)))
|
||||
|
||||
model_config = config["model"]
|
||||
n_loop = model_config["n_loop"]
|
||||
n_layer = model_config["n_layer"]
|
||||
filter_size = model_config["filter_size"]
|
||||
context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)])
|
||||
print("context size is {} samples".format(context_size))
|
||||
train_batch_fn = DataCollector(context_size, sample_rate, hop_length,
|
||||
train_clip_seconds)
|
||||
valid_batch_fn = DataCollector(
|
||||
context_size, sample_rate, hop_length, train_clip_seconds, valid=True)
|
||||
|
||||
batch_size = data_config["batch_size"]
|
||||
train_cargo = DataCargo(
|
||||
ljspeech_train,
|
||||
train_batch_fn,
|
||||
batch_size,
|
||||
sampler=RandomSampler(ljspeech_train))
|
||||
|
||||
# only batch=1 for validation is enabled
|
||||
valid_cargo = DataCargo(
|
||||
ljspeech_valid,
|
||||
valid_batch_fn,
|
||||
batch_size=1,
|
||||
sampler=SequentialSampler(ljspeech_valid))
|
||||
|
||||
make_output_tree(args.output)
|
||||
|
||||
if args.device == -1:
|
||||
place = fluid.CPUPlace()
|
||||
else:
|
||||
place = fluid.CUDAPlace(args.device)
|
||||
|
||||
model_config = config["model"]
|
||||
upsampling_factors = model_config["upsampling_factors"]
|
||||
encoder = UpsampleNet(upsampling_factors)
|
||||
|
||||
n_loop = model_config["n_loop"]
|
||||
n_layer = model_config["n_layer"]
|
||||
residual_channels = model_config["residual_channels"]
|
||||
output_dim = model_config["output_dim"]
|
||||
loss_type = model_config["loss_type"]
|
||||
log_scale_min = model_config["log_scale_min"]
|
||||
decoder = WaveNet(n_loop, n_layer, residual_channels, output_dim, n_mels,
|
||||
filter_size, loss_type, log_scale_min)
|
||||
|
||||
model = ConditionalWavenet(encoder, decoder)
|
||||
summary(model)
|
||||
|
||||
train_config = config["train"]
|
||||
learning_rate = train_config["learning_rate"]
|
||||
anneal_rate = train_config["anneal_rate"]
|
||||
anneal_interval = train_config["anneal_interval"]
|
||||
lr_scheduler = dg.ExponentialDecay(
|
||||
learning_rate, anneal_interval, anneal_rate, staircase=True)
|
||||
gradiant_max_norm = train_config["gradient_max_norm"]
|
||||
optim = fluid.optimizer.Adam(
|
||||
lr_scheduler,
|
||||
parameter_list=model.parameters(),
|
||||
grad_clip=fluid.clip.ClipByGlobalNorm(gradiant_max_norm))
|
||||
|
||||
train_loader = fluid.io.DataLoader.from_generator(
|
||||
capacity=10, return_list=True)
|
||||
train_loader.set_batch_generator(train_cargo, place)
|
||||
|
||||
valid_loader = fluid.io.DataLoader.from_generator(
|
||||
capacity=10, return_list=True)
|
||||
valid_loader.set_batch_generator(valid_cargo, place)
|
||||
|
||||
max_iterations = train_config["max_iterations"]
|
||||
checkpoint_interval = train_config["checkpoint_interval"]
|
||||
snap_interval = train_config["snap_interval"]
|
||||
eval_interval = train_config["eval_interval"]
|
||||
checkpoint_dir = os.path.join(args.output, "checkpoints")
|
||||
log_dir = os.path.join(args.output, "log")
|
||||
writer = LogWriter(log_dir)
|
||||
|
||||
# load parameters and optimizer, and update iterations done so far
|
||||
if args.checkpoint is not None:
|
||||
iteration = io.load_parameters(
|
||||
model, optim, checkpoint_path=args.checkpoint)
|
||||
else:
|
||||
iteration = io.load_parameters(
|
||||
model,
|
||||
optim,
|
||||
checkpoint_dir=checkpoint_dir,
|
||||
iteration=args.iteration)
|
||||
|
||||
global_step = iteration + 1
|
||||
iterator = iter(tqdm.tqdm(train_loader))
|
||||
while global_step <= max_iterations:
|
||||
try:
|
||||
batch = next(iterator)
|
||||
except StopIteration as e:
|
||||
iterator = iter(tqdm.tqdm(train_loader))
|
||||
batch = next(iterator)
|
||||
|
||||
audio_clips, mel_specs, audio_starts = batch
|
||||
|
||||
model.train()
|
||||
y_var = model(audio_clips, mel_specs, audio_starts)
|
||||
loss_var = model.loss(y_var, audio_clips)
|
||||
loss_var.backward()
|
||||
loss_np = loss_var.numpy()
|
||||
|
||||
writer.add_scalar("loss", loss_np[0], global_step)
|
||||
writer.add_scalar("learning_rate",
|
||||
optim._learning_rate.step().numpy()[0], global_step)
|
||||
optim.minimize(loss_var)
|
||||
optim.clear_gradients()
|
||||
print("global_step: {}\tloss: {:<8.6f}".format(global_step, loss_np[
|
||||
0]))
|
||||
|
||||
if global_step % snap_interval == 0:
|
||||
valid_model(model, valid_loader, writer, global_step, sample_rate)
|
||||
|
||||
if global_step % checkpoint_interval == 0:
|
||||
io.save_parameters(checkpoint_dir, global_step, model, optim)
|
||||
|
||||
global_step += 1
|
|
@ -1,62 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from __future__ import division
|
||||
import os
|
||||
import numpy as np
|
||||
import soundfile as sf
|
||||
import paddle.fluid.dygraph as dg
|
||||
|
||||
|
||||
def make_output_tree(output_dir):
|
||||
checkpoint_dir = os.path.join(output_dir, "checkpoints")
|
||||
if not os.path.exists(checkpoint_dir):
|
||||
os.makedirs(checkpoint_dir)
|
||||
|
||||
state_dir = os.path.join(output_dir, "states")
|
||||
if not os.path.exists(state_dir):
|
||||
os.makedirs(state_dir)
|
||||
|
||||
|
||||
def valid_model(model, valid_loader, writer, global_step, sample_rate):
|
||||
loss = []
|
||||
wavs = []
|
||||
model.eval()
|
||||
for i, batch in enumerate(valid_loader):
|
||||
# print("sentence {}".format(i))
|
||||
audio_clips, mel_specs, audio_starts = batch
|
||||
y_var = model(audio_clips, mel_specs, audio_starts)
|
||||
wav_var = model.sample(y_var)
|
||||
loss_var = model.loss(y_var, audio_clips)
|
||||
loss.append(loss_var.numpy()[0])
|
||||
wavs.append(wav_var.numpy()[0])
|
||||
|
||||
average_loss = np.mean(loss)
|
||||
writer.add_scalar("valid_loss", average_loss, global_step)
|
||||
for i, wav in enumerate(wavs):
|
||||
writer.add_audio("valid/sample_{}".format(i), wav, global_step,
|
||||
sample_rate)
|
||||
|
||||
|
||||
def eval_model(model, valid_loader, output_dir, global_step, sample_rate):
|
||||
model.eval()
|
||||
for i, batch in enumerate(valid_loader):
|
||||
# print("sentence {}".format(i))
|
||||
path = os.path.join(output_dir,
|
||||
"sentence_{}_step_{}.wav".format(i, global_step))
|
||||
audio_clips, mel_specs, audio_starts = batch
|
||||
wav_var = model.synthesis(mel_specs)
|
||||
wav_np = wav_var.numpy()[0]
|
||||
sf.write(path, wav_np, samplerate=sample_rate)
|
||||
print("generated {}".format(path))
|
Loading…
Reference in New Issue