Compare commits
34 Commits
develop
...
release/v0
Author | SHA1 | Date |
---|---|---|
Feiyu Chan | e14ef0432c | |
chenfeiyu | b666b830a5 | |
Feiyu Chan | 66620977ba | |
chenfeiyu | 9a11dce942 | |
Feiyu Chan | 1552e34401 | |
iclementine | ee1a9a04d8 | |
iclementine | 87a76c339c | |
iclementine | ea43c3f82a | |
iclementine | a1aa401d8c | |
Feiyu Chan | ba39773b5a | |
chenfeiyu | cb3ea54118 | |
chenfeiyu | 4e876e555c | |
Feiyu Chan | bdf2f680b7 | |
Feiyu Chan | 7019dcfe97 | |
Feiyu Chan | e2bca531ef | |
Hui Zhang | 8d6b7dc4d0 | |
Hui Zhang | db63e4096c | |
Hui Zhang | 7b77c9b26c | |
Feiyu Chan | a42838c717 | |
iclementine | c3a9f6d89b | |
iclementine | 5eebbd0716 | |
Feiyu Chan | 9f256e325c | |
chenfeiyu | a9f633f489 | |
Feiyu Chan | 3f1928604a | |
chenfeiyu | bd773744ac | |
chenfeiyu | 289827c9cf | |
Feiyu Chan | 610b8f2cef | |
Feiyu Chan | f7d85e7058 | |
chenfeiyu | a5991a8f90 | |
Feiyu Chan | df627d6a2e | |
Feiyu Chan | e55ea4555e | |
chenfeiyu | 54003d5ecc | |
chenfeiyu | 80bb465d3c | |
Feiyu Chan | 94c8585140 |
|
@ -59,7 +59,6 @@ See [install](https://paddle-parakeet.readthedocs.io/en/latest/install.html) for
|
|||
Entries to the introduction, and the launch of training and synthsis for different example models:
|
||||
|
||||
- [>>> WaveFlow](./examples/waveflow)
|
||||
- [>>> WaveNet](./examples/wavenet)
|
||||
- [>>> Transformer TTS](./examples/transformer_tts)
|
||||
- [>>> Tacotron2](./examples/tacotron2)
|
||||
|
||||
|
@ -70,6 +69,14 @@ Entries to the introduction, and the launch of training and synthsis for differe
|
|||
|
||||
Check our [website](https://paddle-parakeet.readthedocs.io/en/latest/demo.html) for audio sampels.
|
||||
|
||||
## Pretrained models
|
||||
|
||||
Models pretrained on LJSpeech can be downloaded here.
|
||||
|
||||
[tacotron2](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.2.zip)
|
||||
[transformert_tts](https://paddlespeech.bj.bcebos.com/Parakeet/transformer_tts_ljspeech_ckpt_0.2.zip)
|
||||
[waveflow_res_128](https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_ckpt_0.2.zip)
|
||||
|
||||
## Copyright and License
|
||||
|
||||
Parakeet is provided under the [Apache-2.0 license](LICENSE).
|
||||
|
|
|
@ -52,7 +52,7 @@ vocoder
|
|||
Like the example above, after loading the pretrained ``ConditionalWaveFlow``
|
||||
model, call ``model.predict(mel)`` to synthesize raw audio (in wav format).
|
||||
|
||||
>>> import soundfile as df
|
||||
>>> import soundfile as sf
|
||||
>>> from parakeet.models import ConditionalWaveFlow
|
||||
>>>
|
||||
>>> # load the pretrained model
|
||||
|
|
|
@ -6,7 +6,7 @@
|
|||
|
||||
## Dataset
|
||||
|
||||
我们假设数据集是样例的列表。你可以通过 `__len__` 方法获取其长度,并且可以通过 `__getitem__` 方法随机访问其元素。有了上述两个调节,我们也可以用 `iter(dataset)` 来获得一个 dataset 的迭代器。我们一般通过继承 `paddle.io.Dataset` 来创建自己的数据集。为其实现 `__len__` 方法和 `__getitem__` 方法即可。
|
||||
我们假设数据集是样例的列表。你可以通过 `__len__` 方法获取其长度,并且可以通过 `__getitem__` 方法随机访问其元素。有了上述两个条件,我们也可以用 `iter(dataset)` 来获得一个 dataset 的迭代器。我们一般通过继承 `paddle.io.Dataset` 来创建自己的数据集。为其实现 `__len__` 方法和 `__getitem__` 方法即可。
|
||||
|
||||
出于数据处理,数据加载和数据集大小等方面的考虑,可以采用集中策略来调控数据集是否被懒惰地预处理,是否被懒惰地被加载,是否常驻内存等。
|
||||
|
||||
|
@ -86,7 +86,7 @@ Sampler 被实现为产生整数的可迭代对象。假设数据集有 `N` 个
|
|||
|
||||
当迭代一个 DataLoader 的时候,首先 sampler 产生多个 index, 然后根据这些 index 去取出对应的样例,并调用 batch function 把这些样例组成一个批次。当然取出样例的过程是可并行的,但调用 batch function 组成 batch 不是。
|
||||
|
||||
另外的一种选择是使用 batch sampler, 它是产生整数列表的可迭代对象。对于一般的 sampler, 需要对其迭代器使用 next 多次才能产出多个 index, 而对于 batch sampler, 对其迭代器使用 next 一次就可以产出多个 index. 对于使用一般的 sampler 的情形,batch size 由 DataLoader 的来决定。而对于 batch sampler, 则是由它决定了 DataLoader 的 batch size, 因此可以用它来实现一些特别的需求,比如说动态 batch size.
|
||||
另外的一种选择是使用 batch sampler, 它是产生整数列表的可迭代对象。对于一般的 sampler, 需要对其迭代器使用 next 多次才能产出多个 index, 而对于 batch sampler, 对其迭代器使用 next 一次就可以产出多个 index. 对于使用一般的 sampler 的情形,batch size 由 DataLoader 来决定。而对于 batch sampler, 则是由它决定了 DataLoader 的 batch size, 因此可以用它来实现一些特别的需求,比如说动态 batch size.
|
||||
|
||||
## 示例代码
|
||||
|
||||
|
|
|
@ -86,6 +86,7 @@ class LJSpeechCollector(object):
|
|||
for i, _ in sorted(
|
||||
zip(mel_lens, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
mel_lens = np.array(mel_lens, dtype=np.int64)
|
||||
|
||||
stop_tokens = [
|
||||
i
|
||||
|
@ -93,7 +94,7 @@ class LJSpeechCollector(object):
|
|||
zip(stop_tokens, text_lens), key=lambda x: x[1], reverse=True)
|
||||
]
|
||||
|
||||
text_lens = sorted(text_lens, reverse=True)
|
||||
text_lens = np.array(sorted(text_lens, reverse=True), dtype=np.int64)
|
||||
|
||||
# Pad sequence with largest len of the batch
|
||||
texts = batch_text_id(texts, pad_id=self.padding_idx)
|
||||
|
|
|
@ -23,7 +23,8 @@ _C.data = CN(
|
|||
n_fft=1024, # fft frame size
|
||||
win_length=1024, # window size
|
||||
hop_length=256, # hop size between ajacent frame
|
||||
f_max=8000, # Hz, max frequency when converting to mel
|
||||
fmin=0,
|
||||
fmax=8000, # Hz, max frequency when converting to mel
|
||||
n_mels=80, # mel bands
|
||||
clip_frames=65, # mel clip frames
|
||||
))
|
||||
|
|
|
@ -30,12 +30,14 @@ from config import get_cfg_defaults
|
|||
|
||||
|
||||
class Transform(object):
|
||||
def __init__(self, sample_rate, n_fft, win_length, hop_length, n_mels):
|
||||
def __init__(self, sample_rate, n_fft, win_length, hop_length, n_mels, fmin, fmax):
|
||||
self.sample_rate = sample_rate
|
||||
self.n_fft = n_fft
|
||||
self.win_length = win_length
|
||||
self.hop_length = hop_length
|
||||
self.n_mels = n_mels
|
||||
self.fmin = fmin
|
||||
self.fmax = fmax
|
||||
|
||||
self.spec_normalizer = LogMagnitude(min=1e-5)
|
||||
|
||||
|
@ -47,6 +49,8 @@ class Transform(object):
|
|||
win_length = self.win_length
|
||||
hop_length = self.hop_length
|
||||
n_mels = self.n_mels
|
||||
fmin = self.fmin
|
||||
fmax = self.fmax
|
||||
|
||||
wav, loaded_sr = librosa.load(wav_path, sr=None)
|
||||
assert loaded_sr == sr, "sample rate does not match, resampling applied"
|
||||
|
@ -78,7 +82,9 @@ class Transform(object):
|
|||
# Compute mel-spectrograms.
|
||||
mel_filter_bank = librosa.filters.mel(sr=sr,
|
||||
n_fft=n_fft,
|
||||
n_mels=n_mels)
|
||||
n_mels=n_mels,
|
||||
fmin=fmin,
|
||||
fmax=fmax)
|
||||
mel_spectrogram = np.dot(mel_filter_bank, spectrogram_magnitude)
|
||||
mel_spectrogram = mel_spectrogram
|
||||
|
||||
|
@ -101,7 +107,7 @@ def create_dataset(config, input_dir, output_dir, verbose=True):
|
|||
output_dir.mkdir(exist_ok=True)
|
||||
|
||||
transform = Transform(config.sample_rate, config.n_fft, config.win_length,
|
||||
config.hop_length, config.n_mels)
|
||||
config.hop_length, config.n_mels, config.fmin, config.fmax)
|
||||
file_names = []
|
||||
|
||||
for example in tqdm.tqdm(dataset):
|
||||
|
|
|
@ -1,48 +0,0 @@
|
|||
# WaveNet with LJSpeech
|
||||
|
||||
## Dataset
|
||||
|
||||
### Download the datasaet.
|
||||
|
||||
```bash
|
||||
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
### Extract the dataset.
|
||||
|
||||
```bash
|
||||
tar xjvf LJSpeech-1.1.tar.bz2
|
||||
```
|
||||
|
||||
### Preprocess the dataset.
|
||||
|
||||
Assume the path to save the preprocessed dataset is `ljspeech_wavenet`. Run the command below to preprocess the dataset.
|
||||
|
||||
```bash
|
||||
python preprocess.py --input=LJSpeech-1.1/ --output=ljspeech_wavenet
|
||||
```
|
||||
|
||||
## Train the model
|
||||
|
||||
The training script requires 4 command line arguments.
|
||||
`--data` is the path of the training dataset, `--output` is the path of the output directory (we recommend to use a subdirectory in `runs` to manage different experiments.)
|
||||
|
||||
`--device` should be "cpu" or "gpu", `--nprocs` is the number of processes to train the model in parallel.
|
||||
|
||||
```bash
|
||||
python train.py --data=ljspeech_wavenet/ --output=runs/test --device="gpu" --nprocs=1
|
||||
```
|
||||
|
||||
If you want distributed training, set a larger `--nprocs` (e.g. 4). Note that distributed training with cpu is not supported yet.
|
||||
|
||||
## Synthesize
|
||||
|
||||
Synthesize waveform. We assume the `--input` is a directory containing several mel spectrograms(normalized into range[0, 1)) in `.npy` format. The output would be saved in `--output` directory, containing several `.wav` files, each with the same name as the mel spectrogram does.
|
||||
|
||||
`--checkpoint_path` should be the path of the parameter file (`.pdparams`) to load. Note that the extention name `.pdparmas` is not included here.
|
||||
|
||||
`--device` specifies to device to run synthesis on. Due to the autoregressiveness of wavenet, using cpu may be faster.
|
||||
|
||||
```bash
|
||||
python synthesize.py --input=mels/ --output=wavs/ --checkpoint_path='step-2450000' --device="cpu" --verbose
|
||||
```
|
|
@ -1,58 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from yacs.config import CfgNode as CN
|
||||
|
||||
_C = CN()
|
||||
_C.data = CN(
|
||||
dict(
|
||||
batch_size=8, # batch size
|
||||
valid_size=16, # the first N examples are reserved for validation
|
||||
sample_rate=22050, # Hz, sample rate
|
||||
n_fft=2048, # fft frame size
|
||||
win_length=1024, # window size
|
||||
hop_length=256, # hop size between ajacent frame
|
||||
# f_max=8000, # Hz, max frequency when converting to mel
|
||||
n_mels=80, # mel bands
|
||||
train_clip_seconds=0.5, # audio clip length(in seconds)
|
||||
))
|
||||
|
||||
_C.model = CN(
|
||||
dict(
|
||||
upsample_factors=[16, 16],
|
||||
n_stack=3,
|
||||
n_loop=10,
|
||||
filter_size=2,
|
||||
residual_channels=128, # resiaudal channel in each flow
|
||||
loss_type="mog",
|
||||
output_dim=3, # single gaussian
|
||||
log_scale_min=-9.0, ))
|
||||
|
||||
_C.training = CN(
|
||||
dict(
|
||||
lr=1e-3, # learning rates
|
||||
anneal_rate=0.5, # learning rate decay rate
|
||||
anneal_interval=200000, # decrese lr by annel_rate every anneal_interval steps
|
||||
valid_interval=1000, # validation
|
||||
save_interval=10000, # checkpoint
|
||||
max_iteration=3000000, # max iteration to train
|
||||
gradient_max_norm=100.0 # global norm of gradients
|
||||
))
|
||||
|
||||
|
||||
def get_cfg_defaults():
|
||||
"""Get a yacs CfgNode object with default values for my_project."""
|
||||
# Return a clone so that the defaults will not be altered
|
||||
# This is for the "local variable" use pattern
|
||||
return _C.clone()
|
|
@ -1,151 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
from pathlib import Path
|
||||
import pickle
|
||||
import numpy as np
|
||||
import pandas
|
||||
from paddle.io import Dataset, DataLoader
|
||||
|
||||
from parakeet.data.batch import batch_spec, batch_wav
|
||||
from parakeet.data import dataset
|
||||
from parakeet.audio import AudioProcessor
|
||||
|
||||
|
||||
class LJSpeech(Dataset):
|
||||
"""A simple dataset adaptor for the processed ljspeech dataset."""
|
||||
|
||||
def __init__(self, root):
|
||||
self.root = Path(root).expanduser()
|
||||
meta_data = pandas.read_csv(
|
||||
str(self.root / "metadata.csv"),
|
||||
sep="\t",
|
||||
header=None,
|
||||
names=["fname", "frames", "samples"])
|
||||
|
||||
records = []
|
||||
for row in meta_data.itertuples():
|
||||
mel_path = str(self.root / "mel" / (row.fname + ".npy"))
|
||||
wav_path = str(self.root / "wav" / (row.fname + ".npy"))
|
||||
records.append((mel_path, wav_path))
|
||||
self.records = records
|
||||
|
||||
def __getitem__(self, i):
|
||||
mel_name, wav_name = self.records[i]
|
||||
mel = np.load(mel_name)
|
||||
wav = np.load(wav_name)
|
||||
return mel, wav
|
||||
|
||||
def __len__(self):
|
||||
return len(self.records)
|
||||
|
||||
|
||||
class LJSpeechCollector(object):
|
||||
"""A simple callable to batch LJSpeech examples."""
|
||||
|
||||
def __init__(self, padding_value=0.):
|
||||
self.padding_value = padding_value
|
||||
|
||||
def __call__(self, examples):
|
||||
batch_size = len(examples)
|
||||
mels = [example[0] for example in examples]
|
||||
wavs = [example[1] for example in examples]
|
||||
mels = batch_spec(mels, pad_value=self.padding_value)
|
||||
wavs = batch_wav(wavs, pad_value=self.padding_value)
|
||||
audio_starts = np.zeros((batch_size, ), dtype=np.int64)
|
||||
return mels, wavs, audio_starts
|
||||
|
||||
|
||||
class LJSpeechClipCollector(object):
|
||||
def __init__(self, clip_frames=65, hop_length=256):
|
||||
self.clip_frames = clip_frames
|
||||
self.hop_length = hop_length
|
||||
|
||||
def __call__(self, examples):
|
||||
mels = []
|
||||
wavs = []
|
||||
starts = []
|
||||
for example in examples:
|
||||
mel, wav_clip, start = self.clip(example)
|
||||
mels.append(mel)
|
||||
wavs.append(wav_clip)
|
||||
starts.append(start)
|
||||
mels = batch_spec(mels)
|
||||
wavs = np.stack(wavs)
|
||||
starts = np.array(starts, dtype=np.int64)
|
||||
return mels, wavs, starts
|
||||
|
||||
def clip(self, example):
|
||||
mel, wav = example
|
||||
frames = mel.shape[-1]
|
||||
start = np.random.randint(0, frames - self.clip_frames)
|
||||
wav_clip = wav[start * self.hop_length:(start + self.clip_frames) *
|
||||
self.hop_length]
|
||||
return mel, wav_clip, start
|
||||
|
||||
|
||||
class DataCollector(object):
|
||||
def __init__(self,
|
||||
context_size,
|
||||
sample_rate,
|
||||
hop_length,
|
||||
train_clip_seconds,
|
||||
valid=False):
|
||||
frames_per_second = sample_rate // hop_length
|
||||
train_clip_frames = int(
|
||||
np.ceil(train_clip_seconds * frames_per_second))
|
||||
context_frames = context_size // hop_length
|
||||
self.num_frames = train_clip_frames + context_frames
|
||||
|
||||
self.sample_rate = sample_rate
|
||||
self.hop_length = hop_length
|
||||
self.valid = valid
|
||||
|
||||
def random_crop(self, sample):
|
||||
audio, mel_spectrogram = sample
|
||||
audio_frames = int(audio.size) // self.hop_length
|
||||
max_start_frame = audio_frames - self.num_frames
|
||||
assert max_start_frame >= 0, "audio is too short to be cropped"
|
||||
|
||||
frame_start = np.random.randint(0, max_start_frame)
|
||||
# frame_start = 0 # norandom
|
||||
frame_end = frame_start + self.num_frames
|
||||
|
||||
audio_start = frame_start * self.hop_length
|
||||
audio_end = frame_end * self.hop_length
|
||||
|
||||
audio = audio[audio_start:audio_end]
|
||||
return audio, mel_spectrogram, audio_start
|
||||
|
||||
def __call__(self, samples):
|
||||
# transform them first
|
||||
if self.valid:
|
||||
samples = [(audio, mel_spectrogram, 0)
|
||||
for audio, mel_spectrogram in samples]
|
||||
else:
|
||||
samples = [self.random_crop(sample) for sample in samples]
|
||||
# batch them
|
||||
audios = [sample[0] for sample in samples]
|
||||
audio_starts = [sample[2] for sample in samples]
|
||||
mels = [sample[1] for sample in samples]
|
||||
|
||||
mels = batch_spec(mels)
|
||||
|
||||
if self.valid:
|
||||
audios = batch_wav(audios, dtype=np.float32)
|
||||
else:
|
||||
audios = np.array(audios, dtype=np.float32)
|
||||
audio_starts = np.array(audio_starts, dtype=np.int64)
|
||||
return audios, mels, audio_starts
|
|
@ -1,161 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import os
|
||||
import tqdm
|
||||
import csv
|
||||
import argparse
|
||||
import numpy as np
|
||||
import librosa
|
||||
from pathlib import Path
|
||||
import pandas as pd
|
||||
|
||||
from paddle.io import Dataset
|
||||
from parakeet.data import batch_spec, batch_wav
|
||||
from parakeet.datasets import LJSpeechMetaData
|
||||
from parakeet.audio import AudioProcessor
|
||||
from parakeet.audio.spec_normalizer import UnitMagnitude
|
||||
|
||||
from config import get_cfg_defaults
|
||||
|
||||
|
||||
class Transform(object):
|
||||
def __init__(self, sample_rate, n_fft, win_length, hop_length, n_mels):
|
||||
self.sample_rate = sample_rate
|
||||
self.n_fft = n_fft
|
||||
self.win_length = win_length
|
||||
self.hop_length = hop_length
|
||||
self.n_mels = n_mels
|
||||
|
||||
self.spec_normalizer = UnitMagnitude(min=1e-5)
|
||||
|
||||
def __call__(self, example):
|
||||
wav_path, _, _ = example
|
||||
|
||||
sr = self.sample_rate
|
||||
n_fft = self.n_fft
|
||||
win_length = self.win_length
|
||||
hop_length = self.hop_length
|
||||
n_mels = self.n_mels
|
||||
|
||||
wav, loaded_sr = librosa.load(wav_path, sr=None)
|
||||
assert loaded_sr == sr, "sample rate does not match, resampling applied"
|
||||
|
||||
# Pad audio to the right size.
|
||||
frames = int(np.ceil(float(wav.size) / hop_length))
|
||||
fft_padding = (n_fft - hop_length) // 2 # sound
|
||||
desired_length = frames * hop_length + fft_padding * 2
|
||||
pad_amount = (desired_length - wav.size) // 2
|
||||
|
||||
if wav.size % 2 == 0:
|
||||
wav = np.pad(wav, (pad_amount, pad_amount), mode='reflect')
|
||||
else:
|
||||
wav = np.pad(wav, (pad_amount, pad_amount + 1), mode='reflect')
|
||||
|
||||
# Normalize audio.
|
||||
wav = wav / np.abs(wav).max() * 0.999
|
||||
|
||||
# Compute mel-spectrogram.
|
||||
# Turn center to False to prevent internal padding.
|
||||
spectrogram = librosa.core.stft(
|
||||
wav,
|
||||
hop_length=hop_length,
|
||||
win_length=win_length,
|
||||
n_fft=n_fft,
|
||||
center=False)
|
||||
spectrogram_magnitude = np.abs(spectrogram)
|
||||
|
||||
# Compute mel-spectrograms.
|
||||
mel_filter_bank = librosa.filters.mel(sr=sr,
|
||||
n_fft=n_fft,
|
||||
n_mels=n_mels)
|
||||
mel_spectrogram = np.dot(mel_filter_bank, spectrogram_magnitude)
|
||||
mel_spectrogram = mel_spectrogram
|
||||
|
||||
# log scale mel_spectrogram.
|
||||
mel_spectrogram = self.spec_normalizer.transform(mel_spectrogram)
|
||||
|
||||
# Extract the center of audio that corresponds to mel spectrograms.
|
||||
audio = wav[fft_padding:-fft_padding]
|
||||
assert mel_spectrogram.shape[1] * hop_length == audio.size
|
||||
|
||||
# there is no clipping here
|
||||
return audio, mel_spectrogram
|
||||
|
||||
|
||||
def create_dataset(config, input_dir, output_dir, verbose=True):
|
||||
input_dir = Path(input_dir).expanduser()
|
||||
dataset = LJSpeechMetaData(input_dir)
|
||||
|
||||
output_dir = Path(output_dir).expanduser()
|
||||
output_dir.mkdir(exist_ok=True)
|
||||
|
||||
transform = Transform(config.sample_rate, config.n_fft, config.win_length,
|
||||
config.hop_length, config.n_mels)
|
||||
file_names = []
|
||||
|
||||
for example in tqdm.tqdm(dataset):
|
||||
fname, _, _ = example
|
||||
base_name = os.path.splitext(os.path.basename(fname))[0]
|
||||
wav_dir = output_dir / "wav"
|
||||
mel_dir = output_dir / "mel"
|
||||
wav_dir.mkdir(exist_ok=True)
|
||||
mel_dir.mkdir(exist_ok=True)
|
||||
|
||||
audio, mel = transform(example)
|
||||
np.save(str(wav_dir / base_name), audio)
|
||||
np.save(str(mel_dir / base_name), mel)
|
||||
|
||||
file_names.append((base_name, mel.shape[-1], audio.shape[-1]))
|
||||
|
||||
meta_data = pd.DataFrame.from_records(file_names)
|
||||
meta_data.to_csv(
|
||||
str(output_dir / "metadata.csv"), sep="\t", index=None, header=None)
|
||||
print("saved meta data in to {}".format(
|
||||
os.path.join(output_dir, "metadata.csv")))
|
||||
|
||||
print("Done!")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="create dataset")
|
||||
parser.add_argument(
|
||||
"--config",
|
||||
type=str,
|
||||
metavar="FILE",
|
||||
help="extra config to overwrite the default config")
|
||||
parser.add_argument(
|
||||
"--input", type=str, help="path of the ljspeech dataset")
|
||||
parser.add_argument(
|
||||
"--output", type=str, help="path to save output dataset")
|
||||
parser.add_argument(
|
||||
"--opts",
|
||||
nargs=argparse.REMAINDER,
|
||||
help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
|
||||
)
|
||||
parser.add_argument(
|
||||
"-v", "--verbose", action="store_true", help="print msg")
|
||||
|
||||
config = get_cfg_defaults()
|
||||
args = parser.parse_args()
|
||||
if args.config:
|
||||
config.merge_from_file(args.config)
|
||||
if args.opts:
|
||||
config.merge_from_list(args.opts)
|
||||
config.freeze()
|
||||
if args.verbose:
|
||||
print(config.data)
|
||||
print(args)
|
||||
|
||||
create_dataset(config.data, args.input, args.output, args.verbose)
|
|
@ -1,82 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import argparse
|
||||
import numpy as np
|
||||
import soundfile as sf
|
||||
import os
|
||||
from pathlib import Path
|
||||
import paddle
|
||||
import parakeet
|
||||
from parakeet.models.wavenet import UpsampleNet, WaveNet, ConditionalWaveNet
|
||||
from parakeet.utils import layer_tools, checkpoint
|
||||
|
||||
from config import get_cfg_defaults
|
||||
|
||||
|
||||
def main(config, args):
|
||||
paddle.set_device(args.device)
|
||||
model = ConditionalWaveNet.from_pretrained(config, args.checkpoint_path)
|
||||
layer_tools.recursively_remove_weight_norm(model)
|
||||
model.eval()
|
||||
|
||||
mel_dir = Path(args.input).expanduser()
|
||||
output_dir = Path(args.output).expanduser()
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
for file_path in mel_dir.iterdir():
|
||||
mel = np.load(str(file_path))
|
||||
audio = model.predict(mel)
|
||||
audio_path = output_dir / (
|
||||
os.path.splitext(file_path.name)[0] + ".wav")
|
||||
sf.write(audio_path, audio, config.data.sample_rate)
|
||||
print("[synthesize] {} -> {}".format(file_path, audio_path))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
config = get_cfg_defaults()
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description="generate mel spectrogram with TransformerTTS.")
|
||||
parser.add_argument(
|
||||
"--config",
|
||||
type=str,
|
||||
metavar="FILE",
|
||||
help="extra config to overwrite the default config")
|
||||
parser.add_argument(
|
||||
"--checkpoint_path", type=str, help="path of the checkpoint to load.")
|
||||
parser.add_argument(
|
||||
"--input",
|
||||
type=str,
|
||||
help="path of directory containing mel spectrogram (in .npy format)")
|
||||
parser.add_argument("--output", type=str, help="path to save outputs")
|
||||
parser.add_argument(
|
||||
"--device", type=str, default="cpu", help="device type to use.")
|
||||
parser.add_argument(
|
||||
"--opts",
|
||||
nargs=argparse.REMAINDER,
|
||||
help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
|
||||
)
|
||||
parser.add_argument(
|
||||
"-v", "--verbose", action="store_true", help="print msg")
|
||||
|
||||
args = parser.parse_args()
|
||||
if args.config:
|
||||
config.merge_from_file(args.config)
|
||||
if args.opts:
|
||||
config.merge_from_list(args.opts)
|
||||
config.freeze()
|
||||
print(config)
|
||||
print(args)
|
||||
|
||||
main(config, args)
|
|
@ -1,177 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import time
|
||||
from pathlib import Path
|
||||
import math
|
||||
import numpy as np
|
||||
import paddle
|
||||
from paddle import distributed as dist
|
||||
from paddle.io import DataLoader, DistributedBatchSampler
|
||||
from tensorboardX import SummaryWriter
|
||||
from collections import defaultdict
|
||||
|
||||
import parakeet
|
||||
from parakeet.data import dataset
|
||||
from parakeet.models.wavenet import UpsampleNet, WaveNet, ConditionalWaveNet
|
||||
from parakeet.audio import AudioProcessor
|
||||
from parakeet.utils import scheduler, mp_tools
|
||||
from parakeet.training.cli import default_argument_parser
|
||||
from parakeet.training.experiment import ExperimentBase
|
||||
from parakeet.utils.mp_tools import rank_zero_only
|
||||
|
||||
from config import get_cfg_defaults
|
||||
from ljspeech import LJSpeech, LJSpeechClipCollector, LJSpeechCollector
|
||||
|
||||
|
||||
class Experiment(ExperimentBase):
|
||||
def setup_model(self):
|
||||
config = self.config
|
||||
model = ConditionalWaveNet(
|
||||
upsample_factors=config.model.upsample_factors,
|
||||
n_stack=config.model.n_stack,
|
||||
n_loop=config.model.n_loop,
|
||||
residual_channels=config.model.residual_channels,
|
||||
output_dim=config.model.output_dim,
|
||||
n_mels=config.data.n_mels,
|
||||
filter_size=config.model.filter_size,
|
||||
loss_type=config.model.loss_type,
|
||||
log_scale_min=config.model.log_scale_min)
|
||||
|
||||
if self.parallel:
|
||||
model = paddle.DataParallel(model)
|
||||
|
||||
lr_scheduler = paddle.optimizer.lr.StepDecay(
|
||||
config.training.lr, config.training.anneal_interval,
|
||||
config.training.anneal_rate)
|
||||
optimizer = paddle.optimizer.Adam(
|
||||
lr_scheduler,
|
||||
parameters=model.parameters(),
|
||||
grad_clip=paddle.nn.ClipGradByGlobalNorm(
|
||||
config.training.gradient_max_norm))
|
||||
|
||||
self.model = model
|
||||
self.model_core = model._layers if self.parallel else model
|
||||
self.optimizer = optimizer
|
||||
|
||||
def setup_dataloader(self):
|
||||
config = self.config
|
||||
args = self.args
|
||||
|
||||
ljspeech_dataset = LJSpeech(args.data)
|
||||
valid_set, train_set = dataset.split(ljspeech_dataset,
|
||||
config.data.valid_size)
|
||||
|
||||
# convolutional net's causal padding size
|
||||
context_size = config.model.n_stack \
|
||||
* sum([(config.model.filter_size - 1) * 2**i for i in range(config.model.n_loop)]) \
|
||||
+ 1
|
||||
context_frames = context_size // config.data.hop_length
|
||||
|
||||
# frames used to compute loss
|
||||
frames_per_second = config.data.sample_rate // config.data.hop_length
|
||||
train_clip_frames = math.ceil(config.data.train_clip_seconds *
|
||||
frames_per_second)
|
||||
|
||||
num_frames = train_clip_frames + context_frames
|
||||
batch_fn = LJSpeechClipCollector(num_frames, config.data.hop_length)
|
||||
if not self.parallel:
|
||||
train_loader = DataLoader(
|
||||
train_set,
|
||||
batch_size=config.data.batch_size,
|
||||
shuffle=True,
|
||||
drop_last=True,
|
||||
collate_fn=batch_fn)
|
||||
else:
|
||||
sampler = DistributedBatchSampler(
|
||||
train_set,
|
||||
batch_size=config.data.batch_size,
|
||||
shuffle=True,
|
||||
drop_last=True)
|
||||
train_loader = DataLoader(
|
||||
train_set, batch_sampler=sampler, collate_fn=batch_fn)
|
||||
|
||||
valid_batch_fn = LJSpeechCollector()
|
||||
valid_loader = DataLoader(
|
||||
valid_set, batch_size=1, collate_fn=valid_batch_fn)
|
||||
|
||||
self.train_loader = train_loader
|
||||
self.valid_loader = valid_loader
|
||||
|
||||
def train_batch(self):
|
||||
start = time.time()
|
||||
batch = self.read_batch()
|
||||
data_loader_time = time.time() - start
|
||||
|
||||
self.model.train()
|
||||
self.optimizer.clear_grad()
|
||||
mel, wav, audio_starts = batch
|
||||
|
||||
y = self.model(wav, mel, audio_starts)
|
||||
loss = self.model_core.loss(y, wav)
|
||||
loss.backward()
|
||||
self.optimizer.step()
|
||||
iteration_time = time.time() - start
|
||||
|
||||
loss_value = float(loss)
|
||||
msg = "Rank: {}, ".format(dist.get_rank())
|
||||
msg += "step: {}, ".format(self.iteration)
|
||||
msg += "time: {:>.3f}s/{:>.3f}s, ".format(data_loader_time,
|
||||
iteration_time)
|
||||
msg += "loss: {:>.6f}".format(loss_value)
|
||||
self.logger.info(msg)
|
||||
if dist.get_rank() == 0:
|
||||
self.visualizer.add_scalar(
|
||||
"train/loss", loss_value, global_step=self.iteration)
|
||||
|
||||
@mp_tools.rank_zero_only
|
||||
@paddle.no_grad()
|
||||
def valid(self):
|
||||
valid_iterator = iter(self.valid_loader)
|
||||
valid_losses = []
|
||||
mel, wav, audio_starts = next(valid_iterator)
|
||||
y = self.model(wav, mel, audio_starts)
|
||||
loss = self.model_core.loss(y, wav)
|
||||
valid_losses.append(float(loss))
|
||||
valid_loss = np.mean(valid_losses)
|
||||
self.visualizer.add_scalar(
|
||||
"valid/loss", valid_loss, global_step=self.iteration)
|
||||
|
||||
|
||||
def main_sp(config, args):
|
||||
exp = Experiment(config, args)
|
||||
exp.setup()
|
||||
exp.run()
|
||||
|
||||
|
||||
def main(config, args):
|
||||
if args.nprocs > 1 and args.device == "gpu":
|
||||
dist.spawn(main_sp, args=(config, args), nprocs=args.nprocs)
|
||||
else:
|
||||
main_sp(config, args)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
config = get_cfg_defaults()
|
||||
parser = default_argument_parser()
|
||||
args = parser.parse_args()
|
||||
if args.config:
|
||||
config.merge_from_file(args.config)
|
||||
if args.opts:
|
||||
config.merge_from_list(args.opts)
|
||||
config.freeze()
|
||||
print(config)
|
||||
print(args)
|
||||
|
||||
main(config, args)
|
|
@ -12,6 +12,6 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
__version__ = "0.2.0-beta.0"
|
||||
__version__ = "0.2.0"
|
||||
|
||||
from parakeet import audio, data, datasets, frontend, models, modules, training, utils
|
||||
|
|
|
@ -14,7 +14,7 @@
|
|||
|
||||
#from parakeet.models.clarinet import *
|
||||
from parakeet.models.waveflow import *
|
||||
from parakeet.models.wavenet import *
|
||||
#from parakeet.models.wavenet import *
|
||||
|
||||
from parakeet.models.transformer_tts import *
|
||||
#from parakeet.models.deepvoice3 import *
|
||||
|
|
|
@ -1,977 +0,0 @@
|
|||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
import math
|
||||
import time
|
||||
from typing import Union, Sequence, List
|
||||
from tqdm import trange
|
||||
import numpy as np
|
||||
|
||||
import paddle
|
||||
from paddle import nn
|
||||
from paddle.nn import functional as F
|
||||
import paddle.fluid.initializer as I
|
||||
import paddle.fluid.layers.distributions as D
|
||||
|
||||
from parakeet.modules.conv import Conv1dCell
|
||||
from parakeet.modules.audio import quantize, dequantize, STFT
|
||||
from parakeet.utils import checkpoint, layer_tools
|
||||
|
||||
__all__ = ["WaveNet", "ConditionalWaveNet"]
|
||||
|
||||
|
||||
def crop(x, audio_start, audio_length):
|
||||
"""Crop the upsampled condition to match audio_length.
|
||||
|
||||
The upsampled condition has the same time steps as the whole audio does.
|
||||
But since audios are sliced to 0.5 seconds randomly while conditions are
|
||||
not, upsampled conditions should also be sliced to extactly match the time
|
||||
steps of the audio slice.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
x : Tensor [shape=(B, C, T)]
|
||||
The upsampled condition.
|
||||
audio_start : Tensor [shape=(B,), dtype:int]
|
||||
The index of the starting point of the audio clips.
|
||||
audio_length : int
|
||||
The length of the audio clip(number of samples it contaions).
|
||||
|
||||
Returns
|
||||
-------
|
||||
Tensor [shape=(B, C, audio_length)]
|
||||
Cropped condition.
|
||||
"""
|
||||
# crop audio
|
||||
slices = [] # for each example
|
||||
# paddle now supports Tensor of shape [1] in slice
|
||||
# starts = audio_start.numpy()
|
||||
for i in range(x.shape[0]):
|
||||
start = audio_start[i]
|
||||
end = start + audio_length
|
||||
slice = paddle.slice(x[i], axes=[1], starts=[start], ends=[end])
|
||||
slices.append(slice)
|
||||
out = paddle.stack(slices)
|
||||
return out
|
||||
|
||||
|
||||
class UpsampleNet(nn.LayerList):
|
||||
"""A network used to upsample mel spectrogram to match the time steps of
|
||||
audio.
|
||||
|
||||
It consists of several layers of Conv2DTranspose. Each Conv2DTranspose
|
||||
layer upsamples the time dimension by its `stride` times.
|
||||
|
||||
Also, each Conv2DTranspose's filter_size at frequency dimension is 3.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
upscale_factors : List[int], optional
|
||||
Time upsampling factors for each Conv2DTranspose Layer.
|
||||
|
||||
The ``UpsampleNet`` contains ``len(upscale_factor)`` Conv2DTranspose
|
||||
Layers. Each upscale_factor is used as the ``stride`` for the
|
||||
corresponding Conv2DTranspose. Defaults to [16, 16], this the default
|
||||
upsampling factor is 256.
|
||||
|
||||
Notes
|
||||
------
|
||||
``np.prod(upscale_factors)`` should equals the ``hop_length`` of the stft
|
||||
transformation used to extract spectrogram features from audio.
|
||||
|
||||
For example, ``16 * 16 = 256``, then the spectrogram extracted with a stft
|
||||
transformation whose ``hop_length`` equals 256 is suitable.
|
||||
|
||||
See Also
|
||||
---------
|
||||
``librosa.core.stft``
|
||||
"""
|
||||
|
||||
def __init__(self, upscale_factors=[16, 16]):
|
||||
super(UpsampleNet, self).__init__()
|
||||
self.upscale_factors = list(upscale_factors)
|
||||
self.upscale_factor = 1
|
||||
for item in upscale_factors:
|
||||
self.upscale_factor *= item
|
||||
|
||||
for factor in self.upscale_factors:
|
||||
self.append(
|
||||
nn.utils.weight_norm(
|
||||
nn.Conv2DTranspose(
|
||||
1,
|
||||
1,
|
||||
kernel_size=(3, 2 * factor),
|
||||
stride=(1, factor),
|
||||
padding=(1, factor // 2))))
|
||||
|
||||
def forward(self, x):
|
||||
r"""Compute the upsampled condition.
|
||||
|
||||
Parameters
|
||||
-----------
|
||||
x : Tensor [shape=(B, F, T)]
|
||||
The condition (mel spectrogram here). ``F`` means the frequency
|
||||
bands, which is the feature size of the input.
|
||||
|
||||
In the internal Conv2DTransposes, the frequency dimension
|
||||
is treated as ``height`` dimension instead of ``in_channels``.
|
||||
|
||||
Returns:
|
||||
Tensor [shape=(B, F, T \* upscale_factor)]
|
||||
The upsampled condition.
|
||||
"""
|
||||
x = paddle.unsqueeze(x, 1)
|
||||
for sublayer in self:
|
||||
x = F.leaky_relu(sublayer(x), 0.4)
|
||||
x = paddle.squeeze(x, 1)
|
||||
return x
|
||||
|
||||
|
||||
class ResidualBlock(nn.Layer):
|
||||
"""A Residual block used in wavenet. Conv1D-gated-tanh Block.
|
||||
|
||||
It consists of a Conv1DCell and an Conv1D(kernel_size = 1) to integrate
|
||||
information of the condition.
|
||||
|
||||
Notes
|
||||
--------
|
||||
It does not have parametric residual or skip connection.
|
||||
|
||||
Parameters
|
||||
-----------
|
||||
residual_channels : int
|
||||
The feature size of the input. It is also the feature size of the
|
||||
residual output and skip output.
|
||||
|
||||
condition_dim : int
|
||||
The feature size of the condition.
|
||||
|
||||
filter_size : int
|
||||
Kernel size of the internal convolution cells.
|
||||
|
||||
dilation :int
|
||||
Dilation of the internal convolution cells.
|
||||
"""
|
||||
|
||||
def __init__(self,
|
||||
residual_channels: int,
|
||||
condition_dim: int,
|
||||
filter_size: Union[int, Sequence[int]],
|
||||
dilation: int):
|
||||
|
||||
super(ResidualBlock, self).__init__()
|
||||
dilated_channels = 2 * residual_channels
|
||||
# following clarinet's implementation, we do not have parametric residual
|
||||
# & skip connection.
|
||||
|
||||
_filter_size = filter_size[0] if isinstance(filter_size, (
|
||||
list, tuple)) else filter_size
|
||||
std = math.sqrt(1 / (_filter_size * residual_channels))
|
||||
conv = Conv1dCell(
|
||||
residual_channels,
|
||||
dilated_channels,
|
||||
filter_size,
|
||||
dilation=dilation,
|
||||
weight_attr=I.Normal(scale=std))
|
||||
self.conv = nn.utils.weight_norm(conv)
|
||||
|
||||
std = math.sqrt(1 / condition_dim)
|
||||
condition_proj = Conv1dCell(
|
||||
condition_dim,
|
||||
dilated_channels, (1, ),
|
||||
weight_attr=I.Normal(scale=std))
|
||||
self.condition_proj = nn.utils.weight_norm(condition_proj)
|
||||
|
||||
self.filter_size = filter_size
|
||||
self.dilation = dilation
|
||||
self.dilated_channels = dilated_channels
|
||||
self.residual_channels = residual_channels
|
||||
self.condition_dim = condition_dim
|
||||
|
||||
def forward(self, x, condition=None):
|
||||
"""Forward pass of the ResidualBlock.
|
||||
|
||||
Parameters
|
||||
-----------
|
||||
x : Tensor [shape=(B, C, T)]
|
||||
The input tensor.
|
||||
|
||||
condition : Tensor, optional [shape(B, C_cond, T)]
|
||||
The condition.
|
||||
|
||||
It has been upsampled in time steps, so it has the same time steps
|
||||
as the input does.(C_cond stands for the condition's channels).
|
||||
Defaults to None.
|
||||
|
||||
Returns
|
||||
-----------
|
||||
residual : Tensor [shape=(B, C, T)]
|
||||
The residual, which is used as the input to the next ResidualBlock.
|
||||
|
||||
skip_connection : Tensor [shape=(B, C, T)]
|
||||
Tthe skip connection. This output is accumulated with that of
|
||||
other ResidualBlocks.
|
||||
"""
|
||||
h = x
|
||||
|
||||
# dilated conv
|
||||
h = self.conv(h)
|
||||
|
||||
# condition
|
||||
if condition is not None:
|
||||
h += self.condition_proj(condition)
|
||||
|
||||
# gated tanh
|
||||
content, gate = paddle.split(h, 2, axis=1)
|
||||
z = F.sigmoid(gate) * paddle.tanh(content)
|
||||
|
||||
# projection
|
||||
residual = paddle.scale(z + x, math.sqrt(.5))
|
||||
skip_connection = z
|
||||
return residual, skip_connection
|
||||
|
||||
def start_sequence(self):
|
||||
"""Prepare the ResidualBlock to generate a new sequence.
|
||||
|
||||
Warnings
|
||||
---------
|
||||
This method should be called before calling ``add_input`` multiple times.
|
||||
"""
|
||||
self.conv.start_sequence()
|
||||
self.condition_proj.start_sequence()
|
||||
|
||||
def add_input(self, x, condition=None):
|
||||
"""Take a step input and return a step output.
|
||||
|
||||
This method works similarily with ``forward`` but in a
|
||||
``step-in-step-out`` fashion.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
x : Tensor [shape=(B, C)]
|
||||
Input for a step.
|
||||
|
||||
condition : Tensor, optional [shape=(B, C_cond)]
|
||||
Condition for a step. Defaults to None.
|
||||
|
||||
Returns
|
||||
----------
|
||||
residual : Tensor [shape=(B, C)]
|
||||
The residual for a step, which is used as the input to the next
|
||||
layer of ResidualBlock.
|
||||
|
||||
skip_connection : Tensor [shape=(B, C)]
|
||||
T he skip connection for a step. This output is accumulated with
|
||||
that of other ResidualBlocks.
|
||||
"""
|
||||
h = x
|
||||
|
||||
# dilated conv
|
||||
h = self.conv.add_input(h)
|
||||
|
||||
# condition
|
||||
if condition is not None:
|
||||
h += self.condition_proj.add_input(condition)
|
||||
|
||||
# gated tanh
|
||||
content, gate = paddle.split(h, 2, axis=1)
|
||||
z = F.sigmoid(gate) * paddle.tanh(content)
|
||||
|
||||
# projection
|
||||
residual = paddle.scale(z + x, math.sqrt(0.5))
|
||||
skip_connection = z
|
||||
return residual, skip_connection
|
||||
|
||||
|
||||
class ResidualNet(nn.LayerList):
|
||||
"""The residual network in wavenet.
|
||||
|
||||
It consists of ``n_stack`` stacks, each of which consists of ``n_loop``
|
||||
ResidualBlocks.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
n_stack : int
|
||||
Number of stacks in the ``ResidualNet``.
|
||||
|
||||
n_loop : int
|
||||
Number of ResidualBlocks in a stack.
|
||||
|
||||
residual_channels : int
|
||||
Input feature size of each ``ResidualBlock``'s input.
|
||||
|
||||
condition_dim : int
|
||||
Feature size of the condition.
|
||||
|
||||
filter_size : int
|
||||
Kernel size of the internal ``Conv1dCell`` of each ``ResidualBlock``.
|
||||
|
||||
"""
|
||||
|
||||
def __init__(self,
|
||||
n_stack: int,
|
||||
n_loop: int,
|
||||
residual_channels: int,
|
||||
condition_dim: int,
|
||||
filter_size: int):
|
||||
super(ResidualNet, self).__init__()
|
||||
# double the dilation at each layer in a stack
|
||||
dilations = [2**i for i in range(n_loop)] * n_stack
|
||||
self.context_size = 1 + sum(dilations)
|
||||
for dilation in dilations:
|
||||
self.append(
|
||||
ResidualBlock(residual_channels, condition_dim, filter_size,
|
||||
dilation))
|
||||
|
||||
def forward(self, x, condition=None):
|
||||
"""Forward pass of ``ResidualNet``.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
x : Tensor [shape=(B, C, T)]
|
||||
The input.
|
||||
|
||||
condition : Tensor, optional [shape=(B, C_cond, T)]
|
||||
The condition, it has been upsampled in time steps, so it has the
|
||||
same time steps as the input does. Defaults to None.
|
||||
|
||||
Returns
|
||||
--------
|
||||
Tensor [shape=(B, C, T)]
|
||||
The output.
|
||||
"""
|
||||
for i, func in enumerate(self):
|
||||
x, skip = func(x, condition)
|
||||
if i == 0:
|
||||
skip_connections = skip
|
||||
else:
|
||||
skip_connections = paddle.scale(skip_connections + skip,
|
||||
math.sqrt(0.5))
|
||||
return skip_connections
|
||||
|
||||
def start_sequence(self):
|
||||
"""Prepare the ResidualNet to generate a new sequence. This method
|
||||
should be called before starting calling ``add_input`` multiple times.
|
||||
"""
|
||||
for block in self:
|
||||
block.start_sequence()
|
||||
|
||||
def add_input(self, x, condition=None):
|
||||
"""Take a step input and return a step output.
|
||||
|
||||
This method works similarily with ``forward`` but in a
|
||||
``step-in-step-out`` fashion.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
x : Tensor [shape=(B, C)]
|
||||
Input for a step.
|
||||
|
||||
condition : Tensor, optional [shape=(B, C_cond)]
|
||||
Condition for a step. Defaults to None.
|
||||
|
||||
Returns
|
||||
----------
|
||||
Tensor [shape=(B, C)]
|
||||
The skip connection for a step. This output is accumulated with
|
||||
that of other ResidualBlocks.
|
||||
"""
|
||||
for i, func in enumerate(self):
|
||||
x, skip = func.add_input(x, condition)
|
||||
if i == 0:
|
||||
skip_connections = skip
|
||||
else:
|
||||
skip_connections = paddle.scale(skip_connections + skip,
|
||||
math.sqrt(0.5))
|
||||
return skip_connections
|
||||
|
||||
|
||||
class WaveNet(nn.Layer):
|
||||
"""Wavenet that transform upsampled mel spectrogram into waveform.
|
||||
|
||||
Parameters
|
||||
-----------
|
||||
n_stack : int
|
||||
``n_stack`` for the internal ``ResidualNet``.
|
||||
|
||||
n_loop : int
|
||||
``n_loop`` for the internal ``ResidualNet``.
|
||||
|
||||
residual_channels : int
|
||||
Feature size of the input.
|
||||
|
||||
output_dim : int
|
||||
Feature size of the input.
|
||||
|
||||
condition_dim : int
|
||||
Feature size of the condition (mel spectrogram bands).
|
||||
|
||||
filter_size : int
|
||||
Kernel size of the internal ``ResidualNet``.
|
||||
|
||||
loss_type : str, optional ["mog" or "softmax"]
|
||||
The output type and loss type of the model, by default "mog".
|
||||
|
||||
If "softmax", the model input is first quantized audio and the model
|
||||
outputs a discret categorical distribution.
|
||||
|
||||
If "mog", the model input is audio in floating point format, and the
|
||||
model outputs parameters for a mixture of gaussian distributions.
|
||||
Namely, the weight, mean and log scale of each gaussian distribution.
|
||||
Thus, the ``output_size`` should be a multiple of 3.
|
||||
|
||||
log_scale_min : float, optional
|
||||
Minimum value of the log scale of gaussian distributions, by default
|
||||
-9.0.
|
||||
|
||||
This is only used for computing loss when ``loss_type`` is "mog", If
|
||||
the predicted log scale is less than -9.0, it is clipped at -9.0.
|
||||
"""
|
||||
|
||||
def __init__(self, n_stack, n_loop, residual_channels, output_dim,
|
||||
condition_dim, filter_size, loss_type, log_scale_min):
|
||||
|
||||
super(WaveNet, self).__init__()
|
||||
if loss_type not in ["softmax", "mog"]:
|
||||
raise ValueError("loss_type {} is not supported".format(loss_type))
|
||||
if loss_type == "softmax":
|
||||
self.embed = nn.Embedding(output_dim, residual_channels)
|
||||
else:
|
||||
if (output_dim % 3 != 0):
|
||||
raise ValueError(
|
||||
"with Mixture of Gaussians(mog) output, the output dim must be divisible by 3, but get {}".
|
||||
format(output_dim))
|
||||
self.embed = nn.utils.weight_norm(
|
||||
nn.Linear(1, residual_channels), dim=1)
|
||||
|
||||
self.resnet = ResidualNet(n_stack, n_loop, residual_channels,
|
||||
condition_dim, filter_size)
|
||||
self.context_size = self.resnet.context_size
|
||||
|
||||
skip_channels = residual_channels # assume the same channel
|
||||
self.proj1 = nn.utils.weight_norm(
|
||||
nn.Linear(skip_channels, skip_channels), dim=1)
|
||||
self.proj2 = nn.utils.weight_norm(
|
||||
nn.Linear(skip_channels, skip_channels), dim=1)
|
||||
# if loss_type is softmax, output_dim is n_vocab of waveform magnitude.
|
||||
# if loss_type is mog, output_dim is 3 * gaussian, (weight, mean and stddev)
|
||||
self.proj3 = nn.utils.weight_norm(
|
||||
nn.Linear(skip_channels, output_dim), dim=1)
|
||||
|
||||
self.loss_type = loss_type
|
||||
self.output_dim = output_dim
|
||||
self.input_dim = 1
|
||||
self.skip_channels = skip_channels
|
||||
self.log_scale_min = log_scale_min
|
||||
|
||||
def forward(self, x, condition=None):
|
||||
"""Forward pass of ``WaveNet``.
|
||||
|
||||
Parameters
|
||||
-----------
|
||||
x : Tensor [shape=(B, T)]
|
||||
The input waveform.
|
||||
condition : Tensor, optional [shape=(B, C_cond, T)]
|
||||
the upsampled condition. Defaults to None.
|
||||
|
||||
Returns
|
||||
-------
|
||||
Tensor: [shape=(B, T, C_output)]
|
||||
The parameters of the output distributions.
|
||||
"""
|
||||
|
||||
# Causal Conv
|
||||
if self.loss_type == "softmax":
|
||||
x = paddle.clip(x, min=-1., max=0.99999)
|
||||
x = quantize(x, self.output_dim)
|
||||
x = self.embed(x) # (B, T, C)
|
||||
else:
|
||||
x = paddle.unsqueeze(x, -1) # (B, T, 1)
|
||||
x = self.embed(x) # (B, T, C)
|
||||
x = paddle.transpose(x, perm=[0, 2, 1]) # (B, C, T)
|
||||
|
||||
# Residual & Skip-conenection & linears
|
||||
z = self.resnet(x, condition)
|
||||
|
||||
z = paddle.transpose(z, [0, 2, 1])
|
||||
z = F.relu(self.proj2(F.relu(self.proj1(z))))
|
||||
|
||||
y = self.proj3(z)
|
||||
return y
|
||||
|
||||
def start_sequence(self):
|
||||
"""Prepare the WaveNet to generate a new sequence. This method should
|
||||
be called before starting calling ``add_input`` multiple times.
|
||||
"""
|
||||
self.resnet.start_sequence()
|
||||
|
||||
def add_input(self, x, condition=None):
|
||||
"""Compute the output distribution (represented by its parameters) for
|
||||
a step. It works similarily with the ``forward`` method but in a
|
||||
``step-in-step-out`` fashion.
|
||||
|
||||
Parameters
|
||||
-----------
|
||||
x : Tensor [shape=(B,)]
|
||||
A step of the input waveform.
|
||||
|
||||
condition : Tensor, optional [shape=(B, C_cond)]
|
||||
A step of the upsampled condition. Defaults to None.
|
||||
|
||||
Returns
|
||||
--------
|
||||
Tensor: [shape=(B, C_output)]
|
||||
A step of the parameters of the output distributions.
|
||||
"""
|
||||
# Causal Conv
|
||||
if self.loss_type == "softmax":
|
||||
x = paddle.clip(x, min=-1., max=0.99999)
|
||||
x = quantize(x, self.output_dim)
|
||||
x = self.embed(x) # (B, C)
|
||||
else:
|
||||
x = paddle.unsqueeze(x, -1) # (B, 1)
|
||||
x = self.embed(x) # (B, C)
|
||||
|
||||
# Residual & Skip-conenection & linears
|
||||
z = self.resnet.add_input(x, condition)
|
||||
z = F.relu(self.proj2(F.relu(self.proj1(z)))) # (B, C)
|
||||
|
||||
# Output
|
||||
y = self.proj3(z)
|
||||
return y
|
||||
|
||||
def compute_softmax_loss(self, y, t):
|
||||
"""Compute the loss when output distributions are categorial
|
||||
distributions.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
y : Tensor [shape=(B, T, C_output)]
|
||||
The logits of the output distributions.
|
||||
|
||||
t : Tensor [shape=(B, T)]
|
||||
The target audio. The audio is first quantized then used as the
|
||||
target.
|
||||
|
||||
Notes
|
||||
-------
|
||||
Output distributions whose input contains padding is neglected in
|
||||
loss computation. So the first ``context_size`` steps does not
|
||||
contribute to the loss.
|
||||
|
||||
Returns
|
||||
--------
|
||||
Tensor: [shape=(1,)]
|
||||
The loss.
|
||||
"""
|
||||
# context size is not taken into account
|
||||
y = y[:, self.context_size:, :]
|
||||
t = t[:, self.context_size:]
|
||||
t = paddle.clip(t, min=-1.0, max=0.99999)
|
||||
quantized = quantize(t, n_bands=self.output_dim)
|
||||
label = paddle.unsqueeze(quantized, -1)
|
||||
|
||||
loss = F.softmax_with_cross_entropy(y, label)
|
||||
reduced_loss = paddle.mean(loss)
|
||||
return reduced_loss
|
||||
|
||||
def sample_from_softmax(self, y):
|
||||
"""Sample from the output distribution when the output distributions
|
||||
are categorical distriobutions.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
y : Tensor [shape=(B, T, C_output)]
|
||||
The logits of the output distributions.
|
||||
|
||||
Returns
|
||||
--------
|
||||
Tensor [shape=(B, T)]
|
||||
Waveform sampled from the output distribution.
|
||||
"""
|
||||
# dequantize
|
||||
batch_size, time_steps, output_dim, = y.shape
|
||||
y = paddle.reshape(y, (batch_size * time_steps, output_dim))
|
||||
prob = F.softmax(y)
|
||||
quantized = paddle.fluid.layers.sampling_id(prob)
|
||||
samples = dequantize(quantized, n_bands=self.output_dim)
|
||||
samples = paddle.reshape(samples, (batch_size, -1))
|
||||
return samples
|
||||
|
||||
def compute_mog_loss(self, y, t):
|
||||
"""Compute the loss where output distributions is a mixture of
|
||||
Gaussians distributions.
|
||||
|
||||
Parameters
|
||||
-----------
|
||||
y : Tensor [shape=(B, T, C_output)]
|
||||
The parameterd of the output distribution. It is the concatenation
|
||||
of 3 parts, the logits of every distribution, the mean of each
|
||||
distribution and the log standard deviation of each distribution.
|
||||
|
||||
Each part's shape is (B, T, n_mixture), where ``n_mixture`` means
|
||||
the number of Gaussians in the mixture.
|
||||
|
||||
t : Tensor [shape=(B, T)]
|
||||
The target audio.
|
||||
|
||||
Notes
|
||||
-------
|
||||
Output distributions whose input contains padding is neglected in
|
||||
loss computation. So the first ``context_size`` steps does not
|
||||
contribute to the loss.
|
||||
|
||||
Returns
|
||||
--------
|
||||
Tensor: [shape=(1,)]
|
||||
The loss.
|
||||
"""
|
||||
n_mixture = self.output_dim // 3
|
||||
|
||||
# context size is not taken in to account
|
||||
y = y[:, self.context_size:, :]
|
||||
t = t[:, self.context_size:]
|
||||
|
||||
w, mu, log_std = paddle.split(y, 3, axis=2)
|
||||
# 100.0 is just a large float
|
||||
log_std = paddle.clip(log_std, min=self.log_scale_min, max=100.)
|
||||
inv_std = paddle.exp(-log_std)
|
||||
p_mixture = F.softmax(w, -1)
|
||||
|
||||
t = paddle.unsqueeze(t, -1)
|
||||
if n_mixture > 1:
|
||||
# t = F.expand_as(t, log_std)
|
||||
t = paddle.expand(t, [-1, -1, n_mixture])
|
||||
|
||||
x_std = inv_std * (t - mu)
|
||||
exponent = paddle.exp(-0.5 * x_std * x_std)
|
||||
pdf_x = 1.0 / math.sqrt(2.0 * math.pi) * inv_std * exponent
|
||||
|
||||
pdf_x = p_mixture * pdf_x
|
||||
# pdf_x: [bs, len]
|
||||
pdf_x = paddle.sum(pdf_x, -1)
|
||||
per_sample_loss = -paddle.log(pdf_x + 1e-9)
|
||||
|
||||
loss = paddle.mean(per_sample_loss)
|
||||
return loss
|
||||
|
||||
def sample_from_mog(self, y):
|
||||
"""Sample from the output distribution when the output distribution
|
||||
is a mixture of Gaussian distributions.
|
||||
|
||||
Parameters
|
||||
------------
|
||||
y : Tensor [shape=(B, T, C_output)]
|
||||
The parameterd of the output distribution. It is the concatenation
|
||||
of 3 parts, the logits of every distribution, the mean of each
|
||||
distribution and the log standard deviation of each distribution.
|
||||
|
||||
Each part's shape is (B, T, n_mixture), where ``n_mixture`` means
|
||||
the number of Gaussians in the mixture.
|
||||
|
||||
Returns
|
||||
--------
|
||||
Tensor: [shape=(B, T)]
|
||||
Waveform sampled from the output distribution.
|
||||
"""
|
||||
batch_size, time_steps, output_dim = y.shape
|
||||
n_mixture = output_dim // 3
|
||||
|
||||
w, mu, log_std = paddle.split(y, 3, -1)
|
||||
|
||||
reshaped_w = paddle.reshape(w, (batch_size * time_steps, n_mixture))
|
||||
prob_ids = paddle.fluid.layers.sampling_id(F.softmax(reshaped_w))
|
||||
prob_ids = paddle.reshape(prob_ids, (batch_size, time_steps))
|
||||
prob_ids = prob_ids.numpy()
|
||||
|
||||
# do it
|
||||
index = np.array([[[b, t, prob_ids[b, t]] for t in range(time_steps)]
|
||||
for b in range(batch_size)]).astype("int32")
|
||||
index_var = paddle.to_tensor(index)
|
||||
|
||||
mu_ = paddle.gather_nd(mu, index_var)
|
||||
log_std_ = paddle.gather_nd(log_std, index_var)
|
||||
|
||||
dist = D.Normal(mu_, paddle.exp(log_std_))
|
||||
samples = dist.sample(shape=[])
|
||||
samples = paddle.clip(samples, min=-1., max=1.)
|
||||
return samples
|
||||
|
||||
def sample(self, y):
|
||||
"""Sample from the output distribution.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
y : Tensor [shape=(B, T, C_output)]
|
||||
The parameterd of the output distribution.
|
||||
|
||||
Returns
|
||||
--------
|
||||
Tensor [shape=(B, T)]
|
||||
Waveform sampled from the output distribution.
|
||||
"""
|
||||
if self.loss_type == "softmax":
|
||||
return self.sample_from_softmax(y)
|
||||
else:
|
||||
return self.sample_from_mog(y)
|
||||
|
||||
def loss(self, y, t):
|
||||
"""Compute the loss given the output distribution and the target.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
y : Tensor [shape=(B, T, C_output)]
|
||||
The parameters of the output distribution.
|
||||
|
||||
t : Tensor [shape=(B, T)]
|
||||
The target audio.
|
||||
|
||||
Returns
|
||||
---------
|
||||
Tensor: [shape=(1,)]
|
||||
The loss.
|
||||
"""
|
||||
if self.loss_type == "softmax":
|
||||
return self.compute_softmax_loss(y, t)
|
||||
else:
|
||||
return self.compute_mog_loss(y, t)
|
||||
|
||||
|
||||
class ConditionalWaveNet(nn.Layer):
|
||||
r"""Conditional Wavenet. An implementation of
|
||||
`WaveNet: A Generative Model for Raw Audio <http://arxiv.org/abs/1609.03499>`_.
|
||||
|
||||
It contains an UpsampleNet as the encoder and a WaveNet as the decoder.
|
||||
It is an autoregressive model that generate raw audio.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
upsample_factors : List[int]
|
||||
The upsampling factors of the UpsampleNet.
|
||||
|
||||
n_stack : int
|
||||
Number of convolution stacks in the WaveNet.
|
||||
|
||||
n_loop : int
|
||||
Number of convolution layers in a convolution stack.
|
||||
|
||||
Convolution layers in a stack have exponentially growing dilations,
|
||||
from 1 to .. math:: `k^{n_{loop} - 1}`, where k is the kernel size.
|
||||
|
||||
residual_channels : int
|
||||
Feature size of each ResidualBlocks.
|
||||
|
||||
output_dim : int
|
||||
Feature size of the output. See ``loss_type`` for details.
|
||||
|
||||
n_mels : int
|
||||
The number of bands of mel spectrogram.
|
||||
|
||||
filter_size : int, optional
|
||||
Convolution kernel size of each ResidualBlock, by default 2.
|
||||
|
||||
loss_type : str, optional ["mog" or "softmax"]
|
||||
The output type and loss type of the model, by default "mog".
|
||||
|
||||
If "softmax", the model input should be quantized audio and the model
|
||||
outputs a discret distribution.
|
||||
|
||||
If "mog", the model input is audio in floating point format, and the
|
||||
model outputs parameters for a mixture of gaussian distributions.
|
||||
Namely, the weight, mean and logscale of each gaussian distribution.
|
||||
Thus, the ``output_size`` should be a multiple of 3.
|
||||
|
||||
log_scale_min : float, optional
|
||||
Minimum value of the log scale of gaussian distributions, by default
|
||||
-9.0.
|
||||
|
||||
This is only used for computing loss when ``loss_type`` is "mog", If
|
||||
the predicted log scale is less than -9.0, it is clipped at -9.0.
|
||||
"""
|
||||
|
||||
def __init__(self,
|
||||
upsample_factors: List[int],
|
||||
n_stack: int,
|
||||
n_loop: int,
|
||||
residual_channels: int,
|
||||
output_dim: int,
|
||||
n_mels: int,
|
||||
filter_size: int=2,
|
||||
loss_type: str="mog",
|
||||
log_scale_min: float=-9.0):
|
||||
super(ConditionalWaveNet, self).__init__()
|
||||
self.encoder = UpsampleNet(upsample_factors)
|
||||
self.decoder = WaveNet(
|
||||
n_stack=n_stack,
|
||||
n_loop=n_loop,
|
||||
residual_channels=residual_channels,
|
||||
output_dim=output_dim,
|
||||
condition_dim=n_mels,
|
||||
filter_size=filter_size,
|
||||
loss_type=loss_type,
|
||||
log_scale_min=log_scale_min)
|
||||
|
||||
def forward(self, audio, mel, audio_start):
|
||||
"""Compute the output distribution given the mel spectrogram and the input(for teacher force training).
|
||||
|
||||
Parameters
|
||||
-----------
|
||||
audio : Tensor [shape=(B, T_audio)]
|
||||
ground truth waveform, used for teacher force training.
|
||||
|
||||
mel : Tensor [shape(B, F, T_mel)]
|
||||
Mel spectrogram. Note that it is the spectrogram for the whole
|
||||
utterance.
|
||||
|
||||
audio_start : Tensor [shape=(B,), dtype: int]
|
||||
Audio slices' start positions for each utterance.
|
||||
|
||||
Returns
|
||||
----------
|
||||
Tensor [shape(B, T_audio - 1, C_output)]
|
||||
Parameters for the output distribution, where ``C_output`` is the
|
||||
``output_dim`` of the decoder.)
|
||||
"""
|
||||
audio_length = audio.shape[1] # audio clip's length
|
||||
condition = self.encoder(mel)
|
||||
condition_slice = crop(condition, audio_start, audio_length)
|
||||
|
||||
# shifting 1 step
|
||||
audio = audio[:, :-1]
|
||||
condition_slice = condition_slice[:, :, 1:]
|
||||
|
||||
y = self.decoder(audio, condition_slice)
|
||||
return y
|
||||
|
||||
def loss(self, y, t):
|
||||
"""Compute loss with respect to the output distribution and the target
|
||||
audio.
|
||||
|
||||
Parameters
|
||||
-----------
|
||||
y : Tensor [shape=(B, T - 1, C_output)]
|
||||
Parameters of the output distribution.
|
||||
|
||||
t : Tensor [shape(B, T)]
|
||||
target waveform.
|
||||
|
||||
Returns
|
||||
--------
|
||||
Tensor: [shape=(1,)]
|
||||
the loss.
|
||||
"""
|
||||
t = t[:, 1:]
|
||||
loss = self.decoder.loss(y, t)
|
||||
return loss
|
||||
|
||||
def sample(self, y):
|
||||
"""Sample from the output distribution.
|
||||
|
||||
Parameters
|
||||
-----------
|
||||
y : Tensor [shape=(B, T, C_output)]
|
||||
Parameters of the output distribution.
|
||||
|
||||
Returns
|
||||
--------
|
||||
Tensor [shape=(B, T)]
|
||||
Sampled waveform from the output distribution.
|
||||
"""
|
||||
samples = self.decoder.sample(y)
|
||||
return samples
|
||||
|
||||
@paddle.no_grad()
|
||||
def infer(self, mel):
|
||||
r"""Synthesize waveform from mel spectrogram.
|
||||
|
||||
Parameters
|
||||
-----------
|
||||
mel : Tensor [shape=(B, F, T)]
|
||||
The ondition (mel spectrogram here).
|
||||
|
||||
Returns
|
||||
-----------
|
||||
Tensor [shape=(B, T \* upsacle_factor)]
|
||||
Synthesized waveform.
|
||||
|
||||
``upscale_factor`` is the ``upscale_factor`` of the encoder
|
||||
``UpsampleNet``.
|
||||
"""
|
||||
condition = self.encoder(mel)
|
||||
batch_size, _, time_steps = condition.shape
|
||||
samples = []
|
||||
|
||||
self.decoder.start_sequence()
|
||||
x_t = paddle.zeros((batch_size, ), dtype=mel.dtype)
|
||||
for i in trange(time_steps):
|
||||
c_t = condition[:, :, i] # (B, C)
|
||||
y_t = self.decoder.add_input(x_t, c_t) #(B, C)
|
||||
y_t = paddle.unsqueeze(y_t, 1)
|
||||
x_t = self.sample(y_t) # (B, 1)
|
||||
x_t = paddle.squeeze(x_t, 1) #(B,)
|
||||
samples.append(x_t)
|
||||
samples = paddle.stack(samples, -1)
|
||||
return samples
|
||||
|
||||
@paddle.no_grad()
|
||||
def predict(self, mel):
|
||||
r"""Synthesize audio from mel spectrogram.
|
||||
|
||||
The output and input are numpy arrays without batch.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
mel : np.ndarray [shape=(C, T)]
|
||||
Mel spectrogram of an utterance.
|
||||
|
||||
Returns
|
||||
-------
|
||||
Tensor : np.ndarray [shape=(C, T \* upsample_factor)]
|
||||
The synthesized waveform of an utterance.
|
||||
"""
|
||||
mel = paddle.to_tensor(mel)
|
||||
mel = paddle.unsqueeze(mel, 0)
|
||||
audio = self.infer(mel)
|
||||
audio = audio[0].numpy()
|
||||
return audio
|
||||
|
||||
@classmethod
|
||||
def from_pretrained(cls, config, checkpoint_path):
|
||||
"""Build a ConditionalWaveNet model from a pretrained model.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
config: yacs.config.CfgNode
|
||||
model configs
|
||||
|
||||
checkpoint_path: Path or str
|
||||
the path of pretrained model checkpoint, without extension name
|
||||
|
||||
Returns
|
||||
-------
|
||||
ConditionalWaveNet
|
||||
The model built from pretrained result.
|
||||
"""
|
||||
model = cls(upsample_factors=config.model.upsample_factors,
|
||||
n_stack=config.model.n_stack,
|
||||
n_loop=config.model.n_loop,
|
||||
residual_channels=config.model.residual_channels,
|
||||
output_dim=config.model.output_dim,
|
||||
n_mels=config.data.n_mels,
|
||||
filter_size=config.model.filter_size,
|
||||
loss_type=config.model.loss_type,
|
||||
log_scale_min=config.model.log_scale_min)
|
||||
layer_tools.summary(model)
|
||||
checkpoint.load_parameters(model, checkpoint_path=checkpoint_path)
|
||||
return model
|
Loading…
Reference in New Issue