Merge pull request #105 from iclementine/release/v0.2

fix fmax for example/waveflow
2021-04-14 14:42:43 +08:00 · 2021-04-14 14:38:31 +08:00 · 2021-04-13 16:19:09 +08:00 · 2021-04-13 16:17:46 +08:00 · 2021-03-15 15:16:34 +08:00 · 2021-03-15 15:10:09 +08:00
15 changed files with 26 additions and 1665 deletions
--- a/README.md
+++ b/README.md
@ -59,7 +59,6 @@ See [install](https://paddle-parakeet.readthedocs.io/en/latest/install.html) for
 Entries to the introduction, and the launch of training and synthsis for different example models:

 - [>>> WaveFlow](./examples/waveflow)
- [>>> WaveNet](./examples/wavenet)
 - [>>> Transformer TTS](./examples/transformer_tts)
 - [>>> Tacotron2](./examples/tacotron2)

@ -70,6 +69,14 @@ Entries to the introduction, and the launch of training and synthsis for differe

 Check our [website](https://paddle-parakeet.readthedocs.io/en/latest/demo.html) for audio sampels.

+## Pretrained models
+
+Models pretrained on LJSpeech can be downloaded here.
+
+[tacotron2](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.2.zip)
+[transformert_tts](https://paddlespeech.bj.bcebos.com/Parakeet/transformer_tts_ljspeech_ckpt_0.2.zip)
+[waveflow_res_128](https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_ckpt_0.2.zip)
+
 ## Copyright and License

 Parakeet is provided under the [Apache-2.0 license](LICENSE).
--- a/docs/source/basic.rst
+++ b/docs/source/basic.rst
@ -52,7 +52,7 @@ vocoder
 Like the example above, after loading the pretrained ``ConditionalWaveFlow`` 
 model, call ``model.predict(mel)`` to synthesize raw audio (in wav format).

->>> import soundfile as df
+>>> import soundfile as sf
 >>> from parakeet.models import ConditionalWaveFlow
 >>> 
 >>> # load the pretrained model
--- a/docs_cn/data_cn.md
+++ b/docs_cn/data_cn.md
@ -6,7 +6,7 @@

 ## Dataset

-我们假设数据集是样例的列表。你可以通过 `__len__` 方法获取其长度，并且可以通过 `__getitem__` 方法随机访问其元素。有了上述两个调节，我们也可以用 `iter(dataset)` 来获得一个 dataset 的迭代器。我们一般通过继承 `paddle.io.Dataset` 来创建自己的数据集。为其实现 `__len__` 方法和 `__getitem__` 方法即可。
+我们假设数据集是样例的列表。你可以通过 `__len__` 方法获取其长度，并且可以通过 `__getitem__` 方法随机访问其元素。有了上述两个条件，我们也可以用 `iter(dataset)` 来获得一个 dataset 的迭代器。我们一般通过继承 `paddle.io.Dataset` 来创建自己的数据集。为其实现 `__len__` 方法和 `__getitem__` 方法即可。

 出于数据处理，数据加载和数据集大小等方面的考虑，可以采用集中策略来调控数据集是否被懒惰地预处理，是否被懒惰地被加载，是否常驻内存等。

@ -86,7 +86,7 @@ Sampler 被实现为产生整数的可迭代对象。假设数据集有 `N` 个

 当迭代一个 DataLoader 的时候，首先 sampler 产生多个 index, 然后根据这些 index 去取出对应的样例，并调用 batch function 把这些样例组成一个批次。当然取出样例的过程是可并行的，但调用 batch function 组成 batch 不是。

-另外的一种选择是使用 batch sampler, 它是产生整数列表的可迭代对象。对于一般的 sampler, 需要对其迭代器使用 next 多次才能产出多个 index, 而对于 batch sampler, 对其迭代器使用 next 一次就可以产出多个 index. 对于使用一般的 sampler 的情形，batch size 由 DataLoader 的来决定。而对于 batch sampler, 则是由它决定了 DataLoader 的 batch size, 因此可以用它来实现一些特别的需求，比如说动态 batch size.
+另外的一种选择是使用 batch sampler, 它是产生整数列表的可迭代对象。对于一般的 sampler, 需要对其迭代器使用 next 多次才能产出多个 index, 而对于 batch sampler, 对其迭代器使用 next 一次就可以产出多个 index. 对于使用一般的 sampler 的情形，batch size 由 DataLoader 来决定。而对于 batch sampler, 则是由它决定了 DataLoader 的 batch size, 因此可以用它来实现一些特别的需求，比如说动态 batch size.

 ## 示例代码

--- a/examples/tacotron2/ljspeech.py
+++ b/examples/tacotron2/ljspeech.py
@ -86,6 +86,7 @@ class LJSpeechCollector(object):
            for i, _ in sorted(
                zip(mel_lens, text_lens), key=lambda x: x[1], reverse=True)
        ]
+        mel_lens = np.array(mel_lens, dtype=np.int64)

        stop_tokens = [
            i
@ -93,7 +94,7 @@ class LJSpeechCollector(object):
                zip(stop_tokens, text_lens), key=lambda x: x[1], reverse=True)
        ]

-        text_lens = sorted(text_lens, reverse=True)
+        text_lens = np.array(sorted(text_lens, reverse=True), dtype=np.int64)

        # Pad sequence with largest len of the batch
        texts = batch_text_id(texts, pad_id=self.padding_idx)
--- a/examples/waveflow/config.py
+++ b/examples/waveflow/config.py
@ -23,7 +23,8 @@ _C.data = CN(
        n_fft=1024,  # fft frame size
        win_length=1024,  # window size
        hop_length=256,  # hop size between ajacent frame
-        f_max=8000,  # Hz, max frequency when converting to mel
+        fmin=0,
+        fmax=8000,  # Hz, max frequency when converting to mel
        n_mels=80,  # mel bands
        clip_frames=65,  # mel clip frames
    ))
--- a/examples/waveflow/preprocess.py
+++ b/examples/waveflow/preprocess.py
@ -30,12 +30,14 @@ from config import get_cfg_defaults


 class Transform(object):
-    def __init__(self, sample_rate, n_fft, win_length, hop_length, n_mels):
+    def __init__(self, sample_rate, n_fft, win_length, hop_length, n_mels, fmin, fmax):
        self.sample_rate = sample_rate
        self.n_fft = n_fft
        self.win_length = win_length
        self.hop_length = hop_length
        self.n_mels = n_mels
+        self.fmin = fmin
+        self.fmax = fmax

        self.spec_normalizer = LogMagnitude(min=1e-5)

@ -47,6 +49,8 @@ class Transform(object):
        win_length = self.win_length
        hop_length = self.hop_length
        n_mels = self.n_mels
+        fmin = self.fmin
+        fmax = self.fmax

        wav, loaded_sr = librosa.load(wav_path, sr=None)
        assert loaded_sr == sr, "sample rate does not match, resampling applied"
@ -78,7 +82,9 @@ class Transform(object):
        # Compute mel-spectrograms.
        mel_filter_bank = librosa.filters.mel(sr=sr,
                                              n_fft=n_fft,
-                                              n_mels=n_mels)
+                                              n_mels=n_mels,
+                                              fmin=fmin,
+                                              fmax=fmax)
        mel_spectrogram = np.dot(mel_filter_bank, spectrogram_magnitude)
        mel_spectrogram = mel_spectrogram

@ -101,7 +107,7 @@ def create_dataset(config, input_dir, output_dir, verbose=True):
    output_dir.mkdir(exist_ok=True)

    transform = Transform(config.sample_rate, config.n_fft, config.win_length,
-                          config.hop_length, config.n_mels)
+                          config.hop_length, config.n_mels, config.fmin, config.fmax)
    file_names = []

    for example in tqdm.tqdm(dataset):
--- a/examples/wavenet/README.md
+++ b/examples/wavenet/README.md
@ -1,48 +0,0 @@
-# WaveNet with LJSpeech
-
-## Dataset
-
-### Download the datasaet.
-
-```bash
-wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
-```
-
-### Extract the dataset.
-
-```bash
-tar xjvf LJSpeech-1.1.tar.bz2
-```
-
-### Preprocess the dataset. 
-
-Assume the path to save the preprocessed dataset is `ljspeech_wavenet`. Run the command below to preprocess the dataset.
-
-```bash
-python preprocess.py --input=LJSpeech-1.1/  --output=ljspeech_wavenet
-```
-
-## Train the model
-
-The training script requires 4 command line arguments.
-`--data` is the path of the training dataset, `--output` is the path of the output directory (we recommend to use a subdirectory in `runs` to manage different experiments.)
-
-`--device` should be "cpu" or "gpu", `--nprocs` is the number of processes to train the model in parallel.
-
-```bash
-python train.py --data=ljspeech_wavenet/ --output=runs/test --device="gpu" --nprocs=1
-```
-
-If you want distributed training, set a larger `--nprocs` (e.g. 4). Note that distributed training with cpu is not supported yet.
-
-## Synthesize
-
-Synthesize waveform. We assume the `--input` is a directory containing several mel spectrograms(normalized into range[0, 1)) in `.npy` format. The output would be saved in `--output` directory, containing several `.wav` files, each with the same name as the mel spectrogram does.
-
-`--checkpoint_path` should be the path of the parameter file (`.pdparams`) to load. Note that the extention name `.pdparmas` is not included here.
-
-`--device` specifies to device to run synthesis on. Due to the autoregressiveness of wavenet, using cpu may be faster.
-
-```bash
-python synthesize.py --input=mels/ --output=wavs/ --checkpoint_path='step-2450000' --device="cpu" --verbose
-```
--- a/examples/wavenet/config.py
+++ b/examples/wavenet/config.py
@ -1,58 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from yacs.config import CfgNode as CN
-
-_C = CN()
-_C.data = CN(
-    dict(
-        batch_size=8,  # batch size
-        valid_size=16,  # the first N examples are reserved for validation
-        sample_rate=22050,  # Hz, sample rate
-        n_fft=2048,  # fft frame size
-        win_length=1024,  # window size
-        hop_length=256,  # hop size between ajacent frame
-        # f_max=8000, # Hz, max frequency when converting to mel
-        n_mels=80,  # mel bands
-        train_clip_seconds=0.5,  # audio clip length(in seconds)
-    ))
-
-_C.model = CN(
-    dict(
-        upsample_factors=[16, 16],
-        n_stack=3,
-        n_loop=10,
-        filter_size=2,
-        residual_channels=128,  # resiaudal channel in each flow
-        loss_type="mog",
-        output_dim=3,  # single gaussian
-        log_scale_min=-9.0, ))
-
-_C.training = CN(
-    dict(
-        lr=1e-3,  # learning rates
-        anneal_rate=0.5,  # learning rate decay rate
-        anneal_interval=200000,  # decrese lr by annel_rate every anneal_interval steps
-        valid_interval=1000,  # validation
-        save_interval=10000,  # checkpoint
-        max_iteration=3000000,  # max iteration to train
-        gradient_max_norm=100.0  # global norm of gradients
-    ))
-
-
-def get_cfg_defaults():
-    """Get a yacs CfgNode object with default values for my_project."""
-    # Return a clone so that the defaults will not be altered
-    # This is for the "local variable" use pattern
-    return _C.clone()
--- a/examples/wavenet/ljspeech.py
+++ b/examples/wavenet/ljspeech.py
@ -1,151 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-from pathlib import Path
-import pickle
-import numpy as np
-import pandas
-from paddle.io import Dataset, DataLoader
-
-from parakeet.data.batch import batch_spec, batch_wav
-from parakeet.data import dataset
-from parakeet.audio import AudioProcessor
-
-
-class LJSpeech(Dataset):
-    """A simple dataset adaptor for the processed ljspeech dataset."""
-
-    def __init__(self, root):
-        self.root = Path(root).expanduser()
-        meta_data = pandas.read_csv(
-            str(self.root / "metadata.csv"),
-            sep="\t",
-            header=None,
-            names=["fname", "frames", "samples"])
-
-        records = []
-        for row in meta_data.itertuples():
-            mel_path = str(self.root / "mel" / (row.fname + ".npy"))
-            wav_path = str(self.root / "wav" / (row.fname + ".npy"))
-            records.append((mel_path, wav_path))
-        self.records = records
-
-    def __getitem__(self, i):
-        mel_name, wav_name = self.records[i]
-        mel = np.load(mel_name)
-        wav = np.load(wav_name)
-        return mel, wav
-
-    def __len__(self):
-        return len(self.records)
-
-
-class LJSpeechCollector(object):
-    """A simple callable to batch LJSpeech examples."""
-
-    def __init__(self, padding_value=0.):
-        self.padding_value = padding_value
-
-    def __call__(self, examples):
-        batch_size = len(examples)
-        mels = [example[0] for example in examples]
-        wavs = [example[1] for example in examples]
-        mels = batch_spec(mels, pad_value=self.padding_value)
-        wavs = batch_wav(wavs, pad_value=self.padding_value)
-        audio_starts = np.zeros((batch_size, ), dtype=np.int64)
-        return mels, wavs, audio_starts
-
-
-class LJSpeechClipCollector(object):
-    def __init__(self, clip_frames=65, hop_length=256):
-        self.clip_frames = clip_frames
-        self.hop_length = hop_length
-
-    def __call__(self, examples):
-        mels = []
-        wavs = []
-        starts = []
-        for example in examples:
-            mel, wav_clip, start = self.clip(example)
-            mels.append(mel)
-            wavs.append(wav_clip)
-            starts.append(start)
-        mels = batch_spec(mels)
-        wavs = np.stack(wavs)
-        starts = np.array(starts, dtype=np.int64)
-        return mels, wavs, starts
-
-    def clip(self, example):
-        mel, wav = example
-        frames = mel.shape[-1]
-        start = np.random.randint(0, frames - self.clip_frames)
-        wav_clip = wav[start * self.hop_length:(start + self.clip_frames) *
-                       self.hop_length]
-        return mel, wav_clip, start
-
-
-class DataCollector(object):
-    def __init__(self,
-                 context_size,
-                 sample_rate,
-                 hop_length,
-                 train_clip_seconds,
-                 valid=False):
-        frames_per_second = sample_rate // hop_length
-        train_clip_frames = int(
-            np.ceil(train_clip_seconds * frames_per_second))
-        context_frames = context_size // hop_length
-        self.num_frames = train_clip_frames + context_frames
-
-        self.sample_rate = sample_rate
-        self.hop_length = hop_length
-        self.valid = valid
-
-    def random_crop(self, sample):
-        audio, mel_spectrogram = sample
-        audio_frames = int(audio.size) // self.hop_length
-        max_start_frame = audio_frames - self.num_frames
-        assert max_start_frame >= 0, "audio is too short to be cropped"
-
-        frame_start = np.random.randint(0, max_start_frame)
-        # frame_start = 0  # norandom
-        frame_end = frame_start + self.num_frames
-
-        audio_start = frame_start * self.hop_length
-        audio_end = frame_end * self.hop_length
-
-        audio = audio[audio_start:audio_end]
-        return audio, mel_spectrogram, audio_start
-
-    def __call__(self, samples):
-        # transform them first
-        if self.valid:
-            samples = [(audio, mel_spectrogram, 0)
-                       for audio, mel_spectrogram in samples]
-        else:
-            samples = [self.random_crop(sample) for sample in samples]
-        # batch them
-        audios = [sample[0] for sample in samples]
-        audio_starts = [sample[2] for sample in samples]
-        mels = [sample[1] for sample in samples]
-
-        mels = batch_spec(mels)
-
-        if self.valid:
-            audios = batch_wav(audios, dtype=np.float32)
-        else:
-            audios = np.array(audios, dtype=np.float32)
-        audio_starts = np.array(audio_starts, dtype=np.int64)
-        return audios, mels, audio_starts
--- a/examples/wavenet/preprocess.py
+++ b/examples/wavenet/preprocess.py
@ -1,161 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import os
-import tqdm
-import csv
-import argparse
-import numpy as np
-import librosa
-from pathlib import Path
-import pandas as pd
-
-from paddle.io import Dataset
-from parakeet.data import batch_spec, batch_wav
-from parakeet.datasets import LJSpeechMetaData
-from parakeet.audio import AudioProcessor
-from parakeet.audio.spec_normalizer import UnitMagnitude
-
-from config import get_cfg_defaults
-
-
-class Transform(object):
-    def __init__(self, sample_rate, n_fft, win_length, hop_length, n_mels):
-        self.sample_rate = sample_rate
-        self.n_fft = n_fft
-        self.win_length = win_length
-        self.hop_length = hop_length
-        self.n_mels = n_mels
-
-        self.spec_normalizer = UnitMagnitude(min=1e-5)
-
-    def __call__(self, example):
-        wav_path, _, _ = example
-
-        sr = self.sample_rate
-        n_fft = self.n_fft
-        win_length = self.win_length
-        hop_length = self.hop_length
-        n_mels = self.n_mels
-
-        wav, loaded_sr = librosa.load(wav_path, sr=None)
-        assert loaded_sr == sr, "sample rate does not match, resampling applied"
-
-        # Pad audio to the right size.
-        frames = int(np.ceil(float(wav.size) / hop_length))
-        fft_padding = (n_fft - hop_length) // 2  # sound
-        desired_length = frames * hop_length + fft_padding * 2
-        pad_amount = (desired_length - wav.size) // 2
-
-        if wav.size % 2 == 0:
-            wav = np.pad(wav, (pad_amount, pad_amount), mode='reflect')
-        else:
-            wav = np.pad(wav, (pad_amount, pad_amount + 1), mode='reflect')
-
-        # Normalize audio.
-        wav = wav / np.abs(wav).max() * 0.999
-
-        # Compute mel-spectrogram.
-        # Turn center to False to prevent internal padding.
-        spectrogram = librosa.core.stft(
-            wav,
-            hop_length=hop_length,
-            win_length=win_length,
-            n_fft=n_fft,
-            center=False)
-        spectrogram_magnitude = np.abs(spectrogram)
-
-        # Compute mel-spectrograms.
-        mel_filter_bank = librosa.filters.mel(sr=sr,
-                                              n_fft=n_fft,
-                                              n_mels=n_mels)
-        mel_spectrogram = np.dot(mel_filter_bank, spectrogram_magnitude)
-        mel_spectrogram = mel_spectrogram
-
-        # log scale mel_spectrogram.
-        mel_spectrogram = self.spec_normalizer.transform(mel_spectrogram)
-
-        # Extract the center of audio that corresponds to mel spectrograms.
-        audio = wav[fft_padding:-fft_padding]
-        assert mel_spectrogram.shape[1] * hop_length == audio.size
-
-        # there is no clipping here
-        return audio, mel_spectrogram
-
-
-def create_dataset(config, input_dir, output_dir, verbose=True):
-    input_dir = Path(input_dir).expanduser()
-    dataset = LJSpeechMetaData(input_dir)
-
-    output_dir = Path(output_dir).expanduser()
-    output_dir.mkdir(exist_ok=True)
-
-    transform = Transform(config.sample_rate, config.n_fft, config.win_length,
-                          config.hop_length, config.n_mels)
-    file_names = []
-
-    for example in tqdm.tqdm(dataset):
-        fname, _, _ = example
-        base_name = os.path.splitext(os.path.basename(fname))[0]
-        wav_dir = output_dir / "wav"
-        mel_dir = output_dir / "mel"
-        wav_dir.mkdir(exist_ok=True)
-        mel_dir.mkdir(exist_ok=True)
-
-        audio, mel = transform(example)
-        np.save(str(wav_dir / base_name), audio)
-        np.save(str(mel_dir / base_name), mel)
-
-        file_names.append((base_name, mel.shape[-1], audio.shape[-1]))
-
-    meta_data = pd.DataFrame.from_records(file_names)
-    meta_data.to_csv(
-        str(output_dir / "metadata.csv"), sep="\t", index=None, header=None)
-    print("saved meta data in to {}".format(
-        os.path.join(output_dir, "metadata.csv")))
-
-    print("Done!")
-
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(description="create dataset")
-    parser.add_argument(
-        "--config",
-        type=str,
-        metavar="FILE",
-        help="extra config to overwrite the default config")
-    parser.add_argument(
-        "--input", type=str, help="path of the ljspeech dataset")
-    parser.add_argument(
-        "--output", type=str, help="path to save output dataset")
-    parser.add_argument(
-        "--opts",
-        nargs=argparse.REMAINDER,
-        help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
-    )
-    parser.add_argument(
-        "-v", "--verbose", action="store_true", help="print msg")
-
-    config = get_cfg_defaults()
-    args = parser.parse_args()
-    if args.config:
-        config.merge_from_file(args.config)
-    if args.opts:
-        config.merge_from_list(args.opts)
-    config.freeze()
-    if args.verbose:
-        print(config.data)
-        print(args)
-
-    create_dataset(config.data, args.input, args.output, args.verbose)
--- a/examples/wavenet/synthesize.py
+++ b/examples/wavenet/synthesize.py
@ -1,82 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import argparse
-import numpy as np
-import soundfile as sf
-import os
-from pathlib import Path
-import paddle
-import parakeet
-from parakeet.models.wavenet import UpsampleNet, WaveNet, ConditionalWaveNet
-from parakeet.utils import layer_tools, checkpoint
-
-from config import get_cfg_defaults
-
-
-def main(config, args):
-    paddle.set_device(args.device)
-    model = ConditionalWaveNet.from_pretrained(config, args.checkpoint_path)
-    layer_tools.recursively_remove_weight_norm(model)
-    model.eval()
-
-    mel_dir = Path(args.input).expanduser()
-    output_dir = Path(args.output).expanduser()
-    output_dir.mkdir(parents=True, exist_ok=True)
-    for file_path in mel_dir.iterdir():
-        mel = np.load(str(file_path))
-        audio = model.predict(mel)
-        audio_path = output_dir / (
-            os.path.splitext(file_path.name)[0] + ".wav")
-        sf.write(audio_path, audio, config.data.sample_rate)
-        print("[synthesize] {} -> {}".format(file_path, audio_path))
-
-
-if __name__ == "__main__":
-    config = get_cfg_defaults()
-
-    parser = argparse.ArgumentParser(
-        description="generate mel spectrogram with TransformerTTS.")
-    parser.add_argument(
-        "--config",
-        type=str,
-        metavar="FILE",
-        help="extra config to overwrite the default config")
-    parser.add_argument(
-        "--checkpoint_path", type=str, help="path of the checkpoint to load.")
-    parser.add_argument(
-        "--input",
-        type=str,
-        help="path of directory containing mel spectrogram (in .npy format)")
-    parser.add_argument("--output", type=str, help="path to save outputs")
-    parser.add_argument(
-        "--device", type=str, default="cpu", help="device type to use.")
-    parser.add_argument(
-        "--opts",
-        nargs=argparse.REMAINDER,
-        help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
-    )
-    parser.add_argument(
-        "-v", "--verbose", action="store_true", help="print msg")
-
-    args = parser.parse_args()
-    if args.config:
-        config.merge_from_file(args.config)
-    if args.opts:
-        config.merge_from_list(args.opts)
-    config.freeze()
-    print(config)
-    print(args)
-
-    main(config, args)
--- a/examples/wavenet/train.py
+++ b/examples/wavenet/train.py
@ -1,177 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import time
-from pathlib import Path
-import math
-import numpy as np
-import paddle
-from paddle import distributed as dist
-from paddle.io import DataLoader, DistributedBatchSampler
-from tensorboardX import SummaryWriter
-from collections import defaultdict
-
-import parakeet
-from parakeet.data import dataset
-from parakeet.models.wavenet import UpsampleNet, WaveNet, ConditionalWaveNet
-from parakeet.audio import AudioProcessor
-from parakeet.utils import scheduler, mp_tools
-from parakeet.training.cli import default_argument_parser
-from parakeet.training.experiment import ExperimentBase
-from parakeet.utils.mp_tools import rank_zero_only
-
-from config import get_cfg_defaults
-from ljspeech import LJSpeech, LJSpeechClipCollector, LJSpeechCollector
-
-
-class Experiment(ExperimentBase):
-    def setup_model(self):
-        config = self.config
-        model = ConditionalWaveNet(
-            upsample_factors=config.model.upsample_factors,
-            n_stack=config.model.n_stack,
-            n_loop=config.model.n_loop,
-            residual_channels=config.model.residual_channels,
-            output_dim=config.model.output_dim,
-            n_mels=config.data.n_mels,
-            filter_size=config.model.filter_size,
-            loss_type=config.model.loss_type,
-            log_scale_min=config.model.log_scale_min)
-
-        if self.parallel:
-            model = paddle.DataParallel(model)
-
-        lr_scheduler = paddle.optimizer.lr.StepDecay(
-            config.training.lr, config.training.anneal_interval,
-            config.training.anneal_rate)
-        optimizer = paddle.optimizer.Adam(
-            lr_scheduler,
-            parameters=model.parameters(),
-            grad_clip=paddle.nn.ClipGradByGlobalNorm(
-                config.training.gradient_max_norm))
-
-        self.model = model
-        self.model_core = model._layers if self.parallel else model
-        self.optimizer = optimizer
-
-    def setup_dataloader(self):
-        config = self.config
-        args = self.args
-
-        ljspeech_dataset = LJSpeech(args.data)
-        valid_set, train_set = dataset.split(ljspeech_dataset,
-                                             config.data.valid_size)
-
-        # convolutional net's causal padding size
-        context_size = config.model.n_stack \
-                      * sum([(config.model.filter_size - 1) * 2**i for i in range(config.model.n_loop)]) \
-                      + 1
-        context_frames = context_size // config.data.hop_length
-
-        # frames used to compute loss
-        frames_per_second = config.data.sample_rate // config.data.hop_length
-        train_clip_frames = math.ceil(config.data.train_clip_seconds *
-                                      frames_per_second)
-
-        num_frames = train_clip_frames + context_frames
-        batch_fn = LJSpeechClipCollector(num_frames, config.data.hop_length)
-        if not self.parallel:
-            train_loader = DataLoader(
-                train_set,
-                batch_size=config.data.batch_size,
-                shuffle=True,
-                drop_last=True,
-                collate_fn=batch_fn)
-        else:
-            sampler = DistributedBatchSampler(
-                train_set,
-                batch_size=config.data.batch_size,
-                shuffle=True,
-                drop_last=True)
-            train_loader = DataLoader(
-                train_set, batch_sampler=sampler, collate_fn=batch_fn)
-
-        valid_batch_fn = LJSpeechCollector()
-        valid_loader = DataLoader(
-            valid_set, batch_size=1, collate_fn=valid_batch_fn)
-
-        self.train_loader = train_loader
-        self.valid_loader = valid_loader
-
-    def train_batch(self):
-        start = time.time()
-        batch = self.read_batch()
-        data_loader_time = time.time() - start
-
-        self.model.train()
-        self.optimizer.clear_grad()
-        mel, wav, audio_starts = batch
-
-        y = self.model(wav, mel, audio_starts)
-        loss = self.model_core.loss(y, wav)
-        loss.backward()
-        self.optimizer.step()
-        iteration_time = time.time() - start
-
-        loss_value = float(loss)
-        msg = "Rank: {}, ".format(dist.get_rank())
-        msg += "step: {}, ".format(self.iteration)
-        msg += "time: {:>.3f}s/{:>.3f}s, ".format(data_loader_time,
-                                                  iteration_time)
-        msg += "loss: {:>.6f}".format(loss_value)
-        self.logger.info(msg)
-        if dist.get_rank() == 0:
-            self.visualizer.add_scalar(
-                "train/loss", loss_value, global_step=self.iteration)
-
-    @mp_tools.rank_zero_only
-    @paddle.no_grad()
-    def valid(self):
-        valid_iterator = iter(self.valid_loader)
-        valid_losses = []
-        mel, wav, audio_starts = next(valid_iterator)
-        y = self.model(wav, mel, audio_starts)
-        loss = self.model_core.loss(y, wav)
-        valid_losses.append(float(loss))
-        valid_loss = np.mean(valid_losses)
-        self.visualizer.add_scalar(
-            "valid/loss", valid_loss, global_step=self.iteration)
-
-
-def main_sp(config, args):
-    exp = Experiment(config, args)
-    exp.setup()
-    exp.run()
-
-
-def main(config, args):
-    if args.nprocs > 1 and args.device == "gpu":
-        dist.spawn(main_sp, args=(config, args), nprocs=args.nprocs)
-    else:
-        main_sp(config, args)
-
-
-if __name__ == "__main__":
-    config = get_cfg_defaults()
-    parser = default_argument_parser()
-    args = parser.parse_args()
-    if args.config:
-        config.merge_from_file(args.config)
-    if args.opts:
-        config.merge_from_list(args.opts)
-    config.freeze()
-    print(config)
-    print(args)
-
-    main(config, args)
--- a/parakeet/init.py
+++ b/parakeet/init.py
@ -12,6 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-__version__ = "0.2.0-beta.0"
+__version__ = "0.2.0"

 from parakeet import audio, data, datasets, frontend, models, modules, training, utils
--- a/parakeet/models/init.py
+++ b/parakeet/models/init.py
@ -14,7 +14,7 @@

 #from parakeet.models.clarinet import *
 from parakeet.models.waveflow import *
-from parakeet.models.wavenet import *
+#from parakeet.models.wavenet import *

 from parakeet.models.transformer_tts import *
 #from parakeet.models.deepvoice3 import *
--- a/parakeet/models/wavenet.py
+++ b/parakeet/models/wavenet.py
@ -1,977 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import math
-import time
-from typing import Union, Sequence, List
-from tqdm import trange
-import numpy as np
-
-import paddle
-from paddle import nn
-from paddle.nn import functional as F
-import paddle.fluid.initializer as I
-import paddle.fluid.layers.distributions as D
-
-from parakeet.modules.conv import Conv1dCell
-from parakeet.modules.audio import quantize, dequantize, STFT
-from parakeet.utils import checkpoint, layer_tools
-
-__all__ = ["WaveNet", "ConditionalWaveNet"]
-
-
-def crop(x, audio_start, audio_length):
-    """Crop the upsampled condition to match audio_length. 
-    
-    The upsampled condition has the same time steps as the whole audio does. 
-    But since audios are sliced to 0.5 seconds randomly while conditions are 
-    not, upsampled conditions should also be sliced to extactly match the time 
-    steps of the audio slice.
-
-    Parameters
-    ----------
-    x : Tensor [shape=(B, C, T)]
-        The upsampled condition.
-    audio_start : Tensor [shape=(B,), dtype:int]
-        The index of the starting point of the audio clips.
-    audio_length : int
-        The length of the audio clip(number of samples it contaions).
-
-    Returns
-    -------
-    Tensor [shape=(B, C, audio_length)]
-        Cropped condition.
-    """
-    # crop audio
-    slices = []  # for each example
-    # paddle now supports Tensor of shape [1] in slice
-    # starts = audio_start.numpy()
-    for i in range(x.shape[0]):
-        start = audio_start[i]
-        end = start + audio_length
-        slice = paddle.slice(x[i], axes=[1], starts=[start], ends=[end])
-        slices.append(slice)
-    out = paddle.stack(slices)
-    return out
-
-
-class UpsampleNet(nn.LayerList):
-    """A network used to upsample mel spectrogram to match the time steps of 
-    audio.
-    
-    It consists of several layers of Conv2DTranspose. Each Conv2DTranspose 
-    layer upsamples the time dimension by its `stride` times. 
-    
-    Also, each Conv2DTranspose's filter_size at frequency dimension is 3.
-
-    Parameters
-    ----------
-    upscale_factors : List[int], optional
-        Time upsampling factors for each Conv2DTranspose Layer. 
-        
-        The ``UpsampleNet`` contains ``len(upscale_factor)`` Conv2DTranspose 
-        Layers. Each upscale_factor is used as the ``stride`` for the 
-        corresponding Conv2DTranspose. Defaults to [16, 16], this the default 
-        upsampling factor is 256.
-        
-    Notes
-    ------
-    ``np.prod(upscale_factors)`` should equals the ``hop_length`` of the stft 
-    transformation used to extract spectrogram features from audio. 
-    
-    For example, ``16 * 16 = 256``, then the spectrogram extracted with a stft 
-    transformation whose ``hop_length`` equals 256 is suitable. 
-        
-    See Also
-    ---------
-    ``librosa.core.stft``
-    """
-
-    def __init__(self, upscale_factors=[16, 16]):
-        super(UpsampleNet, self).__init__()
-        self.upscale_factors = list(upscale_factors)
-        self.upscale_factor = 1
-        for item in upscale_factors:
-            self.upscale_factor *= item
-
-        for factor in self.upscale_factors:
-            self.append(
-                nn.utils.weight_norm(
-                    nn.Conv2DTranspose(
-                        1,
-                        1,
-                        kernel_size=(3, 2 * factor),
-                        stride=(1, factor),
-                        padding=(1, factor // 2))))
-
-    def forward(self, x):
-        r"""Compute the upsampled condition.
-
-        Parameters
-        -----------
-        x : Tensor [shape=(B, F, T)]
-            The condition (mel spectrogram here). ``F`` means the frequency 
-            bands, which is the feature size of the input. 
-            
-            In the internal Conv2DTransposes, the frequency dimension 
-            is treated as ``height`` dimension instead of ``in_channels``.
-
-        Returns:
-            Tensor [shape=(B, F, T \* upscale_factor)]
-                The upsampled condition.
-        """
-        x = paddle.unsqueeze(x, 1)
-        for sublayer in self:
-            x = F.leaky_relu(sublayer(x), 0.4)
-        x = paddle.squeeze(x, 1)
-        return x
-
-
-class ResidualBlock(nn.Layer):
-    """A Residual block used in wavenet. Conv1D-gated-tanh Block.
-        
-    It consists of a Conv1DCell and an Conv1D(kernel_size = 1) to integrate 
-    information of the condition.
-    
-    Notes
-    --------
-    It does not have parametric residual or skip connection. 
-
-    Parameters
-    -----------
-    residual_channels : int
-        The feature size of the input. It is also the feature size of the 
-        residual output and skip output.
-        
-    condition_dim : int
-        The feature size of the condition.
-        
-    filter_size : int
-        Kernel size of the internal convolution cells.
-        
-    dilation :int
-        Dilation of the internal convolution cells.
-    """
-
-    def __init__(self,
-                 residual_channels: int,
-                 condition_dim: int,
-                 filter_size: Union[int, Sequence[int]],
-                 dilation: int):
-
-        super(ResidualBlock, self).__init__()
-        dilated_channels = 2 * residual_channels
-        # following clarinet's implementation, we do not have parametric residual
-        # & skip connection.
-
-        _filter_size = filter_size[0] if isinstance(filter_size, (
-            list, tuple)) else filter_size
-        std = math.sqrt(1 / (_filter_size * residual_channels))
-        conv = Conv1dCell(
-            residual_channels,
-            dilated_channels,
-            filter_size,
-            dilation=dilation,
-            weight_attr=I.Normal(scale=std))
-        self.conv = nn.utils.weight_norm(conv)
-
-        std = math.sqrt(1 / condition_dim)
-        condition_proj = Conv1dCell(
-            condition_dim,
-            dilated_channels, (1, ),
-            weight_attr=I.Normal(scale=std))
-        self.condition_proj = nn.utils.weight_norm(condition_proj)
-
-        self.filter_size = filter_size
-        self.dilation = dilation
-        self.dilated_channels = dilated_channels
-        self.residual_channels = residual_channels
-        self.condition_dim = condition_dim
-
-    def forward(self, x, condition=None):
-        """Forward pass of the ResidualBlock.
-
-        Parameters
-        -----------
-        x : Tensor [shape=(B, C, T)]
-            The input tensor.
-             
-        condition : Tensor, optional [shape(B, C_cond, T)]
-            The condition. 
-            
-            It has been upsampled in time steps, so it has the same time steps 
-            as the input does.(C_cond stands for the condition's channels). 
-            Defaults to None.
-
-        Returns
-        -----------
-        residual : Tensor [shape=(B, C, T)]
-            The residual, which is used as the input to the next ResidualBlock.
-            
-        skip_connection : Tensor [shape=(B, C, T)]
-            Tthe skip connection. This output is accumulated with that of 
-            other ResidualBlocks. 
-    """
-        h = x
-
-        # dilated conv
-        h = self.conv(h)
-
-        # condition
-        if condition is not None:
-            h += self.condition_proj(condition)
-
-        # gated tanh
-        content, gate = paddle.split(h, 2, axis=1)
-        z = F.sigmoid(gate) * paddle.tanh(content)
-
-        # projection
-        residual = paddle.scale(z + x, math.sqrt(.5))
-        skip_connection = z
-        return residual, skip_connection
-
-    def start_sequence(self):
-        """Prepare the ResidualBlock to generate a new sequence. 
-        
-        Warnings
-        ---------
-        This method should be called before calling ``add_input`` multiple times.
-        """
-        self.conv.start_sequence()
-        self.condition_proj.start_sequence()
-
-    def add_input(self, x, condition=None):
-        """Take a step input and return a step output. 
-        
-        This method works similarily with ``forward`` but in a 
-        ``step-in-step-out`` fashion.
-
-        Parameters
-        ----------
-        x : Tensor [shape=(B, C)]
-            Input for a step.
-            
-        condition : Tensor, optional [shape=(B, C_cond)]
-            Condition for a step. Defaults to None.
-
-        Returns
-        ----------
-        residual : Tensor [shape=(B, C)] 
-            The residual for a step, which is used as the input to the next 
-            layer of ResidualBlock.
-            
-        skip_connection : Tensor [shape=(B, C)]
-            T he skip connection for a step. This output is accumulated with 
-            that of other ResidualBlocks. 
-        """
-        h = x
-
-        # dilated conv
-        h = self.conv.add_input(h)
-
-        # condition
-        if condition is not None:
-            h += self.condition_proj.add_input(condition)
-
-        # gated tanh
-        content, gate = paddle.split(h, 2, axis=1)
-        z = F.sigmoid(gate) * paddle.tanh(content)
-
-        # projection
-        residual = paddle.scale(z + x, math.sqrt(0.5))
-        skip_connection = z
-        return residual, skip_connection
-
-
-class ResidualNet(nn.LayerList):
-    """The residual network in wavenet. 
-    
-    It consists of ``n_stack`` stacks, each of which consists of ``n_loop``
-    ResidualBlocks.
-
-    Parameters
-    ----------
-    n_stack : int
-        Number of stacks in the ``ResidualNet``.
-        
-    n_loop : int
-        Number of ResidualBlocks in a stack.
-        
-    residual_channels : int
-        Input feature size of each ``ResidualBlock``'s input.
-        
-    condition_dim : int
-        Feature size of the condition.
-        
-    filter_size : int
-        Kernel size of the internal ``Conv1dCell`` of each ``ResidualBlock``.
-
-    """
-
-    def __init__(self,
-                 n_stack: int,
-                 n_loop: int,
-                 residual_channels: int,
-                 condition_dim: int,
-                 filter_size: int):
-        super(ResidualNet, self).__init__()
-        # double the dilation at each layer in a stack
-        dilations = [2**i for i in range(n_loop)] * n_stack
-        self.context_size = 1 + sum(dilations)
-        for dilation in dilations:
-            self.append(
-                ResidualBlock(residual_channels, condition_dim, filter_size,
-                              dilation))
-
-    def forward(self, x, condition=None):
-        """Forward pass of ``ResidualNet``.
-        
-        Parameters
-        ----------
-        x : Tensor [shape=(B, C, T)]
-            The input. 
-            
-        condition : Tensor, optional [shape=(B, C_cond, T)]
-            The condition, it has been upsampled in time steps, so it has the 
-            same time steps as the input does. Defaults to None.
-
-        Returns
-        --------
-        Tensor [shape=(B, C, T)]
-            The output.
-        """
-        for i, func in enumerate(self):
-            x, skip = func(x, condition)
-            if i == 0:
-                skip_connections = skip
-            else:
-                skip_connections = paddle.scale(skip_connections + skip,
-                                                math.sqrt(0.5))
-        return skip_connections
-
-    def start_sequence(self):
-        """Prepare the ResidualNet to generate a new sequence. This method 
-        should be called before starting calling ``add_input`` multiple times.
-        """
-        for block in self:
-            block.start_sequence()
-
-    def add_input(self, x, condition=None):
-        """Take a step input and return a step output. 
-        
-        This method works similarily with ``forward`` but in a 
-        ``step-in-step-out`` fashion.
-
-        Parameters
-        ----------
-        x : Tensor [shape=(B, C)]
-            Input for a step.
-            
-        condition : Tensor, optional [shape=(B, C_cond)]
-            Condition for a step. Defaults to None.
-
-        Returns
-        ----------            
-        Tensor [shape=(B, C)]
-            The skip connection for a step. This output is accumulated with 
-            that of other ResidualBlocks. 
-        """
-        for i, func in enumerate(self):
-            x, skip = func.add_input(x, condition)
-            if i == 0:
-                skip_connections = skip
-            else:
-                skip_connections = paddle.scale(skip_connections + skip,
-                                                math.sqrt(0.5))
-        return skip_connections
-
-
-class WaveNet(nn.Layer):
-    """Wavenet that transform upsampled mel spectrogram into waveform.
-
-    Parameters
-    -----------
-    n_stack : int
-        ``n_stack`` for the internal ``ResidualNet``.
-        
-    n_loop : int
-        ``n_loop`` for the internal ``ResidualNet``.
-        
-    residual_channels : int
-        Feature size of the input.
-        
-    output_dim : int
-        Feature size of the input.
-        
-    condition_dim : int
-        Feature size of the condition (mel spectrogram bands).
-        
-    filter_size : int
-        Kernel size of the internal ``ResidualNet``.
-        
-    loss_type : str, optional ["mog" or "softmax"]
-        The output type and loss type of the model, by default "mog".
-        
-        If "softmax", the model input is first quantized audio and the model 
-        outputs a discret categorical distribution.
-        
-        If "mog", the model input is audio in floating point format, and the 
-        model outputs parameters for a mixture of gaussian distributions. 
-        Namely, the weight, mean and log scale of each gaussian distribution. 
-        Thus, the ``output_size`` should be a multiple of 3.
-    
-    log_scale_min : float, optional
-        Minimum value of the log scale of gaussian distributions, by default 
-        -9.0.
-        
-        This is only used for computing loss when ``loss_type`` is "mog", If 
-        the predicted log scale is less than -9.0, it is clipped at -9.0.
-    """
-
-    def __init__(self, n_stack, n_loop, residual_channels, output_dim,
-                 condition_dim, filter_size, loss_type, log_scale_min):
-
-        super(WaveNet, self).__init__()
-        if loss_type not in ["softmax", "mog"]:
-            raise ValueError("loss_type {} is not supported".format(loss_type))
-        if loss_type == "softmax":
-            self.embed = nn.Embedding(output_dim, residual_channels)
-        else:
-            if (output_dim % 3 != 0):
-                raise ValueError(
-                    "with Mixture of Gaussians(mog) output, the output dim must be divisible by 3, but get {}".
-                    format(output_dim))
-            self.embed = nn.utils.weight_norm(
-                nn.Linear(1, residual_channels), dim=1)
-
-        self.resnet = ResidualNet(n_stack, n_loop, residual_channels,
-                                  condition_dim, filter_size)
-        self.context_size = self.resnet.context_size
-
-        skip_channels = residual_channels  # assume the same channel
-        self.proj1 = nn.utils.weight_norm(
-            nn.Linear(skip_channels, skip_channels), dim=1)
-        self.proj2 = nn.utils.weight_norm(
-            nn.Linear(skip_channels, skip_channels), dim=1)
-        # if loss_type is softmax, output_dim is n_vocab of waveform magnitude.
-        # if loss_type is mog, output_dim is 3 * gaussian, (weight, mean and stddev)
-        self.proj3 = nn.utils.weight_norm(
-            nn.Linear(skip_channels, output_dim), dim=1)
-
-        self.loss_type = loss_type
-        self.output_dim = output_dim
-        self.input_dim = 1
-        self.skip_channels = skip_channels
-        self.log_scale_min = log_scale_min
-
-    def forward(self, x, condition=None):
-        """Forward pass of ``WaveNet``.
-
-        Parameters
-        -----------
-        x : Tensor [shape=(B, T)] 
-            The input waveform.
-        condition : Tensor, optional [shape=(B, C_cond, T)]
-            the upsampled condition. Defaults to None.
-
-        Returns
-        -------
-        Tensor: [shape=(B, T, C_output)]
-            The parameters of the output distributions.
-        """
-
-        # Causal Conv
-        if self.loss_type == "softmax":
-            x = paddle.clip(x, min=-1., max=0.99999)
-            x = quantize(x, self.output_dim)
-            x = self.embed(x)  # (B, T, C)
-        else:
-            x = paddle.unsqueeze(x, -1)  # (B, T, 1)
-            x = self.embed(x)  # (B, T, C)
-        x = paddle.transpose(x, perm=[0, 2, 1])  # (B, C, T)
-
-        # Residual & Skip-conenection & linears
-        z = self.resnet(x, condition)
-
-        z = paddle.transpose(z, [0, 2, 1])
-        z = F.relu(self.proj2(F.relu(self.proj1(z))))
-
-        y = self.proj3(z)
-        return y
-
-    def start_sequence(self):
-        """Prepare the WaveNet to generate a new sequence. This method should 
-        be called before starting calling ``add_input`` multiple times.
-        """
-        self.resnet.start_sequence()
-
-    def add_input(self, x, condition=None):
-        """Compute the output distribution (represented by its parameters) for 
-        a step. It works similarily with the ``forward`` method but in a 
-        ``step-in-step-out`` fashion.
-
-        Parameters
-        -----------
-        x : Tensor [shape=(B,)]
-            A step of the input waveform.
-            
-        condition : Tensor, optional [shape=(B, C_cond)]
-            A step of the upsampled condition. Defaults to None.
-
-        Returns
-        --------
-        Tensor: [shape=(B, C_output)]
-            A step of the parameters of the output distributions.
-        """
-        # Causal Conv
-        if self.loss_type == "softmax":
-            x = paddle.clip(x, min=-1., max=0.99999)
-            x = quantize(x, self.output_dim)
-            x = self.embed(x)  # (B, C)
-        else:
-            x = paddle.unsqueeze(x, -1)  # (B, 1)
-            x = self.embed(x)  # (B, C)
-
-        # Residual & Skip-conenection & linears
-        z = self.resnet.add_input(x, condition)
-        z = F.relu(self.proj2(F.relu(self.proj1(z))))  # (B, C)
-
-        # Output
-        y = self.proj3(z)
-        return y
-
-    def compute_softmax_loss(self, y, t):
-        """Compute the loss when output distributions are categorial 
-        distributions.
-
-        Parameters
-        ----------
-        y : Tensor [shape=(B, T, C_output)]
-            The logits of the output distributions.
-            
-        t : Tensor [shape=(B, T)]
-            The target audio. The audio is first quantized then used as the 
-            target.
-            
-        Notes
-        -------
-        Output distributions whose input contains padding is neglected in 
-        loss computation. So the first ``context_size`` steps does not 
-        contribute to the loss.
-
-        Returns
-        --------
-        Tensor: [shape=(1,)]
-            The loss.
-        """
-        # context size is not taken into account
-        y = y[:, self.context_size:, :]
-        t = t[:, self.context_size:]
-        t = paddle.clip(t, min=-1.0, max=0.99999)
-        quantized = quantize(t, n_bands=self.output_dim)
-        label = paddle.unsqueeze(quantized, -1)
-
-        loss = F.softmax_with_cross_entropy(y, label)
-        reduced_loss = paddle.mean(loss)
-        return reduced_loss
-
-    def sample_from_softmax(self, y):
-        """Sample from the output distribution when the output distributions 
-        are categorical distriobutions.
-
-        Parameters
-        ----------
-        y : Tensor [shape=(B, T, C_output)]
-            The logits of the output distributions.
-
-        Returns
-        --------
-        Tensor [shape=(B, T)]
-            Waveform sampled from the output distribution.
-        """
-        # dequantize
-        batch_size, time_steps, output_dim, = y.shape
-        y = paddle.reshape(y, (batch_size * time_steps, output_dim))
-        prob = F.softmax(y)
-        quantized = paddle.fluid.layers.sampling_id(prob)
-        samples = dequantize(quantized, n_bands=self.output_dim)
-        samples = paddle.reshape(samples, (batch_size, -1))
-        return samples
-
-    def compute_mog_loss(self, y, t):
-        """Compute the loss where output distributions is a mixture of 
-        Gaussians distributions.
-
-        Parameters
-        -----------
-        y : Tensor [shape=(B, T, C_output)]
-            The parameterd of the output distribution. It is the concatenation 
-            of 3 parts, the logits of every distribution, the mean of each 
-            distribution and the log standard deviation of each distribution. 
-            
-            Each part's shape is (B, T, n_mixture), where ``n_mixture`` means 
-            the number of Gaussians in the mixture.
-            
-        t : Tensor [shape=(B, T)]
-            The target audio. 
-            
-        Notes
-        -------
-        Output distributions whose input contains padding is neglected in 
-        loss computation. So the first ``context_size`` steps does not 
-        contribute to the loss.
-
-        Returns
-        --------
-        Tensor: [shape=(1,)]
-            The loss.
-        """
-        n_mixture = self.output_dim // 3
-
-        # context size is not taken in to account
-        y = y[:, self.context_size:, :]
-        t = t[:, self.context_size:]
-
-        w, mu, log_std = paddle.split(y, 3, axis=2)
-        # 100.0 is just a large float
-        log_std = paddle.clip(log_std, min=self.log_scale_min, max=100.)
-        inv_std = paddle.exp(-log_std)
-        p_mixture = F.softmax(w, -1)
-
-        t = paddle.unsqueeze(t, -1)
-        if n_mixture > 1:
-            # t = F.expand_as(t, log_std)
-            t = paddle.expand(t, [-1, -1, n_mixture])
-
-        x_std = inv_std * (t - mu)
-        exponent = paddle.exp(-0.5 * x_std * x_std)
-        pdf_x = 1.0 / math.sqrt(2.0 * math.pi) * inv_std * exponent
-
-        pdf_x = p_mixture * pdf_x
-        # pdf_x: [bs, len]
-        pdf_x = paddle.sum(pdf_x, -1)
-        per_sample_loss = -paddle.log(pdf_x + 1e-9)
-
-        loss = paddle.mean(per_sample_loss)
-        return loss
-
-    def sample_from_mog(self, y):
-        """Sample from the output distribution when the output distribution 
-        is a mixture of Gaussian distributions.
-        
-        Parameters
-        ------------
-        y : Tensor [shape=(B, T, C_output)]
-            The parameterd of the output distribution. It is the concatenation 
-            of 3 parts, the logits of every distribution, the mean of each 
-            distribution and the log standard deviation of each distribution. 
-            
-            Each part's shape is (B, T, n_mixture), where ``n_mixture`` means 
-            the number of Gaussians in the mixture.
-
-        Returns
-        --------
-        Tensor: [shape=(B, T)]
-            Waveform sampled from the output distribution.
-        """
-        batch_size, time_steps, output_dim = y.shape
-        n_mixture = output_dim // 3
-
-        w, mu, log_std = paddle.split(y, 3, -1)
-
-        reshaped_w = paddle.reshape(w, (batch_size * time_steps, n_mixture))
-        prob_ids = paddle.fluid.layers.sampling_id(F.softmax(reshaped_w))
-        prob_ids = paddle.reshape(prob_ids, (batch_size, time_steps))
-        prob_ids = prob_ids.numpy()
-
-        # do it 
-        index = np.array([[[b, t, prob_ids[b, t]] for t in range(time_steps)]
-                          for b in range(batch_size)]).astype("int32")
-        index_var = paddle.to_tensor(index)
-
-        mu_ = paddle.gather_nd(mu, index_var)
-        log_std_ = paddle.gather_nd(log_std, index_var)
-
-        dist = D.Normal(mu_, paddle.exp(log_std_))
-        samples = dist.sample(shape=[])
-        samples = paddle.clip(samples, min=-1., max=1.)
-        return samples
-
-    def sample(self, y):
-        """Sample from the output distribution.
-        
-        Parameters
-        ----------
-        y : Tensor [shape=(B, T, C_output)]
-            The parameterd of the output distribution.
-
-        Returns
-        --------
-        Tensor [shape=(B, T)]
-            Waveform sampled from the output distribution.
-        """
-        if self.loss_type == "softmax":
-            return self.sample_from_softmax(y)
-        else:
-            return self.sample_from_mog(y)
-
-    def loss(self, y, t):
-        """Compute the loss given the output distribution and the target.
-
-        Parameters
-        ----------
-        y : Tensor [shape=(B, T, C_output)]
-            The parameters of the output distribution.
-            
-        t : Tensor [shape=(B, T)]
-            The target audio.
-
-        Returns
-        ---------
-        Tensor: [shape=(1,)]    
-            The loss.
-        """
-        if self.loss_type == "softmax":
-            return self.compute_softmax_loss(y, t)
-        else:
-            return self.compute_mog_loss(y, t)
-
-
-class ConditionalWaveNet(nn.Layer):
-    r"""Conditional Wavenet. An implementation of 
-    `WaveNet: A Generative Model for Raw Audio <http://arxiv.org/abs/1609.03499>`_.
-    
-    It contains an UpsampleNet as the encoder and a WaveNet as the decoder. 
-    It is an autoregressive model that generate raw audio.
-
-    Parameters
-    ----------
-    upsample_factors : List[int]
-        The upsampling factors of the UpsampleNet.
-        
-    n_stack : int
-        Number of convolution stacks in the WaveNet. 
-        
-    n_loop : int
-        Number of convolution layers in a convolution stack.
-        
-        Convolution layers in a stack have exponentially growing dilations, 
-        from 1 to .. math:: `k^{n_{loop} - 1}`, where k is the kernel size.
-        
-    residual_channels : int
-        Feature size of each ResidualBlocks.
-        
-    output_dim : int
-        Feature size of the output. See ``loss_type`` for details.
-        
-    n_mels : int
-        The number of bands of mel spectrogram.
-        
-    filter_size : int, optional
-        Convolution kernel size of each ResidualBlock, by default 2.
-        
-    loss_type : str, optional ["mog" or "softmax"]
-        The output type and loss type of the model, by default "mog".
-        
-        If "softmax", the model input should be quantized audio and the model 
-        outputs a discret distribution.
-        
-        If "mog", the model input is audio in floating point format, and the 
-        model outputs parameters for a mixture of gaussian distributions. 
-        Namely, the weight, mean and logscale of each gaussian distribution. 
-        Thus, the ``output_size`` should be a multiple of 3.
-        
-    log_scale_min : float, optional
-        Minimum value of the log scale of gaussian distributions, by default 
-        -9.0.
-        
-        This is only used for computing loss when ``loss_type`` is "mog", If 
-        the predicted log scale is less than -9.0, it is clipped at -9.0.
-    """
-
-    def __init__(self,
-                 upsample_factors: List[int],
-                 n_stack: int,
-                 n_loop: int,
-                 residual_channels: int,
-                 output_dim: int,
-                 n_mels: int,
-                 filter_size: int=2,
-                 loss_type: str="mog",
-                 log_scale_min: float=-9.0):
-        super(ConditionalWaveNet, self).__init__()
-        self.encoder = UpsampleNet(upsample_factors)
-        self.decoder = WaveNet(
-            n_stack=n_stack,
-            n_loop=n_loop,
-            residual_channels=residual_channels,
-            output_dim=output_dim,
-            condition_dim=n_mels,
-            filter_size=filter_size,
-            loss_type=loss_type,
-            log_scale_min=log_scale_min)
-
-    def forward(self, audio, mel, audio_start):
-        """Compute the output distribution given the mel spectrogram and the input(for teacher force training).
-
-        Parameters
-        -----------
-        audio : Tensor [shape=(B, T_audio)]
-            ground truth waveform, used for teacher force training.
-            
-        mel : Tensor [shape(B, F, T_mel)]
-            Mel spectrogram. Note that it is the spectrogram for the whole 
-            utterance.
-            
-        audio_start : Tensor [shape=(B,), dtype: int]
-            Audio slices' start positions for each utterance.
-
-        Returns
-        ----------
-        Tensor [shape(B, T_audio - 1, C_output)]
-            Parameters for the output distribution, where ``C_output`` is the 
-            ``output_dim`` of the decoder.)
-        """
-        audio_length = audio.shape[1]  # audio clip's length
-        condition = self.encoder(mel)
-        condition_slice = crop(condition, audio_start, audio_length)
-
-        # shifting 1 step
-        audio = audio[:, :-1]
-        condition_slice = condition_slice[:, :, 1:]
-
-        y = self.decoder(audio, condition_slice)
-        return y
-
-    def loss(self, y, t):
-        """Compute loss with respect to the output distribution and the target 
-        audio.
-
-        Parameters
-        -----------
-        y : Tensor [shape=(B, T - 1, C_output)]
-            Parameters of the output distribution.
-            
-        t : Tensor [shape(B, T)] 
-            target waveform.
-
-        Returns
-        --------
-        Tensor: [shape=(1,)]
-            the loss.
-        """
-        t = t[:, 1:]
-        loss = self.decoder.loss(y, t)
-        return loss
-
-    def sample(self, y):
-        """Sample from the output distribution.
-
-        Parameters
-        -----------
-        y : Tensor [shape=(B, T, C_output)]
-            Parameters of the output distribution.
-
-        Returns
-        --------
-        Tensor [shape=(B, T)] 
-            Sampled waveform from the output distribution.
-        """
-        samples = self.decoder.sample(y)
-        return samples
-
-    @paddle.no_grad()
-    def infer(self, mel):
-        r"""Synthesize waveform from mel spectrogram.
-
-        Parameters
-        -----------
-        mel : Tensor [shape=(B, F, T)] 
-            The ondition (mel spectrogram here).
-
-        Returns
-        -----------
-        Tensor [shape=(B, T \* upsacle_factor)]
-            Synthesized waveform.
-            
-            ``upscale_factor`` is the ``upscale_factor`` of the encoder 
-            ``UpsampleNet``.
-        """
-        condition = self.encoder(mel)
-        batch_size, _, time_steps = condition.shape
-        samples = []
-
-        self.decoder.start_sequence()
-        x_t = paddle.zeros((batch_size, ), dtype=mel.dtype)
-        for i in trange(time_steps):
-            c_t = condition[:, :, i]  # (B, C)
-            y_t = self.decoder.add_input(x_t, c_t)  #(B, C)
-            y_t = paddle.unsqueeze(y_t, 1)
-            x_t = self.sample(y_t)  # (B, 1)
-            x_t = paddle.squeeze(x_t, 1)  #(B,)
-            samples.append(x_t)
-        samples = paddle.stack(samples, -1)
-        return samples
-
-    @paddle.no_grad()
-    def predict(self, mel):
-        r"""Synthesize audio from mel spectrogram. 
-        
-        The output and input are numpy arrays without batch.
-
-        Parameters
-        ----------
-        mel : np.ndarray [shape=(C, T)]
-            Mel spectrogram of an utterance.
-
-        Returns
-        -------
-        Tensor : np.ndarray [shape=(C, T \* upsample_factor)]
-            The synthesized waveform of an utterance.
-        """
-        mel = paddle.to_tensor(mel)
-        mel = paddle.unsqueeze(mel, 0)
-        audio = self.infer(mel)
-        audio = audio[0].numpy()
-        return audio
-
-    @classmethod
-    def from_pretrained(cls, config, checkpoint_path):
-        """Build a ConditionalWaveNet model from a pretrained model.
-
-        Parameters
-        ----------        
-        config: yacs.config.CfgNode
-            model configs
-        
-        checkpoint_path: Path or str
-            the path of pretrained model checkpoint, without extension name
-        
-        Returns
-        -------
-        ConditionalWaveNet
-            The model built from pretrained result.
-        """
-        model = cls(upsample_factors=config.model.upsample_factors,
-                    n_stack=config.model.n_stack,
-                    n_loop=config.model.n_loop,
-                    residual_channels=config.model.residual_channels,
-                    output_dim=config.model.output_dim,
-                    n_mels=config.data.n_mels,
-                    filter_size=config.model.filter_size,
-                    loss_type=config.model.loss_type,
-                    log_scale_min=config.model.log_scale_min)
-        layer_tools.summary(model)
-        checkpoint.load_parameters(model, checkpoint_path=checkpoint_path)
-        return model
Author	SHA1	Message	Date
Feiyu Chan	e14ef0432c	Merge pull request #105 from iclementine/release/v0.2 fix fmax for example/waveflow	2021-04-14 14:42:43 +08:00
chenfeiyu	b666b830a5	fix fmax for example/waveflow	2021-04-14 14:38:31 +08:00
Feiyu Chan	66620977ba	Merge pull request #104 from iclementine/release/v0.2 update code for tacotron2	2021-04-13 16:19:09 +08:00
chenfeiyu	9a11dce942	update collate function, data loader not does not convert nested list into numpy array.	2021-04-13 16:17:46 +08:00
Feiyu Chan	1552e34401	Merge pull request #100 from iclementine/release/v0.2 add pretrained models	2021-03-15 15:16:34 +08:00
iclementine	ee1a9a04d8	add pretrained models	2021-03-15 15:10:09 +08:00
iclementine	87a76c339c	update to 0.2.0	2021-03-10 15:56:34 +08:00
iclementine	ea43c3f82a	temporarily remove wavenet, version update	2021-03-10 15:53:15 +08:00
iclementine	a1aa401d8c	Merge branch 'develop' into release/v0.2	2021-03-10 15:40:08 +08:00
Feiyu Chan	ba39773b5a	Merge pull request #93 from iclementine/release/v0.2 update to 0.2.0-beta3	2021-02-07 22:46:43 +08:00
chenfeiyu	cb3ea54118	update version string	2021-02-07 22:00:03 +08:00
chenfeiyu	4e876e555c	Merge branch 'develop' into release/v0.2	2021-02-07 21:59:28 +08:00
Feiyu Chan	bdf2f680b7	Merge pull request #91 from zh794390558/patch-5 Update basic.rst	2021-02-05 10:44:53 +08:00
Feiyu Chan	7019dcfe97	Merge pull request #89 from zh794390558/patch-3 Update data_cn.md	2021-02-05 10:43:54 +08:00
Feiyu Chan	e2bca531ef	Merge pull request #88 from zh794390558/patch-2 Update data_cn.md	2021-02-05 10:43:13 +08:00
Hui Zhang	8d6b7dc4d0	Update basic.rst	2021-02-02 14:20:10 +08:00
Hui Zhang	db63e4096c	Update data_cn.md	2021-02-01 17:19:07 +08:00
Hui Zhang	7b77c9b26c	Update data_cn.md	2021-02-01 17:11:03 +08:00
Feiyu Chan	a42838c717	Merge pull request #87 from PaddlePaddle/develop merge develop: update docs	2021-01-20 18:48:23 +08:00
iclementine	c3a9f6d89b	version update to 0.2.0beta2	2021-01-19 17:09:19 +08:00
iclementine	5eebbd0716	Merge branch 'develop' into release/v0.2	2021-01-19 17:08:06 +08:00
Feiyu Chan	9f256e325c	Merge pull request #84 from iclementine/release/v0.2 fix a bug when using a method other than forward with DataParallel	2021-01-11 17:28:58 +08:00
chenfeiyu	a9f633f489	fix a bug when using a method other than forward with DataParallel	2021-01-11 17:25:28 +08:00
Feiyu Chan	3f1928604a	Merge pull request #81 from iclementine/release/v0.2 fix: the condition to init DataParallel	2021-01-11 17:17:49 +08:00
chenfeiyu	bd773744ac	fix: the condition to init DataParallel	2021-01-11 17:14:48 +08:00
chenfeiyu	289827c9cf	Merge branch 'develop' into release/v0.2	2021-01-11 17:00:35 +08:00
Feiyu Chan	610b8f2cef	Merge pull request #78 from lfchener/develop fix an encoding problem in windows	2021-01-08 11:04:23 +08:00
Feiyu Chan	f7d85e7058	Merge pull request #76 from PaddlePaddle/revert-74-reborn Revert "bug fix: apply dropout to logits before softmax"	2021-01-07 15:24:18 +08:00
chenfeiyu	a5991a8f90	Merge branch 'develop' into release/v0.2	2020-12-31 16:57:36 +08:00
Feiyu Chan	df627d6a2e	Merge pull request #73 from PaddlePaddle/develop add README for transformer_tts, waveflow and wavenet	2020-12-30 15:57:43 +08:00
Feiyu Chan	e55ea4555e	Merge pull request #70 from PaddlePaddle/develop fix the behavior of dropout in eval of tacotron2	2020-12-28 16:35:06 +08:00
chenfeiyu	54003d5ecc	Merge branch 'release/v0.2' of https://github.com/PaddlePaddle/Parakeet into release/v0.2	2020-12-21 17:48:08 +08:00
chenfeiyu	80bb465d3c	update version str to beta 1	2020-12-21 17:46:39 +08:00
Feiyu Chan	94c8585140	Merge pull request #68 from PaddlePaddle/develop fix positional encoding naming conflict	2020-12-21 17:44:19 +08:00