dv3 reloaded, back to the origin

2020-07-10 20:22:43 +08:00 · 2020-07-10 20:22:43 +08:00 · 282c36c2c1
parent 24eb14a718
commit 282c36c2c1
24 changed files with 1649 additions and 2995 deletions
--- a/examples/deepvoice3/README.md
+++ b/examples/deepvoice3/README.md
@ -22,151 +22,118 @@ The model consists of an encoder, a decoder and a converter (and a speaker embed
 ## Project Structure

 ```text
-├── data.py          data_processing
-├── model.py         function to create model, criterion and optimizer
-├── configs/         (example) configuration files
-├── sentences.txt    sample sentences
-├── synthesis.py     script to synthesize waveform from text
-├── train.py         script to train a model
-└── utils.py         utility functions
+├── config/
+├── synthesize.py
+├── data.py
+├── preprocess.py
+├── clip.py
+├── train.py
+└── vocoder.py
 ```

-## Saving & Loading
-`train.py` and `synthesis.py` have 3 arguments in common, `--checkpooint`, `iteration` and `output`.
+# Preprocess

-1. `output` is the directory for saving results.
-During training, checkpoints are saved in `checkpoints/` in `output` and tensorboard log is save in `log/` in `output`. States for training including alignment plots, spectrogram plots and generated audio files are saved in `states/` in `outuput`. In addition, we periodically evaluate the model with several given sentences, the alignment plots and generated audio files are save in `eval/` in `output`.
-During synthesizing, audio files and the alignment plots are save in `synthesis/` in `output`.
-So after training and synthesizing with the same output directory, the file structure of the output directory looks like this.
+Preprocess to dataset with `preprocess.py`. 

 ```text
-├── checkpoints/      # checkpoint directory (including *.pdparams, *.pdopt and a text file `checkpoint` that records the latest checkpoint)
-├── states/           # alignment plots, spectrogram plots and generated wavs at training
-├── log/              # tensorboard log
-├── eval/             # audio files an alignment plots generated at evaluation during training
-└── synthesis/        # synthesized audio files and alignment plots
+usage: preprocess.py [-h] --config CONFIG --input INPUT --output OUTPUT
+
+preprocess ljspeech dataset and save it.
+
+optional arguments:
+  -h, --help       show this help message and exit
+  --config CONFIG  config file
+  --input INPUT    data path of the original data
+  --output OUTPUT  path to save the preprocessed dataset
 ```

-2. `--checkpoint` and `--iteration` for loading from existing checkpoint. Loading existing checkpoiont follows the following rule:
-If `--checkpoint` is provided, the path of the checkpoint specified by `--checkpoint` is loaded.
-If `--checkpoint` is not provided, we try to load the model specified by `--iteration` from the checkpoint directory. If `--iteration` is not provided, we try to load the latested checkpoint from checkpoint directory.
+example code:
+
+```bash
+python preprocess.py --config=configs/ljspeech.yaml --input=LJSpeech-1.1/ --output=data/ljspeech
+```

 ## Train

 Train the model using train.py, follow the usage displayed by `python train.py --help`.

 ```text
-usage: train.py [-h] [--config CONFIG] [--data DATA] [--device DEVICE]
-                [--checkpoint CHECKPOINT | --iteration ITERATION]
-                output
+usage: train.py [-h] --config CONFIG --input INPUT

-Train a Deep Voice 3 model with LJSpeech dataset.
-
-positional arguments:
-  output                        path to save results
+train a Deep Voice 3 model with LJSpeech

 optional arguments:
-  -h, --help                    show this help message and exit
-  --config CONFIG               experimrnt config
-  --data DATA                   The path of the LJSpeech dataset.
-  --device DEVICE               device to use
-  --checkpoint CHECKPOINT       checkpoint to resume from.
-  --iteration ITERATION         the iteration of the checkpoint to load from output directory
+  -h, --help       show this help message and exit
+  --config CONFIG  config file
+  --input INPUT    data path of the original data
 ```

- `--config` is the configuration file to use. The provided `ljspeech.yaml` can be used directly. And you can change some values in the configuration file and train the model with a different config.
- `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt).
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
- `--checkpoint` is the path of the checkpoint.
- `--iteration` is the iteration of the checkpoint to load from output directory.
-See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
- `output` is the directory to save results, all results are saved in this directory. The structure of the output directory is shown below.
+example code:
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python train.py --config=configs/ljspeech.yaml --input=data/ljspeech
+```
+
+It would create a `runs` folder, outputs for each run is saved in a seperate folder in `runs`, whose name is the time joined with hostname. Inside this filder, tensorboard log, parameters and optimizer states are saved. Parameters(`*.pdparams`) and optimizer states(`*.pdopt`) are named by the step when they are saved.

 ```text
-├── checkpoints      # checkpoint
-├── log              # tensorboard log
-└── states           # train and evaluation results
-    ├── alignments   # attention
-    ├── lin_spec     # linear spectrogram
-    ├── mel_spec     # mel spectrogram
-    └── waveform     # waveform (.wav files)
+runs/Jul07_09-39-34_instance-mqcyj27y-4/
+├── checkpoint
+├── events.out.tfevents.1594085974.instance-mqcyj27y-4
+├── step-1000000.pdopt
+├── step-1000000.pdparams
+├── step-100000.pdopt
+├── step-100000.pdparams
+...
 ```

-Example script:
+Since e use waveflow to synthesize audio while training, so download the trained waveflow model and extract it in current directory before training.

 ```bash
-python train.py \
-    --config=configs/ljspeech.yaml \
-    --data=./LJSpeech-1.1/ \
-    --device=0 \
-    experiment
+wget https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_ckpt_1.0.zip
+unzip waveflow_res128_ljspeech_ckpt_1.0.zip
 ```

-To train the model in a paralle in multiple gpus, you can launch the training script with `paddle.distributed.launch`. For example, to train with gpu `0,1,2,3`, you can use the example script below. Note that for parallel training, devices are specified with `--selected_gpus` passed to `paddle.distributed.launch`. In this case, `--device` passed to `train.py`, if specified, is ignored.

-Example script:
+
+## Visualization
+
+You can visualize training losses, check the attention and listen to the synthesized audio when training with teacher forcing.
+
+example code:

 ```bash
-python -m paddle.distributed.launch --selected_gpus=0,1,2,3 \
-    train.py \
-    --config=configs/ljspeech.yaml \
-    --data=./LJSpeech-1.1/ \
-    experiment
-```
-
-You can monitor training log via tensorboard, using the script below.
-
-```bash
-cd experiment/log
-tensorboard --logdir=.
+tensorboard --logdir=runs/ --host=$HOSTNAME --port=8000
 ```

 ## Synthesis
+
 ```text
-usage: synthesis.py [-h] [--config CONFIG] [--device DEVICE]
-                    [--checkpoint CHECKPOINT | --iteration ITERATION]
-                    text output
-
-Synthsize waveform with a checkpoint.
-
-positional arguments:
-  text                          text file to synthesize
-  output                        path to save synthesized audio
+usage: synthesize from a checkpoint [-h] --config CONFIG --input INPUT
+                                    --output OUTPUT --checkpoint CHECKPOINT
+                                    --monotonic_layers MONOTONIC_LAYERS

 optional arguments:
-  -h, --help                    show this help message and exit
-  --config CONFIG               experiment config
-  --device DEVICE               device to use
-  --checkpoint CHECKPOINT       checkpoint to resume from
-  --iteration ITERATION         the iteration of the checkpoint to load from output directory
+  -h, --help            show this help message and exit
+  --config CONFIG       config file
+  --input INPUT         text file to synthesize
+  --output OUTPUT       path to save audio
+  --checkpoint CHECKPOINT
+                        data path of the checkpoint
+  --monotonic_layers MONOTONIC_LAYERS
+                        monotonic decoder layer, index starts friom 1
 ```

- `--config` is the configuration file to use. You should use the same configuration with which you train you model.
- `--device` is the device (gpu id) to use for training. `-1` means CPU.
+`synthesize.py` is used to synthesize several sentences in a text file.
+`--monotonic_layers` is the index of the decoders layer that manifest monotonic diagonal attention. You can get monotonic layers by inspecting tensorboard logs. Mind that the index starts from 1. The layers that manifest monotonic diagonal attention are stable for a model during training and synthesizing, but differ among different runs. So once you get the indices of monotonic layers by inspecting tensorboard log, you can use them at synthesizing. Note that only decoder layers that show strong diagonal attention should be considerd.

- `--checkpoint` is the path of the checkpoint.
- `--iteration` is the iteration of the checkpoint to load from output directory.
-See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
-
- `text`is the text file to synthesize.
- `output` is the directory to save results. The generated audio files (`*.wav`) and attention plots (*.png) for are save in `synthesis/` in ouput directory.
-
-Example script:
+example code:

 ```bash
-python synthesis.py \
-    --config=configs/ljspeech.yaml \
-    --device=0 \
-    --checkpoint="experiment/checkpoints/model_step_005000000" \
-    sentences.txt experiment
-```
-
-or
-
-```bash
-python synthesis.py \
-    --config=configs/ljspeech.yaml \
-    --device=0 \
-    --iteration=005000000 \
-    sentences.txt experiment
+CUDA_VISIBLE_DEVICES=2 python synthesize.py \
+    --config configs/ljspeech.yaml \
+    --input sentences.txt \
+    --output outputs/ \
+    --checkpoint runs/Jul07_09-39-34_instance-mqcyj27y-4/step-1320000 \
+    --monotonic_layers "5,6"
 ```
--- a/examples/deepvoice3/clip.py
+++ b/examples/deepvoice3/clip.py
@ -0,0 +1,181 @@
+from __future__ import print_function
+
+import copy
+import six
+import warnings
+
+import functools
+from paddle.fluid import layers
+from paddle.fluid import framework
+from paddle.fluid import core
+from paddle.fluid import name_scope
+from paddle.fluid.dygraph import base as imperative_base
+from paddle.fluid.clip import GradientClipBase, _correct_clip_op_role_var
+
+class DoubleClip(GradientClipBase):
+    """
+    :alias_main: paddle.nn.GradientClipByGlobalNorm
+	:alias: paddle.nn.GradientClipByGlobalNorm,paddle.nn.clip.GradientClipByGlobalNorm
+	:old_api: paddle.fluid.clip.GradientClipByGlobalNorm
+
+    Given a list of Tensor :math:`t\_list` , calculate the global norm for the elements of all tensors in 
+    :math:`t\_list` , and limit it to ``clip_norm`` .
+    
+    - If the global norm is greater than ``clip_norm`` , all elements of :math:`t\_list` will be compressed by a ratio.
+    
+    - If the global norm is less than or equal to ``clip_norm`` , nothing will be done.
+    
+    The list of Tensor :math:`t\_list` is not passed from this class, but the gradients of all parameters in ``Program`` . If ``need_clip``
+    is not None, then only part of gradients can be selected for gradient clipping.
+    
+    Gradient clip will takes effect after being set in ``optimizer`` , see the document ``optimizer`` 
+    (for example: :ref:`api_fluid_optimizer_SGDOptimizer`).
+
+    The clipping formula is:
+
+    .. math::
+
+        t\_list[i] = t\_list[i] * \\frac{clip\_norm}{\max(global\_norm, clip\_norm)}
+
+    where:
+
+    .. math::
+
+        global\_norm = \sqrt{\sum_{i=0}^{N-1}(l2norm(t\_list[i]))^2}
+
+    Args:
+        clip_norm (float): The maximum norm value.
+        group_name (str, optional): The group name for this clip. Default value is ``default_group``
+        need_clip (function, optional): Type: function. This function accepts a ``Parameter`` and returns ``bool`` 
+            (True: the gradient of this ``Parameter`` need to be clipped, False: not need). Default: None, 
+            and gradients of all parameters in the network will be clipped.
+
+    Examples:
+        .. code-block:: python
+        
+            # use for Static mode
+            import paddle
+            import paddle.fluid as fluid
+            import numpy as np
+                        
+            main_prog = fluid.Program()
+            startup_prog = fluid.Program()
+            with fluid.program_guard(
+                    main_program=main_prog, startup_program=startup_prog):
+                image = fluid.data(
+                    name='x', shape=[-1, 2], dtype='float32')
+                predict = fluid.layers.fc(input=image, size=3, act='relu') # Trainable parameters: fc_0.w.0, fc_0.b.0
+                loss = fluid.layers.mean(predict)
+                
+                # Clip all parameters in network:
+                clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0)
+                
+                # Clip a part of parameters in network: (e.g. fc_0.w_0)
+                # pass a function(fileter_func) to need_clip, and fileter_func receive a ParamBase, and return bool
+                # def fileter_func(Parameter):
+                # # It can be easily filtered by Parameter.name (name can be set in fluid.ParamAttr, and the default name is fc_0.w_0, fc_0.b_0)
+                #   return Parameter.name=="fc_0.w_0"
+                # clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0, need_clip=fileter_func)
+
+                sgd_optimizer = fluid.optimizer.SGDOptimizer(learning_rate=0.1, grad_clip=clip)
+                sgd_optimizer.minimize(loss)
+
+            place = fluid.CPUPlace()
+            exe = fluid.Executor(place)
+            x = np.random.uniform(-100, 100, (10, 2)).astype('float32')
+            exe.run(startup_prog)
+            out = exe.run(main_prog, feed={'x': x}, fetch_list=loss)
+
+
+            # use for Dygraph mode
+            import paddle
+            import paddle.fluid as fluid
+
+            with fluid.dygraph.guard():
+                linear = fluid.dygraph.Linear(10, 10)  # Trainable: linear_0.w.0, linear_0.b.0
+                inputs = fluid.layers.uniform_random([32, 10]).astype('float32')
+                out = linear(fluid.dygraph.to_variable(inputs))
+                loss = fluid.layers.reduce_mean(out)
+                loss.backward()
+
+                # Clip all parameters in network:
+                clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0)
+
+                # Clip a part of parameters in network: (e.g. linear_0.w_0)
+                # pass a function(fileter_func) to need_clip, and fileter_func receive a ParamBase, and return bool
+                # def fileter_func(ParamBase):
+                # # It can be easily filtered by ParamBase.name(name can be set in fluid.ParamAttr, and the default name is linear_0.w_0, linear_0.b_0)
+                #   return ParamBase.name == "linear_0.w_0"
+                # # Note: linear.weight and linear.bias can return the weight and bias of dygraph.Linear, respectively, and can be used to filter
+                #   return ParamBase.name == linear.weight.name
+                # clip = fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0, need_clip=fileter_func)
+
+                sgd_optimizer = fluid.optimizer.SGD(
+                    learning_rate=0.1, parameter_list=linear.parameters(), grad_clip=clip)
+                sgd_optimizer.minimize(loss)
+
+    """
+
+    def __init__(self, clip_value, clip_norm, group_name="default_group", need_clip=None):
+        super(DoubleClip, self).__init__(need_clip)
+        self.clip_value = float(clip_value)
+        self.clip_norm = float(clip_norm)
+        self.group_name = group_name
+
+    def __str__(self):
+        return "Gradient Clip By Value and GlobalNorm, value={}, global_norm={}".format(
+            self.clip_value, self.clip_norm)
+
+    @imperative_base.no_grad
+    def _dygraph_clip(self, params_grads):
+        params_and_grads = []
+        # clip by value first
+        for p, g in params_grads:
+            if g is None:
+                continue
+            if self._need_clip_func is not None and not self._need_clip_func(p):
+                params_and_grads.append((p, g))
+                continue
+            new_grad = layers.clip(x=g, min=-self.clip_value, max=self.clip_value)
+            params_and_grads.append((p, new_grad))
+        params_grads = params_and_grads
+        
+        # clip by global norm
+        params_and_grads = []
+        sum_square_list = []
+        for p, g in params_grads:
+            if g is None:
+                continue
+            if self._need_clip_func is not None and not self._need_clip_func(p):
+                continue
+            merge_grad = g
+            if g.type == core.VarDesc.VarType.SELECTED_ROWS:
+                merge_grad = layers.merge_selected_rows(g)
+                merge_grad = layers.get_tensor_from_selected_rows(merge_grad)
+            square = layers.square(merge_grad)
+            sum_square = layers.reduce_sum(square)
+            sum_square_list.append(sum_square)
+
+        # all parameters have been filterd out
+        if len(sum_square_list) == 0:
+            return params_grads
+
+        global_norm_var = layers.concat(sum_square_list)
+        global_norm_var = layers.reduce_sum(global_norm_var)
+        global_norm_var = layers.sqrt(global_norm_var)
+        max_global_norm = layers.fill_constant(
+            shape=[1], dtype='float32', value=self.clip_norm)
+        clip_var = layers.elementwise_div(
+            x=max_global_norm,
+            y=layers.elementwise_max(
+                x=global_norm_var, y=max_global_norm))
+        for p, g in params_grads:
+            if g is None:
+                continue
+            if self._need_clip_func is not None and not self._need_clip_func(p):
+                params_and_grads.append((p, g))
+                continue
+            new_grad = layers.elementwise_mul(x=g, y=clip_var)
+            params_and_grads.append((p, new_grad))
+
+        return params_and_grads
--- a/examples/deepvoice3/configs/ljspeech.yaml
+++ b/examples/deepvoice3/configs/ljspeech.yaml
@ -1,90 +1,45 @@
-meta_data:
-  min_text_length: 20
+# data processing
+p_pronunciation: 0.99
+sample_rate: 22050 # Hz
+n_fft: 1024
+win_length: 1024
+hop_length: 256
+n_mels: 80
+reduction_factor: 4

-transform:
-  # text
-  replace_pronunciation_prob: 0.5
+# model-s2s
+n_speakers: 1
+speaker_dim: 16
+char_dim: 256
+encoder_dim: 64
+kernel_size: 5
+encoder_layers: 7
+decoder_layers: 8
+prenet_sizes: [128]
+attention_dim: 128

-  # spectrogram
-  sample_rate: 22050
-  max_norm: 0.999
-  preemphasis: 0.97
-  n_fft: 1024
-  win_length: 1024
-  hop_length: 256
+# model-postnet
+postnet_layers: 5
+postnet_dim: 256

-  # mel
-  fmin: 125
-  fmax: 7600
-  n_mels: 80
+# position embedding
+position_weight: 1.0
+position_rate: 5.54
+forward_step: 4
+backward_step: 0

-  # db scale
-  min_level_db: -100
-  ref_level_db: 20
-  clip_norm: true
+dropout: 0.05

+# output-griffinlim
+sharpening_factor: 1.4

-loss:
-  masked_loss_weight: 0.5
-  priority_freq: 3000
-  priority_freq_weight: 0.0
-  binary_divergence_weight: 0.1
-  guided_attention_sigma: 0.2
+# optimizer:
+learning_rate: 0.001
+clip_value: 5.0
+clip_norm: 100.0

-synthesis:
-  max_steps: 512
-  power: 1.4
-  n_iter: 32
-
-model:
-  # speaker_embedding
-  n_speakers: 1
-  speaker_embed_dim: 16
-  speaker_embedding_weight_std: 0.01
-  
-  max_positions: 512
-  dropout: 0.050000000000000044
-  # encoder
-  text_embed_dim: 256
-  embedding_weight_std: 0.1
-  freeze_embedding: false
-  padding_idx: 0
-  encoder_channels: 512
-
-  # decoder
-  query_position_rate: 1.0
-  key_position_rate: 1.29
-  trainable_positional_encodings: false
-  kernel_size: 3
-  decoder_channels: 256
-  downsample_factor: 4
-  outputs_per_step: 1
-  
-  # attention
-  key_projection: true
-  value_projection: true
-  force_monotonic_attention: true
-  window_backward: -1
-  window_ahead: 3
-  use_memory_mask: true
-
-  # converter
-  use_decoder_state_for_postnet_input: true
-  converter_channels: 256
-
-optimizer:
-  beta1: 0.5
-  beta2: 0.9
-  epsilon: 1e-6
-
-lr_scheduler:
-  warmup_steps: 4000
-  peak_learning_rate: 5e-4
-  
-train:
-  batch_size: 16
-  max_iteration: 2000000
-  
-  snap_interval: 1000
-  eval_interval: 10000
-  save_interval: 10000
+# training:
+batch_size: 16
+report_interval: 10000
+save_interval: 10000
+valid_size: 5
--- a/examples/deepvoice3/data.py
+++ b/examples/deepvoice3/data.py
@ -1,257 +1,110 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
+import numpy as np
 import os
 import csv
-from pathlib import Path
-import numpy as np
-from paddle import fluid
 import pandas as pd
-import librosa
-from scipy import signal

-import paddle.fluid.dygraph as dg
+import paddle
+from paddle import fluid
+from paddle.fluid import dygraph as dg
+from paddle.fluid.dataloader import Dataset, BatchSampler
+from paddle.fluid.io import DataLoader

-from parakeet.g2p.en import text_to_sequence, sequence_to_text
-from parakeet.data import DatasetMixin, TransformDataset, FilterDataset, CacheDataset
-from parakeet.data import DataCargo, PartialyRandomizedSimilarTimeLengthSampler, SequentialSampler, BucketSampler
+from parakeet.data import DatasetMixin, DataCargo, PartialyRandomizedSimilarTimeLengthSampler
+from parakeet.g2p import en

-
-class LJSpeechMetaData(DatasetMixin):
+class LJSpeech(DatasetMixin):
    def __init__(self, root):
-        self.root = Path(root)
-        self._wav_dir = self.root.joinpath("wavs")
-        csv_path = self.root.joinpath("metadata.csv")
+        self._root = root
        self._table = pd.read_csv(
-            csv_path,
-            sep="|",
-            encoding="utf-8",
-            header=None,
-            quoting=csv.QUOTE_NONE,
-            names=["fname", "raw_text", "normalized_text"])
+            os.path.join(root, "metadata.csv"), 
+            sep="|", 
+            encoding="utf-8", 
+            quoting=csv.QUOTE_NONE, 
+            header=None, 
+            names=["num_frames", "spec_name", "mel_name", "text"],
+            dtype={"num_frames": np.int64, "spec_name": str, "mel_name":str, "text":str})
+    
+    def num_frames(self):
+        return self._table["num_frames"].to_list()

    def get_example(self, i):
-        fname, raw_text, normalized_text = self._table.iloc[i]
-        fname = str(self._wav_dir.joinpath(fname + ".wav"))
-        return fname, raw_text, normalized_text
-
+        """
+        spec (T_frame, C_spec)
+        mel (T_frame, C_mel)
+        """
+        num_frames, spec_name, mel_name, text = self._table.iloc[i]
+        spec = np.load(os.path.join(self._root, spec_name))
+        mel = np.load(os.path.join(self._root, mel_name))
+        return (text, spec, mel, num_frames)
+    
    def __len__(self):
        return len(self._table)

-
-class Transform(object):
-    def __init__(self,
-                 replace_pronunciation_prob=0.,
-                 sample_rate=22050,
-                 preemphasis=.97,
-                 n_fft=1024,
-                 win_length=1024,
-                 hop_length=256,
-                 fmin=125,
-                 fmax=7600,
-                 n_mels=80,
-                 min_level_db=-100,
-                 ref_level_db=20,
-                 max_norm=0.999,
-                 clip_norm=True):
-        self.replace_pronunciation_prob = replace_pronunciation_prob
-
-        self.sample_rate = sample_rate
-        self.preemphasis = preemphasis
-        self.n_fft = n_fft
-        self.win_length = win_length
-        self.hop_length = hop_length
-
-        self.fmin = fmin
-        self.fmax = fmax
-        self.n_mels = n_mels
-
-        self.min_level_db = min_level_db
-        self.ref_level_db = ref_level_db
-        self.max_norm = max_norm
-        self.clip_norm = clip_norm
-
-    def __call__(self, in_data):
-        fname, _, normalized_text = in_data
-
-        # text processing
-        mix_grapheme_phonemes = text_to_sequence(
-            normalized_text, self.replace_pronunciation_prob)
-        text_length = len(mix_grapheme_phonemes)
-        # CAUTION: positions start from 1
-        speaker_id = None
-
-        # wave processing
-        wav, _ = librosa.load(fname, sr=self.sample_rate)
-        # preemphasis
-        y = signal.lfilter([1., -self.preemphasis], [1.], wav)
-
-        # STFT
-        D = librosa.stft(
-            y=y,
-            n_fft=self.n_fft,
-            win_length=self.win_length,
-            hop_length=self.hop_length)
-        S = np.abs(D)
-
-        # to db and normalize to 0-1
-        amplitude_min = np.exp(self.min_level_db / 20 * np.log(10))  # 1e-5
-        S_norm = 20 * np.log10(np.maximum(amplitude_min,
-                                          S)) - self.ref_level_db
-        S_norm = (S_norm - self.min_level_db) / (-self.min_level_db)
-        S_norm = self.max_norm * S_norm
-        if self.clip_norm:
-            S_norm = np.clip(S_norm, 0, self.max_norm)
-
-        # mel scale and to db and normalize to 0-1,
-        # CAUTION: pass linear scale S, not dbscaled S
-        S_mel = librosa.feature.melspectrogram(
-            S=S, n_mels=self.n_mels, fmin=self.fmin, fmax=self.fmax, power=1.)
-        S_mel = 20 * np.log10(np.maximum(amplitude_min,
-                                         S_mel)) - self.ref_level_db
-        S_mel_norm = (S_mel - self.min_level_db) / (-self.min_level_db)
-        S_mel_norm = self.max_norm * S_mel_norm
-        if self.clip_norm:
-            S_mel_norm = np.clip(S_mel_norm, 0, self.max_norm)
-
-        # num_frames
-        n_frames = S_mel_norm.shape[-1]  # CAUTION: original number of frames
-        return (mix_grapheme_phonemes, text_length, speaker_id, S_norm.T,
-                S_mel_norm.T, n_frames)
-
-
 class DataCollector(object):
-    def __init__(self, downsample_factor=4, r=1):
-        self.downsample_factor = int(downsample_factor)
-        self.frames_per_step = int(r)
-        self._factor = int(downsample_factor * r)
-        # CAUTION: small diff here
-        self._pad_begin = int(downsample_factor * r)
-
+    def __init__(self, p_pronunciation):
+        self.p_pronunciation = p_pronunciation
+        
    def __call__(self, examples):
-        batch_size = len(examples)
+        """
+        output shape and dtype
+        (B, T_text) int64
+        (B,) int64
+        (B, T_frame, C_spec) float32
+        (B, T_frame, C_mel) float32
+        (B,) int64
+        """
+        text_seqs = []
+        specs = []
+        mels = []
+        num_frames = np.array([example[3] for example in examples], dtype=np.int64)
+        max_frames = np.max(num_frames)

-        # lengths
-        text_lengths = np.array([example[1]
-                                 for example in examples]).astype(np.int64)
-        frames = np.array([example[5]
-                           for example in examples]).astype(np.int64)
-
-        max_text_length = int(np.max(text_lengths))
-        max_frames = int(np.max(frames))
-        if max_frames % self._factor != 0:
-            max_frames += (self._factor - max_frames % self._factor)
-        max_frames += self._pad_begin
-        max_decoder_length = max_frames // self._factor
-
-        # pad time sequence
-        text_sequences = []
-        lin_specs = []
-        mel_specs = []
-        done_flags = []
        for example in examples:
-            (mix_grapheme_phonemes, text_length, speaker_id, S_norm,
-             S_mel_norm, num_frames) = example
-            text_sequences.append(
-                np.pad(mix_grapheme_phonemes, (0, max_text_length - text_length
-                                               ),
-                       mode="constant"))
-            lin_specs.append(
-                np.pad(S_norm, ((self._pad_begin, max_frames - self._pad_begin
-                                 - num_frames), (0, 0)),
-                       mode="constant"))
-            mel_specs.append(
-                np.pad(S_mel_norm, ((self._pad_begin, max_frames -
-                                     self._pad_begin - num_frames), (0, 0)),
-                       mode="constant"))
-            done_flags.append(
-                np.pad(np.zeros((int(np.ceil(num_frames // self._factor)), )),
-                       (0, max_decoder_length - int(
-                           np.ceil(num_frames // self._factor))),
-                       mode="constant",
-                       constant_values=1))
-        text_sequences = np.array(text_sequences).astype(np.int64)
-        lin_specs = np.array(lin_specs).astype(np.float32)
-        mel_specs = np.array(mel_specs).astype(np.float32)
+            text, spec, mel, _ = example
+            text_seqs.append(en.text_to_sequence(text, self.p_pronunciation))
+            # if max_frames - mel.shape[0] < 0:
+            #     import pdb; pdb.set_trace()
+            specs.append(np.pad(spec, [(0, max_frames - spec.shape[0]), (0, 0)]))
+            mels.append(np.pad(mel, [(0, max_frames - mel.shape[0]), (0, 0)]))

-        # downsample here
-        done_flags = np.array(done_flags).astype(np.float32)
+        specs = np.stack(specs)
+        mels = np.stack(mels)

-        # text positions
-        text_mask = (np.arange(1, 1 + max_text_length) <= np.expand_dims(
-            text_lengths, -1)).astype(np.int64)
-        text_positions = np.arange(
-            1, 1 + max_text_length, dtype=np.int64) * text_mask
+        text_lengths = np.array([len(seq) for seq in text_seqs], dtype=np.int64)
+        max_length = np.max(text_lengths)
+        text_seqs = np.array([seq + [0] * (max_length - len(seq)) for seq in text_seqs], dtype=np.int64)
+        return text_seqs, text_lengths, specs, mels, num_frames

-        # decoder_positions
-        decoder_positions = np.tile(
-            np.expand_dims(
-                np.arange(
-                    1, 1 + max_decoder_length, dtype=np.int64), 0),
-            (batch_size, 1))
+if __name__ == "__main__":
+    import argparse
+    import tqdm
+    import time
+    from ruamel import yaml

-        return (text_sequences, text_lengths, text_positions, mel_specs,
-                lin_specs, frames, decoder_positions, done_flags)
+    parser = argparse.ArgumentParser(description="load the preprocessed ljspeech dataset")
+    parser.add_argument("--config", type=str, required=True, help="config file")
+    parser.add_argument("--input", type=str, required=True, help="data path of the original data")
+    args = parser.parse_args()
+    with open(args.config, 'rt') as f:
+        config = yaml.safe_load(f)
+    
+    print("========= Command Line Arguments ========")
+    for k, v in vars(args).items():
+        print("{}: {}".format(k, v))
+    print("=========== Configurations ==============")
+    for k in ["p_pronunciation", "batch_size"]:
+        print("{}: {}".format(k, config[k]))

+    ljspeech = LJSpeech(args.input)
+    collate_fn = DataCollector(config["p_pronunciation"])

-def make_data_loader(data_root, config):
-    # construct meta data
-    meta = LJSpeechMetaData(data_root)
+    dg.enable_dygraph(fluid.CPUPlace())
+    sampler = PartialyRandomizedSimilarTimeLengthSampler(ljspeech.num_frames())
+    cargo = DataCargo(ljspeech, collate_fn, 
+                      batch_size=config["batch_size"], sampler=sampler)
+    loader = DataLoader\
+           .from_generator(capacity=5, return_list=True)\
+           .set_batch_generator(cargo)

-    # filter it!
-    min_text_length = config["meta_data"]["min_text_length"]
-    meta = FilterDataset(meta, lambda x: len(x[2]) >= min_text_length)
-
-    # transform meta data into meta data
-    c = config["transform"]
-    transform = Transform(
-        replace_pronunciation_prob=c["replace_pronunciation_prob"],
-        sample_rate=c["sample_rate"],
-        preemphasis=c["preemphasis"],
-        n_fft=c["n_fft"],
-        win_length=c["win_length"],
-        hop_length=c["hop_length"],
-        fmin=c["fmin"],
-        fmax=c["fmax"],
-        n_mels=c["n_mels"],
-        min_level_db=c["min_level_db"],
-        ref_level_db=c["ref_level_db"],
-        max_norm=c["max_norm"],
-        clip_norm=c["clip_norm"])
-    ljspeech = TransformDataset(meta, transform)
-
-    # use meta data's text length as a sort key for the sampler
-    batch_size = config["train"]["batch_size"]
-    text_lengths = [len(example[2]) for example in meta]
-    sampler = PartialyRandomizedSimilarTimeLengthSampler(text_lengths,
-                                                         batch_size)
-
-    env = dg.parallel.ParallelEnv()
-    num_trainers = env.nranks
-    local_rank = env.local_rank
-    sampler = BucketSampler(
-        text_lengths, batch_size, num_trainers=num_trainers, rank=local_rank)
-
-    # some model hyperparameters affect how we process data
-    model_config = config["model"]
-    collector = DataCollector(
-        downsample_factor=model_config["downsample_factor"],
-        r=model_config["outputs_per_step"])
-    ljspeech_loader = DataCargo(
-        ljspeech, batch_fn=collector, batch_size=batch_size, sampler=sampler)
-    loader = fluid.io.DataLoader.from_generator(capacity=10, return_list=True)
-    loader.set_batch_generator(
-        ljspeech_loader, places=fluid.framework._current_expected_place())
-    return loader
+    for i, batch in tqdm.tqdm(enumerate(loader)):
+        continue
--- a/examples/deepvoice3/images/model_architecture.png
+++ b/examples/deepvoice3/images/model_architecture.png
--- a/examples/deepvoice3/model.py
+++ b/examples/deepvoice3/model.py
@ -1,164 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from paddle import fluid
-import paddle.fluid.initializer as I
-import paddle.fluid.dygraph as dg
-
-from parakeet.g2p import en
-from parakeet.models.deepvoice3 import Encoder, Decoder, Converter, DeepVoice3, TTSLoss, ConvSpec, WindowRange
-from parakeet.utils.layer_tools import summary, freeze
-
-
-def make_model(config):
-    c = config["model"]
-    # speaker embedding
-    n_speakers = c["n_speakers"]
-    speaker_dim = c["speaker_embed_dim"]
-    if n_speakers > 1:
-        speaker_embed = dg.Embedding(
-            (n_speakers, speaker_dim),
-            param_attr=I.Normal(scale=c["speaker_embedding_weight_std"]))
-    else:
-        speaker_embed = None
-
-    # encoder
-    h = c["encoder_channels"]
-    k = c["kernel_size"]
-    encoder_convolutions = (
-        ConvSpec(h, k, 1),
-        ConvSpec(h, k, 3),
-        ConvSpec(h, k, 9),
-        ConvSpec(h, k, 27),
-        ConvSpec(h, k, 1),
-        ConvSpec(h, k, 3),
-        ConvSpec(h, k, 9),
-        ConvSpec(h, k, 27),
-        ConvSpec(h, k, 1),
-        ConvSpec(h, k, 3), )
-    encoder = Encoder(
-        n_vocab=en.n_vocab,
-        embed_dim=c["text_embed_dim"],
-        n_speakers=n_speakers,
-        speaker_dim=speaker_dim,
-        embedding_weight_std=c["embedding_weight_std"],
-        convolutions=encoder_convolutions,
-        dropout=c["dropout"])
-    if c["freeze_embedding"]:
-        freeze(encoder.embed)
-
-    # decoder
-    h = c["decoder_channels"]
-    k = c["kernel_size"]
-    prenet_convolutions = (ConvSpec(h, k, 1), ConvSpec(h, k, 3))
-    attentive_convolutions = (
-        ConvSpec(h, k, 1),
-        ConvSpec(h, k, 3),
-        ConvSpec(h, k, 9),
-        ConvSpec(h, k, 27),
-        ConvSpec(h, k, 1), )
-    attention = [True, False, False, False, True]
-    force_monotonic_attention = [True, False, False, False, True]
-    window = WindowRange(c["window_backward"], c["window_ahead"])
-    decoder = Decoder(
-        n_speakers,
-        speaker_dim,
-        embed_dim=c["text_embed_dim"],
-        mel_dim=config["transform"]["n_mels"],
-        r=c["outputs_per_step"],
-        max_positions=c["max_positions"],
-        preattention=prenet_convolutions,
-        convolutions=attentive_convolutions,
-        attention=attention,
-        dropout=c["dropout"],
-        use_memory_mask=c["use_memory_mask"],
-        force_monotonic_attention=force_monotonic_attention,
-        query_position_rate=c["query_position_rate"],
-        key_position_rate=c["key_position_rate"],
-        window_range=window,
-        key_projection=c["key_projection"],
-        value_projection=c["value_projection"])
-    if not c["trainable_positional_encodings"]:
-        freeze(decoder.embed_keys_positions)
-        freeze(decoder.embed_query_positions)
-
-    # converter(postnet)
-    linear_dim = 1 + config["transform"]["n_fft"] // 2
-    h = c["converter_channels"]
-    k = c["kernel_size"]
-    postnet_convolutions = (
-        ConvSpec(h, k, 1),
-        ConvSpec(h, k, 3),
-        ConvSpec(2 * h, k, 1),
-        ConvSpec(2 * h, k, 3), )
-    use_decoder_states = c["use_decoder_state_for_postnet_input"]
-    converter = Converter(
-        n_speakers,
-        speaker_dim,
-        in_channels=decoder.state_dim
-        if use_decoder_states else config["transform"]["n_mels"],
-        linear_dim=linear_dim,
-        time_upsampling=c["downsample_factor"],
-        convolutions=postnet_convolutions,
-        dropout=c["dropout"])
-
-    model = DeepVoice3(
-        encoder,
-        decoder,
-        converter,
-        speaker_embed,
-        use_decoder_states=use_decoder_states)
-    return model
-
-
-def make_criterion(config):
-    # =========================loss=========================
-    loss_config = config["loss"]
-    transform_config = config["transform"]
-    model_config = config["model"]
-
-    priority_freq = loss_config["priority_freq"]  # Hz
-    sample_rate = transform_config["sample_rate"]
-    linear_dim = 1 + transform_config["n_fft"] // 2
-    priority_bin = int(priority_freq / (0.5 * sample_rate) * linear_dim)
-
-    criterion = TTSLoss(
-        masked_weight=loss_config["masked_loss_weight"],
-        priority_bin=priority_bin,
-        priority_weight=loss_config["priority_freq_weight"],
-        binary_divergence_weight=loss_config["binary_divergence_weight"],
-        guided_attention_sigma=loss_config["guided_attention_sigma"],
-        downsample_factor=model_config["downsample_factor"],
-        r=model_config["outputs_per_step"])
-    return criterion
-
-
-def make_optimizer(model, config):
-    # =========================lr_scheduler=========================
-    lr_config = config["lr_scheduler"]
-    warmup_steps = lr_config["warmup_steps"]
-    peak_learning_rate = lr_config["peak_learning_rate"]
-    lr_scheduler = dg.NoamDecay(1 / (warmup_steps * (peak_learning_rate)**2),
-                                warmup_steps)
-
-    # =========================optimizer=========================
-    optim_config = config["optimizer"]
-    optim = fluid.optimizer.Adam(
-        lr_scheduler,
-        beta1=optim_config["beta1"],
-        beta2=optim_config["beta2"],
-        epsilon=optim_config["epsilon"],
-        parameter_list=model.parameters(),
-        grad_clip=fluid.clip.GradientClipByGlobalNorm(0.1))
-    return optim
--- a/examples/deepvoice3/preprocess.py
+++ b/examples/deepvoice3/preprocess.py
@ -0,0 +1,122 @@
+from __future__ import division
+import os
+import argparse
+from ruamel import yaml
+import tqdm
+from os.path import join
+import csv
+import numpy as np
+import pandas as pd
+import librosa
+import logging
+
+from parakeet.data import DatasetMixin
+
+
+class LJSpeechMetaData(DatasetMixin):
+    def __init__(self, root):
+        self.root = root
+        self._wav_dir = join(root, "wavs")
+        csv_path = join(root, "metadata.csv")
+        self._table = pd.read_csv(
+            csv_path,
+            sep="|",
+            encoding="utf-8",
+            header=None,
+            quoting=csv.QUOTE_NONE,
+            names=["fname", "raw_text", "normalized_text"])
+
+    def get_example(self, i):
+        fname, raw_text, normalized_text = self._table.iloc[i]
+        abs_fname = join(self._wav_dir, fname + ".wav")
+        return fname, abs_fname, raw_text, normalized_text
+
+    def __len__(self):
+        return len(self._table)
+
+
+class Transform(object):
+    def __init__(self, sample_rate, n_fft, hop_length, win_length, n_mels, reduction_factor):
+        self.sample_rate = sample_rate
+        self.n_fft = n_fft
+        self.win_length = win_length
+        self.hop_length = hop_length
+        self.n_mels = n_mels
+        self.reduction_factor = reduction_factor
+
+    def __call__(self, fname):
+        # wave processing
+        audio, _ = librosa.load(fname, sr=self.sample_rate)
+
+        # Pad the data to the right size to have a whole number of timesteps,
+        # accounting properly for the model reduction factor.
+        frames = audio.size // (self.reduction_factor * self.hop_length) + 1
+        # librosa's stft extract frame of n_fft size, so we should pad n_fft // 2 on both sidess
+        desired_length = (frames * self.reduction_factor - 1) * self.hop_length + self.n_fft
+        pad_amount = (desired_length - audio.size) // 2
+
+        # we pad mannually to control the number of generated frames
+        if audio.size % 2 == 0:
+            audio = np.pad(audio, (pad_amount, pad_amount), mode='reflect')
+        else:
+            audio = np.pad(audio, (pad_amount, pad_amount + 1), mode='reflect')
+
+        # STFT
+        D = librosa.stft(audio, self.n_fft, self.hop_length, self.win_length, center=False)
+        S = np.abs(D)
+        S_mel = librosa.feature.melspectrogram(sr=self.sample_rate, S=S, n_mels=self.n_mels, fmax=8000.0)
+
+        # log magnitude
+        log_spectrogram = np.log(np.clip(S, a_min=1e-5, a_max=None))
+        log_mel_spectrogram = np.log(np.clip(S_mel, a_min=1e-5, a_max=None))
+        num_frames = log_spectrogram.shape[-1]
+        assert num_frames % self.reduction_factor == 0, "num_frames is wrong"
+        return (log_spectrogram.T, log_mel_spectrogram.T, num_frames)
+
+
+def save(output_path, dataset, transform):
+    if not os.path.exists(output_path):
+        os.makedirs(output_path)
+    records = []
+    for example in tqdm.tqdm(dataset):
+        fname, abs_fname, _, normalized_text = example
+        log_spec, log_mel_spec, num_frames = transform(abs_fname)
+        records.append((num_frames,
+                        fname + "_spec.npy", 
+                        fname + "_mel.npy", 
+                        normalized_text))
+        np.save(join(output_path, fname + "_spec"), log_spec)
+        np.save(join(output_path, fname + "_mel"), log_mel_spec)
+    meta_data = pd.DataFrame.from_records(records)
+    meta_data.to_csv(join(output_path, "metadata.csv"), 
+                     quoting=csv.QUOTE_NONE, sep="|", encoding="utf-8",
+                     header=False, index=False)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="preprocess ljspeech dataset and save it.")
+    parser.add_argument("--config", type=str, required=True, help="config file")
+    parser.add_argument("--input", type=str, required=True, help="data path of the original data")
+    parser.add_argument("--output", type=str, required=True, help="path to save the preprocessed dataset")
+
+    args = parser.parse_args()
+    with open(args.config, 'rt') as f:
+        config = yaml.safe_load(f)
+    
+    print("========= Command Line Arguments ========")
+    for k, v in vars(args).items():
+        print("{}: {}".format(k, v))
+    print("=========== Configurations ==============")
+    for k in ["sample_rate", "n_fft", "win_length", 
+              "hop_length", "n_mels", "reduction_factor"]:
+        print("{}: {}".format(k, config[k]))
+
+    ljspeech_meta = LJSpeechMetaData(args.input)
+    transform = Transform(config["sample_rate"],
+                          config["n_fft"],
+                          config["hop_length"],
+                          config["win_length"],
+                          config["n_mels"],
+                          config["reduction_factor"])
+    save(args.output, ljspeech_meta, transform)
+
--- a/examples/deepvoice3/sentences.txt
+++ b/examples/deepvoice3/sentences.txt
@ -1,6 +1,5 @@
-Scientists at the CERN laboratory say they have discovered a new particle.
-There's a way to measure the acute emotional intelligence that has never gone out of style.
-President Trump met with other leaders at the Group of 20 conference.
-Generative adversarial network or variational auto-encoder.
-Please call Stella.
-Some have accepted this as a miracle without any physical explanation.
+Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition
+in being comparatively modern.
+For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process
+produced the block books, which were the immediate predecessors of the true printed book,
+the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.
--- a/examples/deepvoice3/synthesis.py
+++ b/examples/deepvoice3/synthesis.py
@ -1,91 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import os
-import argparse
-import ruamel.yaml
-import numpy as np
-import soundfile as sf
-
-from paddle import fluid
-fluid.require_version('1.8.0')
-import paddle.fluid.layers as F
-import paddle.fluid.dygraph as dg
-from tensorboardX import SummaryWriter
-
-from parakeet.g2p import en
-from parakeet.modules.weight_norm import WeightNormWrapper
-from parakeet.utils.layer_tools import summary
-from parakeet.utils import io
-
-from model import make_model
-from utils import make_evaluator
-
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(
-        description="Synthsize waveform with a checkpoint.")
-    parser.add_argument("--config", type=str, help="experiment config")
-    parser.add_argument("--device", type=int, default=-1, help="device to use")
-
-    g = parser.add_mutually_exclusive_group()
-    g.add_argument("--checkpoint", type=str, help="checkpoint to resume from")
-    g.add_argument(
-        "--iteration",
-        type=int,
-        help="the iteration of the checkpoint to load from output directory")
-
-    parser.add_argument("text", type=str, help="text file to synthesize")
-    parser.add_argument(
-        "output", type=str, help="path to save synthesized audio")
-
-    args = parser.parse_args()
-    with open(args.config, 'rt') as f:
-        config = ruamel.yaml.safe_load(f)
-
-    print("Command Line Args: ")
-    for k, v in vars(args).items():
-        print("{}: {}".format(k, v))
-
-    if args.device == -1:
-        place = fluid.CPUPlace()
-    else:
-        place = fluid.CUDAPlace(args.device)
-
-    dg.enable_dygraph(place)
-
-    model = make_model(config)
-    checkpoint_dir = os.path.join(args.output, "checkpoints")
-    if args.checkpoint is not None:
-        iteration = io.load_parameters(model, checkpoint_path=args.checkpoint)
-    else:
-        iteration = io.load_parameters(
-            model, checkpoint_dir=checkpoint_dir, iteration=args.iteration)
-
-    # WARNING: don't forget to remove weight norm to re-compute each wrapped layer's weight
-    # removing weight norm also speeds up computation
-    for layer in model.sublayers():
-        if isinstance(layer, WeightNormWrapper):
-            layer.remove_weight_norm()
-
-    synthesis_dir = os.path.join(args.output, "synthesis")
-    if not os.path.exists(synthesis_dir):
-        os.makedirs(synthesis_dir)
-
-    with open(args.text, "rt", encoding="utf-8") as f:
-        lines = f.readlines()
-        sentences = [line[:-1] for line in lines]
-
-    evaluator = make_evaluator(config, sentences, synthesis_dir)
-    evaluator(model, iteration)
--- a/examples/deepvoice3/synthesize.py
+++ b/examples/deepvoice3/synthesize.py
@ -0,0 +1,80 @@
+import numpy as np 
+from matplotlib import cm
+import librosa
+import os
+import time
+import tqdm
+import argparse
+from ruamel import yaml
+import paddle
+from paddle import fluid
+from paddle.fluid import layers as F
+from paddle.fluid import dygraph as dg
+from paddle.fluid.io import DataLoader
+from tensorboardX import SummaryWriter
+import soundfile as sf
+
+from parakeet.data import SliceDataset, DataCargo, PartialyRandomizedSimilarTimeLengthSampler, SequentialSampler
+from parakeet.utils.io import save_parameters, load_parameters, add_yaml_config_to_args
+from parakeet.g2p import en
+
+from vocoder import WaveflowVocoder
+from train import create_model
+
+
+def main(args, config):
+    model = create_model(config)
+    loaded_step = load_parameters(model, checkpoint_path=args.checkpoint)
+    model.eval()
+    vocoder = WaveflowVocoder()
+    vocoder.model.eval()
+    
+    if not os.path.exists(args.output):
+        os.makedirs(args.output)
+    monotonic_layers = [int(item.strip()) - 1 for item in args.monotonic_layers.split(',')]
+    with open(args.input, 'rt') as f:
+        sentences = [line.strip() for line in f.readlines()]
+    for i, sentence in enumerate(sentences):
+        wav = synthesize(config, model, vocoder, sentence, monotonic_layers)
+        sf.write(os.path.join(args.output, "sentence{}.wav".format(i)),
+                 wav, samplerate=config["sample_rate"])
+
+
+def synthesize(config, model, vocoder, sentence, monotonic_layers):
+    print("[synthesize] {}".format(sentence))
+    text = en.text_to_sequence(sentence, p=1.0)
+    text = np.expand_dims(np.array(text, dtype="int64"), 0)
+    lengths = np.array([text.size], dtype=np.int64)
+    text_seqs = dg.to_variable(text)
+    text_lengths = dg.to_variable(lengths)
+
+    decoder_layers = config["decoder_layers"]
+    force_monotonic_attention = [False] * decoder_layers
+    for i in monotonic_layers:
+        force_monotonic_attention[i] = True
+    
+    with dg.no_grad():
+        outputs = model(text_seqs, text_lengths, speakers=None,
+            force_monotonic_attention=force_monotonic_attention, 
+            window=(config["backward_step"], config["forward_step"]))
+        decoded, refined, attentions = outputs
+        wav = vocoder(F.transpose(decoded, (0, 2, 1)))
+        wav_np = wav.numpy()[0]
+    return wav_np
+
+
+if __name__ == "__main__":
+    import argparse
+    from ruamel import yaml
+    parser = argparse.ArgumentParser("synthesize from a checkpoint")
+    parser.add_argument("--config", type=str, required=True, help="config file")
+    parser.add_argument("--input", type=str, required=True, help="text file to synthesize")
+    parser.add_argument("--output", type=str, required=True, help="path to save audio")
+    parser.add_argument("--checkpoint", type=str, required=True, help="data path of the checkpoint")
+    parser.add_argument("--monotonic_layers", type=str, required=True, help="monotonic decoder layer, index starts friom 1")
+    args = parser.parse_args()
+    with open(args.config, 'rt') as f:
+        config = yaml.safe_load(f)
+    
+    dg.enable_dygraph(fluid.CUDAPlace(0))
+    main(args, config)
--- a/examples/deepvoice3/train.py
+++ b/examples/deepvoice3/train.py
@ -1,172 +1,187 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import time
+import numpy as np 
+from matplotlib import cm
+import librosa
 import os
-import argparse
-import ruamel.yaml
+import time
 import tqdm
-from tensorboardX import SummaryWriter
+import paddle
 from paddle import fluid
-fluid.require_version('1.8.0')
-import paddle.fluid.layers as F
-import paddle.fluid.dygraph as dg
-from parakeet.utils.io import load_parameters, save_parameters
+from paddle.fluid import layers as F
+from paddle.fluid import dygraph as dg
+from paddle.fluid.io import DataLoader
+from tensorboardX import SummaryWriter

-from data import make_data_loader
-from model import make_model, make_criterion, make_optimizer
-from utils import make_output_tree, add_options, get_place, Evaluator, StateSaver, make_evaluator, make_state_saver
+from parakeet.models.deepvoice3 import Encoder, Decoder, PostNet, SpectraNet
+from parakeet.data import SliceDataset, DataCargo, PartialyRandomizedSimilarTimeLengthSampler, SequentialSampler
+from parakeet.utils.io import save_parameters, load_parameters
+from parakeet.g2p import en

-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(
-        description="Train a Deep Voice 3 model with LJSpeech dataset.")
-    add_options(parser)
-    args, _ = parser.parse_known_args()
+from data import LJSpeech, DataCollector
+from vocoder import WaveflowVocoder, GriffinLimVocoder
+from clip import DoubleClip

-    # only use args.device when training in single process
-    # when training with distributed.launch, devices are provided by
-    # `--selected_gpus` for distributed.launch
-    env = dg.parallel.ParallelEnv()
-    device_id = env.dev_id if env.nranks > 1 else args.device
-    place = get_place(device_id)
-    # start dygraph
-    dg.enable_dygraph(place)

-    with open(args.config, 'rt') as f:
-        config = ruamel.yaml.safe_load(f)
+def create_model(config):
+    char_embedding = dg.Embedding((en.n_vocab, config["char_dim"]))
+    multi_speaker = config["n_speakers"] > 1
+    speaker_embedding = dg.Embedding((config["n_speakers"], config["speaker_dim"])) \
+        if multi_speaker else None
+    encoder = Encoder(config["encoder_layers"], config["char_dim"], 
+                      config["encoder_dim"], config["kernel_size"], 
+                      has_bias=multi_speaker, bias_dim=config["speaker_dim"], 
+                      keep_prob=1.0 - config["dropout"])
+    decoder = Decoder(config["n_mels"], config["reduction_factor"], 
+                      list(config["prenet_sizes"]) + [config["char_dim"]], 
+                      config["decoder_layers"], config["kernel_size"], 
+                      config["attention_dim"],
+                      position_encoding_weight=config["position_weight"], 
+                      omega=config["position_rate"], 
+                      has_bias=multi_speaker, bias_dim=config["speaker_dim"], 
+                      keep_prob=1.0 - config["dropout"])
+    postnet = PostNet(config["postnet_layers"], config["char_dim"], 
+                      config["postnet_dim"], config["kernel_size"], 
+                      config["n_mels"], config["reduction_factor"], 
+                      has_bias=multi_speaker, bias_dim=config["speaker_dim"], 
+                      keep_prob=1.0 - config["dropout"])
+    spectranet = SpectraNet(char_embedding, speaker_embedding, encoder, decoder, postnet)
+    return spectranet

-    print("Command Line Args: ")
-    for k, v in vars(args).items():
-        print("{}: {}".format(k, v))
+def create_data(config, data_path):
+    dataset = LJSpeech(data_path)

-    data_loader = make_data_loader(args.data, config)
-    model = make_model(config)
-    if env.nranks > 1:
-        strategy = dg.parallel.prepare_context()
-        model = dg.DataParallel(model, strategy)
-    criterion = make_criterion(config)
-    optim = make_optimizer(model, config)
+    train_dataset = SliceDataset(dataset, config["valid_size"], len(dataset))
+    train_collator = DataCollector(config["p_pronunciation"])
+    train_sampler = PartialyRandomizedSimilarTimeLengthSampler(
+        dataset.num_frames()[config["valid_size"]:])
+    train_cargo = DataCargo(train_dataset, train_collator, 
+        batch_size=config["batch_size"], sampler=train_sampler)
+    train_loader = DataLoader\
+                 .from_generator(capacity=10, return_list=True)\
+                 .set_batch_generator(train_cargo)

-    # generation
-    synthesis_config = config["synthesis"]
-    power = synthesis_config["power"]
-    n_iter = synthesis_config["n_iter"]
+    valid_dataset = SliceDataset(dataset, 0, config["valid_size"])
+    valid_collector = DataCollector(1.)
+    valid_sampler = SequentialSampler(valid_dataset)
+    valid_cargo = DataCargo(valid_dataset, valid_collector, 
+        batch_size=1, sampler=valid_sampler)
+    valid_loader = DataLoader\
+                 .from_generator(capacity=2, return_list=True)\
+                 .set_batch_generator(valid_cargo)
+    return train_loader, valid_loader

-    # tensorboard & checkpoint preparation
-    output_dir = args.output
-    ckpt_dir = os.path.join(output_dir, "checkpoints")
-    log_dir = os.path.join(output_dir, "log")
-    state_dir = os.path.join(output_dir, "states")
-    eval_dir = os.path.join(output_dir, "eval")
-    if env.local_rank == 0:
-        make_output_tree(output_dir)
-        writer = SummaryWriter(logdir=log_dir)
-    else:
-        writer = None
-    sentences = [
-        "Scientists at the CERN laboratory say they have discovered a new particle.",
-        "There's a way to measure the acute emotional intelligence that has never gone out of style.",
-        "President Trump met with other leaders at the Group of 20 conference.",
-        "Generative adversarial network or variational auto-encoder.",
-        "Please call Stella.",
-        "Some have accepted this as a miracle without any physical explanation.",
-    ]
-    evaluator = make_evaluator(config, sentences, eval_dir, writer)
-    state_saver = make_state_saver(config, state_dir, writer)
+def create_optimizer(model, config):
+    optim = fluid.optimizer.Adam(config["learning_rate"], 
+        parameter_list=model.parameters(), 
+        grad_clip=DoubleClip(config["clip_value"], config["clip_norm"]))
+    return optim

-    # load parameters and optimizer, and opdate iterations done sofar
-    if args.checkpoint is not None:
-        iteration = load_parameters(
-            model, optim, checkpoint_path=args.checkpoint)
-    else:
-        iteration = load_parameters(
-            model, optim, checkpoint_dir=ckpt_dir, iteration=args.iteration)
+def train(args, config):
+    model = create_model(config)
+    train_loader, valid_loader = create_data(config, args.input)
+    optim = create_optimizer(model, config)

-    # =========================train=========================
-    train_config = config["train"]
-    max_iter = train_config["max_iteration"]
-    snap_interval = train_config["snap_interval"]
-    save_interval = train_config["save_interval"]
-    eval_interval = train_config["eval_interval"]
-
-    global_step = iteration + 1
-    iterator = iter(tqdm.tqdm(data_loader))
-    downsample_factor = config["model"]["downsample_factor"]
-    while global_step <= max_iter:
+    global global_step
+    max_iteration = 2000000
+    
+    iterator = iter(tqdm.tqdm(train_loader))
+    while global_step <= max_iteration:
+        # get inputs
        try:
            batch = next(iterator)
-        except StopIteration as e:
-            iterator = iter(tqdm.tqdm(data_loader))
+        except StopIteration:
+            iterator = iter(tqdm.tqdm(train_loader))
            batch = next(iterator)
+        
+        # unzip it
+        text_seqs, text_lengths, specs, mels, num_frames = batch

+        # forward & backward
        model.train()
-        (text_sequences, text_lengths, text_positions, mel_specs, lin_specs,
-         frames, decoder_positions, done_flags) = batch
-        downsampled_mel_specs = F.strided_slice(
-            mel_specs,
-            axes=[1],
-            starts=[0],
-            ends=[mel_specs.shape[1]],
-            strides=[downsample_factor])
-        outputs = model(
-            text_sequences,
-            text_positions,
-            text_lengths,
-            None,
-            downsampled_mel_specs,
-            decoder_positions, )
-        # mel_outputs, linear_outputs, alignments, done
-        inputs = (downsampled_mel_specs, lin_specs, done_flags, text_lengths,
-                  frames)
-        losses = criterion(outputs, inputs)
+        outputs = model(text_seqs, text_lengths, speakers=None, mel=mels)
+        decoded, refined, attentions, final_state = outputs

-        l = losses["loss"]
-        if env.nranks > 1:
-            l = model.scale_loss(l)
-            l.backward()
-            model.apply_collective_grads()
-        else:
-            l.backward()
+        causal_mel_loss = model.spec_loss(decoded, mels, num_frames)
+        non_causal_mel_loss = model.spec_loss(refined, mels, num_frames)
+        loss = causal_mel_loss + non_causal_mel_loss
+        loss.backward()

-        # record learning rate before updating
-        if env.local_rank == 0:
-            writer.add_scalar("learning_rate",
-                              optim._learning_rate.step().numpy(), global_step)
-        optim.minimize(l)
-        optim.clear_gradients()
+        # update
+        optim.minimize(loss)

-        # record step losses
-        step_loss = {k: v.numpy()[0] for k, v in losses.items()}
+        # logging
+        tqdm.tqdm.write("[train] step: {}\tloss: {:.6f}\tcausal:{:.6f}\tnon_causal:{:.6f}".format(
+            global_step, 
+            loss.numpy()[0], 
+            causal_mel_loss.numpy()[0], 
+            non_causal_mel_loss.numpy()[0]))
+        writer.add_scalar("loss/causal_mel_loss", causal_mel_loss.numpy()[0], global_step=global_step)
+        writer.add_scalar("loss/non_causal_mel_loss", non_causal_mel_loss.numpy()[0], global_step=global_step)
+        writer.add_scalar("loss/loss", loss.numpy()[0], global_step=global_step)
+        
+        if global_step % config["report_interval"] == 0:
+            text_length = int(text_lengths.numpy()[0])
+            num_frame = int(num_frames.numpy()[0])

-        if env.local_rank == 0:
-            tqdm.tqdm.write("[Train] global_step: {}\tloss: {}".format(
-                global_step, step_loss["loss"]))
-            for k, v in step_loss.items():
-                writer.add_scalar(k, v, global_step)
+            tag = "train_mel/ground-truth"
+            img = cm.viridis(normalize(mels.numpy()[0, :num_frame].T))
+            writer.add_image(tag, img, global_step=global_step, dataformats="HWC")

-        # train state saving, the first sentence in the batch
-        if env.local_rank == 0 and global_step % snap_interval == 0:
-            input_specs = (mel_specs, lin_specs)
-            state_saver(outputs, input_specs, global_step)
+            tag = "train_mel/decoded"
+            img = cm.viridis(normalize(decoded.numpy()[0, :num_frame].T))
+            writer.add_image(tag, img, global_step=global_step, dataformats="HWC")

-        # evaluation
-        if env.local_rank == 0 and global_step % eval_interval == 0:
-            evaluator(model, global_step)
+            tag = "train_mel/refined"
+            img = cm.viridis(normalize(refined.numpy()[0, :num_frame].T))
+            writer.add_image(tag, img, global_step=global_step, dataformats="HWC")

-        # save checkpoint
-        if env.local_rank == 0 and global_step % save_interval == 0:
-            save_parameters(ckpt_dir, global_step, model, optim)
+            vocoder = WaveflowVocoder()
+            vocoder.model.eval()

+            tag = "train_audio/ground-truth-waveflow"
+            wav = vocoder(F.transpose(mels[0:1, :num_frame, :], (0, 2, 1)))
+            writer.add_audio(tag, wav.numpy()[0], global_step=global_step, sample_rate=22050)
+
+            tag = "train_audio/decoded-waveflow"
+            wav = vocoder(F.transpose(decoded[0:1, :num_frame, :], (0, 2, 1)))
+            writer.add_audio(tag, wav.numpy()[0], global_step=global_step, sample_rate=22050)
+
+            tag = "train_audio/refined-waveflow"
+            wav = vocoder(F.transpose(refined[0:1, :num_frame, :], (0, 2, 1)))
+            writer.add_audio(tag, wav.numpy()[0], global_step=global_step, sample_rate=22050)
+            
+            attentions_np = attentions.numpy()
+            attentions_np = attentions_np[:, 0, :num_frame // 4 , :text_length]
+            for i, attention_layer in enumerate(np.rot90(attentions_np, axes=(1,2))):
+                tag = "train_attention/layer_{}".format(i)
+                img = cm.viridis(normalize(attention_layer))
+                writer.add_image(tag, img, global_step=global_step, dataformats="HWC")
+
+        if global_step % config["save_interval"] == 0:
+            save_parameters(writer.logdir, global_step, model, optim)
+
+        # global step +1
        global_step += 1
+
+def normalize(arr):
+    return (arr - arr.min()) / (arr.max() - arr.min())
+
+if __name__ == "__main__":
+    import argparse
+    from ruamel import yaml
+
+    parser = argparse.ArgumentParser(description="train a Deep Voice 3 model with LJSpeech")
+    parser.add_argument("--config", type=str, required=True, help="config file")
+    parser.add_argument("--input", type=str, required=True, help="data path of the original data")
+
+    args = parser.parse_args()
+    with open(args.config, 'rt') as f:
+        config = yaml.safe_load(f)
+    
+    dg.enable_dygraph(fluid.CUDAPlace(0))
+    global global_step
+    global_step = 1
+    global writer
+    writer = SummaryWriter()
+    print("[Training] tensorboard log and checkpoints are save in {}".format(
+        writer.logdir))
+    train(args, config)
--- a/examples/deepvoice3/utils.py
+++ b/examples/deepvoice3/utils.py
@ -1,374 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import os
-import numpy as np
-import matplotlib
-matplotlib.use("agg")
-from matplotlib import cm
-import matplotlib.pyplot as plt
-import librosa
-from scipy import signal
-from librosa import display
-import soundfile as sf
-
-from paddle import fluid
-import paddle.fluid.dygraph as dg
-from parakeet.g2p import en
-
-
-def get_place(device_id):
-    """get place from device_id, -1 stands for CPU"""
-    if device_id == -1:
-        place = fluid.CPUPlace()
-    else:
-        place = fluid.CUDAPlace(device_id)
-    return place
-
-
-def add_options(parser):
-    parser.add_argument("--config", type=str, help="experimrnt config")
-    parser.add_argument(
-        "--data",
-        type=str,
-        default="/workspace/datasets/LJSpeech-1.1/",
-        help="The path of the LJSpeech dataset.")
-    parser.add_argument("--device", type=int, default=-1, help="device to use")
-
-    g = parser.add_mutually_exclusive_group()
-    g.add_argument("--checkpoint", type=str, help="checkpoint to resume from.")
-    g.add_argument(
-        "--iteration",
-        type=int,
-        help="the iteration of the checkpoint to load from output directory")
-
-    parser.add_argument(
-        "output", type=str, default="experiment", help="path to save results")
-
-
-def make_evaluator(config, text_sequences, output_dir, writer=None):
-    c = config["transform"]
-    p_replace = 0.0
-    sample_rate = c["sample_rate"]
-    preemphasis = c["preemphasis"]
-    win_length = c["win_length"]
-    hop_length = c["hop_length"]
-    min_level_db = c["min_level_db"]
-    ref_level_db = c["ref_level_db"]
-
-    synthesis_config = config["synthesis"]
-    power = synthesis_config["power"]
-    n_iter = synthesis_config["n_iter"]
-
-    return Evaluator(
-        text_sequences,
-        p_replace,
-        sample_rate,
-        preemphasis,
-        win_length,
-        hop_length,
-        min_level_db,
-        ref_level_db,
-        power,
-        n_iter,
-        output_dir=output_dir,
-        writer=writer)
-
-
-class Evaluator(object):
-    def __init__(self,
-                 text_sequences,
-                 p_replace,
-                 sample_rate,
-                 preemphasis,
-                 win_length,
-                 hop_length,
-                 min_level_db,
-                 ref_level_db,
-                 power,
-                 n_iter,
-                 output_dir,
-                 writer=None):
-        self.text_sequences = text_sequences
-        self.output_dir = output_dir
-        self.writer = writer
-
-        self.p_replace = p_replace
-        self.sample_rate = sample_rate
-        self.preemphasis = preemphasis
-        self.win_length = win_length
-        self.hop_length = hop_length
-        self.min_level_db = min_level_db
-        self.ref_level_db = ref_level_db
-
-        self.power = power
-        self.n_iter = n_iter
-
-    def process_a_sentence(self, model, text):
-        text = np.array(
-            en.text_to_sequence(
-                text, p=self.p_replace), dtype=np.int64)
-        length = len(text)
-        text_positions = np.arange(1, 1 + length, dtype=np.int64)
-        text = np.expand_dims(text, 0)
-        text_positions = np.expand_dims(text_positions, 0)
-
-        model.eval()
-        if isinstance(model, dg.DataParallel):
-            _model = model._layers
-        else:
-            _model = model
-        mel_outputs, linear_outputs, alignments, done = _model.transduce(
-            dg.to_variable(text), dg.to_variable(text_positions))
-
-        linear_outputs_np = linear_outputs.numpy()[0].T  # (C, T)
-
-        wav = spec_to_waveform(linear_outputs_np, self.min_level_db,
-                               self.ref_level_db, self.power, self.n_iter,
-                               self.win_length, self.hop_length,
-                               self.preemphasis)
-        alignments_np = alignments.numpy()[0]  # batch_size = 1
-        return wav, alignments_np
-
-    def __call__(self, model, iteration):
-        writer = self.writer
-        for i, seq in enumerate(self.text_sequences):
-            print("[Eval] synthesizing sentence {}".format(i))
-            wav, alignments_np = self.process_a_sentence(model, seq)
-
-            wav_path = os.path.join(
-                self.output_dir,
-                "eval_sample_{}_step_{:09d}.wav".format(i, iteration))
-            sf.write(wav_path, wav, self.sample_rate)
-            if writer is not None:
-                writer.add_audio(
-                    "eval_sample_{}".format(i),
-                    wav,
-                    iteration,
-                    sample_rate=self.sample_rate)
-            attn_path = os.path.join(
-                self.output_dir,
-                "eval_sample_{}_step_{:09d}.png".format(i, iteration))
-            plot_alignment(alignments_np, attn_path)
-            if writer is not None:
-                writer.add_image(
-                    "eval_sample_attn_{}".format(i),
-                    cm.viridis(alignments_np),
-                    iteration,
-                    dataformats="HWC")
-
-
-def make_state_saver(config, output_dir, writer=None):
-    c = config["transform"]
-    p_replace = c["replace_pronunciation_prob"]
-    sample_rate = c["sample_rate"]
-    preemphasis = c["preemphasis"]
-    win_length = c["win_length"]
-    hop_length = c["hop_length"]
-    min_level_db = c["min_level_db"]
-    ref_level_db = c["ref_level_db"]
-
-    synthesis_config = config["synthesis"]
-    power = synthesis_config["power"]
-    n_iter = synthesis_config["n_iter"]
-
-    return StateSaver(p_replace, sample_rate, preemphasis, win_length,
-                      hop_length, min_level_db, ref_level_db, power, n_iter,
-                      output_dir, writer)
-
-
-class StateSaver(object):
-    def __init__(self,
-                 p_replace,
-                 sample_rate,
-                 preemphasis,
-                 win_length,
-                 hop_length,
-                 min_level_db,
-                 ref_level_db,
-                 power,
-                 n_iter,
-                 output_dir,
-                 writer=None):
-        self.output_dir = output_dir
-        self.writer = writer
-
-        self.p_replace = p_replace
-        self.sample_rate = sample_rate
-        self.preemphasis = preemphasis
-        self.win_length = win_length
-        self.hop_length = hop_length
-        self.min_level_db = min_level_db
-        self.ref_level_db = ref_level_db
-
-        self.power = power
-        self.n_iter = n_iter
-
-    def __call__(self, outputs, inputs, iteration):
-        mel_output, lin_output, alignments, done_output = outputs
-        mel_input, lin_input = inputs
-        writer = self.writer
-
-        # mel spectrogram
-        mel_input = mel_input[0].numpy().T
-        mel_output = mel_output[0].numpy().T
-
-        path = os.path.join(self.output_dir, "mel_spec")
-        plt.figure(figsize=(10, 3))
-        display.specshow(mel_input)
-        plt.colorbar()
-        plt.title("mel_input")
-        plt.savefig(
-            os.path.join(path, "target_mel_spec_step_{:09d}.png".format(
-                iteration)))
-        plt.close()
-
-        if writer is not None:
-            writer.add_image(
-                "target/mel_spec",
-                cm.viridis(mel_input),
-                iteration,
-                dataformats="HWC")
-
-        plt.figure(figsize=(10, 3))
-        display.specshow(mel_output)
-        plt.colorbar()
-        plt.title("mel_output")
-        plt.savefig(
-            os.path.join(path, "predicted_mel_spec_step_{:09d}.png".format(
-                iteration)))
-        plt.close()
-
-        if writer is not None:
-            writer.add_image(
-                "predicted/mel_spec",
-                cm.viridis(mel_output),
-                iteration,
-                dataformats="HWC")
-
-        # linear spectrogram
-        lin_input = lin_input[0].numpy().T
-        lin_output = lin_output[0].numpy().T
-        path = os.path.join(self.output_dir, "lin_spec")
-
-        plt.figure(figsize=(10, 3))
-        display.specshow(lin_input)
-        plt.colorbar()
-        plt.title("mel_input")
-        plt.savefig(
-            os.path.join(path, "target_lin_spec_step_{:09d}.png".format(
-                iteration)))
-        plt.close()
-
-        if writer is not None:
-            writer.add_image(
-                "target/lin_spec",
-                cm.viridis(lin_input),
-                iteration,
-                dataformats="HWC")
-
-        plt.figure(figsize=(10, 3))
-        display.specshow(lin_output)
-        plt.colorbar()
-        plt.title("mel_input")
-        plt.savefig(
-            os.path.join(path, "predicted_lin_spec_step_{:09d}.png".format(
-                iteration)))
-        plt.close()
-
-        if writer is not None:
-            writer.add_image(
-                "predicted/lin_spec",
-                cm.viridis(lin_output),
-                iteration,
-                dataformats="HWC")
-
-        # alignment
-        path = os.path.join(self.output_dir, "alignments")
-        alignments = alignments[:, 0, :, :].numpy()
-        for idx, attn_layer in enumerate(alignments):
-            save_path = os.path.join(
-                path, "train_attn_layer_{}_step_{}.png".format(idx, iteration))
-            plot_alignment(attn_layer, save_path)
-
-            if writer is not None:
-                writer.add_image(
-                    "train_attn/layer_{}".format(idx),
-                    cm.viridis(attn_layer),
-                    iteration,
-                    dataformats="HWC")
-
-        # synthesize waveform
-        wav = spec_to_waveform(
-            lin_output, self.min_level_db, self.ref_level_db, self.power,
-            self.n_iter, self.win_length, self.hop_length, self.preemphasis)
-        path = os.path.join(self.output_dir, "waveform")
-        save_path = os.path.join(
-            path, "train_sample_step_{:09d}.wav".format(iteration))
-        sf.write(save_path, wav, self.sample_rate)
-
-        if writer is not None:
-            writer.add_audio(
-                "train_sample", wav, iteration, sample_rate=self.sample_rate)
-
-
-def spec_to_waveform(spec, min_level_db, ref_level_db, power, n_iter,
-                     win_length, hop_length, preemphasis):
-    """Convert output linear spec to waveform using griffin-lim vocoder.
-    
-    Args:
-        spec (ndarray): the output linear spectrogram, shape(C, T), where C means n_fft, T means frames.
-    """
-    denoramlized = np.clip(spec, 0, 1) * (-min_level_db) + min_level_db
-    lin_scaled = np.exp((denoramlized + ref_level_db) / 20 * np.log(10))
-    wav = librosa.griffinlim(
-        lin_scaled**power,
-        n_iter=n_iter,
-        hop_length=hop_length,
-        win_length=win_length)
-    if preemphasis > 0:
-        wav = signal.lfilter([1.], [1., -preemphasis], wav)
-    wav = np.clip(wav, -1.0, 1.0)
-    return wav
-
-
-def make_output_tree(output_dir):
-    print("creating output tree: {}".format(output_dir))
-    ckpt_dir = os.path.join(output_dir, "checkpoints")
-    state_dir = os.path.join(output_dir, "states")
-    eval_dir = os.path.join(output_dir, "eval")
-
-    for x in [ckpt_dir, state_dir, eval_dir]:
-        if not os.path.exists(x):
-            os.makedirs(x)
-    for x in ["alignments", "waveform", "lin_spec", "mel_spec"]:
-        p = os.path.join(state_dir, x)
-        if not os.path.exists(p):
-            os.makedirs(p)
-
-
-def plot_alignment(alignment, path):
-    """
-    Plot an attention layer's alignment for a sentence.
-    alignment: shape(T_dec, T_enc).
-    """
-
-    plt.figure()
-    plt.imshow(alignment)
-    plt.colorbar()
-    plt.xlabel('Encoder timestep')
-    plt.ylabel('Decoder timestep')
-    plt.savefig(path)
-    plt.close()
--- a/examples/deepvoice3/vocoder.py
+++ b/examples/deepvoice3/vocoder.py
@ -0,0 +1,43 @@
+import argparse
+from ruamel import yaml
+import numpy as np
+import librosa
+import paddle
+from paddle import fluid
+from paddle.fluid import layers as F
+from paddle.fluid import dygraph as dg
+from parakeet.utils.io import load_parameters
+from parakeet.models.waveflow.waveflow_modules import WaveFlowModule
+
+class WaveflowVocoder(object):
+    def __init__(self):
+        config_path = "waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml"
+        with open(config_path, 'rt') as f:
+           config = yaml.safe_load(f)
+        ns = argparse.Namespace()
+        for k, v in config.items():
+            setattr(ns, k, v)
+        ns.use_fp16 = False
+        
+        self.model = WaveFlowModule(ns)
+        checkpoint_path = "waveflow_res128_ljspeech_ckpt_1.0/step-2000000"
+        load_parameters(self.model, checkpoint_path=checkpoint_path)
+
+    def __call__(self, mel):
+        with dg.no_grad():
+            self.model.eval()
+            audio = self.model.synthesize(mel)
+        self.model.train()
+        return audio
+
+class GriffinLimVocoder(object):
+    def __init__(self, sharpening_factor=1.4, win_length=1024, hop_length=256):
+        self.sharpening_factor = sharpening_factor
+        self.win_length = win_length
+        self.hop_length = hop_length
+
+    def __call__(self, spec):
+        audio = librosa.core.griffinlim(np.exp(spec * self.sharpening_factor), 
+            win_length=self.win_length, hop_length=self.hop_length)
+        return audio
+
--- a/parakeet/models/deepvoice3/init.py
+++ b/parakeet/models/deepvoice3/init.py
@ -1,19 +1 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from parakeet.models.deepvoice3.encoder import Encoder, ConvSpec
-from parakeet.models.deepvoice3.decoder import Decoder, WindowRange
-from parakeet.models.deepvoice3.converter import Converter
-from parakeet.models.deepvoice3.loss import TTSLoss
-from parakeet.models.deepvoice3.model import DeepVoice3
+from .model import *
--- a/parakeet/models/deepvoice3/attention.py
+++ b/parakeet/models/deepvoice3/attention.py
@ -1,122 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import numpy as np
-from collections import namedtuple
-from paddle import fluid
-import paddle.fluid.dygraph as dg
-import paddle.fluid.layers as F
-import paddle.fluid.initializer as I
-
-from parakeet.modules.weight_norm import Linear
-WindowRange = namedtuple("WindowRange", ["backward", "ahead"])
-
-
-class Attention(dg.Layer):
-    def __init__(self,
-                 query_dim,
-                 embed_dim,
-                 dropout=0.0,
-                 window_range=WindowRange(-1, 3),
-                 key_projection=True,
-                 value_projection=True):
-        """Attention Layer for Deep Voice 3.
-
-        Args:
-            query_dim (int): the dimension of query vectors. (The size of a single vector of query.)
-            embed_dim (int): the dimension of keys and values.
-            dropout (float, optional): dropout probability of attention. Defaults to 0.0.
-            window_range (WindowRange, optional): range of attention, this is only used at inference. Defaults to WindowRange(-1, 3).
-            key_projection (bool, optional): whether the `Attention` Layer has a Linear Layer for the keys to pass through before computing attention. Defaults to True.
-            value_projection (bool, optional): whether the `Attention` Layer has a Linear Layer for the values to pass through before computing attention. Defaults to True.
-        """
-        super(Attention, self).__init__()
-        std = np.sqrt(1 / query_dim)
-        self.query_proj = Linear(
-            query_dim, embed_dim, param_attr=I.Normal(scale=std))
-        if key_projection:
-            std = np.sqrt(1 / embed_dim)
-            self.key_proj = Linear(
-                embed_dim, embed_dim, param_attr=I.Normal(scale=std))
-        if value_projection:
-            std = np.sqrt(1 / embed_dim)
-            self.value_proj = Linear(
-                embed_dim, embed_dim, param_attr=I.Normal(scale=std))
-        std = np.sqrt(1 / embed_dim)
-        self.out_proj = Linear(
-            embed_dim, query_dim, param_attr=I.Normal(scale=std))
-
-        self.key_projection = key_projection
-        self.value_projection = value_projection
-        self.dropout = dropout
-        self.window_range = window_range
-
-    def forward(self, query, encoder_out, mask=None, last_attended=None):
-        """
-        Compute contextualized representation and alignment scores.
-        
-        Args:
-            query (Variable): shape(B, T_dec, C_q), dtype float32, the query tensor, where C_q means the query dim.
-            encoder_out (keys, values): 
-                keys (Variable): shape(B, T_enc, C_emb), dtype float32, the key representation from an encoder, where C_emb means embed dim.
-                values (Variable): shape(B, T_enc, C_emb), dtype float32, the value representation from an encoder, where C_emb means embed dim.
-            mask (Variable, optional): shape(B, T_enc), dtype float32, mask generated with valid text lengths. Pad tokens corresponds to 1, and valid tokens correspond to 0.
-            last_attended (int, optional): The position that received the most attention at last time step. This is only used at inference.
-
-        Outpus:
-            x (Variable): shape(B, T_dec, C_q), dtype float32, the contextualized representation from attention mechanism.
-            attn_scores (Variable): shape(B, T_dec, T_enc), dtype float32, the alignment tensor, where T_dec means the number of decoder time steps and T_enc means number the number of decoder time steps.
-        """
-        keys, values = encoder_out
-        residual = query
-        if self.value_projection:
-            values = self.value_proj(values)
-        if self.key_projection:
-            keys = self.key_proj(keys)
-        x = self.query_proj(query)
-
-        x = F.matmul(x, keys, transpose_y=True)
-
-        # mask generated by sentence length
-        neg_inf = -1.e30
-        if mask is not None:
-            neg_inf_mask = F.scale(F.unsqueeze(mask, [1]), neg_inf)
-            x += neg_inf_mask
-
-        # if last_attended is provided, focus only on a window range around it
-        # to enforce monotonic attention.
-        if last_attended is not None:
-            locality_mask = np.ones(shape=x.shape, dtype=np.float32)
-            backward, ahead = self.window_range
-            backward = last_attended + backward
-            ahead = last_attended + ahead
-            backward = max(backward, 0)
-            ahead = min(ahead, x.shape[-1])
-            locality_mask[:, :, backward:ahead] = 0.
-            locality_mask = dg.to_variable(locality_mask)
-            neg_inf_mask = F.scale(locality_mask, neg_inf)
-            x += neg_inf_mask
-
-        x = F.softmax(x)
-        attn_scores = x
-        x = F.dropout(
-            x, self.dropout, dropout_implementation="upscale_in_train")
-        x = F.matmul(x, values)
-        encoder_length = keys.shape[1]
-
-        x = F.scale(x, encoder_length * np.sqrt(1.0 / encoder_length))
-        x = self.out_proj(x)
-        x = F.scale((x + residual), np.sqrt(0.5))
-        return x, attn_scores
--- a/parakeet/models/deepvoice3/conv.py
+++ b/parakeet/models/deepvoice3/conv.py
@ -0,0 +1,245 @@
+import numpy as np
+from paddle.fluid import layers as F
+from paddle.fluid.framework import Variable, in_dygraph_mode
+from paddle.fluid import core, dygraph_utils
+from paddle.fluid.layers import nn, utils
+from paddle.fluid.data_feeder import check_variable_and_dtype
+from paddle.fluid.param_attr import ParamAttr
+from paddle.fluid.layer_helper import LayerHelper
+from paddle.fluid.dygraph import layers
+from paddle.fluid.initializer import Normal
+
+
+def _is_list_or_tuple(input):
+    return isinstance(input, (list, tuple))
+
+
+def _zero_padding_in_batch_and_channel(padding, channel_last):
+    if channel_last:
+        return list(padding[0]) == [0, 0] and list(padding[-1]) == [0, 0]
+    else:
+        return list(padding[0]) == [0, 0] and list(padding[1]) == [0, 0]
+
+
+def _exclude_padding_in_batch_and_channel(padding, channel_last):
+    padding_ = padding[1:-1] if channel_last else padding[2:]
+    padding_ = [elem for pad_a_dim in padding_ for elem in pad_a_dim]
+    return padding_
+
+
+def _update_padding_nd(padding, channel_last, num_dims):
+    if isinstance(padding, str):
+        padding = padding.upper()
+        if padding not in ["SAME", "VALID"]:
+            raise ValueError(
+                "Unknown padding: '{}'. It can only be 'SAME' or 'VALID'.".
+                format(padding))
+        if padding == "VALID":
+            padding_algorithm = "VALID"
+            padding = [0] * num_dims
+        else:
+            padding_algorithm = "SAME"
+            padding = [0] * num_dims
+    elif _is_list_or_tuple(padding):
+        # for padding like
+        # [(pad_before, pad_after), (pad_before, pad_after), ...]
+        # padding for batch_dim and channel_dim included
+        if len(padding) == 2 + num_dims and _is_list_or_tuple(padding[0]):
+            if not _zero_padding_in_batch_and_channel(padding, channel_last):
+                raise ValueError(
+                    "Non-zero padding({}) in the batch or channel dimensions "
+                    "is not supported.".format(padding))
+            padding_algorithm = "EXPLICIT"
+            padding = _exclude_padding_in_batch_and_channel(padding,
+                                                            channel_last)
+            if utils._is_symmetric_padding(padding, num_dims):
+                padding = padding[0::2]
+        # for padding like [pad_before, pad_after, pad_before, pad_after, ...]
+        elif len(padding) == 2 * num_dims and isinstance(padding[0], int):
+            padding_algorithm = "EXPLICIT"
+            padding = utils.convert_to_list(padding, 2 * num_dims, 'padding')
+            if utils._is_symmetric_padding(padding, num_dims):
+                padding = padding[0::2]
+        # for padding like [pad_d1, pad_d2, ...]
+        elif len(padding) == num_dims and isinstance(padding[0], int):
+            padding_algorithm = "EXPLICIT"
+            padding = utils.convert_to_list(padding, num_dims, 'padding')
+        else:
+            raise ValueError("In valid padding: {}".format(padding))
+    # for integer padding
+    else:
+        padding_algorithm = "EXPLICIT"
+        padding = utils.convert_to_list(padding, num_dims, 'padding')
+    return padding, padding_algorithm
+
+def _get_default_param_initializer(num_channels, filter_size):
+    filter_elem_num = num_channels * np.prod(filter_size)
+    std = (2.0 / filter_elem_num)**0.5
+    return Normal(0.0, std, 0)
+
+def conv1d(input,
+           weight,
+           bias=None,
+           padding=0,
+           stride=1,
+           dilation=1,
+           groups=1,
+           use_cudnn=True,
+           act=None,
+           data_format="NCT",
+           name=None):
+    # entry checks
+    if not isinstance(use_cudnn, bool):
+        raise ValueError("Attr(use_cudnn) should be True or False. "
+                         "Received Attr(use_cudnn): {}.".format(use_cudnn))
+    if data_format not in ["NCT", "NTC"]:
+        raise ValueError("Attr(data_format) should be 'NCT' or 'NTC'. "
+                         "Received Attr(data_format): {}.".format(data_format))
+
+    channel_last = (data_format == "NTC")
+    channel_dim = -1 if channel_last else 1
+    num_channels = input.shape[channel_dim]
+    num_filters = weight.shape[0]
+    if num_channels < 0:
+        raise ValueError("The channel dimmention of the input({}) "
+                         "should be defined. Received: {}.".format(
+                             input.shape, num_channels))
+    if num_channels % groups != 0:
+        raise ValueError(
+            "the channel of input must be divisible by groups,"
+            "received: the channel of input is {}, the shape of input is {}"
+            ", the groups is {}".format(num_channels, input.shape, groups))
+    if num_filters % groups != 0:
+        raise ValueError(
+            "the number of filters must be divisible by groups,"
+            "received: the number of filters is {}, the shape of weight is {}"
+            ", the groups is {}".format(num_filters, weight.shape, groups))
+
+    # update attrs
+    padding, padding_algorithm = _update_padding_nd(padding, channel_last, 1)
+    if len(padding) == 1: # synmmetric padding
+        padding = [0,] + padding
+    else:
+        # len(padding) == 2
+        padding = [0, 0] + padding
+    stride = [1,] + utils.convert_to_list(stride, 1, 'stride')
+    dilation = [1,] + utils.convert_to_list(dilation, 1, 'dilation')
+    data_format = "NHWC" if channel_last else "NCHW"
+
+    l_type = "conv2d"
+
+    if (num_channels == groups and num_filters % num_channels == 0 and
+            not use_cudnn):
+        l_type = 'depthwise_conv2d'
+    weight = F.unsqueeze(weight, [2])
+    input = F.unsqueeze(input, [1]) if channel_last else F.unsqueeze(input, [2])
+
+    if in_dygraph_mode():
+        attrs = ('strides', stride, 'paddings', padding, 'dilations', dilation,
+                 'groups', groups, 'use_cudnn', use_cudnn, 'use_mkldnn', False,
+                 'fuse_relu_before_depthwise_conv', False, "padding_algorithm",
+                 padding_algorithm, "data_format", data_format)
+        pre_bias = getattr(core.ops, l_type)(input, weight, *attrs)
+        if bias is not None:
+            pre_act = nn.elementwise_add(pre_bias, bias, axis=channel_dim)
+        else:
+            pre_act = pre_bias
+        out = dygraph_utils._append_activation_in_dygraph(
+            pre_act, act, use_cudnn=use_cudnn)
+    else:
+        inputs = {'Input': [input], 'Filter': [weight]}
+        attrs = {
+            'strides': stride,
+            'paddings': padding,
+            'dilations': dilation,
+            'groups': groups,
+            'use_cudnn': use_cudnn,
+            'use_mkldnn': False,
+            'fuse_relu_before_depthwise_conv': False,
+            "padding_algorithm": padding_algorithm,
+            "data_format": data_format
+        }
+        check_variable_and_dtype(input, 'input',
+                                 ['float16', 'float32', 'float64'], 'conv2d')
+        helper = LayerHelper(l_type, **locals())
+        dtype = helper.input_dtype()
+        pre_bias = helper.create_variable_for_type_inference(dtype)
+        outputs = {"Output": [pre_bias]}
+        helper.append_op(
+            type=l_type, inputs=inputs, outputs=outputs, attrs=attrs)
+        if bias is not None:
+            pre_act = nn.elementwise_add(pre_bias, bias, axis=channel_dim)
+        else:
+            pre_act = pre_bias
+        out = helper.append_activation(pre_act)
+    out = F.squeeze(out, [1]) if channel_last else F.squeeze(out, [2])
+    return out
+
+class Conv1D(layers.Layer):
+    def __init__(self,
+                 num_channels,
+                 num_filters,
+                 filter_size,
+                 padding=0,
+                 stride=1,
+                 dilation=1,
+                 groups=1,
+                 param_attr=None,
+                 bias_attr=None,
+                 use_cudnn=True,
+                 act=None,
+                 data_format="NCT",
+                 dtype='float32'):
+        super(Conv1D, self).__init__()
+        assert param_attr is not False, "param_attr should not be False here."
+        self._num_channels = num_channels
+        self._num_filters = num_filters
+        self._groups = groups
+        if num_channels % groups != 0:
+            raise ValueError("num_channels must be divisible by groups.")
+        self._act = act
+        self._data_format = data_format
+        self._dtype = dtype
+        if not isinstance(use_cudnn, bool):
+            raise ValueError("use_cudnn should be True or False")
+        self._use_cudnn = use_cudnn
+
+        self._filter_size = utils.convert_to_list(filter_size, 1, 'filter_size')
+        self._stride = utils.convert_to_list(stride, 1, 'stride')
+        self._dilation = utils.convert_to_list(dilation, 1, 'dilation')
+        channel_last = (data_format == "NTC")
+        self._padding = padding  # leave it to F.conv1d
+
+        self._param_attr = param_attr
+        self._bias_attr = bias_attr
+
+        num_filter_channels = num_channels // groups
+        filter_shape = [self._num_filters, num_filter_channels
+                        ] + self._filter_size
+
+        self.weight = self.create_parameter(
+            attr=self._param_attr,
+            shape=filter_shape,
+            dtype=self._dtype,
+            default_initializer=_get_default_param_initializer(
+                self._num_channels, filter_shape))
+        self.bias = self.create_parameter(
+            attr=self._bias_attr,
+            shape=[self._num_filters],
+            dtype=self._dtype,
+            is_bias=True)
+
+    def forward(self, input):
+        out = conv1d(
+            input,
+            self.weight,
+            bias=self.bias,
+            padding=self._padding,
+            stride=self._stride,
+            dilation=self._dilation,
+            groups=self._groups,
+            use_cudnn=self._use_cudnn,
+            act=self._act,
+            data_format=self._data_format)
+        return out
+
--- a/parakeet/models/deepvoice3/conv1dglu.py
+++ b/parakeet/models/deepvoice3/conv1dglu.py
@ -1,152 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import numpy as np
-
-from paddle import fluid
-import paddle.fluid.dygraph as dg
-import paddle.fluid.layers as F
-import paddle.fluid.initializer as I
-
-from parakeet.modules.weight_norm import Conv1D, Conv1DCell, Conv2D, Linear
-
-
-class Conv1DGLU(dg.Layer):
-    """
-    A Convolution 1D block with GLU activation. It also applys dropout for the input x. It integrates speaker embeddings through a Linear activated by softsign. It has residual connection from the input x, and scale the output by np.sqrt(0.5).
-    """
-
-    def __init__(self,
-                 n_speakers,
-                 speaker_dim,
-                 in_channels,
-                 num_filters,
-                 filter_size=1,
-                 dilation=1,
-                 std_mul=4.0,
-                 dropout=0.0,
-                 causal=False,
-                 residual=True):
-        """[summary]
-
-        Args:
-            n_speakers (int): number of speakers.
-            speaker_dim (int): speaker embedding's size.
-            in_channels (int): channels of the input.
-            num_filters (int): channels of the output.
-            filter_size (int, optional): filter size of the internal Conv1DCell. Defaults to 1.
-            dilation (int, optional): dilation of the internal Conv1DCell. Defaults to 1.
-            std_mul (float, optional): [description]. Defaults to 4.0.
-            dropout (float, optional): dropout probability. Defaults to 0.0.
-            causal (bool, optional): padding of the Conv1DCell. It shoudl be True if `add_input` method of `Conv1DCell` is ever used. Defaults to False.
-            residual (bool, optional): whether to use residual connection. If True, in_channels shoudl equals num_filters. Defaults to True.
-        """
-        super(Conv1DGLU, self).__init__()
-        # conv spec
-        self.in_channels = in_channels
-        self.n_speakers = n_speakers
-        self.speaker_dim = speaker_dim
-        self.num_filters = num_filters
-        self.filter_size = filter_size
-        self.dilation = dilation
-
-        # padding
-        self.causal = causal
-
-        # weight init and dropout
-        self.std_mul = std_mul
-        self.dropout = dropout
-
-        self.residual = residual
-        if residual:
-            assert (
-                in_channels == num_filters
-            ), "this block uses residual connection"\
-                "the input_channes should equals num_filters"
-        std = np.sqrt(std_mul * (1 - dropout) / (filter_size * in_channels))
-        self.conv = Conv1DCell(
-            in_channels,
-            2 * num_filters,
-            filter_size,
-            dilation,
-            causal,
-            param_attr=I.Normal(scale=std))
-
-        if n_speakers > 1:
-            assert (speaker_dim is not None
-                    ), "speaker embed should not be null in multi-speaker case"
-            std = np.sqrt(1 / speaker_dim)
-            self.fc = Linear(
-                speaker_dim, num_filters, param_attr=I.Normal(scale=std))
-
-    def forward(self, x, speaker_embed=None):
-        """
-        Args:
-            x (Variable): shape(B, C_in, T), dtype float32, the input of Conv1DGLU layer, where B means batch_size, C_in means the input channels T means input time steps.
-            speaker_embed (Variable): shape(B, C_sp), dtype float32, speaker embed, where C_sp means speaker embedding size.
-
-        Returns:
-            x (Variable): shape(B, C_out, T), the output of Conv1DGLU, where
-                C_out means the `num_filters`.
-        """
-        residual = x
-        x = F.dropout(
-            x, self.dropout, dropout_implementation="upscale_in_train")
-        x = self.conv(x)
-        content, gate = F.split(x, num_or_sections=2, dim=1)
-
-        if speaker_embed is not None:
-            sp = F.softsign(self.fc(speaker_embed))
-            content = F.elementwise_add(content, sp, axis=0)
-
-        # glu
-        x = F.sigmoid(gate) * content
-
-        if self.residual:
-            x = F.scale(x + residual, np.sqrt(0.5))
-        return x
-
-    def start_sequence(self):
-        """Prepare the Conv1DGLU to generate a new sequence. This method should be called before starting calling `add_input` multiple times.
-        """
-        self.conv.start_sequence()
-
-    def add_input(self, x_t, speaker_embed=None):
-        """
-        Takes a step of inputs and return a step of outputs. It works similarily with the `forward` method, but in a `step-in-step-out` fashion.
-
-        Args:
-            x_t (Variable): shape(B, C_in, T=1), dtype float32, the input of Conv1DGLU layer, where B means batch_size, C_in means the input channels.
-            speaker_embed (Variable): Shape(B, C_sp), dtype float32, speaker embed, where C_sp means speaker embedding size. 
-
-        Returns:
-            x (Variable): shape(B, C_out), the output of Conv1DGLU, where C_out means the `num_filter`.
-        """
-        residual = x_t
-        x_t = F.dropout(
-            x_t, self.dropout, dropout_implementation="upscale_in_train")
-        x_t = self.conv.add_input(x_t)
-        content_t, gate_t = F.split(x_t, num_or_sections=2, dim=1)
-
-        if speaker_embed is not None:
-            sp = F.softsign(self.fc(speaker_embed))
-            content_t = F.elementwise_add(content_t, sp, axis=0)
-
-        # glu
-        x_t = F.sigmoid(gate_t) * content_t
-
-        if self.residual:
-            x_t = F.scale(x_t + residual, np.sqrt(0.5))
-        return x_t
--- a/parakeet/models/deepvoice3/converter.py
+++ b/parakeet/models/deepvoice3/converter.py
@ -1,285 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import numpy as np
-from itertools import chain
-
-import paddle.fluid.layers as F
-import paddle.fluid.initializer as I
-import paddle.fluid.dygraph as dg
-
-from parakeet.modules.weight_norm import Conv1D, Conv1DTranspose, Conv2D, Conv2DTranspose, Linear
-from parakeet.models.deepvoice3.conv1dglu import Conv1DGLU
-from parakeet.models.deepvoice3.encoder import ConvSpec
-
-
-def upsampling_4x_blocks(n_speakers, speaker_dim, target_channels, dropout):
-    """Return a list of Layers that upsamples the input by 4 times in time dimension.
-
-    Args:
-        n_speakers (int): number of speakers of the Conv1DGLU layers used.
-        speaker_dim (int): speaker embedding size of the Conv1DGLU layers used.
-        target_channels (int): channels of the input and the output.(the list of layers does not change the number of channels.)
-        dropout (float): dropout probability.
-
-    Returns:
-        List[Layer]: upsampling layers.
-    """
-    # upsampling convolitions
-    upsampling_convolutions = [
-        Conv1DTranspose(
-            target_channels,
-            target_channels,
-            2,
-            stride=2,
-            param_attr=I.Normal(scale=np.sqrt(1 / (2 * target_channels)))),
-        Conv1DGLU(
-            n_speakers,
-            speaker_dim,
-            target_channels,
-            target_channels,
-            3,
-            dilation=1,
-            std_mul=1.,
-            dropout=dropout),
-        Conv1DGLU(
-            n_speakers,
-            speaker_dim,
-            target_channels,
-            target_channels,
-            3,
-            dilation=3,
-            std_mul=4.,
-            dropout=dropout),
-        Conv1DTranspose(
-            target_channels,
-            target_channels,
-            2,
-            stride=2,
-            param_attr=I.Normal(scale=np.sqrt(4. / (2 * target_channels)))),
-        Conv1DGLU(
-            n_speakers,
-            speaker_dim,
-            target_channels,
-            target_channels,
-            3,
-            dilation=1,
-            std_mul=1.,
-            dropout=dropout),
-        Conv1DGLU(
-            n_speakers,
-            speaker_dim,
-            target_channels,
-            target_channels,
-            3,
-            dilation=3,
-            std_mul=4.,
-            dropout=dropout),
-    ]
-    return upsampling_convolutions
-
-
-def upsampling_2x_blocks(n_speakers, speaker_dim, target_channels, dropout):
-    """Return a list of Layers that upsamples the input by 2 times in time dimension.
-
-    Args:
-        n_speakers (int): number of speakers of the Conv1DGLU layers used.
-        speaker_dim (int): speaker embedding size of the Conv1DGLU layers used.
-        target_channels (int): channels of the input and the output.(the list of layers does not change the number of channels.)
-        dropout (float): dropout probability.
-
-    Returns:
-        List[Layer]: upsampling layers.
-    """
-    upsampling_convolutions = [
-        Conv1DTranspose(
-            target_channels,
-            target_channels,
-            2,
-            stride=2,
-            param_attr=I.Normal(scale=np.sqrt(1. / (2 * target_channels)))),
-        Conv1DGLU(
-            n_speakers,
-            speaker_dim,
-            target_channels,
-            target_channels,
-            3,
-            dilation=1,
-            std_mul=1.,
-            dropout=dropout), Conv1DGLU(
-                n_speakers,
-                speaker_dim,
-                target_channels,
-                target_channels,
-                3,
-                dilation=3,
-                std_mul=4.,
-                dropout=dropout)
-    ]
-    return upsampling_convolutions
-
-
-def upsampling_1x_blocks(n_speakers, speaker_dim, target_channels, dropout):
-    """Return a list of Layers that upsamples the input by 1 times in time dimension.
-
-    Args:
-        n_speakers (int): number of speakers of the Conv1DGLU layers used.
-        speaker_dim (int): speaker embedding size of the Conv1DGLU layers used.
-        target_channels (int): channels of the input and the output.(the list of layers does not change the number of channels.)
-        dropout (float): dropout probability.
-
-    Returns:
-        List[Layer]: upsampling layers.
-    """
-    upsampling_convolutions = [
-        Conv1DGLU(
-            n_speakers,
-            speaker_dim,
-            target_channels,
-            target_channels,
-            3,
-            dilation=3,
-            std_mul=4.,
-            dropout=dropout)
-    ]
-    return upsampling_convolutions
-
-
-class Converter(dg.Layer):
-    def __init__(self,
-                 n_speakers,
-                 speaker_dim,
-                 in_channels,
-                 linear_dim,
-                 convolutions=(ConvSpec(256, 5, 1), ) * 4,
-                 time_upsampling=1,
-                 dropout=0.0):
-        """Vocoder that transforms mel spectrogram (or ecoder hidden states) to waveform.
-
-        Args:
-            n_speakers (int): number of speakers.
-            speaker_dim (int): speaker embedding size.
-            in_channels (int): channels of the input.
-            linear_dim (int): channels of the linear spectrogram.
-            convolutions (Iterable[ConvSpec], optional): specifications of the internal convolutional layers. ConvSpec is a namedtuple of (output_channels, filter_size, dilation) Defaults to (ConvSpec(256, 5, 1), )*4.
-            time_upsampling (int, optional): time upsampling factor of the converter, possible options are {1, 2, 4}. Note that this should equals the downsample factor of the mel spectrogram. Defaults to 1.
-            dropout (float, optional): dropout probability. Defaults to 0.0.
-        """
-        super(Converter, self).__init__()
-
-        self.n_speakers = n_speakers
-        self.speaker_dim = speaker_dim
-        self.in_channels = in_channels
-        self.linear_dim = linear_dim
-        # CAUTION: this should equals the downsampling steps coefficient
-        self.time_upsampling = time_upsampling
-        self.dropout = dropout
-
-        target_channels = convolutions[0].out_channels
-
-        # conv proj to target channels
-        self.first_conv_proj = Conv1D(
-            in_channels,
-            target_channels,
-            1,
-            param_attr=I.Normal(scale=np.sqrt(1 / in_channels)))
-
-        # Idea from nyanko
-        if time_upsampling == 4:
-            self.upsampling_convolutions = dg.LayerList(
-                upsampling_4x_blocks(n_speakers, speaker_dim, target_channels,
-                                     dropout))
-        elif time_upsampling == 2:
-            self.upsampling_convolutions = dg.LayerList(
-                upsampling_2x_blocks(n_speakers, speaker_dim, target_channels,
-                                     dropout))
-        elif time_upsampling == 1:
-            self.upsampling_convolutions = dg.LayerList(
-                upsampling_1x_blocks(n_speakers, speaker_dim, target_channels,
-                                     dropout))
-        else:
-            raise ValueError(
-                "Upsampling factors other than {1, 2, 4} are Not supported.")
-
-        # post conv layers
-        std_mul = 4.0
-        in_channels = target_channels
-        self.convolutions = dg.LayerList()
-        for (out_channels, filter_size, dilation) in convolutions:
-            if in_channels != out_channels:
-                std = np.sqrt(std_mul / in_channels)
-                # CAUTION: relu
-                self.convolutions.append(
-                    Conv1D(
-                        in_channels,
-                        out_channels,
-                        1,
-                        act="relu",
-                        param_attr=I.Normal(scale=std)))
-                in_channels = out_channels
-                std_mul = 2.0
-            self.convolutions.append(
-                Conv1DGLU(
-                    n_speakers,
-                    speaker_dim,
-                    in_channels,
-                    out_channels,
-                    filter_size,
-                    dilation=dilation,
-                    std_mul=std_mul,
-                    dropout=dropout))
-            in_channels = out_channels
-            std_mul = 4.0
-
-        # final conv proj, channel transformed to linear dim
-        std = np.sqrt(std_mul * (1 - dropout) / in_channels)
-        # CAUTION: sigmoid
-        self.last_conv_proj = Conv1D(
-            in_channels,
-            linear_dim,
-            1,
-            act="sigmoid",
-            param_attr=I.Normal(scale=std))
-
-    def forward(self, x, speaker_embed=None):
-        """
-        Convert mel spectrogram or decoder hidden states to linear spectrogram.
-        
-        Args:
-            x (Variable): Shape(B, T_mel, C_in), dtype float32, converter inputs, where C_in means the input channel for the converter. Note that it can be either C_mel (channel of mel spectrogram) or C_dec // r.
-                When use mel_spectrogram as the input of converter, C_in = C_mel; and when use decoder states as the input of converter, C_in = C_dec // r.
-            speaker_embed (Variable, optional): shape(B, C_sp), dtype float32, speaker embedding, where C_sp means the speaker embedding size.
-
-        Returns:
-            out (Variable): Shape(B, T_lin, C_lin), the output linear spectrogram, where C_lin means the channel of linear spectrogram and T_linear means the length(time steps) of linear spectrogram. T_line = time_upsampling * T_mel, which depends on the time_upsampling of the converter.
-        """
-        x = F.transpose(x, [0, 2, 1])
-        x = self.first_conv_proj(x)
-
-        if speaker_embed is not None:
-            speaker_embed = F.dropout(
-                speaker_embed,
-                self.dropout,
-                dropout_implementation="upscale_in_train")
-
-        for layer in chain(self.upsampling_convolutions, self.convolutions):
-            if isinstance(layer, Conv1DGLU):
-                x = layer(x, speaker_embed)
-            else:
-                x = layer(x)
-
-        out = self.last_conv_proj(x)
-        out = F.transpose(out, [0, 2, 1])
-        return out
--- a/parakeet/models/deepvoice3/decoder.py
+++ b/parakeet/models/deepvoice3/decoder.py
@ -1,526 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import numpy as np
-import paddle.fluid.layers as F
-import paddle.fluid.initializer as I
-import paddle.fluid.dygraph as dg
-
-from parakeet.modules.weight_norm import Conv1D, Linear
-from parakeet.models.deepvoice3.conv1dglu import Conv1DGLU
-from parakeet.models.deepvoice3.encoder import ConvSpec
-from parakeet.models.deepvoice3.attention import Attention, WindowRange
-from parakeet.models.deepvoice3.position_embedding import PositionEmbedding
-
-
-def gen_mask(valid_lengths, max_len, dtype="float32"):
-    """
-    Generate a mask tensor from valid lengths. note that it return a *reverse*
-    mask. Indices within valid lengths correspond to 0, and those within
-    padding area correspond to 1. 
-    
-    Assume that valid_lengths = [2,5,7], and max_len = 7, the generated mask is
-    [[0, 0, 1, 1, 1, 1, 1],
-     [0, 0, 0, 0, 0, 1, 1],
-     [0, 0, 0, 0, 0, 0, 0]].
-
-    Args:
-        valid_lengths (Variable): shape(B, ), dtype: int64. A rank-1 Tensor containing the valid lengths (timesteps) of each example, where B means beatch_size.
-        max_len (int): The length (number of time steps) of the mask.
-        dtype (str, optional): A string that specifies the data type of the returned mask. Defaults to 'float32'.
-
-    Returns:
-        mask (Variable): shape(B, max_len), dtype float32, a mask computed from valid lengths.
-    """
-    mask = F.sequence_mask(valid_lengths, maxlen=max_len, dtype=dtype)
-    mask = 1 - mask
-    return mask
-
-
-def fold_adjacent_frames(frames, r):
-    """fold multiple adjacent frames.
-    
-    Args:
-        frames (Variable): shape(B, T, C), the spectrogram.
-        r (int): frames per step.
-    
-    Returns:
-        Variable: shape(B, T // r, r * C), folded frames.
-    """
-    if r == 1:
-        return frames
-    batch_size, time_steps, channels = frames.shape
-    if time_steps % r != 0:
-        print(
-            "time_steps cannot be divided by r, you would lose {} tailing frames"
-            .format(time_steps % r))
-        frames = frames[:, :time_steps - time_steps % r, :]
-    frames = F.reshape(frames, (batch_size, -1, channels * r))
-    return frames
-
-
-def unfold_adjacent_frames(folded_frames, r):
-    """unfold the folded frames.
-    
-    Args:
-        folded_frames (Variable): shape(B, T, C), the folded spectrogram.
-        r (int): frames per step.
-    
-    Returns:
-        Variable: shape(B, T * r, C // r), unfolded frames.
-    """
-    if r == 1:
-        return folded_frames
-    batch_size, time_steps, channels = folded_frames.shape
-    folded_frames = F.reshape(folded_frames, (batch_size, -1, channels // r))
-    return folded_frames
-
-
-class Decoder(dg.Layer):
-    def __init__(self,
-                 n_speakers,
-                 speaker_dim,
-                 embed_dim,
-                 mel_dim,
-                 r=1,
-                 max_positions=512,
-                 preattention=(ConvSpec(128, 5, 1), ) * 4,
-                 convolutions=(ConvSpec(128, 5, 1), ) * 4,
-                 attention=True,
-                 dropout=0.0,
-                 use_memory_mask=False,
-                 force_monotonic_attention=False,
-                 query_position_rate=1.0,
-                 key_position_rate=1.0,
-                 window_range=WindowRange(-1, 3),
-                 key_projection=True,
-                 value_projection=True):
-        """Decoder of the Deep Voice 3 model.
-
-        Args:
-            n_speakers (int): number of speakers.
-            speaker_dim (int): speaker embedding size.
-            embed_dim (int): text embedding size.
-            mel_dim (int): channel of mel input.(mel bands)
-            r (int, optional): number of frames generated per decoder step. Defaults to 1.
-            max_positions (int, optional): max position for text and decoder steps. Defaults to 512.
-            convolutions (Iterable[ConvSpec], optional): specification of causal convolutional layers inside the decoder. ConvSpec is a namedtuple of output_channels, filter_size and dilation. Defaults to (ConvSpec(128, 5, 1), )*4.
-            attention (bool or List[bool], optional): whether to use attention, it should have the same length with `convolutions` if it is a list of bool, indicating whether to have an Attention layer coupled with the corresponding convolutional layer. If it is a bool, it is repeated len(convolutions) times internally. Defaults to True.
-            dropout (float, optional): dropout probability. Defaults to 0.0.
-            use_memory_mask (bool, optional): whether to use memory mask at the Attention layer. It should have the same length with `attention` if it is a list of bool, indicating whether to use memory mask at the corresponding Attention layer. If it is a bool, it is repeated len(attention) times internally. Defaults to False.
-            force_monotonic_attention (bool, optional): whether to use monotonic_attention at the Attention layer when inferencing. It should have the same length with `attention` if it is a list of bool, indicating whether to use monotonic_attention at the corresponding Attention layer. If it is a bool, it is repeated len(attention) times internally. Defaults to False.
-            query_position_rate (float, optional): position_rate of the PositionEmbedding for query. Defaults to 1.0.
-            key_position_rate (float, optional): position_rate of the PositionEmbedding for key. Defaults to 1.0.
-            window_range (WindowRange, optional): window range of monotonic attention. Defaults to WindowRange(-1, 3).
-            key_projection (bool, optional): `key_projection` of Attention layers. Defaults to True.
-            value_projection (bool, optional): `value_projection` of Attention layers Defaults to True.
-        """
-        super(Decoder, self).__init__()
-
-        self.dropout = dropout
-        self.mel_dim = mel_dim
-        self.r = r
-        self.query_position_rate = query_position_rate
-        self.key_position_rate = key_position_rate
-        self.window_range = window_range
-        self.n_speakers = n_speakers
-
-        conv_channels = convolutions[0].out_channels
-        # only when padding idx is 0 can we easilt handle it
-        self.embed_keys_positions = PositionEmbedding(max_positions, embed_dim)
-        self.embed_query_positions = PositionEmbedding(max_positions,
-                                                       conv_channels)
-
-        if n_speakers > 1:
-            std = np.sqrt((1 - dropout) / speaker_dim)
-            self.speaker_proj1 = Linear(
-                speaker_dim, 1, act="sigmoid", param_attr=I.Normal(scale=std))
-            self.speaker_proj2 = Linear(
-                speaker_dim, 1, act="sigmoid", param_attr=I.Normal(scale=std))
-
-        # prenet
-        self.prenet = dg.LayerList()
-        in_channels = mel_dim * r  # multiframe
-        std_mul = 1.0
-        for (out_channels, filter_size, dilation) in preattention:
-            if in_channels != out_channels:
-                # conv1d & relu
-                std = np.sqrt(std_mul / in_channels)
-                self.prenet.append(
-                    Conv1D(
-                        in_channels,
-                        out_channels,
-                        1,
-                        act="relu",
-                        param_attr=I.Normal(scale=std)))
-                in_channels = out_channels
-                std_mul = 2.0
-            self.prenet.append(
-                Conv1DGLU(
-                    n_speakers,
-                    speaker_dim,
-                    in_channels,
-                    out_channels,
-                    filter_size,
-                    dilation,
-                    std_mul,
-                    dropout,
-                    causal=True,
-                    residual=True))
-            in_channels = out_channels
-            std_mul = 4.0
-
-        # attention
-        self.use_memory_mask = use_memory_mask
-        if isinstance(attention, bool):
-            self.attention = [attention] * len(convolutions)
-        else:
-            self.attention = attention
-
-        if isinstance(force_monotonic_attention, bool):
-            self.force_monotonic_attention = [force_monotonic_attention
-                                              ] * len(convolutions)
-        else:
-            self.force_monotonic_attention = force_monotonic_attention
-
-        for x, y in zip(self.force_monotonic_attention, self.attention):
-            if x is True and y is False:
-                raise ValueError("When not using attention, there is no "
-                                 "monotonic attention at all")
-
-        # causual convolution & attention
-        self.conv_attn = []
-        for use_attention, (out_channels, filter_size,
-                            dilation) in zip(self.attention, convolutions):
-            assert (
-                in_channels == out_channels
-            ), "the stack of convolution & attention does not change channels"
-            conv_layer = Conv1DGLU(
-                n_speakers,
-                speaker_dim,
-                in_channels,
-                out_channels,
-                filter_size,
-                dilation,
-                std_mul,
-                dropout,
-                causal=True,
-                residual=False)
-            attn_layer = Attention(
-                out_channels,
-                embed_dim,
-                dropout,
-                window_range,
-                key_projection=key_projection,
-                value_projection=value_projection) if use_attention else None
-            in_channels = out_channels
-            std_mul = 4.0
-            self.conv_attn.append((conv_layer, attn_layer))
-        for i, (conv_layer, attn_layer) in enumerate(self.conv_attn):
-            self.add_sublayer("conv_{}".format(i), conv_layer)
-            if attn_layer is not None:
-                self.add_sublayer("attn_{}".format(i), attn_layer)
-
-        # 1 * 1 conv to transform channels
-        std = np.sqrt(std_mul * (1 - dropout) / in_channels)
-        self.last_conv = Conv1D(
-            in_channels, mel_dim * r, 1, param_attr=I.Normal(scale=std))
-
-        # mel (before sigmoid) to done hat
-        std = np.sqrt(1 / in_channels)
-        self.fc = Conv1D(mel_dim * r, 1, 1, param_attr=I.Normal(scale=std))
-
-        # decoding configs
-        self.max_decoder_steps = 200
-        self.min_decoder_steps = 10
-
-        assert convolutions[-1].out_channels % r == 0, \
-                "decoder_state dim must be divided by r"
-        self.state_dim = convolutions[-1].out_channels // self.r
-
-    def forward(self,
-                encoder_out,
-                lengths,
-                frames,
-                text_positions,
-                frame_positions,
-                speaker_embed=None):
-        """
-        Compute decoder outputs with ground truth mel spectrogram.
-
-        Args:
-            encoder_out (keys, values): 
-                keys (Variable): shape(B, T_enc, C_emb), dtype float32, the key representation from an encoder, where C_emb means text embedding size.
-                values (Variable): shape(B, T_enc, C_emb), dtype float32, the value representation from an encoder, where C_emb means text embedding size.
-            lengths (Variable): shape(batch_size,), dtype: int64, valid lengths of text inputs for each example.
-            inputs (Variable): shape(B, T_mel, C_mel), ground truth mel-spectrogram, which is used as decoder inputs when training.
-            text_positions (Variable): shape(B, T_enc), dtype: int64. Positions indices for text inputs for the encoder, where T_enc means the encoder timesteps.
-            frame_positions (Variable): shape(B, T_mel // r), dtype: int64. Positions indices for each decoder time steps.
-            speaker_embed (Variable, optionals): shape(batch_size, speaker_dim), speaker embedding, only used for multispeaker model.
-
-        Returns:
-            outputs (Variable): shape(B, T_mel, C_mel), dtype float32, decoder outputs, where C_mel means the channels of mel-spectrogram, T_mel means the length(time steps) of mel spectrogram. 
-            alignments (Variable): shape(N, B, T_mel // r, T_enc), dtype float32, the alignment tensor between the decoder and the encoder, where N means number of Attention Layers, T_mel means the length of mel spectrogram, r means the outputs per decoder step, T_enc means the encoder time steps.
-            done (Variable): shape(B, T_mel // r), dtype float32, probability that the last frame has been generated.
-            decoder_states (Variable): shape(B, T_mel, C_dec // r), ddtype float32, decoder hidden states, where C_dec means the channels of decoder states (the output channels of the last `convolutions`). Note that it should be perfectlt devided by `r`.
-        """
-        if speaker_embed is not None:
-            speaker_embed = F.dropout(
-                speaker_embed,
-                self.dropout,
-                dropout_implementation="upscale_in_train")
-
-        keys, values = encoder_out
-        enc_time_steps = keys.shape[1]
-        if self.use_memory_mask and lengths is not None:
-            mask = gen_mask(lengths, enc_time_steps)
-        else:
-            mask = None
-
-        if text_positions is not None:
-            w = self.key_position_rate
-            if self.n_speakers > 1:
-                w = w * F.squeeze(self.speaker_proj1(speaker_embed), [-1])
-            text_pos_embed = self.embed_keys_positions(text_positions, w)
-            keys += text_pos_embed  # (B, T, C)
-
-        if frame_positions is not None:
-            w = self.query_position_rate
-            if self.n_speakers > 1:
-                w = w * F.squeeze(self.speaker_proj2(speaker_embed), [-1])
-            frame_pos_embed = self.embed_query_positions(frame_positions, w)
-        else:
-            frame_pos_embed = None
-
-        # pack multiple frames if necessary
-        frames = fold_adjacent_frames(frames, self.r)  # assume (B, T, C) input
-        # (B, C, T)
-        frames = F.transpose(frames, [0, 2, 1])
-        x = frames
-        x = F.dropout(
-            x, self.dropout, dropout_implementation="upscale_in_train")
-        # Prenet
-        for layer in self.prenet:
-            if isinstance(layer, Conv1DGLU):
-                x = layer(x, speaker_embed)
-            else:
-                x = layer(x)
-
-        # Convolution & Multi-hop Attention
-        alignments = []
-        for (conv, attn) in self.conv_attn:
-            residual = x
-            x = conv(x, speaker_embed)
-            if attn is not None:
-                x = F.transpose(x, [0, 2, 1])  # (B, T, C)
-                if frame_pos_embed is not None:
-                    x = x + frame_pos_embed
-                x, attn_scores = attn(x, (keys, values), mask)
-                alignments.append(attn_scores)
-                x = F.transpose(x, [0, 2, 1])  #(B, C, T)
-            x = F.scale(residual + x, np.sqrt(0.5))
-
-        alignments = F.stack(alignments)
-
-        decoder_states = x
-        x = self.last_conv(x)
-        outputs = F.sigmoid(x)
-        done = F.sigmoid(self.fc(x))
-
-        outputs = F.transpose(outputs, [0, 2, 1])
-        decoder_states = F.transpose(decoder_states, [0, 2, 1])
-        done = F.squeeze(done, [1])
-
-        outputs = unfold_adjacent_frames(outputs, self.r)
-        decoder_states = unfold_adjacent_frames(decoder_states, self.r)
-        return outputs, alignments, done, decoder_states
-
-    @property
-    def receptive_field(self):
-        """Whole receptive field of the causally convolutional decoder."""
-        r = 1
-        for conv in self.prenet:
-            r += conv.dilation[1] * (conv.filter_size[1] - 1)
-        for (conv, _) in self.conv_attn:
-            r += conv.dilation[1] * (conv.filter_size[1] - 1)
-        return r
-
-    def start_sequence(self):
-        """Prepare the Decoder to decode. This method is called by `decode`.
-        """
-        for layer in self.prenet:
-            if isinstance(layer, Conv1DGLU):
-                layer.start_sequence()
-
-        for conv, _ in self.conv_attn:
-            if isinstance(conv, Conv1DGLU):
-                conv.start_sequence()
-
-    def decode(self,
-               encoder_out,
-               text_positions,
-               speaker_embed=None,
-               test_inputs=None):
-        """Decode from the encoder's output and other conditions.
-
-        Args:
-            encoder_out (keys, values): 
-                keys (Variable): shape(B, T_enc, C_emb), dtype float32, the key representation from an encoder, where C_emb means text embedding size.
-                values (Variable): shape(B, T_enc, C_emb), dtype float32, the value representation from an encoder, where C_emb means text embedding size.
-            text_positions (Variable): shape(B, T_enc), dtype: int64. Positions indices for text inputs for the encoder, where T_enc means the encoder timesteps.
-            speaker_embed (Variable, optional): shape(B, C_sp), speaker embedding, only used for multispeaker model.
-            test_inputs (Variable, optional): shape(B, T_test, C_mel). test input, it is only used for debugging. Defaults to None.
-
-        Returns:
-            outputs (Variable): shape(B, T_mel, C_mel), dtype float32, decoder outputs, where C_mel means the channels of mel-spectrogram, T_mel means the length(time steps) of mel spectrogram. 
-            alignments (Variable): shape(N, B, T_mel // r, T_enc), dtype float32, the alignment tensor between the decoder and the encoder, where N means number of Attention Layers, T_mel means the length of mel spectrogram, r means the outputs per decoder step, T_enc means the encoder time steps.
-            done (Variable): shape(B, T_mel // r), dtype float32, probability that the last frame has been generated. If the probability is larger than 0.5 at a step, the generation stops.
-            decoder_states (Variable): shape(B, T_mel, C_dec // r), ddtype float32, decoder hidden states, where C_dec means the channels of decoder states (the output channels of the last `convolutions`). Note that it should be perfectlt devided by `r`.
-
-        Note:
-            Only single instance inference is supported now, so B = 1.
-        """
-        self.start_sequence()
-        keys, values = encoder_out
-        batch_size = keys.shape[0]
-        assert batch_size == 1, "now only supports single instance inference"
-        mask = None  # no mask because we use single instance decoding
-
-        # no dropout in inference
-        if speaker_embed is not None:
-            speaker_embed = F.dropout(
-                speaker_embed,
-                self.dropout,
-                dropout_implementation="upscale_in_train")
-
-        # since we use single example inference, there is no text_mask
-        if text_positions is not None:
-            w = self.key_position_rate
-            if self.n_speakers > 1:
-                # shape (B, )
-                w = w * F.squeeze(self.speaker_proj1(speaker_embed), [-1])
-            text_pos_embed = self.embed_keys_positions(text_positions, w)
-            keys += text_pos_embed  # (B, T, C)
-
-        # statr decoding
-        decoder_states = []  # (B, C, 1) tensors
-        mel_outputs = []  # (B, C, 1) tensors
-        alignments = []  # (B, 1, T_enc) tensors
-        dones = []  # (B, 1, 1) tensors
-        last_attended = [None] * len(self.conv_attn)
-        for idx, monotonic_attn in enumerate(self.force_monotonic_attention):
-            if monotonic_attn:
-                last_attended[idx] = 0
-
-        if test_inputs is not None:
-            # pack multiple frames if necessary # assume (B, T, C) input
-            test_inputs = fold_adjacent_frames(test_inputs, self.r)
-            test_inputs = F.transpose(test_inputs, [0, 2, 1])
-
-        initial_input = F.zeros(
-            (batch_size, self.mel_dim * self.r, 1), dtype=keys.dtype)
-
-        t = 0  # decoder time step
-        while True:
-            frame_pos = F.fill_constant(
-                (batch_size, 1), value=t + 1, dtype="int64")
-            w = self.query_position_rate
-            if self.n_speakers > 1:
-                w = w * F.squeeze(self.speaker_proj2(speaker_embed), [-1])
-            # (B, T=1, C)
-            frame_pos_embed = self.embed_query_positions(frame_pos, w)
-
-            if test_inputs is not None:
-                if t >= test_inputs.shape[-1]:
-                    break
-                current_input = test_inputs[:, :, t:t + 1]
-            else:
-                if t > 0:
-                    current_input = mel_outputs[-1]  # auto-regressive
-                else:
-                    current_input = initial_input
-
-            x_t = current_input
-            x_t = F.dropout(
-                x_t, self.dropout, dropout_implementation="upscale_in_train")
-
-            # Prenet
-            for layer in self.prenet:
-                if isinstance(layer, Conv1DGLU):
-                    x_t = layer.add_input(x_t, speaker_embed)
-                else:
-                    x_t = layer(x_t)  # (B, C, T=1)
-
-            step_attn_scores = []
-            # causal convolutions + multi-hop attentions
-            for i, (conv, attn) in enumerate(self.conv_attn):
-                residual = x_t  #(B, C, T=1)
-                x_t = conv.add_input(x_t, speaker_embed)
-                if attn is not None:
-                    x_t = F.transpose(x_t, [0, 2, 1])
-                    if frame_pos_embed is not None:
-                        x_t += frame_pos_embed
-                    x_t, attn_scores = attn(x_t, (keys, values), mask,
-                                            last_attended[i]
-                                            if test_inputs is None else None)
-                    x_t = F.transpose(x_t, [0, 2, 1])
-                    step_attn_scores.append(attn_scores)  #(B, T_dec=1, T_enc)
-                    # update last attended when necessary
-                    if self.force_monotonic_attention[i]:
-                        last_attended[i] = np.argmax(
-                            attn_scores.numpy(), axis=-1)[0][0]
-                x_t = F.scale(residual + x_t, np.sqrt(0.5))
-            if len(step_attn_scores):
-                # (B, 1, T_enc) again
-                average_attn_scores = F.reduce_mean(
-                    F.stack(step_attn_scores, 0), 0)
-            else:
-                average_attn_scores = None
-
-            decoder_state_t = x_t
-            x_t = self.last_conv(x_t)
-
-            mel_output_t = F.sigmoid(x_t)
-            done_t = F.sigmoid(self.fc(x_t))
-
-            decoder_states.append(decoder_state_t)
-            mel_outputs.append(mel_output_t)
-            if average_attn_scores is not None:
-                alignments.append(average_attn_scores)
-            dones.append(done_t)
-
-            t += 1
-
-            if test_inputs is None:
-                if F.reduce_min(done_t).numpy()[
-                        0] > 0.5 and t > self.min_decoder_steps:
-                    break
-                elif t > self.max_decoder_steps:
-                    break
-
-        # concat results
-        mel_outputs = F.concat(mel_outputs, axis=-1)
-        decoder_states = F.concat(decoder_states, axis=-1)
-        dones = F.concat(dones, axis=-1)
-        alignments = F.concat(alignments, axis=1)
-
-        mel_outputs = F.transpose(mel_outputs, [0, 2, 1])
-        decoder_states = F.transpose(decoder_states, [0, 2, 1])
-        dones = F.squeeze(dones, [1])
-
-        mel_outputs = unfold_adjacent_frames(mel_outputs, self.r)
-        decoder_states = unfold_adjacent_frames(decoder_states, self.r)
-
-        return mel_outputs, alignments, dones, decoder_states
--- a/parakeet/models/deepvoice3/encoder.py
+++ b/parakeet/models/deepvoice3/encoder.py
@ -1,149 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import numpy as np
-from collections import namedtuple
-
-import paddle.fluid.layers as F
-import paddle.fluid.initializer as I
-import paddle.fluid.dygraph as dg
-
-from parakeet.modules.weight_norm import Conv1D, Linear
-from parakeet.models.deepvoice3.conv1dglu import Conv1DGLU
-
-ConvSpec = namedtuple("ConvSpec", ["out_channels", "filter_size", "dilation"])
-
-
-class Encoder(dg.Layer):
-    def __init__(self,
-                 n_vocab,
-                 embed_dim,
-                 n_speakers,
-                 speaker_dim,
-                 padding_idx=None,
-                 embedding_weight_std=0.1,
-                 convolutions=(ConvSpec(64, 5, 1), ) * 7,
-                 dropout=0.):
-        """Encoder of Deep Voice 3.
-
-        Args:
-            n_vocab (int): vocabulary size of the text embedding.
-            embed_dim (int): embedding size of the text embedding.
-            n_speakers (int): number of speakers.
-            speaker_dim (int): speaker embedding size.
-            padding_idx (int, optional): padding index of text embedding. Defaults to None.
-            embedding_weight_std (float, optional): standard deviation of the embedding weights when intialized. Defaults to 0.1.
-            convolutions (Iterable[ConvSpec], optional): specifications of the convolutional layers. ConvSpec is a namedtuple of output channels, filter_size and dilation. Defaults to (ConvSpec(64, 5, 1), )*7.
-            dropout (float, optional): dropout probability. Defaults to 0..
-        """
-        super(Encoder, self).__init__()
-        self.embedding_weight_std = embedding_weight_std
-        self.embed = dg.Embedding(
-            (n_vocab, embed_dim),
-            padding_idx=padding_idx,
-            param_attr=I.Normal(scale=embedding_weight_std))
-
-        self.dropout = dropout
-        if n_speakers > 1:
-            std = np.sqrt((1 - dropout) / speaker_dim)
-            self.sp_proj1 = Linear(
-                speaker_dim,
-                embed_dim,
-                act="softsign",
-                param_attr=I.Normal(scale=std))
-            self.sp_proj2 = Linear(
-                speaker_dim,
-                embed_dim,
-                act="softsign",
-                param_attr=I.Normal(scale=std))
-        self.n_speakers = n_speakers
-
-        self.convolutions = dg.LayerList()
-        in_channels = embed_dim
-        std_mul = 1.0
-        for (out_channels, filter_size, dilation) in convolutions:
-            # 1 * 1 convolution & relu
-            if in_channels != out_channels:
-                std = np.sqrt(std_mul / in_channels)
-                self.convolutions.append(
-                    Conv1D(
-                        in_channels,
-                        out_channels,
-                        1,
-                        act="relu",
-                        param_attr=I.Normal(scale=std)))
-                in_channels = out_channels
-                std_mul = 2.0
-
-            self.convolutions.append(
-                Conv1DGLU(
-                    n_speakers,
-                    speaker_dim,
-                    in_channels,
-                    out_channels,
-                    filter_size,
-                    dilation,
-                    std_mul,
-                    dropout,
-                    causal=False,
-                    residual=True))
-            in_channels = out_channels
-            std_mul = 4.0
-
-        std = np.sqrt(std_mul * (1 - dropout) / in_channels)
-        self.convolutions.append(
-            Conv1D(
-                in_channels, embed_dim, 1, param_attr=I.Normal(scale=std)))
-
-    def forward(self, x, speaker_embed=None):
-        """
-        Encode text sequence.
-        
-        Args:
-            x (Variable): shape(B, T_enc), dtype: int64. Ihe input text indices. T_enc means the timesteps of decoder input x.
-            speaker_embed (Variable, optional): shape(B, C_sp), dtype float32, speaker embeddings. This arg is not None only when the model is a multispeaker model.
-
-        Returns:
-            keys (Variable), Shape(B, T_enc, C_emb), dtype float32, the encoded epresentation for keys, where C_emb menas the text embedding size.
-            values (Variable), Shape(B, T_enc, C_emb), dtype float32, the encoded representation for values.
-        """
-        x = self.embed(x)
-        x = F.dropout(
-            x, self.dropout, dropout_implementation="upscale_in_train")
-        x = F.transpose(x, [0, 2, 1])
-
-        if self.n_speakers > 1 and speaker_embed is not None:
-            speaker_embed = F.dropout(
-                speaker_embed,
-                self.dropout,
-                dropout_implementation="upscale_in_train")
-            x = F.elementwise_add(x, self.sp_proj1(speaker_embed), axis=0)
-
-        input_embed = x
-        for layer in self.convolutions:
-            if isinstance(layer, Conv1DGLU):
-                x = layer(x, speaker_embed)
-            else:
-                # layer is a Conv1D with (1,) filter wrapped by WeightNormWrapper
-                x = layer(x)
-
-        if self.n_speakers > 1 and speaker_embed is not None:
-            x = F.elementwise_add(x, self.sp_proj2(speaker_embed), axis=0)
-
-        keys = x  # (B, C, T)
-        values = F.scale(input_embed + x, scale=np.sqrt(0.5))
-        keys = F.transpose(keys, [0, 2, 1])
-        values = F.transpose(values, [0, 2, 1])
-        return keys, values
--- a/parakeet/models/deepvoice3/loss.py
+++ b/parakeet/models/deepvoice3/loss.py
@ -1,291 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import numpy as np
-from numba import jit
-
-from paddle import fluid
-import paddle.fluid.layers as F
-import paddle.fluid.dygraph as dg
-
-
-def masked_mean(inputs, mask):
-    """
-    Args:
-        inputs (Variable): shape(B, T, C), dtype float32, the input.
-        mask (Variable): shape(B, T), dtype float32, a mask. 
-    Returns:
-        loss (Variable): shape(1, ), dtype float32, masked mean.
-    """
-    channels = inputs.shape[-1]
-    masked_inputs = F.elementwise_mul(inputs, mask, axis=0)
-    loss = F.reduce_sum(masked_inputs) / (channels * F.reduce_sum(mask))
-    return loss
-
-
-@jit(nopython=True)
-def guided_attention(N, max_N, T, max_T, g):
-    """Generate an diagonal attention guide.
-    
-    Args:
-        N (int): valid length of encoder.
-        max_N (int): max length of encoder.
-        T (int): valid length of decoder.
-        max_T (int): max length of decoder.
-        g (float): sigma to adjust the degree of diagonal guide.
-
-    Returns:
-        np.ndarray: shape(max_N, max_T), dtype float32, the diagonal guide.
-    """
-    W = np.zeros((max_N, max_T), dtype=np.float32)
-    for n in range(N):
-        for t in range(T):
-            W[n, t] = 1 - np.exp(-(n / N - t / T)**2 / (2 * g * g))
-    return W
-
-
-def guided_attentions(encoder_lengths, decoder_lengths, max_decoder_len,
-                      g=0.2):
-    """Generate a diagonal attention guide for a batch.
-
-    Args:
-        encoder_lengths (np.ndarray): shape(B, ), dtype: int64, encoder valid lengths.
-        decoder_lengths (np.ndarray): shape(B, ), dtype: int64, decoder valid lengths.
-        max_decoder_len (int): max length of decoder.
-        g (float, optional): sigma to adjust the degree of diagonal guide.. Defaults to 0.2.
-
-    Returns:
-        np.ndarray: shape(B, max_T, max_N), dtype float32, the diagonal guide. (max_N: max encoder length, max_T: max decoder length.)
-    """
-    B = len(encoder_lengths)
-    max_input_len = encoder_lengths.max()
-    W = np.zeros((B, max_decoder_len, max_input_len), dtype=np.float32)
-    for b in range(B):
-        W[b] = guided_attention(encoder_lengths[b], max_input_len,
-                                decoder_lengths[b], max_decoder_len, g).T
-    return W
-
-
-class TTSLoss(object):
-    def __init__(self,
-                 masked_weight=0.0,
-                 priority_bin=None,
-                 priority_weight=0.0,
-                 binary_divergence_weight=0.0,
-                 guided_attention_sigma=0.2,
-                 downsample_factor=4,
-                 r=1):
-        """Compute loss for Deep Voice 3 model.
-
-        Args:
-            masked_weight (float, optional): the weight of masked loss. Defaults to 0.0.
-            priority_bin ([type], optional): frequency bands for linear spectrogram loss to be prioritized. Defaults to None.
-            priority_weight (float, optional): weight for the prioritized frequency bands. Defaults to 0.0.
-            binary_divergence_weight (float, optional): weight for binary cross entropy (used for spectrogram loss). Defaults to 0.0.
-            guided_attention_sigma (float, optional): `sigma` for attention guide. Defaults to 0.2.
-            downsample_factor (int, optional): the downsample factor for mel spectrogram. Defaults to 4.
-            r (int, optional): frames per decoder step. Defaults to 1.
-        """
-        self.masked_weight = masked_weight
-        self.priority_bin = priority_bin  # only used for lin-spec loss
-        self.priority_weight = priority_weight  # only used for lin-spec loss
-        self.binary_divergence_weight = binary_divergence_weight
-        self.guided_attention_sigma = guided_attention_sigma
-
-        self.time_shift = r
-        self.r = r
-        self.downsample_factor = downsample_factor
-
-    def l1_loss(self, prediction, target, mask, priority_bin=None):
-        """L1 loss for spectrogram.
-
-        Args:
-            prediction (Variable): shape(B, T, C), dtype float32, predicted spectrogram.
-            target (Variable): shape(B, T, C), dtype float32, target spectrogram.
-            mask (Variable): shape(B, T), mask.
-            priority_bin (int, optional): frequency bands for linear spectrogram loss to be prioritized. Defaults to None.
-
-        Returns:
-            Variable: shape(1,), dtype float32, l1 loss(with mask and possibly priority bin applied.)
-        """
-        abs_diff = F.abs(prediction - target)
-
-        # basic mask-weighted l1 loss
-        w = self.masked_weight
-        if w > 0 and mask is not None:
-            base_l1_loss = w * masked_mean(abs_diff, mask) \
-                         + (1 - w) * F.reduce_mean(abs_diff)
-        else:
-            base_l1_loss = F.reduce_mean(abs_diff)
-
-        if self.priority_weight > 0 and priority_bin is not None:
-            # mask-weighted priority channels' l1-loss
-            priority_abs_diff = abs_diff[:, :, :priority_bin]
-            if w > 0 and mask is not None:
-                priority_loss = w * masked_mean(priority_abs_diff, mask) \
-                              + (1 - w) * F.reduce_mean(priority_abs_diff)
-            else:
-                priority_loss = F.reduce_mean(priority_abs_diff)
-
-            # priority weighted sum
-            p = self.priority_weight
-            loss = p * priority_loss + (1 - p) * base_l1_loss
-        else:
-            loss = base_l1_loss
-        return loss
-
-    def binary_divergence(self, prediction, target, mask):
-        """Binary cross entropy loss for spectrogram. All the values in the spectrogram are treated as logits in a logistic regression.
-
-        Args:
-            prediction (Variable): shape(B, T, C), dtype float32, predicted spectrogram.
-            target (Variable): shape(B, T, C), dtype float32, target spectrogram.
-            mask (Variable): shape(B, T), mask.
-
-        Returns:
-            Variable: shape(1,), dtype float32, binary cross entropy loss.
-        """
-        flattened_prediction = F.reshape(prediction, [-1, 1])
-        flattened_target = F.reshape(target, [-1, 1])
-        flattened_loss = F.log_loss(
-            flattened_prediction, flattened_target, epsilon=1e-8)
-        bin_div = fluid.layers.reshape(flattened_loss, prediction.shape)
-
-        w = self.masked_weight
-        if w > 0 and mask is not None:
-            loss = w * masked_mean(bin_div, mask) \
-                 + (1 - w) * F.reduce_mean(bin_div)
-        else:
-            loss = F.reduce_mean(bin_div)
-        return loss
-
-    @staticmethod
-    def done_loss(done_hat, done):
-        """Compute done loss
-
-        Args:
-            done_hat (Variable): shape(B, T), dtype float32, predicted done probability(the probability that the final frame has been generated.)
-            done (Variable): shape(B, T), dtype float32, ground truth done probability(the probability that the final frame has been generated.)
-
-        Returns:
-            Variable: shape(1, ), dtype float32, done loss.
-        """
-        flat_done_hat = F.reshape(done_hat, [-1, 1])
-        flat_done = F.reshape(done, [-1, 1])
-        loss = F.log_loss(flat_done_hat, flat_done, epsilon=1e-8)
-        loss = F.reduce_mean(loss)
-        return loss
-
-    def attention_loss(self, predicted_attention, input_lengths,
-                       target_lengths):
-        """
-        Given valid encoder_lengths and decoder_lengths, compute a diagonal guide, and compute loss from the predicted attention and the guide.
-        
-        Args:
-            predicted_attention (Variable): shape(*, B, T_dec, T_enc), dtype float32, the alignment tensor, where B means batch size, T_dec means number of time steps of the decoder, T_enc means the number of time steps of the encoder, * means other possible dimensions.
-            input_lengths (numpy.ndarray): shape(B,), dtype:int64, valid lengths (time steps) of encoder outputs.
-            target_lengths (numpy.ndarray): shape(batch_size,), dtype:int64, valid lengths (time steps) of decoder outputs.
-        
-        Returns:
-            loss (Variable): shape(1, ), dtype float32, attention loss.
-        """
-        n_attention, batch_size, max_target_len, max_input_len = (
-            predicted_attention.shape)
-        soft_mask = guided_attentions(input_lengths, target_lengths,
-                                      max_target_len,
-                                      self.guided_attention_sigma)
-        soft_mask_ = dg.to_variable(soft_mask)
-        loss = fluid.layers.reduce_mean(predicted_attention * soft_mask_)
-        return loss
-
-    def __call__(self, outputs, inputs):
-        """Total loss
-
-        Args:
-            outpus is a tuple of (mel_hyp, lin_hyp, attn_hyp, done_hyp).
-            mel_hyp (Variable): shape(B, T, C_mel), dtype float32, predicted mel spectrogram.
-            lin_hyp (Variable): shape(B, T, C_lin), dtype float32, predicted linear spectrogram.
-            done_hyp (Variable): shape(B, T), dtype float32, predicted done probability.
-            attn_hyp (Variable): shape(N, B, T_dec, T_enc), dtype float32, predicted attention.
-
-            inputs is a tuple of (mel_ref, lin_ref, done_ref, input_lengths, n_frames)
-            mel_ref (Variable): shape(B, T, C_mel), dtype float32, ground truth mel spectrogram.
-            lin_ref (Variable): shape(B, T, C_lin), dtype float32, ground truth linear spectrogram.
-            done_ref (Variable): shape(B, T), dtype float32, ground truth done flag.
-            input_lengths (Variable): shape(B, ), dtype: int, encoder valid lengths.
-            n_frames (Variable): shape(B, ), dtype: int, decoder valid lengths.
-
-        Returns:
-            Dict(str, Variable): details of loss.
-        """
-        total_loss = 0.
-
-        mel_hyp, lin_hyp, attn_hyp, done_hyp = outputs
-        mel_ref, lin_ref, done_ref, input_lengths, n_frames = inputs
-
-        # n_frames # mel_lengths # decoder_lengths
-        max_frames = lin_hyp.shape[1]
-        max_mel_steps = max_frames // self.downsample_factor
-        # max_decoder_steps = max_mel_steps // self.r
-        # decoder_mask = F.sequence_mask(n_frames // self.downsample_factor //
-        #                                self.r,
-        #                                max_decoder_steps,
-        #                                dtype="float32")
-        mel_mask = F.sequence_mask(
-            n_frames // self.downsample_factor, max_mel_steps, dtype="float32")
-        lin_mask = F.sequence_mask(n_frames, max_frames, dtype="float32")
-
-        lin_hyp = lin_hyp[:, :-self.time_shift, :]
-        lin_ref = lin_ref[:, self.time_shift:, :]
-        lin_mask = lin_mask[:, self.time_shift:]
-        lin_l1_loss = self.l1_loss(
-            lin_hyp, lin_ref, lin_mask, priority_bin=self.priority_bin)
-        lin_bce_loss = self.binary_divergence(lin_hyp, lin_ref, lin_mask)
-        lin_loss = self.binary_divergence_weight * lin_bce_loss \
-                    + (1 - self.binary_divergence_weight) * lin_l1_loss
-        total_loss += lin_loss
-
-        mel_hyp = mel_hyp[:, :-self.time_shift, :]
-        mel_ref = mel_ref[:, self.time_shift:, :]
-        mel_mask = mel_mask[:, self.time_shift:]
-        mel_l1_loss = self.l1_loss(mel_hyp, mel_ref, mel_mask)
-        mel_bce_loss = self.binary_divergence(mel_hyp, mel_ref, mel_mask)
-        # print("=====>", mel_l1_loss.numpy()[0], mel_bce_loss.numpy()[0])
-        mel_loss = self.binary_divergence_weight * mel_bce_loss \
-                    + (1 - self.binary_divergence_weight) * mel_l1_loss
-        total_loss += mel_loss
-
-        attn_loss = self.attention_loss(attn_hyp,
-                                        input_lengths.numpy(),
-                                        n_frames.numpy() //
-                                        (self.downsample_factor * self.r))
-        total_loss += attn_loss
-
-        done_loss = self.done_loss(done_hyp, done_ref)
-        total_loss += done_loss
-
-        losses = {
-            "loss": total_loss,
-            "mel/mel_loss": mel_loss,
-            "mel/l1_loss": mel_l1_loss,
-            "mel/bce_loss": mel_bce_loss,
-            "lin/lin_loss": lin_loss,
-            "lin/l1_loss": lin_l1_loss,
-            "lin/bce_loss": lin_bce_loss,
-            "done": done_loss,
-            "attn": attn_loss,
-        }
-
-        return losses
--- a/parakeet/models/deepvoice3/model.py
+++ b/parakeet/models/deepvoice3/model.py
@ -1,106 +1,482 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
 import numpy as np
+import math
+import paddle
+from paddle import fluid
+from paddle.fluid import layers as F
+from paddle.fluid import initializer as I
+from paddle.fluid import dygraph as dg

-import paddle.fluid.layers as F
-import paddle.fluid.initializer as I
-import paddle.fluid.dygraph as dg
+from .conv import Conv1D
+from .weight_norm_hook import weight_norm, remove_weight_norm

+def positional_encoding(tensor, start_index, omega):
+    """
+    tensor: a reference tensor we use to get shape. actually only T and C are needed. Shape(B, T, C)
+    start_index: int, we can actually use start and length to specify them.
+    omega (B,): speaker position rates

-class DeepVoice3(dg.Layer):
-    def __init__(self, encoder, decoder, converter, speaker_embedding,
-                 use_decoder_states):
-        """Deep Voice 3 TTS model.
+    return (B, T, C), position embedding
+    """
+    dtype = omega.dtype
+    _, length, dimension = tensor.shape
+    index = F.range(start_index, start_index + length, 1, dtype=dtype)
+    channel = F.range(0, dimension, 2, dtype=dtype)

-        Args:
-            encoder (Layer): the encoder.
-            decoder (Layer): the decoder.
-            converter (Layer): the converter.
-            speaker_embedding (Layer): the speaker embedding (for multispeaker cases).
-            use_decoder_states (bool): use decoder states instead of predicted mel spectrogram as the input of the converter.
+    p = F.unsqueeze(omega, [1, 2]) \
+      * F.unsqueeze(index, [1]) \
+      / (10000 ** (channel / float(dimension)))
+
+    encodings = F.concat([F.sin(p), F.cos(p)], axis=2)
+    return encodings
+
+class ConvBlock(dg.Layer):
+    def __init__(self, in_channel, kernel_size, causal=False, has_bias=False, 
+                 bias_dim=None, keep_prob=1.):
+        super(ConvBlock, self).__init__()
+        self.causal = causal
+        self.keep_prob = keep_prob
+        self.in_channel = in_channel
+        self.has_bias = has_bias
+
+        std = np.sqrt(4 * keep_prob / (kernel_size * in_channel))
+        initializer = I.NormalInitializer(loc=0., scale=std)
+        padding = "valid" if causal else "same"
+        conv =  Conv1D(in_channel, 2 * in_channel, (kernel_size, ),
+                       padding=padding, 
+                       data_format="NTC",
+                       param_attr=initializer)
+        self.conv = weight_norm(conv)
+        if has_bias:
+            self.bias_affine = dg.Linear(bias_dim, 2 * in_channel)
+
+    def forward(self, input, bias=None, padding=None):
        """
-        super(DeepVoice3, self).__init__()
-        if speaker_embedding is None:
-            self.n_speakers = 1
+        input: input feature (B, T, C)
+        padding: only used when using causal conv, we pad mannually
+        """
+        input_dropped = F.dropout(input, 1. - self.keep_prob,
+                                  dropout_implementation="upscale_in_train")
+        if self.causal:
+            assert padding is not None
+            input_dropped = F.concat([padding, input_dropped], axis=1)
+        hidden = self.conv(input_dropped)
+
+        if self.has_bias:
+            assert bias is not None
+            transformed_bias = F.softsign(self.bias_affine(bias))
+            hidden_embedded = hidden + F.unsqueeze(transformed_bias, [1])
        else:
-            self.speaker_embedding = speaker_embedding
+            hidden_embedded = hidden
+
+        # glu
+        content, gate = F.split(hidden, num_or_sections=2, dim=-1)
+        content = hidden_embedded[:, :, :self.in_channel]
+        hidden = F.sigmoid(gate) * content
+
+        # # residual
+        hidden = F.scale(input + hidden, math.sqrt(0.5))
+        return hidden
+
+
+class AffineBlock1(dg.Layer):
+    def __init__(self, in_channel, out_channel, has_bias=False, bias_dim=0):
+        super(AffineBlock1, self).__init__()
+        std = np.sqrt(1.0 / in_channel)
+        initializer = I.NormalInitializer(loc=0., scale=std)
+        affine = dg.Linear(in_channel, out_channel, param_attr=initializer)
+        self.affine = weight_norm(affine, dim=-1)
+        if has_bias:
+            self.bias_affine = dg.Linear(bias_dim, out_channel)
+
+        self.has_bias = has_bias
+        self.bias_dim = bias_dim
+
+    def forward(self, input, bias=None):
+        """
+        input -> (affine + weight_norm) ->hidden
+        bias -> (affine) -> softsign -> transformed_bis
+        hidden += transformed_bias
+        """
+        hidden = self.affine(input)
+        if self.has_bias:
+            assert bias is not None
+            transformed_bias = F.softsign(self.bias_affine(bias))
+            hidden += F.unsqueeze(transformed_bias, [1])
+        return hidden
+
+
+class AffineBlock2(dg.Layer):
+    def __init__(self, in_channel, out_channel,
+                 has_bias=False, bias_dim=0, dropout=False, keep_prob=1.):
+        super(AffineBlock2, self).__init__()
+        if has_bias:
+            self.bias_affine = dg.Linear(bias_dim, in_channel)
+        std = np.sqrt(1.0 / in_channel)
+        initializer = I.NormalInitializer(loc=0., scale=std)
+        affine = dg.Linear(in_channel, out_channel, param_attr=initializer)
+        self.affine = weight_norm(affine, dim=-1)
+
+        self.has_bias = has_bias
+        self.bias_dim = bias_dim
+        self.dropout = dropout
+        self.keep_prob = keep_prob
+
+    def forward(self, input, bias=None):
+        """
+        input -> (dropout) ->hidden
+        bias -> (affine) -> softsign -> transformed_bis
+        hidden += transformed_bias
+        hidden -> (affine + weight_norm) -> relu -> hidden
+        """
+        hidden = input
+        if self.dropout:
+            hidden = F.dropout(hidden, 1. - self.keep_prob,
+                               dropout_implementation="upscale_in_train")
+        if self.has_bias:
+            assert bias is not None
+            transformed_bias = F.softsign(self.bias_affine(bias))
+            hidden += F.unsqueeze(transformed_bias, [1])
+        hidden = F.relu(self.affine(hidden))
+        return hidden
+
+
+class Encoder(dg.Layer):
+    def __init__(self, layers, in_channels, encoder_dim, kernel_size, 
+                 has_bias=False, bias_dim=0, keep_prob=1.):
+        super(Encoder, self).__init__()
+        self.pre_affine = AffineBlock1(in_channels, encoder_dim, has_bias, bias_dim)
+        self.convs = dg.LayerList([
+            ConvBlock(encoder_dim, kernel_size, False, has_bias, bias_dim, keep_prob) \
+                for _ in range(layers)])
+        self.post_affine = AffineBlock1(encoder_dim, in_channels, has_bias, bias_dim)
+        
+    def forward(self, char_embed, speaker_embed=None):
+        hidden = self.pre_affine(char_embed, speaker_embed)
+        for layer in self.convs:
+            hidden = layer(hidden, speaker_embed)
+        hidden = self.post_affine(hidden, speaker_embed)
+        keys = hidden
+        values = F.scale(char_embed + hidden, np.sqrt(0.5))
+        return keys, values
+
+
+class AttentionBlock(dg.Layer):
+    def __init__(self, attention_dim, input_dim, position_encoding_weight=1., 
+                 position_rate=1., reduction_factor=1, has_bias=False, bias_dim=0, 
+                 keep_prob=1.):
+        super(AttentionBlock, self).__init__()
+        # positional encoding
+        omega_default = position_rate / reduction_factor
+        self.omega_default = omega_default
+        # multispeaker case
+        if has_bias:
+            std = np.sqrt(1.0 / bias_dim)
+            initializer = I.NormalInitializer(loc=0., scale=std)
+            self.q_pos_affine = dg.Linear(bias_dim, 1, param_attr=initializer)
+            self.k_pos_affine = dg.Linear(bias_dim, 1, param_attr=initializer)
+            self.omega_initial = self.create_parameter(shape=[1], 
+                attr=I.ConstantInitializer(value=omega_default))
+        
+        # mind the fact that q, k, v have the same feature dimension
+        # so we can init k_affine and q_affine's weight as the same matrix
+        # to get a better init attention
+        init_weight = np.random.normal(size=(input_dim, attention_dim),
+                                       scale=np.sqrt(1. / input_dim))
+        initializer = I.NumpyArrayInitializer(init_weight.astype(np.float32))
+        # 3 affine transformation to project q, k, v into attention_dim
+        q_affine = dg.Linear(input_dim, attention_dim, 
+                                    param_attr=initializer)
+        self.q_affine = weight_norm(q_affine, dim=-1)
+        k_affine = dg.Linear(input_dim, attention_dim, 
+                                    param_attr=initializer)
+        self.k_affine = weight_norm(k_affine, dim=-1)
+
+        std = np.sqrt(1.0 / input_dim)
+        initializer = I.NormalInitializer(loc=0., scale=std)
+        v_affine = dg.Linear(input_dim, attention_dim, param_attr=initializer)
+        self.v_affine = weight_norm(v_affine, dim=-1)
+
+        std = np.sqrt(1.0 / attention_dim)
+        initializer = I.NormalInitializer(loc=0., scale=std)
+        out_affine = dg.Linear(attention_dim, input_dim, param_attr=initializer)
+        self.out_affine = weight_norm(out_affine, dim=-1)
+
+        self.keep_prob = keep_prob
+        self.has_bias = has_bias
+        self.bias_dim = bias_dim
+        self.attention_dim = attention_dim
+        self.position_encoding_weight = position_encoding_weight
+
+    def forward(self, q, k, v, lengths, speaker_embed, start_index, 
+                force_monotonic=False, prev_coeffs=None, window=None):
+        # add position encoding as an inductive bias 
+        if self.has_bias: # multi-speaker model
+            omega_q = 2 * F.sigmoid(
+                F.squeeze(self.q_pos_affine(speaker_embed), axes=[-1]))
+            omega_k = 2 * self.omega_initial * F.sigmoid(F.squeeze(
+                self.k_pos_affine(speaker_embed), axes=[-1]))
+        else: # single-speaker case
+            batch_size = q.shape[0]
+            omega_q = F.ones((batch_size, ), dtype="float32")
+            omega_k = F.ones((batch_size, ), dtype="float32") * self.omega_default
+        q += self.position_encoding_weight * positional_encoding(q, start_index, omega_q)
+        k += self.position_encoding_weight * positional_encoding(k, 0, omega_k)
+
+        q, k, v = self.q_affine(q), self.k_affine(k), self.v_affine(v)
+        activations = F.matmul(q, k, transpose_y=True)
+        activations /= np.sqrt(self.attention_dim)
+
+        if self.training:
+            # mask the <pad> parts from the encoder
+            mask = F.sequence_mask(lengths, dtype="float32")
+            attn_bias = F.scale(1. - mask, -1000)
+            activations += F.unsqueeze(attn_bias, [1])
+        elif force_monotonic:
+            assert window is not None
+            backward_step, forward_step = window
+            T_enc = k.shape[1]
+            batch_size, T_dec, _ = q.shape
+
+            # actually T_dec = 1 here
+            alpha = F.fill_constant((batch_size, T_dec), value=0, dtype="int64") \
+                   if prev_coeffs is None \
+                   else F.argmax(prev_coeffs, axis=-1)
+            backward = F.sequence_mask(alpha - backward_step, maxlen=T_enc, dtype="bool")
+            forward = F.sequence_mask(alpha + forward_step, maxlen=T_enc, dtype="bool")
+            mask = F.cast(F.logical_xor(backward, forward), "float32")
+            # print("mask's shape:", mask.shape)
+            attn_bias = F.scale(1. - mask, -1000)
+            activations += attn_bias
+
+        # softmax
+        coefficients = F.softmax(activations, axis=-1)
+        # context vector
+        coefficients = F.dropout(coefficients, 1. - self.keep_prob,
+                                 dropout_implementation='upscale_in_train')
+        contexts = F.matmul(coefficients, v)
+        # context normalization
+        enc_lengths = F.cast(F.unsqueeze(lengths, axes=[1, 2]), "float32")
+        contexts *= F.sqrt(enc_lengths)
+        # out affine
+        contexts = self.out_affine(contexts)
+        return contexts, coefficients
+
+
+class Decoder(dg.Layer):
+    def __init__(self, in_channels, reduction_factor, prenet_sizes, 
+                layers, kernel_size, attention_dim,
+                position_encoding_weight=1., omega=1., 
+                has_bias=False, bias_dim=0, keep_prob=1.):
+        super(Decoder, self).__init__()
+        # prenet-mind the difference of AffineBlock2 and AffineBlock1
+        c_in = in_channels
+        self.prenet = dg.LayerList()
+        for i, c_out in enumerate(prenet_sizes):
+            affine = AffineBlock2(c_in, c_out, has_bias, bias_dim, dropout=(i!=0), keep_prob=keep_prob)
+            self.prenet.append(affine)
+            c_in = c_out
+        
+        # causal convolutions + multihop attention
+        decoder_dim = prenet_sizes[-1]
+        self.causal_convs = dg.LayerList()
+        self.attention_blocks = dg.LayerList()
+        for i in range(layers):
+            conv = ConvBlock(decoder_dim, kernel_size, True, has_bias, bias_dim, keep_prob)
+            attn = AttentionBlock(attention_dim, decoder_dim, position_encoding_weight, omega, reduction_factor, has_bias, bias_dim, keep_prob)
+            self.causal_convs.append(conv)
+            self.attention_blocks.append(attn)
+
+        # output mel spectrogram
+        output_dim = reduction_factor * in_channels # r * mel_dim
+        std = np.sqrt(1.0 / decoder_dim)
+        initializer = I.NormalInitializer(loc=0., scale=std)
+        out_affine = dg.Linear(decoder_dim, output_dim, param_attr=initializer)
+        self.out_affine = weight_norm(out_affine, dim=-1)
+        if has_bias:
+            self.out_sp_affine = dg.Linear(bias_dim, output_dim)
+
+        self.has_bias = has_bias
+        self.kernel_size = kernel_size
+
+        self.in_channels = in_channels
+        self.decoder_dim = decoder_dim
+        self.reduction_factor = reduction_factor
+        self.out_channels = output_dim
+
+    def forward(self, inputs, keys, values, lengths, start_index, speaker_embed=None, 
+                state=None, force_monotonic_attention=None, coeffs=None, window=(0, 4)):
+        hidden = inputs
+        for layer in self.prenet:
+            hidden = layer(hidden, speaker_embed)
+
+        attentions = [] # every layer of (B, T_dec, T_enc) attention
+        final_state = [] # layers * (B, (k-1)d, C_dec)
+        batch_size = inputs.shape[0]
+        causal_padding_shape = (batch_size, self.kernel_size - 1, self.decoder_dim)
+
+        for i in range(len(self.causal_convs)):
+            if state is None:
+                padding = F.zeros(causal_padding_shape, dtype="float32")
+            else:
+                padding = state[i]
+            new_state = F.concat([padding, hidden], axis=1) # => to be used next step
+            # causal conv, (B, T, C)
+            hidden = self.causal_convs[i](hidden, speaker_embed, padding=padding)
+            # attn
+            prev_coeffs = None if coeffs is None else coeffs[i] 
+            force_monotonic = False if force_monotonic_attention is None else force_monotonic_attention[i]
+            context, attention = self.attention_blocks[i](
+                hidden, keys, values, lengths, speaker_embed, 
+                start_index, force_monotonic, prev_coeffs, window)
+            # residual connextion (B, T_dec, C_dec)
+            hidden = F.scale(hidden + context, np.sqrt(0.5))
+
+            attentions.append(attention) # layers * (B, T_dec, T_enc)
+            # new state: shift a step, layers * (B, T, C)
+            new_state = new_state[:, -(self.kernel_size - 1):, :]
+            final_state.append(new_state)
+
+        # predict mel spectrogram (B, 1, T_dec, r * C_in)
+        decoded = self.out_affine(hidden)
+        if self.has_bias:
+            decoded *= F.sigmoid(F.unsqueeze(self.out_sp_affine(speaker_embed), [1]))
+        return decoded, hidden, attentions, final_state
+
+
+class PostNet(dg.Layer):
+    def __init__(self, layers, in_channels, postnet_dim, kernel_size, out_channels, upsample_factor, has_bias=False, bias_dim=0, keep_prob=1.):
+        super(PostNet, self).__init__()
+        self.pre_affine = AffineBlock1(in_channels, postnet_dim, has_bias, bias_dim)
+        self.convs = dg.LayerList([
+            ConvBlock(postnet_dim, kernel_size, False, has_bias, bias_dim, keep_prob) for _ in range(layers)
+        ])
+        std = np.sqrt(1.0 / postnet_dim)
+        initializer = I.NormalInitializer(loc=0., scale=std)
+        post_affine = dg.Linear(postnet_dim, out_channels, param_attr=initializer)
+        self.post_affine = weight_norm(post_affine, dim=-1)
+        self.upsample_factor = upsample_factor
+
+    def forward(self, hidden, speaker_embed=None):
+        hidden = self.pre_affine(hidden, speaker_embed)
+        batch_size, time_steps, channels = hidden.shape # pylint: disable=unused-variable
+        hidden = F.expand(hidden, [1, 1, self.upsample_factor])
+        hidden = F.reshape(hidden, [batch_size, -1, channels])
+        for layer in self.convs:
+            hidden = layer(hidden, speaker_embed)
+        spec = self.post_affine(hidden)
+        return spec
+
+
+class SpectraNet(dg.Layer):
+    def __init__(self, char_embedding, speaker_embedding, encoder, decoder, postnet):
+        super(SpectraNet, self).__init__()
+        self.char_embedding = char_embedding
+        self.speaker_embedding = speaker_embedding
        self.encoder = encoder
        self.decoder = decoder
-        self.converter = converter
-        self.use_decoder_states = use_decoder_states
+        self.postnet = postnet
+    
+    def forward(self, text, text_lengths, speakers=None, mel=None, frame_lengths=None, 
+                force_monotonic_attention=None, window=None):
+        # encode
+        text_embed = self.char_embedding(text)# no stress embedding here
+        speaker_embed = F.softsign(self.speaker_embedding(speakers)) if self.speaker_embedding is not None else None
+        keys, values = self.encoder(text_embed, speaker_embed)

-    def forward(self, text_sequences, text_positions, valid_lengths,
-                speaker_indices, mel_inputs, frame_positions):
-        """Compute predicted value in a teacher forcing training manner.
-
-        Args:
-            text_sequences (Variable): shape(B, T_enc), dtype: int64, text indices.
-            text_positions (Variable): shape(B, T_enc), dtype: int64, positions of text indices.
-            valid_lengths (Variable): shape(B, ), dtype: int64, valid lengths of utterances.
-            speaker_indices (Variable): shape(B, ), dtype: int64, speaker indices for utterances.
-            mel_inputs (Variable): shape(B, T_mel, C_mel), dytpe: int64, ground truth mel spectrogram.
-            frame_positions (Variable): shape(B, T_dec), dtype: int64, positions of decoder steps.
-
-        Returns:
-            (mel_outputs, linear_outputs, alignments, done)
-            mel_outputs (Variable): shape(B, T_mel, C_mel), dtype float32, predicted mel spectrogram.
-            mel_outputs (Variable): shape(B, T_mel, C_mel), dtype float32, predicted mel spectrogram.
-            alignments (Variable): shape(N, B, T_dec, T_enc), dtype float32, predicted attention.
-            done (Variable): shape(B, T_dec), dtype float32, predicted done probability.
-            (T_mel: time steps of mel spectrogram, T_lin: time steps of linear spectrogra, T_dec, time steps of decoder, T_enc: time steps of encoder.)
-        """
-        if hasattr(self, "speaker_embedding"):
-            speaker_embed = self.speaker_embedding(speaker_indices)
+        if mel is not None:
+            return self.teacher_forced_train(keys, values, text_lengths, speaker_embed, mel)
        else:
-            speaker_embed = None
+            return self.inference(keys, values, text_lengths, speaker_embed, force_monotonic_attention, window)

-        keys, values = self.encoder(text_sequences, speaker_embed)
-        mel_outputs, alignments, done, decoder_states = self.decoder(
-            (keys, values), valid_lengths, mel_inputs, text_positions,
-            frame_positions, speaker_embed)
-        linear_outputs = self.converter(decoder_states
-                                        if self.use_decoder_states else
-                                        mel_outputs, speaker_embed)
-        return mel_outputs, linear_outputs, alignments, done
+    def teacher_forced_train(self, keys, values, text_lengths, speaker_embed, mel):
+        # build decoder inputs by shifting over by one frame and add all zero <start> frame
+        # the mel input is downsampled by a reduction factor
+        batch_size = mel.shape[0]
+        mel_input = F.reshape(mel, (batch_size, -1, self.decoder.reduction_factor, self.decoder.in_channels))
+        zero_frame = F.zeros((batch_size, 1, self.decoder.in_channels), dtype="float32")
+        # downsample mel input as a regularization
+        mel_input = F.concat([zero_frame, mel_input[:, :-1, -1, :]], axis=1)

-    def transduce(self, text_sequences, text_positions, speaker_indices=None):
-        """Generate output without teacher forcing. Only batch_size = 1 is supported.
+        # decoder
+        decoded, hidden, attentions, final_state = self.decoder(mel_input, keys, values, text_lengths, 0, speaker_embed)
+        attentions = F.stack(attentions) # (N, B, T_dec, T_encs)
+        # unfold frames
+        decoded = F.reshape(decoded, (batch_size, -1, self.decoder.in_channels))
+        # postnet
+        refined = self.postnet(hidden, speaker_embed)
+        return decoded, refined, attentions, final_state

-        Args:
-            text_sequences (Variable): shape(B, T_enc), dtype: int64, text indices.
-            text_positions (Variable): shape(B, T_enc), dtype: int64, positions of text indices.
-            speaker_indices (Variable): shape(B, ), dtype: int64, speaker indices for utterances.
-
-        Returns:
-            (mel_outputs, linear_outputs, alignments, done)
-            mel_outputs (Variable): shape(B, T_mel, C_mel), dtype float32, predicted mel spectrogram.
-            mel_outputs (Variable): shape(B, T_mel, C_mel), dtype float32, predicted mel spectrogram.
-            alignments (Variable): shape(B, T_dec, T_enc), dtype float32, predicted average attention of all attention layers.
-            done (Variable): shape(B, T_dec), dtype float32, predicted done probability.
-            (T_mel: time steps of mel spectrogram, T_lin: time steps of linear spectrogra, T_dec, time steps of decoder, T_enc: time steps of encoder.)
-        """
-        if hasattr(self, "speaker_embedding"):
-            speaker_embed = self.speaker_embedding(speaker_indices)
+    def spec_loss(self, decoded, input, num_frames=None):
+        if num_frames is None:
+            l1_loss = F.reduce_mean(F.abs(decoded - input))
        else:
-            speaker_embed = None
+            # mask the <pad> part of the decoder
+            num_channels = decoded.shape[-1]
+            l1_loss = F.abs(decoded - input)
+            mask = F.sequence_mask(num_frames, dtype="float32")
+            l1_loss *= F.unsqueeze(mask, axes=[-1])
+            l1_loss = F.reduce_sum(l1_loss) / F.scale(F.reduce_sum(mask), num_channels)
+        return l1_loss

-        keys, values = self.encoder(text_sequences, speaker_embed)
-        mel_outputs, alignments, done, decoder_states = self.decoder.decode(
-            (keys, values), text_positions, speaker_embed)
-        linear_outputs = self.converter(decoder_states
-                                        if self.use_decoder_states else
-                                        mel_outputs, speaker_embed)
-        return mel_outputs, linear_outputs, alignments, done
+    @dg.no_grad
+    def inference(self, keys, values, text_lengths, speaker_embed, 
+                  force_monotonic_attention, window):
+        MAX_STEP = 500
+        
+        # layer index of the first monotonic attention
+        num_monotonic_attention_layers = sum(force_monotonic_attention)
+        first_mono_attention_layer = 0
+        if num_monotonic_attention_layers > 0:
+            for i, item in enumerate(force_monotonic_attention):
+                if item:
+                    first_mono_attention_layer = i
+                    break
+            
+        # stop cond (if would be more complicated to support minibatch autoregressive decoding)
+        # so we only supports batch_size == 0 in inference
+        def should_continue(i, mel_input, outputs, hidden, attention, state, coeffs):
+            T_enc = coeffs.shape[-1]
+            attn_peak = F.argmax(coeffs[first_mono_attention_layer, 0, 0]) \
+                if num_monotonic_attention_layers > 0 \
+                else F.fill_constant([1], "int64", value=0)
+            return i < MAX_STEP and F.reshape(attn_peak, [1]) < T_enc - 1
+        
+        def loop_body(i, mel_input, outputs, hiddens, attentions, state=None, coeffs=None):
+            # state is None coeffs is None for the first step
+            decoded, hidden, new_coeffs, new_state = self.decoder(
+                mel_input, keys, values, text_lengths, i, speaker_embed, 
+                state, force_monotonic_attention, coeffs, window)
+            new_coeffs = F.stack(new_coeffs) # (N, B, T_dec=1, T_enc)
+
+            attentions.append(new_coeffs) # (N, B, T_dec=1, T_enc)
+            outputs.append(decoded) # (B, T_dec=1, rC_mel)
+            hiddens.append(hidden) # (B, T_dec=1, C_dec)
+
+            # slice the last frame out of r generated frames to be used as the input for the next step
+            batch_size = mel_input.shape[0]
+            frames = F.reshape(decoded, [batch_size, -1, self.decoder.reduction_factor, self.decoder.in_channels])
+            input_frame = frames[:, :, -1, :]
+            return (i + 1, input_frame, outputs, hiddens, attentions, new_state, new_coeffs)
+
+        i = 0
+        batch_size = keys.shape[0]
+        input_frame = F.zeros((batch_size, 1, self.decoder.in_channels), dtype="float32")
+        outputs = []
+        hiddens = []
+        attentions = []
+        loop_state = loop_body(i, input_frame, outputs, hiddens, attentions)
+
+        while should_continue(*loop_state):
+            loop_state = loop_body(*loop_state)
+    
+        outputs, hiddens, attention = loop_state[2], loop_state[3], loop_state[4]
+        # concat decoder timesteps
+        outputs = F.concat(outputs, axis=1)
+        hiddens = F.concat(hiddens, axis=1)
+        attention = F.concat(attention, axis=2)
+
+        # unfold frames
+        outputs = F.reshape(outputs, (batch_size, -1, self.decoder.in_channels))
+
+        refined = self.postnet(hiddens, speaker_embed)
+        return outputs, refined, attention
--- a/parakeet/models/deepvoice3/position_embedding.py
+++ b/parakeet/models/deepvoice3/position_embedding.py
@ -1,158 +0,0 @@
-# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-from __future__ import division
-import numpy as np
-from paddle import fluid
-import paddle.fluid.layers as F
-import paddle.fluid.dygraph as dg
-
-
-def lookup(weight, indices, padding_idx):
-    out = fluid.core.ops.lookup_table_v2(
-        weight, indices, 'is_sparse', False, 'is_distributed', False,
-        'remote_prefetch', False, 'padding_idx', padding_idx)
-    return out
-
-
-def compute_position_embedding_single_speaker(radians, speaker_position_rate):
-    """Compute sin/cos interleaved matrix from the radians.
-    
-    Arg:
-        radians (Variable): shape(n_vocab, embed_dim), dtype float32, the radians matrix.
-        speaker_position_rate (float or Variable): float or Variable of shape(1, ), speaker positioning rate.
-    
-    Returns:
-        Variable: shape(n_vocab, embed_dim), the sin, cos interleaved matrix.
-    """
-    _, embed_dim = radians.shape
-    scaled_radians = radians * speaker_position_rate
-
-    odd_mask = (np.arange(embed_dim) % 2).astype(np.float32)
-    odd_mask = dg.to_variable(odd_mask)
-
-    out = odd_mask * F.cos(scaled_radians) \
-        + (1 - odd_mask) * F.sin(scaled_radians)
-    return out
-
-
-def compute_position_embedding(radians, speaker_position_rate):
-    """Compute sin/cos interleaved matrix from the radians.
-    
-    Arg:
-        radians (Variable): shape(n_vocab, embed_dim), dtype float32, the radians matrix.
-        speaker_position_rate (Variable): shape(B, ), speaker positioning rate.
-    
-    Returns:
-        Variable: shape(B, n_vocab, embed_dim), the sin, cos interleaved matrix.
-    """
-    _, embed_dim = radians.shape
-    batch_size = speaker_position_rate.shape[0]
-    scaled_radians = F.elementwise_mul(
-        F.expand(F.unsqueeze(radians, [0]), [batch_size, 1, 1]),
-        speaker_position_rate,
-        axis=0)
-
-    odd_mask = (np.arange(embed_dim) % 2).astype(np.float32)
-    odd_mask = dg.to_variable(odd_mask)
-
-    out = odd_mask * F.cos(scaled_radians) \
-        + (1 - odd_mask) * F.sin(scaled_radians)
-    out = F.concat(
-        [F.zeros((batch_size, 1, embed_dim), radians.dtype), out[:, 1:, :]],
-        axis=1)
-    return out
-
-
-def position_encoding_init(n_position,
-                           d_pos_vec,
-                           position_rate=1.0,
-                           padding_idx=None):
-    """Init the position encoding.
-
-    Args:
-        n_position (int): max position, vocab size for position embedding.
-        d_pos_vec (int): position embedding size.
-        position_rate (float, optional): position rate (this should only be used when all the utterances are from one speaker.). Defaults to 1.0.
-        padding_idx (int, optional): padding index for the position embedding(it is set as 0 internally if not provided.). Defaults to None.
-
-    Returns:
-        [type]: [description]
-    """
-    # init the position encoding table
-    # keep idx 0 for padding token position encoding zero vector
-    # CAUTION: it is radians here, sin and cos are not applied
-    indices_range = np.expand_dims(np.arange(n_position), -1)
-    embed_range = 2 * (np.arange(d_pos_vec) // 2)
-    radians = position_rate \
-            * indices_range \
-            / np.power(1.e4, embed_range / d_pos_vec)
-    if padding_idx is not None:
-        radians[padding_idx] = 0.
-    return radians
-
-
-class PositionEmbedding(dg.Layer):
-    def __init__(self, n_position, d_pos_vec, position_rate=1.0):
-        """Position Embedding for Deep Voice 3.
-
-        Args:
-            n_position (int): max position, vocab size for position embedding.
-            d_pos_vec (int): position embedding size.
-            position_rate (float, optional): position rate (this should only be used when all the utterances are from one speaker.). Defaults to 1.0.
-        """
-        super(PositionEmbedding, self).__init__()
-        self.weight = self.create_parameter((n_position, d_pos_vec))
-        self.weight.set_value(
-            position_encoding_init(n_position, d_pos_vec, position_rate)
-            .astype("float32"))
-
-    def forward(self, indices, speaker_position_rate=None):
-        """
-        Args:
-            indices (Variable): shape (B, T), dtype: int64, position
-                indices, where B means the batch size, T means the time steps.
-            speaker_position_rate (Variable | float, optional), position
-                rate. It can be a float point number or a Variable with 
-                shape (1,), then this speaker_position_rate is used for every 
-                example. It can also be a Variable with shape (B, ), which 
-                contains a speaker position rate for each utterance.
-        Returns:
-            out (Variable): shape(B, T, C_pos), dtype float32, position embedding, where C_pos 
-                means position embedding size.
-        """
-        batch_size, time_steps = indices.shape
-
-        if isinstance(speaker_position_rate, float) or \
-            (isinstance(speaker_position_rate, fluid.framework.Variable)
-            and list(speaker_position_rate.shape) == [1]):
-            temp_weight = compute_position_embedding_single_speaker(
-                self.weight, speaker_position_rate)
-            out = lookup(temp_weight, indices, 0)
-            return out
-
-        assert len(speaker_position_rate.shape) == 1 and \
-            list(speaker_position_rate.shape) == [batch_size]
-
-        weight = compute_position_embedding(self.weight,
-                                            speaker_position_rate)  # (B, V, C)
-        # make indices for gather_nd
-        batch_id = F.expand(
-            F.unsqueeze(
-                F.range(
-                    0, batch_size, 1, dtype="int64"), [1]), [1, time_steps])
-        # (B, T, 2)
-        gather_nd_id = F.stack([batch_id, indices], -1)
-        out = F.gather_nd(weight, gather_nd_id)
-        return out
--- a/parakeet/models/deepvoice3/weight_norm_hook.py
+++ b/parakeet/models/deepvoice3/weight_norm_hook.py
@ -0,0 +1,148 @@
+import paddle
+import paddle.fluid.dygraph as dg
+
+import numpy as np
+from paddle import fluid
+import paddle.fluid.dygraph as dg
+import paddle.fluid.layers as F
+from paddle.fluid.layer_helper import LayerHelper
+from paddle.fluid.data_feeder import check_variable_and_dtype
+
+
+def l2_norm(x, axis, epsilon=1e-12, name=None):
+    if len(x.shape) == 1:
+        axis = 0
+    check_variable_and_dtype(x, "X", ("float32", "float64"), "norm")
+
+    helper = LayerHelper("l2_normalize", **locals())
+    out = helper.create_variable_for_type_inference(dtype=x.dtype)
+    norm = helper.create_variable_for_type_inference(dtype=x.dtype)
+    helper.append_op(
+        type="norm",
+        inputs={"X": x},
+        outputs={"Out": out,
+                 "Norm": norm},
+        attrs={
+            "axis": 1 if axis is None else axis,
+            "epsilon": epsilon,
+        })
+    return F.squeeze(norm, axes=[axis])
+    
+def norm_except_dim(p, dim):
+    shape = p.shape
+    ndims = len(shape)
+    if dim is None:
+        return F.sqrt(F.reduce_sum(F.square(p)))
+    elif dim == 0:
+        p_matrix = F.reshape(p, (shape[0], -1))
+        return l2_norm(p_matrix, axis=1)
+    elif dim == -1 or dim == ndims - 1:
+        p_matrix = F.reshape(p, (-1, shape[-1]))
+        return l2_norm(p_matrix, axis=0)
+    else:
+        perm = list(range(ndims))
+        perm[0] = dim
+        perm[dim] = 0
+        p_transposed = F.transpose(p, perm)
+        return norm_except_dim(p_transposed, 0)
+
+def _weight_norm(v, g, dim):
+    shape = v.shape
+    ndims = len(shape)
+
+    if dim is None:
+        v_normalized = v / (F.sqrt(F.reduce_sum(F.square(v))) + 1e-12)
+    elif dim == 0:
+        p_matrix = F.reshape(v, (shape[0], -1))
+        v_normalized = F.l2_normalize(p_matrix, axis=1)
+        v_normalized = F.reshape(v_normalized, shape)
+    elif dim == -1 or dim == ndims - 1:
+        p_matrix = F.reshape(v, (-1, shape[-1]))
+        v_normalized = F.l2_normalize(p_matrix, axis=0)
+        v_normalized = F.reshape(v_normalized, shape)
+    else:
+        perm = list(range(ndims))
+        perm[0] = dim
+        perm[dim] = 0
+        p_transposed = F.transpose(v, perm)
+        transposed_shape = p_transposed.shape
+        p_matrix = F.reshape(p_transposed, (p_transposed.shape[0], -1))
+        v_normalized = F.l2_normalize(p_matrix, axis=1)
+        v_normalized = F.reshape(v_normalized, transposed_shape)
+        v_normalized = F.transpose(v_normalized, perm)
+    weight = F.elementwise_mul(v_normalized, g, axis=dim if dim is not None else -1)
+    return weight
+
+
+class WeightNorm(object):
+    def __init__(self, name, dim):
+        if dim is None:
+            dim = -1
+        self.name = name
+        self.dim = dim
+
+    def compute_weight(self, module):
+        g = getattr(module, self.name + '_g')
+        v = getattr(module, self.name + '_v')
+        w = _weight_norm(v, g, self.dim)
+        return w
+
+    @staticmethod
+    def apply(module: dg.Layer, name, dim):
+        for k, hook in module._forward_pre_hooks.items():
+            if isinstance(hook, WeightNorm) and hook.name == name:
+                raise RuntimeError("Cannot register two weight_norm hooks on "
+                                   "the same parameter {}".format(name))
+
+        if dim is None:
+            dim = -1
+
+        fn = WeightNorm(name, dim)
+
+        # remove w from parameter list
+        w = getattr(module, name)
+        del module._parameters[name]
+
+        # add g and v as new parameters and express w as g/||v|| * v
+        g_var = norm_except_dim(w, dim)
+        v = module.create_parameter(w.shape, dtype=w.dtype)
+        module.add_parameter(name + "_v", v)
+        g = module.create_parameter(g_var.shape, dtype=g_var.dtype)
+        module.add_parameter(name + "_g", g)
+        with dg.no_grad():
+            F.assign(w, v)
+            F.assign(g_var, g)
+        setattr(module, name, fn.compute_weight(module))
+
+        # recompute weight before every forward()
+        module.register_forward_pre_hook(fn)
+        return fn
+
+    def remove(self, module):
+        w_var = self.compute_weight(module)
+        delattr(module, self.name)
+        del module._parameters[self.name + '_g']
+        del module._parameters[self.name + '_v']
+        w = module.create_parameter(w_var.shape, dtype=w_var.dtype)
+        module.add_parameter(self.name, w)
+        with dg.no_grad():
+            F.assign(w_var, w)
+
+    def __call__(self, module, inputs):
+        setattr(module, self.name, self.compute_weight(module))
+
+
+def weight_norm(module, name='weight', dim=0):
+    WeightNorm.apply(module, name, dim)
+    return module
+
+
+def remove_weight_norm(module, name='weight'):
+    for k, hook in module._forward_pre_hooks.items():
+        if isinstance(hook, WeightNorm) and hook.name == name:
+            hook.remove(module)
+            del module._forward_pre_hooks[k]
+            return module
+
+    raise ValueError("weight_norm of '{}' not found in {}"
+                     .format(name, module))