fix typos and refine doc
This commit is contained in:
parent
f82ac720d4
commit
6a9eab4b73
|
@ -1,87 +0,0 @@
|
|||
# How to build your own model and experiment?
|
||||
|
||||
For a general deep learning experiment, there are 4 parts to care for.
|
||||
|
||||
1. process data to satisfy the need for training the model and iterate the them in batches;
|
||||
2. define the model and the optimizer;
|
||||
3. write the training process (including forward-backward computation, parameter update, logging, evaluation and other staff.)
|
||||
4. configuration of the experiment.
|
||||
|
||||
## Data Processing
|
||||
|
||||
For processing data, `parakeet.data` provides flexible `Dataset`, `DataCargo`,`DataIterator`.
|
||||
|
||||
`DatasetMixin` and other datasets provide flexible Dataset API for building customized dataset.
|
||||
|
||||
`DataCargo` is built from a dataset, but can be iterated in batches. We should provide `Sampler` and `batch function` in addition to build a `DataCargo`. `Sampler` specific which examples to pick, and `batch function` specifies how to batch. Commonly used Samplers are provides by `parakeet.data`.
|
||||
|
||||
`DataIterator` is an iterator class for `DataCargo`.
|
||||
|
||||
We split data processing into two phases: example level processing and batching.
|
||||
|
||||
1. Sample level processing. This process is transforming an example into another example. This process can be defined in a `Dataset.get_example` or as a `transform` (callable object) and build a `TransformDataset` with it.
|
||||
|
||||
2. Batching. It is the processing of transforming a list of examples into a batch. The rationale is to transform an array of structures into a structure of arrays. We generally define a batch function (or a callable class).
|
||||
|
||||
To connect DataCargo with Paddlepaddle's asynchronous data loading mechanism, we need to create a `fluid.io.DataLoader` and connect it to the `Datacargo`.
|
||||
|
||||
The overview of data processing in an experiment with Parakeet is :
|
||||
|
||||
```text
|
||||
Dataset --(transform)--> Dataset --+
|
||||
sampler --|
|
||||
batch_fn --+-> DataCargo --> DataLoader
|
||||
```
|
||||
|
||||
The user need to define customized transform and batch function to accomplish this process. See [data](./data.md) for more details.
|
||||
|
||||
## Model
|
||||
|
||||
Parakeet provides commonly used functions, modules and models for the users to define their own models. Functions contains no trainable `Parameter`s, and is used in defining modules and models. Modules and modes are subclasses of `fluid.dygraph.Layer`. The distinction is that `module`s tend to be generic, simple and highly reusable, while `model`s tend to be task-sepcific, complicated and not that reusable. Some models are two complicated that we extract building blocks from it as separate classes but if they are not common and reusable enough, it is considered as a submodel.
|
||||
|
||||
In the structure of the project, modules are places in `parakeet.modules`, while models are in `parakeet.models` and group into folders like `waveflow` and `wavenet`, which include the whole model and thers submodels.
|
||||
|
||||
When developers want to add new models to `parakeet`, they can consider the distinctions described above and place code in appropriate place.
|
||||
|
||||
|
||||
|
||||
## Training Process
|
||||
|
||||
Training process is basically running a training loop for multiple times. A typical training loop consist of the procedures below:
|
||||
|
||||
1. Iterations over training datasets;
|
||||
2. Prerocessing of mini batches;
|
||||
3. Forward/backward computations of the neural networks;
|
||||
4. Parameter updates;
|
||||
5. Evaluations of the current parameters on validation datasets;
|
||||
6. Logging intermediate results;
|
||||
7. Saving checkpoint of the model and the optimizer.
|
||||
|
||||
In section `DataProcrssing` we have cover 1 and 2.
|
||||
|
||||
`Model` and `Optimizer` cover 3 and 4.
|
||||
|
||||
To keep the training loop clear, it's a good idea to define functions for save/load checkpoint, evaluation of validation set, logging and saving intermediate results. For some complicated model, it is also recommended to define a function to define the model. This function can be used in both train and inference, to ensure that the model is identical at training and inference.
|
||||
|
||||
Code is typically organized in this way:
|
||||
|
||||
```
|
||||
├── configs (example configuration)
|
||||
├── data.py (definition of custom Dataset, transform and batch function)
|
||||
├── README.md (README for the experiment)
|
||||
├── synthesis.py (code for inference)
|
||||
├── train.py (code for the training process)
|
||||
└── utils.py (all other utility functions)
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Deep learning experiments have many options to configure. These configurations can be rougly group into different types: configurations about path of the dataset and path to save results, configurations about how to process data, configuration about the model and configurations about the training process.
|
||||
|
||||
Some configurations tend to change when running the code at different times. For example, path of the data and path to save results, whether to load model before training, etc. For these configurations, it's better to define them as command line arguments. We use `argparse` to handle them.
|
||||
|
||||
Other groups of configuration may overlap with others. For example, data processing and model may have some common configurations. The recommended way is to save them as configuration files, for example, `yaml` or `json`. We prefer `yaml`, for it is more human-reabable.
|
||||
|
||||
|
||||
|
||||
There are several examples in this repo, check the `Parakeet/examples` for more details. `Parakeet/examples` is where we place our experiments. Though experiments are not a part of package `parakeet`, it is a part of repo `Parakeet`. There are provided as examples and allow for the user to run our experiment out-of-the-box. Feel free to add new examples and contribute to `Parakeet`.
|
|
@ -2,59 +2,61 @@
|
|||
|
||||
This short guide shows the design of `parakeet.data` and how we use it in an experiment.
|
||||
|
||||
The most concepts of parakeet are Dataset, DataCargo, Sampler, batch function and DataIterator.
|
||||
The most important concepts of `parakeet.data` are `DatasetMixin`, `DataCargo`, `Sampler`, `batch function` and `DataIterator`.
|
||||
|
||||
## Dataset
|
||||
|
||||
`Dataset`, as we assume here, is a list of examples. You gen get its length by `len(dataset)`(which means it length is known, and we have to implement `__len__` method for it). And you can access its items randomly by `dataset[i]`(which means we have to implement `__getitem__` method for it). Furthermore, you can iterable over it by `iter(dataset)` or `for example in dataset`, which means we have to implement `__iter__` method for it.
|
||||
Dataset, as we assume here, is a list of examples. You can get its length by `len(dataset)`(which means it length is known, and we have to implement `__len__` method for it). And you can access its items randomly by `dataset[i]`(which means we have to implement `__getitem__` method for it). Furthermore, you can iterate over it by `iter(dataset)` or `for example in dataset`, which means we have to implement `__iter__` method for it.
|
||||
|
||||
### DatasetMixin
|
||||
|
||||
We provide an `DatasetMixin` object which provides the above methods. You can inherit `DatasetMixin` and implement `get_example` method for it to define your own dataset class. The `get_example` method is called by `__getitem__` method automatically.
|
||||
|
||||
We also provide several other datasets that is built based on other datasets.
|
||||
We also define several high-order Dataset classes, the obejcts of which can be built from some given Dataset objects.
|
||||
|
||||
### TupleDataset
|
||||
|
||||
Dataset that is combined by sevral datasets of the same length. An `example` of a tupledataset is a tuple of examples of its constituent datasets.
|
||||
Dataset that is a combination of sevral datasets of the same length. An example of a `Tupledataset` is a tuple of examples of its constituent datasets.
|
||||
|
||||
### DictDataset
|
||||
|
||||
Dataset that is combined by sevral datasets of the same length. An `example` of the tupledataset is a dict of examples of its constituent datasets.
|
||||
Dataset that is a combination of sevral datasets of the same length. An example of the `Dictdataset` is a dict of examples of its constituent datasets.
|
||||
|
||||
### SliceDataset
|
||||
|
||||
SliceDataset is a slice of the base dataset.
|
||||
`SliceDataset` is a slice of the base dataset.
|
||||
|
||||
### SubsetDataset
|
||||
|
||||
SubsetDataset is a subset of the base dataset.
|
||||
`SubsetDataset` is a subset of the base dataset.
|
||||
|
||||
### ChainDataset
|
||||
|
||||
ChainDataset is the concatenation of several datastes with the same fields.
|
||||
`ChainDataset` is the concatenation of several datastes with the same fields.
|
||||
|
||||
### TransformDataset
|
||||
|
||||
A `TransformeDataset` is created by applying a `transform` to the base dataset. The `transform` is a callable object which takes an `example` of the base dataset and returns an `example` of the `TransformDataset`. The transform is lazy, which means it is applied to an example only when requested.
|
||||
A `TransformeDataset` is created by applying a `transform` to the base dataset. The `transform` is a callable object which takes an `example` of the base dataset as parameter and returns an `example` of the `TransformDataset`. The transformation is lazy, which means it is applied to an example only when requested.
|
||||
|
||||
### FilterDataset
|
||||
|
||||
A `FilterDataset` is created by applying a `filter` to the base dataset. A `filter` is a predicate that takes an `example` of the base dataset and returns a boolean. Only those examples that pass the filter are included in the FilterDataset.
|
||||
A `FilterDataset` is created by applying a `filter` to the base dataset. A `filter` is a predicate that takes an `example` of the base dataset as parameter and returns a boolean. Only those examples that pass the filter are included in the `FilterDataset`.
|
||||
|
||||
Note that filter is applied to all the examples in the base dataset when initializing a FilterDataset.
|
||||
Note that the filter is applied to all the examples in the base dataset when initializing a `FilterDataset`.
|
||||
|
||||
### CacheDataset
|
||||
|
||||
By default, we preprocess dataset lazily in `DatasetMixin.get_example`. An example is preprocessed only only requested. But `CacheDataset` caches the base dataset lazily, so each example is processed only once when it is first requested. When preprocessing the dataset is slow, you can use Cachedataset to speed it up, but caching may consume a lot of RAM if the dataset is large.
|
||||
By default, we preprocess dataset lazily in `DatasetMixin.get_example`. An example is preprocessed whenever requested. But `CacheDataset` caches the base dataset lazily, so each example is processed only once when it is first requested. When preprocessing the dataset is slow, you can use `Cachedataset` to speed it up, but caching may consume a lot of RAM if the dataset is large.
|
||||
|
||||
Finally, if preprocessing the dataset is slow and the processed dataset is too large to cache, you can write your own code to save them into files or databases, and then define a Dataset to load them. `Dataset` is flexible, so you can create your own dataset painlessly.
|
||||
|
||||
## DataCargo
|
||||
|
||||
`DataCargo`, like `Dataset`, is an iterable of batches. We need datacargo because in deep learning, batching examples into batches exploits the computational resources of modern hardwares. You can iterate it by `iter(datacargo)` or `for batch in datacargo`. `DataCargo` is an iterable but not an iterator, in that in can be iterated more than once.
|
||||
`DataCargo`, like `Dataset`, is an iterable, but it is an iterable of batches. We need `Datacargo` because in deep learning, batching examples into batches exploits the computational resources of modern hardwares. You can iterate it by `iter(datacargo)` or `for batch in datacargo`. `DataCargo` is an iterable but not an iterator, in that in can be iterated more than once.
|
||||
|
||||
### batch function
|
||||
|
||||
The concept of `batch` is something transformed from a list of examples. Assume that an example is a structure(tuple in python, or struct in C and C++) consists of several fields, then a list of examples is an array of structure(AOS, a dataset is an AOS). Then a batch here is a structure of arrays (SOA). Here is an example:
|
||||
The concept of `batch` is something transformed from a list of examples. Assume that an example is a structure(tuple in python, or struct in C and C++) consists of several fields, then a list of examples is an array of structures(AOS, e.g. a dataset is an AOS). Then a batch here is a structure of arrays (SOA). Here is an example:
|
||||
|
||||
The table below represents 2 examples, each of which contains 5 fields.
|
||||
|
||||
|
@ -63,7 +65,7 @@ The table below represents 2 examples, each of which contains 5 fields.
|
|||
| 1.2 | 1.1 | 1.3 | 1.4 | 0.8 |
|
||||
| 1.6 | 1.4 | 1.2 | 0.6 | 1.4 |
|
||||
|
||||
The AOS representation and SOA representation of the table is show below.
|
||||
The AOS representation and SOA representation of the table are shown below.
|
||||
|
||||
AOS:
|
||||
```text
|
||||
|
@ -81,15 +83,15 @@ SOA:
|
|||
[0.8, 1.4])
|
||||
```
|
||||
|
||||
For the example above, converting an AOS to an SOA is trivial, just stack every field for all the examples. But it is not always the case. When a field contains a sequence, you may have to pad all the sequences to the largest length then stack them together. In some other cases, we may want to add a field for the batch, for example, `valid_length` for each example. So in general, a function to transform an AOS to SOA is needed to build a datacargo from a dataset. We call this the batch function (`batch_fn`), but you can use any callable if you need to.
|
||||
For the example above, converting an AOS to an SOA is trivial, just stacking every field for all the examples. But it is not always the case. When a field contains a sequence, you may have to pad all the sequences to the largest length then stack them together. In some other cases, we may want to add a field for the batch, for example, `valid_length` for each example. So in general, a function to transform an AOS to SOA is needed to build a `Datacargo` from a dataset. We call this the batch function (`batch_fn`), but you can use any callable object if you need to.
|
||||
|
||||
Usually we need to define an callable object which stores all the options and configurations as its members as our `batch_fn`. Its `__call__` method transforms a list of examples into a batch.
|
||||
Usually we need to define the batch function as an callable object which stores all the options and configurations as its members. Its `__call__` method transforms a list of examples into a batch.
|
||||
|
||||
### sampler
|
||||
### Sampler
|
||||
|
||||
Equipped with a batch function(we have known __how to batch__), here comes the next question. __What to batch?__ We need to decide which examples to pick when creating a batch. Since a dataset is a list of examples, we only need to pick indices for the corresponding examples. A sampler object is what we use to do this.
|
||||
|
||||
A sampler is represented as an iterable of integers. Assume the dataset has `N` examples, then an iterable of intergers in the range`[0, N)` is an appropriate sampler for this dataset to build a `DataCargo`.
|
||||
A `Sampler` is represented as an iterable of integers. Assume the dataset has `N` examples, then an iterable of intergers in the range`[0, N)` is an appropriate sampler for this dataset to build a `DataCargo`.
|
||||
|
||||
We provide several samplers that is ready to use. The `SequentialSampler`, `RandomSampler` and so on.
|
||||
|
||||
|
@ -215,9 +217,9 @@ class Transform(object):
|
|||
return audio, mel_spectrogram
|
||||
```
|
||||
|
||||
`Transform` loads the audio file, and extracts `mel_spectrogram` from the audio. This transform actually needs a lot of options to specify, namely, the sample rate of the audio files, the `n_fft`, `win_length`, `hop_length` of `stft` transformation, and `n_mels` for transforming spectrogram into mel_spectrogram. So we define it as a callable class. You can also use a closure, or a `partial` if you want to.
|
||||
`Transform` loads the audio files, and extracts `mel_spectrogram` from them. This transformation actually needs a lot of options to specify, namely, the sample rate of the audio files, the `n_fft`, `win_length`, `hop_length` of `stft` transformation, and `n_mels` for transforming spectrogram into mel_spectrogram. So we define it as a callable class. You can also use a closure, or a `partial` if you want to.
|
||||
|
||||
Then we defines a functor to batch examples into a batch. Because the two fields ( `audio` and `mel_spectrogram`) are both sequences, batching them is not trivial. Also, because the wavenet model trains in audio clips of a fixed length(0.5 seconds, for example), we have to truncate the audio when creating batches. We want to crop it randomly when creating batches, instead of truncating it when preprocessing each example, because it allows for an audio to be truncated at different positions.
|
||||
Then we defines a functor to batch examples into a batch. Because the two fields ( `audio` and `mel_spectrogram`) are both sequences, batching them is not trivial. Also, because the wavenet model trains in audio clips of a fixed length(0.5 seconds, for example), we have to truncate the audio when creating batches. We want to crop audio randomly when creating batches, instead of truncating them when preprocessing each example, because it allows for an audio to be truncated at different positions.
|
||||
|
||||
```python
|
||||
class DataCollector(object):
|
||||
|
@ -321,7 +323,7 @@ for batch in train_cargo:
|
|||
# your training code here
|
||||
```
|
||||
|
||||
In the code above, processing of the data and training of the model runs in the same process. So the next batch starts to load after the training of the current batch has finished. There is actually better solution for this. Data processing and model training can be run asynchronously. To accomplish this, we would use `DataLoader` from Paddle. This serves as an adapter to transform an Iterable of batches into another iterable of batches, which runs asynchronously and transform each ndarray into `Variable`.
|
||||
In the code above, processing of the data and training of the model run in the same process. So the next batch starts to load after the training of the current batch has finished. There is actually better solutions for this. Data processing and model training can be run asynchronously. To accomplish this, we would use `DataLoader` from Paddle. This serves as an adapter to transform an Iterable of batches into another iterable of batches, which runs asynchronously and transform each ndarray into `Variable`.
|
||||
|
||||
```python
|
||||
# connects our data cargos with corresponding DataLoader
|
|
@ -0,0 +1,87 @@
|
|||
# How to build your own model and experiment?
|
||||
|
||||
For a general deep learning experiment, there are 4 parts to care for.
|
||||
|
||||
1. Preprocess dataset to meet the needs for model training and iterate them in batches;
|
||||
2. Define the model and the optimizer;
|
||||
3. Write the training process (including forward-backward computation, parameter update, logging, evaluation, etc.)
|
||||
4. Configure and launch the experiment.
|
||||
|
||||
## Data Processing
|
||||
|
||||
For processing data, `parakeet.data` provides `DatasetMixin`, `DataCargo` and `DataIterator`.
|
||||
|
||||
Dataset is an iterable of examples. `DatasetMixin` provides the standard indexing interface, and other classes in [parakeet.data.dataset](../parakeet/data/dataset.py) provide flexible interfaces for building customized datasets.
|
||||
|
||||
`DataCargo` is an iterable of batches. It differs from a dataset in that it can be iterated in batches. In addition to a dataset, a `Sampler` and a `batch function` are required to build a `DataCargo`. `Sampler` specifies which examples to pick, and `batch function` specifies how to create a batch from them. Commonly used `Samplers` are provides by [parakeet.data](../parakeet/data/). Users should define a `batch function` for a datasets, in order to batch its examples.
|
||||
|
||||
`DataIterator` is an iterator class for `DataCargo`. It is create when explicitly creating an iterator of a `DataCargo` by `iter(DataCargo)`, or iterating a `DataCargo` with `for` loop.
|
||||
|
||||
Data processing is splited into two phases: sample-level processing and batching.
|
||||
|
||||
1. Sample-level processing. This process is transforming an example into another example. This process can be defined as `get_example` method of a dataset, or as a `transform` (callable object) and build a `TransformDataset` with it.
|
||||
|
||||
2. Batching. It is the process of transforming a list of examples into a batch. The rationale is to transform an array of structures into a structure of arrays. We generally define a batch function (or a callable object) to do this.
|
||||
|
||||
To connect a `DataCargo` with Paddlepaddle's asynchronous data loading mechanism, we need to create a `fluid.io.DataLoader` and connect it to the `Datacargo`.
|
||||
|
||||
The overview of data processing in an experiment with Parakeet is :
|
||||
|
||||
```text
|
||||
Dataset --(transform)--> Dataset --+
|
||||
sampler --+
|
||||
batch_fn --+-> DataCargo --> DataLoader
|
||||
```
|
||||
|
||||
The user need to define a customized transform and a batch function to accomplish this process. See [data](./data.md) for more details.
|
||||
|
||||
## Model
|
||||
|
||||
Parakeet provides commonly used functions, modules and models for the users to define their own models. Functions contains no trainable `Parameter`s, and are used in modules and models. Modules and modes are subclasses of `fluid.dygraph.Layer`. The distinction is that `module`s tend to be generic, simple and highly reusable, while `model`s tend to be task-sepcific, complicated and not that reusable. Some models are so complicated that we extract building blocks from it as separate classes but if these building blocks are not common and reusable enough, they are considered as submodels.
|
||||
|
||||
In the structure of the project, modules are placed in [parakeet.modules](../parakeet/modules/), while models are in [parakeet.models](../parakeet/models) and grouped into folders like `waveflow` and `wavenet`, which include the whole model and their submodels.
|
||||
|
||||
When developers want to add new models to `parakeet`, they can consider the distinctions described above and put the code in an appropriate place.
|
||||
|
||||
|
||||
|
||||
## Training Process
|
||||
|
||||
Training process is basically running a training loop for multiple times. A typical training loop consists of the procedures below:
|
||||
|
||||
1. Iterating over training dataset;
|
||||
2. Prerocessing mini-batches;
|
||||
3. Forward/backward computations of the neural networks;
|
||||
4. Updating Parameters;
|
||||
5. Evaluating the model on validation dataset;
|
||||
6. Logging or saving intermediate results;
|
||||
7. Saving checkpoint of the model and the optimizer.
|
||||
|
||||
In section `DataProcrssing` we have cover 1 and 2.
|
||||
|
||||
`Model` and `Optimizer` cover 3 and 4.
|
||||
|
||||
To keep the training loop clear, it's a good idea to define functions for saving/loading of checkpoints, evaluation on validation set, logging and saving of intermediate results, etc. For some complicated model, it is also recommended to define a function to create the model. This function can be used in both train and inference, to ensure that the model is identical at training and inference.
|
||||
|
||||
Code is typically organized in this way:
|
||||
|
||||
```text
|
||||
├── configs (example configuration)
|
||||
├── data.py (definition of custom Dataset, transform and batch function)
|
||||
├── README.md (README for the experiment)
|
||||
├── synthesis.py (code for inference)
|
||||
├── train.py (code for training)
|
||||
└── utils.py (all other utility functions)
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Deep learning experiments have many options to configure. These configurations can be roughly grouped into different types: configurations about path of the dataset and path to save results, configurations about how to process data, configuration about the model and configurations about the training process.
|
||||
|
||||
Some configurations tend to change when running the code at different times, for example, path of the data and path to save results and whether to load model before training, etc. For these configurations, it's better to define them as command line arguments. We use `argparse` to handle them.
|
||||
|
||||
Other groups of configuration may overlap with others. For example, data processing and model may have some common options. The recommended way is to save them as configuration files, for example, `yaml` or `json`. We prefer `yaml`, for it is more human-reabable.
|
||||
|
||||
|
||||
|
||||
There are several examples in this repo, check [Parakeet/examples](../examples) for more details. `Parakeet/examples` is where we place our experiments. Though experiments are not a part of package `parakeet`, it is a part of repo `Parakeet`. They are provided as examples and allow for the users to run our experiment out-of-the-box. Feel free to add new examples and contribute to `Parakeet`.
|
|
@ -34,7 +34,7 @@ def batch_text_id(minibatch, pad_id=0, dtype=np.int64):
|
|||
"""Pad sequences to text_ids to the largest length and batch them.
|
||||
|
||||
Args:
|
||||
minibatch (List[np.ndarray]): list of rank-1 arrays, shape(T,), dtype: np.int64, text_ids.
|
||||
minibatch (List[np.ndarray]): list of rank-1 arrays, shape(T,), dtype np.int64, text_ids.
|
||||
pad_id (int, optional): the id which correspond to the special pad token. Defaults to 0.
|
||||
dtype (np.dtype, optional): the data dtype of the output. Defaults to np.int64.
|
||||
|
||||
|
@ -75,7 +75,7 @@ def batch_wav(minibatch, pad_value=0., dtype=np.float32):
|
|||
"""pad audios to the largest length and batch them.
|
||||
|
||||
Args:
|
||||
minibatch (List[np.ndarray]): list of rank-1 float arrays(mono-channel audio, shape(T,)) or list of rank-2 float arrays(multi-channel audio, shape(C, T), C stands for numer of channels, T stands for length), dtype: float.
|
||||
minibatch (List[np.ndarray]): list of rank-1 float arrays(mono-channel audio, shape(T,)) or list of rank-2 float arrays(multi-channel audio, shape(C, T), C stands for numer of channels, T stands for length), dtype float.
|
||||
pad_value (float, optional): the pad value. Defaults to 0..
|
||||
dtype (np.dtype, optional): the data type of the output. Defaults to np.float32.
|
||||
|
||||
|
@ -126,7 +126,7 @@ def batch_spec(minibatch, pad_value=0., dtype=np.float32):
|
|||
"""Pad spectra to the largest length and batch them.
|
||||
|
||||
Args:
|
||||
minibatch (List[np.ndarray]): list of rank-2 arrays of shape(F, T) for mono-channel spectrograms, or list of rank-3 arrays of shape(C, F, T) for multi-channel spectrograms(F stands for frequency bands.), dtype: float.
|
||||
minibatch (List[np.ndarray]): list of rank-2 arrays of shape(F, T) for mono-channel spectrograms, or list of rank-3 arrays of shape(C, F, T) for multi-channel spectrograms(F stands for frequency bands.), dtype float.
|
||||
pad_value (float, optional): the pad value. Defaults to 0..
|
||||
dtype (np.dtype, optional): data type of the output. Defaults to np.float32.
|
||||
|
||||
|
|
|
@ -60,16 +60,16 @@ class Clarinet(dg.Layer):
|
|||
"""Compute loss of Clarinet model.
|
||||
|
||||
Args:
|
||||
audio (Variable): shape(B, T_audio), dtype: float, ground truth waveform.
|
||||
mel (Variable): shape(B, F, T_mel), dtype: float, condition(mel spectrogram here).
|
||||
audio_start (Variable): shape(B, ), dtype: int64, audio starts positions.
|
||||
audio (Variable): shape(B, T_audio), dtype flaot32, ground truth waveform.
|
||||
mel (Variable): shape(B, F, T_mel), dtype flaot32, condition(mel spectrogram here).
|
||||
audio_start (Variable): shape(B, ), dtype int64, audio starts positions.
|
||||
clip_kl (bool, optional): whether to clip kl_loss by maximum=100. Defaults to True.
|
||||
|
||||
Returns:
|
||||
Dict(str, Variable)
|
||||
loss (Variable): shape(1, ), dtype: float, total loss.
|
||||
kl (Variable): shape(1, ), dtype: float, kl divergence between the teacher's output distribution and student's output distribution.
|
||||
regularization (Variable): shape(1, ), dtype: float, a regularization term of the KL divergence.
|
||||
loss (Variable): shape(1, ), dtype flaot32, total loss.
|
||||
kl (Variable): shape(1, ), dtype flaot32, kl divergence between the teacher's output distribution and student's output distribution.
|
||||
regularization (Variable): shape(1, ), dtype flaot32, a regularization term of the KL divergence.
|
||||
spectrogram_frame_loss (Variable): shape(1, ), dytpe: float, stft loss, the L1-distance of the magnitudes of the spectrograms of the ground truth waveform and synthesized waveform.
|
||||
"""
|
||||
batch_size, audio_length = audio.shape # audio clip's length
|
||||
|
@ -170,12 +170,12 @@ class STFT(dg.Layer):
|
|||
"""Compute the stft transform.
|
||||
|
||||
Args:
|
||||
x (Variable): shape(B, T), dtype: float, the input waveform.
|
||||
x (Variable): shape(B, T), dtype flaot32, the input waveform.
|
||||
|
||||
Returns:
|
||||
(real, imag)
|
||||
real (Variable): shape(B, C, 1, T), dtype: float, the real part of the spectrogram. (C = 1 + n_fft // 2)
|
||||
imag (Variable): shape(B, C, 1, T), dtype: float, the image part of the spectrogram. (C = 1 + n_fft // 2)
|
||||
real (Variable): shape(B, C, 1, T), dtype flaot32, the real part of the spectrogram. (C = 1 + n_fft // 2)
|
||||
imag (Variable): shape(B, C, 1, T), dtype flaot32, the image part of the spectrogram. (C = 1 + n_fft // 2)
|
||||
"""
|
||||
# x(batch_size, time_steps)
|
||||
# pad it first with reflect mode
|
||||
|
@ -194,11 +194,11 @@ class STFT(dg.Layer):
|
|||
|
||||
Args:
|
||||
(real, imag)
|
||||
real (Variable): shape(B, C, 1, T), dtype: float, the real part of the spectrogram.
|
||||
imag (Variable): shape(B, C, 1, T), dtype: float, the image part of the spectrogram.
|
||||
real (Variable): shape(B, C, 1, T), dtype flaot32, the real part of the spectrogram.
|
||||
imag (Variable): shape(B, C, 1, T), dtype flaot32, the image part of the spectrogram.
|
||||
|
||||
Returns:
|
||||
Variable: shape(B, C, 1, T), dtype: float, the power spectrogram.
|
||||
Variable: shape(B, C, 1, T), dtype flaot32, the power spectrogram.
|
||||
"""
|
||||
real, imag = self(x)
|
||||
power = real**2 + imag**2
|
||||
|
@ -209,11 +209,11 @@ class STFT(dg.Layer):
|
|||
|
||||
Args:
|
||||
(real, imag)
|
||||
real (Variable): shape(B, C, 1, T), dtype: float, the real part of the spectrogram.
|
||||
imag (Variable): shape(B, C, 1, T), dtype: float, the image part of the spectrogram.
|
||||
real (Variable): shape(B, C, 1, T), dtype flaot32, the real part of the spectrogram.
|
||||
imag (Variable): shape(B, C, 1, T), dtype flaot32, the image part of the spectrogram.
|
||||
|
||||
Returns:
|
||||
Variable: shape(B, C, 1, T), dtype: float, the magnitude spectrogram. It is the square root of the power spectrogram.
|
||||
Variable: shape(B, C, 1, T), dtype flaot32, the magnitude spectrogram. It is the square root of the power spectrogram.
|
||||
"""
|
||||
power = self.power(x)
|
||||
magnitude = F.sqrt(power)
|
||||
|
|
|
@ -51,13 +51,13 @@ class ParallelWaveNet(dg.Layer):
|
|||
|
||||
Args:
|
||||
z (Variable): shape(B, T), random noise sampled from a standard gaussian disribution.
|
||||
condition (Variable, optional): shape(B, F, T), dtype: float, the upsampled condition. Defaults to None.
|
||||
condition (Variable, optional): shape(B, F, T), dtype float, the upsampled condition. Defaults to None.
|
||||
|
||||
Returns:
|
||||
(z, out_mu, out_log_std)
|
||||
z (Variable): shape(B, T), dtype: float, transformed noise, it is the synthesized waveform.
|
||||
out_mu (Variable): shape(B, T), dtype: float, means of the output distributions.
|
||||
out_log_std (Variable): shape(B, T), dtype: float, log standard deviations of the output distributions.
|
||||
z (Variable): shape(B, T), dtype float, transformed noise, it is the synthesized waveform.
|
||||
out_mu (Variable): shape(B, T), dtype float, means of the output distributions.
|
||||
out_log_std (Variable): shape(B, T), dtype float, log standard deviations of the output distributions.
|
||||
"""
|
||||
for i, flow in enumerate(self.flows):
|
||||
theta = flow(z, condition) # w, mu, log_std [0: T]
|
||||
|
|
|
@ -67,16 +67,16 @@ class Attention(dg.Layer):
|
|||
Compute contextualized representation and alignment scores.
|
||||
|
||||
Args:
|
||||
query (Variable): shape(B, T_dec, C_q), dtype: float, the query tensor, where C_q means the query dim.
|
||||
query (Variable): shape(B, T_dec, C_q), dtype float32, the query tensor, where C_q means the query dim.
|
||||
encoder_out (keys, values):
|
||||
keys (Variable): shape(B, T_enc, C_emb), dtype: float, the key representation from an encoder, where C_emb means embed dim.
|
||||
values (Variable): shape(B, T_enc, C_emb), dtype: float, the value representation from an encoder, where C_emb means embed dim.
|
||||
mask (Variable, optional): shape(B, T_enc), dtype: float, mask generated with valid text lengths. Pad tokens corresponds to 1, and valid tokens correspond to 0.
|
||||
keys (Variable): shape(B, T_enc, C_emb), dtype float32, the key representation from an encoder, where C_emb means embed dim.
|
||||
values (Variable): shape(B, T_enc, C_emb), dtype float32, the value representation from an encoder, where C_emb means embed dim.
|
||||
mask (Variable, optional): shape(B, T_enc), dtype float32, mask generated with valid text lengths. Pad tokens corresponds to 1, and valid tokens correspond to 0.
|
||||
last_attended (int, optional): The position that received the most attention at last time step. This is only used at inference.
|
||||
|
||||
Outpus:
|
||||
x (Variable): shape(B, T_dec, C_q), dtype: float, the contextualized representation from attention mechanism.
|
||||
attn_scores (Variable): shape(B, T_dec, T_enc), dtype: float, the alignment tensor, where T_dec means the number of decoder time steps and T_enc means number the number of decoder time steps.
|
||||
x (Variable): shape(B, T_dec, C_q), dtype float32, the contextualized representation from attention mechanism.
|
||||
attn_scores (Variable): shape(B, T_dec, T_enc), dtype float32, the alignment tensor, where T_dec means the number of decoder time steps and T_enc means number the number of decoder time steps.
|
||||
"""
|
||||
keys, values = encoder_out
|
||||
residual = query
|
||||
|
|
|
@ -93,8 +93,8 @@ class Conv1DGLU(dg.Layer):
|
|||
def forward(self, x, speaker_embed=None):
|
||||
"""
|
||||
Args:
|
||||
x (Variable): shape(B, C_in, T), dtype: float, the input of Conv1DGLU layer, where B means batch_size, C_in means the input channels T means input time steps.
|
||||
speaker_embed (Variable): shape(B, C_sp), dtype: float, speaker embed, where C_sp means speaker embedding size.
|
||||
x (Variable): shape(B, C_in, T), dtype float32, the input of Conv1DGLU layer, where B means batch_size, C_in means the input channels T means input time steps.
|
||||
speaker_embed (Variable): shape(B, C_sp), dtype float32, speaker embed, where C_sp means speaker embedding size.
|
||||
|
||||
Returns:
|
||||
x (Variable): shape(B, C_out, T), the output of Conv1DGLU, where
|
||||
|
@ -127,8 +127,8 @@ class Conv1DGLU(dg.Layer):
|
|||
Takes a step of inputs and return a step of outputs. It works similarily with the `forward` method, but in a `step-in-step-out` fashion.
|
||||
|
||||
Args:
|
||||
x_t (Variable): shape(B, C_in, T=1), dtype: float, the input of Conv1DGLU layer, where B means batch_size, C_in means the input channels.
|
||||
speaker_embed (Variable): Shape(B, C_sp), dtype: float, speaker embed, where C_sp means speaker embedding size.
|
||||
x_t (Variable): shape(B, C_in, T=1), dtype float32, the input of Conv1DGLU layer, where B means batch_size, C_in means the input channels.
|
||||
speaker_embed (Variable): Shape(B, C_sp), dtype float32, speaker embed, where C_sp means speaker embedding size.
|
||||
|
||||
Returns:
|
||||
x (Variable): shape(B, C_out), the output of Conv1DGLU, where C_out means the `num_filter`.
|
||||
|
|
|
@ -257,9 +257,9 @@ class Converter(dg.Layer):
|
|||
Convert mel spectrogram or decoder hidden states to linear spectrogram.
|
||||
|
||||
Args:
|
||||
x (Variable): Shape(B, T_mel, C_in), dtype: float, converter inputs, where C_in means the input channel for the converter. Note that it can be either C_mel (channel of mel spectrogram) or C_dec // r.
|
||||
x (Variable): Shape(B, T_mel, C_in), dtype float32, converter inputs, where C_in means the input channel for the converter. Note that it can be either C_mel (channel of mel spectrogram) or C_dec // r.
|
||||
When use mel_spectrogram as the input of converter, C_in = C_mel; and when use decoder states as the input of converter, C_in = C_dec // r.
|
||||
speaker_embed (Variable, optional): shape(B, C_sp), dtype: float, speaker embedding, where C_sp means the speaker embedding size.
|
||||
speaker_embed (Variable, optional): shape(B, C_sp), dtype float32, speaker embedding, where C_sp means the speaker embedding size.
|
||||
|
||||
Returns:
|
||||
out (Variable): Shape(B, T_lin, C_lin), the output linear spectrogram, where C_lin means the channel of linear spectrogram and T_linear means the length(time steps) of linear spectrogram. T_line = time_upsampling * T_mel, which depends on the time_upsampling of the converter.
|
||||
|
|
|
@ -41,7 +41,7 @@ def gen_mask(valid_lengths, max_len, dtype="float32"):
|
|||
dtype (str, optional): A string that specifies the data type of the returned mask. Defaults to 'float32'.
|
||||
|
||||
Returns:
|
||||
mask (Variable): shape(B, max_len), dtype: float, a mask computed from valid lengths.
|
||||
mask (Variable): shape(B, max_len), dtype float32, a mask computed from valid lengths.
|
||||
"""
|
||||
mask = F.sequence_mask(valid_lengths, maxlen=max_len, dtype=dtype)
|
||||
mask = 1 - mask
|
||||
|
@ -261,8 +261,8 @@ class Decoder(dg.Layer):
|
|||
|
||||
Args:
|
||||
encoder_out (keys, values):
|
||||
keys (Variable): shape(B, T_enc, C_emb), dtype: float, the key representation from an encoder, where C_emb means text embedding size.
|
||||
values (Variable): shape(B, T_enc, C_emb), dtype: float, the value representation from an encoder, where C_emb means text embedding size.
|
||||
keys (Variable): shape(B, T_enc, C_emb), dtype float32, the key representation from an encoder, where C_emb means text embedding size.
|
||||
values (Variable): shape(B, T_enc, C_emb), dtype float32, the value representation from an encoder, where C_emb means text embedding size.
|
||||
lengths (Variable): shape(batch_size,), dtype: int64, valid lengths of text inputs for each example.
|
||||
inputs (Variable): shape(B, T_mel, C_mel), ground truth mel-spectrogram, which is used as decoder inputs when training.
|
||||
text_positions (Variable): shape(B, T_enc), dtype: int64. Positions indices for text inputs for the encoder, where T_enc means the encoder timesteps.
|
||||
|
@ -270,10 +270,10 @@ class Decoder(dg.Layer):
|
|||
speaker_embed (Variable, optionals): shape(batch_size, speaker_dim), speaker embedding, only used for multispeaker model.
|
||||
|
||||
Returns:
|
||||
outputs (Variable): shape(B, T_mel, C_mel), dtype: float, decoder outputs, where C_mel means the channels of mel-spectrogram, T_mel means the length(time steps) of mel spectrogram.
|
||||
alignments (Variable): shape(N, B, T_mel // r, T_enc), dtype: float, the alignment tensor between the decoder and the encoder, where N means number of Attention Layers, T_mel means the length of mel spectrogram, r means the outputs per decoder step, T_enc means the encoder time steps.
|
||||
done (Variable): shape(B, T_mel // r), dtype: float, probability that the last frame has been generated.
|
||||
decoder_states (Variable): shape(B, T_mel, C_dec // r), ddtype: float, decoder hidden states, where C_dec means the channels of decoder states (the output channels of the last `convolutions`). Note that it should be perfectlt devided by `r`.
|
||||
outputs (Variable): shape(B, T_mel, C_mel), dtype float32, decoder outputs, where C_mel means the channels of mel-spectrogram, T_mel means the length(time steps) of mel spectrogram.
|
||||
alignments (Variable): shape(N, B, T_mel // r, T_enc), dtype float32, the alignment tensor between the decoder and the encoder, where N means number of Attention Layers, T_mel means the length of mel spectrogram, r means the outputs per decoder step, T_enc means the encoder time steps.
|
||||
done (Variable): shape(B, T_mel // r), dtype float32, probability that the last frame has been generated.
|
||||
decoder_states (Variable): shape(B, T_mel, C_dec // r), ddtype float32, decoder hidden states, where C_dec means the channels of decoder states (the output channels of the last `convolutions`). Note that it should be perfectlt devided by `r`.
|
||||
"""
|
||||
if speaker_embed is not None:
|
||||
speaker_embed = F.dropout(
|
||||
|
@ -376,17 +376,17 @@ class Decoder(dg.Layer):
|
|||
|
||||
Args:
|
||||
encoder_out (keys, values):
|
||||
keys (Variable): shape(B, T_enc, C_emb), dtype: float, the key representation from an encoder, where C_emb means text embedding size.
|
||||
values (Variable): shape(B, T_enc, C_emb), dtype: float, the value representation from an encoder, where C_emb means text embedding size.
|
||||
keys (Variable): shape(B, T_enc, C_emb), dtype float32, the key representation from an encoder, where C_emb means text embedding size.
|
||||
values (Variable): shape(B, T_enc, C_emb), dtype float32, the value representation from an encoder, where C_emb means text embedding size.
|
||||
text_positions (Variable): shape(B, T_enc), dtype: int64. Positions indices for text inputs for the encoder, where T_enc means the encoder timesteps.
|
||||
speaker_embed (Variable, optional): shape(B, C_sp), speaker embedding, only used for multispeaker model.
|
||||
test_inputs (Variable, optional): shape(B, T_test, C_mel). test input, it is only used for debugging. Defaults to None.
|
||||
|
||||
Returns:
|
||||
outputs (Variable): shape(B, T_mel, C_mel), dtype: float, decoder outputs, where C_mel means the channels of mel-spectrogram, T_mel means the length(time steps) of mel spectrogram.
|
||||
alignments (Variable): shape(N, B, T_mel // r, T_enc), dtype: float, the alignment tensor between the decoder and the encoder, where N means number of Attention Layers, T_mel means the length of mel spectrogram, r means the outputs per decoder step, T_enc means the encoder time steps.
|
||||
done (Variable): shape(B, T_mel // r), dtype: float, probability that the last frame has been generated. If the probability is larger than 0.5 at a step, the generation stops.
|
||||
decoder_states (Variable): shape(B, T_mel, C_dec // r), ddtype: float, decoder hidden states, where C_dec means the channels of decoder states (the output channels of the last `convolutions`). Note that it should be perfectlt devided by `r`.
|
||||
outputs (Variable): shape(B, T_mel, C_mel), dtype float32, decoder outputs, where C_mel means the channels of mel-spectrogram, T_mel means the length(time steps) of mel spectrogram.
|
||||
alignments (Variable): shape(N, B, T_mel // r, T_enc), dtype float32, the alignment tensor between the decoder and the encoder, where N means number of Attention Layers, T_mel means the length of mel spectrogram, r means the outputs per decoder step, T_enc means the encoder time steps.
|
||||
done (Variable): shape(B, T_mel // r), dtype float32, probability that the last frame has been generated. If the probability is larger than 0.5 at a step, the generation stops.
|
||||
decoder_states (Variable): shape(B, T_mel, C_dec // r), ddtype float32, decoder hidden states, where C_dec means the channels of decoder states (the output channels of the last `convolutions`). Note that it should be perfectlt devided by `r`.
|
||||
|
||||
Note:
|
||||
Only single instance inference is supported now, so B = 1.
|
||||
|
|
|
@ -112,11 +112,11 @@ class Encoder(dg.Layer):
|
|||
|
||||
Args:
|
||||
x (Variable): shape(B, T_enc), dtype: int64. Ihe input text indices. T_enc means the timesteps of decoder input x.
|
||||
speaker_embed (Variable, optional): shape(B, C_sp), dtype: float, speaker embeddings. This arg is not None only when the model is a multispeaker model.
|
||||
speaker_embed (Variable, optional): shape(B, C_sp), dtype float32, speaker embeddings. This arg is not None only when the model is a multispeaker model.
|
||||
|
||||
Returns:
|
||||
keys (Variable), Shape(B, T_enc, C_emb), dtype: float, the encoded epresentation for keys, where C_emb menas the text embedding size.
|
||||
values (Variable), Shape(B, T_enc, C_emb), dtype: float, the encoded representation for values.
|
||||
keys (Variable), Shape(B, T_enc, C_emb), dtype float32, the encoded epresentation for keys, where C_emb menas the text embedding size.
|
||||
values (Variable), Shape(B, T_enc, C_emb), dtype float32, the encoded representation for values.
|
||||
"""
|
||||
x = self.embed(x)
|
||||
x = F.dropout(
|
||||
|
|
|
@ -23,10 +23,10 @@ import paddle.fluid.dygraph as dg
|
|||
def masked_mean(inputs, mask):
|
||||
"""
|
||||
Args:
|
||||
inputs (Variable): shape(B, T, C), dtype: float, the input.
|
||||
mask (Variable): shape(B, T), dtype: float, a mask.
|
||||
inputs (Variable): shape(B, T, C), dtype float32, the input.
|
||||
mask (Variable): shape(B, T), dtype float32, a mask.
|
||||
Returns:
|
||||
loss (Variable): shape(1, ), dtype: float, masked mean.
|
||||
loss (Variable): shape(1, ), dtype float32, masked mean.
|
||||
"""
|
||||
channels = inputs.shape[-1]
|
||||
masked_inputs = F.elementwise_mul(inputs, mask, axis=0)
|
||||
|
@ -46,7 +46,7 @@ def guided_attention(N, max_N, T, max_T, g):
|
|||
g (float): sigma to adjust the degree of diagonal guide.
|
||||
|
||||
Returns:
|
||||
np.ndarray: shape(max_N, max_T), dtype: float, the diagonal guide.
|
||||
np.ndarray: shape(max_N, max_T), dtype float32, the diagonal guide.
|
||||
"""
|
||||
W = np.zeros((max_N, max_T), dtype=np.float32)
|
||||
for n in range(N):
|
||||
|
@ -66,7 +66,7 @@ def guided_attentions(encoder_lengths, decoder_lengths, max_decoder_len,
|
|||
g (float, optional): sigma to adjust the degree of diagonal guide.. Defaults to 0.2.
|
||||
|
||||
Returns:
|
||||
np.ndarray: shape(B, max_T, max_N), dtype: float, the diagonal guide. (max_N: max encoder length, max_T: max decoder length.)
|
||||
np.ndarray: shape(B, max_T, max_N), dtype float32, the diagonal guide. (max_N: max encoder length, max_T: max decoder length.)
|
||||
"""
|
||||
B = len(encoder_lengths)
|
||||
max_input_len = encoder_lengths.max()
|
||||
|
@ -111,13 +111,13 @@ class TTSLoss(object):
|
|||
"""L1 loss for spectrogram.
|
||||
|
||||
Args:
|
||||
prediction (Variable): shape(B, T, C), dtype: float, predicted spectrogram.
|
||||
target (Variable): shape(B, T, C), dtype: float, target spectrogram.
|
||||
prediction (Variable): shape(B, T, C), dtype float32, predicted spectrogram.
|
||||
target (Variable): shape(B, T, C), dtype float32, target spectrogram.
|
||||
mask (Variable): shape(B, T), mask.
|
||||
priority_bin (int, optional): frequency bands for linear spectrogram loss to be prioritized. Defaults to None.
|
||||
|
||||
Returns:
|
||||
Variable: shape(1,), dtype: float, l1 loss(with mask and possibly priority bin applied.)
|
||||
Variable: shape(1,), dtype float32, l1 loss(with mask and possibly priority bin applied.)
|
||||
"""
|
||||
abs_diff = F.abs(prediction - target)
|
||||
|
||||
|
@ -149,12 +149,12 @@ class TTSLoss(object):
|
|||
"""Binary cross entropy loss for spectrogram. All the values in the spectrogram are treated as logits in a logistic regression.
|
||||
|
||||
Args:
|
||||
prediction (Variable): shape(B, T, C), dtype: float, predicted spectrogram.
|
||||
target (Variable): shape(B, T, C), dtype: float, target spectrogram.
|
||||
prediction (Variable): shape(B, T, C), dtype float32, predicted spectrogram.
|
||||
target (Variable): shape(B, T, C), dtype float32, target spectrogram.
|
||||
mask (Variable): shape(B, T), mask.
|
||||
|
||||
Returns:
|
||||
Variable: shape(1,), dtype: float, binary cross entropy loss.
|
||||
Variable: shape(1,), dtype float32, binary cross entropy loss.
|
||||
"""
|
||||
flattened_prediction = F.reshape(prediction, [-1, 1])
|
||||
flattened_target = F.reshape(target, [-1, 1])
|
||||
|
@ -175,11 +175,11 @@ class TTSLoss(object):
|
|||
"""Compute done loss
|
||||
|
||||
Args:
|
||||
done_hat (Variable): shape(B, T), dtype: float, predicted done probability(the probability that the final frame has been generated.)
|
||||
done (Variable): shape(B, T), dtype: float, ground truth done probability(the probability that the final frame has been generated.)
|
||||
done_hat (Variable): shape(B, T), dtype float32, predicted done probability(the probability that the final frame has been generated.)
|
||||
done (Variable): shape(B, T), dtype float32, ground truth done probability(the probability that the final frame has been generated.)
|
||||
|
||||
Returns:
|
||||
Variable: shape(1, ), dtype: float, done loss.
|
||||
Variable: shape(1, ), dtype float32, done loss.
|
||||
"""
|
||||
flat_done_hat = F.reshape(done_hat, [-1, 1])
|
||||
flat_done = F.reshape(done, [-1, 1])
|
||||
|
@ -193,12 +193,12 @@ class TTSLoss(object):
|
|||
Given valid encoder_lengths and decoder_lengths, compute a diagonal guide, and compute loss from the predicted attention and the guide.
|
||||
|
||||
Args:
|
||||
predicted_attention (Variable): shape(*, B, T_dec, T_enc), dtype: float, the alignment tensor, where B means batch size, T_dec means number of time steps of the decoder, T_enc means the number of time steps of the encoder, * means other possible dimensions.
|
||||
predicted_attention (Variable): shape(*, B, T_dec, T_enc), dtype float32, the alignment tensor, where B means batch size, T_dec means number of time steps of the decoder, T_enc means the number of time steps of the encoder, * means other possible dimensions.
|
||||
input_lengths (numpy.ndarray): shape(B,), dtype:int64, valid lengths (time steps) of encoder outputs.
|
||||
target_lengths (numpy.ndarray): shape(batch_size,), dtype:int64, valid lengths (time steps) of decoder outputs.
|
||||
|
||||
Returns:
|
||||
loss (Variable): shape(1, ), dtype: float, attention loss.
|
||||
loss (Variable): shape(1, ), dtype float32, attention loss.
|
||||
"""
|
||||
n_attention, batch_size, max_target_len, max_input_len = (
|
||||
predicted_attention.shape)
|
||||
|
@ -226,13 +226,13 @@ class TTSLoss(object):
|
|||
"""Total loss
|
||||
|
||||
Args:
|
||||
mel_hyp (Variable): shape(B, T, C_mel), dtype, float, predicted mel spectrogram.
|
||||
lin_hyp (Variable): shape(B, T, C_lin), dtype, float, predicted linear spectrogram.
|
||||
done_hyp (Variable): shape(B, T), dtype, float, predicted done probability.
|
||||
attn_hyp (Variable): shape(N, B, T_dec, T_enc), dtype: float, predicted attention.
|
||||
mel_ref (Variable): shape(B, T, C_mel), dtype, float, ground truth mel spectrogram.
|
||||
lin_ref (Variable): shape(B, T, C_lin), dtype, float, ground truth linear spectrogram.
|
||||
done_ref (Variable): shape(B, T), dtype, float, ground truth done flag.
|
||||
mel_hyp (Variable): shape(B, T, C_mel), dtype float32, predicted mel spectrogram.
|
||||
lin_hyp (Variable): shape(B, T, C_lin), dtype float32, predicted linear spectrogram.
|
||||
done_hyp (Variable): shape(B, T), dtype float32, predicted done probability.
|
||||
attn_hyp (Variable): shape(N, B, T_dec, T_enc), dtype float32, predicted attention.
|
||||
mel_ref (Variable): shape(B, T, C_mel), dtype float32, ground truth mel spectrogram.
|
||||
lin_ref (Variable): shape(B, T, C_lin), dtype float32, ground truth linear spectrogram.
|
||||
done_ref (Variable): shape(B, T), dtype float32, ground truth done flag.
|
||||
input_lengths (Variable): shape(B, ), dtype: int, encoder valid lengths.
|
||||
n_frames (Variable): shape(B, ), dtype: int, decoder valid lengths.
|
||||
compute_lin_loss (bool, optional): whether to compute linear loss. Defaults to True.
|
||||
|
|
|
@ -55,10 +55,10 @@ class DeepVoice3(dg.Layer):
|
|||
|
||||
Returns:
|
||||
(mel_outputs, linear_outputs, alignments, done)
|
||||
mel_outputs (Variable): shape(B, T_mel, C_mel), dtype: float, predicted mel spectrogram.
|
||||
mel_outputs (Variable): shape(B, T_mel, C_mel), dtype: float, predicted mel spectrogram.
|
||||
alignments (Variable): shape(N, B, T_dec, T_enc), dtype: float, predicted attention.
|
||||
done (Variable): shape(B, T_dec), dtype: float, predicted done probability.
|
||||
mel_outputs (Variable): shape(B, T_mel, C_mel), dtype float32, predicted mel spectrogram.
|
||||
mel_outputs (Variable): shape(B, T_mel, C_mel), dtype float32, predicted mel spectrogram.
|
||||
alignments (Variable): shape(N, B, T_dec, T_enc), dtype float32, predicted attention.
|
||||
done (Variable): shape(B, T_dec), dtype float32, predicted done probability.
|
||||
(T_mel: time steps of mel spectrogram, T_lin: time steps of linear spectrogra, T_dec, time steps of decoder, T_enc: time steps of encoder.)
|
||||
"""
|
||||
if hasattr(self, "speaker_embedding"):
|
||||
|
@ -85,10 +85,10 @@ class DeepVoice3(dg.Layer):
|
|||
|
||||
Returns:
|
||||
(mel_outputs, linear_outputs, alignments, done)
|
||||
mel_outputs (Variable): shape(B, T_mel, C_mel), dtype: float, predicted mel spectrogram.
|
||||
mel_outputs (Variable): shape(B, T_mel, C_mel), dtype: float, predicted mel spectrogram.
|
||||
alignments (Variable): shape(B, T_dec, T_enc), dtype: float, predicted average attention of all attention layers.
|
||||
done (Variable): shape(B, T_dec), dtype: float, predicted done probability.
|
||||
mel_outputs (Variable): shape(B, T_mel, C_mel), dtype float32, predicted mel spectrogram.
|
||||
mel_outputs (Variable): shape(B, T_mel, C_mel), dtype float32, predicted mel spectrogram.
|
||||
alignments (Variable): shape(B, T_dec, T_enc), dtype float32, predicted average attention of all attention layers.
|
||||
done (Variable): shape(B, T_dec), dtype float32, predicted done probability.
|
||||
(T_mel: time steps of mel spectrogram, T_lin: time steps of linear spectrogra, T_dec, time steps of decoder, T_enc: time steps of encoder.)
|
||||
"""
|
||||
if hasattr(self, "speaker_embedding"):
|
||||
|
|
|
@ -22,7 +22,7 @@ def compute_position_embedding(radians, speaker_position_rate):
|
|||
"""Compute sin/cos interleaved matrix from the radians.
|
||||
|
||||
Arg:
|
||||
radians (Variable): shape(n_vocab, embed_dim), dtype: float, the radians matrix.
|
||||
radians (Variable): shape(n_vocab, embed_dim), dtype float32, the radians matrix.
|
||||
speaker_position_rate (Variable): shape(B, ), speaker positioning rate.
|
||||
|
||||
Returns:
|
||||
|
@ -98,7 +98,7 @@ class PositionEmbedding(dg.Layer):
|
|||
example. It can also be a Variable with shape (B, ), which
|
||||
contains a speaker position rate for each utterance.
|
||||
Returns:
|
||||
out (Variable): shape(B, T, C_pos), dtype: float, position embedding, where C_pos
|
||||
out (Variable): shape(B, T, C_pos), dtype float32, position embedding, where C_pos
|
||||
means position embedding size.
|
||||
"""
|
||||
batch_size, time_steps = indices.shape
|
||||
|
|
|
@ -30,7 +30,7 @@ def crop(x, audio_start, audio_length):
|
|||
"""Crop the upsampled condition to match audio_length. The upsampled condition has the same time steps as the whole audio does. But since audios are sliced to 0.5 seconds randomly while conditions are not, upsampled conditions should also be sliced to extaclt match the time steps of the audio slice.
|
||||
|
||||
Args:
|
||||
x (Variable): shape(B, C, T), dtype: float, the upsample condition.
|
||||
x (Variable): shape(B, C, T), dtype float32, the upsample condition.
|
||||
audio_start (Variable): shape(B, ), dtype: int64, the index the starting point.
|
||||
audio_length (int): the length of the audio (number of samples it contaions).
|
||||
|
||||
|
@ -79,10 +79,10 @@ class UpsampleNet(dg.Layer):
|
|||
"""Compute the upsampled condition.
|
||||
|
||||
Args:
|
||||
x (Variable): shape(B, F, T), dtype: float, the condition (mel spectrogram here.) (F means the frequency bands). In the internal Conv2DTransposes, the frequency dimension is treated as `height` dimension instead of `in_channels`.
|
||||
x (Variable): shape(B, F, T), dtype float32, the condition (mel spectrogram here.) (F means the frequency bands). In the internal Conv2DTransposes, the frequency dimension is treated as `height` dimension instead of `in_channels`.
|
||||
|
||||
Returns:
|
||||
Variable: shape(B, F, T * upscale_factor), dtype: float, the upsampled condition.
|
||||
Variable: shape(B, F, T * upscale_factor), dtype float32, the upsampled condition.
|
||||
"""
|
||||
x = F.unsqueeze(x, axes=[1])
|
||||
for sublayer in self.upsample_convs:
|
||||
|
@ -108,8 +108,8 @@ class ConditionalWavenet(dg.Layer):
|
|||
"""Compute the output distribution given the mel spectrogram and the input(for teacher force training).
|
||||
|
||||
Args:
|
||||
audio (Variable): shape(B, T_audio), dtype: float, ground truth waveform, used for teacher force training.
|
||||
mel ([Variable): shape(B, F, T_mel), dtype: float, mel spectrogram. Note that it is the spectrogram for the whole utterance.
|
||||
audio (Variable): shape(B, T_audio), dtype float32, ground truth waveform, used for teacher force training.
|
||||
mel ([Variable): shape(B, F, T_mel), dtype float32, mel spectrogram. Note that it is the spectrogram for the whole utterance.
|
||||
audio_start (Variable): shape(B, ), dtype: int, audio slices' start positions for each utterance.
|
||||
|
||||
Returns:
|
||||
|
@ -130,11 +130,11 @@ class ConditionalWavenet(dg.Layer):
|
|||
"""compute loss with respect to the output distribution and the targer audio.
|
||||
|
||||
Args:
|
||||
y (Variable): shape(B, T - 1, C_output), dtype: float, parameters of the output distribution.
|
||||
t (Variable): shape(B, T), dtype: float, target waveform.
|
||||
y (Variable): shape(B, T - 1, C_output), dtype float32, parameters of the output distribution.
|
||||
t (Variable): shape(B, T), dtype float32, target waveform.
|
||||
|
||||
Returns:
|
||||
Variable: shape(1, ), dtype: float, the loss.
|
||||
Variable: shape(1, ), dtype float32, the loss.
|
||||
"""
|
||||
t = t[:, 1:]
|
||||
loss = self.decoder.loss(y, t)
|
||||
|
@ -144,10 +144,10 @@ class ConditionalWavenet(dg.Layer):
|
|||
"""Sample from the output distribution.
|
||||
|
||||
Args:
|
||||
y (Variable): shape(B, T, C_output), dtype: float, parameters of the output distribution.
|
||||
y (Variable): shape(B, T, C_output), dtype float32, parameters of the output distribution.
|
||||
|
||||
Returns:
|
||||
Variable: shape(B, T), dtype: float, sampled waveform from the output distribution.
|
||||
Variable: shape(B, T), dtype float32, sampled waveform from the output distribution.
|
||||
"""
|
||||
samples = self.decoder.sample(y)
|
||||
return samples
|
||||
|
|
|
@ -48,7 +48,7 @@ def dequantize(quantized, n_bands):
|
|||
n_bands (int): number of bands. The input integer Tensor's value is in the range [0, n_bans).
|
||||
|
||||
Returns:
|
||||
Variable: the dequantized tensor, dtype: float32.
|
||||
Variable: the dequantized tensor, dtype float3232.
|
||||
"""
|
||||
value = (F.cast(quantized, "float32") + 0.5) * (2.0 / n_bands) - 1.0
|
||||
return value
|
||||
|
@ -93,7 +93,7 @@ class ResidualBlock(dg.Layer):
|
|||
"""Conv1D gated-tanh Block.
|
||||
|
||||
Args:
|
||||
x (Variable): shape(B, C_res, T), the input. (B stands for batch_size, C_res stands for residual channels, T stands for time steps.) dtype: float.
|
||||
x (Variable): shape(B, C_res, T), the input. (B stands for batch_size, C_res stands for residual channels, T stands for time steps.) dtype float32.
|
||||
condition (Variable, optional): shape(B, C_cond, T), the condition, it has been upsampled in time steps, so it has the same time steps as the input does.(C_cond stands for the condition's channels). Defaults to None.
|
||||
|
||||
Returns:
|
||||
|
@ -131,8 +131,8 @@ class ResidualBlock(dg.Layer):
|
|||
"""Add a step input. This method works similarily with `forward` but in a `step-in-step-out` fashion.
|
||||
|
||||
Args:
|
||||
x (Variable): shape(B, C_res, T=1), input for a step, dtype: float.
|
||||
condition (Variable, optional): shape(B, C_cond, T=1). condition for a step, dtype: float. Defaults to None.
|
||||
x (Variable): shape(B, C_res, T=1), input for a step, dtype float32.
|
||||
condition (Variable, optional): shape(B, C_cond, T=1). condition for a step, dtype float32. Defaults to None.
|
||||
|
||||
Returns:
|
||||
(residual, skip_connection)
|
||||
|
@ -182,11 +182,11 @@ class ResidualNet(dg.Layer):
|
|||
def forward(self, x, condition=None):
|
||||
"""
|
||||
Args:
|
||||
x (Variable): shape(B, C_res, T), dtype: float, the input. (B stands for batch_size, C_res stands for residual channels, T stands for time steps.)
|
||||
condition (Variable, optional): shape(B, C_cond, T), dtype: float, the condition, it has been upsampled in time steps, so it has the same time steps as the input does.(C_cond stands for the condition's channels) Defaults to None.
|
||||
x (Variable): shape(B, C_res, T), dtype float32, the input. (B stands for batch_size, C_res stands for residual channels, T stands for time steps.)
|
||||
condition (Variable, optional): shape(B, C_cond, T), dtype float32, the condition, it has been upsampled in time steps, so it has the same time steps as the input does.(C_cond stands for the condition's channels) Defaults to None.
|
||||
|
||||
Returns:
|
||||
skip_connection (Variable): shape(B, C_res, T), dtype: float, the output.
|
||||
skip_connection (Variable): shape(B, C_res, T), dtype float32, the output.
|
||||
"""
|
||||
for i, func in enumerate(self.residual_blocks):
|
||||
x, skip = func(x, condition)
|
||||
|
@ -207,11 +207,11 @@ class ResidualNet(dg.Layer):
|
|||
"""Add a step input. This method works similarily with `forward` but in a `step-in-step-out` fashion.
|
||||
|
||||
Args:
|
||||
x (Variable): shape(B, C_res, T=1), dtype: float, input for a step.
|
||||
condition (Variable, optional): shape(B, C_cond, T=1), dtype: float, condition for a step. Defaults to None.
|
||||
x (Variable): shape(B, C_res, T=1), dtype float32, input for a step.
|
||||
condition (Variable, optional): shape(B, C_cond, T=1), dtype float32, condition for a step. Defaults to None.
|
||||
|
||||
Returns:
|
||||
skip_connection (Variable): shape(B, C_res, T=1), dtype: float, the output for a step.
|
||||
skip_connection (Variable): shape(B, C_res, T=1), dtype float32, the output for a step.
|
||||
"""
|
||||
|
||||
for i, func in enumerate(self.residual_blocks):
|
||||
|
@ -269,11 +269,11 @@ class WaveNet(dg.Layer):
|
|||
"""compute the output distribution (represented by its parameters).
|
||||
|
||||
Args:
|
||||
x (Variable): shape(B, T), dtype: float, the input waveform.
|
||||
condition (Variable, optional): shape(B, C_cond, T), dtype: float, the upsampled condition. Defaults to None.
|
||||
x (Variable): shape(B, T), dtype float32, the input waveform.
|
||||
condition (Variable, optional): shape(B, C_cond, T), dtype float32, the upsampled condition. Defaults to None.
|
||||
|
||||
Returns:
|
||||
Variable: shape(B, T, C_output), dtype: float, the parameter of the output distributions.
|
||||
Variable: shape(B, T, C_output), dtype float32, the parameter of the output distributions.
|
||||
"""
|
||||
|
||||
# Causal Conv
|
||||
|
@ -304,11 +304,11 @@ class WaveNet(dg.Layer):
|
|||
"""compute the output distribution (represented by its parameters) for a step. It works similarily with the `forward` method but in a `step-in-step-out` fashion.
|
||||
|
||||
Args:
|
||||
x (Variable): shape(B, T=1), dtype: float, a step of the input waveform.
|
||||
condition (Variable, optional): shape(B, C_cond, T=1), dtype: float, a step of the upsampled condition. Defaults to None.
|
||||
x (Variable): shape(B, T=1), dtype float32, a step of the input waveform.
|
||||
condition (Variable, optional): shape(B, C_cond, T=1), dtype float32, a step of the upsampled condition. Defaults to None.
|
||||
|
||||
Returns:
|
||||
Variable: shape(B, T=1, C_output), dtype: float, the parameter of the output distributions.
|
||||
Variable: shape(B, T=1, C_output), dtype float32, the parameter of the output distributions.
|
||||
"""
|
||||
# Causal Conv
|
||||
if self.loss_type == "softmax":
|
||||
|
@ -332,11 +332,11 @@ class WaveNet(dg.Layer):
|
|||
"""compute the loss where output distribution is a categorial distribution.
|
||||
|
||||
Args:
|
||||
y (Variable): shape(B, T, C_output), dtype: float, the logits of the output distribution.
|
||||
t (Variable): shape(B, T), dtype: float, the target audio. Note that the target's corresponding time index is one step ahead of the output distribution. And output distribution whose input contains padding is neglected in loss computation.
|
||||
y (Variable): shape(B, T, C_output), dtype float32, the logits of the output distribution.
|
||||
t (Variable): shape(B, T), dtype float32, the target audio. Note that the target's corresponding time index is one step ahead of the output distribution. And output distribution whose input contains padding is neglected in loss computation.
|
||||
|
||||
Returns:
|
||||
Variable: shape(1, ), dtype: float, the loss.
|
||||
Variable: shape(1, ), dtype float32, the loss.
|
||||
"""
|
||||
# context size is not taken into account
|
||||
y = y[:, self.context_size:, :]
|
||||
|
@ -371,11 +371,11 @@ class WaveNet(dg.Layer):
|
|||
"""compute the loss where output distribution is a mixture of Gaussians.
|
||||
|
||||
Args:
|
||||
y (Variable): shape(B, T, C_output), dtype: float, the parameterd of the output distribution. It is the concatenation of 3 parts, the logits of every distribution, the mean of each distribution and the log standard deviation of each distribution. Each part's shape is (B, T, n_mixture), where `n_mixture` means the number of Gaussians in the mixture.
|
||||
t (Variable): shape(B, T), dtype: float, the target audio. Note that the target's corresponding time index is one step ahead of the output distribution. And output distribution whose input contains padding is neglected in loss computation.
|
||||
y (Variable): shape(B, T, C_output), dtype float32, the parameterd of the output distribution. It is the concatenation of 3 parts, the logits of every distribution, the mean of each distribution and the log standard deviation of each distribution. Each part's shape is (B, T, n_mixture), where `n_mixture` means the number of Gaussians in the mixture.
|
||||
t (Variable): shape(B, T), dtype float32, the target audio. Note that the target's corresponding time index is one step ahead of the output distribution. And output distribution whose input contains padding is neglected in loss computation.
|
||||
|
||||
Returns:
|
||||
Variable: shape(1, ), dtype: float, the loss.
|
||||
Variable: shape(1, ), dtype float32, the loss.
|
||||
"""
|
||||
n_mixture = self.output_dim // 3
|
||||
|
||||
|
@ -408,7 +408,7 @@ class WaveNet(dg.Layer):
|
|||
def sample_from_mog(self, y):
|
||||
"""Sample from the output distribution where the output distribution is a mixture of Gaussians.
|
||||
Args:
|
||||
y (Variable): shape(B, T, C_output), dtype: float, the parameterd of the output distribution. It is the concatenation of 3 parts, the logits of every distribution, the mean of each distribution and the log standard deviation of each distribution. Each part's shape is (B, T, n_mixture), where `n_mixture` means the number of Gaussians in the mixture.
|
||||
y (Variable): shape(B, T, C_output), dtype float32, the parameterd of the output distribution. It is the concatenation of 3 parts, the logits of every distribution, the mean of each distribution and the log standard deviation of each distribution. Each part's shape is (B, T, n_mixture), where `n_mixture` means the number of Gaussians in the mixture.
|
||||
|
||||
Returns:
|
||||
Variable: shape(B, T), waveform sampled from the output distribution.
|
||||
|
@ -438,7 +438,7 @@ class WaveNet(dg.Layer):
|
|||
def sample(self, y):
|
||||
"""Sample from the output distribution.
|
||||
Args:
|
||||
y (Variable): shape(B, T, C_output), dtype: float, the parameterd of the output distribution.
|
||||
y (Variable): shape(B, T, C_output), dtype float32, the parameterd of the output distribution.
|
||||
|
||||
Returns:
|
||||
Variable: shape(B, T), waveform sampled from the output distribution.
|
||||
|
@ -452,11 +452,11 @@ class WaveNet(dg.Layer):
|
|||
"""compute the loss where output distribution is a mixture of Gaussians.
|
||||
|
||||
Args:
|
||||
y (Variable): shape(B, T, C_output), dtype: float, the parameterd of the output distribution.
|
||||
t (Variable): shape(B, T), dtype: float, the target audio. Note that the target's corresponding time index is one step ahead of the output distribution. And output distribution whose input contains padding is neglected in loss computation.
|
||||
y (Variable): shape(B, T, C_output), dtype float32, the parameterd of the output distribution.
|
||||
t (Variable): shape(B, T), dtype float32, the target audio. Note that the target's corresponding time index is one step ahead of the output distribution. And output distribution whose input contains padding is neglected in loss computation.
|
||||
|
||||
Returns:
|
||||
Variable: shape(1, ), dtype: float, the loss.
|
||||
Variable: shape(1, ), dtype float32, the loss.
|
||||
"""
|
||||
if self.loss_type == "softmax":
|
||||
return self.compute_softmax_loss(y, t)
|
||||
|
|
Loading…
Reference in New Issue