diff --git a/.gitignore b/.gitignore index 909b4a7..7906666 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,9 @@ +# IDES +*.wpr +*.wpu +*.udb +*.ann + # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] diff --git a/README_cn.md b/README_cn.md new file mode 100644 index 0000000..994a4e2 --- /dev/null +++ b/README_cn.md @@ -0,0 +1,233 @@ +# Parakeet + +Parakeet 自在为开源社区提供一个灵活,高效,先进的语音合成工具箱。Parakeet 基于 PaddlePaddle 2.0 构建,并且包含了 [百度研究院]((http://research.baidu.com)) 以及其他研究机构的许多有影响力的 TTS 模型。 + +parakeet-logo + +其中包含了百度研究院最近提出的 [WaveFlow](https://arxiv.org/abs/1912.01219) 模型。 + +- WaveFlow 无需专用于推理的 kernel, 就可以在 Nvidia v100 上以 40 倍实时的速度合成 22.05kHz 的高保真度的语音。这比 [WaveGlow](https://github.com/NVIDIA/waveglow) 模型更快,而且比 WaveNet 快几个数量级。 +- WaveFlow 是占用小的,基于流的用于生成原始音频的模型,只有 5.9M 个可训练参数,约为 WaveGlow (87.9M 个参数) 的 1/15. +- WaveFlow 可以直接通过最大似然方式训练,而不需要概率密度蒸馏,或者是类似 ParallelWaveNet 和 ClariNet 中使用的辅助 loss, 这简化了训练流程,减小了开发成本。 + +## 模型概览 + +为了方便使用已有的 TTS 模型以及开发新的模型,Parakeet 选取了经典的模型,并且提供了基于 PaddlePaddle 的参考实现。Parakeet 进一步抽象了 TTS 任务的流程,并且将数据预处理,模块共享,模型配置以及训练和合成的流程标准化。目前已经支持的模型包括音码器 (vocoder) 和声学模型。 + +- 音码器 + - [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219) + - [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](https://arxiv.org/abs/1807.07281) + - [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499) + +- 声学模型 + - [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654) + - [Neural Speech Synthesis with Transformer Network (Transformer TTS)](https://arxiv.org/abs/1809.08895) + - [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263) + +未来将会添加更多的模型。 + +如若需要基于 Parakeet 实现自己的模型和实验,可以参考 [如何准备自己的实验](./docs/experiment_guide_cn.md). + +## 安装 + +请参考 [安装](./docs/installation_cn.md). + +## 实验样例 + +Parakeet 提供了多个实验样例。这些样例使用 parakeet 中提供的模型,提供在公共数据集上进行实验的完整流程,包含数据处理,模型训练以及预测的功能,是进行实验以及二次开发的示例。 + +- [>>> WaveFlow](./examples/waveflow) +- [>>> Clarinet](./examples/clarinet) +- [>>> WaveNet](./examples/wavenet) +- [>>> Deep Voice 3](./examples/deepvoice3) +- [>>> Transformer TTS](./examples/transformer_tts) +- [>>> FastSpeech](./examples/fastspeech) + + +## 预训练模型和音频样例 + +Parakeet 同时提供了示例模型的训练好的参数,可从下表中获取。每一列列出了一个模型的资源,包含预训练模型的 checkpoint 下载 url, 训练该模型用的数据集,以及使用改 checkpoint 合成的语音样例。点击模型名,可以下载到一个压缩包,其中包含了训练该模型时使用的配置文件。 + +#### 音码器 + +我们提供了 residual channel 为 64, 96, 128 的 WaveFlow 模型 checkpoint. 另外还提供了 ClariNet 和 WaveNet 的 checkpoint. + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ WaveFlow (res. channels 64) + + WaveFlow (res. channels 96) + + WaveFlow (res. channels 128) +
LJSpeech LJSpeech LJSpeech
+ +
+ +
+ +
+ +
+ + +
+ +
+ +
+ +
+ +
+ + +
+ +
+ +
+ +
+ +
+ + +
+ ClariNet + + WaveNet +
LJSpeech LJSpeech
+ +
+ +
+ +
+ +
+ + +
+ +
+ +
+ +
+ +
+ + +
+
+ + +**注意:** 输入的 mel 频谱是从验证集中选取的,它们不被用于训练。 + +#### 声学模型 + +我们也提供了几个端到端的 TTS 模型的 checkpoint, 并展示用随机选取的著名引言合成的语音。对应的转录文本展示如下。 + +| |Text| From | +|:-:|:-- | :--: | +0|*Life was like a box of chocolates, you never know what you're gonna get.* | *Forrest Gump* | +1|*With great power there must come great responsibility.* | *Spider-Man*| +2|*To be or not to be, that’s a question.*|*Hamlet*| +3|*Death is just a part of life, something we're all destined to do.*| *Forrest Gump*| +4|*Don’t argue with the people of strong determination, because they may change the fact!*| *William Shakespeare* | + +用于可以使用不同的音码器将声学模型产生的频谱转化为原始音频。我们将展示声学模型配合 [Griffin-Lim](https://ieeexplore.ieee.org/document/1164317) 音码器以及基于神经网络的音码器的合成样例。 + +##### 1) Griffin-Lim 音码器 + +
+ + + + + + + + + + + + + + + + + + +
+ Transformer TTS + + FastSpeech +
LJSpeech LJSpeech
+ +
+ +
+ +
+ +
+ + +
+ +
+ +
+ +
+ +
+ + +
+
+ +##### 2) 神经网络音码器 + +正在开发中。 + +## 版权和许可 + +Parakeet 以 [Apache-2.0 license](LICENSE) 提供。 diff --git a/docs/data.md b/docs/data.md deleted file mode 100644 index 07f2a97..0000000 --- a/docs/data.md +++ /dev/null @@ -1,341 +0,0 @@ -# parakeet.data - -This short guide shows the design of `parakeet.data` and how we use it in an experiment. - -The most important concepts of `parakeet.data` are `DatasetMixin`, `DataCargo`, `Sampler`, `batch function` and `DataIterator`. - -## Dataset - -Dataset, as we assume here, is a list of examples. You can get its length by `len(dataset)`(which means its length is known, and we have to implement `__len__()` method for it). And you can access its items randomly by `dataset[i]`(which means we have to implement `__getitem__()` method for it). Furthermore, you can iterate over it by `iter(dataset)` or `for example in dataset`, which means we have to implement `__iter__()` method for it. - -### DatasetMixin - -We provide an `DatasetMixin` object which provides the above methods. You can inherit `DatasetMixin` and implement `get_example()` method for it to define your own dataset class. The `get_example()` method is called by `__getitem__()` method automatically. - -We also define several high-order Dataset classes, the obejcts of which can be built from some given Dataset objects. - -### TupleDataset - -Dataset that is a combination of several datasets of the same length. An example of a `Tupledataset` is a tuple of examples of its constituent datasets. - -### DictDataset - -Dataset that is a combination of several datasets of the same length. An example of the `Dictdataset` is a dict of examples of its constituent datasets. - -### SliceDataset - -`SliceDataset` is a slice of the base dataset. - -### SubsetDataset - -`SubsetDataset` is a subset of the base dataset. - -### ChainDataset - -`ChainDataset` is the concatenation of several datastes with the same fields. - -### TransformDataset - -A `TransformeDataset` is created by applying a `transform` to the examples of the base dataset. The `transform` is a callable object which takes an example of the base dataset as parameter and returns an example of the `TransformDataset`. The transformation is lazy, which means it is applied to an example only when requested. - -### FilterDataset - -A `FilterDataset` is created by applying a `filter` to the base dataset. A `filter` is a predicate that takes an example of the base dataset as parameter and returns a boolean. Only those examples that pass the filter are included in the `FilterDataset`. - -Note that the filter is applied to all the examples in the base dataset when initializing a `FilterDataset`. - -### CacheDataset - -By default, we preprocess dataset lazily in `DatasetMixin.get_example()`. An example is preprocessed whenever requested. But `CacheDataset` caches the base dataset lazily, so each example is processed only once when it is first requested. When preprocessing the dataset is slow, you can use `Cachedataset` to speed it up, but caching may consume a lot of RAM if the dataset is large. - -Finally, if preprocessing the dataset is slow and the processed dataset is too large to cache, you can write your own code to save them into files or databases, and then define a Dataset to load them. `Dataset` is flexible, so you can create your own dataset painlessly. - -## DataCargo - -`DataCargo`, like `Dataset`, is an iterable object, but it is an iterable oject of batches. We need `Datacargo` because in deep learning, batching examples into batches exploits the computational resources of modern hardwares. You can iterate over it by `iter(datacargo)` or `for batch in datacargo`. `DataCargo` is an iterable object but not an iterator, in that in can be iterated over more than once. - -### batch function - -The concept of a `batch` is something transformed from a list of examples. Assume that an example is a structure(tuple in python, or struct in C and C++) consists of several fields, then a list of examples is an array of structures(AOS, e.g. a dataset is an AOS). Then a batch here is a structure of arrays (SOA). Here is an example: - -The table below represents 2 examples, each of which contains 5 fields. - -| weight | height | width | depth | density | -| ------ | ------ | ----- | ----- | ------- | -| 1.2 | 1.1 | 1.3 | 1.4 | 0.8 | -| 1.6 | 1.4 | 1.2 | 0.6 | 1.4 | - -The AOS representation and SOA representation of the table are shown below. - -AOS: -```text -[(1.2, 1,1, 1,3, 1,4, 0.8), - - (1.6, 1.4, 1.2, 0.6, 1.4)] -``` - -SOA: -```text -([1,2, 1.6], - [1.1, 1.4], - [1.3, 1.2], - [1.4, 0.6], - [0.8, 1.4]) -``` - -For the example above, converting an AOS to an SOA is trivial, just stacking every field for all the examples. But it is not always the case. When a field contains a sequence, you may have to pad all the sequences to the largest length then stack them together. In some other cases, we may want to add a field for the batch, for example, `valid_length` for each example. So in general, a function to transform an AOS to SOA is needed to build a `Datacargo` from a dataset. We call this the batch function (`batch_fn`), but you can use any callable object if you need to. - -Usually we need to define the batch function as an callable object which stores all the options and configurations as its members. Its `__call__()` method transforms a list of examples into a batch. - -### Sampler - -Equipped with a batch function(we have known __how to batch__), here comes the next question. __What to batch?__ We need to decide which examples to pick when creating a batch. Since a dataset is a list of examples, we only need to pick indices for the corresponding examples. A sampler object is what we use to do this. - -A `Sampler` is represented as an iterable object of integers. Assume the dataset has `N` examples, then an iterable object of intergers in the range`[0, N)` is an appropriate sampler for this dataset to build a `DataCargo`. - -We provide several samplers that are ready to use, for example, `SequentialSampler` and `RandomSampler`. - -## DataIterator - -`DataIterator` is what returned by `iter(data_cargo)`. It can only be iterated over once. - -Here's the analogy. - -```text -Dataset --> Iterable[Example] | iter(Dataset) -> Iterator[Example] -DataCargo --> Iterable[Batch] | iter(DataCargo) -> Iterator[Batch] -``` - -In order to construct an iterator of batches from an iterator of examples, we construct a DataCargo from a Dataset. - - - -## Code Example - -Here's an example of how we use `parakeet.data` to process the `LJSpeech` dataset with a wavenet model. - -First, we would like to define a class which represents the LJSpeech dataset and loads it as-is. We try not to apply any preprocessings here. - -```python -import csv -import numpy as np -import librosa -from pathlib import Path -import pandas as pd - -from parakeet.data import DatasetMixin -from parakeet.data import batch_spec, batch_wav - -class LJSpeechMetaData(DatasetMixin): - def __init__(self, root): - self.root = Path(root) - self._wav_dir = self.root.joinpath("wavs") - csv_path = self.root.joinpath("metadata.csv") - self._table = pd.read_csv( - csv_path, - sep="|", - header=None, - quoting=csv.QUOTE_NONE, - names=["fname", "raw_text", "normalized_text"]) - - def get_example(self, i): - fname, raw_text, normalized_text = self._table.iloc[i] - fname = str(self._wav_dir.joinpath(fname + ".wav")) - return fname, raw_text, normalized_text - - def __len__(self): - return len(self._table) -``` - -We make this dataset simple in purpose. It requires only the path of the dataset, nothing more. It only loads the `metadata.csv` in the dataset when it is initialized, which includes file names of the audio files, and the transcriptions. We do not even load the audio files at `get_example()`. - -Then we define a `Transform` object to transform an example of `LJSpeechMetaData` into an example we want for the model. - -```python -class Transform(object): - def __init__(self, sample_rate, n_fft, win_length, hop_length, n_mels): - self.sample_rate = sample_rate - self.n_fft = n_fft - self.win_length = win_length - self.hop_length = hop_length - self.n_mels = n_mels - - def __call__(self, example): - wav_path, _, _ = example - - sr = self.sample_rate - n_fft = self.n_fft - win_length = self.win_length - hop_length = self.hop_length - n_mels = self.n_mels - - wav, loaded_sr = librosa.load(wav_path, sr=None) - assert loaded_sr == sr, "sample rate does not match, resampling applied" - - # Pad audio to the right size. - frames = int(np.ceil(float(wav.size) / hop_length)) - fft_padding = (n_fft - hop_length) // 2 # sound - desired_length = frames * hop_length + fft_padding * 2 - pad_amount = (desired_length - wav.size) // 2 - - if wav.size % 2 == 0: - wav = np.pad(wav, (pad_amount, pad_amount), mode='reflect') - else: - wav = np.pad(wav, (pad_amount, pad_amount + 1), mode='reflect') - - # Normalize audio. - wav = wav / np.abs(wav).max() * 0.999 - - # Compute mel-spectrogram. - # Turn center to False to prevent internal padding. - spectrogram = librosa.core.stft( - wav, - hop_length=hop_length, - win_length=win_length, - n_fft=n_fft, - center=False) - spectrogram_magnitude = np.abs(spectrogram) - - # Compute mel-spectrograms. - mel_filter_bank = librosa.filters.mel(sr=sr, - n_fft=n_fft, - n_mels=n_mels) - mel_spectrogram = np.dot(mel_filter_bank, spectrogram_magnitude) - mel_spectrogram = mel_spectrogram - - # Rescale mel_spectrogram. - min_level, ref_level = 1e-5, 20 # hard code it - mel_spectrogram = 20 * np.log10(np.maximum(min_level, mel_spectrogram)) - mel_spectrogram = mel_spectrogram - ref_level - mel_spectrogram = np.clip((mel_spectrogram + 100) / 100, 0, 1) - - # Extract the center of audio that corresponds to mel spectrograms. - audio = wav[fft_padding:-fft_padding] - assert mel_spectrogram.shape[1] * hop_length == audio.size - - # there is no clipping here - return audio, mel_spectrogram -``` - -`Transform` loads the audio files, and extracts `mel_spectrogram` from them. This transformation actually needs a lot of options to specify, namely, the sample rate of the audio files, the `n_fft`, `win_length`, `hop_length` of `stft` transformation, and `n_mels` for transforming spectrogram into mel_spectrogram. So we define it as a callable class. You can also use a closure, or a `partial` if you want to. - -Then we defines a functor to batch examples into a batch. Because the two fields ( `audio` and `mel_spectrogram`) are both sequences, batching them is not trivial. Also, because the wavenet model trains in audio clips of a fixed length(0.5 seconds, for example), we have to truncate the audio when creating batches. We want to crop audio randomly when creating batches, instead of truncating them when preprocessing each example, because it allows for an audio to be truncated at different positions. - -```python -class DataCollector(object): - def __init__(self, - context_size, - sample_rate, - hop_length, - train_clip_seconds, - valid=False): - frames_per_second = sample_rate // hop_length - train_clip_frames = int( - np.ceil(train_clip_seconds * frames_per_second)) - context_frames = context_size // hop_length - self.num_frames = train_clip_frames + context_frames - - self.sample_rate = sample_rate - self.hop_length = hop_length - self.valid = valid - - def random_crop(self, sample): - audio, mel_spectrogram = sample - audio_frames = int(audio.size) // self.hop_length - max_start_frame = audio_frames - self.num_frames - assert max_start_frame >= 0, "audio is too short to be cropped" - - frame_start = np.random.randint(0, max_start_frame) - # frame_start = 0 # norandom - frame_end = frame_start + self.num_frames - - audio_start = frame_start * self.hop_length - audio_end = frame_end * self.hop_length - - audio = audio[audio_start:audio_end] - return audio, mel_spectrogram, audio_start - - def __call__(self, samples): - # transform them first - if self.valid: - samples = [(audio, mel_spectrogram, 0) - for audio, mel_spectrogram in samples] - else: - samples = [self.random_crop(sample) for sample in samples] - # batch them - audios = [sample[0] for sample in samples] - audio_starts = [sample[2] for sample in samples] - mels = [sample[1] for sample in samples] - - mels = batch_spec(mels) - - if self.valid: - audios = batch_wav(audios, dtype=np.float32) - else: - audios = np.array(audios, dtype=np.float32) - audio_starts = np.array(audio_starts, dtype=np.int64) - return audios, mels, audio_starts -``` - -When these 3 components are defined, we can start building our dataset with them. - -```python -# building the ljspeech dataset -ljspeech_meta = LJSpeechMetaData(root) -transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels) -ljspeech = TransformDataset(ljspeech_meta, transform) - -# split them into train and valid dataset -ljspeech_valid = SliceDataset(ljspeech, 0, valid_size) -ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech)) - -# building batch functions (they can be differnt for training and validation if you need it) -train_batch_fn = DataCollector(context_size, sample_rate, hop_length, - train_clip_seconds) -valid_batch_fn = DataCollector( - context_size, sample_rate, hop_length, train_clip_seconds, valid=True) - -# building the data cargo -train_cargo = DataCargo( - ljspeech_train, - train_batch_fn, - batch_size, - sampler=RandomSampler(ljspeech_train)) - -valid_cargo = DataCargo( - ljspeech_valid, - valid_batch_fn, - batch_size=1, # only batch=1 for validation is enabled - sampler=SequentialSampler(ljspeech_valid)) -``` - -Here comes the next question, how to bring batches into Paddle's computation. Do we need some adapter to transform numpy.ndarray into Paddle's native Variable type? Yes. - -First we can use `var = dg.to_variable(array)` to transform ndarray into Variable. - -```python -for batch in train_cargo: - audios, mels, audio_starts = batch - audios = dg.to_variable(audios) - mels = dg.to_variable(mels) - audio_starts = dg.to_variable(audio_starts) - - # your training code here -``` - -In the code above, processing of the data and training of the model run in the same process. So the next batch starts to load after the training of the current batch has finished. There is actually better solutions for this. Data processing and model training can be run asynchronously. To accomplish this, we would use `DataLoader` from Paddle. This serves as an adapter to transform an iterable object of batches into another iterable object of batches, which runs asynchronously and transform each ndarray into `Variable`. - -```python -# connect our data cargos with corresponding DataLoader -# now the data cargo is connected with paddle -with dg.guard(place): - train_loader = fluid.io.DataLoader.from_generator( - capacity=10,return_list=True).set_batch_generator(train_cargo, place) - valid_loader = fluid.io.DataLoader.from_generator( - capacity=10, return_list=True).set_batch_generator(valid_cargo, place) - - # iterate over the dataloader - for batch in train_loader: - audios, mels, audio_starts = batch - # your trains cript here -``` diff --git a/docs/data_cn.md b/docs/data_cn.md new file mode 100644 index 0000000..4a7aab8 --- /dev/null +++ b/docs/data_cn.md @@ -0,0 +1,216 @@ +# 数据准备 + +本节主要讲述 `parakeet.data` 子模块的设计以及如何在实验中使用它。 + +`parakeet.data` 遵循 paddle 管用的数据准备流程。Dataset, Sampler, batch function, DataLoader. + +## Dataset + +我们假设数据集是样例的列表。你可以通过 `__len__` 方法获取其长度,并且可以通过 `__getitem__` 方法随机访问其元素。有了上述两个调节,我们也可以用 `iter(dataset)` 来获得一个 dataset 的迭代器。我们一般通过继承 `paddle.io.Dataset` 来创建自己的数据集。为其实现 `__len__` 方法和 `__getitem__` 方法即可。 + +出于数据处理,数据加载和数据集大小等方面的考虑,可以采用集中策略来调控数据集是否被懒惰地预处理,是否被懒惰地被加载,是否常驻内存等。 + +1. 数据在数据集实例化的时候被全部预处理并常驻内存。对于数据预处理比较快,且整个数据集较小的情况,可以采用这样的策略。因为整个的数据集的预处理在数据集实例化时完成,因此要求预处理很快,否则将要花时间等待数据集实例化。因为被处理后的数据集常驻内存,因此要求数据集较小,否则可能不能将整个数据集加载进内存。 +2. 每个样例在被请求的时候预处理,并且把预处理的结果缓存。可以通过在数据集的 `__getitem__` 方法中调用单条样例的预处理方法来实现这个策略。这样做的条件一样是数据可以整个载入内存。但好处是不必花费很多时间等待数据集实例化。使用这个策略,则数据集被完整迭代一次之后,访问样例的时候会显著变快,因为不需要再次处理。但在首次使用的时候仍然会需要即时处理,所以如果快速评估数据迭代的数度还需要等数据集被迭代一遍。 +3. 先将数据集预处理一遍把结果保存下来。再作为另一个数据集使用,这个新的数据集的 `__getitem__` 方法则只是从存储器读取数据。一般来说数据读取的性能并不会制约模型的训练,并且这也不要求内存必须足以装下整个数据集。是一种较为灵活的方法。但是会需要一个单独的预处理脚本,并且根据处理后的数据写一个数据集。 + +以上的三种只是一种概念上的划分,实际使用时候我们可能混用以上的策略。举例如下: + +1. 对于一个样例的多个字段,有的是很小的,比如说文本,可能可能常驻内存;而对于音频,频谱或者图像,可能预先处理并存储,在访问时仅加载处理好的结果。 +2. 对于某些比较大或者预处理比较慢的数据集。我们可以仅加载一个较小的元数据,里面包含了一些可以用于对样例进行排序或者筛选的特征码,则我们可以在不加载整个样例就可以利用这些元数据对数据进行排序或者筛选。 + +一般来说,我们将一个 Dataset 的子类看作是数据集和实验的具体需求之间的适配器。 + +parakeet 还提供了若干个高阶的 Dataset 类,用于从已有的 Dataset 产生新的 Dataset. + +1. 用于字段组合的有 TupleDataset, DictDataset; +2. 用于数据集切分合并的有 SliceDataset, SubsetDataset, ChainDataset; +3. 用于缓存数据集的有 CacheDataset; +4. 用于数据集筛选的有 FilterDataset; +5. 用于变换数据集的有 TransformDataset. + +可以灵活地使用这些高阶数据集来使数据处理更加灵活。 + +## DataLoader + +`DataLoader` 类似 `Dataset` 也是可迭代对象,但是一般情况下,它是按批量来迭代的。在深度学习中我们需要 `DataLoader` 是因为把多个样例组成一个批次可以充分利用现代硬件的计算资源。可以根据一个 Dataset 构建一个 DataLoader,它可以被多次迭代。 + +构建 DataLoader 除了需要一个 Dataset 之外,还需要两个要素。 + +1. 如何组成批次。 +2. 如何选取样例来组成批次; + +下面的两个小节将分别提供这两个要素。 + +### batch function + +批次是包含多个样例的列表经过某种变换的结果。假设一个样例是一个拥有多个字段的结构(在不同的编程语言可能有不同的实现,比如在 python 中可以是 tuple, dict 等,在 C/C++ 中可能是一个 struct)。那么包含多个样例的列表就是一个结构的阵列(array of structure, AOS). 而出于训练神经网络的需要,我们希望一个批次和一个样例一样,是拥有多个字段的一个结构。因此需要一个方法,把一个结构的阵列(array of structures)变成一个阵列的结构(structure of arrays). + +下面是一个简单的例子: + +下面的表格代表了两个样例,每个包含 5 个字段。 + +| weight | height | width | depth | density | +| ------ | ------ | ----- | ----- | ------- | +| 1.2 | 1.1 | 1.3 | 1.4 | 0.8 | +| 1.6 | 1.4 | 1.2 | 0.6 | 1.4 | + +以上表格的 AOS 表示形式和 SOA 表示形式如下: + +AOS: + +```text +[(1.2, 1,1, 1,3, 1,4, 0.8), + + (1.6, 1.4, 1.2, 0.6, 1.4)] +``` + +SOA: + +```text +([1,2, 1.6], + [1.1, 1.4], + [1.3, 1.2], + [1.4, 0.6], + [0.8, 1.4]) +``` + +对于上述的例子,将 AOS 转换为 SOA 是平凡的。只要把所有样例的各个字段 stack 起来就可以。但事情并非总是如此简单。当一个字段包含一个序列,你可能就需要先把所有的序列都补长 (pad) 到最长的序列长度,然后才能把它们 stack 起来。对于某些情形,批次可能比样例多一些字段,比如说对于包含序列的样例,在补长之后,可能需要增设一个字段来记录那些字段的有效长度。因此,一般情况下,需要一个函数来实现这个功能,而且这是和这个数据集搭配的。当然除了函数之外,也可以使用任何的可调用对象,我们把这些称为 batch function. + + +### Sampler + +有了 batch function(我们知道如何组成批次), 接下来是另一个问题,将什么组成批次呢?当组建一个批次的时候,我们需要决定选取那些样例来组成它。因此我们预设数据集是可以随机访问的,我们只需要选取对应的索引即可。我们使用 sampler 来完成选取 index 的任务。 + +Sampler 被实现为产生整数的可迭代对象。假设数据集有 `N` 个样例,那么产生 `[0, N)` 之间的整数的迭代器就是一个合适的迭代器。最常用的 sampler 是 `SequentialSampler` 和 `RandomSampler`. + +当迭代一个 DataLoader 的时候,首先 sampler 产生多个 index, 然后根据这些 index 去取出对应的样例,并调用 batch function 把这些样例组成一个批次。当然取出样例的过程是可并行的,但调用 batch function 组成 batch 不是。 + +另外的一种选择是使用 batch sampler, 它是产生整数列表的可迭代对象。对于一般的 sampler, 需要对其迭代器使用 next 多次才能产出多个 index, 而对于 batch sampler, 对其迭代器使用 next 一次就可以产出多个 index. 对于使用一般的 sampler 的情形,batch size 由 DataLoader 的来决定。而对于 batch sampler, 则是由它决定了 DataLoader 的 batch size, 因此可以用它来实现一些特别的需求,比如说动态 batch size. + +## 示例代码 + +以下是我们使用 `parakeet.data` 处理 `LJSpeech` 数据集的代码。 + +首先,我们定义一个 class 来代表 LJspeech 数据集,它只是如其所是地加载了元数据,亦即数据集中的 `metadata.csv` 文件,其中记录了音频文件的文件名,以及转录文本。但并不加载音频,也并不做任何的预处理。我们有意让这个数据集保持简单,它仅需要数据集的路径来实例化。 + +```python +import csv +import numpy as np +import librosa +from pathlib import Path +from paddle.io import Dataset + +from parakeet.data import batch_spec, batch_wav + +class LJSpeechMetaData(Dataset): + def __init__(self, root): + self.root = Path(root).expanduser() + wav_dir = self.root / "wavs" + csv_path = self.root / "metadata.csv" + records = [] + speaker_name = "ljspeech" + with open(str(csv_path), 'rt') as f: + for line in f: + filename, _, normalized_text = line.strip().split("|") + filename = str(wav_dir / (filename + ".wav")) + records.append([filename, normalized_text, speaker_name]) + self.records = records + + def __getitem__(self, i): + return self.records[i] + + def __len__(self): + return len(self.records) +``` + +然后我们定义一个 `Transform` 类,用于处理 `LJSpeechMetaData` 中的样例,将其转换为模型所需要的数据。对于不同的模型可以定义不同的 Transform,这样就可以共用 `LJSpeechMetaData` 的代码。 + +```python +from parakeet.audio import AudioProcessor +from parakeet.audio import LogMagnitude +from parakeet.frontend import English + +class Transform(object): + def __init__(self): + self.frontend = English() + self.processor = AudioProcessor( + sample_rate=22050, + n_fft=1024, + win_length=1024, + hop_length=256, + f_max=8000) + self.normalizer = LogMagnitude() + + def forward(self, record): + fname, text, _ = meta_data: + wav = processor.read_wav(fname) + mel = processor.mel_spectrogram(wav) + mel = normalizer.transform(mel) + phonemes = frontend.phoneticize(text) + ids = frontend.numericalize(phonemes) + mel_name = os.path.splitext(os.path.basename(fname))[0] + stop_probs = np.ones([mel.shape[1]], dtype=np.int64) + stop_probs[-1] = 2 + return (ids, mel, stop_probs) +``` + +`Transform` 加载音频,并且提取频谱。把 `Transform` 实现为一个可调用的类可以方便地持有许多选项,比如和傅里叶变换相关的参数。这里可以把一个 `LJSpeechMetaData` 对象和一个 `Transform` 对象组合起来,创建一个 `TransformDataset`. + +```python +from parakeet.data import TransformDataset + +meta = LJSpeechMetaData(data_path) +transform = Transform() +ljspeech = TransformDataset(meta, transform) +``` + +当然也可以选择专门写一个转换脚本把转换后的数据集保存下来,然后再写一个适配的 Dataset 子类去加载这些保存的数据。实际这么做的效率会更高。 + +接下来我们需要写一个可调用对象将多个样例组成批次。因为其中的 ids 和 mel 频谱是序列数据,所以我们需要进行 padding. + +```python +class LJSpeechCollector(object): + """A simple callable to batch LJSpeech examples.""" + def __init__(self, padding_idx=0, padding_value=0.): + self.padding_idx = padding_idx + self.padding_value = padding_value + + def __call__(self, examples): + ids = [example[0] for example in examples] + mels = [example[1] for example in examples] + stop_probs = [example[2] for example in examples] + + ids = batch_text_id(ids, pad_id=self.padding_idx) + mels = batch_spec(mels, pad_value=self.padding_value) + stop_probs = batch_text_id(stop_probs, pad_id=self.padding_idx) + return ids, np.transpose(mels, [0, 2, 1]), stop_probs +``` + +以上的组件准备就绪后,可以准备整个数据流。 + +```python +def create_dataloader(source_path, valid_size, batch_size): + lj = LJSpeechMeta(source_path) + transform = Transform() + lj = TransformDataset(lj, transform) + + valid_set, train_set = dataset.split(lj, valid_size) + train_loader = DataLoader( + train_set, + return_list=False, + batch_size=batch_size, + shuffle=True, + drop_last=True, + collate_fn=LJSpeechCollector()) + valid_loader = DataLoader( + valid_set, + return_list=False, + batch_size=batch_size, + shuffle=False, + drop_last=False, + collate_fn=LJSpeechCollector()) + return train_loader, valid_loader +``` + +train_loader 和 valid_loader 可以被迭代。对其迭代器使用 next, 返回的是 `paddle.Tensor` 的 list, 代表一个 batch,这些就可以直接用作 `paddle.nn.Layer` 的输入了。 diff --git a/docs/experiment_guide.md b/docs/experiment_guide.md deleted file mode 100644 index 8c1eb81..0000000 --- a/docs/experiment_guide.md +++ /dev/null @@ -1,87 +0,0 @@ -# How to build your own model and experiment? - -For a general deep learning experiment, there are 4 parts to care for. - -1. Preprocess dataset to meet the needs for model training and iterate over them in batches; -2. Define the model and the optimizer; -3. Write the training process (including forward-backward computation, parameter update, logging, evaluation, etc.) -4. Configure and launch the experiment. - -## Data Processing - -For processing data, `parakeet.data` provides `DatasetMixin`, `DataCargo` and `DataIterator`. - -Dataset is an iterable object of examples. `DatasetMixin` provides the standard indexing interface, and other classes in [parakeet.data.dataset](../parakeet/data/dataset.py) provide flexible interfaces for building customized datasets. - -`DataCargo` is an iterable object of batches. It differs from a dataset in that it can be iterated over in batches. In addition to a dataset, a `Sampler` and a `batch function` are required to build a `DataCargo`. `Sampler` specifies which examples to pick, and `batch function` specifies how to create a batch from them. Commonly used `Samplers` are provided by [parakeet.data](../parakeet/data/). Users should define a `batch function` for a datasets, in order to batch its examples. - - `DataIterator` is an iterator class for `DataCargo`. It is create when explicitly creating an iterator of a `DataCargo` by `iter(DataCargo)`, or iterating over a `DataCargo` with `for` loop. - -Data processing is splited into two phases: sample-level processing and batching. - -1. Sample-level processing. This process is transforming an example into another. This process can be defined as `get_example()` method of a dataset, or as a `transform` (callable object) and build a `TransformDataset` with it. - -2. Batching. It is the process of transforming a list of examples into a batch. The rationale is to transform an array of structures into a structure of arrays. We generally define a batch function (or a callable object) to do this. - -To connect a `DataCargo` with Paddlepaddle's asynchronous data loading mechanism, we need to create a `fluid.io.DataLoader` and connect it to the `Datacargo`. - -The overview of data processing in an experiment with Parakeet is : - -```text -Dataset --(transform)--> Dataset --+ - sampler --+ - batch_fn --+-> DataCargo --> DataLoader -``` - -The user need to define a customized transform and a batch function to accomplish this process. See [data](./data.md) for more details. - -## Model - -Parakeet provides commonly used functions, modules and models for the users to define their own models. Functions contain no trainable `Parameter`s, and are used in modules and models. Modules and modes are subclasses of `fluid.dygraph.Layer`. The distinction is that `module`s tend to be generic, simple and highly reusable, while `model`s tend to be task-sepcific, complicated and not that reusable. Some models are so complicated that we extract building blocks from it as separate classes but if these building blocks are not common and reusable enough, they are considered as submodels. - -In the structure of the project, modules are placed in [parakeet.modules](../parakeet/modules/), while models are in [parakeet.models](../parakeet/models) and grouped into folders like `waveflow` and `wavenet`, which include the whole model and their submodels. - -When developers want to add new models to `parakeet`, they can consider the distinctions described above and put the code in an appropriate place. - - - -## Training Process - -Training process is basically running a training loop for multiple times. A typical training loop consists of the procedures below: - -1. Iterating over training dataset; -2. Prerocessing mini-batches; -3. Forward/backward computations of the neural networks; -4. Updating Parameters; -5. Evaluating the model on validation dataset; -6. Logging or saving intermediate results; -7. Saving checkpoints of the model and the optimizer. - -In section `DataProcessing` we have cover 1 and 2. - -`Model` and `Optimizer` cover 3 and 4. - -To keep the training loop clear, it's a good idea to define functions for saving/loading of checkpoints, evaluation on validation set, logging and saving of intermediate results, etc. For some complicated model, it is also recommended to define a function to create the model. This function can be used in both train and inference, to ensure that the model is identical at training and inference. - -Code is typically organized in this way: - -```text -├── configs/ (example configuration) -├── data.py (definition of custom Dataset, transform and batch function) -├── README.md (README for the experiment) -├── synthesis.py (code for inference) -├── train.py (code for training) -└── utils.py (all other utility functions) -``` - -## Configuration - -Deep learning experiments have many options to configure. These configurations can be roughly grouped into different types: configurations about path of the dataset and path to save results, configurations about how to process data, configurations about the model and configurations about the training process. - -Some configurations tend to change when running the code at different times, for example, path of the data and path to save results and whether to load model before training, etc. For these configurations, it's better to define them as command line arguments. We use `argparse` to handle them. - -Other groups of configurations may overlap with others. For example, data processing and model may have some common options. The recommended way is to save them as configuration files, for example, `yaml` or `json`. We prefer `yaml`, for it is more human-reabable. - - - -There are several examples in this repo, check [Parakeet/examples](../examples) for more details. `Parakeet/examples` is where we place our experiments. Though experiments are not a part of package `parakeet`, it is a part of repo `Parakeet`. They are provided as examples and allow for the users to run our experiment out-of-the-box. Feel free to add new examples and contribute to `Parakeet`. diff --git a/docs/experiment_guide_cn.md b/docs/experiment_guide_cn.md new file mode 100644 index 0000000..54f4a54 --- /dev/null +++ b/docs/experiment_guide_cn.md @@ -0,0 +1,79 @@ +# 如何准备自己的实验 + +对于一般的深度学习实验,有几个部分需要处理。 + +1. 按照模型的需要对数据进行预处理,并且按批次迭代数据集; +2. 定义模型以及优化器等组件; +3. 写出训练过程(一般包括 forward/backward 计算,参数更新,log 记录,可视化,定期评估等步骤); +4. 配置并运行实验。 + +## 数据处理 + +对于数据处理,`parakeet.data` 采用了 paddlepaddle 常用的 `Dataset -> DataLoader` 的流程。数据处理流程的概览如下: + +```text +Dataset --(transform)--> Dataset --+ + sampler --+ + batch_fn --+-> DataLoader +``` + +其中 transform 代表的是对样例的预处理。可以使用 `parakeet.data` 中的 TransformDataset 来从一个 Dataset 构建另一个 Dataset. + +得到想要的 Dataset 之后,提供 sampler 和 batch function, 即可据此构建 DataLoader. DataLoader 产生的结果可以直接用作模型的输入。 + +详细的使用方式参见 [data_cn](./data_cn.md). + +## 模型 + +为了对模型的可复用行和功能做较好的平衡,我们把模型按照其特征分为几种。 + +对于较为常用,可以作为其他更大的模型的部分的模块,我们尽可能将其实现得足够简单和通用,因为它们会被复用。对于含有可训练参数的模块,一般实现为 `paddle.nn.Layer` 的子类,但它们不是直接面向一个任务,因此不会带上处理未加工的输入和输出的功能。对于不含有可训练参数的模块,可以直接实现为一个函数,其输入输出都是 `paddle.Tensor` 或其集合。 + +针对一个特定任务的开箱模型,一般实现为 `paddle.nn.Layer` 的子类,是一个任务的核心计算单元。为了方便地处理输入和输出,一般还可以为它添加处理未加工的输入输出的功能。比如对于 NLP 任务来说,尽管神经网络接受的输出是文本的 id, 但是为了使模型能够处理未加工的输入,文本预处理的功能,以及文本转 id 的字典,也都应该视作模型的一部分。 + +当一个模型足够复杂,对其进行模块化切分是更好的选择,尽管拆分出来的小模块的功能也不一定非常通用,可能只是用于某个模型,但是当作么做有利于代码的清晰简洁时,仍然推荐这么做。 + +在 parakeet 的目录结构中,复用性较高的模块被放在 [parakeet.modules](../parakeet/modules/), 但是针对特定任务的模型则放在 [parakeet.models](../parakeet/models). + +当开发新的模型的时候,开发这需要考虑拆分模块的可行性,以及模块的通用程度,把它们分置于合适的目录。 + +## 训练流程 + +训练流程一般就是多次训练一个循环体。典型的循环体包含如下的过程: + +1. 迭代数据集; +2. 处理批次数据; +3. 神经网络的 forward/backward 计算; +4. 参数更新; +5. 符合一定条件时,在验证数据集上评估模型; +6. 写日志,可视化,保存中间结果; +7. 保存模型和优化器的状态。 + +`数据处理` 一节包含了 1 和 2, 模型和优化器包含了 3 和 4. 那么 5,6,7 是训练流程主要要完成的事情。为了使训练循环体简洁清晰,推荐将模型的保存和加载,模型评估,写日志以及可视化等功能都实现成函数,尽管很多情况下,它们可能需要访问很多局部变量。我们也正在考虑使用一个 Experiment 或者 Trainer 类来规范化这些训练循环体的写法。这样可以把一些需要被许多函数访问的变量作为类内的变量,可以使代码简洁而不至于引入太多的全局变量。 + +实验代码一般以如下的方式组织: + +```text +├── configs/ (实验配置) +├── data.py (Dataset, DataLoader 等的定义) +├── README.md (实验的帮助信息) +├── synthesis.py (用于生成的代码) +├── train.py (用于训练的代码) +└── utils.py (其他必要的辅助函数) +``` + +## 配置实验 + +深度学习实验常常有很多选项可配置。这些配置大概可以被分为几类: + +1. 数据源以及数据处理方式配置; +2. 实验结果保存路径配置; +3. 数据预处理方式配置; +4. 模型结构和超参数配置; +5. 训练过程配置。 + +这些配置之间也可能存在某些重叠项,比如数据预处理部分的配置可能就和模型配置有关。比如说 mel 频谱的维数。 + +有部分配置是经常会发生改变的,比如数据源以及保存实验结果的路径,或者加载的 checkpoint 的路径等。对于这些配置,更好的做法是把它们实现为命令行参数。其余的不经常发生变动的参数,推荐将其写在配置文件中,我们推荐使用 `yaml` 作为配置文件,因为它允许添加注释,并且更加人类可读。 + +在这个软件源中包含了几个例子,可以在 [Parakeet/examples](../examples) 中查看。这些实验被作为样例提供给用户,可以直接运行。同时也欢迎用户添加新的模型和实验并为 `Parakeet` 贡献代码。 diff --git a/docs/installation_cn.md b/docs/installation_cn.md new file mode 100644 index 0000000..a861c86 --- /dev/null +++ b/docs/installation_cn.md @@ -0,0 +1,57 @@ +# 安装 + +[TOC] + + +## 安装 PaddlePaddle + +Parakeet 以 PaddlePaddle 作为其后端,因此依赖 PaddlePaddle,值得说明的是 Parakeet 要求 2.0 及以上版本的 PaddlePaddle。你可以通过 pip 安装。如果需要安装支持 gpu 版本的 PaddlePaddle,需要根据环境中的 cuda 和 cudnn 的版本来选择 wheel 包的版本。使用 conda 安装以及源码编译安装的方式请参考 [PaddlePaddle 快速安装](https://www.paddlepaddle.org.cn/install/quick/zh/2.0rc-linux-pip). + +**gpu 版 PaddlePaddle** + +```bash +python -m pip install paddlepaddle-gpu==2.0.0rc0.post101 -f https://paddlepaddle.org.cn/whl/stable.html +python -m pip install paddlepaddle-gpu==2.0.0rc0.post100 -f https://paddlepaddle.org.cn/whl/stable.html +``` + +**cpu 版 PaddlePaddle** + +```bash +python -m pip install paddlepaddle==2.0.0rc0 -i https://mirror.baidu.com/pypi/simple +``` + +## 安装 libsndfile + +因为 Parakeet 的实验中常常会需要用到和音频处理,以及频谱处理相关的功能,所以我们依赖 librosa 和 soundfile 进行音频处理。而 librosa 和 soundfile 依赖一个 C 的库 libsndfile, 因为这不是 python 的包,对于 windows 用户和 mac 用户,使用 pip 安装 soundfile 的时候,libsndfile 也会被安装。如果遇到问题也可以参考 [SoundFile](https://pypi.org/project/SoundFile). + +对于 linux 用户,需要使用系统的包管理器安装这个包,常见发行版上的命令参考如下。 + + +```bash +# ubuntu, debian +sudo apt-get install libsndfile1 + +# centos, fedora, +sudo yum install libsndfile + +# openSUSE +sudo zypper in libsndfile +``` + +## 安装 Parakeet + + +我们提供两种方式来使用 Parakeet. + +1. 需要运行 Parakeet 自带的实验代码,或者希望进行二次开发的用户,可以先从 github 克隆本工程,cd 仅工程目录,并进行可编辑式安装(不会被复制到 site-packages, 而且对工程的修改会立即生效,不需要重新安装),之后就可以使用了。 + + ```bash + # -e 表示可编辑式安装 + pip install -e . + ``` + +2. 仅需要使用我们提供的训练好的模型进行预测,那么也可以直接安装 pypi 上的 wheel 包的版本。 + + ```bash + pip install paddle-parakeet + ``` diff --git a/docs/overview_cn.md b/docs/overview_cn.md new file mode 100644 index 0000000..40659af --- /dev/null +++ b/docs/overview_cn.md @@ -0,0 +1,18 @@ +# Parakeet 概览 + +parakeet-logo + +Parakeet 旨在为开源社区提供一个灵活,高效,先进的语音合成工具箱。Parakeet 基于PaddlePaddle 2.0 构建,并且包含了百度研究院以及其他研究机构的许多有影响力的 TTS 模型。 + +Parakeet 为用户和开发者提供了 + +1. 可复用的模型以及常用的模块; +2. 从数据处理,模型训练到预测等一系列过程的完整实验; +3. 高质量的开箱即用模型。 + + + + + + + diff --git a/examples/clarinet/README.md b/examples/clarinet/README.md deleted file mode 100644 index cb02475..0000000 --- a/examples/clarinet/README.md +++ /dev/null @@ -1,148 +0,0 @@ -# Clarinet - -PaddlePaddle dynamic graph implementation of ClariNet, a convolutional network based vocoder. The implementation is based on the paper [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281). - - -## Dataset - -We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/). - -```bash -wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 -tar xjvf LJSpeech-1.1.tar.bz2 -``` - -## Project Structure - -```text -├── data.py data_processing -├── configs/ (example) configuration file -├── synthesis.py script to synthesize waveform from mel_spectrogram -├── train.py script to train a model -└── utils.py utility functions -``` - -## Saving & Loading -`train.py` and `synthesis.py` have 3 arguments in common, `--checkpooint`, `iteration` and `output`. - -1. `output` is the directory for saving results. -During training, checkpoints are saved in `checkpoints/` in `output` and tensorboard log is save in `log/` in `output`. Other possible outputs are saved in `states/` in `outuput`. -During synthesizing, audio files and other possible outputs are save in `synthesis/` in `output`. -So after training and synthesizing with the same output directory, the file structure of the output directory looks like this. - -```text -├── checkpoints/ # checkpoint directory (including *.pdparams, *.pdopt and a text file `checkpoint` that records the latest checkpoint) -├── states/ # audio files generated at validation and other possible outputs -├── log/ # tensorboard log -└── synthesis/ # synthesized audio files and other possible outputs -``` - -2. `--checkpoint` and `--iteration` for loading from existing checkpoint. Loading existing checkpoiont follows the following rule: -If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded. -If `--checkpoint` is not provided, we try to load the model specified by `--iteration` from the checkpoint directory. If `--iteration` is not provided, we try to load the latested checkpoint from checkpoint directory. - -## Train - -Train the model using train.py, follow the usage displayed by `python train.py --help`. - -```text -usage: train.py [-h] [--config CONFIG] [--device DEVICE] [--data DATA] - [--checkpoint CHECKPOINT | --iteration ITERATION] - [--wavenet WAVENET] - output - -Train a ClariNet model with LJspeech and a trained WaveNet model. - -positional arguments: - output path to save experiment results - -optional arguments: - -h, --help show this help message and exit - --config CONFIG path of the config file - --device DEVICE device to use - --data DATA path of LJspeech dataset - --checkpoint CHECKPOINT checkpoint to resume from - --iteration ITERATION the iteration of the checkpoint to load from output directory - --wavenet WAVENET wavenet checkpoint to use - -- `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config. -- `--device` is the device (gpu id) to use for training. `-1` means CPU. -- `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains `metadata.txt`). - -- `--checkpoint` is the path of the checkpoint. -- `--iteration` is the iteration of the checkpoint to load from output directory. -- `output` is the directory to save results, all result are saved in this directory. - -See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading. - -- `--wavenet` is the path of the wavenet checkpoint to load. -When you start training a ClariNet model without loading form a ClariNet checkpoint, you should have trained a WaveNet model with single Gaussian output distribution. Make sure the config of the teacher model matches that of the trained wavenet model. - -Example script: - -```bash -python train.py - --config=./configs/clarinet_ljspeech.yaml - --data=./LJSpeech-1.1/ - --device=0 - --wavenet="wavenet-step-2000000" - experiment -``` - -You can monitor training log via tensorboard, using the script below. - -```bash -cd experiment/log -tensorboard --logdir=. -``` - -## Synthesis -```text -usage: synthesis.py [-h] [--config CONFIG] [--device DEVICE] [--data DATA] - [--checkpoint CHECKPOINT | --iteration ITERATION] - output - -Synthesize audio files from mel spectrogram in the validation set. - -positional arguments: - output path to save the synthesized audio - -optional arguments: - -h, --help show this help message and exit - --config CONFIG path of the config file - --device DEVICE device to use. - --data DATA path of LJspeech dataset - --checkpoint CHECKPOINT checkpoint to resume from - --iteration ITERATION the iteration of the checkpoint to load from output directory -``` - -- `--config` is the configuration file to use. You should use the same configuration with which you train you model. -- `--device` is the device (gpu id) to use for training. `-1` means CPU. -- `--data` is the path of the LJspeech dataset. In principle, a dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files. -- `--checkpoint` is the checkpoint to load. -- `--iteration` is the iteration of the checkpoint to load from output directory. -- `output` is the directory to save synthesized audio. Audio file is saved in `synthesis/` in `output` directory. -See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading. - - -Example script: - -```bash -python synthesis.py \ - --config=./configs/clarinet_ljspeech.yaml \ - --data=./LJSpeech-1.1/ \ - --device=0 \ - --iteration=500000 \ - experiment -``` - -or - -```bash -python synthesis.py \ - --config=./configs/clarinet_ljspeech.yaml \ - --data=./LJSpeech-1.1/ \ - --device=0 \ - --checkpoint="experiment/checkpoints/step-500000" \ - experiment -``` diff --git a/examples/clarinet/configs/clarinet_ljspeech.yaml b/examples/clarinet/configs/clarinet_ljspeech.yaml deleted file mode 100644 index 2e571e5..0000000 --- a/examples/clarinet/configs/clarinet_ljspeech.yaml +++ /dev/null @@ -1,52 +0,0 @@ -data: - batch_size: 8 - train_clip_seconds: 0.5 - sample_rate: 22050 - hop_length: 256 - win_length: 1024 - n_fft: 2048 - - n_mels: 80 - valid_size: 16 - - -conditioner: - upsampling_factors: [16, 16] - -teacher: - n_loop: 10 - n_layer: 3 - filter_size: 2 - residual_channels: 128 - loss_type: "mog" - output_dim: 3 - log_scale_min: -9 - -student: - n_loops: [10, 10, 10, 10, 10, 10] - n_layers: [1, 1, 1, 1, 1, 1] - filter_size: 3 - residual_channels: 64 - log_scale_min: -7 - -stft: - n_fft: 2048 - win_length: 1024 - hop_length: 256 - -loss: - lmd: 4 - -train: - learning_rate: 0.0005 - anneal_rate: 0.5 - anneal_interval: 200000 - gradient_max_norm: 100.0 - - checkpoint_interval: 1000 - eval_interval: 1000 - - max_iterations: 2000000 - - - diff --git a/examples/clarinet/configs/config.yaml b/examples/clarinet/configs/config.yaml deleted file mode 100644 index 2e571e5..0000000 --- a/examples/clarinet/configs/config.yaml +++ /dev/null @@ -1,52 +0,0 @@ -data: - batch_size: 8 - train_clip_seconds: 0.5 - sample_rate: 22050 - hop_length: 256 - win_length: 1024 - n_fft: 2048 - - n_mels: 80 - valid_size: 16 - - -conditioner: - upsampling_factors: [16, 16] - -teacher: - n_loop: 10 - n_layer: 3 - filter_size: 2 - residual_channels: 128 - loss_type: "mog" - output_dim: 3 - log_scale_min: -9 - -student: - n_loops: [10, 10, 10, 10, 10, 10] - n_layers: [1, 1, 1, 1, 1, 1] - filter_size: 3 - residual_channels: 64 - log_scale_min: -7 - -stft: - n_fft: 2048 - win_length: 1024 - hop_length: 256 - -loss: - lmd: 4 - -train: - learning_rate: 0.0005 - anneal_rate: 0.5 - anneal_interval: 200000 - gradient_max_norm: 100.0 - - checkpoint_interval: 1000 - eval_interval: 1000 - - max_iterations: 2000000 - - - diff --git a/examples/clarinet/synthesis.py b/examples/clarinet/synthesis.py deleted file mode 100644 index 185a7ac..0000000 --- a/examples/clarinet/synthesis.py +++ /dev/null @@ -1,179 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import division -import os -import sys -import argparse -import ruamel.yaml -import random -from tqdm import tqdm -import pickle -import numpy as np - -import paddle.fluid.dygraph as dg -from paddle import fluid -fluid.require_version('1.8.0') - -from parakeet.modules.weight_norm import WeightNormWrapper -from parakeet.models.wavenet import WaveNet, UpsampleNet -from parakeet.models.clarinet import STFT, Clarinet, ParallelWaveNet -from parakeet.data import TransformDataset, SliceDataset, RandomSampler, SequentialSampler, DataCargo -from parakeet.utils.layer_tools import summary, freeze -from parakeet.utils import io - -from utils import eval_model -sys.path.append("../wavenet") -from data import LJSpeechMetaData, Transform, DataCollector - -if __name__ == "__main__": - parser = argparse.ArgumentParser( - description="Synthesize audio files from mel spectrogram in the validation set." - ) - parser.add_argument("--config", type=str, help="path of the config file") - parser.add_argument( - "--device", type=int, default=-1, help="device to use.") - parser.add_argument("--data", type=str, help="path of LJspeech dataset") - - g = parser.add_mutually_exclusive_group() - g.add_argument("--checkpoint", type=str, help="checkpoint to resume from") - g.add_argument( - "--iteration", - type=int, - help="the iteration of the checkpoint to load from output directory") - - parser.add_argument( - "output", - type=str, - default="experiment", - help="path to save the synthesized audio") - - args = parser.parse_args() - - with open(args.config, 'rt') as f: - config = ruamel.yaml.safe_load(f) - - if args.device == -1: - place = fluid.CPUPlace() - else: - place = fluid.CUDAPlace(args.device) - - dg.enable_dygraph(place) - - ljspeech_meta = LJSpeechMetaData(args.data) - - data_config = config["data"] - sample_rate = data_config["sample_rate"] - n_fft = data_config["n_fft"] - win_length = data_config["win_length"] - hop_length = data_config["hop_length"] - n_mels = data_config["n_mels"] - train_clip_seconds = data_config["train_clip_seconds"] - transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels) - ljspeech = TransformDataset(ljspeech_meta, transform) - - valid_size = data_config["valid_size"] - ljspeech_valid = SliceDataset(ljspeech, 0, valid_size) - ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech)) - - teacher_config = config["teacher"] - n_loop = teacher_config["n_loop"] - n_layer = teacher_config["n_layer"] - filter_size = teacher_config["filter_size"] - context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)]) - print("context size is {} samples".format(context_size)) - train_batch_fn = DataCollector(context_size, sample_rate, hop_length, - train_clip_seconds) - valid_batch_fn = DataCollector( - context_size, sample_rate, hop_length, train_clip_seconds, valid=True) - - batch_size = data_config["batch_size"] - train_cargo = DataCargo( - ljspeech_train, - train_batch_fn, - batch_size, - sampler=RandomSampler(ljspeech_train)) - - # only batch=1 for validation is enabled - valid_cargo = DataCargo( - ljspeech_valid, - valid_batch_fn, - batch_size=1, - sampler=SequentialSampler(ljspeech_valid)) - - # conditioner(upsampling net) - conditioner_config = config["conditioner"] - upsampling_factors = conditioner_config["upsampling_factors"] - upsample_net = UpsampleNet(upscale_factors=upsampling_factors) - freeze(upsample_net) - - residual_channels = teacher_config["residual_channels"] - loss_type = teacher_config["loss_type"] - output_dim = teacher_config["output_dim"] - log_scale_min = teacher_config["log_scale_min"] - assert loss_type == "mog" and output_dim == 3, \ - "the teacher wavenet should be a wavenet with single gaussian output" - - teacher = WaveNet(n_loop, n_layer, residual_channels, output_dim, n_mels, - filter_size, loss_type, log_scale_min) - # load & freeze upsample_net & teacher - freeze(teacher) - - student_config = config["student"] - n_loops = student_config["n_loops"] - n_layers = student_config["n_layers"] - student_residual_channels = student_config["residual_channels"] - student_filter_size = student_config["filter_size"] - student_log_scale_min = student_config["log_scale_min"] - student = ParallelWaveNet(n_loops, n_layers, student_residual_channels, - n_mels, student_filter_size) - - stft_config = config["stft"] - stft = STFT( - n_fft=stft_config["n_fft"], - hop_length=stft_config["hop_length"], - win_length=stft_config["win_length"]) - - lmd = config["loss"]["lmd"] - model = Clarinet(upsample_net, teacher, student, stft, - student_log_scale_min, lmd) - summary(model) - - # load parameters - if args.checkpoint is not None: - # load from args.checkpoint - iteration = io.load_parameters(model, checkpoint_path=args.checkpoint) - else: - # load from "args.output/checkpoints" - checkpoint_dir = os.path.join(args.output, "checkpoints") - iteration = io.load_parameters( - model, checkpoint_dir=checkpoint_dir, iteration=args.iteration) - assert iteration > 0, "A trained checkpoint is needed." - - # make generation fast - for sublayer in model.sublayers(): - if isinstance(sublayer, WeightNormWrapper): - sublayer.remove_weight_norm() - - # data loader - valid_loader = fluid.io.DataLoader.from_generator( - capacity=10, return_list=True) - valid_loader.set_batch_generator(valid_cargo, place) - - # the directory to save audio files - synthesis_dir = os.path.join(args.output, "synthesis") - if not os.path.exists(synthesis_dir): - os.makedirs(synthesis_dir) - - eval_model(model, valid_loader, synthesis_dir, iteration, sample_rate) diff --git a/examples/clarinet/train.py b/examples/clarinet/train.py deleted file mode 100644 index ef7a93f..0000000 --- a/examples/clarinet/train.py +++ /dev/null @@ -1,243 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import division -import os -import sys -import argparse -import ruamel.yaml -import random -from tqdm import tqdm -import pickle -import numpy as np -from visualdl import LogWriter - -import paddle.fluid.dygraph as dg -from paddle import fluid -fluid.require_version('1.8.0') - -from parakeet.models.wavenet import WaveNet, UpsampleNet -from parakeet.models.clarinet import STFT, Clarinet, ParallelWaveNet -from parakeet.data import TransformDataset, SliceDataset, CacheDataset, RandomSampler, SequentialSampler, DataCargo -from parakeet.utils.layer_tools import summary, freeze -from parakeet.utils import io - -from utils import make_output_tree, eval_model, load_wavenet - -# import dataset from wavenet -sys.path.append("../wavenet") -from data import LJSpeechMetaData, Transform, DataCollector - -if __name__ == "__main__": - parser = argparse.ArgumentParser( - description="Train a ClariNet model with LJspeech and a trained WaveNet model." - ) - parser.add_argument("--config", type=str, help="path of the config file") - parser.add_argument("--device", type=int, default=-1, help="device to use") - parser.add_argument("--data", type=str, help="path of LJspeech dataset") - - g = parser.add_mutually_exclusive_group() - g.add_argument("--checkpoint", type=str, help="checkpoint to resume from") - g.add_argument( - "--iteration", - type=int, - help="the iteration of the checkpoint to load from output directory") - - parser.add_argument( - "--wavenet", type=str, help="wavenet checkpoint to use") - - parser.add_argument( - "output", - type=str, - default="experiment", - help="path to save experiment results") - - args = parser.parse_args() - with open(args.config, 'rt') as f: - config = ruamel.yaml.safe_load(f) - - if args.device == -1: - place = fluid.CPUPlace() - else: - place = fluid.CUDAPlace(args.device) - - dg.enable_dygraph(place) - - print("Command Line args: ") - for k, v in vars(args).items(): - print("{}: {}".format(k, v)) - - ljspeech_meta = LJSpeechMetaData(args.data) - - data_config = config["data"] - sample_rate = data_config["sample_rate"] - n_fft = data_config["n_fft"] - win_length = data_config["win_length"] - hop_length = data_config["hop_length"] - n_mels = data_config["n_mels"] - train_clip_seconds = data_config["train_clip_seconds"] - transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels) - ljspeech = TransformDataset(ljspeech_meta, transform) - - valid_size = data_config["valid_size"] - ljspeech_valid = CacheDataset(SliceDataset(ljspeech, 0, valid_size)) - ljspeech_train = CacheDataset( - SliceDataset(ljspeech, valid_size, len(ljspeech))) - - teacher_config = config["teacher"] - n_loop = teacher_config["n_loop"] - n_layer = teacher_config["n_layer"] - filter_size = teacher_config["filter_size"] - context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)]) - print("context size is {} samples".format(context_size)) - train_batch_fn = DataCollector(context_size, sample_rate, hop_length, - train_clip_seconds) - valid_batch_fn = DataCollector( - context_size, sample_rate, hop_length, train_clip_seconds, valid=True) - - batch_size = data_config["batch_size"] - train_cargo = DataCargo( - ljspeech_train, - train_batch_fn, - batch_size, - sampler=RandomSampler(ljspeech_train)) - - # only batch=1 for validation is enabled - valid_cargo = DataCargo( - ljspeech_valid, - valid_batch_fn, - batch_size=1, - sampler=SequentialSampler(ljspeech_valid)) - - make_output_tree(args.output) - - # conditioner(upsampling net) - conditioner_config = config["conditioner"] - upsampling_factors = conditioner_config["upsampling_factors"] - upsample_net = UpsampleNet(upscale_factors=upsampling_factors) - freeze(upsample_net) - - residual_channels = teacher_config["residual_channels"] - loss_type = teacher_config["loss_type"] - output_dim = teacher_config["output_dim"] - log_scale_min = teacher_config["log_scale_min"] - assert loss_type == "mog" and output_dim == 3, \ - "the teacher wavenet should be a wavenet with single gaussian output" - - teacher = WaveNet(n_loop, n_layer, residual_channels, output_dim, n_mels, - filter_size, loss_type, log_scale_min) - freeze(teacher) - - student_config = config["student"] - n_loops = student_config["n_loops"] - n_layers = student_config["n_layers"] - student_residual_channels = student_config["residual_channels"] - student_filter_size = student_config["filter_size"] - student_log_scale_min = student_config["log_scale_min"] - student = ParallelWaveNet(n_loops, n_layers, student_residual_channels, - n_mels, student_filter_size) - - stft_config = config["stft"] - stft = STFT( - n_fft=stft_config["n_fft"], - hop_length=stft_config["hop_length"], - win_length=stft_config["win_length"]) - - lmd = config["loss"]["lmd"] - model = Clarinet(upsample_net, teacher, student, stft, - student_log_scale_min, lmd) - summary(model) - - # optim - train_config = config["train"] - learning_rate = train_config["learning_rate"] - anneal_rate = train_config["anneal_rate"] - anneal_interval = train_config["anneal_interval"] - lr_scheduler = dg.ExponentialDecay( - learning_rate, anneal_interval, anneal_rate, staircase=True) - gradiant_max_norm = train_config["gradient_max_norm"] - optim = fluid.optimizer.Adam( - lr_scheduler, - parameter_list=model.parameters(), - grad_clip=fluid.clip.ClipByGlobalNorm(gradiant_max_norm)) - - # train - max_iterations = train_config["max_iterations"] - checkpoint_interval = train_config["checkpoint_interval"] - eval_interval = train_config["eval_interval"] - checkpoint_dir = os.path.join(args.output, "checkpoints") - state_dir = os.path.join(args.output, "states") - log_dir = os.path.join(args.output, "log") - writer = LogWriter(log_dir) - - if args.checkpoint is not None: - iteration = io.load_parameters( - model, optim, checkpoint_path=args.checkpoint) - else: - iteration = io.load_parameters( - model, - optim, - checkpoint_dir=checkpoint_dir, - iteration=args.iteration) - - if iteration == 0: - assert args.wavenet is not None, "When training afresh, a trained wavenet model should be provided." - load_wavenet(model, args.wavenet) - - # loader - train_loader = fluid.io.DataLoader.from_generator( - capacity=10, return_list=True) - train_loader.set_batch_generator(train_cargo, place) - - valid_loader = fluid.io.DataLoader.from_generator( - capacity=10, return_list=True) - valid_loader.set_batch_generator(valid_cargo, place) - - # training loop - global_step = iteration + 1 - iterator = iter(tqdm(train_loader)) - while global_step <= max_iterations: - try: - batch = next(iterator) - except StopIteration as e: - iterator = iter(tqdm(train_loader)) - batch = next(iterator) - - audios, mels, audio_starts = batch - model.train() - loss_dict = model( - audios, mels, audio_starts, clip_kl=global_step > 500) - - writer.add_scalar("learning_rate", - optim._learning_rate.step().numpy()[0], global_step) - for k, v in loss_dict.items(): - writer.add_scalar("loss/{}".format(k), v.numpy()[0], global_step) - - l = loss_dict["loss"] - step_loss = l.numpy()[0] - print("[train] global_step: {} loss: {:<8.6f}".format(global_step, - step_loss)) - - l.backward() - optim.minimize(l) - optim.clear_gradients() - - if global_step % eval_interval == 0: - # evaluate on valid dataset - eval_model(model, valid_loader, state_dir, global_step, - sample_rate) - if global_step % checkpoint_interval == 0: - io.save_parameters(checkpoint_dir, global_step, model, optim) - - global_step += 1 diff --git a/examples/clarinet/utils.py b/examples/clarinet/utils.py deleted file mode 100644 index 1e1c46a..0000000 --- a/examples/clarinet/utils.py +++ /dev/null @@ -1,60 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import division -import os -import soundfile as sf -from collections import OrderedDict - -from paddle import fluid -import paddle.fluid.dygraph as dg - - -def make_output_tree(output_dir): - checkpoint_dir = os.path.join(output_dir, "checkpoints") - if not os.path.exists(checkpoint_dir): - os.makedirs(checkpoint_dir) - - state_dir = os.path.join(output_dir, "states") - if not os.path.exists(state_dir): - os.makedirs(state_dir) - - -def eval_model(model, valid_loader, output_dir, iteration, sample_rate): - model.eval() - for i, batch in enumerate(valid_loader): - # print("sentence {}".format(i)) - path = os.path.join(output_dir, - "sentence_{}_step_{}.wav".format(i, iteration)) - audio_clips, mel_specs, audio_starts = batch - wav_var = model.synthesis(mel_specs) - wav_np = wav_var.numpy()[0] - sf.write(path, wav_np, samplerate=sample_rate) - print("generated {}".format(path)) - - -def load_wavenet(model, path): - wavenet_dict, _ = dg.load_dygraph(path) - encoder_dict = OrderedDict() - teacher_dict = OrderedDict() - for k, v in wavenet_dict.items(): - if k.startswith("encoder."): - encoder_dict[k.split('.', 1)[1]] = v - else: - # k starts with "decoder." - teacher_dict[k.split('.', 1)[1]] = v - - model.encoder.set_dict(encoder_dict) - model.teacher.set_dict(teacher_dict) - print("loaded the encoder part and teacher part from wavenet model.") diff --git a/examples/deepvoice3/README.md b/examples/deepvoice3/README.md deleted file mode 100644 index 3e4b0b3..0000000 --- a/examples/deepvoice3/README.md +++ /dev/null @@ -1,144 +0,0 @@ -# Deep Voice 3 - -PaddlePaddle dynamic graph implementation of Deep Voice 3, a convolutional network based text-to-speech generative model. The implementation is based on [Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning](https://arxiv.org/abs/1710.07654). - -We implement Deep Voice 3 using Paddle Fluid with dynamic graph, which is convenient for building flexible network architectures. - -## Dataset - -We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/). - -```bash -wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 -tar xjvf LJSpeech-1.1.tar.bz2 -``` - -## Model Architecture - -![Deep Voice 3 model architecture](./images/model_architecture.png) - -The model consists of an encoder, a decoder and a converter (and a speaker embedding for multispeaker models). The encoder and the decoder together form the seq2seq part of the model, and the converter forms the postnet part. - -## Project Structure - -```text -├── config/ -├── synthesize.py -├── data.py -├── preprocess.py -├── clip.py -├── train.py -└── vocoder.py -``` - -# Preprocess - -Preprocess to dataset with `preprocess.py`. - -```text -usage: preprocess.py [-h] --config CONFIG --input INPUT --output OUTPUT - -preprocess ljspeech dataset and save it. - -optional arguments: - -h, --help show this help message and exit - --config CONFIG config file - --input INPUT data path of the original data - --output OUTPUT path to save the preprocessed dataset -``` - -example code: - -```bash -python preprocess.py --config=configs/ljspeech.yaml --input=LJSpeech-1.1/ --output=data/ljspeech -``` - -## Train - -Train the model using train.py, follow the usage displayed by `python train.py --help`. - -```text -usage: train.py [-h] --config CONFIG --input INPUT - -train a Deep Voice 3 model with LJSpeech - -optional arguments: - -h, --help show this help message and exit - --config CONFIG config file - --input INPUT data path of the original data -``` - -example code: - -```bash -CUDA_VISIBLE_DEVICES=0 python train.py --config=configs/ljspeech.yaml --input=data/ljspeech -``` - -It would create a `runs` folder, outputs for each run is saved in a seperate folder in `runs`, whose name is the time joined with hostname. Inside this filder, tensorboard log, parameters and optimizer states are saved. Parameters(`*.pdparams`) and optimizer states(`*.pdopt`) are named by the step when they are saved. - -```text -runs/Jul07_09-39-34_instance-mqcyj27y-4/ -├── checkpoint -├── events.out.tfevents.1594085974.instance-mqcyj27y-4 -├── step-1000000.pdopt -├── step-1000000.pdparams -├── step-100000.pdopt -├── step-100000.pdparams -... -``` - -Since we use waveflow to synthesize audio while training, so download the trained waveflow model and extract it in current directory before training. - -```bash -wget https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_ckpt_1.0.zip -unzip waveflow_res128_ljspeech_ckpt_1.0.zip -``` - - - -## Visualization - -You can visualize training losses, check the attention and listen to the synthesized audio when training with teacher forcing. - -example code: - -```bash -tensorboard --logdir=runs/ --host=$HOSTNAME --port=8000 -``` - -## Synthesis - -```text -usage: synthesize from a checkpoint [-h] --config CONFIG --input INPUT - --output OUTPUT --checkpoint CHECKPOINT - --monotonic_layers MONOTONIC_LAYERS - [--vocoder {griffin-lim,waveflow}] - -optional arguments: - -h, --help show this help message and exit - --config CONFIG config file - --input INPUT text file to synthesize - --output OUTPUT path to save audio - --checkpoint CHECKPOINT - data path of the checkpoint - --monotonic_layers MONOTONIC_LAYERS - monotonic decoder layers' indices(start from 1) - --vocoder {griffin-lim,waveflow} - vocoder to use -``` - -`synthesize.py` is used to synthesize several sentences in a text file. -`--monotonic_layers` is the index of the decoders layer that manifest monotonic diagonal attention. You can get monotonic layers by inspecting tensorboard logs. Mind that the index starts from 1. The layers that manifest monotonic diagonal attention are stable for a model during training and synthesizing, but differ among different runs. So once you get the indices of monotonic layers by inspecting tensorboard log, you can use them at synthesizing. Note that only decoder layers that show strong diagonal attention should be considerd. -`--vocoder` is the vocoder to use. Current supported values are "waveflow" and "griffin-lim". Default value is "waveflow". - -example code: - -```bash -CUDA_VISIBLE_DEVICES=2 python synthesize.py \ - --config configs/ljspeech.yaml \ - --input sentences.txt \ - --output outputs/ \ - --checkpoint runs/Jul07_09-39-34_instance-mqcyj27y-4/step-1320000 \ - --monotonic_layers "5,6" \ - --vocoder waveflow -``` diff --git a/examples/deepvoice3/clip.py b/examples/deepvoice3/clip.py deleted file mode 100644 index 0a4f998..0000000 --- a/examples/deepvoice3/clip.py +++ /dev/null @@ -1,84 +0,0 @@ -from __future__ import print_function - -import copy -import six -import warnings - -import functools -from paddle.fluid import layers -from paddle.fluid import framework -from paddle.fluid import core -from paddle.fluid import name_scope -from paddle.fluid.dygraph import base as imperative_base -from paddle.fluid.clip import GradientClipBase, _correct_clip_op_role_var - -class DoubleClip(GradientClipBase): - def __init__(self, clip_value, clip_norm, group_name="default_group", need_clip=None): - super(DoubleClip, self).__init__(need_clip) - self.clip_value = float(clip_value) - self.clip_norm = float(clip_norm) - self.group_name = group_name - - def __str__(self): - return "Gradient Clip By Value and GlobalNorm, value={}, global_norm={}".format( - self.clip_value, self.clip_norm) - - @imperative_base.no_grad - def _dygraph_clip(self, params_grads): - params_grads = self._dygraph_clip_by_value(params_grads) - params_grads = self._dygraph_clip_by_global_norm(params_grads) - return params_grads - - @imperative_base.no_grad - def _dygraph_clip_by_value(self, params_grads): - params_and_grads = [] - for p, g in params_grads: - if g is None: - continue - if self._need_clip_func is not None and not self._need_clip_func(p): - params_and_grads.append((p, g)) - continue - new_grad = layers.clip(x=g, min=-self.clip_value, max=self.clip_value) - params_and_grads.append((p, new_grad)) - return params_and_grads - - @imperative_base.no_grad - def _dygraph_clip_by_global_norm(self, params_grads): - params_and_grads = [] - sum_square_list = [] - for p, g in params_grads: - if g is None: - continue - if self._need_clip_func is not None and not self._need_clip_func(p): - continue - merge_grad = g - if g.type == core.VarDesc.VarType.SELECTED_ROWS: - merge_grad = layers.merge_selected_rows(g) - merge_grad = layers.get_tensor_from_selected_rows(merge_grad) - square = layers.square(merge_grad) - sum_square = layers.reduce_sum(square) - sum_square_list.append(sum_square) - - # all parameters have been filterd out - if len(sum_square_list) == 0: - return params_grads - - global_norm_var = layers.concat(sum_square_list) - global_norm_var = layers.reduce_sum(global_norm_var) - global_norm_var = layers.sqrt(global_norm_var) - max_global_norm = layers.fill_constant( - shape=[1], dtype='float32', value=self.clip_norm) - clip_var = layers.elementwise_div( - x=max_global_norm, - y=layers.elementwise_max( - x=global_norm_var, y=max_global_norm)) - for p, g in params_grads: - if g is None: - continue - if self._need_clip_func is not None and not self._need_clip_func(p): - params_and_grads.append((p, g)) - continue - new_grad = layers.elementwise_mul(x=g, y=clip_var) - params_and_grads.append((p, new_grad)) - - return params_and_grads \ No newline at end of file diff --git a/examples/deepvoice3/configs/ljspeech.yaml b/examples/deepvoice3/configs/ljspeech.yaml deleted file mode 100644 index 1e8ec7b..0000000 --- a/examples/deepvoice3/configs/ljspeech.yaml +++ /dev/null @@ -1,46 +0,0 @@ -# data processing -p_pronunciation: 0.99 -sample_rate: 22050 # Hz -n_fft: 1024 -win_length: 1024 -hop_length: 256 -n_mels: 80 -reduction_factor: 4 - -# model-s2s -n_speakers: 1 -speaker_dim: 16 -char_dim: 256 -encoder_dim: 64 -kernel_size: 5 -encoder_layers: 7 -decoder_layers: 8 -prenet_sizes: [128] -attention_dim: 128 - -# model-postnet -postnet_layers: 5 -postnet_dim: 256 - -# position embedding -position_weight: 1.0 -position_rate: 5.54 -forward_step: 4 -backward_step: 0 - -dropout: 0.05 - -# output-griffinlim -sharpening_factor: 1.4 - -# optimizer: -learning_rate: 0.001 -clip_value: 5.0 -clip_norm: 100.0 - -# training: -max_iteration: 1000000 -batch_size: 16 -report_interval: 10000 -save_interval: 10000 -valid_size: 5 \ No newline at end of file diff --git a/examples/deepvoice3/data.py b/examples/deepvoice3/data.py deleted file mode 100644 index 984f963..0000000 --- a/examples/deepvoice3/data.py +++ /dev/null @@ -1,108 +0,0 @@ -import numpy as np -import os -import csv -import pandas as pd - -import paddle -from paddle import fluid -from paddle.fluid import dygraph as dg -from paddle.fluid.dataloader import Dataset, BatchSampler -from paddle.fluid.io import DataLoader - -from parakeet.data import DatasetMixin, DataCargo, PartialyRandomizedSimilarTimeLengthSampler -from parakeet.g2p import en - -class LJSpeech(DatasetMixin): - def __init__(self, root): - self._root = root - self._table = pd.read_csv( - os.path.join(root, "metadata.csv"), - sep="|", - encoding="utf-8", - quoting=csv.QUOTE_NONE, - header=None, - names=["num_frames", "spec_name", "mel_name", "text"], - dtype={"num_frames": np.int64, "spec_name": str, "mel_name":str, "text":str}) - - def num_frames(self): - return self._table["num_frames"].to_list() - - def get_example(self, i): - """ - spec (T_frame, C_spec) - mel (T_frame, C_mel) - """ - num_frames, spec_name, mel_name, text = self._table.iloc[i] - spec = np.load(os.path.join(self._root, spec_name)) - mel = np.load(os.path.join(self._root, mel_name)) - return (text, spec, mel, num_frames) - - def __len__(self): - return len(self._table) - -class DataCollector(object): - def __init__(self, p_pronunciation): - self.p_pronunciation = p_pronunciation - - def __call__(self, examples): - """ - output shape and dtype - (B, T_text) int64 - (B,) int64 - (B, T_frame, C_spec) float32 - (B, T_frame, C_mel) float32 - (B,) int64 - """ - text_seqs = [] - specs = [] - mels = [] - num_frames = np.array([example[3] for example in examples], dtype=np.int64) - max_frames = np.max(num_frames) - - for example in examples: - text, spec, mel, _ = example - text_seqs.append(en.text_to_sequence(text, self.p_pronunciation)) - specs.append(np.pad(spec, [(0, max_frames - spec.shape[0]), (0, 0)], mode="constant")) - mels.append(np.pad(mel, [(0, max_frames - mel.shape[0]), (0, 0)], mode="constant")) - - specs = np.stack(specs) - mels = np.stack(mels) - - text_lengths = np.array([len(seq) for seq in text_seqs], dtype=np.int64) - max_length = np.max(text_lengths) - text_seqs = np.array([seq + [0] * (max_length - len(seq)) for seq in text_seqs], dtype=np.int64) - return text_seqs, text_lengths, specs, mels, num_frames - -if __name__ == "__main__": - import argparse - import tqdm - import time - from ruamel import yaml - - parser = argparse.ArgumentParser(description="load the preprocessed ljspeech dataset") - parser.add_argument("--config", type=str, required=True, help="config file") - parser.add_argument("--input", type=str, required=True, help="data path of the original data") - args = parser.parse_args() - with open(args.config, 'rt') as f: - config = yaml.safe_load(f) - - print("========= Command Line Arguments ========") - for k, v in vars(args).items(): - print("{}: {}".format(k, v)) - print("=========== Configurations ==============") - for k in ["p_pronunciation", "batch_size"]: - print("{}: {}".format(k, config[k])) - - ljspeech = LJSpeech(args.input) - collate_fn = DataCollector(config["p_pronunciation"]) - - dg.enable_dygraph(fluid.CPUPlace()) - sampler = PartialyRandomizedSimilarTimeLengthSampler(ljspeech.num_frames()) - cargo = DataCargo(ljspeech, collate_fn, - batch_size=config["batch_size"], sampler=sampler) - loader = DataLoader\ - .from_generator(capacity=5, return_list=True)\ - .set_batch_generator(cargo) - - for i, batch in tqdm.tqdm(enumerate(loader)): - continue diff --git a/examples/deepvoice3/images/model_architecture.png b/examples/deepvoice3/images/model_architecture.png deleted file mode 100644 index 4668a30..0000000 Binary files a/examples/deepvoice3/images/model_architecture.png and /dev/null differ diff --git a/examples/deepvoice3/preprocess.py b/examples/deepvoice3/preprocess.py deleted file mode 100644 index d042980..0000000 --- a/examples/deepvoice3/preprocess.py +++ /dev/null @@ -1,122 +0,0 @@ -from __future__ import division -import os -import argparse -from ruamel import yaml -import tqdm -from os.path import join -import csv -import numpy as np -import pandas as pd -import librosa -import logging - -from parakeet.data import DatasetMixin - - -class LJSpeechMetaData(DatasetMixin): - def __init__(self, root): - self.root = root - self._wav_dir = join(root, "wavs") - csv_path = join(root, "metadata.csv") - self._table = pd.read_csv( - csv_path, - sep="|", - encoding="utf-8", - header=None, - quoting=csv.QUOTE_NONE, - names=["fname", "raw_text", "normalized_text"]) - - def get_example(self, i): - fname, raw_text, normalized_text = self._table.iloc[i] - abs_fname = join(self._wav_dir, fname + ".wav") - return fname, abs_fname, raw_text, normalized_text - - def __len__(self): - return len(self._table) - - -class Transform(object): - def __init__(self, sample_rate, n_fft, hop_length, win_length, n_mels, reduction_factor): - self.sample_rate = sample_rate - self.n_fft = n_fft - self.win_length = win_length - self.hop_length = hop_length - self.n_mels = n_mels - self.reduction_factor = reduction_factor - - def __call__(self, fname): - # wave processing - audio, _ = librosa.load(fname, sr=self.sample_rate) - - # Pad the data to the right size to have a whole number of timesteps, - # accounting properly for the model reduction factor. - frames = audio.size // (self.reduction_factor * self.hop_length) + 1 - # librosa's stft extract frame of n_fft size, so we should pad n_fft // 2 on both sidess - desired_length = (frames * self.reduction_factor - 1) * self.hop_length + self.n_fft - pad_amount = (desired_length - audio.size) // 2 - - # we pad mannually to control the number of generated frames - if audio.size % 2 == 0: - audio = np.pad(audio, (pad_amount, pad_amount), mode='reflect') - else: - audio = np.pad(audio, (pad_amount, pad_amount + 1), mode='reflect') - - # STFT - D = librosa.stft(audio, self.n_fft, self.hop_length, self.win_length, center=False) - S = np.abs(D) - S_mel = librosa.feature.melspectrogram(sr=self.sample_rate, S=S, n_mels=self.n_mels, fmax=8000.0) - - # log magnitude - log_spectrogram = np.log(np.clip(S, a_min=1e-5, a_max=None)) - log_mel_spectrogram = np.log(np.clip(S_mel, a_min=1e-5, a_max=None)) - num_frames = log_spectrogram.shape[-1] - assert num_frames % self.reduction_factor == 0, "num_frames is wrong" - return (log_spectrogram.T, log_mel_spectrogram.T, num_frames) - - -def save(output_path, dataset, transform): - if not os.path.exists(output_path): - os.makedirs(output_path) - records = [] - for example in tqdm.tqdm(dataset): - fname, abs_fname, _, normalized_text = example - log_spec, log_mel_spec, num_frames = transform(abs_fname) - records.append((num_frames, - fname + "_spec.npy", - fname + "_mel.npy", - normalized_text)) - np.save(join(output_path, fname + "_spec"), log_spec) - np.save(join(output_path, fname + "_mel"), log_mel_spec) - meta_data = pd.DataFrame.from_records(records) - meta_data.to_csv(join(output_path, "metadata.csv"), - quoting=csv.QUOTE_NONE, sep="|", encoding="utf-8", - header=False, index=False) - - -if __name__ == "__main__": - parser = argparse.ArgumentParser(description="preprocess ljspeech dataset and save it.") - parser.add_argument("--config", type=str, required=True, help="config file") - parser.add_argument("--input", type=str, required=True, help="data path of the original data") - parser.add_argument("--output", type=str, required=True, help="path to save the preprocessed dataset") - - args = parser.parse_args() - with open(args.config, 'rt') as f: - config = yaml.safe_load(f) - - print("========= Command Line Arguments ========") - for k, v in vars(args).items(): - print("{}: {}".format(k, v)) - print("=========== Configurations ==============") - for k in ["sample_rate", "n_fft", "win_length", - "hop_length", "n_mels", "reduction_factor"]: - print("{}: {}".format(k, config[k])) - - ljspeech_meta = LJSpeechMetaData(args.input) - transform = Transform(config["sample_rate"], - config["n_fft"], - config["hop_length"], - config["win_length"], - config["n_mels"], - config["reduction_factor"]) - save(args.output, ljspeech_meta, transform) - diff --git a/examples/deepvoice3/synthesize.py b/examples/deepvoice3/synthesize.py deleted file mode 100644 index 3540d6e..0000000 --- a/examples/deepvoice3/synthesize.py +++ /dev/null @@ -1,101 +0,0 @@ -import numpy as np -from matplotlib import cm -import librosa -import os -import time -import tqdm -import argparse -from ruamel import yaml -import paddle -from paddle import fluid -from paddle.fluid import layers as F -from paddle.fluid import dygraph as dg -from paddle.fluid.io import DataLoader -import soundfile as sf - -from parakeet.data import SliceDataset, DataCargo, PartialyRandomizedSimilarTimeLengthSampler, SequentialSampler -from parakeet.utils.io import save_parameters, load_parameters, add_yaml_config_to_args -from parakeet.g2p import en -from parakeet.models.deepvoice3.weight_norm_hook import remove_weight_norm -from vocoder import WaveflowVocoder, GriffinLimVocoder -from train import create_model - - -def main(args, config): - model = create_model(config) - loaded_step = load_parameters(model, checkpoint_path=args.checkpoint) - for name, layer in model.named_sublayers(): - try: - remove_weight_norm(layer) - except ValueError: - # this layer has not weight norm hook - pass - model.eval() - if args.vocoder == "waveflow": - vocoder = WaveflowVocoder() - vocoder.model.eval() - elif args.vocoder == "griffin-lim": - vocoder = GriffinLimVocoder( - sharpening_factor=config["sharpening_factor"], - sample_rate=config["sample_rate"], - n_fft=config["n_fft"], - win_length=config["win_length"], - hop_length=config["hop_length"]) - else: - raise ValueError("Other vocoders are not supported.") - - if not os.path.exists(args.output): - os.makedirs(args.output) - monotonic_layers = [int(item.strip()) - 1 for item in args.monotonic_layers.split(',')] - with open(args.input, 'rt') as f: - sentences = [line.strip() for line in f.readlines()] - for i, sentence in enumerate(sentences): - wav = synthesize(args, config, model, vocoder, sentence, monotonic_layers) - sf.write(os.path.join(args.output, "sentence{}.wav".format(i)), - wav, samplerate=config["sample_rate"]) - - -def synthesize(args, config, model, vocoder, sentence, monotonic_layers): - print("[synthesize] {}".format(sentence)) - text = en.text_to_sequence(sentence, p=1.0) - text = np.expand_dims(np.array(text, dtype="int64"), 0) - lengths = np.array([text.size], dtype=np.int64) - text_seqs = dg.to_variable(text) - text_lengths = dg.to_variable(lengths) - - decoder_layers = config["decoder_layers"] - force_monotonic_attention = [False] * decoder_layers - for i in monotonic_layers: - force_monotonic_attention[i] = True - - with dg.no_grad(): - outputs = model(text_seqs, text_lengths, speakers=None, - force_monotonic_attention=force_monotonic_attention, - window=(config["backward_step"], config["forward_step"])) - decoded, refined, attentions = outputs - if args.vocoder == "griffin-lim": - wav_np = vocoder(refined.numpy()[0].T) - else: - wav = vocoder(F.transpose(refined, (0, 2, 1))) - wav_np = wav.numpy()[0] - return wav_np - - - - -if __name__ == "__main__": - import argparse - from ruamel import yaml - parser = argparse.ArgumentParser("synthesize from a checkpoint") - parser.add_argument("--config", type=str, required=True, help="config file") - parser.add_argument("--input", type=str, required=True, help="text file to synthesize") - parser.add_argument("--output", type=str, required=True, help="path to save audio") - parser.add_argument("--checkpoint", type=str, required=True, help="data path of the checkpoint") - parser.add_argument("--monotonic_layers", type=str, required=True, help="monotonic decoder layers' indices(start from 1)") - parser.add_argument("--vocoder", type=str, default="waveflow", choices=['griffin-lim', 'waveflow'], help="vocoder to use") - args = parser.parse_args() - with open(args.config, 'rt') as f: - config = yaml.safe_load(f) - - dg.enable_dygraph(fluid.CUDAPlace(0)) - main(args, config) \ No newline at end of file diff --git a/examples/deepvoice3/train.py b/examples/deepvoice3/train.py deleted file mode 100644 index c552217..0000000 --- a/examples/deepvoice3/train.py +++ /dev/null @@ -1,187 +0,0 @@ -import numpy as np -from matplotlib import cm -import librosa -import os -import time -import tqdm -import paddle -from paddle import fluid -from paddle.fluid import layers as F -from paddle.fluid import initializer as I -from paddle.fluid import dygraph as dg -from paddle.fluid.io import DataLoader -from visualdl import LogWriter - -from parakeet.models.deepvoice3 import Encoder, Decoder, PostNet, SpectraNet -from parakeet.data import SliceDataset, DataCargo, SequentialSampler, RandomSampler -from parakeet.utils.io import save_parameters, load_parameters -from parakeet.g2p import en - -from data import LJSpeech, DataCollector -from vocoder import WaveflowVocoder, GriffinLimVocoder -from clip import DoubleClip - - -def create_model(config): - char_embedding = dg.Embedding((en.n_vocab, config["char_dim"]), param_attr=I.Normal(scale=0.1)) - multi_speaker = config["n_speakers"] > 1 - speaker_embedding = dg.Embedding((config["n_speakers"], config["speaker_dim"]), param_attr=I.Normal(scale=0.1)) \ - if multi_speaker else None - encoder = Encoder(config["encoder_layers"], config["char_dim"], - config["encoder_dim"], config["kernel_size"], - has_bias=multi_speaker, bias_dim=config["speaker_dim"], - keep_prob=1.0 - config["dropout"]) - decoder = Decoder(config["n_mels"], config["reduction_factor"], - list(config["prenet_sizes"]) + [config["char_dim"]], - config["decoder_layers"], config["kernel_size"], - config["attention_dim"], - position_encoding_weight=config["position_weight"], - omega=config["position_rate"], - has_bias=multi_speaker, bias_dim=config["speaker_dim"], - keep_prob=1.0 - config["dropout"]) - postnet = PostNet(config["postnet_layers"], config["char_dim"], - config["postnet_dim"], config["kernel_size"], - config["n_mels"], config["reduction_factor"], - has_bias=multi_speaker, bias_dim=config["speaker_dim"], - keep_prob=1.0 - config["dropout"]) - spectranet = SpectraNet(char_embedding, speaker_embedding, encoder, decoder, postnet) - return spectranet - -def create_data(config, data_path): - dataset = LJSpeech(data_path) - - train_dataset = SliceDataset(dataset, config["valid_size"], len(dataset)) - train_collator = DataCollector(config["p_pronunciation"]) - train_sampler = RandomSampler(train_dataset) - train_cargo = DataCargo(train_dataset, train_collator, - batch_size=config["batch_size"], sampler=train_sampler) - train_loader = DataLoader\ - .from_generator(capacity=10, return_list=True)\ - .set_batch_generator(train_cargo) - - valid_dataset = SliceDataset(dataset, 0, config["valid_size"]) - valid_collector = DataCollector(1.) - valid_sampler = SequentialSampler(valid_dataset) - valid_cargo = DataCargo(valid_dataset, valid_collector, - batch_size=1, sampler=valid_sampler) - valid_loader = DataLoader\ - .from_generator(capacity=2, return_list=True)\ - .set_batch_generator(valid_cargo) - return train_loader, valid_loader - -def create_optimizer(model, config): - optim = fluid.optimizer.Adam(config["learning_rate"], - parameter_list=model.parameters(), - grad_clip=DoubleClip(config["clip_value"], config["clip_norm"])) - return optim - -def train(args, config): - model = create_model(config) - train_loader, valid_loader = create_data(config, args.input) - optim = create_optimizer(model, config) - - global global_step - max_iteration = config["max_iteration"] - - iterator = iter(tqdm.tqdm(train_loader)) - while global_step <= max_iteration: - # get inputs - try: - batch = next(iterator) - except StopIteration: - iterator = iter(tqdm.tqdm(train_loader)) - batch = next(iterator) - - # unzip it - text_seqs, text_lengths, specs, mels, num_frames = batch - - # forward & backward - model.train() - outputs = model(text_seqs, text_lengths, speakers=None, mel=mels) - decoded, refined, attentions, final_state = outputs - - causal_mel_loss = model.spec_loss(decoded, mels, num_frames) - non_causal_mel_loss = model.spec_loss(refined, mels, num_frames) - loss = causal_mel_loss + non_causal_mel_loss - loss.backward() - - # update - optim.minimize(loss) - - # logging - tqdm.tqdm.write("[train] step: {}\tloss: {:.6f}\tcausal:{:.6f}\tnon_causal:{:.6f}".format( - global_step, - loss.numpy()[0], - causal_mel_loss.numpy()[0], - non_causal_mel_loss.numpy()[0])) - writer.add_scalar("loss/causal_mel_loss", causal_mel_loss.numpy()[0], step=global_step) - writer.add_scalar("loss/non_causal_mel_loss", non_causal_mel_loss.numpy()[0], step=global_step) - writer.add_scalar("loss/loss", loss.numpy()[0], step=global_step) - - if global_step % config["report_interval"] == 0: - text_length = int(text_lengths.numpy()[0]) - num_frame = int(num_frames.numpy()[0]) - - tag = "train_mel/ground-truth" - img = cm.viridis(normalize(mels.numpy()[0, :num_frame].T)) - writer.add_image(tag, img, step=global_step) - - tag = "train_mel/decoded" - img = cm.viridis(normalize(decoded.numpy()[0, :num_frame].T)) - writer.add_image(tag, img, step=global_step) - - tag = "train_mel/refined" - img = cm.viridis(normalize(refined.numpy()[0, :num_frame].T)) - writer.add_image(tag, img, step=global_step) - - vocoder = WaveflowVocoder() - vocoder.model.eval() - - tag = "train_audio/ground-truth-waveflow" - wav = vocoder(F.transpose(mels[0:1, :num_frame, :], (0, 2, 1))) - writer.add_audio(tag, wav.numpy()[0], step=global_step, sample_rate=22050) - - tag = "train_audio/decoded-waveflow" - wav = vocoder(F.transpose(decoded[0:1, :num_frame, :], (0, 2, 1))) - writer.add_audio(tag, wav.numpy()[0], step=global_step, sample_rate=22050) - - tag = "train_audio/refined-waveflow" - wav = vocoder(F.transpose(refined[0:1, :num_frame, :], (0, 2, 1))) - writer.add_audio(tag, wav.numpy()[0], step=global_step, sample_rate=22050) - - attentions_np = attentions.numpy() - attentions_np = attentions_np[:, 0, :num_frame // 4 , :text_length] - for i, attention_layer in enumerate(np.rot90(attentions_np, axes=(1,2))): - tag = "train_attention/layer_{}".format(i) - img = cm.viridis(normalize(attention_layer)) - writer.add_image(tag, img, step=global_step, dataformats="HWC") - - if global_step % config["save_interval"] == 0: - save_parameters(writer.logdir, global_step, model, optim) - - # global step +1 - global_step += 1 - -def normalize(arr): - return (arr - arr.min()) / (arr.max() - arr.min()) - -if __name__ == "__main__": - import argparse - from ruamel import yaml - - parser = argparse.ArgumentParser(description="train a Deep Voice 3 model with LJSpeech") - parser.add_argument("--config", type=str, required=True, help="config file") - parser.add_argument("--input", type=str, required=True, help="data path of the original data") - - args = parser.parse_args() - with open(args.config, 'rt') as f: - config = yaml.safe_load(f) - - dg.enable_dygraph(fluid.CUDAPlace(0)) - global global_step - global_step = 1 - global writer - writer = LogWriter() - print("[Training] tensorboard log and checkpoints are save in {}".format( - writer.logdir)) - train(args, config) \ No newline at end of file diff --git a/examples/deepvoice3/vocoder.py b/examples/deepvoice3/vocoder.py deleted file mode 100644 index 5568394..0000000 --- a/examples/deepvoice3/vocoder.py +++ /dev/null @@ -1,51 +0,0 @@ -import argparse -from ruamel import yaml -import numpy as np -import librosa -import paddle -from paddle import fluid -from paddle.fluid import layers as F -from paddle.fluid import dygraph as dg -from parakeet.utils.io import load_parameters -from parakeet.models.waveflow.waveflow_modules import WaveFlowModule - -class WaveflowVocoder(object): - def __init__(self): - config_path = "waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml" - with open(config_path, 'rt') as f: - config = yaml.safe_load(f) - ns = argparse.Namespace() - for k, v in config.items(): - setattr(ns, k, v) - ns.use_fp16 = False - - self.model = WaveFlowModule(ns) - checkpoint_path = "waveflow_res128_ljspeech_ckpt_1.0/step-2000000" - load_parameters(self.model, checkpoint_path=checkpoint_path) - - def __call__(self, mel): - with dg.no_grad(): - self.model.eval() - audio = self.model.synthesize(mel) - self.model.train() - return audio - -class GriffinLimVocoder(object): - def __init__(self, sharpening_factor=1.4, sample_rate=22050, n_fft=1024, - win_length=1024, hop_length=256): - self.sample_rate = sample_rate - self.n_fft = n_fft - self.sharpening_factor = sharpening_factor - self.win_length = win_length - self.hop_length = hop_length - - def __call__(self, mel): - spec = librosa.feature.inverse.mel_to_stft( - np.exp(mel), - sr=self.sample_rate, - n_fft=self.n_fft, - fmin=0, fmax=8000.0, power=1.0) - audio = librosa.core.griffinlim(spec ** self.sharpening_factor, - win_length=self.win_length, hop_length=self.hop_length) - return audio - diff --git a/examples/fastspeech/README.md b/examples/fastspeech/README.md deleted file mode 100644 index 08c3cfd..0000000 --- a/examples/fastspeech/README.md +++ /dev/null @@ -1,144 +0,0 @@ -# Fastspeech - -PaddlePaddle dynamic graph implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263). - -## Dataset - -We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/). - -```bash -wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 -tar xjvf LJSpeech-1.1.tar.bz2 -``` - -## Model Architecture - -![FastSpeech model architecture](./images/model_architecture.png) - -FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extracts attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length -regulator to expand the source phoneme sequence to match the length of the target -mel-spectrogram sequence for parallel mel-spectrogram generation. We use the TransformerTTS as teacher model. -The model consists of encoder, decoder and length regulator three parts. - -## Project Structure - -```text -├── config # yaml configuration files -├── synthesis.py # script to synthesize waveform from text -├── train.py # script for model training -``` - -## Saving & Loading - -`train_transformer.py` and `train_vocoer.py` have 3 arguments in common, `--checkpoint`, `--iteration` and `--output`. - -1. `--output` is the directory for saving results. -During training, checkpoints are saved in `${output}/checkpoints` and tensorboard logs are saved in `${output}/log`. -During synthesis, results are saved in `${output}/samples` and tensorboard log is save in `${output}/log`. - -2. `--checkpoint` is the path of a checkpoint and `--iteration` is the target step. They are used to load checkpoints in the following way. - - - If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded. - - - If `--checkpoint` is not provided, we try to load the checkpoint of the target step specified by `--iteration` from the `${output}/checkpoints/` directory, e.g. if given `--iteration 120000`, the checkpoint `${output}/checkpoints/step-120000.*` will be load. - - - If both `--checkpoint` and `--iteration` are not provided, we try to load the latest checkpoint from `${output}/checkpoints/` directory. - -## Compute Phoneme Duration - -A ground truth duration of each phoneme (number of frames in the spectrogram that correspond to that phoneme) should be provided when training a FastSpeech model. - -We compute the ground truth duration of each phomemes in the following way. -We extract the encoder-decoder attention alignment from a trained Transformer TTS model; -Each frame is considered corresponding to the phoneme that receive the most attention; - -You can run alignments/get_alignments.py to get it. - -```bash -cd alignments -python get_alignments.py \ ---use_gpu=1 \ ---output='./alignments' \ ---data=${DATAPATH} \ ---config=${CONFIG} \ ---checkpoint_transformer=${CHECKPOINT} \ -``` - -where `${DATAPATH}` is the path saved LJSpeech data, `${CHECKPOINT}` is the pretrain model path of TransformerTTS, `${CONFIG}` is the config yaml file of TransformerTTS checkpoint. It is necessary for you to prepare a pre-trained TranformerTTS checkpoint. - -For more help on arguments - -``python alignments.py --help``. - -Or you can use your own phoneme duration, you just need to process the data into the following format. - -```bash -{'fname1': alignment1, -'fname2': alignment2, -...} -``` - -## Train FastSpeech - -FastSpeech model can be trained by running ``train.py``. - -```bash -python train.py \ ---use_gpu=1 \ ---data=${DATAPATH} \ ---alignments_path=${ALIGNMENTS_PATH} \ ---output=${OUTPUTPATH} \ ---config='configs/ljspeech.yaml' \ -``` - -Or you can run the script file directly. - -```bash -sh train.sh -``` - -If you want to train on multiple GPUs, start training in the following way. - -```bash -CUDA_VISIBLE_DEVICES=0,1,2,3 -python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train.py \ ---use_gpu=1 \ ---data=${DATAPATH} \ ---alignments_path=${ALIGNMENTS_PATH} \ ---output=${OUTPUTPATH} \ ---config='configs/ljspeech.yaml' \ -``` - -If you wish to resume from an existing model, See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading. - -For more help on arguments - -``python train.py --help``. - -## Synthesis - -After training the FastSpeech, audio can be synthesized by running ``synthesis.py``. - -```bash -python synthesis.py \ ---use_gpu=1 \ ---alpha=1.0 \ ---checkpoint=${CHECKPOINTPATH} \ ---config='configs/ljspeech.yaml' \ ---output=${OUTPUTPATH} \ ---vocoder='griffin-lim' \ -``` - -We currently support two vocoders, Griffin-Lim algorithm and WaveFlow. You can set ``--vocoder`` to use one of them. If you want to use WaveFlow as your vocoder, you need to set ``--config_vocoder`` and ``--checkpoint_vocoder`` which are the path of the config and checkpoint of vocoder. You can download the pre-trained model of WaveFlow from [here](https://github.com/PaddlePaddle/Parakeet#vocoders). - -Or you can run the script file directly. - -```bash -sh synthesis.sh -``` - -For more help on arguments - -``python synthesis.py --help``. - -Then you can find the synthesized audio files in ``${OUTPUTPATH}/samples``. diff --git a/examples/fastspeech/alignments/get_alignments.py b/examples/fastspeech/alignments/get_alignments.py deleted file mode 100644 index 8a46ff2..0000000 --- a/examples/fastspeech/alignments/get_alignments.py +++ /dev/null @@ -1,132 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import os -from scipy.io.wavfile import write -from parakeet.g2p.en import text_to_sequence -import numpy as np -import pandas as pd -import csv -from tqdm import tqdm -from ruamel import yaml -import pickle -from pathlib import Path -import argparse -from pprint import pprint -from collections import OrderedDict -import paddle.fluid as fluid -import paddle.fluid.dygraph as dg -from parakeet.models.transformer_tts.utils import * -from parakeet.models.transformer_tts import TransformerTTS -from parakeet.models.fastspeech.utils import get_alignment -from parakeet.utils import io - - -def add_config_options_to_parser(parser): - parser.add_argument("--config", type=str, help="path of the config file") - parser.add_argument("--use_gpu", type=int, default=0, help="device to use") - parser.add_argument("--data", type=str, help="path of LJspeech dataset") - - parser.add_argument( - "--checkpoint_transformer", - type=str, - help="transformer_tts checkpoint to synthesis") - - parser.add_argument( - "--output", - type=str, - default="./alignments", - help="path to save experiment results") - - -def alignments(args): - local_rank = dg.parallel.Env().local_rank - place = (fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace()) - - with open(args.config) as f: - cfg = yaml.load(f, Loader=yaml.Loader) - - with dg.guard(place): - network_cfg = cfg['network'] - model = TransformerTTS( - network_cfg['embedding_size'], network_cfg['hidden_size'], - network_cfg['encoder_num_head'], network_cfg['encoder_n_layers'], - cfg['audio']['num_mels'], network_cfg['outputs_per_step'], - network_cfg['decoder_num_head'], network_cfg['decoder_n_layers']) - # Load parameters. - global_step = io.load_parameters( - model=model, checkpoint_path=args.checkpoint_transformer) - model.eval() - - # get text data - root = Path(args.data) - csv_path = root.joinpath("metadata.csv") - table = pd.read_csv( - csv_path, - sep="|", - header=None, - quoting=csv.QUOTE_NONE, - names=["fname", "raw_text", "normalized_text"]) - - pbar = tqdm(range(len(table))) - alignments = OrderedDict() - for i in pbar: - fname, raw_text, normalized_text = table.iloc[i] - # init input - text = np.asarray(text_to_sequence(normalized_text)) - text = fluid.layers.unsqueeze(dg.to_variable(text), [0]) - pos_text = np.arange(1, text.shape[1] + 1) - pos_text = fluid.layers.unsqueeze(dg.to_variable(pos_text), [0]) - - # load - wav, _ = librosa.load( - str(os.path.join(args.data, 'wavs', fname + ".wav"))) - - spec = librosa.stft( - y=wav, - n_fft=cfg['audio']['n_fft'], - win_length=cfg['audio']['win_length'], - hop_length=cfg['audio']['hop_length']) - mag = np.abs(spec) - mel = librosa.filters.mel(sr=cfg['audio']['sr'], - n_fft=cfg['audio']['n_fft'], - n_mels=cfg['audio']['num_mels'], - fmin=cfg['audio']['fmin'], - fmax=cfg['audio']['fmax']) - mel = np.matmul(mel, mag) - mel = np.log(np.maximum(mel, 1e-5)) - - mel_input = np.transpose(mel, axes=(1, 0)) - mel_input = fluid.layers.unsqueeze(dg.to_variable(mel_input), [0]) - mel_lens = mel_input.shape[1] - - pos_mel = np.arange(1, mel_input.shape[1] + 1) - pos_mel = fluid.layers.unsqueeze(dg.to_variable(pos_mel), [0]) - mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model( - text, mel_input, pos_text, pos_mel) - mel_input = fluid.layers.concat( - [mel_input, postnet_pred[:, -1:, :]], axis=1) - - alignment, _ = get_alignment(attn_probs, mel_lens, - network_cfg['decoder_num_head']) - alignments[fname] = alignment - with open(args.output + '.pkl', "wb") as f: - pickle.dump(alignments, f) - - -if __name__ == '__main__': - parser = argparse.ArgumentParser( - description="Get alignments from TransformerTTS model") - add_config_options_to_parser(parser) - args = parser.parse_args() - alignments(args) diff --git a/examples/fastspeech/alignments/get_alignments.sh b/examples/fastspeech/alignments/get_alignments.sh deleted file mode 100644 index 0fd0394..0000000 --- a/examples/fastspeech/alignments/get_alignments.sh +++ /dev/null @@ -1,14 +0,0 @@ - -CUDA_VISIBLE_DEVICES=0 \ -python -u get_alignments.py \ ---use_gpu=1 \ ---output='./alignments' \ ---data='../../../dataset/LJSpeech-1.1' \ ---config='../../transformer_tts/configs/ljspeech.yaml' \ ---checkpoint_transformer='../../transformer_tts/checkpoint/transformer/step-120000' \ - -if [ $? -ne 0 ]; then - echo "Failed in training!" - exit 1 -fi -exit 0 \ No newline at end of file diff --git a/examples/fastspeech/configs/ljspeech.yaml b/examples/fastspeech/configs/ljspeech.yaml deleted file mode 100644 index 32bdd42..0000000 --- a/examples/fastspeech/configs/ljspeech.yaml +++ /dev/null @@ -1,36 +0,0 @@ -audio: - num_mels: 80 #the number of mel bands when calculating mel spectrograms. - n_fft: 1024 #the number of fft components. - sr: 22050 #the sampling rate of audio data file. - hop_length: 256 #the number of samples to advance between frames. - win_length: 1024 #the length (width) of the window function. - preemphasis: 0.97 - power: 1.2 #the power to raise before griffin-lim. - fmin: 0 - fmax: 8000 - -network: - encoder_n_layer: 6 #the number of FFT Block in encoder. - encoder_head: 2 #the attention head number in encoder. - encoder_conv1d_filter_size: 1536 #the filter size of conv1d in encoder. - max_seq_len: 2048 #the max length of sequence. - decoder_n_layer: 6 #the number of FFT Block in decoder. - decoder_head: 2 #the attention head number in decoder. - decoder_conv1d_filter_size: 1536 #the filter size of conv1d in decoder. - hidden_size: 384 #the hidden size in model of fastspeech. - duration_predictor_output_size: 256 #the output size of duration predictior. - duration_predictor_filter_size: 3 #the filter size of conv1d in duration prediction. - fft_conv1d_filter: 3 #the filter size of conv1d in fft. - fft_conv1d_padding: 1 #the padding size of conv1d in fft. - dropout: 0.1 #the dropout in network. - outputs_per_step: 1 - -train: - batch_size: 32 - learning_rate: 0.001 - warm_up_step: 4000 #the warm up step of learning rate. - grad_clip_thresh: 0.1 #the threshold of grad clip. - - checkpoint_interval: 1000 - max_iteration: 500000 - diff --git a/examples/fastspeech/data.py b/examples/fastspeech/data.py deleted file mode 100644 index b920035..0000000 --- a/examples/fastspeech/data.py +++ /dev/null @@ -1,186 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -from pathlib import Path -import numpy as np -import pandas as pd -import librosa -import csv -import pickle - -from paddle import fluid -from parakeet import g2p -from parakeet import audio -from parakeet.data.sampler import * -from parakeet.data.datacargo import DataCargo -from parakeet.data.batch import TextIDBatcher, SpecBatcher -from parakeet.data.dataset import DatasetMixin, TransformDataset, CacheDataset, SliceDataset -from parakeet.models.transformer_tts.utils import * - - -class LJSpeechLoader: - def __init__(self, - config, - place, - data_path, - alignments_path, - batch_size, - nranks, - rank, - is_vocoder=False, - shuffle=True): - - LJSPEECH_ROOT = Path(data_path) - metadata = LJSpeechMetaData(LJSPEECH_ROOT, alignments_path) - transformer = LJSpeech(config) - dataset = TransformDataset(metadata, transformer) - dataset = CacheDataset(dataset) - - sampler = DistributedSampler( - len(dataset), nranks, rank, shuffle=shuffle) - - assert batch_size % nranks == 0 - each_bs = batch_size // nranks - dataloader = DataCargo( - dataset, - sampler=sampler, - batch_size=each_bs, - shuffle=shuffle, - batch_fn=batch_examples, - drop_last=True) - self.reader = fluid.io.DataLoader.from_generator( - capacity=32, - iterable=True, - use_double_buffer=True, - return_list=True) - self.reader.set_batch_generator(dataloader, place) - - -class LJSpeechMetaData(DatasetMixin): - def __init__(self, root, alignments_path): - self.root = Path(root) - self._wav_dir = self.root.joinpath("wavs") - csv_path = self.root.joinpath("metadata.csv") - self._table = pd.read_csv( - csv_path, - sep="|", - header=None, - quoting=csv.QUOTE_NONE, - names=["fname", "raw_text", "normalized_text"]) - with open(alignments_path, "rb") as f: - self._alignments = pickle.load(f) - - def get_example(self, i): - fname, raw_text, normalized_text = self._table.iloc[i] - alignment = self._alignments[fname] - fname = str(self._wav_dir.joinpath(fname + ".wav")) - return fname, normalized_text, alignment - - def __len__(self): - return len(self._table) - - -class LJSpeech(object): - def __init__(self, cfg): - super(LJSpeech, self).__init__() - self.sr = cfg['sr'] - self.n_fft = cfg['n_fft'] - self.num_mels = cfg['num_mels'] - self.win_length = cfg['win_length'] - self.hop_length = cfg['hop_length'] - self.preemphasis = cfg['preemphasis'] - self.fmin = cfg['fmin'] - self.fmax = cfg['fmax'] - - def __call__(self, metadatum): - """All the code for generating an Example from a metadatum. If you want a - different preprocessing pipeline, you can override this method. - This method may require several processor, each of which has a lot of options. - In this case, you'd better pass a composed transform and pass it to the init - method. - """ - fname, normalized_text, alignment = metadatum - - wav, _ = librosa.load(str(fname)) - spec = librosa.stft( - y=wav, - n_fft=self.n_fft, - win_length=self.win_length, - hop_length=self.hop_length) - mag = np.abs(spec) - mel = librosa.filters.mel(self.sr, - self.n_fft, - n_mels=self.num_mels, - fmin=self.fmin, - fmax=self.fmax) - mel = np.matmul(mel, mag) - mel = np.log(np.maximum(mel, 1e-5)) - phonemes = np.array( - g2p.en.text_to_sequence(normalized_text), dtype=np.int64) - return (mel, phonemes, alignment - ) # maybe we need to implement it as a map in the future - - -def batch_examples(batch): - texts = [] - mels = [] - text_lens = [] - pos_texts = [] - pos_mels = [] - alignments = [] - for data in batch: - mel, text, alignment = data - text_lens.append(len(text)) - pos_texts.append(np.arange(1, len(text) + 1)) - pos_mels.append(np.arange(1, mel.shape[1] + 1)) - mels.append(mel) - texts.append(text) - alignments.append(alignment) - - # Sort by text_len in descending order - texts = [ - i - for i, _ in sorted( - zip(texts, text_lens), key=lambda x: x[1], reverse=True) - ] - mels = [ - i - for i, _ in sorted( - zip(mels, text_lens), key=lambda x: x[1], reverse=True) - ] - pos_texts = [ - i - for i, _ in sorted( - zip(pos_texts, text_lens), key=lambda x: x[1], reverse=True) - ] - pos_mels = [ - i - for i, _ in sorted( - zip(pos_mels, text_lens), key=lambda x: x[1], reverse=True) - ] - alignments = [ - i - for i, _ in sorted( - zip(alignments, text_lens), key=lambda x: x[1], reverse=True) - ] - #text_lens = sorted(text_lens, reverse=True) - - # Pad sequence with largest len of the batch - texts = TextIDBatcher(pad_id=0)(texts) #(B, T) - pos_texts = TextIDBatcher(pad_id=0)(pos_texts) #(B,T) - pos_mels = TextIDBatcher(pad_id=0)(pos_mels) #(B,T) - alignments = TextIDBatcher(pad_id=0)(alignments).astype(np.float32) - mels = np.transpose( - SpecBatcher(pad_value=0.)(mels), axes=(0, 2, 1)) #(B,T,num_mels) - - return (texts, mels, pos_texts, pos_mels, alignments) diff --git a/examples/fastspeech/images/model_architecture.png b/examples/fastspeech/images/model_architecture.png deleted file mode 100644 index ad9fa55..0000000 Binary files a/examples/fastspeech/images/model_architecture.png and /dev/null differ diff --git a/examples/fastspeech/synthesis.py b/examples/fastspeech/synthesis.py deleted file mode 100644 index 9ff4ef7..0000000 --- a/examples/fastspeech/synthesis.py +++ /dev/null @@ -1,170 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import os -from visualdl import LogWriter -from scipy.io.wavfile import write -from collections import OrderedDict -import argparse -from pprint import pprint -from ruamel import yaml -from matplotlib import cm -import numpy as np -import paddle.fluid as fluid -import paddle.fluid.dygraph as dg -from parakeet.g2p.en import text_to_sequence -from parakeet import audio -from parakeet.models.fastspeech.fastspeech import FastSpeech -from parakeet.models.transformer_tts.utils import * -from parakeet.models.wavenet import WaveNet, UpsampleNet -from parakeet.models.clarinet import STFT, Clarinet, ParallelWaveNet -from parakeet.modules import weight_norm -from parakeet.models.waveflow import WaveFlowModule -from parakeet.utils.layer_tools import freeze -from parakeet.utils import io - - -def add_config_options_to_parser(parser): - parser.add_argument("--config", type=str, help="path of the config file") - parser.add_argument( - "--vocoder", - type=str, - default="griffin-lim", - choices=['griffin-lim', 'waveflow'], - help="vocoder method") - parser.add_argument( - "--config_vocoder", type=str, help="path of the vocoder config file") - parser.add_argument("--use_gpu", type=int, default=0, help="device to use") - parser.add_argument( - "--alpha", - type=float, - default=1, - help="determine the length of the expanded sequence mel, controlling the voice speed." - ) - - parser.add_argument( - "--checkpoint", type=str, help="fastspeech checkpoint for synthesis") - parser.add_argument( - "--checkpoint_vocoder", - type=str, - help="vocoder checkpoint for synthesis") - - parser.add_argument( - "--output", - type=str, - default="synthesis", - help="path to save experiment results") - - -def synthesis(text_input, args): - local_rank = dg.parallel.Env().local_rank - place = (fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace()) - fluid.enable_dygraph(place) - - with open(args.config) as f: - cfg = yaml.load(f, Loader=yaml.Loader) - - # tensorboard - if not os.path.exists(args.output): - os.mkdir(args.output) - - writer = LogWriter(os.path.join(args.output, 'log')) - - model = FastSpeech(cfg['network'], num_mels=cfg['audio']['num_mels']) - # Load parameters. - global_step = io.load_parameters( - model=model, checkpoint_path=args.checkpoint) - model.eval() - - text = np.asarray(text_to_sequence(text_input)) - text = np.expand_dims(text, axis=0) - pos_text = np.arange(1, text.shape[1] + 1) - pos_text = np.expand_dims(pos_text, axis=0) - - text = dg.to_variable(text).astype(np.int64) - pos_text = dg.to_variable(pos_text).astype(np.int64) - - _, mel_output_postnet = model(text, pos_text, alpha=args.alpha) - - if args.vocoder == 'griffin-lim': - #synthesis use griffin-lim - wav = synthesis_with_griffinlim(mel_output_postnet, cfg['audio']) - elif args.vocoder == 'waveflow': - wav = synthesis_with_waveflow(mel_output_postnet, args, - args.checkpoint_vocoder, place) - else: - print( - 'vocoder error, we only support griffinlim and waveflow, but recevied %s.' - % args.vocoder) - - writer.add_audio(text_input + '(' + args.vocoder + ')', wav, 0, - cfg['audio']['sr']) - if not os.path.exists(os.path.join(args.output, 'samples')): - os.mkdir(os.path.join(args.output, 'samples')) - write( - os.path.join( - os.path.join(args.output, 'samples'), args.vocoder + '.wav'), - cfg['audio']['sr'], wav) - print("Synthesis completed !!!") - writer.close() - - -def synthesis_with_griffinlim(mel_output, cfg): - mel_output = fluid.layers.transpose( - fluid.layers.squeeze(mel_output, [0]), [1, 0]) - mel_output = np.exp(mel_output.numpy()) - basis = librosa.filters.mel(cfg['sr'], - cfg['n_fft'], - cfg['num_mels'], - fmin=cfg['fmin'], - fmax=cfg['fmax']) - inv_basis = np.linalg.pinv(basis) - spec = np.maximum(1e-10, np.dot(inv_basis, mel_output)) - - wav = librosa.core.griffinlim( - spec**cfg['power'], - hop_length=cfg['hop_length'], - win_length=cfg['win_length']) - - return wav - - -def synthesis_with_waveflow(mel_output, args, checkpoint, place): - - fluid.enable_dygraph(place) - args.config = args.config_vocoder - args.use_fp16 = False - config = io.add_yaml_config_to_args(args) - - mel_spectrogram = fluid.layers.transpose(mel_output, [0, 2, 1]) - - # Build model. - waveflow = WaveFlowModule(config) - io.load_parameters(model=waveflow, checkpoint_path=checkpoint) - for layer in waveflow.sublayers(): - if isinstance(layer, weight_norm.WeightNormWrapper): - layer.remove_weight_norm() - - # Run model inference. - wav = waveflow.synthesize(mel_spectrogram, sigma=config.sigma) - return wav.numpy()[0] - - -if __name__ == '__main__': - parser = argparse.ArgumentParser(description="Synthesis model") - add_config_options_to_parser(parser) - args = parser.parse_args() - pprint(vars(args)) - synthesis( - "Don't argue with the people of strong determination, because they may change the fact!", - args) diff --git a/examples/fastspeech/synthesis.sh b/examples/fastspeech/synthesis.sh deleted file mode 100644 index 1ebed1b..0000000 --- a/examples/fastspeech/synthesis.sh +++ /dev/null @@ -1,20 +0,0 @@ -# train model - -CUDA_VISIBLE_DEVICES=0 \ -python -u synthesis.py \ ---use_gpu=1 \ ---alpha=1.0 \ ---checkpoint='./fastspeech_ljspeech_ckpt_1.0/fastspeech/step-162000' \ ---config='fastspeech_ljspeech_ckpt_1.0/ljspeech.yaml' \ ---output='./synthesis' \ ---vocoder='waveflow' \ ---config_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml' \ ---checkpoint_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/step-2000000' \ - - - -if [ $? -ne 0 ]; then - echo "Failed in synthesis!" - exit 1 -fi -exit 0 \ No newline at end of file diff --git a/examples/fastspeech/train.py b/examples/fastspeech/train.py deleted file mode 100644 index 389e0bf..0000000 --- a/examples/fastspeech/train.py +++ /dev/null @@ -1,166 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import numpy as np -import argparse -import os -import time -import math -from pathlib import Path -from pprint import pprint -from ruamel import yaml -from tqdm import tqdm -from matplotlib import cm -from collections import OrderedDict -from visualdl import LogWriter -import paddle.fluid.dygraph as dg -import paddle.fluid.layers as layers -import paddle.fluid as fluid -from parakeet.models.fastspeech.fastspeech import FastSpeech -from parakeet.models.fastspeech.utils import get_alignment -from data import LJSpeechLoader -from parakeet.utils import io - - -def add_config_options_to_parser(parser): - parser.add_argument("--config", type=str, help="path of the config file") - parser.add_argument("--use_gpu", type=int, default=0, help="device to use") - parser.add_argument("--data", type=str, help="path of LJspeech dataset") - parser.add_argument( - "--alignments_path", type=str, help="path of alignments") - - g = parser.add_mutually_exclusive_group() - g.add_argument("--checkpoint", type=str, help="checkpoint to resume from") - g.add_argument( - "--iteration", - type=int, - help="the iteration of the checkpoint to load from output directory") - - parser.add_argument( - "--output", - type=str, - default="experiment", - help="path to save experiment results") - - -def main(args): - local_rank = dg.parallel.Env().local_rank - nranks = dg.parallel.Env().nranks - parallel = nranks > 1 - - with open(args.config) as f: - cfg = yaml.load(f, Loader=yaml.Loader) - - global_step = 0 - place = fluid.CUDAPlace(dg.parallel.Env() - .dev_id) if args.use_gpu else fluid.CPUPlace() - fluid.enable_dygraph(place) - - if not os.path.exists(args.output): - os.mkdir(args.output) - - writer = LogWriter(os.path.join(args.output, - 'log')) if local_rank == 0 else None - - model = FastSpeech(cfg['network'], num_mels=cfg['audio']['num_mels']) - model.train() - optimizer = fluid.optimizer.AdamOptimizer( - learning_rate=dg.NoamDecay(1 / (cfg['train']['warm_up_step'] * - (cfg['train']['learning_rate']**2)), - cfg['train']['warm_up_step']), - parameter_list=model.parameters(), - grad_clip=fluid.clip.GradientClipByGlobalNorm(cfg['train'][ - 'grad_clip_thresh'])) - reader = LJSpeechLoader( - cfg['audio'], - place, - args.data, - args.alignments_path, - cfg['train']['batch_size'], - nranks, - local_rank, - shuffle=True).reader - iterator = iter(tqdm(reader)) - - # Load parameters. - global_step = io.load_parameters( - model=model, - optimizer=optimizer, - checkpoint_dir=os.path.join(args.output, 'checkpoints'), - iteration=args.iteration, - checkpoint_path=args.checkpoint) - print("Rank {}: checkpoint loaded.".format(local_rank)) - - if parallel: - strategy = dg.parallel.prepare_context() - model = fluid.dygraph.parallel.DataParallel(model, strategy) - - while global_step <= cfg['train']['max_iteration']: - try: - batch = next(iterator) - except StopIteration as e: - iterator = iter(tqdm(reader)) - batch = next(iterator) - - (character, mel, pos_text, pos_mel, alignment) = batch - - global_step += 1 - - #Forward - result = model( - character, pos_text, mel_pos=pos_mel, length_target=alignment) - mel_output, mel_output_postnet, duration_predictor_output, _, _ = result - mel_loss = layers.mse_loss(mel_output, mel) - mel_postnet_loss = layers.mse_loss(mel_output_postnet, mel) - duration_loss = layers.mean( - layers.abs( - layers.elementwise_sub(duration_predictor_output, alignment))) - total_loss = mel_loss + mel_postnet_loss + duration_loss - - if local_rank == 0: - writer.add_scalar('mel_loss', mel_loss.numpy(), global_step) - writer.add_scalar('post_mel_loss', - mel_postnet_loss.numpy(), global_step) - writer.add_scalar('duration_loss', - duration_loss.numpy(), global_step) - writer.add_scalar('learning_rate', - optimizer._learning_rate.step().numpy(), - global_step) - - if parallel: - total_loss = model.scale_loss(total_loss) - total_loss.backward() - model.apply_collective_grads() - else: - total_loss.backward() - optimizer.minimize(total_loss) - model.clear_gradients() - - # save checkpoint - if local_rank == 0 and global_step % cfg['train'][ - 'checkpoint_interval'] == 0: - io.save_parameters( - os.path.join(args.output, 'checkpoints'), global_step, model, - optimizer) - - if local_rank == 0: - writer.close() - - -if __name__ == '__main__': - parser = argparse.ArgumentParser(description="Train Fastspeech model") - add_config_options_to_parser(parser) - args = parser.parse_args() - # Print the whole config setting. - pprint(vars(args)) - main(args) diff --git a/examples/fastspeech/train.sh b/examples/fastspeech/train.sh deleted file mode 100644 index 97d5516..0000000 --- a/examples/fastspeech/train.sh +++ /dev/null @@ -1,15 +0,0 @@ -# train model -export CUDA_VISIBLE_DEVICES=0 -python -u train.py \ ---use_gpu=1 \ ---data='../../dataset/LJSpeech-1.1' \ ---alignments_path='./alignments/alignments.pkl' \ ---output='./experiment' \ ---config='configs/ljspeech.yaml' \ -#--checkpoint='./checkpoint/fastspeech/step-120000' \ - -if [ $? -ne 0 ]; then - echo "Failed in training!" - exit 1 -fi -exit 0 \ No newline at end of file diff --git a/examples/transformer_tts/README.md b/examples/transformer_tts/README.md deleted file mode 100644 index f1e73fe..0000000 --- a/examples/transformer_tts/README.md +++ /dev/null @@ -1,112 +0,0 @@ -# TransformerTTS - -PaddlePaddle dynamic graph implementation of TransformerTTS, a neural TTS with Transformer. The implementation is based on [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895). - -## Dataset - -We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/). - -```bash -wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 -tar xjvf LJSpeech-1.1.tar.bz2 -``` - -## Model Architecture - -
-
-
-
-TransformerTTS model architecture -
- -The model adopts the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in [Tacotron2](https://arxiv.org/abs/1712.05884). The model consists of two main parts, encoder and decoder. We also implement the CBHG model of Tacotron as the vocoder part and convert the spectrogram into raw wave using Griffin-Lim algorithm. - -## Project Structure - -```text -├── config # yaml configuration files -├── data.py # dataset and dataloader settings for LJSpeech -├── synthesis.py # script to synthesize waveform from text -├── train_transformer.py # script for transformer model training -├── train_vocoder.py # script for vocoder model training -``` - -## Saving & Loading - -`train_transformer.py` and `train_vocoer.py` have 3 arguments in common, `--checkpoint`, `--iteration` and `--output`. - -1. `--output` is the directory for saving results. -During training, checkpoints are saved in `${output}/checkpoints` and tensorboard logs are saved in `${output}/log`. -During synthesis, results are saved in `${output}/samples` and tensorboard log is save in `${output}/log`. - -2. `--checkpoint` is the path of a checkpoint and `--iteration` is the target step. They are used to load checkpoints in the following way. - - - If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded. - - - If `--checkpoint` is not provided, we try to load the checkpoint of the target step specified by `--iteration` from the `${output}/checkpoints/` directory, e.g. if given `--iteration 120000`, the checkpoint `${output}/checkpoints/step-120000.*` will be load. - - - If both `--checkpoint` and `--iteration` are not provided, we try to load the latest checkpoint from `${output}/checkpoints/` directory. - -## Train Transformer - -TransformerTTS model can be trained by running ``train_transformer.py``. - -```bash -python train_transformer.py \ ---use_gpu=1 \ ---data=${DATAPATH} \ ---output=${OUTPUTPATH} \ ---config='configs/ljspeech.yaml' \ -``` - -Or you can run the script file directly. - -```bash -sh train_transformer.sh -``` - -If you want to train on multiple GPUs, you must start training in the following way. - -```bash -CUDA_VISIBLE_DEVICES=0,1,2,3 -python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train_transformer.py \ ---use_gpu=1 \ ---data=${DATAPATH} \ ---output=${OUTPUTPATH} \ ---config='configs/ljspeech.yaml' \ -``` - -If you wish to resume from an existing model, See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading. - -**Note: In order to ensure the training effect, we recommend using multi-GPU training to enlarge the batch size, and at least 16 samples in single batch per GPU.** - -For more help on arguments - -``python train_transformer.py --help``. - -## Synthesis - -After training the TransformerTTS, audio can be synthesized by running ``synthesis.py``. - -```bash -python synthesis.py \ ---use_gpu=0 \ ---output=${OUTPUTPATH} \ ---config='configs/ljspeech.yaml' \ ---checkpoint_transformer=${CHECKPOINTPATH} \ ---vocoder='griffin-lim' \ -``` - -We currently support two vocoders, Griffin-Lim algorithm and WaveFlow. You can set ``--vocoder`` to use one of them. If you want to use WaveFlow as your vocoder, you need to set ``--config_vocoder`` and ``--checkpoint_vocoder`` which are the path of the config and checkpoint of vocoder. You can download the pre-trained model of WaveFlow from [here](https://github.com/PaddlePaddle/Parakeet#vocoders). - -Or you can run the script file directly. - -```bash -sh synthesis.sh -``` -For more help on arguments - -``python synthesis.py --help``. - -Then you can find the synthesized audio files in ``${OUTPUTPATH}/samples``. diff --git a/examples/transformer_tts/configs/ljspeech.yaml b/examples/transformer_tts/configs/ljspeech.yaml deleted file mode 100644 index 963a230..0000000 --- a/examples/transformer_tts/configs/ljspeech.yaml +++ /dev/null @@ -1,38 +0,0 @@ -audio: - num_mels: 80 - n_fft: 1024 - sr: 22050 - preemphasis: 0.97 - hop_length: 256 - win_length: 1024 - power: 1.2 - fmin: 0 - fmax: 8000 - -network: - hidden_size: 256 - embedding_size: 512 - encoder_num_head: 4 - encoder_n_layers: 3 - decoder_num_head: 4 - decoder_n_layers: 3 - outputs_per_step: 1 - stop_loss_weight: 8 - -vocoder: - hidden_size: 256 - -train: - batch_size: 32 - learning_rate: 0.001 - warm_up_step: 4000 - grad_clip_thresh: 1.0 - - checkpoint_interval: 1000 - image_interval: 2000 - - max_iteration: 500000 - - - - \ No newline at end of file diff --git a/examples/transformer_tts/data.py b/examples/transformer_tts/data.py deleted file mode 100644 index acaad60..0000000 --- a/examples/transformer_tts/data.py +++ /dev/null @@ -1,219 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -from pathlib import Path -import numpy as np -import pandas as pd -import librosa -import csv - -from paddle import fluid -from parakeet import g2p -from parakeet.data.sampler import * -from parakeet.data.datacargo import DataCargo -from parakeet.data.batch import TextIDBatcher, SpecBatcher -from parakeet.data.dataset import DatasetMixin, TransformDataset, CacheDataset, SliceDataset -from parakeet.models.transformer_tts.utils import * - - -class LJSpeechLoader: - def __init__(self, - config, - place, - data_path, - batch_size, - nranks, - rank, - is_vocoder=False, - shuffle=True): - - LJSPEECH_ROOT = Path(data_path) - metadata = LJSpeechMetaData(LJSPEECH_ROOT) - transformer = LJSpeech(config) - dataset = TransformDataset(metadata, transformer) - dataset = CacheDataset(dataset) - - sampler = DistributedSampler( - len(dataset), nranks, rank, shuffle=shuffle) - - assert batch_size % nranks == 0 - each_bs = batch_size // nranks - if is_vocoder: - dataloader = DataCargo( - dataset, - sampler=sampler, - batch_size=each_bs, - shuffle=shuffle, - batch_fn=batch_examples_vocoder, - drop_last=True) - else: - dataloader = DataCargo( - dataset, - sampler=sampler, - batch_size=each_bs, - shuffle=shuffle, - batch_fn=batch_examples, - drop_last=True) - self.reader = fluid.io.DataLoader.from_generator( - capacity=32, - iterable=True, - use_double_buffer=True, - return_list=True) - self.reader.set_batch_generator(dataloader, place) - - -class LJSpeechMetaData(DatasetMixin): - def __init__(self, root): - self.root = Path(root) - self._wav_dir = self.root.joinpath("wavs") - csv_path = self.root.joinpath("metadata.csv") - self._table = pd.read_csv( - csv_path, - sep="|", - header=None, - quoting=csv.QUOTE_NONE, - names=["fname", "raw_text", "normalized_text"]) - - def get_example(self, i): - fname, raw_text, normalized_text = self._table.iloc[i] - fname = str(self._wav_dir.joinpath(fname + ".wav")) - return fname, raw_text, normalized_text - - def __len__(self): - return len(self._table) - - -class LJSpeech(object): - def __init__(self, config): - super(LJSpeech, self).__init__() - self.config = config - self.sr = config['sr'] - self.n_mels = config['num_mels'] - self.preemphasis = config['preemphasis'] - self.n_fft = config['n_fft'] - self.win_length = config['win_length'] - self.hop_length = config['hop_length'] - self.fmin = config['fmin'] - self.fmax = config['fmax'] - - def __call__(self, metadatum): - """All the code for generating an Example from a metadatum. If you want a - different preprocessing pipeline, you can override this method. - This method may require several processor, each of which has a lot of options. - In this case, you'd better pass a composed transform and pass it to the init - method. - """ - fname, raw_text, normalized_text = metadatum - - # load - wav, _ = librosa.load(str(fname)) - - spec = librosa.stft( - y=wav, - n_fft=self.n_fft, - win_length=self.win_length, - hop_length=self.hop_length) - mag = np.abs(spec) - mel = librosa.filters.mel(sr=self.sr, - n_fft=self.n_fft, - n_mels=self.n_mels, - fmin=self.fmin, - fmax=self.fmax) - mel = np.matmul(mel, mag) - mel = np.log(np.maximum(mel, 1e-5)) - - characters = np.array( - g2p.en.text_to_sequence(normalized_text), dtype=np.int64) - return (mag, mel, characters) - - -def batch_examples(batch): - texts = [] - mels = [] - mel_inputs = [] - text_lens = [] - pos_texts = [] - pos_mels = [] - stop_tokens = [] - for data in batch: - _, mel, text = data - mel_inputs.append( - np.concatenate( - [np.zeros([mel.shape[0], 1], np.float32), mel[:, :-1]], - axis=-1)) - text_lens.append(len(text)) - pos_texts.append(np.arange(1, len(text) + 1)) - pos_mels.append(np.arange(1, mel.shape[1] + 1)) - mels.append(mel) - texts.append(text) - stop_token = np.append(np.zeros([mel.shape[1] - 1], np.float32), 1.0) - stop_tokens.append(stop_token) - - # Sort by text_len in descending order - texts = [ - i - for i, _ in sorted( - zip(texts, text_lens), key=lambda x: x[1], reverse=True) - ] - mels = [ - i - for i, _ in sorted( - zip(mels, text_lens), key=lambda x: x[1], reverse=True) - ] - mel_inputs = [ - i - for i, _ in sorted( - zip(mel_inputs, text_lens), key=lambda x: x[1], reverse=True) - ] - pos_texts = [ - i - for i, _ in sorted( - zip(pos_texts, text_lens), key=lambda x: x[1], reverse=True) - ] - pos_mels = [ - i - for i, _ in sorted( - zip(pos_mels, text_lens), key=lambda x: x[1], reverse=True) - ] - stop_tokens = [ - i - for i, _ in sorted( - zip(stop_tokens, text_lens), key=lambda x: x[1], reverse=True) - ] - text_lens = sorted(text_lens, reverse=True) - - # Pad sequence with largest len of the batch - texts = TextIDBatcher(pad_id=0)(texts) #(B, T) - pos_texts = TextIDBatcher(pad_id=0)(pos_texts) #(B,T) - pos_mels = TextIDBatcher(pad_id=0)(pos_mels) #(B,T) - stop_tokens = TextIDBatcher(pad_id=1, dtype=np.float32)(pos_mels) - mels = np.transpose( - SpecBatcher(pad_value=0.)(mels), axes=(0, 2, 1)) #(B,T,num_mels) - mel_inputs = np.transpose( - SpecBatcher(pad_value=0.)(mel_inputs), axes=(0, 2, 1)) #(B,T,num_mels) - - return (texts, mels, mel_inputs, pos_texts, pos_mels, stop_tokens) - - -def batch_examples_vocoder(batch): - mels = [] - mags = [] - for data in batch: - mag, mel, _ = data - mels.append(mel) - mags.append(mag) - - mels = np.transpose(SpecBatcher(pad_value=0.)(mels), axes=(0, 2, 1)) - mags = np.transpose(SpecBatcher(pad_value=0.)(mags), axes=(0, 2, 1)) - - return (mels, mags) diff --git a/examples/transformer_tts/images/model_architecture.jpg b/examples/transformer_tts/images/model_architecture.jpg deleted file mode 100644 index 9c05b1c..0000000 Binary files a/examples/transformer_tts/images/model_architecture.jpg and /dev/null differ diff --git a/examples/transformer_tts/synthesis.py b/examples/transformer_tts/synthesis.py deleted file mode 100644 index 4297929..0000000 --- a/examples/transformer_tts/synthesis.py +++ /dev/null @@ -1,202 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import os -from scipy.io.wavfile import write -import numpy as np -from tqdm import tqdm -from matplotlib import cm -from visualdl import LogWriter -from ruamel import yaml -from pathlib import Path -import argparse -from pprint import pprint -import paddle.fluid as fluid -import paddle.fluid.dygraph as dg -from parakeet.g2p.en import text_to_sequence -from parakeet.models.transformer_tts.utils import * -from parakeet.models.transformer_tts import TransformerTTS -from parakeet.models.waveflow import WaveFlowModule -from parakeet.modules.weight_norm import WeightNormWrapper -from parakeet.utils import io - - -def add_config_options_to_parser(parser): - parser.add_argument("--config", type=str, help="path of the config file") - parser.add_argument("--use_gpu", type=int, default=0, help="device to use") - parser.add_argument( - "--stop_threshold", - type=float, - default=0.5, - help="The threshold of stop token which indicates the time step should stop generate spectrum or not." - ) - parser.add_argument( - "--max_len", - type=int, - default=1000, - help="The max length of spectrum when synthesize. If the length of synthetical spectrum is lager than max_len, spectrum will be cut off." - ) - - parser.add_argument( - "--checkpoint_transformer", - type=str, - help="transformer_tts checkpoint for synthesis") - parser.add_argument( - "--vocoder", - type=str, - default="griffin-lim", - choices=['griffin-lim', 'waveflow'], - help="vocoder method") - parser.add_argument( - "--config_vocoder", type=str, help="path of the vocoder config file") - parser.add_argument( - "--checkpoint_vocoder", - type=str, - help="vocoder checkpoint for synthesis") - - parser.add_argument( - "--output", - type=str, - default="synthesis", - help="path to save experiment results") - - -def synthesis(text_input, args): - local_rank = dg.parallel.Env().local_rank - place = (fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace()) - - with open(args.config) as f: - cfg = yaml.load(f, Loader=yaml.Loader) - - # tensorboard - if not os.path.exists(args.output): - os.mkdir(args.output) - - writer = LogWriter(os.path.join(args.output, 'log')) - - fluid.enable_dygraph(place) - with fluid.unique_name.guard(): - network_cfg = cfg['network'] - model = TransformerTTS( - network_cfg['embedding_size'], network_cfg['hidden_size'], - network_cfg['encoder_num_head'], network_cfg['encoder_n_layers'], - cfg['audio']['num_mels'], network_cfg['outputs_per_step'], - network_cfg['decoder_num_head'], network_cfg['decoder_n_layers']) - # Load parameters. - global_step = io.load_parameters( - model=model, checkpoint_path=args.checkpoint_transformer) - model.eval() - - # init input - text = np.asarray(text_to_sequence(text_input)) - text = fluid.layers.unsqueeze(dg.to_variable(text).astype(np.int64), [0]) - mel_input = dg.to_variable(np.zeros([1, 1, 80])).astype(np.float32) - pos_text = np.arange(1, text.shape[1] + 1) - pos_text = fluid.layers.unsqueeze( - dg.to_variable(pos_text).astype(np.int64), [0]) - - for i in range(args.max_len): - pos_mel = np.arange(1, mel_input.shape[1] + 1) - pos_mel = fluid.layers.unsqueeze( - dg.to_variable(pos_mel).astype(np.int64), [0]) - mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model( - text, mel_input, pos_text, pos_mel) - if stop_preds.numpy()[0, -1] > args.stop_threshold: - break - mel_input = fluid.layers.concat( - [mel_input, postnet_pred[:, -1:, :]], axis=1) - global_step = 0 - for i, prob in enumerate(attn_probs): - for j in range(4): - x = np.uint8(cm.viridis(prob.numpy()[j]) * 255) - writer.add_image( - 'Attention_%d_0' % global_step, - x, - i * 4 + j) - - if args.vocoder == 'griffin-lim': - #synthesis use griffin-lim - wav = synthesis_with_griffinlim(postnet_pred, cfg['audio']) - elif args.vocoder == 'waveflow': - # synthesis use waveflow - wav = synthesis_with_waveflow(postnet_pred, args, - args.checkpoint_vocoder, place) - else: - print( - 'vocoder error, we only support griffinlim and waveflow, but recevied %s.' - % args.vocoder) - - writer.add_audio(text_input + '(' + args.vocoder + ')', wav, 0, - cfg['audio']['sr']) - if not os.path.exists(os.path.join(args.output, 'samples')): - os.mkdir(os.path.join(args.output, 'samples')) - write( - os.path.join( - os.path.join(args.output, 'samples'), args.vocoder + '.wav'), - cfg['audio']['sr'], wav) - print("Synthesis completed !!!") - writer.close() - - -def synthesis_with_griffinlim(mel_output, cfg): - # synthesis with griffin-lim - mel_output = fluid.layers.transpose( - fluid.layers.squeeze(mel_output, [0]), [1, 0]) - mel_output = np.exp(mel_output.numpy()) - basis = librosa.filters.mel(cfg['sr'], - cfg['n_fft'], - cfg['num_mels'], - fmin=cfg['fmin'], - fmax=cfg['fmax']) - inv_basis = np.linalg.pinv(basis) - spec = np.maximum(1e-10, np.dot(inv_basis, mel_output)) - - wav = librosa.core.griffinlim( - spec**cfg['power'], - hop_length=cfg['hop_length'], - win_length=cfg['win_length']) - - return wav - - -def synthesis_with_waveflow(mel_output, args, checkpoint, place): - fluid.enable_dygraph(place) - args.config = args.config_vocoder - args.use_fp16 = False - config = io.add_yaml_config_to_args(args) - - mel_spectrogram = fluid.layers.transpose( - fluid.layers.squeeze(mel_output, [0]), [1, 0]) - mel_spectrogram = fluid.layers.unsqueeze(mel_spectrogram, [0]) - - # Build model. - waveflow = WaveFlowModule(config) - io.load_parameters(model=waveflow, checkpoint_path=checkpoint) - for layer in waveflow.sublayers(): - if isinstance(layer, WeightNormWrapper): - layer.remove_weight_norm() - - # Run model inference. - wav = waveflow.synthesize(mel_spectrogram, sigma=config.sigma) - return wav.numpy()[0] - - -if __name__ == '__main__': - parser = argparse.ArgumentParser(description="Synthesis model") - add_config_options_to_parser(parser) - args = parser.parse_args() - # Print the whole config setting. - pprint(vars(args)) - synthesis( - "Life was like a box of chocolates, you never know what you're gonna get.", - args) diff --git a/examples/transformer_tts/synthesis.sh b/examples/transformer_tts/synthesis.sh deleted file mode 100644 index be91cd4..0000000 --- a/examples/transformer_tts/synthesis.sh +++ /dev/null @@ -1,17 +0,0 @@ - -# train model -CUDA_VISIBLE_DEVICES=0 \ -python -u synthesis.py \ ---use_gpu=0 \ ---output='./synthesis' \ ---config='transformer_tts_ljspeech_ckpt_1.0/ljspeech.yaml' \ ---checkpoint_transformer='./transformer_tts_ljspeech_ckpt_1.0/step-120000' \ ---vocoder='waveflow' \ ---config_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/waveflow_ljspeech.yaml' \ ---checkpoint_vocoder='./waveflow_res128_ljspeech_ckpt_1.0/step-2000000' \ - -if [ $? -ne 0 ]; then - echo "Failed in training!" - exit 1 -fi -exit 0 diff --git a/examples/transformer_tts/train_transformer.py b/examples/transformer_tts/train_transformer.py deleted file mode 100644 index 3499a5f..0000000 --- a/examples/transformer_tts/train_transformer.py +++ /dev/null @@ -1,219 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import os -from tqdm import tqdm -from visualdl import LogWriter -from collections import OrderedDict -import argparse -from pprint import pprint -from ruamel import yaml -from matplotlib import cm -import numpy as np -import paddle.fluid as fluid -import paddle.fluid.dygraph as dg -import paddle.fluid.layers as layers -from parakeet.models.transformer_tts.utils import cross_entropy -from data import LJSpeechLoader -from parakeet.models.transformer_tts import TransformerTTS -from parakeet.utils import io - - -def add_config_options_to_parser(parser): - parser.add_argument("--config", type=str, help="path of the config file") - parser.add_argument("--use_gpu", type=int, default=0, help="device to use") - parser.add_argument("--data", type=str, help="path of LJspeech dataset") - - g = parser.add_mutually_exclusive_group() - g.add_argument("--checkpoint", type=str, help="checkpoint to resume from") - g.add_argument( - "--iteration", - type=int, - help="the iteration of the checkpoint to load from output directory") - - parser.add_argument( - "--output", - type=str, - default="experiment", - help="path to save experiment results") - - -def main(args): - local_rank = dg.parallel.Env().local_rank - nranks = dg.parallel.Env().nranks - parallel = nranks > 1 - - with open(args.config) as f: - cfg = yaml.load(f, Loader=yaml.Loader) - - global_step = 0 - place = fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace() - - if not os.path.exists(args.output): - os.mkdir(args.output) - - writer = LogWriter(os.path.join(args.output, - 'log')) if local_rank == 0 else None - - fluid.enable_dygraph(place) - network_cfg = cfg['network'] - model = TransformerTTS( - network_cfg['embedding_size'], network_cfg['hidden_size'], - network_cfg['encoder_num_head'], network_cfg['encoder_n_layers'], - cfg['audio']['num_mels'], network_cfg['outputs_per_step'], - network_cfg['decoder_num_head'], network_cfg['decoder_n_layers']) - - model.train() - optimizer = fluid.optimizer.AdamOptimizer( - learning_rate=dg.NoamDecay(1 / (cfg['train']['warm_up_step'] * - (cfg['train']['learning_rate']**2)), - cfg['train']['warm_up_step']), - parameter_list=model.parameters(), - grad_clip=fluid.clip.GradientClipByGlobalNorm(cfg['train'][ - 'grad_clip_thresh'])) - - # Load parameters. - global_step = io.load_parameters( - model=model, - optimizer=optimizer, - checkpoint_dir=os.path.join(args.output, 'checkpoints'), - iteration=args.iteration, - checkpoint_path=args.checkpoint) - print("Rank {}: checkpoint loaded.".format(local_rank)) - - if parallel: - strategy = dg.parallel.prepare_context() - model = fluid.dygraph.parallel.DataParallel(model, strategy) - - reader = LJSpeechLoader( - cfg['audio'], - place, - args.data, - cfg['train']['batch_size'], - nranks, - local_rank, - shuffle=True).reader - - iterator = iter(tqdm(reader)) - - global_step += 1 - - while global_step <= cfg['train']['max_iteration']: - try: - batch = next(iterator) - except StopIteration as e: - iterator = iter(tqdm(reader)) - batch = next(iterator) - - character, mel, mel_input, pos_text, pos_mel, stop_tokens = batch - - mel_pred, postnet_pred, attn_probs, stop_preds, attn_enc, attn_dec = model( - character, mel_input, pos_text, pos_mel) - - mel_loss = layers.mean( - layers.abs(layers.elementwise_sub(mel_pred, mel))) - post_mel_loss = layers.mean( - layers.abs(layers.elementwise_sub(postnet_pred, mel))) - loss = mel_loss + post_mel_loss - - stop_loss = cross_entropy( - stop_preds, stop_tokens, weight=cfg['network']['stop_loss_weight']) - loss = loss + stop_loss - - if local_rank == 0: - writer.add_scalar('training_loss/mel_loss', - mel_loss.numpy(), - global_step) - writer.add_scalar('training_loss/post_mel_loss', - post_mel_loss.numpy(), - global_step) - writer.add_scalar('stop_loss', stop_loss.numpy(), global_step) - - if parallel: - writer.add_scalar('alphas/encoder_alpha', - model._layers.encoder.alpha.numpy(), - global_step) - writer.add_scalar('alphas/decoder_alpha', - model._layers.decoder.alpha.numpy(), - global_step) - else: - writer.add_scalar('alphas/encoder_alpha', - model.encoder.alpha.numpy(), - global_step) - writer.add_scalar('alphas/decoder_alpha', - model.decoder.alpha.numpy(), - global_step) - - writer.add_scalar('learning_rate', - optimizer._learning_rate.step().numpy(), - global_step) - - if global_step % cfg['train']['image_interval'] == 1: - for i, prob in enumerate(attn_probs): - for j in range(cfg['network']['decoder_num_head']): - x = np.uint8( - cm.viridis(prob.numpy()[j * cfg['train'][ - 'batch_size'] // nranks]) * 255) - writer.add_image( - 'Attention_%d_0' % global_step, - x, - i * 4 + j) - - for i, prob in enumerate(attn_enc): - for j in range(cfg['network']['encoder_num_head']): - x = np.uint8( - cm.viridis(prob.numpy()[j * cfg['train'][ - 'batch_size'] // nranks]) * 255) - writer.add_image( - 'Attention_enc_%d_0' % global_step, - x, - i * 4 + j) - - for i, prob in enumerate(attn_dec): - for j in range(cfg['network']['decoder_num_head']): - x = np.uint8( - cm.viridis(prob.numpy()[j * cfg['train'][ - 'batch_size'] // nranks]) * 255) - writer.add_image( - 'Attention_dec_%d_0' % global_step, - x, - i * 4 + j) - - if parallel: - loss = model.scale_loss(loss) - loss.backward() - model.apply_collective_grads() - else: - loss.backward() - optimizer.minimize(loss) - model.clear_gradients() - - # save checkpoint - if local_rank == 0 and global_step % cfg['train'][ - 'checkpoint_interval'] == 0: - io.save_parameters( - os.path.join(args.output, 'checkpoints'), global_step, model, - optimizer) - global_step += 1 - - if local_rank == 0: - writer.close() - - -if __name__ == '__main__': - parser = argparse.ArgumentParser(description="Train TransformerTTS model") - add_config_options_to_parser(parser) - args = parser.parse_args() - # Print the whole config setting. - pprint(vars(args)) - main(args) diff --git a/examples/transformer_tts/train_transformer.sh b/examples/transformer_tts/train_transformer.sh deleted file mode 100644 index 7c910c7..0000000 --- a/examples/transformer_tts/train_transformer.sh +++ /dev/null @@ -1,15 +0,0 @@ - -# train model -export CUDA_VISIBLE_DEVICES=0 -python -u train_transformer.py \ ---use_gpu=1 \ ---data='../../dataset/LJSpeech-1.1' \ ---output='./experiment' \ ---config='configs/ljspeech.yaml' \ -#--checkpoint='./checkpoint/transformer/step-120000' \ - -if [ $? -ne 0 ]; then - echo "Failed in training!" - exit 1 -fi -exit 0 diff --git a/examples/transformer_tts/train_vocoder.py b/examples/transformer_tts/train_vocoder.py deleted file mode 100644 index ccea796..0000000 --- a/examples/transformer_tts/train_vocoder.py +++ /dev/null @@ -1,144 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -from visualdl import LogWriter -import os -from tqdm import tqdm -from pathlib import Path -from collections import OrderedDict -import argparse -from ruamel import yaml -from pprint import pprint -import paddle.fluid as fluid -import paddle.fluid.dygraph as dg -import paddle.fluid.layers as layers -from data import LJSpeechLoader -from parakeet.models.transformer_tts import Vocoder -from parakeet.utils import io - - -def add_config_options_to_parser(parser): - parser.add_argument("--config", type=str, help="path of the config file") - parser.add_argument("--use_gpu", type=int, default=0, help="device to use") - parser.add_argument("--data", type=str, help="path of LJspeech dataset") - - g = parser.add_mutually_exclusive_group() - g.add_argument("--checkpoint", type=str, help="checkpoint to resume from") - g.add_argument( - "--iteration", - type=int, - help="the iteration of the checkpoint to load from output directory") - - parser.add_argument( - "--output", - type=str, - default="vocoder", - help="path to save experiment results") - - -def main(args): - local_rank = dg.parallel.Env().local_rank - nranks = dg.parallel.Env().nranks - parallel = nranks > 1 - - with open(args.config) as f: - cfg = yaml.load(f, Loader=yaml.Loader) - - global_step = 0 - place = fluid.CUDAPlace(local_rank) if args.use_gpu else fluid.CPUPlace() - - if not os.path.exists(args.output): - os.mkdir(args.output) - - writer = LogWriter(os.path.join(args.output, - 'log')) if local_rank == 0 else None - - fluid.enable_dygraph(place) - model = Vocoder(cfg['train']['batch_size'], cfg['vocoder']['hidden_size'], - cfg['audio']['num_mels'], cfg['audio']['n_fft']) - - model.train() - optimizer = fluid.optimizer.AdamOptimizer( - learning_rate=dg.NoamDecay(1 / (cfg['train']['warm_up_step'] * - (cfg['train']['learning_rate']**2)), - cfg['train']['warm_up_step']), - parameter_list=model.parameters(), - grad_clip=fluid.clip.GradientClipByGlobalNorm(cfg['train'][ - 'grad_clip_thresh'])) - - # Load parameters. - global_step = io.load_parameters( - model=model, - optimizer=optimizer, - checkpoint_dir=os.path.join(args.output, 'checkpoints'), - iteration=args.iteration, - checkpoint_path=args.checkpoint) - print("Rank {}: checkpoint loaded.".format(local_rank)) - - if parallel: - strategy = dg.parallel.prepare_context() - model = fluid.dygraph.parallel.DataParallel(model, strategy) - - reader = LJSpeechLoader( - cfg['audio'], - place, - args.data, - cfg['train']['batch_size'], - nranks, - local_rank, - is_vocoder=True).reader() - - for epoch in range(cfg['train']['max_iteration']): - pbar = tqdm(reader) - for i, data in enumerate(pbar): - pbar.set_description('Processing at epoch %d' % epoch) - mel, mag = data - mag = dg.to_variable(mag.numpy()) - mel = dg.to_variable(mel.numpy()) - global_step += 1 - - mag_pred = model(mel) - loss = layers.mean( - layers.abs(layers.elementwise_sub(mag_pred, mag))) - - if parallel: - loss = model.scale_loss(loss) - loss.backward() - model.apply_collective_grads() - else: - loss.backward() - optimizer.minimize(loss) - model.clear_gradients() - - if local_rank == 0: - writer.add_scalar('training_loss/loss', loss.numpy(), - global_step) - - # save checkpoint - if local_rank == 0 and global_step % cfg['train'][ - 'checkpoint_interval'] == 0: - io.save_parameters( - os.path.join(args.output, 'checkpoints'), global_step, - model, optimizer) - - if local_rank == 0: - writer.close() - - -if __name__ == '__main__': - parser = argparse.ArgumentParser(description="Train vocoder model") - add_config_options_to_parser(parser) - args = parser.parse_args() - # Print the whole config setting. - pprint(args) - main(args) diff --git a/examples/transformer_tts/train_vocoder.sh b/examples/transformer_tts/train_vocoder.sh deleted file mode 100644 index 5d0d845..0000000 --- a/examples/transformer_tts/train_vocoder.sh +++ /dev/null @@ -1,16 +0,0 @@ - -# train model -CUDA_VISIBLE_DEVICES=0 \ -python -u train_vocoder.py \ ---use_gpu=1 \ ---data='../../dataset/LJSpeech-1.1' \ ---output='./vocoder' \ ---config='configs/ljspeech.yaml' \ -#--checkpoint='./checkpoint/vocoder/step-100000' \ - - -if [ $? -ne 0 ]; then - echo "Failed in training!" - exit 1 -fi -exit 0 diff --git a/examples/waveflow/README.md b/examples/waveflow/README.md deleted file mode 100644 index 16364f6..0000000 --- a/examples/waveflow/README.md +++ /dev/null @@ -1,122 +0,0 @@ -# WaveFlow - -PaddlePaddle dynamic graph implementation of [WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219). - -- WaveFlow can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow](https://github.com/NVIDIA/waveglow) and serveral orders of magnitude faster than WaveNet. -- WaveFlow is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M). -- WaveFlow is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development. - -## Project Structure -```text -├── configs # yaml configuration files of preset model hyperparameters -├── benchmark.py # benchmark code to test the speed of batched speech synthesis -├── synthesis.py # script for speech synthesis -├── train.py # script for model training -├── utils.py # helper functions for e.g., model checkpointing -├── data.py # dataset and dataloader settings for LJSpeech -├── waveflow.py # WaveFlow model high level APIs -└── parakeet/models/waveflow/waveflow_modules.py # WaveFlow model implementation -``` - -## Usage - -There are many hyperparameters to be tuned depending on the specification of model and dataset you are working on. -We provide `wavenet_ljspeech.yaml` as a hyperparameter set that works well on the LJSpeech dataset. -Note that we use [convolutional queue](https://arxiv.org/abs/1611.09482) at audio synthesis to cache the intermediate hidden states, which will speed up the autoregressive inference over the height dimension. Current implementation only supports height dimension equals 8 or 16, i.e., where there is no dilation on the height dimension. Therefore, you can only set value of `n_group` key in the yaml config file to be either 8 or 16. - -Also note that `train.py`, `synthesis.py`, and `benchmark.py` all accept a `--config` parameter. To ensure consistency, you should use the same config yaml file for both training, synthesizing and benchmarking. You can also overwrite these preset hyperparameters with command line by updating parameters after `--config`. -For example `--config=${yaml} --batch_size=8` can overwrite the corresponding hyperparameters in the `${yaml}` config file. For more details about these hyperparameters, check `utils.add_config_options_to_parser`. - -Additionally, you need to specify some additional parameters for `train.py`, `synthesis.py`, and `benchmark.py`, and the details can be found in `train.add_options_to_parser`, `synthesis.add_options_to_parser`, and `benchmark.add_options_to_parser`, respectively. - -### Dataset - -Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/). - -```bash -wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 -tar xjvf LJSpeech-1.1.tar.bz2 -``` - -In this example, assume that the path of unzipped LJSpeech dataset is `./data/LJSpeech-1.1`. - -### Train on single GPU - -```bash -export CUDA_VISIBLE_DEVICES=0 -python -u train.py \ - --config=./configs/waveflow_ljspeech.yaml \ - --root=./data/LJSpeech-1.1 \ - --name=${ModelName} --batch_size=4 \ - --use_gpu=true -``` - -#### Save and Load checkpoints - -Our model will save model parameters as checkpoints in `./runs/waveflow/${ModelName}/checkpoint/` every 10000 iterations by default, where `${ModelName}` is the model name for one single experiment and it could be whatever you like. -The saved checkpoint will have the format of `step-${iteration_number}.pdparams` for model parameters and `step-${iteration_number}.pdopt` for optimizer parameters. - -There are three ways to load a checkpoint and resume training (take an example that you want to load a 500000-iteration checkpoint): -1. Use `--checkpoint=./runs/waveflow/${ModelName}/checkpoint/step-500000` to provide a specific path to load. Note that you only need to provide the base name of the parameter file, which is `step-500000`, no extension name `.pdparams` or `.pdopt` is needed. -2. Use `--iteration=500000`. -3. If you don't specify either `--checkpoint` or `--iteration`, the model will automatically load the latest checkpoint in `./runs/waveflow/${ModelName}/checkpoint`. - -### Train on multiple GPUs - -```bash -export CUDA_VISIBLE_DEVICES=0,1,2,3 -python -u -m paddle.distributed.launch train.py \ - --config=./configs/waveflow_ljspeech.yaml \ - --root=./data/LJSpeech-1.1 \ - --name=${ModelName} --use_gpu=true -``` - -Use `export CUDA_VISIBLE_DEVICES=0,1,2,3` to set the GPUs that you want to use to be visible. Then the `paddle.distributed.launch` module will use these visible GPUs to do data parallel training in multiprocessing mode. - -### Monitor with Tensorboard - -By default, the logs are saved in `./runs/waveflow/${ModelName}/logs/`. You can monitor logs using TensorBoard. - -```bash -tensorboard --logdir=${log_dir} --port=8888 -``` - -### Synthesize from a checkpoint - -Check the [Save and load checkpoint](#save-and-load-checkpoints) section on how to load a specific checkpoint. -The following example will automatically load the latest checkpoint: - -```bash -export CUDA_VISIBLE_DEVICES=0 -python -u synthesis.py \ - --config=./configs/waveflow_ljspeech.yaml \ - --root=./data/LJSpeech-1.1 \ - --name=${ModelName} --use_gpu=true \ - --output=./syn_audios \ - --sample=${SAMPLE} \ - --sigma=1.0 -``` - -In this example, `--output` specifies where to save the synthesized audios and `--sample` (<16) specifies which sample in the valid dataset (a split from the whole LJSpeech dataset, by default contains the first 16 audio samples) to synthesize based on the mel-spectrograms computed from the ground truth sample audio, e.g., `--sample=0` means to synthesize the first audio in the valid dataset. - -### Benchmarking - -Use the following example to benchmark the speed of batched speech synthesis, which reports how many times faster than real-time: - -```bash -export CUDA_VISIBLE_DEVICES=0 -python -u benchmark.py \ - --config=./configs/waveflow_ljspeech.yaml \ - --root=./data/LJSpeech-1.1 \ - --name=${ModelName} --use_gpu=true -``` - -### Low-precision inference - -This model supports the float16 low-precision inference. By appending the argument - -```bash - --use_fp16=true -``` - -to the command of synthesis and benchmarking, one can experience the fast speed of low-precision inference. diff --git a/examples/waveflow/benchmark.py b/examples/waveflow/benchmark.py deleted file mode 100644 index 222e732..0000000 --- a/examples/waveflow/benchmark.py +++ /dev/null @@ -1,103 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import random -from pprint import pprint - -import argparse -import numpy as np -import paddle.fluid.dygraph as dg -from paddle import fluid - -import utils -from parakeet.utils import io -from waveflow import WaveFlow - - -def add_options_to_parser(parser): - parser.add_argument( - '--model', - type=str, - default='waveflow', - help="general name of the model") - parser.add_argument( - '--name', type=str, help="specific name of the training model") - parser.add_argument( - '--root', type=str, help="root path of the LJSpeech dataset") - - parser.add_argument( - '--use_gpu', - type=utils.str2bool, - default=True, - help="option to use gpu training") - parser.add_argument( - '--use_fp16', - type=utils.str2bool, - default=True, - help="option to use fp16 for inference") - - parser.add_argument( - '--iteration', - type=int, - default=None, - help=("which iteration of checkpoint to load, " - "default to load the latest checkpoint")) - parser.add_argument( - '--checkpoint', - type=str, - default=None, - help="path of the checkpoint to load") - - -def benchmark(config): - pprint(vars(config)) - - # Get checkpoint directory path. - run_dir = os.path.join("runs", config.model, config.name) - checkpoint_dir = os.path.join(run_dir, "checkpoint") - - # Configurate device. - place = fluid.CUDAPlace(0) if config.use_gpu else fluid.CPUPlace() - - with dg.guard(place): - # Fix random seed. - seed = config.seed - random.seed(seed) - np.random.seed(seed) - fluid.default_startup_program().random_seed = seed - fluid.default_main_program().random_seed = seed - print("Random Seed: ", seed) - - # Build model. - model = WaveFlow(config, checkpoint_dir) - model.build(training=False) - - # Run model inference. - model.benchmark() - - -if __name__ == "__main__": - # Create parser. - parser = argparse.ArgumentParser( - description="Synthesize audio using WaveNet model") - add_options_to_parser(parser) - utils.add_config_options_to_parser(parser) - - # Parse argument from both command line and yaml config file. - # For conflicting updates to the same field, - # the preceding update will be overwritten by the following one. - config = parser.parse_args() - config = io.add_yaml_config_to_args(config) - benchmark(config) diff --git a/examples/waveflow/configs/waveflow_ljspeech.yaml b/examples/waveflow/configs/waveflow_ljspeech.yaml deleted file mode 100644 index d3548c4..0000000 --- a/examples/waveflow/configs/waveflow_ljspeech.yaml +++ /dev/null @@ -1,24 +0,0 @@ -valid_size: 16 -segment_length: 16000 -sample_rate: 22050 -fft_window_shift: 256 -fft_window_size: 1024 -fft_size: 1024 -mel_bands: 80 -mel_fmin: 0.0 -mel_fmax: 8000.0 - -seed: 1234 -learning_rate: 0.0002 -batch_size: 8 -test_every: 2000 -save_every: 10000 -max_iterations: 3000000 - -sigma: 1.0 -n_flows: 8 -n_group: 16 -n_layers: 8 -n_channels: 64 -kernel_h: 3 -kernel_w: 3 diff --git a/examples/waveflow/data.py b/examples/waveflow/data.py deleted file mode 100644 index 75d09b7..0000000 --- a/examples/waveflow/data.py +++ /dev/null @@ -1,144 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import random - -import librosa -import numpy as np -from paddle import fluid - -from parakeet.datasets import ljspeech -from parakeet.data import SpecBatcher, WavBatcher -from parakeet.data import DataCargo, DatasetMixin -from parakeet.data import DistributedSampler, BatchSampler -from scipy.io.wavfile import read - - -class Dataset(ljspeech.LJSpeech): - def __init__(self, config): - super(Dataset, self).__init__(config.root) - self.config = config - - def _get_example(self, metadatum): - fname, _, _ = metadatum - wav_path = os.path.join(self.root, "wavs", fname + ".wav") - - audio, loaded_sr = librosa.load(wav_path, sr=self.config.sample_rate) - - return audio - - -class Subset(DatasetMixin): - def __init__(self, dataset, indices, valid): - self.dataset = dataset - self.indices = indices - self.valid = valid - self.config = dataset.config - - def get_mel(self, audio): - spectrogram = librosa.core.stft( - audio, - n_fft=self.config.fft_size, - hop_length=self.config.fft_window_shift, - win_length=self.config.fft_window_size) - spectrogram_magnitude = np.abs(spectrogram) - - # mel_filter_bank shape: [n_mels, 1 + n_fft/2] - mel_filter_bank = librosa.filters.mel(sr=self.config.sample_rate, - n_fft=self.config.fft_size, - n_mels=self.config.mel_bands, - fmin=self.config.mel_fmin, - fmax=self.config.mel_fmax) - # mel shape: [n_mels, num_frames] - mel = np.dot(mel_filter_bank, spectrogram_magnitude) - - # Normalize mel. - clip_val = 1e-5 - ref_constant = 1 - mel = np.log(np.clip(mel, a_min=clip_val, a_max=None) * ref_constant) - - return mel - - def __getitem__(self, idx): - audio = self.dataset[self.indices[idx]] - segment_length = self.config.segment_length - - if self.valid: - # whole audio for valid set - pass - else: - # Randomly crop segment_length from audios in the training set. - # audio shape: [len] - if audio.shape[0] >= segment_length: - max_audio_start = audio.shape[0] - segment_length - audio_start = random.randint(0, max_audio_start) - audio = audio[audio_start:(audio_start + segment_length)] - else: - audio = np.pad(audio, (0, segment_length - audio.shape[0]), - mode='constant', - constant_values=0) - - mel = self.get_mel(audio) - - return audio, mel - - def _batch_examples(self, batch): - audios = [sample[0] for sample in batch] - mels = [sample[1] for sample in batch] - - audios = WavBatcher(pad_value=0.0)(audios) - mels = SpecBatcher(pad_value=0.0)(mels) - - return audios, mels - - def __len__(self): - return len(self.indices) - - -class LJSpeech: - def __init__(self, config, nranks, rank): - place = fluid.CUDAPlace(rank) if config.use_gpu else fluid.CPUPlace() - - # Whole LJSpeech dataset. - ds = Dataset(config) - - # Split into train and valid dataset. - indices = list(range(len(ds))) - train_indices = indices[config.valid_size:] - valid_indices = indices[:config.valid_size] - random.shuffle(train_indices) - - # Train dataset. - trainset = Subset(ds, train_indices, valid=False) - sampler = DistributedSampler(len(trainset), nranks, rank) - total_bs = config.batch_size - assert total_bs % nranks == 0 - train_sampler = BatchSampler( - sampler, total_bs // nranks, drop_last=True) - trainloader = DataCargo(trainset, batch_sampler=train_sampler) - - trainreader = fluid.io.PyReader(capacity=50, return_list=True) - trainreader.decorate_batch_generator(trainloader, place) - self.trainloader = (data for _ in iter(int, 1) - for data in trainreader()) - - # Valid dataset. - validset = Subset(ds, valid_indices, valid=True) - # Currently only support batch_size = 1 for valid loader. - validloader = DataCargo(validset, batch_size=1, shuffle=False) - - validreader = fluid.io.PyReader(capacity=20, return_list=True) - validreader.decorate_batch_generator(validloader, place) - self.validloader = validreader diff --git a/examples/waveflow/synthesis.py b/examples/waveflow/synthesis.py deleted file mode 100644 index b9569bf..0000000 --- a/examples/waveflow/synthesis.py +++ /dev/null @@ -1,113 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import random -from pprint import pprint - -import argparse -import numpy as np -import paddle.fluid.dygraph as dg -from paddle import fluid - -from parakeet.utils import io -import utils -from waveflow import WaveFlow - - -def add_options_to_parser(parser): - parser.add_argument( - '--model', - type=str, - default='waveflow', - help="general name of the model") - parser.add_argument( - '--name', type=str, help="specific name of the training model") - parser.add_argument( - '--root', type=str, help="root path of the LJSpeech dataset") - - parser.add_argument( - '--use_gpu', - type=utils.str2bool, - default=True, - help="option to use gpu training") - parser.add_argument( - '--use_fp16', - type=utils.str2bool, - default=True, - help="option to use fp16 for inference") - - parser.add_argument( - '--iteration', - type=int, - default=None, - help=("which iteration of checkpoint to load, " - "default to load the latest checkpoint")) - parser.add_argument( - '--checkpoint', - type=str, - default=None, - help="path of the checkpoint to load") - - parser.add_argument( - '--output', - type=str, - default="./syn_audios", - help="path to write synthesized audio files") - parser.add_argument( - '--sample', - type=int, - default=None, - help="which of the valid samples to synthesize audio") - - -def synthesize(config): - pprint(vars(config)) - - # Get checkpoint directory path. - run_dir = os.path.join("runs", config.model, config.name) - checkpoint_dir = os.path.join(run_dir, "checkpoint") - - # Configurate device. - place = fluid.CUDAPlace(0) if config.use_gpu else fluid.CPUPlace() - - with dg.guard(place): - # Fix random seed. - seed = config.seed - random.seed(seed) - np.random.seed(seed) - fluid.default_startup_program().random_seed = seed - fluid.default_main_program().random_seed = seed - print("Random Seed: ", seed) - - # Build model. - model = WaveFlow(config, checkpoint_dir) - iteration = model.build(training=False) - # Run model inference. - model.infer(iteration) - - -if __name__ == "__main__": - # Create parser. - parser = argparse.ArgumentParser( - description="Synthesize audio using WaveNet model") - add_options_to_parser(parser) - utils.add_config_options_to_parser(parser) - - # Parse argument from both command line and yaml config file. - # For conflicting updates to the same field, - # the preceding update will be overwritten by the following one. - config = parser.parse_args() - config = io.add_yaml_config_to_args(config) - synthesize(config) diff --git a/examples/waveflow/train.py b/examples/waveflow/train.py deleted file mode 100644 index dd3e7b7..0000000 --- a/examples/waveflow/train.py +++ /dev/null @@ -1,134 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import random -import subprocess -import time -from pprint import pprint - -import argparse -import numpy as np -import paddle.fluid.dygraph as dg -from paddle import fluid -from visualdl import LogWriter - - -import utils -from parakeet.utils import io -from waveflow import WaveFlow - - -def add_options_to_parser(parser): - parser.add_argument( - '--model', - type=str, - default='waveflow', - help="general name of the model") - parser.add_argument( - '--name', type=str, help="specific name of the training model") - parser.add_argument( - '--root', type=str, help="root path of the LJSpeech dataset") - - parser.add_argument( - '--use_gpu', - type=utils.str2bool, - default=True, - help="option to use gpu training") - - parser.add_argument( - '--iteration', - type=int, - default=None, - help=("which iteration of checkpoint to load, " - "default to load the latest checkpoint")) - parser.add_argument( - '--checkpoint', - type=str, - default=None, - help="path of the checkpoint to load") - - -def train(config): - use_gpu = config.use_gpu - - # Get the rank of the current training process. - rank = dg.parallel.Env().local_rank - nranks = dg.parallel.Env().nranks - parallel = nranks > 1 - - if rank == 0: - # Print the whole config setting. - pprint(vars(config)) - - # Make checkpoint directory. - run_dir = os.path.join("runs", config.model, config.name) - checkpoint_dir = os.path.join(run_dir, "checkpoint") - if not os.path.exists(checkpoint_dir): - os.makedirs(checkpoint_dir) - - # Create tensorboard logger. - vdl = LogWriter(os.path.join(run_dir, "logs")) \ - if rank == 0 else None - - # Configurate device - place = fluid.CUDAPlace(rank) if use_gpu else fluid.CPUPlace() - - with dg.guard(place): - # Fix random seed. - seed = config.seed - random.seed(seed) - np.random.seed(seed) - fluid.default_startup_program().random_seed = seed - fluid.default_main_program().random_seed = seed - print("Random Seed: ", seed) - - # Build model. - model = WaveFlow(config, checkpoint_dir, parallel, rank, nranks, vdl) - iteration = model.build() - - while iteration < config.max_iterations: - # Run one single training step. - model.train_step(iteration) - - iteration += 1 - - if iteration % config.test_every == 0: - # Run validation step. - model.valid_step(iteration) - - if rank == 0 and iteration % config.save_every == 0: - # Save parameters. - model.save(iteration) - - # Close TensorBoard. - if rank == 0: - vdl.close() - - -if __name__ == "__main__": - # Create parser. - parser = argparse.ArgumentParser(description="Train WaveFlow model") - #formatter_class='default_argparse') - add_options_to_parser(parser) - utils.add_config_options_to_parser(parser) - - # Parse argument from both command line and yaml config file. - # For conflicting updates to the same field, - # the preceding update will be overwritten by the following one. - config = parser.parse_args() - config = io.add_yaml_config_to_args(config) - # Force to use fp32 in model training - vars(config)["use_fp16"] = False - train(config) diff --git a/examples/waveflow/utils.py b/examples/waveflow/utils.py deleted file mode 100644 index 3f934de..0000000 --- a/examples/waveflow/utils.py +++ /dev/null @@ -1,90 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse - - -def str2bool(v): - return v.lower() in ("true", "t", "1") - - -def add_config_options_to_parser(parser): - parser.add_argument( - '--valid_size', type=int, help="size of the valid dataset") - parser.add_argument( - '--segment_length', - type=int, - help="the length of audio clip for training") - parser.add_argument( - '--sample_rate', type=int, help="sampling rate of audio data file") - parser.add_argument( - '--fft_window_shift', - type=int, - help="the shift of fft window for each frame") - parser.add_argument( - '--fft_window_size', - type=int, - help="the size of fft window for each frame") - parser.add_argument( - '--fft_size', type=int, help="the size of fft filter on each frame") - parser.add_argument( - '--mel_bands', - type=int, - help="the number of mel bands when calculating mel spectrograms") - parser.add_argument( - '--mel_fmin', - type=float, - help="lowest frequency in calculating mel spectrograms") - parser.add_argument( - '--mel_fmax', - type=float, - help="highest frequency in calculating mel spectrograms") - - parser.add_argument( - '--seed', type=int, help="seed of random initialization for the model") - parser.add_argument('--learning_rate', type=float) - parser.add_argument( - '--batch_size', type=int, help="batch size for training") - parser.add_argument( - '--test_every', type=int, help="test interval during training") - parser.add_argument( - '--save_every', - type=int, - help="checkpointing interval during training") - parser.add_argument( - '--max_iterations', type=int, help="maximum training iterations") - - parser.add_argument( - '--sigma', - type=float, - help="standard deviation of the latent Gaussian variable") - parser.add_argument('--n_flows', type=int, help="number of flows") - parser.add_argument( - '--n_group', - type=int, - help="number of adjacent audio samples to squeeze into one column") - parser.add_argument( - '--n_layers', - type=int, - help="number of conv2d layer in one wavenet-like flow architecture") - parser.add_argument( - '--n_channels', type=int, help="number of residual channels in flow") - parser.add_argument( - '--kernel_h', - type=int, - help="height of the kernel in the conv2d layer") - parser.add_argument( - '--kernel_w', type=int, help="width of the kernel in the conv2d layer") - - parser.add_argument('--config', type=str, help="Path to the config file.") diff --git a/examples/waveflow/waveflow.py b/examples/waveflow/waveflow.py deleted file mode 100644 index a41a784..0000000 --- a/examples/waveflow/waveflow.py +++ /dev/null @@ -1,292 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import itertools -import os -import time - -import numpy as np -import paddle.fluid.dygraph as dg -from paddle import fluid -from scipy.io.wavfile import write - -from parakeet.utils import io -from parakeet.modules import weight_norm -from parakeet.models.waveflow import WaveFlowLoss, WaveFlowModule -from data import LJSpeech -import utils - - -class WaveFlow(): - """Wrapper class of WaveFlow model that supports multiple APIs. - - This module provides APIs for model building, training, validation, - inference, benchmarking, and saving. - - Args: - config (obj): config info. - checkpoint_dir (str): path for checkpointing. - parallel (bool, optional): whether use multiple GPUs for training. - Defaults to False. - rank (int, optional): the rank of the process in a multi-process - scenario. Defaults to 0. - nranks (int, optional): the total number of processes. Defaults to 1. - vdl_logger (obj, optional): logger to visualize metrics. - Defaults to None. - - Returns: - WaveFlow - """ - - def __init__(self, - config, - checkpoint_dir, - parallel=False, - rank=0, - nranks=1, - vdl_logger=None): - self.config = config - self.checkpoint_dir = checkpoint_dir - self.parallel = parallel - self.rank = rank - self.nranks = nranks - self.vdl_logger = vdl_logger - self.dtype = "float16" if config.use_fp16 else "float32" - - def build(self, training=True): - """Initialize the model. - - Args: - training (bool, optional): Whether the model is built for training or inference. - Defaults to True. - - Returns: - None - """ - config = self.config - dataset = LJSpeech(config, self.nranks, self.rank) - self.trainloader = dataset.trainloader - self.validloader = dataset.validloader - - waveflow = WaveFlowModule(config) - - if training: - optimizer = fluid.optimizer.AdamOptimizer( - learning_rate=config.learning_rate, - parameter_list=waveflow.parameters()) - - # Load parameters. - iteration = io.load_parameters( - model=waveflow, - optimizer=optimizer, - checkpoint_dir=self.checkpoint_dir, - iteration=config.iteration, - checkpoint_path=config.checkpoint) - print("Rank {}: checkpoint loaded.".format(self.rank)) - - # Data parallelism. - if self.parallel: - strategy = dg.parallel.prepare_context() - waveflow = dg.parallel.DataParallel(waveflow, strategy) - - self.waveflow = waveflow - self.optimizer = optimizer - self.criterion = WaveFlowLoss(config.sigma) - - else: - # Load parameters. - iteration = io.load_parameters( - model=waveflow, - checkpoint_dir=self.checkpoint_dir, - iteration=config.iteration, - checkpoint_path=config.checkpoint) - print("Rank {}: checkpoint loaded.".format(self.rank)) - - for layer in waveflow.sublayers(): - if isinstance(layer, weight_norm.WeightNormWrapper): - layer.remove_weight_norm() - - self.waveflow = waveflow - - return iteration - - def train_step(self, iteration): - """Train the model for one step. - - Args: - iteration (int): current iteration number. - - Returns: - None - """ - self.waveflow.train() - - start_time = time.time() - audios, mels = next(self.trainloader) - load_time = time.time() - - outputs = self.waveflow(audios, mels) - loss = self.criterion(outputs) - - if self.parallel: - # loss = loss / num_trainers - loss = self.waveflow.scale_loss(loss) - loss.backward() - self.waveflow.apply_collective_grads() - else: - loss.backward() - - self.optimizer.minimize( - loss, parameter_list=self.waveflow.parameters()) - self.waveflow.clear_gradients() - - graph_time = time.time() - - if self.rank == 0: - loss_val = float(loss.numpy()) * self.nranks - log = "Rank: {} Step: {:^8d} Loss: {:<8.3f} " \ - "Time: {:.3f}/{:.3f}".format( - self.rank, iteration, loss_val, - load_time - start_time, graph_time - load_time) - print(log) - - vdl_writer = self.vdl_logger - vdl_writer.add_scalar("Train-Loss-Rank-0", loss_val, iteration) - - @dg.no_grad - def valid_step(self, iteration): - """Run the model on the validation dataset. - - Args: - iteration (int): current iteration number. - - Returns: - None - """ - self.waveflow.eval() - vdl_writer = self.vdl_logger - - total_loss = [] - sample_audios = [] - start_time = time.time() - - for i, batch in enumerate(self.validloader()): - audios, mels = batch - valid_outputs = self.waveflow(audios, mels) - valid_z, valid_log_s_list = valid_outputs - - # Visualize latent z and scale log_s. - if self.rank == 0 and i == 0: - vdl_writer.add_histogram("Valid-Latent_z", valid_z.numpy(), - iteration) - for j, valid_log_s in enumerate(valid_log_s_list): - hist_name = "Valid-{}th-Flow-Log_s".format(j) - vdl_writer.add_histogram(hist_name, valid_log_s.numpy(), - iteration) - - valid_loss = self.criterion(valid_outputs) - total_loss.append(float(valid_loss.numpy())) - - total_time = time.time() - start_time - if self.rank == 0: - loss_val = np.mean(total_loss) - log = "Test | Rank: {} AvgLoss: {:<8.3f} Time {:<8.3f}".format( - self.rank, loss_val, total_time) - print(log) - vdl_writer.add_scalar("Valid-Avg-Loss", loss_val, iteration) - - @dg.no_grad - def infer(self, iteration): - """Run the model to synthesize audios. - - Args: - iteration (int): iteration number of the loaded checkpoint. - - Returns: - None - """ - self.waveflow.eval() - - config = self.config - sample = config.sample - - output = "{}/{}/iter-{}".format(config.output, config.name, iteration) - if not os.path.exists(output): - os.makedirs(output) - - mels_list = [mels for _, mels in self.validloader()] - if sample is not None: - mels_list = [mels_list[sample]] - else: - sample = 0 - - for idx, mel in enumerate(mels_list): - abs_idx = sample + idx - filename = "{}/valid_{}.wav".format(output, abs_idx) - print("Synthesize sample {}, save as {}".format(abs_idx, filename)) - - start_time = time.time() - audio = self.waveflow.synthesize(mel, sigma=self.config.sigma) - syn_time = time.time() - start_time - - audio = audio[0] - audio_time = audio.shape[0] / self.config.sample_rate - print("audio time {:.4f}, synthesis time {:.4f}".format(audio_time, - syn_time)) - - # Denormalize audio from [-1, 1] to [-32768, 32768] int16 range. - audio = audio.numpy().astype("float32") * 32768.0 - audio = audio.astype('int16') - write(filename, config.sample_rate, audio) - - @dg.no_grad - def benchmark(self): - """Run the model to benchmark synthesis speed. - - Args: - None - - Returns: - None - """ - self.waveflow.eval() - - mels_list = [mels for _, mels in self.validloader()] - mel = fluid.layers.concat(mels_list, axis=2) - mel = mel[:, :, :864] - batch_size = 8 - mel = fluid.layers.expand(mel, [batch_size, 1, 1]) - - for i in range(10): - start_time = time.time() - audio = self.waveflow.synthesize(mel, sigma=self.config.sigma) - print("audio.shape = ", audio.shape) - syn_time = time.time() - start_time - - audio_time = audio.shape[1] * batch_size / self.config.sample_rate - print("audio time {:.4f}, synthesis time {:.4f}".format(audio_time, - syn_time)) - print("{} X real-time".format(audio_time / syn_time)) - - def save(self, iteration): - """Save model checkpoint. - - Args: - iteration (int): iteration number of the model to be saved. - - Returns: - None - """ - io.save_parameters(self.checkpoint_dir, iteration, self.waveflow, - self.optimizer) diff --git a/examples/wavenet/README.md b/examples/wavenet/README.md deleted file mode 100644 index 42defe7..0000000 --- a/examples/wavenet/README.md +++ /dev/null @@ -1,144 +0,0 @@ -# WaveNet - -PaddlePaddle dynamic graph implementation of WaveNet, a convolutional network based vocoder. WaveNet is originally proposed in [WaveNet: A Generative Model for Raw Audio](https://arxiv.org/abs/1609.03499). However, in this experiment, the implementation follows the teacher model in [ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech](arxiv.org/abs/1807.07281). - - -## Dataset - -We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/). - -```bash -wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 -tar xjvf LJSpeech-1.1.tar.bz2 -``` - -## Project Structure - -```text -├── data.py data_processing -├── configs/ (example) configuration file -├── synthesis.py script to synthesize waveform from mel_spectrogram -├── train.py script to train a model -└── utils.py utility functions -``` - -## Saving & Loading -`train.py` and `synthesis.py` have 3 arguments in common, `--checkpooint`, `iteration` and `output`. - -1. `output` is the directory for saving results. -During training, checkpoints are saved in `checkpoints/` in `output` and tensorboard log is save in `log/` in `output`. Other possible outputs are saved in `states/` in `outuput`. -During synthesizing, audio files and other possible outputs are save in `synthesis/` in `output`. -So after training and synthesizing with the same output directory, the file structure of the output directory looks like this. - -```text -├── checkpoints/ # checkpoint directory (including *.pdparams, *.pdopt and a text file `checkpoint` that records the latest checkpoint) -├── states/ # audio files generated at validation and other possible outputs -├── log/ # tensorboard log -└── synthesis/ # synthesized audio files and other possible outputs -``` - -2. `--checkpoint` and `--iteration` for loading from existing checkpoint. Loading existing checkpoiont follows the following rule: -If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded. -If `--checkpoint` is not provided, we try to load the model specified by `--iteration` from the checkpoint directory. If `--iteration` is not provided, we try to load the latested checkpoint from checkpoint directory. - -## Train - -Train the model using train.py. For help on usage, try `python train.py --help`. - -```text -usage: train.py [-h] [--data DATA] [--config CONFIG] [--device DEVICE] - [--checkpoint CHECKPOINT | --iteration ITERATION] - output - -Train a WaveNet model with LJSpeech. - -positional arguments: - output path to save results - -optional arguments: - -h, --help show this help message and exit - --data DATA path of the LJspeech dataset - --config CONFIG path of the config file - --device DEVICE device to use - --checkpoint CHECKPOINT checkpoint to resume from - --iteration ITERATION the iteration of the checkpoint to load from output directory -``` - -- `--data` is the path of the LJSpeech dataset, the extracted folder from the downloaded archive (the folder which contains metadata.txt). -- `--config` is the configuration file to use. The provided configurations can be used directly. And you can change some values in the configuration file and train the model with a different config. -- `--device` is the device (gpu id) to use for training. `-1` means CPU. - -- `--checkpoint` is the path of the checkpoint. -- `--iteration` is the iteration of the checkpoint to load from output directory. -- `output` is the directory to save results, all result are saved in this directory. - -See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading. - - -Example script: - -```bash -python train.py \ - --config=./configs/wavenet_single_gaussian.yaml \ - --data=./LJSpeech-1.1/ \ - --device=0 \ - experiment -``` - -You can monitor training log via TensorBoard, using the script below. - -```bash -cd experiment/log -tensorboard --logdir=. -``` - -## Synthesis -```text -usage: synthesis.py [-h] [--data DATA] [--config CONFIG] [--device DEVICE] - [--checkpoint CHECKPOINT | --iteration ITERATION] - output - -Synthesize valid data from LJspeech with a wavenet model. - -positional arguments: - output path to save the synthesized audio - -optional arguments: - -h, --help show this help message and exit - --data DATA path of the LJspeech dataset - --config CONFIG path of the config file - --device DEVICE device to use - --checkpoint CHECKPOINT checkpoint to resume from - --iteration ITERATION the iteration of the checkpoint to load from output directory -``` - -- `--data` is the path of the LJspeech dataset. In principle, a dataset is not needed for synthesis, but since the input is mel spectrogram, we need to get mel spectrogram from audio files. -- `--config` is the configuration file to use. You should use the same configuration with which you train you model. -- `--device` is the device (gpu id) to use for training. `-1` means CPU. -- `--checkpoint` is the checkpoint to load. -- `--iteration` is the iteration of the checkpoint to load from output directory. -- `output` is the directory to save synthesized audio. Audio file is saved in `synthesis/` in `output` directory. -See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading. - - -Example script: - -```bash -python synthesis.py \ - --config=./configs/wavenet_single_gaussian.yaml \ - --data=./LJSpeech-1.1/ \ - --device=0 \ - --checkpoint="experiment/checkpoints/step-1000000" \ - experiment -``` - -or - -```bash -python synthesis.py \ - --config=./configs/wavenet_single_gaussian.yaml \ - --data=./LJSpeech-1.1/ \ - --device=0 \ - --iteration=1000000 \ - experiment -``` diff --git a/examples/wavenet/configs/wavenet_mixture_of_gaussians.yaml b/examples/wavenet/configs/wavenet_mixture_of_gaussians.yaml deleted file mode 100644 index 68936ee..0000000 --- a/examples/wavenet/configs/wavenet_mixture_of_gaussians.yaml +++ /dev/null @@ -1,36 +0,0 @@ -data: - batch_size: 16 - train_clip_seconds: 0.5 - sample_rate: 22050 - hop_length: 256 - win_length: 1024 - n_fft: 2048 - n_mels: 80 - valid_size: 16 - - - -model: - upsampling_factors: [16, 16] - n_loop: 10 - n_layer: 3 - filter_size: 2 - residual_channels: 128 - loss_type: "mog" - output_dim: 30 - log_scale_min: -9 - -train: - learning_rate: 0.001 - anneal_rate: 0.5 - anneal_interval: 200000 - gradient_max_norm: 100.0 - - checkpoint_interval: 10000 - snap_interval: 10000 - eval_interval: 10000 - - max_iterations: 2000000 - - - diff --git a/examples/wavenet/configs/wavenet_single_gaussian.yaml b/examples/wavenet/configs/wavenet_single_gaussian.yaml deleted file mode 100644 index 484db0b..0000000 --- a/examples/wavenet/configs/wavenet_single_gaussian.yaml +++ /dev/null @@ -1,36 +0,0 @@ -data: - batch_size: 16 - train_clip_seconds: 0.5 - sample_rate: 22050 - hop_length: 256 - win_length: 1024 - n_fft: 2048 - n_mels: 80 - valid_size: 16 - - - -model: - upsampling_factors: [16, 16] - n_loop: 10 - n_layer: 3 - filter_size: 2 - residual_channels: 128 - loss_type: "mog" - output_dim: 3 - log_scale_min: -9 - -train: - learning_rate: 0.001 - anneal_rate: 0.5 - anneal_interval: 200000 - gradient_max_norm: 100.0 - - checkpoint_interval: 10000 - snap_interval: 10000 - eval_interval: 10000 - - max_iterations: 2000000 - - - diff --git a/examples/wavenet/configs/wavenet_softmax.yaml b/examples/wavenet/configs/wavenet_softmax.yaml deleted file mode 100644 index 7e9d756..0000000 --- a/examples/wavenet/configs/wavenet_softmax.yaml +++ /dev/null @@ -1,36 +0,0 @@ -data: - batch_size: 16 - train_clip_seconds: 0.5 - sample_rate: 22050 - hop_length: 256 - win_length: 1024 - n_fft: 2048 - n_mels: 80 - valid_size: 16 - - - -model: - upsampling_factors: [16, 16] - n_loop: 10 - n_layer: 3 - filter_size: 2 - residual_channels: 128 - loss_type: "softmax" - output_dim: 2048 - log_scale_min: -9 - -train: - learning_rate: 0.001 - anneal_rate: 0.5 - anneal_interval: 200000 - gradient_max_norm: 100.0 - - checkpoint_interval: 10000 - snap_interval: 10000 - eval_interval: 10000 - - max_iterations: 2000000 - - - diff --git a/examples/wavenet/data.py b/examples/wavenet/data.py deleted file mode 100644 index 24285ff..0000000 --- a/examples/wavenet/data.py +++ /dev/null @@ -1,164 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import division -import csv -import numpy as np -import librosa -from pathlib import Path -import pandas as pd - -from parakeet.data import batch_spec, batch_wav -from parakeet.data import DatasetMixin - - -class LJSpeechMetaData(DatasetMixin): - def __init__(self, root): - self.root = Path(root) - self._wav_dir = self.root.joinpath("wavs") - csv_path = self.root.joinpath("metadata.csv") - self._table = pd.read_csv( - csv_path, - sep="|", - header=None, - quoting=csv.QUOTE_NONE, - names=["fname", "raw_text", "normalized_text"]) - - def get_example(self, i): - fname, raw_text, normalized_text = self._table.iloc[i] - fname = str(self._wav_dir.joinpath(fname + ".wav")) - return fname, raw_text, normalized_text - - def __len__(self): - return len(self._table) - - -class Transform(object): - def __init__(self, sample_rate, n_fft, win_length, hop_length, n_mels): - self.sample_rate = sample_rate - self.n_fft = n_fft - self.win_length = win_length - self.hop_length = hop_length - self.n_mels = n_mels - - def __call__(self, example): - wav_path, _, _ = example - - sr = self.sample_rate - n_fft = self.n_fft - win_length = self.win_length - hop_length = self.hop_length - n_mels = self.n_mels - - wav, loaded_sr = librosa.load(wav_path, sr=None) - assert loaded_sr == sr, "sample rate does not match, resampling applied" - - # Pad audio to the right size. - frames = int(np.ceil(float(wav.size) / hop_length)) - fft_padding = (n_fft - hop_length) // 2 # sound - desired_length = frames * hop_length + fft_padding * 2 - pad_amount = (desired_length - wav.size) // 2 - - if wav.size % 2 == 0: - wav = np.pad(wav, (pad_amount, pad_amount), mode='reflect') - else: - wav = np.pad(wav, (pad_amount, pad_amount + 1), mode='reflect') - - # Normalize audio. - wav = wav / np.abs(wav).max() * 0.999 - - # Compute mel-spectrogram. - # Turn center to False to prevent internal padding. - spectrogram = librosa.core.stft( - wav, - hop_length=hop_length, - win_length=win_length, - n_fft=n_fft, - center=False) - spectrogram_magnitude = np.abs(spectrogram) - - # Compute mel-spectrograms. - mel_filter_bank = librosa.filters.mel(sr=sr, - n_fft=n_fft, - n_mels=n_mels) - mel_spectrogram = np.dot(mel_filter_bank, spectrogram_magnitude) - mel_spectrogram = mel_spectrogram - - # Rescale mel_spectrogram. - min_level, ref_level = 1e-5, 20 # hard code it - mel_spectrogram = 20 * np.log10(np.maximum(min_level, mel_spectrogram)) - mel_spectrogram = mel_spectrogram - ref_level - mel_spectrogram = np.clip((mel_spectrogram + 100) / 100, 0, 1) - - # Extract the center of audio that corresponds to mel spectrograms. - audio = wav[fft_padding:-fft_padding] - assert mel_spectrogram.shape[1] * hop_length == audio.size - - # there is no clipping here - return audio, mel_spectrogram - - -class DataCollector(object): - def __init__(self, - context_size, - sample_rate, - hop_length, - train_clip_seconds, - valid=False): - frames_per_second = sample_rate // hop_length - train_clip_frames = int( - np.ceil(train_clip_seconds * frames_per_second)) - context_frames = context_size // hop_length - self.num_frames = train_clip_frames + context_frames - - self.sample_rate = sample_rate - self.hop_length = hop_length - self.valid = valid - - def random_crop(self, sample): - audio, mel_spectrogram = sample - audio_frames = int(audio.size) // self.hop_length - max_start_frame = audio_frames - self.num_frames - assert max_start_frame >= 0, "audio is too short to be cropped" - - frame_start = np.random.randint(0, max_start_frame) - # frame_start = 0 # norandom - frame_end = frame_start + self.num_frames - - audio_start = frame_start * self.hop_length - audio_end = frame_end * self.hop_length - - audio = audio[audio_start:audio_end] - return audio, mel_spectrogram, audio_start - - def __call__(self, samples): - # transform them first - if self.valid: - samples = [(audio, mel_spectrogram, 0) - for audio, mel_spectrogram in samples] - else: - samples = [self.random_crop(sample) for sample in samples] - # batch them - audios = [sample[0] for sample in samples] - audio_starts = [sample[2] for sample in samples] - mels = [sample[1] for sample in samples] - - mels = batch_spec(mels) - - if self.valid: - audios = batch_wav(audios, dtype=np.float32) - else: - audios = np.array(audios, dtype=np.float32) - audio_starts = np.array(audio_starts, dtype=np.int64) - return audios, mels, audio_starts diff --git a/examples/wavenet/synthesis.py b/examples/wavenet/synthesis.py deleted file mode 100644 index a1d13f4..0000000 --- a/examples/wavenet/synthesis.py +++ /dev/null @@ -1,152 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import division -import os -import ruamel.yaml -import argparse -from tqdm import tqdm -from paddle import fluid -fluid.require_version('1.8.0') -import paddle.fluid.dygraph as dg - -from parakeet.modules.weight_norm import WeightNormWrapper -from parakeet.data import SliceDataset, TransformDataset, DataCargo, SequentialSampler, RandomSampler -from parakeet.models.wavenet import UpsampleNet, WaveNet, ConditionalWavenet -from parakeet.utils.layer_tools import summary -from parakeet.utils import io - -from data import LJSpeechMetaData, Transform, DataCollector -from utils import make_output_tree, valid_model, eval_model - -if __name__ == "__main__": - parser = argparse.ArgumentParser( - description="Synthesize valid data from LJspeech with a wavenet model.") - parser.add_argument( - "--data", type=str, help="path of the LJspeech dataset") - parser.add_argument("--config", type=str, help="path of the config file") - parser.add_argument("--device", type=int, default=-1, help="device to use") - - g = parser.add_mutually_exclusive_group() - g.add_argument("--checkpoint", type=str, help="checkpoint to resume from") - g.add_argument( - "--iteration", - type=int, - help="the iteration of the checkpoint to load from output directory") - - parser.add_argument( - "output", - type=str, - default="experiment", - help="path to save the synthesized audio") - - args = parser.parse_args() - with open(args.config, 'rt') as f: - config = ruamel.yaml.safe_load(f) - - if args.device == -1: - place = fluid.CPUPlace() - else: - place = fluid.CUDAPlace(args.device) - - dg.enable_dygraph(place) - - ljspeech_meta = LJSpeechMetaData(args.data) - - data_config = config["data"] - sample_rate = data_config["sample_rate"] - n_fft = data_config["n_fft"] - win_length = data_config["win_length"] - hop_length = data_config["hop_length"] - n_mels = data_config["n_mels"] - train_clip_seconds = data_config["train_clip_seconds"] - transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels) - ljspeech = TransformDataset(ljspeech_meta, transform) - - valid_size = data_config["valid_size"] - ljspeech_valid = SliceDataset(ljspeech, 0, valid_size) - ljspeech_train = SliceDataset(ljspeech, valid_size, len(ljspeech)) - - model_config = config["model"] - n_loop = model_config["n_loop"] - n_layer = model_config["n_layer"] - filter_size = model_config["filter_size"] - context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)]) - print("context size is {} samples".format(context_size)) - train_batch_fn = DataCollector(context_size, sample_rate, hop_length, - train_clip_seconds) - valid_batch_fn = DataCollector( - context_size, sample_rate, hop_length, train_clip_seconds, valid=True) - - batch_size = data_config["batch_size"] - train_cargo = DataCargo( - ljspeech_train, - train_batch_fn, - batch_size, - sampler=RandomSampler(ljspeech_train)) - - # only batch=1 for validation is enabled - valid_cargo = DataCargo( - ljspeech_valid, - valid_batch_fn, - batch_size=1, - sampler=SequentialSampler(ljspeech_valid)) - - if not os.path.exists(args.output): - os.makedirs(args.output) - - model_config = config["model"] - upsampling_factors = model_config["upsampling_factors"] - encoder = UpsampleNet(upsampling_factors) - - n_loop = model_config["n_loop"] - n_layer = model_config["n_layer"] - residual_channels = model_config["residual_channels"] - output_dim = model_config["output_dim"] - loss_type = model_config["loss_type"] - log_scale_min = model_config["log_scale_min"] - decoder = WaveNet(n_loop, n_layer, residual_channels, output_dim, n_mels, - filter_size, loss_type, log_scale_min) - - model = ConditionalWavenet(encoder, decoder) - summary(model) - - # load model parameters - checkpoint_dir = os.path.join(args.output, "checkpoints") - if args.checkpoint: - iteration = io.load_parameters(model, checkpoint_path=args.checkpoint) - else: - iteration = io.load_parameters( - model, checkpoint_dir=checkpoint_dir, iteration=args.iteration) - assert iteration > 0, "A trained model is needed." - - # WARNING: don't forget to remove weight norm to re-compute each wrapped layer's weight - # removing weight norm also speeds up computation - for layer in model.sublayers(): - if isinstance(layer, WeightNormWrapper): - layer.remove_weight_norm() - - train_loader = fluid.io.DataLoader.from_generator( - capacity=10, return_list=True) - train_loader.set_batch_generator(train_cargo, place) - - valid_loader = fluid.io.DataLoader.from_generator( - capacity=10, return_list=True) - valid_loader.set_batch_generator(valid_cargo, place) - - synthesis_dir = os.path.join(args.output, "synthesis") - if not os.path.exists(synthesis_dir): - os.makedirs(synthesis_dir) - - eval_model(model, valid_loader, synthesis_dir, iteration, sample_rate) diff --git a/examples/wavenet/train.py b/examples/wavenet/train.py deleted file mode 100644 index d211b06..0000000 --- a/examples/wavenet/train.py +++ /dev/null @@ -1,201 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import division -import os -import ruamel.yaml -import argparse -import tqdm -from visualdl import LogWriter -from paddle import fluid -fluid.require_version('1.8.0') -import paddle.fluid.dygraph as dg - -from parakeet.data import SliceDataset, TransformDataset, CacheDataset, DataCargo, SequentialSampler, RandomSampler -from parakeet.models.wavenet import UpsampleNet, WaveNet, ConditionalWavenet -from parakeet.utils.layer_tools import summary -from parakeet.utils import io - -from data import LJSpeechMetaData, Transform, DataCollector -from utils import make_output_tree, valid_model - -if __name__ == "__main__": - parser = argparse.ArgumentParser( - description="Train a WaveNet model with LJSpeech.") - parser.add_argument( - "--data", type=str, help="path of the LJspeech dataset") - parser.add_argument("--config", type=str, help="path of the config file") - parser.add_argument("--device", type=int, default=-1, help="device to use") - - g = parser.add_mutually_exclusive_group() - g.add_argument("--checkpoint", type=str, help="checkpoint to resume from") - g.add_argument( - "--iteration", - type=int, - help="the iteration of the checkpoint to load from output directory") - - parser.add_argument( - "output", type=str, default="experiment", help="path to save results") - - args = parser.parse_args() - with open(args.config, 'rt') as f: - config = ruamel.yaml.safe_load(f) - - if args.device == -1: - place = fluid.CPUPlace() - else: - place = fluid.CUDAPlace(args.device) - - dg.enable_dygraph(place) - - print("Command Line Args: ") - for k, v in vars(args).items(): - print("{}: {}".format(k, v)) - - ljspeech_meta = LJSpeechMetaData(args.data) - - data_config = config["data"] - sample_rate = data_config["sample_rate"] - n_fft = data_config["n_fft"] - win_length = data_config["win_length"] - hop_length = data_config["hop_length"] - n_mels = data_config["n_mels"] - train_clip_seconds = data_config["train_clip_seconds"] - transform = Transform(sample_rate, n_fft, win_length, hop_length, n_mels) - ljspeech = TransformDataset(ljspeech_meta, transform) - - valid_size = data_config["valid_size"] - ljspeech_valid = CacheDataset(SliceDataset(ljspeech, 0, valid_size)) - ljspeech_train = CacheDataset( - SliceDataset(ljspeech, valid_size, len(ljspeech))) - - model_config = config["model"] - n_loop = model_config["n_loop"] - n_layer = model_config["n_layer"] - filter_size = model_config["filter_size"] - context_size = 1 + n_layer * sum([filter_size**i for i in range(n_loop)]) - print("context size is {} samples".format(context_size)) - train_batch_fn = DataCollector(context_size, sample_rate, hop_length, - train_clip_seconds) - valid_batch_fn = DataCollector( - context_size, sample_rate, hop_length, train_clip_seconds, valid=True) - - batch_size = data_config["batch_size"] - train_cargo = DataCargo( - ljspeech_train, - train_batch_fn, - batch_size, - sampler=RandomSampler(ljspeech_train)) - - # only batch=1 for validation is enabled - valid_cargo = DataCargo( - ljspeech_valid, - valid_batch_fn, - batch_size=1, - sampler=SequentialSampler(ljspeech_valid)) - - make_output_tree(args.output) - - if args.device == -1: - place = fluid.CPUPlace() - else: - place = fluid.CUDAPlace(args.device) - - model_config = config["model"] - upsampling_factors = model_config["upsampling_factors"] - encoder = UpsampleNet(upsampling_factors) - - n_loop = model_config["n_loop"] - n_layer = model_config["n_layer"] - residual_channels = model_config["residual_channels"] - output_dim = model_config["output_dim"] - loss_type = model_config["loss_type"] - log_scale_min = model_config["log_scale_min"] - decoder = WaveNet(n_loop, n_layer, residual_channels, output_dim, n_mels, - filter_size, loss_type, log_scale_min) - - model = ConditionalWavenet(encoder, decoder) - summary(model) - - train_config = config["train"] - learning_rate = train_config["learning_rate"] - anneal_rate = train_config["anneal_rate"] - anneal_interval = train_config["anneal_interval"] - lr_scheduler = dg.ExponentialDecay( - learning_rate, anneal_interval, anneal_rate, staircase=True) - gradiant_max_norm = train_config["gradient_max_norm"] - optim = fluid.optimizer.Adam( - lr_scheduler, - parameter_list=model.parameters(), - grad_clip=fluid.clip.ClipByGlobalNorm(gradiant_max_norm)) - - train_loader = fluid.io.DataLoader.from_generator( - capacity=10, return_list=True) - train_loader.set_batch_generator(train_cargo, place) - - valid_loader = fluid.io.DataLoader.from_generator( - capacity=10, return_list=True) - valid_loader.set_batch_generator(valid_cargo, place) - - max_iterations = train_config["max_iterations"] - checkpoint_interval = train_config["checkpoint_interval"] - snap_interval = train_config["snap_interval"] - eval_interval = train_config["eval_interval"] - checkpoint_dir = os.path.join(args.output, "checkpoints") - log_dir = os.path.join(args.output, "log") - writer = LogWriter(log_dir) - - # load parameters and optimizer, and update iterations done so far - if args.checkpoint is not None: - iteration = io.load_parameters( - model, optim, checkpoint_path=args.checkpoint) - else: - iteration = io.load_parameters( - model, - optim, - checkpoint_dir=checkpoint_dir, - iteration=args.iteration) - - global_step = iteration + 1 - iterator = iter(tqdm.tqdm(train_loader)) - while global_step <= max_iterations: - try: - batch = next(iterator) - except StopIteration as e: - iterator = iter(tqdm.tqdm(train_loader)) - batch = next(iterator) - - audio_clips, mel_specs, audio_starts = batch - - model.train() - y_var = model(audio_clips, mel_specs, audio_starts) - loss_var = model.loss(y_var, audio_clips) - loss_var.backward() - loss_np = loss_var.numpy() - - writer.add_scalar("loss", loss_np[0], global_step) - writer.add_scalar("learning_rate", - optim._learning_rate.step().numpy()[0], global_step) - optim.minimize(loss_var) - optim.clear_gradients() - print("global_step: {}\tloss: {:<8.6f}".format(global_step, loss_np[ - 0])) - - if global_step % snap_interval == 0: - valid_model(model, valid_loader, writer, global_step, sample_rate) - - if global_step % checkpoint_interval == 0: - io.save_parameters(checkpoint_dir, global_step, model, optim) - - global_step += 1 diff --git a/examples/wavenet/utils.py b/examples/wavenet/utils.py deleted file mode 100644 index b603770..0000000 --- a/examples/wavenet/utils.py +++ /dev/null @@ -1,62 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import division -import os -import numpy as np -import soundfile as sf -import paddle.fluid.dygraph as dg - - -def make_output_tree(output_dir): - checkpoint_dir = os.path.join(output_dir, "checkpoints") - if not os.path.exists(checkpoint_dir): - os.makedirs(checkpoint_dir) - - state_dir = os.path.join(output_dir, "states") - if not os.path.exists(state_dir): - os.makedirs(state_dir) - - -def valid_model(model, valid_loader, writer, global_step, sample_rate): - loss = [] - wavs = [] - model.eval() - for i, batch in enumerate(valid_loader): - # print("sentence {}".format(i)) - audio_clips, mel_specs, audio_starts = batch - y_var = model(audio_clips, mel_specs, audio_starts) - wav_var = model.sample(y_var) - loss_var = model.loss(y_var, audio_clips) - loss.append(loss_var.numpy()[0]) - wavs.append(wav_var.numpy()[0]) - - average_loss = np.mean(loss) - writer.add_scalar("valid_loss", average_loss, global_step) - for i, wav in enumerate(wavs): - writer.add_audio("valid/sample_{}".format(i), wav, global_step, - sample_rate) - - -def eval_model(model, valid_loader, output_dir, global_step, sample_rate): - model.eval() - for i, batch in enumerate(valid_loader): - # print("sentence {}".format(i)) - path = os.path.join(output_dir, - "sentence_{}_step_{}.wav".format(i, global_step)) - audio_clips, mel_specs, audio_starts = batch - wav_var = model.synthesis(mel_specs) - wav_np = wav_var.numpy()[0] - sf.write(path, wav_np, samplerate=sample_rate) - print("generated {}".format(path)) diff --git a/parakeet/__init__.py b/parakeet/__init__.py index 9be1aaf..4f26116 100644 --- a/parakeet/__init__.py +++ b/parakeet/__init__.py @@ -14,4 +14,4 @@ __version__ = "0.0.0" -from . import data, g2p, models, modules +from parakeet import data, frontend, models, modules diff --git a/parakeet/__main__.py b/parakeet/__main__.py new file mode 100644 index 0000000..e7c60be --- /dev/null +++ b/parakeet/__main__.py @@ -0,0 +1,36 @@ +import parakeet + +if __name__ == '__main__': + import argparse + import os + import shutil + from pathlib import Path + + package_path = Path(__file__).parent + print(package_path) + + parser = argparse.ArgumentParser() + subparser = parser.add_subparsers(dest="cmd") + + list_exp_parser = subparser.add_parser("list-examples") + clone = subparser.add_parser("clone-example") + clone.add_argument("experiment_name", type=str, help="experiment name") + + args = parser.parse_args() + + if args.cmd == "list-examples": + print(os.listdir(package_path / "examples")) + exit(0) + + if args.cmd == "clone-example": + source = package_path / "examples" / (args.experiment_name) + target = Path(os.getcwd()) / (args.experiment_name) + if not os.path.exists(str(source)): + raise ValueError("{} does not exist".format(str(source))) + + if os.path.exists(str(target)): + raise FileExistsError("{} already exists".format(str(target))) + + shutil.copytree(str(source), str(target)) + print("{} copied!".format(args.experiment_name)) + exit(0) diff --git a/parakeet/audio/__init__.py b/parakeet/audio/__init__.py index 253a887..7fc437c 100644 --- a/parakeet/audio/__init__.py +++ b/parakeet/audio/__init__.py @@ -12,4 +12,5 @@ # See the License for the specific language governing permissions and # limitations under the License. -from .audio import AudioProcessor \ No newline at end of file +from .audio import AudioProcessor +from .spec_normalizer import NormalizerBase, LogMagnitude \ No newline at end of file diff --git a/parakeet/audio/audio.py b/parakeet/audio/audio.py index 9133a47..48722da 100644 --- a/parakeet/audio/audio.py +++ b/parakeet/audio/audio.py @@ -15,278 +15,80 @@ import librosa import soundfile as sf import numpy as np -import scipy.io -import scipy.signal - class AudioProcessor(object): - def __init__( - self, - sample_rate=None, # int, sampling rate - num_mels=None, # int, bands of mel spectrogram - min_level_db=None, # float, minimum level db - ref_level_db=None, # float, reference level db - n_fft=None, # int: number of samples in a frame for stft - win_length=None, # int: the same meaning with n_fft - hop_length=None, # int: number of samples between neighboring frame - power=None, # float:power to raise before griffin-lim - preemphasis=None, # float: preemphasis coefficident - signal_norm=None, # - symmetric_norm=False, # bool, apply clip norm in [-max_norm, max_form] - max_norm=None, # float, max norm - mel_fmin=None, # int: mel spectrogram's minimum frequency - mel_fmax=None, # int: mel spectrogram's maximum frequency - clip_norm=True, # bool: clip spectrogram's norm - griffin_lim_iters=None, # int: - do_trim_silence=False, # bool: trim silence - sound_norm=False, - **kwargs): + def __init__(self, + sample_rate:int, + n_fft:int, + win_length:int, + hop_length:int, + n_mels:int=80, + f_min:int=0, + f_max:int=None, + window="hann", + center="True", + pad_mode="reflect"): + # read & write self.sample_rate = sample_rate - self.num_mels = num_mels - self.min_level_db = min_level_db - self.ref_level_db = ref_level_db - # stft related + # stft self.n_fft = n_fft - self.win_length = win_length or n_fft - # hop length defaults to 1/4 window_length - self.hop_length = hop_length or 0.25 * self.win_length + self.win_length = win_length + self.hop_length = hop_length + self.window = window + self.center = center + self.pad_mode = pad_mode + + # mel + self.n_mels = n_mels + self.f_min = f_min + self.f_max = f_max - self.power = power - self.preemphasis = float(preemphasis) + self.mel_filter = self._create_mel_filter() + self.inv_mel_filter = np.linalg.pinv(self.mel_filter) + + def _create_mel_filter(self): + mel_filter = librosa.filters.mel( + self.sample_rate, + self.n_fft, + n_mels=self.n_mels, + fmin=self.f_min, + fmax=self.f_max) + return mel_filter - self.griffin_lim_iters = griffin_lim_iters - self.signal_norm = signal_norm - self.symmetric_norm = symmetric_norm + def read_wav(self, filename): + # resampling may occur + wav, _ = librosa.load(filename, sr=self.sample_rate) + return wav - # mel transform related - self.mel_fmin = mel_fmin - self.mel_fmax = mel_fmax + def write_wav(self, path, wav): + sf.write(path, wav, samplerate=self.sample_rate) - self.max_norm = 1.0 if max_norm is None else float(max_norm) - self.clip_norm = clip_norm - self.do_trim_silence = do_trim_silence - - self.sound_norm = sound_norm - self.num_freq, self.frame_length_ms, self.frame_shift_ms = self._stft_parameters( - ) - - def _stft_parameters(self): - """compute frame length and hop length in ms""" - frame_length_ms = self.win_length * 1. / self.sample_rate - frame_shift_ms = self.hop_length * 1. / self.sample_rate - num_freq = 1 + self.n_fft // 2 - return num_freq, frame_length_ms, frame_shift_ms - - def __repr__(self): - """object repr""" - cls_name_str = self.__class__.__name__ - members = vars(self) - dict_str = "\n".join( - [" {}: {},".format(k, v) for k, v in members.items()]) - repr_str = "{}(\n{})\n".format(cls_name_str, dict_str) - return repr_str - - def save_wav(self, path, wav): - """save audio with scipy.io.wavfile in 16bit integers""" - wav_norm = wav * (32767 / max(0.01, np.max(np.abs(wav)))) - scipy.io.wavfile.write(path, self.sample_rate, - wav_norm.as_type(np.int16)) - - def load_wav(self, path, sr=None): - """load wav -> trim_silence -> rescale""" - - x, sr = librosa.load(path, sr=None) - assert self.sample_rate == sr, "audio sample rate: {}Hz != processor sample rate: {}Hz".format( - sr, self.sample_rate) - if self.do_trim_silence: - try: - x = self.trim_silence(x) - except ValueError: - print(" [!] File cannot be trimmed for silence - {}".format( - path)) - if self.sound_norm: - x = x / x.max() * 0.9 # why 0.9 ? - return x - - def trim_silence(self, wav): - """Trim soilent parts with a threshold and 0.01s margin""" - margin = int(self.sample_rate * 0.01) - wav = wav[margin:-margin] - trimed_wav = librosa.effects.trim( + def stft(self, wav): + D = librosa.core.stft( wav, - top_db=60, - frame_length=self.win_length, - hop_length=self.hop_length)[0] - return trimed_wav - - def apply_preemphasis(self, x): - if self.preemphasis == 0.: - raise RuntimeError( - " !! Preemphasis coefficient should be positive. ") - return scipy.signal.lfilter([1., -self.preemphasis], [1.], x) - - def apply_inv_preemphasis(self, x): - if self.preemphasis == 0.: - raise RuntimeError( - " !! Preemphasis coefficient should be positive. ") - return scipy.signal.lfilter([1.], [1., -self.preemphasis], x) - - def _amplitude_to_db(self, x): - amplitude_min = np.exp(self.min_level_db / 20 * np.log(10)) - return 20 * np.log10(np.maximum(amplitude_min, x)) - - @staticmethod - def _db_to_amplitude(x): - return np.power(10., 0.05 * x) - - def _linear_to_mel(self, spectrogram): - _mel_basis = self._build_mel_basis() - return np.dot(_mel_basis, spectrogram) - - def _mel_to_linear(self, mel_spectrogram): - inv_mel_basis = np.linalg.pinv(self._build_mel_basis()) - return np.maximum(1e-10, np.dot(inv_mel_basis, mel_spectrogram)) - - def _build_mel_basis(self): - """return mel basis for mel scale""" - if self.mel_fmax is not None: - assert self.mel_fmax <= self.sample_rate // 2 - return librosa.filters.mel(self.sample_rate, - self.n_fft, - n_mels=self.num_mels, - fmin=self.mel_fmin, - fmax=self.mel_fmax) - - def _normalize(self, S): - """put values in [0, self.max_norm] or [-self.max_norm, self,max_norm]""" - if self.signal_norm: - S_norm = (S - self.min_level_db) / (-self.min_level_db) - if self.symmetric_norm: - S_norm = ((2 * self.max_norm) * S_norm) - self.max_norm - if self.clip_norm: - S_norm = np.clip(S_norm, -self.max_norm, self.max_norm) - return S_norm - else: - S_norm = self.max_norm * S_norm - if self.clip_norm: - S_norm = np.clip(S_norm, 0, self.max_norm) - return S_norm - else: - return S - - def _denormalize(self, S): - """denormalize values""" - S_denorm = S - if self.signal_norm: - if self.symmetric_norm: - if self.clip_norm: - S_denorm = np.clip(S_denorm, -self.max_norm, self.max_norm) - S_denorm = (S_denorm + self.max_norm) * ( - -self.min_level_db) / (2 * self.max_norm - ) + self.min_level_db - return S_denorm - else: - if self.clip_norm: - S_denorm = np.clip(S_denorm, 0, self.max_norm) - S_denorm = S_denorm * (-self.min_level_db - ) / self.max_norm + self.min_level_db - return S_denorm - else: - return S - - def _stft(self, y): - return librosa.stft( - y=y, - n_fft=self.n_fft, + n_fft = self.n_fft, + hop_length=self.hop_length, win_length=self.win_length, - hop_length=self.hop_length) + window=self.window, + center=self.center, + pad_mode=self.pad_mode) + return D - def _istft(self, S): - return librosa.istft( - S, hop_length=self.hop_length, win_length=self.win_length) + def istft(self, D): + wav = librosa.core.istft( + D, + hop_length=self.hop_length, + win_length=self.win_length, + window=self.window, + center=self.center) + return wav - def spectrogram(self, y): - """compute linear spectrogram(amplitude) - preemphasis -> stft -> mag -> amplitude_to_db -> minus_ref_level_db -> normalize - """ - if self.preemphasis: - D = self._stft(self.apply_preemphasis(y)) - else: - D = self._stft(y) - S = self._amplitude_to_db(np.abs(D)) - self.ref_level_db - return self._normalize(S) + def spectrogram(self, wav): + D = self.stft(wav) + return np.abs(D) - def melspectrogram(self, y): - """compute linear spectrogram(amplitude) - preemphasis -> stft -> mag -> mel_scale -> amplitude_to_db -> minus_ref_level_db -> normalize - """ - if self.preemphasis: - D = self._stft(self.apply_preemphasis(y)) - else: - D = self._stft(y) - S = self._amplitude_to_db(self._linear_to_mel(np.abs( - D))) - self.ref_level_db - return self._normalize(S) - - def inv_spectrogram(self, spectrogram): - """convert spectrogram back to waveform using griffin_lim in librosa""" - S = self._denormalize(spectrogram) - S = self._db_to_amplitude(S + self.ref_level_db) - if self.preemphasis: - return self.apply_inv_preemphasis(self._griffin_lim(S**self.power)) - return self._griffin_lim(S**self.power) - - def inv_melspectrogram(self, mel_spectrogram): - S = self._denormalize(mel_spectrogram) - S = self._db_to_amplitude(S + self.ref_level_db) - S = self._mel_to_linear(np.abs(S)) - if self.preemphasis: - return self.apply_inv_preemphasis(self._griffin_lim(S**self.power)) - return self._griffin_lim(S**self.power) - - def out_linear_to_mel(self, linear_spec): - """convert output linear spec to mel spec""" - S = self._denormalize(linear_spec) - S = self._db_to_amplitude(S + self.ref_level_db) - S = self._linear_to_mel(np.abs(S)) - S = self._amplitude_to_db(S) - self.ref_level_db - mel = self._normalize(S) + def mel_spectrogram(self, wav): + S = self.spectrogram(wav) + mel = np.dot(self.mel_filter, S) return mel - - def _griffin_lim(self, S): - angles = np.exp(2j * np.pi * np.random.rand(*S.shape)) - S_complex = np.abs(S).astype(np.complex) - y = self._istft(S_complex * angles) - for _ in range(self.griffin_lim_iters): - angles = np.exp(1j * np.angle(self._stft(y))) - y = self._istft(S_complex * angles) - return y - - @staticmethod - def mulaw_encode(wav, qc): - mu = 2**qc - 1 - # wav_abs = np.minimum(np.abs(wav), 1.0) - signal = np.sign(wav) * np.log(1 + mu * np.abs(wav)) / np.log(1. + mu) - # Quantize signal to the specified number of levels. - signal = (signal + 1) / 2 * mu + 0.5 - return np.floor(signal, ) - - @staticmethod - def mulaw_decode(wav, qc): - """Recovers waveform from quantized values.""" - mu = 2**qc - 1 - x = np.sign(wav) / mu * ((1 + mu)**np.abs(wav) - 1) - return x - - @staticmethod - def encode_16bits(x): - return np.clip(x * 2**15, -2**15, 2**15 - 1).astype(np.int16) - - @staticmethod - def quantize(x, bits): - return (x + 1.) * (2**bits - 1) / 2 - - @staticmethod - def dequantize(x, bits): - return 2 * x / (2**bits - 1) - 1 diff --git a/parakeet/audio/spec_normalizer.py b/parakeet/audio/spec_normalizer.py new file mode 100644 index 0000000..70341a5 --- /dev/null +++ b/parakeet/audio/spec_normalizer.py @@ -0,0 +1,56 @@ + +""" +This modules contains normalizers for spectrogram magnitude. +Normalizers are invertible transformations. They can be used to process +magnitude of spectrogram before training and can also be used to recover from +the generated spectrogram so as to be used with vocoders like griffin lim. + +The base class describe the interface. `transform` is used to perform +transformation and `inverse` is used to perform the inverse transformation. + +check issues: +https://github.com/mozilla/TTS/issues/377 +""" +import numpy as np + +class NormalizerBase(object): + def transform(self, spec): + raise NotImplementedError("transform must be implemented") + + def inverse(self, normalized): + raise NotImplementedError("inverse must be implemented") + +class LogMagnitude(NormalizerBase): + """ + This is a simple normalizer used in Waveglow, Waveflow, tacotron2... + """ + def __init__(self, min=1e-7): + self.min = min + + def transform(self, x): + x = np.maximum(x, self.min) + x = np.log(x) + return x + + def inverse(self, x): + return np.exp(x) + + +class UnitMagnitude(NormalizerBase): + # dbscale and (0, 1) normalization + """ + This is the normalizer used in the + """ + def __init__(self, min=1e-5): + self.min = min + + def transform(self, x): + db_scale = 20 * np.log10(np.maximum(self.min, x)) - 20 + normalized = (db_scale + 100) / 100 + clipped = np.clip(normalized, 0, 1) + return clipped + + def inverse(self, x): + denormalized = np.clip(x, 0, 1) * 100 - 100 + out = np.exp((denormalized + 20) / 20 * np.log(10)) + return out diff --git a/parakeet/data/.vscode/settings.json b/parakeet/data/.vscode/settings.json deleted file mode 100644 index 77b9721..0000000 --- a/parakeet/data/.vscode/settings.json +++ /dev/null @@ -1,3 +0,0 @@ -{ - "python.pythonPath": "/Users/chenfeiyu/miniconda3/envs/paddle/bin/python" -} \ No newline at end of file diff --git a/parakeet/data/__init__.py b/parakeet/data/__init__.py index be28f11..3114058 100644 --- a/parakeet/data/__init__.py +++ b/parakeet/data/__init__.py @@ -13,6 +13,5 @@ # limitations under the License. from .dataset import * -from .datacargo import * from .sampler import * from .batch import * diff --git a/parakeet/data/batch.py b/parakeet/data/batch.py index 355e570..a5be9f7 100644 --- a/parakeet/data/batch.py +++ b/parakeet/data/batch.py @@ -75,19 +75,16 @@ def batch_wav(minibatch, pad_value=0., dtype=np.float32): """pad audios to the largest length and batch them. Args: - minibatch (List[np.ndarray]): list of rank-1 float arrays(mono-channel audio, shape(T,)) or list of rank-2 float arrays(multi-channel audio, shape(C, T), C stands for numer of channels, T stands for length), dtype float. + minibatch (List[np.ndarray]): list of rank-1 float arrays(mono-channel audio, shape(T,)), dtype float. pad_value (float, optional): the pad value. Defaults to 0.. dtype (np.dtype, optional): the data type of the output. Defaults to np.float32. Returns: - np.ndarray: the output batch. It is a rank-2 float array of shape(B, T) if the minibatch is a list of mono-channel audios, or a rank-3 float array of shape(B, C, T) if the minibatch is a list of multi-channel audios. + np.ndarray: shape(B, T), the output batch. """ peek_example = minibatch[0] - if len(peek_example.shape) == 1: - mono_channel = True - elif len(peek_example.shape) == 2: - mono_channel = False + assert len(peek_example.shape) == 1, "we only handles mono-channel wav" # assume (channel, n_samples) or (n_samples, ) lengths = [example.shape[-1] for example in minibatch] @@ -96,33 +93,27 @@ def batch_wav(minibatch, pad_value=0., dtype=np.float32): batch = [] for example in minibatch: pad_len = max_len - example.shape[-1] - if mono_channel: - batch.append( - np.pad(example, [(0, pad_len)], - mode='constant', - constant_values=pad_value)) - else: - batch.append( - np.pad(example, [(0, 0), (0, pad_len)], - mode='constant', - constant_values=pad_value)) - + batch.append( + np.pad(example, [(0, pad_len)], + mode='constant', + constant_values=pad_value)) return np.array(batch, dtype=dtype) class SpecBatcher(object): """A wrapper class for `batch_spec`""" - def __init__(self, pad_value=0., dtype=np.float32): + def __init__(self, pad_value=0., time_major=False, dtype=np.float32): self.pad_value = pad_value self.dtype = dtype + self.time_major = time_major def __call__(self, minibatch): - out = batch_spec(minibatch, pad_value=self.pad_value, dtype=self.dtype) + out = batch_spec(minibatch, pad_value=self.pad_value, time_major=self.time_major, dtype=self.dtype) return out -def batch_spec(minibatch, pad_value=0., dtype=np.float32): +def batch_spec(minibatch, pad_value=0., time_major=False, dtype=np.float32): """Pad spectra to the largest length and batch them. Args: @@ -131,31 +122,28 @@ def batch_spec(minibatch, pad_value=0., dtype=np.float32): dtype (np.dtype, optional): data type of the output. Defaults to np.float32. Returns: - np.ndarray: a rank-3 array of shape(B, F, T) when the minibatch is a list of mono-channel spectrograms, or a rank-4 array of shape(B, C, F, T) when the minibatch is a list of multi-channel spectorgrams. + np.ndarray: a rank-3 array of shape(B, F, T) or (B, T, F). """ - # assume (F, T) or (C, F, T) + # assume (F, T) or (T, F) peek_example = minibatch[0] - if len(peek_example.shape) == 2: - mono_channel = True - elif len(peek_example.shape) == 3: - mono_channel = False + assert len(peek_example.shape) == 2, "we only handles mono channel spectrogram" - # assume (channel, F, n_frame) or (F, n_frame) - lengths = [example.shape[-1] for example in minibatch] + # assume (F, n_frame) or (n_frame, F) + time_idx = 0 if time_major else -1 + lengths = [example.shape[time_idx] for example in minibatch] max_len = np.max(lengths) batch = [] for example in minibatch: - pad_len = max_len - example.shape[-1] - if mono_channel: + pad_len = max_len - example.shape[time_idx] + if time_major: batch.append( - np.pad(example, [(0, 0), (0, pad_len)], - mode='constant', - constant_values=pad_value)) + np.pad(example, [(0, pad_len), (0, 0)], + mode='constant', + constant_values=pad_value)) else: batch.append( - np.pad(example, [(0, 0), (0, 0), (0, pad_len)], - mode='constant', - constant_values=pad_value)) - + np.pad(example, [(0, 0), (0, pad_len)], + mode='constant', + constant_values=pad_value)) return np.array(batch, dtype=dtype) diff --git a/parakeet/data/datacargo.py b/parakeet/data/datacargo.py deleted file mode 100644 index a88829c..0000000 --- a/parakeet/data/datacargo.py +++ /dev/null @@ -1,126 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import six -from .sampler import SequentialSampler, RandomSampler, BatchSampler - - -class DataCargo(object): - def __init__(self, - dataset, - batch_fn=None, - batch_size=1, - sampler=None, - shuffle=False, - batch_sampler=None, - drop_last=False): - """An Iterable object of batches. It requires a dataset, a batch function and a sampler. The sampler yields the example ids, then the corresponding examples in the dataset are collected and transformed into a batch with the batch function. - - Args: - dataset (Dataset): the dataset used to build a data cargo. - batch_fn (callable, optional): a callable that takes a list of examples of `dataset` and return a batch, it can be None if the dataset has a `_batch_examples` method which satisfy the requirement. Defaults to None. - batch_size (int, optional): number of examples in a batch. Defaults to 1. - sampler (Sampler, optional): an iterable of example ids(intergers), the example ids are used to pick examples. Defaults to None. - shuffle (bool, optional): when sampler is not provided, shuffle = True creates a RandomSampler and shuffle=False creates a SequentialSampler internally. Defaults to False. - batch_sampler (BatchSampler, optional): an iterable of lists of example ids(intergers), the list is used to pick examples, `batch_sampler` option is mutually exclusive with `batch_size`, `shuffle`, `sampler`, and `drop_last`. Defaults to None. - drop_last (bool, optional): whether to drop the last minibatch. Defaults to False. - """ - self.dataset = dataset - self.batch_fn = batch_fn or self.dataset._batch_examples - - if batch_sampler is not None: - # auto_collation with custom batch_sampler - if batch_size != 1 or shuffle or sampler is not None or drop_last: - raise ValueError('batch_sampler option is mutually exclusive ' - 'with batch_size, shuffle, sampler, and ' - 'drop_last') - batch_size = None - drop_last = False - shuffle = False - elif batch_size is None: - raise ValueError( - 'batch sampler is none. then batch size must not be none.') - elif sampler is None: - if shuffle: - sampler = RandomSampler(dataset) - else: - sampler = SequentialSampler(dataset) - batch_sampler = BatchSampler(sampler, batch_size, drop_last) - else: - batch_sampler = BatchSampler(sampler, batch_size, drop_last) - - self.batch_size = batch_size - self.drop_last = drop_last - self.sampler = sampler - - self.batch_sampler = batch_sampler - - def __iter__(self): - return DataIterator(self) - - def __call__(self): - # protocol for paddle's DataLoader - return DataIterator(self) - - @property - def _auto_collation(self): - # use auto batching - return self.batch_sampler is not None - - @property - def _index_sampler(self): - if self._auto_collation: - return self.batch_sampler - else: - return self.sampler - - def __len__(self): - return len(self._index_sampler) - - -class DataIterator(object): - def __init__(self, loader): - """Iterator object of DataCargo. - - Args: - loader (DataCargo): the data cargo to iterate. - """ - self.loader = loader - self._dataset = loader.dataset - - self._batch_fn = loader.batch_fn - self._index_sampler = loader._index_sampler - self._sampler_iter = iter(self._index_sampler) - - def __iter__(self): - return self - - def __next__(self): - # TODO(chenfeiyu): use dynamic batch size - index = self._next_index() - minibatch = [self._dataset[i] for i in index] - minibatch = self._batch_fn(minibatch) # list[Example] -> Batch - return minibatch - - next = __next__ # Python 2 compatibility - - def _next_index(self): - if six.PY3: - return next(self._sampler_iter) - else: - # six.PY2 - return self._sampler_iter.next() - - def __len__(self): - return len(self._index_sampler) diff --git a/parakeet/data/dataset.py b/parakeet/data/dataset.py index 87ef393..adbc37b 100644 --- a/parakeet/data/dataset.py +++ b/parakeet/data/dataset.py @@ -13,62 +13,22 @@ # limitations under the License. import six -import numpy as np -from tqdm import tqdm +import paddle +from paddle.io import Dataset -class DatasetMixin(object): - """Standard indexing interface for dataset. Inherit this class to - get the indexing interface. Since it is a mixin class which does - not have an `__init__` class, the subclass not need to call - `super().__init__()`. - """ +def split(dataset, first_size): + """A utility function to split a dataset into two datasets.""" + first = SliceDataset(dataset, 0, first_size) + second = SliceDataset(dataset, first_size, len(dataset)) + return first, second - def __getitem__(self, index): - """Standard indexing interface for dataset. - - Args: - index (slice, list[int], np.array or int): the index. if can be int, slice, list of integers, or ndarray of integers. It calls `get_example` to pick an example. - - Returns: - Example, or List[Example]: If `index` is an interger, it returns an - example. If `index` is a slice, a list of intergers or an array of intergers, - it returns a list of examples. - """ - if isinstance(index, slice): - start, stop, step = index.indices(len(self)) - return [ - self.get_example(i) for i in six.moves.range(start, stop, step) - ] - elif isinstance(index, (list, np.ndarray)): - return [self.get_example(i) for i in index] - else: - # assumes it an integer - return self.get_example(index) - - def get_example(self, i): - """Get an example from the dataset. Custom datasets should have - this method implemented. - - Args: - i (int): example index. - """ - raise NotImplementedError - - def __len__(self): - raise NotImplementedError - - def __iter__(self): - for i in range(len(self)): - yield self.get_example(i) - - -class TransformDataset(DatasetMixin): +class TransformDataset(Dataset): def __init__(self, dataset, transform): """Dataset which is transformed from another with a transform. Args: - dataset (DatasetMixin): the base dataset. + dataset (Dataset): the base dataset. transform (callable): the transform which takes an example of the base dataset as parameter and return a new example. """ self._dataset = dataset @@ -77,17 +37,17 @@ class TransformDataset(DatasetMixin): def __len__(self): return len(self._dataset) - def get_example(self, i): + def __getitem__(self, i): in_data = self._dataset[i] return self._transform(in_data) -class CacheDataset(DatasetMixin): +class CacheDataset(Dataset): def __init__(self, dataset): """A lazy cache of the base dataset. Args: - dataset (DatasetMixin): the base dataset to cache. + dataset (Dataset): the base dataset to cache. """ self._dataset = dataset self._cache = dict() @@ -95,24 +55,24 @@ class CacheDataset(DatasetMixin): def __len__(self): return len(self._dataset) - def get_example(self, i): + def __getitem__(self, i): if not i in self._cache: self._cache[i] = self._dataset[i] return self._cache[i] -class TupleDataset(object): +class TupleDataset(Dataset): def __init__(self, *datasets): """A compound dataset made from several datasets of the same length. An example of the `TupleDataset` is a tuple of examples from the constituent datasets. Args: - datasets: tuple[DatasetMixin], the constituent datasets. + datasets: tuple[Dataset], the constituent datasets. """ if not datasets: raise ValueError("no datasets are given") length = len(datasets[0]) for i, dataset in enumerate(datasets): - if len(datasets) != length: + if len(dataset) != length: raise ValueError( "all the datasets should have the same length." "dataset {} has a different length".format(i)) @@ -136,12 +96,20 @@ class TupleDataset(object): return self._length -class DictDataset(object): +class DictDataset(Dataset): def __init__(self, **datasets): - """A compound dataset made from several datasets of the same length. An example of the `DictDataset` is a dict of examples from the constituent datasets. + """ + A compound dataset made from several datasets of the same length. An + example of the `DictDataset` is a dict of examples from the constituent + datasets. + + WARNING: paddle does not have a good support for DictDataset, because + every batch yield from a DataLoader is a list, but it cannot be a dict. + So you have to provide a collate function because you cannot use the + default one. Args: - datasets: Dict[DatasetMixin], the constituent datasets. + datasets: Dict[Dataset], the constituent datasets. """ if not datasets: raise ValueError("no datasets are given") @@ -149,7 +117,7 @@ class DictDataset(object): for key, dataset in six.iteritems(datasets): if length is None: length = len(dataset) - elif len(datasets) != length: + elif len(dataset) != length: raise ValueError( "all the datasets should have the same length." "dataset {} has a different length".format(key)) @@ -168,14 +136,17 @@ class DictDataset(object): for i in six.moves.range(length)] else: return batches + + def __len__(self): + return self._length -class SliceDataset(DatasetMixin): +class SliceDataset(Dataset): def __init__(self, dataset, start, finish, order=None): """A Dataset which is a slice of the base dataset. Args: - dataset (DatasetMixin): the base dataset. + dataset (Dataset): the base dataset. start (int): the start of the slice. finish (int): the end of the slice, not inclusive. order (List[int], optional): the order, it is a permutation of the valid example ids of the base dataset. If `order` is provided, the slice is taken in `order`. Defaults to None. @@ -197,7 +168,7 @@ class SliceDataset(DatasetMixin): def __len__(self): return self._size - def get_example(self, i): + def __getitem__(self, i): if i >= 0: if i >= self._size: raise IndexError('dataset index out of range') @@ -212,12 +183,12 @@ class SliceDataset(DatasetMixin): return self._dataset[index] -class SubsetDataset(DatasetMixin): +class SubsetDataset(Dataset): def __init__(self, dataset, indices): """A Dataset which is a subset of the base dataset. Args: - dataset (DatasetMixin): the base dataset. + dataset (Dataset): the base dataset. indices (Iterable[int]): the indices of the examples to pick. """ self._dataset = dataset @@ -229,17 +200,17 @@ class SubsetDataset(DatasetMixin): def __len__(self): return self._size - def get_example(self, i): + def __getitem__(self, i): index = self._indices[i] return self._dataset[index] -class FilterDataset(DatasetMixin): +class FilterDataset(Dataset): def __init__(self, dataset, filter_fn): """A filtered dataset. Args: - dataset (DatasetMixin): the base dataset. + dataset (Dataset): the base dataset. filter_fn (callable): a callable which takes an example of the base dataset and return a boolean. """ self._dataset = dataset @@ -251,24 +222,24 @@ class FilterDataset(DatasetMixin): def __len__(self): return self._size - def get_example(self, i): + def __getitem__(self, i): index = self._indices[i] return self._dataset[index] -class ChainDataset(DatasetMixin): +class ChainDataset(Dataset): def __init__(self, *datasets): """A concatenation of the several datasets which the same structure. Args: - datasets (Iterable[DatasetMixin]): datasets to concat. + datasets (Iterable[Dataset]): datasets to concat. """ self._datasets = datasets def __len__(self): return sum(len(dataset) for dataset in self._datasets) - def get_example(self, i): + def __getitem__(self, i): if i < 0: raise IndexError("ChainDataset doesnot support negative indexing.") diff --git a/parakeet/data/sampler.py b/parakeet/data/sampler.py index df2ff7a..2b56d8d 100644 --- a/parakeet/data/sampler.py +++ b/parakeet/data/sampler.py @@ -21,95 +21,8 @@ So the sampler is only responsible for generating valid indices. import numpy as np import random - - -class Sampler(object): - def __iter__(self): - # return a iterator of indices - # or a iterator of list[int], for BatchSampler - raise NotImplementedError - - -class SequentialSampler(Sampler): - def __init__(self, data_source): - """Sequential sampler, the simplest sampler that samples indices from 0 to N - 1, where N is the dataset is length. - - Args: - data_source (DatasetMixin): the dataset. This is used to get the dataset's length. - """ - self.data_source = data_source - - def __iter__(self): - return iter(range(len(self.data_source))) - - def __len__(self): - return len(self.data_source) - - -class RandomSampler(Sampler): - def __init__(self, data_source, replacement=False, num_samples=None): - """Random sampler. - - Args: - data_source (DatasetMixin): the dataset. This is used to get the dataset's length. - replacement (bool, optional): whether replacement is enabled in sampling. When `replacement` is True, `num_samples` must be provided. Defaults to False. - num_samples (int, optional): numbers of indices to draw. This option should only be provided when replacement is True. Defaults to None. - """ - self.data_source = data_source - self.replacement = replacement - self._num_samples = num_samples - - if not isinstance(self.replacement, bool): - raise ValueError("replacement should be a boolean value, but got " - "replacement={}".format(self.replacement)) - - if self._num_samples is not None and not replacement: - raise ValueError( - "With replacement=False, num_samples should not be specified, " - "since a random permutation will be performed.") - - if not isinstance(self.num_samples, int) or self.num_samples <= 0: - raise ValueError("num_samples should be a positive integer " - "value, but got num_samples={}".format( - self.num_samples)) - - @property - def num_samples(self): - if self._num_samples is None: - return len(self.data_source) - return self._num_samples - - def __iter__(self): - n = len(self.data_source) - if self.replacement: - return iter( - np.random.randint( - 0, n, size=(self.num_samples, ), dtype=np.int64).tolist()) - return iter(np.random.permutation(n).tolist()) - - def __len__(self): - return self.num_samples - - -class SubsetRandomSampler(Sampler): - """Samples elements randomly from a given list of indices, without replacement. - Arguments: - indices (sequence): a sequence of indices - """ - - def __init__(self, indices): - """ - Args: - indices (List[int]): indices to sample from. - """ - self.indices = indices - - def __iter__(self): - return (self.indices[i] - for i in np.random.permutation(len(self.indices))) - - def __len__(self): - return len(self.indices) +import paddle +from paddle.io import Sampler class PartialyRandomizedSimilarTimeLengthSampler(Sampler): @@ -285,92 +198,3 @@ class WeightedRandomSampler(Sampler): def __len__(self): return self.num_samples - - -class DistributedSampler(Sampler): - def __init__(self, dataset_size, num_trainers, rank, shuffle=True): - """Sampler used for data parallel training. Indices are divided into num_trainers parts. Each trainer gets a subset and iter that subset. If the dataset has 16 examples, and there are 4 trainers. - - Trainer 0 gets [0, 4, 8, 12]; - Trainer 1 gets [1, 5, 9, 13]; - Trainer 2 gets [2, 6, 10, 14]; - trainer 3 gets [3, 7, 11, 15]. - - It ensures that trainer get different parts of the dataset. If dataset's length cannot be perfectly devidef by num_trainers, some examples appended to the dataset, to ensures that every trainer gets the same amounts of examples. - - Args: - dataset_size (int): the length of the dataset. - num_trainers (int): number of trainers(training processes). - rank (int): local rank of the trainer. - shuffle (bool, optional): whether to shuffle the indices before iteration. Defaults to True. - """ - self.dataset_size = dataset_size - self.num_trainers = num_trainers - self.rank = rank - self.num_samples = int(np.ceil(dataset_size / num_trainers)) - self.total_size = self.num_samples * num_trainers - assert self.total_size >= self.dataset_size - self.shuffle = shuffle - - def __iter__(self): - indices = list(range(self.dataset_size)) - if self.shuffle: - random.shuffle(indices) - - # Append extra samples to make it evenly distributed on all trainers. - indices += indices[:(self.total_size - self.dataset_size)] - assert len(indices) == self.total_size - - # Subset samples for each trainer. - indices = indices[self.rank:self.total_size:self.num_trainers] - assert len(indices) == self.num_samples - - return iter(indices) - - def __len__(self): - return self.num_samples - - -class BatchSampler(Sampler): - """Wraps another sampler to yield a mini-batch of indices.""" - - def __init__(self, sampler, batch_size, drop_last): - """ - Args: - sampler (Sampler): Base sampler. - batch_size (int): Size of mini-batch. - drop_last (bool): If True, the sampler will drop the last batch if its size is less than batch_size. - Example: - >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False)) - [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]] - >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True)) - [[0, 1, 2], [3, 4, 5], [6, 7, 8]] - """ - if not isinstance(sampler, Sampler): - raise ValueError("sampler should be an instance of " - "Sampler, but got sampler={}".format(sampler)) - if not isinstance(batch_size, int) or batch_size <= 0: - raise ValueError("batch_size should be a positive integer value, " - "but got batch_size={}".format(batch_size)) - if not isinstance(drop_last, bool): - raise ValueError("drop_last should be a boolean value, but got " - "drop_last={}".format(drop_last)) - self.sampler = sampler - self.batch_size = batch_size - self.drop_last = drop_last - - def __iter__(self): - batch = [] - for idx in self.sampler: - batch.append(idx) - if len(batch) == self.batch_size: - yield batch - batch = [] - if len(batch) > 0 and not self.drop_last: - yield batch - - def __len__(self): - if self.drop_last: - return len(self.sampler) // self.batch_size - else: - return (len(self.sampler) + self.batch_size - 1) // self.batch_size diff --git a/parakeet/datasets/README.md b/parakeet/datasets/README.md deleted file mode 100644 index cd4f8f4..0000000 --- a/parakeet/datasets/README.md +++ /dev/null @@ -1,17 +0,0 @@ -# The Design of Dataset in Parakeet - -## data & metadata -A Dataset in Parakeet is basically a list of Records (or examples, instances if you prefer this glossary.) By being a list, we mean it can be indexed by `__getitem__`, and we can get the size of the dataset by `__len__`. - -This might mean we should have load the whole dataset before hand. But in practice, we do not do this due to time, computation and memory of storage limits. We actually load some metadata instead, which gives us the size of the dataset, and metadata of each record. In this case, the metadata itself is a small dataset which helps us to load a larger dataset. We made `_load_metadata` a method for all datasets. - -In most cases, metadata is provided with the data. So we can load it trivially. But in other cases, we need to scan the whole dataset to get metadata. For example, the length of the the sentences, the vocabuary or the statistics of the dataset, etc. In these cases, we'd betetr save the metadata, so we do not need to generate them again and again. When implementing a dataset, we do these work in `_prepare_metadata`. - -In our initial cases, record is implemented as a tuple for simplicity. Actually, it can be implemented as a dict or namespace. - -## preprocessing & batching -One of the reasons we choose to load data lazily (only load metadata before hand, and load data only when needed) is computation overhead. For large dataset with complicated preprocessing, it may take several days to preprocess them. So we choose to preprocess it lazily. In practice, we implement preprocessing in `_get_example` which is called by `__getitem__`. This method preprocess only one record. - -For deep learning practice, we typically batch examples. So the dataset should comes with a method to batch examples. Assuming the record is implemented as a tuple with several items. When an item is represented as a fix-sized array, to batch them is trivial, just `np.stack` suffices. But for array with dynamic size, padding is needed. We decide to implement a batching method for each item. Then batching a record can be implemented by these methods. For a dataset, a `_batch_examples` should be implemented. But in most cases, you can choose one from `batching.py`. - -That is it! diff --git a/parakeet/datasets/__init__.py b/parakeet/datasets/__init__.py index abf198b..de7be70 100644 --- a/parakeet/datasets/__init__.py +++ b/parakeet/datasets/__init__.py @@ -1,13 +1,2 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. +from parakeet.datasets.common import * +from parakeet.datasets.ljspeech import * \ No newline at end of file diff --git a/parakeet/datasets/common.py b/parakeet/datasets/common.py new file mode 100644 index 0000000..f923086 --- /dev/null +++ b/parakeet/datasets/common.py @@ -0,0 +1,21 @@ +from paddle.io import Dataset +import os +import librosa + +class AudioFolderDataset(Dataset): + def __init__(self, path, sample_rate, extension="wav"): + self.root = os.path.expanduser(path) + self.sample_rate = sample_rate + self.extension = extension + self.file_names = [ + os.path.join(self.root, x) for x in os.listdir(self.root) \ + if os.path.splitext(x)[-1] == self.extension] + self.length = len(self.file_names) + + def __len__(self): + return self.length + + def __getitem__(self, i): + file_name = self.file_names[i] + y, _ = librosa.load(file_name, sr=self.sample_rate) # pylint: disable=unused-variable + return y diff --git a/parakeet/datasets/ljspeech.py b/parakeet/datasets/ljspeech.py index 3ab8ac9..7011063 100644 --- a/parakeet/datasets/ljspeech.py +++ b/parakeet/datasets/ljspeech.py @@ -1,101 +1,23 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. +from paddle.io import Dataset +from pathlib import Path -import os -import numpy as np -import pandas as pd -import librosa -from .. import g2p - -from ..data.sampler import SequentialSampler, RandomSampler, BatchSampler -from ..data.dataset import DatasetMixin -from ..data.datacargo import DataCargo -from ..data.batch import TextIDBatcher, SpecBatcher - - -class LJSpeech(DatasetMixin): +class LJSpeechMetaData(Dataset): def __init__(self, root): - super(LJSpeech, self).__init__() - self.root = root - self.metadata = self._prepare_metadata() + self.root = Path(root).expanduser() + wav_dir = self.root / "wavs" + csv_path = self.root / "metadata.csv" + records = [] + speaker_name = "ljspeech" + with open(str(csv_path), 'rt') as f: + for line in f: + filename, _, normalized_text = line.strip().split("|") + filename = str(wav_dir / (filename + ".wav")) + records.append([filename, normalized_text, speaker_name]) + self.records = records - def _prepare_metadata(self): - csv_path = os.path.join(self.root, "metadata.csv") - metadata = pd.read_csv( - csv_path, - sep="|", - header=None, - quoting=3, - names=["fname", "raw_text", "normalized_text"]) - return metadata - - def _get_example(self, metadatum): - """All the code for generating an Example from a metadatum. If you want a - different preprocessing pipeline, you can override this method. - This method may require several processor, each of which has a lot of options. - In this case, you'd better pass a composed transform and pass it to the init - method. - """ - - fname, raw_text, normalized_text = metadatum - wav_path = os.path.join(self.root, "wavs", fname + ".wav") - - # load -> trim -> preemphasis -> stft -> magnitude -> mel_scale -> logscale -> normalize - wav, sample_rate = librosa.load( - wav_path, - sr=None) # we would rather use functor to hold its parameters - trimed, _ = librosa.effects.trim(wav) - preemphasized = librosa.effects.preemphasis(trimed) - D = librosa.stft(preemphasized) - mag, phase = librosa.magphase(D) - mel = librosa.feature.melspectrogram(S=mag) - - mag = librosa.amplitude_to_db(S=mag) - mel = librosa.amplitude_to_db(S=mel) - - ref_db = 20 - max_db = 100 - mel = np.clip((mel - ref_db + max_db) / max_db, 1e-8, 1) - mel = np.clip((mag - ref_db + max_db) / max_db, 1e-8, 1) - - phonemes = np.array( - g2p.en.text_to_sequence(normalized_text), dtype=np.int64) - return (mag, mel, phonemes - ) # maybe we need to implement it as a map in the future - - def _batch_examples(self, minibatch): - mag_batch = [] - mel_batch = [] - phoneme_batch = [] - for example in minibatch: - mag, mel, phoneme = example - mag_batch.append(mag) - mel_batch.append(mel) - phoneme_batch.append(phoneme) - mag_batch = SpecBatcher(pad_value=0.)(mag_batch) - mel_batch = SpecBatcher(pad_value=0.)(mel_batch) - phoneme_batch = TextIDBatcher(pad_id=0)(phoneme_batch) - return (mag_batch, mel_batch, phoneme_batch) - - def __getitem__(self, index): - metadatum = self.metadata.iloc[index] - example = self._get_example(metadatum) - return example - - def __iter__(self): - for i in range(len(self)): - yield self[i] + def __getitem__(self, i): + return self.records[i] def __len__(self): - return len(self.metadata) + return len(self.records) + diff --git a/parakeet/datasets/vctk.py b/parakeet/datasets/vctk.py deleted file mode 100644 index 66e4f70..0000000 --- a/parakeet/datasets/vctk.py +++ /dev/null @@ -1,99 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from pathlib import Path -import pandas as pd -from ruamel.yaml import YAML -import io - -import librosa -import numpy as np - -from parakeet.g2p.en import text_to_sequence -from parakeet.data.dataset import Dataset -from parakeet.data.datacargo import DataCargo -from parakeet.data.batch import TextIDBatcher, WavBatcher - - -class VCTK(Dataset): - def __init__(self, root): - assert isinstance(root, ( - str, Path)), "root should be a string or Path object" - self.root = root if isinstance(root, Path) else Path(root) - self.text_root = self.root.joinpath("txt") - self.wav_root = self.root.joinpath("wav48") - - if not (self.root.joinpath("metadata.csv").exists() and - self.root.joinpath("speaker_indices.yaml").exists()): - self._prepare_metadata() - self.speaker_indices, self.metadata = self._load_metadata() - - def _load_metadata(self): - yaml = YAML(typ='safe') - speaker_indices = yaml.load(self.root.joinpath("speaker_indices.yaml")) - metadata = pd.read_csv( - self.root.joinpath("metadata.csv"), sep="|", quoting=3, header=1) - return speaker_indices, metadata - - def _prepare_metadata(self): - metadata = [] - speaker_to_index = {} - for i, speaker_folder in enumerate(self.text_root.iterdir()): - if speaker_folder.is_dir(): - speaker_to_index[speaker_folder.name] = i - for text_file in speaker_folder.iterdir(): - if text_file.is_file(): - with io.open(str(text_file)) as f: - transcription = f.read().strip() - wav_file = text_file.with_suffix(".wav") - metadata.append( - (wav_file.name, speaker_folder.name, transcription)) - metadata = pd.DataFrame.from_records( - metadata, columns=["wave_file", "speaker", "text"]) - - # save them - yaml = YAML(typ='safe') - yaml.dump(speaker_to_index, self.root.joinpath("speaker_indices.yaml")) - metadata.to_csv( - self.root.joinpath("metadata.csv"), - sep="|", - quoting=3, - index=False) - - def _get_example(self, metadatum): - wave_file, speaker, text = metadatum - wav_path = self.wav_root.joinpath(speaker, wave_file) - wav, sr = librosa.load(str(wav_path), sr=None) - phoneme_seq = np.array(text_to_sequence(text)) - return wav, self.speaker_indices[speaker], phoneme_seq - - def __getitem__(self, index): - metadatum = self.metadata.iloc[index] - example = self._get_example(metadatum) - return example - - def __len__(self): - return len(self.metadata) - - def _batch_examples(self, minibatch): - wav_batch, speaker_batch, phoneme_batch = [], [], [] - for example in minibatch: - wav, speaker_id, phoneme_seq = example - wav_batch.append(wav) - speaker_batch.append(speaker_id) - phoneme_batch.append(phoneme_seq) - wav_batch = WavBatcher(pad_value=0.)(wav_batch) - speaker_batch = np.array(speaker_batch) - phoneme_batch = TextIDBatcher(pad_id=0)(phoneme_batch) - return wav_batch, speaker_batch, phoneme_batch diff --git a/parakeet/frontend/__init__.py b/parakeet/frontend/__init__.py new file mode 100644 index 0000000..c49b725 --- /dev/null +++ b/parakeet/frontend/__init__.py @@ -0,0 +1,3 @@ +from parakeet.frontend.vocab import * +from parakeet.frontend.phonectic import * +from parakeet.frontend.punctuation import * diff --git a/parakeet/frontend/normalizer/abbrrviation.py b/parakeet/frontend/normalizer/abbrrviation.py new file mode 100644 index 0000000..e69de29 diff --git a/parakeet/frontend/normalizer/acronyms.py b/parakeet/frontend/normalizer/acronyms.py new file mode 100644 index 0000000..e69de29 diff --git a/parakeet/frontend/normalizer/normalizer.py b/parakeet/frontend/normalizer/normalizer.py new file mode 100644 index 0000000..e69de29 diff --git a/parakeet/frontend/normalizer/numbers.py b/parakeet/frontend/normalizer/numbers.py new file mode 100644 index 0000000..ef7343c --- /dev/null +++ b/parakeet/frontend/normalizer/numbers.py @@ -0,0 +1,3 @@ +# number expansion is not that easy +import num2words +import inflect \ No newline at end of file diff --git a/parakeet/frontend/normalizer/width.py b/parakeet/frontend/normalizer/width.py new file mode 100644 index 0000000..440557f --- /dev/null +++ b/parakeet/frontend/normalizer/width.py @@ -0,0 +1,24 @@ +def full2half_width(ustr): + half = [] + for u in ustr: + num = ord(u) + if num == 0x3000: # 全角空格变半角 + num = 32 + elif 0xFF01 <= num <= 0xFF5E: + num -= 0xfee0 + u = chr(num) + half.append(u) + return ''.join(half) + +def half2full_width(ustr): + full = [] + for u in ustr: + num = ord(u) + if num == 32: # 半角空格变全角 + num = 0x3000 + elif 0x21 <= num <= 0x7E: + num += 0xfee0 + u = chr(num) # to unicode + full.append(u) + + return ''.join(full) \ No newline at end of file diff --git a/parakeet/frontend/phonectic.py b/parakeet/frontend/phonectic.py new file mode 100644 index 0000000..cda0fc7 --- /dev/null +++ b/parakeet/frontend/phonectic.py @@ -0,0 +1,97 @@ +from abc import ABC, abstractmethod +from typing import Union +from g2p_en import G2p +from g2pM import G2pM +from parakeet.frontend import Vocab +from opencc import OpenCC +from parakeet.frontend.punctuation import get_punctuations + +class Phonetics(ABC): + @abstractmethod + def __call__(self, sentence): + pass + + @abstractmethod + def phoneticize(self, sentence): + pass + + @abstractmethod + def numericalize(self, phonemes): + pass + +class English(Phonetics): + def __init__(self): + self.backend = G2p() + self.phonemes = list(self.backend.phonemes) + self.punctuations = get_punctuations("en") + self.vocab = Vocab(self.phonemes + self.punctuations) + + def phoneticize(self, sentence): + start = self.vocab.start_symbol + end = self.vocab.end_symbol + phonemes = ([] if start is None else [start]) \ + + self.backend(sentence) \ + + ([] if end is None else [end]) + return phonemes + + def numericalize(self, phonemes): + ids = [self.vocab.lookup(item) for item in phonemes if item in self.vocab.stoi] + return ids + + def reverse(self, ids): + return [self.vocab.reverse(i) for i in ids] + + def __call__(self, sentence): + return self.numericalize(self.phoneticize(sentence)) + + @property + def vocab_size(self): + return len(self.vocab) + + +class Chinese(Phonetics): + def __init__(self): + self.opencc_backend = OpenCC('t2s.json') + self.backend = G2pM() + self.phonemes = self._get_all_syllables() + self.punctuations = get_punctuations("cn") + self.vocab = Vocab(self.phonemes + self.punctuations) + + def _get_all_syllables(self): + all_syllables = set([syllable for k, v in self.backend.cedict.items() for syllable in v]) + return list(all_syllables) + + def phoneticize(self, sentence): + simplified = self.opencc_backend.convert(sentence) + phonemes = self.backend(simplified) + start = self.vocab.start_symbol + end = self.vocab.end_symbol + phonemes = ([] if start is None else [start]) \ + + phonemes \ + + ([] if end is None else [end]) + return self._filter_symbols(phonemes) + + def _filter_symbols(self, phonemes): + cleaned_phonemes = [] + for item in phonemes: + if item in self.vocab.stoi: + cleaned_phonemes.append(item) + else: + for char in item: + if char in self.vocab.stoi: + cleaned_phonemes.append(char) + return cleaned_phonemes + + def numericalize(self, phonemes): + ids = [self.vocab.lookup(item) for item in phonemes] + return ids + + def __call__(self, sentence): + return self.numericalize(self.phoneticize(sentence)) + + @property + def vocab_size(self): + return len(self.vocab) + + def reverse(self, ids): + return [self.vocab.reverse(i) for i in ids] diff --git a/parakeet/frontend/punctuation.py b/parakeet/frontend/punctuation.py new file mode 100644 index 0000000..9984970 --- /dev/null +++ b/parakeet/frontend/punctuation.py @@ -0,0 +1,33 @@ +import abc +import string + +__all__ = ["get_punctuations"] + +EN_PUNCT = [ + " ", + "-", + "...", + ",", + ".", + "?", + "!", +] + +CN_PUNCT = [ + "、", + ",", + ";", + ":", + "。", + "?", + "!" +] + +def get_punctuations(lang): + if lang == "en": + return EN_PUNCT + elif lang == "cn": + return CN_PUNCT + else: + raise ValueError(f"language {lang} Not supported") + diff --git a/parakeet/frontend/vocab.py b/parakeet/frontend/vocab.py new file mode 100644 index 0000000..e773ac8 --- /dev/null +++ b/parakeet/frontend/vocab.py @@ -0,0 +1,78 @@ +from typing import Dict, Iterable, List +from ruamel import yaml +from collections import OrderedDict + +class Vocab(object): + def __init__(self, symbols: Iterable[str], + padding_symbol="", + unk_symbol="", + start_symbol="", + end_symbol=""): + self.special_symbols = OrderedDict() + for i, item in enumerate( + [padding_symbol, unk_symbol, start_symbol, end_symbol]): + if item: + self.special_symbols[item] = len(self.special_symbols) + + self.padding_symbol = padding_symbol + self.unk_symbol = unk_symbol + self.start_symbol = start_symbol + self.end_symbol = end_symbol + + + self.stoi = OrderedDict() + self.stoi.update(self.special_symbols) + + for i, s in enumerate(symbols): + if s not in self.stoi: + self.stoi[s] = len(self.stoi) + self.itos = {v: k for k, v in self.stoi.items()} + + def __len__(self): + return len(self.stoi) + + @property + def num_specials(self): + return len(self.special_symbols) + + # special tokens + @property + def padding_index(self): + return self.stoi.get(self.padding_symbol, -1) + + @property + def unk_index(self): + return self.stoi.get(self.unk_symbol, -1) + + @property + def start_index(self): + return self.stoi.get(self.start_symbol, -1) + + @property + def end_index(self): + return self.stoi.get(self.end_symbol, -1) + + def __repr__(self): + fmt = "Vocab(size: {},\nstoi:\n{})" + return fmt.format(len(self), self.stoi) + + def __str__(self): + return self.__repr__() + + def lookup(self, symbol): + return self.stoi[symbol] + + def reverse(self, index): + return self.itos[index] + + def add_symbol(self, symbol): + if symbol in self.stoi: + return + N = len(self.stoi) + self.stoi[symbol] = N + self.itos[N] = symbol + + def add_symbols(self, symbols): + for symbol in symbols: + self.add_symbol(symbol) + diff --git a/parakeet/g2p/__init__.py b/parakeet/g2p/__init__.py deleted file mode 100644 index 5840f33..0000000 --- a/parakeet/g2p/__init__.py +++ /dev/null @@ -1,32 +0,0 @@ -# coding: utf-8 -"""Text processing frontend - -All frontend module should have the following functions: - -- text_to_sequence(text, p) -- sequence_to_text(sequence) - -and the property: - -- n_vocab - -""" -from . import en - -# optinoal Japanese frontend -try: - from . import jp -except ImportError: - jp = None - -try: - from . import ko -except ImportError: - ko = None - -# if you are going to use the frontend, you need to modify _characters in symbol.py: -# _characters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'(),-.:;? ' + '¡¿ñáéíóúÁÉÍÓÚÑ' -try: - from . import es -except ImportError: - es = None diff --git a/parakeet/g2p/en/__init__.py b/parakeet/g2p/en/__init__.py deleted file mode 100644 index 01dd223..0000000 --- a/parakeet/g2p/en/__init__.py +++ /dev/null @@ -1,34 +0,0 @@ -# coding: utf-8 - -from ..text.symbols import symbols -from ..text import sequence_to_text - -import nltk -from random import random - -n_vocab = len(symbols) - -_arpabet = nltk.corpus.cmudict.dict() - - -def _maybe_get_arpabet(word, p): - try: - phonemes = _arpabet[word][0] - phonemes = " ".join(phonemes) - except KeyError: - return word - - return '{%s}' % phonemes if random() < p else word - - -def mix_pronunciation(text, p): - text = ' '.join(_maybe_get_arpabet(word, p) for word in text.split(' ')) - return text - - -def text_to_sequence(text, p=0.0): - if p >= 0: - text = mix_pronunciation(text, p) - from ..text import text_to_sequence - text = text_to_sequence(text, ["english_cleaners"]) - return text diff --git a/parakeet/g2p/es/__init__.py b/parakeet/g2p/es/__init__.py deleted file mode 100644 index 8ac385f..0000000 --- a/parakeet/g2p/es/__init__.py +++ /dev/null @@ -1,14 +0,0 @@ -# coding: utf-8 -from ..text.symbols import symbols -from ..text import sequence_to_text - -import nltk -from random import random - -n_vocab = len(symbols) - - -def text_to_sequence(text, p=0.0): - from ..text import text_to_sequence - text = text_to_sequence(text, ["basic_cleaners"]) - return text diff --git a/parakeet/g2p/jp/__init__.py b/parakeet/g2p/jp/__init__.py deleted file mode 100644 index 36c7fd8..0000000 --- a/parakeet/g2p/jp/__init__.py +++ /dev/null @@ -1,77 +0,0 @@ -# coding: utf-8 - -import MeCab -import jaconv -from random import random - -n_vocab = 0xffff - -_eos = 1 -_pad = 0 -_tagger = None - - -def _yomi(mecab_result): - tokens = [] - yomis = [] - for line in mecab_result.split("\n")[:-1]: - s = line.split("\t") - if len(s) == 1: - break - token, rest = s - rest = rest.split(",") - tokens.append(token) - yomi = rest[7] if len(rest) > 7 else None - yomi = None if yomi == "*" else yomi - yomis.append(yomi) - - return tokens, yomis - - -def _mix_pronunciation(tokens, yomis, p): - return "".join(yomis[idx] - if yomis[idx] is not None and random() < p else tokens[idx] - for idx in range(len(tokens))) - - -def mix_pronunciation(text, p): - global _tagger - if _tagger is None: - _tagger = MeCab.Tagger("") - tokens, yomis = _yomi(_tagger.parse(text)) - return _mix_pronunciation(tokens, yomis, p) - - -def add_punctuation(text): - last = text[-1] - if last not in [".", ",", "、", "。", "!", "?", "!", "?"]: - text = text + "。" - return text - - -def normalize_delimitor(text): - text = text.replace(",", "、") - text = text.replace(".", "。") - text = text.replace(",", "、") - text = text.replace(".", "。") - return text - - -def text_to_sequence(text, p=0.0): - for c in [" ", " ", "「", "」", "『", "』", "・", "【", "】", "(", ")", "(", ")"]: - text = text.replace(c, "") - text = text.replace("!", "!") - text = text.replace("?", "?") - - text = normalize_delimitor(text) - text = jaconv.normalize(text) - if p > 0: - text = mix_pronunciation(text, p) - text = jaconv.hira2kata(text) - text = add_punctuation(text) - - return [ord(c) for c in text] + [_eos] # EOS - - -def sequence_to_text(seq): - return "".join(chr(n) for n in seq) diff --git a/parakeet/g2p/ko/__init__.py b/parakeet/g2p/ko/__init__.py deleted file mode 100644 index ccb8b5f..0000000 --- a/parakeet/g2p/ko/__init__.py +++ /dev/null @@ -1,17 +0,0 @@ -# coding: utf-8 - -from random import random - -n_vocab = 0xffff - -_eos = 1 -_pad = 0 -_tagger = None - - -def text_to_sequence(text, p=0.0): - return [ord(c) for c in text] + [_eos] # EOS - - -def sequence_to_text(seq): - return "".join(chr(n) for n in seq) diff --git a/parakeet/g2p/text/__init__.py b/parakeet/g2p/text/__init__.py deleted file mode 100644 index 312b720..0000000 --- a/parakeet/g2p/text/__init__.py +++ /dev/null @@ -1,89 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import re -from . import cleaners -from .symbols import symbols - -# Mappings from symbol to numeric ID and vice versa: -_symbol_to_id = {s: i for i, s in enumerate(symbols)} -_id_to_symbol = {i: s for i, s in enumerate(symbols)} - -# Regular expression matching text enclosed in curly braces: -_curly_re = re.compile(r'(.*?)\{(.+?)\}(.*)') - - -def text_to_sequence(text, cleaner_names): - '''Converts a string of text to a sequence of IDs corresponding to the symbols in the text. - - The text can optionally have ARPAbet sequences enclosed in curly braces embedded - in it. For example, "Turn left on {HH AW1 S S T AH0 N} Street." - - Args: - text: string to convert to a sequence - cleaner_names: names of the cleaner functions to run the text through - - Returns: - List of integers corresponding to the symbols in the text - ''' - sequence = [] - - # Check for curly braces and treat their contents as ARPAbet: - while len(text): - m = _curly_re.match(text) - if not m: - sequence += _symbols_to_sequence(_clean_text(text, cleaner_names)) - break - sequence += _symbols_to_sequence( - _clean_text(m.group(1), cleaner_names)) - sequence += _arpabet_to_sequence(m.group(2)) - text = m.group(3) - - # Append EOS token - sequence.append(_symbol_to_id['~']) - return sequence - - -def sequence_to_text(sequence): - '''Converts a sequence of IDs back to a string''' - result = '' - for symbol_id in sequence: - if symbol_id in _id_to_symbol: - s = _id_to_symbol[symbol_id] - # Enclose ARPAbet back in curly braces: - if len(s) > 1 and s[0] == '@': - s = '{%s}' % s[1:] - result += s - return result.replace('}{', ' ') - - -def _clean_text(text, cleaner_names): - for name in cleaner_names: - cleaner = getattr(cleaners, name) - if not cleaner: - raise Exception('Unknown cleaner: %s' % name) - text = cleaner(text) - return text - - -def _symbols_to_sequence(symbols): - return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)] - - -def _arpabet_to_sequence(text): - return _symbols_to_sequence(['@' + s for s in text.split()]) - - -def _should_keep_symbol(s): - return s in _symbol_to_id and s is not '_' and s is not '~' diff --git a/parakeet/g2p/text/cleaners.py b/parakeet/g2p/text/cleaners.py deleted file mode 100644 index 58553c1..0000000 --- a/parakeet/g2p/text/cleaners.py +++ /dev/null @@ -1,110 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -''' -Cleaners are transformations that run over the input text at both training and eval time. - -Cleaners can be selected by passing a comma-delimited list of cleaner names as the "cleaners" -hyperparameter. Some cleaners are English-specific. You'll typically want to use: - 1. "english_cleaners" for English text - 2. "transliteration_cleaners" for non-English text that can be transliterated to ASCII using - the Unidecode library (https://pypi.python.org/pypi/Unidecode) - 3. "basic_cleaners" if you do not want to transliterate (in this case, you should also update - the symbols in symbols.py to match your data). -''' - -import re -from unidecode import unidecode -from .numbers import normalize_numbers - -# Regular expression matching whitespace: -_whitespace_re = re.compile(r'\s+') - -# List of (regular expression, replacement) pairs for abbreviations: -_abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) - for x in [ - ('mrs', 'misess'), - ('mr', 'mister'), - ('dr', 'doctor'), - ('st', 'saint'), - ('co', 'company'), - ('jr', 'junior'), - ('maj', 'major'), - ('gen', 'general'), - ('drs', 'doctors'), - ('rev', 'reverend'), - ('lt', 'lieutenant'), - ('hon', 'honorable'), - ('sgt', 'sergeant'), - ('capt', 'captain'), - ('esq', 'esquire'), - ('ltd', 'limited'), - ('col', 'colonel'), - ('ft', 'fort'), - ]] - - -def expand_abbreviations(text): - for regex, replacement in _abbreviations: - text = re.sub(regex, replacement, text) - return text - - -def expand_numbers(text): - return normalize_numbers(text) - - -def lowercase(text): - return text.lower() - - -def collapse_whitespace(text): - return re.sub(_whitespace_re, ' ', text) - - -def convert_to_ascii(text): - return unidecode(text) - - -def add_punctuation(text): - if len(text) == 0: - return text - if text[-1] not in '!,.:;?': - text = text + '.' # without this decoder is confused when to output EOS - return text - - -def basic_cleaners(text): - '''Basic pipeline that lowercases and collapses whitespace without transliteration.''' - text = lowercase(text) - text = collapse_whitespace(text) - return text - - -def transliteration_cleaners(text): - '''Pipeline for non-English text that transliterates to ASCII.''' - text = convert_to_ascii(text) - text = lowercase(text) - text = collapse_whitespace(text) - return text - - -def english_cleaners(text): - '''Pipeline for English text, including number and abbreviation expansion.''' - text = convert_to_ascii(text) - #text = add_punctuation(text) - text = lowercase(text) - text = expand_numbers(text) - text = expand_abbreviations(text) - text = collapse_whitespace(text) - return text diff --git a/parakeet/g2p/text/cmudict.py b/parakeet/g2p/text/cmudict.py deleted file mode 100644 index bbe7903..0000000 --- a/parakeet/g2p/text/cmudict.py +++ /dev/null @@ -1,78 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import re - -valid_symbols = [ - 'AA', 'AA0', 'AA1', 'AA2', 'AE', 'AE0', 'AE1', 'AE2', 'AH', 'AH0', 'AH1', - 'AH2', 'AO', 'AO0', 'AO1', 'AO2', 'AW', 'AW0', 'AW1', 'AW2', 'AY', 'AY0', - 'AY1', 'AY2', 'B', 'CH', 'D', 'DH', 'EH', 'EH0', 'EH1', 'EH2', 'ER', 'ER0', - 'ER1', 'ER2', 'EY', 'EY0', 'EY1', 'EY2', 'F', 'G', 'HH', 'IH', 'IH0', - 'IH1', 'IH2', 'IY', 'IY0', 'IY1', 'IY2', 'JH', 'K', 'L', 'M', 'N', 'NG', - 'OW', 'OW0', 'OW1', 'OW2', 'OY', 'OY0', 'OY1', 'OY2', 'P', 'R', 'S', 'SH', - 'T', 'TH', 'UH', 'UH0', 'UH1', 'UH2', 'UW', 'UW0', 'UW1', 'UW2', 'V', 'W', - 'Y', 'Z', 'ZH' -] - -_valid_symbol_set = set(valid_symbols) - - -class CMUDict: - '''Thin wrapper around CMUDict data. http://www.speech.cs.cmu.edu/cgi-bin/cmudict''' - - def __init__(self, file_or_path, keep_ambiguous=True): - if isinstance(file_or_path, str): - with open(file_or_path, encoding='latin-1') as f: - entries = _parse_cmudict(f) - else: - entries = _parse_cmudict(file_or_path) - if not keep_ambiguous: - entries = { - word: pron - for word, pron in entries.items() if len(pron) == 1 - } - self._entries = entries - - def __len__(self): - return len(self._entries) - - def lookup(self, word): - '''Returns list of ARPAbet pronunciations of the given word.''' - return self._entries.get(word.upper()) - - -_alt_re = re.compile(r'\([0-9]+\)') - - -def _parse_cmudict(file): - cmudict = {} - for line in file: - if len(line) and (line[0] >= 'A' and line[0] <= 'Z' or line[0] == "'"): - parts = line.split(' ') - word = re.sub(_alt_re, '', parts[0]) - pronunciation = _get_pronunciation(parts[1]) - if pronunciation: - if word in cmudict: - cmudict[word].append(pronunciation) - else: - cmudict[word] = [pronunciation] - return cmudict - - -def _get_pronunciation(s): - parts = s.strip().split(' ') - for part in parts: - if part not in _valid_symbol_set: - return None - return ' '.join(parts) diff --git a/parakeet/g2p/text/numbers.py b/parakeet/g2p/text/numbers.py deleted file mode 100644 index 24b5817..0000000 --- a/parakeet/g2p/text/numbers.py +++ /dev/null @@ -1,71 +0,0 @@ -# -*- coding: utf-8 -*- - -import inflect -import re - -_inflect = inflect.engine() -_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])') -_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)') -_pounds_re = re.compile(r'£([0-9\,]*[0-9]+)') -_dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)') -_ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)') -_number_re = re.compile(r'[0-9]+') - - -def _remove_commas(m): - return m.group(1).replace(',', '') - - -def _expand_decimal_point(m): - return m.group(1).replace('.', ' point ') - - -def _expand_dollars(m): - match = m.group(1) - parts = match.split('.') - if len(parts) > 2: - return match + ' dollars' # Unexpected format - dollars = int(parts[0]) if parts[0] else 0 - cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0 - if dollars and cents: - dollar_unit = 'dollar' if dollars == 1 else 'dollars' - cent_unit = 'cent' if cents == 1 else 'cents' - return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit) - elif dollars: - dollar_unit = 'dollar' if dollars == 1 else 'dollars' - return '%s %s' % (dollars, dollar_unit) - elif cents: - cent_unit = 'cent' if cents == 1 else 'cents' - return '%s %s' % (cents, cent_unit) - else: - return 'zero dollars' - - -def _expand_ordinal(m): - return _inflect.number_to_words(m.group(0)) - - -def _expand_number(m): - num = int(m.group(0)) - if num > 1000 and num < 3000: - if num == 2000: - return 'two thousand' - elif num > 2000 and num < 2010: - return 'two thousand ' + _inflect.number_to_words(num % 100) - elif num % 100 == 0: - return _inflect.number_to_words(num // 100) + ' hundred' - else: - return _inflect.number_to_words( - num, andword='', zero='oh', group=2).replace(', ', ' ') - else: - return _inflect.number_to_words(num, andword='') - - -def normalize_numbers(text): - text = re.sub(_comma_number_re, _remove_commas, text) - text = re.sub(_pounds_re, r'\1 pounds', text) - text = re.sub(_dollars_re, _expand_dollars, text) - text = re.sub(_decimal_number_re, _expand_decimal_point, text) - text = re.sub(_ordinal_re, _expand_ordinal, text) - text = re.sub(_number_re, _expand_number, text) - return text diff --git a/parakeet/g2p/text/symbols.py b/parakeet/g2p/text/symbols.py deleted file mode 100644 index 299ca58..0000000 --- a/parakeet/g2p/text/symbols.py +++ /dev/null @@ -1,30 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -''' -Defines the set of symbols used in text input to the model. - -The default is a set of ASCII characters that works well for English or text that has been run -through Unidecode. For other data, you can modify _characters. See TRAINING_DATA.md for details. -''' -from .cmudict import valid_symbols - -_pad = '_' -_eos = '~' -_characters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!\'(),-.:;? ' - -# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as uppercase letters): -_arpabet = ['@' + s for s in valid_symbols] - -# Export all symbols: -symbols = [_pad, _eos] + list(_characters) + _arpabet diff --git a/parakeet/models/__init__.py b/parakeet/models/__init__.py index abf198b..d8521da 100644 --- a/parakeet/models/__init__.py +++ b/parakeet/models/__init__.py @@ -11,3 +11,11 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. + +from parakeet.models.clarinet import * +from parakeet.models.waveflow import * +from parakeet.models.wavenet import * + +from parakeet.models.transformer_tts import * +from parakeet.models.deepvoice3 import * +# from parakeet.models.fastspeech import * diff --git a/parakeet/models/clarinet.py b/parakeet/models/clarinet.py new file mode 100644 index 0000000..ba859b2 --- /dev/null +++ b/parakeet/models/clarinet.py @@ -0,0 +1,158 @@ +import paddle +from paddle import nn +from paddle.nn import functional as F +from paddle import distribution as D + +from parakeet.models.wavenet import WaveNet, UpsampleNet, crop + +__all__ = ["Clarinet"] + +class ParallelWaveNet(nn.LayerList): + def __init__(self, n_loops, n_layers, residual_channels, condition_dim, + filter_size): + """ParallelWaveNet, an inverse autoregressive flow model, it contains several flows(WaveNets). + + Args: + n_loops (List[int]): `n_loop` for each flow. + n_layers (List[int]): `n_layer` for each flow. + residual_channels (int): `residual_channels` for every flow. + condition_dim (int): `condition_dim` for every flow. + filter_size (int): `filter_size` for every flow. + """ + super(ParallelWaveNet, self).__init__() + for n_loop, n_layer in zip(n_loops, n_layers): + # teacher's log_scale_min does not matter herem, -100 is a dummy value + self.append( + WaveNet(n_loop, n_layer, residual_channels, 3, condition_dim, + filter_size, "mog", -100.0)) + + def forward(self, z, condition=None): + """Transform a random noise sampled from a standard Gaussian distribution into sample from the target distribution. And output the mean and log standard deviation of the output distribution. + + Args: + z (Variable): shape(B, T), random noise sampled from a standard gaussian disribution. + condition (Variable, optional): shape(B, F, T), dtype float, the upsampled condition. Defaults to None. + + Returns: + (z, out_mu, out_log_std) + z (Variable): shape(B, T), dtype float, transformed noise, it is the synthesized waveform. + out_mu (Variable): shape(B, T), dtype float, means of the output distributions. + out_log_std (Variable): shape(B, T), dtype float, log standard deviations of the output distributions. + """ + for i, flow in enumerate(self): + theta = flow(z, condition) # w, mu, log_std [0: T] + w, mu, log_std = paddle.chunk(theta, 3, axis=-1) # (B, T, 1) for each + mu = paddle.squeeze(mu, -1) #[0: T] + log_std = paddle.squeeze(log_std, -1) #[0: T] + z = z * paddle.exp(log_std) + mu #[0: T] + + if i == 0: + out_mu = mu + out_log_std = log_std + else: + out_mu = out_mu * paddle.exp(log_std) + mu + out_log_std += log_std + + return z, out_mu, out_log_std + + +# Gaussian IAF model +class Clarinet(nn.Layer): + def __init__(self, encoder, teacher, student, stft, + min_log_scale=-6.0, lmd=4.0): + """Clarinet model. Conditional Parallel WaveNet. + + Args: + encoder (UpsampleNet): an UpsampleNet to upsample mel spectrogram. + teacher (WaveNet): a WaveNet, the teacher. + student (ParallelWaveNet): a ParallelWaveNet model, the student. + stft (STFT): a STFT model to perform differentiable stft transform. + min_log_scale (float, optional): used only for computing loss, the minimal value of log standard deviation of the output distribution of both the teacher and the student . Defaults to -6.0. + lmd (float, optional): weight for stft loss. Defaults to 4.0. + """ + super(Clarinet, self).__init__() + self.encoder = encoder + self.teacher = teacher + self.student = student + self.stft = stft + + self.lmd = lmd + self.min_log_scale = min_log_scale + + def forward(self, audio, mel, audio_start, clip_kl=True): + """Compute loss of Clarinet model. + + Args: + audio (Variable): shape(B, T_audio), dtype flaot32, ground truth waveform. + mel (Variable): shape(B, F, T_mel), dtype flaot32, condition(mel spectrogram here). + audio_start (Variable): shape(B, ), dtype int64, audio starts positions. + clip_kl (bool, optional): whether to clip kl_loss by maximum=100. Defaults to True. + + Returns: + Dict(str, Variable) + loss (Variable): shape(1, ), dtype flaot32, total loss. + kl (Variable): shape(1, ), dtype flaot32, kl divergence between the teacher's output distribution and student's output distribution. + regularization (Variable): shape(1, ), dtype flaot32, a regularization term of the KL divergence. + spectrogram_frame_loss (Variable): shape(1, ), dytpe: float, stft loss, the L1-distance of the magnitudes of the spectrograms of the ground truth waveform and synthesized waveform. + """ + batch_size, audio_length = audio.shape # audio clip's length + + z = paddle.randn(audio.shape) + condition = self.encoder(mel) # (B, C, T) + condition_slice = crop(condition, audio_start, audio_length) + + x, s_means, s_scales = self.student(z, condition_slice) # all [0: T] + s_means = s_means[:, 1:] # (B, T-1), time steps [1: T] + s_scales = s_scales[:, 1:] # (B, T-1), time steps [1: T] + s_clipped_scales = paddle.clip(s_scales, self.min_log_scale, 100.) + + # teacher outputs single gaussian + y = self.teacher(x[:, :-1], condition_slice[:, :, 1:]) + _, t_means, t_scales = paddle.chunk(y, 3, axis=-1) # time steps [1: T] + t_means = paddle.squeeze(t_means, [-1]) # (B, T-1), time steps [1: T] + t_scales = paddle.squeeze(t_scales, [-1]) # (B, T-1), time steps [1: T] + t_clipped_scales = paddle.clip(t_scales, self.min_log_scale, 100.) + + s_distribution = D.Normal(s_means, paddle.exp(s_clipped_scales)) + t_distribution = D.Normal(t_means, paddle.exp(t_clipped_scales)) + + # kl divergence loss, so we only need to sample once? no MC + kl = s_distribution.kl_divergence(t_distribution) + if clip_kl: + kl = paddle.clip(kl, -100., 10.) + # context size dropped + kl = paddle.reduce_mean(kl[:, self.teacher.context_size:]) + # major diff here + regularization = F.mse_loss(t_scales[:, self.teacher.context_size:], + s_scales[:, self.teacher.context_size:]) + + # introduce information from real target + spectrogram_frame_loss = F.mse_loss( + self.stft.magnitude(audio), self.stft.magnitude(x)) + loss = kl + self.lmd * regularization + spectrogram_frame_loss + loss_dict = { + "loss": loss, + "kl_divergence": kl, + "regularization": regularization, + "stft_loss": spectrogram_frame_loss + } + return loss_dict + + @paddle.no_grad() + def synthesis(self, mel): + """Synthesize waveform using the encoder and the student network. + + Args: + mel (Variable): shape(B, F, T_mel), the condition(mel spectrogram here). + + Returns: + Variable: shape(B, T_audio), the synthesized waveform. (T_audio = T_mel * upscale_factor, where upscale_factor is the `upscale_factor` of the encoder.) + """ + condition = self.encoder(mel) + samples_shape = (condition.shape[0], condition.shape[-1]) + z = paddle.randn(samples_shape) + x, s_means, s_scales = self.student(z, condition) + return x + + +# TODO(chenfeiyu): ClariNetLoss \ No newline at end of file diff --git a/parakeet/models/clarinet/__init__.py b/parakeet/models/clarinet/__init__.py deleted file mode 100644 index f3148be..0000000 --- a/parakeet/models/clarinet/__init__.py +++ /dev/null @@ -1,16 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from .net import * -from .parallel_wavenet import * \ No newline at end of file diff --git a/parakeet/models/clarinet/net.py b/parakeet/models/clarinet/net.py deleted file mode 100644 index 1af0493..0000000 --- a/parakeet/models/clarinet/net.py +++ /dev/null @@ -1,221 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import division -import itertools -import numpy as np -from scipy import signal -from tqdm import trange - -import paddle.fluid.layers as F -import paddle.fluid.dygraph as dg -import paddle.fluid.initializer as I -import paddle.fluid.layers.distributions as D - -from parakeet.modules.weight_norm import Conv2DTranspose -from parakeet.models.wavenet import crop, WaveNet, UpsampleNet -from parakeet.models.clarinet.parallel_wavenet import ParallelWaveNet -from parakeet.models.clarinet.utils import conv2d - - -# Gaussian IAF model -class Clarinet(dg.Layer): - def __init__(self, - encoder, - teacher, - student, - stft, - min_log_scale=-6.0, - lmd=4.0): - """Clarinet model. - - Args: - encoder (UpsampleNet): an UpsampleNet to upsample mel spectrogram. - teacher (WaveNet): a WaveNet, the teacher. - student (ParallelWaveNet): a ParallelWaveNet model, the student. - stft (STFT): a STFT model to perform differentiable stft transform. - min_log_scale (float, optional): used only for computing loss, the minimal value of log standard deviation of the output distribution of both the teacher and the student . Defaults to -6.0. - lmd (float, optional): weight for stft loss. Defaults to 4.0. - """ - super(Clarinet, self).__init__() - self.encoder = encoder - self.teacher = teacher - self.student = student - self.stft = stft - - self.lmd = lmd - self.min_log_scale = min_log_scale - - def forward(self, audio, mel, audio_start, clip_kl=True): - """Compute loss of Clarinet model. - - Args: - audio (Variable): shape(B, T_audio), dtype flaot32, ground truth waveform. - mel (Variable): shape(B, F, T_mel), dtype flaot32, condition(mel spectrogram here). - audio_start (Variable): shape(B, ), dtype int64, audio starts positions. - clip_kl (bool, optional): whether to clip kl_loss by maximum=100. Defaults to True. - - Returns: - Dict(str, Variable) - loss (Variable): shape(1, ), dtype flaot32, total loss. - kl (Variable): shape(1, ), dtype flaot32, kl divergence between the teacher's output distribution and student's output distribution. - regularization (Variable): shape(1, ), dtype flaot32, a regularization term of the KL divergence. - spectrogram_frame_loss (Variable): shape(1, ), dytpe: float, stft loss, the L1-distance of the magnitudes of the spectrograms of the ground truth waveform and synthesized waveform. - """ - batch_size, audio_length = audio.shape # audio clip's length - - z = F.gaussian_random(audio.shape) - condition = self.encoder(mel) # (B, C, T) - condition_slice = crop(condition, audio_start, audio_length) - - x, s_means, s_scales = self.student(z, condition_slice) # all [0: T] - s_means = s_means[:, 1:] # (B, T-1), time steps [1: T] - s_scales = s_scales[:, 1:] # (B, T-1), time steps [1: T] - s_clipped_scales = F.clip(s_scales, self.min_log_scale, 100.) - - # teacher outputs single gaussian - y = self.teacher(x[:, :-1], condition_slice[:, :, 1:]) - _, t_means, t_scales = F.split(y, 3, -1) # time steps [1: T] - t_means = F.squeeze(t_means, [-1]) # (B, T-1), time steps [1: T] - t_scales = F.squeeze(t_scales, [-1]) # (B, T-1), time steps [1: T] - t_clipped_scales = F.clip(t_scales, self.min_log_scale, 100.) - - s_distribution = D.Normal(s_means, F.exp(s_clipped_scales)) - t_distribution = D.Normal(t_means, F.exp(t_clipped_scales)) - - # kl divergence loss, so we only need to sample once? no MC - kl = s_distribution.kl_divergence(t_distribution) - if clip_kl: - kl = F.clip(kl, -100., 10.) - # context size dropped - kl = F.reduce_mean(kl[:, self.teacher.context_size:]) - # major diff here - regularization = F.mse_loss(t_scales[:, self.teacher.context_size:], - s_scales[:, self.teacher.context_size:]) - - # introduce information from real target - spectrogram_frame_loss = F.mse_loss( - self.stft.magnitude(audio), self.stft.magnitude(x)) - loss = kl + self.lmd * regularization + spectrogram_frame_loss - loss_dict = { - "loss": loss, - "kl_divergence": kl, - "regularization": regularization, - "stft_loss": spectrogram_frame_loss - } - return loss_dict - - @dg.no_grad - def synthesis(self, mel): - """Synthesize waveform using the encoder and the student network. - - Args: - mel (Variable): shape(B, F, T_mel), the condition(mel spectrogram here). - - Returns: - Variable: shape(B, T_audio), the synthesized waveform. (T_audio = T_mel * upscale_factor, where upscale_factor is the `upscale_factor` of the encoder.) - """ - condition = self.encoder(mel) - samples_shape = (condition.shape[0], condition.shape[-1]) - z = F.gaussian_random(samples_shape) - x, s_means, s_scales = self.student(z, condition) - return x - - -class STFT(dg.Layer): - def __init__(self, n_fft, hop_length, win_length, window="hanning"): - """A module for computing differentiable stft transform. See `librosa.stft` for more details. - - Args: - n_fft (int): number of samples in a frame. - hop_length (int): number of samples shifted between adjacent frames. - win_length (int): length of the window function. - window (str, optional): name of window function, see `scipy.signal.get_window` for more details. Defaults to "hanning". - """ - super(STFT, self).__init__() - self.hop_length = hop_length - self.n_bin = 1 + n_fft // 2 - self.n_fft = n_fft - - # calculate window - window = signal.get_window(window, win_length) - if n_fft != win_length: - pad = (n_fft - win_length) // 2 - window = np.pad(window, ((pad, pad), ), 'constant') - - # calculate weights - r = np.arange(0, n_fft) - M = np.expand_dims(r, -1) * np.expand_dims(r, 0) - w_real = np.reshape(window * - np.cos(2 * np.pi * M / n_fft)[:self.n_bin], - (self.n_bin, 1, 1, self.n_fft)).astype("float32") - w_imag = np.reshape(window * - np.sin(-2 * np.pi * M / n_fft)[:self.n_bin], - (self.n_bin, 1, 1, self.n_fft)).astype("float32") - - w = np.concatenate([w_real, w_imag], axis=0) - self.weight = dg.to_variable(w) - - def forward(self, x): - """Compute the stft transform. - - Args: - x (Variable): shape(B, T), dtype flaot32, the input waveform. - - Returns: - (real, imag) - real (Variable): shape(B, C, 1, T), dtype flaot32, the real part of the spectrogram. (C = 1 + n_fft // 2) - imag (Variable): shape(B, C, 1, T), dtype flaot32, the image part of the spectrogram. (C = 1 + n_fft // 2) - """ - # x(batch_size, time_steps) - # pad it first with reflect mode - pad_start = F.reverse(x[:, 1:1 + self.n_fft // 2], axis=1) - pad_stop = F.reverse(x[:, -(1 + self.n_fft // 2):-1], axis=1) - x = F.concat([pad_start, x, pad_stop], axis=-1) - - # to BC1T, C=1 - x = F.unsqueeze(x, axes=[1, 2]) - out = conv2d(x, self.weight, stride=(1, self.hop_length)) - real, imag = F.split(out, 2, dim=1) # BC1T - return real, imag - - def power(self, x): - """Compute the power spectrogram. - - Args: - (real, imag) - real (Variable): shape(B, C, 1, T), dtype flaot32, the real part of the spectrogram. - imag (Variable): shape(B, C, 1, T), dtype flaot32, the image part of the spectrogram. - - Returns: - Variable: shape(B, C, 1, T), dtype flaot32, the power spectrogram. - """ - real, imag = self(x) - power = real**2 + imag**2 - return power - - def magnitude(self, x): - """Compute the magnitude spectrogram. - - Args: - (real, imag) - real (Variable): shape(B, C, 1, T), dtype flaot32, the real part of the spectrogram. - imag (Variable): shape(B, C, 1, T), dtype flaot32, the image part of the spectrogram. - - Returns: - Variable: shape(B, C, 1, T), dtype flaot32, the magnitude spectrogram. It is the square root of the power spectrogram. - """ - power = self.power(x) - magnitude = F.sqrt(power) - return magnitude diff --git a/parakeet/models/clarinet/parallel_wavenet.py b/parakeet/models/clarinet/parallel_wavenet.py deleted file mode 100644 index 9be958e..0000000 --- a/parakeet/models/clarinet/parallel_wavenet.py +++ /dev/null @@ -1,77 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import division -import math -import time -import itertools -import numpy as np - -import paddle.fluid.layers as F -import paddle.fluid.dygraph as dg -import paddle.fluid.initializer as I -import paddle.fluid.layers.distributions as D - -from parakeet.modules.weight_norm import Linear, Conv1D, Conv1DCell, Conv2DTranspose -from parakeet.models.wavenet import WaveNet - - -class ParallelWaveNet(dg.Layer): - def __init__(self, n_loops, n_layers, residual_channels, condition_dim, - filter_size): - """ParallelWaveNet, an inverse autoregressive flow model, it contains several flows(WaveNets). - - Args: - n_loops (List[int]): `n_loop` for each flow. - n_layers (List[int]): `n_layer` for each flow. - residual_channels (int): `residual_channels` for every flow. - condition_dim (int): `condition_dim` for every flow. - filter_size (int): `filter_size` for every flow. - """ - super(ParallelWaveNet, self).__init__() - self.flows = dg.LayerList() - for n_loop, n_layer in zip(n_loops, n_layers): - # teacher's log_scale_min does not matter herem, -100 is a dummy value - self.flows.append( - WaveNet(n_loop, n_layer, residual_channels, 3, condition_dim, - filter_size, "mog", -100.0)) - - def forward(self, z, condition=None): - """Transform a random noise sampled from a standard Gaussian distribution into sample from the target distribution. And output the mean and log standard deviation of the output distribution. - - Args: - z (Variable): shape(B, T), random noise sampled from a standard gaussian disribution. - condition (Variable, optional): shape(B, F, T), dtype float, the upsampled condition. Defaults to None. - - Returns: - (z, out_mu, out_log_std) - z (Variable): shape(B, T), dtype float, transformed noise, it is the synthesized waveform. - out_mu (Variable): shape(B, T), dtype float, means of the output distributions. - out_log_std (Variable): shape(B, T), dtype float, log standard deviations of the output distributions. - """ - for i, flow in enumerate(self.flows): - theta = flow(z, condition) # w, mu, log_std [0: T] - w, mu, log_std = F.split(theta, 3, dim=-1) # (B, T, 1) for each - mu = F.squeeze(mu, [-1]) #[0: T] - log_std = F.squeeze(log_std, [-1]) #[0: T] - z = z * F.exp(log_std) + mu #[0: T] - - if i == 0: - out_mu = mu - out_log_std = log_std - else: - out_mu = out_mu * F.exp(log_std) + mu - out_log_std += log_std - - return z, out_mu, out_log_std diff --git a/parakeet/models/clarinet/utils.py b/parakeet/models/clarinet/utils.py deleted file mode 100644 index 6a92b26..0000000 --- a/parakeet/models/clarinet/utils.py +++ /dev/null @@ -1,38 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import division - -from paddle import fluid -from paddle.fluid.core import ops - - -@fluid.framework.dygraph_only -def conv2d(input, - weight, - stride=(1, 1), - padding=((0, 0), (0, 0)), - dilation=(1, 1), - groups=1, - use_cudnn=True, - data_format="NCHW"): - padding = tuple(pad for pad_dim in padding for pad in pad_dim) - - attrs = ('strides', stride, 'paddings', padding, 'dilations', dilation, - 'groups', groups, 'use_cudnn', use_cudnn, 'use_mkldnn', False, - 'fuse_relu_before_depthwise_conv', False, "padding_algorithm", - "EXPLICIT", "data_format", data_format) - - out = ops.conv2d(input, weight, *attrs) - return out diff --git a/parakeet/models/deepvoice3/model.py b/parakeet/models/deepvoice3.py similarity index 63% rename from parakeet/models/deepvoice3/model.py rename to parakeet/models/deepvoice3.py index 1da13a9..896c119 100644 --- a/parakeet/models/deepvoice3/model.py +++ b/parakeet/models/deepvoice3.py @@ -1,35 +1,16 @@ -import numpy as np import math +import numpy as np + import paddle -from paddle import fluid -from paddle.fluid import layers as F -from paddle.fluid import initializer as I -from paddle.fluid import dygraph as dg +from paddle import nn +from paddle.nn import functional as F +from paddle.nn import initializer as I -from .conv import Conv1D -from .weight_norm_hook import weight_norm, remove_weight_norm +from parakeet.modules import positional_encoding as pe -def positional_encoding(tensor, start_index, omega): - """ - tensor: a reference tensor we use to get shape. actually only T and C are needed. Shape(B, T, C) - start_index: int, we can actually use start and length to specify them. - omega (B,): speaker position rates +__all__ = ["SpectraNet"] - return (B, T, C), position embedding - """ - dtype = omega.dtype - _, length, dimension = tensor.shape - index = F.range(start_index, start_index + length, 1, dtype=dtype) - channel = F.range(0, dimension, 2, dtype=dtype) - - p = F.unsqueeze(omega, [1, 2]) \ - * F.unsqueeze(index, [1]) \ - / (10000 ** (channel / float(dimension))) - - encodings = F.concat([F.sin(p), F.cos(p)], axis=2) - return encodings - -class ConvBlock(dg.Layer): +class ConvBlock(nn.Layer): def __init__(self, in_channel, kernel_size, causal=False, has_bias=False, bias_dim=None, keep_prob=1.): super(ConvBlock, self).__init__() @@ -38,55 +19,56 @@ class ConvBlock(dg.Layer): self.in_channel = in_channel self.has_bias = has_bias - std = np.sqrt(4 * keep_prob / (kernel_size * in_channel)) + std = math.sqrt(4 * keep_prob / (kernel_size * in_channel)) padding = "valid" if causal else "same" - conv = Conv1D(in_channel, 2 * in_channel, (kernel_size, ), - padding=padding, - data_format="NTC", - param_attr=I.Normal(scale=std)) - self.conv = weight_norm(conv) + conv = nn.Conv1D(in_channel, 2 * in_channel, (kernel_size, ), + padding=padding, + data_format="NLC", + weight_attr=I.Normal(scale=std)) + self.conv = nn.utils.weight_norm(conv) if has_bias: - std = np.sqrt(1 / bias_dim) - self.bias_affine = dg.Linear(bias_dim, 2 * in_channel, param_attr=I.Normal(scale=std)) + std = math.sqrt(1 / bias_dim) + self.bias_affine = nn.Linear(bias_dim, 2 * in_channel, + weight_attr=I.Normal(scale=std)) def forward(self, input, bias=None, padding=None): """ input: input feature (B, T, C) padding: only used when using causal conv, we pad mannually """ - input_dropped = F.dropout(input, 1. - self.keep_prob, - dropout_implementation="upscale_in_train") + input_dropped = F.dropout(input, 1. - self.keep_prob, training=self.training) if self.causal: assert padding is not None - input_dropped = F.concat([padding, input_dropped], axis=1) + input_dropped = paddle.concat([padding, input_dropped], axis=1) hidden = self.conv(input_dropped) if self.has_bias: assert bias is not None transformed_bias = F.softsign(self.bias_affine(bias)) - hidden_embedded = hidden + F.unsqueeze(transformed_bias, [1]) + hidden_embedded = hidden + paddle.unsqueeze(transformed_bias, 1) else: hidden_embedded = hidden # glu - content, gate = F.split(hidden, num_or_sections=2, dim=-1) + content, gate = paddle.chunk(hidden, 2, axis=-1) content = hidden_embedded[:, :, :self.in_channel] hidden = F.sigmoid(gate) * content # # residual - hidden = F.scale(input + hidden, math.sqrt(0.5)) + hidden = paddle.scale(input + hidden, math.sqrt(0.5)) return hidden -class AffineBlock1(dg.Layer): +class AffineBlock1(nn.Layer): def __init__(self, in_channel, out_channel, has_bias=False, bias_dim=0): super(AffineBlock1, self).__init__() - std = np.sqrt(1.0 / in_channel) - affine = dg.Linear(in_channel, out_channel, param_attr=I.Normal(scale=std)) - self.affine = weight_norm(affine, dim=-1) + std = math.sqrt(1.0 / in_channel) + affine = nn.Linear(in_channel, out_channel, weight_attr=I.Normal(scale=std)) + self.affine = nn.utils.weight_norm(affine, dim=-1) if has_bias: - std = np.sqrt(1 / bias_dim) - self.bias_affine = dg.Linear(bias_dim, out_channel, param_attr=I.Normal(scale=std)) + std = math.sqrt(1 / bias_dim) + self.bias_affine = nn.Linear(bias_dim, out_channel, + weight_attr=I.Normal(scale=std)) self.has_bias = has_bias self.bias_dim = bias_dim @@ -101,20 +83,20 @@ class AffineBlock1(dg.Layer): if self.has_bias: assert bias is not None transformed_bias = F.softsign(self.bias_affine(bias)) - hidden += F.unsqueeze(transformed_bias, [1]) + hidden += paddle.unsqueeze(transformed_bias, 1) return hidden -class AffineBlock2(dg.Layer): +class AffineBlock2(nn.Layer): def __init__(self, in_channel, out_channel, has_bias=False, bias_dim=0, dropout=False, keep_prob=1.): super(AffineBlock2, self).__init__() if has_bias: - std = np.sqrt(1 / bias_dim) - self.bias_affine = dg.Linear(bias_dim, in_channel, param_attr=I.Normal(scale=std)) - std = np.sqrt(1.0 / in_channel) - affine = dg.Linear(in_channel, out_channel, param_attr=I.Normal(scale=std)) - self.affine = weight_norm(affine, dim=-1) + std = math.sqrt(1 / bias_dim) + self.bias_affine = nn.Linear(bias_dim, in_channel, weight_attr=I.Normal(scale=std)) + std = math.sqrt(1.0 / in_channel) + affine = nn.Linear(in_channel, out_channel, weight_attr=I.Normal(scale=std)) + self.affine = nn.utils.weight_norm(affine, dim=-1) self.has_bias = has_bias self.bias_dim = bias_dim @@ -130,22 +112,21 @@ class AffineBlock2(dg.Layer): """ hidden = input if self.dropout: - hidden = F.dropout(hidden, 1. - self.keep_prob, - dropout_implementation="upscale_in_train") + hidden = F.dropout(hidden, 1. - self.keep_prob, training=self.training) if self.has_bias: assert bias is not None transformed_bias = F.softsign(self.bias_affine(bias)) - hidden += F.unsqueeze(transformed_bias, [1]) + hidden += paddle.unsqueeze(transformed_bias, 1) hidden = F.relu(self.affine(hidden)) return hidden -class Encoder(dg.Layer): +class Encoder(nn.Layer): def __init__(self, layers, in_channels, encoder_dim, kernel_size, has_bias=False, bias_dim=0, keep_prob=1.): super(Encoder, self).__init__() self.pre_affine = AffineBlock1(in_channels, encoder_dim, has_bias, bias_dim) - self.convs = dg.LayerList([ + self.convs = nn.LayerList([ ConvBlock(encoder_dim, kernel_size, False, has_bias, bias_dim, keep_prob) \ for _ in range(layers)]) self.post_affine = AffineBlock1(encoder_dim, in_channels, has_bias, bias_dim) @@ -156,11 +137,11 @@ class Encoder(dg.Layer): hidden = layer(hidden, speaker_embed) hidden = self.post_affine(hidden, speaker_embed) keys = hidden - values = F.scale(char_embed + hidden, np.sqrt(0.5)) + values = paddle.scale(char_embed + hidden, math.sqrt(0.5)) return keys, values -class AttentionBlock(dg.Layer): +class AttentionBlock(nn.Layer): def __init__(self, attention_dim, input_dim, position_encoding_weight=1., position_rate=1., reduction_factor=1, has_bias=False, bias_dim=0, keep_prob=1.): @@ -170,31 +151,37 @@ class AttentionBlock(dg.Layer): self.omega_default = omega_default # multispeaker case if has_bias: - std = np.sqrt(1.0 / bias_dim) - self.q_pos_affine = dg.Linear(bias_dim, 1, param_attr=I.Normal(scale=std)) - self.k_pos_affine = dg.Linear(bias_dim, 1, param_attr=I.Normal(scale=std)) + std = math.sqrt(1.0 / bias_dim) + self.q_pos_affine = nn.Linear(bias_dim, 1, weight_attr=I.Normal(scale=std)) + self.k_pos_affine = nn.Linear(bias_dim, 1, weight_attr=I.Normal(scale=std)) self.omega_initial = self.create_parameter(shape=[1], - attr=I.ConstantInitializer(value=omega_default)) + attr=I.Constant(value=omega_default)) # mind the fact that q, k, v have the same feature dimension # so we can init k_affine and q_affine's weight as the same matrix # to get a better init attention + dtype = self.omega_initial.numpy().dtype init_weight = np.random.normal(size=(input_dim, attention_dim), - scale=np.sqrt(1. / input_dim)) - initializer = I.NumpyArrayInitializer(init_weight.astype(np.float32)) + scale=np.sqrt(1. / input_dim)).astype(dtype) + # TODO(chenfeiyu): to report an issue, there is no such initializer + #initializer = paddle.fluid.initializer.NumpyArrayInitializer(init_weight) # 3 affine transformation to project q, k, v into attention_dim - q_affine = dg.Linear(input_dim, attention_dim, param_attr=initializer) - self.q_affine = weight_norm(q_affine, dim=-1) - k_affine = dg.Linear(input_dim, attention_dim, param_attr=initializer) - self.k_affine = weight_norm(k_affine, dim=-1) + q_affine = nn.Linear(input_dim, attention_dim) + self.q_affine = nn.utils.weight_norm(q_affine, dim=-1) + k_affine = nn.Linear(input_dim, attention_dim) + self.k_affine = nn.utils.weight_norm(k_affine, dim=-1) + + # better to use this, since NumpyInitializer does not support float64 + self.q_affine.weight.set_value(init_weight) + self.k_affine.weight.set_value(init_weight) std = np.sqrt(1.0 / input_dim) - v_affine = dg.Linear(input_dim, attention_dim, param_attr=I.Normal(scale=std)) - self.v_affine = weight_norm(v_affine, dim=-1) + v_affine = nn.Linear(input_dim, attention_dim, weight_attr=I.Normal(scale=std)) + self.v_affine = nn.utils.weight_norm(v_affine, dim=-1) std = np.sqrt(1.0 / attention_dim) - out_affine = dg.Linear(attention_dim, input_dim, param_attr=I.Normal(scale=std)) - self.out_affine = weight_norm(out_affine, dim=-1) + out_affine = nn.Linear(attention_dim, input_dim, weight_attr=I.Normal(scale=std)) + self.out_affine = nn.utils.weight_norm(out_affine, dim=-1) self.keep_prob = keep_prob self.has_bias = has_bias @@ -204,28 +191,30 @@ class AttentionBlock(dg.Layer): def forward(self, q, k, v, lengths, speaker_embed, start_index, force_monotonic=False, prev_coeffs=None, window=None): + dtype = self.omega_initial.dtype # add position encoding as an inductive bias if self.has_bias: # multi-speaker model omega_q = 2 * F.sigmoid( - F.squeeze(self.q_pos_affine(speaker_embed), axes=[-1])) - omega_k = 2 * self.omega_initial * F.sigmoid(F.squeeze( - self.k_pos_affine(speaker_embed), axes=[-1])) + paddle.squeeze(self.q_pos_affine(speaker_embed), -1)) + omega_k = 2 * self.omega_initial * F.sigmoid(paddle.squeeze( + self.k_pos_affine(speaker_embed), -1)) else: # single-speaker case batch_size = q.shape[0] - omega_q = F.ones((batch_size, ), dtype="float32") - omega_k = F.ones((batch_size, ), dtype="float32") * self.omega_default - q += self.position_encoding_weight * positional_encoding(q, start_index, omega_q) - k += self.position_encoding_weight * positional_encoding(k, 0, omega_k) + omega_q = paddle.ones((batch_size, ), dtype=dtype) + omega_k = paddle.ones((batch_size, ), dtype=dtype) * self.omega_default + q += self.position_encoding_weight * pe.scalable_positional_encoding(start_index, q.shape[1], q.shape[-1], omega_q) + k += self.position_encoding_weight * pe.scalable_positional_encoding(0, k.shape[1], k.shape[-1], omega_k) + q, k, v = self.q_affine(q), self.k_affine(k), self.v_affine(v) - activations = F.matmul(q, k, transpose_y=True) - activations /= np.sqrt(self.attention_dim) + activations = paddle.matmul(q, k, transpose_y=True) + activations /= math.sqrt(self.attention_dim) if self.training: # mask the parts from the encoder - mask = F.sequence_mask(lengths, dtype="float32") - attn_bias = F.scale(1. - mask, -1000) - activations += F.unsqueeze(attn_bias, [1]) + mask = paddle.fluid.layers.sequence_mask(lengths, dtype=dtype) + attn_bias = paddle.scale(1. - mask, -1000) + activations += paddle.unsqueeze(attn_bias, 1) elif force_monotonic: assert window is not None backward_step, forward_step = window @@ -233,31 +222,30 @@ class AttentionBlock(dg.Layer): batch_size, T_dec, _ = q.shape # actually T_dec = 1 here - alpha = F.fill_constant((batch_size, T_dec), value=0, dtype="int64") \ + alpha = paddle.fill_constant((batch_size, T_dec), value=0, dtype="int64") \ if prev_coeffs is None \ - else F.argmax(prev_coeffs, axis=-1) - backward = F.sequence_mask(alpha - backward_step, maxlen=T_enc, dtype="bool") - forward = F.sequence_mask(alpha + forward_step, maxlen=T_enc, dtype="bool") - mask = F.cast(F.logical_xor(backward, forward), "float32") + else paddle.argmax(prev_coeffs, axis=-1) + backward = paddle.fluid.layers.sequence_mask(alpha - backward_step, maxlen=T_enc, dtype="bool") + forward = paddle.fluid.layers.sequence_mask(alpha + forward_step, maxlen=T_enc, dtype="bool") + mask = paddle.cast(paddle.logical_xor(backward, forward), activations.dtype) # print("mask's shape:", mask.shape) - attn_bias = F.scale(1. - mask, -1000) + attn_bias = paddle.scale(1. - mask, -1000) activations += attn_bias # softmax coefficients = F.softmax(activations, axis=-1) # context vector - coefficients = F.dropout(coefficients, 1. - self.keep_prob, - dropout_implementation='upscale_in_train') - contexts = F.matmul(coefficients, v) + coefficients = F.dropout(coefficients, 1. - self.keep_prob, training=self.training) + contexts = paddle.matmul(coefficients, v) # context normalization - enc_lengths = F.cast(F.unsqueeze(lengths, axes=[1, 2]), "float32") - contexts *= F.sqrt(enc_lengths) + enc_lengths = paddle.cast(paddle.unsqueeze(lengths, axis=[1, 2]), contexts.dtype) + contexts *= paddle.sqrt(enc_lengths) # out affine contexts = self.out_affine(contexts) return contexts, coefficients + - -class Decoder(dg.Layer): +class Decoder(nn.Layer): def __init__(self, in_channels, reduction_factor, prenet_sizes, layers, kernel_size, attention_dim, position_encoding_weight=1., omega=1., @@ -265,7 +253,7 @@ class Decoder(dg.Layer): super(Decoder, self).__init__() # prenet-mind the difference of AffineBlock2 and AffineBlock1 c_in = in_channels - self.prenet = dg.LayerList() + self.prenet = nn.LayerList() for i, c_out in enumerate(prenet_sizes): affine = AffineBlock2(c_in, c_out, has_bias, bias_dim, dropout=(i!=0), keep_prob=keep_prob) self.prenet.append(affine) @@ -273,8 +261,8 @@ class Decoder(dg.Layer): # causal convolutions + multihop attention decoder_dim = prenet_sizes[-1] - self.causal_convs = dg.LayerList() - self.attention_blocks = dg.LayerList() + self.causal_convs = nn.LayerList() + self.attention_blocks = nn.LayerList() for i in range(layers): conv = ConvBlock(decoder_dim, kernel_size, True, has_bias, bias_dim, keep_prob) attn = AttentionBlock(attention_dim, decoder_dim, position_encoding_weight, omega, reduction_factor, has_bias, bias_dim, keep_prob) @@ -283,12 +271,12 @@ class Decoder(dg.Layer): # output mel spectrogram output_dim = reduction_factor * in_channels # r * mel_dim - std = np.sqrt(1.0 / decoder_dim) - out_affine = dg.Linear(decoder_dim, output_dim, param_attr=I.Normal(scale=std)) - self.out_affine = weight_norm(out_affine, dim=-1) + std = math.sqrt(1.0 / decoder_dim) + out_affine = nn.Linear(decoder_dim, output_dim, weight_attr=I.Normal(scale=std)) + self.out_affine = nn.utils.weight_norm(out_affine, dim=-1) if has_bias: - std = np.sqrt(1 / bias_dim) - self.out_sp_affine = dg.Linear(bias_dim, output_dim, param_attr=I.Normal(scale=std)) + std = math.sqrt(1 / bias_dim) + self.out_sp_affine = nn.Linear(bias_dim, output_dim, weight_attr=I.Normal(scale=std)) self.has_bias = has_bias self.kernel_size = kernel_size @@ -311,10 +299,10 @@ class Decoder(dg.Layer): for i in range(len(self.causal_convs)): if state is None: - padding = F.zeros(causal_padding_shape, dtype="float32") + padding = paddle.zeros(causal_padding_shape, dtype=inputs.dtype) else: padding = state[i] - new_state = F.concat([padding, hidden], axis=1) # => to be used next step + new_state = paddle.concat([padding, hidden], axis=1) # => to be used next step # causal conv, (B, T, C) hidden = self.causal_convs[i](hidden, speaker_embed, padding=padding) # attn @@ -324,7 +312,7 @@ class Decoder(dg.Layer): hidden, keys, values, lengths, speaker_embed, start_index, force_monotonic, prev_coeffs, window) # residual connextion (B, T_dec, C_dec) - hidden = F.scale(hidden + context, np.sqrt(0.5)) + hidden = paddle.scale(hidden + context, math.sqrt(0.5)) attentions.append(attention) # layers * (B, T_dec, T_enc) # new state: shift a step, layers * (B, T, C) @@ -334,34 +322,35 @@ class Decoder(dg.Layer): # predict mel spectrogram (B, 1, T_dec, r * C_in) decoded = self.out_affine(hidden) if self.has_bias: - decoded *= F.sigmoid(F.unsqueeze(self.out_sp_affine(speaker_embed), [1])) + decoded *= F.sigmoid(paddle.unsqueeze(self.out_sp_affine(speaker_embed), 1)) return decoded, hidden, attentions, final_state -class PostNet(dg.Layer): +class PostNet(nn.Layer): def __init__(self, layers, in_channels, postnet_dim, kernel_size, out_channels, upsample_factor, has_bias=False, bias_dim=0, keep_prob=1.): super(PostNet, self).__init__() self.pre_affine = AffineBlock1(in_channels, postnet_dim, has_bias, bias_dim) - self.convs = dg.LayerList([ + self.convs = nn.LayerList([ ConvBlock(postnet_dim, kernel_size, False, has_bias, bias_dim, keep_prob) for _ in range(layers) ]) - std = np.sqrt(1.0 / postnet_dim) - post_affine = dg.Linear(postnet_dim, out_channels, param_attr=I.Normal(scale=std)) - self.post_affine = weight_norm(post_affine, dim=-1) + std = math.sqrt(1.0 / postnet_dim) + post_affine = nn.Linear(postnet_dim, out_channels, weight_attr=I.Normal(scale=std)) + self.post_affine = nn.utils.weight_norm(post_affine, dim=-1) self.upsample_factor = upsample_factor def forward(self, hidden, speaker_embed=None): hidden = self.pre_affine(hidden, speaker_embed) batch_size, time_steps, channels = hidden.shape # pylint: disable=unused-variable - hidden = F.expand(hidden, [1, 1, self.upsample_factor]) - hidden = F.reshape(hidden, [batch_size, -1, channels]) + # NOTE: paddle.expand can only expand dimension whose size is 1 + hidden = paddle.expand(paddle.unsqueeze(hidden, 2), [-1, -1, self.upsample_factor, -1]) + hidden = paddle.reshape(hidden, [batch_size, -1, channels]) for layer in self.convs: hidden = layer(hidden, speaker_embed) spec = self.post_affine(hidden) return spec -class SpectraNet(dg.Layer): +class SpectraNet(nn.Layer): def __init__(self, char_embedding, speaker_embedding, encoder, decoder, postnet): super(SpectraNet, self).__init__() self.char_embedding = char_embedding @@ -386,33 +375,33 @@ class SpectraNet(dg.Layer): # build decoder inputs by shifting over by one frame and add all zero frame # the mel input is downsampled by a reduction factor batch_size = mel.shape[0] - mel_input = F.reshape(mel, (batch_size, -1, self.decoder.reduction_factor, self.decoder.in_channels)) - zero_frame = F.zeros((batch_size, 1, self.decoder.in_channels), dtype="float32") + mel_input = paddle.reshape(mel, (batch_size, -1, self.decoder.reduction_factor, self.decoder.in_channels)) + zero_frame = paddle.zeros((batch_size, 1, self.decoder.in_channels), dtype=mel.dtype) # downsample mel input as a regularization - mel_input = F.concat([zero_frame, mel_input[:, :-1, -1, :]], axis=1) + mel_input = paddle.concat([zero_frame, mel_input[:, :-1, -1, :]], axis=1) # decoder decoded, hidden, attentions, final_state = self.decoder(mel_input, keys, values, text_lengths, 0, speaker_embed) - attentions = F.stack(attentions) # (N, B, T_dec, T_encs) + attentions = paddle.stack(attentions) # (N, B, T_dec, T_encs) # unfold frames - decoded = F.reshape(decoded, (batch_size, -1, self.decoder.in_channels)) + decoded = paddle.reshape(decoded, (batch_size, -1, self.decoder.in_channels)) # postnet refined = self.postnet(hidden, speaker_embed) return decoded, refined, attentions, final_state def spec_loss(self, decoded, input, num_frames=None): if num_frames is None: - l1_loss = F.reduce_mean(F.abs(decoded - input)) + l1_loss = paddle.mean(paddle.abs(decoded - input)) else: # mask the part of the decoder num_channels = decoded.shape[-1] - l1_loss = F.abs(decoded - input) - mask = F.sequence_mask(num_frames, dtype="float32") - l1_loss *= F.unsqueeze(mask, axes=[-1]) - l1_loss = F.reduce_sum(l1_loss) / F.scale(F.reduce_sum(mask), num_channels) + l1_loss = paddle.abs(decoded - input) + mask = paddle.fluid.layers.sequence_mask(num_frames, dtype=decoded.dtype) + l1_loss *= paddle.unsqueeze(mask, axis=-1) + l1_loss = paddle.sum(l1_loss) / paddle.scale(paddle.sum(mask), num_channels) return l1_loss - @dg.no_grad + @paddle.no_grad() def inference(self, keys, values, text_lengths, speaker_embed, force_monotonic_attention, window): MAX_STEP = 500 @@ -430,17 +419,17 @@ class SpectraNet(dg.Layer): # so we only supports batch_size == 0 in inference def should_continue(i, mel_input, outputs, hidden, attention, state, coeffs): T_enc = coeffs.shape[-1] - attn_peak = F.argmax(coeffs[first_mono_attention_layer, 0, 0]) \ + attn_peak = paddle.argmax(coeffs[first_mono_attention_layer, 0, 0]) \ if num_monotonic_attention_layers > 0 \ - else F.fill_constant([1], "int64", value=0) - return i < MAX_STEP and F.reshape(attn_peak, [1]) < T_enc - 1 + else paddle.fill_constant([1], "int64", value=0) + return i < MAX_STEP and paddle.reshape(attn_peak, [1]) < T_enc - 1 def loop_body(i, mel_input, outputs, hiddens, attentions, state=None, coeffs=None): # state is None coeffs is None for the first step decoded, hidden, new_coeffs, new_state = self.decoder( mel_input, keys, values, text_lengths, i, speaker_embed, state, force_monotonic_attention, coeffs, window) - new_coeffs = F.stack(new_coeffs) # (N, B, T_dec=1, T_enc) + new_coeffs = paddle.stack(new_coeffs) # (N, B, T_dec=1, T_enc) attentions.append(new_coeffs) # (N, B, T_dec=1, T_enc) outputs.append(decoded) # (B, T_dec=1, rC_mel) @@ -448,13 +437,13 @@ class SpectraNet(dg.Layer): # slice the last frame out of r generated frames to be used as the input for the next step batch_size = mel_input.shape[0] - frames = F.reshape(decoded, [batch_size, -1, self.decoder.reduction_factor, self.decoder.in_channels]) + frames = paddle.reshape(decoded, [batch_size, -1, self.decoder.reduction_factor, self.decoder.in_channels]) input_frame = frames[:, :, -1, :] return (i + 1, input_frame, outputs, hiddens, attentions, new_state, new_coeffs) i = 0 batch_size = keys.shape[0] - input_frame = F.zeros((batch_size, 1, self.decoder.in_channels), dtype="float32") + input_frame = paddle.zeros((batch_size, 1, self.decoder.in_channels), dtype=keys.dtype) outputs = [] hiddens = [] attentions = [] @@ -465,12 +454,12 @@ class SpectraNet(dg.Layer): outputs, hiddens, attention = loop_state[2], loop_state[3], loop_state[4] # concat decoder timesteps - outputs = F.concat(outputs, axis=1) - hiddens = F.concat(hiddens, axis=1) - attention = F.concat(attention, axis=2) + outputs = paddle.concat(outputs, axis=1) + hiddens = paddle.concat(hiddens, axis=1) + attention = paddle.concat(attention, axis=2) # unfold frames - outputs = F.reshape(outputs, (batch_size, -1, self.decoder.in_channels)) + outputs = paddle.reshape(outputs, (batch_size, -1, self.decoder.in_channels)) refined = self.postnet(hiddens, speaker_embed) return outputs, refined, attention diff --git a/parakeet/models/deepvoice3/__init__.py b/parakeet/models/deepvoice3/__init__.py deleted file mode 100644 index 71a1d85..0000000 --- a/parakeet/models/deepvoice3/__init__.py +++ /dev/null @@ -1 +0,0 @@ -from .model import * \ No newline at end of file diff --git a/parakeet/models/deepvoice3/conv.py b/parakeet/models/deepvoice3/conv.py deleted file mode 100644 index d6a5c3d..0000000 --- a/parakeet/models/deepvoice3/conv.py +++ /dev/null @@ -1,245 +0,0 @@ -import numpy as np -from paddle.fluid import layers as F -from paddle.fluid.framework import Variable, in_dygraph_mode -from paddle.fluid import core, dygraph_utils -from paddle.fluid.layers import nn, utils -from paddle.fluid.data_feeder import check_variable_and_dtype -from paddle.fluid.param_attr import ParamAttr -from paddle.fluid.layer_helper import LayerHelper -from paddle.fluid.dygraph import layers -from paddle.fluid.initializer import Normal - - -def _is_list_or_tuple(input): - return isinstance(input, (list, tuple)) - - -def _zero_padding_in_batch_and_channel(padding, channel_last): - if channel_last: - return list(padding[0]) == [0, 0] and list(padding[-1]) == [0, 0] - else: - return list(padding[0]) == [0, 0] and list(padding[1]) == [0, 0] - - -def _exclude_padding_in_batch_and_channel(padding, channel_last): - padding_ = padding[1:-1] if channel_last else padding[2:] - padding_ = [elem for pad_a_dim in padding_ for elem in pad_a_dim] - return padding_ - - -def _update_padding_nd(padding, channel_last, num_dims): - if isinstance(padding, str): - padding = padding.upper() - if padding not in ["SAME", "VALID"]: - raise ValueError( - "Unknown padding: '{}'. It can only be 'SAME' or 'VALID'.". - format(padding)) - if padding == "VALID": - padding_algorithm = "VALID" - padding = [0] * num_dims - else: - padding_algorithm = "SAME" - padding = [0] * num_dims - elif _is_list_or_tuple(padding): - # for padding like - # [(pad_before, pad_after), (pad_before, pad_after), ...] - # padding for batch_dim and channel_dim included - if len(padding) == 2 + num_dims and _is_list_or_tuple(padding[0]): - if not _zero_padding_in_batch_and_channel(padding, channel_last): - raise ValueError( - "Non-zero padding({}) in the batch or channel dimensions " - "is not supported.".format(padding)) - padding_algorithm = "EXPLICIT" - padding = _exclude_padding_in_batch_and_channel(padding, - channel_last) - if utils._is_symmetric_padding(padding, num_dims): - padding = padding[0::2] - # for padding like [pad_before, pad_after, pad_before, pad_after, ...] - elif len(padding) == 2 * num_dims and isinstance(padding[0], int): - padding_algorithm = "EXPLICIT" - padding = utils.convert_to_list(padding, 2 * num_dims, 'padding') - if utils._is_symmetric_padding(padding, num_dims): - padding = padding[0::2] - # for padding like [pad_d1, pad_d2, ...] - elif len(padding) == num_dims and isinstance(padding[0], int): - padding_algorithm = "EXPLICIT" - padding = utils.convert_to_list(padding, num_dims, 'padding') - else: - raise ValueError("In valid padding: {}".format(padding)) - # for integer padding - else: - padding_algorithm = "EXPLICIT" - padding = utils.convert_to_list(padding, num_dims, 'padding') - return padding, padding_algorithm - -def _get_default_param_initializer(num_channels, filter_size): - filter_elem_num = num_channels * np.prod(filter_size) - std = (2.0 / filter_elem_num)**0.5 - return Normal(0.0, std, 0) - -def conv1d(input, - weight, - bias=None, - padding=0, - stride=1, - dilation=1, - groups=1, - use_cudnn=True, - act=None, - data_format="NCT", - name=None): - # entry checks - if not isinstance(use_cudnn, bool): - raise ValueError("Attr(use_cudnn) should be True or False. " - "Received Attr(use_cudnn): {}.".format(use_cudnn)) - if data_format not in ["NCT", "NTC"]: - raise ValueError("Attr(data_format) should be 'NCT' or 'NTC'. " - "Received Attr(data_format): {}.".format(data_format)) - - channel_last = (data_format == "NTC") - channel_dim = -1 if channel_last else 1 - num_channels = input.shape[channel_dim] - num_filters = weight.shape[0] - if num_channels < 0: - raise ValueError("The channel dimmention of the input({}) " - "should be defined. Received: {}.".format( - input.shape, num_channels)) - if num_channels % groups != 0: - raise ValueError( - "the channel of input must be divisible by groups," - "received: the channel of input is {}, the shape of input is {}" - ", the groups is {}".format(num_channels, input.shape, groups)) - if num_filters % groups != 0: - raise ValueError( - "the number of filters must be divisible by groups," - "received: the number of filters is {}, the shape of weight is {}" - ", the groups is {}".format(num_filters, weight.shape, groups)) - - # update attrs - padding, padding_algorithm = _update_padding_nd(padding, channel_last, 1) - if len(padding) == 1: # synmmetric padding - padding = [0,] + padding - else: - # len(padding) == 2 - padding = [0, 0] + padding - stride = [1,] + utils.convert_to_list(stride, 1, 'stride') - dilation = [1,] + utils.convert_to_list(dilation, 1, 'dilation') - data_format = "NHWC" if channel_last else "NCHW" - - l_type = "conv2d" - - if (num_channels == groups and num_filters % num_channels == 0 and - not use_cudnn): - l_type = 'depthwise_conv2d' - weight = F.unsqueeze(weight, [2]) - input = F.unsqueeze(input, [1]) if channel_last else F.unsqueeze(input, [2]) - - if in_dygraph_mode(): - attrs = ('strides', stride, 'paddings', padding, 'dilations', dilation, - 'groups', groups, 'use_cudnn', use_cudnn, 'use_mkldnn', False, - 'fuse_relu_before_depthwise_conv', False, "padding_algorithm", - padding_algorithm, "data_format", data_format) - pre_bias = getattr(core.ops, l_type)(input, weight, *attrs) - if bias is not None: - pre_act = nn.elementwise_add(pre_bias, bias, axis=channel_dim) - else: - pre_act = pre_bias - out = dygraph_utils._append_activation_in_dygraph( - pre_act, act, use_cudnn=use_cudnn) - else: - inputs = {'Input': [input], 'Filter': [weight]} - attrs = { - 'strides': stride, - 'paddings': padding, - 'dilations': dilation, - 'groups': groups, - 'use_cudnn': use_cudnn, - 'use_mkldnn': False, - 'fuse_relu_before_depthwise_conv': False, - "padding_algorithm": padding_algorithm, - "data_format": data_format - } - check_variable_and_dtype(input, 'input', - ['float16', 'float32', 'float64'], 'conv2d') - helper = LayerHelper(l_type, **locals()) - dtype = helper.input_dtype() - pre_bias = helper.create_variable_for_type_inference(dtype) - outputs = {"Output": [pre_bias]} - helper.append_op( - type=l_type, inputs=inputs, outputs=outputs, attrs=attrs) - if bias is not None: - pre_act = nn.elementwise_add(pre_bias, bias, axis=channel_dim) - else: - pre_act = pre_bias - out = helper.append_activation(pre_act) - out = F.squeeze(out, [1]) if channel_last else F.squeeze(out, [2]) - return out - -class Conv1D(layers.Layer): - def __init__(self, - num_channels, - num_filters, - filter_size, - padding=0, - stride=1, - dilation=1, - groups=1, - param_attr=None, - bias_attr=None, - use_cudnn=True, - act=None, - data_format="NCT", - dtype='float32'): - super(Conv1D, self).__init__() - assert param_attr is not False, "param_attr should not be False here." - self._num_channels = num_channels - self._num_filters = num_filters - self._groups = groups - if num_channels % groups != 0: - raise ValueError("num_channels must be divisible by groups.") - self._act = act - self._data_format = data_format - self._dtype = dtype - if not isinstance(use_cudnn, bool): - raise ValueError("use_cudnn should be True or False") - self._use_cudnn = use_cudnn - - self._filter_size = utils.convert_to_list(filter_size, 1, 'filter_size') - self._stride = utils.convert_to_list(stride, 1, 'stride') - self._dilation = utils.convert_to_list(dilation, 1, 'dilation') - channel_last = (data_format == "NTC") - self._padding = padding # leave it to F.conv1d - - self._param_attr = param_attr - self._bias_attr = bias_attr - - num_filter_channels = num_channels // groups - filter_shape = [self._num_filters, num_filter_channels - ] + self._filter_size - - self.weight = self.create_parameter( - attr=self._param_attr, - shape=filter_shape, - dtype=self._dtype, - default_initializer=_get_default_param_initializer( - self._num_channels, filter_shape)) - self.bias = self.create_parameter( - attr=self._bias_attr, - shape=[self._num_filters], - dtype=self._dtype, - is_bias=True) - - def forward(self, input): - out = conv1d( - input, - self.weight, - bias=self.bias, - padding=self._padding, - stride=self._stride, - dilation=self._dilation, - groups=self._groups, - use_cudnn=self._use_cudnn, - act=self._act, - data_format=self._data_format) - return out - diff --git a/parakeet/models/deepvoice3/weight_norm_hook.py b/parakeet/models/deepvoice3/weight_norm_hook.py deleted file mode 100644 index 6a4ba5d..0000000 --- a/parakeet/models/deepvoice3/weight_norm_hook.py +++ /dev/null @@ -1,148 +0,0 @@ -import paddle -import paddle.fluid.dygraph as dg - -import numpy as np -from paddle import fluid -import paddle.fluid.dygraph as dg -import paddle.fluid.layers as F -from paddle.fluid.layer_helper import LayerHelper -from paddle.fluid.data_feeder import check_variable_and_dtype - - -def l2_norm(x, axis, epsilon=1e-12, name=None): - if len(x.shape) == 1: - axis = 0 - check_variable_and_dtype(x, "X", ("float32", "float64"), "norm") - - helper = LayerHelper("l2_normalize", **locals()) - out = helper.create_variable_for_type_inference(dtype=x.dtype) - norm = helper.create_variable_for_type_inference(dtype=x.dtype) - helper.append_op( - type="norm", - inputs={"X": x}, - outputs={"Out": out, - "Norm": norm}, - attrs={ - "axis": 1 if axis is None else axis, - "epsilon": epsilon, - }) - return F.squeeze(norm, axes=[axis]) - -def norm_except_dim(p, dim): - shape = p.shape - ndims = len(shape) - if dim is None: - return F.sqrt(F.reduce_sum(F.square(p))) - elif dim == 0: - p_matrix = F.reshape(p, (shape[0], -1)) - return l2_norm(p_matrix, axis=1) - elif dim == -1 or dim == ndims - 1: - p_matrix = F.reshape(p, (-1, shape[-1])) - return l2_norm(p_matrix, axis=0) - else: - perm = list(range(ndims)) - perm[0] = dim - perm[dim] = 0 - p_transposed = F.transpose(p, perm) - return norm_except_dim(p_transposed, 0) - -def _weight_norm(v, g, dim): - shape = v.shape - ndims = len(shape) - - if dim is None: - v_normalized = v / (F.sqrt(F.reduce_sum(F.square(v))) + 1e-12) - elif dim == 0: - p_matrix = F.reshape(v, (shape[0], -1)) - v_normalized = F.l2_normalize(p_matrix, axis=1) - v_normalized = F.reshape(v_normalized, shape) - elif dim == -1 or dim == ndims - 1: - p_matrix = F.reshape(v, (-1, shape[-1])) - v_normalized = F.l2_normalize(p_matrix, axis=0) - v_normalized = F.reshape(v_normalized, shape) - else: - perm = list(range(ndims)) - perm[0] = dim - perm[dim] = 0 - p_transposed = F.transpose(v, perm) - transposed_shape = p_transposed.shape - p_matrix = F.reshape(p_transposed, (p_transposed.shape[0], -1)) - v_normalized = F.l2_normalize(p_matrix, axis=1) - v_normalized = F.reshape(v_normalized, transposed_shape) - v_normalized = F.transpose(v_normalized, perm) - weight = F.elementwise_mul(v_normalized, g, axis=dim if dim is not None else -1) - return weight - - -class WeightNorm(object): - def __init__(self, name, dim): - if dim is None: - dim = -1 - self.name = name - self.dim = dim - - def compute_weight(self, module): - g = getattr(module, self.name + '_g') - v = getattr(module, self.name + '_v') - w = _weight_norm(v, g, self.dim) - return w - - @staticmethod - def apply(module: dg.Layer, name, dim): - for k, hook in module._forward_pre_hooks.items(): - if isinstance(hook, WeightNorm) and hook.name == name: - raise RuntimeError("Cannot register two weight_norm hooks on " - "the same parameter {}".format(name)) - - if dim is None: - dim = -1 - - fn = WeightNorm(name, dim) - - # remove w from parameter list - w = getattr(module, name) - del module._parameters[name] - - # add g and v as new parameters and express w as g/||v|| * v - g_var = norm_except_dim(w, dim) - v = module.create_parameter(w.shape, dtype=w.dtype) - module.add_parameter(name + "_v", v) - g = module.create_parameter(g_var.shape, dtype=g_var.dtype) - module.add_parameter(name + "_g", g) - with dg.no_grad(): - F.assign(w, v) - F.assign(g_var, g) - setattr(module, name, fn.compute_weight(module)) - - # recompute weight before every forward() - module.register_forward_pre_hook(fn) - return fn - - def remove(self, module): - w_var = self.compute_weight(module) - delattr(module, self.name) - del module._parameters[self.name + '_g'] - del module._parameters[self.name + '_v'] - w = module.create_parameter(w_var.shape, dtype=w_var.dtype) - module.add_parameter(self.name, w) - with dg.no_grad(): - F.assign(w_var, w) - - def __call__(self, module, inputs): - setattr(module, self.name, self.compute_weight(module)) - - -def weight_norm(module, name='weight', dim=0): - WeightNorm.apply(module, name, dim) - return module - - -def remove_weight_norm(module, name='weight'): - for k, hook in module._forward_pre_hooks.items(): - if isinstance(hook, WeightNorm) and hook.name == name: - hook.remove(module) - del module._forward_pre_hooks[k] - return module - - raise ValueError("weight_norm of '{}' not found in {}" - .format(name, module)) \ No newline at end of file diff --git a/parakeet/models/transformer_tts.py b/parakeet/models/transformer_tts.py new file mode 100644 index 0000000..e39404c --- /dev/null +++ b/parakeet/models/transformer_tts.py @@ -0,0 +1,536 @@ +import math +from tqdm import trange +import paddle +from paddle import nn +from paddle.nn import functional as F +from paddle.nn import initializer as I + +import parakeet +from parakeet.modules.attention import _split_heads, _concat_heads, drop_head, scaled_dot_product_attention +from parakeet.modules.transformer import PositionwiseFFN +from parakeet.modules import masking +from parakeet.modules.conv import Conv1dBatchNorm +from parakeet.modules import positional_encoding as pe +from parakeet.modules import losses as L + +__all__ = ["TransformerTTS", "TransformerTTSLoss"] + +# Transformer TTS's own implementation of transformer +class MultiheadAttention(nn.Layer): + """ + Multihead scaled dot product attention with drop head. See + [Scheduled DropHead: A Regularization Method for Transformer Models](https://arxiv.org/abs/2004.13342) + for details. + + Another deviation is that it concats the input query and context vector before + applying the output projection. + """ + def __init__(self, model_dim, num_heads, k_dim=None, v_dim=None, k_input_dim=None, v_input_dim=None): + """ + Args: + model_dim (int): the feature size of query. + num_heads (int): the number of attention heads. + k_dim (int, optional): feature size of the key of each scaled dot + product attention. If not provided, it is set to + model_dim / num_heads. Defaults to None. + v_dim (int, optional): feature size of the key of each scaled dot + product attention. If not provided, it is set to + model_dim / num_heads. Defaults to None. + + Raises: + ValueError: if model_dim is not divisible by num_heads + """ + super(MultiheadAttention, self).__init__() + if model_dim % num_heads !=0: + raise ValueError("model_dim must be divisible by num_heads") + depth = model_dim // num_heads + k_dim = k_dim or depth + v_dim = v_dim or depth + k_input_dim = k_input_dim or model_dim + v_input_dim = v_input_dim or model_dim + self.affine_q = nn.Linear(model_dim, num_heads * k_dim) + self.affine_k = nn.Linear(k_input_dim, num_heads * k_dim) + self.affine_v = nn.Linear(v_input_dim, num_heads * v_dim) + self.affine_o = nn.Linear(model_dim + num_heads * v_dim, model_dim) + + self.num_heads = num_heads + self.model_dim = model_dim + + def forward(self, q, k, v, mask, drop_n_heads=0): + """ + Compute context vector and attention weights. + + Args: + q (Tensor): shape(batch_size, time_steps_q, model_dim), the queries. + k (Tensor): shape(batch_size, time_steps_k, model_dim), the keys. + v (Tensor): shape(batch_size, time_steps_k, model_dim), the values. + mask (Tensor): shape(batch_size, times_steps_q, time_steps_k) or + broadcastable shape, dtype: float32 or float64, the mask. + + Returns: + out (Tensor), shape(batch_size, time_steps_q, model_dim), the context vector. + attention_weights (Tensor): shape(batch_size, times_steps_q, time_steps_k), the attention weights. + """ + q_in = q + q = _split_heads(self.affine_q(q), self.num_heads) # (B, h, T, C) + k = _split_heads(self.affine_k(k), self.num_heads) + v = _split_heads(self.affine_v(v), self.num_heads) + if mask is not None: + mask = paddle.unsqueeze(mask, 1) # unsqueeze for the h dim + + context_vectors, attention_weights = scaled_dot_product_attention( + q, k, v, mask, training=self.training) + context_vectors = drop_head(context_vectors, drop_n_heads, self.training) + context_vectors = _concat_heads(context_vectors) # (B, T, h*C) + + concat_feature = paddle.concat([q_in, context_vectors], -1) + out = self.affine_o(concat_feature) + return out, attention_weights + + +class TransformerEncoderLayer(nn.Layer): + """ + Transformer encoder layer. + """ + def __init__(self, d_model, n_heads, d_ffn, dropout=0.): + """ + Args: + d_model (int): the feature size of the input, and the output. + n_heads (int): the number of heads in the internal MultiHeadAttention layer. + d_ffn (int): the hidden size of the internal PositionwiseFFN. + dropout (float, optional): the probability of the dropout in + MultiHeadAttention and PositionwiseFFN. Defaults to 0. + """ + super(TransformerEncoderLayer, self).__init__() + self.self_mha = MultiheadAttention(d_model, n_heads) + self.layer_norm1 = nn.LayerNorm([d_model], epsilon=1e-6) + + self.ffn = PositionwiseFFN(d_model, d_ffn, dropout) + self.layer_norm2 = nn.LayerNorm([d_model], epsilon=1e-6) + + self.dropout = dropout + + def _forward_mha(self, x, mask, drop_n_heads): + # PreLN scheme: Norm -> SubLayer -> Dropout -> Residual + x_in = x + x = self.layer_norm1(x) + context_vector, attn_weights = self.self_mha(x, x, x, mask, drop_n_heads) + context_vector = x_in + F.dropout(context_vector, self.dropout, training=self.training) + return context_vector, attn_weights + + def _forward_ffn(self, x): + # PreLN scheme: Norm -> SubLayer -> Dropout -> Residual + x_in = x + x = self.layer_norm2(x) + x = self.ffn(x) + out= x_in + F.dropout(x, self.dropout, training=self.training) + return out + + def forward(self, x, mask, drop_n_heads=0): + """ + Args: + x (Tensor): shape(batch_size, time_steps, d_model), the decoder input. + mask (Tensor): shape(batch_size, 1, time_steps), the padding mask. + + Returns: + x (Tensor): shape(batch_size, time_steps, d_model), the decoded. + attn_weights (Tensor), shape(batch_size, n_heads, time_steps, time_steps), self attention. + """ + x, attn_weights = self._forward_mha(x, mask, drop_n_heads) + x = self._forward_ffn(x) + return x, attn_weights + + +class TransformerDecoderLayer(nn.Layer): + """ + Transformer decoder layer. + """ + def __init__(self, d_model, n_heads, d_ffn, dropout=0., d_encoder=None): + """ + Args: + d_model (int): the feature size of the input, and the output. + n_heads (int): the number of heads in the internal MultiHeadAttention layer. + d_ffn (int): the hidden size of the internal PositionwiseFFN. + dropout (float, optional): the probability of the dropout in + MultiHeadAttention and PositionwiseFFN. Defaults to 0. + """ + super(TransformerDecoderLayer, self).__init__() + self.self_mha = MultiheadAttention(d_model, n_heads) + self.layer_norm1 = nn.LayerNorm([d_model], epsilon=1e-6) + + self.cross_mha = MultiheadAttention(d_model, n_heads, k_input_dim=d_encoder, v_input_dim=d_encoder) + self.layer_norm2 = nn.LayerNorm([d_model], epsilon=1e-6) + + self.ffn = PositionwiseFFN(d_model, d_ffn, dropout) + self.layer_norm3 = nn.LayerNorm([d_model], epsilon=1e-6) + + self.dropout = dropout + + def _forward_self_mha(self, x, mask, drop_n_heads): + # PreLN scheme: Norm -> SubLayer -> Dropout -> Residual + x_in = x + x = self.layer_norm1(x) + context_vector, attn_weights = self.self_mha(x, x, x, mask, drop_n_heads) + context_vector = x_in + F.dropout(context_vector, self.dropout, training=self.training) + return context_vector, attn_weights + + def _forward_cross_mha(self, q, k, v, mask, drop_n_heads): + # PreLN scheme: Norm -> SubLayer -> Dropout -> Residual + q_in = q + q = self.layer_norm2(q) + context_vector, attn_weights = self.cross_mha(q, k, v, mask, drop_n_heads) + context_vector = q_in + F.dropout(context_vector, self.dropout, training=self.training) + return context_vector, attn_weights + + def _forward_ffn(self, x): + # PreLN scheme: Norm -> SubLayer -> Dropout -> Residual + x_in = x + x = self.layer_norm3(x) + x = self.ffn(x) + out= x_in + F.dropout(x, self.dropout, training=self.training) + return out + + def forward(self, q, k, v, encoder_mask, decoder_mask, drop_n_heads=0): + """ + Args: + q (Tensor): shape(batch_size, time_steps_q, d_model), the decoder input. + k (Tensor): shape(batch_size, time_steps_k, d_model), keys. + v (Tensor): shape(batch_size, time_steps_k, d_model), values + encoder_mask (Tensor): shape(batch_size, 1, time_steps_k) encoder padding mask. + decoder_mask (Tensor): shape(batch_size, time_steps_q, time_steps_q) or broadcastable shape, decoder padding mask. + + Returns: + q (Tensor): shape(batch_size, time_steps_q, d_model), the decoded. + self_attn_weights (Tensor), shape(batch_size, n_heads, time_steps_q, time_steps_q), decoder self attention. + cross_attn_weights (Tensor), shape(batch_size, n_heads, time_steps_q, time_steps_k), decoder-encoder cross attention. + """ + q, self_attn_weights = self._forward_self_mha(q, decoder_mask, drop_n_heads) + q, cross_attn_weights = self._forward_cross_mha(q, k, v, encoder_mask, drop_n_heads) + q = self._forward_ffn(q) + return q, self_attn_weights, cross_attn_weights + + +class TransformerEncoder(nn.LayerList): + def __init__(self, d_model, n_heads, d_ffn, n_layers, dropout=0.): + super(TransformerEncoder, self).__init__() + for _ in range(n_layers): + self.append(TransformerEncoderLayer(d_model, n_heads, d_ffn, dropout)) + + def forward(self, x, mask, drop_n_heads=0): + """ + Args: + x (Tensor): shape(batch_size, time_steps, feature_size), the input tensor. + mask (Tensor): shape(batch_size, 1, time_steps), the mask. + drop_n_heads (int, optional): how many heads to drop. Defaults to 0. + + Returns: + x (Tensor): shape(batch_size, time_steps, feature_size), the context vector. + attention_weights(list[Tensor]), each of shape + (batch_size, n_heads, time_steps, time_steps), the attention weights. + """ + attention_weights = [] + for layer in self: + x, attention_weights_i = layer(x, mask, drop_n_heads) + attention_weights.append(attention_weights_i) + return x, attention_weights + + +class TransformerDecoder(nn.LayerList): + def __init__(self, d_model, n_heads, d_ffn, n_layers, dropout=0., d_encoder=None): + super(TransformerDecoder, self).__init__() + for _ in range(n_layers): + self.append(TransformerDecoderLayer(d_model, n_heads, d_ffn, dropout, d_encoder=d_encoder)) + + def forward(self, q, k, v, encoder_mask, decoder_mask, drop_n_heads=0): + """[summary] + + Args: + q (Tensor): shape(batch_size, time_steps_q, d_model) + k (Tensor): shape(batch_size, time_steps_k, d_encoder) + v (Tensor): shape(batch_size, time_steps_k, k_encoder) + encoder_mask (Tensor): shape(batch_size, 1, time_steps_k) + decoder_mask (Tensor): shape(batch_size, time_steps_q, time_steps_q) + drop_n_heads (int, optional): [description]. Defaults to 0. + + Returns: + q (Tensor): shape(batch_size, time_steps_q, d_model), the output. + self_attention_weights (List[Tensor]): shape (batch_size, num_heads, encoder_steps, encoder_steps) + cross_attention_weights (List[Tensor]): shape (batch_size, num_heads, decoder_steps, encoder_steps) + """ + self_attention_weights = [] + cross_attention_weights = [] + for layer in self: + q, self_attention_weights_i, cross_attention_weights_i = layer(q, k, v, encoder_mask, decoder_mask, drop_n_heads) + self_attention_weights.append(self_attention_weights_i) + cross_attention_weights.append(cross_attention_weights_i) + return q, self_attention_weights, cross_attention_weights + + +class MLPPreNet(nn.Layer): + """Decoder's prenet.""" + def __init__(self, d_input, d_hidden, d_output, dropout): + # (lin + relu + dropout) * n + last projection + super(MLPPreNet, self).__init__() + self.lin1 = nn.Linear(d_input, d_hidden) + self.lin2 = nn.Linear(d_hidden, d_hidden) + self.dropout = dropout + + def forward(self, x, dropout): + l1 = F.dropout(F.relu(self.lin1(x)), self.dropout, training=self.training) + l2 = F.dropout(F.relu(self.lin2(l1)), self.dropout, training=self.training) + return l2 + +# NOTE: not used in +class CNNPreNet(nn.Layer): + def __init__(self, d_input, d_hidden, d_output, kernel_size, n_layers, + dropout=0.): + # (conv + bn + relu + dropout) * n + last projection + super(CNNPreNet, self).__init__() + self.convs = nn.LayerList() + c_in = d_input + for _ in range(n_layers): + self.convs.append( + Conv1dBatchNorm(c_in, d_hidden, kernel_size, + weight_attr=I.XavierUniform(), + padding="same", data_format="NLC")) + c_in = d_hidden + self.affine_out = nn.Linear(d_hidden, d_output) + self.dropout = dropout + + def forward(self, x): + for layer in self.convs: + x = F.dropout(F.relu(layer(x)), self.dropout, training=self.training) + x = self.affine_out(x) + return x + + +class CNNPostNet(nn.Layer): + def __init__(self, d_input, d_hidden, d_output, kernel_size, n_layers): + super(CNNPostNet, self).__init__() + self.convs = nn.LayerList() + kernel_size = kernel_size if isinstance(kernel_size, (tuple, list)) else (kernel_size, ) + padding = (kernel_size[0] - 1, 0) + for i in range(n_layers): + c_in = d_input if i == 0 else d_hidden + c_out = d_output if i == n_layers - 1 else d_hidden + self.convs.append( + Conv1dBatchNorm(c_in, c_out, kernel_size, + weight_attr=I.XavierUniform(), + padding=padding)) + # for a layer that ends with a normalization layer that is targeted to + # output a non zero-central output, it may take a long time to + # train the scale and bias + # NOTE: it can also be a non-causal conv + + def forward(self, x): + x_in = x + for i, layer in enumerate(self.convs): + x = layer(x) + if i != (len(self.convs) - 1): + x = F.tanh(x) + x = x_in + x + return x + + +class TransformerTTS(nn.Layer): + def __init__(self, + frontend: parakeet.frontend.Phonetics, + d_encoder: int, + d_decoder: int, + d_mel: int, + n_heads: int, + d_ffn: int, + encoder_layers: int, + decoder_layers: int, + d_prenet: int, + d_postnet: int, + postnet_layers: int, + postnet_kernel_size: int, + max_reduction_factor: int, + decoder_prenet_dropout: float, + dropout: float): + super(TransformerTTS, self).__init__() + + # text frontend (text normalization and g2p) + self.frontend = frontend + + # encoder + self.encoder_prenet = nn.Embedding( + frontend.vocab_size, d_encoder, + padding_idx=frontend.vocab.padding_index, + weight_attr=I.Uniform(-0.05, 0.05)) + # position encoding matrix may be extended later + self.encoder_pe = pe.positional_encoding(0, 1000, d_encoder) + self.encoder_pe_scalar = self.create_parameter( + [1], attr=I.Constant(1.)) + self.encoder = TransformerEncoder( + d_encoder, n_heads, d_ffn, encoder_layers, dropout) + + # decoder + self.decoder_prenet = MLPPreNet(d_mel, d_prenet, d_decoder, dropout) + self.decoder_pe = pe.positional_encoding(0, 1000, d_decoder) + self.decoder_pe_scalar = self.create_parameter( + [1], attr=I.Constant(1.)) + self.decoder = TransformerDecoder( + d_decoder, n_heads, d_ffn, decoder_layers, dropout, + d_encoder=d_encoder) + self.final_proj = nn.Linear(d_decoder, max_reduction_factor * d_mel) + self.decoder_postnet = CNNPostNet( + d_mel, d_postnet, d_mel, postnet_kernel_size, postnet_layers) + self.stop_conditioner = nn.Linear(d_mel, 3) + + # specs + self.padding_idx = frontend.vocab.padding_index + self.d_encoder = d_encoder + self.d_decoder = d_decoder + self.d_mel = d_mel + self.max_r = max_reduction_factor + self.dropout = dropout + self.decoder_prenet_dropout = decoder_prenet_dropout + + # start and end: though it is only used in predict + # it can also be used in training + dtype = paddle.get_default_dtype() + self.start_vec = paddle.full([1, d_mel], 0.5, dtype=dtype) + self.end_vec = paddle.full([1, d_mel], -0.5, dtype=dtype) + self.stop_prob_index = 2 + + # mutables + self.r = max_reduction_factor # set it every call + self.drop_n_heads = 0 + + def forward(self, text, mel): + encoded, encoder_attention_weights, encoder_mask = self.encode(text) + mel_output, mel_intermediate, cross_attention_weights, stop_logits = self.decode(encoded, mel, encoder_mask) + outputs = { + "mel_output": mel_output, + "mel_intermediate": mel_intermediate, + "encoder_attention_weights": encoder_attention_weights, + "cross_attention_weights": cross_attention_weights, + "stop_logits": stop_logits, + } + return outputs + + def encode(self, text): + T_enc = text.shape[-1] + embed = self.encoder_prenet(text) + if embed.shape[1] > self.encoder_pe.shape[0]: + new_T = max(embed.shape[1], self.encoder_pe.shape[0] * 2) + self.encoder_pe = pe.positional_encoding(0, new_T, self.d_encoder) + pos_enc = self.encoder_pe[:T_enc, :] # (T, C) + x = embed.scale(math.sqrt(self.d_encoder)) + pos_enc * self.encoder_pe_scalar + x = F.dropout(x, self.dropout, training=self.training) + + # TODO(chenfeiyu): unsqueeze a decoder_time_steps=1 for the mask + encoder_padding_mask = paddle.unsqueeze( + masking.id_mask(text, self.padding_idx, dtype=x.dtype), 1) + x, attention_weights = self.encoder(x, encoder_padding_mask, self.drop_n_heads) + return x, attention_weights, encoder_padding_mask + + def decode(self, encoder_output, input, encoder_padding_mask): + batch_size, T_dec, mel_dim = input.shape + + x = self.decoder_prenet(input, self.decoder_prenet_dropout) + # twice its length if needed + if x.shape[1] * self.r > self.decoder_pe.shape[0]: + new_T = max(x.shape[1] * self.r, self.decoder_pe.shape[0] * 2) + self.decoder_pe = pe.positional_encoding(0, new_T, self.d_decoder) + pos_enc = self.decoder_pe[:T_dec*self.r:self.r, :] + x = x.scale(math.sqrt(self.d_decoder)) + pos_enc * self.decoder_pe_scalar + x = F.dropout(x, self.dropout, training=self.training) + + no_future_mask = masking.future_mask(T_dec, dtype=input.dtype) + decoder_padding_mask = masking.feature_mask(input, axis=-1, dtype=input.dtype) + decoder_mask = masking.combine_mask(decoder_padding_mask.unsqueeze(-1), no_future_mask) + decoder_output, _, cross_attention_weights = self.decoder( + x, + encoder_output, + encoder_output, + encoder_padding_mask, + decoder_mask, + self.drop_n_heads) + + # use only parts of it + output_proj = self.final_proj(decoder_output)[:, :, : self.r * mel_dim] + mel_intermediate = paddle.reshape(output_proj, [batch_size, -1, mel_dim]) + stop_logits = self.stop_conditioner(mel_intermediate) + + # cnn postnet + mel_channel_first = paddle.transpose(mel_intermediate, [0, 2, 1]) + mel_output = self.decoder_postnet(mel_channel_first) + mel_output = paddle.transpose(mel_output, [0, 2, 1]) + + return mel_output, mel_intermediate, cross_attention_weights, stop_logits + + def predict(self, input, raw_input=True, max_length=1000, verbose=True): + """Predict log scale magnitude mel spectrogram from text input. + + Args: + input (Tensor): shape (T), dtype int, input text sequencce. + max_length (int, optional): max decoder steps. Defaults to 1000. + verbose (bool, optional): display progress bar. Defaults to True. + """ + if raw_input: + text_ids = paddle.to_tensor(self.frontend(input)) + text_input = paddle.unsqueeze(text_ids, 0) # (1, T) + else: + text_input = input + + decoder_input = paddle.unsqueeze(self.start_vec, 0) # (B=1, T, C) + decoder_output = paddle.unsqueeze(self.start_vec, 0) # (B=1, T, C) + + # encoder the text sequence + encoder_output, encoder_attentions, encoder_padding_mask = self.encode(text_input) + for _ in trange(int(max_length // self.r) + 1): + mel_output, _, cross_attention_weights, stop_logits = self.decode( + encoder_output, decoder_input, encoder_padding_mask) + + # extract last step and append it to decoder input + decoder_input = paddle.concat([decoder_input, mel_output[:, -1:, :]], 1) + # extract last r steps and append it to decoder output + decoder_output = paddle.concat([decoder_output, mel_output[:, -self.r:, :]], 1) + + # stop condition: (if any ouput frame of the output multiframes hits the stop condition) + if paddle.any(paddle.argmax(stop_logits[0, :, :], axis=-1) == self.stop_prob_index): + if verbose: + print("Hits stop condition.") + break + mel_output = decoder_output[:, 1:, :] + + outputs = { + "mel_output": mel_output, + "encoder_attention_weights": encoder_attentions, + "cross_attention_weights": cross_attention_weights, + } + return outputs + + def set_constants(self, reduction_factor, drop_n_heads): + self.r = reduction_factor + self.drop_n_heads = drop_n_heads + + +class TransformerTTSLoss(nn.Layer): + def __init__(self, stop_loss_scale): + super(TransformerTTSLoss, self).__init__() + self.stop_loss_scale = stop_loss_scale + + def forward(self, mel_output, mel_intermediate, mel_target, stop_logits, stop_probs): + mask = masking.feature_mask(mel_target, axis=-1, dtype=mel_target.dtype) + mask1 = paddle.unsqueeze(mask, -1) + mel_loss1 = L.masked_l1_loss(mel_output, mel_target, mask1) + mel_loss2 = L.masked_l1_loss(mel_intermediate, mel_target, mask1) + + mel_len = mask.shape[-1] + last_position = F.one_hot(mask.sum(-1).astype("int64") - 1, num_classes=mel_len) + mask2 = mask + last_position.scale(self.stop_loss_scale - 1).astype(mask.dtype) + stop_loss = L.masked_softmax_with_cross_entropy( + stop_logits, stop_probs.unsqueeze(-1), mask2.unsqueeze(-1)) + + loss = mel_loss1 + mel_loss2 + stop_loss + losses = dict( + loss=loss, # total loss + mel_loss1=mel_loss1, # ouput mel loss + mel_loss2=mel_loss2, # intermediate mel loss + stop_loss=stop_loss # stop prob loss + ) + return losses \ No newline at end of file diff --git a/parakeet/models/transformer_tts/__init__.py b/parakeet/models/transformer_tts/__init__.py deleted file mode 100644 index 6d5bfd4..0000000 --- a/parakeet/models/transformer_tts/__init__.py +++ /dev/null @@ -1,15 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -from .transformer_tts import TransformerTTS -from .vocoder import Vocoder \ No newline at end of file diff --git a/parakeet/models/transformer_tts/cbhg.py b/parakeet/models/transformer_tts/cbhg.py deleted file mode 100644 index 9a330f9..0000000 --- a/parakeet/models/transformer_tts/cbhg.py +++ /dev/null @@ -1,287 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import math -from parakeet.g2p.text.symbols import symbols -import paddle.fluid.dygraph as dg -import paddle.fluid as fluid -import paddle.fluid.layers as layers -from parakeet.modules.customized import Pool1D, Conv1D -from parakeet.modules.dynamic_gru import DynamicGRU -import numpy as np - - -class CBHG(dg.Layer): - def __init__(self, - hidden_size, - batch_size, - K=16, - projection_size=256, - num_gru_layers=2, - max_pool_kernel_size=2, - is_post=False): - """CBHG Module - - Args: - hidden_size (int): dimension of hidden unit. - batch_size (int): batch size of input. - K (int, optional): number of convolution banks. Defaults to 16. - projection_size (int, optional): dimension of projection unit. Defaults to 256. - num_gru_layers (int, optional): number of layers of GRUcell. Defaults to 2. - max_pool_kernel_size (int, optional): max pooling kernel size. Defaults to 2 - is_post (bool, optional): whether post processing or not. Defaults to False. - """ - super(CBHG, self).__init__() - - self.hidden_size = hidden_size - self.projection_size = projection_size - self.conv_list = [] - k = math.sqrt(1.0 / projection_size) - self.conv_list.append( - Conv1D( - num_channels=projection_size, - num_filters=hidden_size, - filter_size=1, - padding=int(np.floor(1 / 2)), - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr( - initializer=fluid.initializer.Uniform( - low=-k, high=k)))) - k = math.sqrt(1.0 / hidden_size) - for i in range(2, K + 1): - self.conv_list.append( - Conv1D( - num_channels=hidden_size, - num_filters=hidden_size, - filter_size=i, - padding=int(np.floor(i / 2)), - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr( - initializer=fluid.initializer.Uniform( - low=-k, high=k)))) - - for i, layer in enumerate(self.conv_list): - self.add_sublayer("conv_list_{}".format(i), layer) - - self.batchnorm_list = [] - for i in range(K): - self.batchnorm_list.append( - dg.BatchNorm( - hidden_size, data_layout='NCHW')) - - for i, layer in enumerate(self.batchnorm_list): - self.add_sublayer("batchnorm_list_{}".format(i), layer) - - conv_outdim = hidden_size * K - - k = math.sqrt(1.0 / conv_outdim) - self.conv_projection_1 = Conv1D( - num_channels=conv_outdim, - num_filters=hidden_size, - filter_size=3, - padding=int(np.floor(3 / 2)), - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform( - low=-k, high=k))) - - k = math.sqrt(1.0 / hidden_size) - self.conv_projection_2 = Conv1D( - num_channels=hidden_size, - num_filters=projection_size, - filter_size=3, - padding=int(np.floor(3 / 2)), - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform( - low=-k, high=k))) - - self.batchnorm_proj_1 = dg.BatchNorm(hidden_size, data_layout='NCHW') - self.batchnorm_proj_2 = dg.BatchNorm( - projection_size, data_layout='NCHW') - self.max_pool = Pool1D( - pool_size=max_pool_kernel_size, - pool_type='max', - pool_stride=1, - pool_padding=1, - data_format="NCT") - self.highway = Highwaynet(self.projection_size) - - h_0 = np.zeros((batch_size, hidden_size // 2), dtype="float32") - h_0 = dg.to_variable(h_0) - k = math.sqrt(1.0 / hidden_size) - self.fc_forward1 = dg.Linear( - hidden_size, - hidden_size // 2 * 3, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform( - low=-k, high=k))) - self.fc_reverse1 = dg.Linear( - hidden_size, - hidden_size // 2 * 3, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform( - low=-k, high=k))) - self.gru_forward1 = DynamicGRU( - size=self.hidden_size // 2, - is_reverse=False, - origin_mode=True, - h_0=h_0) - self.gru_reverse1 = DynamicGRU( - size=self.hidden_size // 2, - is_reverse=True, - origin_mode=True, - h_0=h_0) - - self.fc_forward2 = dg.Linear( - hidden_size, - hidden_size // 2 * 3, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform( - low=-k, high=k))) - self.fc_reverse2 = dg.Linear( - hidden_size, - hidden_size // 2 * 3, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform( - low=-k, high=k))) - self.gru_forward2 = DynamicGRU( - size=self.hidden_size // 2, - is_reverse=False, - origin_mode=True, - h_0=h_0) - self.gru_reverse2 = DynamicGRU( - size=self.hidden_size // 2, - is_reverse=True, - origin_mode=True, - h_0=h_0) - - def _conv_fit_dim(self, x, filter_size=3): - if filter_size % 2 == 0: - return x[:, :, :-1] - else: - return x - - def forward(self, input_): - """ - Convert linear spectrum to Mel spectrum. - - Args: - input_ (Variable): shape(B, C, T), dtype float32, the sequentially input. - - Returns: - out (Variable): shape(B, C, T), the CBHG output. - """ - - conv_list = [] - conv_input = input_ - - for i, (conv, batchnorm - ) in enumerate(zip(self.conv_list, self.batchnorm_list)): - conv_input = self._conv_fit_dim(conv(conv_input), i + 1) - conv_input = layers.relu(batchnorm(conv_input)) - conv_list.append(conv_input) - - conv_cat = layers.concat(conv_list, axis=1) - conv_pool = self.max_pool(conv_cat)[:, :, :-1] - - conv_proj = layers.relu( - self.batchnorm_proj_1( - self._conv_fit_dim(self.conv_projection_1(conv_pool)))) - conv_proj = self.batchnorm_proj_2( - self._conv_fit_dim(self.conv_projection_2(conv_proj))) + input_ - - # conv_proj.shape = [N, C, T] - highway = layers.transpose(conv_proj, [0, 2, 1]) - highway = self.highway(highway) - - # highway.shape = [N, T, C] - fc_forward = self.fc_forward1(highway) - fc_reverse = self.fc_reverse1(highway) - out_forward = self.gru_forward1(fc_forward) - out_reverse = self.gru_reverse1(fc_reverse) - out = layers.concat([out_forward, out_reverse], axis=-1) - fc_forward = self.fc_forward2(out) - fc_reverse = self.fc_reverse2(out) - out_forward = self.gru_forward2(fc_forward) - out_reverse = self.gru_reverse2(fc_reverse) - out = layers.concat([out_forward, out_reverse], axis=-1) - out = layers.transpose(out, [0, 2, 1]) - return out - - -class Highwaynet(dg.Layer): - def __init__(self, num_units, num_layers=4): - """Highway network - - Args: - num_units (int): dimension of hidden unit. - num_layers (int, optional): number of highway layers. Defaults to 4. - """ - super(Highwaynet, self).__init__() - self.num_units = num_units - self.num_layers = num_layers - - self.gates = [] - self.linears = [] - k = math.sqrt(1.0 / num_units) - for i in range(num_layers): - self.linears.append( - dg.Linear( - num_units, - num_units, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr( - initializer=fluid.initializer.Uniform( - low=-k, high=k)))) - self.gates.append( - dg.Linear( - num_units, - num_units, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr( - initializer=fluid.initializer.Uniform( - low=-k, high=k)))) - - for i, (linear, gate) in enumerate(zip(self.linears, self.gates)): - self.add_sublayer("linears_{}".format(i), linear) - self.add_sublayer("gates_{}".format(i), gate) - - def forward(self, input_): - """ - Compute result of Highway network. - - Args: - input_(Variable): shape(B, T, C), dtype float32, the sequentially input. - - Returns: - out(Variable): the Highway output. - """ - out = input_ - - for linear, gate in zip(self.linears, self.gates): - h = fluid.layers.relu(linear(out)) - t_ = fluid.layers.sigmoid(gate(out)) - - c = 1 - t_ - out = h * t_ + out * c - - return out diff --git a/parakeet/models/transformer_tts/decoder.py b/parakeet/models/transformer_tts/decoder.py deleted file mode 100644 index 41e11a0..0000000 --- a/parakeet/models/transformer_tts/decoder.py +++ /dev/null @@ -1,193 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import math -import paddle.fluid.dygraph as dg -import paddle.fluid as fluid -from parakeet.models.transformer_tts.utils import * -from parakeet.modules.multihead_attention import MultiheadAttention -from parakeet.modules.ffn import PositionwiseFeedForward -from parakeet.models.transformer_tts.prenet import PreNet -from parakeet.models.transformer_tts.post_convnet import PostConvNet - - -class Decoder(dg.Layer): - def __init__(self, - num_hidden, - num_mels=80, - outputs_per_step=1, - num_head=4, - n_layers=3): - """Decoder layer of TransformerTTS. - - Args: - num_hidden (int): the number of source vocabulary. - n_mels (int, optional): the number of mel bands when calculating mel spectrograms. Defaults to 80. - outputs_per_step (int, optional): the num of output frames per step . Defaults to 1. - num_head (int, optional): the head number of multihead attention. Defaults to 4. - n_layers (int, optional): the layers number of multihead attention. Defaults to 3. - """ - super(Decoder, self).__init__() - self.num_hidden = num_hidden - self.num_head = num_head - param = fluid.ParamAttr() - self.alpha = self.create_parameter( - shape=(1, ), - attr=param, - dtype='float32', - default_initializer=fluid.initializer.ConstantInitializer( - value=1.0)) - self.pos_inp = get_sinusoid_encoding_table( - 1024, self.num_hidden, padding_idx=0) - self.pos_emb = dg.Embedding( - size=[1024, num_hidden], - padding_idx=0, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.NumpyArrayInitializer( - self.pos_inp), - trainable=False)) - self.decoder_prenet = PreNet( - input_size=num_mels, - hidden_size=num_hidden * 2, - output_size=num_hidden, - dropout_rate=0.2) - k = math.sqrt(1.0 / num_hidden) - self.linear = dg.Linear( - num_hidden, - num_hidden, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform( - low=-k, high=k))) - - self.selfattn_layers = [ - MultiheadAttention(num_hidden, num_hidden // num_head, - num_hidden // num_head) for _ in range(n_layers) - ] - for i, layer in enumerate(self.selfattn_layers): - self.add_sublayer("self_attn_{}".format(i), layer) - self.attn_layers = [ - MultiheadAttention(num_hidden, num_hidden // num_head, - num_hidden // num_head) for _ in range(n_layers) - ] - for i, layer in enumerate(self.attn_layers): - self.add_sublayer("attn_{}".format(i), layer) - self.ffns = [ - PositionwiseFeedForward( - num_hidden, num_hidden * num_head, filter_size=1) - for _ in range(n_layers) - ] - for i, layer in enumerate(self.ffns): - self.add_sublayer("ffns_{}".format(i), layer) - self.mel_linear = dg.Linear( - num_hidden, - num_mels * outputs_per_step, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform( - low=-k, high=k))) - self.stop_linear = dg.Linear( - num_hidden, - 1, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform( - low=-k, high=k))) - - self.postconvnet = PostConvNet( - num_mels, - num_hidden, - filter_size=5, - padding=4, - num_conv=5, - outputs_per_step=outputs_per_step, - use_cudnn=True) - - def forward(self, key, value, query, positional, c_mask): - """ - Compute decoder outputs. - - Args: - key (Variable): shape(B, T_text, C), dtype float32, the input key of decoder, - where T_text means the timesteps of input text, - value (Variable): shape(B, T_text, C), dtype float32, the input value of decoder. - query (Variable): shape(B, T_mel, C), dtype float32, the input query of decoder, - where T_mel means the timesteps of input spectrum, - positional (Variable): shape(B, T_mel), dtype int64, the spectrum position. - c_mask (Variable): shape(B, T_text, 1), dtype float32, query mask returned from encoder. - Returns: - mel_out (Variable): shape(B, T_mel, C), the decoder output after mel linear projection. - out (Variable): shape(B, T_mel, C), the decoder output after post mel network. - stop_tokens (Variable): shape(B, T_mel, 1), the stop tokens of output. - attn_list (list[Variable]): len(n_layers), the encoder-decoder attention list. - selfattn_list (list[Variable]): len(n_layers), the decoder self attention list. - """ - - # get decoder mask with triangular matrix - - if fluid.framework._dygraph_tracer()._train_mode: - mask = get_dec_attn_key_pad_mask(positional, self.num_head, - query.dtype) - m_mask = get_non_pad_mask(positional, self.num_head, query.dtype) - zero_mask = layers.cast(c_mask == 0, dtype=query.dtype) * -1e30 - zero_mask = layers.transpose(zero_mask, perm=[0, 2, 1]) - - else: - len_q = query.shape[1] - mask = layers.triu( - layers.ones( - shape=[len_q, len_q], dtype=query.dtype), - diagonal=1) - mask = layers.cast(mask != 0, dtype=query.dtype) * -1e30 - m_mask, zero_mask = None, None - - # Decoder pre-network - query = self.decoder_prenet(query) - - # Centered position - query = self.linear(query) - - # Get position embedding - positional = self.pos_emb(positional) - query = positional * self.alpha + query - - #positional dropout - query = fluid.layers.dropout( - query, 0.1, dropout_implementation='upscale_in_train') - - # Attention decoder-decoder, encoder-decoder - selfattn_list = list() - attn_list = list() - - for selfattn, attn, ffn in zip(self.selfattn_layers, self.attn_layers, - self.ffns): - query, attn_dec = selfattn( - query, query, query, mask=mask, query_mask=m_mask) - query, attn_dot = attn( - key, value, query, mask=zero_mask, query_mask=m_mask) - query = ffn(query) - selfattn_list.append(attn_dec) - attn_list.append(attn_dot) - - # Mel linear projection - mel_out = self.mel_linear(query) - # Post Mel Network - out = self.postconvnet(mel_out) - out = mel_out + out - - # Stop tokens - stop_tokens = self.stop_linear(query) - stop_tokens = layers.squeeze(stop_tokens, [-1]) - stop_tokens = layers.sigmoid(stop_tokens) - - return mel_out, out, attn_list, stop_tokens, selfattn_list diff --git a/parakeet/models/transformer_tts/encoder.py b/parakeet/models/transformer_tts/encoder.py deleted file mode 100644 index a7a0f7a..0000000 --- a/parakeet/models/transformer_tts/encoder.py +++ /dev/null @@ -1,106 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import paddle.fluid.dygraph as dg -import paddle.fluid as fluid -from parakeet.models.transformer_tts.utils import * -from parakeet.modules.multihead_attention import MultiheadAttention -from parakeet.modules.ffn import PositionwiseFeedForward -from parakeet.models.transformer_tts.encoderprenet import EncoderPrenet - - -class Encoder(dg.Layer): - def __init__(self, embedding_size, num_hidden, num_head=4, n_layers=3): - """Encoder layer of TransformerTTS. - - Args: - embedding_size (int): the size of position embedding. - num_hidden (int): the size of hidden layer in network. - num_head (int, optional): the head number of multihead attention. Defaults to 4. - n_layers (int, optional): the layers number of multihead attention. Defaults to 3. - """ - super(Encoder, self).__init__() - self.num_hidden = num_hidden - self.num_head = num_head - param = fluid.ParamAttr(initializer=fluid.initializer.Constant( - value=1.0)) - self.alpha = self.create_parameter( - shape=(1, ), attr=param, dtype='float32') - self.pos_inp = get_sinusoid_encoding_table( - 1024, self.num_hidden, padding_idx=0) - self.pos_emb = dg.Embedding( - size=[1024, num_hidden], - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.NumpyArrayInitializer( - self.pos_inp), - trainable=False)) - self.encoder_prenet = EncoderPrenet( - embedding_size=embedding_size, - num_hidden=num_hidden, - use_cudnn=True) - self.layers = [ - MultiheadAttention(num_hidden, num_hidden // num_head, - num_hidden // num_head) for _ in range(n_layers) - ] - for i, layer in enumerate(self.layers): - self.add_sublayer("self_attn_{}".format(i), layer) - self.ffns = [ - PositionwiseFeedForward( - num_hidden, - num_hidden * num_head, - filter_size=1, - use_cudnn=True) for _ in range(n_layers) - ] - for i, layer in enumerate(self.ffns): - self.add_sublayer("ffns_{}".format(i), layer) - - def forward(self, x, positional): - """ - Encode text sequence. - - Args: - x (Variable): shape(B, T_text), dtype float32, the input character, - where T_text means the timesteps of input text, - positional (Variable): shape(B, T_text), dtype int64, the characters position. - - Returns: - x (Variable): shape(B, T_text, C), the encoder output. - attentions (list[Variable]): len(n_layers), the encoder self attention list. - """ - - # Encoder pre_network - x = self.encoder_prenet(x) - - if fluid.framework._dygraph_tracer()._train_mode: - mask = get_attn_key_pad_mask(positional, self.num_head, x.dtype) - query_mask = get_non_pad_mask(positional, self.num_head, x.dtype) - - else: - query_mask, mask = None, None - - # Get positional encoding - positional = self.pos_emb(positional) - - x = positional * self.alpha + x - - # Positional dropout - x = layers.dropout(x, 0.1, dropout_implementation='upscale_in_train') - - # Self attention encoder - attentions = list() - for layer, ffn in zip(self.layers, self.ffns): - x, attention = layer(x, x, x, mask=mask, query_mask=query_mask) - x = ffn(x) - attentions.append(attention) - - return x, attentions, query_mask diff --git a/parakeet/models/transformer_tts/encoderprenet.py b/parakeet/models/transformer_tts/encoderprenet.py deleted file mode 100644 index a32f5a8..0000000 --- a/parakeet/models/transformer_tts/encoderprenet.py +++ /dev/null @@ -1,111 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import math -from parakeet.g2p.text.symbols import symbols -import paddle.fluid.dygraph as dg -import paddle.fluid as fluid -import paddle.fluid.layers as layers -from parakeet.modules.customized import Conv1D -import numpy as np - - -class EncoderPrenet(dg.Layer): - def __init__(self, embedding_size, num_hidden, use_cudnn=True): - """ Encoder prenet layer of TransformerTTS. - - Args: - embedding_size (int): the size of embedding. - num_hidden (int): the size of hidden layer in network. - use_cudnn (bool, optional): use cudnn or not. Defaults to True. - """ - super(EncoderPrenet, self).__init__() - self.embedding_size = embedding_size - self.num_hidden = num_hidden - self.use_cudnn = use_cudnn - self.embedding = dg.Embedding( - size=[len(symbols), embedding_size], - padding_idx=0, - param_attr=fluid.initializer.Normal( - loc=0.0, scale=1.0)) - self.conv_list = [] - k = math.sqrt(1.0 / embedding_size) - self.conv_list.append( - Conv1D( - num_channels=embedding_size, - num_filters=num_hidden, - filter_size=5, - padding=int(np.floor(5 / 2)), - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr( - initializer=fluid.initializer.Uniform( - low=-k, high=k)), - use_cudnn=use_cudnn)) - k = math.sqrt(1.0 / num_hidden) - for _ in range(2): - self.conv_list.append( - Conv1D( - num_channels=num_hidden, - num_filters=num_hidden, - filter_size=5, - padding=int(np.floor(5 / 2)), - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr( - initializer=fluid.initializer.Uniform( - low=-k, high=k)), - use_cudnn=use_cudnn)) - - for i, layer in enumerate(self.conv_list): - self.add_sublayer("conv_list_{}".format(i), layer) - - self.batch_norm_list = [ - dg.BatchNorm( - num_hidden, data_layout='NCHW') for _ in range(3) - ] - - for i, layer in enumerate(self.batch_norm_list): - self.add_sublayer("batch_norm_list_{}".format(i), layer) - - k = math.sqrt(1.0 / num_hidden) - self.projection = dg.Linear( - num_hidden, - num_hidden, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform( - low=-k, high=k))) - - def forward(self, x): - """ - Prepare encoder input. - - Args: - x (Variable): shape(B, T_text), dtype float32, the input character, where T_text means the timesteps of input text. - - Returns: - (Variable): shape(B, T_text, C), the encoder prenet output. - """ - - x = self.embedding(x) - x = layers.transpose(x, [0, 2, 1]) - for batch_norm, conv in zip(self.batch_norm_list, self.conv_list): - x = layers.dropout( - layers.relu(batch_norm(conv(x))), - 0.2, - dropout_implementation='upscale_in_train') - x = layers.transpose(x, [0, 2, 1]) #(N,T,C) - x = self.projection(x) - - return x diff --git a/parakeet/models/transformer_tts/post_convnet.py b/parakeet/models/transformer_tts/post_convnet.py deleted file mode 100644 index 6ad8e5d..0000000 --- a/parakeet/models/transformer_tts/post_convnet.py +++ /dev/null @@ -1,137 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import math -import paddle.fluid.dygraph as dg -import paddle.fluid as fluid -import paddle.fluid.layers as layers -from parakeet.modules.customized import Conv1D - - -class PostConvNet(dg.Layer): - def __init__(self, - n_mels=80, - num_hidden=512, - filter_size=5, - padding=0, - num_conv=5, - outputs_per_step=1, - use_cudnn=True, - dropout=0.1, - batchnorm_last=False): - """Decocder post conv net of TransformerTTS. - - Args: - n_mels (int, optional): the number of mel bands when calculating mel spectrograms. Defaults to 80. - num_hidden (int, optional): the size of hidden layer in network. Defaults to 512. - filter_size (int, optional): the filter size of Conv. Defaults to 5. - padding (int, optional): the padding size of Conv. Defaults to 0. - num_conv (int, optional): the num of Conv layers in network. Defaults to 5. - outputs_per_step (int, optional): the num of output frames per step . Defaults to 1. - use_cudnn (bool, optional): use cudnn in Conv or not. Defaults to True. - dropout (float, optional): dropout probability. Defaults to 0.1. - batchnorm_last (bool, optional): if batchnorm at last layer or not. Defaults to False. - """ - super(PostConvNet, self).__init__() - - self.dropout = dropout - self.num_conv = num_conv - self.batchnorm_last = batchnorm_last - self.conv_list = [] - k = math.sqrt(1.0 / (n_mels * outputs_per_step)) - self.conv_list.append( - Conv1D( - num_channels=n_mels * outputs_per_step, - num_filters=num_hidden, - filter_size=filter_size, - padding=padding, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr( - initializer=fluid.initializer.Uniform( - low=-k, high=k)), - use_cudnn=use_cudnn)) - - k = math.sqrt(1.0 / num_hidden) - for _ in range(1, num_conv - 1): - self.conv_list.append( - Conv1D( - num_channels=num_hidden, - num_filters=num_hidden, - filter_size=filter_size, - padding=padding, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr( - initializer=fluid.initializer.Uniform( - low=-k, high=k)), - use_cudnn=use_cudnn)) - - self.conv_list.append( - Conv1D( - num_channels=num_hidden, - num_filters=n_mels * outputs_per_step, - filter_size=filter_size, - padding=padding, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr( - initializer=fluid.initializer.Uniform( - low=-k, high=k)), - use_cudnn=use_cudnn)) - - for i, layer in enumerate(self.conv_list): - self.add_sublayer("conv_list_{}".format(i), layer) - - self.batch_norm_list = [ - dg.BatchNorm( - num_hidden, data_layout='NCHW') for _ in range(num_conv - 1) - ] - if self.batchnorm_last: - self.batch_norm_list.append( - dg.BatchNorm( - n_mels * outputs_per_step, data_layout='NCHW')) - for i, layer in enumerate(self.batch_norm_list): - self.add_sublayer("batch_norm_list_{}".format(i), layer) - - def forward(self, input): - """ - Compute the mel spectrum. - - Args: - input (Variable): shape(B, T, C), dtype float32, the result of mel linear projection. - - Returns: - output (Variable): shape(B, T, C), the result after postconvnet. - """ - - input = layers.transpose(input, [0, 2, 1]) - len = input.shape[-1] - for i in range(self.num_conv - 1): - batch_norm = self.batch_norm_list[i] - conv = self.conv_list[i] - - input = layers.dropout( - layers.tanh(batch_norm(conv(input)[:, :, :len])), - self.dropout, - dropout_implementation='upscale_in_train') - conv = self.conv_list[self.num_conv - 1] - input = conv(input)[:, :, :len] - if self.batchnorm_last: - batch_norm = self.batch_norm_list[self.num_conv - 1] - input = layers.dropout( - batch_norm(input), - self.dropout, - dropout_implementation='upscale_in_train') - output = layers.transpose(input, [0, 2, 1]) - return output diff --git a/parakeet/models/transformer_tts/prenet.py b/parakeet/models/transformer_tts/prenet.py deleted file mode 100644 index eaf4bc8..0000000 --- a/parakeet/models/transformer_tts/prenet.py +++ /dev/null @@ -1,71 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import math -import paddle.fluid.dygraph as dg -import paddle.fluid as fluid -import paddle.fluid.layers as layers - - -class PreNet(dg.Layer): - def __init__(self, input_size, hidden_size, output_size, dropout_rate=0.2): - """Prenet before passing through the network. - - Args: - input_size (int): the input channel size. - hidden_size (int): the size of hidden layer in network. - output_size (int): the output channel size. - dropout_rate (float, optional): dropout probability. Defaults to 0.2. - """ - super(PreNet, self).__init__() - self.input_size = input_size - self.hidden_size = hidden_size - self.output_size = output_size - self.dropout_rate = dropout_rate - - k = math.sqrt(1.0 / input_size) - self.linear1 = dg.Linear( - input_size, - hidden_size, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform( - low=-k, high=k))) - k = math.sqrt(1.0 / hidden_size) - self.linear2 = dg.Linear( - hidden_size, - output_size, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform( - low=-k, high=k))) - - def forward(self, x): - """ - Prepare network input. - - Args: - x (Variable): shape(B, T, C), dtype float32, the input value. - - Returns: - output (Variable): shape(B, T, C), the result after pernet. - """ - x = layers.dropout( - layers.relu(self.linear1(x)), - self.dropout_rate, - dropout_implementation='upscale_in_train') - output = layers.dropout( - layers.relu(self.linear2(x)), - self.dropout_rate, - dropout_implementation='upscale_in_train') - return output diff --git a/parakeet/models/transformer_tts/transformer_tts.py b/parakeet/models/transformer_tts/transformer_tts.py deleted file mode 100644 index e1d9418..0000000 --- a/parakeet/models/transformer_tts/transformer_tts.py +++ /dev/null @@ -1,71 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import paddle.fluid.dygraph as dg -import paddle.fluid as fluid -from parakeet.models.transformer_tts.encoder import Encoder -from parakeet.models.transformer_tts.decoder import Decoder - - -class TransformerTTS(dg.Layer): - def __init__(self, - embedding_size, - num_hidden, - encoder_num_head=4, - encoder_n_layers=3, - n_mels=80, - outputs_per_step=1, - decoder_num_head=4, - decoder_n_layers=3): - """TransformerTTS model. - - Args: - embedding_size (int): the size of position embedding. - num_hidden (int): the size of hidden layer in network. - encoder_num_head (int, optional): the head number of multihead attention in encoder. Defaults to 4. - encoder_n_layers (int, optional): the layers number of multihead attention in encoder. Defaults to 3. - n_mels (int, optional): the number of mel bands when calculating mel spectrograms. Defaults to 80. - outputs_per_step (int, optional): the num of output frames per step . Defaults to 1. - decoder_num_head (int, optional): the head number of multihead attention in decoder. Defaults to 4. - decoder_n_layers (int, optional): the layers number of multihead attention in decoder. Defaults to 3. - """ - super(TransformerTTS, self).__init__() - self.encoder = Encoder(embedding_size, num_hidden, encoder_num_head, - encoder_n_layers) - self.decoder = Decoder(num_hidden, n_mels, outputs_per_step, - decoder_num_head, decoder_n_layers) - - def forward(self, characters, mel_input, pos_text, pos_mel): - """ - TransformerTTS network. - - Args: - characters (Variable): shape(B, T_text), dtype float32, the input character, - where T_text means the timesteps of input text, - mel_input (Variable): shape(B, T_mel, C), dtype float32, the input query of decoder, - where T_mel means the timesteps of input spectrum, - pos_text (Variable): shape(B, T_text), dtype int64, the characters position. - - Returns: - mel_output (Variable): shape(B, T_mel, C), the decoder output after mel linear projection. - postnet_output (Variable): shape(B, T_mel, C), the decoder output after post mel network. - stop_preds (Variable): shape(B, T_mel, 1), the stop tokens of output. - attn_probs (list[Variable]): len(n_layers), the encoder-decoder attention list. - attns_enc (list[Variable]): len(n_layers), the encoder self attention list. - attns_dec (list[Variable]): len(n_layers), the decoder self attention list. - """ - key, attns_enc, query_mask = self.encoder(characters, pos_text) - - mel_output, postnet_output, attn_probs, stop_preds, attns_dec = self.decoder( - key, key, mel_input, pos_mel, query_mask) - return mel_output, postnet_output, attn_probs, stop_preds, attns_enc, attns_dec diff --git a/parakeet/models/transformer_tts/utils.py b/parakeet/models/transformer_tts/utils.py deleted file mode 100644 index 9482c23..0000000 --- a/parakeet/models/transformer_tts/utils.py +++ /dev/null @@ -1,101 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import numpy as np -import librosa -import os, copy -from scipy import signal -import paddle.fluid.layers as layers - - -def get_positional_table(d_pos_vec, n_position=1024): - position_enc = np.array( - [[pos / np.power(10000, 2 * i / d_pos_vec) for i in range(d_pos_vec)] - if pos != 0 else np.zeros(d_pos_vec) for pos in range(n_position)]) - - position_enc[1:, 0::2] = np.sin(position_enc[1:, 0::2]) # dim 2i - position_enc[1:, 1::2] = np.cos(position_enc[1:, 1::2]) # dim 2i+1 - return position_enc - - -def get_sinusoid_encoding_table(n_position, d_hid, padding_idx=None): - ''' Sinusoid position encoding table ''' - - def cal_angle(position, hid_idx): - return position / np.power(10000, 2 * (hid_idx // 2) / d_hid) - - def get_posi_angle_vec(position): - return [cal_angle(position, hid_j) for hid_j in range(d_hid)] - - sinusoid_table = np.array( - [get_posi_angle_vec(pos_i) for pos_i in range(n_position)]) - - sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i - sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2i+1 - - if padding_idx is not None: - # zero vector for padding dimension - sinusoid_table[padding_idx] = 0. - - return sinusoid_table - - -def get_non_pad_mask(seq, num_head, dtype): - mask = layers.cast(seq != 0, dtype=dtype) - mask = layers.unsqueeze(mask, axes=[-1]) - mask = layers.expand(mask, [num_head, 1, 1]) - return mask - - -def get_attn_key_pad_mask(seq_k, num_head, dtype): - ''' For masking out the padding part of key sequence. ''' - # Expand to fit the shape of key query attention matrix. - padding_mask = layers.cast(seq_k == 0, dtype=dtype) * -1e30 - padding_mask = layers.unsqueeze(padding_mask, axes=[1]) - padding_mask = layers.expand(padding_mask, [num_head, 1, 1]) - return padding_mask - - -def get_dec_attn_key_pad_mask(seq_k, num_head, dtype): - ''' For masking out the padding part of key sequence. ''' - - # Expand to fit the shape of key query attention matrix. - padding_mask = layers.cast(seq_k == 0, dtype=dtype) - padding_mask = layers.unsqueeze(padding_mask, axes=[1]) - len_k = seq_k.shape[1] - triu = layers.triu( - layers.ones( - shape=[len_k, len_k], dtype=dtype), diagonal=1) - padding_mask = padding_mask + triu - padding_mask = layers.cast( - padding_mask != 0, dtype=dtype) * -1e30 #* (-2**32 + 1) - padding_mask = layers.expand(padding_mask, [num_head, 1, 1]) - return padding_mask - - -def guided_attention(N, T, g=0.2): - '''Guided attention. Refer to page 3 on the paper.''' - W = np.zeros((N, T), dtype=np.float32) - for n_pos in range(W.shape[0]): - for t_pos in range(W.shape[1]): - W[n_pos, t_pos] = 1 - np.exp(-(t_pos / float(T) - n_pos / float(N)) - **2 / (2 * g * g)) - return W - - -def cross_entropy(input, label, weight=1.0, epsilon=1e-30): - output = -1 * label * layers.log(input + epsilon) - ( - 1 - label) * layers.log(1 - input + epsilon) - output = output * (label * (weight - 1) + 1) - - return layers.reduce_mean(output, dim=[0, 1]) diff --git a/parakeet/models/transformer_tts/vocoder.py b/parakeet/models/transformer_tts/vocoder.py deleted file mode 100644 index 4b40ebb..0000000 --- a/parakeet/models/transformer_tts/vocoder.py +++ /dev/null @@ -1,55 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import paddle.fluid.dygraph as dg -import paddle.fluid as fluid -from parakeet.modules.customized import Conv1D -from parakeet.models.transformer_tts.utils import * -from parakeet.models.transformer_tts.cbhg import CBHG - - -class Vocoder(dg.Layer): - def __init__(self, batch_size, hidden_size, num_mels=80, n_fft=2048): - """CBHG Network (mel -> linear) - - Args: - batch_size (int): the batch size of input. - hidden_size (int): the size of hidden layer in network. - n_mels (int, optional): the number of mel bands when calculating mel spectrograms. Defaults to 80. - n_fft (int, optional): length of the windowed signal after padding with zeros. Defaults to 2048. - """ - super(Vocoder, self).__init__() - self.pre_proj = Conv1D( - num_channels=num_mels, num_filters=hidden_size, filter_size=1) - self.cbhg = CBHG(hidden_size, batch_size) - self.post_proj = Conv1D( - num_channels=hidden_size, - num_filters=(n_fft // 2) + 1, - filter_size=1) - - def forward(self, mel): - """ - Compute mel spectrum to linear spectrum. - - Args: - mel (Variable): shape(B, C, T), dtype float32, the input mel spectrum. - - Returns: - mag_pred (Variable): shape(B, T, C), the linear output. - """ - mel = layers.transpose(mel, [0, 2, 1]) - mel = self.pre_proj(mel) - mel = self.cbhg(mel) - mag_pred = self.post_proj(mel) - mag_pred = layers.transpose(mag_pred, [0, 2, 1]) - return mag_pred diff --git a/parakeet/models/waveflow.py b/parakeet/models/waveflow.py new file mode 100644 index 0000000..89bdbda --- /dev/null +++ b/parakeet/models/waveflow.py @@ -0,0 +1,506 @@ +import math +import numpy as np +import paddle +from paddle import nn +from paddle.nn import functional as F +from paddle.nn import initializer as I + +from parakeet.modules import geometry as geo + +__all__ = ["UpsampleNet", "WaveFlow", "ConditionalWaveFlow", "WaveFlowLoss"] + +def fold(x, n_group): + """Fold audio or spectrogram's temporal dimension in to groups. + + Args: + x (Tensor): shape(*, time_steps), the input tensor + n_group (int): the size of a group. + + Returns: + Tensor: shape(*, time_steps // n_group, group), folded tensor. + """ + *spatial_shape, time_steps = x.shape + new_shape = spatial_shape + [time_steps // n_group, n_group] + return paddle.reshape(x, new_shape) + +class UpsampleNet(nn.LayerList): + """ + Layer to upsample mel spectrogram to the same temporal resolution with + the corresponding waveform. It consists of several conv2dtranspose layers + which perform de convolution on mel and time dimension. + """ + def __init__(self, upsample_factors): + super(UpsampleNet, self).__init__() + for factor in upsample_factors: + std = math.sqrt(1 / (3 * 2 * factor)) + init = I.Uniform(-std, std) + self.append( + nn.utils.weight_norm( + nn.Conv2DTranspose(1, 1, (3, 2 * factor), + padding=(1, factor // 2), + stride=(1, factor), + weight_attr=init, + bias_attr=init))) + + # upsample factors + self.upsample_factor = np.prod(upsample_factors) + self.upsample_factors = upsample_factors + + def forward(self, x, trim_conv_artifact=False): + """ + Args: + x (Tensor): shape(batch_size, input_channels, time_steps), the input + spectrogram. + trim_conv_artifact (bool, optional): trim deconvolution artifact at + each layer. Defaults to False. + + Returns: + Tensor: shape(batch_size, input_channels, time_steps * upsample_factor). + If trim_conv_artifact is True, the output time steps is less + than time_steps * upsample_factors. + """ + x = paddle.unsqueeze(x, 1) #(B, C, T) -> (B, 1, C, T) + for layer in self: + x = layer(x) + if trim_conv_artifact: + time_cutoff = layer._kernel_size[1] - layer._stride[1] + x = x[:, :, :, :-time_cutoff] + x = F.leaky_relu(x, 0.4) + x = paddle.squeeze(x, 1) # back to (B, C, T) + return x + + +class ResidualBlock(nn.Layer): + """ + ResidualBlock, the basic unit of ResidualNet. It has a conv2d layer, which + has causal padding in height dimension and same paddign in width dimension. + It also has projection for the condition and output. + """ + def __init__(self, channels, cond_channels, kernel_size, dilations): + super(ResidualBlock, self).__init__() + # input conv + std = math.sqrt(1 / channels * np.prod(kernel_size)) + init = I.Uniform(-std, std) + receptive_field = [1 + (k - 1) * d for (k, d) in zip(kernel_size, dilations)] + rh, rw = receptive_field + paddings = [rh - 1, 0, rw // 2, (rw - 1) // 2] # causal & same + conv = nn.Conv2D(channels, 2 * channels, kernel_size, + padding=paddings, + dilation=dilations, + weight_attr=init, + bias_attr=init) + self.conv = nn.utils.weight_norm(conv) + self.rh = rh + self.rw = rw + self.dilations = dilations + + # condition projection + std = math.sqrt(1 / cond_channels) + init = I.Uniform(-std, std) + condition_proj = nn.Conv2D(cond_channels, 2 * channels, (1, 1), + weight_attr=init, bias_attr=init) + self.condition_proj = nn.utils.weight_norm(condition_proj) + + # parametric residual & skip connection + std = math.sqrt(1 / channels) + init = I.Uniform(-std, std) + out_proj = nn.Conv2D(channels, 2 * channels, (1, 1), + weight_attr=init, bias_attr=init) + self.out_proj = nn.utils.weight_norm(out_proj) + + def forward(self, x, condition): + """Compute output for a whole folded sequence. + + Args: + x (Tensor): shape(batch_size, channel, height, width), the input. + condition (Tensor): shape(batch_size, condition_channel, height, width), + the local condition. + + Returns: + res (Tensor): shape(batch_size, channel, height, width), the residual output. + res (Tensor): shape(batch_size, channel, height, width), the skip output. + """ + x_in = x + x = self.conv(x) + x += self.condition_proj(condition) + + content, gate = paddle.chunk(x, 2, axis=1) + x = paddle.tanh(content) * F.sigmoid(gate) + + x = self.out_proj(x) + res, skip = paddle.chunk(x, 2, axis=1) + return x_in + res, skip + + def start_sequence(self): + """Prepare the layer for incremental computation of causal convolution. Reset the buffer for causal convolution. + + Raises: + ValueError: If not in evaluation mode. + """ + if self.training: + raise ValueError("Only use start sequence at evaluation mode.") + self._conv_buffer = None + + def add_input(self, x_row, condition_row): + """Compute the output for a row and update the buffer. + + Args: + x_row (Tensor): shape(batch_size, channel, 1, width), a row of the input. + condition_row (Tensor): shape(batch_size, condition_channel, 1, width), a row of the input. + + Returns: + res (Tensor): shape(batch_size, channel, 1, width), the residual output. + res (Tensor): shape(batch_size, channel, 1, width), the skip output. + """ + x_row_in = x_row + if self._conv_buffer is None: + self._init_buffer(x_row) + self._update_buffer(x_row) + + rw = self.rw + # call self.conv's weight norm hook expliccitly since its __call__ + # method is not called here + for hook in self.conv._forward_pre_hooks.values(): + hook(self.conv, self._conv_buffer) + x_row = F.conv2d( + self._conv_buffer, + self.conv.weight, + self.conv.bias, + padding=[0, 0, rw // 2, (rw - 1) // 2], + dilation=self.dilations) + x_row += self.condition_proj(condition_row) + + content, gate = paddle.chunk(x_row, 2, axis=1) + x_row = paddle.tanh(content) * F.sigmoid(gate) + + x_row = self.out_proj(x_row) + res, skip = paddle.chunk(x_row, 2, axis=1) + return x_row_in + res, skip + + def _init_buffer(self, input): + batch_size, channels, _, width = input.shape + self._conv_buffer = paddle.zeros( + [batch_size, channels, self.rh, width], dtype=input.dtype) + + def _update_buffer(self, input): + self._conv_buffer = paddle.concat( + [self._conv_buffer[:, :, 1:, :], input], axis=2) + + +class ResidualNet(nn.LayerList): + """ + A stack of several ResidualBlocks. It merges condition at each layer. All + skip outputs are collected. + """ + def __init__(self, n_layer, residual_channels, condition_channels, kernel_size, dilations_h): + if len(dilations_h) != n_layer: + raise ValueError("number of dilations_h should equals num of layers") + super(ResidualNet, self).__init__() + for i in range(n_layer): + dilation = (dilations_h[i], 2 ** i) + layer = ResidualBlock(residual_channels, condition_channels, kernel_size, dilation) + self.append(layer) + + def forward(self, x, condition): + """Comput the output of given the input and the condition. + + Args: + x (Tensor): shape(batch_size, channel, height, width), the input. + condition (Tensor): shape(batch_size, condition_channel, height, width), + the local condition. + + Returns: + Tensor: shape(batch_size, channel, height, width), the output, which + is an aggregation of all the skip outputs. + """ + skip_connections = [] + for layer in self: + x, skip = layer(x, condition) + skip_connections.append(skip) + out = paddle.sum(paddle.stack(skip_connections, 0), 0) + return out + + def start_sequence(self): + """Prepare the layer for incremental computation.""" + for layer in self: + layer.start_sequence() + + def add_input(self, x_row, condition_row): + """Compute the output for a row and update the buffer. + + Args: + x_row (Tensor): shape(batch_size, channel, 1, width), a row of the input. + condition_row (Tensor): shape(batch_size, condition_channel, 1, width), a row of the input. + + Returns: + Tensor: shape(batch_size, channel, 1, width), the output, which is + an aggregation of all the skip outputs. + """ + skip_connections = [] + for layer in self: + x_row, skip = layer.add_input(x_row, condition_row) + skip_connections.append(skip) + out = paddle.sum(paddle.stack(skip_connections, 0), 0) + return out + + +class Flow(nn.Layer): + """ + A bijection (Reversable layer) that transform a density of latent variables + p(Z) into a complex data distribution p(X). + + It's a auto regressive flow. The `forward` method implements the probability + density estimation. The `inverse` method implements the sampling. + """ + dilations_dict = { + 8: [1, 1, 1, 1, 1, 1, 1, 1], + 16: [1, 1, 1, 1, 1, 1, 1, 1], + 32: [1, 2, 4, 1, 2, 4, 1, 2], + 64: [1, 2, 4, 8, 16, 1, 2, 4], + 128: [1, 2, 4, 8, 16, 32, 64, 1] + } + + def __init__(self, n_layers, channels, mel_bands, kernel_size, n_group): + super(Flow, self).__init__() + # input projection + self.input_proj = nn.utils.weight_norm( + nn.Conv2D(1, channels, (1, 1), + weight_attr=I.Uniform(-1., 1.), + bias_attr=I.Uniform(-1., 1.))) + + # residual net + self.resnet = ResidualNet(n_layers, channels, mel_bands, kernel_size, + self.dilations_dict[n_group]) + + # output projection + self.output_proj = nn.Conv2D(channels, 2, (1, 1), + weight_attr=I.Constant(0.), + bias_attr=I.Constant(0.)) + + # specs + self.n_group = n_group + + def _predict_parameters(self, x, condition): + x = self.input_proj(x) + x = self.resnet(x, condition) + bijection_params = self.output_proj(x) + logs, b = paddle.chunk(bijection_params, 2, axis=1) + return logs, b + + def _transform(self, x, logs, b): + z_0 = x[:, :, :1, :] # the first row, just copy it + z_out = x[:, :, 1:, :] * paddle.exp(logs) + b + z_out = paddle.concat([z_0, z_out], axis=2) + return z_out + + def forward(self, x, condition): + """Probability density estimation. It is done by inversely transform a sample + from p(X) back into a sample from p(Z). + + Args: + x (Tensor): shape(batch, 1, height, width), a input sample of the distribution p(X). + condition (Tensor): shape(batch, condition_channel, height, width), the local condition. + + Returns: + (z, (logs, b)) + z (Tensor): shape(batch, 1, height, width), the transformed sample. + logs (Tensor): shape(batch, 1, height - 1, width), the log scale of the inverse transformation. + b (Tensor): shape(batch, 1, height - 1, width), the shift of the inverse transformation. + """ + # (B, C, H-1, W) + logs, b = self._predict_parameters( + x[:, :, :-1, :], condition[:, :, 1:, :]) + z = self._transform(x, logs, b) + return z, (logs, b) + + def _predict_row_parameters(self, x_row, condition_row): + x_row = self.input_proj(x_row) + x_row = self.resnet.add_input(x_row, condition_row) + bijection_params = self.output_proj(x_row) + logs, b = paddle.chunk(bijection_params, 2, axis=1) + return logs, b + + def _inverse_transform_row(self, z_row, logs, b): + x_row = (z_row - b) * paddle.exp(-logs) + return x_row + + def _inverse_row(self, z_row, x_row, condition_row): + logs, b = self._predict_row_parameters(x_row, condition_row) + x_next_row = self._inverse_transform_row(z_row, logs, b) + return x_next_row, (logs, b) + + def _start_sequence(self): + self.resnet.start_sequence() + + def inverse(self, z, condition): + """Sampling from the the distrition p(X). It is done by sample form p(Z) + and transform the sample. It is a auto regressive transformation. + + Args: + z (Tensor): shape(batch, 1, height, width), a input sample of the distribution p(Z). + condition (Tensor): shape(batch, condition_channel, height, width), the local condition. + + Returns: + (x, (logs, b)) + x (Tensor): shape(batch, 1, height, width), the transformed sample. + logs (Tensor): shape(batch, 1, height - 1, width), the log scale of the inverse transformation. + b (Tensor): shape(batch, 1, height - 1, width), the shift of the inverse transformation. + """ + z_0 = z[:, :, :1, :] + x = [] + logs_list = [] + b_list = [] + x.append(z_0) + + self._start_sequence() + for i in range(1, self.n_group): + x_row = x[-1] # actuallt i-1:i + z_row = z[:, :, i:i+1, :] + condition_row = condition[:, :, i:i+1, :] + + x_next_row, (logs, b) = self._inverse_row(z_row, x_row, condition_row) + x.append(x_next_row) + logs_list.append(logs) + b_list.append(b) + + x = paddle.concat(x, 2) + logs = paddle.concat(logs_list, 2) + b = paddle.concat(b_list, 2) + return x, (logs, b) + + +class WaveFlow(nn.LayerList): + """An Deep Reversible layer that is composed of a stack of auto regressive flows.s""" + def __init__(self, n_flows, n_layers, n_group, channels, mel_bands, kernel_size): + if n_group % 2 or n_flows % 2: + raise ValueError("number of flows and number of group must be even " + "since a permutation along group among flows is used.") + super(WaveFlow, self).__init__() + for _ in range(n_flows): + self.append(Flow(n_layers, channels, mel_bands, kernel_size, n_group)) + + # permutations in h + self.perms = self._create_perm(n_group, n_flows) + + # specs + self.n_group = n_group + self.n_flows = n_flows + + def _create_perm(self, n_group, n_flows): + indices = list(range(n_group)) + half = n_group // 2 + perms = [] + for i in range(n_flows): + if i < n_flows // 2: + perms.append(indices[::-1]) + else: + perm = list(reversed(indices[:half])) + list(reversed(indices[half:])) + perms.append(perm) + return perms + + def _trim(self, x, condition): + assert condition.shape[-1] >= x.shape[-1] + pruned_len = int(x.shape[-1] // self.n_group * self.n_group) + + if x.shape[-1] > pruned_len: + x = x[:, :pruned_len] + if condition.shape[-1] > pruned_len: + condition = condition[:, :, :pruned_len] + return x, condition + + def forward(self, x, condition): + """Probability density estimation. + + Args: + x (Tensor): shape(batch_size, time_steps), the audio. + condition (Tensor): shape(batch_size, condition channel, time_steps), the local condition. + + Returns: + z: (Tensor): shape(batch_size, time_steps), the transformed sample. + log_det_jacobian: (Tensor), shape(1,), the log determinant of the jacobian of (dz/dx). + """ + # x: (B, T) + # condition: (B, C, T) upsampled condition + x, condition = self._trim(x, condition) + + # to (B, C, h, T//h) layout + x = paddle.unsqueeze(paddle.transpose(fold(x, self.n_group), [0, 2, 1]), 1) + condition = paddle.transpose(fold(condition, self.n_group), [0, 1, 3, 2]) + + # flows + logs_list = [] + for i, layer in enumerate(self): + x, (logs, b) = layer(x, condition) + logs_list.append(logs) + # permute paddle has no shuffle dim + x = geo.shuffle_dim(x, 2, perm=self.perms[i]) + condition = geo.shuffle_dim(condition, 2, perm=self.perms[i]) + + z = paddle.squeeze(x, 1) # (B, H, W) + batch_size = z.shape[0] + z = paddle.reshape(paddle.transpose(z, [0, 2, 1]), [batch_size, -1]) + + log_det_jacobian = paddle.sum(paddle.stack(logs_list)) + return z, log_det_jacobian + + def inverse(self, z, condition): + """Sampling from the the distrition p(X). It is done by sample form p(Z) + and transform the sample. It is a auto regressive transformation. + + Args: + z (Tensor): shape(batch, 1, time_steps), a input sample of the distribution p(Z). + condition (Tensor): shape(batch, condition_channel, time_steps), the local condition. + + Returns: + x: (Tensor): shape(batch_size, time_steps), the transformed sample. + """ + + z, condition = self._trim(z, condition) + # to (B, C, h, T//h) layout + z = paddle.unsqueeze(paddle.transpose(fold(z, self.n_group), [0, 2, 1]), 1) + condition = paddle.transpose(fold(condition, self.n_group), [0, 1, 3, 2]) + + # reverse it flow by flow + for i in reversed(range(self.n_flows)): + z = geo.shuffle_dim(z, 2, perm=self.perms[i]) + condition = geo.shuffle_dim(condition, 2, perm=self.perms[i]) + z, (logs, b) = self[i].inverse(z, condition) + + x = paddle.squeeze(z, 1) # (B, H, W) + batch_size = x.shape[0] + x = paddle.reshape(paddle.transpose(x, [0, 2, 1]), [batch_size, -1]) + return x + + +class ConditionalWaveFlow(nn.LayerList): + def __init__(self, encoder, decoder): + super(ConditionalWaveFlow, self).__init__() + self.encoder = encoder + self.decoder = decoder + + def forward(self, audio, mel): + condition = self.encoder(mel) + z, log_det_jacobian = self.decoder(audio, condition) + return z, log_det_jacobian + + @paddle.fluid.dygraph.no_grad + def synthesize(self, mel): + condition = self.encoder(mel, trim_conv_artifact=True) #(B, C, T) + batch_size, _, time_steps = condition.shape + z = paddle.randn([batch_size, time_steps], dtype=mel.dtype) + x = self.decoder.inverse(z, condition) + return x + + +class WaveFlowLoss(nn.Layer): + def __init__(self, sigma=1.0): + super(WaveFlowLoss, self).__init__() + self.sigma = sigma + self.const = 0.5 * np.log(2 * np.pi) + np.log(self.sigma) + + def forward(self, model_output): + z, log_det_jacobian = model_output + + loss = paddle.sum(z * z) / (2 * self.sigma * self.sigma) - log_det_jacobian + loss = loss / np.prod(z.shape) + return loss + self.const diff --git a/parakeet/models/waveflow/__init__.py b/parakeet/models/waveflow/__init__.py deleted file mode 100644 index b068b59..0000000 --- a/parakeet/models/waveflow/__init__.py +++ /dev/null @@ -1,15 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from parakeet.models.waveflow.waveflow_modules import WaveFlowLoss, WaveFlowModule diff --git a/parakeet/models/waveflow/waveflow_modules.py b/parakeet/models/waveflow/waveflow_modules.py deleted file mode 100644 index 96c5715..0000000 --- a/parakeet/models/waveflow/waveflow_modules.py +++ /dev/null @@ -1,443 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import itertools -import numpy as np -import paddle.fluid.dygraph as dg -from paddle import fluid -from parakeet.modules import weight_norm - - -def get_param_attr(layer_type, filter_size, c_in=1): - if layer_type == "weight_norm": - k = np.sqrt(1.0 / (c_in * np.prod(filter_size))) - weight_init = fluid.initializer.UniformInitializer(low=-k, high=k) - bias_init = fluid.initializer.UniformInitializer(low=-k, high=k) - elif layer_type == "common": - weight_init = fluid.initializer.ConstantInitializer(0.0) - bias_init = fluid.initializer.ConstantInitializer(0.0) - else: - raise TypeError("Unsupported layer type.") - - param_attr = fluid.ParamAttr(initializer=weight_init) - bias_attr = fluid.ParamAttr(initializer=bias_init) - return param_attr, bias_attr - - -def unfold(x, n_group): - length = x.shape[-1] - new_shape = x.shape[:-1] + [length // n_group, n_group] - return fluid.layers.reshape(x, new_shape) - - -class WaveFlowLoss: - def __init__(self, sigma=1.0): - self.sigma = sigma - - def __call__(self, model_output): - z, log_s_list = model_output - for i, log_s in enumerate(log_s_list): - if i == 0: - log_s_total = fluid.layers.reduce_sum(log_s) - else: - log_s_total = log_s_total + fluid.layers.reduce_sum(log_s) - - loss = fluid.layers.reduce_sum(z * z) / (2 * self.sigma * self.sigma) \ - - log_s_total - loss = loss / np.prod(z.shape) - const = 0.5 * np.log(2 * np.pi) + np.log(self.sigma) - - return loss + const - - -class Conditioner(dg.Layer): - def __init__(self, dtype, upsample_factors): - super(Conditioner, self).__init__() - - self.upsample_conv2d = [] - for s in upsample_factors: - in_channel = 1 - param_attr, bias_attr = get_param_attr( - "weight_norm", (3, 2 * s), c_in=in_channel) - conv_trans2d = weight_norm.Conv2DTranspose( - num_channels=in_channel, - num_filters=1, - filter_size=(3, 2 * s), - padding=(1, s // 2), - stride=(1, s), - param_attr=param_attr, - bias_attr=bias_attr, - dtype=dtype) - self.upsample_conv2d.append(conv_trans2d) - - for i, layer in enumerate(self.upsample_conv2d): - self.add_sublayer("conv2d_transpose_{}".format(i), layer) - - def forward(self, x): - x = fluid.layers.unsqueeze(x, 1) - for layer in self.upsample_conv2d: - x = layer(x) - x = fluid.layers.leaky_relu(x, alpha=0.4) - - return fluid.layers.squeeze(x, [1]) - - def infer(self, x): - x = fluid.layers.unsqueeze(x, 1) - for layer in self.upsample_conv2d: - x = layer(x) - # Trim conv artifacts. - time_cutoff = layer._filter_size[1] - layer._stride[1] - x = fluid.layers.leaky_relu(x[:, :, :, :-time_cutoff], alpha=0.4) - - return fluid.layers.squeeze(x, [1]) - - -class Flow(dg.Layer): - def __init__(self, config): - super(Flow, self).__init__() - self.n_layers = config.n_layers - self.n_channels = config.n_channels - self.kernel_h = config.kernel_h - self.kernel_w = config.kernel_w - self.dtype = "float16" if config.use_fp16 else "float32" - - # Transform audio: [batch, 1, n_group, time/n_group] - # => [batch, n_channels, n_group, time/n_group] - param_attr, bias_attr = get_param_attr("weight_norm", (1, 1), c_in=1) - self.start = weight_norm.Conv2D( - num_channels=1, - num_filters=self.n_channels, - filter_size=(1, 1), - param_attr=param_attr, - bias_attr=bias_attr, - dtype=self.dtype) - - # Initializing last layer to 0 makes the affine coupling layers - # do nothing at first. This helps with training stability - # output shape: [batch, 2, n_group, time/n_group] - param_attr, bias_attr = get_param_attr( - "common", (1, 1), c_in=self.n_channels) - self.end = dg.Conv2D( - num_channels=self.n_channels, - num_filters=2, - filter_size=(1, 1), - param_attr=param_attr, - bias_attr=bias_attr, - dtype=self.dtype) - - # receiptive fileds: (kernel - 1) * sum(dilations) + 1 >= squeeze - dilation_dict = { - 8: [1, 1, 1, 1, 1, 1, 1, 1], - 16: [1, 1, 1, 1, 1, 1, 1, 1], - 32: [1, 2, 4, 1, 2, 4, 1, 2], - 64: [1, 2, 4, 8, 16, 1, 2, 4], - 128: [1, 2, 4, 8, 16, 32, 64, 1] - } - self.dilation_h_list = dilation_dict[config.n_group] - - self.in_layers = [] - self.cond_layers = [] - self.res_skip_layers = [] - for i in range(self.n_layers): - dilation_h = self.dilation_h_list[i] - dilation_w = 2**i - - param_attr, bias_attr = get_param_attr( - "weight_norm", (self.kernel_h, self.kernel_w), - c_in=self.n_channels) - in_layer = weight_norm.Conv2D( - num_channels=self.n_channels, - num_filters=2 * self.n_channels, - filter_size=(self.kernel_h, self.kernel_w), - dilation=(dilation_h, dilation_w), - param_attr=param_attr, - bias_attr=bias_attr, - dtype=self.dtype) - self.in_layers.append(in_layer) - - param_attr, bias_attr = get_param_attr( - "weight_norm", (1, 1), c_in=config.mel_bands) - cond_layer = weight_norm.Conv2D( - num_channels=config.mel_bands, - num_filters=2 * self.n_channels, - filter_size=(1, 1), - param_attr=param_attr, - bias_attr=bias_attr, - dtype=self.dtype) - self.cond_layers.append(cond_layer) - - if i < self.n_layers - 1: - res_skip_channels = 2 * self.n_channels - else: - res_skip_channels = self.n_channels - param_attr, bias_attr = get_param_attr( - "weight_norm", (1, 1), c_in=self.n_channels) - res_skip_layer = weight_norm.Conv2D( - num_channels=self.n_channels, - num_filters=res_skip_channels, - filter_size=(1, 1), - param_attr=param_attr, - bias_attr=bias_attr, - dtype=self.dtype) - self.res_skip_layers.append(res_skip_layer) - - self.add_sublayer("in_layer_{}".format(i), in_layer) - self.add_sublayer("cond_layer_{}".format(i), cond_layer) - self.add_sublayer("res_skip_layer_{}".format(i), res_skip_layer) - - def forward(self, audio, mel): - # audio: [bs, 1, n_group, time/group] - # mel: [bs, mel_bands, n_group, time/n_group] - audio = self.start(audio) - - for i in range(self.n_layers): - dilation_h = self.dilation_h_list[i] - dilation_w = 2**i - - # Pad height dim (n_group): causal convolution - # Pad width dim (time): dialated non-causal convolution - pad_top, pad_bottom = (self.kernel_h - 1) * dilation_h, 0 - pad_left = pad_right = int((self.kernel_w - 1) * dilation_w / 2) - # Using pad2d is a bit faster than using padding in Conv2D directly - audio_pad = fluid.layers.pad2d( - audio, paddings=[pad_top, pad_bottom, pad_left, pad_right]) - hidden = self.in_layers[i](audio_pad) - cond_hidden = self.cond_layers[i](mel) - in_acts = hidden + cond_hidden - out_acts = fluid.layers.tanh(in_acts[:, :self.n_channels, :]) * \ - fluid.layers.sigmoid(in_acts[:, self.n_channels:, :]) - res_skip_acts = self.res_skip_layers[i](out_acts) - - if i < self.n_layers - 1: - audio += res_skip_acts[:, :self.n_channels, :, :] - skip_acts = res_skip_acts[:, self.n_channels:, :, :] - else: - skip_acts = res_skip_acts - - if i == 0: - output = skip_acts - else: - output += skip_acts - - return self.end(output) - - def infer(self, audio, mel, queues): - audio = self.start(audio) - - for i in range(self.n_layers): - dilation_h = self.dilation_h_list[i] - dilation_w = 2**i - - state_size = dilation_h * (self.kernel_h - 1) - queue = queues[i] - - if len(queue) == 0: - for j in range(state_size): - queue.append(fluid.layers.zeros_like(audio)) - - state = queue[0:state_size] - state = fluid.layers.concat(state + [audio], axis=2) - - queue.pop(0) - queue.append(audio) - - # Pad height dim (n_group): causal convolution - # Pad width dim (time): dialated non-causal convolution - pad_top, pad_bottom = 0, 0 - pad_left = int((self.kernel_w - 1) * dilation_w / 2) - pad_right = int((self.kernel_w - 1) * dilation_w / 2) - state = fluid.layers.pad2d( - state, paddings=[pad_top, pad_bottom, pad_left, pad_right]) - hidden = self.in_layers[i](state) - cond_hidden = self.cond_layers[i](mel) - in_acts = hidden + cond_hidden - out_acts = fluid.layers.tanh(in_acts[:, :self.n_channels, :]) * \ - fluid.layers.sigmoid(in_acts[:, self.n_channels:, :]) - res_skip_acts = self.res_skip_layers[i](out_acts) - - if i < self.n_layers - 1: - audio += res_skip_acts[:, :self.n_channels, :, :] - skip_acts = res_skip_acts[:, self.n_channels:, :, :] - else: - skip_acts = res_skip_acts - - if i == 0: - output = skip_acts - else: - output += skip_acts - - return self.end(output) - - -class WaveFlowModule(dg.Layer): - """WaveFlow model implementation. - - Args: - config (obj): model configuration parameters. - - Returns: - WaveFlowModule - """ - - def __init__(self, config): - super(WaveFlowModule, self).__init__() - self.n_flows = config.n_flows - self.n_group = config.n_group - self.n_layers = config.n_layers - self.upsample_factors = config.upsample_factors if hasattr( - config, "upsample_factors") else [16, 16] - assert self.n_group % 2 == 0 - assert self.n_flows % 2 == 0 - - self.dtype = "float16" if config.use_fp16 else "float32" - self.conditioner = Conditioner(self.dtype, self.upsample_factors) - self.flows = [] - for i in range(self.n_flows): - flow = Flow(config) - self.flows.append(flow) - self.add_sublayer("flow_{}".format(i), flow) - - self.perms = [] - half = self.n_group // 2 - for i in range(self.n_flows): - perm = list(range(self.n_group)) - if i < self.n_flows // 2: - perm = perm[::-1] - else: - perm[:half] = reversed(perm[:half]) - perm[half:] = reversed(perm[half:]) - self.perms.append(perm) - - def forward(self, audio, mel): - """Training forward pass. - - Use a conditioner to upsample mel spectrograms into hidden states. - These hidden states along with the audio are passed to a stack of Flow - modules to obtain the final latent variable z and a list of log scaling - variables, which are then passed to the WaveFlowLoss module to calculate - the negative log likelihood. - - Args: - audio (obj): audio samples. - mel (obj): mel spectrograms. - - Returns: - z (obj): latent variable. - log_s_list(list): list of log scaling variables. - """ - mel = self.conditioner(mel) - assert mel.shape[2] >= audio.shape[1] - # Prune out the tail of audio/mel so that time/n_group == 0. - pruned_len = int(audio.shape[1] // self.n_group * self.n_group) - - if audio.shape[1] > pruned_len: - audio = audio[:, :pruned_len] - if mel.shape[2] > pruned_len: - mel = mel[:, :, :pruned_len] - - # From [bs, mel_bands, time] to [bs, mel_bands, n_group, time/n_group] - mel = fluid.layers.transpose(unfold(mel, self.n_group), [0, 1, 3, 2]) - # From [bs, time] to [bs, n_group, time/n_group] - audio = fluid.layers.transpose(unfold(audio, self.n_group), [0, 2, 1]) - # [bs, 1, n_group, time/n_group] - audio = fluid.layers.unsqueeze(audio, 1) - log_s_list = [] - for i in range(self.n_flows): - inputs = audio[:, :, :-1, :] - conds = mel[:, :, 1:, :] - outputs = self.flows[i](inputs, conds) - log_s = outputs[:, :1, :, :] - b = outputs[:, 1:, :, :] - log_s_list.append(log_s) - - audio_0 = audio[:, :, :1, :] - audio_out = audio[:, :, 1:, :] * fluid.layers.exp(log_s) + b - audio = fluid.layers.concat([audio_0, audio_out], axis=2) - - # Permute over the height dim. - audio_slices = [audio[:, :, j, :] for j in self.perms[i]] - audio = fluid.layers.stack(audio_slices, axis=2) - mel_slices = [mel[:, :, j, :] for j in self.perms[i]] - mel = fluid.layers.stack(mel_slices, axis=2) - - z = fluid.layers.squeeze(audio, [1]) - return z, log_s_list - - def synthesize(self, mel, sigma=1.0): - """Use model to synthesize waveform. - - Use a conditioner to upsample mel spectrograms into hidden states. - These hidden states along with initial random gaussian latent variable - are passed to a stack of Flow modules to obtain the audio output. - - Note that we use convolutional queue (https://arxiv.org/abs/1611.09482) - to cache the intermediate hidden states, which will speed up the - autoregressive inference over the height dimension. Current - implementation only supports height dimension (self.n_group) equals - 8 or 16, i.e., where there is no dilation on the height dimension. - - Args: - mel (obj): mel spectrograms. - sigma (float, optional): standard deviation of the guassian latent - variable. Defaults to 1.0. - - Returns: - audio (obj): synthesized audio. - """ - if self.dtype == "float16": - mel = fluid.layers.cast(mel, self.dtype) - mel = self.conditioner.infer(mel) - # Prune out the tail of mel so that time/n_group == 0. - pruned_len = int(mel.shape[2] // self.n_group * self.n_group) - if mel.shape[2] > pruned_len: - mel = mel[:, :, :pruned_len] - # From [bs, mel_bands, time] to [bs, mel_bands, n_group, time/n_group] - mel = fluid.layers.transpose(unfold(mel, self.n_group), [0, 1, 3, 2]) - - audio = fluid.layers.gaussian_random( - shape=[mel.shape[0], 1, mel.shape[2], mel.shape[3]], std=sigma) - if self.dtype == "float16": - audio = fluid.layers.cast(audio, self.dtype) - for i in reversed(range(self.n_flows)): - # Permute over the height dimension. - audio_slices = [audio[:, :, j, :] for j in self.perms[i]] - audio = fluid.layers.stack(audio_slices, axis=2) - mel_slices = [mel[:, :, j, :] for j in self.perms[i]] - mel = fluid.layers.stack(mel_slices, axis=2) - - audio_list = [] - audio_0 = audio[:, :, 0:1, :] - audio_list.append(audio_0) - audio_h = audio_0 - queues = [[] for _ in range(self.n_layers)] - - for h in range(1, self.n_group): - inputs = audio_h - conds = mel[:, :, h:(h + 1), :] - outputs = self.flows[i].infer(inputs, conds, queues) - - log_s = outputs[:, 0:1, :, :] - b = outputs[:, 1:, :, :] - audio_h = (audio[:, :, h:(h+1), :] - b) / \ - fluid.layers.exp(log_s) - audio_list.append(audio_h) - - audio = fluid.layers.concat(audio_list, axis=2) - - # audio: [bs, n_group, time/n_group] - audio = fluid.layers.squeeze(audio, [1]) - # audio: [bs, time] - audio = fluid.layers.reshape( - fluid.layers.transpose(audio, [0, 2, 1]), [audio.shape[0], -1]) - return audio diff --git a/parakeet/models/wavenet.py b/parakeet/models/wavenet.py new file mode 100644 index 0000000..41a06be --- /dev/null +++ b/parakeet/models/wavenet.py @@ -0,0 +1,717 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from __future__ import division +import math +import time +from tqdm import trange +import numpy as np + +import paddle +from paddle import nn +from paddle.nn import functional as F +import paddle.fluid.initializer as I +import paddle.fluid.layers.distributions as D + +from parakeet.modules.conv import Conv1dCell + +__all__ = ["ConditionalWavenet"] + +def quantize(values, n_bands): + """Linearlly quantize a float Tensor in [-1, 1) to an interger Tensor in [0, n_bands). + + Args: + values (Variable): dtype: flaot32 or float64. the floating point value. + n_bands (int): the number of bands. The output integer Tensor's value is in the range [0, n_bans). + + Returns: + Variable: the quantized tensor, dtype: int64. + """ + quantized = paddle.cast((values + 1.0) / 2.0 * n_bands, "int64") + return quantized + + +def dequantize(quantized, n_bands, dtype=None): + """Linearlly dequantize an integer Tensor into a float Tensor in the range [-1, 1). + + Args: + quantized (Variable): dtype: int64. The quantized value in the range [0, n_bands). + n_bands (int): number of bands. The input integer Tensor's value is in the range [0, n_bans). + + Returns: + Variable: the dequantized tensor, dtype is specified by dtype. + """ + dtype = dtype or paddle.get_default_dtype() + value = (paddle.cast(quantized, dtype) + 0.5) * (2.0 / n_bands) - 1.0 + return value + + +def crop(x, audio_start, audio_length): + """Crop the upsampled condition to match audio_length. The upsampled condition has the same time steps as the whole audio does. But since audios are sliced to 0.5 seconds randomly while conditions are not, upsampled conditions should also be sliced to extaclt match the time steps of the audio slice. + + Args: + x (Variable): shape(B, C, T), dtype float32, the upsample condition. + audio_start (Variable): shape(B, ), dtype: int64, the index the starting point. + audio_length (int): the length of the audio (number of samples it contaions). + + Returns: + Variable: shape(B, C, audio_length), cropped condition. + """ + # crop audio + slices = [] # for each example + # paddle now supports Tensor of shape [1] in slice + # starts = audio_start.numpy() + for i in range(x.shape[0]): + start = audio_start[i] + end = start + audio_length + slice = paddle.slice(x[i], axes=[1], starts=[start], ends=[end]) + slices.append(slice) + out = paddle.stack(slices) + return out + + +class ResidualBlock(nn.Layer): + def __init__(self, residual_channels, condition_dim, filter_size, + dilation): + """A Residual block in wavenet. It does not have parametric residual or skip connection. It consists of a Conv1DCell and an Conv1D(filter_size = 1) to integrate the condition. + + Args: + residual_channels (int): the channels of the input, residual and skip. + condition_dim (int): the channels of the condition. + filter_size (int): filter size of the internal convolution cell. + dilation (int): dilation of the internal convolution cell. + """ + super(ResidualBlock, self).__init__() + dilated_channels = 2 * residual_channels + # following clarinet's implementation, we do not have parametric residual + # & skip connection. + + _filter_size = filter_size[0] if isinstance(filter_size, (list, tuple)) else filter_size + std = math.sqrt(1 / (_filter_size * residual_channels)) + conv = Conv1dCell(residual_channels, + dilated_channels, + filter_size, + dilation=dilation, + weight_attr=I.Normal(scale=std)) + self.conv = nn.utils.weight_norm(conv) + + std = math.sqrt(1 / condition_dim) + condition_proj = Conv1dCell(condition_dim, dilated_channels, (1,), + weight_attr=I.Normal(scale=std)) + self.condition_proj = nn.utils.weight_norm(condition_proj) + + self.filter_size = filter_size + self.dilation = dilation + self.dilated_channels = dilated_channels + self.residual_channels = residual_channels + self.condition_dim = condition_dim + + def forward(self, x, condition=None): + """Conv1D gated-tanh Block. + + Args: + x (Tensor): shape(B, C_res, T), the input. (B stands for batch_size, + C_res stands for residual channels, T stands for time steps.) + dtype float32. + condition (Tensor, optional): shape(B, C_cond, T), the condition, + it has been upsampled in time steps, so it has the same time + steps as the input does.(C_cond stands for the condition's channels). + Defaults to None. + + Returns: + (residual, skip_connection) + residual (Tensor): shape(B, C_res, T), the residual, which is used + as the input to the next layer of ResidualBlock. + skip_connection (Tensor): shape(B, C_res, T), the skip connection. + This output is accumulated with that of other ResidualBlocks. + """ + h = x + + # dilated conv + h = self.conv(h) + + # condition + if condition is not None: + h += self.condition_proj(condition) + + # gated tanh + content, gate = paddle.split(h, 2, axis=1) + z = F.sigmoid(gate) * paddle.tanh(content) + + # projection + residual = paddle.scale(z + x, math.sqrt(.5)) + skip_connection = z + return residual, skip_connection + + def start_sequence(self): + """ + Prepare the ResidualBlock to generate a new sequence. This method + should be called before starting calling `add_input` multiple times. + """ + self.conv.start_sequence() + self.condition_proj.start_sequence() + + def add_input(self, x, condition=None): + """ + Add a step input. This method works similarily with `forward` but + in a `step-in-step-out` fashion. + + Args: + x (Variable): shape(B, C_res), input for a step, dtype float32. + condition (Variable, optional): shape(B, C_cond). condition for a + step, dtype float32. Defaults to None. + + Returns: + (residual, skip_connection) + residual (Variable): shape(B, C_res), the residual for a step, + which is used as the input to the next layer of ResidualBlock. + skip_connection (Variable): shape(B, C_res), the skip connection + for a step. This output is accumulated with that of other + ResidualBlocks. + """ + h = x + + # dilated conv + h = self.conv.add_input(h) + + # condition + if condition is not None: + h += self.condition_proj.add_input(condition) + + # gated tanh + content, gate = paddle.split(h, 2, axis=1) + z = F.sigmoid(gate) * paddle.tanh(content) + + # projection + residual = paddle.scale(z + x, math.sqrt(0.5)) + skip_connection = z + return residual, skip_connection + + +class ResidualNet(nn.LayerList): + def __init__(self, n_loop, n_layer, residual_channels, condition_dim, + filter_size): + """The residual network in wavenet. It consists of `n_layer` stacks, + each of which consists of `n_loop` ResidualBlocks. + + Args: + n_loop (int): number of ResidualBlocks in a stack. + n_layer (int): number of stacks in the `ResidualNet`. + residual_channels (int): channels of each `ResidualBlock`'s input. + condition_dim (int): channels of the condition. + filter_size (int): filter size of the internal Conv1DCell of each + `ResidualBlock`. + """ + super(ResidualNet, self).__init__() + # double the dilation at each layer in a loop(n_loop layers) + dilations = [2**i for i in range(n_loop)] * n_layer + self.context_size = 1 + sum(dilations) + for dilation in dilations: + self.append(ResidualBlock(residual_channels, condition_dim, filter_size, dilation)) + + def forward(self, x, condition=None): + """ + Args: + x (Tensor): shape(B, C_res, T), dtype float32, the input. + (B stands for batch_size, C_res stands for residual channels, + T stands for time steps.) + condition (Tensor, optional): shape(B, C_cond, T), dtype float32, + the condition, it has been upsampled in time steps, so it has + the same time steps as the input does.(C_cond stands for the + condition's channels) Defaults to None. + + Returns: + skip_connection (Tensor): shape(B, C_res, T), dtype float32, the output. + """ + for i, func in enumerate(self): + x, skip = func(x, condition) + if i == 0: + skip_connections = skip + else: + skip_connections = paddle.scale(skip_connections + skip, + math.sqrt(0.5)) + return skip_connections + + def start_sequence(self): + """Prepare the ResidualNet to generate a new sequence. This method + should be called before starting calling `add_input` multiple times. + """ + for block in self: + block.start_sequence() + + def add_input(self, x, condition=None): + """Add a step input. This method works similarily with `forward` but + in a `step-in-step-out` fashion. + + Args: + x (Tensor): shape(B, C_res), dtype float32, input for a step. + condition (Tensor, optional): shape(B, C_cond), dtype float32, + condition for a step. Defaults to None. + + Returns: + skip_connection (Tensor): shape(B, C_res), dtype float32, the + output for a step. + """ + + for i, func in enumerate(self): + x, skip = func.add_input(x, condition) + if i == 0: + skip_connections = skip + else: + skip_connections = paddle.scale(skip_connections + skip, + math.sqrt(0.5)) + return skip_connections + + +class WaveNet(nn.Layer): + def __init__(self, n_loop, n_layer, residual_channels, output_dim, + condition_dim, filter_size, loss_type, log_scale_min): + """Wavenet that transform upsampled mel spectrogram into waveform. + + Args: + n_loop (int): n_loop for the internal ResidualNet. + n_layer (int): n_loop for the internal ResidualNet. + residual_channels (int): the channel of the input. + output_dim (int): the channel of the output distribution. + condition_dim (int): the channel of the condition. + filter_size (int): the filter size of the internal ResidualNet. + loss_type (str): loss type of the wavenet. Possible values are + 'softmax' and 'mog'. + If `loss_type` is 'softmax', the output is the logits of the + catrgotical(multinomial) distribution, `output_dim` means the + number of classes of the categorical distribution. + If `loss_type` is mog(mixture of gaussians), the output is the + parameters of a mixture of gaussians, which consists of weight + (in the form of logit) of each gaussian distribution and its + mean and log standard deviaton. So when `loss_type` is 'mog', + `output_dim` should be perfectly divided by 3. + log_scale_min (int): the minimum value of log standard deviation + of the output gaussian distributions. Note that this value is + only used for computing loss if `loss_type` is 'mog', values + less than `log_scale_min` is clipped when computing loss. + """ + super(WaveNet, self).__init__() + if loss_type not in ["softmax", "mog"]: + raise ValueError("loss_type {} is not supported".format(loss_type)) + if loss_type == "softmax": + self.embed = nn.Embedding(output_dim, residual_channels) + else: + if (output_dim % 3 != 0): + raise ValueError( + "with Mixture of Gaussians(mog) output, the output dim must be divisible by 3, but get {}".format(output_dim)) + self.embed = nn.utils.weight_norm(nn.Linear(1, residual_channels), dim=-1) + + self.resnet = ResidualNet(n_loop, n_layer, residual_channels, + condition_dim, filter_size) + self.context_size = self.resnet.context_size + + skip_channels = residual_channels # assume the same channel + self.proj1 = nn.utils.weight_norm(nn.Linear(skip_channels, skip_channels), dim=-1) + self.proj2 = nn.utils.weight_norm(nn.Linear(skip_channels, skip_channels), dim=-1) + # if loss_type is softmax, output_dim is n_vocab of waveform magnitude. + # if loss_type is mog, output_dim is 3 * gaussian, (weight, mean and stddev) + self.proj3 = nn.utils.weight_norm(nn.Linear(skip_channels, output_dim), dim=-1) + + self.loss_type = loss_type + self.output_dim = output_dim + self.input_dim = 1 + self.skip_channels = skip_channels + self.log_scale_min = log_scale_min + + def forward(self, x, condition=None): + """compute the output distribution (represented by its parameters). + + Args: + x (Tensor): shape(B, T), dtype float32, the input waveform. + condition (Tensor, optional): shape(B, C_cond, T), dtype float32, + the upsampled condition. Defaults to None. + + Returns: + Tensor: shape(B, T, C_output), dtype float32, the parameter of + the output distributions. + """ + + # Causal Conv + if self.loss_type == "softmax": + x = paddle.clip(x, min=-1., max=0.99999) + x = quantize(x, self.output_dim) + x = self.embed(x) # (B, T, C) + else: + x = paddle.unsqueeze(x, -1) # (B, T, 1) + x = self.embed(x) # (B, T, C) + x = paddle.transpose(x, perm=[0, 2, 1]) # (B, C, T) + + # Residual & Skip-conenection & linears + z = self.resnet(x, condition) + + z = paddle.transpose(z, [0, 2, 1]) + z = F.relu(self.proj2(F.relu(self.proj1(z)))) + + y = self.proj3(z) + return y + + def start_sequence(self): + """Prepare the WaveNet to generate a new sequence. This method should + be called before starting calling `add_input` multiple times. + """ + self.resnet.start_sequence() + + def add_input(self, x, condition=None): + """compute the output distribution (represented by its parameters) for + a step. It works similarily with the `forward` method but in a + `step-in-step-out` fashion. + + Args: + x (Tensor): shape(B,), dtype float32, a step of the input waveform. + condition (Tensor, optional): shape(B, C_cond, ), dtype float32, a + step of the upsampled condition. Defaults to None. + + Returns: + Tensor: shape(B, C_output), dtype float32, the parameter of the + output distributions. + """ + # Causal Conv + if self.loss_type == "softmax": + x = paddle.clip(x, min=-1., max=0.99999) + x = quantize(x, self.output_dim) + x = self.embed(x) # (B, C) + else: + x = paddle.unsqueeze(x, -1) # (B, 1) + x = self.embed(x) # (B, C) + + # Residual & Skip-conenection & linears + z = self.resnet.add_input(x, condition) + z = F.relu(self.proj2(F.relu(self.proj1(z)))) # (B, C) + + # Output + y = self.proj3(z) + return y + + def compute_softmax_loss(self, y, t): + """compute the loss where output distribution is a categorial distribution. + + Args: + y (Tensor): shape(B, T, C_output), dtype float32, the logits of the + output distribution. + t (Tensor): shape(B, T), dtype float32, the target audio. Note that + the target's corresponding time index is one step ahead of the + output distribution. And output distribution whose input contains + padding is neglected in loss computation. + + Returns: + Tensor: shape(1, ), dtype float32, the loss. + """ + # context size is not taken into account + y = y[:, self.context_size:, :] + t = t[:, self.context_size:] + t = paddle.clip(t, min=-1.0, max=0.99999) + quantized = quantize(t, n_bands=self.output_dim) + label = paddle.unsqueeze(quantized, -1) + + loss = F.softmax_with_cross_entropy(y, label) + reduced_loss = paddle.reduce_mean(loss) + return reduced_loss + + def sample_from_softmax(self, y): + """Sample from the output distribution where the output distribution is + a categorical distriobution. + + Args: + y (Tensor): shape(B, T, C_output), the logits of the output distribution. + + Returns: + Tensor: shape(B, T), waveform sampled from the output distribution. + """ + # dequantize + batch_size, time_steps, output_dim, = y.shape + y = paddle.reshape(y, (batch_size * time_steps, output_dim)) + prob = F.softmax(y) + quantized = paddle.fluid.layers.sampling_id(prob) + samples = dequantize(quantized, n_bands=self.output_dim) + samples = paddle.reshape(samples, (batch_size, -1)) + return samples + + def compute_mog_loss(self, y, t): + """compute the loss where output distribution is a mixture of Gaussians. + + Args: + y (Tensor): shape(B, T, C_output), dtype float32, the parameterd of + the output distribution. It is the concatenation of 3 parts, + the logits of every distribution, the mean of each distribution + and the log standard deviation of each distribution. Each part's + shape is (B, T, n_mixture), where `n_mixture` means the number + of Gaussians in the mixture. + t (Tensor): shape(B, T), dtype float32, the target audio. Note that + the target's corresponding time index is one step ahead of the + output distribution. And output distribution whose input contains + padding is neglected in loss computation. + + Returns: + Tensor: shape(1, ), dtype float32, the loss. + """ + n_mixture = self.output_dim // 3 + + # context size is not taken in to account + y = y[:, self.context_size:, :] + t = t[:, self.context_size:] + + w, mu, log_std = paddle.split(y, 3, axis=2) + # 100.0 is just a large float + log_std = paddle.clip(log_std, min=self.log_scale_min, max=100.) + inv_std = paddle.exp(-log_std) + p_mixture = F.softmax(w, -1) + + t = paddle.unsqueeze(t, -1) + if n_mixture > 1: + # t = F.expand_as(t, log_std) + t = paddle.expand(t, [-1, -1, n_mixture]) + + x_std = inv_std * (t - mu) + exponent = paddle.exp(-0.5 * x_std * x_std) + pdf_x = 1.0 / math.sqrt(2.0 * math.pi) * inv_std * exponent + + pdf_x = p_mixture * pdf_x + # pdf_x: [bs, len] + pdf_x = paddle.reduce_sum(pdf_x, -1) + per_sample_loss = -paddle.log(pdf_x + 1e-9) + + loss = paddle.reduce_mean(per_sample_loss) + return loss + + def sample_from_mog(self, y): + """Sample from the output distribution where the output distribution is + a mixture of Gaussians. + Args: + y (Tensor): shape(B, T, C_output), dtype float32, the parameterd of + the output distribution. It is the concatenation of 3 parts, the + logits of every distribution, the mean of each distribution and the + log standard deviation of each distribution. Each part's shape is + (B, T, n_mixture), where `n_mixture` means the number of Gaussians + in the mixture. + + Returns: + Tensor: shape(B, T), waveform sampled from the output distribution. + """ + batch_size, time_steps, output_dim = y.shape + n_mixture = output_dim // 3 + + w, mu, log_std = paddle.split(y, 3, -1) + + reshaped_w = paddle.reshape(w, (batch_size * time_steps, n_mixture)) + prob_ids = paddle.fluid.layers.sampling_id(F.softmax(reshaped_w)) + prob_ids = paddle.reshape(prob_ids, (batch_size, time_steps)) + prob_ids = prob_ids.numpy() + + # do it + index = np.array([[[b, t, prob_ids[b, t]] for t in range(time_steps)] + for b in range(batch_size)]).astype("int32") + index_var = paddle.to_tensor(index) + + mu_ = paddle.gather_nd(mu, index_var) + log_std_ = paddle.gather_nd(log_std, index_var) + + dist = D.Normal(mu_, paddle.exp(log_std_)) + samples = dist.sample(shape=[]) + samples = paddle.clip(samples, min=-1., max=1.) + return samples + + def sample(self, y): + """Sample from the output distribution. + Args: + y (Tensor): shape(B, T, C_output), dtype float32, the parameterd of + the output distribution. + + Returns: + Tensor: shape(B, T), waveform sampled from the output distribution. + """ + if self.loss_type == "softmax": + return self.sample_from_softmax(y) + else: + return self.sample_from_mog(y) + + def loss(self, y, t): + """compute the loss where output distribution is a mixture of Gaussians. + + Args: + y (Tensor): shape(B, T, C_output), dtype float32, the parameterd of + the output distribution. + t (Tensor): shape(B, T), dtype float32, the target audio. Note that + the target's corresponding time index is one step ahead of the + output distribution. And output distribution whose input contains + padding is neglected in loss computation. + + Returns: + Tensor: shape(1, ), dtype float32, the loss. + """ + if self.loss_type == "softmax": + return self.compute_softmax_loss(y, t) + else: + return self.compute_mog_loss(y, t) + + +class UpsampleNet(nn.LayerList): + def __init__(self, upscale_factors=[16, 16]): + """UpsamplingNet. + It consists of several layers of Conv2DTranspose. Each Conv2DTranspose + layer upsamples the time dimension by its `stride` times. And each + Conv2DTranspose's filter_size at frequency dimension is 3. + + Args: + upscale_factors (list[int], optional): time upsampling factors for + each Conv2DTranspose Layer. The `UpsampleNet` contains + len(upscale_factor) Conv2DTranspose Layers. Each upscale_factor + is used as the `stride` for the corresponding Conv2DTranspose. + Defaults to [16, 16]. + Note: + np.prod(upscale_factors) should equals the `hop_length` of the stft + transformation used to extract spectrogram features from audios. + For example, 16 * 16 = 256, then the spectram extracted using a + stft transformation whose `hop_length` is 256. See `librosa.stft` + for more details. + """ + super(UpsampleNet, self).__init__() + self.upscale_factors = list(upscale_factors) + self.upscale_factor = 1 + for item in upscale_factors: + self.upscale_factor *= item + + for factor in self.upscale_factors: + self.append( + nn.utils.weight_norm( + nn.ConvTranspose2d(1, 1, + kernel_size=(3, 2 * factor), + stride=(1, factor), + padding=(1, factor // 2)))) + + def forward(self, x): + """Compute the upsampled condition. + + Args: + x (Tensor): shape(B, F, T), dtype float32, the condition + (mel spectrogram here.) (F means the frequency bands). In the + internal Conv2DTransposes, the frequency dimension is treated + as `height` dimension instead of `in_channels`. + + Returns: + Tensor: shape(B, F, T * upscale_factor), dtype float32, the + upsampled condition. + """ + x = paddle.unsqueeze(x, 1) + for sublayer in self: + x = F.leaky_relu(sublayer(x), 0.4) + x = paddle.squeeze(x, 1) + return x + + +class ConditionalWavenet(nn.Layer): + def __init__(self, encoder, decoder): + """Conditional Wavenet, which contains an UpsampleNet as the encoder + and a WaveNet as the decoder. It is an autoregressive model. + + Args: + encoder (UpsampleNet): the UpsampleNet as the encoder. + decoder (WaveNet): the WaveNet as the decoder. + """ + super(ConditionalWavenet, self).__init__() + self.encoder = encoder + self.decoder = decoder + + def forward(self, audio, mel, audio_start): + """Compute the output distribution given the mel spectrogram and the + input(for teacher force training). + + Args: + audio (Tensor): shape(B, T_audio), dtype float32, ground truth + waveform, used for teacher force training. + mel (Tensor): shape(B, F, T_mel), dtype float32, mel spectrogram. + Note that it is the spectrogram for the whole utterance. + audio_start (Tensor): shape(B, ), dtype: int, audio slices' start + positions for each utterance. + + Returns: + Tensor: shape(B, T_audio - 1, C_putput), parameters for the output + distribution.(C_output is the `output_dim` of the decoder.) + """ + audio_length = audio.shape[1] # audio clip's length + condition = self.encoder(mel) + condition_slice = crop(condition, audio_start, audio_length) + + # shifting 1 step + audio = audio[:, :-1] + condition_slice = condition_slice[:, :, 1:] + + y = self.decoder(audio, condition_slice) + return y + + def loss(self, y, t): + """compute loss with respect to the output distribution and the targer + audio. + + Args: + y (Tensor): shape(B, T - 1, C_output), dtype float32, parameters of + the output distribution. + t (Tensor): shape(B, T), dtype float32, target waveform. + + Returns: + Tensor: shape(1, ), dtype float32, the loss. + """ + t = t[:, 1:] + loss = self.decoder.loss(y, t) + return loss + + def sample(self, y): + """Sample from the output distribution. + + Args: + y (Tensor): shape(B, T, C_output), dtype float32, parameters of the + output distribution. + + Returns: + Tensor: shape(B, T), dtype float32, sampled waveform from the output + distribution. + """ + samples = self.decoder.sample(y) + return samples + + @paddle.no_grad() + def synthesis(self, mel): + """Synthesize waveform from mel spectrogram. + + Args: + mel (Tensor): shape(B, F, T), condition(mel spectrogram here). + + Returns: + Tensor: shape(B, T * upsacle_factor), synthesized waveform. + (`upscale_factor` is the `upscale_factor` of the encoder + `UpsampleNet`) + """ + condition = self.encoder(mel) + batch_size, _, time_steps = condition.shape + samples = [] + + self.decoder.start_sequence() + x_t = paddle.zeros((batch_size, ), dtype=mel.dtype) + for i in trange(time_steps): + c_t = condition[:, :, i] + y_t = self.decoder.add_input(x_t, c_t) + y_t = paddle.unsqueeze(y_t, 1) + x_t = self.sample(y_t) + x_t = paddle.squeeze(x_t, 1) + samples.append(x_t) + + samples = paddle.concat(samples, -1) + return samples + + +# TODO WaveNetLoss \ No newline at end of file diff --git a/parakeet/models/wavenet/__init__.py b/parakeet/models/wavenet/__init__.py deleted file mode 100644 index 7aa10e0..0000000 --- a/parakeet/models/wavenet/__init__.py +++ /dev/null @@ -1,16 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from .net import * -from .wavenet import * \ No newline at end of file diff --git a/parakeet/models/wavenet/net.py b/parakeet/models/wavenet/net.py deleted file mode 100644 index 52762e3..0000000 --- a/parakeet/models/wavenet/net.py +++ /dev/null @@ -1,179 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import division -import itertools -import numpy as np -from scipy import signal -from tqdm import trange - -import paddle.fluid.layers as F -import paddle.fluid.dygraph as dg -import paddle.fluid.initializer as I -import paddle.fluid.layers.distributions as D - -from parakeet.modules.weight_norm import Conv2DTranspose -from parakeet.models.wavenet.wavenet import WaveNet - - -def crop(x, audio_start, audio_length): - """Crop the upsampled condition to match audio_length. The upsampled condition has the same time steps as the whole audio does. But since audios are sliced to 0.5 seconds randomly while conditions are not, upsampled conditions should also be sliced to extaclt match the time steps of the audio slice. - - Args: - x (Variable): shape(B, C, T), dtype float32, the upsample condition. - audio_start (Variable): shape(B, ), dtype: int64, the index the starting point. - audio_length (int): the length of the audio (number of samples it contaions). - - Returns: - Variable: shape(B, C, audio_length), cropped condition. - """ - # crop audio - slices = [] # for each example - starts = audio_start.numpy() - for i in range(x.shape[0]): - start = starts[i] - end = start + audio_length - slice = F.slice(x[i], axes=[1], starts=[start], ends=[end]) - slices.append(slice) - out = F.stack(slices) - return out - - -class UpsampleNet(dg.Layer): - def __init__(self, upscale_factors=[16, 16]): - """UpsamplingNet. - It consists of several layers of Conv2DTranspose. Each Conv2DTranspose layer upsamples the time dimension by its `stride` times. And each Conv2DTranspose's filter_size at frequency dimension is 3. - - Args: - upscale_factors (list[int], optional): time upsampling factors for each Conv2DTranspose Layer. The `UpsampleNet` contains len(upscale_factor) Conv2DTranspose Layers. Each upscale_factor is used as the `stride` for the corresponding Conv2DTranspose. Defaults to [16, 16]. - Note: - np.prod(upscale_factors) should equals the `hop_length` of the stft transformation used to extract spectrogram features from audios. For example, 16 * 16 = 256, then the spectram extracted using a stft transformation whose `hop_length` is 256. See `librosa.stft` for more details. - """ - super(UpsampleNet, self).__init__() - self.upscale_factors = list(upscale_factors) - self.upsample_convs = dg.LayerList() - for i, factor in enumerate(upscale_factors): - self.upsample_convs.append( - Conv2DTranspose( - 1, - 1, - filter_size=(3, 2 * factor), - stride=(1, factor), - padding=(1, factor // 2))) - - @property - def upscale_factor(self): - return np.prod(self.upscale_factors) - - def forward(self, x): - """Compute the upsampled condition. - - Args: - x (Variable): shape(B, F, T), dtype float32, the condition (mel spectrogram here.) (F means the frequency bands). In the internal Conv2DTransposes, the frequency dimension is treated as `height` dimension instead of `in_channels`. - - Returns: - Variable: shape(B, F, T * upscale_factor), dtype float32, the upsampled condition. - """ - x = F.unsqueeze(x, axes=[1]) - for sublayer in self.upsample_convs: - x = F.leaky_relu(sublayer(x), alpha=.4) - x = F.squeeze(x, [1]) - return x - - -# AutoRegressive Model -class ConditionalWavenet(dg.Layer): - def __init__(self, encoder, decoder): - """Conditional Wavenet, which contains an UpsampleNet as the encoder and a WaveNet as the decoder. It is an autoregressive model. - - Args: - encoder (UpsampleNet): the UpsampleNet as the encoder. - decoder (WaveNet): the WaveNet as the decoder. - """ - super(ConditionalWavenet, self).__init__() - self.encoder = encoder - self.decoder = decoder - - def forward(self, audio, mel, audio_start): - """Compute the output distribution given the mel spectrogram and the input(for teacher force training). - - Args: - audio (Variable): shape(B, T_audio), dtype float32, ground truth waveform, used for teacher force training. - mel ([Variable): shape(B, F, T_mel), dtype float32, mel spectrogram. Note that it is the spectrogram for the whole utterance. - audio_start (Variable): shape(B, ), dtype: int, audio slices' start positions for each utterance. - - Returns: - Variable: shape(B, T_audio - 1, C_putput), parameters for the output distribution.(C_output is the `output_dim` of the decoder.) - """ - audio_length = audio.shape[1] # audio clip's length - condition = self.encoder(mel) - condition_slice = crop(condition, audio_start, audio_length) - - # shifting 1 step - audio = audio[:, :-1] - condition_slice = condition_slice[:, :, 1:] - - y = self.decoder(audio, condition_slice) - return y - - def loss(self, y, t): - """compute loss with respect to the output distribution and the targer audio. - - Args: - y (Variable): shape(B, T - 1, C_output), dtype float32, parameters of the output distribution. - t (Variable): shape(B, T), dtype float32, target waveform. - - Returns: - Variable: shape(1, ), dtype float32, the loss. - """ - t = t[:, 1:] - loss = self.decoder.loss(y, t) - return loss - - def sample(self, y): - """Sample from the output distribution. - - Args: - y (Variable): shape(B, T, C_output), dtype float32, parameters of the output distribution. - - Returns: - Variable: shape(B, T), dtype float32, sampled waveform from the output distribution. - """ - samples = self.decoder.sample(y) - return samples - - @dg.no_grad - def synthesis(self, mel): - """Synthesize waveform from mel spectrogram. - - Args: - mel (Variable): shape(B, F, T), condition(mel spectrogram here). - - Returns: - Variable: shape(B, T * upsacle_factor), synthesized waveform.(`upscale_factor` is the `upscale_factor` of the encoder `UpsampleNet`) - """ - condition = self.encoder(mel) - batch_size, _, time_steps = condition.shape - samples = [] - - self.decoder.start_sequence() - x_t = F.zeros((batch_size, 1), dtype="float32") - for i in trange(time_steps): - c_t = condition[:, :, i:i + 1] - y_t = self.decoder.add_input(x_t, c_t) - x_t = self.sample(y_t) - samples.append(x_t) - - samples = F.concat(samples, axis=-1) - return samples diff --git a/parakeet/models/wavenet/wavenet.py b/parakeet/models/wavenet/wavenet.py deleted file mode 100644 index a0296e1..0000000 --- a/parakeet/models/wavenet/wavenet.py +++ /dev/null @@ -1,467 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import division -import math -import time -import itertools -import numpy as np - -import paddle.fluid.layers as F -import paddle.fluid.dygraph as dg -import paddle.fluid.initializer as I -import paddle.fluid.layers.distributions as D - -from parakeet.modules.weight_norm import Linear, Conv1D, Conv1DCell, Conv2DTranspose - - -# for wavenet with softmax loss -def quantize(values, n_bands): - """Linearlly quantize a float Tensor in [-1, 1) to an interger Tensor in [0, n_bands). - - Args: - values (Variable): dtype: flaot32 or float64. the floating point value. - n_bands (int): the number of bands. The output integer Tensor's value is in the range [0, n_bans). - - Returns: - Variable: the quantized tensor, dtype: int64. - """ - quantized = F.cast((values + 1.0) / 2.0 * n_bands, "int64") - return quantized - - -def dequantize(quantized, n_bands): - """Linearlly dequantize an integer Tensor into a float Tensor in the range [-1, 1). - - Args: - quantized (Variable): dtype: int64. The quantized value in the range [0, n_bands). - n_bands (int): number of bands. The input integer Tensor's value is in the range [0, n_bans). - - Returns: - Variable: the dequantized tensor, dtype float3232. - """ - value = (F.cast(quantized, "float32") + 0.5) * (2.0 / n_bands) - 1.0 - return value - - -class ResidualBlock(dg.Layer): - def __init__(self, residual_channels, condition_dim, filter_size, - dilation): - """A Residual block in wavenet. It does not have parametric residual or skip connection. It consists of a Conv1DCell and an Conv1D(filter_size = 1) to integrate the condition. - - Args: - residual_channels (int): the channels of the input, residual and skip. - condition_dim (int): the channels of the condition. - filter_size (int): filter size of the internal convolution cell. - dilation (int): dilation of the internal convolution cell. - """ - super(ResidualBlock, self).__init__() - dilated_channels = 2 * residual_channels - # following clarinet's implementation, we do not have parametric residual - # & skip connection. - - std = np.sqrt(1 / (filter_size * residual_channels)) - self.conv = Conv1DCell( - residual_channels, - dilated_channels, - filter_size, - dilation=dilation, - causal=True, - param_attr=I.Normal(scale=std)) - - std = np.sqrt(1 / condition_dim) - self.condition_proj = Conv1D( - condition_dim, dilated_channels, 1, param_attr=I.Normal(scale=std)) - - self.filter_size = filter_size - self.dilation = dilation - self.dilated_channels = dilated_channels - self.residual_channels = residual_channels - self.condition_dim = condition_dim - - def forward(self, x, condition=None): - """Conv1D gated-tanh Block. - - Args: - x (Variable): shape(B, C_res, T), the input. (B stands for batch_size, C_res stands for residual channels, T stands for time steps.) dtype float32. - condition (Variable, optional): shape(B, C_cond, T), the condition, it has been upsampled in time steps, so it has the same time steps as the input does.(C_cond stands for the condition's channels). Defaults to None. - - Returns: - (residual, skip_connection) - residual (Variable): shape(B, C_res, T), the residual, which is used as the input to the next layer of ResidualBlock. - skip_connection (Variable): shape(B, C_res, T), the skip connection. This output is accumulated with that of other ResidualBlocks. - """ - time_steps = x.shape[-1] - h = x - - # dilated conv - h = self.conv(h) - if h.shape[-1] != time_steps: - h = h[:, :, :time_steps] - - # condition - if condition is not None: - h += self.condition_proj(condition) - - # gated tanh - content, gate = F.split(h, 2, dim=1) - z = F.sigmoid(gate) * F.tanh(content) - - # projection - residual = F.scale(z + x, math.sqrt(.5)) - skip_connection = z - return residual, skip_connection - - def start_sequence(self): - """Prepare the ResidualBlock to generate a new sequence. This method should be called before starting calling `add_input` multiple times. - """ - self.conv.start_sequence() - - def add_input(self, x, condition=None): - """Add a step input. This method works similarily with `forward` but in a `step-in-step-out` fashion. - - Args: - x (Variable): shape(B, C_res, T=1), input for a step, dtype float32. - condition (Variable, optional): shape(B, C_cond, T=1). condition for a step, dtype float32. Defaults to None. - - Returns: - (residual, skip_connection) - residual (Variable): shape(B, C_res, T=1), the residual for a step, which is used as the input to the next layer of ResidualBlock. - skip_connection (Variable): shape(B, C_res, T=1), the skip connection for a step. This output is accumulated with that of other ResidualBlocks. - """ - h = x - - # dilated conv - h = self.conv.add_input(h) - - # condition - if condition is not None: - h += self.condition_proj(condition) - - # gated tanh - content, gate = F.split(h, 2, dim=1) - z = F.sigmoid(gate) * F.tanh(content) - - # projection - residual = F.scale(z + x, np.sqrt(0.5)) - skip_connection = z - return residual, skip_connection - - -class ResidualNet(dg.Layer): - def __init__(self, n_loop, n_layer, residual_channels, condition_dim, - filter_size): - """The residual network in wavenet. It consists of `n_layer` stacks, each of which consists of `n_loop` ResidualBlocks. - - Args: - n_loop (int): number of ResidualBlocks in a stack. - n_layer (int): number of stacks in the `ResidualNet`. - residual_channels (int): channels of each `ResidualBlock`'s input. - condition_dim (int): channels of the condition. - filter_size (int): filter size of the internal Conv1DCell of each `ResidualBlock`. - """ - super(ResidualNet, self).__init__() - # double the dilation at each layer in a loop(n_loop layers) - dilations = [2**i for i in range(n_loop)] * n_layer - self.context_size = 1 + sum(dilations) - self.residual_blocks = dg.LayerList([ - ResidualBlock(residual_channels, condition_dim, filter_size, - dilation) for dilation in dilations - ]) - - def forward(self, x, condition=None): - """ - Args: - x (Variable): shape(B, C_res, T), dtype float32, the input. (B stands for batch_size, C_res stands for residual channels, T stands for time steps.) - condition (Variable, optional): shape(B, C_cond, T), dtype float32, the condition, it has been upsampled in time steps, so it has the same time steps as the input does.(C_cond stands for the condition's channels) Defaults to None. - - Returns: - skip_connection (Variable): shape(B, C_res, T), dtype float32, the output. - """ - for i, func in enumerate(self.residual_blocks): - x, skip = func(x, condition) - if i == 0: - skip_connections = skip - else: - skip_connections = F.scale(skip_connections + skip, - np.sqrt(0.5)) - return skip_connections - - def start_sequence(self): - """Prepare the ResidualNet to generate a new sequence. This method should be called before starting calling `add_input` multiple times. - """ - for block in self.residual_blocks: - block.start_sequence() - - def add_input(self, x, condition=None): - """Add a step input. This method works similarily with `forward` but in a `step-in-step-out` fashion. - - Args: - x (Variable): shape(B, C_res, T=1), dtype float32, input for a step. - condition (Variable, optional): shape(B, C_cond, T=1), dtype float32, condition for a step. Defaults to None. - - Returns: - skip_connection (Variable): shape(B, C_res, T=1), dtype float32, the output for a step. - """ - - for i, func in enumerate(self.residual_blocks): - x, skip = func.add_input(x, condition) - if i == 0: - skip_connections = skip - else: - skip_connections = F.scale(skip_connections + skip, - np.sqrt(0.5)) - return skip_connections - - -class WaveNet(dg.Layer): - def __init__(self, n_loop, n_layer, residual_channels, output_dim, - condition_dim, filter_size, loss_type, log_scale_min): - """Wavenet that transform upsampled mel spectrogram into waveform. - - Args: - n_loop (int): n_loop for the internal ResidualNet. - n_layer (int): n_loop for the internal ResidualNet. - residual_channels (int): the channel of the input. - output_dim (int): the channel of the output distribution. - condition_dim (int): the channel of the condition. - filter_size (int): the filter size of the internal ResidualNet. - loss_type (str): loss type of the wavenet. Possible values are 'softmax' and 'mog'. If `loss_type` is 'softmax', the output is the logits of the catrgotical(multinomial) distribution, `output_dim` means the number of classes of the categorical distribution. If `loss_type` is mog(mixture of gaussians), the output is the parameters of a mixture of gaussians, which consists of weight(in the form of logit) of each gaussian distribution and its mean and log standard deviaton. So when `loss_type` is 'mog', `output_dim` should be perfectly divided by 3. - log_scale_min (int): the minimum value of log standard deviation of the output gaussian distributions. Note that this value is only used for computing loss if `loss_type` is 'mog', values less than `log_scale_min` is clipped when computing loss. - """ - super(WaveNet, self).__init__() - if loss_type not in ["softmax", "mog"]: - raise ValueError("loss_type {} is not supported".format(loss_type)) - if loss_type == "softmax": - self.embed = dg.Embedding((output_dim, residual_channels)) - else: - assert output_dim % 3 == 0, "with MoG output, the output dim must be divided by 3" - self.embed = Linear(1, residual_channels) - - self.resnet = ResidualNet(n_loop, n_layer, residual_channels, - condition_dim, filter_size) - self.context_size = self.resnet.context_size - - skip_channels = residual_channels # assume the same channel - self.proj1 = Linear(skip_channels, skip_channels) - self.proj2 = Linear(skip_channels, skip_channels) - # if loss_type is softmax, output_dim is n_vocab of waveform magnitude. - # if loss_type is mog, output_dim is 3 * gaussian, (weight, mean and stddev) - self.proj3 = Linear(skip_channels, output_dim) - - self.loss_type = loss_type - self.output_dim = output_dim - self.input_dim = 1 - self.skip_channels = skip_channels - self.log_scale_min = log_scale_min - - def forward(self, x, condition=None): - """compute the output distribution (represented by its parameters). - - Args: - x (Variable): shape(B, T), dtype float32, the input waveform. - condition (Variable, optional): shape(B, C_cond, T), dtype float32, the upsampled condition. Defaults to None. - - Returns: - Variable: shape(B, T, C_output), dtype float32, the parameter of the output distributions. - """ - - # Causal Conv - if self.loss_type == "softmax": - x = F.clip(x, min=-1., max=0.99999) - x = quantize(x, self.output_dim) - x = self.embed(x) # (B, T, C) - else: - x = F.unsqueeze(x, axes=[-1]) # (B, T, 1) - x = self.embed(x) # (B, T, C) - x = F.transpose(x, perm=[0, 2, 1]) # (B, C, T) - - # Residual & Skip-conenection & linears - z = self.resnet(x, condition) - - z = F.transpose(z, [0, 2, 1]) - z = F.relu(self.proj2(F.relu(self.proj1(z)))) - - y = self.proj3(z) - return y - - def start_sequence(self): - """Prepare the WaveNet to generate a new sequence. This method should be called before starting calling `add_input` multiple times. - """ - self.resnet.start_sequence() - - def add_input(self, x, condition=None): - """compute the output distribution (represented by its parameters) for a step. It works similarily with the `forward` method but in a `step-in-step-out` fashion. - - Args: - x (Variable): shape(B, T=1), dtype float32, a step of the input waveform. - condition (Variable, optional): shape(B, C_cond, T=1), dtype float32, a step of the upsampled condition. Defaults to None. - - Returns: - Variable: shape(B, T=1, C_output), dtype float32, the parameter of the output distributions. - """ - # Causal Conv - if self.loss_type == "softmax": - x = F.clip(x, min=-1., max=0.99999) - x = quantize(x, self.output_dim) - x = self.embed(x) # (B, T, C), T=1 - else: - x = F.unsqueeze(x, axes=[-1]) # (B, T, 1), T=1 - x = self.embed(x) # (B, T, C) - x = F.transpose(x, perm=[0, 2, 1]) - - # Residual & Skip-conenection & linears - z = self.resnet.add_input(x, condition) - z = F.transpose(z, [0, 2, 1]) - z = F.relu(self.proj2(F.relu(self.proj1(z)))) # (B, T, C) - - # Output - y = self.proj3(z) - return y - - def compute_softmax_loss(self, y, t): - """compute the loss where output distribution is a categorial distribution. - - Args: - y (Variable): shape(B, T, C_output), dtype float32, the logits of the output distribution. - t (Variable): shape(B, T), dtype float32, the target audio. Note that the target's corresponding time index is one step ahead of the output distribution. And output distribution whose input contains padding is neglected in loss computation. - - Returns: - Variable: shape(1, ), dtype float32, the loss. - """ - # context size is not taken into account - y = y[:, self.context_size:, :] - t = t[:, self.context_size:] - t = F.clip(t, min=-1.0, max=0.99999) - quantized = quantize(t, n_bands=self.output_dim) - label = F.unsqueeze(quantized, axes=[-1]) - - loss = F.softmax_with_cross_entropy(y, label) - reduced_loss = F.reduce_mean(loss) - return reduced_loss - - def sample_from_softmax(self, y): - """Sample from the output distribution where the output distribution is a categorical distriobution. - - Args: - y (Variable): shape(B, T, C_output), the logits of the output distribution - - Returns: - Variable: shape(B, T), waveform sampled from the output distribution. - """ - # dequantize - batch_size, time_steps, output_dim, = y.shape - y = F.reshape(y, (batch_size * time_steps, output_dim)) - prob = F.softmax(y) - quantized = F.sampling_id(prob) - samples = dequantize(quantized, n_bands=self.output_dim) - samples = F.reshape(samples, (batch_size, -1)) - return samples - - def compute_mog_loss(self, y, t): - """compute the loss where output distribution is a mixture of Gaussians. - - Args: - y (Variable): shape(B, T, C_output), dtype float32, the parameterd of the output distribution. It is the concatenation of 3 parts, the logits of every distribution, the mean of each distribution and the log standard deviation of each distribution. Each part's shape is (B, T, n_mixture), where `n_mixture` means the number of Gaussians in the mixture. - t (Variable): shape(B, T), dtype float32, the target audio. Note that the target's corresponding time index is one step ahead of the output distribution. And output distribution whose input contains padding is neglected in loss computation. - - Returns: - Variable: shape(1, ), dtype float32, the loss. - """ - n_mixture = self.output_dim // 3 - - # context size is not taken in to account - y = y[:, self.context_size:, :] - t = t[:, self.context_size:] - - w, mu, log_std = F.split(y, 3, dim=2) - # 100.0 is just a large float - log_std = F.clip(log_std, min=self.log_scale_min, max=100.) - inv_std = F.exp(-log_std) - p_mixture = F.softmax(w, axis=-1) - - t = F.unsqueeze(t, axes=[-1]) - if n_mixture > 1: - # t = F.expand_as(t, log_std) - t = F.expand(t, [1, 1, n_mixture]) - - x_std = inv_std * (t - mu) - exponent = F.exp(-0.5 * x_std * x_std) - pdf_x = 1.0 / math.sqrt(2.0 * math.pi) * inv_std * exponent - - pdf_x = p_mixture * pdf_x - # pdf_x: [bs, len] - pdf_x = F.reduce_sum(pdf_x, dim=-1) - per_sample_loss = -F.log(pdf_x + 1e-9) - - loss = F.reduce_mean(per_sample_loss) - return loss - - def sample_from_mog(self, y): - """Sample from the output distribution where the output distribution is a mixture of Gaussians. - Args: - y (Variable): shape(B, T, C_output), dtype float32, the parameterd of the output distribution. It is the concatenation of 3 parts, the logits of every distribution, the mean of each distribution and the log standard deviation of each distribution. Each part's shape is (B, T, n_mixture), where `n_mixture` means the number of Gaussians in the mixture. - - Returns: - Variable: shape(B, T), waveform sampled from the output distribution. - """ - batch_size, time_steps, output_dim = y.shape - n_mixture = output_dim // 3 - - w, mu, log_std = F.split(y, 3, dim=-1) - - reshaped_w = F.reshape(w, (batch_size * time_steps, n_mixture)) - prob_ids = F.sampling_id(F.softmax(reshaped_w)) - prob_ids = F.reshape(prob_ids, (batch_size, time_steps)) - prob_ids = prob_ids.numpy() - - index = np.array([[[b, t, prob_ids[b, t]] for t in range(time_steps)] - for b in range(batch_size)]).astype("int32") - index_var = dg.to_variable(index) - - mu_ = F.gather_nd(mu, index_var) - log_std_ = F.gather_nd(log_std, index_var) - - dist = D.Normal(mu_, F.exp(log_std_)) - samples = dist.sample(shape=[]) - samples = F.clip(samples, min=-1., max=1.) - return samples - - def sample(self, y): - """Sample from the output distribution. - Args: - y (Variable): shape(B, T, C_output), dtype float32, the parameterd of the output distribution. - - Returns: - Variable: shape(B, T), waveform sampled from the output distribution. - """ - if self.loss_type == "softmax": - return self.sample_from_softmax(y) - else: - return self.sample_from_mog(y) - - def loss(self, y, t): - """compute the loss where output distribution is a mixture of Gaussians. - - Args: - y (Variable): shape(B, T, C_output), dtype float32, the parameterd of the output distribution. - t (Variable): shape(B, T), dtype float32, the target audio. Note that the target's corresponding time index is one step ahead of the output distribution. And output distribution whose input contains padding is neglected in loss computation. - - Returns: - Variable: shape(1, ), dtype float32, the loss. - """ - if self.loss_type == "softmax": - return self.compute_softmax_loss(y, t) - else: - return self.compute_mog_loss(y, t) diff --git a/parakeet/modules/__init__.py b/parakeet/modules/__init__.py index d964a59..9118340 100644 --- a/parakeet/modules/__init__.py +++ b/parakeet/modules/__init__.py @@ -12,5 +12,3 @@ # See the License for the specific language governing permissions and # limitations under the License. -from . import weight_norm -from .customized import * \ No newline at end of file diff --git a/parakeet/modules/attention.py b/parakeet/modules/attention.py new file mode 100644 index 0000000..d7053b4 --- /dev/null +++ b/parakeet/modules/attention.py @@ -0,0 +1,197 @@ +import math +import numpy as np +import paddle +from paddle import nn +from paddle.nn import functional as F + +def scaled_dot_product_attention(q, k, v, mask=None, dropout=0.0, training=True): + """ + scaled dot product attention with mask. Assume q, k, v all have the same + leader dimensions(denoted as * in descriptions below). Dropout is applied to + attention weights before weighted sum of values. + + Args: + q (Tensor): shape(*, T_q, d), the query tensor. + k (Tensor): shape(*, T_k, d), the key tensor. + v (Tensor): shape(*, T_k, d_v), the value tensor. + mask (Tensor, optional): shape(*, T_q, T_k) or broadcastable shape, the + mask tensor, 0 correspond to padding. Defaults to None. + + Returns: + (out, attn_weights) + out (Tensor): shape(*, T_q, d_v), the context vector. + attn_weights (Tensor): shape(*, T_q, T_k), the attention weights. + """ + d = q.shape[-1] # we only support imperative execution + qk = paddle.matmul(q, k, transpose_y=True) + scaled_logit = paddle.scale(qk, 1.0 / math.sqrt(d)) + + if mask is not None: + scaled_logit += paddle.scale((1.0 - mask), -1e9) # hard coded here + + attn_weights = F.softmax(scaled_logit, axis=-1) + attn_weights = F.dropout(attn_weights, dropout, training=training) + out = paddle.matmul(attn_weights, v) + return out, attn_weights + +def drop_head(x, drop_n_heads, training): + """ + Drop n heads from multiple context vectors. + + Args: + x (Tensor): shape(batch_size, num_heads, time_steps, channels), the input. + drop_n_heads (int): [description] + training ([type]): [description] + + Returns: + [type]: [description] + """ + if not training or (drop_n_heads == 0): + return x + + batch_size, num_heads, _, _ = x.shape + # drop all heads + if num_heads == drop_n_heads: + return paddle.zeros_like(x) + + mask = np.ones([batch_size, num_heads]) + mask[:, :drop_n_heads] = 0 + for subarray in mask: + np.random.shuffle(subarray) + scale = float(num_heads) / (num_heads - drop_n_heads) + mask = scale * np.reshape(mask, [batch_size, num_heads, 1, 1]) + out = x * paddle.to_tensor(mask) + return out + +def _split_heads(x, num_heads): + batch_size, time_steps, _ = x.shape + x = paddle.reshape(x, [batch_size, time_steps, num_heads, -1]) + x = paddle.transpose(x, [0, 2, 1, 3]) + return x + +def _concat_heads(x): + batch_size, _, time_steps, _ = x.shape + x = paddle.transpose(x, [0, 2, 1, 3]) + x = paddle.reshape(x, [batch_size, time_steps, -1]) + return x + +# Standard implementations of Monohead Attention & Multihead Attention +class MonoheadAttention(nn.Layer): + def __init__(self, model_dim, dropout=0.0, k_dim=None, v_dim=None): + """ + Monohead Attention module. + + Args: + model_dim (int): the feature size of query. + dropout (float, optional): dropout probability of scaled dot product + attention and final context vector. Defaults to 0.0. + k_dim (int, optional): feature size of the key of each scaled dot + product attention. If not provided, it is set to + model_dim / num_heads. Defaults to None. + v_dim (int, optional): feature size of the key of each scaled dot + product attention. If not provided, it is set to + model_dim / num_heads. Defaults to None. + """ + super(MonoheadAttention, self).__init__() + k_dim = k_dim or model_dim + v_dim = v_dim or model_dim + self.affine_q = nn.Linear(model_dim, k_dim) + self.affine_k = nn.Linear(model_dim, k_dim) + self.affine_v = nn.Linear(model_dim, v_dim) + self.affine_o = nn.Linear(v_dim, model_dim) + + self.model_dim = model_dim + self.dropout = dropout + + def forward(self, q, k, v, mask): + """ + Compute context vector and attention weights. + + Args: + q (Tensor): shape(batch_size, time_steps_q, model_dim), the queries. + k (Tensor): shape(batch_size, time_steps_k, model_dim), the keys. + v (Tensor): shape(batch_size, time_steps_k, model_dim), the values. + mask (Tensor): shape(batch_size, times_steps_q, time_steps_k) or + broadcastable shape, dtype: float32 or float64, the mask. + + Returns: + (out, attention_weights) + out (Tensor), shape(batch_size, time_steps_q, model_dim), the context vector. + attention_weights (Tensor): shape(batch_size, times_steps_q, time_steps_k), the attention weights. + """ + q = self.affine_q(q) # (B, T, C) + k = self.affine_k(k) + v = self.affine_v(v) + + context_vectors, attention_weights = scaled_dot_product_attention( + q, k, v, mask, self.dropout, self.training) + + out = self.affine_o(context_vectors) + return out, attention_weights + + +class MultiheadAttention(nn.Layer): + """ + Multihead scaled dot product attention. + """ + def __init__(self, model_dim, num_heads, dropout=0.0, k_dim=None, v_dim=None): + """ + Multihead Attention module. + + Args: + model_dim (int): the feature size of query. + num_heads (int): the number of attention heads. + dropout (float, optional): dropout probability of scaled dot product + attention and final context vector. Defaults to 0.0. + k_dim (int, optional): feature size of the key of each scaled dot + product attention. If not provided, it is set to + model_dim / num_heads. Defaults to None. + v_dim (int, optional): feature size of the key of each scaled dot + product attention. If not provided, it is set to + model_dim / num_heads. Defaults to None. + + Raises: + ValueError: if model_dim is not divisible by num_heads + """ + super(MultiheadAttention, self).__init__() + if model_dim % num_heads !=0: + raise ValueError("model_dim must be divisible by num_heads") + depth = model_dim // num_heads + k_dim = k_dim or depth + v_dim = v_dim or depth + self.affine_q = nn.Linear(model_dim, num_heads * k_dim) + self.affine_k = nn.Linear(model_dim, num_heads * k_dim) + self.affine_v = nn.Linear(model_dim, num_heads * v_dim) + self.affine_o = nn.Linear(num_heads * v_dim, model_dim) + + self.num_heads = num_heads + self.model_dim = model_dim + self.dropout = dropout + + def forward(self, q, k, v, mask): + """ + Compute context vector and attention weights. + + Args: + q (Tensor): shape(batch_size, time_steps_q, model_dim), the queries. + k (Tensor): shape(batch_size, time_steps_k, model_dim), the keys. + v (Tensor): shape(batch_size, time_steps_k, model_dim), the values. + mask (Tensor): shape(batch_size, times_steps_q, time_steps_k) or + broadcastable shape, dtype: float32 or float64, the mask. + + Returns: + (out, attention_weights) + out (Tensor), shape(batch_size, time_steps_q, model_dim), the context vector. + attention_weights (Tensor): shape(batch_size, times_steps_q, time_steps_k), the attention weights. + """ + q = _split_heads(self.affine_q(q), self.num_heads) # (B, h, T, C) + k = _split_heads(self.affine_k(k), self.num_heads) + v = _split_heads(self.affine_v(v), self.num_heads) + mask = paddle.unsqueeze(mask, 1) # unsqueeze for the h dim + + context_vectors, attention_weights = scaled_dot_product_attention( + q, k, v, mask, self.dropout, self.training) + # NOTE: there is more sophisticated implementation: Scheduled DropHead + context_vectors = _concat_heads(context_vectors) # (B, T, h*C) + out = self.affine_o(context_vectors) + return out, attention_weights diff --git a/parakeet/modules/cbhg.py b/parakeet/modules/cbhg.py new file mode 100644 index 0000000..03bc108 --- /dev/null +++ b/parakeet/modules/cbhg.py @@ -0,0 +1,90 @@ +import math +import paddle +from paddle import nn +from paddle.nn import functional as F +from paddle.nn import initializer as I + +from parakeet.modules.conv import Conv1dBatchNorm + + +class Highway(nn.Layer): + def __init__(self, num_features): + super(Highway, self).__init__() + self.H = nn.Linear(num_features, num_features) + self.T = nn.Linear(num_features, num_features, + bias_attr=I.Constant(-1.)) + + self.num_features = num_features + + def forward(self, x): + H = F.relu(self.H(x)) + T = F.sigmoid(self.T(x)) # gate + return H * T + x * (1.0 - T) + + +class CBHG(nn.Layer): + def __init__(self, in_channels, out_channels_per_conv, max_kernel_size, + projection_channels, + num_highways, highway_features, + gru_features): + super(CBHG, self).__init__() + self.conv1d_banks = nn.LayerList( + [Conv1dBatchNorm(in_channels, out_channels_per_conv, (k,), + padding=((k - 1) // 2, k // 2)) + for k in range(1, 1 + max_kernel_size)]) + + self.projections = nn.LayerList() + projection_channels = list(projection_channels) + proj_in_channels = [max_kernel_size * + out_channels_per_conv] + projection_channels + proj_out_channels = projection_channels + \ + [in_channels] # ensure residual connection + for c_in, c_out in zip(proj_in_channels, proj_out_channels): + conv = nn.Conv1D(c_in, c_out, (3,), padding=(1, 1)) + self.projections.append(conv) + + if in_channels != highway_features: + self.pre_highway = nn.Linear(in_channels, highway_features) + + self.highways = nn.LayerList( + [Highway(highway_features) for _ in range(num_highways)]) + + self.gru = nn.GRU(highway_features, gru_features, + direction="bidirectional") + + self.in_channels = in_channels + self.out_channels_per_conv = out_channels_per_conv + self.max_kernel_size = max_kernel_size + self.num_projections = 1 + len(projection_channels) + self.num_highways = num_highways + self.highway_features = highway_features + self.gru_features = gru_features + + def forward(self, x): + input = x + + # conv banks + conv_outputs = [] + for conv in self.conv1d_banks: + conv_outputs.append(conv(x)) + x = F.relu(paddle.concat(conv_outputs, 1)) + + # max pool + x = F.max_pool1d(x, 2, stride=1, padding=(0, 1)) + + # conv1d projections + n_projections = len(self.projections) + for i, conv in enumerate(self.projections): + x = conv(x) + if i != n_projections: + x = F.relu(x) + x += input # residual connection + + # highway + x = paddle.transpose(x, [0, 2, 1]) + if hasattr(self, "pre_highway"): + x = self.pre_highway(x) + + # gru + x, _ = self.gru(x) + return x diff --git a/parakeet/modules/connections.py b/parakeet/modules/connections.py new file mode 100644 index 0000000..1186b18 --- /dev/null +++ b/parakeet/modules/connections.py @@ -0,0 +1,62 @@ +import paddle +from paddle import nn +from paddle.nn import functional as F + +def residual_connection(input, layer): + """residual connection, only used for single input-single output layer. + y = x + F(x) where F corresponds to the layer. + + Args: + x (Tensor): the input tensor. + layer (callable): a callable that preserve tensor shape. + """ + return input + layer(input) + +class ResidualWrapper(nn.Layer): + def __init__(self, layer): + super(ResidualWrapper, self).__init__() + self.layer = layer + + def forward(self, x): + return residual_connection(x, self.layer) + + +class PreLayerNormWrapper(nn.Layer): + def __init__(self, layer, d_model): + super(PreLayerNormWrapper, self).__init__() + self.layer = layer + self.layer_norm = nn.LayerNorm([d_model], epsilon=1e-6) + + def forward(self, x): + return x + self.layer(self.layer_norm(x)) + + +class PostLayerNormWrapper(nn.Layer): + def __init__(self, layer, d_model): + super(PostLayerNormWrapper, self).__init__() + self.layer = layer + self.layer_norm = nn.LayerNorm([d_model], epsilon=1e-6) + + def forward(self, x): + return self.layer_norm(x + self.layer(x)) + + +def context_gate(input, axis): + """sigmoid gate the content by gate. + + Args: + input (Tensor): shape(*, d_axis, *), the input, treated as content & gate. + axis (int): the axis to chunk content and gate. + + Raises: + ValueError: if input.shape[axis] is not even. + + Returns: + Tensor: shape(*, d_axis / 2 , *), the gated content. + """ + size = input.shape[axis] + if size % 2 != 0: + raise ValueError("the size of the {}-th dimension of input should " + "be even, but received {}".format(axis, size)) + content, gate = paddle.chunk(input, 2, axis) + return F.sigmoid(gate) * content diff --git a/parakeet/modules/conv.py b/parakeet/modules/conv.py new file mode 100644 index 0000000..698cda2 --- /dev/null +++ b/parakeet/modules/conv.py @@ -0,0 +1,101 @@ +import paddle +from paddle import nn + +class Conv1dCell(nn.Conv1D): + """ + A subclass of Conv1d layer, which can be used like an RNN cell. It can take + step input and return step output. It is done by keeping an internal buffer, + when adding a step input, we shift the buffer and return a step output. For + single step case, convolution devolves to a linear transformation. + + That it can be used as a cell depends on several restrictions: + 1. stride must be 1; + 2. padding must be an asymmetric padding (recpetive_field - 1, 0). + + As a result, these arguments are removed form the initializer. + """ + def __init__(self, + in_channels, + out_channels, + kernel_size, + dilation=1, + weight_attr=None, + bias_attr=None): + _dilation = dilation[0] if isinstance(dilation, (tuple, list)) else dilation + _kernel_size = kernel_size[0] if isinstance(kernel_size, (tuple, list)) else kernel_size + self._r = 1 + (_kernel_size - 1) * _dilation + super(Conv1dCell, self).__init__( + in_channels, + out_channels, + kernel_size, + padding=(self._r - 1, 0), + dilation=dilation, + weight_attr=weight_attr, + bias_attr=bias_attr, + data_format="NCL") + + @property + def receptive_field(self): + return self._r + + def start_sequence(self): + if self.training: + raise Exception("only use start_sequence in evaluation") + self._buffer = None + self._reshaped_weight = paddle.reshape( + self.weight, (self._out_channels, -1)) + + def initialize_buffer(self, x_t): + batch_size, _ = x_t.shape + self._buffer = paddle.zeros( + (batch_size, self._in_channels, self.receptive_field), + dtype=x_t.dtype) + + def update_buffer(self, x_t): + self._buffer = paddle.concat( + [self._buffer[:, :, 1:], paddle.unsqueeze(x_t, -1)], -1) + + def add_input(self, x_t): + """ + Arguments: + x_t (Tensor): shape (batch_size, in_channels), step input. + Rerurns: + y_t (Tensor): shape (batch_size, out_channels), step output. + """ + batch_size = x_t.shape[0] + if self.receptive_field > 1: + if self._buffer is None: + self.initialize_buffer(x_t) + + # update buffer + self.update_buffer(x_t) + if self._dilation[0] > 1: + input = self._buffer[:, :, ::self._dilation[0]] + else: + input = self._buffer + input = paddle.reshape(input, (batch_size, -1)) + else: + input = x_t + y_t = paddle.matmul(input, self._reshaped_weight, transpose_y=True) + y_t = y_t + self.bias + return y_t + + +class Conv1dBatchNorm(nn.Layer): + def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0, + weight_attr=None, bias_attr=None, data_format="NCL"): + super(Conv1dBatchNorm, self).__init__() + # TODO(chenfeiyu): carefully initialize Conv1d's weight + self.conv = nn.Conv1D(in_channels, out_channels, kernel_size, stride, + padding=padding, + weight_attr=weight_attr, + bias_attr=bias_attr, + data_format=data_format) + # TODO: channel last, but BatchNorm1d does not support channel last layout + self.bn = nn.BatchNorm1D(out_channels, momentum=0.99, epsilon=1e-3, data_format=data_format) + + def forward(self, x): + x = self.conv(x) + x = self.bn(x) + return x + diff --git a/parakeet/modules/customized.py b/parakeet/modules/customized.py deleted file mode 100644 index 84ca68c..0000000 --- a/parakeet/modules/customized.py +++ /dev/null @@ -1,272 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from paddle import fluid -import paddle.fluid.layers as F -import paddle.fluid.dygraph as dg - - -class Pool1D(dg.Layer): - """ - A Pool 1D block implemented with Pool2D. - """ - - def __init__(self, - pool_size=-1, - pool_type='max', - pool_stride=1, - pool_padding=0, - global_pooling=False, - use_cudnn=True, - ceil_mode=False, - exclusive=True, - data_format='NCT'): - super(Pool1D, self).__init__() - self.pool_size = pool_size - self.pool_type = pool_type - self.pool_stride = pool_stride - self.pool_padding = pool_padding - self.global_pooling = global_pooling - self.use_cudnn = use_cudnn - self.ceil_mode = ceil_mode - self.exclusive = exclusive - self.data_format = data_format - - self.pool2d = dg.Pool2D( - [1, pool_size], - pool_type=pool_type, - pool_stride=[1, pool_stride], - pool_padding=[0, pool_padding], - global_pooling=global_pooling, - use_cudnn=use_cudnn, - ceil_mode=ceil_mode, - exclusive=exclusive) - - def forward(self, x): - """ - Args: - x (Variable): Shape(B, C_in, 1, T), the input, where C_in means - input channels. - Returns: - x (Variable): Shape(B, C_out, 1, T), the outputs, where C_out means - output channels (num_filters). - """ - if self.data_format == 'NTC': - x = fluid.layers.transpose(x, [0, 2, 1]) - x = fluid.layers.unsqueeze(x, [2]) - x = self.pool2d(x) - x = fluid.layers.squeeze(x, [2]) - if self.data_format == 'NTC': - x = fluid.layers.transpose(x, [0, 2, 1]) - return x - - -class Conv1D(dg.Conv2D): - """A standard Conv1D layer that use (B, C, T) data layout. It inherit Conv2D and - use (B, C, 1, T) data layout to compute 1D convolution. Nothing more. - NOTE: we inherit Conv2D instead of encapsulate a Conv2D layer to make it a simple - layer, instead of a complex one. So we can easily apply weight norm to it. - """ - - def __init__(self, - num_channels, - num_filters, - filter_size, - stride=1, - padding=0, - dilation=1, - groups=1, - param_attr=None, - bias_attr=None, - use_cudnn=True, - act=None, - dtype='float32'): - super(Conv1D, self).__init__( - num_channels, - num_filters, (1, filter_size), - stride=(1, stride), - padding=(0, padding), - dilation=(1, dilation), - groups=groups, - param_attr=param_attr, - bias_attr=bias_attr, - use_cudnn=use_cudnn, - act=act, - dtype=dtype) - - def forward(self, x): - """Compute Conv1D by unsqueeze the input and squeeze the output. - - Args: - x (Variable): shape(B, C_in, T_in), dtype float32, input of Conv1D. - - Returns: - Variable: shape(B, C_out, T_out), dtype float32, output of Conv1D. - """ - x = F.unsqueeze(x, [2]) - x = super(Conv1D, self).forward(x) # maybe risky here - x = F.squeeze(x, [2]) - return x - - -class Conv1DTranspose(dg.Conv2DTranspose): - def __init__(self, - num_channels, - num_filters, - filter_size, - padding=0, - stride=1, - dilation=1, - groups=1, - param_attr=None, - bias_attr=None, - use_cudnn=True, - act=None, - dtype='float32'): - super(Conv1DTranspose, self).__init__( - num_channels, - num_filters, (1, filter_size), - output_size=None, - padding=(0, padding), - stride=(1, stride), - dilation=(1, dilation), - groups=groups, - param_attr=param_attr, - bias_attr=bias_attr, - use_cudnn=use_cudnn, - act=act, - dtype=dtype) - - def forward(self, x): - """Compute Conv1DTranspose by unsqueeze the input and squeeze the output. - - Args: - x (Variable): shape(B, C_in, T_in), dtype float32, input of Conv1DTranspose. - - Returns: - Variable: shape(B, C_out, T_out), dtype float32, output of Conv1DTranspose. - """ - x = F.unsqueeze(x, [2]) - x = super(Conv1DTranspose, self).forward(x) # maybe risky here - x = F.squeeze(x, [2]) - return x - - -class Conv1DCell(Conv1D): - """A causal convolve-1d cell. It uses causal padding, padding(receptive_field -1, 0). - But Conv2D in dygraph does not support asymmetric padding yet, we just pad - (receptive_field -1, receptive_field -1) and drop last receptive_field -1 steps in - the output. - - It is a cell that it acts like an RNN cell. It does not support stride > 1, and it - ensures 1-to-1 mapping from input time steps to output timesteps. - """ - - def __init__(self, - num_channels, - num_filters, - filter_size, - dilation=1, - causal=False, - groups=1, - param_attr=None, - bias_attr=None, - use_cudnn=True, - act=None, - dtype='float32'): - receptive_field = 1 + dilation * (filter_size - 1) - padding = receptive_field - 1 if causal else receptive_field // 2 - self._receptive_field = receptive_field - self.causal = causal - super(Conv1DCell, self).__init__( - num_channels, - num_filters, - filter_size, - stride=1, - padding=padding, - dilation=dilation, - groups=groups, - param_attr=param_attr, - bias_attr=bias_attr, - use_cudnn=use_cudnn, - act=act, - dtype=dtype) - - def forward(self, x): - """Compute Conv1D by unsqueeze the input and squeeze the output. - - Args: - x (Variable): shape(B, C_in, T), dtype float32, input of Conv1D. - - Returns: - Variable: shape(B, C_out, T), dtype float32, output of Conv1D. - """ - # it ensures that ouput time steps == input time steps - time_steps = x.shape[-1] - x = super(Conv1DCell, self).forward(x) - if x.shape[-1] != time_steps: - x = x[:, :, :time_steps] - return x - - @property - def receptive_field(self): - return self._receptive_field - - def start_sequence(self): - """Prepare the Conv1DCell to generate a new sequence, this method should be called before calling add_input multiple times. - - WARNING: - This method accesses `self.weight` directly. If a `Conv1DCell` object is wrapped in a `WeightNormWrapper`, make sure this method is called only after the `WeightNormWrapper`'s hook is called. - `WeightNormWrapper` removes the wrapped layer's `weight`, add has a `weight_v` and `weight_g` to re-compute the wrapped layer's weight as $weight = weight_g * weight_v / ||weight_v||$. (Recomputing the `weight` is a hook before calling the wrapped layer's `forward` method.) - Whenever a `WeightNormWrapper`'s `forward` method is called, the wrapped layer's weight is updated. But when loading from a checkpoint, `weight_v` and `weight_g` are updated but the wrapped layer's weight is not, since it is no longer a `Parameter`. You should manually call `remove_weight_norm` or `hook` to re-compute the wrapped layer's weight before calling this method if you don't call `forward` first. - So when loading a model which uses `Conv1DCell` objects wrapped in `WeightNormWrapper`s, remember to call `remove_weight_norm` for all `WeightNormWrapper`s before synthesizing. Also, removing weight norm speeds up computation. - """ - if not self.causal: - raise ValueError( - "Only causal conv1d shell should use start sequence") - if self.receptive_field == 1: - raise ValueError( - "Convolution block with receptive field = 1 does not need" - " to be implemented as a Conv1DCell. Conv1D suffices") - self._buffer = None - self._reshaped_weight = F.reshape(self.weight, (self._num_filters, -1)) - - def add_input(self, x_t): - """This method works similarily with forward but in a `step-in-step-out` fashion. - - Args: - x (Variable): shape(B, C_in, T=1), dtype float32, input of Conv1D. - - Returns: - Variable: shape(B, C_out, T=1), dtype float32, output of Conv1D. - """ - batch_size, c_in, _ = x_t.shape - if self._buffer is None: - self._buffer = F.zeros( - (batch_size, c_in, self.receptive_field), dtype=x_t.dtype) - self._buffer = F.concat([self._buffer[:, :, 1:], x_t], -1) - if self._dilation[1] > 1: - input = F.strided_slice( - self._buffer, - axes=[2], - starts=[0], - ends=[self.receptive_field], - strides=[self._dilation[1]]) - else: - input = self._buffer - input = F.reshape(input, (batch_size, -1)) - y_t = F.matmul(input, self._reshaped_weight, transpose_y=True) - y_t = y_t + self.bias - y_t = F.unsqueeze(y_t, [-1]) - return y_t diff --git a/parakeet/modules/dynamic_gru.py b/parakeet/modules/dynamic_gru.py deleted file mode 100644 index b944b92..0000000 --- a/parakeet/modules/dynamic_gru.py +++ /dev/null @@ -1,64 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import paddle.fluid.dygraph as dg -import paddle.fluid.layers as layers - - -class DynamicGRU(dg.Layer): - def __init__(self, - size, - param_attr=None, - bias_attr=None, - is_reverse=False, - gate_activation='sigmoid', - candidate_activation='tanh', - h_0=None, - origin_mode=False, - init_size=None): - super(DynamicGRU, self).__init__() - self.gru_unit = dg.GRUUnit( - size * 3, - param_attr=param_attr, - bias_attr=bias_attr, - activation=candidate_activation, - gate_activation=gate_activation, - origin_mode=origin_mode) - self.size = size - self.h_0 = h_0 - self.is_reverse = is_reverse - - def forward(self, inputs): - """ - Dynamic GRU block. - - Args: - input (Variable): shape(B, T, C), dtype float32, the input value. - - Returns: - output (Variable): shape(B, T, C), the result compute by GRU. - """ - hidden = self.h_0 - res = [] - for i in range(inputs.shape[1]): - if self.is_reverse: - i = inputs.shape[1] - 1 - i - input_ = inputs[:, i:i + 1, :] - input_ = layers.reshape(input_, [-1, input_.shape[2]]) - hidden, reset, gate = self.gru_unit(input_, hidden) - hidden_ = layers.reshape(hidden, [-1, 1, hidden.shape[1]]) - res.append(hidden_) - if self.is_reverse: - res = res[::-1] - res = layers.concat(res, axis=1) - return res diff --git a/parakeet/modules/ffn.py b/parakeet/modules/ffn.py deleted file mode 100644 index bf68c1c..0000000 --- a/parakeet/modules/ffn.py +++ /dev/null @@ -1,93 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import paddle.fluid.dygraph as dg -import paddle.fluid.layers as layers -import paddle.fluid as fluid -import math -from parakeet.modules.customized import Conv1D - - -class PositionwiseFeedForward(dg.Layer): - def __init__(self, - d_in, - num_hidden, - filter_size, - padding=0, - use_cudnn=True, - dropout=0.1): - """A two-feed-forward-layer module. - - Args: - d_in (int): the size of input channel. - num_hidden (int): the size of hidden layer in network. - filter_size (int): the filter size of Conv - padding (int, optional): the padding size of Conv. Defaults to 0. - use_cudnn (bool, optional): use cudnn in Conv or not. Defaults to True. - dropout (float, optional): dropout probability. Defaults to 0.1. - """ - super(PositionwiseFeedForward, self).__init__() - self.num_hidden = num_hidden - self.use_cudnn = use_cudnn - self.dropout = dropout - - k = math.sqrt(1.0 / d_in) - self.w_1 = Conv1D( - num_channels=d_in, - num_filters=num_hidden, - filter_size=filter_size, - padding=padding, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform( - low=-k, high=k)), - use_cudnn=use_cudnn) - k = math.sqrt(1.0 / num_hidden) - self.w_2 = Conv1D( - num_channels=num_hidden, - num_filters=d_in, - filter_size=filter_size, - padding=padding, - param_attr=fluid.ParamAttr( - initializer=fluid.initializer.XavierInitializer()), - bias_attr=fluid.ParamAttr(initializer=fluid.initializer.Uniform( - low=-k, high=k)), - use_cudnn=use_cudnn) - self.layer_norm = dg.LayerNorm(d_in) - - def forward(self, input): - """ - Compute feed forward network result. - - Args: - input (Variable): shape(B, T, C), dtype float32, the input value. - - Returns: - output (Variable): shape(B, T, C), the result after FFN. - """ - x = layers.transpose(input, [0, 2, 1]) - #FFN Networt - x = self.w_2(layers.relu(self.w_1(x))) - - # dropout - x = layers.dropout( - x, self.dropout, dropout_implementation='upscale_in_train') - - x = layers.transpose(x, [0, 2, 1]) - # residual connection - x = x + input - - #layer normalization - output = self.layer_norm(x) - - return output diff --git a/parakeet/modules/geometry.py b/parakeet/modules/geometry.py new file mode 100644 index 0000000..861aaf3 --- /dev/null +++ b/parakeet/modules/geometry.py @@ -0,0 +1,29 @@ +import numpy as np +import paddle + +def shuffle_dim(x, axis, perm=None): + """Permute input tensor along aixs given the permutation or randomly. + + Args: + x (Tensor): shape(*, d_{axis}, *), the input tensor. + axis (int): the axis to shuffle. + perm (list[int], ndarray, optional): a permutation of [0, d_{axis}), + the order to reorder the tensor along the `axis`-th dimension, if + not provided, randomly shuffle the `axis`-th dimension. Defaults to + None. + + Returns: + Tensor: the shuffled tensor, it has the same shape as x does. + """ + size = x.shape[axis] + if perm is not None and len(perm) != size: + raise ValueError("length of permutation should equals the input " + "tensor's axis-th dimension's size") + if perm is not None: + perm = np.array(perm) + else: + perm = np.random.permutation(size) + + perm = paddle.to_tensor(perm) + out = paddle.gather(x, perm, axis) + return out diff --git a/parakeet/modules/loss.py b/parakeet/modules/loss.py deleted file mode 100644 index 96bcd3b..0000000 --- a/parakeet/modules/loss.py +++ /dev/null @@ -1,158 +0,0 @@ -# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import absolute_import -from __future__ import division -from __future__ import print_function - -import numpy as np -from numba import jit - -from paddle import fluid -import paddle.fluid.dygraph as dg - - -def masked_mean(inputs, mask): - """ - Args: - inputs (Variable): Shape(B, C, 1, T), the input, where B means - batch size, C means channels of input, T means timesteps of - the input. - mask (Variable): Shape(B, T), a mask. - Returns: - loss (Variable): Shape(1, ), masked mean. - """ - channels = inputs.shape[1] - reshaped_mask = fluid.layers.reshape( - mask, shape=[mask.shape[0], 1, 1, mask.shape[-1]]) - expanded_mask = fluid.layers.expand( - reshaped_mask, expand_times=[1, channels, 1, 1]) - expanded_mask.stop_gradient = True - - valid_cnt = fluid.layers.reduce_sum(expanded_mask) - valid_cnt.stop_gradient = True - - masked_inputs = inputs * expanded_mask - loss = fluid.layers.reduce_sum(masked_inputs) / valid_cnt - return loss - - -@jit(nopython=True) -def guided_attention(N, max_N, T, max_T, g): - W = np.zeros((max_N, max_T), dtype=np.float32) - for n in range(N): - for t in range(T): - W[n, t] = 1 - np.exp(-(n / N - t / T)**2 / (2 * g * g)) - return W - - -def guided_attentions(input_lengths, target_lengths, max_target_len, g=0.2): - B = len(input_lengths) - max_input_len = input_lengths.max() - W = np.zeros((B, max_target_len, max_input_len), dtype=np.float32) - for b in range(B): - W[b] = guided_attention(input_lengths[b], max_input_len, - target_lengths[b], max_target_len, g).T - return W - - -class TTSLoss(object): - def __init__(self, - masked_weight=0.0, - priority_weight=0.0, - binary_divergence_weight=0.0, - guided_attention_sigma=0.2): - self.masked_weight = masked_weight - self.priority_weight = priority_weight - self.binary_divergence_weight = binary_divergence_weight - self.guided_attention_sigma = guided_attention_sigma - - def l1_loss(self, prediction, target, mask, priority_bin=None): - abs_diff = fluid.layers.abs(prediction - target) - - # basic mask-weighted l1 loss - w = self.masked_weight - if w > 0 and mask is not None: - base_l1_loss = w * masked_mean(abs_diff, mask) + ( - 1 - w) * fluid.layers.reduce_mean(abs_diff) - else: - base_l1_loss = fluid.layers.reduce_mean(abs_diff) - - if self.priority_weight > 0 and priority_bin is not None: - # mask-weighted priority channels' l1-loss - priority_abs_diff = fluid.layers.slice( - abs_diff, axes=[1], starts=[0], ends=[priority_bin]) - if w > 0 and mask is not None: - priority_loss = w * masked_mean(priority_abs_diff, mask) + ( - 1 - w) * fluid.layers.reduce_mean(priority_abs_diff) - else: - priority_loss = fluid.layers.reduce_mean(priority_abs_diff) - - # priority weighted sum - p = self.priority_weight - loss = p * priority_loss + (1 - p) * base_l1_loss - else: - loss = base_l1_loss - return loss - - def binary_divergence(self, prediction, target, mask): - flattened_prediction = fluid.layers.reshape(prediction, [-1, 1]) - flattened_target = fluid.layers.reshape(target, [-1, 1]) - flattened_loss = fluid.layers.log_loss( - flattened_prediction, flattened_target, epsilon=1e-8) - bin_div = fluid.layers.reshape(flattened_loss, prediction.shape) - - w = self.masked_weight - if w > 0 and mask is not None: - loss = w * masked_mean(bin_div, mask) + ( - 1 - w) * fluid.layers.reduce_mean(bin_div) - else: - loss = fluid.layers.reduce_mean(bin_div) - return loss - - @staticmethod - def done_loss(done_hat, done): - flat_done_hat = fluid.layers.reshape(done_hat, [-1, 1]) - flat_done = fluid.layers.reshape(done, [-1, 1]) - loss = fluid.layers.log_loss(flat_done_hat, flat_done, epsilon=1e-8) - loss = fluid.layers.reduce_mean(loss) - return loss - - def attention_loss(self, predicted_attention, input_lengths, - target_lengths): - """ - Given valid encoder_lengths and decoder_lengths, compute a diagonal - guide, and compute loss from the predicted attention and the guide. - - Args: - predicted_attention (Variable): Shape(*, B, T_dec, T_enc), the - alignment tensor, where B means batch size, T_dec means number - of time steps of the decoder, T_enc means the number of time - steps of the encoder, * means other possible dimensions. - input_lengths (numpy.ndarray): Shape(B,), dtype:int64, valid lengths - (time steps) of encoder outputs. - target_lengths (numpy.ndarray): Shape(batch_size,), dtype:int64, - valid lengths (time steps) of decoder outputs. - - Returns: - loss (Variable): Shape(1, ) attention loss. - """ - n_attention, batch_size, max_target_len, max_input_len = ( - predicted_attention.shape) - soft_mask = guided_attentions(input_lengths, target_lengths, - max_target_len, - self.guided_attention_sigma) - soft_mask_ = dg.to_variable(soft_mask) - loss = fluid.layers.reduce_mean(predicted_attention * soft_mask_) - return loss diff --git a/parakeet/modules/losses.py b/parakeet/modules/losses.py new file mode 100644 index 0000000..e7187a8 --- /dev/null +++ b/parakeet/modules/losses.py @@ -0,0 +1,24 @@ +import paddle +from paddle import nn +from paddle.nn import functional as F + +def weighted_mean(input, weight): + """weighted mean.(It can also be used as masked mean.) + + Args: + input (Tensor): input tensor, floating point dtype. + weight (Tensor): weight tensor with broadcastable shape. + + Returns: + Tensor: shape(1,), weighted mean tensor with the same dtype as input. + """ + weight = paddle.cast(weight, input.dtype) + return paddle.mean(input * weight) + +def masked_l1_loss(prediction, target, mask): + abs_error = F.l1_loss(prediction, target, reduction='none') + return weighted_mean(abs_error, mask) + +def masked_softmax_with_cross_entropy(logits, label, mask, axis=-1): + ce = F.softmax_with_cross_entropy(logits, label, axis=axis) + return weighted_mean(ce, mask) diff --git a/parakeet/modules/masking.py b/parakeet/modules/masking.py new file mode 100644 index 0000000..dc282c2 --- /dev/null +++ b/parakeet/modules/masking.py @@ -0,0 +1,32 @@ +import paddle +from paddle.fluid.layers import sequence_mask + +def id_mask(input, padding_index=0, dtype="bool"): + return paddle.cast(input != padding_index, dtype) + +def feature_mask(input, axis, dtype="bool"): + feature_sum = paddle.sum(paddle.abs(input), axis) + return paddle.cast(feature_sum != 0, dtype) + +def combine_mask(padding_mask, no_future_mask): + """ + Combine the padding mask and no future mask for transformer decoder. + Padding mask is used to mask padding positions and no future mask is used + to prevent the decoder to see future information. + + Args: + padding_mask (Tensor): shape(batch_size, time_steps), dtype: float32 or float64, decoder padding mask. + no_future_mask (Tensor): shape(time_steps, time_steps), dtype: float32 or float64, no future mask. + + Returns: + Tensor: shape(batch_size, time_steps, time_steps), combined mask. + """ + # TODO: to support boolean mask by using logical_and? + if padding_mask.dtype == paddle.fluid.core.VarDesc.VarType.BOOL: + return paddle.logical_and(padding_mask, no_future_mask) + else: + return padding_mask * no_future_mask + +def future_mask(time_steps, dtype="bool"): + mask = paddle.tril(paddle.ones([time_steps, time_steps])) + return paddle.cast(mask, dtype) diff --git a/parakeet/modules/positional_encoding.py b/parakeet/modules/positional_encoding.py new file mode 100644 index 0000000..5d862ff --- /dev/null +++ b/parakeet/modules/positional_encoding.py @@ -0,0 +1,32 @@ +import math +import numpy as np +import paddle +from paddle.nn import functional as F + + +def positional_encoding(start_index, length, size, dtype=None): + """ + Generate standard positional encoding. + + pe(pos, 2i) = sin(pos / 10000 ** (2i / size)) + pe(pos, 2i+1) = cos(pos / 10000 ** (2i / size)) + + Args: + start_index (int): the start index. + length (int): the length of the positional encoding. + size (int): positional encoding dimension. + + Returns: + encodings (Tensor): shape(length, size), the positional encoding. + """ + if (size % 2 != 0): + raise ValueError("size should be divisible by 2") + dtype = dtype or paddle.get_default_dtype() + channel = np.arange(0, size, 2) + index = np.arange(start_index, start_index + length, 1) + p = np.expand_dims(index, -1) / (10000 ** (channel / float(size))) + encodings = np.zeros([length, size]) + encodings[:, 0::2] = np.sin(p) + encodings[:, 1::2] = np.cos(p) + encodings = paddle.to_tensor(encodings) + return encodings diff --git a/parakeet/modules/stft.py b/parakeet/modules/stft.py new file mode 100644 index 0000000..56cbec1 --- /dev/null +++ b/parakeet/modules/stft.py @@ -0,0 +1,93 @@ +import paddle +from paddle import nn +from paddle.nn import functional as F +from scipy import signal +import numpy as np + +class STFT(nn.Layer): + def __init__(self, n_fft, hop_length, win_length, window="hanning"): + """A module for computing differentiable stft transform. See `librosa.stft` for more details. + + Args: + n_fft (int): number of samples in a frame. + hop_length (int): number of samples shifted between adjacent frames. + win_length (int): length of the window function. + window (str, optional): name of window function, see `scipy.signal.get_window` for more details. Defaults to "hanning". + """ + super(STFT, self).__init__() + self.hop_length = hop_length + self.n_bin = 1 + n_fft // 2 + self.n_fft = n_fft + + # calculate window + window = signal.get_window(window, win_length) + if n_fft != win_length: + pad = (n_fft - win_length) // 2 + window = np.pad(window, ((pad, pad), ), 'constant') + + # calculate weights + r = np.arange(0, n_fft) + M = np.expand_dims(r, -1) * np.expand_dims(r, 0) + w_real = np.reshape(window * + np.cos(2 * np.pi * M / n_fft)[:self.n_bin], + (self.n_bin, 1, 1, self.n_fft)) + w_imag = np.reshape(window * + np.sin(-2 * np.pi * M / n_fft)[:self.n_bin], + (self.n_bin, 1, 1, self.n_fft)) + + w = np.concatenate([w_real, w_imag], axis=0) + self.weight = paddle.cast(paddle.to_tensor(w), paddle.get_default_dtype()) + + def forward(self, x): + """Compute the stft transform. + + Args: + x (Variable): shape(B, T), dtype flaot32, the input waveform. + + Returns: + (real, imag) + real (Variable): shape(B, C, 1, T), dtype flaot32, the real part of the spectrogram. (C = 1 + n_fft // 2) + imag (Variable): shape(B, C, 1, T), dtype flaot32, the image part of the spectrogram. (C = 1 + n_fft // 2) + """ + # x(batch_size, time_steps) + # pad it first with reflect mode + # TODO(chenfeiyu): report an issue on paddle.flip + pad_start = paddle.reverse(x[:, 1:1 + self.n_fft // 2], axis=[1]) + pad_stop = paddle.reverse(x[:, -(1 + self.n_fft // 2):-1], axis=[1]) + x = paddle.concat([pad_start, x, pad_stop], axis=-1) + + # to BC1T, C=1 + x = paddle.unsqueeze(x, axis=[1, 2]) + out = F.conv2d(x, self.weight, stride=(1, self.hop_length)) + real, imag = paddle.chunk(out, 2, axis=1) # BC1T + return real, imag + + def power(self, x): + """Compute the power spectrogram. + + Args: + (real, imag) + real (Variable): shape(B, C, 1, T), dtype flaot32, the real part of the spectrogram. + imag (Variable): shape(B, C, 1, T), dtype flaot32, the image part of the spectrogram. + + Returns: + Variable: shape(B, C, 1, T), dtype flaot32, the power spectrogram. + """ + real, imag = self(x) + power = real**2 + imag**2 + return power + + def magnitude(self, x): + """Compute the magnitude spectrogram. + + Args: + (real, imag) + real (Variable): shape(B, C, 1, T), dtype flaot32, the real part of the spectrogram. + imag (Variable): shape(B, C, 1, T), dtype flaot32, the image part of the spectrogram. + + Returns: + Variable: shape(B, C, 1, T), dtype flaot32, the magnitude spectrogram. It is the square root of the power spectrogram. + """ + power = self.power(x) + magnitude = paddle.sqrt(power) + return magnitude diff --git a/parakeet/modules/transformer.py b/parakeet/modules/transformer.py new file mode 100644 index 0000000..f262923 --- /dev/null +++ b/parakeet/modules/transformer.py @@ -0,0 +1,133 @@ +import math +import paddle +from paddle import nn +from paddle.nn import functional as F + +from parakeet.modules import attention as attn +from parakeet.modules.masking import combine_mask +class PositionwiseFFN(nn.Layer): + """ + A faithful implementation of Position-wise Feed-Forward Network + in `Attention is All You Need `_. + It is basically a 3-layer MLP, with relu actication and dropout in between. + """ + def __init__(self, + input_size: int, + hidden_size: int, + dropout=0.0): + """ + Args: + input_size (int): the input feature size. + hidden_size (int): the hidden layer's feature size. + dropout (float, optional): probability of dropout applied to the + output of the first fully connected layer. Defaults to 0.0. + """ + super(PositionwiseFFN, self).__init__() + self.linear1 = nn.Linear(input_size, hidden_size) + self.linear2 = nn.Linear(hidden_size, input_size) + self.dropout = nn.Dropout(dropout) + + self.input_size = input_size + self.hidden_szie = hidden_size + + def forward(self, x): + """positionwise feed forward network. + + Args: + x (Tensor): shape(*, input_size), the input tensor. + + Returns: + Tensor: shape(*, input_size), the output tensor. + """ + l1 = self.dropout(F.relu(self.linear1(x))) + l2 = self.linear2(l1) + return l2 + + +class TransformerEncoderLayer(nn.Layer): + """ + Transformer encoder layer. + """ + def __init__(self, d_model, n_heads, d_ffn, dropout=0.): + """ + Args: + d_model (int): the feature size of the input, and the output. + n_heads (int): the number of heads in the internal MultiHeadAttention layer. + d_ffn (int): the hidden size of the internal PositionwiseFFN. + dropout (float, optional): the probability of the dropout in + MultiHeadAttention and PositionwiseFFN. Defaults to 0. + """ + super(TransformerEncoderLayer, self).__init__() + self.self_mha = attn.MultiheadAttention(d_model, n_heads, dropout) + self.layer_norm1 = nn.LayerNorm([d_model], epsilon=1e-6) + + self.ffn = PositionwiseFFN(d_model, d_ffn, dropout) + self.layer_norm2 = nn.LayerNorm([d_model], epsilon=1e-6) + + def forward(self, x, mask): + """ + Args: + x (Tensor): shape(batch_size, time_steps, d_model), the decoder input. + mask (Tensor): shape(batch_size, time_steps), the padding mask. + + Returns: + (x, attn_weights) + x (Tensor): shape(batch_size, time_steps, d_model), the decoded. + attn_weights (Tensor), shape(batch_size, n_heads, time_steps, time_steps), self attention. + """ + context_vector, attn_weights = self.self_mha(x, x, x, paddle.unsqueeze(mask, 1)) + x = self.layer_norm1(x + context_vector) + + x = self.layer_norm2(x + self.ffn(x)) + return x, attn_weights + + +class TransformerDecoderLayer(nn.Layer): + """ + Transformer decoder layer. + """ + def __init__(self, d_model, n_heads, d_ffn, dropout=0.): + """ + Args: + d_model (int): the feature size of the input, and the output. + n_heads (int): the number of heads in the internal MultiHeadAttention layer. + d_ffn (int): the hidden size of the internal PositionwiseFFN. + dropout (float, optional): the probability of the dropout in + MultiHeadAttention and PositionwiseFFN. Defaults to 0. + """ + super(TransformerDecoderLayer, self).__init__() + self.self_mha = attn.MultiheadAttention(d_model, n_heads, dropout) + self.layer_norm1 = nn.LayerNorm([d_model], epsilon=1e-6) + + self.cross_mha = attn.MultiheadAttention(d_model, n_heads, dropout) + self.layer_norm2 = nn.LayerNorm([d_model], epsilon=1e-6) + + self.ffn = PositionwiseFFN(d_model, d_ffn, dropout) + self.layer_norm3 = nn.LayerNorm([d_model], epsilon=1e-6) + + def forward(self, q, k, v, encoder_mask, decoder_mask): + """ + Args: + q (Tensor): shape(batch_size, time_steps_q, d_model), the decoder input. + k (Tensor): shape(batch_size, time_steps_k, d_model), keys. + v (Tensor): shape(batch_size, time_steps_k, d_model), values + encoder_mask (Tensor): shape(batch_size, time_steps_k) encoder padding mask. + decoder_mask (Tensor): shape(batch_size, time_steps_q) decoder padding mask. + + Returns: + (q, self_attn_weights, cross_attn_weights) + q (Tensor): shape(batch_size, time_steps_q, d_model), the decoded. + self_attn_weights (Tensor), shape(batch_size, n_heads, time_steps_q, time_steps_q), decoder self attention. + cross_attn_weights (Tensor), shape(batch_size, n_heads, time_steps_q, time_steps_k), decoder-encoder cross attention. + """ + tq = q.shape[1] + no_future_mask = paddle.tril(paddle.ones([tq, tq])) #(tq, tq) + combined_mask = combine_mask(decoder_mask.unsqueeze(1), no_future_mask) + context_vector, self_attn_weights = self.self_mha(q, q, q, combined_mask) + q = self.layer_norm1(q + context_vector) + + context_vector, cross_attn_weights = self.cross_mha(q, k, v, paddle.unsqueeze(encoder_mask, 1)) + q = self.layer_norm2(q + context_vector) + + q = self.layer_norm3(q + self.ffn(q)) + return q, self_attn_weights, cross_attn_weights diff --git a/parakeet/modules/weight_norm.py b/parakeet/modules/weight_norm.py deleted file mode 100644 index 51732a7..0000000 --- a/parakeet/modules/weight_norm.py +++ /dev/null @@ -1,282 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from paddle import fluid -import paddle.fluid.dygraph as dg -import paddle.fluid.layers as F - -from parakeet.modules import customized as L - - -def norm(param, dim, power): - powered = F.pow(param, power) - in_dtype = powered.dtype - if in_dtype == fluid.core.VarDesc.VarType.FP16: - powered = F.cast(powered, "float32") - powered_norm = F.reduce_sum(powered, dim=dim, keep_dim=False) - norm_ = F.pow(powered_norm, 1. / power) - if in_dtype == fluid.core.VarDesc.VarType.FP16: - norm_ = F.cast(norm_, "float16") - return norm_ - - -def norm_except(param, dim, power): - """Computes the norm over all dimensions except dim. - It differs from pytorch implementation that it does not keep dim. - This difference is related with the broadcast mechanism in paddle. - Read elementeise_mul for more. - """ - shape = param.shape - ndim = len(shape) - - if dim is None: - return norm(param, dim, power) - elif dim == 0: - param_matrix = F.reshape(param, (shape[0], -1)) - return norm(param_matrix, dim=1, power=power) - elif dim == -1 or dim == ndim - 1: - param_matrix = F.reshape(param, (-1, shape[-1])) - return norm(param_matrix, dim=0, power=power) - else: - perm = list(range(ndim)) - perm[0] = dim - perm[dim] = 0 - transposed_param = F.transpose(param, perm) - return norm_except(transposed_param, dim=0, power=power) - - -def compute_l2_normalized_weight(v, g, dim): - shape = v.shape - ndim = len(shape) - - if dim is None: - v_normalized = v / (F.sqrt(F.reduce_sum(F.square(v))) + 1e-12) - elif dim == 0: - param_matrix = F.reshape(v, (shape[0], -1)) - v_normalized = F.l2_normalize(param_matrix, axis=1) - v_normalized = F.reshape(v_normalized, shape) - elif dim == -1 or dim == ndim - 1: - param_matrix = F.reshape(v, (-1, shape[-1])) - v_normalized = F.l2_normalize(param_matrix, axis=0) - v_normalized = F.reshape(v_normalized, shape) - else: - perm = list(range(ndim)) - perm[0] = dim - perm[dim] = 0 - transposed_param = F.transpose(v, perm) - transposed_shape = transposed_param.shape - param_matrix = F.reshape(transposed_param, - (transposed_param.shape[0], -1)) - v_normalized = F.l2_normalize(param_matrix, axis=1) - v_normalized = F.reshape(v_normalized, transposed_shape) - v_normalized = F.transpose(v_normalized, perm) - weight = F.elementwise_mul(v_normalized, g, axis=dim) - return weight - - -def compute_weight(v, g, dim, power): - assert len(g.shape) == 1, "magnitude should be a vector" - if power == 2: - in_dtype = v.dtype - if in_dtype == fluid.core.VarDesc.VarType.FP16: - v = F.cast(v, "float32") - g = F.cast(g, "float32") - weight = compute_l2_normalized_weight(v, g, dim) - if in_dtype == fluid.core.VarDesc.VarType.FP16: - weight = F.cast(weight, "float16") - return weight - else: - v_normalized = F.elementwise_div( - v, (norm_except(v, dim, power) + 1e-12), axis=dim) - weight = F.elementwise_mul(v_normalized, g, axis=dim) - return weight - - -class WeightNormWrapper(dg.Layer): - def __init__(self, layer, param_name="weight", dim=0, power=2): - super(WeightNormWrapper, self).__init__() - - self.param_name = param_name - self.dim = dim - self.power = power - self.layer = layer - - w_v = param_name + "_v" - w_g = param_name + "_g" - - # we could also use numpy to compute this, after all, it is run only once - # at initialization. - original_weight = getattr(layer, param_name) - self.add_parameter( - w_v, - self.create_parameter( - shape=original_weight.shape, dtype=original_weight.dtype)) - with dg.no_grad(): - F.assign(original_weight, getattr(self, w_v)) - delattr(layer, param_name) - temp = norm_except(getattr(self, w_v), self.dim, self.power) - self.add_parameter( - w_g, self.create_parameter( - shape=temp.shape, dtype=temp.dtype)) - with dg.no_grad(): - F.assign(temp, getattr(self, w_g)) - - # also set this when setting up - setattr(self.layer, self.param_name, - compute_weight( - getattr(self, w_v), - getattr(self, w_g), self.dim, self.power)) - - self.weigth_norm_applied = True - - # hook to compute weight with v & g - def hook(self): - w_v = self.param_name + "_v" - w_g = self.param_name + "_g" - setattr(self.layer, self.param_name, - compute_weight( - getattr(self, w_v), - getattr(self, w_g), self.dim, self.power)) - - def remove_weight_norm(self): - self.hook() - self.weigth_norm_applied = False - - def forward(self, *args, **kwargs): - if self.weigth_norm_applied == True: - self.hook() - return self.layer(*args, **kwargs) - - def __getattr__(self, key): - """ - this is used for attr forwarding. - """ - if key in self._parameters: - return self._parameters[key] - elif key in self._sub_layers: - return self._sub_layers[key] - elif key is "layer": - return self._sub_layers["layer"] - else: - return getattr( - object.__getattribute__(self, "_sub_layers")["layer"], key) - - -def Linear(input_dim, - output_dim, - param_attr=None, - bias_attr=None, - act=None, - dtype="float32"): - # a weight norm applied linear layer. - lin = dg.Linear(input_dim, output_dim, param_attr, bias_attr, act, dtype) - lin = WeightNormWrapper(lin, dim=1) - return lin - - -def Conv1D(num_channels, - num_filters, - filter_size, - stride=1, - padding=0, - dilation=1, - groups=1, - param_attr=None, - bias_attr=None, - use_cudnn=True, - act=None, - dtype='float32'): - conv = L.Conv1D(num_channels, num_filters, filter_size, stride, padding, - dilation, groups, param_attr, bias_attr, use_cudnn, act, - dtype) - conv = WeightNormWrapper(conv, dim=0) - return conv - - -def Conv1DTranspose(num_channels, - num_filters, - filter_size, - padding=0, - stride=1, - dilation=1, - groups=1, - param_attr=None, - bias_attr=None, - use_cudnn=True, - act=None, - dtype='float32'): - conv = L.Conv1DTranspose(num_channels, num_filters, filter_size, padding, - stride, dilation, groups, param_attr, bias_attr, - use_cudnn, act, dtype) - conv = WeightNormWrapper(conv, dim=0) - return conv - - -def Conv1DCell(num_channels, - num_filters, - filter_size, - dilation=1, - causal=False, - groups=1, - param_attr=None, - bias_attr=None, - use_cudnn=True, - act=None, - dtype='float32'): - conv = L.Conv1DCell(num_channels, num_filters, filter_size, dilation, - causal, groups, param_attr, bias_attr, use_cudnn, act, - dtype) - conv = WeightNormWrapper(conv, dim=0) - return conv - - -def Conv2D(num_channels, - num_filters, - filter_size, - stride=1, - padding=0, - dilation=1, - groups=1, - param_attr=None, - bias_attr=None, - use_cudnn=True, - act=None, - dtype='float32'): - # a conv2d layer with weight norm wrapper - conv = dg.Conv2D(num_channels, num_filters, filter_size, stride, padding, - dilation, groups, param_attr, bias_attr, use_cudnn, act, - dtype) - conv = WeightNormWrapper(conv, dim=0) - return conv - - -def Conv2DTranspose(num_channels, - num_filters, - filter_size, - output_size=None, - padding=0, - stride=1, - dilation=1, - groups=1, - param_attr=None, - bias_attr=None, - use_cudnn=True, - act=None, - dtype='float32'): - # a conv2d transpose layer with weight norm wrapper. - conv = dg.Conv2DTranspose(num_channels, num_filters, filter_size, - output_size, padding, stride, dilation, groups, - param_attr, bias_attr, use_cudnn, act, dtype) - conv = WeightNormWrapper(conv, dim=0) - return conv diff --git a/parakeet/utils/__init__.py b/parakeet/utils/__init__.py index abf198b..9ef6d7a 100644 --- a/parakeet/utils/__init__.py +++ b/parakeet/utils/__init__.py @@ -11,3 +11,5 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. + +from . import io, layer_tools, scheduler, display diff --git a/parakeet/utils/conf.py b/parakeet/utils/conf.py new file mode 100644 index 0000000..3e9eb7b --- /dev/null +++ b/parakeet/utils/conf.py @@ -0,0 +1,48 @@ +import attrdict +import flatdict +import argparse +import yaml + + +class Config(attrdict.AttrDict): + def dump(self, path): + with open(path, 'wt') as f: + yaml.safe_dump(dict(self), f, default_flow_style=None) + + def dumps(self): + return yaml.safe_dump(dict(self), default_flow_style=None) + + @classmethod + def from_file(cls, path): + with open(path, 'rt') as f: + c = yaml.safe_load(f) + return cls(c) + + def merge_file(self, path): + with open(path, 'rt') as f: + other = yaml.safe_load(f) + self.update(self + other) + + def merge_args(self, args): + args_dict = vars(args) + args_dict.pop("config") # exclude config file path + args_dict = {k: v for k, v in args_dict.items() if v is not None} + nested_dict = flatdict.FlatDict(args_dict, delimiter=".").as_dict() + self.update(self + nested_dict) + + def merge(self, other): + self.update(self + other) + + def flatten(self): + flat = flatdict.FlatDict(self, delimiter='.') + return flat + + def add_options_to_parser(self, parser): + parser.add_argument( + "--config", type=str, + help="extra config file to override the default config") + flat = self.flatten() + g = parser.add_argument_group("config file options") + for k, v in flat.items(): + g.add_argument("--{}".format(k), type=type(v), + help="config file option: {}".format(k)) \ No newline at end of file diff --git a/parakeet/utils/display.py b/parakeet/utils/display.py new file mode 100644 index 0000000..2e25997 --- /dev/null +++ b/parakeet/utils/display.py @@ -0,0 +1,39 @@ +import numpy as np +import matplotlib +from matplotlib import cm, pyplot + +def pack_attention_images(attention_weights, rotate=False): + # add a box + attention_weights = np.pad(attention_weights, + [(0, 0), (1, 1), (1, 1)], + mode="constant", + constant_values=1.) + if rotate: + attention_weights = np.rot90(attention_weights, axes=(1, 2)) + n, h, w = attention_weights.shape + + ratio = h / w + if ratio < 1: + cols = max(int(np.sqrt(n / ratio)), 1) + rows = int(np.ceil(n / cols)) + else: + rows = max(int(np.sqrt(n / ratio)), 1) + cols = int(np.ceil(n / rows)) + extras = rows * cols - n + #print(rows, cols, extras) + total = np.append(attention_weights, np.zeros([extras, h, w]), axis=0) + total = np.reshape(total, [rows, cols, h, w]) + img = np.block([[total[i, j] for j in range(cols)] for i in range(rows)]) + return img + +def add_attention_plots(writer, tag, attention_weights, global_step): + attns = [attn[0].numpy() for attn in attention_weights] + for i, attn in enumerate(attns): + img = pack_attention_images(attn) + writer.add_image(f"{tag}/{i}", + cm.plasma(img), + global_step=global_step, + dataformats="HWC") + +def min_max_normalize(v): + return (v - v.min()) / (v.max() - v.min()) diff --git a/parakeet/utils/internals.py b/parakeet/utils/internals.py new file mode 100644 index 0000000..e9d56af --- /dev/null +++ b/parakeet/utils/internals.py @@ -0,0 +1,36 @@ +import numpy as np +from paddle.framework import core + +def convert_dtype_to_np_dtype_(dtype): + """ + Convert paddle's data type to corrsponding numpy data type. + + Args: + dtype(np.dtype): the data type in paddle. + + Returns: + type: the data type in numpy. + + """ + if dtype is core.VarDesc.VarType.FP32: + return np.float32 + elif dtype is core.VarDesc.VarType.FP64: + return np.float64 + elif dtype is core.VarDesc.VarType.FP16: + return np.float16 + elif dtype is core.VarDesc.VarType.BOOL: + return np.bool + elif dtype is core.VarDesc.VarType.INT32: + return np.int32 + elif dtype is core.VarDesc.VarType.INT64: + return np.int64 + elif dtype is core.VarDesc.VarType.INT16: + return np.int16 + elif dtype is core.VarDesc.VarType.INT8: + return np.int8 + elif dtype is core.VarDesc.VarType.UINT8: + return np.uint8 + elif dtype is core.VarDesc.VarType.BF16: + return np.uint16 + else: + raise ValueError("Not supported dtype %s" % dtype) diff --git a/parakeet/utils/io.py b/parakeet/utils/io.py index ed78bcc..7f89593 100644 --- a/parakeet/utils/io.py +++ b/parakeet/utils/io.py @@ -132,13 +132,13 @@ def load_parameters(model, k].dtype: model_dict[k] = v.astype(state_dict[k].numpy().dtype) - model.set_dict(model_dict) + model.set_state_dict(model_dict) print("[checkpoint] Rank {}: loaded model from {}.pdparams".format( local_rank, checkpoint_path)) if optimizer and optimizer_dict: - optimizer.set_dict(optimizer_dict) + optimizer.set_state_dict(optimizer_dict) print("[checkpoint] Rank {}: loaded optimizer state from {}.pdopt". format(local_rank, checkpoint_path)) diff --git a/parakeet/utils/layer_tools.py b/parakeet/utils/layer_tools.py index 8ff0631..d056d11 100644 --- a/parakeet/utils/layer_tools.py +++ b/parakeet/utils/layer_tools.py @@ -13,10 +13,10 @@ # limitations under the License. import numpy as np -import paddle.fluid.dygraph as dg +from paddle import nn -def summary(layer): +def summary(layer: nn.Layer): num_params = num_elements = 0 print("layer summary:") for name, param in layer.state_dict().items(): @@ -26,12 +26,18 @@ def summary(layer): print("layer has {} parameters, {} elements.".format(num_params, num_elements)) +def gradient_norm(layer: nn.Layer): + grad_norm_dict = {} + for name, param in layer.state_dict().items(): + if param.trainable: + grad = param.gradient() + grad_norm_dict[name] = np.linalg.norm(grad) / grad.size + return grad_norm_dict -def freeze(layer): +def freeze(layer: nn.Layer): for param in layer.parameters(): param.trainable = False - -def unfreeze(layer): +def unfreeze(layer: nn.Layer): for param in layer.parameters(): param.trainable = True diff --git a/parakeet/utils/mp_tools.py b/parakeet/utils/mp_tools.py new file mode 100644 index 0000000..bc24726 --- /dev/null +++ b/parakeet/utils/mp_tools.py @@ -0,0 +1,18 @@ +import paddle +from paddle import distributed as dist +from functools import wraps + +def rank_zero_only(func): + local_rank = dist.get_rank() + + @wraps(func) + def wrapper(*args, **kwargs): + if local_rank != 0: + return + result = func(*args, **kwargs) + return result + + return wrapper + + + diff --git a/parakeet/utils/scheduler.py b/parakeet/utils/scheduler.py new file mode 100644 index 0000000..4e93f5c --- /dev/null +++ b/parakeet/utils/scheduler.py @@ -0,0 +1,59 @@ +import math + +class SchedulerBase(object): + def __call__(self, step): + raise NotImplementedError("You should implement the __call__ method.") + + +class Constant(SchedulerBase): + def __init__(self, value): + self.value = value + + def __call__(self, step): + return self.value + + +class PieceWise(SchedulerBase): + def __init__(self, anchors): + anchors = list(anchors) + anchors = sorted(anchors, key=lambda x: x[0]) + assert anchors[0][0] == 0, "it must start from zero" + self.xs = [item[0] for item in anchors] + self.ys = [item[1] for item in anchors] + self.num_anchors = len(self.xs) + + def __call__(self, step): + i = 0 + for x in self.xs: + if step >= x: + i += 1 + if i == 0: + return self.ys[0] + if i == self.num_anchors: + return self.ys[-1] + k = (self.ys[i] - self.ys[i-1]) / (self.xs[i] - self.xs[i-1]) + out = self.ys[i-1] + (step - self.xs[i-1]) * k + return out + + +class StepWise(SchedulerBase): + def __init__(self, anchors): + anchors = list(anchors) + anchors = sorted(anchors, key=lambda x: x[0]) + assert anchors[0][0] == 0, "it must start from zero" + self.xs = [item[0] for item in anchors] + self.ys = [item[1] for item in anchors] + self.num_anchors = len(self.xs) + + def __call__(self, step): + i = 0 + for x in self.xs: + if step >= x: + i += 1 + + if i == self.num_anchors: + return self.ys[-1] + if i == 0: + return self.ys[0] + return self.ys[i-1] + diff --git a/setup.py b/setup.py index 8a19307..bc1ff1d 100644 --- a/setup.py +++ b/setup.py @@ -64,6 +64,9 @@ setup_info = dict( 'sox', 'soundfile', 'llvmlite==0.31.0', + 'opencc', + 'g2p_en', + 'g2pM' ], # Package info diff --git a/tests/test_attention.py b/tests/test_attention.py new file mode 100644 index 0000000..7865b68 --- /dev/null +++ b/tests/test_attention.py @@ -0,0 +1,101 @@ +import unittest +import numpy as np +import paddle +paddle.set_default_dtype("float64") +paddle.disable_static(paddle.CPUPlace()) + +from parakeet.modules import attention as attn + +class TestScaledDotProductAttention(unittest.TestCase): + def test_without_mask(self): + x = paddle.randn([4, 16, 8]) + context_vector, attention_weights = attn.scaled_dot_product_attention(x, x, x) + assert(list(context_vector.shape) == [4, 16, 8]) + assert(list(attention_weights.shape) == [4, 16, 16]) + + def test_with_mask(self): + x = paddle.randn([4, 16, 8]) + mask = paddle.fluid.layers.sequence_mask( + paddle.to_tensor([16, 15, 13, 14]), dtype=x.dtype) + mask = mask.unsqueeze(1) # unsqueeze for the decoder time steps + context_vector, attention_weights = attn.scaled_dot_product_attention(x, x, x, mask) + assert(list(context_vector.shape) == [4, 16, 8]) + assert(list(attention_weights.shape) == [4, 16, 16]) + + def test_4d(self): + x = paddle.randn([4, 6, 16, 8]) + context_vector, attention_weights = attn.scaled_dot_product_attention(x, x, x) + assert(list(context_vector.shape) == [4, 6, 16, 8]) + assert(list(attention_weights.shape) == [4, 6, 16, 16]) + + +class TestMonoheadAttention(unittest.TestCase): + def test_io(self): + net = attn.MonoheadAttention(6, 0.1) + q = paddle.randn([4, 18, 6]) + k = paddle.randn([4, 12, 6]) + v = paddle.randn([4, 12, 6]) + mask = paddle.fluid.layers.sequence_mask( + paddle.to_tensor([12, 10, 8, 9]), dtype=q.dtype) + mask = paddle.unsqueeze(mask, 1) # unsqueeze for time_steps_q + context_vector, attn_weights = net(q, k, v, mask) + self.assertTupleEqual(context_vector.numpy().shape, (4, 18, 6)) + self.assertTupleEqual(attn_weights.numpy().shape, (4, 18, 12)) + + +class TestDropHead(unittest.TestCase): + def test_drop(self): + x = paddle.randn([4, 6, 16, 8]) + out = attn.drop_head(x, 2, training=True) + # drop 2 head from 6 at all positions + np.testing.assert_allclose(np.sum(out.numpy() == 0., axis=1), 2) + + def test_drop_all(self): + x = paddle.randn([4, 6, 16, 8]) + out = attn.drop_head(x, 6, training=True) + np.testing.assert_allclose(np.sum(out.numpy()), 0) + + def test_eval(self): + x = paddle.randn([4, 6, 16, 8]) + out = attn.drop_head(x, 6, training=False) + self.assertIs(x, out) + + +class TestMultiheadAttention(unittest.TestCase): + def __init__(self, methodName="test_io", same_qk=True): + super(TestMultiheadAttention, self).__init__(methodName) + self.same_qk = same_qk + + def setUp(self): + if self.same_qk: + net = attn.MultiheadAttention(64, 8, dropout=0.3) + else: + net = attn.MultiheadAttention(64, 8, k_dim=12, v_dim=6) + self.net =net + + def test_io(self): + q = paddle.randn([4, 12, 64]) + mask = paddle.fluid.layers.sequence_mask( + paddle.to_tensor([12, 10, 8, 9]), dtype=q.dtype) + mask = paddle.unsqueeze(mask, 1) # unsqueeze for time_steps_q + context_vector, attention_weights = self.net(q, q, q, mask) + self.assertTupleEqual(context_vector.numpy().shape, (4, 12, 64)) + self.assertTupleEqual(attention_weights.numpy().shape, (4, 8, 12, 12)) + + +def load_tests(loader, standard_tests, pattern): + suite = unittest.TestSuite() + suite.addTest(TestScaledDotProductAttention("test_without_mask")) + suite.addTest(TestScaledDotProductAttention("test_with_mask")) + suite.addTest(TestScaledDotProductAttention("test_4d")) + + suite.addTest(TestDropHead("test_drop")) + suite.addTest(TestDropHead("test_drop_all")) + suite.addTest(TestDropHead("test_eval")) + + suite.addTest(TestMonoheadAttention("test_io")) + + suite.addTest(TestMultiheadAttention("test_io", same_qk=True)) + suite.addTest(TestMultiheadAttention("test_io", same_qk=False)) + + return suite \ No newline at end of file diff --git a/tests/test_cbhg.py b/tests/test_cbhg.py new file mode 100644 index 0000000..08ccbcc --- /dev/null +++ b/tests/test_cbhg.py @@ -0,0 +1,34 @@ +import unittest +import paddle +paddle.set_default_dtype("float64") +paddle.disable_static(paddle.CPUPlace()) +from parakeet.modules import cbhg + + +class TestHighway(unittest.TestCase): + def test_io(self): + net = cbhg.Highway(4) + x = paddle.randn([2, 12, 4]) + y = net(x) + self.assertTupleEqual(y.numpy().shape, (2, 12, 4)) + + +class TestCBHG(unittest.TestCase): + def __init__(self, methodName="runTest", ): + super(TestCBHG, self).__init__(methodName) + + def test_io(self): + self.net = cbhg.CBHG(64, 32, 16, + projection_channels=[64, 128], + num_highways=4, highway_features=128, + gru_features=64) + x = paddle.randn([4, 64, 32]) + y = self.net(x) + self.assertTupleEqual(y.numpy().shape, (4, 32, 128)) + +def load_tests(loader, standard_tests, pattern): + suite = unittest.TestSuite() + + suite.addTest(TestHighway("test_io")) + suite.addTest(TestCBHG("test_io")) + return suite diff --git a/tests/test_clarinet.py b/tests/test_clarinet.py new file mode 100644 index 0000000..32e8bff --- /dev/null +++ b/tests/test_clarinet.py @@ -0,0 +1,43 @@ +import unittest +import numpy as np + +import paddle +paddle.set_default_dtype("float64") +paddle.disable_static(paddle.CPUPlace()) + +from parakeet.models import clarinet +from parakeet.modules import stft + +class TestParallelWaveNet(unittest.TestCase): + def test_io(self): + net = clarinet.ParallelWaveNet([8, 8, 8], [1, 1, 1], 16, 12, 2) + x = paddle.randn([4, 6073]) + condition = paddle.randn([4, 12, 6073]) + z, out_mu, out_log_std = net(x, condition) + self.assertTupleEqual(z.numpy().shape, (4, 6073)) + self.assertTupleEqual(out_mu.numpy().shape, (4, 6073)) + self.assertTupleEqual(out_log_std.numpy().shape, (4, 6073)) + + +class TestClariNet(unittest.TestCase): + def setUp(self): + encoder = clarinet.UpsampleNet([2, 2]) + teacher = clarinet.WaveNet(8, 3, 16, 3, 12, 2, "mog", -9.0) + student = clarinet.ParallelWaveNet([8, 8, 8, 8, 8, 8], [1, 1, 1, 1, 1, 1], 16, 12, 2) + stft_module = stft.STFT(16, 4, 8) + net = clarinet.Clarinet(encoder, teacher, student, stft_module, -6.0, lmd=4) + print("context size is: ", teacher.context_size) + self.net = net + + def test_io(self): + audio = paddle.randn([4, 1366]) + mel = paddle.randn([4, 12, 512]) # 512 * 4 =2048 + audio_start = paddle.zeros([4], dtype="int64") + loss = self.net(audio, mel, audio_start, clip_kl=True) + loss["loss"].numpy() + + def test_synthesis(self): + mel = paddle.randn([4, 12, 512]) # 64 = 246 / 4 + out = self.net.synthesis(mel) + self.assertTupleEqual(out.numpy().shape, (4, 2048)) + \ No newline at end of file diff --git a/tests/test_connections.py b/tests/test_connections.py new file mode 100644 index 0000000..be0401a --- /dev/null +++ b/tests/test_connections.py @@ -0,0 +1,33 @@ +import unittest +import paddle +from paddle import nn +paddle.disable_static(paddle.CPUPlace()) +paddle.set_default_dtype("float64") + +from parakeet.modules import connections as conn + +class TestPreLayerNormWrapper(unittest.TestCase): + def test_io(self): + net = nn.Linear(8, 8) + net = conn.PreLayerNormWrapper(net, 8) + x = paddle.randn([4, 8]) + y = net(x) + self.assertTupleEqual(x.numpy().shape, y.numpy().shape) + + +class TestPostLayerNormWrapper(unittest.TestCase): + def test_io(self): + net = nn.Linear(8, 8) + net = conn.PostLayerNormWrapper(net, 8) + x = paddle.randn([4, 8]) + y = net(x) + self.assertTupleEqual(x.numpy().shape, y.numpy().shape) + + +class TestResidualWrapper(unittest.TestCase): + def test_io(self): + net = nn.Linear(8, 8) + net = conn.ResidualWrapper(net) + x = paddle.randn([4, 8]) + y = net(x) + self.assertTupleEqual(x.numpy().shape, y.numpy().shape) \ No newline at end of file diff --git a/tests/test_conv.py b/tests/test_conv.py new file mode 100644 index 0000000..b76e719 --- /dev/null +++ b/tests/test_conv.py @@ -0,0 +1,67 @@ +import paddle +paddle.set_default_dtype("float64") +paddle.disable_static(paddle.CPUPlace()) +import unittest +import numpy as np + +from parakeet.modules import conv + +class TestConv1dCell(unittest.TestCase): + def setUp(self): + self.net = conv.Conv1dCell(4, 6, 5, dilation=2) + + def forward_incremental(self, x): + outs = [] + self.net.start_sequence() + with paddle.no_grad(): + for i in range(x.shape[-1]): + xt = x[:, :, i] + yt = self.net.add_input(xt) + outs.append(yt) + y2 = paddle.stack(outs, axis=-1) + return y2 + + def test_equality(self): + x = paddle.randn([2, 4, 16]) + y1 = self.net(x) + + self.net.eval() + y2 = self.forward_incremental(x) + + np.testing.assert_allclose(y2.numpy(), y1.numpy()) + + +class TestConv1dBatchNorm(unittest.TestCase): + def __init__(self, methodName="runTest", causal=False, channel_last=False): + super(TestConv1dBatchNorm, self).__init__(methodName) + self.causal = causal + self.channel_last = channel_last + + def setUp(self): + k = 5 + paddding = (k - 1, 0) if self.causal else ((k-1) // 2, k //2) + self.net = conv.Conv1dBatchNorm(4, 6, (k,), 1, padding=paddding, + data_format="NLC" if self.channel_last else "NCL") + + def test_input_output(self): + x = paddle.randn([4, 16, 4]) if self.channel_last else paddle.randn([4, 4, 16]) + out = self.net(x) + out_np = out.numpy() + if self.channel_last: + self.assertTupleEqual(out_np.shape, (4, 16, 6)) + else: + self.assertTupleEqual(out_np.shape, (4, 6, 16)) + + def runTest(self): + self.test_input_output() + + +def load_tests(loader, standard_tests, pattern): + suite = unittest.TestSuite() + suite.addTest(TestConv1dBatchNorm("runTest", True, True)) + suite.addTest(TestConv1dBatchNorm("runTest", False, False)) + suite.addTest(TestConv1dBatchNorm("runTest", True, False)) + suite.addTest(TestConv1dBatchNorm("runTest", False, True)) + suite.addTest(TestConv1dCell("test_equality")) + + return suite \ No newline at end of file diff --git a/tests/test_dataset.py b/tests/test_dataset.py new file mode 100644 index 0000000..eafd74a --- /dev/null +++ b/tests/test_dataset.py @@ -0,0 +1,122 @@ +import unittest +import numpy as np +import paddle +from paddle import io +from parakeet import data + +class MyDataset(io.Dataset): + def __init__(self, size): + self._data = np.random.randn(size, 6) + + def __getitem__(self, i): + return self._data[i] + + def __len__(self): + return self._data.shape[0] + + +class TestTransformDataset(unittest.TestCase): + def test(self): + dataset = MyDataset(20) + dataset = data.TransformDataset(dataset, lambda x: np.abs(x)) + dataloader = io.DataLoader(dataset, batch_size=4, shuffle=True, num_workers=1) + print("TransformDataset") + for batch, in dataloader: + print(type(batch), batch.dtype, batch.shape) + + +class TestChainDataset(unittest.TestCase): + def test(self): + dataset1 = MyDataset(20) + dataset2 = MyDataset(40) + dataset = data.ChainDataset(dataset1, dataset2) + dataloader = io.DataLoader(dataset, batch_size=4, shuffle=True, num_workers=1) + print("ChainDataset") + for batch, in dataloader: + print(type(batch), batch.dtype, batch.shape) + + +class TestTupleDataset(unittest.TestCase): + def test(self): + dataset1 = MyDataset(20) + dataset2 = MyDataset(20) + dataset = data.TupleDataset(dataset1, dataset2) + dataloader = io.DataLoader(dataset, batch_size=4, shuffle=True, num_workers=1) + print("TupleDataset") + for field1, field2 in dataloader: + print(type(field1), field1.dtype, field1.shape) + print(type(field2), field2.dtype, field2.shape) + + +class TestDictDataset(unittest.TestCase): + def test(self): + dataset1 = MyDataset(20) + dataset2 = MyDataset(20) + dataset = data.DictDataset(field1=dataset1, field2=dataset2) + def collate_fn(examples): + examples_tuples = [] + for example in examples: + examples_tuples.append(example.values()) + return paddle.fluid.dataloader.dataloader_iter.default_collate_fn(examples_tuples) + + dataloader = io.DataLoader(dataset, batch_size=4, shuffle=True, num_workers=1, collate_fn=collate_fn) + print("DictDataset") + for field1, field2 in dataloader: + print(type(field1), field1.dtype, field1.shape) + print(type(field2), field2.dtype, field2.shape) + + +class TestSliceDataset(unittest.TestCase): + def test(self): + dataset = MyDataset(40) + dataset = data.SliceDataset(dataset, 0, 20) + dataloader = io.DataLoader(dataset, batch_size=4, shuffle=True, num_workers=1) + print("SliceDataset") + for batch, in dataloader: + print(type(batch), batch.dtype, batch.shape) + + +class TestSplit(unittest.TestCase): + def test(self): + dataset = MyDataset(40) + train, valid = data.split(dataset, 10) + dataloader1 = io.DataLoader(train, batch_size=4, shuffle=True, num_workers=1) + dataloader2 = io.DataLoader(valid, batch_size=4, shuffle=True, num_workers=1) + print("First Dataset") + for batch, in dataloader1: + print(type(batch), batch.dtype, batch.shape) + + print("Second Dataset") + for batch, in dataloader2: + print(type(batch), batch.dtype, batch.shape) + + +class TestSubsetDataset(unittest.TestCase): + def test(self): + dataset = MyDataset(40) + indices = np.random.choice(np.arange(40), [20], replace=False).tolist() + dataset = data.SubsetDataset(dataset, indices) + dataloader = io.DataLoader(dataset, batch_size=4, shuffle=True, num_workers=1) + print("SubsetDataset") + for batch, in dataloader: + print(type(batch), batch.dtype, batch.shape) + + +class TestFilterDataset(unittest.TestCase): + def test(self): + dataset = MyDataset(40) + dataset = data.FilterDataset(dataset, lambda x: np.mean(x)> 0.3) + dataloader = io.DataLoader(dataset, batch_size=4, shuffle=True, num_workers=1) + print("FilterDataset") + for batch, in dataloader: + print(type(batch), batch.dtype, batch.shape) + + +class TestCacheDataset(unittest.TestCase): + def test(self): + dataset = MyDataset(40) + dataset = data.CacheDataset(dataset) + dataloader = io.DataLoader(dataset, batch_size=4, shuffle=True, num_workers=1) + print("CacheDataset") + for batch, in dataloader: + print(type(batch), batch.dtype, batch.shape) diff --git a/tests/test_deepvoice3.py b/tests/test_deepvoice3.py new file mode 100644 index 0000000..5abe272 --- /dev/null +++ b/tests/test_deepvoice3.py @@ -0,0 +1,107 @@ +import numpy as np +import unittest +import paddle +paddle.set_default_dtype("float64") +paddle.disable_static(paddle.CPUPlace()) + +from parakeet.models import deepvoice3 as dv3 + +class TestConvBlock(unittest.TestCase): + def test_io_causal(self): + net = dv3.ConvBlock(6, 5, True, True, 8, 0.9) + x = paddle.randn([4, 32, 6]) + condition = paddle.randn([4, 8]) + # TODO(chenfeiyu): to report an issue on default data type + padding = paddle.zeros([4, 4, 6], dtype=x.dtype) + y = net.forward(x, condition, padding) + self.assertTupleEqual(y.numpy().shape, (4, 32, 6)) + + def test_io_non_causal(self): + net = dv3.ConvBlock(6, 5, False, True, 8, 0.9) + x = paddle.randn([4, 32, 6]) + condition = paddle.randn([4, 8]) + y = net.forward(x, condition) + self.assertTupleEqual(y.numpy().shape, (4, 32, 6)) + + +class TestAffineBlock1(unittest.TestCase): + def test_io(self): + net = dv3.AffineBlock1(6, 16, True, 8) + x = paddle.randn([4, 32, 6]) + condition = paddle.randn([4, 8]) + y = net(x, condition) + self.assertTupleEqual(y.numpy().shape, (4, 32, 16)) + + +class TestAffineBlock2(unittest.TestCase): + def test_io(self): + net = dv3.AffineBlock2(6, 16, True, 8) + x = paddle.randn([4, 32, 6]) + condition = paddle.randn([4, 8]) + y = net(x, condition) + self.assertTupleEqual(y.numpy().shape, (4, 32, 16)) + + +class TestEncoder(unittest.TestCase): + def test_io(self): + net = dv3.Encoder(5, 8, 16, 5, True, 6) + x = paddle.randn([4, 32, 8]) + condition = paddle.randn([4, 6]) + keys, values = net(x, condition) + self.assertTupleEqual(keys.numpy().shape, (4, 32, 8)) + self.assertTupleEqual(values.numpy().shape, (4, 32, 8)) + + +class TestAttentionBlock(unittest.TestCase): + def test_io(self): + net = dv3.AttentionBlock(16, 6, has_bias=True, bias_dim=8) + q = paddle.randn([4, 32, 6]) + k = paddle.randn([4, 24, 6]) + v = paddle.randn([4, 24, 6]) + lengths = paddle.to_tensor([24, 20, 19, 23], dtype="int64") + condition = paddle.randn([4, 8]) + context_vector, attention_weight = net(q, k, v, lengths, condition, 0) + self.assertTupleEqual(context_vector.numpy().shape, (4, 32, 6)) + self.assertTupleEqual(attention_weight.numpy().shape, (4, 32, 24)) + + def test_io_with_previous_attn(self): + net = dv3.AttentionBlock(16, 6, has_bias=True, bias_dim=8) + q = paddle.randn([4, 32, 6]) + k = paddle.randn([4, 24, 6]) + v = paddle.randn([4, 24, 6]) + lengths = paddle.to_tensor([24, 20, 19, 23], dtype="int64") + condition = paddle.randn([4, 8]) + prev_attn_weight = paddle.randn([4, 32, 16]) + + context_vector, attention_weight = net( + q, k, v, lengths, condition, 0, + force_monotonic=True, prev_coeffs=prev_attn_weight, window=(0, 4)) + self.assertTupleEqual(context_vector.numpy().shape, (4, 32, 6)) + self.assertTupleEqual(attention_weight.numpy().shape, (4, 32, 24)) + + +class TestDecoder(unittest.TestCase): + def test_io(self): + net = dv3.Decoder(8, 4, [4, 12], 5, 3, 16, 1.0, 1.45, True, 6) + x = paddle.randn([4, 32, 8]) + k = paddle.randn([4, 24, 12]) # prenet's last size should equals k's feature size + v = paddle.randn([4, 24, 12]) + lengths = paddle.to_tensor([24, 18, 19, 22]) + condition = paddle.randn([4, 6]) + decoded, hidden, attentions, final_state = net(x, k, v, lengths, 0, condition) + self.assertTupleEqual(decoded.numpy().shape, (4, 32, 4 * 8)) + self.assertTupleEqual(hidden.numpy().shape, (4, 32, 12)) + self.assertEqual(len(attentions), 5) + self.assertTupleEqual(attentions[0].numpy().shape, (4, 32, 24)) + self.assertEqual(len(final_state), 5) + self.assertTupleEqual(final_state[0].numpy().shape, (4, 2, 12)) + + +class TestPostNet(unittest.TestCase): + def test_io(self): + net = dv3.PostNet(3, 8, 16, 3, 12, 4, True, 6) + x = paddle.randn([4, 32, 8]) + condition = paddle.randn([4, 6]) + y = net(x, condition) + self.assertTupleEqual(y.numpy().shape, (4, 32 * 4, 12)) + diff --git a/tests/test_geometry.py b/tests/test_geometry.py new file mode 100644 index 0000000..1c0efeb --- /dev/null +++ b/tests/test_geometry.py @@ -0,0 +1,19 @@ +import unittest +import numpy as np + +import paddle +paddle.set_default_dtype("float64") +paddle.disable_static(paddle.CPUPlace()) + +from parakeet.modules import geometry as geo + +class TestShuffleDim(unittest.TestCase): + def test_perm(self): + x = paddle.randn([2, 3, 4, 6]) + y = geo.shuffle_dim(x, 2, [3, 2, 1, 0]) + np.testing.assert_allclose(x.numpy()[0, 0, :, 0], y.numpy()[0, 0, ::-1, 0]) + + def test_random_perm(self): + x = paddle.randn([2, 3, 4, 6]) + y = geo.shuffle_dim(x, 2) + np.testing.assert_allclose(x.numpy().sum(2), y.numpy().sum(2)) \ No newline at end of file diff --git a/tests/test_losses.py b/tests/test_losses.py new file mode 100644 index 0000000..fa38eee --- /dev/null +++ b/tests/test_losses.py @@ -0,0 +1,33 @@ +import unittest +import paddle +paddle.set_device("cpu") +import numpy as np + +from parakeet.modules.losses import weighted_mean, masked_l1_loss, masked_softmax_with_cross_entropy + +class TestWeightedMean(unittest.TestCase): + def test(self): + x = paddle.arange(0, 10, dtype="float64").unsqueeze(-1).broadcast_to([10, 3]) + mask = (paddle.arange(0, 10, dtype="float64") > 4).unsqueeze(-1) + loss = weighted_mean(x, mask) + self.assertAlmostEqual(loss.numpy()[0], 7) + + +class TestMaskedL1Loss(unittest.TestCase): + def test(self): + x = paddle.arange(0, 10, dtype="float64").unsqueeze(-1).broadcast_to([10, 3]) + y = paddle.zeros_like(x) + mask = (paddle.arange(0, 10, dtype="float64") > 4).unsqueeze(-1) + loss = masked_l1_loss(x, y, mask) + print(loss) + self.assertAlmostEqual(loss.numpy()[0], 7) + + +class TestMaskedCrossEntropy(unittest.TestCase): + def test(self): + x = paddle.randn([3, 30, 8], dtype="float64") + y = paddle.randint(0, 8, [3, 30], dtype="int64").unsqueeze(-1) # mind this + mask = paddle.fluid.layers.sequence_mask( + paddle.to_tensor([30, 18, 27]), dtype="int64").unsqueeze(-1) + loss = masked_softmax_with_cross_entropy(x, y, mask) + print(loss) diff --git a/tests/test_masking.py b/tests/test_masking.py new file mode 100644 index 0000000..c1a388b --- /dev/null +++ b/tests/test_masking.py @@ -0,0 +1,54 @@ +import unittest +import numpy as np +import paddle +paddle.set_default_dtype("float64") + +from parakeet.modules import masking + + +def sequence_mask(lengths, max_length=None, dtype="bool"): + max_length = max_length or np.max(lengths) + ids = np.arange(max_length) + return (ids < np.expand_dims(lengths, -1)).astype(dtype) + +def future_mask(lengths, max_length=None, dtype="bool"): + max_length = max_length or np.max(lengths) + return np.tril(np.tril(np.ones(max_length))).astype(dtype) + +class TestIDMask(unittest.TestCase): + def test(self): + ids = paddle.to_tensor( + [[1, 2, 3, 0, 0, 0], + [2, 4, 5, 6, 0, 0], + [7, 8, 9, 0, 0, 0]] + ) + mask = masking.id_mask(ids) + self.assertTupleEqual(mask.numpy().shape, ids.numpy().shape) + print(mask.numpy()) + +class TestFeatureMask(unittest.TestCase): + def test(self): + features = np.random.randn(3, 16, 8) + lengths = [16, 14, 12] + for i, length in enumerate(lengths): + features[i, length:, :] = 0 + + feature_tensor = paddle.to_tensor(features) + mask = masking.feature_mask(feature_tensor, -1) + self.assertTupleEqual(mask.numpy().shape, (3, 16, 1)) + print(mask.numpy().squeeze()) + + +class TestCombineMask(unittest.TestCase): + def test_bool_mask(self): + lengths = np.array([12, 8, 9, 10]) + padding_mask = sequence_mask(lengths, dtype="bool") + no_future_mask = future_mask(lengths, dtype="bool") + combined_mask1 = np.expand_dims(padding_mask, 1) * no_future_mask + + print(paddle.to_tensor(padding_mask).dtype) + print(paddle.to_tensor(no_future_mask).dtype) + combined_mask2 = masking.combine_mask( + paddle.to_tensor(padding_mask).unsqueeze(1), paddle.to_tensor(no_future_mask) + ) + np.testing.assert_allclose(combined_mask2.numpy(), combined_mask1) diff --git a/tests/test_position_encoding.py b/tests/test_position_encoding.py new file mode 100644 index 0000000..408c0d2 --- /dev/null +++ b/tests/test_position_encoding.py @@ -0,0 +1,64 @@ +import unittest +import numpy as np +import paddle + +from parakeet.modules import positional_encoding as pe + +def positional_encoding(start_index, length, size, dtype="float32"): + if (size % 2 != 0): + raise ValueError("size should be divisible by 2") + channel = np.arange(0, size, 2, dtype=dtype) + index = np.arange(start_index, start_index + length, 1, dtype=dtype) + p = np.expand_dims(index, -1) / (10000 ** (channel / float(size))) + encodings = np.concatenate([np.sin(p), np.cos(p)], axis=-1) + return encodings + +def scalable_positional_encoding(start_index, length, size, omega): + dtype = omega.dtype + index = np.arange(start_index, start_index + length, 1, dtype=dtype) + channel = np.arange(0, size, 2, dtype=dtype) + + p = np.reshape(omega, omega.shape + (1, 1)) \ + * np.expand_dims(index, -1) \ + / (10000 ** (channel / float(size))) + + encodings = np.concatenate([np.sin(p), np.cos(p)], axis=-1) + return encodings + +class TestPositionEncoding(unittest.TestCase): + def __init__(self, start=0, length=20, size=16, dtype="float64"): + super(TestPositionEncoding, self).__init__("runTest") + self.spec = (start, length, size, dtype) + + def test_equality(self): + start, length, size, dtype = self.spec + position_embed1 = positional_encoding(start, length, size, dtype) + position_embed2 = pe.positional_encoding(start, length, size, dtype) + np.testing.assert_allclose(position_embed2.numpy(), position_embed1) + + def runTest(self): + paddle.disable_static(paddle.CPUPlace()) + self.test_equality() + +class TestScalablePositionEncoding(unittest.TestCase): + def __init__(self, start=0, length=20, size=16, dtype="float64"): + super(TestScalablePositionEncoding, self).__init__("runTest") + self.spec = (start, length, size, dtype) + + def test_equality(self): + start, length, size, dtype = self.spec + omega = np.random.uniform(1, 2, size=(4,)).astype(dtype) + position_embed1 = scalable_positional_encoding(start, length, size, omega) + position_embed2 = pe.scalable_positional_encoding(start, length, size, paddle.to_tensor(omega)) + np.testing.assert_allclose(position_embed2.numpy(), position_embed1) + + def runTest(self): + paddle.disable_static(paddle.CPUPlace()) + self.test_equality() + + +def load_tests(loader, standard_tests, pattern): + suite = unittest.TestSuite() + suite.addTest(TestPositionEncoding(0, 20, 16, "float64")) + suite.addTest(TestScalablePositionEncoding(0, 20, 16)) + return suite \ No newline at end of file diff --git a/tests/test_stft.py b/tests/test_stft.py new file mode 100644 index 0000000..ac66d24 --- /dev/null +++ b/tests/test_stft.py @@ -0,0 +1,27 @@ +import unittest +import numpy as np +import librosa +import paddle +paddle.set_default_dtype("float64") +paddle.disable_static(paddle.CPUPlace()) + +from parakeet.modules import stft + +class TestSTFT(unittest.TestCase): + def test(self): + path = librosa.util.example("choice") + wav, sr = librosa.load(path, duration=5) + wav = wav.astype("float64") + + spec = librosa.stft(wav, n_fft=2048, hop_length=256, win_length=1024) + mag1 = np.abs(spec) + + wav_in_batch = paddle.unsqueeze(paddle.to_tensor(wav), 0) + mag2 = stft.STFT(2048, 256, 1024).magnitude(wav_in_batch) + mag2 = paddle.squeeze(mag2, [0, 2]).numpy() + + print("mag1", mag1) + print("mag2", mag2) + # TODO(chenfeiyu): Is there something wrong? there is some elements that + # does not match + # np.testing.assert_allclose(mag2, mag1) diff --git a/tests/test_transformer.py b/tests/test_transformer.py new file mode 100644 index 0000000..41b79bc --- /dev/null +++ b/tests/test_transformer.py @@ -0,0 +1,43 @@ +import unittest +import numpy as np +import paddle +paddle.set_default_dtype("float64") +paddle.disable_static(paddle.CPUPlace()) + +from parakeet.modules import transformer + +class TestPositionwiseFFN(unittest.TestCase): + def test_io(self): + net = transformer.PositionwiseFFN(8, 12) + x = paddle.randn([2, 3, 4, 8]) + y = net(x) + self.assertTupleEqual(y.numpy().shape, (2, 3, 4, 8)) + + +class TestTransformerEncoderLayer(unittest.TestCase): + def test_io(self): + net = transformer.TransformerEncoderLayer(64, 8, 128, 0.5) + x = paddle.randn([4, 12, 64]) + lengths = paddle.to_tensor([12, 8, 9, 10]) + mask = paddle.fluid.layers.sequence_mask(lengths, dtype=x.dtype) + y, attn_weights = net(x, mask) + + self.assertTupleEqual(y.numpy().shape, (4, 12, 64)) + self.assertTupleEqual(attn_weights.numpy().shape, (4, 8, 12, 12)) + + +class TestTransformerDecoderLayer(unittest.TestCase): + def test_io(self): + net = transformer.TransformerDecoderLayer(64, 8, 128, 0.5) + q = paddle.randn([4, 32, 64]) + k = paddle.randn([4, 24, 64]) + v = paddle.randn([4, 24, 64]) + enc_lengths = paddle.to_tensor([24, 18, 20, 22]) + dec_lengths = paddle.to_tensor([32, 28, 30, 31]) + enc_mask = paddle.fluid.layers.sequence_mask(enc_lengths, dtype=k.dtype) + dec_mask = paddle.fluid.layers.sequence_mask(dec_lengths, dtype=q.dtype) + y, self_attn_weights, cross_attn_weights = net(q, k, v, enc_mask, dec_mask) + + self.assertTupleEqual(y.numpy().shape, (4, 32, 64)) + self.assertTupleEqual(self_attn_weights.numpy().shape, (4, 8, 32, 32)) + self.assertTupleEqual(cross_attn_weights.numpy().shape, (4, 8, 32, 24)) \ No newline at end of file diff --git a/tests/test_transformer_tts.py b/tests/test_transformer_tts.py new file mode 100644 index 0000000..a13990d --- /dev/null +++ b/tests/test_transformer_tts.py @@ -0,0 +1,121 @@ +import unittest +import numpy as np +import paddle +paddle.set_default_dtype("float64") +paddle.disable_static(paddle.CPUPlace()) + +from parakeet.models import transformer_tts as tts +from parakeet.modules import masking +from pprint import pprint + +class TestMultiheadAttention(unittest.TestCase): + def test_io_same_qk(self): + net = tts.MultiheadAttention(64, 8) + q = paddle.randn([4, 12, 64]) + mask = paddle.fluid.layers.sequence_mask( + paddle.to_tensor([12, 10, 8, 9]), dtype=q.dtype) + mask = paddle.unsqueeze(mask, 1) # unsqueeze for time_steps_q + context_vector, attention_weights = net(q, q, q, mask, drop_n_heads=2) + self.assertTupleEqual(context_vector.numpy().shape, (4, 12, 64)) + self.assertTupleEqual(attention_weights.numpy().shape, (4, 8, 12, 12)) + + def test_io(self): + net = tts.MultiheadAttention(64, 8, k_dim=12, v_dim=6) + q = paddle.randn([4, 12, 64]) + mask = paddle.fluid.layers.sequence_mask( + paddle.to_tensor([12, 10, 8, 9]), dtype=q.dtype) + mask = paddle.unsqueeze(mask, 1) # unsqueeze for time_steps_q + context_vector, attention_weights = net(q, q, q, mask, drop_n_heads=2) + self.assertTupleEqual(context_vector.numpy().shape, (4, 12, 64)) + self.assertTupleEqual(attention_weights.numpy().shape, (4, 8, 12, 12)) + + +class TestTransformerEncoderLayer(unittest.TestCase): + def test_io(self): + net = tts.TransformerEncoderLayer(64, 8, 128) + x = paddle.randn([4, 12, 64]) + mask = paddle.fluid.layers.sequence_mask( + paddle.to_tensor([12, 10, 8, 9]), dtype=x.dtype) + context_vector, attention_weights = net(x, mask) + self.assertTupleEqual(context_vector.numpy().shape, (4, 12, 64)) + self.assertTupleEqual(attention_weights.numpy().shape, (4, 8, 12, 12)) + + +class TestTransformerDecoderLayer(unittest.TestCase): + def test_io(self): + net = tts.TransformerDecoderLayer(64, 8, 128, 0.5) + q = paddle.randn([4, 32, 64]) + k = paddle.randn([4, 24, 64]) + v = paddle.randn([4, 24, 64]) + enc_lengths = paddle.to_tensor([24, 18, 20, 22]) + dec_lengths = paddle.to_tensor([32, 28, 30, 31]) + enc_mask = masking.sequence_mask(enc_lengths, dtype=k.dtype) + dec_padding_mask = masking.sequence_mask(dec_lengths, dtype=q.dtype) + no_future_mask = masking.future_mask(32, dtype=q.dtype) + dec_mask = masking.combine_mask(dec_padding_mask.unsqueeze(-1), no_future_mask) + y, self_attn_weights, cross_attn_weights = net(q, k, v, enc_mask, dec_mask) + + self.assertTupleEqual(y.numpy().shape, (4, 32, 64)) + self.assertTupleEqual(self_attn_weights.numpy().shape, (4, 8, 32, 32)) + self.assertTupleEqual(cross_attn_weights.numpy().shape, (4, 8, 32, 24)) + + +class TestTransformerTTS(unittest.TestCase): + def setUp(self): + net = tts.TransformerTTS( + 128, 0, 64, 128, 80, 4, 128, + 6, 6, 128, 128, 4, + 3, 10, 0.1) + self.net = net + + def test_encode_io(self): + net = self.net + + text = paddle.randint(0, 128, [4, 176]) + lengths = paddle.to_tensor([176, 156, 174, 168]) + mask = masking.sequence_mask(lengths, dtype=text.dtype) + text = text * mask + + encoded, attention_weights, encoder_mask = net.encode(text) + print("output shapes:") + print("encoded:", encoded.numpy().shape) + print("encoder_attentions:", [item.shape for item in attention_weights]) + print("encoder_mask:", encoder_mask.numpy().shape) + + def test_all_io(self): + net = self.net + + text = paddle.randint(0, 128, [4, 176]) + lengths = paddle.to_tensor([176, 156, 174, 168]) + mask = masking.sequence_mask(lengths, dtype=text.dtype) + text = text * mask + + mel = paddle.randn([4, 189, 80]) + frames = paddle.to_tensor([189, 186, 179, 174]) + mask = masking.sequence_mask(frames, dtype=frames.dtype) + mel = mel * mask.unsqueeze(-1) + + encoded, encoder_attention_weights, encoder_mask = net.encode(text) + mel_output, mel_intermediate, cross_attention_weights, stop_logits = net.decode(encoded, mel, encoder_mask) + + print("output shapes:") + print("encoder_output:", encoded.numpy().shape) + print("encoder_attentions:", [item.shape for item in encoder_attention_weights]) + print("encoder_mask:", encoder_mask.numpy().shape) + print("mel_output: ", mel_output.numpy().shape) + print("mel_intermediate: ", mel_intermediate.numpy().shape) + print("decoder_attentions:", [item.shape for item in cross_attention_weights]) + print("stop_logits:", stop_logits.numpy().shape) + + def test_predict_io(self): + net = self.net + net.eval() + with paddle.no_grad(): + text = paddle.randint(0, 128, [176]) + decoder_output, encoder_attention_weights, cross_attention_weights = net.predict(text) + + print("output shapes:") + print("mel_output: ", decoder_output.numpy().shape) + print("encoder_attentions:", [item.shape for item in encoder_attention_weights]) + print("decoder_attentions:", [item.shape for item in cross_attention_weights]) + \ No newline at end of file diff --git a/tests/test_waveflow.py b/tests/test_waveflow.py new file mode 100644 index 0000000..15bbc44 --- /dev/null +++ b/tests/test_waveflow.py @@ -0,0 +1,130 @@ +import numpy as np +import unittest + +import paddle +paddle.set_default_dtype("float64") +paddle.disable_static(paddle.CPUPlace()) + +from parakeet.models import waveflow + +class TestFold(unittest.TestCase): + def test_audio(self): + x = paddle.randn([4, 32 * 8]) + y = waveflow.fold(x, 8) + self.assertTupleEqual(y.numpy().shape, (4, 32, 8)) + + def test_spec(self): + x = paddle.randn([4, 80, 32 * 8]) + y = waveflow.fold(x, 8) + self.assertTupleEqual(y.numpy().shape, (4, 80, 32, 8)) + + +class TestUpsampleNet(unittest.TestCase): + def test_io(self): + net = waveflow.UpsampleNet([2, 2]) + x = paddle.randn([4, 8, 6]) + y = net(x) + self.assertTupleEqual(y.numpy().shape, (4, 8, 2 * 2 * 6)) + + +class TestResidualBlock(unittest.TestCase): + def test_io(self): + net = waveflow.ResidualBlock(4, 6, (3, 3), (2, 2)) + x = paddle.randn([4, 4, 16, 32]) + condition = paddle.randn([4, 6, 16, 32]) + res, skip = net(x, condition) + self.assertTupleEqual(res.numpy().shape, (4, 4, 16, 32)) + self.assertTupleEqual(skip.numpy().shape, (4, 4, 16, 32)) + + def test_add_input(self): + net = waveflow.ResidualBlock(4, 6, (3, 3), (2, 2)) + net.eval() + net.start_sequence() + + x_row = paddle.randn([4, 4, 1, 32]) + condition_row = paddle.randn([4, 6, 1, 32]) + + res, skip = net.add_input(x_row, condition_row) + self.assertTupleEqual(res.numpy().shape, (4, 4, 1, 32)) + self.assertTupleEqual(skip.numpy().shape, (4, 4, 1, 32)) + + +class TestResidualNet(unittest.TestCase): + def test_io(self): + net = waveflow.ResidualNet(8, 6, 8, (3, 3), [1, 1, 1, 1, 1, 1, 1, 1]) + x = paddle.randn([4, 6, 8, 32]) + condition = paddle.randn([4, 8, 8, 32]) + y = net(x, condition) + self.assertTupleEqual(y.numpy().shape, (4, 6, 8, 32)) + + def test_add_input(self): + net = waveflow.ResidualNet(8, 6, 8, (3, 3), [1, 1, 1, 1, 1, 1, 1, 1]) + net.eval() + net.start_sequence() + + x_row = paddle.randn([4, 6, 1, 32]) + condition_row = paddle.randn([4, 8, 1, 32]) + + y_row = net.add_input(x_row, condition_row) + self.assertTupleEqual(y_row.numpy().shape, (4, 6, 1, 32)) + + +class TestFlow(unittest.TestCase): + def test_io(self): + net = waveflow.Flow(8, 16, 7, (3, 3), 8) + + x = paddle.randn([4, 1, 8, 32]) + condition = paddle.randn([4, 7, 8, 32]) + z, (logs, b) = net(x, condition) + self.assertTupleEqual(z.numpy().shape, (4, 1, 8, 32)) + self.assertTupleEqual(logs.numpy().shape, (4, 1, 7, 32)) + self.assertTupleEqual(b.numpy().shape, (4, 1, 7, 32)) + + def test_inverse_row(self): + net = waveflow.Flow(8, 16, 7, (3, 3), 8) + net.eval() + net._start_sequence() + + x_row = paddle.randn([4, 1, 1, 32]) # last row + condition_row = paddle.randn([4, 7, 1, 32]) + z_row = paddle.randn([4, 1, 1, 32]) + x_next_row, (logs, b) = net._inverse_row(z_row, x_row, condition_row) + + self.assertTupleEqual(x_next_row.numpy().shape, (4, 1, 1, 32)) + self.assertTupleEqual(logs.numpy().shape, (4, 1, 1, 32)) + self.assertTupleEqual(b.numpy().shape, (4, 1, 1, 32)) + + def test_inverse(self): + net = waveflow.Flow(8, 16, 7, (3, 3), 8) + net.eval() + + z = paddle.randn([4, 1, 8, 32]) + condition = paddle.randn([4, 7, 8, 32]) + + with paddle.no_grad(): + x, (logs, b) = net.inverse(z, condition) + self.assertTupleEqual(x.numpy().shape, (4, 1, 8, 32)) + self.assertTupleEqual(logs.numpy().shape, (4, 1, 7, 32)) + self.assertTupleEqual(b.numpy().shape, (4, 1, 7, 32)) + + +class TestWaveFlow(unittest.TestCase): + def test_io(self): + x = paddle.randn([4, 32 * 8 ]) + condition = paddle.randn([4, 7, 32 * 8]) + net = waveflow.WaveFlow(2, 8, 8, 16, 7, (3, 3)) + z, logs_det_jacobian = net(x, condition) + + self.assertTupleEqual(z.numpy().shape, (4, 32 * 8)) + self.assertTupleEqual(logs_det_jacobian.numpy().shape, (1,)) + + def test_inverse(self): + z = paddle.randn([4, 32 * 8 ]) + condition = paddle.randn([4, 7, 32 * 8]) + + net = waveflow.WaveFlow(2, 8, 8, 16, 7, (3, 3)) + net.eval() + + with paddle.no_grad(): + x = net.inverse(z, condition) + self.assertTupleEqual(x.numpy().shape, (4, 32 * 8))