ParakeetRebeccaRosario/docs/source/advanced.rst

======================
Advanced Usage
======================

This sections covers how to extend parakeet by implementing your own models and 
experiments. Guidelines on implementation are also elaborated.

Model
-------------

As a common practice with paddlepaddle, models are implemented as subclasses
of ``paddle.nn.Layer``. Models could be simple, like a single layer RNN. For 
complicated models, it is recommended to split the model into different 
components.

For a encoder-decoder model, it is natural to split it into the encoder and 
the decoder. For a model composed of several similar layers, it is natural to 
extract the sublayer as a separate layer.

There are two common ways to define a model which consists of several modules.

#. Define a module given the specifications. Here is an example with multilayer 
   perceptron.

   .. code-block:: python

      class MLP(nn.Layer):
          def __init__(self, input_size, hidden_size, output_size):
              self.linear1 = nn.Linear(input_size, hidden_size)
              self.linear2 = nn.Linear(hidden_size, output_size)
              
          def forward(self, x):
              return self.linear2(paddle.tanh(self.linear1(x))

      module = MLP(16, 32, 4) # intialize a module

   When the module is intended to be a generic and reusable layer that can be 
   integrated into a larger model, we prefer to define it in this way.

   For considerations of readability and usability, we strongly recommend 
   **NOT** to pack specifications into a single object. Here's an example below.

   .. code-block:: python

      class MLP(nn.Layer):
          def __init__(self, hparams):
              self.linear1 = nn.Linear(hparams.input_size, hparams.hidden_size)
              self.linear2 = nn.Linear(hparams.hidden_size, hparams.output_size)
              
          def forward(self, x):
              return self.linear2(paddle.tanh(self.linear1(x))

   For a module defined in this way, it's harder for the user to initialize an 
   instance. Users have to read the code to check what attributes are used.

   Also, code in this style tend to be abused by passing a huge config object 
   to initialize every module used in an experiment, thought each module may 
   not need the whole configuration.
   
   We prefer to be explicit.

#. Define a module as a combination given its components. Here is an example 
   for a sequence-to-sequence model.

   .. code-block:: python
   
      class Seq2Seq(nn.Layer):
          def __init__(self, encoder, decoder):
              self.encoder = encoder
              self.decoder = decoder
              
          def forward(self, x):
              encoder_output = self.encoder(x)
              output = self.decoder(encoder_output)
              return output
      
      encoder = Encoder(...)
      decoder = Decoder(...)
      model = Seq2Seq(encoder, decoder) # compose two components

   When a model is a complicated and made up of several components, each of which 
   has a separate functionality, and can be replaced by other components with the 
   same functionality, we prefer to define it in this way.

Data
-------------

Another critical componnet for a deep learning project is data. As a common 
practice, we use the dataset and dataloader abstraction. 

Dataset
^^^^^^^^^^
Dataset is the representation of a set of examples used by a project. In most of 
the cases, dataset is a collection of examples. Dataset is an object which has 
methods below.

#. ``__len__``, to get the size of the dataset.
#. ``__getitem__``, to get an example by key or index.

Examples is a record consisting of several fields. In practice, we usually 
represent it as a namedtuple for convenience, yet dict and user-defined object 
are also supported.

We define our own dataset by subclassing ``paddle.io.Dataset``.

DataLoader
^^^^^^^^^^^
In deep learning practice, models are trained with minibatches. DataLoader 
meets the need for iterating the dataset in batches. It is done by providing 
a sampler and a batch function in addition to a dataset.

#. sampler, sample indices or keys used to get examples from the dataset.
#. batch function, transform a list of examples into a batch.

An commonly used sampler is ``RandomSampler``, it shuffles all the valid 
indices and then iterate over them sequentially. ``DistributedBatchSampler`` is 
a sampler used for distributed data parallel training, when the sampler handles 
data sharding in a dynamic way.

Batch function is used to transform selected examples into a batch. For a simple 
case where an example is composed of several fields, each of which is represented 
by an fixed size array, batch function can be simply stacking each field. For 
cases where variable size arrays are included in the example, batching could 
invlove padding and stacking. While in theory, batch function can do more like 
randomly slicing, etc.

For a custom dataset used for a custom model, it is required to define a batch 
function for it.

Config
-------------

It's common to change the running configuration to compare results. To keep track 
of running configuration, we use ``yaml`` configuration files.

Also, we want to interact with command line options. Some options that usually 
change according to running environments is provided by command line arguments. 
In addition, we want to override an option in the config file without editing 
it. 

Taking these requirements in to consideration, we use `yacs <https://github.com/rbgirshick/yacs>`_ 
as a config management tool. Other tools like `omegaconf <https://github.com/omry/omegaconf>`_ 
are also powerful and have similar functions.

In each example provided, there is a ``config.py``, where the default config is 
defined. If you want to get the default config, import ``config.py`` and call 
``get_cfg_defaults()`` to get the default config. Then it can be updated with 
yaml config file or command line arguments if needed.

For details about how to use yacs in experiments, see `yacs <https://github.com/rbgirshick/yacs>`_.


Experiment
--------------
add documentation sections 2021-01-13 11:06:15 +08:00			`======================`
			`Advanced Usage`
			`======================`

fix typos 2021-01-18 15:15:49 +08:00			`This sections covers how to extend parakeet by implementing your own models and`
add documentation sections 2021-01-13 11:06:15 +08:00			`experiments. Guidelines on implementation are also elaborated.`

			`Model`
			`-------------`

polish documentation 2021-01-13 14:58:26 +08:00			`As a common practice with paddlepaddle, models are implemented as subclasses`
add tutorials into sdvanced 2021-01-14 15:12:36 +08:00			of ``paddle.nn.Layer``. Models could be simple, like a single layer RNN. For
			`complicated models, it is recommended to split the model into different`
			`components.`
add documentation sections 2021-01-13 11:06:15 +08:00
			`For a encoder-decoder model, it is natural to split it into the encoder and`
			`the decoder. For a model composed of several similar layers, it is natural to`
add tutorials into sdvanced 2021-01-14 15:12:36 +08:00			`extract the sublayer as a separate layer.`
add documentation sections 2021-01-13 11:06:15 +08:00
			`There are two common ways to define a model which consists of several modules.`

add tutorials into sdvanced 2021-01-14 15:12:36 +08:00			`#. Define a module given the specifications. Here is an example with multilayer`
			`perceptron.`
add documentation sections 2021-01-13 11:06:15 +08:00
polish documentation 2021-01-13 14:58:26 +08:00			`.. code-block:: python`

			`class MLP(nn.Layer):`
			`def __init__(self, input_size, hidden_size, output_size):`
			`self.linear1 = nn.Linear(input_size, hidden_size)`
			`self.linear2 = nn.Linear(hidden_size, output_size)`

			`def forward(self, x):`
			`return self.linear2(paddle.tanh(self.linear1(x))`

			`module = MLP(16, 32, 4) # intialize a module`

add tutorials into sdvanced 2021-01-14 15:12:36 +08:00			`When the module is intended to be a generic and reusable layer that can be`
polish documentation 2021-01-13 14:58:26 +08:00			`integrated into a larger model, we prefer to define it in this way.`

add tutorials into sdvanced 2021-01-14 15:12:36 +08:00			`For considerations of readability and usability, we strongly recommend`
			`NOT to pack specifications into a single object. Here's an example below.`
polish documentation 2021-01-13 14:58:26 +08:00
			`.. code-block:: python`

			`class MLP(nn.Layer):`
			`def __init__(self, hparams):`
			`self.linear1 = nn.Linear(hparams.input_size, hparams.hidden_size)`
			`self.linear2 = nn.Linear(hparams.hidden_size, hparams.output_size)`

			`def forward(self, x):`
			`return self.linear2(paddle.tanh(self.linear1(x))`

add tutorials into sdvanced 2021-01-14 15:12:36 +08:00			`For a module defined in this way, it's harder for the user to initialize an`
			`instance. Users have to read the code to check what attributes are used.`
polish documentation 2021-01-13 14:58:26 +08:00
add tutorials into sdvanced 2021-01-14 15:12:36 +08:00			`Also, code in this style tend to be abused by passing a huge config object`
			`to initialize every module used in an experiment, thought each module may`
			`not need the whole configuration.`
polish documentation 2021-01-13 14:58:26 +08:00
			`We prefer to be explicit.`

add tutorials into sdvanced 2021-01-14 15:12:36 +08:00			`#. Define a module as a combination given its components. Here is an example`
			`for a sequence-to-sequence model.`
polish documentation 2021-01-13 14:58:26 +08:00
			`.. code-block:: python`

			`class Seq2Seq(nn.Layer):`
			`def __init__(self, encoder, decoder):`
			`self.encoder = encoder`
			`self.decoder = decoder`

			`def forward(self, x):`
			`encoder_output = self.encoder(x)`
			`output = self.decoder(encoder_output)`
			`return output`

			`encoder = Encoder(...)`
			`decoder = Decoder(...)`
			`model = Seq2Seq(encoder, decoder) # compose two components`

add tutorials into sdvanced 2021-01-14 15:12:36 +08:00			`When a model is a complicated and made up of several components, each of which`
polish documentation 2021-01-13 14:58:26 +08:00			`has a separate functionality, and can be replaced by other components with the`
			`same functionality, we prefer to define it in this way.`
add documentation sections 2021-01-13 11:06:15 +08:00
			`Data`
			`-------------`

add tutorials into sdvanced 2021-01-14 15:12:36 +08:00			`Another critical componnet for a deep learning project is data. As a common`
			`practice, we use the dataset and dataloader abstraction.`

			`Dataset`
			`^^^^^^^^^^`
fix typos 2021-01-18 15:15:49 +08:00			`Dataset is the representation of a set of examples used by a project. In most of`
add tutorials into sdvanced 2021-01-14 15:12:36 +08:00			`the cases, dataset is a collection of examples. Dataset is an object which has`
			`methods below.`

			#. ``__len__``, to get the size of the dataset.
			#. ``__getitem__``, to get an example by key or index.

			`Examples is a record consisting of several fields. In practice, we usually`
			`represent it as a namedtuple for convenience, yet dict and user-defined object`
			`are also supported.`

			We define our own dataset by subclassing ``paddle.io.Dataset``.

			`DataLoader`
			`^^^^^^^^^^^`
			`In deep learning practice, models are trained with minibatches. DataLoader`
			`meets the need for iterating the dataset in batches. It is done by providing`
			`a sampler and a batch function in addition to a dataset.`

			`#. sampler, sample indices or keys used to get examples from the dataset.`
			`#. batch function, transform a list of examples into a batch.`

			An commonly used sampler is ``RandomSampler``, it shuffles all the valid
			indices and then iterate over them sequentially. ``DistributedBatchSampler`` is
			`a sampler used for distributed data parallel training, when the sampler handles`
			`data sharding in a dynamic way.`

			`Batch function is used to transform selected examples into a batch. For a simple`
			`case where an example is composed of several fields, each of which is represented`
			`by an fixed size array, batch function can be simply stacking each field. For`
			`cases where variable size arrays are included in the example, batching could`
			`invlove padding and stacking. While in theory, batch function can do more like`
			`randomly slicing, etc.`

			`For a custom dataset used for a custom model, it is required to define a batch`
			`function for it.`

add documentation sections 2021-01-13 11:06:15 +08:00			`Config`
			`-------------`

add tutorials into sdvanced 2021-01-14 15:12:36 +08:00			`It's common to change the running configuration to compare results. To keep track`
			of running configuration, we use ``yaml`` configuration files.

			`Also, we want to interact with command line options. Some options that usually`
			`change according to running environments is provided by command line arguments.`
fix typos 2021-01-18 15:15:49 +08:00			`In addition, we want to override an option in the config file without editing`
add tutorials into sdvanced 2021-01-14 15:12:36 +08:00			`it.`

			Taking these requirements in to consideration, we use `yacs <https://github.com/rbgirshick/yacs>`_
fix typos 2021-01-18 15:15:49 +08:00			as a config management tool. Other tools like `omegaconf <https://github.com/omry/omegaconf>`_
add tutorials into sdvanced 2021-01-14 15:12:36 +08:00			`are also powerful and have similar functions.`

			In each example provided, there is a ``config.py``, where the default config is
			defined. If you want to get the default config, import ``config.py`` and call
			``get_cfg_defaults()`` to get the default config. Then it can be updated with
			`yaml config file or command line arguments if needed.`

			For details about how to use yacs in experiments, see `yacs <https://github.com/rbgirshick/yacs>`_.


add documentation sections 2021-01-13 11:06:15 +08:00			`Experiment`
add tutorials into sdvanced 2021-01-14 15:12:36 +08:00			`--------------`