update to 0.2.0

big change
2019-12-03 18:47:25 +08:00 · 2019-12-03 18:47:25 +08:00 · cb07fb64df
parent 2aec6bd730
commit cb07fb64df
70 changed files with 7599 additions and 6826 deletions
--- a/.github/CODE_OF_CONDUCT.md
+++ b/.github/CODE_OF_CONDUCT.md
@ -1,13 +0,0 @@
-# Contributor Code of Conduct
-
-As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
-
-We are committed to making participation in this project a harassment-free experience for everyone, regardless of the level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, age, or religion.
-
-Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.
-
-Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.
-
-Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.
-
-This Code of Conduct is adapted from the [Contributor Covenant](http://contributor-covenant.org), version 1.0.0, available at [http://contributor-covenant.org/version/1/0/0/](http://contributor-covenant.org/version/1/0/0/)
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@ -1,13 +0,0 @@
-<!-- PULL REQUEST TEMPLATE -->
-<!-- (Update "[ ]" to "[x]" to check a box) -->
-
-**What kind of change does this PR introduce?** (check at least one)
-
- [ ] Bugfix
- [ ] Feature
- [ ] Code style update
- [ ] Refactor
- [ ] Build-related changes
- [ ] Other, please describe:
-
-**Other information:**w
--- a/201
+++ b/201
@ -1,201 +0,0 @@
-                                 Apache License
-                           Version 2.0, January 2004
-                        http://www.apache.org/licenses/
-
-   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
-
-   1. Definitions.
-
-      "License" shall mean the terms and conditions for use, reproduction,
-      and distribution as defined by Sections 1 through 9 of this document.
-
-      "Licensor" shall mean the copyright owner or entity authorized by
-      the copyright owner that is granting the License.
-
-      "Legal Entity" shall mean the union of the acting entity and all
-      other entities that control, are controlled by, or are under common
-      control with that entity. For the purposes of this definition,
-      "control" means (i) the power, direct or indirect, to cause the
-      direction or management of such entity, whether by contract or
-      otherwise, or (ii) ownership of fifty percent (50%) or more of the
-      outstanding shares, or (iii) beneficial ownership of such entity.
-
-      "You" (or "Your") shall mean an individual or Legal Entity
-      exercising permissions granted by this License.
-
-      "Source" form shall mean the preferred form for making modifications,
-      including but not limited to software source code, documentation
-      source, and configuration files.
-
-      "Object" form shall mean any form resulting from mechanical
-      transformation or translation of a Source form, including but
-      not limited to compiled object code, generated documentation,
-      and conversions to other media types.
-
-      "Work" shall mean the work of authorship, whether in Source or
-      Object form, made available under the License, as indicated by a
-      copyright notice that is included in or attached to the work
-      (an example is provided in the Appendix below).
-
-      "Derivative Works" shall mean any work, whether in Source or Object
-      form, that is based on (or derived from) the Work and for which the
-      editorial revisions, annotations, elaborations, or other modifications
-      represent, as a whole, an original work of authorship. For the purposes
-      of this License, Derivative Works shall not include works that remain
-      separable from, or merely link (or bind by name) to the interfaces of,
-      the Work and Derivative Works thereof.
-
-      "Contribution" shall mean any work of authorship, including
-      the original version of the Work and any modifications or additions
-      to that Work or Derivative Works thereof, that is intentionally
-      submitted to Licensor for inclusion in the Work by the copyright owner
-      or by an individual or Legal Entity authorized to submit on behalf of
-      the copyright owner. For the purposes of this definition, "submitted"
-      means any form of electronic, verbal, or written communication sent
-      to the Licensor or its representatives, including but not limited to
-      communication on electronic mailing lists, source code control systems,
-      and issue tracking systems that are managed by, or on behalf of, the
-      Licensor for the purpose of discussing and improving the Work, but
-      excluding communication that is conspicuously marked or otherwise
-      designated in writing by the copyright owner as "Not a Contribution."
-
-      "Contributor" shall mean Licensor and any individual or Legal Entity
-      on behalf of whom a Contribution has been received by Licensor and
-      subsequently incorporated within the Work.
-
-   2. Grant of Copyright License. Subject to the terms and conditions of
-      this License, each Contributor hereby grants to You a perpetual,
-      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
-      copyright license to reproduce, prepare Derivative Works of,
-      publicly display, publicly perform, sublicense, and distribute the
-      Work and such Derivative Works in Source or Object form.
-
-   3. Grant of Patent License. Subject to the terms and conditions of
-      this License, each Contributor hereby grants to You a perpetual,
-      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
-      (except as stated in this section) patent license to make, have made,
-      use, offer to sell, sell, import, and otherwise transfer the Work,
-      where such license applies only to those patent claims licensable
-      by such Contributor that are necessarily infringed by their
-      Contribution(s) alone or by combination of their Contribution(s)
-      with the Work to which such Contribution(s) was submitted. If You
-      institute patent litigation against any entity (including a
-      cross-claim or counterclaim in a lawsuit) alleging that the Work
-      or a Contribution incorporated within the Work constitutes direct
-      or contributory patent infringement, then any patent licenses
-      granted to You under this License for that Work shall terminate
-      as of the date such litigation is filed.
-
-   4. Redistribution. You may reproduce and distribute copies of the
-      Work or Derivative Works thereof in any medium, with or without
-      modifications, and in Source or Object form, provided that You
-      meet the following conditions:
-
-      (a) You must give any other recipients of the Work or
-          Derivative Works a copy of this License; and
-
-      (b) You must cause any modified files to carry prominent notices
-          stating that You changed the files; and
-
-      (c) You must retain, in the Source form of any Derivative Works
-          that You distribute, all copyright, patent, trademark, and
-          attribution notices from the Source form of the Work,
-          excluding those notices that do not pertain to any part of
-          the Derivative Works; and
-
-      (d) If the Work includes a "NOTICE" text file as part of its
-          distribution, then any Derivative Works that You distribute must
-          include a readable copy of the attribution notices contained
-          within such NOTICE file, excluding those notices that do not
-          pertain to any part of the Derivative Works, in at least one
-          of the following places: within a NOTICE text file distributed
-          as part of the Derivative Works; within the Source form or
-          documentation, if provided along with the Derivative Works; or,
-          within a display generated by the Derivative Works, if and
-          wherever such third-party notices normally appear. The contents
-          of the NOTICE file are for informational purposes only and
-          do not modify the License. You may add Your own attribution
-          notices within Derivative Works that You distribute, alongside
-          or as an addendum to the NOTICE text from the Work, provided
-          that such additional attribution notices cannot be construed
-          as modifying the License.
-
-      You may add Your own copyright statement to Your modifications and
-      may provide additional or different license terms and conditions
-      for use, reproduction, or distribution of Your modifications, or
-      for any such Derivative Works as a whole, provided Your use,
-      reproduction, and distribution of the Work otherwise complies with
-      the conditions stated in this License.
-
-   5. Submission of Contributions. Unless You explicitly state otherwise,
-      any Contribution intentionally submitted for inclusion in the Work
-      by You to the Licensor shall be under the terms and conditions of
-      this License, without any additional terms or conditions.
-      Notwithstanding the above, nothing herein shall supersede or modify
-      the terms of any separate license agreement you may have executed
-      with Licensor regarding such Contributions.
-
-   6. Trademarks. This License does not grant permission to use the trade
-      names, trademarks, service marks, or product names of the Licensor,
-      except as required for reasonable and customary use in describing the
-      origin of the Work and reproducing the content of the NOTICE file.
-
-   7. Disclaimer of Warranty. Unless required by applicable law or
-      agreed to in writing, Licensor provides the Work (and each
-      Contributor provides its Contributions) on an "AS IS" BASIS,
-      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
-      implied, including, without limitation, any warranties or conditions
-      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
-      PARTICULAR PURPOSE. You are solely responsible for determining the
-      appropriateness of using or redistributing the Work and assume any
-      risks associated with Your exercise of permissions under this License.
-
-   8. Limitation of Liability. In no event and under no legal theory,
-      whether in tort (including negligence), contract, or otherwise,
-      unless required by applicable law (such as deliberate and grossly
-      negligent acts) or agreed to in writing, shall any Contributor be
-      liable to You for damages, including any direct, indirect, special,
-      incidental, or consequential damages of any character arising as a
-      result of this License or out of the use or inability to use the
-      Work (including but not limited to damages for loss of goodwill,
-      work stoppage, computer failure or malfunction, or any and all
-      other commercial damages or losses), even if such Contributor
-      has been advised of the possibility of such damages.
-
-   9. Accepting Warranty or Additional Liability. While redistributing
-      the Work or Derivative Works thereof, You may choose to offer,
-      and charge a fee for, acceptance of support, warranty, indemnity,
-      or other liability obligations and/or rights consistent with this
-      License. However, in accepting such obligations, You may act only
-      on Your own behalf and on Your sole responsibility, not on behalf
-      of any other Contributor, and only if You agree to indemnify,
-      defend, and hold each Contributor harmless for any liability
-      incurred by, or claims asserted against, such Contributor by reason
-      of your accepting any such warranty or additional liability.
-
-   END OF TERMS AND CONDITIONS
-
-   APPENDIX: How to apply the Apache License to your work.
-
-      To apply the Apache License to your work, attach the following
-      boilerplate notice, with the fields enclosed by brackets "[]"
-      replaced with your own identifying information. (Don't include
-      the brackets!)  The text should be enclosed in the appropriate
-      comment syntax for the file format. We also recommend that a
-      file or class name and description of purpose be included on the
-      same "printed page" as the copyright notice for easier
-      identification within third-party archives.
-
-   Copyright [yyyy] [name of copyright owner]
-
-   Licensed under the Apache License, Version 2.0 (the "License");
-   you may not use this file except in compliance with the License.
-   You may obtain a copy of the License at
-
-       http://www.apache.org/licenses/LICENSE-2.0
-
-   Unless required by applicable law or agreed to in writing, software
-   distributed under the License is distributed on an "AS IS" BASIS,
-   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-   See the License for the specific language governing permissions and
-   limitations under the License.
--- a/checkpoints/.gitkeep
+++ b/checkpoints/.gitkeep
--- a/conf/config.yaml
+++ b/conf/config.yaml
@ -0,0 +1,16 @@
+# ??? is a mandatory value.
+# you should be able to set it without open_dict
+# but if you try to read it before it's set an error will get thrown.
+
+# populated at runtime
+cwd: ???
+
+
+defaults:
+  - hydra/output: custom
+  - preprocess
+  - train
+  - embedding
+  - model: cnn
+
+
--- a/conf/embedding.yaml
+++ b/conf/embedding.yaml
@ -0,0 +1,10 @@
+# populated at runtime
+vocab_size: ???
+word_dim: 50
+pos_size: ??? # 2 * pos_limit + 2
+pos_dim: 10   # 当为 sum 时，此值无效，和 word_dim 强行相同
+
+dim_strategy: sum # [cat, sum]
+
+# 关系种类
+num_relations: 11
--- a/conf/hydra/output/custom.yaml
+++ b/conf/hydra/output/custom.yaml
@ -0,0 +1,11 @@
+hydra:
+
+  run:
+    # Output directory for normal runs
+    dir: ./logs/${now:%Y-%m-%d_%H-%M-%S}
+
+  sweep:
+    # Output directory for sweep runs
+    dir: ./logs/${now:%Y-%m-%d_%H-%M-%S}
+    # Output sub directory for sweep runs.
+    subdir: ${hydra.job.num}_${hydra.job.id}
--- a/conf/model/capsule.yaml
+++ b/conf/model/capsule.yaml
@ -0,0 +1,6 @@
+num_primary_units: 8
+num_output_units: 10  # relation_type
+primary_channels: 1
+primary_unit_size: 768
+output_unit_size: 128
+num_iterations: 3
--- a/conf/model/cnn.yaml
+++ b/conf/model/cnn.yaml
@ -0,0 +1,12 @@
+model_name: cnn
+
+#in_channels: 100 # 使用 embedding 输出的结果，不需要指定
+out_channels: 100
+kernel_sizes: [3, 5, 7] # 必须为奇数，为了保证cnn的输出不改变句子长度
+activation: 'gelu'   # [relu, lrelu, prelu, selu, celu, gelu, sigmoid, tanh]
+pooling_strategy: 'max'  # [max, avg, cls]
+dropout: 0.3
+
+# pcnn
+use_pcnn: False
+intermediate: 80
--- a/conf/model/gcn.yaml
+++ b/conf/model/gcn.yaml
@ -0,0 +1 @@
+num_layers: 3
--- a/conf/model/lm.yaml
+++ b/conf/model/lm.yaml
@ -0,0 +1,12 @@
+model_name: lm
+
+# lm_name = 'bert-base-chinese'  # download usage
+# cache file usage
+#lm_file: 'bert_pretrained'
+# 当使用预训练语言模型时，该预训练的模型存放位置
+lm_file: '/Users/leo/transformers/bert-base-chinese'
+
+
+# transformer 层数，初始 base bert 为12层
+# 但是数据量较小时调低些反而收敛更快效果更好
+num_hidden_layers: 2
--- a/conf/model/rnn.yaml
+++ b/conf/model/rnn.yaml
@ -0,0 +1,10 @@
+model_name: rnn
+
+type_rnn: 'RNN'  # [RNN, GRU, LSTM]
+
+#input_size: 100 # 使用 embedding 输出的结果，不需要指定
+hidden_size: 150  # 必须为偶数
+num_layers: 2
+dropout: 0.3
+bidirectional: True
+last_layer_hn: True
--- a/conf/model/transformer.yaml
+++ b/conf/model/transformer.yaml
@ -0,0 +1,9 @@
+hidden_size: 128
+intermediate_size: 256
+num_hidden_layers: 3
+num_heads: 4
+dropout: 0.1
+layer_norm_eps: 1e-12
+hidden_act: gelu_new
+output_attentions: True
+output_hidden_states: True
--- a/conf/preprocess.yaml
+++ b/conf/preprocess.yaml
@ -0,0 +1,26 @@
+# 是否需要预处理数据
+# 当数据处理参数没有变换时，不需要重新预处理
+preprocess: True
+
+# 原始数据存放位置
+data_path: 'data/origin'
+
+# 预处理后存放文件位置
+out_path: 'data/out'
+
+# 是否需要分词
+chinese_split: True
+
+# 是否需要使用实体类型替换实体词语
+replace_entity_with_type: True
+
+# 是否需要使用三元组头尾标记替换实体词语
+replace_entity_with_scope: True
+
+# vocab 构建时的最低词频控制
+min_freq: 3
+
+# 句长限制: 指句子中词语相对entity的position限制
+# 如：[-30, 30]，embed 时整体+31，变成[1, 61]
+# 则一共62个pos token，0 留给 pad
+pos_limit: 30
--- a/conf/train.yaml
+++ b/conf/train.yaml
@ -0,0 +1,21 @@
+seed: 1
+
+use_gpu: True
+gpu_id: 0
+
+epoch: 50
+batch_size: 32
+learning_rate: 3e-4
+lr_factor: 0.7 # 学习率的衰减率
+lr_patience: 3 # 学习率衰减的等待epoch
+weight_decay: 1e-3 # L2正则
+
+early_stopping_patience: 6
+
+train_log: True
+log_interval: 10
+show_plot: True
+only_comparison_plot: False
+plot_utils: matplot  # [matplot, tensorboard]
+
+predict_plot: True
--- a/data/.gitkeep
+++ b/data/.gitkeep
--- a/data/origin/predict.csv
+++ b/data/origin/predict.csv
@ -1,6 +0,0 @@
-sentence,head,head_type,head_offset,tail,tail_type,tail_offset
-“逆袭”系列微电影《宝贝》由优酷土豆股份有限公司于2012年出品,宝贝,影视作品,10,优酷土豆股份有限公司,企业,14
-位于伦敦东南方的格林威治，为地球经线的起始点,格林威治,景点,8,伦敦,城市,2
-崔恒源 男，1950年3月生，祖籍河南省孟县，现任孟县无缝钢管厂党委书记、厂长,崔恒源,人物,0,河南省孟县,地点,17
-帅长斌，男，1964年6月生，江西九江人,帅长斌,人物,0,江西九江,地点,15
-图为《西游记》拍摄幕后照片，猪八戒的大耳朵都掉了一只，可见当时拍摄条件实在有限，但是导演杨洁精益求精，使得这部电视剧成为经典,西游记,影视作品,3,杨洁,人物,44
--- a/data/origin/relation.csv
+++ b/data/origin/relation.csv
@ -0,0 +1,12 @@
+head_type,tail_type,relation,index
+None,None,None,0
+影视作品,人物,导演,1
+人物,国家,国籍,2
+人物,地点,祖籍,3
+电视综艺,人物,主持人,4
+人物,地点,出生地,5
+景点,城市,所在城市,6
+歌曲,音乐专辑,所属专辑,7
+网络小说,网站,连载网站,8
+影视作品,企业,出品公司,9
+人物,学校,毕业院校,10
--- a/data/origin/relation.txt
+++ b/data/origin/relation.txt
@ -1,10 +0,0 @@
-国籍
-祖籍
-导演
-出生地
-主持人
-所在城市
-所属专辑
-连载网站
-出品公司
-毕业院校
--- a/data/origin/test.csv
+++ b/data/origin/test.csv
--- a/data/origin/train.csv
+++ b/data/origin/train.csv
--- a/data/origin/valid.csv
+++ b/data/origin/valid.csv
--- a/dataset.py
+++ b/dataset.py
@ -0,0 +1,79 @@
+import torch
+from torch.utils.data import Dataset
+from utils import load_pkl
+
+
+def collate_fn(cfg):
+    def collate_fn_intra(batch):
+        batch.sort(key=lambda data: data['seq_len'], reverse=True)
+
+        max_len = batch[0]['seq_len']
+
+        def _padding(x, max_len):
+            return x + [0] * (max_len - len(x))
+
+        x, y = dict(), []
+        word, word_len = [], []
+        head_pos, tail_pos = [], []
+        pcnn_mask = []
+        for data in batch:
+            word.append(_padding(data['token2idx'], max_len))
+            word_len.append(data['seq_len'])
+            y.append(int(data['rel2idx']))
+
+            if cfg.model_name != 'lm':
+                head_pos.append(_padding(data['head_pos'], max_len))
+                tail_pos.append(_padding(data['tail_pos'], max_len))
+                if cfg.use_pcnn:
+                    pcnn_mask.append(_padding(data['entities_pos'], max_len))
+
+        x['word'] = torch.tensor(word)
+        x['lens'] = torch.tensor(word_len)
+        y = torch.tensor(y)
+
+        if cfg.model_name != 'lm':
+            x['head_pos'] = torch.tensor(head_pos)
+            x['tail_pos'] = torch.tensor(tail_pos)
+            if cfg.model_name == 'cnn' and cfg.use_pcnn:
+                x['pcnn_mask'] = torch.tensor(pcnn_mask)
+
+        return x, y
+
+    return collate_fn_intra
+
+
+class CustomDataset(Dataset):
+    """默认使用 List 存储数据"""
+    def __init__(self, fp):
+        self.file = load_pkl(fp)
+
+    def __getitem__(self, item):
+        sample = self.file[item]
+        return sample
+
+    def __len__(self):
+        return len(self.file)
+
+
+if __name__ == '__main__':
+    from torch.utils.data import DataLoader
+    train_data_path = 'data/out/train.pkl'
+    vocab_path = 'data/out/vocab.pkl'
+    unk_str = 'UNK'
+    vocab = load_pkl(vocab_path)
+    train_ds = CustomDataset(train_data_path)
+    train_dl = DataLoader(train_ds, batch_size=4, shuffle=True, collate_fn=collate_fn, drop_last=False)
+
+    for batch_idx, (x, y) in enumerate(train_dl):
+        word = x['word']
+        for idx in word:
+            idx2token = ''.join([vocab.idx2word.get(i, unk_str) for i in idx.numpy()])
+            print(idx2token)
+        print(y)
+        break
+        # x, y = x.to(device), y.to(device)
+        # optimizer.zero_grad()
+        # y_pred = models(y)
+        # loss = criterion(y_pred, y)
+        # loss.backward()
+        # optimizer.step()
--- a/deepke/init.py
+++ b/deepke/init.py
--- a/deepke/config.py
+++ b/deepke/config.py
@ -1,97 +0,0 @@
-class TrainingConfig(object):
-    seed = 1
-    use_gpu = True
-    gpu_id = 0
-    epoch = 30
-    learning_rate = 1e-3
-    decay_rate = 0.5
-    decay_patience = 3
-    batch_size = 64
-    train_log = True
-    log_interval = 10
-    show_plot = True
-    f1_norm = ['macro', 'micro']
-
-
-class ModelConfig(object):
-    word_dim = 50
-    pos_size = 102  # 2 * pos_limit + 2
-    pos_dim = 5
-    feature_dim = 60  # 50 + 5 * 2
-    hidden_dim = 100
-    dropout = 0.3
-
-
-class CNNConfig(object):
-    use_pcnn = True
-    out_channels = 100
-    kernel_size = [3, 5, 7]
-
-
-class RNNConfig(object):
-    lstm_layers = 3
-    last_hn = False
-
-
-class GCNConfig(object):
-    num_layers = 3
-
-
-class TransformerConfig(object):
-    transformer_layers = 3
-
-
-class CapsuleConfig(object):
-    num_primary_units = 8
-    num_output_units = 10  # relation_type
-    primary_channels = 1
-    primary_unit_size = 768
-    output_unit_size = 128
-    num_iterations = 3
-
-
-class LMConfig(object):
-    # lm_name = 'bert-base-chinese'  # download usage
-    # cache file usage
-    lm_file = 'bert_pretrained'
-    # transformer 层数，初始 base bert 为12层
-    # 但是数据量较小时调低些反而收敛更快效果更好
-    num_hidden_layers = 2
-
-
-class Config(object):
-    # 原始数据存放位置
-    data_path = 'data/origin'
-    # 预处理后存放文件的位置
-    out_path = 'data/out'
-
-    # 是否将句子中实体替换为实体类型
-    replace_entity_by_type = True
-    # 是否为中文数据
-    is_chinese = True
-    # 是否需要分词操作
-    word_segment = True
-
-    # 关系种类
-    relation_type = 10
-
-    # vocab 构建时最低词频控制
-    min_freq = 2
-
-    # position limit
-    pos_limit = 50  # [-50, 50]
-
-    # (CNN, RNN, GCN, Transformer, Capsule, LM)
-    model_name = 'Capsule'
-
-    training = TrainingConfig()
-    model = ModelConfig()
-    cnn = CNNConfig()
-    rnn = RNNConfig()
-    gcn = GCNConfig()
-    transformer = TransformerConfig()
-    capsule = CapsuleConfig()
-    lm = LMConfig()
-
-
-config = Config()
--- a/deepke/dataset.py
+++ b/deepke/dataset.py
@ -1,64 +0,0 @@
-import torch
-from torch.utils.data import Dataset
-from deepke.utils import load_pkl
-from deepke.config import config
-
-
-class CustomDataset(Dataset):
-    def __init__(self, fp):
-        self.file = load_pkl(fp)
-
-    def __getitem__(self, item):
-        sample = self.file[item]
-        return sample
-
-    def __len__(self):
-        return len(self.file)
-
-
-def collate_fn(batch):
-    batch.sort(key=lambda data: data['seq_len'], reverse=True)
-
-    max_len = 0
-    for data in batch:
-        if data['seq_len'] > max_len:
-            max_len = data['seq_len']
-
-    def _padding(x, max_len):
-        return x + [0] * (max_len - len(x))
-
-    if config.model_name == 'LM':
-        x, y = [], []
-        for data in batch:
-            x.append(_padding(data['lm_idx'], max_len))
-            y.append(data['target'])
-
-        return torch.tensor(x), torch.tensor(y)
-
-    else:
-        sent, head_pos, tail_pos, mask_pos = [], [], [], []
-        y = []
-        for data in batch:
-            sent.append(_padding(data['word2idx'], max_len))
-            head_pos.append(_padding(data['head_pos'], max_len))
-            tail_pos.append(_padding(data['tail_pos'], max_len))
-            mask_pos.append(_padding(data['mask_pos'], max_len))
-            y.append(data['target'])
-        return torch.tensor(sent), torch.tensor(head_pos), torch.tensor(tail_pos), torch.tensor(
-            mask_pos), torch.tensor(y)
-
-
-if __name__ == '__main__':
-    from torch.utils.data import DataLoader
-    vocab_path = '../data/out/vocab.pkl'
-    train_data_path = '../data/out/train.pkl'
-    vocab = load_pkl(vocab_path)
-
-    train_dataset = CustomDataset(train_data_path)
-    dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)
-
-    for idx, (*x, y) in enumerate(dataloader):
-        print(x)
-        print(y)
-        break
-
--- a/deepke/model/BasicModule.py
+++ b/deepke/model/BasicModule.py
@ -1,34 +0,0 @@
-import torch
-import torch.nn as nn
-import time
-from deepke.utils import ensure_dir
-
-
-class BasicModule(nn.Module):
-    '''
-    封装nn.Module, 提供 save 和 load 方法
-    '''
-    def __init__(self):
-        super(BasicModule, self).__init__()
-        self.model_name = str(type(self))
-
-    def load(self, path):
-        '''
-        加载指定路径的模型
-        '''
-        self.load_state_dict(torch.load(path))
-
-    def save(self, epoch=0, name=None):
-        '''
-        保存模型，默认使用“模型名字+时间”作为文件名
-        '''
-        prefix = 'checkpoints/'
-        ensure_dir(prefix)
-        if name is None:
-            name = prefix + self.model_name + '_' + f'epoch{epoch}_'
-            name = time.strftime(name + '%m%d_%H:%M:%S.pth')
-        else:
-            name = prefix + name + '_' + self.model_name + '_' + f'epoch{epoch}_'
-            name = time.strftime(name + '%m%d_%H:%M:%S.pth')
-        torch.save(self.state_dict(), name)
-        return name
--- a/deepke/model/CNN.py
+++ b/deepke/model/CNN.py
@ -1,88 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from deepke.model import BasicModule, Embedding
-
-
-class CNN(BasicModule):
-    def __init__(self, vocab_size, config):
-        super(CNN, self).__init__()
-        self.model_name = 'CNN'
-        self.vocab_size = vocab_size
-        self.word_dim = config.model.word_dim
-        self.pos_size = config.model.pos_size
-        self.pos_dim = config.model.pos_dim
-        self.hidden_dim = config.model.hidden_dim
-        self.dropout = config.model.dropout
-        self.use_pcnn = config.cnn.use_pcnn
-        self.out_channels = config.cnn.out_channels
-        self.kernel_size = config.cnn.kernel_size
-        self.out_dim = config.relation_type
-
-        if isinstance(self.kernel_size, int):
-            self.kernel_size = [self.kernel_size]
-        for k in self.kernel_size:
-            assert k % 2 == 1, "kernel size has to be odd numbers."
-
-        self.embedding = Embedding(self.vocab_size, self.word_dim, self.pos_size, self.pos_dim)
-        # PCNN embedding
-        self.mask_embed = nn.Embedding(4, 3)
-        masks = torch.tensor([[0, 0, 0], [100, 0, 0], [0, 100, 0], [0, 0, 100]])
-        self.mask_embed.weight.data.copy_(masks)
-        self.mask_embed.weight.requires_grad = False
-
-        self.input_dim = self.word_dim + self.pos_dim * 2
-        self.convs = nn.ModuleList([
-            nn.Conv1d(in_channels=self.input_dim,
-                      out_channels=self.out_channels,
-                      kernel_size=k,
-                      padding=k // 2,
-                      bias=None) for k in self.kernel_size
-        ])
-        self.conv_dim = len(self.kernel_size) * self.out_channels
-        if self.use_pcnn:
-            self.conv_dim *= 3
-        self.fc1 = nn.Linear(self.conv_dim, self.hidden_dim)
-        self.fc2 = nn.Linear(self.hidden_dim, self.out_dim)
-        self.dropout = nn.Dropout(self.dropout)
-
-    def forward(self, input):
-        *x, mask = input
-        x = self.embedding(x)
-        mask_embed = self.mask_embed(mask)
-
-        # [B,L,C] -> [B,C,L]
-        x = torch.transpose(x, 1, 2)
-
-        # CNN
-        x = [F.leaky_relu(conv(x)) for conv in self.convs]
-        x = torch.cat(x, dim=1)
-
-        # mask
-        mask = mask.unsqueeze(1)  # B x 1 x L
-        x = x.masked_fill_(mask.eq(0), float('-inf'))
-
-        if self.use_pcnn:
-            # triple max_pooling
-            x = x.unsqueeze(-1).permute(0, 2, 1, 3)  # [B, L, C, 1]
-            mask_embed = mask_embed.unsqueeze(-2)  # [B, L, 1, 3]
-            x = x + mask_embed  # [B, L, C, 3]
-            x = torch.max(x, dim=1)[0] - 100  # [B, C, 3]
-            x = x.view(x.size(0), -1)  # [B, 3*C]
-
-        else:
-            # max_pooling
-            x = F.max_pool1d(x, x.size(-1)).squeeze(-1)  # [[B,C],..]
-
-        # droup
-        x = self.dropout(x)
-
-        # linear
-        x = F.leaky_relu(self.fc1(x))
-        x = F.leaky_relu(self.fc2(x))
-
-        return x
-
-
-if __name__ == '__main__':
-    pass
--- a/deepke/model/Capsule.py
+++ b/deepke/model/Capsule.py
@ -1,206 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from deepke.model import BasicModule, Embedding, VarLenLSTM
-
-
-class Capsule(BasicModule):
-    def __init__(self, vocab_size, config):
-        super(Capsule, self).__init__()
-        self.model_name = 'Capsule'
-        self.vocab_size = vocab_size
-        self.word_dim = config.model.word_dim
-        self.pos_size = config.model.pos_size
-        self.pos_dim = config.model.pos_dim
-        self.hidden_dim = config.model.hidden_dim
-
-        self.num_primary_units = config.capsule.num_primary_units
-        self.num_output_units = config.capsule.num_output_units
-        self.primary_channels = config.capsule.primary_channels
-        self.primary_unit_size = config.capsule.primary_unit_size
-        self.output_unit_size = config.capsule.output_unit_size
-        self.num_iterations = config.capsule.num_iterations
-
-        self.embedding = Embedding(self.vocab_size, self.word_dim, self.pos_size, self.pos_dim)
-        self.input_dim = self.word_dim + self.pos_dim * 2
-        self.lstm = VarLenLSTM(
-            self.input_dim,
-            self.hidden_dim,
-        )
-        self.capsule = CapsuleNet(self.num_primary_units, self.num_output_units, self.primary_channels,
-                                  self.primary_unit_size, self.output_unit_size, self.num_iterations)
-
-    def forward(self, input):
-        *x, mask = input
-        x = self.embedding(x)
-        x_lens = torch.sum(mask.gt(0), dim=-1)
-        _, hn = self.lstm(x, x_lens)
-        out = self.capsule(hn)
-        return out  # B, num_output_units, output_unit_size
-
-    def predict(self, output):
-        v_mag = torch.sqrt((output**2).sum(dim=2, keepdim=False))
-        pred = v_mag.argmax(1, keepdim=False)
-        return pred
-
-    def loss(self, input, target, size_average=True):
-        batch_size = input.size(0)
-
-        v_mag = torch.sqrt((input**2).sum(dim=2, keepdim=True))
-
-        max_l = torch.relu(0.9 - v_mag).view(batch_size, -1)**2
-        max_r = torch.relu(v_mag - 0.1).view(batch_size, -1)**2
-
-        loss_lambda = 0.5
-        T_c = target
-        L_c = T_c * max_l + loss_lambda * (1.0 - T_c) * max_r
-        L_c = L_c.sum(dim=1)
-
-        if size_average:
-            L_c = L_c.mean()
-
-        return L_c
-
-
-class CapsuleNet(nn.Module):
-    def __init__(self, num_primary_units, num_output_units, primary_channels, primary_unit_size, output_unit_size,
-                 num_iterations):
-        super(CapsuleNet, self).__init__()
-        self.primary = CapsuleLayer(in_units=0,
-                                    out_units=num_primary_units,
-                                    in_channels=primary_channels,
-                                    unit_size=primary_unit_size,
-                                    use_routing=False,
-                                    num_iterations=0)
-
-        self.iteration = CapsuleLayer(in_units=num_primary_units,
-                                      out_units=num_output_units,
-                                      in_channels=primary_unit_size,
-                                      unit_size=output_unit_size,
-                                      use_routing=True,
-                                      num_iterations=num_iterations)
-
-    def forward(self, input):
-        return self.iteration(self.primary(input))
-
-
-class ConvUnit(nn.Module):
-    def __init__(self, in_channels):
-        super(ConvUnit, self).__init__()
-        self.conv0 = nn.Conv1d(
-            in_channels=in_channels,
-            out_channels=8,  # fixme constant
-            kernel_size=9,  # fixme constant
-            stride=2,  # fixme constant
-            bias=True)
-
-    def forward(self, x):
-        return self.conv0(x)
-
-
-class CapsuleLayer(nn.Module):
-    def __init__(self, in_units, out_units, in_channels, unit_size, use_routing, num_iterations):
-        super(CapsuleLayer, self).__init__()
-        self.in_units = in_units
-        self.out_units = out_units
-        self.in_channels = in_channels
-        self.unit_size = unit_size
-        self.use_routing = use_routing
-
-        if self.use_routing:
-            self.W = nn.Parameter(torch.randn(1, in_channels, out_units, unit_size, in_units))
-            self.num_iterations = num_iterations
-        else:
-
-            def create_conv_unit(unit_idx):
-                unit = ConvUnit(in_channels=in_channels)
-                self.add_module("unit_" + str(unit_idx), unit)
-                return unit
-
-            self.units = [create_conv_unit(i) for i in range(self.out_units)]
-
-    @staticmethod
-    def squash(s):
-        # This is equation 1 from the paper.
-        mag_sq = torch.sum(s**2, dim=2, keepdim=True)
-        mag = torch.sqrt(mag_sq)
-        s = (mag_sq / (1.0 + mag_sq)) * (s / mag)
-        return s
-
-    def forward(self, x):
-        if self.use_routing:
-            return self.routing(x)
-        else:
-            return self.no_routing(x)
-
-    def no_routing(self, x):
-        # Each unit will be (batch, channels, feature).
-        u = [self.units[i](x) for i in range(self.out_units)]
-
-        # Stack all unit outputs (batch, unit, channels, feature).
-        u = torch.stack(u, dim=1)
-
-        # Flatten to (batch, unit, output).
-        u = u.view(x.size(0), self.out_units, -1)
-
-        # Return squashed outputs.
-        return CapsuleLayer.squash(u)
-
-    def routing(self, x):
-        batch_size = x.size(0)
-
-        # (batch, in_units, features) -> (batch, features, in_units)
-        x = x.transpose(1, 2)
-
-        # (batch, features, in_units) -> (batch, features, out_units, in_units, 1)
-        x = torch.stack([x] * self.out_units, dim=2).unsqueeze(4)
-
-        # (batch, features, out_units, unit_size, in_units)
-        W = torch.cat([self.W] * batch_size, dim=0)
-
-        # Transform inputs by weight matrix.
-        # (batch_size, features, out_units, unit_size, 1)
-        u_hat = torch.matmul(W, x)
-
-        # Initialize routing logits to zero.
-        b_ij = torch.zeros(1, self.in_channels, self.out_units, 1).to(x.device)
-
-        # Iterative routing.
-        num_iterations = self.num_iterations
-        for iteration in range(num_iterations):
-            # Convert routing logits to softmax.
-            c_ij = F.softmax(b_ij, dim=1)
-
-            # (batch, features, out_units, 1, 1)
-            c_ij = torch.cat([c_ij] * batch_size, dim=0).unsqueeze(4)
-
-            # Apply routing (c_ij) to weighted inputs (u_hat).
-            # (batch_size, 1, out_units, unit_size, 1)
-            s_j = (c_ij * u_hat).sum(dim=1, keepdim=True)
-
-            # (batch_size, 1, out_units, unit_size, 1)
-            v_j = CapsuleLayer.squash(s_j)
-
-            # (batch_size, features, out_units, unit_size, 1)
-            v_j1 = torch.cat([v_j] * self.in_channels, dim=1)
-
-            # (1, features, out_units, 1)
-            u_vj1 = torch.matmul(u_hat.transpose(3, 4), v_j1).squeeze(4).mean(dim=0, keepdim=True)
-
-            # Update b_ij (routing)
-            b_ij = u_vj1
-
-        # (batch_size, out_units, unit_size, 1)
-        return v_j.squeeze()
-
-
-if __name__ == '__main__':
-    net = CapsuleNet(num_primary_units=8,
-                     num_output_units=13,
-                     primary_channels=10,
-                     primary_unit_size=8,
-                     output_unit_size=20,
-                     num_iterations=5)
-    inputs = torch.randn(4, 10, 10)
-    outs = net(inputs)
-    print(outs.shape)  # (4, 13, 20)
--- a/deepke/model/Embedding.py
+++ b/deepke/model/Embedding.py
@ -1,19 +0,0 @@
-import torch
-import torch.nn as nn
-
-
-class Embedding(nn.Module):
-    def __init__(self, vocab_size: int, word_dim: int, pos_size: int, pos_dim: int):
-        super(Embedding, self).__init__()
-        self.word_embed = nn.Embedding(vocab_size, word_dim, padding_idx=0)
-        self.head_pos_embed = nn.Embedding(pos_size, pos_dim, padding_idx=0)
-        self.tail_pos_embed = nn.Embedding(pos_size, pos_dim, padding_idx=0)
-
-    def forward(self, x):
-        words, head_pos, tail_pos = x
-        word_embed = self.word_embed(words)
-        head_pos_embed = self.head_pos_embed(head_pos)
-        tail_pos_embed = self.tail_pos_embed(tail_pos)
-        feature_embed = torch.cat([word_embed, head_pos_embed, tail_pos_embed], dim=-1)
-
-        return feature_embed
--- a/deepke/model/GCN.py
+++ b/deepke/model/GCN.py
@ -1,138 +0,0 @@
-import torch
-import numpy as np
-import torch.nn as nn
-import torch.nn.functional as F
-from deepke.model import BasicModule, Embedding
-
-# 暂时有bug，主要是没有找到很好的可以做中文 dependency parsing 的工具
-# 尝试了 hanlp, standford_nlp, 都需要安装 java 包，还是老版本的 java6，测试时bug不少
-
-
-class GCN(BasicModule):
-    def __init__(self, vocab_size, config):
-        super(GCN, self).__init__()
-        self.model_name = 'GCN'
-        self.vocab_size = vocab_size
-        self.word_dim = config.model.word_dim
-        self.pos_size = config.model.pos_size
-        self.pos_dim = config.model.pos_dim
-        self.hidden_dim = config.model.hidden_dim
-        self.dropout = config.model.dropout
-        self.num_layers = config.gcn.num_layers
-        self.out_dim = config.relation_type
-        self.embedding = Embedding(self.vocab_size, self.word_dim, self.pos_size, self.pos_dim)
-        self.input_dim = self.word_dim + self.pos_dim * 2
-        self.fc1 = nn.Linear(self.input_dim, self.hidden_dim)
-        self.fc2 = nn.Linear(self.hidden_dim, self.hidden_dim)
-        self.fc3 = nn.Linear(self.hidden_dim, self.out_dim)
-        self.dropout = nn.Dropout(self.dropout)
-
-    def forward(self, input):
-        *x, adj, mask = input
-        x = self.embedding(x)
-        for i in range(1, self.num_layers + 1):
-            if i == 1 == self.num_layers:
-                out = self.fc1(torch.bmm(adj, x))
-            elif i == self.num_layers:
-                out = self.fc3(torch.bmm(adj, x))
-            else:
-                out = F.relu(self.fc2(torch.bmm(adj, x)))
-        return out
-
-
-class Tree(object):
-    def __init__(self):
-        self.parent = None
-        self.num_children = 0
-        self.children = list()
-
-    def add_child(self, child):
-        child.parent = self
-        self.num_children += 1
-        self.children.append(child)
-
-    def size(self):
-        s = getattr(self, '_size', -1)
-        if s != -1:
-            return self._size
-        else:
-            count = 1
-            for i in range(self.num_children):
-                count += self.children[i].size()
-            self._size = count
-            return self._size
-
-    def __iter__(self):
-        yield self
-        for c in self.children:
-            for x in c:
-                yield x
-
-    def depth(self):
-        d = getattr(self, '_depth', -1)
-        if d != -1:
-            return self._depth
-        else:
-            count = 0
-            if self.num_children > 0:
-                for i in range(self.num_children):
-                    child_depth = self.children[i].depth()
-                    if child_depth > count:
-                        count = child_depth
-                count += 1
-            self._depth = count
-            return self._depth
-
-
-def head_to_adj(head, directed=True, self_loop=False):
-    """
-    Convert a sequence of head indexes to an (numpy) adjacency matrix.
-    """
-    seq_len = len(head)
-    head = head[:seq_len]
-    root = None
-    nodes = [Tree() for _ in head]
-
-    for i in range(seq_len):
-        h = head[i]
-        setattr(nodes[i], 'idx', i)
-        if h == 0:
-            root = nodes[i]
-        else:
-            nodes[h - 1].add_child(nodes[i])
-
-    assert root is not None
-
-    ret = np.zeros((seq_len, seq_len), dtype=np.float32)
-    queue = [root]
-    idx = []
-    while len(queue) > 0:
-        t, queue = queue[0], queue[1:]
-        idx += [t.idx]
-        for c in t.children:
-            ret[t.idx, c.idx] = 1
-        queue += t.children
-
-    if not directed:
-        ret = ret + ret.T
-
-    if self_loop:
-        for i in idx:
-            ret[i, i] = 1
-
-    return ret
-
-
-if __name__ == '__main__':
-    inputs = torch.tensor([list(range(6))])
-    embedding = nn.Embedding(10, 10)
-    inputs = embedding(inputs)
-
-    head = [2, 0, 5, 3, 2, 2]
-    adj = head_to_adj(head, directed=False, self_loop=True)
-    print(adj)
-    adj = torch.tensor([adj])
-
-    model = GCN(10, 10)
-    outs = model(adj, inputs)
-    print(outs.shape)
--- a/deepke/model/LM.py
+++ b/deepke/model/LM.py
@ -1,20 +0,0 @@
-import torch.nn as nn
-from deepke.model import BasicModule
-from pytorch_transformers import BertModel
-
-
-class LM(BasicModule):
-    def __init__(self, vocab_size, config):
-        super(LM, self).__init__()
-        self.model_name = 'LM'
-        self.lm_name = config.lm.lm_file
-        self.out_dim = config.relation_type
-
-        self.lm = BertModel.from_pretrained(self.lm_name, num_hidden_layers=config.lm.num_hidden_layers)
-        self.fc = nn.Linear(768, self.out_dim)
-
-    def forward(self, x):
-        x = x[0]
-        out = self.lm(x)[0][:, 0]
-        out = self.fc(out)
-        return out
--- a/deepke/model/RNN.py
+++ b/deepke/model/RNN.py
@ -1,101 +0,0 @@
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
-from deepke.model import BasicModule, Embedding
-
-
-class VarLenLSTM(BasicModule):
-    def __init__(self, input_size, hidden_size, lstm_layers=1, dropout=0, last_hn=False):
-        super(VarLenLSTM, self).__init__()
-        self.model_name = 'VarLenLSTM'
-        self.lstm_layers = lstm_layers
-        self.last_hn = last_hn
-        self.lstm = nn.LSTM(
-            input_size=input_size,
-            hidden_size=hidden_size,
-            num_layers=lstm_layers,
-            dropout=dropout,
-            bidirectional=True,
-            bias=True,
-            batch_first=True,
-        )
-
-    def forward(self, x, x_len):
-        '''
-        针对有 padding 的句子
-        一般来说，out 用来做序列标注，hn 做分类任务
-        :param x:      [B * L * H]
-        :param x_len:  [l...]
-        :return:
-            out:  [B * seq_len * hidden]   hidden = 2 * hidden_dim
-             hn:  [B * layers  * hidden]   hidden = 2 * hidden_dim
-        '''
-        x = pack_padded_sequence(x, x_len, batch_first=True, enforce_sorted=True)
-        out, (hn, _) = self.lstm(x)
-        out, _ = pad_packed_sequence(out, batch_first=True, padding_value=0.0)
-        hn = hn.transpose(0, 1).contiguous()
-        # [B, layers, 2*hidden]
-        hn = hn.view(hn.size(0), self.lstm_layers, -1)
-        if self.last_hn:
-            hn = hn[:, -1].unsqueeze(1)
-
-        return out, hn
-
-
-class BiLSTM(BasicModule):
-    def __init__(self, vocab_size, config):
-        super(BiLSTM, self).__init__()
-        self.model_name = 'BiLSTM'
-        self.vocab_size = vocab_size
-        self.word_dim = config.model.word_dim
-        self.pos_size = config.model.pos_size
-        self.pos_dim = config.model.pos_dim
-        self.hidden_dim = config.model.hidden_dim
-        self.dropout = config.model.dropout
-        self.lstm_layers = config.rnn.lstm_layers
-        self.last_hn = config.rnn.last_hn
-        self.out_dim = config.relation_type
-
-        self.embedding = Embedding(self.vocab_size, self.word_dim, self.pos_size, self.pos_dim)
-        self.input_dim = self.word_dim + self.pos_dim * 2
-        self.lstm = VarLenLSTM(self.input_dim,
-                               self.hidden_dim,
-                               self.lstm_layers,
-                               dropout=self.dropout,
-                               last_hn=self.last_hn)
-        if self.last_hn:
-            linear_input_dim = self.hidden_dim * 2
-        else:
-            linear_input_dim = self.hidden_dim * 2 * self.lstm_layers
-        self.fc1 = nn.Linear(linear_input_dim, self.hidden_dim)
-        self.fc2 = nn.Linear(self.hidden_dim, self.out_dim)
-
-    def forward(self, input):
-        *x, mask = input
-        x = self.embedding(x)
-        x_lens = torch.sum(mask.gt(0), dim=-1)
-        _, hn = self.lstm(x, x_lens)
-        hn = hn.view(hn.size(0), -1)
-        y = F.leaky_relu(self.fc1(hn))
-        y = F.leaky_relu(self.fc2(y))
-        return y
-
-
-if __name__ == '__main__':
-    torch.manual_seed(1)
-    x = torch.Tensor([
-        [1, 2, 3, 4, 3, 2],
-        [1, 2, 3, 0, 0, 0],
-        [2, 4, 3, 0, 0, 0],
-        [2, 3, 0, 0, 0, 0],
-    ])
-    x_len = torch.Tensor([6, 3, 3, 2])
-    embedding = nn.Embedding(5, 10, padding_idx=0)
-    model = VarLenLSTM(input_size=10, hidden_size=30, lstm_layers=5, last_hn=False)
-
-    x = embedding(x)  # [4, 6, 5]
-    out, hn = model(x, x_len)
-    # out: [4, 6, 60]   [B, seq_len, 2 * hidden]
-    #  hn: [4, 5, 60]   [B, layers,  2 * hidden]
-    print(out.shape, hn.shape)
--- a/deepke/model/Transformer.py
+++ b/deepke/model/Transformer.py
@ -1,131 +0,0 @@
-import math
-import torch
-import torch.nn as nn
-from deepke.model import BasicModule, Embedding
-
-
-class DotAttention(nn.Module):
-    '''
-    \text {Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V
-    '''
-    def __init__(self, dropout=0.0):
-        super(DotAttention, self).__init__()
-        self.drop = nn.Dropout(dropout)
-        self.softmax = nn.Softmax(dim=-1)
-
-    def forward(self, Q, K, V, mask_out=None):
-        """
-        :param Q: [batch, seq_len_q, feature_size]
-        :param K: [batch, seq_len_k, feature_size]
-        :param V: [batch, seq_len_k, feature_size]
-        :param mask_out: [batch, 1, seq_len] or [batch, seq_len_q, seq_len_k]
-        """
-        feature_size = Q.size(-1)
-        scale = math.sqrt(feature_size)
-        output = torch.matmul(Q, K.transpose(1, 2)) / scale
-        if mask_out is not None:
-            output.masked_fill_(mask_out, -1e18)
-        output = self.softmax(output)
-        output = self.drop(output)
-        return torch.matmul(output, V)
-
-
-class MultiHeadAttention(nn.Module):
-    """
-    :param feature_size: int, 输入维度的大小。同时也是输出维度的大小。
-    :param num_head: int，head的数量。
-    :param dropout: float。
-    """
-    def __init__(self, feature_size, num_head, dropout=0.2):
-        super(MultiHeadAttention, self).__init__()
-        self.feature_size = feature_size
-        self.num_head = num_head
-        self.q_in = nn.Linear(feature_size, feature_size * num_head)
-        self.k_in = nn.Linear(feature_size, feature_size * num_head)
-        self.v_in = nn.Linear(feature_size, feature_size * num_head)
-        self.attention = DotAttention(dropout=dropout)
-        self.out = nn.Linear(feature_size * num_head, feature_size)
-
-    def forward(self, Q, K, V, att_mask_out=None):
-        """
-        :param Q: [batch, seq_len_q, feature_size]
-        :param K: [batch, seq_len_k, feature_size]
-        :param V: [batch, seq_len_k, feature_size]
-        :param seq_mask: [batch, seq_len]
-        """
-        batch, sq, feature = Q.size()
-        sk = K.size(1)
-        n_head = self.num_head
-        # input linear
-        q = self.q_in(Q).view(batch, sq, n_head, feature)
-        k = self.k_in(K).view(batch, sk, n_head, feature)
-        v = self.v_in(V).view(batch, sk, n_head, feature)
-
-        # transpose q, k and v to do batch attention
-        # [batch, seq_len, num_head, feature] => [num_head*batch, seq_len, feature]
-        q = q.permute(2, 0, 1, 3).contiguous().view(-1, sq, feature)
-        k = k.permute(2, 0, 1, 3).contiguous().view(-1, sk, feature)
-        v = v.permute(2, 0, 1, 3).contiguous().view(-1, sk, feature)
-        if att_mask_out is not None:
-            att_mask_out = att_mask_out.repeat(n_head, 1, 1)
-        att = self.attention(q, k, v, att_mask_out).view(n_head, batch, sq, feature)
-
-        # concat all heads, do output linear
-        # [num_head, batch, seq_len, feature] => [batch, seq_len, num_head*feature]
-        att = att.permute(1, 2, 0, 3).contiguous().view(batch, sq, -1)
-        output = self.out(att)
-        return output
-
-
-class Transformer(BasicModule):
-    def __init__(self, vocab_size, config):
-        super(Transformer, self).__init__()
-        self.model_name = 'Transformer'
-        self.vocab_size = vocab_size
-        self.word_dim = config.model.word_dim
-        self.pos_size = config.model.pos_size
-        self.pos_dim = config.model.pos_dim
-        self.hidden_dim = config.model.hidden_dim
-        self.dropout = config.model.dropout
-        self.layers = config.transformer.transformer_layers
-        self.out_dim = config.relation_type
-
-        self.embedding = Embedding(self.vocab_size, self.word_dim, self.pos_size, self.pos_dim)
-        self.feature_dim = self.word_dim + self.pos_dim * 2
-        self.att = MultiHeadAttention(self.feature_dim, num_head=4)
-        self.norm1 = nn.LayerNorm(self.feature_dim)
-        self.ffn = nn.Sequential(nn.Linear(self.feature_dim, self.hidden_dim), nn.ReLU(),
-                                 nn.Linear(self.hidden_dim, self.feature_dim), nn.Dropout(self.dropout))
-        self.norm2 = nn.LayerNorm(self.feature_dim)
-        self.fc = nn.Linear(self.feature_dim, self.out_dim)
-
-    def forward(self, input):
-        *x, mask = input
-        x = self.embedding(x)
-        att_mask_out = mask.eq(0).unsqueeze(1)
-
-        for i in range(self.layers):
-            attention = self.att(x, x, x, att_mask_out)
-            norm_att = self.norm1(attention + x)
-            x = self.ffn(norm_att)
-            x = self.norm2(x + norm_att)
-        x = x[:, 0]
-        out = self.fc(x)
-        return out
-
-
-if __name__ == '__main__':
-    torch.manual_seed(1)
-
-    q = torch.randn(32, 50, 100)
-    k = torch.randn(32, 60, 100)
-    v = torch.randn(32, 60, 100)
-    mask = torch.randn(32, 60).unsqueeze(1).gt(0)
-
-    att1 = DotAttention()
-    out = att1(q, k, v, mask)
-    print(out.shape)  # [32, 50, 100]
-
-    att2 = MultiHeadAttention(feature_size=100, num_head=8)
-    out = att2(q, k, v, mask)
-    print(out.shape)  # [32, 50, 100]
--- a/deepke/model/init.py
+++ b/deepke/model/init.py
@ -1,8 +0,0 @@
-from .BasicModule import BasicModule
-from .Embedding import Embedding
-from .CNN import CNN
-from .RNN import VarLenLSTM, BiLSTM
-from .GCN import GCN
-from .Transformer import Transformer
-from .Capsule import Capsule
-from .LM import LM
--- a/deepke/preprocess.py
+++ b/deepke/preprocess.py
@ -1,211 +0,0 @@
-import os
-import jieba
-import logging
-from typing import List, Dict
-from pytorch_transformers import BertTokenizer
-# self file
-from deepke.vocab import Vocab
-from deepke.config import config
-from deepke.utils import ensure_dir, save_pkl, load_csv
-
-jieba.setLogLevel(logging.INFO)
-
-Path = str
-
-
-def _mask_feature(entities_idx: List, sen_len: int) -> List:
-    left = [1] * (entities_idx[0] + 1)
-    middle = [2] * (entities_idx[1] - entities_idx[0] - 1)
-    right = [3] * (sen_len - entities_idx[1])
-
-    return left + middle + right
-
-
-def _pos_feature(sent_len: int, entity_idx: int, entity_len: int, pos_limit: int) -> List:
-
-    left = list(range(-entity_idx, 0))
-    middle = [0] * entity_len
-    right = list(range(1, sent_len - entity_idx - entity_len + 1))
-    pos = left + middle + right
-
-    for i, p in enumerate(pos):
-        if p > pos_limit:
-            pos[i] = pos_limit
-        if p < -pos_limit:
-            pos[i] = -pos_limit
-    pos = [p + pos_limit + 1 for p in pos]
-
-    return pos
-
-
-def _build_data(data: List[Dict], vocab: Vocab, relations: Dict) -> List[Dict]:
-    if vocab.name == 'LM':
-        for d in data:
-            d['target'] = relations[d['relation']]
-
-        return data
-
-    for d in data:
-        word2idx = [vocab.word2idx.get(w, 1) for w in d['sentence']]
-        seq_len = len(word2idx)
-        head_idx, tail_idx = int(d['head_offset']), int(d['tail_offset'])
-        if vocab.name == 'word':
-            head_len, tail_len = 1, 1
-        else:
-            head_len, tail_len = len(d['head_type']), len(d['tail_type'])
-        entities_idx = [head_idx, tail_idx] if tail_idx > head_idx else [tail_idx, head_idx]
-        head_pos = _pos_feature(seq_len, head_idx, head_len, config.pos_limit)
-        tail_pos = _pos_feature(seq_len, tail_idx, tail_len, config.pos_limit)
-        mask_pos = _mask_feature(entities_idx, seq_len)
-        target = relations[d['relation']]
-
-        d['word2idx'] = word2idx
-        d['seq_len'] = seq_len
-        d['head_pos'] = head_pos
-        d['tail_pos'] = tail_pos
-        d['mask_pos'] = mask_pos
-        d['target'] = target
-
-    return data
-
-
-def _build_vocab(data: List[Dict], out_path: Path) -> Vocab:
-    if config.word_segment:
-        vocab = Vocab('word')
-    else:
-        vocab = Vocab('char')
-
-    for d in data:
-        vocab.add_sent(d['sentence'])
-    vocab.trim(config.min_freq)
-
-    ensure_dir(out_path)
-    vocab_path = os.path.join(out_path, 'vocab.pkl')
-    vocab_txt = os.path.join(out_path, 'vocab.txt')
-    save_pkl(vocab_path, vocab, 'vocab')
-    with open(vocab_txt, 'w', encoding='utf-8') as f:
-        f.write(os.linesep.join([word for word in vocab.word2idx.keys()]))
-    return vocab
-
-
-def _split_sent(data: List[Dict], verbose: bool = True) -> List[Dict]:
-    if verbose:
-        print('need word segment, use jieba to split sentence')
-
-    jieba.add_word('HEAD')
-    jieba.add_word('TAIL')
-
-    for d in data:
-        sent = d['sentence']
-        sent = sent.replace(d['head_type'], 'HEAD', 1)
-        sent = sent.replace(d['tail_type'], 'TAIL', 1)
-        sent = jieba.lcut(sent)
-        head_idx, tail_idx = sent.index('HEAD'), sent.index('TAIL')
-        sent[head_idx], sent[tail_idx] = d['head_type'], d['tail_type']
-        d['sentence'] = sent
-        d['head_offset'] = head_idx
-        d['tail_offset'] = tail_idx
-
-    return data
-
-
-def _add_lm_data(data: List[Dict]) -> List[Dict]:
-    '使用语言模型的词表，序列化输入的句子'
-    tokenizer = BertTokenizer.from_pretrained(config.lm.lm_file)
-
-    for d in data:
-        sent = d['sentence']
-        sent += '[SEP]' + d['head'] + '[SEP]' + d['tail']
-
-        d['lm_idx'] = tokenizer.encode(sent, add_special_tokens=True)
-        d['seq_len'] = len(d['lm_idx'])
-
-    return data
-
-
-def _replace_entity_by_type(data: List[Dict]) -> List[Dict]:
-    for d in data:
-        sent = d['sentence'].strip()
-        sent = sent.replace(d['head'], d['head_type'], 1)
-        sent = sent.replace(d['tail'], d['tail_type'], 1)
-        head_offset = sent.index(d['head_type'])
-        tail_offset = sent.index(d['tail_type'])
-
-        d['sentence'] = sent
-        d['head_offset'] = head_offset
-        d['tail_offset'] = tail_offset
-
-    return data
-
-
-def _load_relations(fp: Path) -> Dict:
-    '读取关系文件，并将关系保存为词典格式，用来序列化关系'
-
-    print(f'load {fp}')
-    relations_arr = []
-    relations_dict = {}
-
-    with open(fp, encoding='utf-8') as f:
-        for l in f:
-            relations_arr.append(l.strip())
-
-    for k, v in enumerate(relations_arr):
-        relations_dict[v] = k
-
-    return relations_dict
-
-
-def process(data_path: Path, out_path: Path) -> None:
-    print('===== start preprocess data =====')
-    train_fp = os.path.join(data_path, 'train.csv')
-    test_fp = os.path.join(data_path, 'test.csv')
-    relation_fp = os.path.join(data_path, 'relation.txt')
-
-    print('load raw files...')
-    train_raw_data = load_csv(train_fp)
-    test_raw_data = load_csv(test_fp)
-    relations = _load_relations(relation_fp)
-
-    # 使用 entity type 替换句子中的 entity
-    # 这样训练效果会提升很多
-    if config.replace_entity_by_type:
-        train_raw_data = _replace_entity_by_type(train_raw_data)
-        test_raw_data = _replace_entity_by_type(test_raw_data)
-
-    # 使用预训练语言模型时
-    if config.model_name == 'LM':
-        print('\nuse pretrained language model serialize sentence...')
-        train_raw_data = _add_lm_data(train_raw_data)
-        test_raw_data = _add_lm_data(test_raw_data)
-        vocab = Vocab('LM')
-
-    else:
-        # 当为中文时是否需要分词操作，如果句子已为分词的结果，则不需要分词
-        print('\nverify whether need split words...')
-        if config.is_chinese and config.word_segment:
-            train_raw_data = _split_sent(train_raw_data)
-            test_raw_data = _split_sent(test_raw_data, verbose=False)
-
-        print('build word vocabulary...')
-        vocab = _build_vocab(train_raw_data, out_path)
-
-    print('\nbuild train data...')
-    train_data = _build_data(train_raw_data, vocab, relations)
-    print('build test data...\n')
-    test_data = _build_data(test_raw_data, vocab, relations)
-
-    ensure_dir(out_path)
-    train_data_path = os.path.join(out_path, 'train.pkl')
-    test_data_path = os.path.join(out_path, 'test.pkl')
-
-    save_pkl(train_data_path, train_data, 'train data')
-    save_pkl(test_data_path, test_data, 'test data')
-
-    print('===== end preprocess data =====')
-
-
-if __name__ == "__main__":
-    data_path = '../data/origin'
-    out_path = '../data/out'
-
-    process(data_path, out_path)
--- a/deepke/trainer.py
+++ b/deepke/trainer.py
@ -1,72 +0,0 @@
-import torch
-import numpy as np
-import matplotlib.pyplot as plt
-from sklearn.metrics import precision_recall_fscore_support
-from deepke.utils import to_one_hot
-
-
-def train(epoch, device, dataloader, model, optimizer, criterion, config):
-    model.train()
-    total_loss = []
-
-    for batch_idx, (*x, y) in enumerate(dataloader, 1):
-        x = [i.to(device) for i in x]
-        y = y.to(device)
-        optimizer.zero_grad()
-        y_pred = model(x)
-
-        if model.model_name == 'Capsule':
-            y = to_one_hot(y, config.relation_type)
-            loss = model.loss(y_pred, y)
-        else:
-            loss = criterion(y_pred, y)
-
-        loss.backward()
-        optimizer.step()
-        total_loss.append(loss.item())
-
-        # logging
-        data_cal = len(dataloader.dataset) if batch_idx == len(dataloader) else batch_idx * len(y)
-        if (config.training.train_log
-                and batch_idx % config.training.log_interval == 0) or batch_idx == len(dataloader):
-            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(epoch, data_cal, len(dataloader.dataset),
-                                                                           100. * batch_idx / len(dataloader),
-                                                                           loss.item()))
-
-    # plot
-    if config.training.show_plot:
-        plt.plot(total_loss)
-        plt.title('loss')
-        plt.show()
-
-
-def validate(dataloader, model, device, config):
-    model.eval()
-
-    with torch.no_grad():
-        total_y_true = np.empty(0)
-        total_y_pred = np.empty(0)
-        for batch_idx, (*x, y) in enumerate(dataloader, 1):
-            x = [i.to(device) for i in x]
-            y = y.to(device)
-            y_pred = model(x)
-
-            if model.model_name == 'Capsule':
-                y_pred = model.predict(y_pred)
-            else:
-                y_pred = y_pred.argmax(dim=-1)
-
-            try:
-                y, y_pred = y.numpy(), y_pred.numpy()
-            except:
-                y, y_pred = y.cpu().numpy(), y_pred.cpu().numpy()
-            total_y_true = np.append(total_y_true, y)
-            total_y_pred = np.append(total_y_pred, y_pred)
-
-        total_f1 = []
-        for average in config.training.f1_norm:
-            p, r, f1, _ = precision_recall_fscore_support(total_y_true, total_y_pred, average=average)
-            print(f' {average} metrics: [p: {p:.4f}, r:{r:.4f}, f1:{f1:.4f}]')
-            total_f1.append(f1)
-
-    return total_f1
--- a/deepke/utils.py
+++ b/deepke/utils.py
@ -1,232 +0,0 @@
-import os
-import csv
-import json
-import torch
-import pickle
-import random
-import warnings
-import numpy as np
-from functools import reduce
-from typing import Dict, List, Tuple, Set, Any
-
-__all__ = [
-    'to_one_hot',
-    'seq_len_to_mask',
-    'ignore_waring',
-    'make_seed',
-    'load_pkl',
-    'save_pkl',
-    'ensure_dir',
-    'load_csv',
-    'load_jsonld',
-    'jsonld2csv',
-    'csv2jsonld',
-]
-
-Path = str
-
-
-def to_one_hot(x, length):
-    batch_size = x.size(0)
-    x_one_hot = torch.zeros(batch_size, length).to(x.device)
-    for i in range(batch_size):
-        x_one_hot[i, x[i]] = 1.0
-    return x_one_hot
-
-
-def model_summary(model):
-    """
-    得到模型的总参数量
-
-    :params model: Pytorch 模型
-    :return tuple: 包含总参数量，可训练参数量，不可训练参数量
-    """
-    train = []
-    nontrain = []
-
-    def layer_summary(module):
-        def count_size(sizes):
-            return reduce(lambda x, y: x * y, sizes)
-
-        for p in module.parameters(recurse=False):
-            if p.requires_grad:
-                train.append(count_size(p.shape))
-            else:
-                nontrain.append(count_size(p.shape))
-        for subm in module.children():
-            layer_summary(subm)
-
-    layer_summary(model)
-    total_train = sum(train)
-    total_nontrain = sum(nontrain)
-    total = total_train + total_nontrain
-    strings = []
-    strings.append('Total params: {:,}'.format(total))
-    strings.append('Trainable params: {:,}'.format(total_train))
-    strings.append('Non-trainable params: {:,}'.format(total_nontrain))
-    max_len = len(max(strings, key=len))
-    bar = '-' * (max_len + 3)
-    strings = [bar] + strings + [bar]
-    print('\n'.join(strings))
-    return total, total_train, total_nontrain
-
-
-def seq_len_to_mask(seq_len, max_len=None):
-    """
-
-    将一个表示sequence length的一维数组转换为二维的mask，不包含的位置为0。
-    转变 1-d seq_len到2-d mask.
-
-    .. code-block::
-
-        >>> seq_len = torch.arange(2, 16)
-        >>> mask = seq_len_to_mask(seq_len)
-        >>> print(mask.size())
-        torch.Size([14, 15])
-        >>> seq_len = np.arange(2, 16)
-        >>> mask = seq_len_to_mask(seq_len)
-        >>> print(mask.shape)
-        (14, 15)
-        >>> seq_len = torch.arange(2, 16)
-        >>> mask = seq_len_to_mask(seq_len, max_len=100)
-        >>>print(mask.size())
-        torch.Size([14, 100])
-
-    :param np.ndarray,torch.LongTensor seq_len: shape将是(B,)
-    :param int max_len: 将长度pad到这个长度。默认(None)使用的是seq_len中最长的长度。但在nn.DataParallel的场景下可能不同卡的seq_len会有
-        区别，所以需要传入一个max_len使得mask的长度是pad到该长度。
-    :return: np.ndarray, torch.Tensor 。shape将是(B, max_length)， 元素类似为bool或torch.uint8
-    """
-    if isinstance(seq_len, np.ndarray):
-        assert len(np.shape(seq_len)) == 1, f"seq_len can only have one dimension, got {len(np.shape(seq_len))}."
-        max_len = int(max_len) if max_len else int(seq_len.max())
-        broad_cast_seq_len = np.tile(np.arange(max_len), (len(seq_len), 1))
-        mask = broad_cast_seq_len < seq_len.reshape(-1, 1)
-
-    elif isinstance(seq_len, torch.Tensor):
-        assert seq_len.dim() == 1, f"seq_len can only have one dimension, got {seq_len.dim() == 1}."
-        batch_size = seq_len.size(0)
-        max_len = int(max_len) if max_len else seq_len.max().long()
-        broad_cast_seq_len = torch.arange(max_len).expand(batch_size, -1).to(seq_len)
-        mask = broad_cast_seq_len.lt(seq_len.unsqueeze(1))
-    else:
-        raise TypeError("Only support 1-d numpy.ndarray or 1-d torch.Tensor.")
-
-    return mask
-
-
-def ignore_waring():
-    warnings.filterwarnings("ignore")
-
-
-def make_seed(num: int = 1) -> None:
-    random.seed(num)
-    np.random.seed(num)
-    torch.manual_seed(num)
-    torch.cuda.manual_seed(num)
-    torch.cuda.manual_seed_all(num)
-
-
-def load_pkl(fp: str, obj_name: str = 'data', verbose: bool = True) -> Any:
-    if verbose:
-        print(f'load {obj_name} in {fp}')
-    with open(fp, 'rb') as f:
-        data = pickle.load(f)
-        return data
-
-
-def save_pkl(fp: Path, obj, obj_name: str = 'data', verbose: bool = True) -> None:
-    if verbose:
-        print(f'save {obj_name} in {fp}')
-    with open(fp, 'wb') as f:
-        pickle.dump(obj, f)
-
-
-def ensure_dir(d: str, verbose: bool = True) -> None:
-    '''
-    判断目录是否存在，不存在时创建
-    :param d: directory
-    :param verbose: whether print logging
-    :return: None
-    '''
-    if not os.path.exists(d):
-        if verbose:
-            print("Directory '{}' do not exist; creating...".format(d))
-        os.makedirs(d)
-
-
-def load_csv(fp: str) -> List:
-    print(f'load {fp}')
-
-    with open(fp, encoding='utf-8') as f:
-        reader = csv.DictReader(f)
-        return list(reader)
-
-
-def load_jsonld(fp: str) -> List:
-    print(f'load {fp}')
-    datas = []
-
-    with open(fp, encoding='utf-8') as f:
-        for l in f:
-            line = json.loads(l)
-            data = list(line.values())
-            datas.append(data)
-    return datas
-
-
-def jsonld2csv(fp: str, verbose: bool = True) -> str:
-    '''
-    读入 jsonld 文件，存储在同位置同名的 csv 文件
-    :param fp: jsonld 文件地址
-    :param verbose: whether print logging
-    :return: csv 文件地址
-    '''
-    data = []
-    root, ext = os.path.splitext(fp)
-    fp_new = root + '.csv'
-    if verbose:
-        print(f'read jsonld file in: {fp}')
-    with open(fp, encoding='utf-8') as f:
-        for l in f:
-            line = json.loads(l)
-            data.append(line)
-    if verbose:
-        print('saving...')
-    with open(fp_new, 'w', encoding='utf-8') as f:
-        fieldnames = data[0].keys()
-        writer = csv.DictWriter(f, fieldnames=fieldnames, dialect='excel')
-        writer.writeheader()
-        writer.writerows(data)
-    if verbose:
-        print(f'saved csv file in: {fp_new}')
-    return fp_new
-
-
-def csv2jsonld(fp: str, verbose: bool = True) -> str:
-    '''
-    读入 csv 文件，存储为同位置同名的 jsonld 文件
-    :param fp: csv 文件地址
-    :param verbose: whether print logging
-    :return: jsonld 地址
-    '''
-    data = []
-    root, ext = os.path.splitext(fp)
-    fp_new = root + '.jsonld'
-    if verbose:
-        print(f'read csv file in: {fp}')
-    with open(fp, encoding='utf-8') as f:
-        writer = csv.DictReader(f, fieldnames=None, dialect='excel')
-        for line in writer:
-            data.append(line)
-    if verbose:
-        print('saving...')
-    with open(fp_new, 'w', encoding='utf-8') as f:
-        f.write(os.linesep.join([json.dumps(l, ensure_ascii=False) for l in data]))
-    if verbose:
-        print(f'saved jsonld file in: {fp_new}')
-    return fp_new
-
-
-if __name__ == '__main__':
-    pass
--- a/deepke/vocab.py
+++ b/deepke/vocab.py
@ -1,78 +0,0 @@
-from typing import List
-
-init_tokens = ['PAD', 'UNK']
-
-
-class Vocab(object):
-    def __init__(self, name: str, init_tokens: List[str] = init_tokens):
-        self.name = name
-        self.init_tokens = init_tokens
-        self.trimed = False
-        self.word2idx = {}
-        self.word2count = {}
-        self.idx2word = {}
-        self.count = 0
-        self._add_init_tokens()
-
-    def _add_init_tokens(self):
-        for token in self.init_tokens:
-            self._add_word(token)
-
-    def _add_word(self, word: str):
-        if word not in self.word2idx:
-            self.word2idx[word] = self.count
-            self.word2count[word] = 1
-            self.idx2word[self.count] = word
-            self.count += 1
-        else:
-            self.word2count[word] += 1
-
-    def add_sent(self, sent: str):
-        for word in sent:
-            self._add_word(word)
-
-    def trim(self, min_freq=2, verbose: bool = True):
-        '''当 word 词频低于 min_freq 时，从词库中删除
-
-        Args:
-            param min_freq: 最低词频
-        '''
-        if self.trimed:
-            return
-        self.trimed = True
-
-        keep_words = []
-        new_words = []
-        keep_words.extend(self.init_tokens)
-        new_words.extend(self.init_tokens)
-
-        for k, v in self.word2count.items():
-            if v >= min_freq:
-                keep_words.append(k)
-                new_words.extend([k] * v)
-        if verbose:
-            print('after trim, keep words [{} / {}] = {:.2f}%'.format(len(keep_words), len(self.word2idx),
-                                                                      len(keep_words) / len(self.word2idx) * 100))
-
-        # Reinitialize dictionaries
-        self.word2idx = {}
-        self.word2count = {}
-        self.idx2word = {}
-        self.count = 0
-        for word in new_words:
-            self._add_word(word)
-
-
-if __name__ == '__main__':
-    # english
-    # from nltk import word_tokenize
-    # sent = "I'm chinese, I love China."
-    # words = word_tokenize(sent)
-    vocab = Vocab('test')
-    sent = ' 我是中国人，我   爱中国。'
-    print(sent, '\n')
-    vocab.add_sent(sent)
-    print(vocab.word2idx)
-    print(vocab.word2count)
-    vocab.trim(2)
-    print(vocab.word2idx)
--- a/main.py
+++ b/main.py
@ -1,97 +1,138 @@
 import os
-import argparse
-import warnings
+import hydra
 import torch
+import logging
 import torch.nn as nn
-import torch.optim as optim
+from torch import optim
+from hydra import utils
+import matplotlib.pyplot as plt
 from torch.utils.data import DataLoader
-from deepke.config import config
-from deepke import model
-from deepke.utils import make_seed, load_pkl
-from deepke.trainer import train, validate
-from deepke.preprocess import process
-from deepke.dataset import CustomDataset, collate_fn
+from torch.utils.tensorboard import SummaryWriter
+# self
+import models
+from preprocess import preprocess
+from dataset import CustomDataset, collate_fn
+from trainer import train, validate
+from utils import manual_seed, load_pkl

-warnings.filterwarnings("ignore")
+logger = logging.getLogger(__name__)

-__Models__ = {
-    "CNN": model.CNN,
-    "RNN": model.BiLSTM,
-    "GCN": model.GCN,
-    "Transformer": model.Transformer,
-    "Capsule": model.Capsule,
-    "LM": model.LM,
-}

-parser = argparse.ArgumentParser(description='choose your model')
-parser.add_argument('--model_name', type=str, help='model name: [CNN, RNN, GCN, Capsule, Transformer, LM]')
-args = parser.parse_args()
-model_name = args.model_name if args.model_name else config.model_name
+@hydra.main(config_path='conf/config.yaml')
+def main(cfg):
+    cwd = utils.get_original_cwd()
+    cfg.cwd = cwd
+    cfg.pos_size = 2 * cfg.pos_limit + 2
+    logger.info(f'\n{cfg.pretty()}')

-make_seed(config.training.seed)
+    __Model__ = {
+        'cnn': models.PCNN,
+    }

-if config.training.use_gpu and torch.cuda.is_available():
-    device = torch.device('cuda', config.training.gpu_id)
-else:
-    device = torch.device('cpu')
+    # device
+    if cfg.use_gpu and torch.cuda.is_available():
+        device = torch.device('cuda', cfg.gpu_id)
+    else:
+        device = torch.device('cpu')
+    logger.info(f'device: {device}')

-# if not os.path.exists(config.out_path):
-process(config.data_path, config.out_path)
+    # 如果不修改预处理的过程，这一步最好注释掉，不用每次运行都预处理数据一次
+    if cfg.preprocess:
+        preprocess(cfg)

-train_data_path = os.path.join(config.out_path, 'train.pkl')
-test_data_path = os.path.join(config.out_path, 'test.pkl')
+    train_data_path = os.path.join(cfg.cwd, cfg.out_path, 'train.pkl')
+    valid_data_path = os.path.join(cfg.cwd, cfg.out_path, 'valid.pkl')
+    test_data_path = os.path.join(cfg.cwd, cfg.out_path, 'test.pkl')
+    vocab_path = os.path.join(cfg.cwd, cfg.out_path, 'vocab.pkl')

-if model_name == 'LM':
-    vocab_size = None
-else:
-    vocab_path = os.path.join(config.out_path, 'vocab.pkl')
-    vocab = load_pkl(vocab_path)
-    vocab_size = len(vocab.word2idx)
+    if cfg.model_name == 'lm':
+        vocab_size = None
+    else:
+        vocab = load_pkl(vocab_path)
+        vocab_size = vocab.count
+    cfg.vocab_size = vocab_size

-train_dataset = CustomDataset(train_data_path)
-train_dataloader = DataLoader(train_dataset,
-                              batch_size=config.training.batch_size,
-                              shuffle=True,
-                              collate_fn=collate_fn)
-test_dataset = CustomDataset(test_data_path)
-test_dataloader = DataLoader(
-    test_dataset,
-    batch_size=config.training.batch_size,
-    shuffle=False,
-    collate_fn=collate_fn,
-)
+    train_dataset = CustomDataset(train_data_path)
+    valid_dataset = CustomDataset(valid_data_path)
+    test_dataset = CustomDataset(test_data_path)

-model = __Models__[model_name](vocab_size, config)
-model.to(device)
-# print(model)
+    train_dataloader = DataLoader(train_dataset, batch_size=cfg.batch_size, shuffle=True, collate_fn=collate_fn(cfg))
+    valid_dataloader = DataLoader(valid_dataset, batch_size=cfg.batch_size, shuffle=True, collate_fn=collate_fn(cfg))
+    test_dataloader = DataLoader(test_dataset, batch_size=cfg.batch_size, shuffle=True, collate_fn=collate_fn(cfg))

-optimizer = optim.Adam(model.parameters(), lr=config.training.learning_rate)
-scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer,
-                                                 'max',
-                                                 factor=config.training.decay_rate,
-                                                 patience=config.training.decay_patience)
-criterion = nn.CrossEntropyLoss()
+    model = __Model__[cfg.model_name](cfg)
+    model.to(device)
+    logger.info(f'\n {model}')

-best_macro_f1, best_macro_epoch = 0, 1
-best_micro_f1, best_micro_epoch = 0, 1
-best_macro_model, best_micro_model = '', ''
-print('=' * 10, ' Start training ', '=' * 10)
+    optimizer = optim.Adam(model.parameters(), lr=cfg.learning_rate, weight_decay=cfg.weight_decay)
+    scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=cfg.lr_factor, patience=cfg.lr_patience)
+    criterion = nn.CrossEntropyLoss()

-for epoch in range(1, config.training.epoch + 1):
-    train(epoch, device, train_dataloader, model, optimizer, criterion, config)
-    macro_f1, micro_f1 = validate(test_dataloader, model, device, config)
-    model_name = model.save(epoch=epoch)
-    scheduler.step(macro_f1)
+    best_f1, best_epoch = -1, 0
+    es_loss, es_f1, es_epoch, es_patience, best_es_epoch, best_es_f1, es_path, best_es_path = 1e8, -1, 0, 0, 0, -1, '', ''
+    train_losses, valid_losses = [], []

-    if macro_f1 > best_macro_f1:
-        best_macro_f1 = macro_f1
-        best_macro_epoch = epoch
-        best_macro_model = model_name
-    if micro_f1 > best_micro_f1:
-        best_micro_f1 = micro_f1
-        best_micro_epoch = epoch
-        best_micro_model = model_name
+    if cfg.show_plot and cfg.plot_utils == 'tensorboard':
+        writer = SummaryWriter('tensorboard')
+    else:
+        writer = None

-print('=' * 10, ' End training ', '=' * 10)
-print(f'best macro f1: {best_macro_f1:.4f},', f'in epoch: {best_macro_epoch}, saved in: {best_macro_model}')
-print(f'best micro f1: {best_micro_f1:.4f},', f'in epoch: {best_micro_epoch}, saved in: {best_micro_model}')
+    logger.info('=' * 10 + ' Start training ' + '=' * 10)
+
+    for epoch in range(1, cfg.epoch + 1):
+        manual_seed(cfg.seed + epoch)
+        train_loss = train(epoch, model, train_dataloader, optimizer, criterion, device, writer, cfg)
+        valid_f1, valid_loss = validate(epoch, model, valid_dataloader, criterion, device)
+        scheduler.step(valid_loss)
+        model_path = model.save(epoch, cfg)
+        # logger.info(model_path)
+
+        train_losses.append(train_loss)
+        valid_losses.append(valid_loss)
+        if best_f1 < valid_f1:
+            best_f1 = valid_f1
+            best_epoch = epoch
+        # 使用 valid loss 做 early stopping 的判断标准
+        if es_loss > valid_loss:
+            es_loss = valid_loss
+            es_f1 = valid_f1
+            es_epoch = epoch
+            es_patience = 0
+            es_path = model_path
+        else:
+            es_patience += 1
+            if es_patience >= cfg.early_stopping_patience:
+                best_es_epoch = es_epoch
+                best_es_f1 = es_f1
+                best_es_path = es_path
+
+    if cfg.show_plot:
+        if cfg.plot_utils == 'matplot':
+            plt.plot(train_losses, 'x-')
+            plt.plot(valid_losses, '+-')
+            plt.legend(['train', 'valid'])
+            plt.title('train/valid comparison loss')
+            plt.show()
+
+        if cfg.plot_utils == 'tensorboard':
+            for i in range(len(train_losses)):
+                writer.add_scalars('train/valid_comparison_loss', {
+                    'train': train_losses[i],
+                    'valid': valid_losses[i]
+                }, i)
+            writer.close()
+
+    logger.info(f'best(valid loss quota) early stopping epoch: {best_es_epoch}, '
+                f'this epoch macro f1: {best_es_f1:0.4f}')
+    logger.info(f'this model save path: {best_es_path}')
+    logger.info(f'total {cfg.epoch} epochs, best(valid macro f1) epoch: {best_epoch}, '
+                f'this epoch macro f1: {best_f1:.4f}')
+
+    validate(-1, model, test_dataloader, criterion, device)
+
+
+if __name__ == '__main__':
+    main()
+    # python predict.py --help  # 查看参数帮助
+    # python predict.py -c
+    # python predict.py chinese_split=0,1 replace_entity_with_type=0,1 -m
--- a/metrics.py
+++ b/metrics.py
@ -0,0 +1,62 @@
+import torch
+import numpy as np
+from abc import ABCMeta, abstractmethod
+from sklearn.metrics import precision_recall_fscore_support
+
+
+class Metric(metaclass=ABCMeta):
+    @abstractmethod
+    def __init__(self):
+        pass
+
+    @abstractmethod
+    def reset(self):
+        """
+        Resets the metric to to it's initial state.
+        This is called at the start of each epoch.
+        """
+        pass
+
+    @abstractmethod
+    def update(self, *args):
+        """
+        Updates the metric's state using the passed batch output.
+        This is called once for each batch.
+        """
+        pass
+
+    @abstractmethod
+    def compute(self):
+        """
+        Computes the metric based on it's accumulated state.
+        This is called at the end of each epoch.
+        :return: the actual quantity of interest
+        """
+        pass
+
+
+class PRMetric():
+    def __init__(self):
+        """
+        暂时调用 sklearn 的方法
+        """
+        self.y_true = np.empty(0)
+        self.y_pred = np.empty(0)
+
+    def reset(self):
+        self.y_true = np.empty(0)
+        self.y_pred = np.empty(0)
+
+    def update(self, y_true: torch.Tensor, y_pred: torch.Tensor):
+        y_true = y_true.cpu().detach().numpy()
+        y_pred = y_pred.cpu().detach().numpy()
+        y_pred = np.argmax(y_pred, axis=-1)
+
+        self.y_true = np.append(self.y_true, y_true)
+        self.y_pred = np.append(self.y_pred, y_pred)
+
+    def compute(self):
+        p, r, f1, _ = precision_recall_fscore_support(self.y_true, self.y_pred, average='macro', warn_for=tuple())
+        _, _, acc, _ = precision_recall_fscore_support(self.y_true, self.y_pred, average='micro', warn_for=tuple())
+
+        return acc, p, r, f1
--- a/models/BasicModule.py
+++ b/models/BasicModule.py
@ -0,0 +1,34 @@
+import os
+import time
+import torch
+import torch.nn as nn
+
+
+class BasicModule(nn.Module):
+    '''
+    封装nn.Module, 提供 save 和 load 方法
+    '''
+    def __init__(self):
+        super(BasicModule, self).__init__()
+
+
+    def load(self, path, device):
+        '''
+        加载指定路径的模型
+        '''
+        self.load_state_dict(torch.load(path, map_location=device))
+
+
+    def save(self, epoch=0, cfg=None):
+        '''
+        保存模型，默认使用“模型名字+时间”作为文件名
+        '''
+        time_prefix = time.strftime('%Y-%m-%d_%H-%M-%S')
+        prefix = os.path.join(cfg.cwd, 'checkpoints',time_prefix)
+        os.makedirs(prefix, exist_ok=True)
+        name = os.path.join(prefix, cfg.model_name + '_' + f'epoch{epoch}' + '.pth')
+
+        torch.save(self.state_dict(), name)
+        return name
+
+
--- a/models/BiLSTM.py
+++ b/models/BiLSTM.py
@ -0,0 +1,24 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from . import BasicModule
+from module import Embedding, RNN
+from utils import seq_len_to_mask
+
+
+class BiLSTM(BasicModule):
+    def __init__(self, cfg):
+        super(BiLSTM, self).__init__()
+
+        self.use_pcnn = cfg.use_pcnn
+
+        self.embedding = Embedding(cfg)
+        self.bilsm = RNN(cfg)
+        self.fc1 = nn.Linear(len(cfg.kernel_sizes) * cfg.out_channels, cfg.intermediate)
+        self.fc2 = nn.Linear(cfg.intermediate, cfg.num_relations)
+        self.dropout = nn.Dropout(cfg.dropout)
+
+    def forward(self, x):
+        word, lens, head_pos, tail_pos = x['word'], x['lens'], x['head_pos'], x['tail_pos']
+        inputs = self.embedding(word, head_pos, tail_pos)
+        out, out_pool = self.rnn(inputs)
--- a/models/Capsule.py
+++ b/models/Capsule.py
@ -0,0 +1,6 @@
+# coding=utf-8
+# Version: Python 3.7.3
+# Tools: Pycharm 2019.02
+
+__date__ = '2019/12/1 12:00 上午'
+__author__ = 'Haiyang Yu'
--- a/models/PCNN.py
+++ b/models/PCNN.py
@ -0,0 +1,52 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from . import BasicModule
+from module import Embedding, CNN
+from utils import seq_len_to_mask
+
+
+class PCNN(BasicModule):
+    def __init__(self, cfg):
+        super(PCNN, self).__init__()
+
+        self.use_pcnn = cfg.use_pcnn
+
+        self.embedding = Embedding(cfg)
+        self.cnn = CNN(cfg)
+        self.fc1 = nn.Linear(len(cfg.kernel_sizes) * cfg.out_channels, cfg.intermediate)
+        self.fc2 = nn.Linear(cfg.intermediate, cfg.num_relations)
+        self.dropout = nn.Dropout(cfg.dropout)
+
+        if self.use_pcnn:
+            self.fc_pcnn = nn.Linear(3 * len(cfg.kernel_sizes) * cfg.out_channels,
+                                     len(cfg.kernel_sizes) * cfg.out_channels)
+            self.pcnn_mask_embedding = nn.Embedding(4, 3)
+            masks = torch.tensor([[0, 0, 0], [100, 0, 0], [0, 100, 0], [0, 0, 100]])
+            self.pcnn_mask_embedding.weight.data.copy_(masks)
+            self.pcnn_mask_embedding.weight.requires_grad = False
+
+
+    def forward(self, x):
+        word, lens, head_pos, tail_pos = x['word'], x['lens'], x['head_pos'], x['tail_pos']
+        mask = seq_len_to_mask(lens)
+
+        inputs = self.embedding(word, head_pos, tail_pos)
+        out, out_pool = self.cnn(inputs, mask=mask)
+
+        if self.use_pcnn:
+            out = out.unsqueeze(-1)  # [B, L, Hs, 1]
+            pcnn_mask = x['pcnn_mask']
+            pcnn_mask = self.pcnn_mask_embedding(pcnn_mask).unsqueeze(-2)  # [B, L, 1, 3]
+            out = out + pcnn_mask  # [B, L, Hs, 3]
+            out = out.max(dim=1)[0] - 100  # [B, Hs, 3]
+            out_pool = out.view(out.size(0), -1)  # [B, 3 * Hs]
+            out_pool = F.leaky_relu(self.fc_pcnn(out_pool))  # [B, Hs]
+            out_pool = self.dropout(out_pool)
+
+        output = self.fc1(out_pool)
+        output = F.leaky_relu(output)
+        output = self.dropout(output)
+        output = self.fc2(output)
+
+        return output
--- a/models/init.py
+++ b/models/init.py
@ -0,0 +1,2 @@
+from .BasicModule import BasicModule
+from .PCNN import PCNN
--- a/module/Attention.py
+++ b/module/Attention.py
@ -0,0 +1,138 @@
+import logging
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+logger = logging.getLogger(__name__)
+
+
+class DotAttention(nn.Module):
+    def __init__(self, dropout=0.0):
+        super(DotAttention, self).__init__()
+        self.dropout = dropout
+
+    def forward(self, Q, K, V, mask_out=None,head_mask=None):
+        """
+        一般输入信息 X 时，假设 K = V = X
+
+        att_weight = softmax( score_func(q, k) )
+        att = sum( att_weight * v )
+
+        :param Q: [..., L, H]
+        :param K: [..., S, H]
+        :param V: [..., S, H]
+        :param mask_out: [..., 1, S]
+        :return:
+        """
+        H = Q.size(-1)
+
+        scale = float(H)**0.5
+        attention_weight = torch.matmul(Q, K.transpose(-1, -2)) / scale
+
+        if mask_out is not None:
+            # 当 DotAttention 单独使用时（几乎不会），保证维度一样
+            while mask_out.dim() != Q.dim():
+                mask_out = mask_out.unsqueeze(1)
+            attention_weight.masked_fill_(mask_out, -1e8)
+
+        attention_weight = F.softmax(attention_weight, dim=-1)
+
+        attention_weight = F.dropout(attention_weight, self.dropout)
+
+        # mask heads if we want to:
+        # multi head 才会使用
+        if head_mask is not None:
+            attention_weight = attention_weight * head_mask
+
+        attention_out = torch.matmul(attention_weight, V)
+
+        return attention_out, attention_weight
+
+
+class MultiHeadAttention(nn.Module):
+    def __init__(self, embed_dim, num_heads, dropout=0.0, output_attentions=True):
+        """
+        :param embed_dim: 输入的维度，必须能被 num_heads 整除
+        :param num_heads: attention 的个数
+        :param dropout: float。
+        """
+        super(MultiHeadAttention, self).__init__()
+        self.num_heads = num_heads
+        self.output_attentions = output_attentions
+        self.head_dim = int(embed_dim / num_heads)
+        self.all_head_dim = self.head_dim * num_heads
+        assert self.all_head_dim == embed_dim, logger.error(
+            f"embed_dim{embed_dim} must be divisible by num_heads{num_heads}")
+
+        self.q_in = nn.Linear(embed_dim, self.all_head_dim)
+        self.k_in = nn.Linear(embed_dim, self.all_head_dim)
+        self.v_in = nn.Linear(embed_dim, self.all_head_dim)
+        self.attention = DotAttention(dropout=dropout)
+        self.out = nn.Linear(self.all_head_dim, embed_dim)
+
+    def forward(self, Q, K, V, key_padding_mask=None,attention_mask=None, head_mask=None):
+        """
+        :param Q: [B, L, Hs]
+        :param K: [B, S, Hs]
+        :param V: [B, S, Hs]
+        :param key_padding_mask: [B, S]                为 1/True 的地方需要 mask
+        :param attention_mask: [S] / [L, S] 指定位置 mask 掉， 为 1/True 的地方需要 mask
+        :param head_mask: [N] 指定 head mask 掉，        为 1/True 的地方需要 mask
+        """
+        B, L, Hs = Q.shape
+        S = V.size(1)
+        N,H = self.num_heads, self.head_dim
+
+        q = self.q_in(Q).view(B, L, N, H).transpose(1, 2)  # [B, N, L, H]
+        k = self.k_in(K).view(B, S, N, H).transpose(1, 2)  # [B, N, S, H]
+        v = self.v_in(V).view(B, S, N, H).transpose(1, 2)  # [B, N, S, H]
+
+        if key_padding_mask is not None:
+            key_padding_mask = key_padding_mask.ne(0)
+            key_padding_mask = key_padding_mask.unsqueeze(1).unsqueeze(1)
+
+        if attention_mask is not None:
+            attention_mask = attention_mask.ne(0)
+            if attention_mask.dim() == 1:
+                attention_mask = attention_mask.unsqueeze(0)
+            elif attention_mask.dim() == 2:
+                attention_mask = attention_mask.unsqueeze(0).unsqueeze(0).expand(B,-1,-1,-1)
+            else:
+                raise ValueError(f'attention_mask dim must be 1 or 2, can not be {attention_mask.dim()}')
+
+        if key_padding_mask is None:
+            mask_out = attention_mask if attention_mask is not None else None
+        else:
+            mask_out = (key_padding_mask + attention_mask).ne(0) if attention_mask is not None else key_padding_mask
+
+        if head_mask is not None:
+            head_mask = head_mask.eq(0)
+            head_mask = head_mask.unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
+
+        attention_out, attention_weight = self.attention(q, k, v, mask_out=mask_out,head_mask=head_mask)
+
+        attention_out = attention_out.transpose(1, 2).reshape(B, L, N * H)  # [B, N, L, H] -> [B, L, N * H]
+
+        # concat all heads, and do output linear
+        attention_out = self.out(attention_out)  # [B, L, N * H] -> [B, L, H]
+
+        if self.output_attentions:
+            return attention_out, attention_weight
+        else:
+            return attention_out,
+
+
+if __name__ == '__main__':
+    from utils import seq_len_to_mask
+
+    q = torch.randn(4, 6, 20)  # [B, L, H]
+    k = v = torch.randn(4, 5, 20)  # [B, S, H]
+    key_padding_mask = seq_len_to_mask([5,4,3,2], max_len=5)
+    attention_mask = torch.tensor([1,0,0,1,0]) # 为1 的地方 mask 掉
+    head_mask = torch.tensor([0,1]) # 为1 的地方 mask 掉
+
+    m = MultiHeadAttention(embed_dim=20, num_heads=2, dropout=0.0,output_attentions=True)
+    ao, aw = m(q, k, v, key_padding_mask=key_padding_mask, attention_mask=attention_mask,head_mask=head_mask)
+    print(ao.shape, aw.shape)  # [B, L, H]  [B, N, L, S]
+    print(ao)
+    print(aw.unbind(1))
--- a/module/CNN.py
+++ b/module/CNN.py
@ -0,0 +1,117 @@
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+class GELU(nn.Module):
+    def __init__(self):
+        super(GELU, self).__init__()
+
+    def forward(self, x):
+        return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
+
+
+class CNN(nn.Module):
+    """
+    nlp 里为了保证输出的句长 = 输入的句长，一般使用奇数 kernel_size，如 [3, 5, 7, 9]
+    此时，padding = k // 2
+    stride 一般为 1
+    """
+    def __init__(self, config):
+        """
+        in_channels      : 一般就是 word embedding 的维度，或者 hidden size 的维度
+        out_channels     : int
+        kernel_sizes     : list 为了保证输出长度=输入长度，必须为奇数: 3, 5, 7...
+        activation       : [relu, lrelu, prelu, selu, celu, gelu, sigmoid, tanh]
+        pooling_strategy : [max, avg, cls]
+        dropout:         : float
+        """
+        super(CNN, self).__init__()
+
+        # self.xxx = config.xxx
+        # self.in_channels = config.in_channels
+        if config.dim_strategy == 'cat':
+            self.in_channels = config.word_dim + 2 * config.pos_dim
+        else:
+            self.in_channels = config.word_dim
+
+        self.out_channels = config.out_channels
+        self.kernel_sizes = config.kernel_sizes
+        self.activation = config.activation
+        self.pooling_strategy = config.pooling_strategy
+        self.dropout = config.dropout
+        for kernel_size in self.kernel_sizes:
+            assert kernel_size % 2 == 1, "kernel size has to be odd numbers."
+
+        # convolution
+        self.convs = nn.ModuleList([
+            nn.Conv1d(in_channels=self.in_channels,
+                      out_channels=self.out_channels,
+                      kernel_size=k,
+                      stride=1,
+                      padding=k // 2,
+                      dilation=1,
+                      groups=1,
+                      bias=False) for k in self.kernel_sizes
+        ])
+
+        # activation function
+        assert self.activation in ['relu', 'lrelu', 'prelu', 'selu', 'celu', 'gelu', 'sigmoid', 'tanh'], \
+            'activation function must choose from [relu, lrelu, prelu, selu, celu, gelu, sigmoid, tanh]'
+        self.activations = nn.ModuleDict([
+            ['relu', nn.ReLU()],
+            ['lrelu', nn.LeakyReLU()],
+            ['prelu', nn.PReLU()],
+            ['selu', nn.SELU()],
+            ['celu', nn.CELU()],
+            ['gelu', GELU()],
+            ['sigmoid', nn.Sigmoid()],
+            ['tanh', nn.Tanh()],
+        ])
+
+        # pooling
+        assert self.pooling_strategy in ['max', 'avg', 'cls'], 'pooling strategy must choose from [max, avg, cls]'
+
+        self.dropout = nn.Dropout(self.dropout)
+
+    def forward(self, x, mask=None):
+        """
+            :param x: torch.Tensor [batch_size, seq_max_length, input_size], [B, L, H] 一般是经过embedding后的值
+            :param mask: [batch_size, max_len], 句长部分为0，padding部分为1。不影响卷积运算，max-pool一定不会pool到pad为0的位置
+            :return:
+            """
+        # [B, L, H] -> [B, H, L] （注释：将 H 维度当作输入 channel 维度)
+        x = torch.transpose(x, 1, 2)
+
+        # convolution + activation  [[B, H, L], ... ]
+        act_fn = self.activations[self.activation]
+
+        x = [act_fn(conv(x)) for conv in self.convs]
+        x = torch.cat(x, dim=1)
+
+        # mask
+        if mask is not None:
+            # [B, L] -> [B, 1, L]
+            mask = mask.unsqueeze(1)
+            x = x.masked_fill_(mask, 1e-12)
+
+        # pooling
+        # [[B, H, L], ... ] -> [[B, H], ... ]
+        if self.pooling_strategy == 'max':
+            xp = F.max_pool1d(x, kernel_size=x.size(2)).squeeze(2)
+            # 等价于 xp = torch.max(x, dim=2)[0]
+
+        elif self.pooling_strategy == 'avg':
+            x_len = mask.squeeze().eq(0).sum(-1).unsqueeze(-1).to(torch.float).to(device=mask.device)
+            xp = torch.sum(x, dim=-1) / x_len
+
+        else:
+            # self.pooling_strategy == 'cls'
+            xp = x[:, :, 0]
+
+        x = x.transpose(1, 2)
+        x = self.dropout(x)
+        xp = self.dropout(xp)
+
+        return x, xp  # [B, L, Hs], [B, Hs]
--- a/module/Capsule.py
+++ b/module/Capsule.py
@ -0,0 +1,13 @@
+import logging
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+logger = logging.getLogger(__name__)
+
+
+class Capsule(nn.Module):
+    def __init__(self, config):
+        super(Capsule, self).__init__()
+
+        # self.xxx = config.xxx
--- a/module/Embedding.py
+++ b/module/Embedding.py
@ -0,0 +1,39 @@
+import torch
+import torch.nn as nn
+from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
+
+
+class Embedding(nn.Module):
+    def __init__(self, config):
+        """
+        word embedding: 一般 0 为 padding
+        pos embedding:  一般 0 为 padding
+        dim_strategy: [cat, sum]  多个 embedding 是拼接还是相加
+        """
+        super(Embedding, self).__init__()
+
+        # self.xxx = config.xxx
+        self.vocab_size = config.vocab_size
+        self.word_dim = config.word_dim
+        self.pos_size = config.pos_size
+        self.pos_dim = config.pos_dim if config.dim_strategy == 'cat' else config.word_dim
+        self.dim_strategy = config.dim_strategy
+
+        self.wordEmbed = nn.Embedding(self.vocab_size,self.word_dim,padding_idx=0)
+        self.headPosEmbed = nn.Embedding(self.pos_size,self.pos_dim,padding_idx=0)
+        self.tailPosEmbed = nn.Embedding(self.pos_size,self.pos_dim,padding_idx=0)
+
+
+    def forward(self, *x):
+        word, head, tail = x
+        word_embedding = self.wordEmbed(word)
+        head_embedding = self.headPosEmbed(head)
+        tail_embedding = self.tailPosEmbed(tail)
+
+        if self.dim_strategy == 'cat':
+            return torch.cat((word_embedding,head_embedding, tail_embedding), -1)
+        elif self.dim_strategy == 'sum':
+            # 此时 pos_dim == word_dim
+            return word_embedding + head_embedding + tail_embedding
+        else:
+            raise Exception('dim_strategy must choose from [sum, cat]')
--- a/module/RNN.py
+++ b/module/RNN.py
@ -0,0 +1,96 @@
+import torch
+import torch.nn as nn
+from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
+
+
+class RNN(nn.Module):
+    def __init__(self, config):
+        """
+        type_rnn: RNN, GRU, LSTM 可选
+        """
+        super(RNN, self).__init__()
+
+        # self.xxx = config.xxx
+        self.input_size = config.input_size
+        self.hidden_size = config.hidden_size // 2 if config.bidirectional else config.hidden_size
+        self.num_layers = config.num_layers
+        self.dropout = config.dropout
+        self.bidirectional = config.bidirectional
+        self.last_layer_hn = config.last_layer_hn
+        self.type_rnn = config.type_rnn
+
+        self.h0 = self._init_h0()
+        rnn = eval(f'nn.{self.type_rnn}')
+        self.rnn = rnn(input_size=self.input_size,
+                       hidden_size=self.hidden_size,
+                       num_layers=self.num_layers,
+                       dropout=self.dropout,
+                       bidirectional=self.bidirectional,
+                       bias=True,
+                       batch_first=True)
+
+    def _init_h0(self):
+        pass
+        # h0 = torch.empty(1,B,H)
+        # h0 = nn.init.orthogonal_(h0)
+
+    def forward(self, x, x_len):
+        """
+        :param x: torch.Tensor [batch_size, seq_max_length, input_size], [B, L, H_in] 一般是经过embedding后的值
+        :param x_len: torch.Tensor [L] 已经排好序的句长值
+        :return:
+        output: torch.Tensor [B, L, H_out] 序列标注的使用结果
+        hn:     torch.Tensor [B, N, H_out] / [B, H_out] 分类的结果，当 last_layer_hn 时只有最后一层结果
+        """
+        B, L, _ = x.size()
+        H, N = self.hidden_size, self.num_layers
+
+        h0 = torch.zeros([2 * N, B, H]) if self.bidirectional else torch.zeros([N, B, H])
+        nn.init.orthogonal_(h0)
+        c0 = torch.zeros([2 * N, B, H]) if self.bidirectional else torch.zeros([N, B, H])
+        nn.init.orthogonal_(c0)
+
+        x = pack_padded_sequence(x, x_len, batch_first=True, enforce_sorted=True)
+        if self.type_rnn == 'LSTM':
+            output, hn = self.rnn(x, (h0, c0))
+        else:
+            output, hn = self.rnn(x, h0)
+
+        output, _ = pad_packed_sequence(output, batch_first=True, total_length=L)
+
+        if self.type_rnn == 'LSTM':
+            hn = hn[0]
+        if self.bidirectional:
+            hn = hn.view(N, 2, B, H).transpose(1, 2).contiguous().view(N, B, 2 * H).transpose(0, 1)
+        else:
+            hn = hn.transpose(0, 1)
+        if self.last_layer_hn:
+            hn = hn[:, -1, :]
+
+        return output, hn
+
+
+if __name__ == '__main__':
+
+    class Config(object):
+        type_rnn = 'LSTM'
+        input_size = 5
+        hidden_size = 4
+        num_layers = 3
+        dropout = 0.0
+        last_layer_hn = False
+        bidirectional = True
+
+    config = Config()
+    model = RNN(config)
+    print(model)
+
+    torch.manual_seed(1)
+    x = torch.tensor([[4, 3, 2, 1], [5, 6, 7, 0], [8, 10, 0, 0]])
+    x = torch.nn.Embedding(11, 5, padding_idx=0)(x)  # B,L,H = 3,4,5
+    x_len = torch.tensor([4, 3, 2])
+
+    o, h = model(x, x_len)
+
+    print(o.shape, h.shape, sep='\n\n')
+    print(o[-1].data, h[-1].data, sep='\n\n')
--- a/module/Transformer.py
+++ b/module/Transformer.py
@ -0,0 +1,149 @@
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from .Attention import MultiHeadAttention
+
+
+def gelu(x):
+    """ Original Implementation of the gelu activation function in Google Bert repo when initially created.
+        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
+        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+        Also see https://arxiv.org/abs/1606.08415
+    """
+    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
+
+
+def gelu_new(x):
+    """ Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT).
+        Also see https://arxiv.org/abs/1606.08415
+    """
+    return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
+
+
+def swish(x):
+    return x * torch.sigmoid(x)
+
+
+ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu, "swish": swish, "gelu_new": gelu_new}
+
+
+class TransformerAttention(nn.Module):
+    def __init__(self, config):
+        super(TransformerAttention, self).__init__()
+
+        # self.xxx = config.xxx
+        self.hidden_size = config.hidden_size
+        self.num_heads = config.num_heads
+        self.dropout = config.dropout
+        self.output_attentions = config.output_attentions
+        self.layer_norm_eps = config.layer_norm_eps
+
+        self.multihead_attention = MultiHeadAttention(self.hidden_size, self.num_heads, self.dropout,
+                                                      self.output_attentions)
+        self.dense = nn.Linear(self.hidden_size, self.hidden_size)
+        self.dropout = nn.Dropout(self.dropout)
+        self.layerNorm = nn.LayerNorm(self.hidden_size, eps=self.layer_norm_eps)
+
+    def forward(self, x, key_padding_mask=None, attention_mask=None, head_mask=None):
+        """
+        :param x: [B, L, Hs]
+        :param attention_mask: [B, L] padding后的句子后面补0了，补0的位置为True，前面部分为False
+        :param head_mask: [L] [N,L]
+        :return:
+        """
+        attention_outputs = self.multihead_attention(x, x, x, key_padding_mask, attention_mask, head_mask)
+        attention_output = attention_outputs[0]
+        attention_output = self.dense(attention_output)
+        attention_output = self.dropout(attention_output)
+        attention_output = self.layerNorm(attention_output + x)
+        outputs = (attention_output, ) + attention_outputs[1:]  # 后面是 attention weight
+        return outputs
+
+
+class TransformerOutput(nn.Module):
+    def __init__(self, config):
+        super(TransformerOutput, self).__init__()
+
+        # self.xxx = config.xxx
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.dropout = config.dropout
+        self.layer_norm_eps = config.layer_norm_eps
+
+        self.zoom_in = nn.Linear(self.hidden_size, self.intermediate_size)
+        self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        self.zoom_out = nn.Linear(self.intermediate_size, self.hidden_size)
+        self.dropout = nn.Dropout(self.dropout)
+        self.layerNorm = nn.LayerNorm(self.hidden_size, eps=self.layer_norm_eps)
+
+    def forward(self, input_tensor):
+        hidden_states = self.zoom_in(input_tensor)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        hidden_states = self.zoom_out(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.layerNorm(hidden_states + input_tensor)
+        return hidden_states
+
+
+class TransformerLayer(nn.Module):
+    def __init__(self, config):
+        super(TransformerLayer, self).__init__()
+
+        self.attention = TransformerAttention(config)
+        self.output = TransformerOutput(config)
+
+    def forward(self, hidden_states, key_padding_mask=None, attention_mask=None, head_mask=None):
+        attention_outputs = self.attention(hidden_states, key_padding_mask, attention_mask, head_mask)
+        attention_output = attention_outputs[0]
+        layer_output = self.output(attention_output)
+        outputs = (layer_output, ) + attention_outputs[1:]
+        return outputs
+
+
+class Transformer(nn.Module):
+    def __init__(self, config):
+        super(Transformer, self).__init__()
+
+        # self.xxx = config.xxx
+        self.num_hidden_layers = config.num_hidden_layers
+        self.output_attentions = config.output_attentions
+        self.output_hidden_states = config.output_hidden_states
+
+        self.layer = nn.ModuleList([TransformerLayer(config) for _ in range(self.num_hidden_layers)])
+
+    def forward(self, hidden_states, key_padding_mask=None, attention_mask=None, head_mask=None):
+        """
+        :param hidden_states: [B, L, Hs]
+        :param key_padding_mask: [B, S]                 为 1/True 的地方需要 mask
+        :param attn_mask: [S] / [L, S] 指定位置 mask 掉，  为 1/True 的地方需要 mask
+        :param head_mask: [N] / [L, N] 指定 head mask 掉， 为 1/True 的地方需要 mask
+        """
+        if head_mask is not None:
+            if head_mask.dim() == 1:
+                head_mask = head_mask.expand((self.num_hidden_layers, ) + head_mask.shape)
+        else:
+            head_mask = [None] * self.num_hidden_layers
+
+        all_hidden_states = ()
+        all_attentions = ()
+        for i, layer_module in enumerate(self.layer):
+            if self.output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states, )
+
+            layer_outputs = layer_module(hidden_states, key_padding_mask, attention_mask, head_mask[i])
+            hidden_states = layer_outputs[0]
+
+            if self.output_attentions:
+                all_attentions = all_attentions + (layer_outputs[1], )
+
+        # Add last layer
+        if self.output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states, )
+
+        outputs = (hidden_states, )
+        if self.output_hidden_states:
+            outputs = outputs + (all_hidden_states, )
+        if self.output_attentions:
+            outputs = outputs + (all_attentions, )
+        return outputs  # last-layer hidden state, (all hidden states), (all attentions)
--- a/module/Transformer_offical.py
+++ b/module/Transformer_offical.py
@ -0,0 +1,429 @@
+import copy
+import torch
+from torch.nn.init import xavier_uniform_
+from torch.nn import Module,ModuleList,LayerNorm,Linear,Dropout,MultiheadAttention
+import torch.nn.functional as F
+
+# 代码来自 torch 1.3.0 这是官网些的 transformer
+# 但是这个transformer 接口写的太死，自己重新实现了一版
+class Transformer(Module):
+    r"""A transformer model. User is able to modify the attributes as needed. The architecture
+    is based on the paper "Attention Is All You Need". Ashish Vaswani, Noam Shazeer,
+    Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and
+    Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information
+    Processing Systems, pages 6000-6010. Users can build the BERT(https://arxiv.org/abs/1810.04805)
+    model with corresponding parameters.
+
+    Args:
+        d_model: the number of expected features in the encoder/decoder inputs (default=512).
+        nhead: the number of heads in the multiheadattention models (default=8).
+        num_encoder_layers: the number of sub-encoder-layers in the encoder (default=6).
+        num_decoder_layers: the number of sub-decoder-layers in the decoder (default=6).
+        dim_feedforward: the dimension of the feedforward network model (default=2048).
+        dropout: the dropout value (default=0.1).
+        activation: the activation function of encoder/decoder intermediate layer, relu or gelu (default=relu).
+        custom_encoder: custom encoder (default=None).
+        custom_decoder: custom decoder (default=None).
+
+    Examples::
+        >>> transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
+        >>> src = torch.rand((10, 32, 512))
+        >>> tgt = torch.rand((20, 32, 512))
+        >>> out = transformer_model(src, tgt)
+
+    Note: A full example to apply nn.Transformer module for the word language model is available in
+    https://github.com/pytorch/examples/tree/master/word_language_model
+    """
+
+    def __init__(self, d_model=512, nhead=8, num_encoder_layers=6,
+                 num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,
+                 activation="relu", custom_encoder=None, custom_decoder=None):
+        super(Transformer, self).__init__()
+
+        if custom_encoder is not None:
+            self.encoder = custom_encoder
+        else:
+            encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout, activation)
+            encoder_norm = LayerNorm(d_model)
+            self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
+
+        if custom_decoder is not None:
+            self.decoder = custom_decoder
+        else:
+            decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward, dropout, activation)
+            decoder_norm = LayerNorm(d_model)
+            self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm)
+
+        self._reset_parameters()
+
+        self.d_model = d_model
+        self.nhead = nhead
+
+    def forward(self, src, tgt, src_mask=None, tgt_mask=None,
+                memory_mask=None, src_key_padding_mask=None,
+                tgt_key_padding_mask=None, memory_key_padding_mask=None):
+        r"""Take in and process masked source/target sequences.
+
+        Args:
+            src: the sequence to the encoder (required).
+            tgt: the sequence to the decoder (required).
+            src_mask: the additive mask for the src sequence (optional).
+            tgt_mask: the additive mask for the tgt sequence (optional).
+            memory_mask: the additive mask for the encoder output (optional).
+            src_key_padding_mask: the ByteTensor mask for src keys per batch (optional).
+            tgt_key_padding_mask: the ByteTensor mask for tgt keys per batch (optional).
+            memory_key_padding_mask: the ByteTensor mask for memory keys per batch (optional).
+
+        Shape:
+            - src: :math:`(S, N, E)`.
+            - tgt: :math:`(T, N, E)`.
+            - src_mask: :math:`(S, S)`.
+            - tgt_mask: :math:`(T, T)`.
+            - memory_mask: :math:`(T, S)`.
+            - src_key_padding_mask: :math:`(N, S)`.
+            - tgt_key_padding_mask: :math:`(N, T)`.
+            - memory_key_padding_mask: :math:`(N, S)`.
+
+            Note: [src/tgt/memory]_mask should be filled with
+            float('-inf') for the masked positions and float(0.0) else. These masks
+            ensure that predictions for position i depend only on the unmasked positions
+            j and are applied identically for each sequence in a batch.
+            [src/tgt/memory]_key_padding_mask should be a ByteTensor where True values are positions
+            that should be masked with float('-inf') and False values will be unchanged.
+            This mask ensures that no information will be taken from position i if
+            it is masked, and has a separate mask for each sequence in a batch.
+
+            - output: :math:`(T, N, E)`.
+
+            Note: Due to the multi-head attention architecture in the transformer model,
+            the output sequence length of a transformer is same as the input sequence
+            (i.e. target) length of the decode.
+
+            where S is the source sequence length, T is the target sequence length, N is the
+            batch size, E is the feature number
+
+        Examples:
+            >>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
+        """
+
+        if src.size(1) != tgt.size(1):
+            raise RuntimeError("the batch number of src and tgt must be equal")
+
+        if src.size(2) != self.d_model or tgt.size(2) != self.d_model:
+            raise RuntimeError("the feature number of src and tgt must be equal to d_model")
+
+        memory = self.encoder(src, mask=src_mask, src_key_padding_mask=src_key_padding_mask)
+        output = self.decoder(tgt, memory, tgt_mask=tgt_mask, memory_mask=memory_mask,
+                              tgt_key_padding_mask=tgt_key_padding_mask,
+                              memory_key_padding_mask=memory_key_padding_mask)
+        return output
+
+    def generate_square_subsequent_mask(self, sz):
+        r"""Generate a square mask for the sequence. The masked positions are filled with float('-inf').
+            Unmasked positions are filled with float(0.0).
+        """
+        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
+        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
+        return mask
+
+    def _reset_parameters(self):
+        r"""Initiate parameters in the transformer model."""
+
+        for p in self.parameters():
+            if p.dim() > 1:
+                xavier_uniform_(p)
+
+
+class TransformerEncoder(Module):
+    r"""TransformerEncoder is a stack of N encoder layers
+
+    Args:
+        encoder_layer: an instance of the TransformerEncoderLayer() class (required).
+        num_layers: the number of sub-encoder-layers in the encoder (required).
+        norm: the layer normalization component (optional).
+
+    Examples::
+        >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
+        >>> transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
+        >>> src = torch.rand(10, 32, 512)
+        >>> out = transformer_encoder(src)
+    """
+
+    def __init__(self, encoder_layer, num_layers, norm=None):
+        super(TransformerEncoder, self).__init__()
+        self.layers = _get_clones(encoder_layer, num_layers)
+        self.num_layers = num_layers
+        self.norm = norm
+
+    def forward(self, src, mask=None, src_key_padding_mask=None):
+        r"""Pass the input through the endocder layers in turn.
+
+        Args:
+            src: the sequnce to the encoder (required).
+            mask: the mask for the src sequence (optional).
+            src_key_padding_mask: the mask for the src keys per batch (optional).
+
+        Shape:
+            see the docs in Transformer class.
+        """
+        output = src
+
+        for i in range(self.num_layers):
+            output = self.layers[i](output, src_mask=mask,
+                                    src_key_padding_mask=src_key_padding_mask)
+
+        if self.norm:
+            output = self.norm(output)
+
+        return output
+
+
+class TransformerDecoder(Module):
+    r"""TransformerDecoder is a stack of N decoder layers
+
+    Args:
+        decoder_layer: an instance of the TransformerDecoderLayer() class (required).
+        num_layers: the number of sub-decoder-layers in the decoder (required).
+        norm: the layer normalization component (optional).
+
+    Examples::
+        >>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
+        >>> transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
+        >>> memory = torch.rand(10, 32, 512)
+        >>> tgt = torch.rand(20, 32, 512)
+        >>> out = transformer_decoder(tgt, memory)
+    """
+
+    def __init__(self, decoder_layer, num_layers, norm=None):
+        super(TransformerDecoder, self).__init__()
+        self.layers = _get_clones(decoder_layer, num_layers)
+        self.num_layers = num_layers
+        self.norm = norm
+
+    def forward(self, tgt, memory, tgt_mask=None,
+                memory_mask=None, tgt_key_padding_mask=None,
+                memory_key_padding_mask=None):
+        r"""Pass the inputs (and mask) through the decoder layer in turn.
+
+        Args:
+            tgt: the sequence to the decoder (required).
+            memory: the sequnce from the last layer of the encoder (required).
+            tgt_mask: the mask for the tgt sequence (optional).
+            memory_mask: the mask for the memory sequence (optional).
+            tgt_key_padding_mask: the mask for the tgt keys per batch (optional).
+            memory_key_padding_mask: the mask for the memory keys per batch (optional).
+
+        Shape:
+            see the docs in Transformer class.
+        """
+        output = tgt
+
+        for i in range(self.num_layers):
+            output = self.layers[i](output, memory, tgt_mask=tgt_mask,
+                                    memory_mask=memory_mask,
+                                    tgt_key_padding_mask=tgt_key_padding_mask,
+                                    memory_key_padding_mask=memory_key_padding_mask)
+
+        if self.norm:
+            output = self.norm(output)
+
+        return output
+
+class TransformerEncoderLayer(Module):
+    r"""TransformerEncoderLayer is made up of self-attn and feedforward network.
+    This standard encoder layer is based on the paper "Attention Is All You Need".
+    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
+    Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in
+    Neural Information Processing Systems, pages 6000-6010. Users may modify or implement
+    in a different way during application.
+
+    Args:
+        d_model: the number of expected features in the input (required).
+        nhead: the number of heads in the multiheadattention models (required).
+        dim_feedforward: the dimension of the feedforward network model (default=2048).
+        dropout: the dropout value (default=0.1).
+        activation: the activation function of intermediate layer, relu or gelu (default=relu).
+
+    Examples::
+        >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
+        >>> src = torch.rand(10, 32, 512)
+        >>> out = encoder_layer(src)
+    """
+
+    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1, activation="relu"):
+        super(TransformerEncoderLayer, self).__init__()
+        self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
+        # Implementation of Feedforward model
+        self.linear1 = Linear(d_model, dim_feedforward)
+        self.dropout = Dropout(dropout)
+        self.linear2 = Linear(dim_feedforward, d_model)
+
+        self.norm1 = LayerNorm(d_model)
+        self.norm2 = LayerNorm(d_model)
+        self.dropout1 = Dropout(dropout)
+        self.dropout2 = Dropout(dropout)
+
+        self.activation = _get_activation_fn(activation)
+
+    def forward(self, src, src_mask=None, src_key_padding_mask=None):
+        r"""Pass the input through the endocder layer.
+
+        Args:
+            src: the sequnce to the encoder layer (required).
+            src_mask: the mask for the src sequence (optional).
+            src_key_padding_mask: the mask for the src keys per batch (optional).
+
+        Shape:
+            see the docs in Transformer class.
+        """
+        src2 = self.self_attn(src, src, src, attn_mask=src_mask,
+                              key_padding_mask=src_key_padding_mask)[0]
+        src = src + self.dropout1(src2)
+        src = self.norm1(src)
+        if hasattr(self, "activation"):
+            src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
+        else:  # for backward compatibility
+            src2 = self.linear2(self.dropout(F.relu(self.linear1(src))))
+        src = src + self.dropout2(src2)
+        src = self.norm2(src)
+        return src
+
+
+class TransformerDecoderLayer(Module):
+    r"""TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.
+    This standard decoder layer is based on the paper "Attention Is All You Need".
+    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
+    Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in
+    Neural Information Processing Systems, pages 6000-6010. Users may modify or implement
+    in a different way during application.
+
+    Args:
+        d_model: the number of expected features in the input (required).
+        nhead: the number of heads in the multiheadattention models (required).
+        dim_feedforward: the dimension of the feedforward network model (default=2048).
+        dropout: the dropout value (default=0.1).
+        activation: the activation function of intermediate layer, relu or gelu (default=relu).
+
+    Examples::
+        >>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
+        >>> memory = torch.rand(10, 32, 512)
+        >>> tgt = torch.rand(20, 32, 512)
+        >>> out = decoder_layer(tgt, memory)
+    """
+
+    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1, activation="relu"):
+        super(TransformerDecoderLayer, self).__init__()
+        self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
+        self.multihead_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
+        # Implementation of Feedforward model
+        self.linear1 = Linear(d_model, dim_feedforward)
+        self.dropout = Dropout(dropout)
+        self.linear2 = Linear(dim_feedforward, d_model)
+
+        self.norm1 = LayerNorm(d_model)
+        self.norm2 = LayerNorm(d_model)
+        self.norm3 = LayerNorm(d_model)
+        self.dropout1 = Dropout(dropout)
+        self.dropout2 = Dropout(dropout)
+        self.dropout3 = Dropout(dropout)
+
+        self.activation = _get_activation_fn(activation)
+
+    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None,
+                tgt_key_padding_mask=None, memory_key_padding_mask=None):
+        r"""Pass the inputs (and mask) through the decoder layer.
+
+        Args:
+            tgt: the sequence to the decoder layer (required).
+            memory: the sequnce from the last layer of the encoder (required).
+            tgt_mask: the mask for the tgt sequence (optional).
+            memory_mask: the mask for the memory sequence (optional).
+            tgt_key_padding_mask: the mask for the tgt keys per batch (optional).
+            memory_key_padding_mask: the mask for the memory keys per batch (optional).
+
+        Shape:
+            see the docs in Transformer class.
+        """
+        tgt2 = self.self_attn(tgt, tgt, tgt, attn_mask=tgt_mask,
+                              key_padding_mask=tgt_key_padding_mask)[0]
+        tgt = tgt + self.dropout1(tgt2)
+        tgt = self.norm1(tgt)
+        tgt2 = self.multihead_attn(tgt, memory, memory, attn_mask=memory_mask,
+                                   key_padding_mask=memory_key_padding_mask)[0]
+        tgt = tgt + self.dropout2(tgt2)
+        tgt = self.norm2(tgt)
+        if hasattr(self, "activation"):
+            tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
+        else:  # for backward compatibility
+            tgt2 = self.linear2(self.dropout(F.relu(self.linear1(tgt))))
+        tgt = tgt + self.dropout3(tgt2)
+        tgt = self.norm3(tgt)
+        return tgt
+
+
+def _get_clones(module, N):
+    return ModuleList([copy.deepcopy(module) for i in range(N)])
+
+
+def _get_activation_fn(activation):
+    if activation == "relu":
+        return F.relu
+    elif activation == "gelu":
+        return F.gelu
+    else:
+        raise RuntimeError("activation should be relu/gelu, not %s." % activation)
+
+
+
+if __name__ == '__main__':
+    import torch.nn as nn
+    torch.manual_seed(1)
+    class Config():
+        d_model = 8
+        nhead = 4
+        num_encoder_layers = 3
+        num_decoder_layers = 3
+        dim_feedforward = 64
+        dropout = 0.1
+        activation = 'gelu'
+
+    cfg = Config()
+
+    encoder_layer = nn.TransformerEncoderLayer(cfg.d_model, cfg.nhead, cfg.dim_feedforward, cfg.dropout,
+                                               cfg.activation)
+    encoder_norm = nn.LayerNorm(cfg.d_model)
+    encoder = nn.TransformerEncoder(encoder_layer, cfg.num_encoder_layers, encoder_norm)
+
+    decoder_layer = nn.TransformerDecoderLayer(cfg.d_model, cfg.nhead, cfg.dim_feedforward, cfg.dropout,
+                                               cfg.activation)
+    decoder_norm = nn.LayerNorm(cfg.d_model)
+    decoder = nn.TransformerDecoder(decoder_layer, cfg.num_decoder_layers, decoder_norm)
+
+    src = torch.randn((2, 7, 8))  # B,L,H
+    tgt = torch.randn((2, 5, 8))
+    src.transpose_(0,1)
+    tgt.transpose_(0,1)
+    src_mask = None
+    tgt_mask = None
+    memory_mask = None
+    src_key_padding_mask = None
+    tgt_key_padding_mask = None
+    memory_key_padding_mask = None
+
+    memory = encoder(src, mask=src_mask, src_key_padding_mask=src_key_padding_mask)
+    output = decoder(tgt,
+                     memory,
+                     tgt_mask=tgt_mask,
+                     memory_mask=memory_mask,
+                     tgt_key_padding_mask=tgt_key_padding_mask,
+                     memory_key_padding_mask=memory_key_padding_mask)
+    memory.transpose_(0,1)
+    output.transpose_(0,1)
+    print(memory.shape, output.shape)  # torch.Size([2, 80, 8]) torch.Size([2, 160, 8])
+
+    # 直接调用 transformer
+    transformer = nn.Transformer(cfg.d_model,cfg.nhead,cfg.num_encoder_layers,cfg.num_decoder_layers,cfg.dim_feedforward,cfg.dropout,cfg.activation)
+    out = transformer(src,tgt,src_mask=src_mask,tgt_mask=tgt_mask,memory_mask=memory_mask)
+    out.transpose_(0,1)
+    print(out.shape)
+
--- a/module/init.py
+++ b/module/init.py
@ -0,0 +1,6 @@
+from .Embedding import Embedding
+from .CNN import CNN
+from .RNN import RNN
+from .Attention import DotAttention, MultiHeadAttention
+from .Transformer import Transformer
+from .Capsule import Capsule
--- a/predict.py
+++ b/predict.py
@ -0,0 +1,136 @@
+import os
+import sys
+import torch
+import logging
+import hydra
+import models
+from hydra import utils
+from utils import load_pkl, load_csv
+from serializer import Serializer
+from preprocess import _serialize_sentence, _convert_tokens_into_index, _add_pos_seq, _handle_relation_data
+import matplotlib.pyplot as plt
+
+logger = logging.getLogger(__name__)
+
+
+def _preprocess_data(data, cfg):
+    vocab = load_pkl(os.path.join(cfg.cwd, cfg.out_path, 'vocab.pkl'), verbose=False)
+    relation_data = load_csv(os.path.join(cfg.cwd, cfg.data_path, 'relation.csv'), verbose=False)
+    rels = _handle_relation_data(relation_data)
+    cfg.vocab_size = vocab.count
+    serializer = Serializer(do_chinese_split=cfg.chinese_split)
+    serial = serializer.serialize
+
+    _serialize_sentence(data, serial, cfg)
+    _convert_tokens_into_index(data, vocab)
+    _add_pos_seq(data, cfg)
+    logger.info('start sentence preprocess...')
+    formats = '\nsentence: {}\nchinese_split: {}\nreplace_entity_with_type:  {}\nreplace_entity_with_scope: {}\n' \
+              'tokens:    {}\ntoken2idx: {}\nlength:    {}\nhead_idx:  {}\ntail_idx:  {}'
+    logger.info(
+        formats.format(data[0]['sentence'], cfg.chinese_split, cfg.replace_entity_with_type,
+                       cfg.replace_entity_with_scope, data[0]['tokens'], data[0]['token2idx'], data[0]['seq_len'],
+                       data[0]['head_idx'], data[0]['tail_idx']))
+    return data, rels
+
+
+def _get_predict_instance():
+    flag = input('是否使用范例[y/n]，退出请输入: exit .... ')
+    flag = flag.strip().lower()
+    if flag == 'y' or flag == 'yes':
+        sentence = '《乡村爱情》是一部由知名导演赵本山在1985年所拍摄的农村青春偶像剧。'
+        head = '乡村爱情'
+        tail = '赵本山'
+        head_type = '电视剧'
+        tail_type = '人物'
+    elif flag == 'n' or flag == 'no':
+        sentence = input('请输入句子：')
+        head = input('请输入句中需要预测关系的头实体：')
+        head_type = input('请输入头实体类型：')
+        tail = input('请输入句中需要预测关系的尾实体：')
+        tail_type = input('请输入尾实体类型：')
+    elif flag == 'exit':
+        sys.exit(0)
+    else:
+        print('please input yes or no, or exit!')
+        _get_predict_instance()
+
+    instance = dict()
+    instance['sentence'] = sentence.strip()
+    instance['head'] = head.strip()
+    instance['head_type'] = head_type.strip()
+    instance['tail'] = tail.strip()
+    instance['tail_type'] = tail_type.strip()
+
+    return instance
+
+
+# 自定义模型存储的路径
+fp = 'xxx/checkpoints/2019-12-03_17-35-30/cnn_epoch21.pth'
+
+@hydra.main(config_path='conf/config.yaml')
+def main(cfg):
+    cwd = utils.get_original_cwd()
+    cfg.cwd = cwd
+    cfg.pos_size = 2 * cfg.pos_limit + 2
+    # print(cfg.pretty())
+
+    # get predict instance
+    instance = _get_predict_instance()
+    data = [instance]
+
+    # preprocess data
+    data, rels = _preprocess_data(data, cfg)
+
+    # model
+    __Model__ = {
+        'cnn': models.PCNN,
+    }
+
+    # 最好在 cpu 上预测
+    # cfg.use_gpu = False
+    if cfg.use_gpu and torch.cuda.is_available():
+        device = torch.device('cuda', cfg.gpu_id)
+    else:
+        device = torch.device('cpu')
+    logger.info(f'device: {device}')
+
+    model = __Model__[cfg.model_name](cfg)
+    model.load(fp, device=device)
+    model.to(device)
+    model.eval()
+    logger.info(f'model name: {cfg.model_name}')
+    logger.info(f'\n {model}')
+
+    x = dict()
+    x['word'], x['lens'] = torch.tensor([data[0]['token2idx']]), torch.tensor([data[0]['seq_len']])
+    if cfg.model_name != 'lm':
+        x['head_pos'], x['tail_pos'] = torch.tensor([data[0]['head_pos']]), torch.tensor([data[0]['tail_pos']])
+        if cfg.use_pcnn:
+            x['pcnn_mask'] = torch.tensor([data[0]['entities_pos']])
+    for key in x.keys():
+        x[key] = x[key].to(device)
+
+    with torch.no_grad():
+        y_pred = model(x)
+        y_pred = torch.softmax(y_pred, dim=-1)[0]
+        prob = y_pred.max().item()
+        prob_rel = list(rels.keys())[y_pred.argmax().item()]
+        logger.info(f"\"{data[0]['head']}\" 和 \"{data[0]['tail']}\" 在句中关系为：\"{prob_rel}\"，置信度为{prob:.2f}。")
+
+    if cfg.predict_plot:
+        # maplot 默认显示不支持中文
+        plt.rcParams["font.family"] = 'Arial Unicode MS'
+        x = list(rels.keys())
+        height = list(y_pred.cpu().numpy())
+        plt.bar(x, height)
+        for x, y in zip(x, height):
+            plt.text(x, y, '%.2f' % y, ha="center", va="bottom")
+        plt.xlabel('关系')
+        plt.ylabel('置信度')
+        plt.xticks(rotation=315)
+        plt.show()
+
+
+if __name__ == '__main__':
+    main()
--- a/preprocess.py
+++ b/preprocess.py
@ -0,0 +1,175 @@
+import os
+import logging
+from collections import OrderedDict
+from typing import List, Dict
+from transformers import BertTokenizer
+from serializer import Serializer
+from vocab import Vocab
+from utils import save_pkl, load_csv
+
+logger = logging.getLogger(__name__)
+
+
+def _handle_pos_limit(pos: List[int], limit: int) -> List[int]:
+    for i, p in enumerate(pos):
+        if p > limit:
+            pos[i] = limit
+        if p < -limit:
+            pos[i] = -limit
+    return [p + limit + 1 for p in pos]
+
+
+def _add_pos_seq(train_data: List[Dict], cfg):
+    for d in train_data:
+        entities_idx = [d['head_idx'], d['tail_idx']
+                        ] if d['head_idx'] < d['tail_idx'] else [d['tail_idx'], d['head_idx']]
+
+        d['head_pos'] = list(map(lambda i: i - d['head_idx'], list(range(d['seq_len']))))
+        d['head_pos'] = _handle_pos_limit(d['head_pos'], int(cfg.pos_limit))
+
+        d['tail_pos'] = list(map(lambda i: i - d['tail_idx'], list(range(d['seq_len']))))
+        d['tail_pos'] = _handle_pos_limit(d['tail_pos'], int(cfg.pos_limit))
+
+        if cfg.use_pcnn:
+            # 当句子无法分隔成三段时，无法使用PCNN
+            # 比如： [head, ... tail] or [... head, tail, ...] 无法使用统一方式 mask 分段
+            d['entities_pos'] = [1] * (entities_idx[0] + 1) + [2] * (entities_idx[1] - entities_idx[0] - 1) +\
+                                [3] * (d['seq_len'] - entities_idx[1])
+
+
+def _convert_tokens_into_index(data: List[Dict], vocab):
+    unk_str = '[UNK]'
+    unk_idx = vocab.word2idx[unk_str]
+
+    for d in data:
+        d['token2idx'] = [vocab.word2idx.get(i, unk_idx) for i in d['tokens']]
+        d['seq_len'] = len(d['token2idx'])
+
+
+def _serialize_sentence(data: List[Dict], serial, cfg):
+    for d in data:
+        sent = d['sentence'].strip()
+        sent = sent.replace(d['head'], ' head ', 1).replace(d['tail'], ' tail ', 1)
+        d['tokens'] = serial(sent, never_split=['head', 'tail'])
+        head_idx, tail_idx = d['tokens'].index('head'), d['tokens'].index('tail')
+        d['head_idx'], d['tail_idx'] = head_idx, tail_idx
+
+        if cfg.replace_entity_with_type:
+            if cfg.replace_entity_with_scope:
+                d['tokens'][head_idx], d['tokens'][tail_idx] = 'HEAD_' + d['head_type'], 'TAIL_' + d['tail_type']
+            else:
+                d['tokens'][head_idx], d['tokens'][tail_idx] = d['head_type'], d['tail_type']
+        else:
+            if cfg.replace_entity_with_scope:
+                d['tokens'][head_idx], d['tokens'][tail_idx] = 'HEAD', 'TAIL'
+            else:
+                d['tokens'][head_idx], d['tokens'][tail_idx] = d['head'], d['tail']
+
+
+def _lm_serialize(data: List[Dict], cfg):
+    logger.info('use bert tokenizer...')
+    tokenizer = BertTokenizer.from_pretrained(cfg.lm_file)
+    for d in data:
+        sent = d['sentence'].strip()
+        sent = sent.replace(d['head'], d['head_type'], 1).replace(d['tail'], d['tail_type'], 1)
+        sent += '[SEP]' + d['head'] + '[SEP]' + d['tail']
+        d['token2idx'] = tokenizer.encode(sent, add_special_tokens=True)
+        d['seq_len'] = len(d['token2idx'])
+
+
+def _add_relation_data(rels: Dict, data: List) -> None:
+    for d in data:
+        d['rel2idx'] = rels[d['relation']]['index']
+        d['head_type'] = rels[d['relation']]['head_type']
+        d['tail_type'] = rels[d['relation']]['tail_type']
+
+
+def _handle_relation_data(relation_data: List[Dict]) -> Dict:
+    rels = OrderedDict()
+    relation_data = sorted(relation_data, key=lambda i: int(i['index']))
+    for d in relation_data:
+        rels[d['relation']] = {
+            'index': int(d['index']),
+            'head_type': d['head_type'],
+            'tail_type': d['tail_type'],
+        }
+
+    return rels
+
+
+def preprocess(cfg):
+
+    logger.info('===== start preprocess data =====')
+    train_fp = os.path.join(cfg.cwd, cfg.data_path, 'train.csv')
+    valid_fp = os.path.join(cfg.cwd, cfg.data_path, 'valid.csv')
+    test_fp = os.path.join(cfg.cwd, cfg.data_path, 'test.csv')
+    relation_fp = os.path.join(cfg.cwd, cfg.data_path, 'relation.csv')
+
+    logger.info('load raw files...')
+    train_data = load_csv(train_fp)
+    valid_data = load_csv(valid_fp)
+    test_data = load_csv(test_fp)
+    relation_data = load_csv(relation_fp)
+
+    logger.info('convert relation into index...')
+    rels = _handle_relation_data(relation_data)
+    _add_relation_data(rels, train_data)
+    _add_relation_data(rels, valid_data)
+    _add_relation_data(rels, test_data)
+
+    logger.info('verify whether use pretrained language models...')
+    if cfg.model_name == 'lm':
+        logger.info('use pretrained language models serialize sentence...')
+        _lm_serialize(train_data, cfg)
+        _lm_serialize(valid_data, cfg)
+        _lm_serialize(test_data, cfg)
+    else:
+        logger.info('serialize sentence into tokens...')
+        serializer = Serializer(do_chinese_split=cfg.chinese_split, do_lower_case=True)
+        serial = serializer.serialize
+        _serialize_sentence(train_data, serial, cfg)
+        _serialize_sentence(valid_data, serial, cfg)
+        _serialize_sentence(test_data, serial, cfg)
+
+        logger.info('build vocabulary...')
+        vocab = Vocab('word')
+        train_tokens = [d['tokens'] for d in train_data]
+        valid_tokens = [d['tokens'] for d in valid_data]
+        test_tokens = [d['tokens'] for d in test_data]
+        sent_tokens = [*train_tokens, *valid_tokens, *test_tokens]
+        for sent in sent_tokens:
+            vocab.add_words(sent)
+        vocab.trim(min_freq=cfg.min_freq)
+
+        logger.info('convert tokens into index...')
+        _convert_tokens_into_index(train_data, vocab)
+        _convert_tokens_into_index(valid_data, vocab)
+        _convert_tokens_into_index(test_data, vocab)
+
+        logger.info('build position sequence...')
+        _add_pos_seq(train_data, cfg)
+        _add_pos_seq(valid_data, cfg)
+        _add_pos_seq(test_data, cfg)
+
+    logger.info('save data for backup...')
+    os.makedirs(os.path.join(cfg.cwd, cfg.out_path), exist_ok=True)
+    train_save_fp = os.path.join(cfg.cwd, cfg.out_path, 'train.pkl')
+    valid_save_fp = os.path.join(cfg.cwd, cfg.out_path, 'valid.pkl')
+    test_save_fp = os.path.join(cfg.cwd, cfg.out_path, 'test.pkl')
+    save_pkl(train_data, train_save_fp)
+    save_pkl(valid_data, valid_save_fp)
+    save_pkl(test_data, test_save_fp)
+
+    if cfg.model_name != 'lm':
+        vocab_save_fp = os.path.join(cfg.cwd, cfg.out_path, 'vocab.pkl')
+        vocab_txt = os.path.join(cfg.cwd, cfg.out_path, 'vocab.txt')
+        save_pkl(vocab, vocab_save_fp)
+        logger.info('save vocab in txt file, for watching...')
+        with open(vocab_txt, 'w', encoding='utf-8') as f:
+            f.write(os.linesep.join(vocab.word2idx.keys()))
+
+    logger.info('===== end preprocess data =====')
+
+
+if __name__ == '__main__':
+    pass
--- a/requirements.txt
+++ b/requirements.txt
@ -1,5 +1,7 @@
 torch>=1.0
-jieba>=0.38
-pytorch_transformers>=1.2
-matplotlib>=3.0
-scikit_learn>=0.20
+tensorboard>=2.0
+matplotlib>=3.1.0
+transformers>=2.0
+hydra-core>=0.11
+jieba>=0.39
+pyhanlp
--- a/serializer.py
+++ b/serializer.py
@ -0,0 +1,203 @@
+import re
+import unicodedata
+import jieba
+import logging
+from typing import List
+
+logger = logging.getLogger(__name__)
+jieba.setLogLevel(logging.INFO)
+
+
+class Serializer():
+    def __init__(self, never_split: List = None, do_lower_case=True, do_chinese_split=False):
+        self.never_split = never_split if never_split is not None else []
+        self.do_lower_case = do_lower_case
+        self.do_chinese_split = do_chinese_split
+
+    def serialize(self, text, never_split: List = None):
+        never_split = self.never_split + (never_split if never_split is not None else [])
+        text = self._clean_text(text)
+
+        if self.do_chinese_split:
+            output_tokens = self._use_jieba_cut(text, never_split)
+            return output_tokens
+
+        text = self._tokenize_chinese_chars(text)
+        orig_tokens = self._orig_tokenize(text)
+        split_tokens = []
+        for token in orig_tokens:
+            if self.do_lower_case and token not in never_split:
+                token = token.lower()
+                token = self._run_strip_accents(token)
+            split_tokens.extend(self._run_split_on_punc(token, never_split=never_split))
+
+        output_tokens = self._whitespace_tokenize(" ".join(split_tokens))
+
+        return output_tokens
+
+    def _clean_text(self, text):
+        """Performs invalid character removal and whitespace cleanup on text."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if cp == 0 or cp == 0xfffd or self.is_control(char):
+                continue
+            if self.is_whitespace(char):
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+    def _use_jieba_cut(self, text, never_split):
+        for word in never_split:
+            jieba.suggest_freq(word, True)
+        tokens = jieba.lcut(text)
+        if self.do_lower_case:
+            tokens = [i.lower() for i in tokens]
+        try:
+            while True:
+                tokens.remove(' ')
+        except:
+            return tokens
+
+    def _tokenize_chinese_chars(self, text):
+        """Adds whitespace around any CJK character."""
+        output = []
+        for char in text:
+            cp = ord(char)
+            if self.is_chinese_char(cp):
+                output.append(" ")
+                output.append(char)
+                output.append(" ")
+            else:
+                output.append(char)
+        return "".join(output)
+
+    def _orig_tokenize(self, text):
+        """Splits text on whitespace and some punctuations like comma or period"""
+        text = text.strip()
+        if not text:
+            return []
+        # 常见的断句标点
+        punc = """,.?!;: 、｜，。？！；：《》「」【】/<>|\“ ”‘ ’"""
+        punc_re = '|'.join(re.escape(x) for x in punc)
+        tokens = re.sub(punc_re, lambda x: ' ' + x.group() + ' ', text)
+        tokens = tokens.split()
+        return tokens
+
+    def _whitespace_tokenize(self, text):
+        """Runs basic whitespace cleaning and splitting on a piece of text."""
+        text = text.strip()
+        if not text:
+            return []
+        tokens = text.split()
+        return tokens
+
+    def _run_strip_accents(self, text):
+        """Strips accents from a piece of text."""
+        text = unicodedata.normalize("NFD", text)
+        output = []
+        for char in text:
+            cat = unicodedata.category(char)
+            if cat == "Mn":
+                continue
+            output.append(char)
+        return "".join(output)
+
+    def _run_split_on_punc(self, text, never_split=None):
+        """Splits punctuation on a piece of text."""
+        if never_split is not None and text in never_split:
+            return [text]
+        chars = list(text)
+        i = 0
+        start_new_word = True
+        output = []
+        while i < len(chars):
+            char = chars[i]
+            if self.is_punctuation(char):
+                output.append([char])
+                start_new_word = True
+            else:
+                if start_new_word:
+                    output.append([])
+                start_new_word = False
+                output[-1].append(char)
+            i += 1
+
+        return ["".join(x) for x in output]
+
+    @staticmethod
+    def is_control(char):
+        """Checks whether `chars` is a control character."""
+        # These are technically control characters but we count them as whitespace
+        # characters.
+        if char == "\t" or char == "\n" or char == "\r":
+            return False
+        cat = unicodedata.category(char)
+        if cat.startswith("C"):
+            return True
+        return False
+
+    @staticmethod
+    def is_whitespace(char):
+        """Checks whether `chars` is a whitespace character."""
+        # \t, \n, and \r are technically contorl characters but we treat them
+        # as whitespace since they are generally considered as such.
+        if char == " " or char == "\t" or char == "\n" or char == "\r":
+            return True
+        cat = unicodedata.category(char)
+        if cat == "Zs":
+            return True
+        return False
+
+    @staticmethod
+    def is_chinese_char(cp):
+        """Checks whether CP is the codepoint of a CJK character."""
+        # This defines a "chinese character" as anything in the CJK Unicode block:
+        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
+        #
+        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
+        # despite its name. The modern Korean Hangul alphabet is a different block,
+        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
+        # space-separated words, so they are not treated specially and handled
+        # like the all of the other languages.
+        if ((cp >= 0x4E00 and cp <= 0x9FFF) or  #
+            (cp >= 0x3400 and cp <= 0x4DBF) or  #
+            (cp >= 0x20000 and cp <= 0x2A6DF) or  #
+            (cp >= 0x2A700 and cp <= 0x2B73F) or  #
+            (cp >= 0x2B740 and cp <= 0x2B81F) or  #
+            (cp >= 0x2B820 and cp <= 0x2CEAF) or (cp >= 0xF900 and cp <= 0xFAFF) or  #
+            (cp >= 0x2F800 and cp <= 0x2FA1F)):  #
+            return True
+
+        return False
+
+    @staticmethod
+    def is_punctuation(char):
+        """Checks whether `chars` is a punctuation character."""
+        cp = ord(char)
+        # We treat all non-letter/number ASCII as punctuation.
+        # Characters such as "^", "$", and "`" are not in the Unicode
+        # Punctuation class but we treat them as punctuation anyways, for
+        # consistency.
+        if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96)
+                or (cp >= 123 and cp <= 126)):
+            return True
+        cat = unicodedata.category(char)
+        if cat.startswith("P"):
+            return True
+        return False
+
+
+if __name__ == '__main__':
+    text1 = "\t\n你   好呀， I\'m his pupp\'peer,\n\t"
+    text2 = '你孩子的爱情叫 Stam\'s 的打到天啊呢哦'
+
+    serializer = Serializer(do_chinese_split=False)
+    print(serializer.serialize(text1))
+    print(serializer.serialize(text2))
+
+    text3 = "good\'s  head  pupp\'er, "
+    # print: ["good's", 'pupp', "'", 'er', ',']
+    # true:  ["good's", "pupp'er", ","]
+    print(serializer.serialize(text3, never_split=["pupp\'er"]))
--- a/test/test_attention.py
+++ b/test/test_attention.py
@ -0,0 +1,38 @@
+import pytest
+import torch
+from utils import seq_len_to_mask
+from module import DotAttention, MultiHeadAttention
+
+torch.manual_seed(1)
+q = torch.randn(4, 6, 20)  # [B, L, H]
+k = v = torch.randn(4, 5, 20)  # [B, S, H]
+key_padding_mask = seq_len_to_mask([5, 4, 3, 2], max_len=5)
+attention_mask = torch.tensor([1, 0, 0, 1, 0])  # 为1 的地方 mask 掉
+head_mask = torch.tensor([0, 1, 0, 0])  # 为1 的地方 mask 掉
+
+# m = DotAttention(dropout=0.0)
+# ao,aw = m(q,k,v,key_padding_mask)
+# print(ao.shape,aw.shape)
+# print(aw)
+
+
+def test_DotAttention():
+    m = DotAttention(dropout=0.0)
+    ao, aw = m(q, k, v, mask_out=key_padding_mask)
+
+    assert ao.shape == torch.Size([4, 6, 20])
+    assert aw.shape == torch.Size([4, 6, 5])
+    assert torch.all(aw[1, :, -1:].eq(0)) == torch.all(aw[2, :, -2:].eq(0)) == torch.all(aw[3, :, -3:].eq(0)) == True
+
+
+def test_MultiHeadAttention():
+    m = MultiHeadAttention(embed_dim=20, num_heads=4, dropout=0.0)
+    ao, aw = m(q, k, v, key_padding_mask=key_padding_mask,attention_mask=attention_mask,head_mask=head_mask)
+
+    assert ao.shape == torch.Size([4, 6, 20])
+    assert aw.shape == torch.Size([4, 4, 6, 5])
+    assert aw.unbind(dim=1)[1].bool().any() == False
+
+
+if __name__ == '__main__':
+    pytest.main()
--- a/test/test_cnn.py
+++ b/test/test_cnn.py
@ -0,0 +1,32 @@
+import pytest
+import torch
+from module import CNN
+from utils import seq_len_to_mask
+
+
+class Config(object):
+    in_channels = 100
+    out_channels = 200
+    kernel_sizes = [3, 5, 7, 9, 11]
+    activation = 'gelu'
+    pooling_strategy = 'avg'
+
+
+config = Config()
+
+
+def test_CNN():
+
+    x = torch.randn(4, 5, 100)
+    seq = torch.arange(4, 0, -1)
+    mask = seq_len_to_mask(seq, max_len=5)
+
+    cnn = CNN(config)
+    out, out_pooling = cnn(x, mask=mask)
+    out_channels = config.out_channels * len(config.kernel_sizes)
+    assert out.shape == torch.Size([4, 5, out_channels])
+    assert out_pooling.shape == torch.Size([4, out_channels])
+
+
+if __name__ == '__main__':
+    pytest.main()
--- a/test/test_embedding.py
+++ b/test/test_embedding.py
@ -0,0 +1,38 @@
+import pytest
+import torch
+from module import Embedding
+
+
+class Config(object):
+    vocab_size = 10
+    word_dim = 10
+    pos_size = 12  # 2 * pos_limit + 2
+    pos_dim = 5
+    dim_strategy = 'cat'  # [cat, sum]
+
+
+config = Config()
+
+x = torch.tensor([[1, 2, 3, 4, 5], [6, 7, 3, 5, 0], [8, 4, 3, 0, 0]])
+x_pos = torch.tensor([[1, 2, 3, 4, 5], [1, 2, 3, 4, 0], [1, 2, 3, 0, 0]])
+
+
+def test_Embedding_cat():
+    embed = Embedding(config)
+    feature = embed((x, x_pos))
+    dim = config.word_dim + config.pos_dim
+
+    assert feature.shape == torch.Size((3, 5, dim))
+
+
+def test_Embedding_sum():
+    config.dim_strategy = 'sum'
+    embed = Embedding(config)
+    feature = embed((x, x_pos))
+    dim = config.word_dim
+
+    assert feature.shape == torch.Size((3, 5, dim))
+
+
+if __name__ == '__main__':
+    pytest.main()
--- a/test/test_rnn.py
+++ b/test/test_rnn.py
@ -0,0 +1,49 @@
+import pytest
+import torch
+from module import RNN
+from utils import seq_len_to_mask
+
+
+class Config(object):
+    type_rnn = 'LSTM'
+    input_size = 5
+    hidden_size = 4
+    num_layers = 3
+    dropout = 0.0
+    last_layer_hn = False
+    bidirectional = True
+
+
+config = Config()
+
+
+def test_CNN():
+    torch.manual_seed(1)
+    x = torch.tensor([[4, 3, 2, 1], [5, 6, 7, 0], [8, 10, 0, 0]])
+    x = torch.nn.Embedding(11, 5, padding_idx=0)(x)  # B,L,H = 3,4,5
+    x_len = torch.tensor([4, 3, 2])
+
+    model = RNN(config)
+    output, hn = model(x, x_len)
+
+    B, L, _ = x.size()
+    H, N = config.hidden_size, config.num_layers
+
+    assert output.shape == torch.Size([B, L, H])
+    assert hn.shape == torch.Size([B, N, H])
+
+    config.bidirectional = False
+    model = RNN(config)
+    output, hn = model(x, x_len)
+    assert output.shape == torch.Size([B, L, H])
+    assert hn.shape == torch.Size([B, N, H])
+
+    config.last_layer_hn = True
+    model = RNN(config)
+    output, hn = model(x, x_len)
+    assert output.shape == torch.Size([B, L, H])
+    assert hn.shape == torch.Size([B, H])
+
+
+if __name__ == '__main__':
+    pytest.main()
--- a/test/test_serializer.py
+++ b/test/test_serializer.py
@ -0,0 +1,36 @@
+import pytest
+from serializer import Serializer
+
+
+def test_serializer_for_no_chinese_split():
+    text1 = "\nI\'m  his pupp\'peer, and i have a ball\t"
+    text2 = '\t叫Stam一起到nba打篮球\n'
+    text3 = '\n\n现在时刻2014-04-08\t\t'
+
+    serializer = Serializer(do_chinese_split=False)
+    serial_text1 = serializer.serialize(text1)
+    serial_text2 = serializer.serialize(text2)
+    serial_text3 = serializer.serialize(text3)
+
+    assert serial_text1 == ['i', "'", 'm', 'his', 'pupp', "'", 'peer', ',', 'and', 'i', 'have', 'a', 'ball']
+    assert serial_text2 == ['叫', 'stam', '一', '起', '到', 'nba', '打', '篮', '球']
+    assert serial_text3 == ['现', '在', '时', '刻', '2014', '-', '04', '-', '08']
+
+
+def test_serializer_for_chinese_split():
+    text1 = "\nI\'m  his pupp\'peer, and i have a basketball\t"
+    text2 = '\t叫Stam一起到nba打篮球\n'
+    text3 = '\n\n现在时刻2014-04-08\t\t'
+
+    serializer = Serializer(do_chinese_split=True)
+    serial_text1 = serializer.serialize(text1)
+    serial_text2 = serializer.serialize(text2)
+    serial_text3 = serializer.serialize(text3)
+
+    assert serial_text1 == ['i', "'", 'm', 'his', 'pupp', "'", 'peer', ',', 'and', 'i', 'have', 'a', 'basketball']
+    assert serial_text2 == ['叫', 'stam', '一起', '到', 'nba', '打篮球']
+    assert serial_text3 == ['现在', '时刻', '2014', '-', '04', '-', '08']
+
+
+if __name__ == '__main__':
+    pytest.main()
--- a/test/test_transformer.py
+++ b/test/test_transformer.py
@ -0,0 +1,40 @@
+import pytest
+import torch
+from module import Transformer
+from utils import seq_len_to_mask
+
+
+class Config():
+    hidden_size = 12
+    intermediate_size = 24
+    num_hidden_layers = 5
+    num_heads = 3
+    dropout = 0.0
+    layer_norm_eps = 1e-12
+    hidden_act = 'gelu_new'
+    output_attentions = True
+    output_hidden_states = True
+
+
+config = Config()
+
+
+def test_Transformer():
+    m = Transformer(config)
+    i = torch.randn(4, 5, 12)  # [B, L, H]
+    key_padding_mask = seq_len_to_mask([5, 4, 3, 2], max_len=5)
+    attention_mask = torch.tensor([1, 0, 0, 1, 0])  # 为1 的地方 mask 掉
+    head_mask = torch.tensor([0, 1, 0])  # 为1 的地方 mask 掉
+
+    out = m(i, key_padding_mask=key_padding_mask, attention_mask=attention_mask, head_mask=head_mask)
+    hn, h_all, att_weights = out
+    assert hn.shape == torch.Size([4, 5, 12])
+    assert torch.equal(h_all[0], i) and torch.equal(h_all[-1], hn) == True
+    assert len(h_all) == config.num_hidden_layers + 1
+    assert len(att_weights) == config.num_hidden_layers
+    assert att_weights[0].shape == torch.Size([4, 3, 5, 5])
+    assert att_weights[0].unbind(dim=1)[1].bool().any() == False
+
+
+if __name__ == '__main__':
+    pytest.main()
--- a/test/test_vocab.py
+++ b/test/test_vocab.py
@ -0,0 +1,38 @@
+import pytest
+from serializer import Serializer
+from vocab import Vocab
+
+
+def test_vocab():
+    vocab = Vocab('test')
+    sent = ' 我是中国人，我爱中国。 I\'m Chinese, I love China'
+
+    serializer = Serializer(do_lower_case=True)
+    tokens = serializer.serialize(sent)
+    assert tokens == [
+        '我', '是', '中', '国', '人', '，', '我', '爱', '中', '国', '。', 'i', "'", 'm', 'chinese', ',', 'i', 'love', 'china'
+    ]
+
+    vocab.add_words(tokens)
+    unk_str = '[UNK]'
+    unk_idx = vocab.word2idx[unk_str]
+
+    assert vocab.count == 22
+    assert len(vocab.word2idx) == len(vocab.idx2word) == len(vocab.word2idx) == 22
+
+    vocab.trim(2, verbose=False)
+
+    assert vocab.count == 11
+    assert len(vocab.word2idx) == len(vocab.idx2word) == len(vocab.word2idx) == 11
+
+    token2idx = [vocab.word2idx.get(i, unk_idx) for i in tokens]
+    assert len(tokens) == len(token2idx)
+    assert token2idx == [7, 1, 8, 9, 1, 1, 7, 1, 8, 9, 1, 10, 1, 1, 1, 1, 10, 1, 1]
+
+    idx2tokens = [vocab.idx2word.get(i, unk_str) for i in token2idx]
+    assert len(idx2tokens) == len(token2idx)
+    assert ' '.join(idx2tokens) == '我 [UNK] 中 国 [UNK] [UNK] 我 [UNK] 中 国 [UNK] i [UNK] [UNK] [UNK] [UNK] i [UNK] [UNK]'
+
+
+if __name__ == '__main__':
+    pytest.main()
--- a/trainer.py
+++ b/trainer.py
@ -0,0 +1,82 @@
+import torch
+import logging
+import matplotlib.pyplot as plt
+from metrics import PRMetric
+
+logger = logging.getLogger(__name__)
+
+
+def train(epoch, model, dataloader, optimizer, criterion, device, writer, cfg):
+    model.train()
+
+    metric = PRMetric()
+    losses = []
+
+    for batch_idx, (x, y) in enumerate(dataloader, 1):
+        for key, value in x.items():
+            x[key] = value.to(device)
+        y = y.to(device)
+
+        optimizer.zero_grad()
+        y_pred = model(x)
+        loss = criterion(y_pred, y)
+
+        loss.backward()
+        optimizer.step()
+
+        metric.update(y_true=y, y_pred=y_pred)
+        losses.append(loss.item())
+
+        data_total = len(dataloader.dataset)
+        data_cal = data_total if batch_idx == len(dataloader) else batch_idx * len(y)
+        if (cfg.train_log and batch_idx % cfg.log_interval == 0) or batch_idx == len(dataloader):
+            # p r f1 皆为 macro，因为micro时三者相同，定义为acc
+            acc, p, r, f1 = metric.compute()
+            logger.info(f'Train Epoch {epoch}: [{data_cal}/{data_total} ({100. * data_cal / data_total:.0f}%)]\t'
+                        f'Loss: {loss.item():.6f}')
+            logger.info(f'Train Epoch {epoch}: Acc: {100. * acc:.2f}%\t'
+                        f'macro metrics: [p: {p:.4f}, r:{r:.4f}, f1:{f1:.4f}]')
+
+    if cfg.show_plot and not cfg.only_comparison_plot:
+        if cfg.plot_utils == 'matplot':
+            plt.plot(losses)
+            plt.title(f'epoch {epoch} train loss')
+            plt.show()
+
+        if cfg.plot_utils == 'tensorboard':
+            for i in range(len(losses)):
+                writer.add_scalar(f'epoch_{epoch}_training_loss', losses[i], i)
+
+    return losses[-1]
+
+
+def validate(epoch, model, dataloader, criterion, device):
+    model.eval()
+
+    metric = PRMetric()
+    losses = []
+
+    for batch_idx, (x, y) in enumerate(dataloader, 1):
+        for key, value in x.items():
+            x[key] = value.to(device)
+        y = y.to(device)
+
+        with torch.no_grad():
+            y_pred = model(x)
+            loss = criterion(y_pred, y)
+
+            metric.update(y_true=y, y_pred=y_pred)
+            losses.append(loss.item())
+
+    loss = sum(losses) / len(losses)
+    acc, p, r, f1 = metric.compute()
+    data_total = len(dataloader.dataset)
+
+    if epoch >= 0:
+        logger.info(f'Valid Epoch {epoch}: [{data_total}/{data_total}](100%)\t Loss: {loss:.6f}')
+        logger.info(f'Valid Epoch {epoch}: Acc: {100. * acc:.2f}%\tmacro metrics: [p: {p:.4f}, r:{r:.4f}, f1:{f1:.4f}]')
+    else:
+        logger.info(f'Test Data: [{data_total}/{data_total}](100%)\t Loss: {loss:.6f}')
+        logger.info(f'Test Data: Acc: {100. * acc:.2f}%\tmacro metrics: [p: {p:.4f}, r:{r:.4f}, f1:{f1:.4f}]')
+
+    return f1, loss
--- a/utils/init.py
+++ b/utils/init.py
@ -0,0 +1,2 @@
+from .ioUtils import *
+from .nnUtils import *
--- a/utils/ioUtils.py
+++ b/utils/ioUtils.py
@ -0,0 +1,56 @@
+import os
+import csv
+import pickle
+import logging
+from typing import NewType, List, Tuple, Dict, Any
+
+__all__ = [
+    'load_pkl',
+    'save_pkl',
+    'load_csv',
+    'save_csv',
+]
+
+logger = logging.getLogger(__name__)
+
+Path = str
+
+
+def load_pkl(fp: Path, verbose: bool = True) -> Any:
+    if verbose:
+        logger.info(f'load data from {fp}')
+
+    with open(fp, 'rb') as f:
+        data = pickle.load(f)
+        return data
+
+
+def save_pkl(data: Any, fp: Path, verbose: bool = True) -> None:
+    if verbose:
+        logger.info(f'save data in {fp}')
+
+    with open(fp, 'wb') as f:
+        pickle.dump(data, f)
+
+
+def load_csv(fp: Path, is_tsv: bool = False, verbose: bool = True) -> List:
+    if verbose:
+        logger.info(f'load csv from {fp}')
+
+    dialect = 'excel-tab' if is_tsv else 'excel'
+    with open(fp, encoding='utf-8') as f:
+        reader = csv.DictReader(f, dialect=dialect)
+        return list(reader)
+
+
+def save_csv(data: List[Dict], fp: Path, save_in_tsv: False, write_head=True, verbose=True) -> None:
+    if verbose:
+        logger.info(f'save csv file in: {fp}')
+
+    with open(fp, 'w', encoding='utf-8') as f:
+        fieldnames = data[0].keys()
+        dialect = 'excel-tab' if save_in_tsv else 'excel'
+        writer = csv.DictWriter(f, fieldnames=fieldnames, dialect=dialect)
+        if write_head:
+            writer.writeheader()
+        writer.writerows(data)
--- a/utils/nnUtils.py
+++ b/utils/nnUtils.py
@ -0,0 +1,51 @@
+import torch
+import random
+import logging
+import numpy as np
+from typing import List, Tuple, Dict, Union
+
+logger = logging.getLogger(__name__)
+
+__all__ = [
+    'manual_seed',
+    'seq_len_to_mask',
+]
+
+
+def manual_seed(num: int = 1) -> None:
+    random.seed(num)
+    np.random.seed(num)
+    torch.manual_seed(num)
+    torch.cuda.manual_seed(num)
+    torch.cuda.manual_seed_all(num)
+
+
+def seq_len_to_mask(seq_len: Union[List, np.ndarray, torch.Tensor], max_len=None, mask_pos_to_true=True):
+    """
+    将一个表示sequence length的一维数组转换为二维的mask，默认pad的位置为1。
+    转变 1-d seq_len到2-d mask.
+
+    :param list, np.ndarray, torch.LongTensor seq_len: shape将是(B,)
+    :param int max_len: 将长度pad到这个长度。默认(None)使用的是seq_len中最长的长度。但在nn.DataParallel的场景下可能不同卡的seq_len会有
+        区别，所以需要传入一个max_len使得mask的长度是pad到该长度。
+    :return: np.ndarray, torch.Tensor 。shape将是(B, max_length)， 元素类似为bool或torch.uint8
+    """
+    if isinstance(seq_len, list):
+        seq_len = np.array(seq_len)
+
+    if isinstance(seq_len, np.ndarray):
+        seq_len = torch.from_numpy(seq_len)
+
+    if isinstance(seq_len, torch.Tensor):
+        assert seq_len.dim() == 1, logger.error(f"seq_len can only have one dimension, got {seq_len.dim()} != 1.")
+        batch_size = seq_len.size(0)
+        max_len = int(max_len) if max_len else seq_len.max().long()
+        broad_cast_seq_len = torch.arange(max_len).expand(batch_size, -1).to(seq_len.device)
+        if mask_pos_to_true:
+            mask = broad_cast_seq_len.ge(seq_len.unsqueeze(1))
+        else:
+            mask = broad_cast_seq_len.lt(seq_len.unsqueeze(1))
+    else:
+        raise logger.error("Only support 1-d list or 1-d numpy.ndarray or 1-d torch.Tensor.")
+
+    return mask
--- a/vocab.py
+++ b/vocab.py
@ -0,0 +1,103 @@
+import logging
+from collections import OrderedDict
+from typing import Sequence, Optional
+
+logger = logging.getLogger(__name__)
+
+SPECIAL_TOKENS_KEYS = [
+    "pad_token",
+    "unk_token",
+    "mask_token",
+    "cls_token",
+    "sep_token",
+    "bos_token",
+    "eos_token",
+]
+
+SPECIAL_TOKENS_VALUES = [
+    "[PAD]",
+    "[UNK]",
+    "[MASK]",
+    "[CLS]",
+    "[SEP]",
+    "[BOS]",
+    "[EOS]",
+]
+
+SPECIAL_TOKENS = OrderedDict(zip(SPECIAL_TOKENS_KEYS, SPECIAL_TOKENS_VALUES))
+
+
+class Vocab(object):
+    def __init__(self, name: str = 'basic', init_tokens: Sequence = SPECIAL_TOKENS):
+        self.name = name
+        self.init_tokens = init_tokens
+        self.trimed = False
+        self.word2idx = {}
+        self.word2count = {}
+        self.idx2word = {}
+        self.count = 0
+        self._add_init_tokens()
+
+    def _add_init_tokens(self):
+        for token in self.init_tokens.values():
+            self._add_word(token)
+
+    def _add_word(self, word: str):
+        if word not in self.word2idx:
+            self.word2idx[word] = self.count
+            self.word2count[word] = 1
+            self.idx2word[self.count] = word
+            self.count += 1
+        else:
+            self.word2count[word] += 1
+
+    def add_words(self, words: Sequence):
+        for word in words:
+            self._add_word(word)
+
+    def trim(self, min_freq=2, verbose: Optional[bool] = True):
+        '''当 word 词频低于 min_freq 时，从词库中删除
+
+        Args:
+            param min_freq: 最低词频
+        '''
+        assert min_freq == int(min_freq), f'min_freq must be integer, can\'t be {min_freq}'
+        min_freq = int(min_freq)
+        if min_freq < 2:
+            return
+        if self.trimed:
+            return
+        self.trimed = True
+
+        keep_words = []
+        new_words = []
+
+        for k, v in self.word2count.items():
+            if v >= min_freq:
+                keep_words.append(k)
+                new_words.extend([k] * v)
+        if verbose:
+            before_len = len(keep_words)
+            after_len = len(self.word2idx) - len(self.init_tokens)
+            logger.info('vocab after be trimmed, keep words [{} / {}] = {:.2f}%'.format(
+                before_len, after_len, before_len / after_len * 100))
+
+        # Reinitialize dictionaries
+        self.word2idx = {}
+        self.word2count = {}
+        self.idx2word = {}
+        self.count = 0
+        self._add_init_tokens()
+        self.add_words(new_words)
+
+
+if __name__ == '__main__':
+    vocab = Vocab('test')
+    sent = ' 我是中国人，我爱中国。'
+    sent = list(sent)
+    print(sent)
+
+    vocab.add_words(sent)
+    print(vocab.word2count)
+    vocab.trim(2)
+    print(vocab.word2count)