update to 0.2.0

big change
This commit is contained in:
leo 2019-12-03 18:47:25 +08:00
parent 2aec6bd730
commit cb07fb64df
70 changed files with 7599 additions and 6826 deletions

View File

@ -1,13 +0,0 @@
# Contributor Code of Conduct
As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
We are committed to making participation in this project a harassment-free experience for everyone, regardless of the level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, age, or religion.
Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.
This Code of Conduct is adapted from the [Contributor Covenant](http://contributor-covenant.org), version 1.0.0, available at [http://contributor-covenant.org/version/1/0/0/](http://contributor-covenant.org/version/1/0/0/)

View File

@ -1,13 +0,0 @@
<!-- PULL REQUEST TEMPLATE -->
<!-- (Update "[ ]" to "[x]" to check a box) -->
**What kind of change does this PR introduce?** (check at least one)
- [ ] Bugfix
- [ ] Feature
- [ ] Code style update
- [ ] Refactor
- [ ] Build-related changes
- [ ] Other, please describe:
**Other information:**w

201
LICENSE
View File

@ -1,201 +0,0 @@
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

View File

16
conf/config.yaml Normal file
View File

@ -0,0 +1,16 @@
# ??? is a mandatory value.
# you should be able to set it without open_dict
# but if you try to read it before it's set an error will get thrown.
# populated at runtime
cwd: ???
defaults:
- hydra/output: custom
- preprocess
- train
- embedding
- model: cnn

10
conf/embedding.yaml Normal file
View File

@ -0,0 +1,10 @@
# populated at runtime
vocab_size: ???
word_dim: 50
pos_size: ??? # 2 * pos_limit + 2
pos_dim: 10 # 当为 sum 时,此值无效,和 word_dim 强行相同
dim_strategy: sum # [cat, sum]
# 关系种类
num_relations: 11

View File

@ -0,0 +1,11 @@
hydra:
run:
# Output directory for normal runs
dir: ./logs/${now:%Y-%m-%d_%H-%M-%S}
sweep:
# Output directory for sweep runs
dir: ./logs/${now:%Y-%m-%d_%H-%M-%S}
# Output sub directory for sweep runs.
subdir: ${hydra.job.num}_${hydra.job.id}

6
conf/model/capsule.yaml Normal file
View File

@ -0,0 +1,6 @@
num_primary_units: 8
num_output_units: 10 # relation_type
primary_channels: 1
primary_unit_size: 768
output_unit_size: 128
num_iterations: 3

12
conf/model/cnn.yaml Normal file
View File

@ -0,0 +1,12 @@
model_name: cnn
#in_channels: 100 # 使用 embedding 输出的结果,不需要指定
out_channels: 100
kernel_sizes: [3, 5, 7] # 必须为奇数为了保证cnn的输出不改变句子长度
activation: 'gelu' # [relu, lrelu, prelu, selu, celu, gelu, sigmoid, tanh]
pooling_strategy: 'max' # [max, avg, cls]
dropout: 0.3
# pcnn
use_pcnn: False
intermediate: 80

1
conf/model/gcn.yaml Normal file
View File

@ -0,0 +1 @@
num_layers: 3

12
conf/model/lm.yaml Normal file
View File

@ -0,0 +1,12 @@
model_name: lm
# lm_name = 'bert-base-chinese' # download usage
# cache file usage
#lm_file: 'bert_pretrained'
# 当使用预训练语言模型时,该预训练的模型存放位置
lm_file: '/Users/leo/transformers/bert-base-chinese'
# transformer 层数,初始 base bert 为12层
# 但是数据量较小时调低些反而收敛更快效果更好
num_hidden_layers: 2

10
conf/model/rnn.yaml Normal file
View File

@ -0,0 +1,10 @@
model_name: rnn
type_rnn: 'RNN' # [RNN, GRU, LSTM]
#input_size: 100 # 使用 embedding 输出的结果,不需要指定
hidden_size: 150 # 必须为偶数
num_layers: 2
dropout: 0.3
bidirectional: True
last_layer_hn: True

View File

@ -0,0 +1,9 @@
hidden_size: 128
intermediate_size: 256
num_hidden_layers: 3
num_heads: 4
dropout: 0.1
layer_norm_eps: 1e-12
hidden_act: gelu_new
output_attentions: True
output_hidden_states: True

26
conf/preprocess.yaml Normal file
View File

@ -0,0 +1,26 @@
# 是否需要预处理数据
# 当数据处理参数没有变换时,不需要重新预处理
preprocess: True
# 原始数据存放位置
data_path: 'data/origin'
# 预处理后存放文件位置
out_path: 'data/out'
# 是否需要分词
chinese_split: True
# 是否需要使用实体类型替换实体词语
replace_entity_with_type: True
# 是否需要使用三元组头尾标记替换实体词语
replace_entity_with_scope: True
# vocab 构建时的最低词频控制
min_freq: 3
# 句长限制: 指句子中词语相对entity的position限制
# 如:[-30, 30]embed 时整体+31变成[1, 61]
# 则一共62个pos token0 留给 pad
pos_limit: 30

21
conf/train.yaml Normal file
View File

@ -0,0 +1,21 @@
seed: 1
use_gpu: True
gpu_id: 0
epoch: 50
batch_size: 32
learning_rate: 3e-4
lr_factor: 0.7 # 学习率的衰减率
lr_patience: 3 # 学习率衰减的等待epoch
weight_decay: 1e-3 # L2正则
early_stopping_patience: 6
train_log: True
log_interval: 10
show_plot: True
only_comparison_plot: False
plot_utils: matplot # [matplot, tensorboard]
predict_plot: True

View File

View File

@ -1,6 +0,0 @@
sentence,head,head_type,head_offset,tail,tail_type,tail_offset
“逆袭”系列微电影《宝贝》由优酷土豆股份有限公司于2012年出品,宝贝,影视作品,10,优酷土豆股份有限公司,企业,14
位于伦敦东南方的格林威治,为地球经线的起始点,格林威治,景点,8,伦敦,城市,2
崔恒源 男1950年3月生祖籍河南省孟县现任孟县无缝钢管厂党委书记、厂长,崔恒源,人物,0,河南省孟县,地点,17
帅长斌1964年6月生江西九江人,帅长斌,人物,0,江西九江,地点,15
图为《西游记》拍摄幕后照片,猪八戒的大耳朵都掉了一只,可见当时拍摄条件实在有限,但是导演杨洁精益求精,使得这部电视剧成为经典,西游记,影视作品,3,杨洁,人物,44
1 sentence head head_type head_offset tail tail_type tail_offset
2 “逆袭”系列微电影《宝贝》由优酷土豆股份有限公司于2012年出品 宝贝 影视作品 10 优酷土豆股份有限公司 企业 14
3 位于伦敦东南方的格林威治,为地球经线的起始点 格林威治 景点 8 伦敦 城市 2
4 崔恒源 男,1950年3月生,祖籍河南省孟县,现任孟县无缝钢管厂党委书记、厂长 崔恒源 人物 0 河南省孟县 地点 17
5 帅长斌,男,1964年6月生,江西九江人 帅长斌 人物 0 江西九江 地点 15
6 图为《西游记》拍摄幕后照片,猪八戒的大耳朵都掉了一只,可见当时拍摄条件实在有限,但是导演杨洁精益求精,使得这部电视剧成为经典 西游记 影视作品 3 杨洁 人物 44

12
data/origin/relation.csv Normal file
View File

@ -0,0 +1,12 @@
head_type,tail_type,relation,index
None,None,None,0
影视作品,人物,导演,1
人物,国家,国籍,2
人物,地点,祖籍,3
电视综艺,人物,主持人,4
人物,地点,出生地,5
景点,城市,所在城市,6
歌曲,音乐专辑,所属专辑,7
网络小说,网站,连载网站,8
影视作品,企业,出品公司,9
人物,学校,毕业院校,10
1 head_type tail_type relation index
2 None None None 0
3 影视作品 人物 导演 1
4 人物 国家 国籍 2
5 人物 地点 祖籍 3
6 电视综艺 人物 主持人 4
7 人物 地点 出生地 5
8 景点 城市 所在城市 6
9 歌曲 音乐专辑 所属专辑 7
10 网络小说 网站 连载网站 8
11 影视作品 企业 出品公司 9
12 人物 学校 毕业院校 10

View File

@ -1,10 +0,0 @@
国籍
祖籍
导演
出生地
主持人
所在城市
所属专辑
连载网站
出品公司
毕业院校

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

1001
data/origin/valid.csv Normal file

File diff suppressed because it is too large Load Diff

79
dataset.py Normal file
View File

@ -0,0 +1,79 @@
import torch
from torch.utils.data import Dataset
from utils import load_pkl
def collate_fn(cfg):
def collate_fn_intra(batch):
batch.sort(key=lambda data: data['seq_len'], reverse=True)
max_len = batch[0]['seq_len']
def _padding(x, max_len):
return x + [0] * (max_len - len(x))
x, y = dict(), []
word, word_len = [], []
head_pos, tail_pos = [], []
pcnn_mask = []
for data in batch:
word.append(_padding(data['token2idx'], max_len))
word_len.append(data['seq_len'])
y.append(int(data['rel2idx']))
if cfg.model_name != 'lm':
head_pos.append(_padding(data['head_pos'], max_len))
tail_pos.append(_padding(data['tail_pos'], max_len))
if cfg.use_pcnn:
pcnn_mask.append(_padding(data['entities_pos'], max_len))
x['word'] = torch.tensor(word)
x['lens'] = torch.tensor(word_len)
y = torch.tensor(y)
if cfg.model_name != 'lm':
x['head_pos'] = torch.tensor(head_pos)
x['tail_pos'] = torch.tensor(tail_pos)
if cfg.model_name == 'cnn' and cfg.use_pcnn:
x['pcnn_mask'] = torch.tensor(pcnn_mask)
return x, y
return collate_fn_intra
class CustomDataset(Dataset):
"""默认使用 List 存储数据"""
def __init__(self, fp):
self.file = load_pkl(fp)
def __getitem__(self, item):
sample = self.file[item]
return sample
def __len__(self):
return len(self.file)
if __name__ == '__main__':
from torch.utils.data import DataLoader
train_data_path = 'data/out/train.pkl'
vocab_path = 'data/out/vocab.pkl'
unk_str = 'UNK'
vocab = load_pkl(vocab_path)
train_ds = CustomDataset(train_data_path)
train_dl = DataLoader(train_ds, batch_size=4, shuffle=True, collate_fn=collate_fn, drop_last=False)
for batch_idx, (x, y) in enumerate(train_dl):
word = x['word']
for idx in word:
idx2token = ''.join([vocab.idx2word.get(i, unk_str) for i in idx.numpy()])
print(idx2token)
print(y)
break
# x, y = x.to(device), y.to(device)
# optimizer.zero_grad()
# y_pred = models(y)
# loss = criterion(y_pred, y)
# loss.backward()
# optimizer.step()

View File

View File

@ -1,97 +0,0 @@
class TrainingConfig(object):
seed = 1
use_gpu = True
gpu_id = 0
epoch = 30
learning_rate = 1e-3
decay_rate = 0.5
decay_patience = 3
batch_size = 64
train_log = True
log_interval = 10
show_plot = True
f1_norm = ['macro', 'micro']
class ModelConfig(object):
word_dim = 50
pos_size = 102 # 2 * pos_limit + 2
pos_dim = 5
feature_dim = 60 # 50 + 5 * 2
hidden_dim = 100
dropout = 0.3
class CNNConfig(object):
use_pcnn = True
out_channels = 100
kernel_size = [3, 5, 7]
class RNNConfig(object):
lstm_layers = 3
last_hn = False
class GCNConfig(object):
num_layers = 3
class TransformerConfig(object):
transformer_layers = 3
class CapsuleConfig(object):
num_primary_units = 8
num_output_units = 10 # relation_type
primary_channels = 1
primary_unit_size = 768
output_unit_size = 128
num_iterations = 3
class LMConfig(object):
# lm_name = 'bert-base-chinese' # download usage
# cache file usage
lm_file = 'bert_pretrained'
# transformer 层数,初始 base bert 为12层
# 但是数据量较小时调低些反而收敛更快效果更好
num_hidden_layers = 2
class Config(object):
# 原始数据存放位置
data_path = 'data/origin'
# 预处理后存放文件的位置
out_path = 'data/out'
# 是否将句子中实体替换为实体类型
replace_entity_by_type = True
# 是否为中文数据
is_chinese = True
# 是否需要分词操作
word_segment = True
# 关系种类
relation_type = 10
# vocab 构建时最低词频控制
min_freq = 2
# position limit
pos_limit = 50 # [-50, 50]
# (CNN, RNN, GCN, Transformer, Capsule, LM)
model_name = 'Capsule'
training = TrainingConfig()
model = ModelConfig()
cnn = CNNConfig()
rnn = RNNConfig()
gcn = GCNConfig()
transformer = TransformerConfig()
capsule = CapsuleConfig()
lm = LMConfig()
config = Config()

View File

@ -1,64 +0,0 @@
import torch
from torch.utils.data import Dataset
from deepke.utils import load_pkl
from deepke.config import config
class CustomDataset(Dataset):
def __init__(self, fp):
self.file = load_pkl(fp)
def __getitem__(self, item):
sample = self.file[item]
return sample
def __len__(self):
return len(self.file)
def collate_fn(batch):
batch.sort(key=lambda data: data['seq_len'], reverse=True)
max_len = 0
for data in batch:
if data['seq_len'] > max_len:
max_len = data['seq_len']
def _padding(x, max_len):
return x + [0] * (max_len - len(x))
if config.model_name == 'LM':
x, y = [], []
for data in batch:
x.append(_padding(data['lm_idx'], max_len))
y.append(data['target'])
return torch.tensor(x), torch.tensor(y)
else:
sent, head_pos, tail_pos, mask_pos = [], [], [], []
y = []
for data in batch:
sent.append(_padding(data['word2idx'], max_len))
head_pos.append(_padding(data['head_pos'], max_len))
tail_pos.append(_padding(data['tail_pos'], max_len))
mask_pos.append(_padding(data['mask_pos'], max_len))
y.append(data['target'])
return torch.tensor(sent), torch.tensor(head_pos), torch.tensor(tail_pos), torch.tensor(
mask_pos), torch.tensor(y)
if __name__ == '__main__':
from torch.utils.data import DataLoader
vocab_path = '../data/out/vocab.pkl'
train_data_path = '../data/out/train.pkl'
vocab = load_pkl(vocab_path)
train_dataset = CustomDataset(train_data_path)
dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)
for idx, (*x, y) in enumerate(dataloader):
print(x)
print(y)
break

View File

@ -1,34 +0,0 @@
import torch
import torch.nn as nn
import time
from deepke.utils import ensure_dir
class BasicModule(nn.Module):
'''
封装nn.Module, 提供 save load 方法
'''
def __init__(self):
super(BasicModule, self).__init__()
self.model_name = str(type(self))
def load(self, path):
'''
加载指定路径的模型
'''
self.load_state_dict(torch.load(path))
def save(self, epoch=0, name=None):
'''
保存模型默认使用模型名字+时间作为文件名
'''
prefix = 'checkpoints/'
ensure_dir(prefix)
if name is None:
name = prefix + self.model_name + '_' + f'epoch{epoch}_'
name = time.strftime(name + '%m%d_%H:%M:%S.pth')
else:
name = prefix + name + '_' + self.model_name + '_' + f'epoch{epoch}_'
name = time.strftime(name + '%m%d_%H:%M:%S.pth')
torch.save(self.state_dict(), name)
return name

View File

@ -1,88 +0,0 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
from deepke.model import BasicModule, Embedding
class CNN(BasicModule):
def __init__(self, vocab_size, config):
super(CNN, self).__init__()
self.model_name = 'CNN'
self.vocab_size = vocab_size
self.word_dim = config.model.word_dim
self.pos_size = config.model.pos_size
self.pos_dim = config.model.pos_dim
self.hidden_dim = config.model.hidden_dim
self.dropout = config.model.dropout
self.use_pcnn = config.cnn.use_pcnn
self.out_channels = config.cnn.out_channels
self.kernel_size = config.cnn.kernel_size
self.out_dim = config.relation_type
if isinstance(self.kernel_size, int):
self.kernel_size = [self.kernel_size]
for k in self.kernel_size:
assert k % 2 == 1, "kernel size has to be odd numbers."
self.embedding = Embedding(self.vocab_size, self.word_dim, self.pos_size, self.pos_dim)
# PCNN embedding
self.mask_embed = nn.Embedding(4, 3)
masks = torch.tensor([[0, 0, 0], [100, 0, 0], [0, 100, 0], [0, 0, 100]])
self.mask_embed.weight.data.copy_(masks)
self.mask_embed.weight.requires_grad = False
self.input_dim = self.word_dim + self.pos_dim * 2
self.convs = nn.ModuleList([
nn.Conv1d(in_channels=self.input_dim,
out_channels=self.out_channels,
kernel_size=k,
padding=k // 2,
bias=None) for k in self.kernel_size
])
self.conv_dim = len(self.kernel_size) * self.out_channels
if self.use_pcnn:
self.conv_dim *= 3
self.fc1 = nn.Linear(self.conv_dim, self.hidden_dim)
self.fc2 = nn.Linear(self.hidden_dim, self.out_dim)
self.dropout = nn.Dropout(self.dropout)
def forward(self, input):
*x, mask = input
x = self.embedding(x)
mask_embed = self.mask_embed(mask)
# [B,L,C] -> [B,C,L]
x = torch.transpose(x, 1, 2)
# CNN
x = [F.leaky_relu(conv(x)) for conv in self.convs]
x = torch.cat(x, dim=1)
# mask
mask = mask.unsqueeze(1) # B x 1 x L
x = x.masked_fill_(mask.eq(0), float('-inf'))
if self.use_pcnn:
# triple max_pooling
x = x.unsqueeze(-1).permute(0, 2, 1, 3) # [B, L, C, 1]
mask_embed = mask_embed.unsqueeze(-2) # [B, L, 1, 3]
x = x + mask_embed # [B, L, C, 3]
x = torch.max(x, dim=1)[0] - 100 # [B, C, 3]
x = x.view(x.size(0), -1) # [B, 3*C]
else:
# max_pooling
x = F.max_pool1d(x, x.size(-1)).squeeze(-1) # [[B,C],..]
# droup
x = self.dropout(x)
# linear
x = F.leaky_relu(self.fc1(x))
x = F.leaky_relu(self.fc2(x))
return x
if __name__ == '__main__':
pass

View File

@ -1,206 +0,0 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
from deepke.model import BasicModule, Embedding, VarLenLSTM
class Capsule(BasicModule):
def __init__(self, vocab_size, config):
super(Capsule, self).__init__()
self.model_name = 'Capsule'
self.vocab_size = vocab_size
self.word_dim = config.model.word_dim
self.pos_size = config.model.pos_size
self.pos_dim = config.model.pos_dim
self.hidden_dim = config.model.hidden_dim
self.num_primary_units = config.capsule.num_primary_units
self.num_output_units = config.capsule.num_output_units
self.primary_channels = config.capsule.primary_channels
self.primary_unit_size = config.capsule.primary_unit_size
self.output_unit_size = config.capsule.output_unit_size
self.num_iterations = config.capsule.num_iterations
self.embedding = Embedding(self.vocab_size, self.word_dim, self.pos_size, self.pos_dim)
self.input_dim = self.word_dim + self.pos_dim * 2
self.lstm = VarLenLSTM(
self.input_dim,
self.hidden_dim,
)
self.capsule = CapsuleNet(self.num_primary_units, self.num_output_units, self.primary_channels,
self.primary_unit_size, self.output_unit_size, self.num_iterations)
def forward(self, input):
*x, mask = input
x = self.embedding(x)
x_lens = torch.sum(mask.gt(0), dim=-1)
_, hn = self.lstm(x, x_lens)
out = self.capsule(hn)
return out # B, num_output_units, output_unit_size
def predict(self, output):
v_mag = torch.sqrt((output**2).sum(dim=2, keepdim=False))
pred = v_mag.argmax(1, keepdim=False)
return pred
def loss(self, input, target, size_average=True):
batch_size = input.size(0)
v_mag = torch.sqrt((input**2).sum(dim=2, keepdim=True))
max_l = torch.relu(0.9 - v_mag).view(batch_size, -1)**2
max_r = torch.relu(v_mag - 0.1).view(batch_size, -1)**2
loss_lambda = 0.5
T_c = target
L_c = T_c * max_l + loss_lambda * (1.0 - T_c) * max_r
L_c = L_c.sum(dim=1)
if size_average:
L_c = L_c.mean()
return L_c
class CapsuleNet(nn.Module):
def __init__(self, num_primary_units, num_output_units, primary_channels, primary_unit_size, output_unit_size,
num_iterations):
super(CapsuleNet, self).__init__()
self.primary = CapsuleLayer(in_units=0,
out_units=num_primary_units,
in_channels=primary_channels,
unit_size=primary_unit_size,
use_routing=False,
num_iterations=0)
self.iteration = CapsuleLayer(in_units=num_primary_units,
out_units=num_output_units,
in_channels=primary_unit_size,
unit_size=output_unit_size,
use_routing=True,
num_iterations=num_iterations)
def forward(self, input):
return self.iteration(self.primary(input))
class ConvUnit(nn.Module):
def __init__(self, in_channels):
super(ConvUnit, self).__init__()
self.conv0 = nn.Conv1d(
in_channels=in_channels,
out_channels=8, # fixme constant
kernel_size=9, # fixme constant
stride=2, # fixme constant
bias=True)
def forward(self, x):
return self.conv0(x)
class CapsuleLayer(nn.Module):
def __init__(self, in_units, out_units, in_channels, unit_size, use_routing, num_iterations):
super(CapsuleLayer, self).__init__()
self.in_units = in_units
self.out_units = out_units
self.in_channels = in_channels
self.unit_size = unit_size
self.use_routing = use_routing
if self.use_routing:
self.W = nn.Parameter(torch.randn(1, in_channels, out_units, unit_size, in_units))
self.num_iterations = num_iterations
else:
def create_conv_unit(unit_idx):
unit = ConvUnit(in_channels=in_channels)
self.add_module("unit_" + str(unit_idx), unit)
return unit
self.units = [create_conv_unit(i) for i in range(self.out_units)]
@staticmethod
def squash(s):
# This is equation 1 from the paper.
mag_sq = torch.sum(s**2, dim=2, keepdim=True)
mag = torch.sqrt(mag_sq)
s = (mag_sq / (1.0 + mag_sq)) * (s / mag)
return s
def forward(self, x):
if self.use_routing:
return self.routing(x)
else:
return self.no_routing(x)
def no_routing(self, x):
# Each unit will be (batch, channels, feature).
u = [self.units[i](x) for i in range(self.out_units)]
# Stack all unit outputs (batch, unit, channels, feature).
u = torch.stack(u, dim=1)
# Flatten to (batch, unit, output).
u = u.view(x.size(0), self.out_units, -1)
# Return squashed outputs.
return CapsuleLayer.squash(u)
def routing(self, x):
batch_size = x.size(0)
# (batch, in_units, features) -> (batch, features, in_units)
x = x.transpose(1, 2)
# (batch, features, in_units) -> (batch, features, out_units, in_units, 1)
x = torch.stack([x] * self.out_units, dim=2).unsqueeze(4)
# (batch, features, out_units, unit_size, in_units)
W = torch.cat([self.W] * batch_size, dim=0)
# Transform inputs by weight matrix.
# (batch_size, features, out_units, unit_size, 1)
u_hat = torch.matmul(W, x)
# Initialize routing logits to zero.
b_ij = torch.zeros(1, self.in_channels, self.out_units, 1).to(x.device)
# Iterative routing.
num_iterations = self.num_iterations
for iteration in range(num_iterations):
# Convert routing logits to softmax.
c_ij = F.softmax(b_ij, dim=1)
# (batch, features, out_units, 1, 1)
c_ij = torch.cat([c_ij] * batch_size, dim=0).unsqueeze(4)
# Apply routing (c_ij) to weighted inputs (u_hat).
# (batch_size, 1, out_units, unit_size, 1)
s_j = (c_ij * u_hat).sum(dim=1, keepdim=True)
# (batch_size, 1, out_units, unit_size, 1)
v_j = CapsuleLayer.squash(s_j)
# (batch_size, features, out_units, unit_size, 1)
v_j1 = torch.cat([v_j] * self.in_channels, dim=1)
# (1, features, out_units, 1)
u_vj1 = torch.matmul(u_hat.transpose(3, 4), v_j1).squeeze(4).mean(dim=0, keepdim=True)
# Update b_ij (routing)
b_ij = u_vj1
# (batch_size, out_units, unit_size, 1)
return v_j.squeeze()
if __name__ == '__main__':
net = CapsuleNet(num_primary_units=8,
num_output_units=13,
primary_channels=10,
primary_unit_size=8,
output_unit_size=20,
num_iterations=5)
inputs = torch.randn(4, 10, 10)
outs = net(inputs)
print(outs.shape) # (4, 13, 20)

View File

@ -1,19 +0,0 @@
import torch
import torch.nn as nn
class Embedding(nn.Module):
def __init__(self, vocab_size: int, word_dim: int, pos_size: int, pos_dim: int):
super(Embedding, self).__init__()
self.word_embed = nn.Embedding(vocab_size, word_dim, padding_idx=0)
self.head_pos_embed = nn.Embedding(pos_size, pos_dim, padding_idx=0)
self.tail_pos_embed = nn.Embedding(pos_size, pos_dim, padding_idx=0)
def forward(self, x):
words, head_pos, tail_pos = x
word_embed = self.word_embed(words)
head_pos_embed = self.head_pos_embed(head_pos)
tail_pos_embed = self.tail_pos_embed(tail_pos)
feature_embed = torch.cat([word_embed, head_pos_embed, tail_pos_embed], dim=-1)
return feature_embed

View File

@ -1,138 +0,0 @@
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from deepke.model import BasicModule, Embedding
# 暂时有bug主要是没有找到很好的可以做中文 dependency parsing 的工具
# 尝试了 hanlp, standford_nlp, 都需要安装 java 包,还是老版本的 java6测试时bug不少
class GCN(BasicModule):
def __init__(self, vocab_size, config):
super(GCN, self).__init__()
self.model_name = 'GCN'
self.vocab_size = vocab_size
self.word_dim = config.model.word_dim
self.pos_size = config.model.pos_size
self.pos_dim = config.model.pos_dim
self.hidden_dim = config.model.hidden_dim
self.dropout = config.model.dropout
self.num_layers = config.gcn.num_layers
self.out_dim = config.relation_type
self.embedding = Embedding(self.vocab_size, self.word_dim, self.pos_size, self.pos_dim)
self.input_dim = self.word_dim + self.pos_dim * 2
self.fc1 = nn.Linear(self.input_dim, self.hidden_dim)
self.fc2 = nn.Linear(self.hidden_dim, self.hidden_dim)
self.fc3 = nn.Linear(self.hidden_dim, self.out_dim)
self.dropout = nn.Dropout(self.dropout)
def forward(self, input):
*x, adj, mask = input
x = self.embedding(x)
for i in range(1, self.num_layers + 1):
if i == 1 == self.num_layers:
out = self.fc1(torch.bmm(adj, x))
elif i == self.num_layers:
out = self.fc3(torch.bmm(adj, x))
else:
out = F.relu(self.fc2(torch.bmm(adj, x)))
return out
class Tree(object):
def __init__(self):
self.parent = None
self.num_children = 0
self.children = list()
def add_child(self, child):
child.parent = self
self.num_children += 1
self.children.append(child)
def size(self):
s = getattr(self, '_size', -1)
if s != -1:
return self._size
else:
count = 1
for i in range(self.num_children):
count += self.children[i].size()
self._size = count
return self._size
def __iter__(self):
yield self
for c in self.children:
for x in c:
yield x
def depth(self):
d = getattr(self, '_depth', -1)
if d != -1:
return self._depth
else:
count = 0
if self.num_children > 0:
for i in range(self.num_children):
child_depth = self.children[i].depth()
if child_depth > count:
count = child_depth
count += 1
self._depth = count
return self._depth
def head_to_adj(head, directed=True, self_loop=False):
"""
Convert a sequence of head indexes to an (numpy) adjacency matrix.
"""
seq_len = len(head)
head = head[:seq_len]
root = None
nodes = [Tree() for _ in head]
for i in range(seq_len):
h = head[i]
setattr(nodes[i], 'idx', i)
if h == 0:
root = nodes[i]
else:
nodes[h - 1].add_child(nodes[i])
assert root is not None
ret = np.zeros((seq_len, seq_len), dtype=np.float32)
queue = [root]
idx = []
while len(queue) > 0:
t, queue = queue[0], queue[1:]
idx += [t.idx]
for c in t.children:
ret[t.idx, c.idx] = 1
queue += t.children
if not directed:
ret = ret + ret.T
if self_loop:
for i in idx:
ret[i, i] = 1
return ret
if __name__ == '__main__':
inputs = torch.tensor([list(range(6))])
embedding = nn.Embedding(10, 10)
inputs = embedding(inputs)
head = [2, 0, 5, 3, 2, 2]
adj = head_to_adj(head, directed=False, self_loop=True)
print(adj)
adj = torch.tensor([adj])
model = GCN(10, 10)
outs = model(adj, inputs)
print(outs.shape)

View File

@ -1,20 +0,0 @@
import torch.nn as nn
from deepke.model import BasicModule
from pytorch_transformers import BertModel
class LM(BasicModule):
def __init__(self, vocab_size, config):
super(LM, self).__init__()
self.model_name = 'LM'
self.lm_name = config.lm.lm_file
self.out_dim = config.relation_type
self.lm = BertModel.from_pretrained(self.lm_name, num_hidden_layers=config.lm.num_hidden_layers)
self.fc = nn.Linear(768, self.out_dim)
def forward(self, x):
x = x[0]
out = self.lm(x)[0][:, 0]
out = self.fc(out)
return out

View File

@ -1,101 +0,0 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from deepke.model import BasicModule, Embedding
class VarLenLSTM(BasicModule):
def __init__(self, input_size, hidden_size, lstm_layers=1, dropout=0, last_hn=False):
super(VarLenLSTM, self).__init__()
self.model_name = 'VarLenLSTM'
self.lstm_layers = lstm_layers
self.last_hn = last_hn
self.lstm = nn.LSTM(
input_size=input_size,
hidden_size=hidden_size,
num_layers=lstm_layers,
dropout=dropout,
bidirectional=True,
bias=True,
batch_first=True,
)
def forward(self, x, x_len):
'''
针对有 padding 的句子
一般来说out 用来做序列标注hn 做分类任务
:param x: [B * L * H]
:param x_len: [l...]
:return:
out: [B * seq_len * hidden] hidden = 2 * hidden_dim
hn: [B * layers * hidden] hidden = 2 * hidden_dim
'''
x = pack_padded_sequence(x, x_len, batch_first=True, enforce_sorted=True)
out, (hn, _) = self.lstm(x)
out, _ = pad_packed_sequence(out, batch_first=True, padding_value=0.0)
hn = hn.transpose(0, 1).contiguous()
# [B, layers, 2*hidden]
hn = hn.view(hn.size(0), self.lstm_layers, -1)
if self.last_hn:
hn = hn[:, -1].unsqueeze(1)
return out, hn
class BiLSTM(BasicModule):
def __init__(self, vocab_size, config):
super(BiLSTM, self).__init__()
self.model_name = 'BiLSTM'
self.vocab_size = vocab_size
self.word_dim = config.model.word_dim
self.pos_size = config.model.pos_size
self.pos_dim = config.model.pos_dim
self.hidden_dim = config.model.hidden_dim
self.dropout = config.model.dropout
self.lstm_layers = config.rnn.lstm_layers
self.last_hn = config.rnn.last_hn
self.out_dim = config.relation_type
self.embedding = Embedding(self.vocab_size, self.word_dim, self.pos_size, self.pos_dim)
self.input_dim = self.word_dim + self.pos_dim * 2
self.lstm = VarLenLSTM(self.input_dim,
self.hidden_dim,
self.lstm_layers,
dropout=self.dropout,
last_hn=self.last_hn)
if self.last_hn:
linear_input_dim = self.hidden_dim * 2
else:
linear_input_dim = self.hidden_dim * 2 * self.lstm_layers
self.fc1 = nn.Linear(linear_input_dim, self.hidden_dim)
self.fc2 = nn.Linear(self.hidden_dim, self.out_dim)
def forward(self, input):
*x, mask = input
x = self.embedding(x)
x_lens = torch.sum(mask.gt(0), dim=-1)
_, hn = self.lstm(x, x_lens)
hn = hn.view(hn.size(0), -1)
y = F.leaky_relu(self.fc1(hn))
y = F.leaky_relu(self.fc2(y))
return y
if __name__ == '__main__':
torch.manual_seed(1)
x = torch.Tensor([
[1, 2, 3, 4, 3, 2],
[1, 2, 3, 0, 0, 0],
[2, 4, 3, 0, 0, 0],
[2, 3, 0, 0, 0, 0],
])
x_len = torch.Tensor([6, 3, 3, 2])
embedding = nn.Embedding(5, 10, padding_idx=0)
model = VarLenLSTM(input_size=10, hidden_size=30, lstm_layers=5, last_hn=False)
x = embedding(x) # [4, 6, 5]
out, hn = model(x, x_len)
# out: [4, 6, 60] [B, seq_len, 2 * hidden]
# hn: [4, 5, 60] [B, layers, 2 * hidden]
print(out.shape, hn.shape)

View File

@ -1,131 +0,0 @@
import math
import torch
import torch.nn as nn
from deepke.model import BasicModule, Embedding
class DotAttention(nn.Module):
'''
\text {Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V
'''
def __init__(self, dropout=0.0):
super(DotAttention, self).__init__()
self.drop = nn.Dropout(dropout)
self.softmax = nn.Softmax(dim=-1)
def forward(self, Q, K, V, mask_out=None):
"""
:param Q: [batch, seq_len_q, feature_size]
:param K: [batch, seq_len_k, feature_size]
:param V: [batch, seq_len_k, feature_size]
:param mask_out: [batch, 1, seq_len] or [batch, seq_len_q, seq_len_k]
"""
feature_size = Q.size(-1)
scale = math.sqrt(feature_size)
output = torch.matmul(Q, K.transpose(1, 2)) / scale
if mask_out is not None:
output.masked_fill_(mask_out, -1e18)
output = self.softmax(output)
output = self.drop(output)
return torch.matmul(output, V)
class MultiHeadAttention(nn.Module):
"""
:param feature_size: int, 输入维度的大小同时也是输出维度的大小
:param num_head: inthead的数量
:param dropout: float
"""
def __init__(self, feature_size, num_head, dropout=0.2):
super(MultiHeadAttention, self).__init__()
self.feature_size = feature_size
self.num_head = num_head
self.q_in = nn.Linear(feature_size, feature_size * num_head)
self.k_in = nn.Linear(feature_size, feature_size * num_head)
self.v_in = nn.Linear(feature_size, feature_size * num_head)
self.attention = DotAttention(dropout=dropout)
self.out = nn.Linear(feature_size * num_head, feature_size)
def forward(self, Q, K, V, att_mask_out=None):
"""
:param Q: [batch, seq_len_q, feature_size]
:param K: [batch, seq_len_k, feature_size]
:param V: [batch, seq_len_k, feature_size]
:param seq_mask: [batch, seq_len]
"""
batch, sq, feature = Q.size()
sk = K.size(1)
n_head = self.num_head
# input linear
q = self.q_in(Q).view(batch, sq, n_head, feature)
k = self.k_in(K).view(batch, sk, n_head, feature)
v = self.v_in(V).view(batch, sk, n_head, feature)
# transpose q, k and v to do batch attention
# [batch, seq_len, num_head, feature] => [num_head*batch, seq_len, feature]
q = q.permute(2, 0, 1, 3).contiguous().view(-1, sq, feature)
k = k.permute(2, 0, 1, 3).contiguous().view(-1, sk, feature)
v = v.permute(2, 0, 1, 3).contiguous().view(-1, sk, feature)
if att_mask_out is not None:
att_mask_out = att_mask_out.repeat(n_head, 1, 1)
att = self.attention(q, k, v, att_mask_out).view(n_head, batch, sq, feature)
# concat all heads, do output linear
# [num_head, batch, seq_len, feature] => [batch, seq_len, num_head*feature]
att = att.permute(1, 2, 0, 3).contiguous().view(batch, sq, -1)
output = self.out(att)
return output
class Transformer(BasicModule):
def __init__(self, vocab_size, config):
super(Transformer, self).__init__()
self.model_name = 'Transformer'
self.vocab_size = vocab_size
self.word_dim = config.model.word_dim
self.pos_size = config.model.pos_size
self.pos_dim = config.model.pos_dim
self.hidden_dim = config.model.hidden_dim
self.dropout = config.model.dropout
self.layers = config.transformer.transformer_layers
self.out_dim = config.relation_type
self.embedding = Embedding(self.vocab_size, self.word_dim, self.pos_size, self.pos_dim)
self.feature_dim = self.word_dim + self.pos_dim * 2
self.att = MultiHeadAttention(self.feature_dim, num_head=4)
self.norm1 = nn.LayerNorm(self.feature_dim)
self.ffn = nn.Sequential(nn.Linear(self.feature_dim, self.hidden_dim), nn.ReLU(),
nn.Linear(self.hidden_dim, self.feature_dim), nn.Dropout(self.dropout))
self.norm2 = nn.LayerNorm(self.feature_dim)
self.fc = nn.Linear(self.feature_dim, self.out_dim)
def forward(self, input):
*x, mask = input
x = self.embedding(x)
att_mask_out = mask.eq(0).unsqueeze(1)
for i in range(self.layers):
attention = self.att(x, x, x, att_mask_out)
norm_att = self.norm1(attention + x)
x = self.ffn(norm_att)
x = self.norm2(x + norm_att)
x = x[:, 0]
out = self.fc(x)
return out
if __name__ == '__main__':
torch.manual_seed(1)
q = torch.randn(32, 50, 100)
k = torch.randn(32, 60, 100)
v = torch.randn(32, 60, 100)
mask = torch.randn(32, 60).unsqueeze(1).gt(0)
att1 = DotAttention()
out = att1(q, k, v, mask)
print(out.shape) # [32, 50, 100]
att2 = MultiHeadAttention(feature_size=100, num_head=8)
out = att2(q, k, v, mask)
print(out.shape) # [32, 50, 100]

View File

@ -1,8 +0,0 @@
from .BasicModule import BasicModule
from .Embedding import Embedding
from .CNN import CNN
from .RNN import VarLenLSTM, BiLSTM
from .GCN import GCN
from .Transformer import Transformer
from .Capsule import Capsule
from .LM import LM

View File

@ -1,211 +0,0 @@
import os
import jieba
import logging
from typing import List, Dict
from pytorch_transformers import BertTokenizer
# self file
from deepke.vocab import Vocab
from deepke.config import config
from deepke.utils import ensure_dir, save_pkl, load_csv
jieba.setLogLevel(logging.INFO)
Path = str
def _mask_feature(entities_idx: List, sen_len: int) -> List:
left = [1] * (entities_idx[0] + 1)
middle = [2] * (entities_idx[1] - entities_idx[0] - 1)
right = [3] * (sen_len - entities_idx[1])
return left + middle + right
def _pos_feature(sent_len: int, entity_idx: int, entity_len: int, pos_limit: int) -> List:
left = list(range(-entity_idx, 0))
middle = [0] * entity_len
right = list(range(1, sent_len - entity_idx - entity_len + 1))
pos = left + middle + right
for i, p in enumerate(pos):
if p > pos_limit:
pos[i] = pos_limit
if p < -pos_limit:
pos[i] = -pos_limit
pos = [p + pos_limit + 1 for p in pos]
return pos
def _build_data(data: List[Dict], vocab: Vocab, relations: Dict) -> List[Dict]:
if vocab.name == 'LM':
for d in data:
d['target'] = relations[d['relation']]
return data
for d in data:
word2idx = [vocab.word2idx.get(w, 1) for w in d['sentence']]
seq_len = len(word2idx)
head_idx, tail_idx = int(d['head_offset']), int(d['tail_offset'])
if vocab.name == 'word':
head_len, tail_len = 1, 1
else:
head_len, tail_len = len(d['head_type']), len(d['tail_type'])
entities_idx = [head_idx, tail_idx] if tail_idx > head_idx else [tail_idx, head_idx]
head_pos = _pos_feature(seq_len, head_idx, head_len, config.pos_limit)
tail_pos = _pos_feature(seq_len, tail_idx, tail_len, config.pos_limit)
mask_pos = _mask_feature(entities_idx, seq_len)
target = relations[d['relation']]
d['word2idx'] = word2idx
d['seq_len'] = seq_len
d['head_pos'] = head_pos
d['tail_pos'] = tail_pos
d['mask_pos'] = mask_pos
d['target'] = target
return data
def _build_vocab(data: List[Dict], out_path: Path) -> Vocab:
if config.word_segment:
vocab = Vocab('word')
else:
vocab = Vocab('char')
for d in data:
vocab.add_sent(d['sentence'])
vocab.trim(config.min_freq)
ensure_dir(out_path)
vocab_path = os.path.join(out_path, 'vocab.pkl')
vocab_txt = os.path.join(out_path, 'vocab.txt')
save_pkl(vocab_path, vocab, 'vocab')
with open(vocab_txt, 'w', encoding='utf-8') as f:
f.write(os.linesep.join([word for word in vocab.word2idx.keys()]))
return vocab
def _split_sent(data: List[Dict], verbose: bool = True) -> List[Dict]:
if verbose:
print('need word segment, use jieba to split sentence')
jieba.add_word('HEAD')
jieba.add_word('TAIL')
for d in data:
sent = d['sentence']
sent = sent.replace(d['head_type'], 'HEAD', 1)
sent = sent.replace(d['tail_type'], 'TAIL', 1)
sent = jieba.lcut(sent)
head_idx, tail_idx = sent.index('HEAD'), sent.index('TAIL')
sent[head_idx], sent[tail_idx] = d['head_type'], d['tail_type']
d['sentence'] = sent
d['head_offset'] = head_idx
d['tail_offset'] = tail_idx
return data
def _add_lm_data(data: List[Dict]) -> List[Dict]:
'使用语言模型的词表,序列化输入的句子'
tokenizer = BertTokenizer.from_pretrained(config.lm.lm_file)
for d in data:
sent = d['sentence']
sent += '[SEP]' + d['head'] + '[SEP]' + d['tail']
d['lm_idx'] = tokenizer.encode(sent, add_special_tokens=True)
d['seq_len'] = len(d['lm_idx'])
return data
def _replace_entity_by_type(data: List[Dict]) -> List[Dict]:
for d in data:
sent = d['sentence'].strip()
sent = sent.replace(d['head'], d['head_type'], 1)
sent = sent.replace(d['tail'], d['tail_type'], 1)
head_offset = sent.index(d['head_type'])
tail_offset = sent.index(d['tail_type'])
d['sentence'] = sent
d['head_offset'] = head_offset
d['tail_offset'] = tail_offset
return data
def _load_relations(fp: Path) -> Dict:
'读取关系文件,并将关系保存为词典格式,用来序列化关系'
print(f'load {fp}')
relations_arr = []
relations_dict = {}
with open(fp, encoding='utf-8') as f:
for l in f:
relations_arr.append(l.strip())
for k, v in enumerate(relations_arr):
relations_dict[v] = k
return relations_dict
def process(data_path: Path, out_path: Path) -> None:
print('===== start preprocess data =====')
train_fp = os.path.join(data_path, 'train.csv')
test_fp = os.path.join(data_path, 'test.csv')
relation_fp = os.path.join(data_path, 'relation.txt')
print('load raw files...')
train_raw_data = load_csv(train_fp)
test_raw_data = load_csv(test_fp)
relations = _load_relations(relation_fp)
# 使用 entity type 替换句子中的 entity
# 这样训练效果会提升很多
if config.replace_entity_by_type:
train_raw_data = _replace_entity_by_type(train_raw_data)
test_raw_data = _replace_entity_by_type(test_raw_data)
# 使用预训练语言模型时
if config.model_name == 'LM':
print('\nuse pretrained language model serialize sentence...')
train_raw_data = _add_lm_data(train_raw_data)
test_raw_data = _add_lm_data(test_raw_data)
vocab = Vocab('LM')
else:
# 当为中文时是否需要分词操作,如果句子已为分词的结果,则不需要分词
print('\nverify whether need split words...')
if config.is_chinese and config.word_segment:
train_raw_data = _split_sent(train_raw_data)
test_raw_data = _split_sent(test_raw_data, verbose=False)
print('build word vocabulary...')
vocab = _build_vocab(train_raw_data, out_path)
print('\nbuild train data...')
train_data = _build_data(train_raw_data, vocab, relations)
print('build test data...\n')
test_data = _build_data(test_raw_data, vocab, relations)
ensure_dir(out_path)
train_data_path = os.path.join(out_path, 'train.pkl')
test_data_path = os.path.join(out_path, 'test.pkl')
save_pkl(train_data_path, train_data, 'train data')
save_pkl(test_data_path, test_data, 'test data')
print('===== end preprocess data =====')
if __name__ == "__main__":
data_path = '../data/origin'
out_path = '../data/out'
process(data_path, out_path)

View File

@ -1,72 +0,0 @@
import torch
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_fscore_support
from deepke.utils import to_one_hot
def train(epoch, device, dataloader, model, optimizer, criterion, config):
model.train()
total_loss = []
for batch_idx, (*x, y) in enumerate(dataloader, 1):
x = [i.to(device) for i in x]
y = y.to(device)
optimizer.zero_grad()
y_pred = model(x)
if model.model_name == 'Capsule':
y = to_one_hot(y, config.relation_type)
loss = model.loss(y_pred, y)
else:
loss = criterion(y_pred, y)
loss.backward()
optimizer.step()
total_loss.append(loss.item())
# logging
data_cal = len(dataloader.dataset) if batch_idx == len(dataloader) else batch_idx * len(y)
if (config.training.train_log
and batch_idx % config.training.log_interval == 0) or batch_idx == len(dataloader):
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(epoch, data_cal, len(dataloader.dataset),
100. * batch_idx / len(dataloader),
loss.item()))
# plot
if config.training.show_plot:
plt.plot(total_loss)
plt.title('loss')
plt.show()
def validate(dataloader, model, device, config):
model.eval()
with torch.no_grad():
total_y_true = np.empty(0)
total_y_pred = np.empty(0)
for batch_idx, (*x, y) in enumerate(dataloader, 1):
x = [i.to(device) for i in x]
y = y.to(device)
y_pred = model(x)
if model.model_name == 'Capsule':
y_pred = model.predict(y_pred)
else:
y_pred = y_pred.argmax(dim=-1)
try:
y, y_pred = y.numpy(), y_pred.numpy()
except:
y, y_pred = y.cpu().numpy(), y_pred.cpu().numpy()
total_y_true = np.append(total_y_true, y)
total_y_pred = np.append(total_y_pred, y_pred)
total_f1 = []
for average in config.training.f1_norm:
p, r, f1, _ = precision_recall_fscore_support(total_y_true, total_y_pred, average=average)
print(f' {average} metrics: [p: {p:.4f}, r:{r:.4f}, f1:{f1:.4f}]')
total_f1.append(f1)
return total_f1

View File

@ -1,232 +0,0 @@
import os
import csv
import json
import torch
import pickle
import random
import warnings
import numpy as np
from functools import reduce
from typing import Dict, List, Tuple, Set, Any
__all__ = [
'to_one_hot',
'seq_len_to_mask',
'ignore_waring',
'make_seed',
'load_pkl',
'save_pkl',
'ensure_dir',
'load_csv',
'load_jsonld',
'jsonld2csv',
'csv2jsonld',
]
Path = str
def to_one_hot(x, length):
batch_size = x.size(0)
x_one_hot = torch.zeros(batch_size, length).to(x.device)
for i in range(batch_size):
x_one_hot[i, x[i]] = 1.0
return x_one_hot
def model_summary(model):
"""
得到模型的总参数量
:params model: Pytorch 模型
:return tuple: 包含总参数量可训练参数量不可训练参数量
"""
train = []
nontrain = []
def layer_summary(module):
def count_size(sizes):
return reduce(lambda x, y: x * y, sizes)
for p in module.parameters(recurse=False):
if p.requires_grad:
train.append(count_size(p.shape))
else:
nontrain.append(count_size(p.shape))
for subm in module.children():
layer_summary(subm)
layer_summary(model)
total_train = sum(train)
total_nontrain = sum(nontrain)
total = total_train + total_nontrain
strings = []
strings.append('Total params: {:,}'.format(total))
strings.append('Trainable params: {:,}'.format(total_train))
strings.append('Non-trainable params: {:,}'.format(total_nontrain))
max_len = len(max(strings, key=len))
bar = '-' * (max_len + 3)
strings = [bar] + strings + [bar]
print('\n'.join(strings))
return total, total_train, total_nontrain
def seq_len_to_mask(seq_len, max_len=None):
"""
将一个表示sequence length的一维数组转换为二维的mask不包含的位置为0
转变 1-d seq_len到2-d mask.
.. code-block::
>>> seq_len = torch.arange(2, 16)
>>> mask = seq_len_to_mask(seq_len)
>>> print(mask.size())
torch.Size([14, 15])
>>> seq_len = np.arange(2, 16)
>>> mask = seq_len_to_mask(seq_len)
>>> print(mask.shape)
(14, 15)
>>> seq_len = torch.arange(2, 16)
>>> mask = seq_len_to_mask(seq_len, max_len=100)
>>>print(mask.size())
torch.Size([14, 100])
:param np.ndarray,torch.LongTensor seq_len: shape将是(B,)
:param int max_len: 将长度pad到这个长度默认(None)使用的是seq_len中最长的长度但在nn.DataParallel的场景下可能不同卡的seq_len会有
区别所以需要传入一个max_len使得mask的长度是pad到该长度
:return: np.ndarray, torch.Tensor shape将是(B, max_length) 元素类似为bool或torch.uint8
"""
if isinstance(seq_len, np.ndarray):
assert len(np.shape(seq_len)) == 1, f"seq_len can only have one dimension, got {len(np.shape(seq_len))}."
max_len = int(max_len) if max_len else int(seq_len.max())
broad_cast_seq_len = np.tile(np.arange(max_len), (len(seq_len), 1))
mask = broad_cast_seq_len < seq_len.reshape(-1, 1)
elif isinstance(seq_len, torch.Tensor):
assert seq_len.dim() == 1, f"seq_len can only have one dimension, got {seq_len.dim() == 1}."
batch_size = seq_len.size(0)
max_len = int(max_len) if max_len else seq_len.max().long()
broad_cast_seq_len = torch.arange(max_len).expand(batch_size, -1).to(seq_len)
mask = broad_cast_seq_len.lt(seq_len.unsqueeze(1))
else:
raise TypeError("Only support 1-d numpy.ndarray or 1-d torch.Tensor.")
return mask
def ignore_waring():
warnings.filterwarnings("ignore")
def make_seed(num: int = 1) -> None:
random.seed(num)
np.random.seed(num)
torch.manual_seed(num)
torch.cuda.manual_seed(num)
torch.cuda.manual_seed_all(num)
def load_pkl(fp: str, obj_name: str = 'data', verbose: bool = True) -> Any:
if verbose:
print(f'load {obj_name} in {fp}')
with open(fp, 'rb') as f:
data = pickle.load(f)
return data
def save_pkl(fp: Path, obj, obj_name: str = 'data', verbose: bool = True) -> None:
if verbose:
print(f'save {obj_name} in {fp}')
with open(fp, 'wb') as f:
pickle.dump(obj, f)
def ensure_dir(d: str, verbose: bool = True) -> None:
'''
判断目录是否存在不存在时创建
:param d: directory
:param verbose: whether print logging
:return: None
'''
if not os.path.exists(d):
if verbose:
print("Directory '{}' do not exist; creating...".format(d))
os.makedirs(d)
def load_csv(fp: str) -> List:
print(f'load {fp}')
with open(fp, encoding='utf-8') as f:
reader = csv.DictReader(f)
return list(reader)
def load_jsonld(fp: str) -> List:
print(f'load {fp}')
datas = []
with open(fp, encoding='utf-8') as f:
for l in f:
line = json.loads(l)
data = list(line.values())
datas.append(data)
return datas
def jsonld2csv(fp: str, verbose: bool = True) -> str:
'''
读入 jsonld 文件存储在同位置同名的 csv 文件
:param fp: jsonld 文件地址
:param verbose: whether print logging
:return: csv 文件地址
'''
data = []
root, ext = os.path.splitext(fp)
fp_new = root + '.csv'
if verbose:
print(f'read jsonld file in: {fp}')
with open(fp, encoding='utf-8') as f:
for l in f:
line = json.loads(l)
data.append(line)
if verbose:
print('saving...')
with open(fp_new, 'w', encoding='utf-8') as f:
fieldnames = data[0].keys()
writer = csv.DictWriter(f, fieldnames=fieldnames, dialect='excel')
writer.writeheader()
writer.writerows(data)
if verbose:
print(f'saved csv file in: {fp_new}')
return fp_new
def csv2jsonld(fp: str, verbose: bool = True) -> str:
'''
读入 csv 文件存储为同位置同名的 jsonld 文件
:param fp: csv 文件地址
:param verbose: whether print logging
:return: jsonld 地址
'''
data = []
root, ext = os.path.splitext(fp)
fp_new = root + '.jsonld'
if verbose:
print(f'read csv file in: {fp}')
with open(fp, encoding='utf-8') as f:
writer = csv.DictReader(f, fieldnames=None, dialect='excel')
for line in writer:
data.append(line)
if verbose:
print('saving...')
with open(fp_new, 'w', encoding='utf-8') as f:
f.write(os.linesep.join([json.dumps(l, ensure_ascii=False) for l in data]))
if verbose:
print(f'saved jsonld file in: {fp_new}')
return fp_new
if __name__ == '__main__':
pass

View File

@ -1,78 +0,0 @@
from typing import List
init_tokens = ['PAD', 'UNK']
class Vocab(object):
def __init__(self, name: str, init_tokens: List[str] = init_tokens):
self.name = name
self.init_tokens = init_tokens
self.trimed = False
self.word2idx = {}
self.word2count = {}
self.idx2word = {}
self.count = 0
self._add_init_tokens()
def _add_init_tokens(self):
for token in self.init_tokens:
self._add_word(token)
def _add_word(self, word: str):
if word not in self.word2idx:
self.word2idx[word] = self.count
self.word2count[word] = 1
self.idx2word[self.count] = word
self.count += 1
else:
self.word2count[word] += 1
def add_sent(self, sent: str):
for word in sent:
self._add_word(word)
def trim(self, min_freq=2, verbose: bool = True):
'''当 word 词频低于 min_freq 时,从词库中删除
Args:
param min_freq: 最低词频
'''
if self.trimed:
return
self.trimed = True
keep_words = []
new_words = []
keep_words.extend(self.init_tokens)
new_words.extend(self.init_tokens)
for k, v in self.word2count.items():
if v >= min_freq:
keep_words.append(k)
new_words.extend([k] * v)
if verbose:
print('after trim, keep words [{} / {}] = {:.2f}%'.format(len(keep_words), len(self.word2idx),
len(keep_words) / len(self.word2idx) * 100))
# Reinitialize dictionaries
self.word2idx = {}
self.word2count = {}
self.idx2word = {}
self.count = 0
for word in new_words:
self._add_word(word)
if __name__ == '__main__':
# english
# from nltk import word_tokenize
# sent = "I'm chinese, I love China."
# words = word_tokenize(sent)
vocab = Vocab('test')
sent = ' 我是中国人,我 爱中国。'
print(sent, '\n')
vocab.add_sent(sent)
print(vocab.word2idx)
print(vocab.word2count)
vocab.trim(2)
print(vocab.word2idx)

197
main.py
View File

@ -1,97 +1,138 @@
import os
import argparse
import warnings
import hydra
import torch
import logging
import torch.nn as nn
import torch.optim as optim
from torch import optim
from hydra import utils
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from deepke.config import config
from deepke import model
from deepke.utils import make_seed, load_pkl
from deepke.trainer import train, validate
from deepke.preprocess import process
from deepke.dataset import CustomDataset, collate_fn
from torch.utils.tensorboard import SummaryWriter
# self
import models
from preprocess import preprocess
from dataset import CustomDataset, collate_fn
from trainer import train, validate
from utils import manual_seed, load_pkl
warnings.filterwarnings("ignore")
logger = logging.getLogger(__name__)
__Models__ = {
"CNN": model.CNN,
"RNN": model.BiLSTM,
"GCN": model.GCN,
"Transformer": model.Transformer,
"Capsule": model.Capsule,
"LM": model.LM,
}
parser = argparse.ArgumentParser(description='choose your model')
parser.add_argument('--model_name', type=str, help='model name: [CNN, RNN, GCN, Capsule, Transformer, LM]')
args = parser.parse_args()
model_name = args.model_name if args.model_name else config.model_name
@hydra.main(config_path='conf/config.yaml')
def main(cfg):
cwd = utils.get_original_cwd()
cfg.cwd = cwd
cfg.pos_size = 2 * cfg.pos_limit + 2
logger.info(f'\n{cfg.pretty()}')
make_seed(config.training.seed)
__Model__ = {
'cnn': models.PCNN,
}
if config.training.use_gpu and torch.cuda.is_available():
device = torch.device('cuda', config.training.gpu_id)
else:
device = torch.device('cpu')
# device
if cfg.use_gpu and torch.cuda.is_available():
device = torch.device('cuda', cfg.gpu_id)
else:
device = torch.device('cpu')
logger.info(f'device: {device}')
# if not os.path.exists(config.out_path):
process(config.data_path, config.out_path)
# 如果不修改预处理的过程,这一步最好注释掉,不用每次运行都预处理数据一次
if cfg.preprocess:
preprocess(cfg)
train_data_path = os.path.join(config.out_path, 'train.pkl')
test_data_path = os.path.join(config.out_path, 'test.pkl')
train_data_path = os.path.join(cfg.cwd, cfg.out_path, 'train.pkl')
valid_data_path = os.path.join(cfg.cwd, cfg.out_path, 'valid.pkl')
test_data_path = os.path.join(cfg.cwd, cfg.out_path, 'test.pkl')
vocab_path = os.path.join(cfg.cwd, cfg.out_path, 'vocab.pkl')
if model_name == 'LM':
vocab_size = None
else:
vocab_path = os.path.join(config.out_path, 'vocab.pkl')
vocab = load_pkl(vocab_path)
vocab_size = len(vocab.word2idx)
if cfg.model_name == 'lm':
vocab_size = None
else:
vocab = load_pkl(vocab_path)
vocab_size = vocab.count
cfg.vocab_size = vocab_size
train_dataset = CustomDataset(train_data_path)
train_dataloader = DataLoader(train_dataset,
batch_size=config.training.batch_size,
shuffle=True,
collate_fn=collate_fn)
test_dataset = CustomDataset(test_data_path)
test_dataloader = DataLoader(
test_dataset,
batch_size=config.training.batch_size,
shuffle=False,
collate_fn=collate_fn,
)
train_dataset = CustomDataset(train_data_path)
valid_dataset = CustomDataset(valid_data_path)
test_dataset = CustomDataset(test_data_path)
model = __Models__[model_name](vocab_size, config)
model.to(device)
# print(model)
train_dataloader = DataLoader(train_dataset, batch_size=cfg.batch_size, shuffle=True, collate_fn=collate_fn(cfg))
valid_dataloader = DataLoader(valid_dataset, batch_size=cfg.batch_size, shuffle=True, collate_fn=collate_fn(cfg))
test_dataloader = DataLoader(test_dataset, batch_size=cfg.batch_size, shuffle=True, collate_fn=collate_fn(cfg))
optimizer = optim.Adam(model.parameters(), lr=config.training.learning_rate)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer,
'max',
factor=config.training.decay_rate,
patience=config.training.decay_patience)
criterion = nn.CrossEntropyLoss()
model = __Model__[cfg.model_name](cfg)
model.to(device)
logger.info(f'\n {model}')
best_macro_f1, best_macro_epoch = 0, 1
best_micro_f1, best_micro_epoch = 0, 1
best_macro_model, best_micro_model = '', ''
print('=' * 10, ' Start training ', '=' * 10)
optimizer = optim.Adam(model.parameters(), lr=cfg.learning_rate, weight_decay=cfg.weight_decay)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=cfg.lr_factor, patience=cfg.lr_patience)
criterion = nn.CrossEntropyLoss()
for epoch in range(1, config.training.epoch + 1):
train(epoch, device, train_dataloader, model, optimizer, criterion, config)
macro_f1, micro_f1 = validate(test_dataloader, model, device, config)
model_name = model.save(epoch=epoch)
scheduler.step(macro_f1)
best_f1, best_epoch = -1, 0
es_loss, es_f1, es_epoch, es_patience, best_es_epoch, best_es_f1, es_path, best_es_path = 1e8, -1, 0, 0, 0, -1, '', ''
train_losses, valid_losses = [], []
if macro_f1 > best_macro_f1:
best_macro_f1 = macro_f1
best_macro_epoch = epoch
best_macro_model = model_name
if micro_f1 > best_micro_f1:
best_micro_f1 = micro_f1
best_micro_epoch = epoch
best_micro_model = model_name
if cfg.show_plot and cfg.plot_utils == 'tensorboard':
writer = SummaryWriter('tensorboard')
else:
writer = None
print('=' * 10, ' End training ', '=' * 10)
print(f'best macro f1: {best_macro_f1:.4f},', f'in epoch: {best_macro_epoch}, saved in: {best_macro_model}')
print(f'best micro f1: {best_micro_f1:.4f},', f'in epoch: {best_micro_epoch}, saved in: {best_micro_model}')
logger.info('=' * 10 + ' Start training ' + '=' * 10)
for epoch in range(1, cfg.epoch + 1):
manual_seed(cfg.seed + epoch)
train_loss = train(epoch, model, train_dataloader, optimizer, criterion, device, writer, cfg)
valid_f1, valid_loss = validate(epoch, model, valid_dataloader, criterion, device)
scheduler.step(valid_loss)
model_path = model.save(epoch, cfg)
# logger.info(model_path)
train_losses.append(train_loss)
valid_losses.append(valid_loss)
if best_f1 < valid_f1:
best_f1 = valid_f1
best_epoch = epoch
# 使用 valid loss 做 early stopping 的判断标准
if es_loss > valid_loss:
es_loss = valid_loss
es_f1 = valid_f1
es_epoch = epoch
es_patience = 0
es_path = model_path
else:
es_patience += 1
if es_patience >= cfg.early_stopping_patience:
best_es_epoch = es_epoch
best_es_f1 = es_f1
best_es_path = es_path
if cfg.show_plot:
if cfg.plot_utils == 'matplot':
plt.plot(train_losses, 'x-')
plt.plot(valid_losses, '+-')
plt.legend(['train', 'valid'])
plt.title('train/valid comparison loss')
plt.show()
if cfg.plot_utils == 'tensorboard':
for i in range(len(train_losses)):
writer.add_scalars('train/valid_comparison_loss', {
'train': train_losses[i],
'valid': valid_losses[i]
}, i)
writer.close()
logger.info(f'best(valid loss quota) early stopping epoch: {best_es_epoch}, '
f'this epoch macro f1: {best_es_f1:0.4f}')
logger.info(f'this model save path: {best_es_path}')
logger.info(f'total {cfg.epoch} epochs, best(valid macro f1) epoch: {best_epoch}, '
f'this epoch macro f1: {best_f1:.4f}')
validate(-1, model, test_dataloader, criterion, device)
if __name__ == '__main__':
main()
# python predict.py --help # 查看参数帮助
# python predict.py -c
# python predict.py chinese_split=0,1 replace_entity_with_type=0,1 -m

62
metrics.py Normal file
View File

@ -0,0 +1,62 @@
import torch
import numpy as np
from abc import ABCMeta, abstractmethod
from sklearn.metrics import precision_recall_fscore_support
class Metric(metaclass=ABCMeta):
@abstractmethod
def __init__(self):
pass
@abstractmethod
def reset(self):
"""
Resets the metric to to it's initial state.
This is called at the start of each epoch.
"""
pass
@abstractmethod
def update(self, *args):
"""
Updates the metric's state using the passed batch output.
This is called once for each batch.
"""
pass
@abstractmethod
def compute(self):
"""
Computes the metric based on it's accumulated state.
This is called at the end of each epoch.
:return: the actual quantity of interest
"""
pass
class PRMetric():
def __init__(self):
"""
暂时调用 sklearn 的方法
"""
self.y_true = np.empty(0)
self.y_pred = np.empty(0)
def reset(self):
self.y_true = np.empty(0)
self.y_pred = np.empty(0)
def update(self, y_true: torch.Tensor, y_pred: torch.Tensor):
y_true = y_true.cpu().detach().numpy()
y_pred = y_pred.cpu().detach().numpy()
y_pred = np.argmax(y_pred, axis=-1)
self.y_true = np.append(self.y_true, y_true)
self.y_pred = np.append(self.y_pred, y_pred)
def compute(self):
p, r, f1, _ = precision_recall_fscore_support(self.y_true, self.y_pred, average='macro', warn_for=tuple())
_, _, acc, _ = precision_recall_fscore_support(self.y_true, self.y_pred, average='micro', warn_for=tuple())
return acc, p, r, f1

34
models/BasicModule.py Normal file
View File

@ -0,0 +1,34 @@
import os
import time
import torch
import torch.nn as nn
class BasicModule(nn.Module):
'''
封装nn.Module, 提供 save load 方法
'''
def __init__(self):
super(BasicModule, self).__init__()
def load(self, path, device):
'''
加载指定路径的模型
'''
self.load_state_dict(torch.load(path, map_location=device))
def save(self, epoch=0, cfg=None):
'''
保存模型默认使用模型名字+时间作为文件名
'''
time_prefix = time.strftime('%Y-%m-%d_%H-%M-%S')
prefix = os.path.join(cfg.cwd, 'checkpoints',time_prefix)
os.makedirs(prefix, exist_ok=True)
name = os.path.join(prefix, cfg.model_name + '_' + f'epoch{epoch}' + '.pth')
torch.save(self.state_dict(), name)
return name

24
models/BiLSTM.py Normal file
View File

@ -0,0 +1,24 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
from . import BasicModule
from module import Embedding, RNN
from utils import seq_len_to_mask
class BiLSTM(BasicModule):
def __init__(self, cfg):
super(BiLSTM, self).__init__()
self.use_pcnn = cfg.use_pcnn
self.embedding = Embedding(cfg)
self.bilsm = RNN(cfg)
self.fc1 = nn.Linear(len(cfg.kernel_sizes) * cfg.out_channels, cfg.intermediate)
self.fc2 = nn.Linear(cfg.intermediate, cfg.num_relations)
self.dropout = nn.Dropout(cfg.dropout)
def forward(self, x):
word, lens, head_pos, tail_pos = x['word'], x['lens'], x['head_pos'], x['tail_pos']
inputs = self.embedding(word, head_pos, tail_pos)
out, out_pool = self.rnn(inputs)

6
models/Capsule.py Normal file
View File

@ -0,0 +1,6 @@
# coding=utf-8
# Version: Python 3.7.3
# Tools: Pycharm 2019.02
__date__ = '2019/12/1 12:00 上午'
__author__ = 'Haiyang Yu'

52
models/PCNN.py Normal file
View File

@ -0,0 +1,52 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
from . import BasicModule
from module import Embedding, CNN
from utils import seq_len_to_mask
class PCNN(BasicModule):
def __init__(self, cfg):
super(PCNN, self).__init__()
self.use_pcnn = cfg.use_pcnn
self.embedding = Embedding(cfg)
self.cnn = CNN(cfg)
self.fc1 = nn.Linear(len(cfg.kernel_sizes) * cfg.out_channels, cfg.intermediate)
self.fc2 = nn.Linear(cfg.intermediate, cfg.num_relations)
self.dropout = nn.Dropout(cfg.dropout)
if self.use_pcnn:
self.fc_pcnn = nn.Linear(3 * len(cfg.kernel_sizes) * cfg.out_channels,
len(cfg.kernel_sizes) * cfg.out_channels)
self.pcnn_mask_embedding = nn.Embedding(4, 3)
masks = torch.tensor([[0, 0, 0], [100, 0, 0], [0, 100, 0], [0, 0, 100]])
self.pcnn_mask_embedding.weight.data.copy_(masks)
self.pcnn_mask_embedding.weight.requires_grad = False
def forward(self, x):
word, lens, head_pos, tail_pos = x['word'], x['lens'], x['head_pos'], x['tail_pos']
mask = seq_len_to_mask(lens)
inputs = self.embedding(word, head_pos, tail_pos)
out, out_pool = self.cnn(inputs, mask=mask)
if self.use_pcnn:
out = out.unsqueeze(-1) # [B, L, Hs, 1]
pcnn_mask = x['pcnn_mask']
pcnn_mask = self.pcnn_mask_embedding(pcnn_mask).unsqueeze(-2) # [B, L, 1, 3]
out = out + pcnn_mask # [B, L, Hs, 3]
out = out.max(dim=1)[0] - 100 # [B, Hs, 3]
out_pool = out.view(out.size(0), -1) # [B, 3 * Hs]
out_pool = F.leaky_relu(self.fc_pcnn(out_pool)) # [B, Hs]
out_pool = self.dropout(out_pool)
output = self.fc1(out_pool)
output = F.leaky_relu(output)
output = self.dropout(output)
output = self.fc2(output)
return output

2
models/__init__.py Normal file
View File

@ -0,0 +1,2 @@
from .BasicModule import BasicModule
from .PCNN import PCNN

138
module/Attention.py Normal file
View File

@ -0,0 +1,138 @@
import logging
import torch
import torch.nn as nn
import torch.nn.functional as F
logger = logging.getLogger(__name__)
class DotAttention(nn.Module):
def __init__(self, dropout=0.0):
super(DotAttention, self).__init__()
self.dropout = dropout
def forward(self, Q, K, V, mask_out=None,head_mask=None):
"""
一般输入信息 X 假设 K = V = X
att_weight = softmax( score_func(q, k) )
att = sum( att_weight * v )
:param Q: [..., L, H]
:param K: [..., S, H]
:param V: [..., S, H]
:param mask_out: [..., 1, S]
:return:
"""
H = Q.size(-1)
scale = float(H)**0.5
attention_weight = torch.matmul(Q, K.transpose(-1, -2)) / scale
if mask_out is not None:
# 当 DotAttention 单独使用时(几乎不会),保证维度一样
while mask_out.dim() != Q.dim():
mask_out = mask_out.unsqueeze(1)
attention_weight.masked_fill_(mask_out, -1e8)
attention_weight = F.softmax(attention_weight, dim=-1)
attention_weight = F.dropout(attention_weight, self.dropout)
# mask heads if we want to:
# multi head 才会使用
if head_mask is not None:
attention_weight = attention_weight * head_mask
attention_out = torch.matmul(attention_weight, V)
return attention_out, attention_weight
class MultiHeadAttention(nn.Module):
def __init__(self, embed_dim, num_heads, dropout=0.0, output_attentions=True):
"""
:param embed_dim: 输入的维度必须能被 num_heads 整除
:param num_heads: attention 的个数
:param dropout: float
"""
super(MultiHeadAttention, self).__init__()
self.num_heads = num_heads
self.output_attentions = output_attentions
self.head_dim = int(embed_dim / num_heads)
self.all_head_dim = self.head_dim * num_heads
assert self.all_head_dim == embed_dim, logger.error(
f"embed_dim{embed_dim} must be divisible by num_heads{num_heads}")
self.q_in = nn.Linear(embed_dim, self.all_head_dim)
self.k_in = nn.Linear(embed_dim, self.all_head_dim)
self.v_in = nn.Linear(embed_dim, self.all_head_dim)
self.attention = DotAttention(dropout=dropout)
self.out = nn.Linear(self.all_head_dim, embed_dim)
def forward(self, Q, K, V, key_padding_mask=None,attention_mask=None, head_mask=None):
"""
:param Q: [B, L, Hs]
:param K: [B, S, Hs]
:param V: [B, S, Hs]
:param key_padding_mask: [B, S] 1/True 的地方需要 mask
:param attention_mask: [S] / [L, S] 指定位置 mask 1/True 的地方需要 mask
:param head_mask: [N] 指定 head mask 1/True 的地方需要 mask
"""
B, L, Hs = Q.shape
S = V.size(1)
N,H = self.num_heads, self.head_dim
q = self.q_in(Q).view(B, L, N, H).transpose(1, 2) # [B, N, L, H]
k = self.k_in(K).view(B, S, N, H).transpose(1, 2) # [B, N, S, H]
v = self.v_in(V).view(B, S, N, H).transpose(1, 2) # [B, N, S, H]
if key_padding_mask is not None:
key_padding_mask = key_padding_mask.ne(0)
key_padding_mask = key_padding_mask.unsqueeze(1).unsqueeze(1)
if attention_mask is not None:
attention_mask = attention_mask.ne(0)
if attention_mask.dim() == 1:
attention_mask = attention_mask.unsqueeze(0)
elif attention_mask.dim() == 2:
attention_mask = attention_mask.unsqueeze(0).unsqueeze(0).expand(B,-1,-1,-1)
else:
raise ValueError(f'attention_mask dim must be 1 or 2, can not be {attention_mask.dim()}')
if key_padding_mask is None:
mask_out = attention_mask if attention_mask is not None else None
else:
mask_out = (key_padding_mask + attention_mask).ne(0) if attention_mask is not None else key_padding_mask
if head_mask is not None:
head_mask = head_mask.eq(0)
head_mask = head_mask.unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
attention_out, attention_weight = self.attention(q, k, v, mask_out=mask_out,head_mask=head_mask)
attention_out = attention_out.transpose(1, 2).reshape(B, L, N * H) # [B, N, L, H] -> [B, L, N * H]
# concat all heads, and do output linear
attention_out = self.out(attention_out) # [B, L, N * H] -> [B, L, H]
if self.output_attentions:
return attention_out, attention_weight
else:
return attention_out,
if __name__ == '__main__':
from utils import seq_len_to_mask
q = torch.randn(4, 6, 20) # [B, L, H]
k = v = torch.randn(4, 5, 20) # [B, S, H]
key_padding_mask = seq_len_to_mask([5,4,3,2], max_len=5)
attention_mask = torch.tensor([1,0,0,1,0]) # 为1 的地方 mask 掉
head_mask = torch.tensor([0,1]) # 为1 的地方 mask 掉
m = MultiHeadAttention(embed_dim=20, num_heads=2, dropout=0.0,output_attentions=True)
ao, aw = m(q, k, v, key_padding_mask=key_padding_mask, attention_mask=attention_mask,head_mask=head_mask)
print(ao.shape, aw.shape) # [B, L, H] [B, N, L, S]
print(ao)
print(aw.unbind(1))

117
module/CNN.py Normal file
View File

@ -0,0 +1,117 @@
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
class GELU(nn.Module):
def __init__(self):
super(GELU, self).__init__()
def forward(self, x):
return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
class CNN(nn.Module):
"""
nlp 里为了保证输出的句长 = 输入的句长一般使用奇数 kernel_size [3, 5, 7, 9]
此时padding = k // 2
stride 一般为 1
"""
def __init__(self, config):
"""
in_channels : 一般就是 word embedding 的维度或者 hidden size 的维度
out_channels : int
kernel_sizes : list 为了保证输出长度=输入长度必须为奇数: 3, 5, 7...
activation : [relu, lrelu, prelu, selu, celu, gelu, sigmoid, tanh]
pooling_strategy : [max, avg, cls]
dropout: : float
"""
super(CNN, self).__init__()
# self.xxx = config.xxx
# self.in_channels = config.in_channels
if config.dim_strategy == 'cat':
self.in_channels = config.word_dim + 2 * config.pos_dim
else:
self.in_channels = config.word_dim
self.out_channels = config.out_channels
self.kernel_sizes = config.kernel_sizes
self.activation = config.activation
self.pooling_strategy = config.pooling_strategy
self.dropout = config.dropout
for kernel_size in self.kernel_sizes:
assert kernel_size % 2 == 1, "kernel size has to be odd numbers."
# convolution
self.convs = nn.ModuleList([
nn.Conv1d(in_channels=self.in_channels,
out_channels=self.out_channels,
kernel_size=k,
stride=1,
padding=k // 2,
dilation=1,
groups=1,
bias=False) for k in self.kernel_sizes
])
# activation function
assert self.activation in ['relu', 'lrelu', 'prelu', 'selu', 'celu', 'gelu', 'sigmoid', 'tanh'], \
'activation function must choose from [relu, lrelu, prelu, selu, celu, gelu, sigmoid, tanh]'
self.activations = nn.ModuleDict([
['relu', nn.ReLU()],
['lrelu', nn.LeakyReLU()],
['prelu', nn.PReLU()],
['selu', nn.SELU()],
['celu', nn.CELU()],
['gelu', GELU()],
['sigmoid', nn.Sigmoid()],
['tanh', nn.Tanh()],
])
# pooling
assert self.pooling_strategy in ['max', 'avg', 'cls'], 'pooling strategy must choose from [max, avg, cls]'
self.dropout = nn.Dropout(self.dropout)
def forward(self, x, mask=None):
"""
:param x: torch.Tensor [batch_size, seq_max_length, input_size], [B, L, H] 一般是经过embedding后的值
:param mask: [batch_size, max_len], 句长部分为0padding部分为1不影响卷积运算max-pool一定不会pool到pad为0的位置
:return:
"""
# [B, L, H] -> [B, H, L] (注释:将 H 维度当作输入 channel 维度)
x = torch.transpose(x, 1, 2)
# convolution + activation [[B, H, L], ... ]
act_fn = self.activations[self.activation]
x = [act_fn(conv(x)) for conv in self.convs]
x = torch.cat(x, dim=1)
# mask
if mask is not None:
# [B, L] -> [B, 1, L]
mask = mask.unsqueeze(1)
x = x.masked_fill_(mask, 1e-12)
# pooling
# [[B, H, L], ... ] -> [[B, H], ... ]
if self.pooling_strategy == 'max':
xp = F.max_pool1d(x, kernel_size=x.size(2)).squeeze(2)
# 等价于 xp = torch.max(x, dim=2)[0]
elif self.pooling_strategy == 'avg':
x_len = mask.squeeze().eq(0).sum(-1).unsqueeze(-1).to(torch.float).to(device=mask.device)
xp = torch.sum(x, dim=-1) / x_len
else:
# self.pooling_strategy == 'cls'
xp = x[:, :, 0]
x = x.transpose(1, 2)
x = self.dropout(x)
xp = self.dropout(xp)
return x, xp # [B, L, Hs], [B, Hs]

13
module/Capsule.py Normal file
View File

@ -0,0 +1,13 @@
import logging
import torch
import torch.nn as nn
import torch.nn.functional as F
logger = logging.getLogger(__name__)
class Capsule(nn.Module):
def __init__(self, config):
super(Capsule, self).__init__()
# self.xxx = config.xxx

39
module/Embedding.py Normal file
View File

@ -0,0 +1,39 @@
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
class Embedding(nn.Module):
def __init__(self, config):
"""
word embedding: 一般 0 padding
pos embedding: 一般 0 padding
dim_strategy: [cat, sum] 多个 embedding 是拼接还是相加
"""
super(Embedding, self).__init__()
# self.xxx = config.xxx
self.vocab_size = config.vocab_size
self.word_dim = config.word_dim
self.pos_size = config.pos_size
self.pos_dim = config.pos_dim if config.dim_strategy == 'cat' else config.word_dim
self.dim_strategy = config.dim_strategy
self.wordEmbed = nn.Embedding(self.vocab_size,self.word_dim,padding_idx=0)
self.headPosEmbed = nn.Embedding(self.pos_size,self.pos_dim,padding_idx=0)
self.tailPosEmbed = nn.Embedding(self.pos_size,self.pos_dim,padding_idx=0)
def forward(self, *x):
word, head, tail = x
word_embedding = self.wordEmbed(word)
head_embedding = self.headPosEmbed(head)
tail_embedding = self.tailPosEmbed(tail)
if self.dim_strategy == 'cat':
return torch.cat((word_embedding,head_embedding, tail_embedding), -1)
elif self.dim_strategy == 'sum':
# 此时 pos_dim == word_dim
return word_embedding + head_embedding + tail_embedding
else:
raise Exception('dim_strategy must choose from [sum, cat]')

96
module/RNN.py Normal file
View File

@ -0,0 +1,96 @@
import torch
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
class RNN(nn.Module):
def __init__(self, config):
"""
type_rnn: RNN, GRU, LSTM 可选
"""
super(RNN, self).__init__()
# self.xxx = config.xxx
self.input_size = config.input_size
self.hidden_size = config.hidden_size // 2 if config.bidirectional else config.hidden_size
self.num_layers = config.num_layers
self.dropout = config.dropout
self.bidirectional = config.bidirectional
self.last_layer_hn = config.last_layer_hn
self.type_rnn = config.type_rnn
self.h0 = self._init_h0()
rnn = eval(f'nn.{self.type_rnn}')
self.rnn = rnn(input_size=self.input_size,
hidden_size=self.hidden_size,
num_layers=self.num_layers,
dropout=self.dropout,
bidirectional=self.bidirectional,
bias=True,
batch_first=True)
def _init_h0(self):
pass
# h0 = torch.empty(1,B,H)
# h0 = nn.init.orthogonal_(h0)
def forward(self, x, x_len):
"""
:param x: torch.Tensor [batch_size, seq_max_length, input_size], [B, L, H_in] 一般是经过embedding后的值
:param x_len: torch.Tensor [L] 已经排好序的句长值
:return:
output: torch.Tensor [B, L, H_out] 序列标注的使用结果
hn: torch.Tensor [B, N, H_out] / [B, H_out] 分类的结果 last_layer_hn 时只有最后一层结果
"""
B, L, _ = x.size()
H, N = self.hidden_size, self.num_layers
h0 = torch.zeros([2 * N, B, H]) if self.bidirectional else torch.zeros([N, B, H])
nn.init.orthogonal_(h0)
c0 = torch.zeros([2 * N, B, H]) if self.bidirectional else torch.zeros([N, B, H])
nn.init.orthogonal_(c0)
x = pack_padded_sequence(x, x_len, batch_first=True, enforce_sorted=True)
if self.type_rnn == 'LSTM':
output, hn = self.rnn(x, (h0, c0))
else:
output, hn = self.rnn(x, h0)
output, _ = pad_packed_sequence(output, batch_first=True, total_length=L)
if self.type_rnn == 'LSTM':
hn = hn[0]
if self.bidirectional:
hn = hn.view(N, 2, B, H).transpose(1, 2).contiguous().view(N, B, 2 * H).transpose(0, 1)
else:
hn = hn.transpose(0, 1)
if self.last_layer_hn:
hn = hn[:, -1, :]
return output, hn
if __name__ == '__main__':
class Config(object):
type_rnn = 'LSTM'
input_size = 5
hidden_size = 4
num_layers = 3
dropout = 0.0
last_layer_hn = False
bidirectional = True
config = Config()
model = RNN(config)
print(model)
torch.manual_seed(1)
x = torch.tensor([[4, 3, 2, 1], [5, 6, 7, 0], [8, 10, 0, 0]])
x = torch.nn.Embedding(11, 5, padding_idx=0)(x) # B,L,H = 3,4,5
x_len = torch.tensor([4, 3, 2])
o, h = model(x, x_len)
print(o.shape, h.shape, sep='\n\n')
print(o[-1].data, h[-1].data, sep='\n\n')

149
module/Transformer.py Normal file
View File

@ -0,0 +1,149 @@
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from .Attention import MultiHeadAttention
def gelu(x):
""" Original Implementation of the gelu activation function in Google Bert repo when initially created.
For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
Also see https://arxiv.org/abs/1606.08415
"""
return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
def gelu_new(x):
""" Implementation of the gelu activation function currently in Google Bert repo (identical to OpenAI GPT).
Also see https://arxiv.org/abs/1606.08415
"""
return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
def swish(x):
return x * torch.sigmoid(x)
ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu, "swish": swish, "gelu_new": gelu_new}
class TransformerAttention(nn.Module):
def __init__(self, config):
super(TransformerAttention, self).__init__()
# self.xxx = config.xxx
self.hidden_size = config.hidden_size
self.num_heads = config.num_heads
self.dropout = config.dropout
self.output_attentions = config.output_attentions
self.layer_norm_eps = config.layer_norm_eps
self.multihead_attention = MultiHeadAttention(self.hidden_size, self.num_heads, self.dropout,
self.output_attentions)
self.dense = nn.Linear(self.hidden_size, self.hidden_size)
self.dropout = nn.Dropout(self.dropout)
self.layerNorm = nn.LayerNorm(self.hidden_size, eps=self.layer_norm_eps)
def forward(self, x, key_padding_mask=None, attention_mask=None, head_mask=None):
"""
:param x: [B, L, Hs]
:param attention_mask: [B, L] padding后的句子后面补0了补0的位置为True前面部分为False
:param head_mask: [L] [N,L]
:return:
"""
attention_outputs = self.multihead_attention(x, x, x, key_padding_mask, attention_mask, head_mask)
attention_output = attention_outputs[0]
attention_output = self.dense(attention_output)
attention_output = self.dropout(attention_output)
attention_output = self.layerNorm(attention_output + x)
outputs = (attention_output, ) + attention_outputs[1:] # 后面是 attention weight
return outputs
class TransformerOutput(nn.Module):
def __init__(self, config):
super(TransformerOutput, self).__init__()
# self.xxx = config.xxx
self.hidden_size = config.hidden_size
self.intermediate_size = config.intermediate_size
self.dropout = config.dropout
self.layer_norm_eps = config.layer_norm_eps
self.zoom_in = nn.Linear(self.hidden_size, self.intermediate_size)
self.intermediate_act_fn = ACT2FN[config.hidden_act]
self.zoom_out = nn.Linear(self.intermediate_size, self.hidden_size)
self.dropout = nn.Dropout(self.dropout)
self.layerNorm = nn.LayerNorm(self.hidden_size, eps=self.layer_norm_eps)
def forward(self, input_tensor):
hidden_states = self.zoom_in(input_tensor)
hidden_states = self.intermediate_act_fn(hidden_states)
hidden_states = self.zoom_out(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.layerNorm(hidden_states + input_tensor)
return hidden_states
class TransformerLayer(nn.Module):
def __init__(self, config):
super(TransformerLayer, self).__init__()
self.attention = TransformerAttention(config)
self.output = TransformerOutput(config)
def forward(self, hidden_states, key_padding_mask=None, attention_mask=None, head_mask=None):
attention_outputs = self.attention(hidden_states, key_padding_mask, attention_mask, head_mask)
attention_output = attention_outputs[0]
layer_output = self.output(attention_output)
outputs = (layer_output, ) + attention_outputs[1:]
return outputs
class Transformer(nn.Module):
def __init__(self, config):
super(Transformer, self).__init__()
# self.xxx = config.xxx
self.num_hidden_layers = config.num_hidden_layers
self.output_attentions = config.output_attentions
self.output_hidden_states = config.output_hidden_states
self.layer = nn.ModuleList([TransformerLayer(config) for _ in range(self.num_hidden_layers)])
def forward(self, hidden_states, key_padding_mask=None, attention_mask=None, head_mask=None):
"""
:param hidden_states: [B, L, Hs]
:param key_padding_mask: [B, S] 1/True 的地方需要 mask
:param attn_mask: [S] / [L, S] 指定位置 mask 1/True 的地方需要 mask
:param head_mask: [N] / [L, N] 指定 head mask 1/True 的地方需要 mask
"""
if head_mask is not None:
if head_mask.dim() == 1:
head_mask = head_mask.expand((self.num_hidden_layers, ) + head_mask.shape)
else:
head_mask = [None] * self.num_hidden_layers
all_hidden_states = ()
all_attentions = ()
for i, layer_module in enumerate(self.layer):
if self.output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states, )
layer_outputs = layer_module(hidden_states, key_padding_mask, attention_mask, head_mask[i])
hidden_states = layer_outputs[0]
if self.output_attentions:
all_attentions = all_attentions + (layer_outputs[1], )
# Add last layer
if self.output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states, )
outputs = (hidden_states, )
if self.output_hidden_states:
outputs = outputs + (all_hidden_states, )
if self.output_attentions:
outputs = outputs + (all_attentions, )
return outputs # last-layer hidden state, (all hidden states), (all attentions)

View File

@ -0,0 +1,429 @@
import copy
import torch
from torch.nn.init import xavier_uniform_
from torch.nn import Module,ModuleList,LayerNorm,Linear,Dropout,MultiheadAttention
import torch.nn.functional as F
# 代码来自 torch 1.3.0 这是官网些的 transformer
# 但是这个transformer 接口写的太死,自己重新实现了一版
class Transformer(Module):
r"""A transformer model. User is able to modify the attributes as needed. The architecture
is based on the paper "Attention Is All You Need". Ashish Vaswani, Noam Shazeer,
Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and
Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information
Processing Systems, pages 6000-6010. Users can build the BERT(https://arxiv.org/abs/1810.04805)
model with corresponding parameters.
Args:
d_model: the number of expected features in the encoder/decoder inputs (default=512).
nhead: the number of heads in the multiheadattention models (default=8).
num_encoder_layers: the number of sub-encoder-layers in the encoder (default=6).
num_decoder_layers: the number of sub-decoder-layers in the decoder (default=6).
dim_feedforward: the dimension of the feedforward network model (default=2048).
dropout: the dropout value (default=0.1).
activation: the activation function of encoder/decoder intermediate layer, relu or gelu (default=relu).
custom_encoder: custom encoder (default=None).
custom_decoder: custom decoder (default=None).
Examples::
>>> transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
>>> src = torch.rand((10, 32, 512))
>>> tgt = torch.rand((20, 32, 512))
>>> out = transformer_model(src, tgt)
Note: A full example to apply nn.Transformer module for the word language model is available in
https://github.com/pytorch/examples/tree/master/word_language_model
"""
def __init__(self, d_model=512, nhead=8, num_encoder_layers=6,
num_decoder_layers=6, dim_feedforward=2048, dropout=0.1,
activation="relu", custom_encoder=None, custom_decoder=None):
super(Transformer, self).__init__()
if custom_encoder is not None:
self.encoder = custom_encoder
else:
encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout, activation)
encoder_norm = LayerNorm(d_model)
self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
if custom_decoder is not None:
self.decoder = custom_decoder
else:
decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward, dropout, activation)
decoder_norm = LayerNorm(d_model)
self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm)
self._reset_parameters()
self.d_model = d_model
self.nhead = nhead
def forward(self, src, tgt, src_mask=None, tgt_mask=None,
memory_mask=None, src_key_padding_mask=None,
tgt_key_padding_mask=None, memory_key_padding_mask=None):
r"""Take in and process masked source/target sequences.
Args:
src: the sequence to the encoder (required).
tgt: the sequence to the decoder (required).
src_mask: the additive mask for the src sequence (optional).
tgt_mask: the additive mask for the tgt sequence (optional).
memory_mask: the additive mask for the encoder output (optional).
src_key_padding_mask: the ByteTensor mask for src keys per batch (optional).
tgt_key_padding_mask: the ByteTensor mask for tgt keys per batch (optional).
memory_key_padding_mask: the ByteTensor mask for memory keys per batch (optional).
Shape:
- src: :math:`(S, N, E)`.
- tgt: :math:`(T, N, E)`.
- src_mask: :math:`(S, S)`.
- tgt_mask: :math:`(T, T)`.
- memory_mask: :math:`(T, S)`.
- src_key_padding_mask: :math:`(N, S)`.
- tgt_key_padding_mask: :math:`(N, T)`.
- memory_key_padding_mask: :math:`(N, S)`.
Note: [src/tgt/memory]_mask should be filled with
float('-inf') for the masked positions and float(0.0) else. These masks
ensure that predictions for position i depend only on the unmasked positions
j and are applied identically for each sequence in a batch.
[src/tgt/memory]_key_padding_mask should be a ByteTensor where True values are positions
that should be masked with float('-inf') and False values will be unchanged.
This mask ensures that no information will be taken from position i if
it is masked, and has a separate mask for each sequence in a batch.
- output: :math:`(T, N, E)`.
Note: Due to the multi-head attention architecture in the transformer model,
the output sequence length of a transformer is same as the input sequence
(i.e. target) length of the decode.
where S is the source sequence length, T is the target sequence length, N is the
batch size, E is the feature number
Examples:
>>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
"""
if src.size(1) != tgt.size(1):
raise RuntimeError("the batch number of src and tgt must be equal")
if src.size(2) != self.d_model or tgt.size(2) != self.d_model:
raise RuntimeError("the feature number of src and tgt must be equal to d_model")
memory = self.encoder(src, mask=src_mask, src_key_padding_mask=src_key_padding_mask)
output = self.decoder(tgt, memory, tgt_mask=tgt_mask, memory_mask=memory_mask,
tgt_key_padding_mask=tgt_key_padding_mask,
memory_key_padding_mask=memory_key_padding_mask)
return output
def generate_square_subsequent_mask(self, sz):
r"""Generate a square mask for the sequence. The masked positions are filled with float('-inf').
Unmasked positions are filled with float(0.0).
"""
mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
return mask
def _reset_parameters(self):
r"""Initiate parameters in the transformer model."""
for p in self.parameters():
if p.dim() > 1:
xavier_uniform_(p)
class TransformerEncoder(Module):
r"""TransformerEncoder is a stack of N encoder layers
Args:
encoder_layer: an instance of the TransformerEncoderLayer() class (required).
num_layers: the number of sub-encoder-layers in the encoder (required).
norm: the layer normalization component (optional).
Examples::
>>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
>>> transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
>>> src = torch.rand(10, 32, 512)
>>> out = transformer_encoder(src)
"""
def __init__(self, encoder_layer, num_layers, norm=None):
super(TransformerEncoder, self).__init__()
self.layers = _get_clones(encoder_layer, num_layers)
self.num_layers = num_layers
self.norm = norm
def forward(self, src, mask=None, src_key_padding_mask=None):
r"""Pass the input through the endocder layers in turn.
Args:
src: the sequnce to the encoder (required).
mask: the mask for the src sequence (optional).
src_key_padding_mask: the mask for the src keys per batch (optional).
Shape:
see the docs in Transformer class.
"""
output = src
for i in range(self.num_layers):
output = self.layers[i](output, src_mask=mask,
src_key_padding_mask=src_key_padding_mask)
if self.norm:
output = self.norm(output)
return output
class TransformerDecoder(Module):
r"""TransformerDecoder is a stack of N decoder layers
Args:
decoder_layer: an instance of the TransformerDecoderLayer() class (required).
num_layers: the number of sub-decoder-layers in the decoder (required).
norm: the layer normalization component (optional).
Examples::
>>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
>>> transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
>>> memory = torch.rand(10, 32, 512)
>>> tgt = torch.rand(20, 32, 512)
>>> out = transformer_decoder(tgt, memory)
"""
def __init__(self, decoder_layer, num_layers, norm=None):
super(TransformerDecoder, self).__init__()
self.layers = _get_clones(decoder_layer, num_layers)
self.num_layers = num_layers
self.norm = norm
def forward(self, tgt, memory, tgt_mask=None,
memory_mask=None, tgt_key_padding_mask=None,
memory_key_padding_mask=None):
r"""Pass the inputs (and mask) through the decoder layer in turn.
Args:
tgt: the sequence to the decoder (required).
memory: the sequnce from the last layer of the encoder (required).
tgt_mask: the mask for the tgt sequence (optional).
memory_mask: the mask for the memory sequence (optional).
tgt_key_padding_mask: the mask for the tgt keys per batch (optional).
memory_key_padding_mask: the mask for the memory keys per batch (optional).
Shape:
see the docs in Transformer class.
"""
output = tgt
for i in range(self.num_layers):
output = self.layers[i](output, memory, tgt_mask=tgt_mask,
memory_mask=memory_mask,
tgt_key_padding_mask=tgt_key_padding_mask,
memory_key_padding_mask=memory_key_padding_mask)
if self.norm:
output = self.norm(output)
return output
class TransformerEncoderLayer(Module):
r"""TransformerEncoderLayer is made up of self-attn and feedforward network.
This standard encoder layer is based on the paper "Attention Is All You Need".
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in
Neural Information Processing Systems, pages 6000-6010. Users may modify or implement
in a different way during application.
Args:
d_model: the number of expected features in the input (required).
nhead: the number of heads in the multiheadattention models (required).
dim_feedforward: the dimension of the feedforward network model (default=2048).
dropout: the dropout value (default=0.1).
activation: the activation function of intermediate layer, relu or gelu (default=relu).
Examples::
>>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
>>> src = torch.rand(10, 32, 512)
>>> out = encoder_layer(src)
"""
def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1, activation="relu"):
super(TransformerEncoderLayer, self).__init__()
self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
# Implementation of Feedforward model
self.linear1 = Linear(d_model, dim_feedforward)
self.dropout = Dropout(dropout)
self.linear2 = Linear(dim_feedforward, d_model)
self.norm1 = LayerNorm(d_model)
self.norm2 = LayerNorm(d_model)
self.dropout1 = Dropout(dropout)
self.dropout2 = Dropout(dropout)
self.activation = _get_activation_fn(activation)
def forward(self, src, src_mask=None, src_key_padding_mask=None):
r"""Pass the input through the endocder layer.
Args:
src: the sequnce to the encoder layer (required).
src_mask: the mask for the src sequence (optional).
src_key_padding_mask: the mask for the src keys per batch (optional).
Shape:
see the docs in Transformer class.
"""
src2 = self.self_attn(src, src, src, attn_mask=src_mask,
key_padding_mask=src_key_padding_mask)[0]
src = src + self.dropout1(src2)
src = self.norm1(src)
if hasattr(self, "activation"):
src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
else: # for backward compatibility
src2 = self.linear2(self.dropout(F.relu(self.linear1(src))))
src = src + self.dropout2(src2)
src = self.norm2(src)
return src
class TransformerDecoderLayer(Module):
r"""TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.
This standard decoder layer is based on the paper "Attention Is All You Need".
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in
Neural Information Processing Systems, pages 6000-6010. Users may modify or implement
in a different way during application.
Args:
d_model: the number of expected features in the input (required).
nhead: the number of heads in the multiheadattention models (required).
dim_feedforward: the dimension of the feedforward network model (default=2048).
dropout: the dropout value (default=0.1).
activation: the activation function of intermediate layer, relu or gelu (default=relu).
Examples::
>>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
>>> memory = torch.rand(10, 32, 512)
>>> tgt = torch.rand(20, 32, 512)
>>> out = decoder_layer(tgt, memory)
"""
def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1, activation="relu"):
super(TransformerDecoderLayer, self).__init__()
self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
self.multihead_attn = MultiheadAttention(d_model, nhead, dropout=dropout)
# Implementation of Feedforward model
self.linear1 = Linear(d_model, dim_feedforward)
self.dropout = Dropout(dropout)
self.linear2 = Linear(dim_feedforward, d_model)
self.norm1 = LayerNorm(d_model)
self.norm2 = LayerNorm(d_model)
self.norm3 = LayerNorm(d_model)
self.dropout1 = Dropout(dropout)
self.dropout2 = Dropout(dropout)
self.dropout3 = Dropout(dropout)
self.activation = _get_activation_fn(activation)
def forward(self, tgt, memory, tgt_mask=None, memory_mask=None,
tgt_key_padding_mask=None, memory_key_padding_mask=None):
r"""Pass the inputs (and mask) through the decoder layer.
Args:
tgt: the sequence to the decoder layer (required).
memory: the sequnce from the last layer of the encoder (required).
tgt_mask: the mask for the tgt sequence (optional).
memory_mask: the mask for the memory sequence (optional).
tgt_key_padding_mask: the mask for the tgt keys per batch (optional).
memory_key_padding_mask: the mask for the memory keys per batch (optional).
Shape:
see the docs in Transformer class.
"""
tgt2 = self.self_attn(tgt, tgt, tgt, attn_mask=tgt_mask,
key_padding_mask=tgt_key_padding_mask)[0]
tgt = tgt + self.dropout1(tgt2)
tgt = self.norm1(tgt)
tgt2 = self.multihead_attn(tgt, memory, memory, attn_mask=memory_mask,
key_padding_mask=memory_key_padding_mask)[0]
tgt = tgt + self.dropout2(tgt2)
tgt = self.norm2(tgt)
if hasattr(self, "activation"):
tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
else: # for backward compatibility
tgt2 = self.linear2(self.dropout(F.relu(self.linear1(tgt))))
tgt = tgt + self.dropout3(tgt2)
tgt = self.norm3(tgt)
return tgt
def _get_clones(module, N):
return ModuleList([copy.deepcopy(module) for i in range(N)])
def _get_activation_fn(activation):
if activation == "relu":
return F.relu
elif activation == "gelu":
return F.gelu
else:
raise RuntimeError("activation should be relu/gelu, not %s." % activation)
if __name__ == '__main__':
import torch.nn as nn
torch.manual_seed(1)
class Config():
d_model = 8
nhead = 4
num_encoder_layers = 3
num_decoder_layers = 3
dim_feedforward = 64
dropout = 0.1
activation = 'gelu'
cfg = Config()
encoder_layer = nn.TransformerEncoderLayer(cfg.d_model, cfg.nhead, cfg.dim_feedforward, cfg.dropout,
cfg.activation)
encoder_norm = nn.LayerNorm(cfg.d_model)
encoder = nn.TransformerEncoder(encoder_layer, cfg.num_encoder_layers, encoder_norm)
decoder_layer = nn.TransformerDecoderLayer(cfg.d_model, cfg.nhead, cfg.dim_feedforward, cfg.dropout,
cfg.activation)
decoder_norm = nn.LayerNorm(cfg.d_model)
decoder = nn.TransformerDecoder(decoder_layer, cfg.num_decoder_layers, decoder_norm)
src = torch.randn((2, 7, 8)) # B,L,H
tgt = torch.randn((2, 5, 8))
src.transpose_(0,1)
tgt.transpose_(0,1)
src_mask = None
tgt_mask = None
memory_mask = None
src_key_padding_mask = None
tgt_key_padding_mask = None
memory_key_padding_mask = None
memory = encoder(src, mask=src_mask, src_key_padding_mask=src_key_padding_mask)
output = decoder(tgt,
memory,
tgt_mask=tgt_mask,
memory_mask=memory_mask,
tgt_key_padding_mask=tgt_key_padding_mask,
memory_key_padding_mask=memory_key_padding_mask)
memory.transpose_(0,1)
output.transpose_(0,1)
print(memory.shape, output.shape) # torch.Size([2, 80, 8]) torch.Size([2, 160, 8])
# 直接调用 transformer
transformer = nn.Transformer(cfg.d_model,cfg.nhead,cfg.num_encoder_layers,cfg.num_decoder_layers,cfg.dim_feedforward,cfg.dropout,cfg.activation)
out = transformer(src,tgt,src_mask=src_mask,tgt_mask=tgt_mask,memory_mask=memory_mask)
out.transpose_(0,1)
print(out.shape)

6
module/__init__.py Normal file
View File

@ -0,0 +1,6 @@
from .Embedding import Embedding
from .CNN import CNN
from .RNN import RNN
from .Attention import DotAttention, MultiHeadAttention
from .Transformer import Transformer
from .Capsule import Capsule

136
predict.py Normal file
View File

@ -0,0 +1,136 @@
import os
import sys
import torch
import logging
import hydra
import models
from hydra import utils
from utils import load_pkl, load_csv
from serializer import Serializer
from preprocess import _serialize_sentence, _convert_tokens_into_index, _add_pos_seq, _handle_relation_data
import matplotlib.pyplot as plt
logger = logging.getLogger(__name__)
def _preprocess_data(data, cfg):
vocab = load_pkl(os.path.join(cfg.cwd, cfg.out_path, 'vocab.pkl'), verbose=False)
relation_data = load_csv(os.path.join(cfg.cwd, cfg.data_path, 'relation.csv'), verbose=False)
rels = _handle_relation_data(relation_data)
cfg.vocab_size = vocab.count
serializer = Serializer(do_chinese_split=cfg.chinese_split)
serial = serializer.serialize
_serialize_sentence(data, serial, cfg)
_convert_tokens_into_index(data, vocab)
_add_pos_seq(data, cfg)
logger.info('start sentence preprocess...')
formats = '\nsentence: {}\nchinese_split: {}\nreplace_entity_with_type: {}\nreplace_entity_with_scope: {}\n' \
'tokens: {}\ntoken2idx: {}\nlength: {}\nhead_idx: {}\ntail_idx: {}'
logger.info(
formats.format(data[0]['sentence'], cfg.chinese_split, cfg.replace_entity_with_type,
cfg.replace_entity_with_scope, data[0]['tokens'], data[0]['token2idx'], data[0]['seq_len'],
data[0]['head_idx'], data[0]['tail_idx']))
return data, rels
def _get_predict_instance():
flag = input('是否使用范例[y/n],退出请输入: exit .... ')
flag = flag.strip().lower()
if flag == 'y' or flag == 'yes':
sentence = '《乡村爱情》是一部由知名导演赵本山在1985年所拍摄的农村青春偶像剧。'
head = '乡村爱情'
tail = '赵本山'
head_type = '电视剧'
tail_type = '人物'
elif flag == 'n' or flag == 'no':
sentence = input('请输入句子:')
head = input('请输入句中需要预测关系的头实体:')
head_type = input('请输入头实体类型:')
tail = input('请输入句中需要预测关系的尾实体:')
tail_type = input('请输入尾实体类型:')
elif flag == 'exit':
sys.exit(0)
else:
print('please input yes or no, or exit!')
_get_predict_instance()
instance = dict()
instance['sentence'] = sentence.strip()
instance['head'] = head.strip()
instance['head_type'] = head_type.strip()
instance['tail'] = tail.strip()
instance['tail_type'] = tail_type.strip()
return instance
# 自定义模型存储的路径
fp = 'xxx/checkpoints/2019-12-03_17-35-30/cnn_epoch21.pth'
@hydra.main(config_path='conf/config.yaml')
def main(cfg):
cwd = utils.get_original_cwd()
cfg.cwd = cwd
cfg.pos_size = 2 * cfg.pos_limit + 2
# print(cfg.pretty())
# get predict instance
instance = _get_predict_instance()
data = [instance]
# preprocess data
data, rels = _preprocess_data(data, cfg)
# model
__Model__ = {
'cnn': models.PCNN,
}
# 最好在 cpu 上预测
# cfg.use_gpu = False
if cfg.use_gpu and torch.cuda.is_available():
device = torch.device('cuda', cfg.gpu_id)
else:
device = torch.device('cpu')
logger.info(f'device: {device}')
model = __Model__[cfg.model_name](cfg)
model.load(fp, device=device)
model.to(device)
model.eval()
logger.info(f'model name: {cfg.model_name}')
logger.info(f'\n {model}')
x = dict()
x['word'], x['lens'] = torch.tensor([data[0]['token2idx']]), torch.tensor([data[0]['seq_len']])
if cfg.model_name != 'lm':
x['head_pos'], x['tail_pos'] = torch.tensor([data[0]['head_pos']]), torch.tensor([data[0]['tail_pos']])
if cfg.use_pcnn:
x['pcnn_mask'] = torch.tensor([data[0]['entities_pos']])
for key in x.keys():
x[key] = x[key].to(device)
with torch.no_grad():
y_pred = model(x)
y_pred = torch.softmax(y_pred, dim=-1)[0]
prob = y_pred.max().item()
prob_rel = list(rels.keys())[y_pred.argmax().item()]
logger.info(f"\"{data[0]['head']}\"\"{data[0]['tail']}\" 在句中关系为:\"{prob_rel}\",置信度为{prob:.2f}")
if cfg.predict_plot:
# maplot 默认显示不支持中文
plt.rcParams["font.family"] = 'Arial Unicode MS'
x = list(rels.keys())
height = list(y_pred.cpu().numpy())
plt.bar(x, height)
for x, y in zip(x, height):
plt.text(x, y, '%.2f' % y, ha="center", va="bottom")
plt.xlabel('关系')
plt.ylabel('置信度')
plt.xticks(rotation=315)
plt.show()
if __name__ == '__main__':
main()

175
preprocess.py Normal file
View File

@ -0,0 +1,175 @@
import os
import logging
from collections import OrderedDict
from typing import List, Dict
from transformers import BertTokenizer
from serializer import Serializer
from vocab import Vocab
from utils import save_pkl, load_csv
logger = logging.getLogger(__name__)
def _handle_pos_limit(pos: List[int], limit: int) -> List[int]:
for i, p in enumerate(pos):
if p > limit:
pos[i] = limit
if p < -limit:
pos[i] = -limit
return [p + limit + 1 for p in pos]
def _add_pos_seq(train_data: List[Dict], cfg):
for d in train_data:
entities_idx = [d['head_idx'], d['tail_idx']
] if d['head_idx'] < d['tail_idx'] else [d['tail_idx'], d['head_idx']]
d['head_pos'] = list(map(lambda i: i - d['head_idx'], list(range(d['seq_len']))))
d['head_pos'] = _handle_pos_limit(d['head_pos'], int(cfg.pos_limit))
d['tail_pos'] = list(map(lambda i: i - d['tail_idx'], list(range(d['seq_len']))))
d['tail_pos'] = _handle_pos_limit(d['tail_pos'], int(cfg.pos_limit))
if cfg.use_pcnn:
# 当句子无法分隔成三段时无法使用PCNN
# 比如: [head, ... tail] or [... head, tail, ...] 无法使用统一方式 mask 分段
d['entities_pos'] = [1] * (entities_idx[0] + 1) + [2] * (entities_idx[1] - entities_idx[0] - 1) +\
[3] * (d['seq_len'] - entities_idx[1])
def _convert_tokens_into_index(data: List[Dict], vocab):
unk_str = '[UNK]'
unk_idx = vocab.word2idx[unk_str]
for d in data:
d['token2idx'] = [vocab.word2idx.get(i, unk_idx) for i in d['tokens']]
d['seq_len'] = len(d['token2idx'])
def _serialize_sentence(data: List[Dict], serial, cfg):
for d in data:
sent = d['sentence'].strip()
sent = sent.replace(d['head'], ' head ', 1).replace(d['tail'], ' tail ', 1)
d['tokens'] = serial(sent, never_split=['head', 'tail'])
head_idx, tail_idx = d['tokens'].index('head'), d['tokens'].index('tail')
d['head_idx'], d['tail_idx'] = head_idx, tail_idx
if cfg.replace_entity_with_type:
if cfg.replace_entity_with_scope:
d['tokens'][head_idx], d['tokens'][tail_idx] = 'HEAD_' + d['head_type'], 'TAIL_' + d['tail_type']
else:
d['tokens'][head_idx], d['tokens'][tail_idx] = d['head_type'], d['tail_type']
else:
if cfg.replace_entity_with_scope:
d['tokens'][head_idx], d['tokens'][tail_idx] = 'HEAD', 'TAIL'
else:
d['tokens'][head_idx], d['tokens'][tail_idx] = d['head'], d['tail']
def _lm_serialize(data: List[Dict], cfg):
logger.info('use bert tokenizer...')
tokenizer = BertTokenizer.from_pretrained(cfg.lm_file)
for d in data:
sent = d['sentence'].strip()
sent = sent.replace(d['head'], d['head_type'], 1).replace(d['tail'], d['tail_type'], 1)
sent += '[SEP]' + d['head'] + '[SEP]' + d['tail']
d['token2idx'] = tokenizer.encode(sent, add_special_tokens=True)
d['seq_len'] = len(d['token2idx'])
def _add_relation_data(rels: Dict, data: List) -> None:
for d in data:
d['rel2idx'] = rels[d['relation']]['index']
d['head_type'] = rels[d['relation']]['head_type']
d['tail_type'] = rels[d['relation']]['tail_type']
def _handle_relation_data(relation_data: List[Dict]) -> Dict:
rels = OrderedDict()
relation_data = sorted(relation_data, key=lambda i: int(i['index']))
for d in relation_data:
rels[d['relation']] = {
'index': int(d['index']),
'head_type': d['head_type'],
'tail_type': d['tail_type'],
}
return rels
def preprocess(cfg):
logger.info('===== start preprocess data =====')
train_fp = os.path.join(cfg.cwd, cfg.data_path, 'train.csv')
valid_fp = os.path.join(cfg.cwd, cfg.data_path, 'valid.csv')
test_fp = os.path.join(cfg.cwd, cfg.data_path, 'test.csv')
relation_fp = os.path.join(cfg.cwd, cfg.data_path, 'relation.csv')
logger.info('load raw files...')
train_data = load_csv(train_fp)
valid_data = load_csv(valid_fp)
test_data = load_csv(test_fp)
relation_data = load_csv(relation_fp)
logger.info('convert relation into index...')
rels = _handle_relation_data(relation_data)
_add_relation_data(rels, train_data)
_add_relation_data(rels, valid_data)
_add_relation_data(rels, test_data)
logger.info('verify whether use pretrained language models...')
if cfg.model_name == 'lm':
logger.info('use pretrained language models serialize sentence...')
_lm_serialize(train_data, cfg)
_lm_serialize(valid_data, cfg)
_lm_serialize(test_data, cfg)
else:
logger.info('serialize sentence into tokens...')
serializer = Serializer(do_chinese_split=cfg.chinese_split, do_lower_case=True)
serial = serializer.serialize
_serialize_sentence(train_data, serial, cfg)
_serialize_sentence(valid_data, serial, cfg)
_serialize_sentence(test_data, serial, cfg)
logger.info('build vocabulary...')
vocab = Vocab('word')
train_tokens = [d['tokens'] for d in train_data]
valid_tokens = [d['tokens'] for d in valid_data]
test_tokens = [d['tokens'] for d in test_data]
sent_tokens = [*train_tokens, *valid_tokens, *test_tokens]
for sent in sent_tokens:
vocab.add_words(sent)
vocab.trim(min_freq=cfg.min_freq)
logger.info('convert tokens into index...')
_convert_tokens_into_index(train_data, vocab)
_convert_tokens_into_index(valid_data, vocab)
_convert_tokens_into_index(test_data, vocab)
logger.info('build position sequence...')
_add_pos_seq(train_data, cfg)
_add_pos_seq(valid_data, cfg)
_add_pos_seq(test_data, cfg)
logger.info('save data for backup...')
os.makedirs(os.path.join(cfg.cwd, cfg.out_path), exist_ok=True)
train_save_fp = os.path.join(cfg.cwd, cfg.out_path, 'train.pkl')
valid_save_fp = os.path.join(cfg.cwd, cfg.out_path, 'valid.pkl')
test_save_fp = os.path.join(cfg.cwd, cfg.out_path, 'test.pkl')
save_pkl(train_data, train_save_fp)
save_pkl(valid_data, valid_save_fp)
save_pkl(test_data, test_save_fp)
if cfg.model_name != 'lm':
vocab_save_fp = os.path.join(cfg.cwd, cfg.out_path, 'vocab.pkl')
vocab_txt = os.path.join(cfg.cwd, cfg.out_path, 'vocab.txt')
save_pkl(vocab, vocab_save_fp)
logger.info('save vocab in txt file, for watching...')
with open(vocab_txt, 'w', encoding='utf-8') as f:
f.write(os.linesep.join(vocab.word2idx.keys()))
logger.info('===== end preprocess data =====')
if __name__ == '__main__':
pass

View File

@ -1,5 +1,7 @@
torch>=1.0
jieba>=0.38
pytorch_transformers>=1.2
matplotlib>=3.0
scikit_learn>=0.20
tensorboard>=2.0
matplotlib>=3.1.0
transformers>=2.0
hydra-core>=0.11
jieba>=0.39
pyhanlp

203
serializer.py Normal file
View File

@ -0,0 +1,203 @@
import re
import unicodedata
import jieba
import logging
from typing import List
logger = logging.getLogger(__name__)
jieba.setLogLevel(logging.INFO)
class Serializer():
def __init__(self, never_split: List = None, do_lower_case=True, do_chinese_split=False):
self.never_split = never_split if never_split is not None else []
self.do_lower_case = do_lower_case
self.do_chinese_split = do_chinese_split
def serialize(self, text, never_split: List = None):
never_split = self.never_split + (never_split if never_split is not None else [])
text = self._clean_text(text)
if self.do_chinese_split:
output_tokens = self._use_jieba_cut(text, never_split)
return output_tokens
text = self._tokenize_chinese_chars(text)
orig_tokens = self._orig_tokenize(text)
split_tokens = []
for token in orig_tokens:
if self.do_lower_case and token not in never_split:
token = token.lower()
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token, never_split=never_split))
output_tokens = self._whitespace_tokenize(" ".join(split_tokens))
return output_tokens
def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xfffd or self.is_control(char):
continue
if self.is_whitespace(char):
output.append(" ")
else:
output.append(char)
return "".join(output)
def _use_jieba_cut(self, text, never_split):
for word in never_split:
jieba.suggest_freq(word, True)
tokens = jieba.lcut(text)
if self.do_lower_case:
tokens = [i.lower() for i in tokens]
try:
while True:
tokens.remove(' ')
except:
return tokens
def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self.is_chinese_char(cp):
output.append(" ")
output.append(char)
output.append(" ")
else:
output.append(char)
return "".join(output)
def _orig_tokenize(self, text):
"""Splits text on whitespace and some punctuations like comma or period"""
text = text.strip()
if not text:
return []
# 常见的断句标点
punc = """,.?!;: 、|,。?!;:《》「」【】/<>|\“ ”‘ """
punc_re = '|'.join(re.escape(x) for x in punc)
tokens = re.sub(punc_re, lambda x: ' ' + x.group() + ' ', text)
tokens = tokens.split()
return tokens
def _whitespace_tokenize(self, text):
"""Runs basic whitespace cleaning and splitting on a piece of text."""
text = text.strip()
if not text:
return []
tokens = text.split()
return tokens
def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize("NFD", text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == "Mn":
continue
output.append(char)
return "".join(output)
def _run_split_on_punc(self, text, never_split=None):
"""Splits punctuation on a piece of text."""
if never_split is not None and text in never_split:
return [text]
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if self.is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1
return ["".join(x) for x in output]
@staticmethod
def is_control(char):
"""Checks whether `chars` is a control character."""
# These are technically control characters but we count them as whitespace
# characters.
if char == "\t" or char == "\n" or char == "\r":
return False
cat = unicodedata.category(char)
if cat.startswith("C"):
return True
return False
@staticmethod
def is_whitespace(char):
"""Checks whether `chars` is a whitespace character."""
# \t, \n, and \r are technically contorl characters but we treat them
# as whitespace since they are generally considered as such.
if char == " " or char == "\t" or char == "\n" or char == "\r":
return True
cat = unicodedata.category(char)
if cat == "Zs":
return True
return False
@staticmethod
def is_chinese_char(cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
(cp >= 0x3400 and cp <= 0x4DBF) or #
(cp >= 0x20000 and cp <= 0x2A6DF) or #
(cp >= 0x2A700 and cp <= 0x2B73F) or #
(cp >= 0x2B740 and cp <= 0x2B81F) or #
(cp >= 0x2B820 and cp <= 0x2CEAF) or (cp >= 0xF900 and cp <= 0xFAFF) or #
(cp >= 0x2F800 and cp <= 0x2FA1F)): #
return True
return False
@staticmethod
def is_punctuation(char):
"""Checks whether `chars` is a punctuation character."""
cp = ord(char)
# We treat all non-letter/number ASCII as punctuation.
# Characters such as "^", "$", and "`" are not in the Unicode
# Punctuation class but we treat them as punctuation anyways, for
# consistency.
if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or (cp >= 91 and cp <= 96)
or (cp >= 123 and cp <= 126)):
return True
cat = unicodedata.category(char)
if cat.startswith("P"):
return True
return False
if __name__ == '__main__':
text1 = "\t\n你 好呀, I\'m his pupp\'peer,\n\t"
text2 = '你孩子的爱情叫 Stam\'s 的打到天啊呢哦'
serializer = Serializer(do_chinese_split=False)
print(serializer.serialize(text1))
print(serializer.serialize(text2))
text3 = "good\'s head pupp\'er, "
# print: ["good's", 'pupp', "'", 'er', ',']
# true: ["good's", "pupp'er", ","]
print(serializer.serialize(text3, never_split=["pupp\'er"]))

38
test/test_attention.py Normal file
View File

@ -0,0 +1,38 @@
import pytest
import torch
from utils import seq_len_to_mask
from module import DotAttention, MultiHeadAttention
torch.manual_seed(1)
q = torch.randn(4, 6, 20) # [B, L, H]
k = v = torch.randn(4, 5, 20) # [B, S, H]
key_padding_mask = seq_len_to_mask([5, 4, 3, 2], max_len=5)
attention_mask = torch.tensor([1, 0, 0, 1, 0]) # 为1 的地方 mask 掉
head_mask = torch.tensor([0, 1, 0, 0]) # 为1 的地方 mask 掉
# m = DotAttention(dropout=0.0)
# ao,aw = m(q,k,v,key_padding_mask)
# print(ao.shape,aw.shape)
# print(aw)
def test_DotAttention():
m = DotAttention(dropout=0.0)
ao, aw = m(q, k, v, mask_out=key_padding_mask)
assert ao.shape == torch.Size([4, 6, 20])
assert aw.shape == torch.Size([4, 6, 5])
assert torch.all(aw[1, :, -1:].eq(0)) == torch.all(aw[2, :, -2:].eq(0)) == torch.all(aw[3, :, -3:].eq(0)) == True
def test_MultiHeadAttention():
m = MultiHeadAttention(embed_dim=20, num_heads=4, dropout=0.0)
ao, aw = m(q, k, v, key_padding_mask=key_padding_mask,attention_mask=attention_mask,head_mask=head_mask)
assert ao.shape == torch.Size([4, 6, 20])
assert aw.shape == torch.Size([4, 4, 6, 5])
assert aw.unbind(dim=1)[1].bool().any() == False
if __name__ == '__main__':
pytest.main()

32
test/test_cnn.py Normal file
View File

@ -0,0 +1,32 @@
import pytest
import torch
from module import CNN
from utils import seq_len_to_mask
class Config(object):
in_channels = 100
out_channels = 200
kernel_sizes = [3, 5, 7, 9, 11]
activation = 'gelu'
pooling_strategy = 'avg'
config = Config()
def test_CNN():
x = torch.randn(4, 5, 100)
seq = torch.arange(4, 0, -1)
mask = seq_len_to_mask(seq, max_len=5)
cnn = CNN(config)
out, out_pooling = cnn(x, mask=mask)
out_channels = config.out_channels * len(config.kernel_sizes)
assert out.shape == torch.Size([4, 5, out_channels])
assert out_pooling.shape == torch.Size([4, out_channels])
if __name__ == '__main__':
pytest.main()

38
test/test_embedding.py Normal file
View File

@ -0,0 +1,38 @@
import pytest
import torch
from module import Embedding
class Config(object):
vocab_size = 10
word_dim = 10
pos_size = 12 # 2 * pos_limit + 2
pos_dim = 5
dim_strategy = 'cat' # [cat, sum]
config = Config()
x = torch.tensor([[1, 2, 3, 4, 5], [6, 7, 3, 5, 0], [8, 4, 3, 0, 0]])
x_pos = torch.tensor([[1, 2, 3, 4, 5], [1, 2, 3, 4, 0], [1, 2, 3, 0, 0]])
def test_Embedding_cat():
embed = Embedding(config)
feature = embed((x, x_pos))
dim = config.word_dim + config.pos_dim
assert feature.shape == torch.Size((3, 5, dim))
def test_Embedding_sum():
config.dim_strategy = 'sum'
embed = Embedding(config)
feature = embed((x, x_pos))
dim = config.word_dim
assert feature.shape == torch.Size((3, 5, dim))
if __name__ == '__main__':
pytest.main()

49
test/test_rnn.py Normal file
View File

@ -0,0 +1,49 @@
import pytest
import torch
from module import RNN
from utils import seq_len_to_mask
class Config(object):
type_rnn = 'LSTM'
input_size = 5
hidden_size = 4
num_layers = 3
dropout = 0.0
last_layer_hn = False
bidirectional = True
config = Config()
def test_CNN():
torch.manual_seed(1)
x = torch.tensor([[4, 3, 2, 1], [5, 6, 7, 0], [8, 10, 0, 0]])
x = torch.nn.Embedding(11, 5, padding_idx=0)(x) # B,L,H = 3,4,5
x_len = torch.tensor([4, 3, 2])
model = RNN(config)
output, hn = model(x, x_len)
B, L, _ = x.size()
H, N = config.hidden_size, config.num_layers
assert output.shape == torch.Size([B, L, H])
assert hn.shape == torch.Size([B, N, H])
config.bidirectional = False
model = RNN(config)
output, hn = model(x, x_len)
assert output.shape == torch.Size([B, L, H])
assert hn.shape == torch.Size([B, N, H])
config.last_layer_hn = True
model = RNN(config)
output, hn = model(x, x_len)
assert output.shape == torch.Size([B, L, H])
assert hn.shape == torch.Size([B, H])
if __name__ == '__main__':
pytest.main()

36
test/test_serializer.py Normal file
View File

@ -0,0 +1,36 @@
import pytest
from serializer import Serializer
def test_serializer_for_no_chinese_split():
text1 = "\nI\'m his pupp\'peer, and i have a ball\t"
text2 = '\t叫Stam一起到nba打篮球\n'
text3 = '\n\n现在时刻2014-04-08\t\t'
serializer = Serializer(do_chinese_split=False)
serial_text1 = serializer.serialize(text1)
serial_text2 = serializer.serialize(text2)
serial_text3 = serializer.serialize(text3)
assert serial_text1 == ['i', "'", 'm', 'his', 'pupp', "'", 'peer', ',', 'and', 'i', 'have', 'a', 'ball']
assert serial_text2 == ['', 'stam', '', '', '', 'nba', '', '', '']
assert serial_text3 == ['', '', '', '', '2014', '-', '04', '-', '08']
def test_serializer_for_chinese_split():
text1 = "\nI\'m his pupp\'peer, and i have a basketball\t"
text2 = '\t叫Stam一起到nba打篮球\n'
text3 = '\n\n现在时刻2014-04-08\t\t'
serializer = Serializer(do_chinese_split=True)
serial_text1 = serializer.serialize(text1)
serial_text2 = serializer.serialize(text2)
serial_text3 = serializer.serialize(text3)
assert serial_text1 == ['i', "'", 'm', 'his', 'pupp', "'", 'peer', ',', 'and', 'i', 'have', 'a', 'basketball']
assert serial_text2 == ['', 'stam', '一起', '', 'nba', '打篮球']
assert serial_text3 == ['现在', '时刻', '2014', '-', '04', '-', '08']
if __name__ == '__main__':
pytest.main()

40
test/test_transformer.py Normal file
View File

@ -0,0 +1,40 @@
import pytest
import torch
from module import Transformer
from utils import seq_len_to_mask
class Config():
hidden_size = 12
intermediate_size = 24
num_hidden_layers = 5
num_heads = 3
dropout = 0.0
layer_norm_eps = 1e-12
hidden_act = 'gelu_new'
output_attentions = True
output_hidden_states = True
config = Config()
def test_Transformer():
m = Transformer(config)
i = torch.randn(4, 5, 12) # [B, L, H]
key_padding_mask = seq_len_to_mask([5, 4, 3, 2], max_len=5)
attention_mask = torch.tensor([1, 0, 0, 1, 0]) # 为1 的地方 mask 掉
head_mask = torch.tensor([0, 1, 0]) # 为1 的地方 mask 掉
out = m(i, key_padding_mask=key_padding_mask, attention_mask=attention_mask, head_mask=head_mask)
hn, h_all, att_weights = out
assert hn.shape == torch.Size([4, 5, 12])
assert torch.equal(h_all[0], i) and torch.equal(h_all[-1], hn) == True
assert len(h_all) == config.num_hidden_layers + 1
assert len(att_weights) == config.num_hidden_layers
assert att_weights[0].shape == torch.Size([4, 3, 5, 5])
assert att_weights[0].unbind(dim=1)[1].bool().any() == False
if __name__ == '__main__':
pytest.main()

38
test/test_vocab.py Normal file
View File

@ -0,0 +1,38 @@
import pytest
from serializer import Serializer
from vocab import Vocab
def test_vocab():
vocab = Vocab('test')
sent = ' 我是中国人,我爱中国。 I\'m Chinese, I love China'
serializer = Serializer(do_lower_case=True)
tokens = serializer.serialize(sent)
assert tokens == [
'', '', '', '', '', '', '', '', '', '', '', 'i', "'", 'm', 'chinese', ',', 'i', 'love', 'china'
]
vocab.add_words(tokens)
unk_str = '[UNK]'
unk_idx = vocab.word2idx[unk_str]
assert vocab.count == 22
assert len(vocab.word2idx) == len(vocab.idx2word) == len(vocab.word2idx) == 22
vocab.trim(2, verbose=False)
assert vocab.count == 11
assert len(vocab.word2idx) == len(vocab.idx2word) == len(vocab.word2idx) == 11
token2idx = [vocab.word2idx.get(i, unk_idx) for i in tokens]
assert len(tokens) == len(token2idx)
assert token2idx == [7, 1, 8, 9, 1, 1, 7, 1, 8, 9, 1, 10, 1, 1, 1, 1, 10, 1, 1]
idx2tokens = [vocab.idx2word.get(i, unk_str) for i in token2idx]
assert len(idx2tokens) == len(token2idx)
assert ' '.join(idx2tokens) == '我 [UNK] 中 国 [UNK] [UNK] 我 [UNK] 中 国 [UNK] i [UNK] [UNK] [UNK] [UNK] i [UNK] [UNK]'
if __name__ == '__main__':
pytest.main()

82
trainer.py Normal file
View File

@ -0,0 +1,82 @@
import torch
import logging
import matplotlib.pyplot as plt
from metrics import PRMetric
logger = logging.getLogger(__name__)
def train(epoch, model, dataloader, optimizer, criterion, device, writer, cfg):
model.train()
metric = PRMetric()
losses = []
for batch_idx, (x, y) in enumerate(dataloader, 1):
for key, value in x.items():
x[key] = value.to(device)
y = y.to(device)
optimizer.zero_grad()
y_pred = model(x)
loss = criterion(y_pred, y)
loss.backward()
optimizer.step()
metric.update(y_true=y, y_pred=y_pred)
losses.append(loss.item())
data_total = len(dataloader.dataset)
data_cal = data_total if batch_idx == len(dataloader) else batch_idx * len(y)
if (cfg.train_log and batch_idx % cfg.log_interval == 0) or batch_idx == len(dataloader):
# p r f1 皆为 macro因为micro时三者相同定义为acc
acc, p, r, f1 = metric.compute()
logger.info(f'Train Epoch {epoch}: [{data_cal}/{data_total} ({100. * data_cal / data_total:.0f}%)]\t'
f'Loss: {loss.item():.6f}')
logger.info(f'Train Epoch {epoch}: Acc: {100. * acc:.2f}%\t'
f'macro metrics: [p: {p:.4f}, r:{r:.4f}, f1:{f1:.4f}]')
if cfg.show_plot and not cfg.only_comparison_plot:
if cfg.plot_utils == 'matplot':
plt.plot(losses)
plt.title(f'epoch {epoch} train loss')
plt.show()
if cfg.plot_utils == 'tensorboard':
for i in range(len(losses)):
writer.add_scalar(f'epoch_{epoch}_training_loss', losses[i], i)
return losses[-1]
def validate(epoch, model, dataloader, criterion, device):
model.eval()
metric = PRMetric()
losses = []
for batch_idx, (x, y) in enumerate(dataloader, 1):
for key, value in x.items():
x[key] = value.to(device)
y = y.to(device)
with torch.no_grad():
y_pred = model(x)
loss = criterion(y_pred, y)
metric.update(y_true=y, y_pred=y_pred)
losses.append(loss.item())
loss = sum(losses) / len(losses)
acc, p, r, f1 = metric.compute()
data_total = len(dataloader.dataset)
if epoch >= 0:
logger.info(f'Valid Epoch {epoch}: [{data_total}/{data_total}](100%)\t Loss: {loss:.6f}')
logger.info(f'Valid Epoch {epoch}: Acc: {100. * acc:.2f}%\tmacro metrics: [p: {p:.4f}, r:{r:.4f}, f1:{f1:.4f}]')
else:
logger.info(f'Test Data: [{data_total}/{data_total}](100%)\t Loss: {loss:.6f}')
logger.info(f'Test Data: Acc: {100. * acc:.2f}%\tmacro metrics: [p: {p:.4f}, r:{r:.4f}, f1:{f1:.4f}]')
return f1, loss

2
utils/__init__.py Normal file
View File

@ -0,0 +1,2 @@
from .ioUtils import *
from .nnUtils import *

56
utils/ioUtils.py Normal file
View File

@ -0,0 +1,56 @@
import os
import csv
import pickle
import logging
from typing import NewType, List, Tuple, Dict, Any
__all__ = [
'load_pkl',
'save_pkl',
'load_csv',
'save_csv',
]
logger = logging.getLogger(__name__)
Path = str
def load_pkl(fp: Path, verbose: bool = True) -> Any:
if verbose:
logger.info(f'load data from {fp}')
with open(fp, 'rb') as f:
data = pickle.load(f)
return data
def save_pkl(data: Any, fp: Path, verbose: bool = True) -> None:
if verbose:
logger.info(f'save data in {fp}')
with open(fp, 'wb') as f:
pickle.dump(data, f)
def load_csv(fp: Path, is_tsv: bool = False, verbose: bool = True) -> List:
if verbose:
logger.info(f'load csv from {fp}')
dialect = 'excel-tab' if is_tsv else 'excel'
with open(fp, encoding='utf-8') as f:
reader = csv.DictReader(f, dialect=dialect)
return list(reader)
def save_csv(data: List[Dict], fp: Path, save_in_tsv: False, write_head=True, verbose=True) -> None:
if verbose:
logger.info(f'save csv file in: {fp}')
with open(fp, 'w', encoding='utf-8') as f:
fieldnames = data[0].keys()
dialect = 'excel-tab' if save_in_tsv else 'excel'
writer = csv.DictWriter(f, fieldnames=fieldnames, dialect=dialect)
if write_head:
writer.writeheader()
writer.writerows(data)

51
utils/nnUtils.py Normal file
View File

@ -0,0 +1,51 @@
import torch
import random
import logging
import numpy as np
from typing import List, Tuple, Dict, Union
logger = logging.getLogger(__name__)
__all__ = [
'manual_seed',
'seq_len_to_mask',
]
def manual_seed(num: int = 1) -> None:
random.seed(num)
np.random.seed(num)
torch.manual_seed(num)
torch.cuda.manual_seed(num)
torch.cuda.manual_seed_all(num)
def seq_len_to_mask(seq_len: Union[List, np.ndarray, torch.Tensor], max_len=None, mask_pos_to_true=True):
"""
将一个表示sequence length的一维数组转换为二维的mask默认pad的位置为1
转变 1-d seq_len到2-d mask.
:param list, np.ndarray, torch.LongTensor seq_len: shape将是(B,)
:param int max_len: 将长度pad到这个长度默认(None)使用的是seq_len中最长的长度但在nn.DataParallel的场景下可能不同卡的seq_len会有
区别所以需要传入一个max_len使得mask的长度是pad到该长度
:return: np.ndarray, torch.Tensor shape将是(B, max_length) 元素类似为bool或torch.uint8
"""
if isinstance(seq_len, list):
seq_len = np.array(seq_len)
if isinstance(seq_len, np.ndarray):
seq_len = torch.from_numpy(seq_len)
if isinstance(seq_len, torch.Tensor):
assert seq_len.dim() == 1, logger.error(f"seq_len can only have one dimension, got {seq_len.dim()} != 1.")
batch_size = seq_len.size(0)
max_len = int(max_len) if max_len else seq_len.max().long()
broad_cast_seq_len = torch.arange(max_len).expand(batch_size, -1).to(seq_len.device)
if mask_pos_to_true:
mask = broad_cast_seq_len.ge(seq_len.unsqueeze(1))
else:
mask = broad_cast_seq_len.lt(seq_len.unsqueeze(1))
else:
raise logger.error("Only support 1-d list or 1-d numpy.ndarray or 1-d torch.Tensor.")
return mask

103
vocab.py Normal file
View File

@ -0,0 +1,103 @@
import logging
from collections import OrderedDict
from typing import Sequence, Optional
logger = logging.getLogger(__name__)
SPECIAL_TOKENS_KEYS = [
"pad_token",
"unk_token",
"mask_token",
"cls_token",
"sep_token",
"bos_token",
"eos_token",
]
SPECIAL_TOKENS_VALUES = [
"[PAD]",
"[UNK]",
"[MASK]",
"[CLS]",
"[SEP]",
"[BOS]",
"[EOS]",
]
SPECIAL_TOKENS = OrderedDict(zip(SPECIAL_TOKENS_KEYS, SPECIAL_TOKENS_VALUES))
class Vocab(object):
def __init__(self, name: str = 'basic', init_tokens: Sequence = SPECIAL_TOKENS):
self.name = name
self.init_tokens = init_tokens
self.trimed = False
self.word2idx = {}
self.word2count = {}
self.idx2word = {}
self.count = 0
self._add_init_tokens()
def _add_init_tokens(self):
for token in self.init_tokens.values():
self._add_word(token)
def _add_word(self, word: str):
if word not in self.word2idx:
self.word2idx[word] = self.count
self.word2count[word] = 1
self.idx2word[self.count] = word
self.count += 1
else:
self.word2count[word] += 1
def add_words(self, words: Sequence):
for word in words:
self._add_word(word)
def trim(self, min_freq=2, verbose: Optional[bool] = True):
'''当 word 词频低于 min_freq 时,从词库中删除
Args:
param min_freq: 最低词频
'''
assert min_freq == int(min_freq), f'min_freq must be integer, can\'t be {min_freq}'
min_freq = int(min_freq)
if min_freq < 2:
return
if self.trimed:
return
self.trimed = True
keep_words = []
new_words = []
for k, v in self.word2count.items():
if v >= min_freq:
keep_words.append(k)
new_words.extend([k] * v)
if verbose:
before_len = len(keep_words)
after_len = len(self.word2idx) - len(self.init_tokens)
logger.info('vocab after be trimmed, keep words [{} / {}] = {:.2f}%'.format(
before_len, after_len, before_len / after_len * 100))
# Reinitialize dictionaries
self.word2idx = {}
self.word2count = {}
self.idx2word = {}
self.count = 0
self._add_init_tokens()
self.add_words(new_words)
if __name__ == '__main__':
vocab = Vocab('test')
sent = ' 我是中国人,我爱中国。'
sent = list(sent)
print(sent)
vocab.add_words(sent)
print(vocab.word2count)
vocab.trim(2)
print(vocab.word2count)