deepke/tutorial-notebooks/LM.ipynb

600 lines
23 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## relation extraction 实践\n",
"> Tutorial作者余海阳yuhaiyang@zju.edu.cn)\n",
"\n",
"在这个演示中,我们使用 `pretrain_language_model` 模型实现中文关系抽取。\n",
"希望在这个demo中帮助大家了解知识图谱构建过程中三元组抽取构建的原理和常用方法。\n",
"\n",
"本demo使用 `python3` 运⾏。\n",
"\n",
"### 数据集\n",
"在这个示例中,我们采样了一些中文文本,抽取其中的三元组。\n",
"\n",
"sentence|relation|head|tail\n",
":---:|:---:|:---:|:---:\n",
"孔正锡在2005年以一部温馨的爱情电影《长腿叔叔》敲开电影界大门。|导演|长腿叔叔|孔正锡\n",
"《伤心的树》是吴宗宪的音乐作品,收录在《你比从前快乐》专辑中。|所属专辑|伤心的树|你比从前快乐\n",
"2000年8月「天坛大佛」荣获「香港十大杰出工程项目」第四名。|所在城市|天坛大佛|香港\n",
"\n",
"\n",
"- train.csv: 包含6个训练三元组文件的每一⾏表示一个三元组, 按句子、关系、头实体、尾实体排序,并用`,`分隔。\n",
"- valid.csv: 包含3个验证三元组文件的每一⾏表示一个三元组, 按句子、关系、头实体、尾实体排序,并用`,`分隔。\n",
"- test.csv: 包含3个测试三元组文件的每一⾏表示一个三元组, 按句子、关系、头实体、尾实体排序,并用`,`分隔。\n",
"- relation.csv: 包含4种关系三元组文件的每一⾏表示一个三元组种类, 按头实体种类、尾实体种类、关系、序号排序,并用`,`分隔。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### BERT 原理回顾\n",
"\n",
"![BERT](img/Bert.png)\n",
"\n",
"原句经过bert编码后可以得到丰富的语义信息。所得结果输入到双向LSTM中输出的结果即可得到句子的关系信息。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 代码实践\n",
"\n",
"重要提示:\n",
"- 在使用预训练语言模型时需要加载约500m的模型数据所以更建议下载到本地后运行。此时只需要将 `lm_file` 值修改为本地文件夹的地址即可。具体预训练模型下载链接见:[transformers](https://huggingface.co/transformers/)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 使用pytorch运行神经网络运行前确认是否安装\n",
"!pip install torch\n",
"!pip install matplotlib\n",
"!pip install transformers"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 导入所使用模块\n",
"import os\n",
"import csv\n",
"import math\n",
"import pickle\n",
"import logging\n",
"import torch\n",
"import torch.nn as nn\n",
"import torch.nn.functional as F\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from torch import optim\n",
"from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence\n",
"from torch.utils.data import Dataset,DataLoader\n",
"from sklearn.metrics import precision_recall_fscore_support\n",
"from typing import List, Tuple, Dict, Any, Sequence, Optional, Union\n",
"from transformers import BertTokenizer, BertModel\n",
"\n",
"logger = logging.getLogger(__name__)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 模型调参的配置文件\n",
"class Config(object):\n",
" model_name = 'lm' # ['cnn', 'gcn', 'lm']\n",
" use_pcnn = True\n",
" min_freq = 1\n",
" pos_limit = 20\n",
" out_path = 'data/out' \n",
" batch_size = 2 \n",
" word_dim = 10\n",
" pos_dim = 5\n",
" dim_strategy = 'sum' # ['sum', 'cat']\n",
" out_channels = 20\n",
" intermediate = 10\n",
" kernel_sizes = [3, 5, 7]\n",
" activation = 'gelu'\n",
" pooling_strategy = 'max'\n",
" dropout = 0.3\n",
" epoch = 10\n",
" num_relations = 4\n",
" learning_rate = 3e-4\n",
" lr_factor = 0.7 # 学习率的衰减率\n",
" lr_patience = 3 # 学习率衰减的等待epoch\n",
" weight_decay = 1e-3 # L2正则\n",
" early_stopping_patience = 6\n",
" train_log = True\n",
" log_interval = 1\n",
" show_plot = True\n",
" only_comparison_plot = False\n",
" plot_utils = 'matplot'\n",
" lm_file = 'bert-base-chinese'\n",
"# lm_file = '/Users/yuhaiyang/transformers/bert-base-chinese'\n",
" lm_num_hidden_layers = 2\n",
" rnn_layers = 2\n",
" \n",
"cfg = Config()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 预处理过程所需要使用的函数\n",
"Path = str\n",
"\n",
"def load_csv(fp: Path, is_tsv: bool = False, verbose: bool = True) -> List:\n",
" if verbose:\n",
" logger.info(f'load csv from {fp}')\n",
"\n",
" dialect = 'excel-tab' if is_tsv else 'excel'\n",
" with open(fp, encoding='utf-8') as f:\n",
" reader = csv.DictReader(f, dialect=dialect)\n",
" return list(reader)\n",
"\n",
" \n",
"def load_pkl(fp: Path, verbose: bool = True) -> Any:\n",
" if verbose:\n",
" logger.info(f'load data from {fp}')\n",
"\n",
" with open(fp, 'rb') as f:\n",
" data = pickle.load(f)\n",
" return data\n",
"\n",
"\n",
"def save_pkl(data: Any, fp: Path, verbose: bool = True) -> None:\n",
" if verbose:\n",
" logger.info(f'save data in {fp}')\n",
"\n",
" with open(fp, 'wb') as f:\n",
" pickle.dump(data, f)\n",
" \n",
" \n",
"def _handle_relation_data(relation_data: List[Dict]) -> Dict:\n",
" rels = dict()\n",
" for d in relation_data:\n",
" rels[d['relation']] = {\n",
" 'index': int(d['index']),\n",
" 'head_type': d['head_type'],\n",
" 'tail_type': d['tail_type'],\n",
" }\n",
" return rels\n",
"\n",
"\n",
"def _add_relation_data(rels: Dict,data: List) -> None:\n",
" for d in data:\n",
" d['rel2idx'] = rels[d['relation']]['index']\n",
" d['head_type'] = rels[d['relation']]['head_type']\n",
" d['tail_type'] = rels[d['relation']]['tail_type']\n",
"\n",
"\n",
"def seq_len_to_mask(seq_len: Union[List, np.ndarray, torch.Tensor], max_len=None, mask_pos_to_true=True):\n",
" \"\"\"\n",
" 将一个表示sequence length的一维数组转换为二维的mask默认pad的位置为1。\n",
" 转变 1-d seq_len到2-d mask.\n",
"\n",
" :param list, np.ndarray, torch.LongTensor seq_len: shape将是(B,)\n",
" :param int max_len: 将长度pad到这个长度。默认(None)使用的是seq_len中最长的长度。但在nn.DataParallel的场景下可能不同卡的seq_len会有\n",
" 区别所以需要传入一个max_len使得mask的长度是pad到该长度。\n",
" :return: np.ndarray, torch.Tensor 。shape将是(B, max_length) 元素类似为bool或torch.uint8\n",
" \"\"\"\n",
" if isinstance(seq_len, list):\n",
" seq_len = np.array(seq_len)\n",
"\n",
" if isinstance(seq_len, np.ndarray):\n",
" seq_len = torch.from_numpy(seq_len)\n",
"\n",
" if isinstance(seq_len, torch.Tensor):\n",
" assert seq_len.dim() == 1, logger.error(f\"seq_len can only have one dimension, got {seq_len.dim()} != 1.\")\n",
" batch_size = seq_len.size(0)\n",
" max_len = int(max_len) if max_len else seq_len.max().long()\n",
" broad_cast_seq_len = torch.arange(max_len).expand(batch_size, -1).to(seq_len.device)\n",
" if mask_pos_to_true:\n",
" mask = broad_cast_seq_len.ge(seq_len.unsqueeze(1))\n",
" else:\n",
" mask = broad_cast_seq_len.lt(seq_len.unsqueeze(1))\n",
" else:\n",
" raise logger.error(\"Only support 1-d list or 1-d numpy.ndarray or 1-d torch.Tensor.\")\n",
"\n",
" return mask\n",
"\n",
"\n",
"def _lm_serialize(data: List[Dict], cfg):\n",
" logger.info('use bert tokenizer...')\n",
" tokenizer = BertTokenizer.from_pretrained(cfg.lm_file)\n",
" for d in data:\n",
" sent = d['sentence'].strip()\n",
" sent = sent.replace(d['head'], d['head_type'], 1).replace(d['tail'], d['tail_type'], 1)\n",
" sent += '[SEP]' + d['head'] + '[SEP]' + d['tail']\n",
" d['token2idx'] = tokenizer.encode(sent, add_special_tokens=True)\n",
" d['lens'] = len(d['token2idx'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 预处理过程\n",
"logger.info('load raw files...')\n",
"train_fp = os.path.join('data/train.csv')\n",
"valid_fp = os.path.join('data/valid.csv')\n",
"test_fp = os.path.join('data/test.csv')\n",
"relation_fp = os.path.join('data/relation.csv')\n",
"\n",
"train_data = load_csv(train_fp)\n",
"valid_data = load_csv(valid_fp)\n",
"test_data = load_csv(test_fp)\n",
"relation_data = load_csv(relation_fp)\n",
"\n",
"for d in train_data:\n",
" d['tokens'] = eval(d['tokens'])\n",
"for d in valid_data:\n",
" d['tokens'] = eval(d['tokens'])\n",
"for d in test_data:\n",
" d['tokens'] = eval(d['tokens'])\n",
" \n",
"logger.info('convert relation into index...')\n",
"rels = _handle_relation_data(relation_data)\n",
"_add_relation_data(rels, train_data)\n",
"_add_relation_data(rels, valid_data)\n",
"_add_relation_data(rels, test_data)\n",
"\n",
"logger.info('verify whether use pretrained language models...')\n",
"\n",
"logger.info('use pretrained language models serialize sentence...')\n",
"_lm_serialize(train_data, cfg)\n",
"_lm_serialize(valid_data, cfg)\n",
"_lm_serialize(test_data, cfg)\n",
"\n",
"logger.info('save data for backup...')\n",
"os.makedirs(cfg.out_path, exist_ok=True)\n",
"train_save_fp = os.path.join(cfg.out_path, 'train.pkl')\n",
"valid_save_fp = os.path.join(cfg.out_path, 'valid.pkl')\n",
"test_save_fp = os.path.join(cfg.out_path, 'test.pkl')\n",
"save_pkl(train_data, train_save_fp)\n",
"save_pkl(valid_data, valid_save_fp)\n",
"save_pkl(test_data, test_save_fp)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# pytorch 构建自定义 Dataset\n",
"def collate_fn(cfg):\n",
" def collate_fn_intra(batch):\n",
" batch.sort(key=lambda data: int(data['lens']), reverse=True)\n",
" max_len = int(batch[0]['lens'])\n",
" \n",
" def _padding(x, max_len):\n",
" return x + [0] * (max_len - len(x))\n",
" \n",
" def _pad_adj(adj, max_len):\n",
" adj = np.array(adj)\n",
" pad_len = max_len - adj.shape[0]\n",
" for i in range(pad_len):\n",
" adj = np.insert(adj, adj.shape[-1], 0, axis=1)\n",
" for i in range(pad_len):\n",
" adj = np.insert(adj, adj.shape[0], 0, axis=0)\n",
" return adj\n",
" \n",
" x, y = dict(), []\n",
" word, word_len = [], []\n",
" head_pos, tail_pos = [], []\n",
" pcnn_mask = []\n",
" adj_matrix = []\n",
" for data in batch:\n",
" word.append(_padding(data['token2idx'], max_len))\n",
" word_len.append(int(data['lens']))\n",
" y.append(int(data['rel2idx']))\n",
" \n",
" if cfg.model_name != 'lm':\n",
" head_pos.append(_padding(data['head_pos'], max_len))\n",
" tail_pos.append(_padding(data['tail_pos'], max_len))\n",
" if cfg.model_name == 'gcn':\n",
" head = eval(data['dependency'])\n",
" adj = head_to_adj(head, directed=True, self_loop=True)\n",
" adj_matrix.append(_pad_adj(adj, max_len))\n",
"\n",
" if cfg.use_pcnn:\n",
" pcnn_mask.append(_padding(data['entities_pos'], max_len))\n",
"\n",
" x['word'] = torch.tensor(word)\n",
" x['lens'] = torch.tensor(word_len)\n",
" y = torch.tensor(y)\n",
" \n",
" if cfg.model_name != 'lm':\n",
" x['head_pos'] = torch.tensor(head_pos)\n",
" x['tail_pos'] = torch.tensor(tail_pos)\n",
" if cfg.model_name == 'gcn':\n",
" x['adj'] = torch.tensor(adj_matrix)\n",
" if cfg.model_name == 'cnn' and cfg.use_pcnn:\n",
" x['pcnn_mask'] = torch.tensor(pcnn_mask)\n",
"\n",
" return x, y\n",
" \n",
" return collate_fn_intra\n",
"\n",
"\n",
"class CustomDataset(Dataset):\n",
" \"\"\"默认使用 List 存储数据\"\"\"\n",
" def __init__(self, fp):\n",
" self.file = load_pkl(fp)\n",
"\n",
" def __getitem__(self, item):\n",
" sample = self.file[item]\n",
" return sample\n",
"\n",
" def __len__(self):\n",
" return len(self.file)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 预训练语言模型\n",
"class PretrainLM(nn.Module):\n",
" def __init__(self, cfg):\n",
" super(PretrainLM, self).__init__()\n",
" self.num_layers = cfg.rnn_layers\n",
" self.lm = BertModel.from_pretrained(cfg.lm_file, num_hidden_layers=cfg.lm_num_hidden_layers)\n",
" self.bilstm = nn.LSTM(768,10,batch_first=True,bidirectional=True,num_layers=cfg.rnn_layers,dropout=cfg.dropout)\n",
" self.fc = nn.Linear(20, cfg.num_relations)\n",
"\n",
" def forward(self, x):\n",
" N = self.num_layers\n",
" word, lens = x['word'], x['lens']\n",
" B = word.size(0)\n",
" output, pooler_output = self.lm(word)\n",
" output = pack_padded_sequence(output, lens, batch_first=True, enforce_sorted=True)\n",
" _, (output,_) = self.bilstm(output)\n",
" output = output.view(N, 2, B, 10).transpose(1, 2).contiguous().view(N, B, 20).transpose(0, 1)\n",
" output = output[:,-1,:]\n",
" output = self.fc(output)\n",
" \n",
" return output"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# p,r,f1 指标测量\n",
"class PRMetric():\n",
" def __init__(self):\n",
" \"\"\"\n",
" 暂时调用 sklearn 的方法\n",
" \"\"\"\n",
" self.y_true = np.empty(0)\n",
" self.y_pred = np.empty(0)\n",
"\n",
" def reset(self):\n",
" self.y_true = np.empty(0)\n",
" self.y_pred = np.empty(0)\n",
"\n",
" def update(self, y_true:torch.Tensor, y_pred:torch.Tensor):\n",
" y_true = y_true.cpu().detach().numpy()\n",
" y_pred = y_pred.cpu().detach().numpy()\n",
" y_pred = np.argmax(y_pred,axis=-1)\n",
"\n",
" self.y_true = np.append(self.y_true, y_true)\n",
" self.y_pred = np.append(self.y_pred, y_pred)\n",
"\n",
" def compute(self):\n",
" p, r, f1, _ = precision_recall_fscore_support(self.y_true,self.y_pred,average='macro',warn_for=tuple())\n",
" _, _, acc, _ = precision_recall_fscore_support(self.y_true,self.y_pred,average='micro',warn_for=tuple())\n",
"\n",
" return acc,p,r,f1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 训练过程中的迭代\n",
"def train(epoch, model, dataloader, optimizer, criterion, cfg):\n",
" model.train()\n",
"\n",
" metric = PRMetric()\n",
" losses = []\n",
"\n",
" for batch_idx, (x, y) in enumerate(dataloader, 1):\n",
" optimizer.zero_grad()\n",
" y_pred = model(x)\n",
" loss = criterion(y_pred, y)\n",
"\n",
" loss.backward()\n",
" optimizer.step()\n",
"\n",
" metric.update(y_true=y, y_pred=y_pred)\n",
" losses.append(loss.item())\n",
"\n",
" data_total = len(dataloader.dataset)\n",
" data_cal = data_total if batch_idx == len(dataloader) else batch_idx * len(y)\n",
" if (cfg.train_log and batch_idx % cfg.log_interval == 0) or batch_idx == len(dataloader):\n",
" # p r f1 皆为 macro因为micro时三者相同定义为acc\n",
" acc,p,r,f1 = metric.compute()\n",
" print(f'Train Epoch {epoch}: [{data_cal}/{data_total} ({100. * data_cal / data_total:.0f}%)]\\t'\n",
" f'Loss: {loss.item():.6f}')\n",
" print(f'Train Epoch {epoch}: Acc: {100. * acc:.2f}%\\t'\n",
" f'macro metrics: [p: {p:.4f}, r:{r:.4f}, f1:{f1:.4f}]')\n",
"\n",
" if cfg.show_plot and not cfg.only_comparison_plot:\n",
" if cfg.plot_utils == 'matplot':\n",
" plt.plot(losses)\n",
" plt.title(f'epoch {epoch} train loss')\n",
" plt.show()\n",
"\n",
" return losses[-1]\n",
"\n",
"\n",
"# 测试过程中的迭代\n",
"def validate(epoch, model, dataloader, criterion,verbose=True):\n",
" model.eval()\n",
"\n",
" metric = PRMetric()\n",
" losses = []\n",
"\n",
" for batch_idx, (x, y) in enumerate(dataloader, 1):\n",
" with torch.no_grad():\n",
" y_pred = model(x)\n",
" loss = criterion(y_pred, y)\n",
"\n",
" metric.update(y_true=y, y_pred=y_pred)\n",
" losses.append(loss.item())\n",
"\n",
" loss = sum(losses) / len(losses)\n",
" acc,p,r,f1 = metric.compute()\n",
" data_total = len(dataloader.dataset)\n",
" if verbose:\n",
" print(f'Valid Epoch {epoch}: [{data_total}/{data_total}](100%)\\t Loss: {loss:.6f}')\n",
" print(f'Valid Epoch {epoch}: Acc: {100. * acc:.2f}%\\tmacro metrics: [p: {p:.4f}, r:{r:.4f}, f1:{f1:.4f}]\\n\\n')\n",
"\n",
" return f1,loss"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 加载数据集\n",
"train_dataset = CustomDataset(train_save_fp)\n",
"valid_dataset = CustomDataset(valid_save_fp)\n",
"test_dataset = CustomDataset(test_save_fp)\n",
"\n",
"train_dataloader = DataLoader(train_dataset, batch_size=cfg.batch_size, shuffle=True, collate_fn=collate_fn(cfg))\n",
"valid_dataloader = DataLoader(valid_dataset, batch_size=cfg.batch_size, shuffle=True, collate_fn=collate_fn(cfg))\n",
"test_dataloader = DataLoader(test_dataset, batch_size=cfg.batch_size, shuffle=True, collate_fn=collate_fn(cfg))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# main 入口定义优化函数、loss函数等\n",
"# 开始epoch迭代\n",
"# 使用valid 数据集的loss做早停判断当不再下降时此时为模型泛化性最好的时刻。\n",
"model = PretrainLM(cfg)\n",
"print(model)\n",
"\n",
"optimizer = optim.Adam(model.parameters(), lr=cfg.learning_rate, weight_decay=cfg.weight_decay)\n",
"scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=cfg.lr_factor, patience=cfg.lr_patience)\n",
"criterion = nn.CrossEntropyLoss()\n",
"\n",
"best_f1, best_epoch = -1, 0\n",
"es_loss, es_f1, es_epoch, es_patience, best_es_epoch, best_es_f1, = 1000, -1, 0, 0, 0, -1\n",
"train_losses, valid_losses = [], []\n",
"\n",
"logger.info('=' * 10 + ' Start training ' + '=' * 10)\n",
"for epoch in range(1, cfg.epoch + 1):\n",
" train_loss = train(epoch, model, train_dataloader, optimizer, criterion, cfg)\n",
" valid_f1, valid_loss = validate(epoch, model, valid_dataloader, criterion)\n",
" scheduler.step(valid_loss)\n",
"\n",
" train_losses.append(train_loss)\n",
" valid_losses.append(valid_loss)\n",
" if best_f1 < valid_f1:\n",
" best_f1 = valid_f1\n",
" best_epoch = epoch\n",
" # 使用 valid loss 做 early stopping 的判断标准\n",
" if es_loss > valid_loss:\n",
" es_loss = valid_loss\n",
" es_f1 = valid_f1\n",
" best_es_f1 = valid_f1\n",
" es_epoch = epoch\n",
" best_es_epoch = epoch\n",
" es_patience = 0\n",
" else:\n",
" es_patience += 1\n",
" if es_patience >= cfg.early_stopping_patience:\n",
" best_es_epoch = es_epoch\n",
" best_es_f1 = es_f1\n",
"\n",
"if cfg.show_plot:\n",
" if cfg.plot_utils == 'matplot':\n",
" plt.plot(train_losses, 'x-')\n",
" plt.plot(valid_losses, '+-')\n",
" plt.legend(['train', 'valid'])\n",
" plt.title('train/valid comparison loss')\n",
" plt.show()\n",
"\n",
"\n",
"print(f'best(valid loss quota) early stopping epoch: {best_es_epoch}, '\n",
" f'this epoch macro f1: {best_es_f1:0.4f}')\n",
"print(f'total {cfg.epoch} epochs, best(valid macro f1) epoch: {best_epoch}, '\n",
" f'this epoch macro f1: {best_f1:.4f}')\n",
"\n",
"test_f1, _ = validate(0, model, test_dataloader, criterion,verbose=False)\n",
"print(f'after {cfg.epoch} epochs, final test data macro f1: {test_f1:.4f}')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"本demo不包括调参部分有兴趣的同学可以自行前往 [deepke](http://openkg.cn/tool/deepke) 仓库,下载使用更多的模型 :)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}