Compare commits
No commits in common. "main" and "deprecated-tensorflow" have entirely different histories.
main
...
deprecated
|
@ -1,76 +0,0 @@
|
|||
# Contributor Covenant Code of Conduct
|
||||
|
||||
## Our Pledge
|
||||
|
||||
In the interest of fostering an open and welcoming environment, we as
|
||||
contributors and maintainers pledge to making participation in our project and
|
||||
our community a harassment-free experience for everyone, regardless of age, body
|
||||
size, disability, ethnicity, sex characteristics, gender identity and expression,
|
||||
level of experience, education, socio-economic status, nationality, personal
|
||||
appearance, race, religion, or sexual identity and orientation.
|
||||
|
||||
## Our Standards
|
||||
|
||||
Examples of behavior that contributes to creating a positive environment
|
||||
include:
|
||||
|
||||
* Using welcoming and inclusive language
|
||||
* Being respectful of differing viewpoints and experiences
|
||||
* Gracefully accepting constructive criticism
|
||||
* Focusing on what is best for the community
|
||||
* Showing empathy towards other community members
|
||||
|
||||
Examples of unacceptable behavior by participants include:
|
||||
|
||||
* The use of sexualized language or imagery and unwelcome sexual attention or
|
||||
advances
|
||||
* Trolling, insulting/derogatory comments, and personal or political attacks
|
||||
* Public or private harassment
|
||||
* Publishing others' private information, such as a physical or electronic
|
||||
address, without explicit permission
|
||||
* Other conduct which could reasonably be considered inappropriate in a
|
||||
professional setting
|
||||
|
||||
## Our Responsibilities
|
||||
|
||||
Project maintainers are responsible for clarifying the standards of acceptable
|
||||
behavior and are expected to take appropriate and fair corrective action in
|
||||
response to any instances of unacceptable behavior.
|
||||
|
||||
Project maintainers have the right and responsibility to remove, edit, or
|
||||
reject comments, commits, code, wiki edits, issues, and other contributions
|
||||
that are not aligned to this Code of Conduct, or to ban temporarily or
|
||||
permanently any contributor for other behaviors that they deem inappropriate,
|
||||
threatening, offensive, or harmful.
|
||||
|
||||
## Scope
|
||||
|
||||
This Code of Conduct applies both within project spaces and in public spaces
|
||||
when an individual is representing the project or its community. Examples of
|
||||
representing a project or community include using an official project e-mail
|
||||
address, posting via an official social media account, or acting as an appointed
|
||||
representative at an online or offline event. Representation of a project may be
|
||||
further defined and clarified by project maintainers.
|
||||
|
||||
## Enforcement
|
||||
|
||||
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
||||
reported by contacting the project team at . All
|
||||
complaints will be reviewed and investigated and will result in a response that
|
||||
is deemed necessary and appropriate to the circumstances. The project team is
|
||||
obligated to maintain confidentiality with regard to the reporter of an incident.
|
||||
Further details of specific enforcement policies may be posted separately.
|
||||
|
||||
Project maintainers who do not follow or enforce the Code of Conduct in good
|
||||
faith may face temporary or permanent repercussions as determined by other
|
||||
members of the project's leadership.
|
||||
|
||||
## Attribution
|
||||
|
||||
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
|
||||
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
|
||||
|
||||
[homepage]: https://www.contributor-covenant.org
|
||||
|
||||
For answers to common questions about this code of conduct, see
|
||||
https://www.contributor-covenant.org/faq
|
|
@ -1,22 +0,0 @@
|
|||
# Contributing
|
||||
Welcome to the Deepke community! We're building relation extraction toolkits for research.
|
||||
|
||||
## Simple Internal Code
|
||||
It's useful for users to look at the code and understand very quickly what's happening. Many users won't be engineers. Thus we need to value clear, simple code over condensed ninja moves. While that's super cool, this isn't the project for that :)
|
||||
|
||||
## Contribution Types
|
||||
Currently looking for help implementing new features or adding bug fixes.
|
||||
|
||||
## Bug Fixes:
|
||||
1. Submit a github issue.
|
||||
2. Fix it.
|
||||
3. Submit a PR!
|
||||
|
||||
## New Features:
|
||||
1. Submit a github issue.
|
||||
2. We'll agree on the feature scope.
|
||||
3. Submit a PR!
|
||||
|
||||
## Coding Styleguide
|
||||
1. Test the code with flake8.
|
||||
2. Use f-strings.
|
|
@ -1,28 +0,0 @@
|
|||
---
|
||||
name: Bug report
|
||||
about: Create a report to help us improve
|
||||
title: ''
|
||||
labels: 'bug'
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
**Describe the bug**
|
||||
> A clear and concise description of what the bug is.
|
||||
|
||||
|
||||
|
||||
|
||||
**Environment (please complete the following information):**
|
||||
- OS: [e.g. mac / window]
|
||||
- Python Version [e.g. 3.6]
|
||||
|
||||
|
||||
|
||||
**Screenshots**
|
||||
> If applicable, add screenshots to help explain your problem.
|
||||
|
||||
|
||||
|
||||
**Additional context**
|
||||
> Add any other context about the problem here.
|
|
@ -1,28 +0,0 @@
|
|||
---
|
||||
name: Feature request
|
||||
about: Suggest an idea for this project
|
||||
title: ''
|
||||
labels: 'enhancement'
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
**Describe the feature**
|
||||
> A clear and concise description of any features you've considered.
|
||||
|
||||
|
||||
|
||||
|
||||
**Environment (please complete the following information):**
|
||||
- OS: [e.g. mac / window]
|
||||
- Python Version [e.g. 3.6]
|
||||
|
||||
|
||||
|
||||
**Screenshots**
|
||||
> If applicable, add screenshots to help explain your problem.
|
||||
|
||||
|
||||
|
||||
**Additional context**
|
||||
> Add any other context about the problem here.
|
|
@ -1,28 +0,0 @@
|
|||
---
|
||||
name: Question consult
|
||||
about: Other question want to ask
|
||||
title: ''
|
||||
labels: 'question'
|
||||
assignees: ''
|
||||
|
||||
---
|
||||
|
||||
**Describe the question**
|
||||
> A clear and concise description of what the question is.
|
||||
|
||||
|
||||
|
||||
|
||||
**Environment (please complete the following information):**
|
||||
- OS: [e.g. mac / window]
|
||||
- Python Version [e.g. 3.6]
|
||||
|
||||
|
||||
|
||||
**Screenshots**
|
||||
> If applicable, add screenshots to help explain your problem.
|
||||
|
||||
|
||||
|
||||
**Additional context**
|
||||
> Add any other context about the problem here.
|
|
@ -1,19 +0,0 @@
|
|||
.DS_Store
|
||||
|
||||
.idea
|
||||
.vscode
|
||||
|
||||
__pycache__
|
||||
*.pyc
|
||||
|
||||
test/.pytest_cache
|
||||
|
||||
data/out
|
||||
|
||||
logs
|
||||
checkpoints
|
||||
|
||||
demo.py
|
||||
|
||||
otherUtils.py
|
||||
module/Transformer_offical.py
|
80
CITATION.cff
80
CITATION.cff
|
@ -1,80 +0,0 @@
|
|||
cff-version: "1.0.0"
|
||||
message: "If you use this toolkit, please cite it using these metadata."
|
||||
title: "deepke"
|
||||
repository-code: "https://https://github.com/zjunlp/DeepKE"
|
||||
authors:
|
||||
- family-names: Zhang
|
||||
given-names: Ningyu
|
||||
- family-names: Xu
|
||||
given-names: Xin
|
||||
- family-names: Tao
|
||||
given-names: Liankuan
|
||||
- family-names: Yu
|
||||
given-names: Haiyang
|
||||
- family-names: Ye
|
||||
given-names: Hongbin
|
||||
- family-names: Xie
|
||||
given-names: Xin
|
||||
- family-names: Chen
|
||||
given-names: Xiang
|
||||
- family-names: Li
|
||||
given-names: Zhoubo
|
||||
- family-names: Li
|
||||
given-names: Lei
|
||||
- family-names: Liang
|
||||
given-names: Xiaozhuan
|
||||
- family-names: Yao
|
||||
given-names: Yunzhi
|
||||
- family-names: Deng
|
||||
given-names: Shumin
|
||||
- family-names: Zhang
|
||||
given-names: Zhenru
|
||||
- family-names: Tan
|
||||
given-names: Chuanqi
|
||||
- family-names: Huang
|
||||
given-names: Fei
|
||||
- family-names: Zheng
|
||||
given-names: Guozhou
|
||||
- family-names: Chen
|
||||
given-names: Huajun
|
||||
preferred-citation:
|
||||
type: article
|
||||
title: "DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population"
|
||||
authors:
|
||||
- family-names: Zhang
|
||||
given-names: Ningyu
|
||||
- family-names: Xu
|
||||
given-names: Xin
|
||||
- family-names: Tao
|
||||
given-names: Liankuan
|
||||
- family-names: Yu
|
||||
given-names: Haiyang
|
||||
- family-names: Ye
|
||||
given-names: Hongbin
|
||||
- family-names: Xie
|
||||
given-names: Xin
|
||||
- family-names: Chen
|
||||
given-names: Xiang
|
||||
- family-names: Li
|
||||
given-names: Zhoubo
|
||||
- family-names: Li
|
||||
given-names: Lei
|
||||
- family-names: Liang
|
||||
given-names: Xiaozhuan
|
||||
- family-names: Yao
|
||||
given-names: Yunzhi
|
||||
- family-names: Deng
|
||||
given-names: Shumin
|
||||
- family-names: Zhang
|
||||
given-names: Zhenru
|
||||
- family-names: Tan
|
||||
given-names: Chuanqi
|
||||
- family-names: Huang
|
||||
given-names: Fei
|
||||
- family-names: Zheng
|
||||
given-names: Guozhou
|
||||
- family-names: Chen
|
||||
given-names: Huajun
|
||||
journal: "http://arxiv.org/abs/2201.03335"
|
||||
year: 2022
|
||||
|
21
LICENSE
21
LICENSE
|
@ -1,21 +0,0 @@
|
|||
MIT License
|
||||
|
||||
Copyright (c) 2021 ZJUNLP
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
476
README.md
476
README.md
|
@ -1,441 +1,57 @@
|
|||
<p align="center">
|
||||
<a href="https://github.com/zjunlp/deepke"> <img src="pics/logo.png" width="400"/></a>
|
||||
<p>
|
||||
<p align="center">
|
||||
<a href="http://deepke.zjukg.cn">
|
||||
<img alt="Documentation" src="https://img.shields.io/badge/demo-website-blue">
|
||||
</a>
|
||||
<a href="https://pypi.org/project/deepke/#files">
|
||||
<img alt="PyPI" src="https://img.shields.io/pypi/v/deepke">
|
||||
</a>
|
||||
<a href="https://github.com/zjunlp/DeepKE/blob/master/LICENSE">
|
||||
<img alt="GitHub" src="https://img.shields.io/github/license/zjunlp/deepke">
|
||||
</a>
|
||||
<a href="http://zjunlp.github.io/DeepKE">
|
||||
<img alt="Documentation" src="https://img.shields.io/badge/doc-website-red">
|
||||
</a>
|
||||
<a href="https://colab.research.google.com/drive/1vS8YJhJltzw3hpJczPt24O0Azcs3ZpRi?usp=sharing">
|
||||
<img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg">
|
||||
</a>
|
||||
</p>
|
||||
# deepke
|
||||
|
||||
## 数据准备
|
||||
|
||||
<p align="center">
|
||||
<b> English | <a href="https://github.com/zjunlp/DeepKE/blob/main/README_CN.md">简体中文</a> </b>
|
||||
</p>
|
||||
文件(来源) | 样本
|
||||
---- | ----
|
||||
com2abbr.txt(原始)|沙河实业股份有限公司 沙河股份
|
||||
stock.sql(原始)|[stock_code,ChiName,ChiNameAbbr]<br />('000001', '平安银行股份有限公司', '平安银行');
|
||||
rel_per_com.txt(原始)|非独立董事 刘旭 湖南博云新材料股份有限公司
|
||||
kg_company_management.sql(原始)|[stock_code, manager_name,manager_position…](主要部分)
|
||||
per_com.txt(原始)| 董事 深圳中国农大科技股份有限公司 刘多宏
|
||||
rel_per_com.txt(由前面两个生成)| 非独立董事 刘旭 湖南博云新材料股份有限公司
|
||||
|
||||
<h1 align="center">
|
||||
<p>A Deep Learning Based Knowledge Extraction Toolkit<br>for Knowledge Base Population</p>
|
||||
</h1>
|
||||
运行`preprocess.py`的`get_initial_sample()`,主要包括
|
||||
|
||||
* **初步整理原始文本**</br>
|
||||
包括去掉不必要的符号、将数字用NUM替换等
|
||||
* **整理远程监督的数据**</br>
|
||||
得到职位相关的数据,包括所有的人和公司,放在*per_pool*中和*com_pool*中以及具有职位关系的三元组*rel_per_com*
|
||||
* **初步采样**</br>
|
||||
通过远程监督进行采样,目前的设定是遍历所有句子,如果两实体出现在句子中且二者在*rel_per_com*中有关系,则标记为正样本;不在*rel_per_com* 中的两实体标记为负
|
||||
* **规则过滤数据**</br>
|
||||
**噪音来源:**
|
||||
* 远程监督数据源自身的噪音,如人名为‘智慧’
|
||||
* 一人多职位,
|
||||
* 如句子“B为A的实际控股人” ,在*rel_per_com* 中有「A,B,董事长」 ,句子会被标记为董事长的正样本
|
||||
* 静态的远程监督数据源和随时间动态变化的职位关系之间的冲突,
|
||||
* 对句子“A曾任B的董事长”,在 *rel_per_com* 中有「A,B,董事长」,句子会被标记为董事长的正样本
|
||||
* 对句子“任命A为B的总裁”,在 *rel_per_com* 中关于「A,B」 的关系只有「A,B,副总裁」,句子也会被标记为副总裁的正样本
|
||||
* 对句子“任命A为B的总裁”,句子中有于「A,B」,但是在 *rel_per_com* 中没有任何于「A,B」的职位信息,会被标记为负样本
|
||||
* ...
|
||||
|
||||
DeepKE is a knowledge extraction toolkit supporting **low-resource** and **document-level** scenarios for *entity*, *relation* and *attribute* extraction. We provide [comprehensive documents](https://zjunlp.github.io/DeepKE/), [Google Colab tutorials](), and [online demo](http://deepke.zjukg.cn/) for beginners.
|
||||
|
||||
<br>
|
||||
|
||||
# Table of Contents
|
||||
|
||||
* [What's New](#whats-new)
|
||||
* [Prediction Demo](#prediction-demo)
|
||||
* [Model Framework](#model-framework)
|
||||
* [Quick Start](#quick-start)
|
||||
* [Requirements](#requirements)
|
||||
* [Introduction of Three Functions](#introduction-of-three-functions)
|
||||
* [1. Named Entity Recognition](#1-named-entity-recognition)
|
||||
* [2. Relation Extraction](#2-relation-extraction)
|
||||
* [3. Attribute Extraction](#3-attribute-extraction)
|
||||
* [Notebook Tutorial](#notebook-tutorial)
|
||||
* [Tips](#tips)
|
||||
* [To do](#to-do)
|
||||
* [Citation](#citation)
|
||||
* [Developers](#developers)
|
||||
|
||||
<br>
|
||||
|
||||
# What's New
|
||||
|
||||
## Jan, 2022
|
||||
* We have released a paper [DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population](https://arxiv.org/abs/2201.03335)
|
||||
## Dec, 2021
|
||||
* We have added `dockerfile` to create the enviroment automatically.
|
||||
## Nov, 2021
|
||||
* The demo of DeepKE, supporting real-time extration without deploying and training, has been released.
|
||||
* The documentation of DeepKE, containing the details of DeepKE such as source codes and datasets, has been released.
|
||||
## Oct, 2021
|
||||
* `pip install deepke`
|
||||
* The codes of deepke-v2.0 have been released.
|
||||
## August, 2020
|
||||
* The codes of deepke-v1.0 have been released.
|
||||
|
||||
<br>
|
||||
|
||||
# Prediction Demo
|
||||
|
||||
There is a demonstration of prediction.<br>
|
||||
<img src="pics/demo.gif" width="636" height="494" align=center>
|
||||
|
||||
<br>
|
||||
|
||||
# Model Framework
|
||||
|
||||
<h3 align="center">
|
||||
<img src="pics/architectures.png">
|
||||
</h3>
|
||||
|
||||
|
||||
- DeepKE contains a unified framework for **named entity recognition**, **relation extraction** and **attribute extraction**, the three knowledge extraction functions.
|
||||
- Each task can be implemented in different scenarios. For example, we can achieve relation extraction in **standard**, **low-resource (few-shot)** and **document-level** settings.
|
||||
- Each application scenario comprises of three components: **Data** including Tokenizer, Preprocessor and Loader, **Model** including Module, Encoder and Forwarder, **Core** including Training, Evaluation and Prediction.
|
||||
|
||||
<br>
|
||||
|
||||
# Quick Start
|
||||
|
||||
*DeepKE* supports `pip install deepke`. <br>Take the fully supervised relation extraction for example.
|
||||
|
||||
**Step1** Download the basic code
|
||||
|
||||
```bash
|
||||
git clone https://github.com/zjunlp/DeepKE.git
|
||||
```
|
||||
|
||||
**Step2** Create a virtual environment using `Anaconda` and enter it.<br>
|
||||
|
||||
We also provide dockerfile source code, which is located in the `docker` folder, to help users create their own mirrors.
|
||||
|
||||
```bash
|
||||
conda create -n deepke python=3.8
|
||||
|
||||
conda activate deepke
|
||||
```
|
||||
|
||||
1. Install *DeepKE* with source code
|
||||
|
||||
```bash
|
||||
python setup.py install
|
||||
**正样本过滤:** </br>
|
||||
* 关系关键词必须在句子中,考虑一人多职位
|
||||
|
||||
python setup.py develop
|
||||
```
|
||||
**负样本过滤:**</br>
|
||||
* 正则表达式识别‘A的董事长B‘这种类型的句子,回标为正样本
|
||||
* 远程监督本身的噪音,如在*per_pool* 中有「周建灿」和「周建」,有句子“金盾董事长周建灿”,直接的实体链接会标出「金盾董事,周建,董事长」
|
||||
|
||||
2. Install *DeepKE* with `pip`
|
||||
|
||||
```bash
|
||||
pip install deepke
|
||||
```
|
||||
|
||||
## 训练
|
||||
* 运行`preprocess.py`的`train_preprocess()`,生成训练数据
|
||||
* 运行`python train.py`,模型存在`../model`下
|
||||
* 参数在`config.py`中进行配置,包括*GPU_ID*, *learning_rate*等
|
||||
|
||||
**Step3** Enter the task directory
|
||||
## 测试
|
||||
* 运行`preprocess.py`的`predict_preprocess()`,生成可以输入模型的数据
|
||||
* 运行`python test.py`,结果保存在`../result`下
|
||||
|
||||
```bash
|
||||
cd DeepKE/example/re/standard
|
||||
```
|
||||
## 模型
|
||||
考虑到实验效果,目前使用多个二分类模型</br>
|
||||
参考:[Lin et al. (2017)](http://www.aclweb.org/anthology/D15-1203).</br>
|
||||
输入:句子,两实体及相应的位置信息,用于判断并列语句的辅助信息序列</br>
|
||||
输出:是否具有相应关系</br>
|
||||
|
||||
**Step4** Download the dataset
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/re/standard/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
**Step5** Training (Parameters for training can be changed in the `conf` folder)
|
||||
|
||||
We support visual parameter tuning by using *wandb*.
|
||||
|
||||
```bash
|
||||
python run.py
|
||||
```
|
||||
|
||||
**Step6** Prediction (Parameters for prediction can be changed in the `conf` folder)
|
||||
|
||||
Modify the path of the trained model in `predict.yaml`.
|
||||
|
||||
```bash
|
||||
python predict.py
|
||||
```
|
||||
|
||||
## Requirements
|
||||
|
||||
> python == 3.8
|
||||
|
||||
- torch == 1.5
|
||||
- hydra-core == 1.0.6
|
||||
- tensorboard == 2.4.1
|
||||
- matplotlib == 3.4.1
|
||||
- transformers == 3.4.0
|
||||
- jieba == 0.42.1
|
||||
- scikit-learn == 0.24.1
|
||||
- pytorch-transformers == 1.2.0
|
||||
- seqeval == 1.2.2
|
||||
- tqdm == 4.60.0
|
||||
- opt-einsum==3.3.0
|
||||
- wandb==0.12.7
|
||||
- ujson
|
||||
|
||||
## Introduction of Three Functions
|
||||
|
||||
### 1. Named Entity Recognition
|
||||
|
||||
- Named entity recognition seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, organizations, etc.
|
||||
|
||||
- The data is stored in `.txt` files. Some instances as following:
|
||||
|
||||
| Sentence | Person | Location | Organization |
|
||||
| :----------------------------------------------------------: | :------------------------: | :------------: | :----------------------------: |
|
||||
| 本报北京9月4日讯记者杨涌报道:部分省区人民日报宣传发行工作座谈会9月3日在4日在京举行。 | 杨涌 | 北京 | 人民日报 |
|
||||
| 《红楼梦》是中央电视台和中国电视剧制作中心根据中国古典文学名著《红楼梦》摄制于1987年的一部古装连续剧,由王扶林导演,周汝昌、王蒙、周岭等多位红学家参与制作。 | 王扶林,周汝昌,王蒙,周岭 | 中国 | 中央电视台,中国电视剧制作中心 |
|
||||
| 秦始皇兵马俑位于陕西省西安市,1961年被国务院公布为第一批全国重点文物保护单位,是世界八大奇迹之一。 | 秦始皇 | 陕西省,西安市 | 国务院 |
|
||||
|
||||
- Read the detailed process in specific README
|
||||
- **[STANDARD (Fully Supervised)](https://github.com/zjunlp/DeepKE/tree/main/example/ner/standard)**
|
||||
|
||||
**Step1** Enter `DeepKE/example/ner/standard`. Download the dataset.
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/ner/standard/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
**Step2** Training<br>
|
||||
|
||||
The dataset and parameters can be customized in the `data` folder and `conf` folder respectively.
|
||||
|
||||
```bash
|
||||
python run.py
|
||||
```
|
||||
|
||||
**Step3** Prediction
|
||||
|
||||
```bash
|
||||
python predict.py
|
||||
```
|
||||
|
||||
- **[FEW-SHOT](https://github.com/zjunlp/DeepKE/tree/main/example/ner/few-shot)**
|
||||
|
||||
**Step1** Enter `DeepKE/example/ner/few-shot`. Download the dataset.
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/ner/few_shot/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
**Step2** Training in the low-resouce setting <br>
|
||||
|
||||
The directory where the model is loaded and saved and the configuration parameters can be cusomized in the `conf` folder.
|
||||
|
||||
```bash
|
||||
python run.py +train=few_shot
|
||||
```
|
||||
|
||||
Users can modify `load_path` in `conf/train/few_shot.yaml` to use existing loaded model.<br>
|
||||
|
||||
**Step3** Add `- predict` to `conf/config.yaml`, modify `loda_path` as the model path and `write_path` as the path where the predicted results are saved in `conf/predict.yaml`, and then run `python predict.py`
|
||||
|
||||
```bash
|
||||
python predict.py
|
||||
```
|
||||
|
||||
### 2. Relation Extraction
|
||||
|
||||
- Relationship extraction is the task of extracting semantic relations between entities from a unstructured text.
|
||||
|
||||
- The data is stored in `.csv` files. Some instances as following:
|
||||
|
||||
| Sentence | Relation | Head | Head_offset | Tail | Tail_offset |
|
||||
| :----------------------------------------------------: | :------: | :--------: | :---------: | :--------: | :---------: |
|
||||
| 《岳父也是爹》是王军执导的电视剧,由马恩然、范明主演。 | 导演 | 岳父也是爹 | 1 | 王军 | 8 |
|
||||
| 《九玄珠》是在纵横中文网连载的一部小说,作者是龙马。 | 连载网站 | 九玄珠 | 1 | 纵横中文网 | 7 |
|
||||
| 提起杭州的美景,西湖总是第一个映入脑海的词语。 | 所在城市 | 西湖 | 8 | 杭州 | 2 |
|
||||
|
||||
- Read the detailed process in specific README
|
||||
|
||||
- **[STANDARD (Fully Supervised)](https://github.com/zjunlp/DeepKE/tree/main/example/re/standard)**
|
||||
|
||||
**Step1** Enter the `DeepKE/example/re/standard` folder. Download the dataset.
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/re/standard/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
**Step2** Training<br>
|
||||
|
||||
The dataset and parameters can be customized in the `data` folder and `conf` folder respectively.
|
||||
|
||||
```bash
|
||||
python run.py
|
||||
```
|
||||
|
||||
**Step3** Prediction
|
||||
|
||||
```bash
|
||||
python predict.py
|
||||
```
|
||||
|
||||
- **[FEW-SHOT](https://github.com/zjunlp/DeepKE/tree/main/example/re/few-shot)**
|
||||
|
||||
**Step1** Enter `DeepKE/example/re/few-shot`. Download the dataset.
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/re/few_shot/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
**Step 2** Training<br>
|
||||
|
||||
- The dataset and parameters can be customized in the `data` folder and `conf` folder respectively.
|
||||
- Start with the model trained last time: modify `train_from_saved_model` in `conf/train.yaml`as the path where the model trained last time was saved. And the path saving logs generated in training can be customized by `log_dir`.
|
||||
|
||||
```bash
|
||||
python run.py
|
||||
```
|
||||
|
||||
**Step3** Prediction
|
||||
|
||||
```bash
|
||||
python predict.py
|
||||
```
|
||||
|
||||
- **[DOCUMENT](https://github.com/zjunlp/DeepKE/tree/main/example/re/document)**<br>
|
||||
|
||||
**Step1** Enter `DeepKE/example/re/document`. Download the dataset.
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/re/document/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
**Step2** Training<br>
|
||||
|
||||
- The dataset and parameters can be customized in the `data` folder and `conf` folder respectively.
|
||||
- Start with the model trained last time: modify `train_from_saved_model` in `conf/train.yaml`as the path where the model trained last time was saved. And the path saving logs generated in training can be customized by `log_dir`.
|
||||
|
||||
```bash
|
||||
python run.py
|
||||
```
|
||||
|
||||
**Step3** Prediction
|
||||
|
||||
```bash
|
||||
python predict.py
|
||||
```
|
||||
|
||||
### 3. Attribute Extraction
|
||||
|
||||
- Attribute extraction is to extract attributes for entities in a unstructed text.
|
||||
|
||||
- The data is stored in `.csv` files. Some instances as following:
|
||||
|
||||
| Sentence | Att | Ent | Ent_offset | Val | Val_offset |
|
||||
| :----------------------------------------------------------: | :------: | :------: | :--------: | :-----------: | :--------: |
|
||||
| 张冬梅,女,汉族,1968年2月生,河南淇县人 | 民族 | 张冬梅 | 0 | 汉族 | 6 |
|
||||
|诸葛亮,字孔明,三国时期杰出的军事家、文学家、发明家。| 朝代 | 诸葛亮 | 0 | 三国时期 | 8 |
|
||||
| 2014年10月1日许鞍华执导的电影《黄金时代》上映 | 上映时间 | 黄金时代 | 19 | 2014年10月1日 | 0 |
|
||||
|
||||
- Read the detailed process in specific README
|
||||
- **[STANDARD (Fully Supervised)](https://github.com/zjunlp/DeepKE/tree/main/example/ae/standard)**
|
||||
|
||||
**Step1** Enter the `DeepKE/example/ae/standard` folder. Download the dataset.
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/ae/standard/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
**Step2** Training<br>
|
||||
|
||||
The dataset and parameters can be customized in the `data` folder and `conf` folder respectively.
|
||||
|
||||
```bash
|
||||
python run.py
|
||||
```
|
||||
|
||||
**Step3** Prediction
|
||||
|
||||
```bash
|
||||
python predict.py
|
||||
```
|
||||
|
||||
<br>
|
||||
|
||||
# Notebook Tutorial
|
||||
|
||||
This toolkit provides many `Jupyter Notebook` and `Google Colab` tutorials. Users can study *DeepKE* with them.
|
||||
|
||||
- Standard Setting<br>
|
||||
|
||||
[NER Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/ner/standard/standard_ner_tutorial.ipynb)
|
||||
|
||||
[NER Colab](https://colab.research.google.com/drive/1h4k6-_oNEHBRxrnzpxHPczO5SFaLS9uq?usp=sharing)
|
||||
|
||||
[RE Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/re/standard/standard_re_pcnn_tutorial.ipynb)
|
||||
|
||||
[RE Colab](https://colab.research.google.com/drive/1o6rKIxBqrGZNnA2IMXqiSsY2GWANAZLl?usp=sharing)
|
||||
|
||||
[AE Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/ae/standard/standard_ae_tutorial.ipynb)
|
||||
|
||||
[AE Colab](https://colab.research.google.com/drive/1pgPouEtHMR7L9Z-QfG1sPYkJfrtRt8ML)
|
||||
|
||||
- Low-resource<br>
|
||||
|
||||
[NER Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/ner/few-shot/fewshot_ner_tutorial.ipynb)
|
||||
|
||||
[NER Colab](https://colab.research.google.com/drive/1Xz0sNpYQNbkjhebCG5djrwM8Mj2Crj7F?usp=sharing)
|
||||
|
||||
[RE Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/re/few-shot/fewshot_re_tutorial.ipynb)
|
||||
|
||||
[RE Colab](https://colab.research.google.com/drive/1o1ly6ORgerkm1fCDjEQb7hsN5WKyg3JH?usp=sharing)
|
||||
|
||||
|
||||
- Document-level<br>
|
||||
|
||||
[RE Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/re/document/document_re_tutorial.ipynb)
|
||||
|
||||
[RE Colab](https://colab.research.google.com/drive/1RGUBbbOBHlWJ1NXQLtP_YEUktntHtROa?usp=sharing)
|
||||
|
||||
|
||||
<br>
|
||||
|
||||
# Tips
|
||||
|
||||
1. Using nearest mirror, like [THU](https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/) in China, will speed up the installation of *Anaconda*.
|
||||
2. Using nearest mirror, like [aliyun](http://mirrors.aliyun.com/pypi/simple/) in China, will speed up `pip install XXX`.
|
||||
3. When encountering `ModuleNotFoundError: No module named 'past'`,run `pip install future` .
|
||||
4. It's slow to install the pretrained language models online. Recommend download pretrained models before use and save them in the `pretrained` folder. Read `README.md` in every task directory to check the specific requirement for saving pretrained models.
|
||||
5. The old version of *DeepKE* is in the [deepke-v1.0](https://github.com/zjunlp/DeepKE/tree/deepke-v1.0) branch. Users can change the branch to use the old version. The old version has been totally transfered to the standard relation extraction ([example/re/standard](https://github.com/zjunlp/DeepKE/blob/main/example/re/standard/README.md)).
|
||||
6. It's recommended to install *DeepKE* with source codes. Because user may meet some problems in Windows system with 'pip'.
|
||||
|
||||
<br>
|
||||
|
||||
# To do
|
||||
In next version, we plan to add multi-modality knowledge extraction to the toolkit.
|
||||
|
||||
Meanwhile, we will offer long-term maintenance to fix bugs, solve issues and meet new requests. So if you have any problems, please put issues to us.
|
||||
|
||||
<br>
|
||||
|
||||
# Citation
|
||||
|
||||
Please cite our paper if you use DeepKE in your work
|
||||
|
||||
```bibtex
|
||||
@article{Zhang_DeepKE_A_Deep_2022,
|
||||
author = {Zhang, Ningyu and Xu, Xin and Tao, Liankuan and Yu, Haiyang and Ye, Hongbin and Xie, Xin and Chen, Xiang and Li, Zhoubo and Li, Lei and Liang, Xiaozhuan and Yao, Yunzhi and Deng, Shumin and Zhang, Zhenru and Tan, Chuanqi and Huang, Fei and Zheng, Guozhou and Chen, Huajun},
|
||||
journal = {http://arxiv.org/abs/2201.03335},
|
||||
title = {{DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population}},
|
||||
year = {2022}
|
||||
}
|
||||
```
|
||||
<br>
|
||||
|
||||
# Developers
|
||||
|
||||
Zhejiang University: Ningyu Zhang, Liankuan Tao, Xin Xu, Haiyang Yu, Hongbin Ye, Xin Xie, Xiang Chen, Zhoubo Li, Lei Li, Xiaozhuan Liang, YunzhiYao, Shuofei Qiao, Shumin Deng, Wen Zhang, Guozhou Zheng, Huajun Chen
|
||||
|
||||
DAMO Academy: Zhenru Zhang, Chuanqi Tan, Fei Huang
|
||||
## 结果
|
||||
![pcnn result](https://github.com/zjunlp/deepke/blob/dev/result/result.png)
|
||||
|
|
423
README_CN.md
423
README_CN.md
|
@ -1,423 +0,0 @@
|
|||
<p align="center">
|
||||
<a href="https://github.com/zjunlp/deepke"> <img src="pics/logo.png" width="400"/></a>
|
||||
<p>
|
||||
<p align="center">
|
||||
<a href="http://deepke.zjukg.cn">
|
||||
<img alt="Documentation" src="https://img.shields.io/badge/demo-website-blue">
|
||||
</a>
|
||||
<a href="https://pypi.org/project/deepke/#files">
|
||||
<img alt="PyPI" src="https://img.shields.io/pypi/v/deepke">
|
||||
</a>
|
||||
<a href="https://github.com/zjunlp/DeepKE/blob/master/LICENSE">
|
||||
<img alt="GitHub" src="https://img.shields.io/github/license/zjunlp/deepke">
|
||||
</a>
|
||||
<a href="http://zjunlp.github.io/DeepKE">
|
||||
<img alt="Documentation" src="https://img.shields.io/badge/doc-website-red">
|
||||
</a>
|
||||
<a href="https://colab.research.google.com/drive/1vS8YJhJltzw3hpJczPt24O0Azcs3ZpRi?usp=sharing">
|
||||
<img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg">
|
||||
</a>
|
||||
</p>
|
||||
|
||||
|
||||
<h1 align="center">
|
||||
<p>基于深度学习的开源中文知识图谱抽取框架</p>
|
||||
</h1>
|
||||
|
||||
|
||||
DeepKE 是一个支持<b>低资源、长篇章</b>的知识抽取工具,可以基于<b>PyTorch</b>实现<b>命名实体识别</b>、<b>关系抽取</b>和<b>属性抽取</b>功能。<br>同时为初学者提供了详尽的[文档](https://zjunlp.github.io/DeepKE/),[Google Colab教程](https://colab.research.google.com/drive/1vS8YJhJltzw3hpJczPt24O0Azcs3ZpRi?usp=sharing)和[在线演示](http://deepke.zjukg.cn/CN/index.html)。
|
||||
|
||||
<br>
|
||||
|
||||
# 目录
|
||||
|
||||
* [新版特性](#新版特性)
|
||||
* [预测演示](#预测演示)
|
||||
* [模型架构](#模型架构)
|
||||
* [快速上手](#快速上手)
|
||||
* [环境依赖](#环境依赖)
|
||||
* [具体功能介绍](#具体功能介绍)
|
||||
* [1. 命名实体识别NER](#1-命名实体识别ner)
|
||||
* [2. 关系抽取RE](#2-关系抽取re)
|
||||
* [3. 属性抽取AE](#3-属性抽取ae)
|
||||
* [Notebook教程](#notebook教程)
|
||||
* [备注(常见问题)](#备注常见问题)
|
||||
* [未来计划](#未来计划)
|
||||
* [引用](#引用)
|
||||
* [项目成员](#项目成员)
|
||||
|
||||
<br>
|
||||
|
||||
# 新版特性
|
||||
|
||||
## 2021年1月
|
||||
|
||||
- 发布论文 [DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population](https://arxiv.org/abs/2201.03335)
|
||||
|
||||
## 2021年12月
|
||||
- 加入`dockerfile`以便自动创建环境
|
||||
## 2021年11月
|
||||
- 发布DeepKE demo页面,支持实时抽取,无需部署和训练模型
|
||||
- 发布DeepKE文档,包含DeepKE源码和数据集等详细信息
|
||||
## 2021年10月
|
||||
- `pip install deepke`
|
||||
- deepke-v2.0发布
|
||||
## 2021年5月
|
||||
- `pip install deepke`
|
||||
- deepke-v1.0发布
|
||||
<br>
|
||||
|
||||
# 预测演示
|
||||
下面使用一个demo展示预测过程<br>
|
||||
<img src="pics/demo.gif" width="636" height="494" align=center>
|
||||
|
||||
<br>
|
||||
|
||||
# 模型架构
|
||||
|
||||
Deepke的架构图如下所示
|
||||
|
||||
<h3 align="center">
|
||||
<img src="pics/architectures.png">
|
||||
</h3>
|
||||
|
||||
- DeepKE为三个知识抽取功能(命名实体识别、关系抽取和属性抽取)设计了一个统一的框架
|
||||
- 可以在不同场景下实现不同功能。比如,可以在标准全监督、低资源少样本和文档级设定下进行关系抽取
|
||||
- 每一个应用场景由三个部分组成:Data部分包含Tokenizer、Preprocessor和Loader,Model部分包含Module、Encoder和Forwarder,Core部分包含Training、Evaluation和Prediction
|
||||
|
||||
|
||||
<br>
|
||||
|
||||
# 快速上手
|
||||
|
||||
DeepKE支持pip安装使用,以常规全监督设定关系抽取为例,经过以下6个步骤就可以实现一个常规关系抽取模型
|
||||
|
||||
**Step 1**:下载代码 ```git clone https://github.com/zjunlp/DeepKE.git```(别忘记star和fork哈!!!)
|
||||
|
||||
**Step 2**:使用anaconda创建虚拟环境,进入虚拟环境(提供Dockerfile源码可自行创建镜像,位于docker文件夹中)
|
||||
|
||||
```
|
||||
conda create -n deepke python=3.8
|
||||
|
||||
conda activate deepke
|
||||
```
|
||||
1) 基于pip安装,直接使用
|
||||
|
||||
```
|
||||
pip install deepke
|
||||
```
|
||||
|
||||
2) 基于源码安装
|
||||
|
||||
```
|
||||
python setup.py install
|
||||
|
||||
python setup.py develop
|
||||
```
|
||||
|
||||
**Step 3** :进入任务文件夹,以常规关系抽取为例
|
||||
|
||||
```
|
||||
cd DeepKE/example/re/standard
|
||||
```
|
||||
|
||||
**Step 4**:下载数据集
|
||||
```
|
||||
wget 120.27.214.45/Data/re/standard/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
**Step 5** :模型训练,训练用到的参数可在conf文件夹内修改
|
||||
|
||||
DeepKE使用*wandb*支持可视化调参
|
||||
|
||||
```
|
||||
python run.py
|
||||
```
|
||||
|
||||
**Step 6** :模型预测。预测用到的参数可在conf文件夹内修改
|
||||
|
||||
修改`conf/predict.yaml`中保存训练好的模型路径。
|
||||
```
|
||||
python predict.py
|
||||
```
|
||||
|
||||
<br>
|
||||
|
||||
## 环境依赖
|
||||
|
||||
> python == 3.8
|
||||
|
||||
- torch == 1.5
|
||||
- hydra-core == 1.0.6
|
||||
- tensorboard == 2.4.1
|
||||
- matplotlib == 3.4.1
|
||||
- transformers == 3.4.0
|
||||
- jieba == 0.42.1
|
||||
- scikit-learn == 0.24.1
|
||||
- pytorch-transformers == 1.2.0
|
||||
- seqeval == 1.2.2
|
||||
- tqdm == 4.60.0
|
||||
- opt-einsum==3.3.0
|
||||
- ujson
|
||||
|
||||
<br>
|
||||
|
||||
## 具体功能介绍
|
||||
|
||||
### 1. 命名实体识别NER
|
||||
|
||||
- 命名实体识别是从非结构化的文本中识别出实体和其类型。数据为txt文件,样式范例为:
|
||||
|
||||
| Sentence | Person | Location | Organization |
|
||||
| :----------------------------------------------------------: | :------------------------: | :------------: | :----------------------------: |
|
||||
| 本报北京9月4日讯记者杨涌报道:部分省区人民日报宣传发行工作座谈会9月3日在4日在京举行。 | 杨涌 | 北京 | 人民日报 |
|
||||
| 《红楼梦》是中央电视台和中国电视剧制作中心根据中国古典文学名著《红楼梦》摄制于1987年的一部古装连续剧,由王扶林导演,周汝昌、王蒙、周岭等多位红学家参与制作。 | 王扶林,周汝昌,王蒙,周岭 | 中国 | 中央电视台,中国电视剧制作中心 |
|
||||
| 秦始皇兵马俑位于陕西省西安市,1961年被国务院公布为第一批全国重点文物保护单位,是世界八大奇迹之一。 | 秦始皇 | 陕西省,西安市 | 国务院 |
|
||||
|
||||
- 具体流程请进入详细的README中
|
||||
- **[常规全监督STANDARD](https://github.com/zjunlp/DeepKE/tree/main/example/ner/standard)**
|
||||
|
||||
**Step1**: 进入`DeepKE/example/ner/standard`,下载数据集
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/ner/standard/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
**Step2**: 模型训练<br>
|
||||
|
||||
数据集和参数配置可以分别在`data`和`conf`文件夹中修改
|
||||
|
||||
```
|
||||
python run.py
|
||||
```
|
||||
|
||||
**Step3**: 模型预测
|
||||
```
|
||||
python predict.py
|
||||
```
|
||||
|
||||
- **[少样本FEW-SHOT](https://github.com/zjunlp/DeepKE/tree/main/example/ner/few-shot)**
|
||||
|
||||
**Step1**: 进入`DeepKE/example/ner/few-shot`,下载数据集
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/ner/few_shot/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
**Step2**:低资源场景下训练模型<br>
|
||||
|
||||
模型加载和保存位置以及参数配置可以在`conf`文件夹中修改
|
||||
|
||||
```
|
||||
python run.py +train=few_shot
|
||||
```
|
||||
|
||||
若要加载模型,修改`few_shot.yaml`中的`load_path`;<br>
|
||||
|
||||
**Step3**:在`config.yaml`中追加`- predict`,`predict.yaml`中修改`load_path`为模型路径以及`write_path`为预测结果的保存路径,完成修改后使用
|
||||
|
||||
```
|
||||
python predict.py
|
||||
```
|
||||
|
||||
### 2. 关系抽取RE
|
||||
|
||||
- 关系抽取是从非结构化的文本中抽取出实体之间的关系,以下为几个样式范例,数据为csv文件:
|
||||
|
||||
| Sentence | Relation | Head | Head_offset | Tail | Tail_offset |
|
||||
| :----------------------------------------------------: | :------: | :--------: | :---------: | :--------: | :---------: |
|
||||
| 《岳父也是爹》是王军执导的电视剧,由马恩然、范明主演。 | 导演 | 岳父也是爹 | 1 | 王军 | 8 |
|
||||
| 《九玄珠》是在纵横中文网连载的一部小说,作者是龙马。 | 连载网站 | 九玄珠 | 1 | 纵横中文网 | 7 |
|
||||
| 提起杭州的美景,西湖总是第一个映入脑海的词语。 | 所在城市 | 西湖 | 8 | 杭州 | 2 |
|
||||
|
||||
- 具体流程请进入详细的README中,RE包括了以下三个子功能
|
||||
- **[常规全监督STANDARD](https://github.com/zjunlp/DeepKE/tree/main/example/re/standard)**
|
||||
|
||||
**Step1**:进入`DeepKE/example/re/standard`,下载数据集
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/re/standard/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
**Step2**:模型训练<br>
|
||||
|
||||
数据集和参数配置可以分别进入`data`和`conf`文件夹中修改
|
||||
|
||||
```
|
||||
python run.py
|
||||
```
|
||||
|
||||
**Step3**:模型预测
|
||||
|
||||
```
|
||||
python predict.py
|
||||
```
|
||||
|
||||
- **[少样本FEW-SHOT](https://github.com/zjunlp/DeepKE/tree/main/example/re/few-shot)**
|
||||
|
||||
**Step1**:进入`DeepKE/example/re/few-shot`,下载数据集
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/re/few_shot/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
**Step2**:模型训练<br>
|
||||
|
||||
- 数据集和参数配置可以分别进入`data`和`conf`文件夹中修改
|
||||
|
||||
- 如需从上次训练的模型开始训练:设置`conf/train.yaml`中的`train_from_saved_model`为上次保存模型的路径,每次训练的日志默认保存在根目录,可用`log_dir`来配置
|
||||
|
||||
```
|
||||
python run.py
|
||||
```
|
||||
|
||||
**Step3**:模型预测
|
||||
|
||||
```
|
||||
python predict.py
|
||||
```
|
||||
|
||||
- **[文档级DOCUMENT](https://github.com/zjunlp/DeepKE/tree/main/example/re/document)** <br>
|
||||
|
||||
**Step1**:进入`DeepKE/example/re/document`,下载数据集
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/re/document/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
**Step2**:模型训练<br>
|
||||
|
||||
- 数据集和参数配置可以分别进入`data`和`conf`文件夹中修改
|
||||
- 如需从上次训练的模型开始训练:设置`conf/train.yaml`中的`train_from_saved_model`为上次保存模型的路径,每次训练的日志默认保存在根目录,可用`log_dir`来配置;
|
||||
|
||||
```
|
||||
python run.py
|
||||
```
|
||||
**Step3**:模型预测
|
||||
|
||||
```
|
||||
python predict.py
|
||||
```
|
||||
|
||||
### 3. 属性抽取AE
|
||||
|
||||
- 数据为csv文件,样式范例为:
|
||||
|
||||
| Sentence | Att | Ent | Ent_offset | Val | Val_offset |
|
||||
| :----------------------------------------------------------: | :------: | :------: | :--------: | :-----------: | :--------: |
|
||||
| 张冬梅,女,汉族,1968年2月生,河南淇县人 | 民族 | 张冬梅 | 0 | 汉族 | 6 |
|
||||
| 诸葛亮,字孔明,三国时期杰出的军事家、文学家、发明家。 | 朝代 | 诸葛亮 | 0 | 三国时期 | 8 |
|
||||
| 2014年10月1日许鞍华执导的电影《黄金时代》上映 | 上映时间 | 黄金时代 | 19 | 2014年10月1日 | 0 |
|
||||
|
||||
- 具体流程请进入详细的README中
|
||||
- **[常规全监督STANDARD](https://github.com/zjunlp/DeepKE/tree/main/example/ae/standard)**
|
||||
|
||||
**Step1**:进入`DeepKE/example/ae/standard`,下载数据集
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/ae/standard/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
**Step2**:模型训练<br>
|
||||
|
||||
数据集和参数配置可以分别进入`data`和`conf`文件夹中修改
|
||||
|
||||
```
|
||||
python run.py
|
||||
```
|
||||
|
||||
**Step3**:模型预测
|
||||
|
||||
```
|
||||
python predict.py
|
||||
```
|
||||
|
||||
<br>
|
||||
|
||||
# Notebook教程
|
||||
|
||||
本工具提供了若干Notebook和Google Colab教程,用户可针对性调试学习。
|
||||
|
||||
- 常规设定:
|
||||
|
||||
[命名实体识别Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/ner/standard/standard_ner_tutorial.ipynb)
|
||||
|
||||
[命名实体识别Colab](https://colab.research.google.com/drive/1rFiIcDNgpC002q9BbtY_wkeBUvbqVxpg?usp=sharing)
|
||||
|
||||
[关系抽取Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/re/standard/standard_re_BERT_tutorial.ipynb)
|
||||
|
||||
[关系抽取Colab](https://colab.research.google.com/drive/1o6rKIxBqrGZNnA2IMXqiSsY2GWANAZLl?usp=sharing)
|
||||
|
||||
[属性抽取Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/ae/standard/standard_ae_tutorial.ipynb)
|
||||
|
||||
[属性抽取Colab](https://colab.research.google.com/drive/1pgPouEtHMR7L9Z-QfG1sPYkJfrtRt8ML?usp=sharing)
|
||||
|
||||
- 低资源:
|
||||
|
||||
[命名实体识别Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/ner/few-shot/fewshot_ner_tutorial.ipynb)
|
||||
|
||||
[命名实体识别Colab](https://colab.research.google.com/drive/1Xz0sNpYQNbkjhebCG5djrwM8Mj2Crj7F?usp=sharing)
|
||||
|
||||
[关系抽取Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/re/few-shot/fewshot_re_tutorial.ipynb)
|
||||
|
||||
[关系抽取Colab](https://colab.research.google.com/drive/1o1ly6ORgerkm1fCDjEQb7hsN5WKyg3JH?usp=sharing)
|
||||
|
||||
- 篇章级:
|
||||
|
||||
[关系抽取Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/re/document/document_re_tutorial.ipynb)
|
||||
|
||||
[关系抽取Colab](https://colab.research.google.com/drive/1RGUBbbOBHlWJ1NXQLtP_YEUktntHtROa?usp=sharing)
|
||||
|
||||
<br>
|
||||
|
||||
# 备注(常见问题)
|
||||
|
||||
1. 使用 Anaconda 时,建议添加国内镜像,下载速度更快。如[镜像](https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/)。
|
||||
2. 使用 pip 时,建议使用国内镜像,下载速度更快,如阿里云镜像。
|
||||
3. 安装后提示 `ModuleNotFoundError: No module named 'past'`,输入命令 `pip install future` 即可解决。
|
||||
4. 使用语言预训练模型时,在线安装下载模型比较慢,更建议提前下载好,存放到 pretrained 文件夹内。具体存放文件要求见文件夹内的 `README.md`。
|
||||
5. DeepKE老版本位于[deepke-v1.0](https://github.com/zjunlp/DeepKE/tree/deepke-v1.0)分支,用户可切换分支使用老版本,老版本的能力已全部迁移到标准设定关系抽取([example/re/standard](https://github.com/zjunlp/DeepKE/blob/main/example/re/standard/README.md))中。
|
||||
|
||||
<br>
|
||||
|
||||
# 未来计划
|
||||
|
||||
- 在DeepKE的下一个版本中加入多模态知识抽取
|
||||
- 我们提供长期技术维护和答疑解惑。如有疑问,请提交issues
|
||||
|
||||
<br>
|
||||
|
||||
# 引用
|
||||
|
||||
如果使用DeepKE,请按以下格式引用
|
||||
|
||||
```bibtex
|
||||
@article{Zhang_DeepKE_A_Deep_2022,
|
||||
author = {Zhang, Ningyu and Xu, Xin and Tao, Liankuan and Yu, Haiyang and Ye, Hongbin and Xie, Xin and Chen, Xiang and Li, Zhoubo and Li, Lei and Liang, Xiaozhuan and Yao, Yunzhi and Deng, Shumin and Zhang, Zhenru and Tan, Chuanqi and Huang, Fei and Zheng, Guozhou and Chen, Huajun},
|
||||
journal = {http://arxiv.org/abs/2201.03335},
|
||||
title = {{DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population}},
|
||||
year = {2022}
|
||||
}
|
||||
```
|
||||
|
||||
<br>
|
||||
|
||||
# 项目成员
|
||||
|
||||
浙江大学:张宁豫、陶联宽、徐欣、余海洋、叶宏彬、谢辛、陈想、黎洲波、李磊、梁孝转、姚云志、乔硕斐、邓淑敏、张文、郑国轴、陈华钧
|
||||
|
||||
达摩院:张珍茹、谭传奇、黄非
|
|
@ -0,0 +1 @@
|
|||
链接: https://pan.baidu.com/s/1r7-Curgph4ffTlILh6JDJA 密码: knya
|
|
@ -1,28 +0,0 @@
|
|||
FROM ubuntu:18.04
|
||||
LABEL maintainer="ZJUNLP"
|
||||
LABEL repository="DeepKE"
|
||||
ENV PYTHON_VERSION=3.8
|
||||
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
build-essential \
|
||||
cmake \
|
||||
wget \
|
||||
git \
|
||||
curl \
|
||||
ca-certificates
|
||||
|
||||
RUN curl -o ~/miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-4.7.12-Linux-x86_64.sh && \
|
||||
chmod +x ~/miniconda.sh && \
|
||||
~/miniconda.sh -b && \
|
||||
rm ~/miniconda.sh
|
||||
|
||||
ENV PATH=/root/miniconda3/bin:$PATH
|
||||
|
||||
RUN conda create -y --name deepke python=$PYTHON_VERSION
|
||||
|
||||
# SHELL ["/root/miniconda3/bin/conda", "run", "-n", "deepke", "/bin/bash", "-c"]
|
||||
RUN conda init bash
|
||||
|
||||
RUN cd ~ && \
|
||||
git clone https://github.com/zjunlp/DeepKE.git
|
||||
|
|
@ -1,20 +0,0 @@
|
|||
# Minimal makefile for Sphinx documentation
|
||||
#
|
||||
|
||||
# You can set these variables from the command line, and also
|
||||
# from the environment for the first two.
|
||||
SPHINXOPTS ?=
|
||||
SPHINXBUILD ?= sphinx-build
|
||||
SOURCEDIR = source
|
||||
BUILDDIR = build
|
||||
|
||||
# Put it first so that "make" without argument is like "make help".
|
||||
help:
|
||||
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
||||
|
||||
.PHONY: help Makefile
|
||||
|
||||
# Catch-all target: route all unknown targets to Sphinx using the new
|
||||
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
|
||||
%: Makefile
|
||||
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
|
@ -1,35 +0,0 @@
|
|||
@ECHO OFF
|
||||
|
||||
pushd %~dp0
|
||||
|
||||
REM Command file for Sphinx documentation
|
||||
|
||||
if "%SPHINXBUILD%" == "" (
|
||||
set SPHINXBUILD=sphinx-build
|
||||
)
|
||||
set SOURCEDIR=source
|
||||
set BUILDDIR=build
|
||||
|
||||
if "%1" == "" goto help
|
||||
|
||||
%SPHINXBUILD% >NUL 2>NUL
|
||||
if errorlevel 9009 (
|
||||
echo.
|
||||
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
|
||||
echo.installed, then set the SPHINXBUILD environment variable to point
|
||||
echo.to the full path of the 'sphinx-build' executable. Alternatively you
|
||||
echo.may add the Sphinx directory to PATH.
|
||||
echo.
|
||||
echo.If you don't have Sphinx installed, grab it from
|
||||
echo.https://www.sphinx-doc.org/
|
||||
exit /b 1
|
||||
)
|
||||
|
||||
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
|
||||
goto end
|
||||
|
||||
:help
|
||||
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
|
||||
|
||||
:end
|
||||
popd
|
Binary file not shown.
Before Width: | Height: | Size: 104 KiB |
|
@ -1,118 +0,0 @@
|
|||
a,
|
||||
.wy-menu-vertical header,
|
||||
.wy-menu-vertical p.caption,
|
||||
.wy-nav-top .fa-bars,
|
||||
.wy-menu-vertical a:hover,
|
||||
|
||||
.rst-content code.literal, .rst-content tt.literal
|
||||
|
||||
{
|
||||
color: rgb(0, 63, 136) !important;
|
||||
}
|
||||
|
||||
/* inspired by sphinx press theme */
|
||||
.wy-menu.wy-menu-vertical li.toctree-l1.current > a {
|
||||
border-left: solid 8px #4122f0 !important;
|
||||
border-top: none;
|
||||
border-bottom: none;
|
||||
}
|
||||
|
||||
.wy-menu.wy-menu-vertical li.toctree-l1.current > ul {
|
||||
border-left: solid 8px #4719ee !important;
|
||||
}
|
||||
/* inspired by sphinx press theme */
|
||||
|
||||
.wy-nav-side {
|
||||
color: unset !important;
|
||||
background: unset !important;
|
||||
border-right: solid 1px #ccc !important;
|
||||
}
|
||||
|
||||
.wy-side-nav-search,
|
||||
.wy-nav-top,
|
||||
.wy-menu-vertical li,
|
||||
.wy-menu-vertical li a:hover,
|
||||
.wy-menu-vertical li a
|
||||
{
|
||||
background: unset !important;
|
||||
}
|
||||
|
||||
.wy-menu-vertical li.current a {
|
||||
border-right: unset !important;
|
||||
}
|
||||
|
||||
.wy-side-nav-search div,
|
||||
.wy-menu-vertical a {
|
||||
color: #404040 !important;
|
||||
}
|
||||
|
||||
.wy-menu-vertical button.toctree-expand {
|
||||
color: #333 !important;
|
||||
}
|
||||
|
||||
.wy-nav-content {
|
||||
max-width: unset;
|
||||
}
|
||||
|
||||
.rst-content {
|
||||
max-width: 900px;
|
||||
}
|
||||
|
||||
.wy-nav-content .icon-home:before {
|
||||
content: "Docs";
|
||||
}
|
||||
|
||||
.wy-side-nav-search .icon-home:before {
|
||||
content: "";
|
||||
}
|
||||
|
||||
dl.field-list {
|
||||
display: block !important;
|
||||
}
|
||||
|
||||
dl.field-list > dt:after {
|
||||
content: "" !important;
|
||||
}
|
||||
|
||||
dl.field-list > dt {
|
||||
display: table;
|
||||
padding-left: 6px !important;
|
||||
padding-right: 6px !important;
|
||||
margin-bottom: 4px !important;
|
||||
padding-bottom: 1px !important;
|
||||
background: #f6ecd852;
|
||||
border-left: solid 2px #ccc;
|
||||
}
|
||||
|
||||
|
||||
dl.py.class>dt
|
||||
{
|
||||
color: rgba(17, 16, 17, 0.822) !important;
|
||||
background: rgb(226, 241, 250) !important;
|
||||
border-top: solid 2px #58b5cc !important;
|
||||
}
|
||||
|
||||
dl.py.method>dt
|
||||
{
|
||||
background: rgb(226, 241, 250) !important;
|
||||
border-left: solid 2px #bcb3be !important;
|
||||
}
|
||||
|
||||
dl.py.attribute>dt,
|
||||
dl.py.property>dt
|
||||
{
|
||||
background: rgb(226, 241, 250) !important;
|
||||
border-left: solid 2px #58b5cc !important;
|
||||
}
|
||||
|
||||
.fa-plus-square-o::before, .wy-menu-vertical li button.toctree-expand::before,
|
||||
.fa-minus-square-o::before, .wy-menu-vertical li.current > a button.toctree-expand::before, .wy-menu-vertical li.on a button.toctree-expand::before
|
||||
{
|
||||
content: "";
|
||||
}
|
||||
|
||||
.rst-content .viewcode-back,
|
||||
.rst-content .viewcode-link
|
||||
{
|
||||
font-size: 120%;
|
||||
}
|
Binary file not shown.
Before Width: | Height: | Size: 419 KiB |
Binary file not shown.
Before Width: | Height: | Size: 207 KiB |
Binary file not shown.
Before Width: | Height: | Size: 6.4 KiB |
|
@ -1,82 +0,0 @@
|
|||
# Configuration file for the Sphinx documentation builder.
|
||||
#
|
||||
# This file only contains a selection of the most common options. For a full
|
||||
# list see the documentation:
|
||||
# https://www.sphinx-doc.org/en/master/usage/configuration.html
|
||||
|
||||
# -- Path setup --------------------------------------------------------------
|
||||
|
||||
# If extensions (or modules to document with autodoc) are in another directory,
|
||||
# add these directories to sys.path here. If the directory is relative to the
|
||||
# documentation root, use os.path.abspath to make it absolute, like shown here.
|
||||
#
|
||||
import os
|
||||
import sys
|
||||
sys.path.insert(0, os.path.abspath('../../src'))
|
||||
import sphinx_rtd_theme
|
||||
import doctest
|
||||
import deepke
|
||||
# -- Project information -----------------------------------------------------
|
||||
|
||||
project = 'DeepKE'
|
||||
copyright = '2021, ZJUNLP'
|
||||
author = 'tlk'
|
||||
|
||||
# The full version, including alpha/beta/rc tags
|
||||
release = '1.0.0'
|
||||
|
||||
|
||||
# -- General configuration ---------------------------------------------------
|
||||
|
||||
# Add any Sphinx extension module names here, as strings. They can be
|
||||
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
|
||||
# ones.
|
||||
extensions = [
|
||||
'sphinx.ext.autodoc',
|
||||
'sphinx.ext.autosummary',
|
||||
'sphinx.ext.doctest',
|
||||
'sphinx.ext.intersphinx',
|
||||
'sphinx.ext.mathjax',
|
||||
'sphinx.ext.napoleon',
|
||||
'sphinx.ext.viewcode',
|
||||
'sphinx.ext.githubpages',
|
||||
'sphinx.ext.todo',
|
||||
'sphinx.ext.coverage',
|
||||
'sphinx_copybutton',
|
||||
]
|
||||
|
||||
# Add any paths that contain templates here, relative to this directory.
|
||||
templates_path = ['_templates']
|
||||
|
||||
# List of patterns, relative to source directory, that match files and
|
||||
# directories to ignore when looking for source files.
|
||||
# This pattern also affects html_static_path and html_extra_path.
|
||||
exclude_patterns = []
|
||||
|
||||
doctest_default_flags = doctest.NORMALIZE_WHITESPACE
|
||||
autodoc_member_order = 'bysource'
|
||||
intersphinx_mapping = {'python': ('https://docs.python.org/', None)}
|
||||
|
||||
|
||||
# -- Options for HTML output -------------------------------------------------
|
||||
|
||||
# The theme to use for HTML and HTML Help pages. See the documentation for
|
||||
# a list of builtin themes.
|
||||
#
|
||||
html_theme = 'sphinx_rtd_theme'
|
||||
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
|
||||
|
||||
# Add any paths that contain custom static files (such as style sheets) here,
|
||||
# relative to this directory. They are copied after the builtin static files,
|
||||
# so a file named "default.css" will overwrite the builtin "default.css".
|
||||
html_static_path = ['_static']
|
||||
html_css_files = ['css/custom.css']
|
||||
# html_logo = './_static/logo.png'
|
||||
|
||||
html_context = {
|
||||
"display_github": True, # Integrate GitHub
|
||||
"github_user": "tlk1997", # Username
|
||||
"github_repo": "test_doc", # Repo name
|
||||
"github_version": "main", # Version
|
||||
"conf_py_path": "/docs/source/", # Path in the checkout to the docs root
|
||||
}
|
|
@ -1,9 +0,0 @@
|
|||
Attribution Extraction
|
||||
======================
|
||||
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 4
|
||||
|
||||
deepke.attribution_extraction.standard
|
|
@ -1,60 +0,0 @@
|
|||
Models
|
||||
======
|
||||
|
||||
|
||||
deepke.attribution\_extraction.standard.models.BasicModule module
|
||||
-----------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.models.BasicModule
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.models.BiLSTM module
|
||||
------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.models.BiLSTM
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.models.Capsule module
|
||||
-------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.models.Capsule
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.models.GCN module
|
||||
---------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.models.GCN
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.models.LM module
|
||||
--------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.models.LM
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.models.PCNN module
|
||||
----------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.models.PCNN
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.models.Transformer module
|
||||
-----------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.models.Transformer
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
|
@ -1,60 +0,0 @@
|
|||
Module
|
||||
======
|
||||
|
||||
|
||||
deepke.attribution\_extraction.standard.module.Attention module
|
||||
---------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.module.Attention
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.module.CNN module
|
||||
---------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.module.CNN
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.module.Capsule module
|
||||
-------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.module.Capsule
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.module.Embedding module
|
||||
---------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.module.Embedding
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.module.GCN module
|
||||
---------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.module.GCN
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.module.RNN module
|
||||
---------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.module.RNN
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.module.Transformer module
|
||||
-----------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.module.Transformer
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
|
@ -1,11 +0,0 @@
|
|||
Standard
|
||||
========
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 4
|
||||
|
||||
deepke.attribution_extraction.standard.models
|
||||
deepke.attribution_extraction.standard.module
|
||||
deepke.attribution_extraction.standard.tools
|
||||
deepke.attribution_extraction.standard.utils
|
|
@ -1,53 +0,0 @@
|
|||
Tools
|
||||
=====
|
||||
|
||||
|
||||
|
||||
deepke.attribution\_extraction.standard.tools.dataset module
|
||||
------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.tools.dataset
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.tools.metrics module
|
||||
------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.tools.metrics
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.tools.preprocess module
|
||||
---------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.tools.preprocess
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.tools.serializer module
|
||||
---------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.tools.serializer
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.tools.trainer module
|
||||
------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.tools.trainer
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.tools.vocab module
|
||||
----------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.tools.vocab
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
|
@ -1,20 +0,0 @@
|
|||
Utils
|
||||
=====
|
||||
|
||||
|
||||
deepke.attribution\_extraction.standard.utils.ioUtils module
|
||||
------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.utils.ioUtils
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.attribution\_extraction.standard.utils.nnUtils module
|
||||
------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.attribution_extraction.standard.utils.nnUtils
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
|
@ -1,21 +0,0 @@
|
|||
Models
|
||||
======
|
||||
|
||||
|
||||
|
||||
deepke.name\_entity\_recognition.few\_shot.models.model module
|
||||
--------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.name_entity_recognition.few_shot.models.model
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.name\_entity\_recognition.few\_shot.models.modeling\_bart module
|
||||
-----------------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.name_entity_recognition.few_shot.models.modeling_bart
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
|
@ -1,35 +0,0 @@
|
|||
Module
|
||||
======
|
||||
|
||||
deepke.name\_entity\_recognition.few\_shot.module.datasets module
|
||||
-----------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.name_entity_recognition.few_shot.module.datasets
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.name\_entity\_recognition.few\_shot.module.mapping\_type module
|
||||
----------------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.name_entity_recognition.few_shot.module.mapping_type
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.name\_entity\_recognition.few\_shot.module.metrics module
|
||||
----------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.name_entity_recognition.few_shot.module.metrics
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.name\_entity\_recognition.few\_shot.module.train module
|
||||
--------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.name_entity_recognition.few_shot.module.train
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
|
@ -1,10 +0,0 @@
|
|||
Few Shot
|
||||
========
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 4
|
||||
|
||||
deepke.name_entity_recognition.few_shot.models
|
||||
deepke.name_entity_recognition.few_shot.module
|
||||
deepke.name_entity_recognition.few_shot.utils
|
||||
|
|
@ -1,11 +0,0 @@
|
|||
Utils
|
||||
=====
|
||||
|
||||
deepke.name\_entity\_recognition.few\_shot.utils.util module
|
||||
------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.name_entity_recognition.few_shot.utils.util
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
|
@ -1,9 +0,0 @@
|
|||
Name Entity Recognition
|
||||
=======================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 4
|
||||
|
||||
deepke.name_entity_recognition.few_shot
|
||||
deepke.name_entity_recognition.standard
|
||||
|
|
@ -1,11 +0,0 @@
|
|||
Models
|
||||
======
|
||||
|
||||
deepke.name\_entity\_recognition.standard.models.InferBert module
|
||||
-----------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.name_entity_recognition.standard.models.InferBert
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
|
@ -1,9 +0,0 @@
|
|||
Standard
|
||||
========
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 4
|
||||
|
||||
deepke.name_entity_recognition.standard.models
|
||||
deepke.name_entity_recognition.standard.tools
|
||||
|
|
@ -1,20 +0,0 @@
|
|||
Tools
|
||||
=====
|
||||
|
||||
deepke.name\_entity\_recognition.standard.tools.dataset module
|
||||
--------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.name_entity_recognition.standard.tools.dataset
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.name\_entity\_recognition.standard.tools.preprocess module
|
||||
-----------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.name_entity_recognition.standard.tools.preprocess
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
|
|
@ -1,51 +0,0 @@
|
|||
Document
|
||||
========
|
||||
|
||||
deepke.relation\_extraction.document.evaluation module
|
||||
------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.document.evaluation
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.document.losses module
|
||||
--------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.document.losses
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.document.model module
|
||||
-------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.document.model
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.document.module module
|
||||
--------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.document.module
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.document.prepro module
|
||||
--------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.document.prepro
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.document.utils module
|
||||
-------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.document.utils
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
|
@ -1,26 +0,0 @@
|
|||
Dataset
|
||||
=======
|
||||
|
||||
deepke.relation\_extraction.few\_shot.dataset.base\_data\_module module
|
||||
-----------------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.few_shot.dataset.base_data_module
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.few\_shot.dataset.dialogue module
|
||||
-------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.few_shot.dataset.dialogue
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.few\_shot.dataset.processor module
|
||||
--------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.few_shot.dataset.processor
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
|
@ -1,27 +0,0 @@
|
|||
Lit Models
|
||||
==========
|
||||
|
||||
deepke.relation\_extraction.few\_shot.lit\_models.base module
|
||||
-------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.few_shot.lit_models.base
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.few\_shot.lit\_models.transformer module
|
||||
--------------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.few_shot.lit_models.transformer
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.few\_shot.lit\_models.util module
|
||||
-------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.few_shot.lit_models.util
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
|
@ -1,27 +0,0 @@
|
|||
Few Shot
|
||||
========
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 4
|
||||
|
||||
deepke.relation_extraction.few_shot.dataset
|
||||
deepke.relation_extraction.few_shot.lit_models
|
||||
|
||||
|
||||
deepke.relation\_extraction.few\_shot.generate\_k\_shot module
|
||||
--------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.few_shot.generate_k_shot
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.few\_shot.get\_label\_word module
|
||||
-------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.few_shot.get_label_word
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
|
@ -1,11 +0,0 @@
|
|||
Relation Extraction
|
||||
===================
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 4
|
||||
|
||||
deepke.relation_extraction.document
|
||||
deepke.relation_extraction.few_shot
|
||||
deepke.relation_extraction.standard
|
||||
|
|
@ -1,59 +0,0 @@
|
|||
Models
|
||||
======
|
||||
|
||||
|
||||
deepke.relation\_extraction.standard.models.BasicModule module
|
||||
--------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.models.BasicModule
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.models.BiLSTM module
|
||||
---------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.models.BiLSTM
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.models.Capsule module
|
||||
----------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.models.Capsule
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.models.GCN module
|
||||
------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.models.GCN
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.models.LM module
|
||||
-----------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.models.LM
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.models.PCNN module
|
||||
-------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.models.PCNN
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.models.Transformer module
|
||||
--------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.models.Transformer
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
|
@ -1,59 +0,0 @@
|
|||
Module
|
||||
======
|
||||
|
||||
deepke.relation\_extraction.standard.module.Attention module
|
||||
------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.module.Attention
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.module.CNN module
|
||||
------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.module.CNN
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.module.Capsule module
|
||||
----------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.module.Capsule
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.module.Embedding module
|
||||
------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.module.Embedding
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.module.GCN module
|
||||
------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.module.GCN
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.module.RNN module
|
||||
------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.module.RNN
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.module.Transformer module
|
||||
--------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.module.Transformer
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
|
@ -1,10 +0,0 @@
|
|||
Standard
|
||||
========
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 4
|
||||
|
||||
deepke.relation_extraction.standard.models
|
||||
deepke.relation_extraction.standard.module
|
||||
deepke.relation_extraction.standard.tools
|
||||
deepke.relation_extraction.standard.utils
|
|
@ -1,60 +0,0 @@
|
|||
Tools
|
||||
=====
|
||||
|
||||
|
||||
deepke.relation\_extraction.standard.tools.dataset module
|
||||
---------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.tools.dataset
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.tools.loss module
|
||||
------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.tools.loss
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.tools.metrics module
|
||||
---------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.tools.metrics
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.tools.preprocess module
|
||||
------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.tools.preprocess
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.tools.serializer module
|
||||
------------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.tools.serializer
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.tools.trainer module
|
||||
---------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.tools.trainer
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.tools.vocab module
|
||||
-------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.tools.vocab
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
|
@ -1,19 +0,0 @@
|
|||
Utils
|
||||
=====
|
||||
|
||||
deepke.relation\_extraction.standard.utils.ioUtils module
|
||||
---------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.utils.ioUtils
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
||||
deepke.relation\_extraction.standard.utils.nnUtils module
|
||||
---------------------------------------------------------
|
||||
|
||||
.. automodule:: deepke.relation_extraction.standard.utils.nnUtils
|
||||
:members:
|
||||
:undoc-members:
|
||||
:show-inheritance:
|
||||
|
|
@ -1,10 +0,0 @@
|
|||
DeepKE
|
||||
======
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 4
|
||||
|
||||
deepke.attribution_extraction
|
||||
deepke.name_entity_recognition
|
||||
deepke.relation_extraction
|
||||
|
|
@ -1,345 +0,0 @@
|
|||
Example
|
||||
=======
|
||||
|
||||
Standard NER
|
||||
------------
|
||||
The standard module is implemented by the pretrained model BERT.
|
||||
|
||||
**Step 1**
|
||||
|
||||
Enter ``DeepKE/example/ner/standard`` .
|
||||
|
||||
**Step 2**
|
||||
|
||||
Get data:
|
||||
|
||||
`wget 120.27.214.45/Data/ner/standard/data.tar.gz`
|
||||
|
||||
`tar -xzvf data.tar.gz`
|
||||
|
||||
The dataset and parameters can be customized in the ``data`` folder and ``conf`` folder respectively.
|
||||
|
||||
Dataset needs to be input as ``TXT`` file
|
||||
|
||||
The `data's format` of file needs to comply with the following:
|
||||
|
||||
杭 B-LOC '\\n'
|
||||
州 I-LOC '\\n'
|
||||
真 O '\\n'
|
||||
美 O '\\n'
|
||||
|
||||
**Step 3**
|
||||
|
||||
Train:
|
||||
|
||||
`python run.py`
|
||||
|
||||
**Step 4**
|
||||
|
||||
Predict:
|
||||
|
||||
`python predict.py`
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
cd example/ner/standard
|
||||
|
||||
wget 120.27.214.45/Data/ner/standard/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
|
||||
python run.py
|
||||
|
||||
python predict.py
|
||||
|
||||
Few-shot NER
|
||||
------------
|
||||
This module is in the low-resouce scenario.
|
||||
|
||||
**Step 1**
|
||||
|
||||
Enter ``DeepKE/example/ner/few-shot`` .
|
||||
|
||||
**Step 2**
|
||||
|
||||
Get data:
|
||||
|
||||
`wget 120.27.214.45/Data/ner/few_shot/data.tar.gz`
|
||||
|
||||
`tar -xzvf data.tar.gz`
|
||||
|
||||
The directory where the model is loaded and saved and the configuration parameters can be cusomized in the ``conf`` folder.The dataset can be customized in the ``data`` folder.
|
||||
|
||||
Dataset needs to be input as ``TXT`` file
|
||||
|
||||
The `data's format` of file needs to comply with the following:
|
||||
|
||||
EU B-ORG '\\n'
|
||||
rejects O '\\n'
|
||||
German B-MISC '\\n'
|
||||
call O '\\n'
|
||||
to O '\\n'
|
||||
boycott O '\\n'
|
||||
British B-MISC '\\n'
|
||||
lamb O '\\n'
|
||||
. O '\\n'
|
||||
|
||||
**Step 3**
|
||||
|
||||
Train with CoNLL-2003:
|
||||
|
||||
`python run.py`
|
||||
|
||||
Train in the few-shot scenario:
|
||||
|
||||
`python run.py +train=few_shot`. Users can modify `load_path` in ``conf/train/few_shot.yaml`` with the use of existing loaded model.
|
||||
|
||||
**Step 4**
|
||||
|
||||
Predict:
|
||||
|
||||
add `- predict` to ``conf/config.yaml`` , modify `loda_path` as the model path and `write_path` as the path where the predicted results are saved in ``conf/predict.yaml`` , and then run `python predict.py`
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
cd example/ner/few-shot
|
||||
|
||||
wget 120.27.214.45/Data/ner/few_shot/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
|
||||
python run.py
|
||||
|
||||
python predict.py
|
||||
|
||||
Standard RE
|
||||
-----------
|
||||
The standard module is implemented by common deep learning models, including CNN, RNN, Capsule, GCN, Transformer and the pretrained model.
|
||||
|
||||
**Step 1**
|
||||
|
||||
Enter the ``DeepKE/example/re/standard`` folder.
|
||||
|
||||
**Step 2**
|
||||
|
||||
Get data:
|
||||
|
||||
`wget 120.27.214.45/Data/re/standard/data.tar.gz`
|
||||
|
||||
`tar -xzvf data.tar.gz`
|
||||
|
||||
The dataset and parameters can be customized in the ``data`` folder and ``conf`` folder respectively.
|
||||
|
||||
Dataset needs to be input as ``CSV`` file.
|
||||
|
||||
The `data's format` of file needs to comply with the following:
|
||||
|
||||
+--------------------------+-----------+------------+-------------+------------+------------+
|
||||
| Sentence | Relation | Head | Head_offset | Tail | Tail_offset|
|
||||
+--------------------------+-----------+------------+-------------+------------+------------+
|
||||
|
||||
The relation's format of file needs to comply with the following:
|
||||
|
||||
+------------+-----------+------------------+-------------+
|
||||
| Head_type | Tail_type | relation | Index |
|
||||
+------------+-----------+------------------+-------------+
|
||||
|
||||
|
||||
**Step 3**
|
||||
|
||||
Train:
|
||||
|
||||
`python run.py`
|
||||
|
||||
**Step 4**
|
||||
|
||||
Predict:
|
||||
|
||||
`python predict.py`
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
cd example/re/standard
|
||||
|
||||
wget 120.27.214.45/Data/re/standard/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
|
||||
python run.py
|
||||
|
||||
python predict.py
|
||||
|
||||
Few-shot RE
|
||||
-----------
|
||||
This module is in the low-resouce scenario.
|
||||
|
||||
**Step 1**
|
||||
|
||||
Enter ``DeepKE/example/re/few-shot`` .
|
||||
|
||||
**Step 2**
|
||||
|
||||
Get data:
|
||||
|
||||
`wget 120.27.214.45/Data/re/few_shot/data.tar.gz`
|
||||
|
||||
`tar -xzvf data.tar.gz`
|
||||
|
||||
The dataset and parameters can be customized in the ``data`` folder and ``conf`` folder respectively.
|
||||
|
||||
Dataset needs to be input as ``TXT`` file and ``JSON`` file.
|
||||
|
||||
The `data's format` of file needs to comply with the following:
|
||||
|
||||
{"token": ["the", "most", "common", "audits", "were", "about", "waste", "and", "recycling", "."], "h": {"name": "audits", "pos": [3, 4]}, "t": {"name": "waste", "pos": [6, 7]}, "relation": "Message-Topic(e1,e2)"}
|
||||
|
||||
The relation's format of file needs to comply with the following:
|
||||
|
||||
{"Other": 0 , "Message-Topic(e1,e2)": 1 ... }
|
||||
|
||||
**Step 3**
|
||||
|
||||
Train:
|
||||
|
||||
`python run.py`
|
||||
|
||||
Start with the model trained last time: modify `train_from_saved_model` in ``conf/train.yaml`` as the path where the model trained last time was saved. And the path saving logs generated in training can be customized by ``log_dir``.
|
||||
|
||||
**Step 4**
|
||||
|
||||
Predict:
|
||||
|
||||
`python predict.py`
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
cd example/re/few-shot
|
||||
|
||||
wget 120.27.214.45/Data/re/few_shot/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
|
||||
python run.py
|
||||
|
||||
python predict.py
|
||||
|
||||
Document RE
|
||||
-----------
|
||||
This module is in the document scenario.
|
||||
|
||||
**Step 1**
|
||||
|
||||
Enter ``DeepKE/example/re/document`` .
|
||||
|
||||
**Step2**
|
||||
|
||||
Get data:
|
||||
|
||||
`wget 120.27.214.45/Data/re/document/data.tar.gz`
|
||||
|
||||
`tar -xzvf data.tar.gz`
|
||||
|
||||
The dataset and parameters can be customized in the ``data`` folder and ``conf`` folder respectively.
|
||||
|
||||
|
||||
Dataset needs to be input as ``JSON`` file
|
||||
|
||||
The `data's format` of file needs to comply with the following:
|
||||
|
||||
[{"vertexSet": [[{"name": "Lark Force", "pos": [0, 2], "sent_id": 0, "type": "ORG"},...]],
|
||||
|
||||
"labels": [{"r": "P607", "h": 1, "t": 3, "evidence": [0]}, ...],
|
||||
|
||||
"title": "Lark Force",
|
||||
|
||||
"sents": [["Lark", "Force", "was", "an", "Australian", "Army", "formation", "established", "in", "March", "1941", "during", "World", "War", "II", "for", "service", "in", "New", "Britain", "and", "New", "Ireland", "."],...}]
|
||||
|
||||
|
||||
The relation's format of file needs to comply with the following:
|
||||
|
||||
{"P1376": 79,"P607": 27,...}
|
||||
|
||||
**Step 3**
|
||||
|
||||
Train:
|
||||
|
||||
`python run.py`
|
||||
|
||||
Start with the model trained last time: modify `train_from_saved_model` in ``conf/train.yaml`` as the path where the model trained last time was saved. And the path saving logs generated in training can be customized by ``log_dir``.
|
||||
|
||||
**Step 4**
|
||||
|
||||
Predict:
|
||||
|
||||
`python predict.py`
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
cd example/re/document
|
||||
|
||||
wget 120.27.214.45/Data/re/document/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
|
||||
python run.py
|
||||
|
||||
python predict.py
|
||||
|
||||
Standard AE
|
||||
-----------
|
||||
The standard module is implemented by common deep learning models, including CNN, RNN, Capsule, GCN, Transformer and the pretrained model.
|
||||
|
||||
**Step 1**
|
||||
|
||||
Enter the ``DeepKE/example/ae/standard`` folder.
|
||||
|
||||
**Step 2**
|
||||
|
||||
Get data:
|
||||
|
||||
`wget 120.27.214.45/Data/ae/standard/data.tar.gz`
|
||||
|
||||
`tar -xzvf data.tar.gz`
|
||||
|
||||
The dataset and parameters can be customized in the ``data`` folder and ``conf`` folder respectively.
|
||||
|
||||
Dataset needs to be input as ``CSV`` file.
|
||||
|
||||
The `data's format` of file needs to comply with the following:
|
||||
|
||||
+--------------------------+------------+------------+---------------+-------------------+-----------------------+
|
||||
| Sentence | Attribute | Entity | Entity_offset | Attribute_value | Attribute_value_offset|
|
||||
+--------------------------+------------+------------+---------------+-------------------+-----------------------+
|
||||
|
||||
The attribute's format of file needs to comply with the following:
|
||||
|
||||
+-------------------+-------------+
|
||||
| Attribute | Index |
|
||||
+-------------------+-------------+
|
||||
|
||||
**Step 3**
|
||||
|
||||
Train:
|
||||
|
||||
`python run.py`
|
||||
|
||||
**Step 4**
|
||||
|
||||
Predict:
|
||||
|
||||
`python predict.py`
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
cd example/ae/regular
|
||||
|
||||
wget 120.27.214.45/Data/ae/standard/data.tar.gz
|
||||
|
||||
tar -xzvf data.tar.gz
|
||||
|
||||
python run.py
|
||||
|
||||
python predict.py
|
||||
|
||||
|
||||
More details , you can refer to https://www.bilibili.com/video/BV1n44y1x7iW?spm_id_from=333.999.0.0 .
|
|
@ -1,13 +0,0 @@
|
|||
FAQ
|
||||
===
|
||||
|
||||
|
||||
1.Using nearest mirror, will speed up the installation of Anaconda.
|
||||
|
||||
2.Using nearest mirror, will speed up pip install XXX.
|
||||
|
||||
3.When encountering ModuleNotFoundError: No module named 'past',run pip install future .
|
||||
|
||||
4.It's slow to install the pretrained language models online. Recommend download pretrained models before use and save them in the pretrained folder. Read README.md in every task directory to check the specific requirement for saving pretrained models.
|
||||
|
||||
5.The old version of DeepKE is in the deepke-v1.0 branch. Users can change the branch to use the old version. The old version has been totally transfered to the standard relation extraction (example/re/standard).
|
|
@ -1,52 +0,0 @@
|
|||
|
||||
DeepKE Documentation
|
||||
====================
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
|
||||
.. image:: ./_static/logo.png
|
||||
|
||||
DeepKE is a knowledge extraction toolkit supporting low-resource and document-level scenarios. It provides three functions based PyTorch, including Named Entity Recognition, Relation Extraciton and Attribute Extraction.
|
||||
|
||||
|
||||
.. image:: ./_static/demo.gif
|
||||
|
||||
Support Weight & Biases
|
||||
-----------------------
|
||||
|
||||
.. image:: ./_static/wandb.png
|
||||
|
||||
To achieve automatic hyper-parameters fine-tuning, DeepKE adopts Weight & Biases, a machine learning toolkit for developers to build better models faster.
|
||||
With this toolkit, DeepKE can visualize results and tune hyper-parameters better automatically.
|
||||
The example running files for all functions in the repository support the toolkit and researchers are able to modify the metrics and hyper-parameter configuration as needed.
|
||||
The detailed usage of this toolkit refers to the official document
|
||||
|
||||
Support Notebook Tutorials
|
||||
--------------------------
|
||||
|
||||
We provide Google Colab tutorials and jupyter notebooks in the github repository as example implementation of every functions in different scenarios.
|
||||
These tutorials can be run directly and lead developers and researchers to have a whole picture of DeepKE’s application methods.
|
||||
|
||||
You can go colab directly: https://colab.research.google.com/drive/1cM-zbLhEHkje54P0IZENrfe4HaXwZxZc?usp=sharing
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 1
|
||||
:caption: Getting Started
|
||||
|
||||
start
|
||||
install
|
||||
example
|
||||
faq
|
||||
|
||||
|
||||
|
||||
.. toctree::
|
||||
:glob:
|
||||
:maxdepth: 3
|
||||
:caption: Package
|
||||
|
||||
deepke
|
||||
|
|
@ -1,39 +0,0 @@
|
|||
Install
|
||||
=======
|
||||
|
||||
Create environment
|
||||
------------------
|
||||
|
||||
Create a virtual environment directly (recommend anaconda)
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
conda create -n deepke python=3.8
|
||||
conda activate deepke
|
||||
|
||||
We also provide dockerfile to create docker image.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
cd docker
|
||||
docker build -t deepke .
|
||||
conda activate deepke
|
||||
|
||||
Install by pypi
|
||||
---------------
|
||||
|
||||
If use deepke directly
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
pip install deepke
|
||||
|
||||
|
||||
Install by setup.py
|
||||
-------------------
|
||||
|
||||
If modify source codes before usage
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
python setup.py install
|
|
@ -1,107 +0,0 @@
|
|||
Start
|
||||
=====
|
||||
|
||||
|
||||
|
||||
Model Framework
|
||||
---------------
|
||||
|
||||
.. image:: ./_static/architectures.png
|
||||
|
||||
DeepKE contains three modules for named entity recognition, relation extraction and attribute extraction, the three tasks respectively.
|
||||
|
||||
Each module has its own submodules. For example, there are standard, document-level and few-shot submodules in the attribute extraction modular.
|
||||
|
||||
Each submodule compose of three parts: a collection of tools, which can function as tokenizer, dataloader, preprocessor and the like, a encoder and a part for training and prediction
|
||||
|
||||
Dataset
|
||||
-------
|
||||
|
||||
We use the following datasets in our experiments:
|
||||
|
||||
+--------------------------+-----------+------------------+----------+------------+
|
||||
| Task | Settings | Corpus | Language | Model |
|
||||
+==========================+===========+==================+==========+============+
|
||||
| | | CoNLL-2003 | English | |
|
||||
| | Standard +------------------+----------+ BERT |
|
||||
| | | People's Daily | Chinese | |
|
||||
| +-----------+------------------+----------+------------+
|
||||
| | | CoNLL-2003 | | |
|
||||
| | +------------------+ | |
|
||||
| Name Entity Recognition | | MIT Movie | | |
|
||||
| | Few-shot +------------------+ English | LightNER |
|
||||
| | | MIT Restaurant | | |
|
||||
| | +------------------+ | |
|
||||
| | | ATIS | | |
|
||||
+--------------------------+-----------+------------------+----------+------------+
|
||||
| | | | | CNN |
|
||||
| | | | +------------+
|
||||
| | | | | RNN |
|
||||
| | | | +------------+
|
||||
| | | | | Capsule |
|
||||
| | Standard | DuIE | Chinese +------------+
|
||||
| | | | | GCN |
|
||||
| | | | +------------+
|
||||
| | | | | Transformer|
|
||||
| | | | +------------+
|
||||
| | | | | BERT |
|
||||
| +-----------+------------------+----------+------------+
|
||||
| Relation Extraction | | SEMEVAL(8-shot) | | |
|
||||
| | +------------------+ | |
|
||||
| | | SEMEVAL(16-shot) | | |
|
||||
| | Few-shot +------------------+ English | KnowPrompt |
|
||||
| | | SEMEVAL(32-shot) | | |
|
||||
| | +------------------+ | |
|
||||
| | | SEMEVAL(Full) | | |
|
||||
| +-----------+------------------+----------+------------+
|
||||
| | | DocRED | | |
|
||||
| | +------------------+ | |
|
||||
| | Document | CDR | English | DocuNet |
|
||||
| | +------------------+ | |
|
||||
| | | GDA | | |
|
||||
+--------------------------+-----------+------------------+----------+------------+
|
||||
| | | | | CNN |
|
||||
| | | | +------------+
|
||||
| | | | | RNN |
|
||||
| | | | +------------+
|
||||
| | |Triplet Extraction| | Capsule |
|
||||
| Attribute Extraction | Standard |Dataset | Chinese +------------+
|
||||
| | | | | GCN |
|
||||
| | | | +------------+
|
||||
| | | | | Transformer|
|
||||
| | | | +------------+
|
||||
| | | | | BERT |
|
||||
+--------------------------+-----------+------------------+----------+------------+
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Get Start
|
||||
---------
|
||||
|
||||
If you want to use our code , you can do as follow:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
git clone https://github.com/zjunlp/DeepKE.git
|
||||
cd DeepKE
|
||||
|
||||
|
||||
|
|
@ -1,73 +0,0 @@
|
|||
# Easy Start
|
||||
|
||||
<p align="left">
|
||||
<b> English | <a href="https://github.com/zjunlp/DeepKE/blob/main/example/ae/standard/README_CN.md">简体中文</a> </b>
|
||||
</p>
|
||||
|
||||
## Requirements
|
||||
|
||||
> python == 3.8
|
||||
|
||||
- torch == 1.5
|
||||
- hydra-core == 1.0.6
|
||||
- tensorboard == 2.4.1
|
||||
- matplotlib == 3.4.1
|
||||
- scikit-learn == 0.24.1
|
||||
- transformers == 3.4.0
|
||||
- jieba == 0.42.1
|
||||
- deepke
|
||||
|
||||
## Download Code
|
||||
|
||||
```bash
|
||||
git clone https://github.com/zjunlp/DeepKE.git
|
||||
cd DeepKE/example/ae/standard
|
||||
```
|
||||
|
||||
## Install with Pip
|
||||
|
||||
- Create and enter the python virtual environment.
|
||||
- Install dependencies: `pip install -r requirements.txt`.
|
||||
|
||||
## Train and Predict
|
||||
|
||||
- Dataset
|
||||
|
||||
- Download the dataset to this directory.
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/ae/standard/data.tar.gz
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
- The dataset is stored in `data/origin`:
|
||||
- `train.csv`: Training set
|
||||
- `valid.csv `: Validation set
|
||||
- `test.csv`: Test set
|
||||
- `attribute.csv`: Attribute types
|
||||
|
||||
- Training
|
||||
|
||||
- Parameters for training are in the `conf` folder and users can modify them before training.
|
||||
- If using LM, modify `lm_file` to use the local model.
|
||||
|
||||
- Logs for training are in the `log` folder and the trained model is saved in the `checkpoints` folder.
|
||||
|
||||
```bash
|
||||
python run.py
|
||||
```
|
||||
|
||||
- Prediction
|
||||
|
||||
```bash
|
||||
python predict.py
|
||||
```
|
||||
|
||||
## Models
|
||||
|
||||
1. CNN
|
||||
2. RNN
|
||||
3. Capsule
|
||||
4. GCN
|
||||
5. Transformer
|
||||
6. Pre-trained Model (BERT)
|
|
@ -1,63 +0,0 @@
|
|||
## 快速上手
|
||||
|
||||
<p align="left">
|
||||
<b> <a href="https://github.com/zjunlp/DeepKE/blob/main/example/ae/standard/README.md">English</a> | 简体中文 </b>
|
||||
</p>
|
||||
|
||||
### 环境依赖
|
||||
|
||||
> python == 3.8
|
||||
|
||||
- torch == 1.5
|
||||
- hydra-core == 1.0.6
|
||||
- tensorboard == 2.4.1
|
||||
- matplotlib == 3.4.1
|
||||
- scikit-learn == 0.24.1
|
||||
- transformers == 3.4.0
|
||||
- jieba == 0.42.1
|
||||
- deepke
|
||||
|
||||
### 克隆代码
|
||||
```bash
|
||||
git clone https://github.com/zjunlp/DeepKE.git
|
||||
cd DeepKE/example/ae/standard
|
||||
```
|
||||
### 使用pip安装
|
||||
|
||||
首先创建python虚拟环境,再进入虚拟环境
|
||||
|
||||
- 安装依赖: ```pip install -r requirements.txt```
|
||||
|
||||
### 使用数据进行训练预测
|
||||
|
||||
- 存放数据: 可先下载数据 ```wget 120.27.214.45/Data/ae/standard/data.tar.gz```至此目录下
|
||||
|
||||
解压后`data/origin` 文件夹下存放来训练数据:
|
||||
|
||||
- `train.csv`:存放训练数据集
|
||||
|
||||
- `valid.csv`:存放验证数据集
|
||||
|
||||
- `test.csv`:存放测试数据集
|
||||
|
||||
- `attribute.csv`:存放属性种类
|
||||
|
||||
- 开始训练:```python run.py``` (训练所用到参数都在conf文件夹中,修改即可;使用LM时,可修改'lm_file'使用下载至本地的模型)
|
||||
|
||||
- 每次训练的日志保存在 `logs` 文件夹内,模型结果保存在 `checkpoints` 文件夹内。
|
||||
|
||||
- 进行预测 ```python predict.py```
|
||||
|
||||
|
||||
## 模型内容
|
||||
1、CNN
|
||||
|
||||
2、RNN
|
||||
|
||||
3、Capsule
|
||||
|
||||
4、GCN
|
||||
|
||||
5、Transformer
|
||||
|
||||
6、预训练模型
|
|
@ -1,17 +0,0 @@
|
|||
# ??? is a mandatory value.
|
||||
# you should be able to set it without open_dict
|
||||
# but if you try to read it before it's set an error will get thrown.
|
||||
|
||||
# populated at runtime
|
||||
cwd: ???
|
||||
|
||||
|
||||
defaults:
|
||||
- hydra/output: custom
|
||||
- preprocess
|
||||
- train
|
||||
- embedding
|
||||
- predict
|
||||
- model: cnn # [cnn, rnn, transformer, capsule, gcn, lm]
|
||||
|
||||
|
|
@ -1,10 +0,0 @@
|
|||
# populated at runtime
|
||||
vocab_size: ???
|
||||
word_dim: 60
|
||||
pos_size: ??? # 2 * pos_limit + 2
|
||||
pos_dim: 10 # 当为 sum 时,此值无效,和 word_dim 强行相同
|
||||
|
||||
dim_strategy: sum # [cat, sum]
|
||||
|
||||
# 属性种类
|
||||
num_attributes: 7
|
|
@ -1,11 +0,0 @@
|
|||
hydra:
|
||||
|
||||
run:
|
||||
# Output directory for normal runs
|
||||
dir: logs/${now:%Y-%m-%d_%H-%M-%S}
|
||||
|
||||
sweep:
|
||||
# Output directory for sweep runs
|
||||
dir: logs/${now:%Y-%m-%d_%H-%M-%S}
|
||||
# Output sub directory for sweep runs.
|
||||
subdir: ${hydra.job.num}_${hydra.job.id}
|
|
@ -1,20 +0,0 @@
|
|||
model_name: capsule
|
||||
|
||||
share_weights: True
|
||||
num_iterations: 5 # 迭代次数
|
||||
dropout: 0.3
|
||||
|
||||
input_dim_capsule: ??? # 由上层卷积结果得到,一般是卷积输出的 hidden_size
|
||||
dim_capsule: 50 # 输出 capsule 的维度
|
||||
num_capsule: ??? # 输出 capsule 的数目,和分类结果相同, == num_attributes
|
||||
|
||||
|
||||
# primary capsule 组成
|
||||
# 可以 embedding / cnn / rnn
|
||||
# 暂时先用 cnn
|
||||
in_channels: ??? # 使用 embedding 输出的结果,不需要指定
|
||||
out_channels: 100 # == input_dim_capsule
|
||||
kernel_sizes: [9] # 必须为奇数,而且要比较大
|
||||
activation: 'lrelu' # [relu, lrelu, prelu, selu, celu, gelu, sigmoid, tanh]
|
||||
keep_length: False # 不需要padding,太多无用信息
|
||||
pooling_strategy: cls # 无关紧要,根本用不到
|
|
@ -1,13 +0,0 @@
|
|||
model_name: cnn
|
||||
|
||||
in_channels: ??? # 使用 embedding 输出的结果,不需要指定
|
||||
out_channels: 100
|
||||
kernel_sizes: [3, 5, 7] # 必须为奇数,为了保证cnn的输出不改变句子长度
|
||||
activation: 'gelu' # [relu, lrelu, prelu, selu, celu, gelu, sigmoid, tanh]
|
||||
pooling_strategy: 'max' # [max, avg, cls]
|
||||
keep_length: True
|
||||
dropout: 0.3
|
||||
|
||||
# pcnn
|
||||
use_pcnn: False
|
||||
intermediate: 80
|
|
@ -1,7 +0,0 @@
|
|||
model_name: gcn
|
||||
|
||||
num_layers: 3
|
||||
|
||||
input_size: ??? # 使用 embedding 输出的结果,不需要指定
|
||||
hidden_size: 100
|
||||
dropout: 0.3
|
|
@ -1,20 +0,0 @@
|
|||
model_name: lm
|
||||
|
||||
# 当使用预训练语言模型时,该预训练的模型存放位置
|
||||
# lm_name = 'bert-base-chinese' # download usage
|
||||
#lm_file: 'pretrained'
|
||||
lm_file: 'bert-base-chinese'
|
||||
|
||||
# transformer 层数,初始 base bert 为12层
|
||||
# 但是数据量较小时调低些反而收敛更快效果更好
|
||||
num_hidden_layers: 1
|
||||
|
||||
|
||||
# 后面所接 bilstm 的参数
|
||||
type_rnn: 'LSTM' # [RNN, GRU, LSTM]
|
||||
input_size: 768 # 这个值由bert得到
|
||||
hidden_size: 100 # 必须为偶数
|
||||
num_layers: 1
|
||||
dropout: 0.3
|
||||
bidirectional: True
|
||||
last_layer_hn: True
|
|
@ -1,10 +0,0 @@
|
|||
model_name: rnn
|
||||
|
||||
type_rnn: 'LSTM' # [RNN, GRU, LSTM]
|
||||
|
||||
input_size: ??? # 使用 embedding 输出的结果,不需要指定
|
||||
hidden_size: 150 # 必须为偶数
|
||||
num_layers: 2
|
||||
dropout: 0.3
|
||||
bidirectional: True
|
||||
last_layer_hn: True
|
|
@ -1,12 +0,0 @@
|
|||
model_name: transformer
|
||||
|
||||
hidden_size: ??? # 使用 embedding 输出的结果,不需要指定
|
||||
num_heads: 4 # 必须能被 hidden_size 整除
|
||||
num_hidden_layers: 3
|
||||
intermediate_size: 256
|
||||
dropout: 0.1
|
||||
layer_norm_eps: 1e-12
|
||||
hidden_act: gelu_new # [relu, gelu, swish, gelu_new]
|
||||
|
||||
output_attentions: True
|
||||
output_hidden_states: True
|
|
@ -1,2 +0,0 @@
|
|||
# 自定义模型存储的路径
|
||||
fp: 'xxx/checkpoints/2019-12-03_17-35-30/cnn_epoch21.pth'
|
|
@ -1,20 +0,0 @@
|
|||
# 是否需要预处理数据
|
||||
# 当数据处理参数没有变换时,不需要重新预处理
|
||||
preprocess: True
|
||||
|
||||
# 原始数据存放位置
|
||||
data_path: 'data/origin'
|
||||
|
||||
# 预处理后存放文件位置
|
||||
out_path: 'data/out'
|
||||
|
||||
# 是否需要分词
|
||||
chinese_split: True
|
||||
|
||||
# vocab 构建时的最低词频控制
|
||||
min_freq: 3
|
||||
|
||||
# 句长限制: 指句子中词语相对entity的position限制
|
||||
# 如:[-30, 30],embed 时整体+31,变成[1, 61]
|
||||
# 则一共62个pos token,0 留给 pad
|
||||
pos_limit: 30
|
|
@ -1,21 +0,0 @@
|
|||
seed: 1
|
||||
|
||||
use_gpu: True
|
||||
gpu_id: 0
|
||||
|
||||
epoch: 50
|
||||
batch_size: 32
|
||||
learning_rate: 3e-4
|
||||
lr_factor: 0.7 # 学习率的衰减率
|
||||
lr_patience: 3 # 学习率衰减的等待epoch
|
||||
weight_decay: 1e-3 # L2正则
|
||||
|
||||
early_stopping_patience: 6
|
||||
|
||||
train_log: True
|
||||
log_interval: 10
|
||||
show_plot: True
|
||||
only_comparison_plot: False
|
||||
plot_utils: matplot # [matplot, tensorboard]
|
||||
|
||||
predict_plot: True
|
|
@ -1,152 +0,0 @@
|
|||
import os
|
||||
import sys
|
||||
import torch
|
||||
import logging
|
||||
import hydra
|
||||
from hydra import utils
|
||||
from deepke.attribution_extraction.standard.tools import Serializer
|
||||
from deepke.attribution_extraction.standard.tools import _serialize_sentence, _convert_tokens_into_index, _add_pos_seq, _handle_attribute_data , _lm_serialize
|
||||
import matplotlib.pyplot as plt
|
||||
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
|
||||
from deepke.attribution_extraction.standard.utils import load_pkl, load_csv
|
||||
import deepke.attribution_extraction.standard.models as models
|
||||
|
||||
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def _preprocess_data(data, cfg):
|
||||
attribute_data = load_csv(os.path.join(cfg.cwd, cfg.data_path, 'attribute.csv'), verbose=False)
|
||||
atts = _handle_attribute_data(attribute_data)
|
||||
if cfg.model_name != 'lm':
|
||||
vocab = load_pkl(os.path.join(cfg.cwd, cfg.out_path, 'vocab.pkl'), verbose=False)
|
||||
cfg.vocab_size = vocab.count
|
||||
serializer = Serializer(do_chinese_split=cfg.chinese_split)
|
||||
serial = serializer.serialize
|
||||
|
||||
_serialize_sentence(data, serial)
|
||||
_convert_tokens_into_index(data, vocab)
|
||||
_add_pos_seq(data, cfg)
|
||||
logger.info('start sentence preprocess...')
|
||||
formats = '\nsentence: {}\nchinese_split: {}\n' \
|
||||
'tokens: {}\ntoken2idx: {}\nlength: {}\nentity_index: {}\nattribute_value_index: {}'
|
||||
logger.info(
|
||||
formats.format(data[0]['sentence'], cfg.chinese_split,
|
||||
data[0]['tokens'], data[0]['token2idx'], data[0]['seq_len'],
|
||||
data[0]['entity_index'], data[0]['attribute_value_index']))
|
||||
else:
|
||||
_lm_serialize(data,cfg)
|
||||
return data, atts
|
||||
|
||||
|
||||
def _get_predict_instance(cfg):
|
||||
flag = input('是否使用范例[y/n],退出请输入: exit .... ')
|
||||
flag = flag.strip().lower()
|
||||
if flag == 'y' or flag == 'yes':
|
||||
sentence = '张冬梅,女,汉族,1968年2月生,河南淇县人,1988年7月加入中国共产党,1989年9月参加工作,中央党校经济管理专业毕业,中央党校研究生学历'
|
||||
entity = '张冬梅'
|
||||
attribute_value = '汉族'
|
||||
elif flag == 'n' or flag == 'no':
|
||||
sentence = input('请输入句子:')
|
||||
entity = input('请输入句中需要预测的实体:')
|
||||
attribute_value = input('请输入句中需要预测的属性值:')
|
||||
elif flag == 'exit':
|
||||
sys.exit(0)
|
||||
else:
|
||||
print('please input yes or no, or exit!')
|
||||
_get_predict_instance(cfg)
|
||||
|
||||
|
||||
instance = dict()
|
||||
instance['sentence'] = sentence.strip()
|
||||
instance['entity'] = entity.strip()
|
||||
instance['attribute_value'] = attribute_value.strip()
|
||||
instance['entity_offset'] = sentence.find(entity)
|
||||
instance['attribute_value_offset'] = sentence.find(attribute_value)
|
||||
|
||||
return instance
|
||||
|
||||
|
||||
|
||||
|
||||
@hydra.main(config_path='conf/config.yaml')
|
||||
def main(cfg):
|
||||
cwd = utils.get_original_cwd()
|
||||
# cwd = cwd[0:-5]
|
||||
cfg.cwd = cwd
|
||||
cfg.pos_size = 2 * cfg.pos_limit + 2
|
||||
print(cfg.pretty())
|
||||
|
||||
# get predict instance
|
||||
instance = _get_predict_instance(cfg)
|
||||
data = [instance]
|
||||
|
||||
# preprocess data
|
||||
data, rels = _preprocess_data(data, cfg)
|
||||
|
||||
# model
|
||||
__Model__ = {
|
||||
'cnn': models.PCNN,
|
||||
'rnn': models.BiLSTM,
|
||||
'transformer': models.Transformer,
|
||||
'gcn': models.GCN,
|
||||
'capsule': models.Capsule,
|
||||
'lm': models.LM,
|
||||
}
|
||||
|
||||
# 最好在 cpu 上预测
|
||||
cfg.use_gpu = False
|
||||
if cfg.use_gpu and torch.cuda.is_available():
|
||||
device = torch.device('cuda', cfg.gpu_id)
|
||||
else:
|
||||
device = torch.device('cpu')
|
||||
logger.info(f'device: {device}')
|
||||
|
||||
model = __Model__[cfg.model_name](cfg)
|
||||
logger.info(f'model name: {cfg.model_name}')
|
||||
logger.info(f'\n {model}')
|
||||
model.load(cfg.fp, device=device)
|
||||
model.to(device)
|
||||
model.eval()
|
||||
|
||||
x = dict()
|
||||
x['word'], x['lens'] = torch.tensor([data[0]['token2idx']]), torch.tensor([data[0]['seq_len']])
|
||||
|
||||
if cfg.model_name != 'lm':
|
||||
x['entity_pos'], x['attribute_value_pos'] = torch.tensor([data[0]['entity_pos']]), torch.tensor([data[0]['attribute_value_pos']])
|
||||
if cfg.model_name == 'cnn':
|
||||
if cfg.use_pcnn:
|
||||
x['pcnn_mask'] = torch.tensor([data[0]['entities_pos']])
|
||||
if cfg.model_name == 'gcn':
|
||||
# 没找到合适的做 parsing tree 的工具,暂时随机初始化
|
||||
adj = torch.empty(1,data[0]['seq_len'],data[0]['seq_len']).random_(2)
|
||||
x['adj'] = adj
|
||||
|
||||
|
||||
for key in x.keys():
|
||||
x[key] = x[key].to(device)
|
||||
|
||||
with torch.no_grad():
|
||||
y_pred = model(x)
|
||||
y_pred = torch.softmax(y_pred, dim=-1)[0]
|
||||
prob = y_pred.max().item()
|
||||
prob_att = list(rels.keys())[y_pred.argmax().item()]
|
||||
logger.info(f"\"{data[0]['entity']}\" 和 \"{data[0]['attribute_value']}\" 在句中属性为:\"{prob_att}\",置信度为{prob:.2f}。")
|
||||
|
||||
if cfg.predict_plot:
|
||||
plt.rcParams["font.family"] = 'Arial Unicode MS'
|
||||
x = list(rels.keys())
|
||||
height = list(y_pred.cpu().numpy())
|
||||
plt.bar(x, height)
|
||||
for x, y in zip(x, height):
|
||||
plt.text(x, y, '%.2f' % y, ha="center", va="bottom")
|
||||
plt.xlabel('关系')
|
||||
plt.ylabel('置信度')
|
||||
plt.xticks(rotation=315)
|
||||
plt.show()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
|
@ -1,8 +0,0 @@
|
|||
torch == 1.5
|
||||
hydra-core == 1.0.6
|
||||
tensorboard == 2.4.1
|
||||
matplotlib == 3.4.1
|
||||
scikit-learn == 0.24.1
|
||||
transformers == 4.5.0
|
||||
jieba == 0.42.1
|
||||
deepke
|
|
@ -1,167 +0,0 @@
|
|||
import os
|
||||
import hydra
|
||||
import torch
|
||||
import logging
|
||||
import torch.nn as nn
|
||||
from torch import optim
|
||||
from hydra import utils
|
||||
import matplotlib.pyplot as plt
|
||||
from torch.utils.data import DataLoader
|
||||
from torch.utils.tensorboard import SummaryWriter
|
||||
# self
|
||||
import sys
|
||||
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
|
||||
import deepke.attribution_extraction.standard.models as models
|
||||
from deepke.attribution_extraction.standard.tools import preprocess , CustomDataset, collate_fn ,train, validate
|
||||
from deepke.attribution_extraction.standard.utils import manual_seed, load_pkl
|
||||
|
||||
import wandb
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@hydra.main(config_path="conf/config.yaml")
|
||||
def main(cfg):
|
||||
cwd = utils.get_original_cwd()
|
||||
# cwd = cwd[0:-5]
|
||||
cfg.cwd = cwd
|
||||
cfg.pos_size = 2 * cfg.pos_limit + 2
|
||||
logger.info(f'\n{cfg.pretty()}')
|
||||
|
||||
wandb.init(project="DeepKE_AE_Standard", name=cfg.model_name)
|
||||
wandb.watch_called = False
|
||||
|
||||
__Model__ = {
|
||||
'cnn': models.PCNN,
|
||||
'rnn': models.BiLSTM,
|
||||
'transformer': models.Transformer,
|
||||
'gcn': models.GCN,
|
||||
'capsule': models.Capsule,
|
||||
'lm': models.LM,
|
||||
}
|
||||
|
||||
# device
|
||||
if cfg.use_gpu and torch.cuda.is_available():
|
||||
device = torch.device('cuda', cfg.gpu_id)
|
||||
else:
|
||||
device = torch.device('cpu')
|
||||
logger.info(f'device: {device}')
|
||||
|
||||
# 如果不修改预处理的过程,这一步最好注释掉,不用每次运行都预处理数据一次
|
||||
if cfg.preprocess:
|
||||
preprocess(cfg)
|
||||
|
||||
train_data_path = os.path.join(cfg.cwd, cfg.out_path, 'train.pkl')
|
||||
valid_data_path = os.path.join(cfg.cwd, cfg.out_path, 'valid.pkl')
|
||||
test_data_path = os.path.join(cfg.cwd, cfg.out_path, 'test.pkl')
|
||||
vocab_path = os.path.join(cfg.cwd, cfg.out_path, 'vocab.pkl')
|
||||
|
||||
if cfg.model_name == 'lm':
|
||||
vocab_size = None
|
||||
else:
|
||||
vocab = load_pkl(vocab_path)
|
||||
vocab_size = vocab.count
|
||||
cfg.vocab_size = vocab_size
|
||||
|
||||
train_dataset = CustomDataset(train_data_path)
|
||||
valid_dataset = CustomDataset(valid_data_path)
|
||||
test_dataset = CustomDataset(test_data_path)
|
||||
|
||||
train_dataloader = DataLoader(train_dataset, batch_size=cfg.batch_size, shuffle=True, collate_fn=collate_fn(cfg))
|
||||
valid_dataloader = DataLoader(valid_dataset, batch_size=cfg.batch_size, shuffle=True, collate_fn=collate_fn(cfg))
|
||||
test_dataloader = DataLoader(test_dataset, batch_size=cfg.batch_size, shuffle=True, collate_fn=collate_fn(cfg))
|
||||
|
||||
model = __Model__[cfg.model_name](cfg)
|
||||
model.to(device)
|
||||
|
||||
wandb.watch(model, log="all")
|
||||
logger.info(f'\n {model}')
|
||||
|
||||
optimizer = optim.Adam(model.parameters(), lr=cfg.learning_rate, weight_decay=cfg.weight_decay)
|
||||
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=cfg.lr_factor, patience=cfg.lr_patience)
|
||||
criterion = nn.CrossEntropyLoss()
|
||||
|
||||
best_f1, best_epoch = -1, 0
|
||||
es_loss, es_f1, es_epoch, es_patience, best_es_epoch, best_es_f1, es_path, best_es_path = 1e8, -1, 0, 0, 0, -1, '', ''
|
||||
train_losses, valid_losses = [], []
|
||||
|
||||
if cfg.show_plot and cfg.plot_utils == 'tensorboard':
|
||||
writer = SummaryWriter('tensorboard')
|
||||
else:
|
||||
writer = None
|
||||
|
||||
logger.info('=' * 10 + ' Start training ' + '=' * 10)
|
||||
|
||||
for epoch in range(1, cfg.epoch + 1):
|
||||
manual_seed(cfg.seed + epoch)
|
||||
train_loss = train(epoch, model, train_dataloader, optimizer, criterion, device, writer, cfg)
|
||||
valid_f1, valid_loss = validate(epoch, model, valid_dataloader, criterion, device, cfg)
|
||||
scheduler.step(valid_loss)
|
||||
model_path = model.save(epoch, cfg)
|
||||
# logger.info(model_path)
|
||||
|
||||
train_losses.append(train_loss)
|
||||
valid_losses.append(valid_loss)
|
||||
|
||||
wandb.log({
|
||||
"train_loss":train_loss,
|
||||
"valid_loss":valid_loss
|
||||
})
|
||||
|
||||
|
||||
if best_f1 < valid_f1:
|
||||
best_f1 = valid_f1
|
||||
best_epoch = epoch
|
||||
# 使用 valid loss 做 early stopping 的判断标准
|
||||
if es_loss > valid_loss:
|
||||
es_loss = valid_loss
|
||||
es_f1 = valid_f1
|
||||
es_epoch = epoch
|
||||
es_patience = 0
|
||||
es_path = model_path
|
||||
else:
|
||||
es_patience += 1
|
||||
if es_patience >= cfg.early_stopping_patience:
|
||||
best_es_epoch = es_epoch
|
||||
best_es_f1 = es_f1
|
||||
best_es_path = es_path
|
||||
|
||||
if cfg.show_plot:
|
||||
if cfg.plot_utils == 'matplot':
|
||||
plt.plot(train_losses, 'x-')
|
||||
plt.plot(valid_losses, '+-')
|
||||
plt.legend(['train', 'valid'])
|
||||
plt.title('train/valid comparison loss')
|
||||
plt.show()
|
||||
|
||||
if cfg.plot_utils == 'tensorboard':
|
||||
for i in range(len(train_losses)):
|
||||
writer.add_scalars('train/valid_comparison_loss', {
|
||||
'train': train_losses[i],
|
||||
'valid': valid_losses[i]
|
||||
}, i)
|
||||
writer.close()
|
||||
|
||||
logger.info(f'best(valid loss quota) early stopping epoch: {best_es_epoch}, '
|
||||
f'this epoch macro f1: {best_es_f1:0.4f}')
|
||||
logger.info(f'this model save path: {best_es_path}')
|
||||
logger.info(f'total {cfg.epoch} epochs, best(valid macro f1) epoch: {best_epoch}, '
|
||||
f'this epoch macro f1: {best_f1:.4f}')
|
||||
|
||||
logger.info('=====end of training====')
|
||||
logger.info('')
|
||||
logger.info('=====start test performance====')
|
||||
_ , test_loss = validate(-1, model, test_dataloader, criterion, device, cfg)
|
||||
|
||||
wandb.log({
|
||||
"test_loss":test_loss,
|
||||
})
|
||||
|
||||
logger.info('=====ending====')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
# python predict.py --help # 查看参数帮助
|
||||
# python predict.py -c
|
||||
# python predict.py chinese_split=0,1 replace_entity_with_type=0,1 -m
|
|
@ -1,112 +0,0 @@
|
|||
# Easy Start
|
||||
|
||||
<p align="left">
|
||||
<b> English | <a href="https://github.com/zjunlp/DeepKE/blob/main/example/ner/few-shot/README_CN.md">简体中文</a> </b>
|
||||
</p>
|
||||
|
||||
## Requirements
|
||||
|
||||
> python == 3.8
|
||||
|
||||
- torch == 1.5
|
||||
- transformers == 3.4.0
|
||||
- deepke
|
||||
|
||||
## Download Code
|
||||
|
||||
```bash
|
||||
git clone https://github.com/zjunlp/DeepKE.git
|
||||
cd DeepKE/example/ner/few-shot
|
||||
```
|
||||
|
||||
## Install with Pip
|
||||
|
||||
- Create and enter the python virtual environment.
|
||||
- Install dependencies: `pip install -r requirements.txt`.
|
||||
|
||||
## Train and Predict
|
||||
|
||||
- Dataset
|
||||
|
||||
- Download the dataset to this directory.
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/ner/few-shot/data.tar.gz
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
- The datasets are stored in `data`, including CoNLL-2003, MIT-movie, MIT-restaurant and ATIS.
|
||||
|
||||
- **CoNLL-2003**
|
||||
|
||||
- `train.txt`: Training set
|
||||
- `valid.txt `: Validation set
|
||||
- `test.txt`: Test set
|
||||
- `indomain-train.txt`: In-domain training set
|
||||
|
||||
- **MIT-movie, MIT-restaurant and ATIS**
|
||||
- `k-shot-train.txt`: k=[10, 20, 50, 100, 200, 500], Training set
|
||||
- `test.txt`: Testing set
|
||||
|
||||
- Training
|
||||
|
||||
- Parameters, model paths and configuration for training are in the `conf` folder and users can modify them before training.
|
||||
|
||||
- Training on CoNLL-2003
|
||||
|
||||
```bash
|
||||
python run.py
|
||||
```
|
||||
|
||||
- Few-shot Training
|
||||
|
||||
If the model need to be uploaded, modify `load_path` in `few_shot.yaml`
|
||||
|
||||
```bash
|
||||
python run.py +train=few_shot
|
||||
```
|
||||
|
||||
- Logs for training are in the `log` folder. The path of the trained model can be customized.
|
||||
|
||||
- Prediction
|
||||
|
||||
- Add `- predict` in `config.yaml`
|
||||
|
||||
- Modify `load_path` as the path of the trained model and `write_path` as the path of predicted results in `predict.yaml`
|
||||
|
||||
- ```bash
|
||||
python predict.py
|
||||
```
|
||||
|
||||
## Model
|
||||
|
||||
[LightNER](https://arxiv.org/abs/2109.00720)
|
||||
|
||||
## Cite
|
||||
|
||||
If you use or extend our work, please cite the following paper:
|
||||
|
||||
```bibtex
|
||||
@article{DBLP:journals/corr/abs-2109-00720,
|
||||
author = {Xiang Chen and
|
||||
Ningyu Zhang and
|
||||
Lei Li and
|
||||
Xin Xie and
|
||||
Shumin Deng and
|
||||
Chuanqi Tan and
|
||||
Fei Huang and
|
||||
Luo Si and
|
||||
Huajun Chen},
|
||||
title = {LightNER: {A} Lightweight Generative Framework with Prompt-guided
|
||||
Attention for Low-resource {NER}},
|
||||
journal = {CoRR},
|
||||
volume = {abs/2109.00720},
|
||||
year = {2021},
|
||||
url = {https://arxiv.org/abs/2109.00720},
|
||||
eprinttype = {arXiv},
|
||||
eprint = {2109.00720},
|
||||
timestamp = {Mon, 20 Sep 2021 16:29:41 +0200},
|
||||
biburl = {https://dblp.org/rec/journals/corr/abs-2109-00720.bib},
|
||||
bibsource = {dblp computer science bibliography, https://dblp.org}
|
||||
}
|
||||
```
|
|
@ -1,91 +0,0 @@
|
|||
## 快速上手
|
||||
|
||||
<p align="left">
|
||||
<b> <a href="https://github.com/zjunlp/DeepKE/blob/main/example/ner/few-shot/README.md">English</a> | 简体中文 </b>
|
||||
</p>
|
||||
|
||||
### 环境依赖
|
||||
|
||||
> python == 3.8
|
||||
|
||||
- torch == 1.5
|
||||
- transformers == 3.4.0
|
||||
- deepke
|
||||
|
||||
### 克隆代码
|
||||
```bash
|
||||
git clone https://github.com/zjunlp/DeepKE.git
|
||||
cd DeepKE/example/ner/few-shot
|
||||
```
|
||||
### 使用pip安装
|
||||
|
||||
首先创建python虚拟环境,再进入虚拟环境
|
||||
|
||||
- 安装依赖: ```pip install -r requirements.txt```
|
||||
|
||||
### 使用数据进行训练预测
|
||||
|
||||
- 存放数据: 可先下载数据 ```wget 120.27.214.45/Data/ner/few_shot/data.tar.gz```在此目录下
|
||||
|
||||
在 `data` 文件夹下存放训练数据。包含CoNLL2003,MIT-movie, MIT-restaurant和ATIS等数据集。
|
||||
|
||||
- conll2003包含以下数据:
|
||||
|
||||
- `train.txt`:存放训练数据集
|
||||
|
||||
- `dev.txt`:存放验证数据集
|
||||
|
||||
- `test.txt`:存放测试数据集
|
||||
|
||||
- `indomain-train.txt`:存放in-domain数据集
|
||||
|
||||
- MIT-movie, MIT-restaurant和ATIS包含以下数据:
|
||||
|
||||
- `k-shot-train.txt`:k=[10, 20, 50, 100, 200, 500],存放训练数据集
|
||||
|
||||
- `test.txt`:存放测试数据集
|
||||
|
||||
|
||||
- 开始训练:模型加载和保存位置以及配置可以在conf文件夹中修改
|
||||
|
||||
- 训练conll2003:` python run.py ` (训练所用到参数都在conf文件夹中,修改即可)
|
||||
|
||||
- 进行few-shot训练:` python run.py +train=few_shot ` (若要加载模型,修改few_shot.yaml中的load_path)
|
||||
|
||||
|
||||
- 每次训练的日志保存在 `logs` 文件夹内,模型结果保存目录可以自定义。
|
||||
- 进行预测:在config.yaml中加入 - predict , 再在predict.yaml中修改load_path为模型路径以及write_path为预测结果保存路径,再` python predict.py `
|
||||
|
||||
### 模型
|
||||
|
||||
[LightNER](https://arxiv.org/abs/2109.00720)
|
||||
|
||||
## 引用
|
||||
|
||||
如果您使用了上述代码,请您引用下列论文:
|
||||
|
||||
```bibtex
|
||||
@article{DBLP:journals/corr/abs-2109-00720,
|
||||
author = {Xiang Chen and
|
||||
Ningyu Zhang and
|
||||
Lei Li and
|
||||
Xin Xie and
|
||||
Shumin Deng and
|
||||
Chuanqi Tan and
|
||||
Fei Huang and
|
||||
Luo Si and
|
||||
Huajun Chen},
|
||||
title = {LightNER: {A} Lightweight Generative Framework with Prompt-guided
|
||||
Attention for Low-resource {NER}},
|
||||
journal = {CoRR},
|
||||
volume = {abs/2109.00720},
|
||||
year = {2021},
|
||||
url = {https://arxiv.org/abs/2109.00720},
|
||||
eprinttype = {arXiv},
|
||||
eprint = {2109.00720},
|
||||
timestamp = {Mon, 20 Sep 2021 16:29:41 +0200},
|
||||
biburl = {https://dblp.org/rec/journals/corr/abs-2109-00720.bib},
|
||||
bibsource = {dblp computer science bibliography, https://dblp.org}
|
||||
}
|
||||
```
|
||||
|
|
@ -1,5 +0,0 @@
|
|||
cwd: ???
|
||||
|
||||
defaults:
|
||||
- train/conll
|
||||
|
|
@ -1,28 +0,0 @@
|
|||
cwd: ???
|
||||
|
||||
seed: 1
|
||||
|
||||
bart_name: "facebook/bart-large"
|
||||
dataset_name: conll2003
|
||||
device: cuda
|
||||
|
||||
num_epochs: 30
|
||||
batch_size: 16
|
||||
learning_rate: 2e-5
|
||||
warmup_ratio: 0.01
|
||||
eval_begin_epoch: 16
|
||||
src_seq_ratio: 0.6
|
||||
tgt_max_len: 10
|
||||
num_beams: 1
|
||||
length_penalty: 1
|
||||
|
||||
use_prompt: True
|
||||
prompt_len: 10
|
||||
prompt_dim: 800
|
||||
|
||||
freeze_plm: True
|
||||
learn_weights: True
|
||||
notes: ''
|
||||
save_path: null # 模型保存路径
|
||||
load_path: load_path # 模型加载路径,不能为空
|
||||
write_path: "data/conll2003/predict.txt"
|
|
@ -1,25 +0,0 @@
|
|||
seed: 1
|
||||
|
||||
bart_name: "facebook/bart-large"
|
||||
dataset_name: conll2003
|
||||
device: cuda
|
||||
|
||||
num_epochs: 30
|
||||
batch_size: 16
|
||||
learning_rate: 2e-5
|
||||
warmup_ratio: 0.01
|
||||
eval_begin_epoch: 16
|
||||
src_seq_ratio: 0.6
|
||||
tgt_max_len: 10
|
||||
num_beams: 1
|
||||
length_penalty: 1
|
||||
|
||||
use_prompt: True
|
||||
prompt_len: 10
|
||||
prompt_dim: 800
|
||||
|
||||
freeze_plm: True
|
||||
learn_weights: True
|
||||
save_path: save path # 模型保存路径
|
||||
load_path: null
|
||||
notes: ''
|
|
@ -1,25 +0,0 @@
|
|||
seed: 1
|
||||
|
||||
bart_name: "facebook/bart-large"
|
||||
dataset_name: mit-movie
|
||||
device: cuda
|
||||
|
||||
num_epochs: 30
|
||||
batch_size: 3
|
||||
learning_rate: 5e-5
|
||||
warmup_ratio: 0.01
|
||||
eval_begin_epoch: 16
|
||||
src_seq_ratio: 0.8
|
||||
tgt_max_len: 10
|
||||
num_beams: 1
|
||||
length_penalty: 1
|
||||
|
||||
use_prompt: True
|
||||
prompt_len: 10
|
||||
prompt_dim: 800
|
||||
|
||||
freeze_plm: True
|
||||
learn_weights: True
|
||||
save_path: null # 模型保存路径
|
||||
load_path: null # 模型加载路径,
|
||||
notes: ''
|
|
@ -1,100 +0,0 @@
|
|||
import os
|
||||
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
|
||||
os.environ["CUDA_VISIBLE_DEVICES"]='0'
|
||||
import logging
|
||||
import sys
|
||||
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
|
||||
|
||||
import hydra
|
||||
from hydra import utils
|
||||
from torch.utils.data import DataLoader
|
||||
from deepke.name_entity_re.few_shot.models.model import PromptBartModel, PromptGeneratorModel
|
||||
from deepke.name_entity_re.few_shot.module.datasets import ConllNERProcessor, ConllNERDataset
|
||||
from deepke.name_entity_re.few_shot.module.train import Trainer
|
||||
from deepke.name_entity_re.few_shot.utils.util import set_seed
|
||||
from deepke.name_entity_re.few_shot.module.mapping_type import mit_movie_mapping, mit_restaurant_mapping, atis_mapping
|
||||
|
||||
import warnings
|
||||
warnings.filterwarnings("ignore", category=UserWarning)
|
||||
from tensorboardX import SummaryWriter
|
||||
writer = SummaryWriter(log_dir='logs')
|
||||
|
||||
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
||||
datefmt = '%m/%d/%Y %H:%M:%S',
|
||||
level = logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
DATASET_CLASS = {
|
||||
'conll2003': ConllNERDataset,
|
||||
'mit-movie': ConllNERDataset,
|
||||
'mit-restaurant': ConllNERDataset,
|
||||
'atis': ConllNERDataset
|
||||
}
|
||||
|
||||
DATA_PROCESS = {
|
||||
'conll2003': ConllNERProcessor,
|
||||
'mit-movie': ConllNERProcessor,
|
||||
'mit-restaurant': ConllNERProcessor,
|
||||
'atis': ConllNERProcessor
|
||||
}
|
||||
|
||||
DATA_PATH = {
|
||||
'conll2003': {'train': 'data/conll2003/train.txt',
|
||||
'dev': 'data/conll2003/dev.txt',
|
||||
'test': 'data/conll2003/test.txt'},
|
||||
'mit-movie': {'train': 'data/mit-movie/20-shot-train.txt',
|
||||
'dev': 'data/mit-movie/test.txt'},
|
||||
'mit-restaurant': {'train': 'data/mit-restaurant/10-shot-train.txt',
|
||||
'dev': 'data/mit-restaurant/test.txt'},
|
||||
'atis': {'train': 'data/atis/20-shot-train.txt',
|
||||
'dev': 'data/atis/test.txt'}
|
||||
}
|
||||
|
||||
MAPPING = {
|
||||
'conll2003': {'loc': '<<location>>',
|
||||
'per': '<<person>>',
|
||||
'org': '<<organization>>',
|
||||
'misc': '<<others>>'},
|
||||
'mit-movie': mit_movie_mapping,
|
||||
'mit-restaurant': mit_restaurant_mapping,
|
||||
'atis': atis_mapping
|
||||
}
|
||||
|
||||
|
||||
@hydra.main(config_path="conf/config.yaml")
|
||||
def main(cfg):
|
||||
cwd = utils.get_original_cwd()
|
||||
cfg.cwd = cwd
|
||||
print(cfg)
|
||||
|
||||
data_path = DATA_PATH[cfg.dataset_name]
|
||||
for mode, path in data_path.items():
|
||||
data_path[mode] = os.path.join(cfg.cwd, path)
|
||||
dataset_class, data_process = DATASET_CLASS[cfg.dataset_name], DATA_PROCESS[cfg.dataset_name]
|
||||
mapping = MAPPING[cfg.dataset_name]
|
||||
|
||||
set_seed(cfg.seed) # set seed, default is 1
|
||||
if cfg.save_path is not None: # make save_path dir
|
||||
cfg.save_path = os.path.join(cfg.save_path, cfg.dataset_name+"_"+str(cfg.batch_size)+"_"+str(cfg.learning_rate)+cfg.notes)
|
||||
if not os.path.exists(cfg.save_path):
|
||||
os.makedirs(cfg.save_path, exist_ok=True)
|
||||
|
||||
process = data_process(data_path=data_path, mapping=mapping, bart_name=cfg.bart_name, learn_weights=cfg.learn_weights)
|
||||
test_dataset = dataset_class(data_processor=process, mode='test')
|
||||
test_dataloader = DataLoader(test_dataset, collate_fn=test_dataset.collate_fn, batch_size=cfg.batch_size, num_workers=4)
|
||||
|
||||
label_ids = list(process.mapping2id.values())
|
||||
prompt_model = PromptBartModel(tokenizer=process.tokenizer, label_ids=label_ids, args=cfg)
|
||||
model = PromptGeneratorModel(prompt_model=prompt_model, bos_token_id=0,
|
||||
eos_token_id=1,
|
||||
max_length=cfg.tgt_max_len, max_len_a=cfg.src_seq_ratio,num_beams=cfg.num_beams, do_sample=False,
|
||||
repetition_penalty=1, length_penalty=cfg.length_penalty, pad_token_id=1,
|
||||
restricter=None)
|
||||
trainer = Trainer(train_data=None, dev_data=None, test_data=test_dataloader, model=model, process=process, args=cfg, logger=logger,
|
||||
loss=None, metrics=None, writer=writer)
|
||||
trainer.predict()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -1,3 +0,0 @@
|
|||
transformers==3.4.0
|
||||
pytorch==1.7.0
|
||||
tensorboardX==2.4
|
|
@ -1,111 +0,0 @@
|
|||
import os
|
||||
|
||||
import hydra
|
||||
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
|
||||
os.environ["CUDA_VISIBLE_DEVICES"]='1'
|
||||
import logging
|
||||
import sys
|
||||
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
|
||||
|
||||
from hydra import utils
|
||||
from torch.utils.data import DataLoader
|
||||
from deepke.name_entity_re.few_shot.models.model import PromptBartModel, PromptGeneratorModel
|
||||
from deepke.name_entity_re.few_shot.module.datasets import ConllNERProcessor, ConllNERDataset
|
||||
from deepke.name_entity_re.few_shot.module.train import Trainer
|
||||
from deepke.name_entity_re.few_shot.module.metrics import Seq2SeqSpanMetric
|
||||
from deepke.name_entity_re.few_shot.utils.util import get_loss, set_seed
|
||||
from deepke.name_entity_re.few_shot.module.mapping_type import mit_movie_mapping, mit_restaurant_mapping, atis_mapping
|
||||
|
||||
import warnings
|
||||
warnings.filterwarnings("ignore", category=UserWarning)
|
||||
|
||||
import wandb
|
||||
|
||||
writer = wandb.init(project="DeepKE_NER_Few")
|
||||
|
||||
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
||||
datefmt = '%m/%d/%Y %H:%M:%S',
|
||||
level = logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
DATASET_CLASS = {
|
||||
'conll2003': ConllNERDataset,
|
||||
'mit-movie': ConllNERDataset,
|
||||
'mit-restaurant': ConllNERDataset,
|
||||
'atis': ConllNERDataset
|
||||
}
|
||||
|
||||
DATA_PROCESS = {
|
||||
'conll2003': ConllNERProcessor,
|
||||
'mit-movie': ConllNERProcessor,
|
||||
'mit-restaurant': ConllNERProcessor,
|
||||
'atis': ConllNERProcessor
|
||||
}
|
||||
|
||||
DATA_PATH = {
|
||||
'conll2003': {'train': 'data/conll2003/train.txt',
|
||||
'dev': 'data/conll2003/dev.txt',
|
||||
'test': 'data/conll2003/test.txt'},
|
||||
'mit-movie': {'train': 'data/mit-movie/20-shot-train.txt',
|
||||
'dev': 'data/mit-movie/test.txt'},
|
||||
'mit-restaurant': {'train': 'data/mit-restaurant/10-shot-train.txt',
|
||||
'dev': 'data/mit-restaurant/test.txt'},
|
||||
'atis': {'train': 'data/atis/20-shot-train.txt',
|
||||
'dev': 'data/atis/test.txt'}
|
||||
}
|
||||
|
||||
MAPPING = {
|
||||
'conll2003': {'loc': '<<location>>',
|
||||
'per': '<<person>>',
|
||||
'org': '<<organization>>',
|
||||
'misc': '<<others>>'},
|
||||
'mit-movie': mit_movie_mapping,
|
||||
'mit-restaurant': mit_restaurant_mapping,
|
||||
'atis': atis_mapping
|
||||
}
|
||||
|
||||
@hydra.main(config_path="conf/config.yaml")
|
||||
def main(cfg):
|
||||
cwd = utils.get_original_cwd()
|
||||
cfg.cwd = cwd
|
||||
print(cfg)
|
||||
|
||||
data_path = DATA_PATH[cfg.dataset_name]
|
||||
for mode, path in data_path.items():
|
||||
data_path[mode] = os.path.join(cfg.cwd, path)
|
||||
dataset_class, data_process = DATASET_CLASS[cfg.dataset_name], DATA_PROCESS[cfg.dataset_name]
|
||||
mapping = MAPPING[cfg.dataset_name]
|
||||
|
||||
set_seed(cfg.seed) # set seed, default is 1
|
||||
if cfg.save_path is not None: # make save_path dir
|
||||
cfg.save_path = os.path.join(cfg.save_path, cfg.dataset_name+"_"+str(cfg.batch_size)+"_"+str(cfg.learning_rate)+cfg.notes)
|
||||
if not os.path.exists(cfg.save_path):
|
||||
os.makedirs(cfg.save_path, exist_ok=True)
|
||||
|
||||
process = data_process(data_path=data_path, mapping=mapping, bart_name=cfg.bart_name, learn_weights=cfg.learn_weights)
|
||||
train_dataset = dataset_class(data_processor=process, mode='train')
|
||||
train_dataloader = DataLoader(train_dataset, collate_fn=train_dataset.collate_fn, batch_size=cfg.batch_size, num_workers=4)
|
||||
|
||||
dev_dataset = dataset_class(data_processor=process, mode='dev')
|
||||
dev_dataloader = DataLoader(dev_dataset, collate_fn=dev_dataset.collate_fn, batch_size=cfg.batch_size, num_workers=4)
|
||||
|
||||
label_ids = list(process.mapping2id.values())
|
||||
|
||||
prompt_model = PromptBartModel(tokenizer=process.tokenizer, label_ids=label_ids, args=cfg)
|
||||
model = PromptGeneratorModel(prompt_model=prompt_model, bos_token_id=0,
|
||||
eos_token_id=1,
|
||||
max_length=cfg.tgt_max_len, max_len_a=cfg.src_seq_ratio,num_beams=cfg.num_beams, do_sample=False,
|
||||
repetition_penalty=1, length_penalty=cfg.length_penalty, pad_token_id=1,
|
||||
restricter=None)
|
||||
metrics = Seq2SeqSpanMetric(eos_token_id=1, num_labels=len(label_ids), target_type='word')
|
||||
loss = get_loss
|
||||
|
||||
trainer = Trainer(train_data=train_dataloader, dev_data=dev_dataloader, test_data=None, model=model, args=cfg, logger=logger, loss=loss,
|
||||
metrics=metrics, writer=writer)
|
||||
trainer.train()
|
||||
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -1,65 +0,0 @@
|
|||
# Easy Start
|
||||
|
||||
<p align="left">
|
||||
<b> English | <a href="https://github.com/zjunlp/DeepKE/blob/main/example/ner/standard/README_CN.md">简体中文</a> </b>
|
||||
</p>
|
||||
|
||||
## Requirements
|
||||
|
||||
> python == 3.8
|
||||
|
||||
- pytorch-transformers == 1.2.0
|
||||
- torch == 1.5.0
|
||||
- hydra-core == 1.0.6
|
||||
- seqeval == 1.2.2
|
||||
- tqdm == 4.60.0
|
||||
- matplotlib == 3.4.1
|
||||
- deepke
|
||||
|
||||
## Download Code
|
||||
|
||||
```bash
|
||||
git clone https://github.com/zjunlp/DeepKE.git
|
||||
cd DeepKE/example/ner/standard
|
||||
```
|
||||
|
||||
## Install with Pip
|
||||
|
||||
- Create and enter the python virtual environment.
|
||||
- Install dependencies: `pip install -r requirements.txt`.
|
||||
|
||||
## Train and Predict
|
||||
|
||||
- Dataset
|
||||
|
||||
- Download the dataset to this directory.
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/ner/standard/data.tar.gz
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
- The dataset is stored in `data`:
|
||||
- `train.txt`: Training set
|
||||
- `valid.txt `: Validation set
|
||||
- `test.txt`: Test set
|
||||
|
||||
- Training
|
||||
|
||||
- Parameters for training are in the `conf` folder and users can modify them before training.
|
||||
|
||||
- Logs for training are in the `log` folder and the trained model is saved in the `checkpoints` folder.
|
||||
|
||||
```bash
|
||||
python run.py
|
||||
```
|
||||
|
||||
- Prediction
|
||||
|
||||
```bash
|
||||
python predict.py
|
||||
```
|
||||
|
||||
## Model
|
||||
|
||||
BERT
|
|
@ -1,57 +0,0 @@
|
|||
## 快速上手
|
||||
|
||||
<p align="left">
|
||||
<b> <a href="https://github.com/zjunlp/DeepKE/blob/main/example/ner/standard/README.md">English</a> | 简体中文 </b>
|
||||
</p>
|
||||
|
||||
### 环境依赖
|
||||
|
||||
> python == 3.8
|
||||
|
||||
- pytorch-transformers == 1.2.0
|
||||
- torch == 1.5.0
|
||||
- hydra-core == 1.0.6
|
||||
- seqeval == 1.2.2
|
||||
- tqdm == 4.60.0
|
||||
- matplotlib == 3.4.1
|
||||
- deepke
|
||||
|
||||
|
||||
|
||||
### 克隆代码
|
||||
|
||||
```
|
||||
git clone https://github.com/zjunlp/DeepKE.git
|
||||
cd DeepKE/example/ner/standard
|
||||
```
|
||||
|
||||
|
||||
|
||||
### 使用pip安装
|
||||
|
||||
首先创建python虚拟环境,再进入虚拟环境
|
||||
|
||||
- 安装依赖:`pip install -r requirements.txt`
|
||||
|
||||
|
||||
|
||||
### 使用数据进行训练预测
|
||||
|
||||
- 存放数据: 可先下载数据 ```wget 120.27.214.45/Data/ner/standard/data.tar.gz```在此目录下
|
||||
|
||||
在`data`文件夹下存放数据:
|
||||
|
||||
- `train.txt`:存放训练数据集
|
||||
- `valid.txt`:存放验证数据集
|
||||
- `test.txt`:存放测试数据集
|
||||
- 开始训练:```python run.py``` (训练所用到参数都在conf文件夹中,修改即可)
|
||||
|
||||
- 每次训练的日志保存在 `logs` 文件夹内,模型结果保存在 `checkpoints` 文件夹内。
|
||||
|
||||
- 进行预测 ```python predict.py```
|
||||
|
||||
|
||||
|
||||
### 模型内容
|
||||
|
||||
BERT
|
|
@ -1,11 +0,0 @@
|
|||
# ??? is a mandatory value.
|
||||
# you should be able to set it without open_dict
|
||||
# but if you try to read it before it's set an error will get thrown.
|
||||
|
||||
# populated at runtime
|
||||
cwd: ???
|
||||
|
||||
defaults:
|
||||
- hydra/output: custom
|
||||
- train
|
||||
- predict
|
|
@ -1,11 +0,0 @@
|
|||
hydra:
|
||||
|
||||
run:
|
||||
# Output directory for normal runs
|
||||
dir: logs/${now:%Y-%m-%d_%H-%M-%S}
|
||||
|
||||
sweep:
|
||||
# Output directory for sweep runs
|
||||
dir: logs/${now:%Y-%m-%d_%H-%M-%S}
|
||||
# Output sub directory for sweep runs.
|
||||
subdir: ${hydra.job.num}_${hydra.job.id}
|
|
@ -1 +0,0 @@
|
|||
text: "秦始皇兵马俑位于陕西省西安市,1961年被国务院公布为第一批全国重点文物保护单位,是世界八大奇迹之一。"
|
|
@ -1,25 +0,0 @@
|
|||
adam_epsilon: 1e-8
|
||||
bert_model: "bert-base-chinese"
|
||||
data_dir: "data/"
|
||||
do_eval: True
|
||||
do_lower_case: True
|
||||
do_train: True
|
||||
eval_batch_size: 8
|
||||
eval_on: "dev"
|
||||
fp16: False
|
||||
fp16_opt_level: "01"
|
||||
gpu_id: 1
|
||||
gradient_accumulation_steps: 1
|
||||
learning_rate: 5e-5
|
||||
local_rank: -1
|
||||
loss_scale: 0.0
|
||||
max_grad_norm: 1.0
|
||||
max_seq_length: 128
|
||||
num_train_epochs: 3 # the number of training epochs
|
||||
output_dir: "checkpoints"
|
||||
seed: 42
|
||||
task_name: "ner"
|
||||
train_batch_size: 32
|
||||
use_gpu: True # use gpu or not
|
||||
warmup_proportion: 0.1
|
||||
weight_decay: 0.01
|
|
@ -1,27 +0,0 @@
|
|||
from deepke.name_entity_re.standard import *
|
||||
import hydra
|
||||
from hydra import utils
|
||||
|
||||
@hydra.main(config_path="conf", config_name='config')
|
||||
def main(cfg):
|
||||
model = InferNer(utils.get_original_cwd()+'/'+"checkpoints/")
|
||||
text = cfg.text
|
||||
|
||||
print("NER句子:")
|
||||
print(text)
|
||||
print('NER结果:')
|
||||
|
||||
result = model.predict(text)
|
||||
for k,v in result.items():
|
||||
if v:
|
||||
print(v,end=': ')
|
||||
if k=='PER':
|
||||
print('Person')
|
||||
elif k=='LOC':
|
||||
print('Location')
|
||||
elif k=='ORG':
|
||||
print('Organization')
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -1,7 +0,0 @@
|
|||
pytorch-transformers==1.2.0
|
||||
torch==1.5.0
|
||||
hydra-core==1.0.6
|
||||
seqeval==0.0.5
|
||||
tqdm==4.31.1
|
||||
matplotlib==3.4.1
|
||||
deepke
|
|
@ -1,235 +0,0 @@
|
|||
from __future__ import absolute_import, division, print_function
|
||||
|
||||
import csv
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import random
|
||||
import sys
|
||||
import numpy as np
|
||||
import torch
|
||||
import torch.nn.functional as F
|
||||
from pytorch_transformers import (WEIGHTS_NAME, AdamW, BertConfig, BertForTokenClassification, BertTokenizer, WarmupLinearSchedule)
|
||||
from torch import nn
|
||||
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, TensorDataset)
|
||||
from torch.utils.data.distributed import DistributedSampler
|
||||
from tqdm import tqdm, trange
|
||||
from seqeval.metrics import classification_report
|
||||
import hydra
|
||||
from hydra import utils
|
||||
from deepke.name_entity_re.standard import *
|
||||
|
||||
import wandb
|
||||
|
||||
|
||||
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
|
||||
datefmt = '%m/%d/%Y %H:%M:%S',
|
||||
level = logging.INFO)
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class TrainNer(BertForTokenClassification):
|
||||
|
||||
def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None,valid_ids=None,attention_mask_label=None,device=None):
|
||||
sequence_output = self.bert(input_ids, token_type_ids, attention_mask,head_mask=None)[0]
|
||||
batch_size,max_len,feat_dim = sequence_output.shape
|
||||
valid_output = torch.zeros(batch_size,max_len,feat_dim,dtype=torch.float32,device=device)
|
||||
for i in range(batch_size):
|
||||
jj = -1
|
||||
for j in range(max_len):
|
||||
if valid_ids[i][j].item() == 1:
|
||||
jj += 1
|
||||
valid_output[i][jj] = sequence_output[i][j]
|
||||
sequence_output = self.dropout(valid_output)
|
||||
logits = self.classifier(sequence_output)
|
||||
|
||||
if labels is not None:
|
||||
loss_fct = nn.CrossEntropyLoss(ignore_index=0)
|
||||
if attention_mask_label is not None:
|
||||
active_loss = attention_mask_label.view(-1) == 1
|
||||
active_logits = logits.view(-1, self.num_labels)[active_loss]
|
||||
active_labels = labels.view(-1)[active_loss]
|
||||
loss = loss_fct(active_logits, active_labels)
|
||||
else:
|
||||
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
|
||||
return loss
|
||||
else:
|
||||
return logits
|
||||
|
||||
wandb.init(project="DeepKE_NER_Standard")
|
||||
@hydra.main(config_path="conf", config_name='config')
|
||||
def main(cfg):
|
||||
|
||||
# Use gpu or not
|
||||
if cfg.use_gpu and torch.cuda.is_available():
|
||||
device = torch.device('cuda', cfg.gpu_id)
|
||||
else:
|
||||
device = torch.device('cpu')
|
||||
|
||||
if cfg.gradient_accumulation_steps < 1:
|
||||
raise ValueError("Invalid gradient_accumulation_steps parameter: {}, should be >= 1".format(cfg.gradient_accumulation_steps))
|
||||
|
||||
cfg.train_batch_size = cfg.train_batch_size // cfg.gradient_accumulation_steps
|
||||
|
||||
random.seed(cfg.seed)
|
||||
np.random.seed(cfg.seed)
|
||||
torch.manual_seed(cfg.seed)
|
||||
|
||||
if not cfg.do_train and not cfg.do_eval:
|
||||
raise ValueError("At least one of `do_train` or `do_eval` must be True.")
|
||||
|
||||
# Checkpoints
|
||||
if os.path.exists(utils.get_original_cwd()+'/'+cfg.output_dir) and os.listdir(utils.get_original_cwd()+'/'+cfg.output_dir) and cfg.do_train:
|
||||
raise ValueError("Output directory ({}) already exists and is not empty.".format(utils.get_original_cwd()+'/'+cfg.output_dir))
|
||||
if not os.path.exists(utils.get_original_cwd()+'/'+cfg.output_dir):
|
||||
os.makedirs(utils.get_original_cwd()+'/'+cfg.output_dir)
|
||||
|
||||
# Preprocess the input dataset
|
||||
processor = NerProcessor()
|
||||
label_list = processor.get_labels()
|
||||
num_labels = len(label_list) + 1
|
||||
|
||||
# Prepare the model
|
||||
tokenizer = BertTokenizer.from_pretrained(cfg.bert_model, do_lower_case=cfg.do_lower_case)
|
||||
|
||||
train_examples = None
|
||||
num_train_optimization_steps = 0
|
||||
if cfg.do_train:
|
||||
train_examples = processor.get_train_examples(utils.get_original_cwd()+'/'+cfg.data_dir)
|
||||
num_train_optimization_steps = int(len(train_examples) / cfg.train_batch_size / cfg.gradient_accumulation_steps) * cfg.num_train_epochs
|
||||
|
||||
config = BertConfig.from_pretrained(cfg.bert_model, num_labels=num_labels, finetuning_task=cfg.task_name)
|
||||
model = TrainNer.from_pretrained(cfg.bert_model,from_tf = False,config = config)
|
||||
model.to(device)
|
||||
|
||||
param_optimizer = list(model.named_parameters())
|
||||
no_decay = ['bias','LayerNorm.weight']
|
||||
optimizer_grouped_parameters = [
|
||||
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': cfg.weight_decay},
|
||||
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
|
||||
]
|
||||
warmup_steps = int(cfg.warmup_proportion * num_train_optimization_steps)
|
||||
optimizer = AdamW(optimizer_grouped_parameters, lr=cfg.learning_rate, eps=cfg.adam_epsilon)
|
||||
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=warmup_steps, t_total=num_train_optimization_steps)
|
||||
global_step = 0
|
||||
nb_tr_steps = 0
|
||||
tr_loss = 0
|
||||
label_map = {i : label for i, label in enumerate(label_list,1)}
|
||||
if cfg.do_train:
|
||||
train_features = convert_examples_to_features(train_examples, label_list, cfg.max_seq_length, tokenizer)
|
||||
all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
|
||||
all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
|
||||
all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long)
|
||||
all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)
|
||||
all_valid_ids = torch.tensor([f.valid_ids for f in train_features], dtype=torch.long)
|
||||
all_lmask_ids = torch.tensor([f.label_mask for f in train_features], dtype=torch.long)
|
||||
train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids,all_valid_ids,all_lmask_ids)
|
||||
train_sampler = RandomSampler(train_data)
|
||||
|
||||
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=cfg.train_batch_size)
|
||||
|
||||
model.train()
|
||||
|
||||
for _ in trange(int(cfg.num_train_epochs), desc="Epoch"):
|
||||
tr_loss = 0
|
||||
nb_tr_examples, nb_tr_steps = 0, 0
|
||||
for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")):
|
||||
batch = tuple(t.to(device) for t in batch)
|
||||
input_ids, input_mask, segment_ids, label_ids, valid_ids,l_mask = batch
|
||||
loss = model(input_ids, segment_ids, input_mask, label_ids,valid_ids,l_mask,device)
|
||||
if cfg.gradient_accumulation_steps > 1:
|
||||
loss = loss / cfg.gradient_accumulation_steps
|
||||
|
||||
loss.backward()
|
||||
torch.nn.utils.clip_grad_norm_(model.parameters(), cfg.max_grad_norm)
|
||||
|
||||
tr_loss += loss.item()
|
||||
nb_tr_examples += input_ids.size(0)
|
||||
nb_tr_steps += 1
|
||||
if (step + 1) % cfg.gradient_accumulation_steps == 0:
|
||||
optimizer.step()
|
||||
scheduler.step() # Update learning rate schedule
|
||||
model.zero_grad()
|
||||
global_step += 1
|
||||
wandb.log({
|
||||
"train_loss":tr_loss/nb_tr_steps
|
||||
})
|
||||
# Save a trained model and the associated configuration
|
||||
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
|
||||
model_to_save.save_pretrained(utils.get_original_cwd()+'/'+cfg.output_dir)
|
||||
tokenizer.save_pretrained(utils.get_original_cwd()+'/'+cfg.output_dir)
|
||||
label_map = {i : label for i, label in enumerate(label_list,1)}
|
||||
model_config = {"bert_model":cfg.bert_model,"do_lower":cfg.do_lower_case,"max_seq_length":cfg.max_seq_length,"num_labels":len(label_list)+1,"label_map":label_map}
|
||||
json.dump(model_config,open(os.path.join(utils.get_original_cwd()+'/'+cfg.output_dir,"model_config.json"),"w"))
|
||||
# Load a trained model and config that you have fine-tuned
|
||||
else:
|
||||
# Load a trained model and vocabulary that you have fine-tuned
|
||||
model = TrainNer.from_pretrained(utils.get_original_cwd()+'/'+cfg.output_dir)
|
||||
tokenizer = BertTokenizer.from_pretrained(utils.get_original_cwd()+'/'+cfg.output_dir, do_lower_case=cfg.do_lower_case)
|
||||
|
||||
model.to(device)
|
||||
|
||||
if cfg.do_eval:
|
||||
if cfg.eval_on == "dev":
|
||||
eval_examples = processor.get_dev_examples(utils.get_original_cwd()+'/'+cfg.data_dir)
|
||||
elif cfg.eval_on == "test":
|
||||
eval_examples = processor.get_test_examples(utils.get_original_cwd()+'/'+cfg.data_dir)
|
||||
else:
|
||||
raise ValueError("eval on dev or test set only")
|
||||
eval_features = convert_examples_to_features(eval_examples, label_list, cfg.max_seq_length, tokenizer)
|
||||
all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
|
||||
all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
|
||||
all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
|
||||
all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
|
||||
all_valid_ids = torch.tensor([f.valid_ids for f in eval_features], dtype=torch.long)
|
||||
all_lmask_ids = torch.tensor([f.label_mask for f in eval_features], dtype=torch.long)
|
||||
eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids,all_valid_ids,all_lmask_ids)
|
||||
# Run prediction for full data
|
||||
eval_sampler = SequentialSampler(eval_data)
|
||||
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=cfg.eval_batch_size)
|
||||
model.eval()
|
||||
eval_loss, eval_accuracy = 0, 0
|
||||
nb_eval_steps, nb_eval_examples = 0, 0
|
||||
y_true = []
|
||||
y_pred = []
|
||||
label_map = {i : label for i, label in enumerate(label_list,1)}
|
||||
for input_ids, input_mask, segment_ids, label_ids,valid_ids,l_mask in tqdm(eval_dataloader, desc="Evaluating"):
|
||||
input_ids = input_ids.to(device)
|
||||
input_mask = input_mask.to(device)
|
||||
segment_ids = segment_ids.to(device)
|
||||
valid_ids = valid_ids.to(device)
|
||||
label_ids = label_ids.to(device)
|
||||
l_mask = l_mask.to(device)
|
||||
|
||||
with torch.no_grad():
|
||||
logits = model(input_ids, segment_ids, input_mask,valid_ids=valid_ids,attention_mask_label=l_mask,device=device)
|
||||
|
||||
logits = torch.argmax(F.log_softmax(logits,dim=2),dim=2)
|
||||
logits = logits.detach().cpu().numpy()
|
||||
label_ids = label_ids.to('cpu').numpy()
|
||||
input_mask = input_mask.to('cpu').numpy()
|
||||
|
||||
for i, label in enumerate(label_ids):
|
||||
temp_1 = []
|
||||
temp_2 = []
|
||||
for j,m in enumerate(label):
|
||||
if j == 0:
|
||||
continue
|
||||
elif label_ids[i][j] == len(label_map):
|
||||
y_true.append(temp_1)
|
||||
y_pred.append(temp_2)
|
||||
break
|
||||
else:
|
||||
temp_1.append(label_map[label_ids[i][j]])
|
||||
temp_2.append(label_map[logits[i][j]])
|
||||
|
||||
report = classification_report(y_true, y_pred,digits=4)
|
||||
logger.info("\n%s", report)
|
||||
output_eval_file = os.path.join(utils.get_original_cwd()+'/'+cfg.output_dir, "eval_results.txt")
|
||||
with open(output_eval_file, "w") as writer:
|
||||
logger.info("***** Eval results *****")
|
||||
logger.info("\n%s", report)
|
||||
writer.write(report)
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
|
@ -1,81 +0,0 @@
|
|||
# Easy Start
|
||||
|
||||
<p align="left">
|
||||
<b> English | <a href="https://github.com/zjunlp/DeepKE/blob/main/example/re/document/README_CN.md">简体中文</a> </b>
|
||||
</p>
|
||||
|
||||
## Requirements
|
||||
|
||||
> python == 3.8
|
||||
|
||||
- torch == 1.5.0
|
||||
- transformers == 3.4.0
|
||||
- opt-einsum == 3.3.0
|
||||
- ujson
|
||||
- deepke
|
||||
|
||||
## Download Code
|
||||
|
||||
```bash
|
||||
git clone https://github.com/zjunlp/DeepKE.git
|
||||
cd DeepKE/example/re/document
|
||||
```
|
||||
|
||||
## Install with Pip
|
||||
|
||||
- Create and enter the python virtual environment.
|
||||
- Install dependencies: `pip install -r requirements.txt`.
|
||||
|
||||
## Train and Predict
|
||||
|
||||
- Dataset
|
||||
|
||||
- Download the dataset to this directory.
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/re/document/data.tar.gz
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
- The dataset [DocRED](https://github.com/thunlp/DocRED/tree/master/) is stored in `data`:
|
||||
|
||||
- `dev.json`:Validation set
|
||||
- `rel_info.json`:Relation set
|
||||
|
||||
- `rel2id.json`:Relation labels - ID
|
||||
|
||||
- `test.json`:Test set
|
||||
|
||||
- `train_annotated.json`:Training set annotated manually
|
||||
|
||||
- `train_distant.json`: Training set generated by distant supervision
|
||||
|
||||
- Training
|
||||
|
||||
- Parameters, model paths and configuration for training are in the `conf` folder and users can modify them before training.
|
||||
|
||||
- Training on DocRED
|
||||
|
||||
```bash
|
||||
python run.py
|
||||
```
|
||||
|
||||
- The trained model is stored in the current directory by default.
|
||||
|
||||
- Start to train from last-trained model<br>
|
||||
|
||||
modify `train_from_saved_model` in `.yaml` as the path of the last-trained model
|
||||
|
||||
- Logs for training are stored in the current directory by default and the path can be configured by modifying `log_dir` in `.yaml`
|
||||
|
||||
- Prediction
|
||||
|
||||
```bash
|
||||
python predict.py
|
||||
```
|
||||
|
||||
- After prediction, generated `result.json` is stored in the current directory
|
||||
|
||||
## Model
|
||||
|
||||
[DocuNet](https://arxiv.org/abs/2106.03618)
|
|
@ -1,65 +0,0 @@
|
|||
## 快速上手
|
||||
|
||||
<p align="left">
|
||||
<b> <a href="https://github.com/zjunlp/DeepKE/blob/main/example/re/document/README.md">English</a> | 简体中文 </b>
|
||||
</p>
|
||||
|
||||
### 环境依赖
|
||||
|
||||
> python == 3.8
|
||||
|
||||
- torch == 1.5.0
|
||||
- transformers == 3.4.0
|
||||
- opt-einsum == 3.3.0
|
||||
- ujson
|
||||
- deepke
|
||||
|
||||
### 克隆代码
|
||||
```
|
||||
git clone https://github.com/zjunlp/DeepKE.git
|
||||
cd DeepKE/example/re/d
|
||||
```
|
||||
### 使用pip安装
|
||||
|
||||
首先创建python虚拟环境,再进入虚拟环境
|
||||
|
||||
- 安装依赖: ```pip install -r requirements.txt```
|
||||
|
||||
### 使用数据进行训练预测
|
||||
|
||||
- 存放数据: 可先下载数据 ```wget 120.27.214.45/Data/re/document/data.tar.gz```在此目录下
|
||||
|
||||
在 `data` 文件夹下存放训练数据。模型采用的数据集是[DocRED](https://github.com/thunlp/DocRED/tree/master/),DocRED数据集来自于2010年的国际语义评测大会中Task 8:"Multi-Way Classification of Semantic Relations Between Pairs of Nominals"。
|
||||
|
||||
|
||||
- DocRED包含以下数据:
|
||||
|
||||
- `dev.json`:验证集
|
||||
|
||||
- `rel_info.json`:关系集
|
||||
|
||||
- `rel2id.json`:关系标签到ID的映射
|
||||
|
||||
- `test.json`:测试集
|
||||
|
||||
- `train_annotated.json`:人工标注的训练集
|
||||
|
||||
- `train_distant.json`:远程监督产生的训练集
|
||||
|
||||
- 开始训练:模型加载和保存位置以及配置可以在conf的`.yaml`文件中修改
|
||||
|
||||
- 在数据集DocRED中训练:`python run.py`
|
||||
|
||||
- 训练好的模型保存在当前目录下
|
||||
|
||||
- 从上次训练的模型开始训练:设置`.yaml`中的train_from_saved_model为上次保存模型的路径
|
||||
|
||||
- 每次训练的日志保存路径默认保存在根目录,可以通过`.yaml`中的log_dir来配置
|
||||
|
||||
- 进行预测: `python predict.py`
|
||||
|
||||
- 预测生成的`result.json`保存在根目录
|
||||
|
||||
|
||||
## 模型内容
|
||||
[DocuNet](https://arxiv.org/abs/2106.03618)
|
|
@ -1,3 +0,0 @@
|
|||
defaults:
|
||||
- hydra/output: custom
|
||||
- train
|
|
@ -1,11 +0,0 @@
|
|||
hydra:
|
||||
|
||||
run:
|
||||
# Output directory for normal runs
|
||||
dir: logs/${now:%Y-%m-%d_%H-%M-%S}
|
||||
|
||||
sweep:
|
||||
# Output directory for sweep runs
|
||||
dir: logs/${now:%Y-%m-%d_%H-%M-%S}
|
||||
# Output sub directory for sweep runs.
|
||||
subdir: ${hydra.job.num}_${hydra.job.id}
|
|
@ -1,32 +0,0 @@
|
|||
adam_epsilon: 1e-06
|
||||
bert_lr: 3e-05
|
||||
channel_type: 'context-based'
|
||||
config_name: ''
|
||||
data_dir: 'data'
|
||||
dataset: 'docred'
|
||||
dev_file: 'dev.json'
|
||||
down_dim: 256
|
||||
evaluation_steps: -1
|
||||
gradient_accumulation_steps: 2
|
||||
learning_rate: 0.0004
|
||||
log_dir: './train_roberta.log'
|
||||
max_grad_norm: 1.0
|
||||
max_height: 42
|
||||
max_seq_length: 1024
|
||||
model_name_or_path: 'roberta-base'
|
||||
num_class: 97
|
||||
num_labels: 4
|
||||
num_train_epochs: 30
|
||||
save_path: './model_roberta.pt'
|
||||
seed: 111
|
||||
test_batch_size: 2
|
||||
test_file: 'test.json'
|
||||
tokenizer_name: ''
|
||||
train_batch_size: 2
|
||||
train_file: 'train_annotated.json'
|
||||
train_from_saved_model: ''
|
||||
transformer_type: 'roberta'
|
||||
unet_in_dim: 3
|
||||
unet_out_dim: 256
|
||||
warmup_ratio: 0.06
|
||||
load_path: './model_roberta.pt'
|
|
@ -1,88 +0,0 @@
|
|||
import os
|
||||
import time
|
||||
import hydra
|
||||
from hydra.utils import get_original_cwd
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
import ujson as json
|
||||
from torch.utils.data import DataLoader
|
||||
from transformers import AutoConfig, AutoModel, AutoTokenizer
|
||||
from transformers.optimization import AdamW, get_linear_schedule_with_warmup
|
||||
|
||||
from deepke.relation_extraction.document import *
|
||||
|
||||
|
||||
def report(args, model, features):
|
||||
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
|
||||
dataloader = DataLoader(features, batch_size=args.test_batch_size, shuffle=False, collate_fn=collate_fn, drop_last=False)
|
||||
preds = []
|
||||
for batch in dataloader:
|
||||
model.eval()
|
||||
|
||||
inputs = {'input_ids': batch[0].to(device),
|
||||
'attention_mask': batch[1].to(device),
|
||||
'entity_pos': batch[3],
|
||||
'hts': batch[4],
|
||||
}
|
||||
|
||||
with torch.no_grad():
|
||||
pred = model(**inputs)
|
||||
pred = pred.cpu().numpy()
|
||||
pred[np.isnan(pred)] = 0
|
||||
preds.append(pred)
|
||||
|
||||
preds = np.concatenate(preds, axis=0).astype(np.float32)
|
||||
preds = to_official(args, preds, features)
|
||||
return preds
|
||||
|
||||
|
||||
|
||||
|
||||
@hydra.main(config_path="conf/config.yaml")
|
||||
def main(cfg):
|
||||
cwd = get_original_cwd()
|
||||
os.chdir(cwd)
|
||||
|
||||
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
|
||||
|
||||
config = AutoConfig.from_pretrained(
|
||||
cfg.config_name if cfg.config_name else cfg.model_name_or_path,
|
||||
num_labels=cfg.num_class,
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained(
|
||||
cfg.tokenizer_name if cfg.tokenizer_name else cfg.model_name_or_path,
|
||||
)
|
||||
|
||||
Dataset = ReadDataset(cfg, cfg.dataset, tokenizer, cfg.max_seq_length)
|
||||
|
||||
|
||||
test_file = os.path.join(cfg.data_dir, cfg.test_file)
|
||||
|
||||
test_features = Dataset.read(test_file)
|
||||
|
||||
model = AutoModel.from_pretrained(
|
||||
cfg.model_name_or_path,
|
||||
from_tf=bool(".ckpt" in cfg.model_name_or_path),
|
||||
config=config,
|
||||
)
|
||||
|
||||
config.cls_token_id = tokenizer.cls_token_id
|
||||
config.sep_token_id = tokenizer.sep_token_id
|
||||
config.transformer_type = cfg.transformer_type
|
||||
|
||||
set_seed(cfg)
|
||||
model = DocREModel(config, cfg, model, num_labels=cfg.num_labels)
|
||||
|
||||
|
||||
model.load_state_dict(torch.load(cfg.load_path)['checkpoint'])
|
||||
model.to(device)
|
||||
T_features = test_features # Testing on the test set
|
||||
#T_score, T_output = evaluate(cfg, model, T_features, tag="test")
|
||||
pred = report(cfg, model, T_features)
|
||||
with open("./result.json", "w") as fh:
|
||||
json.dump(pred, fh)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -1,5 +0,0 @@
|
|||
torch==1.8.1
|
||||
transformers==4.7.0
|
||||
opt-einsum==3.3.0
|
||||
hydra-core==1.0.6
|
||||
ujson
|
|
@ -1,252 +0,0 @@
|
|||
import os
|
||||
import time
|
||||
import hydra
|
||||
from hydra.utils import get_original_cwd
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
import ujson as json
|
||||
from torch.utils.data import DataLoader
|
||||
from transformers import AutoConfig, AutoModel, AutoTokenizer
|
||||
from transformers.optimization import AdamW, get_linear_schedule_with_warmup
|
||||
|
||||
from deepke.relation_extraction.document import *
|
||||
|
||||
import wandb
|
||||
|
||||
def train(args, model, train_features, dev_features, test_features):
|
||||
def logging(s, print_=True, log_=True):
|
||||
if print_:
|
||||
print(s)
|
||||
if log_ and args.log_dir != '':
|
||||
with open(args.log_dir, 'a+') as f_log:
|
||||
f_log.write(s + '\n')
|
||||
def finetune(features, optimizer, num_epoch, num_steps, model):
|
||||
cur_model = model.module if hasattr(model, 'module') else model
|
||||
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
|
||||
if args.train_from_saved_model != '':
|
||||
best_score = torch.load(args.train_from_saved_model)["best_f1"]
|
||||
epoch_delta = torch.load(args.train_from_saved_model)["epoch"] + 1
|
||||
else:
|
||||
epoch_delta = 0
|
||||
best_score = -1
|
||||
train_dataloader = DataLoader(features, batch_size=args.train_batch_size, shuffle=True, collate_fn=collate_fn, drop_last=True)
|
||||
train_iterator = [epoch + epoch_delta for epoch in range(num_epoch)]
|
||||
total_steps = int(len(train_dataloader) * num_epoch // args.gradient_accumulation_steps)
|
||||
warmup_steps = int(total_steps * args.warmup_ratio)
|
||||
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)
|
||||
print("Total steps: {}".format(total_steps))
|
||||
print("Warmup steps: {}".format(warmup_steps))
|
||||
global_step = 0
|
||||
log_step = 100
|
||||
total_loss = 0
|
||||
|
||||
|
||||
|
||||
#scaler = GradScaler()
|
||||
for epoch in train_iterator:
|
||||
start_time = time.time()
|
||||
optimizer.zero_grad()
|
||||
|
||||
for step, batch in enumerate(train_dataloader):
|
||||
model.train()
|
||||
|
||||
inputs = {'input_ids': batch[0].to(device),
|
||||
'attention_mask': batch[1].to(device),
|
||||
'labels': batch[2],
|
||||
'entity_pos': batch[3],
|
||||
'hts': batch[4],
|
||||
}
|
||||
#with autocast():
|
||||
outputs = model(**inputs)
|
||||
loss = outputs[0] / args.gradient_accumulation_steps
|
||||
total_loss += loss.item()
|
||||
# scaler.scale(loss).backward()
|
||||
|
||||
|
||||
loss.backward()
|
||||
|
||||
if step % args.gradient_accumulation_steps == 0:
|
||||
#scaler.unscale_(optimizer)
|
||||
if args.max_grad_norm > 0:
|
||||
# torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
|
||||
torch.nn.utils.clip_grad_norm_(cur_model.parameters(), args.max_grad_norm)
|
||||
#scaler.step(optimizer)
|
||||
#scaler.update()
|
||||
#scheduler.step()
|
||||
optimizer.step()
|
||||
scheduler.step()
|
||||
optimizer.zero_grad()
|
||||
global_step += 1
|
||||
num_steps += 1
|
||||
if global_step % log_step == 0:
|
||||
cur_loss = total_loss / log_step
|
||||
elapsed = time.time() - start_time
|
||||
logging(
|
||||
'| epoch {:2d} | step {:4d} | min/b {:5.2f} | lr {} | train loss {:5.3f}'.format(
|
||||
epoch, global_step, elapsed / 60, scheduler.get_last_lr(), cur_loss * 1000))
|
||||
total_loss = 0
|
||||
start_time = time.time()
|
||||
|
||||
wandb.log({
|
||||
"train_loss":cur_loss
|
||||
})
|
||||
|
||||
if (step + 1) == len(train_dataloader) - 1 or (args.evaluation_steps > 0 and num_steps % args.evaluation_steps == 0 and step % args.gradient_accumulation_steps == 0):
|
||||
# if step ==0:
|
||||
logging('-' * 89)
|
||||
eval_start_time = time.time()
|
||||
dev_score, dev_output = evaluate(args, model, dev_features, tag="dev")
|
||||
|
||||
logging(
|
||||
'| epoch {:3d} | time: {:5.2f}s | dev_result:{}'.format(epoch, time.time() - eval_start_time,
|
||||
dev_output))
|
||||
|
||||
wandb.log({
|
||||
"dev_result":dev_output
|
||||
})
|
||||
|
||||
logging('-' * 89)
|
||||
if dev_score > best_score:
|
||||
best_score = dev_score
|
||||
logging(
|
||||
'| epoch {:3d} | best_f1:{}'.format(epoch, best_score))
|
||||
|
||||
wandb.log({
|
||||
"best_f1":best_score
|
||||
})
|
||||
|
||||
if args.save_path != "":
|
||||
torch.save({
|
||||
'epoch': epoch,
|
||||
'checkpoint': cur_model.state_dict(),
|
||||
'best_f1': best_score,
|
||||
'optimizer': optimizer.state_dict()
|
||||
}, args.save_path
|
||||
, _use_new_zipfile_serialization=False)
|
||||
logging(
|
||||
'| successfully save model at: {}'.format(args.save_path))
|
||||
logging('-' * 89)
|
||||
return num_steps
|
||||
|
||||
cur_model = model.module if hasattr(model, 'module') else model
|
||||
extract_layer = ["extractor", "bilinear"]
|
||||
bert_layer = ['bert_model']
|
||||
optimizer_grouped_parameters = [
|
||||
{"params": [p for n, p in cur_model.named_parameters() if any(nd in n for nd in bert_layer)], "lr": args.bert_lr},
|
||||
{"params": [p for n, p in cur_model.named_parameters() if any(nd in n for nd in extract_layer)], "lr": 1e-4},
|
||||
{"params": [p for n, p in cur_model.named_parameters() if not any(nd in n for nd in extract_layer + bert_layer)]},
|
||||
]
|
||||
|
||||
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
|
||||
if args.train_from_saved_model != '':
|
||||
optimizer.load_state_dict(torch.load(args.train_from_saved_model)["optimizer"])
|
||||
print("load saved optimizer from {}.".format(args.train_from_saved_model))
|
||||
|
||||
|
||||
num_steps = 0
|
||||
set_seed(args)
|
||||
model.zero_grad()
|
||||
finetune(train_features, optimizer, args.num_train_epochs, num_steps, model)
|
||||
|
||||
|
||||
def evaluate(args, model, features, tag="dev"):
|
||||
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
|
||||
dataloader = DataLoader(features, batch_size=args.test_batch_size, shuffle=False, collate_fn=collate_fn, drop_last=False)
|
||||
preds = []
|
||||
total_loss = 0
|
||||
for i, batch in enumerate(dataloader):
|
||||
model.eval()
|
||||
|
||||
inputs = {'input_ids': batch[0].to(device),
|
||||
'attention_mask': batch[1].to(device),
|
||||
'labels': batch[2],
|
||||
'entity_pos': batch[3],
|
||||
'hts': batch[4],
|
||||
}
|
||||
|
||||
with torch.no_grad():
|
||||
output = model(**inputs)
|
||||
loss = output[0]
|
||||
pred = output[1].cpu().numpy()
|
||||
pred[np.isnan(pred)] = 0
|
||||
preds.append(pred)
|
||||
total_loss += loss.item()
|
||||
|
||||
average_loss = total_loss / (i + 1)
|
||||
preds = np.concatenate(preds, axis=0).astype(np.float32)
|
||||
ans = to_official(args, preds, features)
|
||||
if len(ans) > 0:
|
||||
best_f1, _, best_f1_ign, _, re_p, re_r = official_evaluate(ans, args.data_dir)
|
||||
output = {
|
||||
tag + "_F1": best_f1 * 100,
|
||||
tag + "_F1_ign": best_f1_ign * 100,
|
||||
tag + "_re_p": re_p * 100,
|
||||
tag + "_re_r": re_r * 100,
|
||||
tag + "_average_loss": average_loss
|
||||
}
|
||||
|
||||
|
||||
|
||||
return best_f1, output
|
||||
|
||||
|
||||
wandb.init(project="DeepKE_RE_Document")
|
||||
wandb.watch_called = False
|
||||
@hydra.main(config_path="conf/config.yaml")
|
||||
def main(cfg):
|
||||
cwd = get_original_cwd()
|
||||
os.chdir(cwd)
|
||||
|
||||
if not os.path.exists(os.path.join(cfg.data_dir, "train_distant.json")):
|
||||
raise FileNotFoundError("Sorry, the file: 'train_annotated.json' is too big to upload to github, \
|
||||
please manually download to 'data/' from DocRED GoogleDrive https://drive.google.com/drive/folders/1c5-0YwnoJx8NS6CV2f-NoTHR__BdkNqw")
|
||||
|
||||
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
|
||||
|
||||
config = AutoConfig.from_pretrained(
|
||||
cfg.config_name if cfg.config_name else cfg.model_name_or_path,
|
||||
num_labels=cfg.num_class,
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained(
|
||||
cfg.tokenizer_name if cfg.tokenizer_name else cfg.model_name_or_path,
|
||||
)
|
||||
|
||||
Dataset = ReadDataset(cfg, cfg.dataset, tokenizer, cfg.max_seq_length)
|
||||
|
||||
train_file = os.path.join(cfg.data_dir, cfg.train_file)
|
||||
dev_file = os.path.join(cfg.data_dir, cfg.dev_file)
|
||||
test_file = os.path.join(cfg.data_dir, cfg.test_file)
|
||||
train_features = Dataset.read(train_file)
|
||||
dev_features = Dataset.read(dev_file)
|
||||
test_features = Dataset.read(test_file)
|
||||
|
||||
model = AutoModel.from_pretrained(
|
||||
cfg.model_name_or_path,
|
||||
from_tf=bool(".ckpt" in cfg.model_name_or_path),
|
||||
config=config,
|
||||
)
|
||||
wandb.watch(model, log="all")
|
||||
|
||||
|
||||
config.cls_token_id = tokenizer.cls_token_id
|
||||
config.sep_token_id = tokenizer.sep_token_id
|
||||
config.transformer_type = cfg.transformer_type
|
||||
|
||||
set_seed(cfg)
|
||||
model = DocREModel(config, cfg, model, num_labels=cfg.num_labels)
|
||||
if cfg.train_from_saved_model != '':
|
||||
model.load_state_dict(torch.load(cfg.train_from_saved_model)["checkpoint"])
|
||||
print("load saved model from {}.".format(cfg.train_from_saved_model))
|
||||
|
||||
#if torch.cuda.device_count() > 1:
|
||||
# print("Let's use", torch.cuda.device_count(), "GPUs!")
|
||||
# model = torch.nn.DataParallel(model, device_ids = list(range(torch.cuda.device_count())))
|
||||
model.to(device)
|
||||
|
||||
train(cfg, model, train_features, dev_features, test_features)
|
||||
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -1,75 +0,0 @@
|
|||
# Easy Start
|
||||
|
||||
<p align="left">
|
||||
<b> English | <a href="https://github.com/zjunlp/DeepKE/blob/main/example/re/few-shot/README_CN.md">简体中文</a> </b>
|
||||
</p>
|
||||
|
||||
## Requirements
|
||||
|
||||
> python == 3.8
|
||||
|
||||
- torch == 1.5
|
||||
- transformers == 3.4.0
|
||||
- hydra-core == 1.0.6
|
||||
- deepke
|
||||
|
||||
## Download Code
|
||||
|
||||
```bash
|
||||
git clone https://github.com/zjunlp/DeepKE.git
|
||||
cd DeepKE/example/re/few-shot
|
||||
```
|
||||
|
||||
## Install with Pip
|
||||
|
||||
- Create and enter the python virtual environment.
|
||||
- Install dependencies: `pip install -r requirements.txt`.
|
||||
|
||||
## Train and Predict
|
||||
|
||||
- Dataset
|
||||
|
||||
- Download the dataset to this directory.
|
||||
|
||||
```bash
|
||||
wget 120.27.214.45/Data/re/few-shot/data.tar.gz
|
||||
tar -xzvf data.tar.gz
|
||||
```
|
||||
|
||||
- The dataset [SEMEVAL](https://semeval2.fbk.eu/semeval2.php?location=tasks#T11) is stored in `data`:
|
||||
- `rel2id.json`:Relation Label - ID
|
||||
- `temp.txt`:Results of handled relation labels
|
||||
|
||||
- `test.txt`: Test set
|
||||
|
||||
- `train.txt`: Training set
|
||||
|
||||
- `val.txt`:Validation set
|
||||
|
||||
- Training
|
||||
|
||||
- Parameters, model paths and configuration for training are in the `conf` folder and users can modify them before training.
|
||||
|
||||
- Few-shot training on SEMEVAL
|
||||
|
||||
```bash
|
||||
python run.py
|
||||
```
|
||||
|
||||
- The trained model is stored in the current directory by default.
|
||||
|
||||
- Start to train from last-trained model<br>
|
||||
|
||||
modify `train_from_saved_model` in `.yaml` as the path of the last-trained model
|
||||
|
||||
- Logs for training are stored in the current directory by default and the path can be configured by modifying `log_dir` in `.yaml`
|
||||
|
||||
- Prediction
|
||||
|
||||
```bash
|
||||
python predict.py
|
||||
```
|
||||
|
||||
## Model
|
||||
|
||||
[KnowPrompt](https://arxiv.org/abs/2104.07650)
|
|
@ -1,59 +0,0 @@
|
|||
## 快速上手
|
||||
|
||||
<p align="left">
|
||||
<b> <a href="https://github.com/zjunlp/DeepKE/blob/main/example/re/few-shot/README.md">English</a> | 简体中文 </b>
|
||||
</p>
|
||||
|
||||
### 环境依赖
|
||||
|
||||
> python == 3.8
|
||||
|
||||
- torch == 1.5
|
||||
- transformers == 3.4.0
|
||||
- hydra-core == 1.0.6
|
||||
- deepke
|
||||
|
||||
### 克隆代码
|
||||
```
|
||||
git clone https://github.com/zjunlp/DeepKE.git
|
||||
cd DeepKE/example/re/few-shot
|
||||
```
|
||||
### 使用pip安装
|
||||
|
||||
首先创建python虚拟环境,再进入虚拟环境
|
||||
|
||||
- 安装依赖: ```pip install -r requirements.txt```
|
||||
|
||||
### 使用数据进行训练预测
|
||||
|
||||
- 存放数据: 可先下载数据 ```wget 120.27.214.45/Data/re/few_shot/data.tar.gz```在此目录下
|
||||
|
||||
在 `data` 文件夹下存放训练数据。模型采用的数据集是[SEMEVAL](https://semeval2.fbk.eu/semeval2.php?location=tasks#T11),SEMEVAL数据集来自于2010年的国际语义评测大会中Task 8:"Multi-Way Classification of Semantic Relations Between Pairs of Nominals"。
|
||||
|
||||
- SEMEVAL包含以下数据:
|
||||
|
||||
- `rel2id.json`:关系标签到ID的映射
|
||||
|
||||
- `temp.txt`:关系标签处理
|
||||
|
||||
- `test.txt`: 测试集
|
||||
|
||||
- `train.txt`:训练集
|
||||
|
||||
- `val.txt`:验证集
|
||||
|
||||
- 开始训练:模型加载和保存位置以及配置可以在conf的`.yaml`文件中修改
|
||||
|
||||
- 对数据集SEMEVAL进行few-shot训练:`python run.py`
|
||||
|
||||
- 训练好的模型默认保存在当前目录
|
||||
|
||||
- 从上次训练的模型开始训练:设置`.yaml`中的train_from_saved_model为上次保存模型的路径
|
||||
|
||||
- 每次训练的日志保存路径默认保存在当前目录,可以通过`.yaml`中的log_dir来配置
|
||||
|
||||
- 进行预测: `python predict.py `
|
||||
|
||||
|
||||
## 模型内容
|
||||
[KnowPrompt](https://arxiv.org/abs/2104.07650)
|
|
@ -1,3 +0,0 @@
|
|||
defaults:
|
||||
- hydra/output: custom
|
||||
- train
|
|
@ -1,11 +0,0 @@
|
|||
hydra:
|
||||
|
||||
run:
|
||||
# Output directory for normal runs
|
||||
dir: logs/${now:%Y-%m-%d_%H-%M-%S}
|
||||
|
||||
sweep:
|
||||
# Output directory for sweep runs
|
||||
dir: logs/${now:%Y-%m-%d_%H-%M-%S}
|
||||
# Output sub directory for sweep runs.
|
||||
subdir: ${hydra.job.num}_${hydra.job.id}
|
|
@ -1,83 +0,0 @@
|
|||
accelerator: None
|
||||
accumulate_grad_batches: '1'
|
||||
amp_backend: 'native'
|
||||
amp_level: 'O2'
|
||||
auto_lr_find: False
|
||||
auto_scale_batch_size: False
|
||||
auto_select_gpus: False
|
||||
batch_size: 16
|
||||
benchmark: False
|
||||
check_val_every_n_epoch: '3'
|
||||
checkpoint_callback: True
|
||||
data_class: 'REDataset'
|
||||
data_dir: 'data/k-shot/8-1'
|
||||
default_root_dir: None
|
||||
deterministic: False
|
||||
devices: None
|
||||
distributed_backend: None
|
||||
fast_dev_run: False
|
||||
flush_logs_every_n_steps: 100
|
||||
gpus: None
|
||||
gradient_accumulation_steps: 1
|
||||
gradient_clip_algorithm: 'norm'
|
||||
gradient_clip_val: 0.0
|
||||
ipus: None
|
||||
limit_predict_batches: 1.0
|
||||
limit_test_batches: 1.0
|
||||
limit_train_batches: 1.0
|
||||
limit_val_batches: 1.0
|
||||
litmodel_class: 'BertLitModel'
|
||||
load_checkpoint: None
|
||||
log_dir: './model_bert.log'
|
||||
log_every_n_steps: 50
|
||||
log_gpu_memory: None
|
||||
logger: True
|
||||
lr: 3e-05
|
||||
lr_2: 3e-05
|
||||
max_epochs: '30'
|
||||
max_seq_length: 256
|
||||
max_steps: None
|
||||
max_time: None
|
||||
min_epochs: None
|
||||
min_steps: None
|
||||
model_class: 'BertForMaskedLM'
|
||||
model_name_or_path: 'bert-base-uncased'
|
||||
move_metrics_to_cpu: False
|
||||
multiple_trainloader_mode: 'max_size_cycle'
|
||||
num_nodes: 1
|
||||
num_processes: 1
|
||||
num_sanity_val_steps: 2
|
||||
num_train_epochs: 30
|
||||
num_workers: 8
|
||||
optimizer: 'AdamW'
|
||||
overfit_batches: 0.0
|
||||
plugins: None
|
||||
precision: 32
|
||||
prepare_data_per_node: True
|
||||
process_position: 0
|
||||
profiler: None
|
||||
progress_bar_refresh_rate: None
|
||||
ptune_k: 7
|
||||
reload_dataloaders_every_epoch: False
|
||||
reload_dataloaders_every_n_epochs: 0
|
||||
replace_sampler_ddp: True
|
||||
resume_from_checkpoint: None
|
||||
save_path: './model_bert.pt'
|
||||
seed: 666
|
||||
stochastic_weight_avg: False
|
||||
sync_batchnorm: False
|
||||
t_lambda: 0.001
|
||||
task_name: 'wiki80'
|
||||
terminate_on_nan: False
|
||||
tpu_cores: None
|
||||
track_grad_norm: -1
|
||||
train_from_saved_model: ''
|
||||
truncated_bptt_steps: None
|
||||
two_steps: False
|
||||
use_prompt: True
|
||||
val_check_interval: 1.0
|
||||
wandb: False
|
||||
weight_decay: 0.01
|
||||
weights_save_path: None
|
||||
weights_summary: 'top'
|
||||
load_path: './model_bert.pt'
|
|
@ -1,83 +0,0 @@
|
|||
from logging import debug
|
||||
|
||||
import hydra
|
||||
from hydra.utils import get_original_cwd
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
from torch.utils.data.dataloader import DataLoader
|
||||
import yaml
|
||||
import time
|
||||
from transformers import AutoConfig, AutoModelForMaskedLM
|
||||
from transformers.optimization import get_linear_schedule_with_warmup
|
||||
import os
|
||||
from tqdm import tqdm
|
||||
|
||||
from deepke.relation_extraction.few_shot import *
|
||||
|
||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||
|
||||
|
||||
# In order to ensure reproducible experiments, we must set random seeds.
|
||||
|
||||
|
||||
def logging(log_dir, s, print_=True, log_=True):
|
||||
if print_:
|
||||
print(s)
|
||||
if log_dir != '' and log_:
|
||||
with open(log_dir, 'a+') as f_log:
|
||||
f_log.write(s + '\n')
|
||||
|
||||
def test(args, model, lit_model, data):
|
||||
model.eval()
|
||||
with torch.no_grad():
|
||||
test_loss = []
|
||||
for test_index, test_batch in enumerate(tqdm(data.test_dataloader())):
|
||||
loss = lit_model.test_step(test_batch, test_index)
|
||||
test_loss.append(loss)
|
||||
f1 = lit_model.test_epoch_end(test_loss)
|
||||
logging(args.log_dir,
|
||||
'| test_result: {}'.format(f1))
|
||||
logging(args.log_dir,'-' * 89)
|
||||
|
||||
|
||||
|
||||
@hydra.main(config_path="conf/config.yaml")
|
||||
def main(cfg):
|
||||
cwd = get_original_cwd()
|
||||
os.chdir(cwd)
|
||||
if not os.path.exists(f"data/{cfg.model_name_or_path}.pt"):
|
||||
get_label_word(cfg)
|
||||
if not os.path.exists(cfg.data_dir):
|
||||
generate_k_shot(cfg.data_dir)
|
||||
|
||||
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
|
||||
|
||||
data = REDataset(cfg)
|
||||
data_config = data.get_data_config()
|
||||
|
||||
config = AutoConfig.from_pretrained(cfg.model_name_or_path)
|
||||
config.num_labels = data_config["num_labels"]
|
||||
|
||||
model = AutoModelForMaskedLM.from_pretrained(cfg.model_name_or_path, config=config)
|
||||
|
||||
|
||||
# if torch.cuda.device_count() > 1:
|
||||
# print("Let's use", torch.cuda.device_count(), "GPUs!")
|
||||
# model = torch.nn.DataParallel(model, device_ids = list(range(torch.cuda.device_count())))
|
||||
|
||||
|
||||
model.to(device)
|
||||
|
||||
lit_model = BertLitModel(args=cfg, model=model, device=device,tokenizer=data.tokenizer)
|
||||
data.setup()
|
||||
|
||||
model.load_state_dict(torch.load(cfg.load_path)["checkpoint"], False)
|
||||
print("load trained model from {}.".format(cfg.load_path))
|
||||
|
||||
test(cfg, model, lit_model, data)
|
||||
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
|
@ -1,3 +0,0 @@
|
|||
torch==1.5
|
||||
transformers==3.4.0
|
||||
hydra-core==1.0.6
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue