Compare commits

...

No commits in common. "main" and "deprecated-tensorflow" have entirely different histories.

257 changed files with 2079 additions and 24935 deletions

View File

@ -1,76 +0,0 @@
# Contributor Covenant Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to making participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
## Scope
This Code of Conduct applies both within project spaces and in public spaces
when an individual is representing the project or its community. Examples of
representing a project or community include using an official project e-mail
address, posting via an official social media account, or acting as an appointed
representative at an online or offline event. Representation of a project may be
further defined and clarified by project maintainers.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at . All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq

View File

@ -1,22 +0,0 @@
# Contributing
Welcome to the Deepke community! We're building relation extraction toolkits for research.
## Simple Internal Code
It's useful for users to look at the code and understand very quickly what's happening. Many users won't be engineers. Thus we need to value clear, simple code over condensed ninja moves. While that's super cool, this isn't the project for that :)
## Contribution Types
Currently looking for help implementing new features or adding bug fixes.
## Bug Fixes:
1. Submit a github issue.
2. Fix it.
3. Submit a PR!
## New Features:
1. Submit a github issue.
2. We'll agree on the feature scope.
3. Submit a PR!
## Coding Styleguide
1. Test the code with flake8.
2. Use f-strings.

View File

@ -1,28 +0,0 @@
---
name: Bug report
about: Create a report to help us improve
title: ''
labels: 'bug'
assignees: ''
---
**Describe the bug**
> A clear and concise description of what the bug is.
**Environment (please complete the following information):**
- OS: [e.g. mac / window]
- Python Version [e.g. 3.6]
**Screenshots**
> If applicable, add screenshots to help explain your problem.
**Additional context**
> Add any other context about the problem here.

View File

@ -1,28 +0,0 @@
---
name: Feature request
about: Suggest an idea for this project
title: ''
labels: 'enhancement'
assignees: ''
---
**Describe the feature**
> A clear and concise description of any features you've considered.
**Environment (please complete the following information):**
- OS: [e.g. mac / window]
- Python Version [e.g. 3.6]
**Screenshots**
> If applicable, add screenshots to help explain your problem.
**Additional context**
> Add any other context about the problem here.

View File

@ -1,28 +0,0 @@
---
name: Question consult
about: Other question want to ask
title: ''
labels: 'question'
assignees: ''
---
**Describe the question**
> A clear and concise description of what the question is.
**Environment (please complete the following information):**
- OS: [e.g. mac / window]
- Python Version [e.g. 3.6]
**Screenshots**
> If applicable, add screenshots to help explain your problem.
**Additional context**
> Add any other context about the problem here.

19
.gitignore vendored
View File

@ -1,19 +0,0 @@
.DS_Store
.idea
.vscode
__pycache__
*.pyc
test/.pytest_cache
data/out
logs
checkpoints
demo.py
otherUtils.py
module/Transformer_offical.py

View File

@ -1,80 +0,0 @@
cff-version: "1.0.0"
message: "If you use this toolkit, please cite it using these metadata."
title: "deepke"
repository-code: "https://https://github.com/zjunlp/DeepKE"
authors:
- family-names: Zhang
given-names: Ningyu
- family-names: Xu
given-names: Xin
- family-names: Tao
given-names: Liankuan
- family-names: Yu
given-names: Haiyang
- family-names: Ye
given-names: Hongbin
- family-names: Xie
given-names: Xin
- family-names: Chen
given-names: Xiang
- family-names: Li
given-names: Zhoubo
- family-names: Li
given-names: Lei
- family-names: Liang
given-names: Xiaozhuan
- family-names: Yao
given-names: Yunzhi
- family-names: Deng
given-names: Shumin
- family-names: Zhang
given-names: Zhenru
- family-names: Tan
given-names: Chuanqi
- family-names: Huang
given-names: Fei
- family-names: Zheng
given-names: Guozhou
- family-names: Chen
given-names: Huajun
preferred-citation:
type: article
title: "DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population"
authors:
- family-names: Zhang
given-names: Ningyu
- family-names: Xu
given-names: Xin
- family-names: Tao
given-names: Liankuan
- family-names: Yu
given-names: Haiyang
- family-names: Ye
given-names: Hongbin
- family-names: Xie
given-names: Xin
- family-names: Chen
given-names: Xiang
- family-names: Li
given-names: Zhoubo
- family-names: Li
given-names: Lei
- family-names: Liang
given-names: Xiaozhuan
- family-names: Yao
given-names: Yunzhi
- family-names: Deng
given-names: Shumin
- family-names: Zhang
given-names: Zhenru
- family-names: Tan
given-names: Chuanqi
- family-names: Huang
given-names: Fei
- family-names: Zheng
given-names: Guozhou
- family-names: Chen
given-names: Huajun
journal: "http://arxiv.org/abs/2201.03335"
year: 2022

21
LICENSE
View File

@ -1,21 +0,0 @@
MIT License
Copyright (c) 2021 ZJUNLP
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

476
README.md
View File

@ -1,441 +1,57 @@
<p align="center">
<a href="https://github.com/zjunlp/deepke"> <img src="pics/logo.png" width="400"/></a>
<p>
<p align="center">
<a href="http://deepke.zjukg.cn">
<img alt="Documentation" src="https://img.shields.io/badge/demo-website-blue">
</a>
<a href="https://pypi.org/project/deepke/#files">
<img alt="PyPI" src="https://img.shields.io/pypi/v/deepke">
</a>
<a href="https://github.com/zjunlp/DeepKE/blob/master/LICENSE">
<img alt="GitHub" src="https://img.shields.io/github/license/zjunlp/deepke">
</a>
<a href="http://zjunlp.github.io/DeepKE">
<img alt="Documentation" src="https://img.shields.io/badge/doc-website-red">
</a>
<a href="https://colab.research.google.com/drive/1vS8YJhJltzw3hpJczPt24O0Azcs3ZpRi?usp=sharing">
<img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg">
</a>
</p>
# deepke
## 数据准备
<p align="center">
<b> English | <a href="https://github.com/zjunlp/DeepKE/blob/main/README_CN.md">简体中文</a> </b>
</p>
文件(来源) | 样本
---- | ----
com2abbr.txt原始|沙河实业股份有限公司 沙河股份
stock.sql原始|[stock_code,ChiName,ChiNameAbbr]<br />('000001', '平安银行股份有限公司', '平安银行');
rel_per_com.txt原始|非独立董事 刘旭 湖南博云新材料股份有限公司
kg_company_management.sql原始|[stock_code, manager_name,manager_position…](主要部分)
per_com.txt原始| 董事 深圳中国农大科技股份有限公司 刘多宏
rel_per_com.txt由前面两个生成| 非独立董事 刘旭 湖南博云新材料股份有限公司
<h1 align="center">
<p>A Deep Learning Based Knowledge Extraction Toolkit<br>for Knowledge Base Population</p>
</h1>
运行`preprocess.py`的`get_initial_sample()`,主要包括
* **初步整理原始文本**</br>
包括去掉不必要的符号、将数字用NUM替换等
* **整理远程监督的数据**</br>
得到职位相关的数据,包括所有的人和公司,放在*per_pool*中和*com_pool*中以及具有职位关系的三元组*rel_per_com*
* **初步采样**</br>
通过远程监督进行采样,目前的设定是遍历所有句子,如果两实体出现在句子中且二者在*rel_per_com*中有关系,则标记为正样本;不在*rel_per_com* 中的两实体标记为负
* **规则过滤数据**</br>
**噪音来源:**
* 远程监督数据源自身的噪音,如人名为‘智慧’
* 一人多职位,
* 如句子“B为A的实际控股人” ,在*rel_per_com* 中有「A,B,董事长」 ,句子会被标记为董事长的正样本
* 静态的远程监督数据源和随时间动态变化的职位关系之间的冲突,
* 对句子“A曾任B的董事长”*rel_per_com* 中有「A,B,董事长」,句子会被标记为董事长的正样本
* 对句子“任命A为B的总裁”*rel_per_com* 中关于「A,B」 的关系只有「A,B,副总裁」,句子也会被标记为副总裁的正样本
* 对句子“任命A为B的总裁”句子中有于「A,B」但是在 *rel_per_com* 中没有任何于「A,B」的职位信息会被标记为负样本
* ...
DeepKE is a knowledge extraction toolkit supporting **low-resource** and **document-level** scenarios for *entity*, *relation* and *attribute* extraction. We provide [comprehensive documents](https://zjunlp.github.io/DeepKE/), [Google Colab tutorials](), and [online demo](http://deepke.zjukg.cn/) for beginners.
<br>
# Table of Contents
* [What's New](#whats-new)
* [Prediction Demo](#prediction-demo)
* [Model Framework](#model-framework)
* [Quick Start](#quick-start)
* [Requirements](#requirements)
* [Introduction of Three Functions](#introduction-of-three-functions)
* [1. Named Entity Recognition](#1-named-entity-recognition)
* [2. Relation Extraction](#2-relation-extraction)
* [3. Attribute Extraction](#3-attribute-extraction)
* [Notebook Tutorial](#notebook-tutorial)
* [Tips](#tips)
* [To do](#to-do)
* [Citation](#citation)
* [Developers](#developers)
<br>
# What's New
## Jan, 2022
* We have released a paper [DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population](https://arxiv.org/abs/2201.03335)
## Dec, 2021
* We have added `dockerfile` to create the enviroment automatically.
## Nov, 2021
* The demo of DeepKE, supporting real-time extration without deploying and training, has been released.
* The documentation of DeepKE, containing the details of DeepKE such as source codes and datasets, has been released.
## Oct, 2021
* `pip install deepke`
* The codes of deepke-v2.0 have been released.
## August, 2020
* The codes of deepke-v1.0 have been released.
<br>
# Prediction Demo
There is a demonstration of prediction.<br>
<img src="pics/demo.gif" width="636" height="494" align=center>
<br>
# Model Framework
<h3 align="center">
<img src="pics/architectures.png">
</h3>
- DeepKE contains a unified framework for **named entity recognition**, **relation extraction** and **attribute extraction**, the three knowledge extraction functions.
- Each task can be implemented in different scenarios. For example, we can achieve relation extraction in **standard**, **low-resource (few-shot)** and **document-level** settings.
- Each application scenario comprises of three components: **Data** including Tokenizer, Preprocessor and Loader, **Model** including Module, Encoder and Forwarder, **Core** including Training, Evaluation and Prediction.
<br>
# Quick Start
*DeepKE* supports `pip install deepke`. <br>Take the fully supervised relation extraction for example.
**Step1** Download the basic code
```bash
git clone https://github.com/zjunlp/DeepKE.git
```
**Step2** Create a virtual environment using `Anaconda` and enter it.<br>
We also provide dockerfile source code, which is located in the `docker` folder, to help users create their own mirrors.
```bash
conda create -n deepke python=3.8
conda activate deepke
```
1. Install *DeepKE* with source code
```bash
python setup.py install
**正样本过滤:** </br>
* 关系关键词必须在句子中,考虑一人多职位
python setup.py develop
```
**负样本过滤:**</br>
* 正则表达式识别A的董事长B这种类型的句子回标为正样本
* 远程监督本身的噪音,如在*per_pool* 中有「周建灿」和「周建」,有句子“金盾董事长周建灿”,直接的实体链接会标出「金盾董事,周建,董事长」
2. Install *DeepKE* with `pip`
```bash
pip install deepke
```
## 训练
* 运行`preprocess.py`的`train_preprocess()`,生成训练数据
* 运行`python train.py`,模型存在`../model`下
* 参数在`config.py`中进行配置,包括*GPU_ID*, *learning_rate*等
**Step3** Enter the task directory
## 测试
* 运行`preprocess.py`的`predict_preprocess()`,生成可以输入模型的数据
* 运行`python test.py`,结果保存在`../result`下
```bash
cd DeepKE/example/re/standard
```
## 模型
考虑到实验效果,目前使用多个二分类模型</br>
参考:[Lin et al. (2017)](http://www.aclweb.org/anthology/D15-1203).</br>
输入:句子,两实体及相应的位置信息,用于判断并列语句的辅助信息序列</br>
输出:是否具有相应关系</br>
**Step4** Download the dataset
```bash
wget 120.27.214.45/Data/re/standard/data.tar.gz
tar -xzvf data.tar.gz
```
**Step5** Training (Parameters for training can be changed in the `conf` folder)
We support visual parameter tuning by using *wandb*.
```bash
python run.py
```
**Step6** Prediction (Parameters for prediction can be changed in the `conf` folder)
Modify the path of the trained model in `predict.yaml`.
```bash
python predict.py
```
## Requirements
> python == 3.8
- torch == 1.5
- hydra-core == 1.0.6
- tensorboard == 2.4.1
- matplotlib == 3.4.1
- transformers == 3.4.0
- jieba == 0.42.1
- scikit-learn == 0.24.1
- pytorch-transformers == 1.2.0
- seqeval == 1.2.2
- tqdm == 4.60.0
- opt-einsum==3.3.0
- wandb==0.12.7
- ujson
## Introduction of Three Functions
### 1. Named Entity Recognition
- Named entity recognition seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, organizations, etc.
- The data is stored in `.txt` files. Some instances as following:
| Sentence | Person | Location | Organization |
| :----------------------------------------------------------: | :------------------------: | :------------: | :----------------------------: |
| 本报北京9月4日讯记者杨涌报道部分省区人民日报宣传发行工作座谈会9月3日在4日在京举行。 | 杨涌 | 北京 | 人民日报 |
| 《红楼梦》是中央电视台和中国电视剧制作中心根据中国古典文学名著《红楼梦》摄制于1987年的一部古装连续剧由王扶林导演周汝昌、王蒙、周岭等多位红学家参与制作。 | 王扶林,周汝昌,王蒙,周岭 | 中国 | 中央电视台,中国电视剧制作中心 |
| 秦始皇兵马俑位于陕西省西安市1961年被国务院公布为第一批全国重点文物保护单位是世界八大奇迹之一。 | 秦始皇 | 陕西省,西安市 | 国务院 |
- Read the detailed process in specific README
- **[STANDARD (Fully Supervised)](https://github.com/zjunlp/DeepKE/tree/main/example/ner/standard)**
**Step1** Enter `DeepKE/example/ner/standard`. Download the dataset.
```bash
wget 120.27.214.45/Data/ner/standard/data.tar.gz
tar -xzvf data.tar.gz
```
**Step2** Training<br>
The dataset and parameters can be customized in the `data` folder and `conf` folder respectively.
```bash
python run.py
```
**Step3** Prediction
```bash
python predict.py
```
- **[FEW-SHOT](https://github.com/zjunlp/DeepKE/tree/main/example/ner/few-shot)**
**Step1** Enter `DeepKE/example/ner/few-shot`. Download the dataset.
```bash
wget 120.27.214.45/Data/ner/few_shot/data.tar.gz
tar -xzvf data.tar.gz
```
**Step2** Training in the low-resouce setting <br>
The directory where the model is loaded and saved and the configuration parameters can be cusomized in the `conf` folder.
```bash
python run.py +train=few_shot
```
Users can modify `load_path` in `conf/train/few_shot.yaml` to use existing loaded model.<br>
**Step3** Add `- predict` to `conf/config.yaml`, modify `loda_path` as the model path and `write_path` as the path where the predicted results are saved in `conf/predict.yaml`, and then run `python predict.py`
```bash
python predict.py
```
### 2. Relation Extraction
- Relationship extraction is the task of extracting semantic relations between entities from a unstructured text.
- The data is stored in `.csv` files. Some instances as following:
| Sentence | Relation | Head | Head_offset | Tail | Tail_offset |
| :----------------------------------------------------: | :------: | :--------: | :---------: | :--------: | :---------: |
| 《岳父也是爹》是王军执导的电视剧,由马恩然、范明主演。 | 导演 | 岳父也是爹 | 1 | 王军 | 8 |
| 《九玄珠》是在纵横中文网连载的一部小说,作者是龙马。 | 连载网站 | 九玄珠 | 1 | 纵横中文网 | 7 |
| 提起杭州的美景,西湖总是第一个映入脑海的词语。 | 所在城市 | 西湖 | 8 | 杭州 | 2 |
- Read the detailed process in specific README
- **[STANDARD (Fully Supervised)](https://github.com/zjunlp/DeepKE/tree/main/example/re/standard)**
**Step1** Enter the `DeepKE/example/re/standard` folder. Download the dataset.
```bash
wget 120.27.214.45/Data/re/standard/data.tar.gz
tar -xzvf data.tar.gz
```
**Step2** Training<br>
The dataset and parameters can be customized in the `data` folder and `conf` folder respectively.
```bash
python run.py
```
**Step3** Prediction
```bash
python predict.py
```
- **[FEW-SHOT](https://github.com/zjunlp/DeepKE/tree/main/example/re/few-shot)**
**Step1** Enter `DeepKE/example/re/few-shot`. Download the dataset.
```bash
wget 120.27.214.45/Data/re/few_shot/data.tar.gz
tar -xzvf data.tar.gz
```
**Step 2** Training<br>
- The dataset and parameters can be customized in the `data` folder and `conf` folder respectively.
- Start with the model trained last time: modify `train_from_saved_model` in `conf/train.yaml`as the path where the model trained last time was saved. And the path saving logs generated in training can be customized by `log_dir`.
```bash
python run.py
```
**Step3** Prediction
```bash
python predict.py
```
- **[DOCUMENT](https://github.com/zjunlp/DeepKE/tree/main/example/re/document)**<br>
**Step1** Enter `DeepKE/example/re/document`. Download the dataset.
```bash
wget 120.27.214.45/Data/re/document/data.tar.gz
tar -xzvf data.tar.gz
```
**Step2** Training<br>
- The dataset and parameters can be customized in the `data` folder and `conf` folder respectively.
- Start with the model trained last time: modify `train_from_saved_model` in `conf/train.yaml`as the path where the model trained last time was saved. And the path saving logs generated in training can be customized by `log_dir`.
```bash
python run.py
```
**Step3** Prediction
```bash
python predict.py
```
### 3. Attribute Extraction
- Attribute extraction is to extract attributes for entities in a unstructed text.
- The data is stored in `.csv` files. Some instances as following:
| Sentence | Att | Ent | Ent_offset | Val | Val_offset |
| :----------------------------------------------------------: | :------: | :------: | :--------: | :-----------: | :--------: |
| 张冬梅汉族1968年2月生河南淇县人 | 民族 | 张冬梅 | 0 | 汉族 | 6 |
|诸葛亮,字孔明,三国时期杰出的军事家、文学家、发明家。| 朝代 | 诸葛亮 | 0 | 三国时期 | 8 |
| 2014年10月1日许鞍华执导的电影《黄金时代》上映 | 上映时间 | 黄金时代 | 19 | 2014年10月1日 | 0 |
- Read the detailed process in specific README
- **[STANDARD (Fully Supervised)](https://github.com/zjunlp/DeepKE/tree/main/example/ae/standard)**
**Step1** Enter the `DeepKE/example/ae/standard` folder. Download the dataset.
```bash
wget 120.27.214.45/Data/ae/standard/data.tar.gz
tar -xzvf data.tar.gz
```
**Step2** Training<br>
The dataset and parameters can be customized in the `data` folder and `conf` folder respectively.
```bash
python run.py
```
**Step3** Prediction
```bash
python predict.py
```
<br>
# Notebook Tutorial
This toolkit provides many `Jupyter Notebook` and `Google Colab` tutorials. Users can study *DeepKE* with them.
- Standard Setting<br>
[NER Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/ner/standard/standard_ner_tutorial.ipynb)
[NER Colab](https://colab.research.google.com/drive/1h4k6-_oNEHBRxrnzpxHPczO5SFaLS9uq?usp=sharing)
[RE Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/re/standard/standard_re_pcnn_tutorial.ipynb)
[RE Colab](https://colab.research.google.com/drive/1o6rKIxBqrGZNnA2IMXqiSsY2GWANAZLl?usp=sharing)
[AE Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/ae/standard/standard_ae_tutorial.ipynb)
[AE Colab](https://colab.research.google.com/drive/1pgPouEtHMR7L9Z-QfG1sPYkJfrtRt8ML)
- Low-resource<br>
[NER Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/ner/few-shot/fewshot_ner_tutorial.ipynb)
[NER Colab](https://colab.research.google.com/drive/1Xz0sNpYQNbkjhebCG5djrwM8Mj2Crj7F?usp=sharing)
[RE Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/re/few-shot/fewshot_re_tutorial.ipynb)
[RE Colab](https://colab.research.google.com/drive/1o1ly6ORgerkm1fCDjEQb7hsN5WKyg3JH?usp=sharing)
- Document-level<br>
[RE Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/re/document/document_re_tutorial.ipynb)
[RE Colab](https://colab.research.google.com/drive/1RGUBbbOBHlWJ1NXQLtP_YEUktntHtROa?usp=sharing)
<br>
# Tips
1. Using nearest mirror, like [THU](https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/) in China, will speed up the installation of *Anaconda*.
2. Using nearest mirror, like [aliyun](http://mirrors.aliyun.com/pypi/simple/) in China, will speed up `pip install XXX`.
3. When encountering `ModuleNotFoundError: No module named 'past'`run `pip install future` .
4. It's slow to install the pretrained language models online. Recommend download pretrained models before use and save them in the `pretrained` folder. Read `README.md` in every task directory to check the specific requirement for saving pretrained models.
5. The old version of *DeepKE* is in the [deepke-v1.0](https://github.com/zjunlp/DeepKE/tree/deepke-v1.0) branch. Users can change the branch to use the old version. The old version has been totally transfered to the standard relation extraction ([example/re/standard](https://github.com/zjunlp/DeepKE/blob/main/example/re/standard/README.md)).
6. It's recommended to install *DeepKE* with source codes. Because user may meet some problems in Windows system with 'pip'.
<br>
# To do
In next version, we plan to add multi-modality knowledge extraction to the toolkit.
Meanwhile, we will offer long-term maintenance to fix bugs, solve issues and meet new requests. So if you have any problems, please put issues to us.
<br>
# Citation
Please cite our paper if you use DeepKE in your work
```bibtex
@article{Zhang_DeepKE_A_Deep_2022,
author = {Zhang, Ningyu and Xu, Xin and Tao, Liankuan and Yu, Haiyang and Ye, Hongbin and Xie, Xin and Chen, Xiang and Li, Zhoubo and Li, Lei and Liang, Xiaozhuan and Yao, Yunzhi and Deng, Shumin and Zhang, Zhenru and Tan, Chuanqi and Huang, Fei and Zheng, Guozhou and Chen, Huajun},
journal = {http://arxiv.org/abs/2201.03335},
title = {{DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population}},
year = {2022}
}
```
<br>
# Developers
Zhejiang University: Ningyu Zhang, Liankuan Tao, Xin Xu, Haiyang Yu, Hongbin Ye, Xin Xie, Xiang Chen, Zhoubo Li, Lei Li, Xiaozhuan Liang, YunzhiYao, Shuofei Qiao, Shumin Deng, Wen Zhang, Guozhou Zheng, Huajun Chen
DAMO Academy: Zhenru Zhang, Chuanqi Tan, Fei Huang
## 结果
![pcnn result](https://github.com/zjunlp/deepke/blob/dev/result/result.png)

View File

@ -1,423 +0,0 @@
<p align="center">
<a href="https://github.com/zjunlp/deepke"> <img src="pics/logo.png" width="400"/></a>
<p>
<p align="center">
<a href="http://deepke.zjukg.cn">
<img alt="Documentation" src="https://img.shields.io/badge/demo-website-blue">
</a>
<a href="https://pypi.org/project/deepke/#files">
<img alt="PyPI" src="https://img.shields.io/pypi/v/deepke">
</a>
<a href="https://github.com/zjunlp/DeepKE/blob/master/LICENSE">
<img alt="GitHub" src="https://img.shields.io/github/license/zjunlp/deepke">
</a>
<a href="http://zjunlp.github.io/DeepKE">
<img alt="Documentation" src="https://img.shields.io/badge/doc-website-red">
</a>
<a href="https://colab.research.google.com/drive/1vS8YJhJltzw3hpJczPt24O0Azcs3ZpRi?usp=sharing">
<img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg">
</a>
</p>
<h1 align="center">
<p>基于深度学习的开源中文知识图谱抽取框架</p>
</h1>
DeepKE 是一个支持<b>低资源、长篇章</b>的知识抽取工具,可以基于<b>PyTorch</b>实现<b>命名实体识别</b><b>关系抽取</b><b>属性抽取</b>功能。<br>同时为初学者提供了详尽的[文档](https://zjunlp.github.io/DeepKE/)[Google Colab教程](https://colab.research.google.com/drive/1vS8YJhJltzw3hpJczPt24O0Azcs3ZpRi?usp=sharing)和[在线演示](http://deepke.zjukg.cn/CN/index.html)。
<br>
# 目录
* [新版特性](#新版特性)
* [预测演示](#预测演示)
* [模型架构](#模型架构)
* [快速上手](#快速上手)
* [环境依赖](#环境依赖)
* [具体功能介绍](#具体功能介绍)
* [1. 命名实体识别NER](#1-命名实体识别ner)
* [2. 关系抽取RE](#2-关系抽取re)
* [3. 属性抽取AE](#3-属性抽取ae)
* [Notebook教程](#notebook教程)
* [备注(常见问题)](#备注常见问题)
* [未来计划](#未来计划)
* [引用](#引用)
* [项目成员](#项目成员)
<br>
# 新版特性
## 2021年1月
- 发布论文 [DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population](https://arxiv.org/abs/2201.03335)
## 2021年12月
- 加入`dockerfile`以便自动创建环境
## 2021年11月
- 发布DeepKE demo页面支持实时抽取无需部署和训练模型
- 发布DeepKE文档包含DeepKE源码和数据集等详细信息
## 2021年10月
- `pip install deepke`
- deepke-v2.0发布
## 2021年5月
- `pip install deepke`
- deepke-v1.0发布
<br>
# 预测演示
下面使用一个demo展示预测过程<br>
<img src="pics/demo.gif" width="636" height="494" align=center>
<br>
# 模型架构
Deepke的架构图如下所示
<h3 align="center">
<img src="pics/architectures.png">
</h3>
- DeepKE为三个知识抽取功能命名实体识别、关系抽取和属性抽取设计了一个统一的框架
- 可以在不同场景下实现不同功能。比如,可以在标准全监督、低资源少样本和文档级设定下进行关系抽取
- 每一个应用场景由三个部分组成Data部分包含Tokenizer、Preprocessor和LoaderModel部分包含Module、Encoder和ForwarderCore部分包含Training、Evaluation和Prediction
<br>
# 快速上手
DeepKE支持pip安装使用以常规全监督设定关系抽取为例经过以下6个步骤就可以实现一个常规关系抽取模型
**Step 1**:下载代码 ```git clone https://github.com/zjunlp/DeepKE.git```别忘记star和fork哈
**Step 2**使用anaconda创建虚拟环境进入虚拟环境提供Dockerfile源码可自行创建镜像位于docker文件夹中
```
conda create -n deepke python=3.8
conda activate deepke
```
1 基于pip安装直接使用
```
pip install deepke
```
2 基于源码安装
```
python setup.py install
python setup.py develop
```
**Step 3** :进入任务文件夹,以常规关系抽取为例
```
cd DeepKE/example/re/standard
```
**Step 4**:下载数据集
```
wget 120.27.214.45/Data/re/standard/data.tar.gz
tar -xzvf data.tar.gz
```
**Step 5** 模型训练训练用到的参数可在conf文件夹内修改
DeepKE使用*wandb*支持可视化调参
```
python run.py
```
**Step 6** 模型预测。预测用到的参数可在conf文件夹内修改
修改`conf/predict.yaml`中保存训练好的模型路径。
```
python predict.py
```
<br>
## 环境依赖
> python == 3.8
- torch == 1.5
- hydra-core == 1.0.6
- tensorboard == 2.4.1
- matplotlib == 3.4.1
- transformers == 3.4.0
- jieba == 0.42.1
- scikit-learn == 0.24.1
- pytorch-transformers == 1.2.0
- seqeval == 1.2.2
- tqdm == 4.60.0
- opt-einsum==3.3.0
- ujson
<br>
## 具体功能介绍
### 1. 命名实体识别NER
- 命名实体识别是从非结构化的文本中识别出实体和其类型。数据为txt文件样式范例为
| Sentence | Person | Location | Organization |
| :----------------------------------------------------------: | :------------------------: | :------------: | :----------------------------: |
| 本报北京9月4日讯记者杨涌报道部分省区人民日报宣传发行工作座谈会9月3日在4日在京举行。 | 杨涌 | 北京 | 人民日报 |
| 《红楼梦》是中央电视台和中国电视剧制作中心根据中国古典文学名著《红楼梦》摄制于1987年的一部古装连续剧由王扶林导演周汝昌、王蒙、周岭等多位红学家参与制作。 | 王扶林,周汝昌,王蒙,周岭 | 中国 | 中央电视台,中国电视剧制作中心 |
| 秦始皇兵马俑位于陕西省西安市1961年被国务院公布为第一批全国重点文物保护单位是世界八大奇迹之一。 | 秦始皇 | 陕西省,西安市 | 国务院 |
- 具体流程请进入详细的README中
- **[常规全监督STANDARD](https://github.com/zjunlp/DeepKE/tree/main/example/ner/standard)**
**Step1**: 进入`DeepKE/example/ner/standard`,下载数据集
```bash
wget 120.27.214.45/Data/ner/standard/data.tar.gz
tar -xzvf data.tar.gz
```
**Step2**: 模型训练<br>
数据集和参数配置可以分别在`data`和`conf`文件夹中修改
```
python run.py
```
**Step3**: 模型预测
```
python predict.py
```
- **[少样本FEW-SHOT](https://github.com/zjunlp/DeepKE/tree/main/example/ner/few-shot)**
**Step1**: 进入`DeepKE/example/ner/few-shot`,下载数据集
```bash
wget 120.27.214.45/Data/ner/few_shot/data.tar.gz
tar -xzvf data.tar.gz
```
**Step2**:低资源场景下训练模型<br>
模型加载和保存位置以及参数配置可以在`conf`文件夹中修改
```
python run.py +train=few_shot
```
若要加载模型,修改`few_shot.yaml`中的`load_path`<br>
**Step3**:在`config.yaml`中追加`- predict``predict.yaml`中修改`load_path`为模型路径以及`write_path`为预测结果的保存路径,完成修改后使用
```
python predict.py
```
### 2. 关系抽取RE
- 关系抽取是从非结构化的文本中抽取出实体之间的关系以下为几个样式范例数据为csv文件
| Sentence | Relation | Head | Head_offset | Tail | Tail_offset |
| :----------------------------------------------------: | :------: | :--------: | :---------: | :--------: | :---------: |
| 《岳父也是爹》是王军执导的电视剧,由马恩然、范明主演。 | 导演 | 岳父也是爹 | 1 | 王军 | 8 |
| 《九玄珠》是在纵横中文网连载的一部小说,作者是龙马。 | 连载网站 | 九玄珠 | 1 | 纵横中文网 | 7 |
| 提起杭州的美景,西湖总是第一个映入脑海的词语。 | 所在城市 | 西湖 | 8 | 杭州 | 2 |
- 具体流程请进入详细的README中RE包括了以下三个子功能
- **[常规全监督STANDARD](https://github.com/zjunlp/DeepKE/tree/main/example/re/standard)**
**Step1**:进入`DeepKE/example/re/standard`,下载数据集
```bash
wget 120.27.214.45/Data/re/standard/data.tar.gz
tar -xzvf data.tar.gz
```
**Step2**:模型训练<br>
数据集和参数配置可以分别进入`data`和`conf`文件夹中修改
```
python run.py
```
**Step3**:模型预测
```
python predict.py
```
- **[少样本FEW-SHOT](https://github.com/zjunlp/DeepKE/tree/main/example/re/few-shot)**
**Step1**:进入`DeepKE/example/re/few-shot`,下载数据集
```bash
wget 120.27.214.45/Data/re/few_shot/data.tar.gz
tar -xzvf data.tar.gz
```
**Step2**:模型训练<br>
- 数据集和参数配置可以分别进入`data`和`conf`文件夹中修改
- 如需从上次训练的模型开始训练:设置`conf/train.yaml`中的`train_from_saved_model`为上次保存模型的路径,每次训练的日志默认保存在根目录,可用`log_dir`来配置
```
python run.py
```
**Step3**:模型预测
```
python predict.py
```
- **[文档级DOCUMENT](https://github.com/zjunlp/DeepKE/tree/main/example/re/document)** <br>
**Step1**:进入`DeepKE/example/re/document`,下载数据集
```bash
wget 120.27.214.45/Data/re/document/data.tar.gz
tar -xzvf data.tar.gz
```
**Step2**:模型训练<br>
- 数据集和参数配置可以分别进入`data`和`conf`文件夹中修改
- 如需从上次训练的模型开始训练:设置`conf/train.yaml`中的`train_from_saved_model`为上次保存模型的路径,每次训练的日志默认保存在根目录,可用`log_dir`来配置;
```
python run.py
```
**Step3**:模型预测
```
python predict.py
```
### 3. 属性抽取AE
- 数据为csv文件样式范例为
| Sentence | Att | Ent | Ent_offset | Val | Val_offset |
| :----------------------------------------------------------: | :------: | :------: | :--------: | :-----------: | :--------: |
| 张冬梅汉族1968年2月生河南淇县人 | 民族 | 张冬梅 | 0 | 汉族 | 6 |
| 诸葛亮,字孔明,三国时期杰出的军事家、文学家、发明家。 | 朝代 | 诸葛亮 | 0 | 三国时期 | 8 |
| 2014年10月1日许鞍华执导的电影《黄金时代》上映 | 上映时间 | 黄金时代 | 19 | 2014年10月1日 | 0 |
- 具体流程请进入详细的README中
- **[常规全监督STANDARD](https://github.com/zjunlp/DeepKE/tree/main/example/ae/standard)**
**Step1**:进入`DeepKE/example/ae/standard`,下载数据集
```bash
wget 120.27.214.45/Data/ae/standard/data.tar.gz
tar -xzvf data.tar.gz
```
**Step2**:模型训练<br>
数据集和参数配置可以分别进入`data`和`conf`文件夹中修改
```
python run.py
```
**Step3**:模型预测
```
python predict.py
```
<br>
# Notebook教程
本工具提供了若干Notebook和Google Colab教程用户可针对性调试学习。
- 常规设定:
[命名实体识别Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/ner/standard/standard_ner_tutorial.ipynb)
[命名实体识别Colab](https://colab.research.google.com/drive/1rFiIcDNgpC002q9BbtY_wkeBUvbqVxpg?usp=sharing)
[关系抽取Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/re/standard/standard_re_BERT_tutorial.ipynb)
[关系抽取Colab](https://colab.research.google.com/drive/1o6rKIxBqrGZNnA2IMXqiSsY2GWANAZLl?usp=sharing)
[属性抽取Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/ae/standard/standard_ae_tutorial.ipynb)
[属性抽取Colab](https://colab.research.google.com/drive/1pgPouEtHMR7L9Z-QfG1sPYkJfrtRt8ML?usp=sharing)
- 低资源:
[命名实体识别Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/ner/few-shot/fewshot_ner_tutorial.ipynb)
[命名实体识别Colab](https://colab.research.google.com/drive/1Xz0sNpYQNbkjhebCG5djrwM8Mj2Crj7F?usp=sharing)
[关系抽取Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/re/few-shot/fewshot_re_tutorial.ipynb)
[关系抽取Colab](https://colab.research.google.com/drive/1o1ly6ORgerkm1fCDjEQb7hsN5WKyg3JH?usp=sharing)
- 篇章级:
[关系抽取Notebook](https://github.com/zjunlp/DeepKE/blob/main/tutorial-notebooks/re/document/document_re_tutorial.ipynb)
[关系抽取Colab](https://colab.research.google.com/drive/1RGUBbbOBHlWJ1NXQLtP_YEUktntHtROa?usp=sharing)
<br>
# 备注(常见问题)
1. 使用 Anaconda 时,建议添加国内镜像,下载速度更快。如[镜像](https://mirrors.tuna.tsinghua.edu.cn/help/anaconda/)。
2. 使用 pip 时,建议使用国内镜像,下载速度更快,如阿里云镜像。
3. 安装后提示 `ModuleNotFoundError: No module named 'past'`,输入命令 `pip install future` 即可解决。
4. 使用语言预训练模型时,在线安装下载模型比较慢,更建议提前下载好,存放到 pretrained 文件夹内。具体存放文件要求见文件夹内的 `README.md`
5. DeepKE老版本位于[deepke-v1.0](https://github.com/zjunlp/DeepKE/tree/deepke-v1.0)分支,用户可切换分支使用老版本,老版本的能力已全部迁移到标准设定关系抽取([example/re/standard](https://github.com/zjunlp/DeepKE/blob/main/example/re/standard/README.md))中。
<br>
# 未来计划
- 在DeepKE的下一个版本中加入多模态知识抽取
- 我们提供长期技术维护和答疑解惑。如有疑问请提交issues
<br>
# 引用
如果使用DeepKE请按以下格式引用
```bibtex
@article{Zhang_DeepKE_A_Deep_2022,
author = {Zhang, Ningyu and Xu, Xin and Tao, Liankuan and Yu, Haiyang and Ye, Hongbin and Xie, Xin and Chen, Xiang and Li, Zhoubo and Li, Lei and Liang, Xiaozhuan and Yao, Yunzhi and Deng, Shumin and Zhang, Zhenru and Tan, Chuanqi and Huang, Fei and Zheng, Guozhou and Chen, Huajun},
journal = {http://arxiv.org/abs/2201.03335},
title = {{DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population}},
year = {2022}
}
```
<br>
# 项目成员
浙江大学:张宁豫、陶联宽、徐欣、余海洋、叶宏彬、谢辛、陈想、黎洲波、李磊、梁孝转、姚云志、乔硕斐、邓淑敏、张文、郑国轴、陈华钧
达摩院:张珍茹、谭传奇、黄非

1
dataset/README.md Normal file
View File

@ -0,0 +1 @@
链接: https://pan.baidu.com/s/1r7-Curgph4ffTlILh6JDJA 密码: knya

View File

@ -1,28 +0,0 @@
FROM ubuntu:18.04
LABEL maintainer="ZJUNLP"
LABEL repository="DeepKE"
ENV PYTHON_VERSION=3.8
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
cmake \
wget \
git \
curl \
ca-certificates
RUN curl -o ~/miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-4.7.12-Linux-x86_64.sh && \
chmod +x ~/miniconda.sh && \
~/miniconda.sh -b && \
rm ~/miniconda.sh
ENV PATH=/root/miniconda3/bin:$PATH
RUN conda create -y --name deepke python=$PYTHON_VERSION
# SHELL ["/root/miniconda3/bin/conda", "run", "-n", "deepke", "/bin/bash", "-c"]
RUN conda init bash
RUN cd ~ && \
git clone https://github.com/zjunlp/DeepKE.git

View File

@ -1,20 +0,0 @@
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

View File

@ -1,35 +0,0 @@
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build
if "%1" == "" goto help
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd

Binary file not shown.

Before

Width:  |  Height:  |  Size: 104 KiB

View File

@ -1,118 +0,0 @@
a,
.wy-menu-vertical header,
.wy-menu-vertical p.caption,
.wy-nav-top .fa-bars,
.wy-menu-vertical a:hover,
.rst-content code.literal, .rst-content tt.literal
{
color: rgb(0, 63, 136) !important;
}
/* inspired by sphinx press theme */
.wy-menu.wy-menu-vertical li.toctree-l1.current > a {
border-left: solid 8px #4122f0 !important;
border-top: none;
border-bottom: none;
}
.wy-menu.wy-menu-vertical li.toctree-l1.current > ul {
border-left: solid 8px #4719ee !important;
}
/* inspired by sphinx press theme */
.wy-nav-side {
color: unset !important;
background: unset !important;
border-right: solid 1px #ccc !important;
}
.wy-side-nav-search,
.wy-nav-top,
.wy-menu-vertical li,
.wy-menu-vertical li a:hover,
.wy-menu-vertical li a
{
background: unset !important;
}
.wy-menu-vertical li.current a {
border-right: unset !important;
}
.wy-side-nav-search div,
.wy-menu-vertical a {
color: #404040 !important;
}
.wy-menu-vertical button.toctree-expand {
color: #333 !important;
}
.wy-nav-content {
max-width: unset;
}
.rst-content {
max-width: 900px;
}
.wy-nav-content .icon-home:before {
content: "Docs";
}
.wy-side-nav-search .icon-home:before {
content: "";
}
dl.field-list {
display: block !important;
}
dl.field-list > dt:after {
content: "" !important;
}
dl.field-list > dt {
display: table;
padding-left: 6px !important;
padding-right: 6px !important;
margin-bottom: 4px !important;
padding-bottom: 1px !important;
background: #f6ecd852;
border-left: solid 2px #ccc;
}
dl.py.class>dt
{
color: rgba(17, 16, 17, 0.822) !important;
background: rgb(226, 241, 250) !important;
border-top: solid 2px #58b5cc !important;
}
dl.py.method>dt
{
background: rgb(226, 241, 250) !important;
border-left: solid 2px #bcb3be !important;
}
dl.py.attribute>dt,
dl.py.property>dt
{
background: rgb(226, 241, 250) !important;
border-left: solid 2px #58b5cc !important;
}
.fa-plus-square-o::before, .wy-menu-vertical li button.toctree-expand::before,
.fa-minus-square-o::before, .wy-menu-vertical li.current > a button.toctree-expand::before, .wy-menu-vertical li.on a button.toctree-expand::before
{
content: "";
}
.rst-content .viewcode-back,
.rst-content .viewcode-link
{
font-size: 120%;
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 419 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 207 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 6.4 KiB

View File

@ -1,82 +0,0 @@
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys
sys.path.insert(0, os.path.abspath('../../src'))
import sphinx_rtd_theme
import doctest
import deepke
# -- Project information -----------------------------------------------------
project = 'DeepKE'
copyright = '2021, ZJUNLP'
author = 'tlk'
# The full version, including alpha/beta/rc tags
release = '1.0.0'
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.doctest',
'sphinx.ext.intersphinx',
'sphinx.ext.mathjax',
'sphinx.ext.napoleon',
'sphinx.ext.viewcode',
'sphinx.ext.githubpages',
'sphinx.ext.todo',
'sphinx.ext.coverage',
'sphinx_copybutton',
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = []
doctest_default_flags = doctest.NORMALIZE_WHITESPACE
autodoc_member_order = 'bysource'
intersphinx_mapping = {'python': ('https://docs.python.org/', None)}
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'sphinx_rtd_theme'
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
html_css_files = ['css/custom.css']
# html_logo = './_static/logo.png'
html_context = {
"display_github": True, # Integrate GitHub
"github_user": "tlk1997", # Username
"github_repo": "test_doc", # Repo name
"github_version": "main", # Version
"conf_py_path": "/docs/source/", # Path in the checkout to the docs root
}

View File

@ -1,9 +0,0 @@
Attribution Extraction
======================
.. toctree::
:maxdepth: 4
deepke.attribution_extraction.standard

View File

@ -1,60 +0,0 @@
Models
======
deepke.attribution\_extraction.standard.models.BasicModule module
-----------------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.models.BasicModule
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.models.BiLSTM module
------------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.models.BiLSTM
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.models.Capsule module
-------------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.models.Capsule
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.models.GCN module
---------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.models.GCN
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.models.LM module
--------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.models.LM
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.models.PCNN module
----------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.models.PCNN
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.models.Transformer module
-----------------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.models.Transformer
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,60 +0,0 @@
Module
======
deepke.attribution\_extraction.standard.module.Attention module
---------------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.module.Attention
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.module.CNN module
---------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.module.CNN
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.module.Capsule module
-------------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.module.Capsule
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.module.Embedding module
---------------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.module.Embedding
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.module.GCN module
---------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.module.GCN
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.module.RNN module
---------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.module.RNN
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.module.Transformer module
-----------------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.module.Transformer
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,11 +0,0 @@
Standard
========
.. toctree::
:maxdepth: 4
deepke.attribution_extraction.standard.models
deepke.attribution_extraction.standard.module
deepke.attribution_extraction.standard.tools
deepke.attribution_extraction.standard.utils

View File

@ -1,53 +0,0 @@
Tools
=====
deepke.attribution\_extraction.standard.tools.dataset module
------------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.tools.dataset
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.tools.metrics module
------------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.tools.metrics
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.tools.preprocess module
---------------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.tools.preprocess
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.tools.serializer module
---------------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.tools.serializer
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.tools.trainer module
------------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.tools.trainer
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.tools.vocab module
----------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.tools.vocab
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,20 +0,0 @@
Utils
=====
deepke.attribution\_extraction.standard.utils.ioUtils module
------------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.utils.ioUtils
:members:
:undoc-members:
:show-inheritance:
deepke.attribution\_extraction.standard.utils.nnUtils module
------------------------------------------------------------
.. automodule:: deepke.attribution_extraction.standard.utils.nnUtils
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,21 +0,0 @@
Models
======
deepke.name\_entity\_recognition.few\_shot.models.model module
--------------------------------------------------------------
.. automodule:: deepke.name_entity_recognition.few_shot.models.model
:members:
:undoc-members:
:show-inheritance:
deepke.name\_entity\_recognition.few\_shot.models.modeling\_bart module
-----------------------------------------------------------------------
.. automodule:: deepke.name_entity_recognition.few_shot.models.modeling_bart
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,35 +0,0 @@
Module
======
deepke.name\_entity\_recognition.few\_shot.module.datasets module
-----------------------------------------------------------------
.. automodule:: deepke.name_entity_recognition.few_shot.module.datasets
:members:
:undoc-members:
:show-inheritance:
deepke.name\_entity\_recognition.few\_shot.module.mapping\_type module
----------------------------------------------------------------------
.. automodule:: deepke.name_entity_recognition.few_shot.module.mapping_type
:members:
:undoc-members:
:show-inheritance:
deepke.name\_entity\_recognition.few\_shot.module.metrics module
----------------------------------------------------------------
.. automodule:: deepke.name_entity_recognition.few_shot.module.metrics
:members:
:undoc-members:
:show-inheritance:
deepke.name\_entity\_recognition.few\_shot.module.train module
--------------------------------------------------------------
.. automodule:: deepke.name_entity_recognition.few_shot.module.train
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,10 +0,0 @@
Few Shot
========
.. toctree::
:maxdepth: 4
deepke.name_entity_recognition.few_shot.models
deepke.name_entity_recognition.few_shot.module
deepke.name_entity_recognition.few_shot.utils

View File

@ -1,11 +0,0 @@
Utils
=====
deepke.name\_entity\_recognition.few\_shot.utils.util module
------------------------------------------------------------
.. automodule:: deepke.name_entity_recognition.few_shot.utils.util
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,9 +0,0 @@
Name Entity Recognition
=======================
.. toctree::
:maxdepth: 4
deepke.name_entity_recognition.few_shot
deepke.name_entity_recognition.standard

View File

@ -1,11 +0,0 @@
Models
======
deepke.name\_entity\_recognition.standard.models.InferBert module
-----------------------------------------------------------------
.. automodule:: deepke.name_entity_recognition.standard.models.InferBert
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,9 +0,0 @@
Standard
========
.. toctree::
:maxdepth: 4
deepke.name_entity_recognition.standard.models
deepke.name_entity_recognition.standard.tools

View File

@ -1,20 +0,0 @@
Tools
=====
deepke.name\_entity\_recognition.standard.tools.dataset module
--------------------------------------------------------------
.. automodule:: deepke.name_entity_recognition.standard.tools.dataset
:members:
:undoc-members:
:show-inheritance:
deepke.name\_entity\_recognition.standard.tools.preprocess module
-----------------------------------------------------------------
.. automodule:: deepke.name_entity_recognition.standard.tools.preprocess
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,51 +0,0 @@
Document
========
deepke.relation\_extraction.document.evaluation module
------------------------------------------------------
.. automodule:: deepke.relation_extraction.document.evaluation
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.document.losses module
--------------------------------------------------
.. automodule:: deepke.relation_extraction.document.losses
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.document.model module
-------------------------------------------------
.. automodule:: deepke.relation_extraction.document.model
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.document.module module
--------------------------------------------------
.. automodule:: deepke.relation_extraction.document.module
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.document.prepro module
--------------------------------------------------
.. automodule:: deepke.relation_extraction.document.prepro
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.document.utils module
-------------------------------------------------
.. automodule:: deepke.relation_extraction.document.utils
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,26 +0,0 @@
Dataset
=======
deepke.relation\_extraction.few\_shot.dataset.base\_data\_module module
-----------------------------------------------------------------------
.. automodule:: deepke.relation_extraction.few_shot.dataset.base_data_module
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.few\_shot.dataset.dialogue module
-------------------------------------------------------------
.. automodule:: deepke.relation_extraction.few_shot.dataset.dialogue
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.few\_shot.dataset.processor module
--------------------------------------------------------------
.. automodule:: deepke.relation_extraction.few_shot.dataset.processor
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,27 +0,0 @@
Lit Models
==========
deepke.relation\_extraction.few\_shot.lit\_models.base module
-------------------------------------------------------------
.. automodule:: deepke.relation_extraction.few_shot.lit_models.base
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.few\_shot.lit\_models.transformer module
--------------------------------------------------------------------
.. automodule:: deepke.relation_extraction.few_shot.lit_models.transformer
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.few\_shot.lit\_models.util module
-------------------------------------------------------------
.. automodule:: deepke.relation_extraction.few_shot.lit_models.util
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,27 +0,0 @@
Few Shot
========
.. toctree::
:maxdepth: 4
deepke.relation_extraction.few_shot.dataset
deepke.relation_extraction.few_shot.lit_models
deepke.relation\_extraction.few\_shot.generate\_k\_shot module
--------------------------------------------------------------
.. automodule:: deepke.relation_extraction.few_shot.generate_k_shot
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.few\_shot.get\_label\_word module
-------------------------------------------------------------
.. automodule:: deepke.relation_extraction.few_shot.get_label_word
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,11 +0,0 @@
Relation Extraction
===================
.. toctree::
:maxdepth: 4
deepke.relation_extraction.document
deepke.relation_extraction.few_shot
deepke.relation_extraction.standard

View File

@ -1,59 +0,0 @@
Models
======
deepke.relation\_extraction.standard.models.BasicModule module
--------------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.models.BasicModule
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.models.BiLSTM module
---------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.models.BiLSTM
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.models.Capsule module
----------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.models.Capsule
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.models.GCN module
------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.models.GCN
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.models.LM module
-----------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.models.LM
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.models.PCNN module
-------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.models.PCNN
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.models.Transformer module
--------------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.models.Transformer
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,59 +0,0 @@
Module
======
deepke.relation\_extraction.standard.module.Attention module
------------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.module.Attention
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.module.CNN module
------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.module.CNN
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.module.Capsule module
----------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.module.Capsule
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.module.Embedding module
------------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.module.Embedding
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.module.GCN module
------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.module.GCN
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.module.RNN module
------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.module.RNN
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.module.Transformer module
--------------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.module.Transformer
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,10 +0,0 @@
Standard
========
.. toctree::
:maxdepth: 4
deepke.relation_extraction.standard.models
deepke.relation_extraction.standard.module
deepke.relation_extraction.standard.tools
deepke.relation_extraction.standard.utils

View File

@ -1,60 +0,0 @@
Tools
=====
deepke.relation\_extraction.standard.tools.dataset module
---------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.tools.dataset
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.tools.loss module
------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.tools.loss
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.tools.metrics module
---------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.tools.metrics
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.tools.preprocess module
------------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.tools.preprocess
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.tools.serializer module
------------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.tools.serializer
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.tools.trainer module
---------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.tools.trainer
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.tools.vocab module
-------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.tools.vocab
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,19 +0,0 @@
Utils
=====
deepke.relation\_extraction.standard.utils.ioUtils module
---------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.utils.ioUtils
:members:
:undoc-members:
:show-inheritance:
deepke.relation\_extraction.standard.utils.nnUtils module
---------------------------------------------------------
.. automodule:: deepke.relation_extraction.standard.utils.nnUtils
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,10 +0,0 @@
DeepKE
======
.. toctree::
:maxdepth: 4
deepke.attribution_extraction
deepke.name_entity_recognition
deepke.relation_extraction

View File

@ -1,345 +0,0 @@
Example
=======
Standard NER
------------
The standard module is implemented by the pretrained model BERT.
**Step 1**
Enter ``DeepKE/example/ner/standard`` .
**Step 2**
Get data:
`wget 120.27.214.45/Data/ner/standard/data.tar.gz`
`tar -xzvf data.tar.gz`
The dataset and parameters can be customized in the ``data`` folder and ``conf`` folder respectively.
Dataset needs to be input as ``TXT`` file
The `data's format` of file needs to comply with the following
杭 B-LOC '\\n'
州 I-LOC '\\n'
真 O '\\n'
美 O '\\n'
**Step 3**
Train:
`python run.py`
**Step 4**
Predict:
`python predict.py`
.. code-block:: bash
cd example/ner/standard
wget 120.27.214.45/Data/ner/standard/data.tar.gz
tar -xzvf data.tar.gz
python run.py
python predict.py
Few-shot NER
------------
This module is in the low-resouce scenario.
**Step 1**
Enter ``DeepKE/example/ner/few-shot`` .
**Step 2**
Get data:
`wget 120.27.214.45/Data/ner/few_shot/data.tar.gz`
`tar -xzvf data.tar.gz`
The directory where the model is loaded and saved and the configuration parameters can be cusomized in the ``conf`` folder.The dataset can be customized in the ``data`` folder.
Dataset needs to be input as ``TXT`` file
The `data's format` of file needs to comply with the following
EU B-ORG '\\n'
rejects O '\\n'
German B-MISC '\\n'
call O '\\n'
to O '\\n'
boycott O '\\n'
British B-MISC '\\n'
lamb O '\\n'
. O '\\n'
**Step 3**
Train with CoNLL-2003:
`python run.py`
Train in the few-shot scenario:
`python run.py +train=few_shot`. Users can modify `load_path` in ``conf/train/few_shot.yaml`` with the use of existing loaded model.
**Step 4**
Predict:
add `- predict` to ``conf/config.yaml`` , modify `loda_path` as the model path and `write_path` as the path where the predicted results are saved in ``conf/predict.yaml`` , and then run `python predict.py`
.. code-block:: bash
cd example/ner/few-shot
wget 120.27.214.45/Data/ner/few_shot/data.tar.gz
tar -xzvf data.tar.gz
python run.py
python predict.py
Standard RE
-----------
The standard module is implemented by common deep learning models, including CNN, RNN, Capsule, GCN, Transformer and the pretrained model.
**Step 1**
Enter the ``DeepKE/example/re/standard`` folder.
**Step 2**
Get data:
`wget 120.27.214.45/Data/re/standard/data.tar.gz`
`tar -xzvf data.tar.gz`
The dataset and parameters can be customized in the ``data`` folder and ``conf`` folder respectively.
Dataset needs to be input as ``CSV`` file.
The `data's format` of file needs to comply with the following
+--------------------------+-----------+------------+-------------+------------+------------+
| Sentence | Relation | Head | Head_offset | Tail | Tail_offset|
+--------------------------+-----------+------------+-------------+------------+------------+
The relation's format of file needs to comply with the following
+------------+-----------+------------------+-------------+
| Head_type | Tail_type | relation | Index |
+------------+-----------+------------------+-------------+
**Step 3**
Train:
`python run.py`
**Step 4**
Predict:
`python predict.py`
.. code-block:: bash
cd example/re/standard
wget 120.27.214.45/Data/re/standard/data.tar.gz
tar -xzvf data.tar.gz
python run.py
python predict.py
Few-shot RE
-----------
This module is in the low-resouce scenario.
**Step 1**
Enter ``DeepKE/example/re/few-shot`` .
**Step 2**
Get data:
`wget 120.27.214.45/Data/re/few_shot/data.tar.gz`
`tar -xzvf data.tar.gz`
The dataset and parameters can be customized in the ``data`` folder and ``conf`` folder respectively.
Dataset needs to be input as ``TXT`` file and ``JSON`` file.
The `data's format` of file needs to comply with the following
{"token": ["the", "most", "common", "audits", "were", "about", "waste", "and", "recycling", "."], "h": {"name": "audits", "pos": [3, 4]}, "t": {"name": "waste", "pos": [6, 7]}, "relation": "Message-Topic(e1,e2)"}
The relation's format of file needs to comply with the following
{"Other": 0 , "Message-Topic(e1,e2)": 1 ... }
**Step 3**
Train:
`python run.py`
Start with the model trained last time: modify `train_from_saved_model` in ``conf/train.yaml`` as the path where the model trained last time was saved. And the path saving logs generated in training can be customized by ``log_dir``.
**Step 4**
Predict:
`python predict.py`
.. code-block:: bash
cd example/re/few-shot
wget 120.27.214.45/Data/re/few_shot/data.tar.gz
tar -xzvf data.tar.gz
python run.py
python predict.py
Document RE
-----------
This module is in the document scenario.
**Step 1**
Enter ``DeepKE/example/re/document`` .
**Step2**
Get data:
`wget 120.27.214.45/Data/re/document/data.tar.gz`
`tar -xzvf data.tar.gz`
The dataset and parameters can be customized in the ``data`` folder and ``conf`` folder respectively.
Dataset needs to be input as ``JSON`` file
The `data's format` of file needs to comply with the following
[{"vertexSet": [[{"name": "Lark Force", "pos": [0, 2], "sent_id": 0, "type": "ORG"},...]],
"labels": [{"r": "P607", "h": 1, "t": 3, "evidence": [0]}, ...],
"title": "Lark Force",
"sents": [["Lark", "Force", "was", "an", "Australian", "Army", "formation", "established", "in", "March", "1941", "during", "World", "War", "II", "for", "service", "in", "New", "Britain", "and", "New", "Ireland", "."],...}]
The relation's format of file needs to comply with the following
{"P1376": 79,"P607": 27,...}
**Step 3**
Train:
`python run.py`
Start with the model trained last time: modify `train_from_saved_model` in ``conf/train.yaml`` as the path where the model trained last time was saved. And the path saving logs generated in training can be customized by ``log_dir``.
**Step 4**
Predict:
`python predict.py`
.. code-block:: bash
cd example/re/document
wget 120.27.214.45/Data/re/document/data.tar.gz
tar -xzvf data.tar.gz
python run.py
python predict.py
Standard AE
-----------
The standard module is implemented by common deep learning models, including CNN, RNN, Capsule, GCN, Transformer and the pretrained model.
**Step 1**
Enter the ``DeepKE/example/ae/standard`` folder.
**Step 2**
Get data:
`wget 120.27.214.45/Data/ae/standard/data.tar.gz`
`tar -xzvf data.tar.gz`
The dataset and parameters can be customized in the ``data`` folder and ``conf`` folder respectively.
Dataset needs to be input as ``CSV`` file.
The `data's format` of file needs to comply with the following
+--------------------------+------------+------------+---------------+-------------------+-----------------------+
| Sentence | Attribute | Entity | Entity_offset | Attribute_value | Attribute_value_offset|
+--------------------------+------------+------------+---------------+-------------------+-----------------------+
The attribute's format of file needs to comply with the following
+-------------------+-------------+
| Attribute | Index |
+-------------------+-------------+
**Step 3**
Train:
`python run.py`
**Step 4**
Predict:
`python predict.py`
.. code-block:: bash
cd example/ae/regular
wget 120.27.214.45/Data/ae/standard/data.tar.gz
tar -xzvf data.tar.gz
python run.py
python predict.py
More details , you can refer to https://www.bilibili.com/video/BV1n44y1x7iW?spm_id_from=333.999.0.0 .

View File

@ -1,13 +0,0 @@
FAQ
===
1.Using nearest mirror, will speed up the installation of Anaconda.
2.Using nearest mirror, will speed up pip install XXX.
3.When encountering ModuleNotFoundError: No module named 'past'run pip install future .
4.It's slow to install the pretrained language models online. Recommend download pretrained models before use and save them in the pretrained folder. Read README.md in every task directory to check the specific requirement for saving pretrained models.
5.The old version of DeepKE is in the deepke-v1.0 branch. Users can change the branch to use the old version. The old version has been totally transfered to the standard relation extraction (example/re/standard).

View File

@ -1,52 +0,0 @@
DeepKE Documentation
====================
Introduction
------------
.. image:: ./_static/logo.png
DeepKE is a knowledge extraction toolkit supporting low-resource and document-level scenarios. It provides three functions based PyTorch, including Named Entity Recognition, Relation Extraciton and Attribute Extraction.
.. image:: ./_static/demo.gif
Support Weight & Biases
-----------------------
.. image:: ./_static/wandb.png
To achieve automatic hyper-parameters fine-tuning, DeepKE adopts Weight & Biases, a machine learning toolkit for developers to build better models faster.
With this toolkit, DeepKE can visualize results and tune hyper-parameters better automatically.
The example running files for all functions in the repository support the toolkit and researchers are able to modify the metrics and hyper-parameter configuration as needed.
The detailed usage of this toolkit refers to the official document
Support Notebook Tutorials
--------------------------
We provide Google Colab tutorials and jupyter notebooks in the github repository as example implementation of every functions in different scenarios.
These tutorials can be run directly and lead developers and researchers to have a whole picture of DeepKEs application methods.
You can go colab directly: https://colab.research.google.com/drive/1cM-zbLhEHkje54P0IZENrfe4HaXwZxZc?usp=sharing
.. toctree::
:glob:
:maxdepth: 1
:caption: Getting Started
start
install
example
faq
.. toctree::
:glob:
:maxdepth: 3
:caption: Package
deepke

View File

@ -1,39 +0,0 @@
Install
=======
Create environment
------------------
Create a virtual environment directly (recommend anaconda)
.. code-block:: bash
conda create -n deepke python=3.8
conda activate deepke
We also provide dockerfile to create docker image.
.. code-block:: bash
cd docker
docker build -t deepke .
conda activate deepke
Install by pypi
---------------
If use deepke directly
.. code-block:: python
pip install deepke
Install by setup.py
-------------------
If modify source codes before usage
.. code-block:: python
python setup.py install

View File

@ -1,107 +0,0 @@
Start
=====
Model Framework
---------------
.. image:: ./_static/architectures.png
DeepKE contains three modules for named entity recognition, relation extraction and attribute extraction, the three tasks respectively.
Each module has its own submodules. For example, there are standard, document-level and few-shot submodules in the attribute extraction modular.
Each submodule compose of three parts: a collection of tools, which can function as tokenizer, dataloader, preprocessor and the like, a encoder and a part for training and prediction
Dataset
-------
We use the following datasets in our experiments:
+--------------------------+-----------+------------------+----------+------------+
| Task | Settings | Corpus | Language | Model |
+==========================+===========+==================+==========+============+
| | | CoNLL-2003 | English | |
| | Standard +------------------+----------+ BERT |
| | | People's Daily | Chinese | |
| +-----------+------------------+----------+------------+
| | | CoNLL-2003 | | |
| | +------------------+ | |
| Name Entity Recognition | | MIT Movie | | |
| | Few-shot +------------------+ English | LightNER |
| | | MIT Restaurant | | |
| | +------------------+ | |
| | | ATIS | | |
+--------------------------+-----------+------------------+----------+------------+
| | | | | CNN |
| | | | +------------+
| | | | | RNN |
| | | | +------------+
| | | | | Capsule |
| | Standard | DuIE | Chinese +------------+
| | | | | GCN |
| | | | +------------+
| | | | | Transformer|
| | | | +------------+
| | | | | BERT |
| +-----------+------------------+----------+------------+
| Relation Extraction | | SEMEVAL(8-shot) | | |
| | +------------------+ | |
| | | SEMEVAL(16-shot) | | |
| | Few-shot +------------------+ English | KnowPrompt |
| | | SEMEVAL(32-shot) | | |
| | +------------------+ | |
| | | SEMEVAL(Full) | | |
| +-----------+------------------+----------+------------+
| | | DocRED | | |
| | +------------------+ | |
| | Document | CDR | English | DocuNet |
| | +------------------+ | |
| | | GDA | | |
+--------------------------+-----------+------------------+----------+------------+
| | | | | CNN |
| | | | +------------+
| | | | | RNN |
| | | | +------------+
| | |Triplet Extraction| | Capsule |
| Attribute Extraction | Standard |Dataset | Chinese +------------+
| | | | | GCN |
| | | | +------------+
| | | | | Transformer|
| | | | +------------+
| | | | | BERT |
+--------------------------+-----------+------------------+----------+------------+
Get Start
---------
If you want to use our code , you can do as follow:
.. code-block:: python
git clone https://github.com/zjunlp/DeepKE.git
cd DeepKE

View File

@ -1,73 +0,0 @@
# Easy Start
<p align="left">
<b> English | <a href="https://github.com/zjunlp/DeepKE/blob/main/example/ae/standard/README_CN.md">简体中文</a> </b>
</p>
## Requirements
> python == 3.8
- torch == 1.5
- hydra-core == 1.0.6
- tensorboard == 2.4.1
- matplotlib == 3.4.1
- scikit-learn == 0.24.1
- transformers == 3.4.0
- jieba == 0.42.1
- deepke
## Download Code
```bash
git clone https://github.com/zjunlp/DeepKE.git
cd DeepKE/example/ae/standard
```
## Install with Pip
- Create and enter the python virtual environment.
- Install dependencies: `pip install -r requirements.txt`.
## Train and Predict
- Dataset
- Download the dataset to this directory.
```bash
wget 120.27.214.45/Data/ae/standard/data.tar.gz
tar -xzvf data.tar.gz
```
- The dataset is stored in `data/origin`:
- `train.csv`: Training set
- `valid.csv `: Validation set
- `test.csv`: Test set
- `attribute.csv`: Attribute types
- Training
- Parameters for training are in the `conf` folder and users can modify them before training.
- If using LM, modify `lm_file` to use the local model.
- Logs for training are in the `log` folder and the trained model is saved in the `checkpoints` folder.
```bash
python run.py
```
- Prediction
```bash
python predict.py
```
## Models
1. CNN
2. RNN
3. Capsule
4. GCN
5. Transformer
6. Pre-trained Model (BERT)

View File

@ -1,63 +0,0 @@
## 快速上手
<p align="left">
<b> <a href="https://github.com/zjunlp/DeepKE/blob/main/example/ae/standard/README.md">English</a> | 简体中文 </b>
</p>
### 环境依赖
> python == 3.8
- torch == 1.5
- hydra-core == 1.0.6
- tensorboard == 2.4.1
- matplotlib == 3.4.1
- scikit-learn == 0.24.1
- transformers == 3.4.0
- jieba == 0.42.1
- deepke
### 克隆代码
```bash
git clone https://github.com/zjunlp/DeepKE.git
cd DeepKE/example/ae/standard
```
### 使用pip安装
首先创建python虚拟环境再进入虚拟环境
- 安装依赖: ```pip install -r requirements.txt```
### 使用数据进行训练预测
- 存放数据: 可先下载数据 ```wget 120.27.214.45/Data/ae/standard/data.tar.gz```至此目录下
解压后`data/origin` 文件夹下存放来训练数据:
- `train.csv`:存放训练数据集
- `valid.csv`:存放验证数据集
- `test.csv`:存放测试数据集
- `attribute.csv`:存放属性种类
- 开始训练:```python run.py``` (训练所用到参数都在conf文件夹中修改即可使用LM时可修改'lm_file'使用下载至本地的模型)
- 每次训练的日志保存在 `logs` 文件夹内,模型结果保存在 `checkpoints` 文件夹内。
- 进行预测 ```python predict.py```
## 模型内容
1、CNN
2、RNN
3、Capsule
4、GCN
5、Transformer
6、预训练模型

View File

@ -1,17 +0,0 @@
# ??? is a mandatory value.
# you should be able to set it without open_dict
# but if you try to read it before it's set an error will get thrown.
# populated at runtime
cwd: ???
defaults:
- hydra/output: custom
- preprocess
- train
- embedding
- predict
- model: cnn # [cnn, rnn, transformer, capsule, gcn, lm]

View File

@ -1,10 +0,0 @@
# populated at runtime
vocab_size: ???
word_dim: 60
pos_size: ??? # 2 * pos_limit + 2
pos_dim: 10 # 当为 sum 时,此值无效,和 word_dim 强行相同
dim_strategy: sum # [cat, sum]
# 属性种类
num_attributes: 7

View File

@ -1,11 +0,0 @@
hydra:
run:
# Output directory for normal runs
dir: logs/${now:%Y-%m-%d_%H-%M-%S}
sweep:
# Output directory for sweep runs
dir: logs/${now:%Y-%m-%d_%H-%M-%S}
# Output sub directory for sweep runs.
subdir: ${hydra.job.num}_${hydra.job.id}

View File

@ -1,20 +0,0 @@
model_name: capsule
share_weights: True
num_iterations: 5 # 迭代次数
dropout: 0.3
input_dim_capsule: ??? # 由上层卷积结果得到,一般是卷积输出的 hidden_size
dim_capsule: 50 # 输出 capsule 的维度
num_capsule: ??? # 输出 capsule 的数目,和分类结果相同, == num_attributes
# primary capsule 组成
# 可以 embedding / cnn / rnn
# 暂时先用 cnn
in_channels: ??? # 使用 embedding 输出的结果,不需要指定
out_channels: 100 # == input_dim_capsule
kernel_sizes: [9] # 必须为奇数,而且要比较大
activation: 'lrelu' # [relu, lrelu, prelu, selu, celu, gelu, sigmoid, tanh]
keep_length: False # 不需要padding太多无用信息
pooling_strategy: cls # 无关紧要,根本用不到

View File

@ -1,13 +0,0 @@
model_name: cnn
in_channels: ??? # 使用 embedding 输出的结果,不需要指定
out_channels: 100
kernel_sizes: [3, 5, 7] # 必须为奇数为了保证cnn的输出不改变句子长度
activation: 'gelu' # [relu, lrelu, prelu, selu, celu, gelu, sigmoid, tanh]
pooling_strategy: 'max' # [max, avg, cls]
keep_length: True
dropout: 0.3
# pcnn
use_pcnn: False
intermediate: 80

View File

@ -1,7 +0,0 @@
model_name: gcn
num_layers: 3
input_size: ??? # 使用 embedding 输出的结果,不需要指定
hidden_size: 100
dropout: 0.3

View File

@ -1,20 +0,0 @@
model_name: lm
# 当使用预训练语言模型时,该预训练的模型存放位置
# lm_name = 'bert-base-chinese' # download usage
#lm_file: 'pretrained'
lm_file: 'bert-base-chinese'
# transformer 层数,初始 base bert 为12层
# 但是数据量较小时调低些反而收敛更快效果更好
num_hidden_layers: 1
# 后面所接 bilstm 的参数
type_rnn: 'LSTM' # [RNN, GRU, LSTM]
input_size: 768 # 这个值由bert得到
hidden_size: 100 # 必须为偶数
num_layers: 1
dropout: 0.3
bidirectional: True
last_layer_hn: True

View File

@ -1,10 +0,0 @@
model_name: rnn
type_rnn: 'LSTM' # [RNN, GRU, LSTM]
input_size: ??? # 使用 embedding 输出的结果,不需要指定
hidden_size: 150 # 必须为偶数
num_layers: 2
dropout: 0.3
bidirectional: True
last_layer_hn: True

View File

@ -1,12 +0,0 @@
model_name: transformer
hidden_size: ??? # 使用 embedding 输出的结果,不需要指定
num_heads: 4 # 必须能被 hidden_size 整除
num_hidden_layers: 3
intermediate_size: 256
dropout: 0.1
layer_norm_eps: 1e-12
hidden_act: gelu_new # [relu, gelu, swish, gelu_new]
output_attentions: True
output_hidden_states: True

View File

@ -1,2 +0,0 @@
# 自定义模型存储的路径
fp: 'xxx/checkpoints/2019-12-03_17-35-30/cnn_epoch21.pth'

View File

@ -1,20 +0,0 @@
# 是否需要预处理数据
# 当数据处理参数没有变换时,不需要重新预处理
preprocess: True
# 原始数据存放位置
data_path: 'data/origin'
# 预处理后存放文件位置
out_path: 'data/out'
# 是否需要分词
chinese_split: True
# vocab 构建时的最低词频控制
min_freq: 3
# 句长限制: 指句子中词语相对entity的position限制
# 如:[-30, 30]embed 时整体+31变成[1, 61]
# 则一共62个pos token0 留给 pad
pos_limit: 30

View File

@ -1,21 +0,0 @@
seed: 1
use_gpu: True
gpu_id: 0
epoch: 50
batch_size: 32
learning_rate: 3e-4
lr_factor: 0.7 # 学习率的衰减率
lr_patience: 3 # 学习率衰减的等待epoch
weight_decay: 1e-3 # L2正则
early_stopping_patience: 6
train_log: True
log_interval: 10
show_plot: True
only_comparison_plot: False
plot_utils: matplot # [matplot, tensorboard]
predict_plot: True

View File

@ -1,152 +0,0 @@
import os
import sys
import torch
import logging
import hydra
from hydra import utils
from deepke.attribution_extraction.standard.tools import Serializer
from deepke.attribution_extraction.standard.tools import _serialize_sentence, _convert_tokens_into_index, _add_pos_seq, _handle_attribute_data , _lm_serialize
import matplotlib.pyplot as plt
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
from deepke.attribution_extraction.standard.utils import load_pkl, load_csv
import deepke.attribution_extraction.standard.models as models
logger = logging.getLogger(__name__)
def _preprocess_data(data, cfg):
attribute_data = load_csv(os.path.join(cfg.cwd, cfg.data_path, 'attribute.csv'), verbose=False)
atts = _handle_attribute_data(attribute_data)
if cfg.model_name != 'lm':
vocab = load_pkl(os.path.join(cfg.cwd, cfg.out_path, 'vocab.pkl'), verbose=False)
cfg.vocab_size = vocab.count
serializer = Serializer(do_chinese_split=cfg.chinese_split)
serial = serializer.serialize
_serialize_sentence(data, serial)
_convert_tokens_into_index(data, vocab)
_add_pos_seq(data, cfg)
logger.info('start sentence preprocess...')
formats = '\nsentence: {}\nchinese_split: {}\n' \
'tokens: {}\ntoken2idx: {}\nlength: {}\nentity_index: {}\nattribute_value_index: {}'
logger.info(
formats.format(data[0]['sentence'], cfg.chinese_split,
data[0]['tokens'], data[0]['token2idx'], data[0]['seq_len'],
data[0]['entity_index'], data[0]['attribute_value_index']))
else:
_lm_serialize(data,cfg)
return data, atts
def _get_predict_instance(cfg):
flag = input('是否使用范例[y/n],退出请输入: exit .... ')
flag = flag.strip().lower()
if flag == 'y' or flag == 'yes':
sentence = '张冬梅汉族1968年2月生河南淇县人1988年7月加入中国共产党1989年9月参加工作中央党校经济管理专业毕业中央党校研究生学历'
entity = '张冬梅'
attribute_value = '汉族'
elif flag == 'n' or flag == 'no':
sentence = input('请输入句子:')
entity = input('请输入句中需要预测的实体:')
attribute_value = input('请输入句中需要预测的属性值:')
elif flag == 'exit':
sys.exit(0)
else:
print('please input yes or no, or exit!')
_get_predict_instance(cfg)
instance = dict()
instance['sentence'] = sentence.strip()
instance['entity'] = entity.strip()
instance['attribute_value'] = attribute_value.strip()
instance['entity_offset'] = sentence.find(entity)
instance['attribute_value_offset'] = sentence.find(attribute_value)
return instance
@hydra.main(config_path='conf/config.yaml')
def main(cfg):
cwd = utils.get_original_cwd()
# cwd = cwd[0:-5]
cfg.cwd = cwd
cfg.pos_size = 2 * cfg.pos_limit + 2
print(cfg.pretty())
# get predict instance
instance = _get_predict_instance(cfg)
data = [instance]
# preprocess data
data, rels = _preprocess_data(data, cfg)
# model
__Model__ = {
'cnn': models.PCNN,
'rnn': models.BiLSTM,
'transformer': models.Transformer,
'gcn': models.GCN,
'capsule': models.Capsule,
'lm': models.LM,
}
# 最好在 cpu 上预测
cfg.use_gpu = False
if cfg.use_gpu and torch.cuda.is_available():
device = torch.device('cuda', cfg.gpu_id)
else:
device = torch.device('cpu')
logger.info(f'device: {device}')
model = __Model__[cfg.model_name](cfg)
logger.info(f'model name: {cfg.model_name}')
logger.info(f'\n {model}')
model.load(cfg.fp, device=device)
model.to(device)
model.eval()
x = dict()
x['word'], x['lens'] = torch.tensor([data[0]['token2idx']]), torch.tensor([data[0]['seq_len']])
if cfg.model_name != 'lm':
x['entity_pos'], x['attribute_value_pos'] = torch.tensor([data[0]['entity_pos']]), torch.tensor([data[0]['attribute_value_pos']])
if cfg.model_name == 'cnn':
if cfg.use_pcnn:
x['pcnn_mask'] = torch.tensor([data[0]['entities_pos']])
if cfg.model_name == 'gcn':
# 没找到合适的做 parsing tree 的工具,暂时随机初始化
adj = torch.empty(1,data[0]['seq_len'],data[0]['seq_len']).random_(2)
x['adj'] = adj
for key in x.keys():
x[key] = x[key].to(device)
with torch.no_grad():
y_pred = model(x)
y_pred = torch.softmax(y_pred, dim=-1)[0]
prob = y_pred.max().item()
prob_att = list(rels.keys())[y_pred.argmax().item()]
logger.info(f"\"{data[0]['entity']}\"\"{data[0]['attribute_value']}\" 在句中属性为:\"{prob_att}\",置信度为{prob:.2f}")
if cfg.predict_plot:
plt.rcParams["font.family"] = 'Arial Unicode MS'
x = list(rels.keys())
height = list(y_pred.cpu().numpy())
plt.bar(x, height)
for x, y in zip(x, height):
plt.text(x, y, '%.2f' % y, ha="center", va="bottom")
plt.xlabel('关系')
plt.ylabel('置信度')
plt.xticks(rotation=315)
plt.show()
if __name__ == '__main__':
main()

View File

@ -1,8 +0,0 @@
torch == 1.5
hydra-core == 1.0.6
tensorboard == 2.4.1
matplotlib == 3.4.1
scikit-learn == 0.24.1
transformers == 4.5.0
jieba == 0.42.1
deepke

View File

@ -1,167 +0,0 @@
import os
import hydra
import torch
import logging
import torch.nn as nn
from torch import optim
from hydra import utils
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
# self
import sys
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
import deepke.attribution_extraction.standard.models as models
from deepke.attribution_extraction.standard.tools import preprocess , CustomDataset, collate_fn ,train, validate
from deepke.attribution_extraction.standard.utils import manual_seed, load_pkl
import wandb
logger = logging.getLogger(__name__)
@hydra.main(config_path="conf/config.yaml")
def main(cfg):
cwd = utils.get_original_cwd()
# cwd = cwd[0:-5]
cfg.cwd = cwd
cfg.pos_size = 2 * cfg.pos_limit + 2
logger.info(f'\n{cfg.pretty()}')
wandb.init(project="DeepKE_AE_Standard", name=cfg.model_name)
wandb.watch_called = False
__Model__ = {
'cnn': models.PCNN,
'rnn': models.BiLSTM,
'transformer': models.Transformer,
'gcn': models.GCN,
'capsule': models.Capsule,
'lm': models.LM,
}
# device
if cfg.use_gpu and torch.cuda.is_available():
device = torch.device('cuda', cfg.gpu_id)
else:
device = torch.device('cpu')
logger.info(f'device: {device}')
# 如果不修改预处理的过程,这一步最好注释掉,不用每次运行都预处理数据一次
if cfg.preprocess:
preprocess(cfg)
train_data_path = os.path.join(cfg.cwd, cfg.out_path, 'train.pkl')
valid_data_path = os.path.join(cfg.cwd, cfg.out_path, 'valid.pkl')
test_data_path = os.path.join(cfg.cwd, cfg.out_path, 'test.pkl')
vocab_path = os.path.join(cfg.cwd, cfg.out_path, 'vocab.pkl')
if cfg.model_name == 'lm':
vocab_size = None
else:
vocab = load_pkl(vocab_path)
vocab_size = vocab.count
cfg.vocab_size = vocab_size
train_dataset = CustomDataset(train_data_path)
valid_dataset = CustomDataset(valid_data_path)
test_dataset = CustomDataset(test_data_path)
train_dataloader = DataLoader(train_dataset, batch_size=cfg.batch_size, shuffle=True, collate_fn=collate_fn(cfg))
valid_dataloader = DataLoader(valid_dataset, batch_size=cfg.batch_size, shuffle=True, collate_fn=collate_fn(cfg))
test_dataloader = DataLoader(test_dataset, batch_size=cfg.batch_size, shuffle=True, collate_fn=collate_fn(cfg))
model = __Model__[cfg.model_name](cfg)
model.to(device)
wandb.watch(model, log="all")
logger.info(f'\n {model}')
optimizer = optim.Adam(model.parameters(), lr=cfg.learning_rate, weight_decay=cfg.weight_decay)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=cfg.lr_factor, patience=cfg.lr_patience)
criterion = nn.CrossEntropyLoss()
best_f1, best_epoch = -1, 0
es_loss, es_f1, es_epoch, es_patience, best_es_epoch, best_es_f1, es_path, best_es_path = 1e8, -1, 0, 0, 0, -1, '', ''
train_losses, valid_losses = [], []
if cfg.show_plot and cfg.plot_utils == 'tensorboard':
writer = SummaryWriter('tensorboard')
else:
writer = None
logger.info('=' * 10 + ' Start training ' + '=' * 10)
for epoch in range(1, cfg.epoch + 1):
manual_seed(cfg.seed + epoch)
train_loss = train(epoch, model, train_dataloader, optimizer, criterion, device, writer, cfg)
valid_f1, valid_loss = validate(epoch, model, valid_dataloader, criterion, device, cfg)
scheduler.step(valid_loss)
model_path = model.save(epoch, cfg)
# logger.info(model_path)
train_losses.append(train_loss)
valid_losses.append(valid_loss)
wandb.log({
"train_loss":train_loss,
"valid_loss":valid_loss
})
if best_f1 < valid_f1:
best_f1 = valid_f1
best_epoch = epoch
# 使用 valid loss 做 early stopping 的判断标准
if es_loss > valid_loss:
es_loss = valid_loss
es_f1 = valid_f1
es_epoch = epoch
es_patience = 0
es_path = model_path
else:
es_patience += 1
if es_patience >= cfg.early_stopping_patience:
best_es_epoch = es_epoch
best_es_f1 = es_f1
best_es_path = es_path
if cfg.show_plot:
if cfg.plot_utils == 'matplot':
plt.plot(train_losses, 'x-')
plt.plot(valid_losses, '+-')
plt.legend(['train', 'valid'])
plt.title('train/valid comparison loss')
plt.show()
if cfg.plot_utils == 'tensorboard':
for i in range(len(train_losses)):
writer.add_scalars('train/valid_comparison_loss', {
'train': train_losses[i],
'valid': valid_losses[i]
}, i)
writer.close()
logger.info(f'best(valid loss quota) early stopping epoch: {best_es_epoch}, '
f'this epoch macro f1: {best_es_f1:0.4f}')
logger.info(f'this model save path: {best_es_path}')
logger.info(f'total {cfg.epoch} epochs, best(valid macro f1) epoch: {best_epoch}, '
f'this epoch macro f1: {best_f1:.4f}')
logger.info('=====end of training====')
logger.info('')
logger.info('=====start test performance====')
_ , test_loss = validate(-1, model, test_dataloader, criterion, device, cfg)
wandb.log({
"test_loss":test_loss,
})
logger.info('=====ending====')
if __name__ == '__main__':
main()
# python predict.py --help # 查看参数帮助
# python predict.py -c
# python predict.py chinese_split=0,1 replace_entity_with_type=0,1 -m

View File

@ -1,112 +0,0 @@
# Easy Start
<p align="left">
<b> English | <a href="https://github.com/zjunlp/DeepKE/blob/main/example/ner/few-shot/README_CN.md">简体中文</a> </b>
</p>
## Requirements
> python == 3.8
- torch == 1.5
- transformers == 3.4.0
- deepke
## Download Code
```bash
git clone https://github.com/zjunlp/DeepKE.git
cd DeepKE/example/ner/few-shot
```
## Install with Pip
- Create and enter the python virtual environment.
- Install dependencies: `pip install -r requirements.txt`.
## Train and Predict
- Dataset
- Download the dataset to this directory.
```bash
wget 120.27.214.45/Data/ner/few-shot/data.tar.gz
tar -xzvf data.tar.gz
```
- The datasets are stored in `data`, including CoNLL-2003, MIT-movie, MIT-restaurant and ATIS.
- **CoNLL-2003**
- `train.txt`: Training set
- `valid.txt `: Validation set
- `test.txt`: Test set
- `indomain-train.txt`: In-domain training set
- **MIT-movie, MIT-restaurant and ATIS**
- `k-shot-train.txt`: k=[10, 20, 50, 100, 200, 500], Training set
- `test.txt`: Testing set
- Training
- Parameters, model paths and configuration for training are in the `conf` folder and users can modify them before training.
- Training on CoNLL-2003
```bash
python run.py
```
- Few-shot Training
If the model need to be uploaded, modify `load_path` in `few_shot.yaml`
```bash
python run.py +train=few_shot
```
- Logs for training are in the `log` folder. The path of the trained model can be customized.
- Prediction
- Add `- predict` in `config.yaml`
- Modify `load_path` as the path of the trained model and `write_path` as the path of predicted results in `predict.yaml`
- ```bash
python predict.py
```
## Model
[LightNER](https://arxiv.org/abs/2109.00720)
## Cite
If you use or extend our work, please cite the following paper:
```bibtex
@article{DBLP:journals/corr/abs-2109-00720,
author = {Xiang Chen and
Ningyu Zhang and
Lei Li and
Xin Xie and
Shumin Deng and
Chuanqi Tan and
Fei Huang and
Luo Si and
Huajun Chen},
title = {LightNER: {A} Lightweight Generative Framework with Prompt-guided
Attention for Low-resource {NER}},
journal = {CoRR},
volume = {abs/2109.00720},
year = {2021},
url = {https://arxiv.org/abs/2109.00720},
eprinttype = {arXiv},
eprint = {2109.00720},
timestamp = {Mon, 20 Sep 2021 16:29:41 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2109-00720.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```

View File

@ -1,91 +0,0 @@
## 快速上手
<p align="left">
<b> <a href="https://github.com/zjunlp/DeepKE/blob/main/example/ner/few-shot/README.md">English</a> | 简体中文 </b>
</p>
### 环境依赖
> python == 3.8
- torch == 1.5
- transformers == 3.4.0
- deepke
### 克隆代码
```bash
git clone https://github.com/zjunlp/DeepKE.git
cd DeepKE/example/ner/few-shot
```
### 使用pip安装
首先创建python虚拟环境再进入虚拟环境
- 安装依赖: ```pip install -r requirements.txt```
### 使用数据进行训练预测
- 存放数据: 可先下载数据 ```wget 120.27.214.45/Data/ner/few_shot/data.tar.gz```在此目录下
`data` 文件夹下存放训练数据。包含CoNLL2003MIT-movie, MIT-restaurant和ATIS等数据集。
- conll2003包含以下数据
- `train.txt`:存放训练数据集
- `dev.txt`:存放验证数据集
- `test.txt`:存放测试数据集
- `indomain-train.txt`存放in-domain数据集
- MIT-movie, MIT-restaurant和ATIS包含以下数据
- `k-shot-train.txt`k=[10, 20, 50, 100, 200, 500],存放训练数据集
- `test.txt`:存放测试数据集
- 开始训练模型加载和保存位置以及配置可以在conf文件夹中修改
- 训练conll2003` python run.py ` (训练所用到参数都在conf文件夹中修改即可)
- 进行few-shot训练` python run.py +train=few_shot ` (若要加载模型修改few_shot.yaml中的load_path)
- 每次训练的日志保存在 `logs` 文件夹内,模型结果保存目录可以自定义。
- 进行预测在config.yaml中加入 - predict 再在predict.yaml中修改load_path为模型路径以及write_path为预测结果保存路径再` python predict.py `
### 模型
[LightNER](https://arxiv.org/abs/2109.00720)
## 引用
如果您使用了上述代码,请您引用下列论文:
```bibtex
@article{DBLP:journals/corr/abs-2109-00720,
author = {Xiang Chen and
Ningyu Zhang and
Lei Li and
Xin Xie and
Shumin Deng and
Chuanqi Tan and
Fei Huang and
Luo Si and
Huajun Chen},
title = {LightNER: {A} Lightweight Generative Framework with Prompt-guided
Attention for Low-resource {NER}},
journal = {CoRR},
volume = {abs/2109.00720},
year = {2021},
url = {https://arxiv.org/abs/2109.00720},
eprinttype = {arXiv},
eprint = {2109.00720},
timestamp = {Mon, 20 Sep 2021 16:29:41 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2109-00720.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```

View File

@ -1,5 +0,0 @@
cwd: ???
defaults:
- train/conll

View File

@ -1,28 +0,0 @@
cwd: ???
seed: 1
bart_name: "facebook/bart-large"
dataset_name: conll2003
device: cuda
num_epochs: 30
batch_size: 16
learning_rate: 2e-5
warmup_ratio: 0.01
eval_begin_epoch: 16
src_seq_ratio: 0.6
tgt_max_len: 10
num_beams: 1
length_penalty: 1
use_prompt: True
prompt_len: 10
prompt_dim: 800
freeze_plm: True
learn_weights: True
notes: ''
save_path: null # 模型保存路径
load_path: load_path # 模型加载路径,不能为空
write_path: "data/conll2003/predict.txt"

View File

@ -1,25 +0,0 @@
seed: 1
bart_name: "facebook/bart-large"
dataset_name: conll2003
device: cuda
num_epochs: 30
batch_size: 16
learning_rate: 2e-5
warmup_ratio: 0.01
eval_begin_epoch: 16
src_seq_ratio: 0.6
tgt_max_len: 10
num_beams: 1
length_penalty: 1
use_prompt: True
prompt_len: 10
prompt_dim: 800
freeze_plm: True
learn_weights: True
save_path: save path # 模型保存路径
load_path: null
notes: ''

View File

@ -1,25 +0,0 @@
seed: 1
bart_name: "facebook/bart-large"
dataset_name: mit-movie
device: cuda
num_epochs: 30
batch_size: 3
learning_rate: 5e-5
warmup_ratio: 0.01
eval_begin_epoch: 16
src_seq_ratio: 0.8
tgt_max_len: 10
num_beams: 1
length_penalty: 1
use_prompt: True
prompt_len: 10
prompt_dim: 800
freeze_plm: True
learn_weights: True
save_path: null # 模型保存路径
load_path: null # 模型加载路径,
notes: ''

View File

@ -1,100 +0,0 @@
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]='0'
import logging
import sys
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
import hydra
from hydra import utils
from torch.utils.data import DataLoader
from deepke.name_entity_re.few_shot.models.model import PromptBartModel, PromptGeneratorModel
from deepke.name_entity_re.few_shot.module.datasets import ConllNERProcessor, ConllNERDataset
from deepke.name_entity_re.few_shot.module.train import Trainer
from deepke.name_entity_re.few_shot.utils.util import set_seed
from deepke.name_entity_re.few_shot.module.mapping_type import mit_movie_mapping, mit_restaurant_mapping, atis_mapping
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
from tensorboardX import SummaryWriter
writer = SummaryWriter(log_dir='logs')
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt = '%m/%d/%Y %H:%M:%S',
level = logging.INFO)
logger = logging.getLogger(__name__)
DATASET_CLASS = {
'conll2003': ConllNERDataset,
'mit-movie': ConllNERDataset,
'mit-restaurant': ConllNERDataset,
'atis': ConllNERDataset
}
DATA_PROCESS = {
'conll2003': ConllNERProcessor,
'mit-movie': ConllNERProcessor,
'mit-restaurant': ConllNERProcessor,
'atis': ConllNERProcessor
}
DATA_PATH = {
'conll2003': {'train': 'data/conll2003/train.txt',
'dev': 'data/conll2003/dev.txt',
'test': 'data/conll2003/test.txt'},
'mit-movie': {'train': 'data/mit-movie/20-shot-train.txt',
'dev': 'data/mit-movie/test.txt'},
'mit-restaurant': {'train': 'data/mit-restaurant/10-shot-train.txt',
'dev': 'data/mit-restaurant/test.txt'},
'atis': {'train': 'data/atis/20-shot-train.txt',
'dev': 'data/atis/test.txt'}
}
MAPPING = {
'conll2003': {'loc': '<<location>>',
'per': '<<person>>',
'org': '<<organization>>',
'misc': '<<others>>'},
'mit-movie': mit_movie_mapping,
'mit-restaurant': mit_restaurant_mapping,
'atis': atis_mapping
}
@hydra.main(config_path="conf/config.yaml")
def main(cfg):
cwd = utils.get_original_cwd()
cfg.cwd = cwd
print(cfg)
data_path = DATA_PATH[cfg.dataset_name]
for mode, path in data_path.items():
data_path[mode] = os.path.join(cfg.cwd, path)
dataset_class, data_process = DATASET_CLASS[cfg.dataset_name], DATA_PROCESS[cfg.dataset_name]
mapping = MAPPING[cfg.dataset_name]
set_seed(cfg.seed) # set seed, default is 1
if cfg.save_path is not None: # make save_path dir
cfg.save_path = os.path.join(cfg.save_path, cfg.dataset_name+"_"+str(cfg.batch_size)+"_"+str(cfg.learning_rate)+cfg.notes)
if not os.path.exists(cfg.save_path):
os.makedirs(cfg.save_path, exist_ok=True)
process = data_process(data_path=data_path, mapping=mapping, bart_name=cfg.bart_name, learn_weights=cfg.learn_weights)
test_dataset = dataset_class(data_processor=process, mode='test')
test_dataloader = DataLoader(test_dataset, collate_fn=test_dataset.collate_fn, batch_size=cfg.batch_size, num_workers=4)
label_ids = list(process.mapping2id.values())
prompt_model = PromptBartModel(tokenizer=process.tokenizer, label_ids=label_ids, args=cfg)
model = PromptGeneratorModel(prompt_model=prompt_model, bos_token_id=0,
eos_token_id=1,
max_length=cfg.tgt_max_len, max_len_a=cfg.src_seq_ratio,num_beams=cfg.num_beams, do_sample=False,
repetition_penalty=1, length_penalty=cfg.length_penalty, pad_token_id=1,
restricter=None)
trainer = Trainer(train_data=None, dev_data=None, test_data=test_dataloader, model=model, process=process, args=cfg, logger=logger,
loss=None, metrics=None, writer=writer)
trainer.predict()
if __name__ == "__main__":
main()

View File

@ -1,3 +0,0 @@
transformers==3.4.0
pytorch==1.7.0
tensorboardX==2.4

View File

@ -1,111 +0,0 @@
import os
import hydra
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]='1'
import logging
import sys
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../")))
from hydra import utils
from torch.utils.data import DataLoader
from deepke.name_entity_re.few_shot.models.model import PromptBartModel, PromptGeneratorModel
from deepke.name_entity_re.few_shot.module.datasets import ConllNERProcessor, ConllNERDataset
from deepke.name_entity_re.few_shot.module.train import Trainer
from deepke.name_entity_re.few_shot.module.metrics import Seq2SeqSpanMetric
from deepke.name_entity_re.few_shot.utils.util import get_loss, set_seed
from deepke.name_entity_re.few_shot.module.mapping_type import mit_movie_mapping, mit_restaurant_mapping, atis_mapping
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
import wandb
writer = wandb.init(project="DeepKE_NER_Few")
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt = '%m/%d/%Y %H:%M:%S',
level = logging.INFO)
logger = logging.getLogger(__name__)
DATASET_CLASS = {
'conll2003': ConllNERDataset,
'mit-movie': ConllNERDataset,
'mit-restaurant': ConllNERDataset,
'atis': ConllNERDataset
}
DATA_PROCESS = {
'conll2003': ConllNERProcessor,
'mit-movie': ConllNERProcessor,
'mit-restaurant': ConllNERProcessor,
'atis': ConllNERProcessor
}
DATA_PATH = {
'conll2003': {'train': 'data/conll2003/train.txt',
'dev': 'data/conll2003/dev.txt',
'test': 'data/conll2003/test.txt'},
'mit-movie': {'train': 'data/mit-movie/20-shot-train.txt',
'dev': 'data/mit-movie/test.txt'},
'mit-restaurant': {'train': 'data/mit-restaurant/10-shot-train.txt',
'dev': 'data/mit-restaurant/test.txt'},
'atis': {'train': 'data/atis/20-shot-train.txt',
'dev': 'data/atis/test.txt'}
}
MAPPING = {
'conll2003': {'loc': '<<location>>',
'per': '<<person>>',
'org': '<<organization>>',
'misc': '<<others>>'},
'mit-movie': mit_movie_mapping,
'mit-restaurant': mit_restaurant_mapping,
'atis': atis_mapping
}
@hydra.main(config_path="conf/config.yaml")
def main(cfg):
cwd = utils.get_original_cwd()
cfg.cwd = cwd
print(cfg)
data_path = DATA_PATH[cfg.dataset_name]
for mode, path in data_path.items():
data_path[mode] = os.path.join(cfg.cwd, path)
dataset_class, data_process = DATASET_CLASS[cfg.dataset_name], DATA_PROCESS[cfg.dataset_name]
mapping = MAPPING[cfg.dataset_name]
set_seed(cfg.seed) # set seed, default is 1
if cfg.save_path is not None: # make save_path dir
cfg.save_path = os.path.join(cfg.save_path, cfg.dataset_name+"_"+str(cfg.batch_size)+"_"+str(cfg.learning_rate)+cfg.notes)
if not os.path.exists(cfg.save_path):
os.makedirs(cfg.save_path, exist_ok=True)
process = data_process(data_path=data_path, mapping=mapping, bart_name=cfg.bart_name, learn_weights=cfg.learn_weights)
train_dataset = dataset_class(data_processor=process, mode='train')
train_dataloader = DataLoader(train_dataset, collate_fn=train_dataset.collate_fn, batch_size=cfg.batch_size, num_workers=4)
dev_dataset = dataset_class(data_processor=process, mode='dev')
dev_dataloader = DataLoader(dev_dataset, collate_fn=dev_dataset.collate_fn, batch_size=cfg.batch_size, num_workers=4)
label_ids = list(process.mapping2id.values())
prompt_model = PromptBartModel(tokenizer=process.tokenizer, label_ids=label_ids, args=cfg)
model = PromptGeneratorModel(prompt_model=prompt_model, bos_token_id=0,
eos_token_id=1,
max_length=cfg.tgt_max_len, max_len_a=cfg.src_seq_ratio,num_beams=cfg.num_beams, do_sample=False,
repetition_penalty=1, length_penalty=cfg.length_penalty, pad_token_id=1,
restricter=None)
metrics = Seq2SeqSpanMetric(eos_token_id=1, num_labels=len(label_ids), target_type='word')
loss = get_loss
trainer = Trainer(train_data=train_dataloader, dev_data=dev_dataloader, test_data=None, model=model, args=cfg, logger=logger, loss=loss,
metrics=metrics, writer=writer)
trainer.train()
if __name__ == "__main__":
main()

View File

@ -1,65 +0,0 @@
# Easy Start
<p align="left">
<b> English | <a href="https://github.com/zjunlp/DeepKE/blob/main/example/ner/standard/README_CN.md">简体中文</a> </b>
</p>
## Requirements
> python == 3.8
- pytorch-transformers == 1.2.0
- torch == 1.5.0
- hydra-core == 1.0.6
- seqeval == 1.2.2
- tqdm == 4.60.0
- matplotlib == 3.4.1
- deepke
## Download Code
```bash
git clone https://github.com/zjunlp/DeepKE.git
cd DeepKE/example/ner/standard
```
## Install with Pip
- Create and enter the python virtual environment.
- Install dependencies: `pip install -r requirements.txt`.
## Train and Predict
- Dataset
- Download the dataset to this directory.
```bash
wget 120.27.214.45/Data/ner/standard/data.tar.gz
tar -xzvf data.tar.gz
```
- The dataset is stored in `data`
- `train.txt`: Training set
- `valid.txt `: Validation set
- `test.txt`: Test set
- Training
- Parameters for training are in the `conf` folder and users can modify them before training.
- Logs for training are in the `log` folder and the trained model is saved in the `checkpoints` folder.
```bash
python run.py
```
- Prediction
```bash
python predict.py
```
## Model
BERT

View File

@ -1,57 +0,0 @@
## 快速上手
<p align="left">
<b> <a href="https://github.com/zjunlp/DeepKE/blob/main/example/ner/standard/README.md">English</a> | 简体中文 </b>
</p>
### 环境依赖
> python == 3.8
- pytorch-transformers == 1.2.0
- torch == 1.5.0
- hydra-core == 1.0.6
- seqeval == 1.2.2
- tqdm == 4.60.0
- matplotlib == 3.4.1
- deepke
### 克隆代码
```
git clone https://github.com/zjunlp/DeepKE.git
cd DeepKE/example/ner/standard
```
### 使用pip安装
首先创建python虚拟环境再进入虚拟环境
- 安装依赖:`pip install -r requirements.txt`
### 使用数据进行训练预测
- 存放数据: 可先下载数据 ```wget 120.27.214.45/Data/ner/standard/data.tar.gz```在此目录下
在`data`文件夹下存放数据:
- `train.txt`:存放训练数据集
- `valid.txt`:存放验证数据集
- `test.txt`:存放测试数据集
- 开始训练:```python run.py``` (训练所用到参数都在conf文件夹中修改即可)
- 每次训练的日志保存在 `logs` 文件夹内,模型结果保存在 `checkpoints` 文件夹内。
- 进行预测 ```python predict.py```
### 模型内容
BERT

View File

@ -1,11 +0,0 @@
# ??? is a mandatory value.
# you should be able to set it without open_dict
# but if you try to read it before it's set an error will get thrown.
# populated at runtime
cwd: ???
defaults:
- hydra/output: custom
- train
- predict

View File

@ -1,11 +0,0 @@
hydra:
run:
# Output directory for normal runs
dir: logs/${now:%Y-%m-%d_%H-%M-%S}
sweep:
# Output directory for sweep runs
dir: logs/${now:%Y-%m-%d_%H-%M-%S}
# Output sub directory for sweep runs.
subdir: ${hydra.job.num}_${hydra.job.id}

View File

@ -1 +0,0 @@
text: "秦始皇兵马俑位于陕西省西安市1961年被国务院公布为第一批全国重点文物保护单位是世界八大奇迹之一。"

View File

@ -1,25 +0,0 @@
adam_epsilon: 1e-8
bert_model: "bert-base-chinese"
data_dir: "data/"
do_eval: True
do_lower_case: True
do_train: True
eval_batch_size: 8
eval_on: "dev"
fp16: False
fp16_opt_level: "01"
gpu_id: 1
gradient_accumulation_steps: 1
learning_rate: 5e-5
local_rank: -1
loss_scale: 0.0
max_grad_norm: 1.0
max_seq_length: 128
num_train_epochs: 3 # the number of training epochs
output_dir: "checkpoints"
seed: 42
task_name: "ner"
train_batch_size: 32
use_gpu: True # use gpu or not
warmup_proportion: 0.1
weight_decay: 0.01

View File

@ -1,27 +0,0 @@
from deepke.name_entity_re.standard import *
import hydra
from hydra import utils
@hydra.main(config_path="conf", config_name='config')
def main(cfg):
model = InferNer(utils.get_original_cwd()+'/'+"checkpoints/")
text = cfg.text
print("NER句子:")
print(text)
print('NER结果:')
result = model.predict(text)
for k,v in result.items():
if v:
print(v,end=': ')
if k=='PER':
print('Person')
elif k=='LOC':
print('Location')
elif k=='ORG':
print('Organization')
if __name__ == "__main__":
main()

View File

@ -1,7 +0,0 @@
pytorch-transformers==1.2.0
torch==1.5.0
hydra-core==1.0.6
seqeval==0.0.5
tqdm==4.31.1
matplotlib==3.4.1
deepke

View File

@ -1,235 +0,0 @@
from __future__ import absolute_import, division, print_function
import csv
import json
import logging
import os
import random
import sys
import numpy as np
import torch
import torch.nn.functional as F
from pytorch_transformers import (WEIGHTS_NAME, AdamW, BertConfig, BertForTokenClassification, BertTokenizer, WarmupLinearSchedule)
from torch import nn
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler, TensorDataset)
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm, trange
from seqeval.metrics import classification_report
import hydra
from hydra import utils
from deepke.name_entity_re.standard import *
import wandb
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt = '%m/%d/%Y %H:%M:%S',
level = logging.INFO)
logger = logging.getLogger(__name__)
class TrainNer(BertForTokenClassification):
def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None,valid_ids=None,attention_mask_label=None,device=None):
sequence_output = self.bert(input_ids, token_type_ids, attention_mask,head_mask=None)[0]
batch_size,max_len,feat_dim = sequence_output.shape
valid_output = torch.zeros(batch_size,max_len,feat_dim,dtype=torch.float32,device=device)
for i in range(batch_size):
jj = -1
for j in range(max_len):
if valid_ids[i][j].item() == 1:
jj += 1
valid_output[i][jj] = sequence_output[i][j]
sequence_output = self.dropout(valid_output)
logits = self.classifier(sequence_output)
if labels is not None:
loss_fct = nn.CrossEntropyLoss(ignore_index=0)
if attention_mask_label is not None:
active_loss = attention_mask_label.view(-1) == 1
active_logits = logits.view(-1, self.num_labels)[active_loss]
active_labels = labels.view(-1)[active_loss]
loss = loss_fct(active_logits, active_labels)
else:
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
return loss
else:
return logits
wandb.init(project="DeepKE_NER_Standard")
@hydra.main(config_path="conf", config_name='config')
def main(cfg):
# Use gpu or not
if cfg.use_gpu and torch.cuda.is_available():
device = torch.device('cuda', cfg.gpu_id)
else:
device = torch.device('cpu')
if cfg.gradient_accumulation_steps < 1:
raise ValueError("Invalid gradient_accumulation_steps parameter: {}, should be >= 1".format(cfg.gradient_accumulation_steps))
cfg.train_batch_size = cfg.train_batch_size // cfg.gradient_accumulation_steps
random.seed(cfg.seed)
np.random.seed(cfg.seed)
torch.manual_seed(cfg.seed)
if not cfg.do_train and not cfg.do_eval:
raise ValueError("At least one of `do_train` or `do_eval` must be True.")
# Checkpoints
if os.path.exists(utils.get_original_cwd()+'/'+cfg.output_dir) and os.listdir(utils.get_original_cwd()+'/'+cfg.output_dir) and cfg.do_train:
raise ValueError("Output directory ({}) already exists and is not empty.".format(utils.get_original_cwd()+'/'+cfg.output_dir))
if not os.path.exists(utils.get_original_cwd()+'/'+cfg.output_dir):
os.makedirs(utils.get_original_cwd()+'/'+cfg.output_dir)
# Preprocess the input dataset
processor = NerProcessor()
label_list = processor.get_labels()
num_labels = len(label_list) + 1
# Prepare the model
tokenizer = BertTokenizer.from_pretrained(cfg.bert_model, do_lower_case=cfg.do_lower_case)
train_examples = None
num_train_optimization_steps = 0
if cfg.do_train:
train_examples = processor.get_train_examples(utils.get_original_cwd()+'/'+cfg.data_dir)
num_train_optimization_steps = int(len(train_examples) / cfg.train_batch_size / cfg.gradient_accumulation_steps) * cfg.num_train_epochs
config = BertConfig.from_pretrained(cfg.bert_model, num_labels=num_labels, finetuning_task=cfg.task_name)
model = TrainNer.from_pretrained(cfg.bert_model,from_tf = False,config = config)
model.to(device)
param_optimizer = list(model.named_parameters())
no_decay = ['bias','LayerNorm.weight']
optimizer_grouped_parameters = [
{'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': cfg.weight_decay},
{'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
warmup_steps = int(cfg.warmup_proportion * num_train_optimization_steps)
optimizer = AdamW(optimizer_grouped_parameters, lr=cfg.learning_rate, eps=cfg.adam_epsilon)
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=warmup_steps, t_total=num_train_optimization_steps)
global_step = 0
nb_tr_steps = 0
tr_loss = 0
label_map = {i : label for i, label in enumerate(label_list,1)}
if cfg.do_train:
train_features = convert_examples_to_features(train_examples, label_list, cfg.max_seq_length, tokenizer)
all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long)
all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)
all_valid_ids = torch.tensor([f.valid_ids for f in train_features], dtype=torch.long)
all_lmask_ids = torch.tensor([f.label_mask for f in train_features], dtype=torch.long)
train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids,all_valid_ids,all_lmask_ids)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=cfg.train_batch_size)
model.train()
for _ in trange(int(cfg.num_train_epochs), desc="Epoch"):
tr_loss = 0
nb_tr_examples, nb_tr_steps = 0, 0
for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")):
batch = tuple(t.to(device) for t in batch)
input_ids, input_mask, segment_ids, label_ids, valid_ids,l_mask = batch
loss = model(input_ids, segment_ids, input_mask, label_ids,valid_ids,l_mask,device)
if cfg.gradient_accumulation_steps > 1:
loss = loss / cfg.gradient_accumulation_steps
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), cfg.max_grad_norm)
tr_loss += loss.item()
nb_tr_examples += input_ids.size(0)
nb_tr_steps += 1
if (step + 1) % cfg.gradient_accumulation_steps == 0:
optimizer.step()
scheduler.step() # Update learning rate schedule
model.zero_grad()
global_step += 1
wandb.log({
"train_loss":tr_loss/nb_tr_steps
})
# Save a trained model and the associated configuration
model_to_save = model.module if hasattr(model, 'module') else model # Only save the model it-self
model_to_save.save_pretrained(utils.get_original_cwd()+'/'+cfg.output_dir)
tokenizer.save_pretrained(utils.get_original_cwd()+'/'+cfg.output_dir)
label_map = {i : label for i, label in enumerate(label_list,1)}
model_config = {"bert_model":cfg.bert_model,"do_lower":cfg.do_lower_case,"max_seq_length":cfg.max_seq_length,"num_labels":len(label_list)+1,"label_map":label_map}
json.dump(model_config,open(os.path.join(utils.get_original_cwd()+'/'+cfg.output_dir,"model_config.json"),"w"))
# Load a trained model and config that you have fine-tuned
else:
# Load a trained model and vocabulary that you have fine-tuned
model = TrainNer.from_pretrained(utils.get_original_cwd()+'/'+cfg.output_dir)
tokenizer = BertTokenizer.from_pretrained(utils.get_original_cwd()+'/'+cfg.output_dir, do_lower_case=cfg.do_lower_case)
model.to(device)
if cfg.do_eval:
if cfg.eval_on == "dev":
eval_examples = processor.get_dev_examples(utils.get_original_cwd()+'/'+cfg.data_dir)
elif cfg.eval_on == "test":
eval_examples = processor.get_test_examples(utils.get_original_cwd()+'/'+cfg.data_dir)
else:
raise ValueError("eval on dev or test set only")
eval_features = convert_examples_to_features(eval_examples, label_list, cfg.max_seq_length, tokenizer)
all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
all_valid_ids = torch.tensor([f.valid_ids for f in eval_features], dtype=torch.long)
all_lmask_ids = torch.tensor([f.label_mask for f in eval_features], dtype=torch.long)
eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids,all_valid_ids,all_lmask_ids)
# Run prediction for full data
eval_sampler = SequentialSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=cfg.eval_batch_size)
model.eval()
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
y_true = []
y_pred = []
label_map = {i : label for i, label in enumerate(label_list,1)}
for input_ids, input_mask, segment_ids, label_ids,valid_ids,l_mask in tqdm(eval_dataloader, desc="Evaluating"):
input_ids = input_ids.to(device)
input_mask = input_mask.to(device)
segment_ids = segment_ids.to(device)
valid_ids = valid_ids.to(device)
label_ids = label_ids.to(device)
l_mask = l_mask.to(device)
with torch.no_grad():
logits = model(input_ids, segment_ids, input_mask,valid_ids=valid_ids,attention_mask_label=l_mask,device=device)
logits = torch.argmax(F.log_softmax(logits,dim=2),dim=2)
logits = logits.detach().cpu().numpy()
label_ids = label_ids.to('cpu').numpy()
input_mask = input_mask.to('cpu').numpy()
for i, label in enumerate(label_ids):
temp_1 = []
temp_2 = []
for j,m in enumerate(label):
if j == 0:
continue
elif label_ids[i][j] == len(label_map):
y_true.append(temp_1)
y_pred.append(temp_2)
break
else:
temp_1.append(label_map[label_ids[i][j]])
temp_2.append(label_map[logits[i][j]])
report = classification_report(y_true, y_pred,digits=4)
logger.info("\n%s", report)
output_eval_file = os.path.join(utils.get_original_cwd()+'/'+cfg.output_dir, "eval_results.txt")
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results *****")
logger.info("\n%s", report)
writer.write(report)
if __name__ == '__main__':
main()

View File

@ -1,81 +0,0 @@
# Easy Start
<p align="left">
<b> English | <a href="https://github.com/zjunlp/DeepKE/blob/main/example/re/document/README_CN.md">简体中文</a> </b>
</p>
## Requirements
> python == 3.8
- torch == 1.5.0
- transformers == 3.4.0
- opt-einsum == 3.3.0
- ujson
- deepke
## Download Code
```bash
git clone https://github.com/zjunlp/DeepKE.git
cd DeepKE/example/re/document
```
## Install with Pip
- Create and enter the python virtual environment.
- Install dependencies: `pip install -r requirements.txt`.
## Train and Predict
- Dataset
- Download the dataset to this directory.
```bash
wget 120.27.214.45/Data/re/document/data.tar.gz
tar -xzvf data.tar.gz
```
- The dataset [DocRED](https://github.com/thunlp/DocRED/tree/master/) is stored in `data`:
- `dev.json`Validation set
- `rel_info.json`Relation set
- `rel2id.json`Relation labels - ID
- `test.json`Test set
- `train_annotated.json`Training set annotated manually
- `train_distant.json`: Training set generated by distant supervision
- Training
- Parameters, model paths and configuration for training are in the `conf` folder and users can modify them before training.
- Training on DocRED
```bash
python run.py
```
- The trained model is stored in the current directory by default.
- Start to train from last-trained model<br>
modify `train_from_saved_model` in `.yaml` as the path of the last-trained model
- Logs for training are stored in the current directory by default and the path can be configured by modifying `log_dir` in `.yaml`
- Prediction
```bash
python predict.py
```
- After prediction, generated `result.json` is stored in the current directory
## Model
[DocuNet](https://arxiv.org/abs/2106.03618)

View File

@ -1,65 +0,0 @@
## 快速上手
<p align="left">
<b> <a href="https://github.com/zjunlp/DeepKE/blob/main/example/re/document/README.md">English</a> | 简体中文 </b>
</p>
### 环境依赖
> python == 3.8
- torch == 1.5.0
- transformers == 3.4.0
- opt-einsum == 3.3.0
- ujson
- deepke
### 克隆代码
```
git clone https://github.com/zjunlp/DeepKE.git
cd DeepKE/example/re/d
```
### 使用pip安装
首先创建python虚拟环境再进入虚拟环境
- 安装依赖: ```pip install -r requirements.txt```
### 使用数据进行训练预测
- 存放数据: 可先下载数据 ```wget 120.27.214.45/Data/re/document/data.tar.gz```在此目录下
`data` 文件夹下存放训练数据。模型采用的数据集是[DocRED](https://github.com/thunlp/DocRED/tree/master/)DocRED数据集来自于2010年的国际语义评测大会中Task 8"Multi-Way Classification of Semantic Relations Between Pairs of Nominals"。
- DocRED包含以下数据
- `dev.json`:验证集
- `rel_info.json`:关系集
- `rel2id.json`关系标签到ID的映射
- `test.json`:测试集
- `train_annotated.json`:人工标注的训练集
- `train_distant.json`:远程监督产生的训练集
- 开始训练模型加载和保存位置以及配置可以在conf的`.yaml`文件中修改
- 在数据集DocRED中训练`python run.py`
- 训练好的模型保存在当前目录下
- 从上次训练的模型开始训练:设置`.yaml`中的train_from_saved_model为上次保存模型的路径
- 每次训练的日志保存路径默认保存在根目录,可以通过`.yaml`中的log_dir来配置
- 进行预测: `python predict.py`
- 预测生成的`result.json`保存在根目录
## 模型内容
[DocuNet](https://arxiv.org/abs/2106.03618)

View File

@ -1,3 +0,0 @@
defaults:
- hydra/output: custom
- train

View File

@ -1,11 +0,0 @@
hydra:
run:
# Output directory for normal runs
dir: logs/${now:%Y-%m-%d_%H-%M-%S}
sweep:
# Output directory for sweep runs
dir: logs/${now:%Y-%m-%d_%H-%M-%S}
# Output sub directory for sweep runs.
subdir: ${hydra.job.num}_${hydra.job.id}

View File

@ -1,32 +0,0 @@
adam_epsilon: 1e-06
bert_lr: 3e-05
channel_type: 'context-based'
config_name: ''
data_dir: 'data'
dataset: 'docred'
dev_file: 'dev.json'
down_dim: 256
evaluation_steps: -1
gradient_accumulation_steps: 2
learning_rate: 0.0004
log_dir: './train_roberta.log'
max_grad_norm: 1.0
max_height: 42
max_seq_length: 1024
model_name_or_path: 'roberta-base'
num_class: 97
num_labels: 4
num_train_epochs: 30
save_path: './model_roberta.pt'
seed: 111
test_batch_size: 2
test_file: 'test.json'
tokenizer_name: ''
train_batch_size: 2
train_file: 'train_annotated.json'
train_from_saved_model: ''
transformer_type: 'roberta'
unet_in_dim: 3
unet_out_dim: 256
warmup_ratio: 0.06
load_path: './model_roberta.pt'

View File

@ -1,88 +0,0 @@
import os
import time
import hydra
from hydra.utils import get_original_cwd
import numpy as np
import torch
import ujson as json
from torch.utils.data import DataLoader
from transformers import AutoConfig, AutoModel, AutoTokenizer
from transformers.optimization import AdamW, get_linear_schedule_with_warmup
from deepke.relation_extraction.document import *
def report(args, model, features):
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dataloader = DataLoader(features, batch_size=args.test_batch_size, shuffle=False, collate_fn=collate_fn, drop_last=False)
preds = []
for batch in dataloader:
model.eval()
inputs = {'input_ids': batch[0].to(device),
'attention_mask': batch[1].to(device),
'entity_pos': batch[3],
'hts': batch[4],
}
with torch.no_grad():
pred = model(**inputs)
pred = pred.cpu().numpy()
pred[np.isnan(pred)] = 0
preds.append(pred)
preds = np.concatenate(preds, axis=0).astype(np.float32)
preds = to_official(args, preds, features)
return preds
@hydra.main(config_path="conf/config.yaml")
def main(cfg):
cwd = get_original_cwd()
os.chdir(cwd)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
config = AutoConfig.from_pretrained(
cfg.config_name if cfg.config_name else cfg.model_name_or_path,
num_labels=cfg.num_class,
)
tokenizer = AutoTokenizer.from_pretrained(
cfg.tokenizer_name if cfg.tokenizer_name else cfg.model_name_or_path,
)
Dataset = ReadDataset(cfg, cfg.dataset, tokenizer, cfg.max_seq_length)
test_file = os.path.join(cfg.data_dir, cfg.test_file)
test_features = Dataset.read(test_file)
model = AutoModel.from_pretrained(
cfg.model_name_or_path,
from_tf=bool(".ckpt" in cfg.model_name_or_path),
config=config,
)
config.cls_token_id = tokenizer.cls_token_id
config.sep_token_id = tokenizer.sep_token_id
config.transformer_type = cfg.transformer_type
set_seed(cfg)
model = DocREModel(config, cfg, model, num_labels=cfg.num_labels)
model.load_state_dict(torch.load(cfg.load_path)['checkpoint'])
model.to(device)
T_features = test_features # Testing on the test set
#T_score, T_output = evaluate(cfg, model, T_features, tag="test")
pred = report(cfg, model, T_features)
with open("./result.json", "w") as fh:
json.dump(pred, fh)
if __name__ == "__main__":
main()

View File

@ -1,5 +0,0 @@
torch==1.8.1
transformers==4.7.0
opt-einsum==3.3.0
hydra-core==1.0.6
ujson

View File

@ -1,252 +0,0 @@
import os
import time
import hydra
from hydra.utils import get_original_cwd
import numpy as np
import torch
import ujson as json
from torch.utils.data import DataLoader
from transformers import AutoConfig, AutoModel, AutoTokenizer
from transformers.optimization import AdamW, get_linear_schedule_with_warmup
from deepke.relation_extraction.document import *
import wandb
def train(args, model, train_features, dev_features, test_features):
def logging(s, print_=True, log_=True):
if print_:
print(s)
if log_ and args.log_dir != '':
with open(args.log_dir, 'a+') as f_log:
f_log.write(s + '\n')
def finetune(features, optimizer, num_epoch, num_steps, model):
cur_model = model.module if hasattr(model, 'module') else model
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if args.train_from_saved_model != '':
best_score = torch.load(args.train_from_saved_model)["best_f1"]
epoch_delta = torch.load(args.train_from_saved_model)["epoch"] + 1
else:
epoch_delta = 0
best_score = -1
train_dataloader = DataLoader(features, batch_size=args.train_batch_size, shuffle=True, collate_fn=collate_fn, drop_last=True)
train_iterator = [epoch + epoch_delta for epoch in range(num_epoch)]
total_steps = int(len(train_dataloader) * num_epoch // args.gradient_accumulation_steps)
warmup_steps = int(total_steps * args.warmup_ratio)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)
print("Total steps: {}".format(total_steps))
print("Warmup steps: {}".format(warmup_steps))
global_step = 0
log_step = 100
total_loss = 0
#scaler = GradScaler()
for epoch in train_iterator:
start_time = time.time()
optimizer.zero_grad()
for step, batch in enumerate(train_dataloader):
model.train()
inputs = {'input_ids': batch[0].to(device),
'attention_mask': batch[1].to(device),
'labels': batch[2],
'entity_pos': batch[3],
'hts': batch[4],
}
#with autocast():
outputs = model(**inputs)
loss = outputs[0] / args.gradient_accumulation_steps
total_loss += loss.item()
# scaler.scale(loss).backward()
loss.backward()
if step % args.gradient_accumulation_steps == 0:
#scaler.unscale_(optimizer)
if args.max_grad_norm > 0:
# torch.nn.utils.clip_grad_norm_(amp.master_params(optimizer), args.max_grad_norm)
torch.nn.utils.clip_grad_norm_(cur_model.parameters(), args.max_grad_norm)
#scaler.step(optimizer)
#scaler.update()
#scheduler.step()
optimizer.step()
scheduler.step()
optimizer.zero_grad()
global_step += 1
num_steps += 1
if global_step % log_step == 0:
cur_loss = total_loss / log_step
elapsed = time.time() - start_time
logging(
'| epoch {:2d} | step {:4d} | min/b {:5.2f} | lr {} | train loss {:5.3f}'.format(
epoch, global_step, elapsed / 60, scheduler.get_last_lr(), cur_loss * 1000))
total_loss = 0
start_time = time.time()
wandb.log({
"train_loss":cur_loss
})
if (step + 1) == len(train_dataloader) - 1 or (args.evaluation_steps > 0 and num_steps % args.evaluation_steps == 0 and step % args.gradient_accumulation_steps == 0):
# if step ==0:
logging('-' * 89)
eval_start_time = time.time()
dev_score, dev_output = evaluate(args, model, dev_features, tag="dev")
logging(
'| epoch {:3d} | time: {:5.2f}s | dev_result:{}'.format(epoch, time.time() - eval_start_time,
dev_output))
wandb.log({
"dev_result":dev_output
})
logging('-' * 89)
if dev_score > best_score:
best_score = dev_score
logging(
'| epoch {:3d} | best_f1:{}'.format(epoch, best_score))
wandb.log({
"best_f1":best_score
})
if args.save_path != "":
torch.save({
'epoch': epoch,
'checkpoint': cur_model.state_dict(),
'best_f1': best_score,
'optimizer': optimizer.state_dict()
}, args.save_path
, _use_new_zipfile_serialization=False)
logging(
'| successfully save model at: {}'.format(args.save_path))
logging('-' * 89)
return num_steps
cur_model = model.module if hasattr(model, 'module') else model
extract_layer = ["extractor", "bilinear"]
bert_layer = ['bert_model']
optimizer_grouped_parameters = [
{"params": [p for n, p in cur_model.named_parameters() if any(nd in n for nd in bert_layer)], "lr": args.bert_lr},
{"params": [p for n, p in cur_model.named_parameters() if any(nd in n for nd in extract_layer)], "lr": 1e-4},
{"params": [p for n, p in cur_model.named_parameters() if not any(nd in n for nd in extract_layer + bert_layer)]},
]
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
if args.train_from_saved_model != '':
optimizer.load_state_dict(torch.load(args.train_from_saved_model)["optimizer"])
print("load saved optimizer from {}.".format(args.train_from_saved_model))
num_steps = 0
set_seed(args)
model.zero_grad()
finetune(train_features, optimizer, args.num_train_epochs, num_steps, model)
def evaluate(args, model, features, tag="dev"):
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
dataloader = DataLoader(features, batch_size=args.test_batch_size, shuffle=False, collate_fn=collate_fn, drop_last=False)
preds = []
total_loss = 0
for i, batch in enumerate(dataloader):
model.eval()
inputs = {'input_ids': batch[0].to(device),
'attention_mask': batch[1].to(device),
'labels': batch[2],
'entity_pos': batch[3],
'hts': batch[4],
}
with torch.no_grad():
output = model(**inputs)
loss = output[0]
pred = output[1].cpu().numpy()
pred[np.isnan(pred)] = 0
preds.append(pred)
total_loss += loss.item()
average_loss = total_loss / (i + 1)
preds = np.concatenate(preds, axis=0).astype(np.float32)
ans = to_official(args, preds, features)
if len(ans) > 0:
best_f1, _, best_f1_ign, _, re_p, re_r = official_evaluate(ans, args.data_dir)
output = {
tag + "_F1": best_f1 * 100,
tag + "_F1_ign": best_f1_ign * 100,
tag + "_re_p": re_p * 100,
tag + "_re_r": re_r * 100,
tag + "_average_loss": average_loss
}
return best_f1, output
wandb.init(project="DeepKE_RE_Document")
wandb.watch_called = False
@hydra.main(config_path="conf/config.yaml")
def main(cfg):
cwd = get_original_cwd()
os.chdir(cwd)
if not os.path.exists(os.path.join(cfg.data_dir, "train_distant.json")):
raise FileNotFoundError("Sorry, the file: 'train_annotated.json' is too big to upload to github, \
please manually download to 'data/' from DocRED GoogleDrive https://drive.google.com/drive/folders/1c5-0YwnoJx8NS6CV2f-NoTHR__BdkNqw")
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
config = AutoConfig.from_pretrained(
cfg.config_name if cfg.config_name else cfg.model_name_or_path,
num_labels=cfg.num_class,
)
tokenizer = AutoTokenizer.from_pretrained(
cfg.tokenizer_name if cfg.tokenizer_name else cfg.model_name_or_path,
)
Dataset = ReadDataset(cfg, cfg.dataset, tokenizer, cfg.max_seq_length)
train_file = os.path.join(cfg.data_dir, cfg.train_file)
dev_file = os.path.join(cfg.data_dir, cfg.dev_file)
test_file = os.path.join(cfg.data_dir, cfg.test_file)
train_features = Dataset.read(train_file)
dev_features = Dataset.read(dev_file)
test_features = Dataset.read(test_file)
model = AutoModel.from_pretrained(
cfg.model_name_or_path,
from_tf=bool(".ckpt" in cfg.model_name_or_path),
config=config,
)
wandb.watch(model, log="all")
config.cls_token_id = tokenizer.cls_token_id
config.sep_token_id = tokenizer.sep_token_id
config.transformer_type = cfg.transformer_type
set_seed(cfg)
model = DocREModel(config, cfg, model, num_labels=cfg.num_labels)
if cfg.train_from_saved_model != '':
model.load_state_dict(torch.load(cfg.train_from_saved_model)["checkpoint"])
print("load saved model from {}.".format(cfg.train_from_saved_model))
#if torch.cuda.device_count() > 1:
# print("Let's use", torch.cuda.device_count(), "GPUs!")
# model = torch.nn.DataParallel(model, device_ids = list(range(torch.cuda.device_count())))
model.to(device)
train(cfg, model, train_features, dev_features, test_features)
if __name__ == "__main__":
main()

View File

@ -1,75 +0,0 @@
# Easy Start
<p align="left">
<b> English | <a href="https://github.com/zjunlp/DeepKE/blob/main/example/re/few-shot/README_CN.md">简体中文</a> </b>
</p>
## Requirements
> python == 3.8
- torch == 1.5
- transformers == 3.4.0
- hydra-core == 1.0.6
- deepke
## Download Code
```bash
git clone https://github.com/zjunlp/DeepKE.git
cd DeepKE/example/re/few-shot
```
## Install with Pip
- Create and enter the python virtual environment.
- Install dependencies: `pip install -r requirements.txt`.
## Train and Predict
- Dataset
- Download the dataset to this directory.
```bash
wget 120.27.214.45/Data/re/few-shot/data.tar.gz
tar -xzvf data.tar.gz
```
- The dataset [SEMEVAL](https://semeval2.fbk.eu/semeval2.php?location=tasks#T11) is stored in `data`:
- `rel2id.json`Relation Label - ID
- `temp.txt`Results of handled relation labels
- `test.txt` Test set
- `train.txt`: Training set
- `val.txt`Validation set
- Training
- Parameters, model paths and configuration for training are in the `conf` folder and users can modify them before training.
- Few-shot training on SEMEVAL
```bash
python run.py
```
- The trained model is stored in the current directory by default.
- Start to train from last-trained model<br>
modify `train_from_saved_model` in `.yaml` as the path of the last-trained model
- Logs for training are stored in the current directory by default and the path can be configured by modifying `log_dir` in `.yaml`
- Prediction
```bash
python predict.py
```
## Model
[KnowPrompt](https://arxiv.org/abs/2104.07650)

View File

@ -1,59 +0,0 @@
## 快速上手
<p align="left">
<b> <a href="https://github.com/zjunlp/DeepKE/blob/main/example/re/few-shot/README.md">English</a> | 简体中文 </b>
</p>
### 环境依赖
> python == 3.8
- torch == 1.5
- transformers == 3.4.0
- hydra-core == 1.0.6
- deepke
### 克隆代码
```
git clone https://github.com/zjunlp/DeepKE.git
cd DeepKE/example/re/few-shot
```
### 使用pip安装
首先创建python虚拟环境再进入虚拟环境
- 安装依赖: ```pip install -r requirements.txt```
### 使用数据进行训练预测
- 存放数据: 可先下载数据 ```wget 120.27.214.45/Data/re/few_shot/data.tar.gz```在此目录下
`data` 文件夹下存放训练数据。模型采用的数据集是[SEMEVAL](https://semeval2.fbk.eu/semeval2.php?location=tasks#T11)SEMEVAL数据集来自于2010年的国际语义评测大会中Task 8"Multi-Way Classification of Semantic Relations Between Pairs of Nominals"。
- SEMEVAL包含以下数据
- `rel2id.json`关系标签到ID的映射
- `temp.txt`:关系标签处理
- `test.txt` 测试集
- `train.txt`:训练集
- `val.txt`:验证集
- 开始训练模型加载和保存位置以及配置可以在conf的`.yaml`文件中修改
- 对数据集SEMEVAL进行few-shot训练`python run.py`
- 训练好的模型默认保存在当前目录
- 从上次训练的模型开始训练:设置`.yaml`中的train_from_saved_model为上次保存模型的路径
- 每次训练的日志保存路径默认保存在当前目录,可以通过`.yaml`中的log_dir来配置
- 进行预测: `python predict.py `
## 模型内容
[KnowPrompt](https://arxiv.org/abs/2104.07650)

View File

@ -1,3 +0,0 @@
defaults:
- hydra/output: custom
- train

View File

@ -1,11 +0,0 @@
hydra:
run:
# Output directory for normal runs
dir: logs/${now:%Y-%m-%d_%H-%M-%S}
sweep:
# Output directory for sweep runs
dir: logs/${now:%Y-%m-%d_%H-%M-%S}
# Output sub directory for sweep runs.
subdir: ${hydra.job.num}_${hydra.job.id}

View File

@ -1,83 +0,0 @@
accelerator: None
accumulate_grad_batches: '1'
amp_backend: 'native'
amp_level: 'O2'
auto_lr_find: False
auto_scale_batch_size: False
auto_select_gpus: False
batch_size: 16
benchmark: False
check_val_every_n_epoch: '3'
checkpoint_callback: True
data_class: 'REDataset'
data_dir: 'data/k-shot/8-1'
default_root_dir: None
deterministic: False
devices: None
distributed_backend: None
fast_dev_run: False
flush_logs_every_n_steps: 100
gpus: None
gradient_accumulation_steps: 1
gradient_clip_algorithm: 'norm'
gradient_clip_val: 0.0
ipus: None
limit_predict_batches: 1.0
limit_test_batches: 1.0
limit_train_batches: 1.0
limit_val_batches: 1.0
litmodel_class: 'BertLitModel'
load_checkpoint: None
log_dir: './model_bert.log'
log_every_n_steps: 50
log_gpu_memory: None
logger: True
lr: 3e-05
lr_2: 3e-05
max_epochs: '30'
max_seq_length: 256
max_steps: None
max_time: None
min_epochs: None
min_steps: None
model_class: 'BertForMaskedLM'
model_name_or_path: 'bert-base-uncased'
move_metrics_to_cpu: False
multiple_trainloader_mode: 'max_size_cycle'
num_nodes: 1
num_processes: 1
num_sanity_val_steps: 2
num_train_epochs: 30
num_workers: 8
optimizer: 'AdamW'
overfit_batches: 0.0
plugins: None
precision: 32
prepare_data_per_node: True
process_position: 0
profiler: None
progress_bar_refresh_rate: None
ptune_k: 7
reload_dataloaders_every_epoch: False
reload_dataloaders_every_n_epochs: 0
replace_sampler_ddp: True
resume_from_checkpoint: None
save_path: './model_bert.pt'
seed: 666
stochastic_weight_avg: False
sync_batchnorm: False
t_lambda: 0.001
task_name: 'wiki80'
terminate_on_nan: False
tpu_cores: None
track_grad_norm: -1
train_from_saved_model: ''
truncated_bptt_steps: None
two_steps: False
use_prompt: True
val_check_interval: 1.0
wandb: False
weight_decay: 0.01
weights_save_path: None
weights_summary: 'top'
load_path: './model_bert.pt'

View File

@ -1,83 +0,0 @@
from logging import debug
import hydra
from hydra.utils import get_original_cwd
import numpy as np
import torch
from torch.utils.data.dataloader import DataLoader
import yaml
import time
from transformers import AutoConfig, AutoModelForMaskedLM
from transformers.optimization import get_linear_schedule_with_warmup
import os
from tqdm import tqdm
from deepke.relation_extraction.few_shot import *
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# In order to ensure reproducible experiments, we must set random seeds.
def logging(log_dir, s, print_=True, log_=True):
if print_:
print(s)
if log_dir != '' and log_:
with open(log_dir, 'a+') as f_log:
f_log.write(s + '\n')
def test(args, model, lit_model, data):
model.eval()
with torch.no_grad():
test_loss = []
for test_index, test_batch in enumerate(tqdm(data.test_dataloader())):
loss = lit_model.test_step(test_batch, test_index)
test_loss.append(loss)
f1 = lit_model.test_epoch_end(test_loss)
logging(args.log_dir,
'| test_result: {}'.format(f1))
logging(args.log_dir,'-' * 89)
@hydra.main(config_path="conf/config.yaml")
def main(cfg):
cwd = get_original_cwd()
os.chdir(cwd)
if not os.path.exists(f"data/{cfg.model_name_or_path}.pt"):
get_label_word(cfg)
if not os.path.exists(cfg.data_dir):
generate_k_shot(cfg.data_dir)
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
data = REDataset(cfg)
data_config = data.get_data_config()
config = AutoConfig.from_pretrained(cfg.model_name_or_path)
config.num_labels = data_config["num_labels"]
model = AutoModelForMaskedLM.from_pretrained(cfg.model_name_or_path, config=config)
# if torch.cuda.device_count() > 1:
# print("Let's use", torch.cuda.device_count(), "GPUs!")
# model = torch.nn.DataParallel(model, device_ids = list(range(torch.cuda.device_count())))
model.to(device)
lit_model = BertLitModel(args=cfg, model=model, device=device,tokenizer=data.tokenizer)
data.setup()
model.load_state_dict(torch.load(cfg.load_path)["checkpoint"], False)
print("load trained model from {}.".format(cfg.load_path))
test(cfg, model, lit_model, data)
if __name__ == "__main__":
main()

View File

@ -1,3 +0,0 @@
torch==1.5
transformers==3.4.0
hydra-core==1.0.6

Some files were not shown because too many files have changed in this diff Show More