Go to file

tlk-dsg f96b4ca300 Merge branch 'main' of github.com:zjunlp/DeepKE into main		2021-12-20 15:16:38 +08:00
.github	move to .github	2019-12-04 10:42:55 +08:00
docker	add docker	2021-11-27 14:46:51 +08:00
example	fix bug	2021-12-20 15:15:16 +08:00
pics	replace architectures pic	2021-11-27 14:42:14 +08:00
pretrained	update readme	2019-12-03 22:59:23 +08:00
src/deepke	Update dataset.py	2021-12-15 18:31:35 +08:00
tutorial-notebooks	Update tutorial	2021-11-22 12:18:55 +00:00
.gitignore	Update .gitignore	2019-12-06 12:43:45 +08:00
LICENSE	Create LICENSE	2021-10-09 14:41:07 +08:00
README.md	Update README.md	2021-12-19 21:00:52 +08:00
README_CN.md	Update README_CN.md	2021-12-19 21:13:49 +08:00
setup.py	Update setup.py	2021-12-15 18:32:09 +08:00

README.md

简体中文 | English

A Deep Learning Based Knowledge Extraction Toolkit
for Knowledge Base Population

DeepKE is a knowledge extraction toolkit supporting low-resource and document-level scenarios. It provides three functions based on PyTorch, including Named Entity Recognition, Relation Extraciton and Attribute Extraction.

What's New

Dec, 2021

We have added dockerfile to create the enviroment automatically.

Nov, 2021

The demo of DeepKE, supporting real-time extration without deploying and training, has been released.
The documentation of DeepKE, containing the details of DeepKE such as source codes and datasets, has been released.

Oct, 2021

pip install deepke
The codes of deepke-v2.0 have been released.

May, 2021

pip install deepke
The codes of deepke-v1.0 have been released.

Prediction

There is a demonstration of prediction.

Model Framework

Figure 1: The framework of DeepKE

DeepKE contains a unified framework for named entity recognition, relation extraction and attribute extraction, the three knowledge extraction functions.
Each task can be implemented in different scenarios. For example, we can achieve relation extraction in standard, low-resource (few-shot) and document-level settings.
Each application scenario comprises of three components: Data including Tokenizer, Preprocessor and Loader, Model including Module, Encoder and Forwarder, Core including Training, Evaluation and Prediction.

Quickstart

DeepKE is supported pip install deepke. Take the fully supervised relation extraction for example.
(Please star✨ and fork 📝 !!!)

Step1 Download the basic codes

git clone https://github.com/zjunlp/DeepKE.git

Step2 Create a virtual environment using Anaconda and enter it.

We also provide dockerfile source code, you can create your own image, which is located in the docker folder.

conda create -n deepke python=3.8

conda activate deepke

Install DeepKE with source codes

python setup.py install

python setup.py develop

Install DeepKE with pip
```
pip install deepke
```

Step3 Enter the task directory

cd DeepKE/example/re/standard

Step4 Download the dataset

wget 120.27.214.45/Data/re/standard/data.tar.gz

tar -xzvf data.tar.gz

Step5 Training (Parameters for training can be changed in the conf folder)

We support visual parameter tuning using wandb

python run.py

Step6 Prediction (Parameters for prediction can be changed in the conf folder)

python predict.py

Requirements

python == 3.8

torch == 1.5
hydra-core == 1.0.6
tensorboard == 2.4.1
matplotlib == 3.4.1
transformers == 3.4.0
jieba == 0.42.1
scikit-learn == 0.24.1
pytorch-transformers == 1.2.0
seqeval == 1.2.2
tqdm == 4.60.0
opt-einsum==3.3.0
wandb==0.12.7
ujson

Introduction of Three Functions

1. Named Entity Recognition

Named entity recognition seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, organizations, etc.

The data is stored in .txt files. Some instances as following:

Sentence	Person	Location	Organization
本报北京9月4日讯记者杨涌报道：部分省区人民日报宣传发行工作座谈会9月3日在4日在京举行。	杨涌	北京	人民日报
《红楼梦》是中央电视台和中国电视剧制作中心根据中国古典文学名著《红楼梦》摄制于1987年的一部古装连续剧，由王扶林导演，周汝昌、王蒙、周岭等多位红学家参与制作。	王扶林，周汝昌，王蒙，周岭	中国	中央电视台，中国电视剧制作中心
秦始皇兵马俑位于陕西省西安市，1961年被国务院公布为第一批全国重点文物保护单位，是世界八大奇迹之一。	秦始皇	陕西省，西安市	国务院

Read the detailed process in specific README
- STANDARD (Fully Supervised)
  
  Step1 Enter DeepKE/example/ner/standard. Download the dataset.
```
wget 120.27.214.45/Data/ner/standard/data.tar.gz

tar -xzvf data.tar.gz
```
  Step2 Training
  
  The dataset and parameters can be customized in the data folder and conf folder respectively.
```
python run.py
```
  Step3 Prediction
```
python predict.py
```
- FEW-SHOT
  
  Step1 Enter DeepKE/example/ner/few-shot. Download the dataset.
```
wget 120.27.214.45/Data/ner/few_shot/data.tar.gz

tar -xzvf data.tar.gz
```
  Step2 Training in the low-resouce setting
  
  The directory where the model is loaded and saved and the configuration parameters can be cusomized in the conf folder.
```
python run.py +train=few_shot
```
  Users can modify load_path in conf/train/few_shot.yaml to use existing loaded model.
  
  Step3 Add - predict to conf/config.yaml, modify loda_path as the model path and write_path as the path where the predicted results are saved in conf/predict.yaml, and then run python predict.py
```
python predict.py
```

2. Relation Extraction

Relationship extraction is the task of extracting semantic relations between entities from a unstructured text.

The data is stored in .csv files. Some instances as following:

Sentence	Relation	Head	Head_offset	Tail	Tail_offset
《岳父也是爹》是王军执导的电视剧，由马恩然、范明主演。	导演	岳父也是爹	1	王军	8
《九玄珠》是在纵横中文网连载的一部小说，作者是龙马。	连载网站	九玄珠	1	纵横中文网	7
提起杭州的美景，西湖总是第一个映入脑海的词语。	所在城市	西湖	8	杭州	2

Read the detailed process in specific README
- STANDARD (Fully Supervised)
  
  Step1 Enter the DeepKE/example/re/standard folder. Download the dataset.
```
wget 120.27.214.45/Data/re/standard/data.tar.gz

tar -xzvf data.tar.gz
```
  Step2 Training
  
  The dataset and parameters can be customized in the data folder and conf folder respectively.
```
python run.py
```
  Step3 Prediction
```
python predict.py
```
- FEW-SHOT
  
  Step1 Enter DeepKE/example/re/few-shot. Download the dataset.
```
wget 120.27.214.45/Data/re/few_shot/data.tar.gz

tar -xzvf data.tar.gz
```
  Step 2 Training
  - The dataset and parameters can be customized in the data folder and conf folder respectively.
  - Start with the model trained last time: modify train_from_saved_model in conf/train.yamlas the path where the model trained last time was saved. And the path saving logs generated in training can be customized by log_dir.
```
python run.py
```
  Step3 Prediction
```
python predict.py
```
- DOCUMENT
  
  Step1 Enter DeepKE/example/re/document. Download the dataset.
```
wget 120.27.214.45/Data/re/document/data.tar.gz

tar -xzvf data.tar.gz
```
  Step2 Training
  - The dataset and parameters can be customized in the data folder and conf folder respectively.
  - Start with the model trained last time: modify train_from_saved_model in conf/train.yamlas the path where the model trained last time was saved. And the path saving logs generated in training can be customized by log_dir.
```
python run.py
```
  Step3 Prediction
```
python predict.py
```

3. Attribute Extraction

Attribute extraction is to extract attributes for entities in a unstructed text.

The data is stored in .csv files. Some instances as following:

Sentence	Att	Ent	Ent_offset	Val	Val_offset
张冬梅，女，汉族，1968年2月生，河南淇县人	民族	张冬梅	0	汉族	6
杨缨，字绵公，号钓溪，松溪县人，祖籍将乐，是北宋理学家杨时的七世孙	朝代	杨缨	0	北宋	22
2014年10月1日许鞍华执导的电影《黄金时代》上映	上映时间	黄金时代	19	2014年10月1日	0

Read the detailed process in specific README
- STANDARD (Fully Supervised)
  
  Step1 Enter the DeepKE/example/ae/standard folder. Download the dataset.
```
wget 120.27.214.45/Data/ae/standard/data.tar.gz

tar -xzvf data.tar.gz
```
  Step2 Training
  
  The dataset and parameters can be customized in the data folder and conf folder respectively.
```
python run.py
```
  Step3 Prediction
```
python predict.py
```

Notebook Tutorial

This toolkit provides many Jupyter Notebook and Google Colab tutorials. Users can study DeepKE with them.

Standard Setting

NER Notebook

NER Colab

RE Notebook

RE Colab

AE Notebook

AE Colab
Low-resource

NER Notebook

NER Colab

RE Notebook
Document-level

RE Notebook

Tips

Using nearest mirror, like THU in China, will speed up the installation of Anaconda.
Using nearest mirror, like aliyun in China, will speed up pip install XXX.
When encountering ModuleNotFoundError: No module named 'past'，run pip install future .
It's slow to install the pretrained language models online. Recommend download pretrained models before use and save them in the pretrained folder. Read README.md in every task directory to check the specific requirement for saving pretrained models.
The old version of DeepKE is in the deepke-v1.0 branch. Users can change the branch to use the old version. The old version has been totally transfered to the standard relation extraction (example/re/standard).
It's recommended to install DeepKE with source codes. Because user may meet some problems in Windows system with 'pip'.

To do

In next version, we plan to add multi-modality knowledge extraction to the toolkit.

Meanwhile, we will offer long-term maintenance to fix bugs, solve issues and meet new requests. So if you have any problems, please put issues to us.

Developers

Zhejiang University: Ningyu Zhang, Liankuan Tao, Haiyang Yu, Xiang Chen, Xin Xu, Xi Tian, Lei Li, Zhoubo Li, Shumin Deng, Yunzhi Yao, Hongbin Ye, Xin Xie, Guozhou Zheng, Huajun Chen

DAMO Academy: Chuanqi Tan, Fei Huang

README.md Unescape Escape

A Deep Learning Based Knowledge Extraction Toolkitfor Knowledge Base Population