# DS-RelationExtraction

**Repository Path**: fulowl/DS-RelationExtraction

## Basic Information

- **Project Name**: DS-RelationExtraction
- **Description**: Distantly Supervised Relation Extraction
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2019-12-09
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Distantly-supervised Relation Extraction with Knowledge Bases
This repository puts together recent models and data sets for **sentence-level relation extraction** *using knowledge bases (i.e., distant supervision)*. In particular, it contains the source code for WWW'17 paper *[CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases](https://arxiv.org/pdf/1610.08763.pdf)*.

**Task**: Given a text corpus with entity mentions *detected* and *heuristically labeled* using distant supervision, the task aims to identify relation types/labels between a pair of entity mentions based on the sentence context where they co-occur.

## Quick Start
- [Blog Posts](#blog-posts)
- [Data](#data)
- [Benchmark](#benchmark)
- [Usage](#usage)
- [Customized Run](#customized-run)
- [Baselines](#baselines)
- [References](#references)
- [Contributors](#contributors)

## Blog Posts
* [08/2017] [Indirect Supervision for Relation Extraction Using Question-Answer Pairs](https://ellenmellon.github.io/ReQuest/)
* [08/2016] [Heterogeneous Supervision for Relation Extraction](https://liyuanlucasliu.github.io/ReHession/)


## Data
For evaluating on sentence-level extraction, we [processed](https://github.com/shanzhenren/StructMineDataPipeline) (using our [data pipeline](https://github.com/shanzhenren/StructMineDataPipeline)) three public datasets to our JSON format. We ran [Stanford NER](https://nlp.stanford.edu/software/CRF-NER.shtml) on training set to detect entity mentions, mapped entity names to Freebase entities using [DBpediaSpotlight](https://github.com/dbpedia-spotlight/dbpedia-spotlight), aligned Freebase facts to sentences, and assign entity types of Freebase entities to their mapped names in sentences:

   * **PubMed-BioInfer**: 100k PubMed paper abstracts as training data and 1,530 manually labeled biomedical paper abstracts from [BioInfer](http://mars.cs.utu.fi/BioInfer/) ([Pyysalo et al., 2007](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-50)) as test data. It consists of 94 relation types (protein-protein interactions) and over 2,000 entity types (from MESH ontology). ([Download](https://drive.google.com/drive/folders/0B--ZKWD8ahE4RmFBTjR6aUJjTkU?usp=sharing))
   
   * **NYT-manual**: 1.18M sentences sampled from 294K New York Times news articles which were then aligned with Freebase facts by ([Riedel et al., ECML'10](https://pdfs.semanticscholar.org/db55/0f7af299157c67d7f1874bf784dca10ce4a9.pdf)) ([link](http://iesl.cs.umass.edu/riedel/ecml/) to Riedel's data). For test set, 395 sentences are manually annotated with 24 relation types and 47 entity types ([Hoffmann et al., ACL'11](http://raphaelhoffmann.com/publications/acl2011.pdf)) ([link](http://raphaelhoffmann.com/mr/) to Hoffmann's data). ([Download](https://drive.google.com/drive/folders/0B--ZKWD8ahE4UktManVsY1REOUk?usp=sharing))
   
   * **Wiki-KBP**: the training corpus contains 1.5M sentences sampled from 780k [Wikipedia articles](https://github.com/xiaoling/figer) ([Ling & Weld, 2012](http://xiaoling.github.io/pubs/ling-aaai12.pdf)) plus ~7,000 sentences from 2013 KBP corpus. Test data consists of 14k system-labeled sentences from [2013 KBP slot filling](http://surdeanu.info/kbp2013/) assessment results. It has 7 relation types and 126 entity types after filtering of numeric value relations. ([Download](https://drive.google.com/drive/folders/0B--ZKWD8ahE4RjFLUkVQTm93WVU?usp=sharing))

Please put the data files in corresponding subdirectories under `data/source`


## Benchmark
Performance comparison with several *relation extraction* systems over KBP 2013 dataset (**sentence-level extraction**). 

Method | Precision | Recall | F1 
-------|-----------|--------|----
Mintz (our implementation, [Mintz et al., 2009](http://web.stanford.edu/~jurafsky/mintz.pdf)) | 0.296 | 0.387 | 0.335 
LINE + Dist Sup ([Tang et al., 2015](https://arxiv.org/pdf/1503.03578.pdf)) | **0.360** | 0.257 | 0.299 
MultiR ([Hoffmann et al., 2011](http://raphaelhoffmann.com/publications/acl2011.pdf)) | 0.325 | 0.278 | 0.301 
FCM + Dist Sup ([Gormley et al., 2015](http://www.aclweb.org/anthology/D15-1205)) | 0.151 | 0.498 | 0.300 
HypeNet (our implementation, [Shwartz et al., 2016](http://www.aclweb.org/anthology/P16-1226)) | 0.210 | 0.315 | 0.252
CNN (our implementation, [Zeng et at., 2014](http://www.aclweb.org/anthology/C14-1220))| 0.198 | 0.334 | 0.242
PCNN (our implementation, [Zeng et at., 2015](http://www.aclweb.org/anthology/D15-1203))| 0.220 | 0.452 | 0.295
LSTM (our implementation) | 0.274 | 0.500 | 0.350
Bi-GRU (our implementation) | 0.301 | 0.465 | 0.362
SDP-LSTM (our implementation, [Xu et at., 2015](http://www.aclweb.org/anthology/D15-1206)) | 0.300 | 0.436 | 0.356
Position-Aware LSTM ([Zhang et al., 2017](http://www.aclweb.org/anthology/D17-1004))| 0.265 | **0.598** | 0.367
CoType-RM ([Ren et al., 2017](https://arxiv.org/pdf/1610.08763v1.pdf)) | 0.303 | 0.407 | 0.347
**CoType** ([Ren et al., 2017](https://arxiv.org/pdf/1610.08763v1.pdf)) | 0.348 | 0.406 | **0.369**

**Note**: for models that trained on sentences annotated with a single label (HypeNet, CNN/PCNN, LSTM, SDP/PA-LSTMs, Bi-GRU), we form one training instance for each sentence-label pair based on their DS-annotated data.

## Usage

### Dependencies
We will take Ubuntu for example.

* python 2.7
* Python library dependencies
```
$ pip install pexpect ujson tqdm
```

* [stanford coreNLP 3.7.0](http://stanfordnlp.github.io/CoreNLP/) and its [python wrapper](https://github.com/stanfordnlp/stanza). Please put the library under `code/DataProcessor/'.

```
$ cd code/DataProcessor/
$ git clone git@github.com:stanfordnlp/stanza.git
$ cd stanza
$ pip install -e .
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
$ unzip stanford-corenlp-full-2016-10-31.zip
```
* [eigen 3.2.5](http://bitbucket.org/eigen/eigen/get/3.2.5.tar.bz2) (already included). 

We have included compilied binaries. If you need to re-compile `retype.cpp` under your own g++ environment
```
$ cd code/Model/retype; make
```

### Default Run
As an example, we show how to run CoType on the Wiki-KBP dataset

Start the Stanford corenlp server for the python wrapper.
```
$ java -mx4g -cp "code/DataProcessor/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
```

Feature extraction, embedding learning on training data, and evaluation on test data.
```
$ ./run.sh  
```

For relation classification, the "none"-labeled instances need to be first removed from train/test JSON files. The hyperparamters for embedding learning are included in the run.sh script.

### Parameters
Dataset to run on.
```
Data="KBP"
```
- Hyperparameters for *relation extraction*:
```
- KBP: -negative 3 -iters 400 -lr 0.02 -transWeight 1.0
- NYT: -negative 5 -iters 700 -lr 0.02 -transWeight 7.0
- BioInfer: -negative 5 -iters 700 -lr 0.02 -transWeight 7.0
```
Hyperparameters for *relation classification* are included in the run.sh script.

### Evaluation
Evaluates relation extraction performance (precision, recall, F1): produce predictions along with their confidence score; filter the predicted instances by tuning the thresholds.
```
$ python code/Evaluation/emb_test.py extract KBP retype cosine 0.0
$ python code/Evaluation/tune_threshold.py extract KBP emb retype cosine
```

### In-text Prediction
The last command in *run.sh* generates json file for predicted results, in the same format as test.json in data/source/$DATANAME, except that we only output the predicted relation mention labels. Replace the second parameter with whatever threshold you would like.
```
$ python code/Evaluation/convertPredictionToJson.py $Data 0.0
```

## Customized Run
Code for producing the JSON files from a raw corpus for running CoType and baseline models is [here](https://github.com/shanzhenren/StructMineDataPipeline).

## Baselines
You can find our implementation of some recent relation extraction models under the [Code/Model/](https://github.com/shanzhenren/DS-RelationExtraction/tree/master/code/Model) directory.

## References
* Xiang Ren, Zeqiu Wu, Wenqi He, Meng Qu, Clare R. Voss, Heng Ji, Tarek F. Abdelzaher, Jiawei Han. "[CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases](https://arxiv.org/pdf/1610.08763.pdf)", WWW 2017.
* Meng Qu, Xiang Ren, Yu Zhang, Jiawei Han. “[Weakly-supervised Relation Extraction by Pattern-enhanced Embedding Learning](https://arxiv.org/abs/1711.03226)”, WWW 2018.
* Liyuan Liu*, Xiang Ren*, Qi Zhu, Shi Zhi, Huan Gui, Heng Ji, Jiawei Han. "[Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach](https://arxiv.org/abs/1707.00166)", EMNLP 2017.
* Ellen Wu, Xiang Ren, Frank Xu, Ji Li, Jiawei Han. "[Indirect Supervision for Relation Extraction using Question-Answer Pairs](https://arxiv.org/abs/1710.11169)", WSDM 2018.


## Contributors
* Ellen Wu
* Meng Qu
* Frank Xu
* Wenqi He
* Maosen Zhang
* Qinyuan Ye
* Xiang Ren