Update: We release the manually annotated financial relation extraction dataset FinRE in data/FinRE
, which contains 44 relations (bidirectional) and 18000+ instances. Feel free to download and obtain the dataset, and please cite our paper if you use the dataset in your work.
Source code for ACL 2019 paper "Chinese Relation Extraction with Multi-Grained Information and External Linguistic Knowledge". Some code in this repository is based on the excellent open-source project https://github.com/jiesutd/LatticeLSTM.
- Python 3.6
- Pytorch 0.4.1
Three datasets are used in our paper:
-
FinRE: A manual-labeled financial news RE dataset. The data cannot be made public for the time being.
-
SanWen: A Chinese literature NER-RE dataset, the source of the dataset is https://github.com/lancopku/Chinese-Literature-NER-RE-Dataset.
-
ACE 2005: A benchmark RE dataset. According to the terms of LDC, we are not allowed to share the dataset with the third party. If you have the LDC license, please obtain the dataset (LDC2006T06) and follow the data format by yourself.
In this project, train.txt
, dev.txt
and test.txt
are all from SanWen.
data/SanWen/train.txt, dev.txt, test.txt
One instance per line with 4 columns separated by tab character. The first and second columns are head and tail entities. The third column is the relation label and the last one is text:
[head] [tail] [relation] text
For example ( one line ):
湖底 卵石 Located 连湖底的卵石颜色也可分辨
data/SanWen/relation2id.txt
One relation per line with 2 columns separated by tab character. The first column is teh label while the second one is the corresponding ID:
[relation] [ID]
data/vec.txt
One character per line. For each line, the first column is the character, the rest columns is the value of the embedding of the character.
data/sense.txt
Similar to character embedding but for word senses. For example:
释放#1 0.304095 ...
释放#2 -0.175496 ...
夏天 -0.230772 ...
Here, A#n means that it is the n-th sense of word A ( A is a polysemous word ). And the word-sense embeddings could be trained by the SAT (Sememe Attention over Target) approach.
data/sense_map.txt
Recording all senses for each polysemous word, corresponding to the word sense embedding. One word per line, for each line, the first column is the word, and the rest columns are all the senses of it ( if exist ). For example:
释放 释放#1 释放#2
夏天
The sense_map
file could be obtained by HowNet.
You can download the pre-trained character embeddings vec.txt
, pre-trained word-sense embeddings sense.txt
and word-sense map sense_map.txt
from Tsinghua Cloud or Google Drive. Then put them in place following the folder structure:
MG-Lattice
|-- ...
|-- data
|
|-- sense.txt
|
|-- vec.txt
|
|-- sense_map.txt
|
|-- DATASET_NAME_1
| |
| |-- train.txt
| |-- valid.txt
| |-- test.txt
| |-- relation2id.txt
|
|-- DATASET_NAME_2
|-- ...
Arguments are set in configure.py
, the default values are for SanWen dataset. The full usage is:
-- savemodel path to save the model
-- loadmodel path to load the model
-- savedset path to load the data settings
-- public_path the parent path of the dataset (data/)
-- dataset the folder name of dataset (SanWen/)
-- train_file train dataset (train.txt)
-- dev_file developement dataset (dev.txt)
-- test_file test dataset (test.txt)
-- relation2id map relation to id (relation2id.txt)
-- char_emb_file pre-trained char embeddings (vec.txt)
-- sense_emb_file pre-trained sense embeddings (sense.txt)
-- word_sense_map record polysemous words (sense_map.txt)
-- max_length the max length of the input
-- Encoder Specify which encoder to use
-- Optimizer Specify which optimizier to use
-- lr learning rate
-- weights_mode mode to set weights for each class in loss function
With appropriate configuration and data preparation, you can run the model by:
python main.py
If you use the code, please cite the paper:
@inproceedings{li2019chinese, title={Chinese Relation Extraction with Multi-Grained Information and External Linguistic Knowledge}, author={Li, Ziran and Ding, Ning and Liu, Zhiyuan and Zheng, Hai-Tao and Shen, Ying}, booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics}, pages={4377--4386}, year={2019} }