forked from sebastianruder/NLP-progress
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added two problems: Document Dating and Open Knowledge Base Canonical…
…ization
- Loading branch information
Showing
3 changed files
with
51 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# Document Dating (Time Stamping) | ||
|
||
#### Problem | ||
|
||
Document Dating is the problem of automatically predicting the date of a document based on its content. Date of a document, also referred to as the Document Creation Time (DCT), is at the core of many important tasks, such as, information retrieval, temporal reasoning, text summarization, event detection, and analysis of historical text, among others. | ||
|
||
For example, in the following document, the correct creation year is 1999. This can be inferred by the presence of terms *1995* and *Four years after*. | ||
|
||
*Swiss adopted that form of taxation in 1995. The concession was approved by the govt last September. Four years after, the IOC….* | ||
|
||
#### Datasets | ||
|
||
| Datasets | # Docs | Start Year | End Year | | ||
| :--------------------------------------: | :----: | :--------: | :------: | | ||
| [APW](https://drive.google.com/file/d/1tll04ZBooB3Mohm6It-v8MBcjMCC3Y1w/view) | 675k | 1995 | 2010 | | ||
| [NYT](https://drive.google.com/file/d/1wqQRFeA1ESAOJqrwUNakfa77n_S9cmBi/view?usp=sharing) | 647k | 1987 | 1996 | | ||
|
||
#### Comparison on year level granularity: | ||
|
||
| | APW Dataset | NYT Dataset | | ||
| ---------------------------------------- | :---------: | :---------: | | ||
| [BurstySimDater](https://pdfs.semanticscholar.org/87af/a0cb4f829ce861da0c721ca666d48a62c404.pdf) | 45.9 | 38.5 | | ||
| [MaxEnt-Time+NER](https://pdfs.semanticscholar.org/87af/a0cb4f829ce861da0c721ca666d48a62c404.pdf) | 52.5 | 42.3 | | ||
| [NeuralDater](https://github.com/malllabiisc/NeuralDater) | 64.1 | 58.9 | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# Open Knowledge Graph Canonicalization | ||
|
||
#### Problem | ||
|
||
Open Information Extraction approaches leads to creation of large Knowledge bases (KB) from the web. The problem with such methods is that their entities and relations are not canonicalized, which leads to storage of redundant and ambiguous facts. For example, an Open KB storing *\<Barack Obama, was born in, Honolulu\>* and *\<Obama, took birth in, Honolulu\>* doesn't know that *Barack Obama* and *Obama* mean the same entity. Similarly, *took birth in* and *was born in* also refer to the same relation. Problem of Open KB canonicalization involves identifying groups of equivalent entities and relations in the KB. | ||
|
||
#### Datasets | ||
|
||
| Datasets | # Gold Entities | #NPs | #Relations | #Triples | | ||
| ---------------------------------------- | :-------------: | ----- | ---------- | -------- | | ||
| [Base](https://suchanek.name/work/publications/cikm2014.pdf) | 150 | 290 | 3K | 9K | | ||
| [Ambiguous](https://suchanek.name/work/publications/cikm2014.pdf) | 446 | 717 | 11K | 37K | | ||
| [ReVerb45K](https://github.com/malllabiisc/cesi) | 7.5K | 15.5K | 22K | 45K | | ||
|
||
#### Noun Phrase Canonicalization: | ||
|
||
| | | Base Dataset | | | Ambiguous dataset | | | ReVerb45k | | | ||
| :--------------------------------------- | :-----------: | :----------: | :----: | :-----------: | :---------------: | ------ | :-----------: | :--------: | :----: | | ||
| | **Precision** | **Recall** | **F1** | **Precision** | **Recall** | **F1** | **Precision** | **Recall** | **F1** | | ||
| Morphological Normalization | 58.3 | 88.3 | 83.5 | 49.1 | 57.2 | 70.9 | 1.4 | 77.7 | 75.1 | | ||
| [Galárraga-StrSim](https://suchanek.name/work/publications/cikm2014.pdf) | 88.2 | 96.5 | 97.7 | 66.6 | 85.3 | 82.2 | 69.9 | 51.7 | 0.5 | | ||
| [Galárraga-IDF](https://suchanek.name/work/publications/cikm2014.pdf) | 94.8 | 97.9 | 98.3 | 67.9 | 82.9 | 79.3 | 71.6 | 50.8 | 0.5 | | ||
| [Galárraga-Attr](https://suchanek.name/work/publications/cikm2014.pdf) | 76.1 | 51.4 | 18.1 | 82.9 | 27.7 | 8.4 | 75.1 | 20.1 | 0.2 | | ||
| [CESI](https://github.com/malllabiisc/cesi) | 98.2 | 99.8 | 99.9 | 66.2 | 92.4 | 91.9 | 62.7 | 84.4 | 81.9 | | ||
|