Skip to content

Commit

Permalink
Added two problems: Document Dating and Open Knowledge Base Canonical…
Browse files Browse the repository at this point in the history
…ization
  • Loading branch information
svjan5 committed Jul 11, 2018
1 parent 7e38a35 commit 53d7497
Show file tree
Hide file tree
Showing 3 changed files with 51 additions and 0 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
- [Semantic role labeling](semantic_role_labeling.md)
- [Summarization](summarization.md)
- [Text classification](text_classification.md)
- [Open KB Canonicalization](open_kb_canonicalization.md)
- [Document Dating](document_dating.md)

This document aims to track the progress in Natural Language Processing (NLP) and give an overview
of the state-of-the-art across the most common NLP tasks and their corresponding datasets.
Expand Down
24 changes: 24 additions & 0 deletions document_dating.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Document Dating (Time Stamping)

#### Problem

Document Dating is the problem of automatically predicting the date of a document based on its content. Date of a document, also referred to as the Document Creation Time (DCT), is at the core of many important tasks, such as, information retrieval, temporal reasoning, text summarization, event detection, and analysis of historical text, among others.

For example, in the following document, the correct creation year is 1999. This can be inferred by the presence of terms *1995* and *Four years after*.

*Swiss adopted that form of taxation in 1995. The concession was approved by the govt last September. Four years after, the IOC….*

#### Datasets

| Datasets | # Docs | Start Year | End Year |
| :--------------------------------------: | :----: | :--------: | :------: |
| [APW](https://drive.google.com/file/d/1tll04ZBooB3Mohm6It-v8MBcjMCC3Y1w/view) | 675k | 1995 | 2010 |
| [NYT](https://drive.google.com/file/d/1wqQRFeA1ESAOJqrwUNakfa77n_S9cmBi/view?usp=sharing) | 647k | 1987 | 1996 |

#### Comparison on year level granularity:

| | APW Dataset | NYT Dataset |
| ---------------------------------------- | :---------: | :---------: |
| [BurstySimDater](https://pdfs.semanticscholar.org/87af/a0cb4f829ce861da0c721ca666d48a62c404.pdf) | 45.9 | 38.5 |
| [MaxEnt-Time+NER](https://pdfs.semanticscholar.org/87af/a0cb4f829ce861da0c721ca666d48a62c404.pdf) | 52.5 | 42.3 |
| [NeuralDater](https://github.com/malllabiisc/NeuralDater) | 64.1 | 58.9 |
25 changes: 25 additions & 0 deletions open_kg_canonicalization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Open Knowledge Graph Canonicalization

#### Problem

Open Information Extraction approaches leads to creation of large Knowledge bases (KB) from the web. The problem with such methods is that their entities and relations are not canonicalized, which leads to storage of redundant and ambiguous facts. For example, an Open KB storing *\<Barack Obama, was born in, Honolulu\>* and *\<Obama, took birth in, Honolulu\>* doesn't know that *Barack Obama* and *Obama* mean the same entity. Similarly, *took birth in* and *was born in* also refer to the same relation. Problem of Open KB canonicalization involves identifying groups of equivalent entities and relations in the KB.

#### Datasets

| Datasets | # Gold Entities | #NPs | #Relations | #Triples |
| ---------------------------------------- | :-------------: | ----- | ---------- | -------- |
| [Base](https://suchanek.name/work/publications/cikm2014.pdf) | 150 | 290 | 3K | 9K |
| [Ambiguous](https://suchanek.name/work/publications/cikm2014.pdf) | 446 | 717 | 11K | 37K |
| [ReVerb45K](https://github.com/malllabiisc/cesi) | 7.5K | 15.5K | 22K | 45K |

#### Noun Phrase Canonicalization:

| | | Base Dataset | | | Ambiguous dataset | | | ReVerb45k | |
| :--------------------------------------- | :-----------: | :----------: | :----: | :-----------: | :---------------: | ------ | :-----------: | :--------: | :----: |
| | **Precision** | **Recall** | **F1** | **Precision** | **Recall** | **F1** | **Precision** | **Recall** | **F1** |
| Morphological Normalization | 58.3 | 88.3 | 83.5 | 49.1 | 57.2 | 70.9 | 1.4 | 77.7 | 75.1 |
| [Galárraga-StrSim](https://suchanek.name/work/publications/cikm2014.pdf) | 88.2 | 96.5 | 97.7 | 66.6 | 85.3 | 82.2 | 69.9 | 51.7 | 0.5 |
| [Galárraga-IDF](https://suchanek.name/work/publications/cikm2014.pdf) | 94.8 | 97.9 | 98.3 | 67.9 | 82.9 | 79.3 | 71.6 | 50.8 | 0.5 |
| [Galárraga-Attr](https://suchanek.name/work/publications/cikm2014.pdf) | 76.1 | 51.4 | 18.1 | 82.9 | 27.7 | 8.4 | 75.1 | 20.1 | 0.2 |
| [CESI](https://github.com/malllabiisc/cesi) | 98.2 | 99.8 | 99.9 | 66.2 | 92.4 | 91.9 | 62.7 | 84.4 | 81.9 |

0 comments on commit 53d7497

Please sign in to comment.