Added two problems: Document Dating and Open Knowledge Base Canonical…

…ization
dey-aditi · Jul 11, 2018 · 53d7497 · 53d7497
1 parent 7e38a35
commit 53d7497
Show file tree

Hide file tree

Showing 3 changed files with 51 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -24,6 +24,8 @@
 - [Semantic role labeling](semantic_role_labeling.md)
 - [Summarization](summarization.md)
 - [Text classification](text_classification.md)
+- [Open KB Canonicalization](open_kb_canonicalization.md)
+- [Document Dating](document_dating.md)
 
 This document aims to track the progress in Natural Language Processing (NLP) and give an overview
 of the state-of-the-art across the most common NLP tasks and their corresponding datasets.

diff --git a/document_dating.md b/document_dating.md
@@ -0,0 +1,24 @@
+# Document Dating (Time Stamping)
+
+#### Problem
+
+Document Dating is the problem of automatically predicting the date of a document based on its content. Date of a document, also referred to as the Document Creation Time (DCT), is at the core of many important tasks, such as, information retrieval, temporal reasoning, text summarization, event detection, and analysis of historical text, among others. 
+
+For example, in the following document, the correct creation year is 1999. This can be inferred by the presence of terms *1995* and *Four years after*.
+
+*Swiss adopted that form of taxation in 1995. The concession was approved by the govt last September. Four years after, the IOC….*
+
+#### Datasets 
+
+|                 Datasets                 | # Docs | Start Year | End Year |
+| :--------------------------------------: | :----: | :--------: | :------: |
+| [APW](https://drive.google.com/file/d/1tll04ZBooB3Mohm6It-v8MBcjMCC3Y1w/view) |  675k  |    1995    |   2010   |
+| [NYT](https://drive.google.com/file/d/1wqQRFeA1ESAOJqrwUNakfa77n_S9cmBi/view?usp=sharing) |  647k  |    1987    |   1996   |
+
+#### Comparison on year level granularity:
+
+|                                          | APW Dataset | NYT Dataset |
+| ---------------------------------------- | :---------: | :---------: |
+| [BurstySimDater](https://pdfs.semanticscholar.org/87af/a0cb4f829ce861da0c721ca666d48a62c404.pdf) |    45.9     |    38.5     |
+| [MaxEnt-Time+NER](https://pdfs.semanticscholar.org/87af/a0cb4f829ce861da0c721ca666d48a62c404.pdf) |    52.5     |    42.3     |
+| [NeuralDater](https://github.com/malllabiisc/NeuralDater) |    64.1     |    58.9     |
diff --git a/open_kg_canonicalization.md b/open_kg_canonicalization.md
@@ -0,0 +1,25 @@
+# Open Knowledge Graph Canonicalization
+
+#### Problem
+
+Open Information Extraction approaches leads to creation of large Knowledge bases (KB) from the web. The problem with such methods is that their entities and relations are not canonicalized, which leads to storage of redundant and ambiguous facts. For example, an Open KB storing *\<Barack Obama, was born in, Honolulu\>* and *\<Obama, took birth in, Honolulu\>* doesn't know that *Barack Obama* and *Obama* mean the same entity. Similarly, *took birth in* and *was born in* also refer to the same relation. Problem of Open KB canonicalization involves identifying groups of equivalent entities and relations in the KB.
+
+#### Datasets 
+
+| Datasets                                 | # Gold Entities | #NPs  | #Relations | #Triples |
+| ---------------------------------------- | :-------------: | ----- | ---------- | -------- |
+| [Base](https://suchanek.name/work/publications/cikm2014.pdf) |       150       | 290   | 3K         | 9K       |
+| [Ambiguous](https://suchanek.name/work/publications/cikm2014.pdf) |       446       | 717   | 11K        | 37K      |
+| [ReVerb45K](https://github.com/malllabiisc/cesi) |      7.5K       | 15.5K | 22K        | 45K      |
+
+#### Noun Phrase Canonicalization:
+
+|                                          |               | Base Dataset |        |               | Ambiguous dataset |        |               | ReVerb45k  |        |
+| :--------------------------------------- | :-----------: | :----------: | :----: | :-----------: | :---------------: | ------ | :-----------: | :--------: | :----: |
+|                                          | **Precision** |  **Recall**  | **F1** | **Precision** |    **Recall**     | **F1** | **Precision** | **Recall** | **F1** |
+| Morphological Normalization              |     58.3      |     88.3     |  83.5  |     49.1      |       57.2        | 70.9   |      1.4      |    77.7    |  75.1  |
+| [Galárraga-StrSim](https://suchanek.name/work/publications/cikm2014.pdf) |     88.2      |     96.5     |  97.7  |     66.6      |       85.3        | 82.2   |     69.9      |    51.7    |  0.5   |
+| [Galárraga-IDF](https://suchanek.name/work/publications/cikm2014.pdf) |     94.8      |     97.9     |  98.3  |     67.9      |       82.9        | 79.3   |     71.6      |    50.8    |  0.5   |
+| [Galárraga-Attr](https://suchanek.name/work/publications/cikm2014.pdf) |     76.1      |     51.4     |  18.1  |     82.9      |       27.7        | 8.4    |     75.1      |    20.1    |  0.2   |
+| [CESI](https://github.com/malllabiisc/cesi) |     98.2      |     99.8     |  99.9  |     66.2      |       92.4        | 91.9   |     62.7      |    84.4    |  81.9  |
+