Skip to content

Commit

Permalink
add dart dataset
Browse files Browse the repository at this point in the history
  • Loading branch information
zqwerty committed May 11, 2022
1 parent e265e86 commit a403317
Show file tree
Hide file tree
Showing 7 changed files with 561 additions and 1 deletion.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ __pycache__
data/unified_datasets/multiwoz21/MultiWOZ_2.1.zip
data/unified_datasets/tm1/master.zip
data/unified_datasets/dailydialog/ijcnlp_dailydialog.zip
data/unified_datasets/dart/dart-v1.1.1-*.json
data/**/train.json
data/**/val.json
data/**/test.json
Expand Down
4 changes: 3 additions & 1 deletion data/unified_datasets/check.py
Original file line number Diff line number Diff line change
Expand Up @@ -329,9 +329,11 @@ def create_shuffled_dial_ids(dialogues, rng=random.Random(42), num_orders=10):

if args.preprocess:
print('pre-processing')

cur_dir = os.getcwd()
os.chdir(name)
preprocess = importlib.import_module(f'{name}.preprocess')
preprocess.preprocess()
os.chdir(cur_dir)

data_file = f'{name}/data.zip'
if not os.path.exists(data_file):
Expand Down
59 changes: 59 additions & 0 deletions data/unified_datasets/dart/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Dataset Card for DailyDialog

- **Repository:** https://github.com/Yale-LILY/dart
- **Paper:** https://arxiv.org/pdf/2007.02871.pdf
- **Leaderboard:** https://github.com/Yale-LILY/dart
- **Who transforms the dataset:** Qi Zhu(zhuq96 at gmail dot com)

### Dataset Summary

DART is a large and open-domain structured DAta Record to Text generation corpus with high-quality sentence annotations with each input being a set of entity-relation triples following a tree-structured ontology. It consists of 82191 examples across different domains with each input being a semantic triple set derived from data records in tables and the tree ontology of table schema, annotated with sentence description that covers all facts in the triple set.

- **How to get the transformed data from original data:**
- Run `python preprocess.py` in the current directory.
- **Main changes of the transformation:**
- Use `source` annotation as `domain`.
- Retain entity-relation triples in the `tripleset` field of each turn.
- If there are multiple source&text annotation in a original sample, split them into multiple samples.
- **Annotations:**
- entity-relation triples

### Supported Tasks and Leaderboards

NLG

### Languages

English

### Data Splits

| split | dialogues | utterances | avg_utt | avg_tokens | avg_domains | cat slot match(state) | cat slot match(goal) | cat slot match(dialogue act) | non-cat slot span(dialogue act) |
|------------|-------------|--------------|-----------|--------------|---------------|-------------------------|------------------------|--------------------------------|-----------------------------------|
| train | 62659 | 62659 | 1 | 18.85 | 1 | - | - | - | - |
| validation | 6980 | 6980 | 1 | 21.22 | 1 | - | - | - | - |
| test | 12552 | 12552 | 1 | 20.95 | 1 | - | - | - | - |
| all | 82191 | 82191 | 1 | 19.37 | 1 | - | - | - | - |

6 domains: ['WikiTableQuestions_mturk', 'WikiSQL_decl_sents', 'WikiSQL_lily', 'WikiTableQuestions_lily', 'webnlg', 'e2e']
- **cat slot match**: how many values of categorical slots are in the possible values of ontology in percentage.
- **non-cat slot span**: how many values of non-categorical slots have span annotation in percentage.


### Citation

```
@inproceedings{nan-etal-2021-dart,
title = "{DART}: Open-Domain Structured Data Record to Text Generation",
author = "Linyong Nan and Dragomir Radev and Rui Zhang and Amrit Rau and Abhinand Sivaprasad and Chiachun Hsieh and Xiangru Tang and Aadit Vyas and Neha Verma and Pranav Krishna and Yangxiaokang Liu and Nadia Irwanto and Jessica Pan and Faiaz Rahman and Ahmad Zaidi and Murori Mutuma and Yasin Tarabar and Ankit Gupta and Tao Yu and Yi Chern Tan and Xi Victoria Lin and Caiming Xiong and Richard Socher and Nazneen Fatema Rajani",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
year = "2021",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.naacl-main.37",
}
```

### Licensing Information

MIT License
Binary file added data/unified_datasets/dart/data.zip
Binary file not shown.
Loading

0 comments on commit a403317

Please sign in to comment.