Code & Data for the AAAI 2020 Paper "Likelihood Ratios and Generative Classifiers For Unsupervised OOD Detection In Task-Based Dialog"
Data:
The ROSTD dataset of OOD points can be found under data/fbrelease
This tsv file contains ~4500 OOD examples. The 3rd field of each line contains the sentence.
This is the only field which could be of interest - the other fields are vestigial and can be ignored.
Note that this OOD dataset is a companion to the ID dataset released as part of the paper "Cross-lingual transfer learning for multilingual task oriented dialog" by Schuster et al at NAACL 2019.
This ID dataset can be found in its original form here.
Alternatively, you can directly use the splits we made (with ID train, and ID-OOD mixed validation and test) as described under the "Dataset Splits" section below.
Reference:
If you find our code or data useful, please consider citing our paper:
@article{gangal2019likelihood,
title={Likelihood Ratios and Generative Classifiers for Unsupervised Out-of-Domain Detection In Task Oriented Dialog},
author={Gangal, Varun and Arora, Abhinav and Einolghozati, Arash and Gupta, Sonal},
journal={arXiv preprint arXiv:1912.12800},
year={2019}
}
Contact:
For any questions or issues, either raise an issue here or drop an email at [email protected]
Code: [Under Progress]
Refer to requirements.txt for the python package requirements For other specifications, refer to other_specifications.txt
Code Structure and TLDR:
code/util.py: Contains most of the argument specifications. Ignore arguments or argument groups with an "IGNORE" comment on top of them
code/train.py: Contains the training and inference mechanism
code/model.py: Specifices architecture for most of the models e.g Discriminative Classifier, Generative Classifier etc
code/oodmetrics.py: Code for computing the ood-related metrics such as AUROC
Please ignore code/model_gan.py and code/wasserstein.py. They are not really used much for the paper experiments, but we have just retained them to not meddle with the imports.
Dataset Splits:
- For fbrelease and fbreleasecoarse
You can directly find the ready-to-use dataset splits under code/data/{dataset_name}/unsup/ for dataset_name = fbrelease / fbreleasecoarse
This already contains the plain id train split and the id-ood mixed dev and test splits
Note that only the ood part of the fbrelease dev and test splits constitutes our own released data. The rest is formed from existing datasets. - For atis and snips
You will need to run some scripts to do random splitting where a fraction of classes are held out as OOD.
The code/data/{dataset_name}/preprocess_{dataset_name}.sh needs to be run for this. (Where dataset_name = atis/snips)
Shell Scripts:
train_for_fbrelease.sh - Commands for fbrelease i.e ROSTD with its corresponding id training set and validation sets
train_for_fbreleasecoarse.sh - Commands for fbreleasecoarse i.e ROSTD with its corresponding id training set and validation sets, but with labels coarsened.
train_for_atis.sh - Commands for atis
train_for_snips.sh - Commands for snips
Notes:
- In all of these scripts, you will need to set super_root to point to where the repo resides on your system. We need this because we use torchtext to preprocess, create the vocabulary, load and minibatch our datasets, and we could only get it to work with absolute path specifications.