CASCADE: Contextual Sarcasm Detection in Online Discussion Forums

Code for the paper CASCADE: Contextual Sarcasm Detection in Online Discussion Forums (COLING 2018, New Mexico).

Description

In this paper, we propose a ContextuAl SarCasm DEtector (CASCADE), which adopts a hybrid approach of both content and context-driven modeling for sarcasm detection in online social media discussions (Reddit).

Requirements

Clone this repo.
Python (2.7 or 3.3-3.6)
Install your preferred version of TensorFlow 1.4.0 (for CPU, GPU; from PyPI, compiled, etc).
Install the rest of the requirements: pip install -r requirements.txt
Download the FastText pre-trained embeddings and extract it somewhere.
Download the comments.json dataset file [1] and place it in data/.
If you want to run the Preprocessing steps (optional), install YAJL 2, download the train-balanced.csv file, save it under data/ and continue with the Preprocessing instructions. Otherwise, just download user_gcca_embeddings.npz, place it in users/user_embeddings/ and go directly to Running CASCADE section.

Preprocessing

User Embeddings: Stylometric features.

The file data/comments.json has Reddit users and their corresponding comments. Per user, there might be multiple number of comments. Hence, we concatenate all the comments corresponding to the same user with the <END> tag:
```
cd users
python create_per_user_paragraph.py
```
The ParagraphVector algorithm is used to generate the stylometric features. First, train the model:
```
python train_stylometric.py
```
Generate user_stylometric.csv (user stylometric features) using the trained model:
```
python generate_stylometric.py
```
User Embeddings: Personality features

Pre-train a CNN-based model to detect personality features from text. The code utilizes two datasets to train. The second dataset [2] can be obtained by requesting it to the original authors.
```
python process_data.py [path/to/FastText_embedding]
python train_personality.py
```
Generate user_personality.csv (user personality features) using this model:
```
python generate_user_personality.py
```
To use the pre-trained model from our experiments, download the model weights and unzip them inside the folder user/.
User Embeddings: Multi-view fusion

Merge the user_stylometric.csv and user_personality.csv files into a single merged user_view_vectors.csv file:
```
python merge_user_views.py
```
Multi-view fusion of the user views (stylometric and personality) is performed using GCCA (~ CCA for two views). Generate fused user embeddings user_gcca_embeddings.npz using the following command:
```
python user_wgcca.py --input user_embeddings/user_view_vectors.csv --output user_embeddings/user_gcca_embeddings.npz --k 100 --no_of_views 2
```
This implementation of GCCA has been adapted from the wgcca repo.

Finally:
```
cd ..
```
Discourse Embeddings

Similar to user stylometric features, create the discourse features for each discussion forum (sub-reddit):
```
cd discourse
python create_per_discourse_paragraph.py
```
The ParagraphVector algorithm is used to generate the stylometric features. First, train the model:
```
python train_discourse.py
```
Generate discourse.csv (user stylometric features) using the trained model:
```
python generate_discourse.py
```
Finally:
```
cd ..
```

Running CASCADE

Hybrid CNN combining user-embeddings and discourse-features with textual modeling.

cd src
python process_data.py [path/to/FastText_embedding]
python train_cascade.py

The CNN codebase has been adapted from the repo cnn-text-classification-tf from Denny Britz.

Citation

If you use this code in your work then please cite the paper CASCADE: Contextual Sarcasm Detection in Online Discussion Forums with the following:

@InProceedings{C18-1156,
  author = 	"Hazarika, Devamanyu
		and Poria, Soujanya
		and Gorantla, Sruthi
		and Cambria, Erik
		and Zimmermann, Roger
		and Mihalcea, Rada",
  title = 	"CASCADE: Contextual Sarcasm Detection in Online Discussion Forums",
  booktitle = 	"Proceedings of the 27th International Conference on Computational Linguistics",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"1837--1848",
  location = 	"Santa Fe, New Mexico, USA",
  url = 	"http://aclweb.org/anthology/C18-1156"
}

References

[1]. Khodak, Mikhail, Nikunj Saunshi, and Kiran Vodrahalli. "A large self-annotated corpus for sarcasm." Proceedings of the Eleventh International Conference on Language Resources and Evaluation. 2018.

[2]. Celli, Fabio, et al. "Workshop on computational personality recognition (shared task)." Proceedings of the Workshop on Computational Personality Recognition. 2013.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
data		data
discourse		discourse
src		src
users		users
.gitignore		.gitignore
CASCADE_presentation_ppt_final.pdf		CASCADE_presentation_ppt_final.pdf
README.md		README.md
cca.jpg		cca.jpg
overall_model.jpg		overall_model.jpg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CASCADE: Contextual Sarcasm Detection in Online Discussion Forums

Description

Requirements

Preprocessing

Running CASCADE

Citation

References

About

Releases

Packages

Contributors 4

Languages

declare-lab/CASCADE

Folders and files

Latest commit

History

Repository files navigation

CASCADE: Contextual Sarcasm Detection in Online Discussion Forums

Description

Requirements

Preprocessing

Running CASCADE

Citation

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages