idiom_detection

Using natural language processing to detect English idioms in documents.

All data files are already loaded into this repo in the data/ folder. To generate the data from scratch, follow these steps:

Run idiom_scraper.py to create the list of idioms (data/idioms.txt)
Run idiom_example.py to retrieve the example sentences of idioms (data/idiom_example.csv)
Run tag_*.py (brown, gutenberg, reuters, example) to generate sentences where idioms are tagged (e.g. rank#BEGIN and#IN file#IN)
Run build_train_test.py to convert the sentences from the text files into pickle files containing the lists of sentences. [List-of [List-of (word, POStag, idiomTag)]]

The following steps are necessary to perform, as they create files needed for modeling that are too large to be stored on github.

Get Google's pretrained Word2Vec file here https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit. Unzip the .gz file and save it in the data/ folder.
Run create_n_grams_freq.py to create pickle files that contain the frequencies of unigrams and bigrams in the train and test data.

"Model Testing.ipynb" contains a "grid search" of different possible combinations of using Word2Vec similarity scores, PMI/PPMI similarity scores, and how many words ahead and behind to look at.
hyperparameter_tuning.py does a formal randomized grid search to see what the "best" regularization terms are.
regularization.py performs more intense regularization in an attempt to drive overly specific features down to zero (the result is never predicting the presence of an idiom)

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
logs		logs
.gitignore		.gitignore
Idiom_Detection.pdf		Idiom_Detection.pdf
Model Testing.ipynb		Model Testing.ipynb
README.md		README.md
build_train_test.py		build_train_test.py
create_ngrams_freq.py		create_ngrams_freq.py
hyperparameter_tuning.py		hyperparameter_tuning.py
idiom_example.py		idiom_example.py
idiom_scraper.py		idiom_scraper.py
modelling.py		modelling.py
regularization.py		regularization.py
tag_brown_corpus.py		tag_brown_corpus.py
tag_example_sentences.py		tag_example_sentences.py
tag_gutenberg_corpus.py		tag_gutenberg_corpus.py
tag_reuters_corpus.py		tag_reuters_corpus.py
utils.py		utils.py

Provide feedback