Using natural language processing to detect English idioms in documents.
Link to this repo:
All data files are already loaded into this repo in the data/ folder. To generate the data from scratch, follow these steps:
Run to create the list of idioms (data/idioms.txt)
Run to retrieve the example sentences of idioms (data/idiom_example.csv)
Run tag_*.py (brown, gutenberg, reuters, example) to generate sentences where idioms are tagged (e.g. rank#BEGIN and#IN file#IN)
Run to convert the sentences from the text files into pickle files containing the lists of sentences. [List-of [List-of (word, POStag, idiomTag)]]
The following steps are necessary to perform, as they create files needed for modeling that are too large to be stored on github.
Get Google's pretrained Word2Vec file here Unzip the .gz file and save it in the data/ folder.
Run to create pickle files that contain the frequencies of unigrams and bigrams in the train and test data.
"Model Testing.ipynb" contains a "grid search" of different possible combinations of using Word2Vec similarity scores, PMI/PPMI similarity scores, and how many words ahead and behind to look at.
- does a formal randomized grid search to see what the "best" regularization terms are.
- performs more intense regularization in an attempt to drive overly specific features down to zero (the result is never predicting the presence of an idiom)