This is a machine learning program to perform authorship identification on sample sentences from three horror authors HP Lovecraft (HPL), Mary Wollstonecraft Shelley (MWS), and Edgar Allen Poe (EAP).
The project was created for a sample contest on Kaggle
The classifier is based on Naive Bayes, and will feed the training data and predict each of the unknown sentences.
The project includes the training and testing data files. The training data is labeled with the author of each sentence, while the test data is not labeled.
The followings are the feature vectors that the program uses for prediction.
- bag of words (I put all of the training texts into lists labeled with author and create bag of words based on it. Then read each test text and classify it with bag of words)
- parts of speech (syntax features)
- lexical features (average number of words per a sentence, sentence length variation, and lexical diversity)
- punctuation features (commas, semicolons, and colons per a sentence)
To run the code, make sure that you install all packages that the project is using. The project is using the following packages:
To ensure that you install the packages above, run the following command on your console:
python -m pip install --user numpy nltk sklearn
Distributed under the MIT License. See LICENSE
for more information.
Shogo Akiyama - [email protected]
Project Link: https://github.com/shogo54/author-identification
The implementation is inspired from the following article:
in the future, I can apply a neural network based approach to this project.
The artilces bellow might be useful: