Detecting fake reviews in a set of Reviews
First, basic Naïve Bayes Approach is used for the detecting deception of the True and false reviews of the many hotels given in the Training Set. A Unigram Language Model was built for each class – Positive and Negative Classes.
As a first step, Regular Expressions are used to clean up the data, that is, to eliminate the punctuation marks - ?,!,.- etc…Then the online dictionary from nltk package for Stop words (frequently repeated words)is used, like ‘the’,’a’ etc. and the punctuation marks like Period (.), Comma (,), Exclamatory mark (!) and Hyphen (-). Laplace Smoothing is done to avoid zero probabilities appearing for likelihood. The Likelihood for each word is calculated. Likelihood probability for each word is calculated as a ratio of sum of frequency of each word of that class and size of the vocabulary for that class. For calculating the Prior P(c), the fraction of number of reviews in each class to the sum of number of reviews in both the classes was considered. To avoid underflow, all the probabilities are calculated with log to the base 10. And unknown words that occurred in either Positive training or negative training set are ignored. Finally for test set, the probability for each review is calculated for each class and whichever class has the highest probability value, the review is assigned that particular class. The class here being – True (T) or Fake (F) Review. The given training set is divided in the ratio of 80:20. For various partitions of training set, the accuracy is calculated. It is in the range of 48% to 57%.
I tried to improve the accuracy considering the following features –
- POS Tagging the reviews. This makes use of the assumption that fake review contains more verbs than nouns. So, the training set is tagged using nltk package and the unigram model is used to count the number of Verb and noun related tags in the true and false training set. This probability is added to the probabilities considered in Naïve Bayes algorithm. After this modification, the accuracy improved by 3% -5%.
- Counting the number of sentiment words. This feature makes use of the assumption that a fake review contains more number of sentiment words than a true review. Bing Liu and Minqing Hu Sentiment Lexicon is used for sentiment words. So, the number of sentiment words is calculated for each false and true training set or rather the ratio of sentiment words to the other words is calculated. But this count remained balanced. For the true training set, it had a value of -1.2 and for the fake training data, it had a value of -1.19. So, adding this feature did not improve the accuracy much.
When finally tested for different types of parititions (80:20), the accuracy was in the range of 51% to 60%.