GitHub - purugupta99/customer-review-sentiment-analysis

Data Preprocessing
- A data corpus of 2 GB is given in the form of HTML WebPages
- Using lxml formating and BeautifulSoup the relevant data like keywords, urls, reviews etc. have been extracted and can be found in processed.csv
Text processing
- change case to lower
- tokenize text and remove puncutation
- remove words that contain numbers
- remove stop wordsremove empty tokens
- lemmatize text
- remove words with only one letter
Feature extraction
- Apply sentiment analysis on the text processed data to get score for positive, neutral, negative and compound features
- Adding number of characters and words as a feature for all samples
- Adding document feature and term frequency feature for all samples
Classification
- Clustering – apply k-means clustering to cluster positive and negative reviews based on the new features
- Vader model – Use vader model to analyze the overall sentiment of the review to predict if it is positive or negative

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Feature Extraction.ipynb		Feature Extraction.ipynb
README.md		README.md
Report.pdf		Report.pdf
Sentiment Analysis.ipynb		Sentiment Analysis.ipynb
output.csv		output.csv
processed.csv		processed.csv

Provide feedback