This repository contains a Jupyter Notebook imdb-review-classifier.ipynb
for performing sentiment analysis on IMDb movie reviews. The goal is to classify reviews as either positive or negative based on their content.
The dataset used is the IMDb movie reviews dataset. It contains 50,000 reviews labeled as positive or negative. You can find the dataset in the imdb-reviews-data
directory. The dataset is divided evenly with 25,000 positive reviews and 25,000 negative reviews.
Imdb_sentiment_analysis/
├── imdb-review-classifier.ipynb
├── imdb-reviews-data/
│ └── IMDB Dataset.csv
├── README.md
└── LICENSE
The notebook provides a step-by-step process to:
- Load and preprocess the IMDb dataset.
- Transform the text data into numerical features.
- Train a machine learning model to classify sentiments.
- Evaluate the model's performance.
- Visualize the results.
In this step, we load the IMDb dataset and preprocess the text data. Preprocessing involves:
- Removing HTML tags
- Removing punctuation
- Removing numbers
- Converting text to lowercase
- Removing stop words (common words that don't carry much meaning, like 'and', 'the', etc.)
We use the TF-IDF (Term Frequency-Inverse Document Frequency) Vectorizer to convert the text data into numerical features. TF-IDF helps in understanding the importance of a word in a document relative to a collection of documents.
We use a RandomForest Classifier to train the model. This classifier is an ensemble learning method that operates by constructing a multitude of decision trees during training and outputting the mode of the classes for classification tasks.
The performance of the model is evaluated using metrics like accuracy, precision, recall, and F1-score. We also visualize the results using a confusion matrix to understand the number of correct and incorrect predictions.
Visualizations help in better understanding the performance of the model. We use Seaborn and Matplotlib libraries to create plots for visualizing the confusion matrix and other metrics.
The following libraries are used in this project:
re
nltk
pandas
seaborn
matplotlib
sklearn