中文 | English
This project provides a comprehensive pipeline for analyzing Twitter (X) data, including crawling tweets, retrieving comments and user information based on keywords and date ranges, and further performing keyword extraction, sentiment analysis, and statistical descriptions.
- Crawl tweets, user information, and comments based on specified keywords and date ranges.
- Extract structured data, including:
- Tweet metadata (content, likes, retweets, replies, and timestamps).
- User information (followers, location, etc.).
- Comments related to the tweets.
-
Convert JSON to Excel:
- Merge and process raw JSON files containing tweets, user information, and comments.
- Convert data into structured Excel files (
.xlsx
) with three sheets:- Main Data: Detailed tweet information.
- Comments: Associated comment data.
- Users: User profile data.
-
Preprocessing Methods:
- Remove duplicate and irrelevant records.
- Clean special characters (e.g., removing
@
mentions). - Standardize date formats and handle missing values.
- Classify tweets and comments into predefined sentiment categories using advanced models.
- Classification System Prompt:
You are an advanced language model trained to analyze the sentiment of a given text. Your task is to classify the text into one of the following categories: 1. Positive 2. Negative 3. Objective 4. Sarcastic 5. Neutral 6. Disappointment 7. Disgust 8. Contempt 9. Hate For the input text, determine the most appropriate category based on its emotional tone, intent, and context. Return only the category name as the output, with no additional explanation or commentary.
- Supported sentiment categories include:
- Positive, Negative, Neutral, Sarcastic, Hate, Disappointment, Contempt, Disgust.
- Visualize sentiment trends over time.
- Analyze language distribution in tweets and comments.
- Generate:
- Language distribution pie charts.
- Language profiles to showcase cultural trends.
- Extract keywords and topics using:
- TF-IDF: Identify frequently discussed topics.
- TextRank: Highlight key phrases using graph-based ranking methods.
- Create word clouds and bar charts for intuitive visualization of keyword information.
- Measure tweet engagement:
- Analyze distributions of likes, retweets, and comments.
- Use bar charts to display total and average engagement.
- Explore sentiment distributions and engagement trends over time.
- Build networks:
- Keyword Co-occurrence: Highlight relationships between frequently mentioned terms.
- Tweet Propagation: Analyze retweet or comment relationships in tweets.
- Export networks in
.gexf
format for visualization in tools like Gephi.
- Summarize analysis results into a
.docx
report, including:- Data insights.
- Sentiment and language trends.
- Keyword highlights and engagement statistics.
git clone https://github.com/AquariniqueMu/TwitterCrawl4Analysis.git
cd TwitterCrawl4Analysis
- Python version:
>=3.8
(Recommended: Python 3.12.5) - Install required libraries:
pip install -r requirements.txt
This project uses Jupyter Notebook files (.ipynb
) for all steps of the analysis. Follow the steps below:
- Open
twitter_crawl.ipynb
. - Configure the keywords, date range, and Twitter account information. Execute each cell to crawl tweets, comments, and user information.
- Open
json_process.ipynb
. - Execute all cells to convert raw JSON files into structured Excel files (
.xlsx
).
- For specific analysis tasks:
- Sentiment Analysis: Open and run
analysis_emotion.ipynb
. - Keyword and Trend Analysis: Open and run
analysis_keyword_and_trend.ipynb
. - Language Distribution Analysis: Open and run
analysis_language.ipynb
. - User Analysis: Open and run
analysis_user.ipynb
.
- Sentiment Analysis: Open and run
- Excel Files:
- Three sheets:
Main Data
,Comments
, andUsers
.
- Three sheets:
- Charts:
- Sentiment distribution bar charts.
- Keyword trends and word clouds.
- Language distribution pie charts.
- Engagement and sentiment trends over time.
- Data Crawling:
Selenium
,requests
,webdriver-manager
- Data Processing:
Pandas
,JSON
,OpenPyXL
- Sentiment Analysis:
Transformers
,OpenAI API
- Keyword Extraction:
SpaCy
,TF-IDF
,TextRank
- Statistical Analysis:
Matplotlib
,Seaborn
,WordCloud
- Network Analysis:
NetworkX
,Gephi
- Keywords: Modify keywords directly in the notebook files to crawl specific topics.
- Sentiment Analysis: Adjust the sentiment analysis prompt in
analysis_emotion.ipynb
for tailored analysis. - Visualizations: Modify chart styles and layouts in the analysis notebooks.
This project is open-sourced under the MIT License. See the LICENSE
file for details.
- Thanks to hanxinkong and CX330Blake for their Twitter crawling tools, as well as Stopwords ISO, which formed the foundation of this project.