项目主要随机抓取蔡徐坤100万+转发的微博《再见,“任性的”千千…》的10万条转发数据,并且分析蔡徐坤真假转发流量的比例以及真假粉丝的用户画像
主要的文件为:
- CaiXuKun: 爬取转发数据的Scrapy爬虫代码(带说明和注释, 需要安装mongodb以及Scrapy)
- scrapy.cfg: Scrapy配置文件
- CaiXuKun.ipynb: Jupyter notebook代码,对转发数据进行分析
- stopwords.txt: 停用词表
蔡徐坤一条100万+微博下的102313条转发数据
- python3.6
- requests
- pyecharts
- pandas
- numpy
- pymongo
- scrapy
注:具体分析说明可以关注微信公众号:Alfred数据室,阅读对应文章《用大数据扒一扒蔡徐坤的真假流量粉》
This project Crawls 100,000+ CaiXuKun Weibo repost data, and analyses the ratio of real and fake reposts. The main files are listed below:
- CaiXuKun: Scrapy project file for crawling repost data.(with annotation, MongoDB and Scrapy needs to be installed.)
- scrapy.cfg: Scrapy configure file
- CaiXuKun.ipynb: Jupyter notebook codes for analysing the data
- stopwords.txt: stop words list
102313 repost data from one of CaiXuKun's Weibo
- Python3.6
- requests
- pyecharts
- pandas
- numpy
- pymongo
- scrapy
Notice: you can find the detailed document by following Alfred's wechat official account: Alfred_Lab