@author: peng.huang @email: [email protected] @date: 2015-04-02 @update content: 将抓取的ctrip的酒店数据导入到local数据库
- 导入过程中涉及酒店名称判重,使用kdtree, shingling算法来去除重复酒店
- 使用shingling算法还不足以筛选出重复的酒店,下次提交需添加算法来筛选重复酒店
@author: peng.huang @email: [email protected] @date: 2015-04-02 @update content: detailed README.md
This is a spider project. Use Django to manager model. Use Scrapy to scrapy web station.
It scrapys hotel informations, hotel images, hotel reviews, and record logs, also, send email when scrapy completed.
