Skip to content

Commit

Permalink
first update
Browse files Browse the repository at this point in the history
  • Loading branch information
iphysresearch committed May 26, 2018
1 parent 859086c commit f49bdb4
Show file tree
Hide file tree
Showing 213 changed files with 641,008 additions and 0 deletions.
20,780 changes: 20,780 additions & 0 deletions case_data/movie_comment100.json

Large diffs are not rendered by default.

20,780 changes: 20,780 additions & 0 deletions case_data/movie_comment120.json

Large diffs are not rendered by default.

20,780 changes: 20,780 additions & 0 deletions case_data/movie_comment140.json

Large diffs are not rendered by default.

20,780 changes: 20,780 additions & 0 deletions case_data/movie_comment160.json

Large diffs are not rendered by default.

20,780 changes: 20,780 additions & 0 deletions case_data/movie_comment180.json

Large diffs are not rendered by default.

20,780 changes: 20,780 additions & 0 deletions case_data/movie_comment20.json

Large diffs are not rendered by default.

20,780 changes: 20,780 additions & 0 deletions case_data/movie_comment200.json

Large diffs are not rendered by default.

20,980 changes: 20,980 additions & 0 deletions case_data/movie_comment225.json

Large diffs are not rendered by default.

20,780 changes: 20,780 additions & 0 deletions case_data/movie_comment250.json

Large diffs are not rendered by default.

20,780 changes: 20,780 additions & 0 deletions case_data/movie_comment40.json

Large diffs are not rendered by default.

20,780 changes: 20,780 additions & 0 deletions case_data/movie_comment60.json

Large diffs are not rendered by default.

20,780 changes: 20,780 additions & 0 deletions case_data/movie_comment80.json

Large diffs are not rendered by default.

246 changes: 246 additions & 0 deletions case_data/movie_item.json

Large diffs are not rendered by default.

5,000 changes: 5,000 additions & 0 deletions case_data/movie_people10000.json

Large diffs are not rendered by default.

4,587 changes: 4,587 additions & 0 deletions case_data/movie_people15000.json

Large diffs are not rendered by default.

5,000 changes: 5,000 additions & 0 deletions case_data/movie_people20000.json

Large diffs are not rendered by default.

4,152 changes: 4,152 additions & 0 deletions case_data/movie_people25000.json

Large diffs are not rendered by default.

4,197 changes: 4,197 additions & 0 deletions case_data/movie_people30000.json

Large diffs are not rendered by default.

4,072 changes: 4,072 additions & 0 deletions case_data/movie_people35000.json

Large diffs are not rendered by default.

2,866 changes: 2,866 additions & 0 deletions case_data/movie_people40000.json

Large diffs are not rendered by default.

5,000 changes: 5,000 additions & 0 deletions case_data/movie_people5000.json

Large diffs are not rendered by default.

7,492 changes: 7,492 additions & 0 deletions data_cleaning&feature_engineering/.ipynb_checkpoints/Filting-checkpoint.ipynb

Large diffs are not rendered by default.

7,492 changes: 7,492 additions & 0 deletions data_cleaning&feature_engineering/Filting.ipynb

Large diffs are not rendered by default.

Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added data_cleaning&feature_engineering/cover.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
274 changes: 274 additions & 0 deletions data_cleaning&feature_engineering/model.vec

Large diffs are not rendered by default.

1,968 changes: 1,968 additions & 0 deletions data_cleaning&feature_engineering/train_data_supervised_fasttext.txt

Large diffs are not rendered by default.

1,968 changes: 1,968 additions & 0 deletions data_cleaning&feature_engineering/train_data_unsupervised_fasttext.txt

Large diffs are not rendered by default.

Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
GEOGCS["GCS_WGS_1984",DATUM["D_WGS_1984",SPHEROID["WGS_1984",6378137,298.257223563]],PRIMEM["Greenwich",0],UNIT["Degree",0.017453292519943295]]
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
<?xml version="1.0"?>
<!--<!DOCTYPE metadata SYSTEM "http://www.esri.com/metadata/esriprof80.dtd">-->
<metadata xml:lang="en"><Esri><MetaID>{7BF31F4C-0ACE-4D07-A747-9F6A8485CB3A}</MetaID><CreaDate>20110324</CreaDate><CreaTime>14020800</CreaTime><SyncOnce>TRUE</SyncOnce><DataProperties><lineage><Process ToolSource="C:\Program Files\ArcGIS\ArcToolbox\Toolboxes\Data Management Tools.tbx\RepairGeometry" Date="20110324" Time="140208">RepairGeometry World_countries_shp DELETE_NULL World_countries_shp</Process></lineage></DataProperties></Esri></metadata>
Binary file not shown.
1 change: 1 addition & 0 deletions douban_movie/.floydexpt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"family_id": "DFFupXVj53JYsMV72VpUnh", "name": "douban_movie_coment"}
15 changes: 15 additions & 0 deletions douban_movie/.floydignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@

# Directories and files to ignore when uploading code to floyd

.git
.eggs
eggs
lib
lib64
parts
sdist
var
*.pyc
*.swp
.DS_Store
data
99 changes: 99 additions & 0 deletions douban_movie/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@

README file of Scrapy project for douban_movie
------------------------------------------------------------------

Before you crawl anything, you need to make sure some packages installed.
You can check it by typing the following in your terminal:

>> pip install scrapy faker selenium


If there is no ‘data’ file, please make the directory which will store
the json file you crawl from the internet:

>> mkdir data


Then, you need to run the Scrapy project at the ‘bin’ directory:

>> cd bin

And, you have to download and unzip the phantomjs packages at the ‘bin’ directory:

>> wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2

>> tar -jxvf phantomjs-2.1.1-linux-x86_64.tar.bz2

Finally, we can crawl now!
You can check all the spiders by typing the command:

>> scraps list

==============================================================

STEP 1: Crawl for movie_item:

Just run:

>> scraps crawl douban-movie

# 内含我的豆瓣账号和密码,懒得改了~ 请保密。。。。

==============================================================

STEP 2: Crawl for movie_comment

Just run the following command one by one:

>> scrapy crawl douban-comment20 -a pages=1000
>> scrapy crawl douban-comment40 -a pages=1000
>> scrapy crawl douban-comment60 -a pages=1000
>> scrapy crawl douban-comment80 -a pages=1000
>> scrapy crawl douban-comment100 -a pages=1000
>> scrapy crawl douban-comment120 -a pages=1000
>> scrapy crawl douban-comment140 -a pages=1000
>> scrapy crawl douban-comment160 -a pages=1000
>> scrapy crawl douban-comment180 -a pages=1000
>> scrapy crawl douban-comment200 -a pages=1000
>> scrapy crawl douban-comment220 -a pages=1000
>> scrapy crawl douban-comment225 -a pages=1000
>> scrapy crawl douban-comment250 -a pages=1000

in which we have split the 250 movies into 13 parts to crawl and specify
the pages as parameter (1000 by defalt).

(HIGH LEVEL!)
Actually, you can crawl all the douban-comment spiders at once!
But also you would be banned at once! So you can crawl every two spiders
for douban-comment by running the command.

>> scraps crawlallcomment

and modify the file in ./douban_movie/commands/crawlallcomment.py
==============================================================

STEP 3: Crawl for movie_people

Just run the following command one by one:

>> scrapy crawl douban-people5000
>> scrapy crawl douban-people10000
>> scrapy crawl douban-people15000
>> scrapy crawl douban-people20000
>> scrapy crawl douban-people25000
>> scrapy crawl douban-people30000
>> scrapy crawl douban-people35000
>> scrapy crawl douban-people40000

in which we have split the 35776 peoples into 8 parts to crawl.

(HIGH LEVEL!)
Likewise, you also can crawl all the douban-people spiders at once by typing:

>> scraps crawlallpeople

However you would be banned without doubt!
You can modify the file in ./douban_movie/commands/crawlallpeople.py



Loading

0 comments on commit f49bdb4

Please sign in to comment.