Kenny Yu, Ali Nahm, R.J. Aquino, Joseph Ong
There are three steps to arriving at the final dataset. First, however, let's start with dependencies. Execute the following (if you're on Ubuntu, else substitute the first line with the appropriate line to install pip).
sudo apt-get install python-pip
sudo pip install nltk
sudo pip install mrjob
sudo pip install praw
Additionally, the csplit program needed must be the GNU version, which has the * extension. If you're using the BSD version bundled with MacOS, you're out of luck, though I can try to figure out a workaround soon.
-
First, scraping the data from Reddit. To do this, you'll need to use the
scraper.py
file located in thescraper/scraper
directory. Basic usage is as follows:python scraper.py < inputfile > outputfile
Where input file is a list of subreddits, one per line and surrounding by double quotes. Like this:
"pics" "politics" "aww" "todayilearned" "movies"
This will start a MapReduce job. It won't work very well unless you have some instances you can spin up on a cluster, or on Amazon EC2.
You can create a ~/.mrjob.conf file in your home directory that will do all the necessary work to get the script running on EC2 (for example, installing dependencies on the cluster, etc.). I've provided an example in the
scraper/scraper
directory, but you'll need to fill in your own Amazon credentials, and then move it into your home directory as viamv example.mrjob.conf ~/.mrjob.conf
You'll notice that conf file has a configuration called "emr". You can then run the script with that configuration as follows:
python scraper.py -r emr < inputfile > outputfile
Further, you'll want to optimize this script based on the number of inputs you have. In particular, you'll want the number of mappers to be equivalent to the number of subreddits scraped, to minimize the scraping runtime. For example, I had 24 subreddits to scrape, so I set the number of mapper tasks to 24 as follows:
python scraper.py -r emr --jobconf mapred.map.tasks=24 < subreddits_important > sr_impt_data
You also want to make sure that the number of instances you spin up is appropriate for the number of subreddits you are scraping. In this case, I spun up 12 m1.large instances, as per
.mrjob.conf
, so 2 mappers ran per instance.Don't run 12 instances for 1 subreddit, or you'll just be wasting a lot of money. On the flipsie, don't use 1 instance for 24 subreddits, or your job will run for an excrutiatingly long time.
I've found that 2 subreddits per instance works well, and will make your scraping job last about 6 hours.
I should really handle a lot of this under the hood for you, but at the moment, I don't think mr.job has hooks to let me dynamically set settings based on input size. Sorry!
-
Next, you'll need to clean the data you get back from the MapReduce job. In the
scraper/cleaner
directory, you'll find a clean.py script. It's also a MR script, but you can probably just run this one on your computer instead of a cluster -- it shouldn't take too long unless step #1 produced a huge file (I mean like, > 5 GB).python cleaner.py < unclean_input > cleaned_output
Where unclean_input is the output of step #1, and clean_name is the name of the clean file you want.
-
Finally, we'll need to segment the huge chunk of data we scraped into individual subreddit files. In order to do this, look in the
scraper/segmenter
directory. You can then run the command:./segment inputfile outputdir
Where inputfile is the file produced from step #2, and outputdir is the name of the directory to which you'll want to put all the new information. You'll have to make the directory beforehand.