GitHub - nikhilgk/wiki-pat

The cathierarchy table gives a page_id and page_title and that page's categories in category_title. Some pages will also be categories, so connecting all links will result in a hierarchy of categories.

Schema for cathierarchy table/cathierarchy.csv:

Field:	Type:
page_id	int(8) unsigned
page_title	varbinary(255)
category_title	varbinary(255)
page_type	enum('page','subcat','file') -- defaults to 'page'
page_is_redirect	tinyint(1) unsigned

Setup the datasource on AWS

The data set of interest is here : http://aws.amazon.com/datasets/2345 Documentation here : http://wiki.freebase.com/wiki/WEX/Documentation

To set this on EC2, log on to AWS and go to the EC2 page. Change your region to US East/N. Virgina. That is the only region where the the data set volume is available. Click on Volumes > Create Volume > Set size to 80 GB and search for the snapshot id "snap-1781757e". Accept the other defaults Note the zone where this is created.

Now create a new Instance. Since this is just for the data, t1.micro is sufficient. I used the following AMI image XLT-4.3.4-Java7u21-Ubuntu-13.04-64bit-IPv6 (ami-01061568) since it had Java and Python already on it. When launching make sure this is the same region and zone where the EBS volume was created. Also the default user to SSH intothis one is 'ubuntu' and not 'root'

Once the EC2 instance launcher, go back to the Volume tab and attach the Wikipedia volume to this instance. Subsequently you can SSH to the instance and mount the volume to the instance.

Here are some more details on what you could do with the data: https://github.com/utcompling/applied-nlp/wiki/Spark-AWS-Exercise2#create-ebs-volume-with-wikipedia-wex-data-and-attach-it-to-the-cluster

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
python		python
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

nikhilgk/wiki-pat

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages