GitHub - coreyb42/wikipedia-to-mongodb at 0294542379f87559428341b834e656e33c5b6fbe

Name	Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.markdown	README.markdown
index.js	index.js
package.json	package.json

Name

Last commit message

Last commit date

#All the world's knowledge, in yer mongodb get a crazy-ass 10Gb wikipedia xml dump straight into mongo, without thinking, without loading it into memory, and without any intermediate files, grepping, or nonsense.

this library uses xml-stream to navigate the large xml file, and wtf_wikipedia to parse the article contents into pretty JSON.

Using these tools, you can get a queryable wikipedia on a laptop in an afternoon.

dependency node-expat requires node <= v0.10.33

Flow for Afrikaans wikipedia

The Afrikaans wikipedia (only 33 556 artikels) only takes a few minutes to download, and 10 mins to load into mongo on a macbook.

wget https://dumps.wikimedia.org/afwiki/latest/afwiki-latest-pages-articles.xml.bz2  #(38mb, couple minutes)

bunzip2 ./afwiki-latest-pages-articles.xml.bz2 #(180mb, couple seconds)

node index.js afwiki-latest-pages-articles.xml #(couple minutes)

yahoo!

to view your data now,

mongo
use af_wikipedia

//shows a random page
db.wikipedia.find().skip(200).limit(2)

//count the redirects (~4,000 in afrikaans)
db.wikipedia.count({type:"redirect"})

//find a specific page
db.wikipedia.findOne({title:"Toronto"}).categories

the english wikipedia will work under the same process, just a little slower.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flow for Afrikaans wikipedia

About

Releases

Packages

Languages

coreyb42/wikipedia-to-mongodb

Folders and files

Latest commit

History

Repository files navigation

Flow for Afrikaans wikipedia

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages