#All the world's knowledge, in yer mongodb get a crazy-ass 10Gb wikipedia xml dump straight into mongo, without thinking, without loading it into memory, and without any intermediate files, grepping, or nonsense.
this library uses xml-stream to navigate the large xml file, and wtf_wikipedia to parse the article contents into pretty JSON.
Using these tools, you can get a queryable wikipedia on a laptop in an afternoon.
dependency node-expat requires node <= v0.10.33
The Afrikaans wikipedia (only 33 556 artikels) only takes a few minutes to download, and 10 mins to load into mongo on a macbook.
wget https://dumps.wikimedia.org/afwiki/latest/afwiki-latest-pages-articles.xml.bz2 #(38mb, couple minutes)
bunzip2 ./afwiki-latest-pages-articles.xml.bz2 #(180mb, couple seconds)
node index.js afwiki-latest-pages-articles.xml #(couple minutes)
yahoo!
to view your data now,
mongo
use af_wikipedia
//shows a random page
db.wikipedia.find().skip(200).limit(2)
//count the redirects (~4,000 in afrikaans)
db.wikipedia.count({type:"redirect"})
//find a specific page
db.wikipedia.findOne({title:"Toronto"}).categories
the english wikipedia will work under the same process, just a little slower.