#All the world's knowledge, in yer mongodb get a crazy-ass 10Gb wikipedia xml dump straight into mongo, without thinking, without loading it into memory, and without any intermediate files, grepping, or nonsense.
this library uses xml-stream to navigate the large xml file, and wtf_wikipedia to parse the article contents into pretty JSON.
Using these tools, you can get a queryable wikipedia on a laptop in an afternoon.
dependency node-expat requires node <= v0.10.33
The Afrikaans wikipedia (only 33 556 artikels) only takes a few minutes to download, and 10 mins to load into mongo on a macbook.
wget https://dumps.wikimedia.org/afwiki/latest/afwiki-latest-pages-articles.xml.bz2 #(38mb, couple minutes)
bunzip2 ./afwiki-latest-pages-articles.xml.bz2 #(180mb, couple seconds)
node index.js afwiki-latest-pages-articles.xml #(couple minutes)
to view your data now,
use af_wikipedia
//shows a random page
//count the redirects (~4,000 in afrikaans)
//find a specific page
the english wikipedia will work under the same process, just a little slower.