Skip to content

WIP Temp repository for parser scripts. About 40MB/s parsing speed for g-zipped json.

License

Notifications You must be signed in to change notification settings

vegetablejuiceftw/wikidata-parsing

Repository files navigation

This Rust script can parse a 120G WikiData gzip dump in 50 minutes (40MB/s) into 8 shards of gzipped Messagepack streams, that can be read in Python in 20 seconds flat. The processing requires barely any memory (~100MB).

The measurements were done on a AMD Ryzen 5950x cpu on a Samsung NVME ssd.

The script currently also supports throwing out:

  • unrequited languages
  • Disambiguation, list and "name" pages.
  • Insignificant Chemical compounds and Astronomical objects (about 66% of whole wikidata)
  • garbage properties (about 3000 external IDs)

Running

 cargo run --release

To specify the location of the Wikidata dump edit the main.rs file.
To change the filtering behaviour, edit the wd_read.rs or the constants.rs files.

As a learning project, this had not much organizing done.

About

WIP Temp repository for parser scripts. About 40MB/s parsing speed for g-zipped json.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published