Running

This Rust script can parse a 120G WikiData gzip dump in 50 minutes (40MB/s) into 8 shards of gzipped Messagepack streams, that can be read in Python in 20 seconds flat. The processing requires barely any memory (~100MB).

The measurements were done on a AMD Ryzen 5950x cpu on a Samsung NVME ssd.

The script currently also supports throwing out:

unrequited languages
Disambiguation, list and "name" pages.
Insignificant Chemical compounds and Astronomical objects (about 66% of whole wikidata)
garbage properties (about 3000 external IDs)

Running

 cargo run --release

To specify the location of the Wikidata dump edit the main.rs file.
To change the filtering behaviour, edit the wd_read.rs or the constants.rs files.

As a learning project, this had not much organizing done.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
python-shard-reader		python-shard-reader
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
example_data.jsonl		example_data.jsonl
wikidata-row.example.json		wikidata-row.example.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Running

About

Releases

Packages

Languages

License

vegetablejuiceftw/wikidata-parsing

Folders and files

Latest commit

History

Repository files navigation

Running

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages