wd2duckdb
is a tool transforming
Wikidata JSON dumps
into a fully indexed DuckDB database ~80% smaller than the original
dump, yet contains most of its information. The resulting database enables
high-performance queries to be executed on commodity hardware without the
need to install and configure specialized triplestore software. This project
is heavily based on wd2sql.
TBD
wd2duckdb --json <JSON_FILE> --database <DUCKDB_FILE>
Use -
as <JSON_FILE>
to read from standard input instead of from a file.
This makes it possible to build a pipeline that processes JSON data as it is
being decompressed, without having to decompress the full dump to disk:
bzcat latest-all.json.bz2 | wd2duckdb --json - --database <DUCKDB_FILE>
Without the efforts of the countless people who built Wikidata and its
contents, wd2duckdb
would be useless. It's truly impossible to praise
this amazing open data project enough.
- wd2sql is this project's main inspiration.
Copyright © 2023 Ángel Iglesias Préstamo ([email protected])
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
By contributing to this project, you agree to release your contributions under the same license.