This repository contains all code related to the WikiHist.html data release.
The dataset itself is not described here. For a description of the dataset, please refer to these resources:
- Dataset website: https://doi.org/10.5281/zenodo.3605388
- Dataset paper: https://arxiv.org/abs/2001.10256
As described on the dataset website, the HTML revision history is hosted at the Internet Archive. To facilitate downloading the data, we provide two alternative ways to get the data:
- a torrent-based solution (recommended ✳️)
- a Python script in the
downloading_scripts
directory of this repo. Note: Using the script, you can download either all data or only revisions for specific Wikipedia articles.
Pros: fast, automatic retry and restore
Cons: intended only for full download
This method is the recommended way to download the full dataset. If you are interested in a partial download (i.e., only some articles), please consider Option 2. This solution requires the command-line utility Aria2 available at https://aria2.github.io/
Once the repository is cloned, the download requires 2 steps:
- Download the utility
aria2c
from its Github repository. - Run the script
download.sh
in the folderTorrentDownload
This script starts the download of the Torrent files listed in files_list.txt
. The parameters in the file download.sh
can be adapted to your connection specifics. Please refer to Aria2 documentation (aria2c -h
and Online Manual).
By default, the script uses 16 parallel connections and saves the downloaded dataset in the folder WikiHist_html.
Pros: allows partial download, e.g., to get revisions for certain articles only
Cons: slower, maximum number of retries is upper-bounded
This solution allows both the full dowload and the partial download of the dataset based on the article ID(s).
The scripts require the internetarchive and wget packages. First you need to install those:
pip install internetarchive
pip install wget
To download the full dataset, go to the downloading_scripts
directory and run
python download_whole_dataset.py
Caveat emptor: the dataset is 7TB large, so make sure you have enough disk space before starting the download. Given the size, downloading the data will take a while.
If, rather than downloading the full dataset, you want to download revisions for specific Wikipedia articles only, proceed as follows:
- Go to the
downloading_scripts
directory. - In the file
titles_to_download.txt
, list the titles of the articles whose HTML revision history you would like to download. - Run
python download_subset.py
. - The script requires some metadata. If you don't have it yet, you will be asked if the script should download it. Type
Yes
. - When asked about the search mode, type
page_title
orpage_id
. (If you choosepage_id
thentitles_to_download.txt
should contain page ids, rather than page titles.) - When prompted, provide the path to
titles_to_download.txt
. - The data will be saved in the
downloaded_data
directory.
Most users will need only the above scripts for downloading the ready-made WikiHist.html dataset. The remainder of this README refers to the code for producing the dataset from scratch.
The scripts are divided into 7 directories:
- 1_downloading_wiki_dump
- 2_extracting_templates
- 3_create_mysql_db
- 4_docker_containers
- 5_server_scripts
- 6_dealing_with_failed
- 7_uploading_to_IA
Every directory represents a step in the process of converting wikitext to HTML, from downloading the raw wikitext dump, to extracting the templates, etc., all the way to uploading the data to the Internet Archive. In each step's directory, there is a README with details about that step.
The following libraries are required:
- Internet Archive Command-Line Tool (installation guide)
- Docker (installation guide)
To run the pipeline on a small sample (mostly for debugging), run
bash quick_run.sh
The above script processes the small input file data/sample.xml
. Note that, even in this setting, the script needs 111GB of free disk space, as it downloads an 11GB MySQL database that decompresses to 100GB.
If the processing completes successfully, a directory data/results/sample.xml/_SUCCESS
will be created, and the resulting JSON files will be placed in a directory data/results/sample.xml
.
Attribution 3.0 Unported (CC BY 3.0)