This repository contains the code and data for the paper "An Edit-centric Approach for Wikipedia Article Quality Assessment". If you use or code or data please consider citing our work.
-
Clone this repo:
git clone https://github.com/epochx/wikigen cd wikigen
-
Create a conda environment and activate it
conda create -n <name> python=3.6 conda activate <name>
Where you can replace
<name>
by whatever you want. -
Install dependencies everything
sh ./install.sh
This script will install all the dependencies using conda and/or pip.
-
Download data
sh ./download.sh
By default, the data will be downloaded in
~/data/wikigen/
If you want to change this, make sure to edit the value ofDATA_PATH
accordingly in the file~/wikigen/wikigen/settings.py
By default the output of training a model will go to ~/results/wikigen
If you want to change this, please modify the value of RESULTS_PATH
in the file ~/wikigen/wikigen/settings.py
or change the results_path
parameter when running a model.
-
To train and evaluate a classifier model execute the following command.
python train_classifier.py --config wikigen/config/classifier.yaml
- You can modify the or provide parameters by changing the
classifier.yaml
file, or by using the command line. For help, runpython train_classifier.py --help
for additional details.
- You can modify the or provide parameters by changing the
-
To train and evaluate models including the auxiliary generative tasks, run:
python train_seq2seq.py --config wikigen/config/seq2seq.yaml
- You can modify the or provide parameters by changing the
seq2seq.yaml
file, or by using the command line. For help, runpython train_seq2seq.py --help
for additional details.
We are releasing the pre-processed datasets that we utilize in our paper, including a pre-processed version of the "Wikiclass" dataset, officially known as the "English Wikipedia Quality Assessment Dataset". However, to run our baseline you also need to download the original version.
. To obtain this dataset please go to this link and download the file 2017_english_wikipedia_quality_dataset.tar.bz2
. There are two versions of the dataset, but in our work we use the most recent, 2017 version since the one from 2015 is maintained for historical reasons.
- Pre-process the Wikiclass dataset and train doc2vec using the following command:
python preprocess_doc2vec_wikiclass.py
This process should take approximately 30 mins.
- Load the vectors obtained by doc2vec and run the classifier following command:
python train_doc2vec_wikiclass_classifier.py