Skip to content
/ wikigen Public

Code and data for the paper "An Edit-centric Approach for Wikipedia Article Quality Assessment"

Notifications You must be signed in to change notification settings

epochx/wikigen

Repository files navigation

Wikigen

This repository contains the code and data for the paper "An Edit-centric Approach for Wikipedia Article Quality Assessment". If you use or code or data please consider citing our work.

Setup

  1. Clone this repo:

    git clone https://github.com/epochx/wikigen
    cd wikigen
  2. Create a conda environment and activate it

    conda create -n <name> python=3.6
    conda activate <name>

    Where you can replace <name> by whatever you want.

  3. Install dependencies everything

    sh ./install.sh

    This script will install all the dependencies using conda and/or pip.

  4. Download data

    sh ./download.sh

    By default, the data will be downloaded in ~/data/wikigen/ If you want to change this, make sure to edit the value ofDATA_PATH accordingly in the file ~/wikigen/wikigen/settings.py

Running

By default the output of training a model will go to ~/results/wikigen If you want to change this, please modify the value of RESULTS_PATH in the file ~/wikigen/wikigen/settings.py or change the results_path parameter when running a model.

  1. To train and evaluate a classifier model execute the following command.

    python train_classifier.py --config wikigen/config/classifier.yaml
    • You can modify the or provide parameters by changing the classifier.yaml file, or by using the command line. For help, run python train_classifier.py --help for additional details.
  2. To train and evaluate models including the auxiliary generative tasks, run:

python train_seq2seq.py --config wikigen/config/seq2seq.yaml
  • You can modify the or provide parameters by changing the seq2seq.yaml file, or by using the command line. For help, run python train_seq2seq.py --help for additional details.

Running the doc2vec baseline

We are releasing the pre-processed datasets that we utilize in our paper, including a pre-processed version of the "Wikiclass" dataset, officially known as the "English Wikipedia Quality Assessment Dataset". However, to run our baseline you also need to download the original version.

. To obtain this dataset please go to this link and download the file 2017_english_wikipedia_quality_dataset.tar.bz2. There are two versions of the dataset, but in our work we use the most recent, 2017 version since the one from 2015 is maintained for historical reasons.

  1. Pre-process the Wikiclass dataset and train doc2vec using the following command:
python preprocess_doc2vec_wikiclass.py

This process should take approximately 30 mins.

  1. Load the vectors obtained by doc2vec and run the classifier following command:
python train_doc2vec_wikiclass_classifier.py

About

Code and data for the paper "An Edit-centric Approach for Wikipedia Article Quality Assessment"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published