Wikigen

This repository contains the code and data for the paper "An Edit-centric Approach for Wikipedia Article Quality Assessment". If you use or code or data please consider citing our work.

Setup

Clone this repo:

git clone https://github.com/epochx/wikigen
cd wikigen

Create a conda environment and activate it
```
conda create -n <name> python=3.6
conda activate <name>
```
Where you can replace <name> by whatever you want.
Install dependencies everything
```
sh ./install.sh
```
This script will install all the dependencies using conda and/or pip.
Download data
```
sh ./download.sh
```
By default, the data will be downloaded in ~/data/wikigen/ If you want to change this, make sure to edit the value ofDATA_PATH accordingly in the file ~/wikigen/wikigen/settings.py

Running

By default the output of training a model will go to ~/results/wikigen If you want to change this, please modify the value of RESULTS_PATH in the file ~/wikigen/wikigen/settings.py or change the results_path parameter when running a model.

To train and evaluate a classifier model execute the following command.
```
python train_classifier.py --config wikigen/config/classifier.yaml
```
- You can modify the or provide parameters by changing the classifier.yaml file, or by using the command line. For help, run python train_classifier.py --help for additional details.
To train and evaluate models including the auxiliary generative tasks, run:

python train_seq2seq.py --config wikigen/config/seq2seq.yaml

You can modify the or provide parameters by changing the seq2seq.yaml file, or by using the command line. For help, run python train_seq2seq.py --help for additional details.

Running the doc2vec baseline

We are releasing the pre-processed datasets that we utilize in our paper, including a pre-processed version of the "Wikiclass" dataset, officially known as the "English Wikipedia Quality Assessment Dataset". However, to run our baseline you also need to download the original version.

. To obtain this dataset please go to this link and download the file 2017_english_wikipedia_quality_dataset.tar.bz2. There are two versions of the dataset, but in our work we use the most recent, 2017 version since the one from 2015 is maintained for historical reasons.

Pre-process the Wikiclass dataset and train doc2vec using the following command:

python preprocess_doc2vec_wikiclass.py

This process should take approximately 30 mins.

Load the vectors obtained by doc2vec and run the classifier following command:

python train_doc2vec_wikiclass_classifier.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikigen

Setup

Running

Running the doc2vec baseline

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
wikigen		wikigen
.gitignore		.gitignore
README.md		README.md
download.sh		download.sh
install.sh		install.sh
preprocess_doc2vec_wikiclass.py		preprocess_doc2vec_wikiclass.py
test_classifier.py		test_classifier.py
test_seq2seq.py		test_seq2seq.py
train_classifier.py		train_classifier.py
train_doc2vec_wikiclass_classifier.py		train_doc2vec_wikiclass_classifier.py
train_seq2seq.py		train_seq2seq.py

epochx/wikigen

Folders and files

Latest commit

History

Repository files navigation

Wikigen

Setup

Running

Running the doc2vec baseline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages