CoCo: Coherence-Enhanced Machine-Generated Text Detection Under Low Resource With Contrastive Learning
This repo contains code for CoCo: Coherence-Enhanced Machine-Generated Text Detection Under Low Resource With Contrastive Learning (https://arxiv.org/abs/2212.10341, EMNLP 2023) by Xiaoming Liu*, Zhaohan Zhang*, Yichen Wang*, Hang Pu, Yu Lan, and Chao Shen. In this codebase, we provide a coherence-graph-based contrastive learning model, CoCo, to detect machine-generated texts under low-resource scenarios. CoCo's detection accuracy outperforms contemporary state-of-art detectors on multiple datasets.
(1) Install Python 3.9.12 and PyTorch 2.0.1. (slightly older or newer versions are probably also fine for both).
(2) Install other packages by conda create --name <env> --file requirements.txt
. If it does not work, pip install -r requirements.txt
is probably also fine; the pip-format requirements.txt
is in #1)
(3) Download our dataset MGTDetect_CoCo from Huggingface (https://huggingface.co/datasets/ZachW/MGTDetect_CoCo) into data/.
-
For loading datasets for other usages, we recommend to download the data and using
json.loads()
directly to avoid format errors. -
If want to try some self-made data samples, you can create a new jsonline file following the format:
{"article": <the text>, "label": <"human" or "machine">}
for each entry (1 entry per line).
A two-step preprocess for building the coherence graph for the raw dataset is needed before detection (only for CoCo but not other baseline methods).
(1) Entity extraction:
python preprocess/extract_keywords.py --raw_dir data/<dataset_name>.jsonl
.
A new jsonl file named data/<dataset_name>_kws.jsonl
will be output under the same path.
(2) Graph construction:
python preprocess/construct_graph.py --kw_file_dir data/<dataset_name>_kws.jsonl
.
The final dataset data/<dataset_name>_graph.jsonl
will be output under the same path.
For example:
python preprocess/extract_keywords.py --raw_dir data/gpt2/gpt2_500_train.jsonl
python preprocess/construct_graph.py --kw_file_dir data/gpt2/gpt2_500_train_kws.jsonl
You need to do all these steps for both training, evaluating, and testing splits if you want to finish a complete training process.
Before running the code, update the dataset path at run_detector.py#L1415
.
Example for running the detector with both training, evaluating, and testing with the suggested setting as shown:
python run_detector.py --args.dataset_name grover_1000
- Note that the suggested setting might not be optimal as the based model updating in the future. If the result is obviously low, we suggest you could try
--do_ray True
to tune the hyperparameter again.
- Specify
--model_type
and--model_name_or_path
to use other base models. - Change
--do_train
and--do_eval
to False if you have already fine-tuned a model and you want to use the path of it to load it directly. - Use
--wandb_note
to use wandb and set wandb project name. Must set up wandb at L1188.
If you find our work helpful, please cite us with the following BibTex entry:
@article{liu2022coco,
title={Coco: Coherence-enhanced machine-generated text detection under data limitation with contrastive learning},
author={Liu, Xiaoming and Zhang, Zhaohan and Wang, Yichen and Pu, Hang and Lan, Yu and Shen, Chao},
journal={arXiv preprint arXiv:2212.10341},
year={2022}
}
Link to EMNLP 2023 version paper on ACL Anthology: https://aclanthology.org/2023.emnlp-main.1005/