This is the code repository for the paper "Medical Question Summarization with Entity-driven Contrastive Learning", which is accepted by ACM Transactions on Asian and Low-Resource Language Information Processing.
Python >= 3.6
pytorch == 1.10.1
transformers == 4.4.1
py-rouge == 1.1
stanza == 1.5.0
The original datasets can be downloaded from the following URLs.
Dataset | URLs |
---|---|
MeQSum | https://github.com/abachaa/MeQSum |
CHQ-Summ | https://github.com/shwetanlp/Yahoo-CHQ-Summ |
iCliniq | https://github.com/UCSD-AI4H/Medical-Dialogue-System |
HealthCareMagic | https://github.com/UCSD-AI4H/Medical-Dialogue-System |
After the datasets are downloaded, please put each of them into a specified directory of the project.
iCliniq and HealthCareMagic datasets suffer from a data leakage problem. For example, the duplicate rate of data samples in the iCliniq datasets reaches 33%, which leads to the overlap of training and test data, and makes the evaluation results unreliable. In order to conquer the problem, we check their data samples carefully and remove the repeated ones. The revised datasets can be downloaded from the following URLs.
Dataset | URLs |
---|---|
iCliniq-revised | https://drive.google.com/drive/u/1/folders/1FQTsgRYDJajcNlKJXG-FFPKFw4Cf4FzU |
HealthCareMagic-revised | https://drive.google.com/drive/u/1/folders/1Hq4AiYr96jfOsB8OJMlyDRRUhmr_BYvY |
python data_deduplicated.py
Note: if you have downloaded the revised datasets in last step, you can skip this one.
For different datasets, we provide the preprocessing code. You can run them to process all datasets into a uniform formation. The relevant codes are in the "data_preprocessing" directory.
You can generate hard negative examples for different datasets with the following command:
python med_ent_perturbation.py --dataset dataset/MeQSum/train.json --output_dir dataset/MeQSum --sample_size 128
python med_ent_perturbation.py --dataset dataset/CHQ-Summ/train.json --output_dir dataset/CHQ-Summ --sample_size 128
python med_ent_perturbation.py --dataset dataset/iCliniq/train.json --output_dir dataset/iCliniq --sample_size 256
python med_ent_perturbation.py --dataset dataset/HealthCareMagic/train.json --output_dir dataset/HealthCareMagic --sample_size 512
You can train the model with the following command:
python run.py --do_train_contrast --output_dir models/MeQSum --learning_rate 1e-5 --batch_size 16 --dataset MeQSum --epoch 20 --contrast_number 128
python run.py --do_train_contrast --output_dir models/chqsumm --learning_rate 1e-5 --batch_size 16 --dataset CHQSumm --epoch 20 --contrast_number 128
python run.py --do_train_contrast --output_dir models/iCliniq --learning_rate 1e-5 --batch_size 16 --dataset iCliniq --epoch 10 --contrast_number 128
python run.py --do_train_contrast --output_dir models/HealthCareMagic --learning_rate 1e-5 --batch_size 16 --dataset HealthCareMagic --epoch 10 --contrast_number 128
You can test the trained model with a command like the following:
python run.py --mod test --model_path models/test/model_name.pt --dataset MeQSum
If this work is useful in your research, please cite our paper.
@article{wei2023medical,
title={Medical Question Summarization with Entity-driven Contrastive Learning},
author={Lu, Wenpeng and Wei, Sibo and Peng, Xueping and Wang, Yi-Fei and Naseem, Usman and Wang, Shoujin},
journal={ACM Transactions on Asian and Low-Resource Language Information Processing},
volume={23},
number={4},
pages={56:1-56:19},
year={2024}
}
We have uploaded the model checkpoints to HuggingFace: https://huggingface.co/youren1999/MQS-ECL