Generated by DALL·E 3
This repository contains the code for the paper titled "MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance". [Link to our paper]
conda create -n mllm_protector python=3.10 -y
conda activate mllm_protector
pip install -e .
Obtain weights for llama-3B from here
Obtain lora checkpoint for harm detector with open-llama-3b from here
Obtain lora checkpoint for harm detector with llama2-7b from here
Obtain lora checkpoint for detoxifer from here
You may use the harm detector to check the responses generated by the MLLM to verify the harmfulness, which also serves as a proxy for GPT4 API calls.
python scripts/merge_peft_adapter.py --base_model_name path-to-llama_3b_v2 --adapter_model_name path-to-lora --output_name path-to-merged-model
You may obtain the augmented dataset from here
mkdir eval_polite
Prepare benchmark data from MM-SafetyBench.
Here is the data structure:
dataset/coco/
├── gpt4_generated_questions/
├── imgs/
├── processed_questions/
├── coco_task_annotation.json
bash scripts/train_harm_detector.sh
bash scripts/train_detoxifier.sh
bash llava/eval/eval_multi_safeguard.sh path-to-llava path-to-result num_gpu temperature path-to-detector path-to-detoxifier
We adopt the newly proposed MLLM jailbreak benchmark for evaluation, please follow their instructions for setting up the evaluation bench. Thanks for the great work!
The project is built on top of the amazing multimodal large language model LLaVA. Thanks for these great work!
If you find our work useful for your research or applications, please cite using this BibTeX:
@misc{pi2024mllmprotector,
title={MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance},
author={Renjie Pi and Tianyang Han and Yueqi Xie and Rui Pan and Qing Lian and Hanze Dong and Jipeng Zhang and Tong Zhang},
year={2024},
eprint={2401.02906},
archivePrefix={arXiv},
primaryClass={cs.CR}
}