Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

🔥 Must-read papers for harmful fine-tuning attacks/defenses for LLMs.

💫 Continuously update on a weekly basis. (last update: 2024/12/20)

🔥 Good news: 7 harmful fine-tuning related papers are accpeted by NeurIPS2024

💫 We have updated our survey, including the discussion on the 17 ICLR2025 new submissions.

🔥 We update a slide to introduce harmful fine-tuning attacks/defenses. Check out the slide here.

Content

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Attacks

[2023/10/4] Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models arXiv [paper] [code]
[2023/10/5] Fine-tuning aligned language models compromises safety, even when users do not intend to! ICLR 2024 [paper] [code]
[2023/10/5] On the Vulnerability of Safety Alignment in Open-Access LLMs ACL2024 (Findings) [paper]
[2023/10/31] Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b SeT LLM workshop@ ICLR 2024 [paper]
[2023/11/9] Removing RLHF Protections in GPT-4 via Fine-Tuning NAACL2024 [paper]
[2024/4/1] What's in your" safe" data?: Identifying benign data that breaks safety COLM2024 [paper] [code]
[2024/6/28] Covert malicious finetuning: Challenges in safeguarding llm adaptation ICML2024 [paper]
[2024/7/29] Can Editing LLMs Inject Harm? NeurIPS2024 [paper] [code]
[2024/10/21] The effect of fine-tuning on language model toxicity NeurIPS2024 Safe GenAI workshop [paper]
[2024/10/23] Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks arXiv [paper]

Defenses

Alignment Stage Defenses

[2024/2/2] Vaccine: Perturbation-aware alignment for large language model aginst harmful fine-tuning NeurIPS2024 [paper] [code]
[2024/5/23] Representation noising effectively prevents harmful fine-tuning on LLMs NeurIPS2024 [paper] [code]
[2024/5/24] Buckle Up: Robustifying LLMs at Every Customization Stage via Data Curation ICLR2025 Submission [paper] [code] [Openreview]
[2024/8/1] Tamper-Resistant Safeguards for Open-Weight LLMs ICLR2025 Submission [Openreview] [paper] [code]
[2024/9/3] Booster: Tackling harmful fine-tuning for large language models via attenuating harmful perturbation ICLR2025 Submission [paper] [code] [Openreview]
[2024/9/26] Leveraging Catastrophic Forgetting to Develop Safe Diffusion Models against Malicious Finetuning NeurIPS2024 (for diffusion model) [paper]
[2024/10/05] Identifying and Tuning Safety Neurons in Large Language Models ICLR2025 Submission [Openreview]
[2024/10/13] Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation arXiv [paper] [code]

Fine-tuning Stage Defenses

[2023/8/25] Fine-tuning can cripple your foundation model; preserving features may be the solution TMLR [paper] [code]
[2023/9/14] Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions ICLR2024 [paper] [code]
[2024/2/3] Safety fine-tuning at (almost) no cost: A baseline for vision large language models ICML2024 [paper] [code]
[2024/2/7] Assessing the brittleness of safety alignment via pruning and low-rank modifications ME-FoMo@ICLR2024 [paper] [code]
[2024/2/22] Mitigating fine-tuning jailbreak attack with backdoor enhanced alignment NeurIPS2024 [paper] [code]
[2024/2/28] Keeping llms aligned after fine-tuning: The crucial role of prompt templates NeurIPS2024 [paper] [code]
[2024/5/28] Lazy safety alignment for large language models against harmful fine-tuning NeurIPS2024 [paper] [code]
[2024/6/10] Safety alignment should be made more than just a few tokens deep ICLR2025 Submission [paper] [code] [Openriew]
[2024/6/12] Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models ICLR2025 Submission [paper] [Openreview]
[2024/8/27] Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models ICLR2025 Submission [Openreview] [paper]
[2024/8/30] Safety Layers in Aligned Large Language Models: The Key to LLM Security ICLR2025 Submission [Openreview] [paper]
[2024/10/05] SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection ICLR2025 Submission [Openreview]
[2024/10/05] Safety Alignment Shouldn't Be Complicated ICLR2025 Submission [Openreview]
[2024/10/05] SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation ICLR2025 Submission [Openreview]
[2024/10/05] Towards Secure Tuning: Mitigating Security Risks Arising from Benign Instruction Fine-Tuning ICLR2025 Submission [paper] [Openreview]
[2024/10/13] Safety-Aware Fine-Tuning of Large Language Models NeurIPS 2024 Workshop on Safe Generative AI [paper]

Post-Fine-tuning Stage Defenses

[2024/3/8] Defending Against Unforeseen Failure Modes with Latent Adversarial Training arXiv [paper] [code]
[2024/5/15] A safety realignment framework via subspace-oriented model fusion for large language models arXiv [paper] [code]
[2024/5/23] MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability NeurIPS2024 [paper] [code]
[2024/5/27] Safe lora: the silver lining of reducing safety risks when fine-tuning large language models NeurIPS2024 [paper]
[2024/8/18] Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning arXiv [paper]
[2024/10/05] Locking Down the Finetuned LLMs Safety ICLR2025 Submission [Openreview]
[2024/10/05] Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models ICLR2025 Submission [Openreview]
[2024/12/15] Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models arXiv [paper]

Mechanical Study

[2024/5/25] No two devils alike: Unveiling distinct mechanisms of fine-tuning attacks arXiv [paper]
[2024/5/27] Navigating the safety landscape: Measuring risks in finetuning large language models NeurIPS2024 [paper]
[2024/10/05] Your Task May Vary: A Systematic Understanding of Alignment and Safety Degradation when Fine-tuning LLMs ICLR2025 Submission [Openreview]
[2024/10/05] On Evaluating the Durability of Safeguards for Open-Weight LLMs ICLR2025 Submission [Openreview] [Code]
[2024/11/13] The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense arXiv [paper]

Benchmark

[2024/9/19] Defending against Reverse Preference Attacks is Difficult arXiv [paper] [code]

Attacks and Defenses for Federated Fine-tuning

[2024/6/15] Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models ICLR2025 Submission [paper] [Openreview]
[2024/11/28] PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning arXiv [paper]

Other awesome resources on LLM safety

Citation

If you find this repository useful, please cite our paper:

@article{huang2024harmful,
  title={Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey},
  author={Huang, Tiansheng and Hu, Sihao and Ilhan, Fatih and Tekin, Selim Furkan and Liu, Ling},
  journal={arXiv preprint arXiv:2409.18169},
  year={2024}
}

Contact

If you discover any papers that are suitable but not included, please contact Tiansheng Huang ([email protected]).

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.gitignore		.gitignore
README.md		README.md
finetuning_stage_illustration.png		finetuning_stage_illustration.png
harmful_finetuning.png		harmful_finetuning.png
harmful_finetuning.pptx		harmful_finetuning.pptx
illustration.png		illustration.png
image.png		image.png
survey_slide.pdf		survey_slide.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Content

Attacks

Defenses

Alignment Stage Defenses

Fine-tuning Stage Defenses

Post-Fine-tuning Stage Defenses

Mechanical Study

Benchmark

Attacks and Defenses for Federated Fine-tuning

Other awesome resources on LLM safety

Citation

Contact

About

Releases

Packages

git-disl/awesome_LLM-harmful-fine-tuning-papers

Folders and files

Latest commit

History

Repository files navigation

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Content

Attacks

Defenses

Alignment Stage Defenses

Fine-tuning Stage Defenses

Post-Fine-tuning Stage Defenses

Mechanical Study

Benchmark

Attacks and Defenses for Federated Fine-tuning

Other awesome resources on LLM safety

Citation

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages