LLM-attack
Universal and Transferable Attacks on Aligned Language Models
[ICLR 2024] The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models".
A curation of awesome tools, documents and projects about LLM Security.
An unofficial implementation of AutoDAN attack on LLMs (arXiv:2310.15140)
A curated list of safety-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights i…
Persuasive Jailbreaker: we can persuade LLMs to jailbreak them!
Official repo for GPTFUZZER : Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"
We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20 via OpenAI’s APIs.
Restore safety in fine-tuned language models through task arithmetic
Code & Data for our Paper "Alleviating Hallucinations of Large Language Models through Induced Hallucinations"
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"
[ACL24] Official Repo of Paper `ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs`
Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
A fast + lightweight implementation of the GCG algorithm in PyTorch
JAILJUDGE: A comprehensive evaluation benchmark which includes a wide range of risk scenarios with complex malicious prompts (e.g., synthetic, adversarial, in-the-wild, and multi-language scenarios…
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)