A curated list of safety-related papers, articles, and resources focused on Large Language Models (LLMs). This repository aims to provide researchers, practitioners, and enthusiasts with insights i…

1,177 58 Updated Feb 3, 2025

CHATS-lab / persuasive_jailbreaker

Persuasive Jailbreaker: we can persuade LLMs to jailbreak them!

HTML 283 19 Updated Oct 10, 2024

sherdencooper / GPTFuzz

Official repo for GPTFUZZER : Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Python 450 57 Updated Sep 24, 2024

centerforaisafety / HarmBench

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Jupyter Notebook 542 75 Updated Aug 16, 2024

tmlr-group / DeepInception

[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"

Python 134 13 Updated Feb 20, 2024

LLM-Tuning-Safety / LLMs-Finetuning-Safety

We jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed examples, at a cost of less than $0.20 via OpenAI’s APIs.

Python 276 31 Updated Feb 23, 2024

declare-lab / resta

Restore safety in fine-tuned language models through task arithmetic

Python 26 2 Updated Mar 28, 2024

HillZhang1999 / ICD

Code & Data for our Paper "Alleviating Hallucinations of Large Language Models through Induced Hallucinations"

Python 63 6 Updated Feb 27, 2024

SafeAILab / RAIN

[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning

Python 89 4 Updated May 23, 2024

chujiezheng / LLM-Safeguard

Official repository for ICML 2024 paper "On Prompt-Driven Safeguarding for Large Language Models"

Python 84 8 Updated Sep 5, 2024

Aatrox103 / SAP

Python 44 6 Updated May 9, 2024

uw-nsl / ArtPrompt

[ACL24] Official Repo of Paper `ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs`

Python 59 14 Updated Dec 9, 2024

xirui-li / DrAttack

Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers

JavaScript 46 9 Updated Aug 25, 2024

Allen-piexl / JailbreakZoo

112 12 Updated Sep 2, 2024

GraySwanAI / nanoGCG

A fast + lightweight implementation of the GCG algorithm in PyTorch

Python 178 40 Updated Feb 9, 2025

usail-hkust / Jailjudge

JAILJUDGE: A comprehensive evaluation benchmark which includes a wide range of risk scenarios with complex malicious prompts (e.g., synthetic, adversarial, in-the-wild, and multi-language scenarios…

Python 34 Updated Dec 13, 2024

usail-hkust / JailTrickBench

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs. Empirical tricks for LLM Jailbreaking. (NeurIPS 2024)

Python 117 9 Updated Nov 30, 2024

0xk1h0 / ChatGPT_DAN

ChatGPT DAN, Jailbreaks prompt

6,804 621 Updated Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tangminji

Achievements

Achievements

Block or report tangminji

LLM-attack

aounon / certified-llm-safety

llm-attacks / llm-attacks

arobey1 / smooth-llm

shiningrain / JailGuard

Princeton-SysML / Jailbreak_LLM

patrickrchao / JailbreakingLLMs

SheltonLiu-N / AutoDAN

corca-ai / awesome-llm-security

rotaryhammer / code-autodan

ydyjya / Awesome-LLM-Safety