-
The Chinese University of Hong Kong
- Hong Kong SAR
- https://gregxmhu.github.io/
Starred repositories
A fast + lightweight implementation of the GCG algorithm in PyTorch
Repo for NeurIPS 2024 paper "Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes"
Memory Mosaics are networks of associative memories working in concert to achieve a prediction task.
Guide: Finetune GPT2-XL (1.5 Billion Parameters) and finetune GPT-NEO (2.7 B) on a single GPU with Huggingface Transformers using DeepSpeed
MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO、GRPO。
An index of algorithms for reinforcement learning from human feedback (rlhf))
The related works and background techniques about Openai o1
[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
A curated list of reinforcement learning with human feedback resources (continually updated)
List of papers on hallucination detection in LLMs.
Set of tools to assess and improve LLM security.
A reading list for large models safety, security, and privacy (including Awesome LLM Security, Safety, etc.).
Code for our NeurIPS2023 accepted paper: RADAR: Robust AI-Text Detection via Adversarial Learning. We tested RADAR on 8 LLMs including Vicuna and LLaMA. The results show that RADAR can attain good …
This repository contains code to quantitatively evaluate instruction-tuned models such as Alpaca and Flan-T5 on held-out tasks.
TAP: An automated jailbreaking method for black-box LLMs
Code for visualizing the loss landscape of neural nets
🎨 ML Visuals contains figures and templates which you can reuse and customize to improve your scientific writing.
[ICLR 2024] The official implementation of our ICLR2024 paper "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models".
[ICML 2021] Break-It-Fix-It: Unsupervised Learning for Program Repair
A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca)
[CCS'24] A dataset consists of 15,140 ChatGPT prompts from Reddit, Discord, websites, and open-source datasets (including 1,405 jailbreak prompts).
Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
A practical and feature-rich paraphrasing framework to augment human intents in text form to build robust NLU models for conversational engines. Created by Prithiviraj Damodaran. Open to pull reque…
Curation of prompts that are known to be adversarial to large language models
prompt attack-defense, prompt Injection, reverse engineering notes and examples | 提示词对抗、破解例子与笔记