llm-alignment-survey

A curated reading list for large language model (LLM) alignment. Take a look at our new survey "Large Language Model Alignment: A Survey" on arXiv for more details!

Feel free to open an issue/PR or e-mail [email protected] and [email protected] if you find any missing areas, papers, or datasets. We will keep updating this list and survey.

If you find our survey useful, please kindly cite our paper:

@article{shen2023alignment,
      title={Large Language Model Alignment: A Survey}, 
      author={Shen, Tianhao and Jin, Renren and Huang, Yufei and Liu, Chuang and Dong, Weilong and Guo, Zishan and Wu, Xinwei and Liu, Yan and Xiong, Deyi},
      journal={arXiv preprint arXiv:2309.15025},
      year={2023}
}

Related Surveys

Aligning Large Language Models with Human: A Survey. Yufei Wang et al. arXiv 2023. [Paper]
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment. Yang Liu et al. arXiv 2023. [Paper]
Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation. Patrick Fernandes et al. arXiv 2023. [paper]
Augmented Language Models: a Survey. Grégoire Mialon et al. arXiv 2023. [Paper]
An Overview of Catastrophic AI Risks. Dan Hendrycks et al. arXiv 2023. [Paper]
A Survey of Large Language Models. Wayne Xin Zhao et al. arXiv 2023. [Paper]
A Survey on Universal Adversarial Attack. Chaoning Zhang et al. IJCAI 2021. [Paper]
Survey of Hallucination in Natural Language Generation. Ziwei Ji et al. ACM Computing Surveys 2022. [Paper]
Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies. Liangming Pan et al. arXiv 2023. [Paper]
Automatic Detection of Machine Generated Text: A Critical Survey. Ganesh Jawahar et al. COLING 2020. [Paper]

Why LLM Alignment?

Synchromesh: Reliable Code Generation from Pre-trained Language Models. Gabriel Poesia et al. ICLR 2022. [Paper]
LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models. Chan Hee Song et al. ICCV 2023. [Paper]
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. Wenlong Huang et al. PMLR 2022. [Paper]
Tool Learning with Foundation Models. Yujia Qin et al. arXiv 2023. [Paper]
Ethical and social risks of harm from Language Models. Laura Weidinger et al. arXiv 2021. [Paper]

LLM-Generated Content

Undesirable Content

Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview. Deven Shah et al. arXiv 2019. [Paper]
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. Samuel Gehman et al. arXiv 2023. [Paper]
Extracting Training Data from Large Language Models. Nicholas Carlini et al. arXiv 2012. [Paper]
StereoSet: Measuring stereotypical bias in pretrained language models. Moin Nadeem et al. arXiv 2020. [Paper]
CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. Nikita Nangia et al. EMNLP 2020. [Paper]
HONEST: Measuring Hurtful Sentence Completion in Language Models. Debora Nozza et al. NAACL 2021. [Paper]
Language Models are Few-Shot Learners. Tom Brown et al. NeurIPS 2020. [Paper]
Persistent Anti-Muslim Bias in Large Language Models. Abubakar Abid et al. AIES 2021. [Paper]
Gender and Representation Bias in GPT-3 Generated Stories. Li Lucy et al. WNU 2021. [Paper]

Unfaithful Content

Measuring and Improving Consistency in Pretrained Language Models. Yanai Elazar et al. TACL 2021. [Paper]
GPT-3 Creative Fiction. Gwern. 2023. [Blog]
GPT-3: What’s It Good for? Robert Dale. Natural Language Engineering 2020. [Paper]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher. Jack W. Rae et al. arXiv 2021. [Paper]
TruthfulQA: Measuring How Models Mimic Human Falsehoods. Stephanie Lin et al. ACL 2022. [Paper]
Towards Tracing Knowledge in Language Models Back to the Training Data. Ekin Akyurek et al. EMNLP 2020. [Paper]
Sparks of Artificial General Intelligence: Early experiments with GPT-4. Sébastien Bubeck et al. arXiv 2023. [Paper]
Navigating the Grey Area: Expressions of Overconfidence and Uncertainty in Language Models. Kaitlyn Zhou et al. arXiv 2023. [Paper]
Patient and Consumer Safety Risks When Using Conversational Assistants for Medical Information: An Observational Study of Siri, Alexa, and Google Assistant. Reza Asadi et al. 2018. [Paper]
Will ChatGPT Replace Lawyers? Kate Rattray. 2023. [Blog]
Constitutional AI: Harmlessness from AI Feedback. Yuntao Bai et al. arXiv 2022. [Paper]

Malicious Uses

Truth, Lies, and Automation How Language Models Could Change Disinformation. Ben Buchanan et al. Center for Security and Emerging Technology, 2021. [Paper]
Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models. Alex Tamkin et al. arXiv 2021. [Paper]
Deal or No Deal? End-to-End Learning for Negotiation Dialogues. Mike Lewis et al. arXiv 2017. [Paper]
Evaluating Large Language Models Trained on Code. Anne-Laure Ligozat et al. arXiv 2021. [Paper]
Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools. Jonas B. Sandbrink. arXiv 2023. [Paper]

Negative Impacts on Society

Sustainable AI: AI for sustainability and the sustainability of AI. Aimee van Wynsberghe. AI and Ethics 2021. [Paper]
Unraveling the Hidden Environmental Impacts of AI Solutions for Environment. Anne-Laure Ligozat et al. arXiv 2021. [Paper]
GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models. Tyna Eloundou et al. arXiv 2023. [Paper]

Potential Risks Associated with Advanced LLMs

Formalizing Convergent Instrumental Goals. Tsvi Benson-Tilsen et al. AAAI AIES Workshop 2016. [Paper]
Model evaluation for extreme risks. Toby Shevlane et al. arXiv 2023. [Paper]
Aligning AI Optimization to Community Well-Being. Stray J. International Journal of Community Well-being 2020. [Paper]
What are you optimizing for? Aligning Recommender Systems with Human Values. Jonathan Stray et al. ICML 2020. [Paper]
Model evaluation for extreme risks. Toby Shevlane et al. arXiv 2023. [Paper]
Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Meta Fundamental AI Research Diplomacy Team (FAIR) et al. Science 2022. [Paper]
Characterizing Manipulation from AI Systems. Micah Carroll et al. arXiv 2023. [Paper]
Deceptive Alignment Monitoring. Andres Carranza et al. ICML AdvML Workshop 2023. [Paper]
The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents. Nick Bostrom. Minds and Machines 2012. [Paper]
Is Power-Seeking AI an Existential Risk? Joseph Carlsmith. arXiv 2023. [Paper]
Optimal Policies Tend To Seek Power. Alexander Matt Turner et al. NeurIPS 2021. [Paper]
Parametrically Retargetable Decision-Makers Tend To Seek Power. Alexander Matt Turner et al. NeurIPS 2022. [Paper]
Power-seeking can be probable and predictive for trained agents. Victoria Krakovna et al. arXiv 2023. [Paper]
Discovering Language Model Behaviors with Model-Written Evaluations. Ethan Perez et al. arXiv 2022. [Paper]

What is LLM Alignment?

Some Moral and Technical Consequences of Automation: As Machines Learn They May Develop Unforeseen Strategies at Rates That Baffle Their Programmers. Norbert Wiener. Science 1960. [Paper]
Coherent Extrapolated Volition. Eliezer Yudkowsky. Singularity Institute for Artificial Intelligence 2004. [Paper]
The Basic AI Drives. Stephen M. Omohundro. AGI 2008. [Paper]
The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents. Nick Bostrom. Minds and Machines 2012. [Paper]
General Purpose Intelligence: Arguing the Orthogonality Thesis. Stuart Armstrong. Analysis and Metaphysics 2013. [Paper]
Aligning Superintelligence with Human Interests: An Annotated Bibliography. Nate Soares. Intelligence 2015. [Paper]
Concrete Problems in AI Safety. Dario Amodei et al. arXiv 2016. [Paper]
The Mythos of Model Interpretability. Zachary C. Lipton. arXiv 2017. [Paper]
AI Safety Gridworlds. Jan Leike et al. arXiv 2017. [Paper]
Overview of Current AI Alignment Approaches. Micah Carroll. 2018. [Paper]
Risks from Learned Optimization in Advanced Machine Learning Systems. Evan Hubinger et al. arXiv 2019. [Paper]
An Overview of 11 Proposals for Building Safe Advanced AI. Evan Hubinger. arXiv 2020. [Paper]
Unsolved Problems in ML Safety. Dan Hendrycks et al. arXiv 2021. [Paper]
A Mathematical Framework for Transformer Circuits. Nelson Elhage et al. Transformer Circuits Thread 2021. [Paper]
Alignment of Language Agents. Zachary Kenton et al. arXiv 2021. [Paper]
A General Language Assistant as a Laboratory for Alignment. Amanda Askell et al. arXiv 2021. [Paper]
A Transparency and Interpretability Tech Tree. Evan Hubinger. 2022. [Blog]
Understanding AI Alignment Research: A Systematic Analysis. J. Kirchner et al. arXiv 2022. [Paper]
Softmax Linear Units. Nelson Elhage et al. Transformer Circuits Thread 2022. [Paper]
The Alignment Problem from a Deep Learning Perspective. Richard Ngo. arXiv 2022. [Paper]
Paradigms of AI Alignment: Components and Enablers. Victoria Krakovna. 2022. [Blog]
Progress Measures for Grokking via Mechanistic Interpretability. Neel Nanda et al. arXiv 2023. [Paper]
Agentized LLMs Will Change the Alignment Landscape. Seth Herd. 2023. [Blog]
Language Models Can Explain Neurons in Language Models. Steven Bills et al. 2023. [Paper]
Core Views on AI Safety: When, Why, What, and How. Anthropic. 2023. [Blog]

Outer Alignment

Non-recursive Oversight

RL-based Methods

Proximal Policy Optimization Algorithms. John Schulman et al. arXiv 2017. [Paper]
Fine-Tuning Language Models from Human Preferences. Daniel M Ziegler et al. arXiv 2019. [Paper]
Learning to Summarize with Human Feedback. Nisan Stiennon et al. NeurIPS 2020. [Paper]
Training Language Models to Follow Instructions with Human Feedback. Long Ouyang et al. NeurIPS 2022. [Paper]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. Yuntao Bai et al. arXiv 2022. [Paper]
RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs. Afra Feyza Akyürek et al. arXiv 2023. [Paper]
Improving Language Models with Advantage-Based Offline Policy Gradients. Ashutosh Baheti et al. arXiv 2023. [Paper]
Scaling Laws for Reward Model Overoptimization. Leo Gao et al. ICML 2023. [Paper]
Improving Alignment of Dialogue Agents via Targeted Human Judgements. Amelia Glaese et al. arXiv 2022. [Paper]
Aligning Language Models with Preferences through F-Divergence Minimization. Dongyoung Go et al. arXiv 2023. [Paper]
Aligning Large Language Models through Synthetic Feedback. Sungdong Kim et al. arXiv 2023. [Paper]
RLHF. Ansh Radhakrishnan. Lesswrong 2022. [Paper]
Guiding Large Language Models via Directional Stimulus Prompting. Zekun Li et al. arXiv 2023. [Paper]
Aligning Generative Language Models with Human Values. Ruibo Liu et al. NAACL 2022 Findings. [Paper]
Second Thoughts Are Best: Learning to Re-Align with Human Values from Text Edits. Ruibo Liu et al. NeurIPS 2022. [Paper]
Secrets of RLHF in Large Language Models Part I: PPO. Rui Zheng et al. arXiv 2023. [Paper]
Principled Reinforcement Learning with Human Feedback from Pairwise or K-Wise Comparisons. Banghua Zhu et al. arXiv 2023. [Paper]
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. Stephen Casper et al. arXiv 2023. [Paper]

SL-based Methods

Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. Timo Schick et al. TACL, 2021. [Paper]
The Cringe Loss: Learning What Language Not to Model. Leonard Adolphs et al. arXiv 2022. [Paper]
Leashing the Inner Demons: Self-detoxification for Language Models. Canwen Xu et al. AAAI 2022. [Paper]
Calibrating Sequence Likelihood Improves Conditional Language Generation. Yao Zhao et al. arXiv 2022. [Paper]
RAFT: Reward Ranked Finetuning for Generative Foundation Model Alignment. Hanze Dong et al. arXiv 2023. [Paper]
Chain of Hindsight Aligns Language Models with Feedback. Hao Liu et al. arXiv 2023. [Paper]
Training Socially Aligned Language Models in Simulated Human Society. Ruibo Liu et al. arXiv 2023. [Paper]
Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. Rafael Rafailov et al. arXiv 2023. [Paper]
Training Language Models with Language Feedback at Scale. Jérémy Scheurer et al. arXiv 2023. [Paper]
Preference Ranking Optimization for Human Alignment. Feifan Song et al. arXiv 2023. [Paper]
RRHF: Rank Responses to Align Language Models with Human Feedback without Tears. Zheng Yuan et al. arXiv 2023. [Paper]
SLiC-HF: Sequence Likelihood Calibration with Human Feedback. Yao Zhao et al. arXiv 2023. [Paper]
LIMA: Less Is More for Alignment. Chunting Zhou et al. arXiv 2023. [Paper]

Scalable Oversight

Supervising Strong Learners by Amplifying Weak Experts. Paul Christiano et al. arXiv 2018. [Paper]
Scalable Agent Alignment via Reward Modeling: A Research Direction. Jan Leike et al. arXiv 2018. [Paper]
AI Safety Needs Social Scientists. Geoffrey Irving, and Amanda Askell. Distill 2019. [Paper]
Learning to Summarize with Human Feedback. Nisan Stiennon et al. NeurIPS 2020. [Paper]
Task Decomposition for Scalable Oversight (AGISF Distillation). Charbel-Raphaël Segerie. 2023. [Blog]
Measuring Progress on Scalable Oversight for Large Language Models. Samuel R Bowman et al. arXiv 2022. [Paper]
Constitutional AI: Harmlessness from AI Feedback. Yuntao Bai et al. CoRR 2022. [Paper]
Improving Factuality and Reasoning in Language Models through Multiagent Debate. Yilun Du et al. arXiv 2023. [Paper]
Evaluating Superhuman Models with Consistency Checks. Lukas Fluri et al. arXiv 2023. [Paper]
AI Safety via Debate. Geoffrey Irving et al. arXiv 2018. [Paper]
AI Safety via Market Making. Evan Hubinger. 2020. [Blog]
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. Tian Liang et al. arXiv 2023. [Paper]
Let's Verify Step by Step. Hunter Lightman et al. arXiv 2023. [Paper]
Introducing Superalignment. OpenAI. 2023. [Blog]
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision. Zhiqing Sun et al. arXiv 2023. [Paper]

Inner Alignment

Risks from Learned Optimization in Advanced Machine Learning Systems. Evan Hubinger et al. arXiv 2021. [Paper]
Goal Misgeneralization in Deep Reinforcement Learning. Lauro Langosco et al. ICML 2022. [Paper]
Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals. Rohin Shah et al. arXiv 2022. [Paper]
Defining capability and alignment in gradient descent. Edouard Harris. Lesswrong 2020. [Blog]
Categorizing failures as “outer” or “inner” misalignment is often confused. Rohin Shah. Lesswrong 2023. [Blog]
Inner Alignment Failures" Which Are Actually Outer Alignment Failures. John Wentworth. Lesswrong 2020. [Blog]
Relaxed adversarial training for inner alignment. Evan Hubinger. Lesswrong 2019. [Blog]
The Inner Alignment Problem. Evan Hubinger et al. Lesswrong 2019. [Blog]
Three scenarios of pseudo-alignment. Eleni Angelou. Lesswrong 2022. [Blog]
Deceptive Alignment. Evan Hubinger et al. Lesswrong 2019. [Blog]
What failure looks like. Paul Christiano. AI Alignment Forum 2019. [Blog]
Concrete experiments in inner alignment. Evan Hubinger. Lesswrong 2019. [Blog]
A central AI alignment problem: capabilities generalization, and the sharp left turn. Nate Soares. Lesswrong 2022. [Blog]
Clarifying the confusion around inner alignment. Rauno Arike. AI Alignment Forum 2022. [Blog]
2-D Robustness. Vladimir Mikulik. AI Alignment Forum 2019. [Blog]
Monitoring for deceptive alignment. Evan Hubinger. Lesswrong 2022. [Blog]

Mechanistic Interpretability

Notions of explainability and evaluation approaches for explainable artificial intelligence. Giulia Vilone et al. arXiv 2020. [Paper]
A Comprehensive Mechanistic Interpretability Explainer Glossary. Neel Nanda. 2022. [Paper]
The Mythos of Model Interpretability. Zachary C. Lipton. arXiv 2017. [Paper]
AI research considerations for human existential safety (ARCHES). Andrew Critch et al. arXiv 2020. [Paper]
Concrete problems for autonomous vehicle safety: Advantages of Bayesian deep learning. RT McAllister et al. IJCAI 2017. [Paper]
In-context Learning and Induction Heads. Catherine Olsson et al. Transformer Circuits Thread, 2022. [Paper]
Transformer Feed-Forward Layers Are Key-Value Memories. Mor Geva et al. EMNLP 2021. [Paper]
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Mor Geva et al. EMNLP 2022. [Paper]
Softmax Linear Units. Nelson Elhage et al. Transformer Circuits Thread 2022. [Paper]
Toy Models of Superposition. Nelson Elhage et al. Transformer Circuits Thread 2022. [Paper]
Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases. Chris Olah. 2022. [Paper]
Knowledge Neurons in Pretrained Transformers. Dai Damai et al. ACL 2021. [Paper]
Locating and editing factual associations in GPT. Kevin Meng et al. NeurIPS 2022. [Paper]
Eliciting Truthful Answers from a Language Model. Kenneth Li et al. arXiv 2023. [Paper]
LEACE: Perfect linear concept erasure in closed form. Nora Belrose et al. arXiv 2023. [Paper]

Attacks on Aligned Language Models

Privacy Attacks

Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots. Gelei Deng et al. arXiv 2023. [Paper]
Multi-step Jailbreaking Privacy Attacks on ChatGPT. Haoran Li et al. arXiv 2023. [Paper]

Backdoor Attacks

Prompt Injection Attack Against LLM-integrated Applications. Yi Liu et al. arXiv 2023. [Paper]
Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models. Shuai Zhao et al. arXiv 2023. [Paper]
More Than You've Asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models. Kai Greshake et al. arXiv 2023. [Paper]
Backdoor Attacks for In-Context Learning with Language Models. Nikhil Kandpal et al. arXiv 2023. [Paper]
BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT. Jiawen Shi et al. arXiv 2023. [Paper]

Adversarial Attacks

Universal and Transferable Adversarial Attacks on Aligned Language Models. Andy Zou et al. arXiv 2023. [Paper]
Are Aligned Neural Networks Adversarially Aligned?. Nicholas Carlini et al. arXiv 2023. [Paper]
Visual Adversarial Examples Jailbreak Large Language Models. Xiangyu Qi et al. arXiv 2023. [Paper]

Alignment Evaluation

Factuality Evaluation

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. Sewon Min et al. arXiv 2023. [Paper]
Factuality Enhanced Language Models for Open-ended Text Generation. Nayeon Lee et al. NeurIPS 2022. [Paper]
TruthfulQA: Measuring How Models Mimic Human Falsehoods. Stephanie Lin et al. arXiv 2021. [Paper]
SummaC: Re-visiting NLI-based Models for Inconsistency Detection in Summarization. Philippe Laban et al. TACL 2022. [Paper]
QAFactEval: Improved QA-based Factual Consistency Evaluation for Summarization. Alexander R. Fabbri et al. arXiv 2021. [Paper]
TRUE: Re-evaluating Factual Consistency Evaluation. Or Honovich et al. arXiv 2022. [Paper]
AlignScore: Evaluating Factual Consistency with a Unified Alignment Function. Yuheng Zha et al. arXiv 2023. [Paper]

Ethics Evaluation

Social Chemistry 101: Learning to Reason about Social and Moral Norms. Maxwell Forbes et al. arXiv 2020. [Paper]
Aligning AI with Shared Human Values. Dan Hendrycks et al. arXiv 2020. [Paper]
Would You Rather? A New Benchmark for Learning Machine Alignment with Cultural Values and Social Preferences. Yi Tay et al. ACL 2020. [Paper]
Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes. Nicholas Lourie et al. AAAI 2021. [Paper]

Toxicity Evaluation

Task-specific Evaluation

Detecting Offensive Language in Social Media to Protect Adolescent Online Safety. Ying Chen et al. PASSAT-SocialCom 2012. [Paper]
Offensive Language Detection Using Multi-level Classification. Amir H. Razavi et al. Canadian AI 2010. [Paper]
Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter. Zeerak Waseem and Dirk Hovy. NAACL Student Research Workshop 2016. [Paper]
Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis. Bjorn Ross et al. NLP4CMC 2016. [Paper]
Ex Machina: Personal Attacks Seen at Scale. Ellery Wulczyn et al. WWW 2017. [Paper]
Predicting the Type and Target of Offensive Posts in Social Media. Marcos Zampieri et al. NAACL-HLT 2019. [Paper]

LLM-centered Evaluation

Recipes for Safety in Open-Domain Chatbots. Jing Xu et al. arXiv 2020. [Paper]
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. Samuel Gehman et al. EMNLP 2020 Findings. [Paper]
COLD: A Benchmark for Chinese Offensive Language Detection. Jiawen Deng et al. EMNLP 2022. [Paper]

Stereotype and Bias Evaluation

Task-specific Evaluation

Gender Bias in Coreference Resolution. Rachel Rudinger et al. NAACL 2018. [Paper]
Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. Jieyu Zhao et al. NAACL 2018. [Paper]
The Winograd Schema Challenge. Hector Levesque et al. KR 2012. [Paper]
Toward Gender-Inclusive Coreference Resolution: An Analysis of Gender and Bias Throughout the Machine Learning Lifecycle. Yang Trista Cao and Hal Daumé III. Computational Linguistics 2021. [Paper]
Evaluating Gender Bias in Machine Translation. Gabriel Stanovsky et al. ACL 2019. [Paper]
Investigating Failures of Automatic Translation in the Case of Unambiguous Gender. Adithya Renduchintala and Adina Williams. ACL 2022. [Paper]
Towards Understanding Gender Bias in Relation Extraction. Andrew Gaut et al. ACL 2020. [Paper]
Addressing Age-Related Bias in Sentiment Analysis. Mark Díaz et al. CHI 2018. [Paper]
Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. Svetlana Kiritchenko and Saif M. Mohammad. NAACL-HLT 2018. [Paper]
On Measuring and Mitigating Biased Inferences of Word Embeddings. Sunipa Dev et al. AAAI 2020. [Paper]
Social Bias Frames: Reasoning About Social and Power Implications of Language. Maarten Sap et al. ACL 2020. [Paper]
Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark. Jingyan Zhou et al. EMNLP 2022 Findings. [Paper]
CORGI-PM: A Chinese Corpus for Gender Bias Probing and Mitigation. Ge Zhang et al. arXiv 2023. [Paper]

LLM-centered Evaluation

StereoSet: Measuring Stereotypical Bias in Pretrained Language Models. Moin Nadeem et al. ACL 2021. [Paper]
CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. Nikita Nangia et al. EMNLP 2020. [Paper]
BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation. Jwala Dhamala et al. FAccT 2021. [Paper]
“I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset. Eric Michael Smith et al. EMNLP 2022. [Paper]
Multilingual Holistic Bias: Extending Descriptors and Patterns to Unveil Demographic Biases in Languages at Scale. Marta R. Costa-jussà et al. arXiv 2023. [Paper]
UNQOVERing Stereotyping Biases via Underspecified Questions. Tao Li et al. EMNLP 2020 Findings. [Paper]
BBQ: A Hand-Built Bias Benchmark for Question Answering. Alicia Parrish et al. ACL 2022 Findings. [Paper]
CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models. Yufei Huang and Deyi Xiong. arXiv 2023. [Paper]

Hate Speech Detection

Automated Hate Speech Detection and the Problem of Offensive Language. Thomas Davidson et al. AAAI 2017. [Paper]
Deep Learning for Hate Speech Detection in Tweets. Pinkesh Badjatiya et al. WWW 2017. [Paper]
Detecting Hate Speech on the World Wide Web. William Warner and Julia Hirschberg. NAACL-HLT 2012. [Paper]
A Survey on Hate Speech Detection using Natural Language Processing. Anna Schmidt and Michael Wiegand. SocialNLP 2017. [Paper]
Hate Speech Detection with Comment Embeddings. Nemanja Djuric et al. WWW 2015. [Paper]
Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter. Zeerak Waseem. NLP+CSS@EMNLP 2016. [Paper]
TweetBLM: A Hate Speech Dataset and Analysis of Black Lives Matter-related Microblogs on Twitter. Sumit Kumar and Raj Ratn Pranesh. arXiv 2021. [Paper]
Hate Speech Dataset from a White Supremacy Forum. Ona de Gibert et al. ALW2 2018. [Paper]
The Gab Hate Corpus: A Collection of 27k Posts Annotated for Hate Speech. Brendan Kennedy et al. LRE 2022 [Paper]
Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts. Luke Breitfeller et al. EMNLP 2019. [Paper]
Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection. Bertie Vidgen et al. ACL 2021. [Paper]
Hate speech detection: Challenges and solutions. Sean MacAvaney et al. PloS One 2019. [Paper]
Racial Microaggressions in Everyday Life: Implications for Clinical Practice. Derald Wing Sue et al. American Psychologist 2007. [Paper]
The Impact of Racial Microaggressions on Mental Health: Counseling Implications for Clients of Color. Kevin L. Nadal et al. Journal of Counseling & Development 2014. [Paper]
A Preliminary Report on the Relationship Between Microaggressions Against Black People and Racism Among White College Students. Jonathan W. Kanter et al. Race and Social Problems 2017. [Paper]
Microaggressions and Traumatic Stress: Theory, Research, and Clinical Treatment. Kevin L. Nadal. American Psychological Association 2018. [Paper]
Arabs as Terrorists: Effects of Stereotypes Within Violent Contexts on Attitudes, Perceptions, and Affect. Muniba Saleem and Craig A. Anderson. Psychology of Violence 2013. [Paper]
Mean Girls? The Influence of Gender Portrayals in Teen Movies on Emerging Adults' Gender-Based Attitudes and Beliefs. Elizabeth Behm-Morawitz and Dana E. Mastro. Journalism and Mass Communication Quarterly 2008. [Paper]
Exposure to Hate Speech Increases Prejudice Through Desensitization. Wiktor Soral, Michał Bilewicz, and Mikołaj Winiewski. Aggressive behavior 2018. [Paper]
Latent Hatred: A Benchmark for Understanding Implicit Hate Speech. Mai ElSherief et al. EMNLP 2021. [Paper]
ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection. Thomas Hartvigsen et al. ACL 2022. [Paper]
An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models. Saghar Hosseini, Hamid Palangi, and Ahmed Hassan Awadallah. arXiv 2023. [Paper]

General Evaluation

TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models. Yue Huang et al. arXiv 2023. [Paper]
Safety Assessment of Chinese Large Language Models. Hao Sun et al. arXiv 2023. [Paper]
FLASK: Fine-grained Language Model Evaluation Based on Alignment Skill Sets. Seonghyeon Ye et al. arXiv 2023. [Paper]
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Lianmin Zheng et al. arXiv 2023. [Paper]
Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models. Aarohi Srivastava et al. arXiv 2023. [Paper]
A Critical Evaluation of Evaluations for Long-form Question Answering. Fangyuan Xu et al. arXiv 2023. [Paper]
AlpacaEval: An Automatic Evaluator of Instruction-following Models. Xuechen Li et al. Github 2023. [Github]
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. Yann Dubois et al. Github 2023. [Paper]
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. Yidong Wang et al. arXiv 2023. [Paper]
Large Language Models are not Fair Evaluators. Peiyi Wang et al. arXiv 2023. [Paper]
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. Yang Liu et al. arXiv 2023. [Paper]
Benchmarking Foundation Models with Language-Model-as-an-Examiner. Yushi Bai et al. arXiv 2023. [Paper]
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations. Ruosen Li et al. arXiv 2023. [Paper]
SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions. Yizhong Wang et al. arXiv 2023. [Paper]

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-alignment-survey

Table of Contents

Related Surveys

Why LLM Alignment?

LLM-Generated Content

Undesirable Content

Unfaithful Content

Malicious Uses

Negative Impacts on Society

Potential Risks Associated with Advanced LLMs

What is LLM Alignment?

Outer Alignment

Non-recursive Oversight

RL-based Methods

SL-based Methods

Scalable Oversight

Inner Alignment

Mechanistic Interpretability

Attacks on Aligned Language Models

Privacy Attacks

Backdoor Attacks

Adversarial Attacks

Alignment Evaluation

Factuality Evaluation

Ethics Evaluation

Toxicity Evaluation

Task-specific Evaluation

LLM-centered Evaluation

Stereotype and Bias Evaluation

Task-specific Evaluation

LLM-centered Evaluation

Hate Speech Detection

General Evaluation

About

Releases

Packages

License

o0mahan0o/llm-alignment-survey

Folders and files

Latest commit

History

Repository files navigation

llm-alignment-survey

Table of Contents

Related Surveys

Why LLM Alignment?

LLM-Generated Content

Undesirable Content

Unfaithful Content

Malicious Uses

Negative Impacts on Society

Potential Risks Associated with Advanced LLMs

What is LLM Alignment?

Outer Alignment

Non-recursive Oversight

RL-based Methods

SL-based Methods

Scalable Oversight

Inner Alignment

Mechanistic Interpretability

Attacks on Aligned Language Models

Privacy Attacks

Backdoor Attacks

Adversarial Attacks

Alignment Evaluation

Factuality Evaluation

Ethics Evaluation

Toxicity Evaluation

Task-specific Evaluation

LLM-centered Evaluation

Stereotype and Bias Evaluation

Task-specific Evaluation

LLM-centered Evaluation

Hate Speech Detection

General Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages