Skip to content

A curated reading list for large language model (LLM) alignment. Take a look at our new survey "Large Language Model Alignment: A Survey" for more details!

License

Notifications You must be signed in to change notification settings

caoyuji1986/llm-alignment-survey

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

llm-alignment-survey

A curated reading list for large language model (LLM) alignment. Take a look at our new survey "Large Language Model Alignment: A Survey" on arXiv for more details!

Feel free to open an issue/PR or e-mail [email protected] and [email protected] if you find any missing areas, papers, or datasets. We will keep updating this list and survey.

If you find our survey useful, please kindly cite our paper:

@article{shen2023alignment,
      title={Large Language Model Alignment: A Survey}, 
      author={Shen, Tianhao and Jin, Renren and Huang, Yufei and Liu, Chuang and Dong, Weilong and Guo, Zishan and Wu, Xinwei and Liu, Yan and Xiong, Deyi},
      journal={arXiv preprint arXiv:2309.15025},
      year={2023}
}

Table of Contents

Related Surveys

  1. Aligning Large Language Models with Human: A Survey. Yufei Wang et al. arXiv 2023. [Paper]
  2. Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment. Yang Liu et al. arXiv 2023. [Paper]
  3. Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation. Patrick Fernandes et al. arXiv 2023. [paper]
  4. Augmented Language Models: a Survey. Grégoire Mialon et al. arXiv 2023. [Paper]
  5. An Overview of Catastrophic AI Risks. Dan Hendrycks et al. arXiv 2023. [Paper]
  6. A Survey of Large Language Models. Wayne Xin Zhao et al. arXiv 2023. [Paper]
  7. A Survey on Universal Adversarial Attack. Chaoning Zhang et al. IJCAI 2021. [Paper]
  8. Survey of Hallucination in Natural Language Generation. Ziwei Ji et al. ACM Computing Surveys 2022. [Paper]
  9. Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies. Liangming Pan et al. arXiv 2023. [Paper]
  10. Automatic Detection of Machine Generated Text: A Critical Survey. Ganesh Jawahar et al. COLING 2020. [Paper]

Why LLM Alignment?

  1. Synchromesh: Reliable Code Generation from Pre-trained Language Models. Gabriel Poesia et al. ICLR 2022. [Paper]
  2. LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models. Chan Hee Song et al. ICCV 2023. [Paper]
  3. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. Wenlong Huang et al. PMLR 2022. [Paper]
  4. Tool Learning with Foundation Models. Yujia Qin et al. arXiv 2023. [Paper]
  5. Ethical and social risks of harm from Language Models. Laura Weidinger et al. arXiv 2021. [Paper]

LLM-Generated Content

Undesirable Content

  1. Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview. Deven Shah et al. arXiv 2019. [Paper]
  2. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. Samuel Gehman et al. arXiv 2023. [Paper]
  3. Extracting Training Data from Large Language Models. Nicholas Carlini et al. arXiv 2012. [Paper]
  4. StereoSet: Measuring stereotypical bias in pretrained language models. Moin Nadeem et al. arXiv 2020. [Paper]
  5. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. Nikita Nangia et al. EMNLP 2020. [Paper]
  6. HONEST: Measuring Hurtful Sentence Completion in Language Models. Debora Nozza et al. NAACL 2021. [Paper]
  7. Language Models are Few-Shot Learners. Tom Brown et al. NeurIPS 2020. [Paper]
  8. Persistent Anti-Muslim Bias in Large Language Models. Abubakar Abid et al. AIES 2021. [Paper]
  9. Gender and Representation Bias in GPT-3 Generated Stories. Li Lucy et al. WNU 2021. [Paper]

Unfaithful Content

  1. Measuring and Improving Consistency in Pretrained Language Models. Yanai Elazar et al. TACL 2021. [Paper]
  2. GPT-3 Creative Fiction. Gwern. 2023. [Blog]
  3. GPT-3: What’s It Good for? Robert Dale. Natural Language Engineering 2020. [Paper]
  4. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. Jack W. Rae et al. arXiv 2021. [Paper]
  5. TruthfulQA: Measuring How Models Mimic Human Falsehoods. Stephanie Lin et al. ACL 2022. [Paper]
  6. Towards Tracing Knowledge in Language Models Back to the Training Data. Ekin Akyurek et al. EMNLP 2020. [Paper]
  7. Sparks of Artificial General Intelligence: Early experiments with GPT-4. Sébastien Bubeck et al. arXiv 2023. [Paper]
  8. Navigating the Grey Area: Expressions of Overconfidence and Uncertainty in Language Models. Kaitlyn Zhou et al. arXiv 2023. [Paper]
  9. Patient and Consumer Safety Risks When Using Conversational Assistants for Medical Information: An Observational Study of Siri, Alexa, and Google Assistant. Reza Asadi et al. 2018. [Paper]
  10. Will ChatGPT Replace Lawyers? Kate Rattray. 2023. [Blog]
  11. Constitutional AI: Harmlessness from AI Feedback. Yuntao Bai et al. arXiv 2022. [Paper]

Malicious Uses

  1. Truth, Lies, and Automation How Language Models Could Change Disinformation. Ben Buchanan et al. Center for Security and Emerging Technology, 2021. [Paper]
  2. Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models. Alex Tamkin et al. arXiv 2021. [Paper]
  3. Deal or No Deal? End-to-End Learning for Negotiation Dialogues. Mike Lewis et al. arXiv 2017. [Paper]
  4. Evaluating Large Language Models Trained on Code. Anne-Laure Ligozat et al. arXiv 2021. [Paper]
  5. Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools. Jonas B. Sandbrink. arXiv 2023. [Paper]

Negative Impacts on Society

  1. Sustainable AI: AI for sustainability and the sustainability of AI. Aimee van Wynsberghe. AI and Ethics 2021. [Paper]
  2. Unraveling the Hidden Environmental Impacts of AI Solutions for Environment. Anne-Laure Ligozat et al. arXiv 2021. [Paper]
  3. GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models. Tyna Eloundou et al. arXiv 2023. [Paper]

Potential Risks Associated with Advanced LLMs

  1. Formalizing Convergent Instrumental Goals. Tsvi Benson-Tilsen et al. AAAI AIES Workshop 2016. [Paper]
  2. Model evaluation for extreme risks. Toby Shevlane et al. arXiv 2023. [Paper]
  3. Aligning AI Optimization to Community Well-Being. Stray J. International Journal of Community Well-being 2020. [Paper]
  4. What are you optimizing for? Aligning Recommender Systems with Human Values. Jonathan Stray et al. ICML 2020. [Paper]
  5. Model evaluation for extreme risks. Toby Shevlane et al. arXiv 2023. [Paper]
  6. Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Meta Fundamental AI Research Diplomacy Team (FAIR) et al. Science 2022. [Paper]
  7. Characterizing Manipulation from AI Systems. Micah Carroll et al. arXiv 2023. [Paper]
  8. Deceptive Alignment Monitoring. Andres Carranza et al. ICML AdvML Workshop 2023. [Paper]
  9. The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents. Nick Bostrom. Minds and Machines 2012. [Paper]
  10. Is Power-Seeking AI an Existential Risk? Joseph Carlsmith. arXiv 2023. [Paper]
  11. Optimal Policies Tend To Seek Power. Alexander Matt Turner et al. NeurIPS 2021. [Paper]
  12. Parametrically Retargetable Decision-Makers Tend To Seek Power. Alexander Matt Turner et al. NeurIPS 2022. [Paper]
  13. Power-seeking can be probable and predictive for trained agents. Victoria Krakovna et al. arXiv 2023. [Paper]
  14. Discovering Language Model Behaviors with Model-Written Evaluations. Ethan Perez et al. arXiv 2022. [Paper]

What is LLM Alignment?

  1. Some Moral and Technical Consequences of Automation: As Machines Learn They May Develop Unforeseen Strategies at Rates That Baffle Their Programmers. Norbert Wiener. Science 1960. [Paper]
  2. Coherent Extrapolated Volition. Eliezer Yudkowsky. Singularity Institute for Artificial Intelligence 2004. [Paper]
  3. The Basic AI Drives. Stephen M. Omohundro. AGI 2008. [Paper]
  4. The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents. Nick Bostrom. Minds and Machines 2012. [Paper]
  5. General Purpose Intelligence: Arguing the Orthogonality Thesis. Stuart Armstrong. Analysis and Metaphysics 2013. [Paper]
  6. Aligning Superintelligence with Human Interests: An Annotated Bibliography. Nate Soares. Intelligence 2015. [Paper]
  7. Concrete Problems in AI Safety. Dario Amodei et al. arXiv 2016. [Paper]
  8. The Mythos of Model Interpretability. Zachary C. Lipton. arXiv 2017. [Paper]
  9. AI Safety Gridworlds. Jan Leike et al. arXiv 2017. [Paper]
  10. Overview of Current AI Alignment Approaches. Micah Carroll. 2018. [Paper]
  11. Risks from Learned Optimization in Advanced Machine Learning Systems. Evan Hubinger et al. arXiv 2019. [Paper]
  12. An Overview of 11 Proposals for Building Safe Advanced AI. Evan Hubinger. arXiv 2020. [Paper]
  13. Unsolved Problems in ML Safety. Dan Hendrycks et al. arXiv 2021. [Paper]
  14. A Mathematical Framework for Transformer Circuits. Nelson Elhage et al. Transformer Circuits Thread 2021. [Paper]
  15. Alignment of Language Agents. Zachary Kenton et al. arXiv 2021. [Paper]
  16. A General Language Assistant as a Laboratory for Alignment. Amanda Askell et al. arXiv 2021. [Paper]
  17. A Transparency and Interpretability Tech Tree. Evan Hubinger. 2022. [Blog]
  18. Understanding AI Alignment Research: A Systematic Analysis. J. Kirchner et al. arXiv 2022. [Paper]
  19. Softmax Linear Units. Nelson Elhage et al. Transformer Circuits Thread 2022. [Paper]
  20. The Alignment Problem from a Deep Learning Perspective. Richard Ngo. arXiv 2022. [Paper]
  21. Paradigms of AI Alignment: Components and Enablers. Victoria Krakovna. 2022. [Blog]
  22. Progress Measures for Grokking via Mechanistic Interpretability. Neel Nanda et al. arXiv 2023. [Paper]
  23. Agentized LLMs Will Change the Alignment Landscape. Seth Herd. 2023. [Blog]
  24. Language Models Can Explain Neurons in Language Models. Steven Bills et al. 2023. [Paper]
  25. Core Views on AI Safety: When, Why, What, and How. Anthropic. 2023. [Blog]

Outer Alignment

Non-recursive Oversight

RL-based Methods

  1. Proximal Policy Optimization Algorithms. John Schulman et al. arXiv 2017. [Paper]
  2. Fine-Tuning Language Models from Human Preferences. Daniel M Ziegler et al. arXiv 2019. [Paper]
  3. Learning to Summarize with Human Feedback. Nisan Stiennon et al. NeurIPS 2020. [Paper]
  4. Training Language Models to Follow Instructions with Human Feedback. Long Ouyang et al. NeurIPS 2022. [Paper]
  5. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. Yuntao Bai et al. arXiv 2022. [Paper]
  6. RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs. Afra Feyza Akyürek et al. arXiv 2023. [Paper]
  7. Improving Language Models with Advantage-Based Offline Policy Gradients. Ashutosh Baheti et al. arXiv 2023. [Paper]
  8. Scaling Laws for Reward Model Overoptimization. Leo Gao et al. ICML 2023. [Paper]
  9. Improving Alignment of Dialogue Agents via Targeted Human Judgements. Amelia Glaese et al. arXiv 2022. [Paper]
  10. Aligning Language Models with Preferences through F-Divergence Minimization. Dongyoung Go et al. arXiv 2023. [Paper]
  11. Aligning Large Language Models through Synthetic Feedback. Sungdong Kim et al. arXiv 2023. [Paper]
  12. RLHF. Ansh Radhakrishnan. Lesswrong 2022. [Paper]
  13. Guiding Large Language Models via Directional Stimulus Prompting. Zekun Li et al. arXiv 2023. [Paper]
  14. Aligning Generative Language Models with Human Values. Ruibo Liu et al. NAACL 2022 Findings. [Paper]
  15. Second Thoughts Are Best: Learning to Re-Align with Human Values from Text Edits. Ruibo Liu et al. NeurIPS 2022. [Paper]
  16. Secrets of RLHF in Large Language Models Part I: PPO. Rui Zheng et al. arXiv 2023. [Paper]
  17. Principled Reinforcement Learning with Human Feedback from Pairwise or K-Wise Comparisons. Banghua Zhu et al. arXiv 2023. [Paper]
  18. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. Stephen Casper et al. arXiv 2023. [Paper]

SL-based Methods

  1. Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. Timo Schick et al. TACL, 2021. [Paper]
  2. The Cringe Loss: Learning What Language Not to Model. Leonard Adolphs et al. arXiv 2022. [Paper]
  3. Leashing the Inner Demons: Self-detoxification for Language Models. Canwen Xu et al. AAAI 2022. [Paper]
  4. Calibrating Sequence Likelihood Improves Conditional Language Generation. Yao Zhao et al. arXiv 2022. [Paper]
  5. RAFT: Reward Ranked Finetuning for Generative Foundation Model Alignment. Hanze Dong et al. arXiv 2023. [Paper]
  6. Chain of Hindsight Aligns Language Models with Feedback. Hao Liu et al. arXiv 2023. [Paper]
  7. Training Socially Aligned Language Models in Simulated Human Society. Ruibo Liu et al. arXiv 2023. [Paper]
  8. Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. Rafael Rafailov et al. arXiv 2023. [Paper]
  9. Training Language Models with Language Feedback at Scale. Jérémy Scheurer et al. arXiv 2023. [Paper]
  10. Preference Ranking Optimization for Human Alignment. Feifan Song et al. arXiv 2023. [Paper]
  11. RRHF: Rank Responses to Align Language Models with Human Feedback without Tears. Zheng Yuan et al. arXiv 2023. [Paper]
  12. SLiC-HF: Sequence Likelihood Calibration with Human Feedback. Yao Zhao et al. arXiv 2023. [Paper]
  13. LIMA: Less Is More for Alignment. Chunting Zhou et al. arXiv 2023. [Paper]

Scalable Oversight

  1. Supervising Strong Learners by Amplifying Weak Experts. Paul Christiano et al. arXiv 2018. [Paper]
  2. Scalable Agent Alignment via Reward Modeling: A Research Direction. Jan Leike et al. arXiv 2018. [Paper]
  3. AI Safety Needs Social Scientists. Geoffrey Irving, and Amanda Askell. Distill 2019. [Paper]
  4. Learning to Summarize with Human Feedback. Nisan Stiennon et al. NeurIPS 2020. [Paper]
  5. Task Decomposition for Scalable Oversight (AGISF Distillation). Charbel-Raphaël Segerie. 2023. [Blog]
  6. Measuring Progress on Scalable Oversight for Large Language Models. Samuel R Bowman et al. arXiv 2022. [Paper]
  7. Constitutional AI: Harmlessness from AI Feedback. Yuntao Bai et al. CoRR 2022. [Paper]
  8. Improving Factuality and Reasoning in Language Models through Multiagent Debate. Yilun Du et al. arXiv 2023. [Paper]
  9. Evaluating Superhuman Models with Consistency Checks. Lukas Fluri et al. arXiv 2023. [Paper]
  10. AI Safety via Debate. Geoffrey Irving et al. arXiv 2018. [Paper]
  11. AI Safety via Market Making. Evan Hubinger. 2020. [Blog]
  12. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. Tian Liang et al. arXiv 2023. [Paper]
  13. Let's Verify Step by Step. Hunter Lightman et al. arXiv 2023. [Paper]
  14. Introducing Superalignment. OpenAI. 2023. [Blog]
  15. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision. Zhiqing Sun et al. arXiv 2023. [Paper]

Inner Alignment

  1. Risks from Learned Optimization in Advanced Machine Learning Systems. Evan Hubinger et al. arXiv 2021. [Paper]
  2. Goal Misgeneralization in Deep Reinforcement Learning. Lauro Langosco et al. ICML 2022. [Paper]
  3. Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals. Rohin Shah et al. arXiv 2022. [Paper]
  4. Defining capability and alignment in gradient descent. Edouard Harris. Lesswrong 2020. [Blog]
  5. Categorizing failures as “outer” or “inner” misalignment is often confused. Rohin Shah. Lesswrong 2023. [Blog]
  6. Inner Alignment Failures" Which Are Actually Outer Alignment Failures. John Wentworth. Lesswrong 2020. [Blog]
  7. Relaxed adversarial training for inner alignment. Evan Hubinger. Lesswrong 2019. [Blog]
  8. The Inner Alignment Problem. Evan Hubinger et al. Lesswrong 2019. [Blog]
  9. Three scenarios of pseudo-alignment. Eleni Angelou. Lesswrong 2022. [Blog]
  10. Deceptive Alignment. Evan Hubinger et al. Lesswrong 2019. [Blog]
  11. What failure looks like. Paul Christiano. AI Alignment Forum 2019. [Blog]
  12. Concrete experiments in inner alignment. Evan Hubinger. Lesswrong 2019. [Blog]
  13. A central AI alignment problem: capabilities generalization, and the sharp left turn. Nate Soares. Lesswrong 2022. [Blog]
  14. Clarifying the confusion around inner alignment. Rauno Arike. AI Alignment Forum 2022. [Blog]
  15. 2-D Robustness. Vladimir Mikulik. AI Alignment Forum 2019. [Blog]
  16. Monitoring for deceptive alignment. Evan Hubinger. Lesswrong 2022. [Blog]

Mechanistic Interpretability

  1. Notions of explainability and evaluation approaches for explainable artificial intelligence. Giulia Vilone et al. arXiv 2020. [Paper]
  2. A Comprehensive Mechanistic Interpretability Explainer Glossary. Neel Nanda. 2022. [Paper]
  3. The Mythos of Model Interpretability. Zachary C. Lipton. arXiv 2017. [Paper]
  4. AI research considerations for human existential safety (ARCHES). Andrew Critch et al. arXiv 2020. [Paper]
  5. Concrete problems for autonomous vehicle safety: Advantages of Bayesian deep learning. RT McAllister et al. IJCAI 2017. [Paper]
  6. In-context Learning and Induction Heads. Catherine Olsson et al. Transformer Circuits Thread, 2022. [Paper]
  7. Transformer Feed-Forward Layers Are Key-Value Memories. Mor Geva et al. EMNLP 2021. [Paper]
  8. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Mor Geva et al. EMNLP 2022. [Paper]
  9. Softmax Linear Units. Nelson Elhage et al. Transformer Circuits Thread 2022. [Paper]
  10. Toy Models of Superposition. Nelson Elhage et al. Transformer Circuits Thread 2022. [Paper]
  11. Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases. Chris Olah. 2022. [Paper]
  12. Knowledge Neurons in Pretrained Transformers. Dai Damai et al. ACL 2021. [Paper]
  13. Locating and editing factual associations in GPT. Kevin Meng et al. NeurIPS 2022. [Paper]
  14. Eliciting Truthful Answers from a Language Model. Kenneth Li et al. arXiv 2023. [Paper]
  15. LEACE: Perfect linear concept erasure in closed form. Nora Belrose et al. arXiv 2023. [Paper]

Attacks on Aligned Language Models

Privacy Attacks

  1. Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots. Gelei Deng et al. arXiv 2023. [Paper]
  2. Multi-step Jailbreaking Privacy Attacks on ChatGPT. Haoran Li et al. arXiv 2023. [Paper]

Backdoor Attacks

  1. Prompt Injection Attack Against LLM-integrated Applications. Yi Liu et al. arXiv 2023. [Paper]
  2. Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models. Shuai Zhao et al. arXiv 2023. [Paper]
  3. More Than You've Asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models. Kai Greshake et al. arXiv 2023. [Paper]
  4. Backdoor Attacks for In-Context Learning with Language Models. Nikhil Kandpal et al. arXiv 2023. [Paper]
  5. BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT. Jiawen Shi et al. arXiv 2023. [Paper]

Adversarial Attacks

  1. Universal and Transferable Adversarial Attacks on Aligned Language Models. Andy Zou et al. arXiv 2023. [Paper]
  2. Are Aligned Neural Networks Adversarially Aligned?. Nicholas Carlini et al. arXiv 2023. [Paper]
  3. Visual Adversarial Examples Jailbreak Large Language Models. Xiangyu Qi et al. arXiv 2023. [Paper]

Alignment Evaluation

Factuality Evaluation

  1. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. Sewon Min et al. arXiv 2023. [Paper]
  2. Factuality Enhanced Language Models for Open-ended Text Generation. Nayeon Lee et al. NeurIPS 2022. [Paper]
  3. TruthfulQA: Measuring How Models Mimic Human Falsehoods. Stephanie Lin et al. arXiv 2021. [Paper]
  4. SummaC: Re-visiting NLI-based Models for Inconsistency Detection in Summarization. Philippe Laban et al. TACL 2022. [Paper]
  5. QAFactEval: Improved QA-based Factual Consistency Evaluation for Summarization. Alexander R. Fabbri et al. arXiv 2021. [Paper]
  6. TRUE: Re-evaluating Factual Consistency Evaluation. Or Honovich et al. arXiv 2022. [Paper]
  7. AlignScore: Evaluating Factual Consistency with a Unified Alignment Function. Yuheng Zha et al. arXiv 2023. [Paper]

Ethics Evaluation

  1. Social Chemistry 101: Learning to Reason about Social and Moral Norms. Maxwell Forbes et al. arXiv 2020. [Paper]
  2. Aligning AI with Shared Human Values. Dan Hendrycks et al. arXiv 2020. [Paper]
  3. Would You Rather? A New Benchmark for Learning Machine Alignment with Cultural Values and Social Preferences. Yi Tay et al. ACL 2020. [Paper]
  4. Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes. Nicholas Lourie et al. AAAI 2021. [Paper]

Toxicity Evaluation

Task-specific Evaluation

  1. Detecting Offensive Language in Social Media to Protect Adolescent Online Safety. Ying Chen et al. PASSAT-SocialCom 2012. [Paper]
  2. Offensive Language Detection Using Multi-level Classification. Amir H. Razavi et al. Canadian AI 2010. [Paper]
  3. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter. Zeerak Waseem and Dirk Hovy. NAACL Student Research Workshop 2016. [Paper]
  4. Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis. Bjorn Ross et al. NLP4CMC 2016. [Paper]
  5. Ex Machina: Personal Attacks Seen at Scale. Ellery Wulczyn et al. WWW 2017. [Paper]
  6. Predicting the Type and Target of Offensive Posts in Social Media. Marcos Zampieri et al. NAACL-HLT 2019. [Paper]

LLM-centered Evaluation

  1. Recipes for Safety in Open-Domain Chatbots. Jing Xu et al. arXiv 2020. [Paper]
  2. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. Samuel Gehman et al. EMNLP 2020 Findings. [Paper]
  3. COLD: A Benchmark for Chinese Offensive Language Detection. Jiawen Deng et al. EMNLP 2022. [Paper]

Stereotype and Bias Evaluation

Task-specific Evaluation

  1. Gender Bias in Coreference Resolution. Rachel Rudinger et al. NAACL 2018. [Paper]
  2. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. Jieyu Zhao et al. NAACL 2018. [Paper]
  3. The Winograd Schema Challenge. Hector Levesque et al. KR 2012. [Paper]
  4. Toward Gender-Inclusive Coreference Resolution: An Analysis of Gender and Bias Throughout the Machine Learning Lifecycle. Yang Trista Cao and Hal Daumé III. Computational Linguistics 2021. [Paper]
  5. Evaluating Gender Bias in Machine Translation. Gabriel Stanovsky et al. ACL 2019. [Paper]
  6. Investigating Failures of Automatic Translation in the Case of Unambiguous Gender. Adithya Renduchintala and Adina Williams. ACL 2022. [Paper]
  7. Towards Understanding Gender Bias in Relation Extraction. Andrew Gaut et al. ACL 2020. [Paper]
  8. Addressing Age-Related Bias in Sentiment Analysis. Mark Díaz et al. CHI 2018. [Paper]
  9. Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. Svetlana Kiritchenko and Saif M. Mohammad. NAACL-HLT 2018. [Paper]
  10. On Measuring and Mitigating Biased Inferences of Word Embeddings. Sunipa Dev et al. AAAI 2020. [Paper]
  11. Social Bias Frames: Reasoning About Social and Power Implications of Language. Maarten Sap et al. ACL 2020. [Paper]
  12. Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark. Jingyan Zhou et al. EMNLP 2022 Findings. [Paper]
  13. CORGI-PM: A Chinese Corpus for Gender Bias Probing and Mitigation. Ge Zhang et al. arXiv 2023. [Paper]

LLM-centered Evaluation

  1. StereoSet: Measuring Stereotypical Bias in Pretrained Language Models. Moin Nadeem et al. ACL 2021. [Paper]
  2. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. Nikita Nangia et al. EMNLP 2020. [Paper]
  3. BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation. Jwala Dhamala et al. FAccT 2021. [Paper]
  4. “I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset. Eric Michael Smith et al. EMNLP 2022. [Paper]
  5. Multilingual Holistic Bias: Extending Descriptors and Patterns to Unveil Demographic Biases in Languages at Scale. Marta R. Costa-jussà et al. arXiv 2023. [Paper]
  6. UNQOVERing Stereotyping Biases via Underspecified Questions. Tao Li et al. EMNLP 2020 Findings. [Paper]
  7. BBQ: A Hand-Built Bias Benchmark for Question Answering. Alicia Parrish et al. ACL 2022 Findings. [Paper]
  8. CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models. Yufei Huang and Deyi Xiong. arXiv 2023. [Paper]

Hate Speech Detection

  1. Automated Hate Speech Detection and the Problem of Offensive Language. Thomas Davidson et al. AAAI 2017. [Paper]
  2. Deep Learning for Hate Speech Detection in Tweets. Pinkesh Badjatiya et al. WWW 2017. [Paper]
  3. Detecting Hate Speech on the World Wide Web. William Warner and Julia Hirschberg. NAACL-HLT 2012. [Paper]
  4. A Survey on Hate Speech Detection using Natural Language Processing. Anna Schmidt and Michael Wiegand. SocialNLP 2017. [Paper]
  5. Hate Speech Detection with Comment Embeddings. Nemanja Djuric et al. WWW 2015. [Paper]
  6. Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter. Zeerak Waseem. NLP+CSS@EMNLP 2016. [Paper]
  7. TweetBLM: A Hate Speech Dataset and Analysis of Black Lives Matter-related Microblogs on Twitter. Sumit Kumar and Raj Ratn Pranesh. arXiv 2021. [Paper]
  8. Hate Speech Dataset from a White Supremacy Forum. Ona de Gibert et al. ALW2 2018. [Paper]
  9. The Gab Hate Corpus: A Collection of 27k Posts Annotated for Hate Speech. Brendan Kennedy et al. LRE 2022 [Paper]
  10. Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts. Luke Breitfeller et al. EMNLP 2019. [Paper]
  11. Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection. Bertie Vidgen et al. ACL 2021. [Paper]
  12. Hate speech detection: Challenges and solutions. Sean MacAvaney et al. PloS One 2019. [Paper]
  13. Racial Microaggressions in Everyday Life: Implications for Clinical Practice. Derald Wing Sue et al. American Psychologist 2007. [Paper]
  14. The Impact of Racial Microaggressions on Mental Health: Counseling Implications for Clients of Color. Kevin L. Nadal et al. Journal of Counseling & Development 2014. [Paper]
  15. A Preliminary Report on the Relationship Between Microaggressions Against Black People and Racism Among White College Students. Jonathan W. Kanter et al. Race and Social Problems 2017. [Paper]
  16. Microaggressions and Traumatic Stress: Theory, Research, and Clinical Treatment. Kevin L. Nadal. American Psychological Association 2018. [Paper]
  17. Arabs as Terrorists: Effects of Stereotypes Within Violent Contexts on Attitudes, Perceptions, and Affect. Muniba Saleem and Craig A. Anderson. Psychology of Violence 2013. [Paper]
  18. Mean Girls? The Influence of Gender Portrayals in Teen Movies on Emerging Adults' Gender-Based Attitudes and Beliefs. Elizabeth Behm-Morawitz and Dana E. Mastro. Journalism and Mass Communication Quarterly 2008. [Paper]
  19. Exposure to Hate Speech Increases Prejudice Through Desensitization. Wiktor Soral, Michał Bilewicz, and Mikołaj Winiewski. Aggressive behavior 2018. [Paper]
  20. Latent Hatred: A Benchmark for Understanding Implicit Hate Speech. Mai ElSherief et al. EMNLP 2021. [Paper]
  21. ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection. Thomas Hartvigsen et al. ACL 2022. [Paper]
  22. An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models. Saghar Hosseini, Hamid Palangi, and Ahmed Hassan Awadallah. arXiv 2023. [Paper]

General Evaluation

  1. TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models. Yue Huang et al. arXiv 2023. [Paper]
  2. Safety Assessment of Chinese Large Language Models. Hao Sun et al. arXiv 2023. [Paper]
  3. FLASK: Fine-grained Language Model Evaluation Based on Alignment Skill Sets. Seonghyeon Ye et al. arXiv 2023. [Paper]
  4. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Lianmin Zheng et al. arXiv 2023. [Paper]
  5. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models. Aarohi Srivastava et al. arXiv 2023. [Paper]
  6. A Critical Evaluation of Evaluations for Long-form Question Answering. Fangyuan Xu et al. arXiv 2023. [Paper]
  7. AlpacaEval: An Automatic Evaluator of Instruction-following Models. Xuechen Li et al. Github 2023. [Github]
  8. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. Yann Dubois et al. Github 2023. [Paper]
  9. PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. Yidong Wang et al. arXiv 2023. [Paper]
  10. Large Language Models are not Fair Evaluators. Peiyi Wang et al. arXiv 2023. [Paper]
  11. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. Yang Liu et al. arXiv 2023. [Paper]
  12. Benchmarking Foundation Models with Language-Model-as-an-Examiner. Yushi Bai et al. arXiv 2023. [Paper]
  13. PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations. Ruosen Li et al. arXiv 2023. [Paper]
  14. SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions. Yizhong Wang et al. arXiv 2023. [Paper]

About

A curated reading list for large language model (LLM) alignment. Take a look at our new survey "Large Language Model Alignment: A Survey" for more details!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published