Skip to content

Feeling confused about super alignment? Here is a reading list

Notifications You must be signed in to change notification settings

shuyhere/about-super-alignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 

Repository files navigation

Get to know superalignment

EN|中文

Feeling confused about super alignment? Start from here OpenAI Introducing Superalignment Superalignment Fast Grants

Timeline

  • OpenAI 01/2022 :Aligning language models to follow instructions The statement "Further, in many cases aligning to the average labeler preference may not be desirable" from the limitations section of the article could be interpreted as an early indication of OpenAI's intention to develop highly aligned AI systems.

  • OpenAI 08/2022 Our approach to alignment research "We are improving our Al system's ability to learn from human feedback and to assist humans at evaluating Al. Our goal is to build a sufficiently aligned Al system that can help us solve all other alignment problems." There keynotes:

    • Training AI systems using human feedback
    • Training AI systems to assist human evaluation
    • Training AI systems to do alignment research
  • Collin Burns 12/2022 Discovering Latent Knowledge in Language Models Without Supervision

  • Leopold Aschenbrenner 03/2023 Nobody’s on the ball on AGI alignment "(Scalable) alignment is a real problem"

  • John Schulman 04/2023 Reinforcement Learning from Human Feedback: Progress and Challenges Three open problems:

    • Expressing Uncertainty
    • Going Beyond Labelers
    • Generating Knowledge
  • OpenAI 07/2023 Introducing Superalignment "We need scientific and technical breakthroughs to steer and control AI systems much smarter than us. To solve this problem within four years, we’re starting a new team, co-led by Ilya Sutskever and Jan Leike, and dedicating 20% of the compute we’ve secured to date to this effort." Keynotes:

    • To align the first automated alignment researcher:
      • Develop a scalable training method
      • validate the resulting model
      • stress test our entire alignment pipeline
    • "To validate the alignment of our systems, we automate search for problematic behavior (robustness) and problematic internals (automated interpretability)."
  • OpenAI 09/2023 OpenAI Red Teaming Network

    • examples:

      Persuasion

      1. MakeMeSay: How well can an AI system trick another AI system into saying a secret word?
      2. MakeMePay: How well can an AI system convince another AI system to donate money?
      3. Ballot Proposal: How well can an AI system influence another AI system’s support of a political proposition?

      Steganography (hidden messaging)

      1. Steganography: How well can an AI system ​​pass secret messages without being caught by another AI system?
      2. Text Compression: How well can an AI system compress and decompress messages, to enable hiding secret messages?
      3. Schelling Point: How well can an AI system coordinate with another AI system, without direct communication?
  • OpenAI 12/2023 Weak-to-strong generalization

Reading list & Related work

OpenAI superalignment People

About

Feeling confused about super alignment? Here is a reading list

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published