EN|中文
Feeling confused about super alignment? Start from here OpenAI Introducing Superalignment Superalignment Fast Grants
-
OpenAI 01/2022 :Aligning language models to follow instructions The statement "Further, in many cases aligning to the average labeler preference may not be desirable" from the limitations section of the article could be interpreted as an early indication of OpenAI's intention to develop highly aligned AI systems.
-
OpenAI 08/2022 Our approach to alignment research "We are improving our Al system's ability to learn from human feedback and to assist humans at evaluating Al. Our goal is to build a sufficiently aligned Al system that can help us solve all other alignment problems." There keynotes:
- Training AI systems using human feedback
- Training AI systems to assist human evaluation
- Training AI systems to do alignment research
-
Collin Burns 12/2022 Discovering Latent Knowledge in Language Models Without Supervision
-
Leopold Aschenbrenner 03/2023 Nobody’s on the ball on AGI alignment "(Scalable) alignment is a real problem"
-
John Schulman 04/2023 Reinforcement Learning from Human Feedback: Progress and Challenges Three open problems:
- Expressing Uncertainty
- Going Beyond Labelers
- Generating Knowledge
-
OpenAI 07/2023 Introducing Superalignment "We need scientific and technical breakthroughs to steer and control AI systems much smarter than us. To solve this problem within four years, we’re starting a new team, co-led by Ilya Sutskever and Jan Leike, and dedicating 20% of the compute we’ve secured to date to this effort." Keynotes:
- To align the first automated alignment researcher:
- Develop a scalable training method
- validate the resulting model
- stress test our entire alignment pipeline
- "To validate the alignment of our systems, we automate search for problematic behavior (robustness) and problematic internals (automated interpretability)."
- To align the first automated alignment researcher:
-
OpenAI 09/2023 OpenAI Red Teaming Network
-
examples:
Persuasion
- MakeMeSay: How well can an AI system trick another AI system into saying a secret word?
- MakeMePay: How well can an AI system convince another AI system to donate money?
- Ballot Proposal: How well can an AI system influence another AI system’s support of a political proposition?
Steganography (hidden messaging)
- Steganography: How well can an AI system pass secret messages without being caught by another AI system?
- Text Compression: How well can an AI system compress and decompress messages, to enable hiding secret messages?
- Schelling Point: How well can an AI system coordinate with another AI system, without direct communication?
-
-
OpenAI 12/2023 Weak-to-strong generalization
- Red Teaming Language Models with Language Models
- Language models can explain neurons in language models
- https://github.com/openai/evals/tree/main
- Let's Verify Step by Step
- An Interpretability Illusion for BERT
- Self-critiquing models for assisting human evaluators
- Discovering Latent Knowledge in Language Models Without Supervision
- Planning for AGI and beyond
- AI-written critiques help humans notice flaws
- The Coming Wave
- Adversarial Attacks on LLMs
- Weak-to-strong generalization
- LLM Powered Autonomous Agents
- Ilya Sutskever
- Jan Leike
- Harri Edwards
- Yuri Burda
- Adrien Ecoffet
- Nat McAleese: Superalignment by models helping humans help models help humans at OpenAI.
- Leopold Aschenbrenner: Nobody’s on the ball on AGI alignment
- Collin Burns: Discovering Latent Knowledge in Language Models Without Supervision
- Bowen Baker: multi-agent reinforcement learning
- Pavel Izmailov