The purpose of this curriculum is to help new Elicit employees learn background in machine learning, with a focus on language models. I’ve tried to strike a balance between papers that are relevant for deploying ML in production and techniques that matter for longer-term scalability.
If you don’t work at Elicit yet - we’re hiring ML and software engineers.
Recommended reading order:
- Read “Tier 1” for all topics
- Read “Tier 2” for all topics
- Etc
✨ Added after 2024/4/1
- Fundamentals
- Reasoning and runtime strategies
- Applications
- ML in practice
- Advanced topics
- The big picture
- Maintainer
Tier 1
- A short introduction to machine learning
- But what is a neural network?
- Gradient descent, how neural networks learn
Tier 2
- ✨ An intuitive understanding of backpropagation
- What is backpropagation really doing?
- An introduction to deep reinforcement learning
Tier 3
- The spelled-out intro to neural networks and backpropagation: building micrograd
- Backpropagation calculus
Tier 1
- ✨ But what is a GPT? Visual intro to transformers
- ✨ Attention in transformers, visually explained
- ✨ Attention? Attention!
- The Illustrated Transformer
- The Illustrated GPT-2 (Visualizing Transformer Language Models)
Tier 2
- ✨ Neural Machine Translation by Jointly Learning to Align and Translate
- The Annotated Transformer
- Attention Is All You Need
- A Practical Survey on Faster and Lighter Transformers
Tier 3
- TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
- Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
- A Mathematical Framework for Transformer Circuits
Tier 4+
Tier 1
- Language Models are Unsupervised Multitask Learners (GPT-2)
- Language Models are Few-Shot Learners (GPT-3)
Tier 2
- ✨ LLaMA: Open and Efficient Foundation Language Models (LLaMA)
- ✨ Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Mamba)
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)
- Evaluating Large Language Models Trained on Code (OpenAI Codex)
- Training language models to follow instructions with human feedback (OpenAI Instruct)
Tier 3
- ✨ Mistral 7B (Mistral)
- ✨ Mixtral of Experts (Mixtral)
- ✨ Gemini: A Family of Highly Capable Multimodal Models (Gemini)
- ✨ Textbooks Are All You Need II: phi-1.5 technical report (phi 1.5)
- Scaling Instruction-Finetuned Language Models (Flan)
Tier 4+
- ✨ Consistency Models
- ✨ Model Card and Evaluations for Claude Models (Claude 2)
- ✨ OLMo: Accelerating the Science of Language Models
- ✨ PaLM 2 Technical Report (Palm 2)
- ✨ Visual Instruction Tuning (LLaVA)
- A General Language Assistant as a Laboratory for Alignment
- Finetuned Language Models Are Zero-Shot Learners (Google Instruct)
- Galactica: A Large Language Model for Science
- LaMDA: Language Models for Dialog Applications (Google Dialog)
- OPT: Open Pre-trained Transformer Language Models (Meta GPT-3)
- PaLM: Scaling Language Modeling with Pathways (PaLM)
- Program Synthesis with Large Language Models (Google Codex)
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher (Gopher)
- Solving Quantitative Reasoning Problems with Language Models (Minerva)
- UL2: Unifying Language Learning Paradigms (UL2)
Tier 2
- ✨ Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
- Learning to summarise with human feedback
- Training Verifiers to Solve Math Word Problems
Tier 3
- ✨ Pretraining Language Models with Human Preferences
- ✨ Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
- Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
- LoRA: Low-Rank Adaptation of Large Language Models
- Unsupervised Neural Machine Translation with Generative Language Models Only
Tier 4+
- ✨ Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
- ✨ Improving Code Generation by Training with Natural Language Feedback
- ✨ Language Modeling Is Compression
- ✨ LIMA: Less Is More for Alignment
- ✨ Learning to Compress Prompts with Gist Tokens
- ✨ Lost in the Middle: How Language Models Use Long Contexts
- ✨ QLoRA: Efficient Finetuning of Quantized LLMs
- ✨ Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
- ✨ Reinforced Self-Training (ReST) for Language Modeling
- ✨ Solving olympiad geometry without human demonstrations
- ✨ Tell, don't show: Declarative facts influence how LLMs generalize
- ✨ Textbooks Are All You Need
- ✨ TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
- ✨ Training Language Models with Language Feedback at Scale
- ✨ Turing Complete Transformers: Two Transformers Are More Powerful Than One
- ByT5: Towards a token-free future with pre-trained byte-to-byte models
- Data Distributional Properties Drive Emergent In-Context Learning in Transformers
- Diffusion-LM Improves Controllable Text Generation
- ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
- Efficient Training of Language Models to Fill in the Middle
- Efficiently Modeling Long Sequences with Structured State Spaces
- ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
- Prefix-Tuning: Optimizing Continuous Prompts for Generation
- Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning
- True Few-Shot Learning with Prompts -- A Real-World Perspective
Tier 2
- Chain of Thought Prompting Elicits Reasoning in Large Language Models
- Large Language Models are Zero-Shot Reasoners (Let's think step by step)
- Self-Consistency Improves Chain of Thought Reasoning in Language Models
Tier 3
- ✨ Chain-of-Thought Reasoning Without Prompting
- ✨ Why think step-by-step? Reasoning emerges from the locality of experience
Tier 4+
- ✨ Baldur: Whole-Proof Generation and Repair with Large Language Models
- ✨ Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
- ✨ Certified Reasoning with Language Models
- ✨ Hypothesis Search: Inductive Reasoning with Language Models
- ✨ LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations
- ✨ Large Language Models Cannot Self-Correct Reasoning Yet
- ✨ Stream of Search (SoS): Learning to Search in Language
- ✨ Training Chain-of-Thought via Latent-Variable Inference
- Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
- Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right
Tier 1
Tier 2
- ✨ Tree of Thoughts: Deliberate Problem Solving with Large Language Models
- Factored cognition
- Iterated Distillation and Amplification
- Recursively Summarizing Books with Human Feedback
- Solving math word problems with process-based and outcome-based feedback
Tier 3
- ✨ Factored Verification: Detecting and Reducing Hallucination in Summaries of Academic Papers
- Faithful Reasoning Using Large Language Models
- Humans consulting HCH
- Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes
- Language Model Cascades
Tier 4+
- ✨ Decontextualization: Making Sentences Stand-Alone
- ✨ Factored Cognition Primer
- ✨ Graph of Thoughts: Solving Elaborate Problems with Large Language Models
- ✨ Parsel: A Unified Natural Language Framework for Algorithmic Reasoning
- AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts
- Challenging BIG-Bench tasks and whether chain-of-thought can solve them
- Evaluating Arguments One Step at a Time
- Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
- Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations
- Measuring and narrowing the compositionality gap in language models
- PAL: Program-aided Language Models
- ReAct: Synergizing Reasoning and Acting in Language Models
- Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning
- Show Your Work: Scratchpads for Intermediate Computation with Language Models
- Summ^N: A Multi-Stage Summarization Framework for Long Input Dialogues and Documents
- Thinksum: probabilistic reasoning over sets using large language models
Tier 2
Tier 3
- ✨ Debate Helps Supervise Unreliable Experts
- Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions
Tier 4+
Tier 2
- ✨ Measuring the impact of post-training enhancements
- WebGPT: Browser-assisted question-answering with human feedback
Tier 3
- ✨ AI capabilities can be significantly improved without expensive retraining
- ✨ Automated Statistical Model Discovery with Language Models
Tier 4+
- ✨ DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
- ✨ Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
- ✨ Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation
- ✨ Voyager: An Open-Ended Embodied Agent with Large Language Models
- ReGAL: Refactoring Programs to Discover Generalizable Abstractions
Tier 2
Tier 3
- ✨ What Evidence Do Language Models Find Convincing?
- ✨ How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Tier 4+
Tier 3
- ✨ Can large language models provide useful feedback on research papers? A large-scale empirical analysis
- ✨ Large Language Models Encode Clinical Knowledge
- ✨ The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4
- A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
Tier 4+
- ✨ Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
- ✨ Nougat: Neural Optical Understanding for Academic Documents
- ✨ Scim: Intelligent Skimming Support for Scientific Papers
- ✨ SynerGPT: In-Context Learning for Personalized Drug Synergy Prediction and Drug Design
- ✨ Towards Accurate Differential Diagnosis with Large Language Models
- ✨ Towards a Benchmark for Scientific Understanding in Humans and Machines
- A Search Engine for Discovery of Scientific Challenges and Directions
- A full systematic review was completed in 2 weeks using automation tools: a case study
- Fact or Fiction: Verifying Scientific Claims
- Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles
- PEER: A Collaborative Language Model
- PubMedQA: A Dataset for Biomedical Research Question Answering
- SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts
- SciTail: A Textual Entailment Dataset from Science Question Answering
Tier 3
- ✨ AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy
- ✨ Approaching Human-Level Forecasting with Language Models
- ✨ Are Transformers Effective for Time Series Forecasting?
- Forecasting Future World Events with Neural Networks
Tier 2
- Learning Dense Representations of Phrases at Scale
- Text and Code Embeddings by Contrastive Pre-Training (OpenAI embeddings)
Tier 3
- ✨ Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
- Not All Vector Databases Are Made Equal
- REALM: Retrieval-Augmented Language Model Pre-Training
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- Task-aware Retrieval with Instructions
Tier 4+
- ✨ RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!
- ✨ Some Common Mistakes In IR Evaluation, And How They Can Be Avoided
- Boosting Search Engines with Interactive Agents
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
- Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking
- UnifiedQA: Crossing Format Boundaries With a Single QA System
Tier 1
- Machine Learning in Python: Main developments and technology trends in data science, machine learning, and AI
- Machine Learning: The High Interest Credit Card of Technical Debt
Tier 2
Tier 2
- ✨ GPQA: A Graduate-Level Google-Proof Q&A Benchmark
- ✨ SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- TruthfulQA: Measuring How Models Mimic Human Falsehoods
Tier 3
- FLEX: Unifying Evaluation for Few-Shot NLP
- Holistic Evaluation of Language Models (HELM)
- Measuring Massive Multitask Language Understanding
- RAFT: A Real-World Few-Shot Text Classification Benchmark
- True Few-Shot Learning with Language Models
Tier 4+
- ✨ GAIA: a benchmark for General AI Assistants
- ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers
- Measuring Mathematical Problem Solving With the MATH Dataset
- QuALITY: Question Answering with Long Input Texts, Yes!
- SCROLLS: Standardized CompaRison Over Long Language Sequences
- What Will it Take to Fix Benchmarking in Natural Language Understanding?
Tier 2
Tier 3
- Dialog Inpainting: Turning Documents into Dialogs
- MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
- Microsoft Academic Graph
- TLDR9+: A Large Scale Resource for Extreme Summarization of Social Media Posts
Tier 3
- ✨ Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
- ✨ From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought
- Language Models Represent Space and Time
Tier 4+
- ✨ Amortizing intractable inference in large language models
- ✨ CLADDER: Assessing Causal Reasoning in Language Models
- ✨ Causal Bayesian Optimization
- ✨ Causal Reasoning and Large Language Models: Opening a New Frontier for Causality
- ✨ Generative Agents: Interactive Simulacra of Human Behavior
- ✨ Passive learning of active causal strategies in agents and language models
Tier 4+
Tier 2
- ✨ Experts Don't Cheat: Learning What You Don't Know By Predicting Pairs
- A Simple Baseline for Bayesian Uncertainty in Deep Learning
- Plex: Towards Reliability using Pretrained Large Model Extensions
Tier 3
- ✨ Active Preference Inference using Language Models and Probabilistic Reasoning
- ✨ Eliciting Human Preferences with Language Models
- Active Learning by Acquiring Contrastive Examples
- Describing Differences between Text Distributions with Natural Language
- Teaching Models to Express Their Uncertainty in Words
Tier 4+
Tier 2
Tier 3
- ✨ Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
- ✨ Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
- ✨ Representation Engineering: A Top-Down Approach to AI Transparency
- ✨ Studying Large Language Model Generalization with Influence Functions
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Tier 4+
- ✨ Codebook Features: Sparse and Discrete Interpretability for Neural Networks
- ✨ Eliciting Latent Predictions from Transformers with the Tuned Lens
- ✨ How do Language Models Bind Entities in Context?
- ✨ Opening the AI black box: program synthesis via mechanistic interpretability
- ✨ Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
- ✨ Uncovering mesa-optimization algorithms in Transformers
- Fast Model Editing at Scale
- Git Re-Basin: Merging Models modulo Permutation Symmetries
- Locating and Editing Factual Associations in GPT
- Mass-Editing Memory in a Transformer
Tier 2
- ✨ Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- ✨ Reflexion: Language Agents with Verbal Reinforcement Learning
- Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (AlphaZero)
- MuZero: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
Tier 3
- ✨ Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
- AlphaStar: mastering the real-time strategy game StarCraft II
- Decision Transformer
- Mastering Atari Games with Limited Data (EfficientZero)
- Mastering Stratego, the classic game of imperfect information (DeepNash)
Tier 4+
- ✨ AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning
- ✨ Bayesian Reinforcement Learning with Limited Cognitive Load
- ✨ Contrastive Prefence Learning: Learning from Human Feedback without RL
- ✨ Grandmaster-Level Chess Without Search
- A data-driven approach for learning to control computers
- Acquisition of Chess Knowledge in AlphaZero
- Player of Games
- Retrieval-Augmented Reinforcement Learning
Tier 1
Tier 2
- AI and compute
- Scaling Laws for Transfer
- Training Compute-Optimal Large Language Models (Chinchilla)
Tier 3
- Emergent Abilities of Large Language Models
- Transcending Scaling Laws with 0.1% Extra Compute (U-PaLM)
Tier 4+
- ✨ Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
- ✨ Power Law Trends in Speedrunning and Machine Learning
- ✨ Scaling laws for single-agent reinforcement learning
- Beyond neural scaling laws: beating power law scaling via data pruning
- Emergent Abilities of Large Language Models
- Scaling Scaling Laws with Board Games
Tier 1
- Three impacts of machine intelligence
- What failure looks like
- Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
Tier 2
- ✨ An Overview of Catastrophic AI Risks
- Clarifying “What failure looks like” (part 1)
- Deep RL from human preferences
- The alignment problem from a deep learning perspective
Tier 3
- ✨ Scheming AIs: Will AIs fake alignment during training in order to get power?
- Measuring Progress on Scalable Oversight for Large Language Models
- Risks from Learned Optimization in Advanced Machine Learning Systems
- Scalable agent alignment via reward modelling
Tier 4+
- ✨ AI Deception: A Survey of Examples, Risks, and Potential Solutions
- ✨ Benchmarks for Detecting Measurement Tampering
- ✨ Chess as a Testing Grounds for the Oracle Approach to AI Safety
- ✨ Close the Gates to an Inhuman Future: How and why we should choose to not develop superhuman general-purpose artificial intelligence
- ✨ Model evaluation for extreme risks
- ✨ Responsible Reporting for Frontier AI Development
- ✨ Safety Cases: How to Justify the Safety of Advanced AI Systems
- ✨ Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
- ✨ Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure
- ✨ Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
- ✨ Tools for Verifying Neural Models' Training Data
- ✨ Towards a Cautious Scientist AI with Convergent Safety Bounds
- Alignment of Language Agents
- Eliciting Latent Knowledge
- Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
- Red Teaming Language Models with Language Models
- Unsolved Problems in ML Safety
Tier 3
- ✨ Explosive growth from AI automation: A review of the arguments
- ✨ Language Models Can Reduce Asymmetry in Information Markets
Tier 4+
- ✨ Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero
- ✨ Foundation Models and Fair Use
- ✨ GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models
- ✨ Levels of AGI: Operationalizing Progress on the Path to AGI
- ✨ Opportunities and Risks of LLMs for Scalable Deliberation with Polis
- On the Opportunities and Risks of Foundation Models
Tier 2