Stars
A library for mechanistic interpretability of GPT-style language models
Locating and editing factual associations in GPT (NeurIPS 2022)
Utility for behavioral and representational analyses of Language Models
Stanford NLP Python library for Representation Finetuning (ReFT)
PAIR.withgoogle.com and friend's work on interpretability methods
Doing simple retrieval from LLM models at various context lengths to measure accuracy
✨Fast Coreference Resolution in spaCy with Neural Networks
Public repo with code and dataset for Textual time travel project
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
The nnsight package enables interpreting and manipulating the internals of deep learned models.
Interpretability for sequence generation models 🐛 🔍
Public repository for "Think Twice: Perspective-Taking Improves Large Language Models’ Theory-of-Mind Capabilities".
Stanford Open Information Extraction made simple!
Using sparse coding to find distributed representations used by neural networks.
[ICML 2024] Language Models Represent Beliefs of Self and Others
[ACL 2024] An Easy-to-use Knowledge Editing Framework for LLMs.
Function Vectors in Large Language Models (ICLR 2024)
A unified interface for computing surprisal (log probabilities) from language models! Supports neural, symbolic, and black-box API models.
Code for the paper "Neural Metaphor Detection in Context".
Probing and Generalization of Metaphorical Knowledge in Pre-Trained Language Modelss[ACL 2022]
Machine Theory of Mind Reading List. Built upon EMNLP Findings 2023 Paper: Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models
Inspecting and Editing Knowledge Representations in Language Models
Implements pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF), to train and fine-tune the LLaMA2 model to follow human instructions, similar to Instru…
A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
Evaluating the Moral Beliefs Encoded in LLMs
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP https://textattack.readthedocs.io/en/master/