chess_llm_interpretability

This evaluates LLMs trained on PGN format chess games through the use of linear probes. We can check the LLMs internal understanding of board state and ability to estimate the skill level of the players involved. We can also perform interventions on the model's internal board state by deleting pieces from its internal world model.

This repo can train, evaluate, and visualize linear probes on LLMs that have been trained to play chess with PGN strings. For example, we can visualize where the model "thinks" the white pawns are. On the left, we have the actual white pawn location. In the middle, we clip the probe outputs to turn the heatmap into a more binary visualization. On the right, we have the full gradient of model beliefs.

I trained linear probes on both the model's ability to compute board state and estimate player ELO as it's predicting the next character. Here we can see a per layer graph of board state and elo classification accuracy across a range of LLMs.

For more information, refer to this post.

Setup

Create a Python environment with Python 3.10 or 3.11 (I'm using 3.11).

pip install -r requirements.txt
python model_setup.py

Then click "Run All" on lichess_data_filtering.ipynb. To visualise probe outputs, check out probe_output_visualization.ipynb.

To train a linear probe or test a saved probe on the test set, set these two variables at the bottom of train_test_chess.py: RUN_TEST_SET = True USE_PIECE_BOARD_STATE = True

Then run python train_test_chess.py.

Interventions

To perform board state interventions, first train a set of 8 (one per layer) board state probes using train_test_chess.py. Then run python board_state_interventions.py. It will record JSON results in intervention_logs/.

To perform skill interventions, you can train a set of 8 skill probes using train_test_chess.py or generate a set of 8 contrastive activations using caa.py. Note that contrastive activations tend to work a little better. If you want to use probe derived interventions, use this script to create activation files from the probes: utils/create_skill_intervention_from_skill_probe.ipynb.

Then, follow these directions to use them to perform skill interventions: https://github.com/adamkarvonen/chess_gpt_eval/tree/master/nanogpt

Useful links

All code, models, and datasets are open source.

To play the nanoGPT model against Stockfish, please visit: https://github.com/adamkarvonen/chess_gpt_eval/tree/master/nanogpt

To train a Chess-GPT from scratch, please visit: https://github.com/adamkarvonen/nanoGPT

All pretrained models are available here: https://huggingface.co/adamkarvonen/chess_llms

All datasets are available here: https://huggingface.co/datasets/adamkarvonen/chess_games

Wandb training loss curves and model configs can be viewed here: https://api.wandb.ai/links/adam-karvonen/u783xspb

References

Much of my linear probing was developed using Neel Nanda's linear probing code as a reference. Here are the main references I used:

https://colab.research.google.com/github/neelnanda-io/TransformerLens/blob/main/demos/Othello_GPT.ipynb https://colab.research.google.com/github/likenneth/othello_world/blob/master/Othello_GPT_Circuits.ipynb https://www.neelnanda.io/mechanistic-interpretability/othello https://github.com/likenneth/othello_world/tree/master/mechanistic_interpretability

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

chess_llm_interpretability

Setup

Interventions

Useful links

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 249 Commits
contrastive_activations		contrastive_activations
data		data
images		images
linear_probes		linear_probes
models		models
utils		utils
.gitignore		.gitignore
README.md		README.md
board_state_interventions.py		board_state_interventions.py
caa.py		caa.py
chess_utils.py		chess_utils.py
lichess_data_filtering.ipynb		lichess_data_filtering.ipynb
model_setup.py		model_setup.py
nanogpt_to_transformer_lens.ipynb		nanogpt_to_transformer_lens.ipynb
probe_output_visualization.ipynb		probe_output_visualization.ipynb
requirements.txt		requirements.txt
train_test_chess.py		train_test_chess.py

mrcodechef/chess_llm_interpretability

Folders and files

Latest commit

History

Repository files navigation

chess_llm_interpretability

Setup

Interventions

Useful links

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages