BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

BALROG is a novel benchmark evaluating agentic LLM and VLM capabilities on long-horizon interactive tasks using reinforcement learning environments. Check out how current models fare on our leaderboard. You can read more about BALROG in our paper.

Features

Comprehensive evaluation of agentic abilities
Support for both language and vision-language models
Integration with popular AI APIs and local deployment
Easy integration for custom agents, new environments and new models

Installation

We advise using conda for the installation

conda create -n balrog python=3.10 -y
conda activate balrog

git clone https://github.com/balrog-ai/BALROG.git
cd BALROG
pip install -e .
balrog-post-install

Docker

We have provided some docker images. Please see the relevant README.

⚡️ Evaluate using vLLM locally

We support running LLMs/VLMs locally using vLLM. You can spin up a vLLM client and evaluate your agent on BALROG in the following way:

pip install vllm numpy==1.23
vllm serve meta-llama/Llama-3.2-1B-Instruct --port 8080

python eval.py \
  agent.type=naive \
  agent.max_image_history=0 \
  agent.max_history=16 \
  eval.num_workers=32 \
  client.client_name=vllm \
  client.model_id=meta-llama/Llama-3.2-1B-Instruct \
  client.base_url=http://0.0.0.0:8080/v1

Check out vLLM for more options on how to serve your models fast and efficiently.

🛜 Evaluate using popular APIs

We support out of the box clients for OpenAI, Anthropic and Google Gemini APIs. First set up your API key:

export OPENAI_API_KEY=<KEY>
export ANTHROPIC_API_KEY=<KEY>
export GEMINI_API_KEY=<KEY>

Then run the evaluation with:

python eval.py \
  agent.type=naive \
  agent.max_image_history=0 \
  eval.num_workers=64 \
  client.client_name=openai \
  client.model_id=gpt-4o-mini-2024-07-18

Documentation

Evaluation Guide - Detailed instructions for various evaluation scenarios
Agent Development - Tutorial on creating custom agents
Few Shot Learning - Instructions on how to run Few Shot Learning

We welcome contributions! Please see our Contributing Guidelines for details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use BALROG in any of your work, please cite:

@article{paglieri2024balrog,
  title={Benchmarking Agentic LLM and VLM Reasoning On Games},
  author={Paglieri, Davide and Cupia{\l}, Bart{\l}omiej and Coward, Sam and Piterbarg, Ulyana and Wo{\l}czyk, Maciej and Khan, Akbir and Pignatelli, Eduardo and Kuci{\'n}ski, {\L}ukasz and Pinto, Lerrel and Fergus, Rob and Foerster, Jakob Nicolaus and Parker-Holder, Jack and Rockt{\"a}schel, Tim},
  journal={arXiv preprint arXiv:2411.13543},
  year={2024}
}

Name	Name	Last commit message	Last commit date
Latest commit DavidePaglieri fix: double system prompt (balrog-ai#21 ) Jan 7, 2025 e7854a1 · Jan 7, 2025 History 15 Commits
.github/workflows	.github/workflows	initial commit	Nov 21, 2024
balrog	balrog	fix: double system prompt (balrog-ai#21 )	Jan 7, 2025
docker	docker	move post_install to balrog scripts (balrog-ai#9 )	Dec 3, 2024
docs	docs	Feat: more robust naive agent (balrog-ai#13 )	Dec 17, 2024
.flake8	.flake8	initial commit	Nov 21, 2024
.gitignore	.gitignore	Add Few-Shot Learning Support to Balrog (balrog-ai#4 )	Dec 5, 2024
.pre-commit-config.yaml	.pre-commit-config.yaml	initial commit	Nov 21, 2024
LICENSE	LICENSE	initial commit	Nov 21, 2024
README.md	README.md	Update README.md (balrog-ai#19 )	Jan 3, 2025
SECRETS	SECRETS	initial commit	Nov 21, 2024
eval.py	eval.py	initial commit	Nov 21, 2024
pyproject.toml	pyproject.toml	initial commit	Nov 21, 2024
setup.py	setup.py	move post_install to balrog scripts (balrog-ai#9 )	Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

Features

Installation

Docker

⚡️ Evaluate using vLLM locally

🛜 Evaluate using popular APIs

Documentation

License

Citation

About

Releases

Packages

Languages

License

richielo/BALROG

Folders and files

Latest commit

History

Repository files navigation

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

Features

Installation

Docker

⚡️ Evaluate using vLLM locally

🛜 Evaluate using popular APIs

Documentation

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages