Arena is a general evaluation platform and building toolkit for single/multi-agent intelligence. As a part of Arena project, this repository is to provide implementation of state-of-the-art deep single/multi-agent reinforcement learning baselines. Arena-Baselines is based on ray. Other resources can be found in Arena Home. If you use this repository to conduct research, we kindly ask that you cite the papers as references.
Arena-Baselines is built upon ray, benefiting from lots of handy features from it as well as bringing some further advantages.
10+ most popular deep single/multi-agent reinforcement learning baselines, including: independent learners, self-play, variable-shared policies (share arbitrary variables/layers), centralized critic via experiences sharing, centralized critic via observations sharing, Q-Mix, arbitrary grouping agents, centralized critic with counterfactual baselines, population-based training, and etc.
Baselines | Supported(:heavy_check_mark:) / In-progress(:heavy_minus_sign:) |
---|---|
Independent Learners | ✔️ |
Self-play | ✔️ |
Population-based Training | ✔️ |
Share Weights: between Arbitrary Agents / Teams | ✔️ |
Sharing observations: own, team_absolute, team_relative, all_absolute, all_relative | ✔️ |
Multiple sensors: first-person visual, third-person visual, vector (lidar), or any combinations | ✔️ |
Use separated observations for actor and critic (such as centralized critic and decentralized actors). | ✔️ |
30+ games built by Arena-BuildingToolkit, including Tennis, Tank, BlowBlow, FallFlat, and etc. There are also multiple variants of each game with different team setups, social structures, actions spaces, observation spaces (first-person visual, third-person visual, vector or mixed), domain randomizations, and etc. More documentation of games can be found in Arena-Benchmark
Many research-oriented utilities, such as:
Utilities | Supported(:heavy_check_mark:) / In-progress(:heavy_minus_sign:) |
---|---|
Tune (grid search parameters, schedule trails, yaml config and etc.) | ✔️ |
Interactively load and evaluate trained policies/populations | ✔️ |
Visualize performance in population | ✔️ |
We are currently open to any suggestions or pull requests from the community to make Arena better. To contribute to the project, joint us in Slack, check To Do list in Trello as well as check issues opened in the repo.
We are re-benchmarking all games, currently, all games are available in Arena-BuildingToolkit.
We are building and testing them one by one.
You can build and test it yourself if you urgently want to try some the games (we have played with the games and they are fine, it's just we have not have enough resources to train them yet).
Before you build a game and train it, you can also play with it in Arena-BuildingToolkit so that you can make sure you are satisfied with the game.
The env examples you are seeing in the arena-experiments/Arena-Benchmark.yaml
are the ones we have built and trained.
- Arena-Baselines
To install above dependencies, run: (we are using specific versions of the dependencies)
# In case you have not install these basic staff
sudo apt-get install htop python-pip ffmpeg git -y
wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
bash Anaconda3-2019.10-Linux-x86_64.sh
# My personal config for new machines, skip this is you do not understand the following command
git config --global user.email "[email protected]"
git config --global user.name "YuhangSong"
git config --global push.default "current"
git config --global pull.default "current"
git config --global credential.helper "cache --timeout=36000000000000000"
# Set conda source. Only for users behind the Great Wall of China, no need for other users
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --set show_channel_urls yes
# If you accidentally did above, type: vim ~/.condarc, and reset conda source by removing the first two lines. Finally, run: conda update --all
# Set pip source. Only for users behind the Great Wall of China, no need for other users.
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
# If you accidentally did above, reset pip source with: pip config set global.index-url https://pypi.org/simple
# Create dir
mkdir Arena
cd Arena
# clone code
git clone https://github.com/YuhangSong/Arena-Baselines.git
cd Arena-Baselines
# Create a virtual environment
conda deactivate
conda remove --name Arena-Baselines --all -y
conda create -n Arena-Baselines python=3.6.9 -y
conda activate Arena-Baselines
# Install required packages
pip3 install -r requirements.txt
# pip may search and resolve package dependencies, be patient here...
If you are using different versions of ray or tensorflow, finish run tests to make sure it works. Open up a pull request if you find newer versions that also works fine to complete the tests.
If you run into following situations,
- You are running on a remote server without GUI (X-Server).
- Your machine have a X-Server but it does not belongs (started by) your account, so you cannot access it. (If your machine have a X-Server but it belongs (started by) your account, but you cannot use the GUI desktop all the time, see here)
If none of above is your situation, i.e., you are running things on your own desktop, skip this part and go to Usage. If you are in above situations, follow guidelines here to setup a virtual display. Or you can follow here (This is simpler and get you in place in shorter time, but could be outdated. If so, go to the above link, and consider open a pull requests to give a update of this).
First, crate a TMUX session to maintain your process (more necessary if the machine is a server you connect via SSH), and enter virtual environment
tmux new-session -s Arena
source activate Arena-Baselines
Test your installation of rllib learning environment with Pong (rllib breaks sometimes, so a test is needed before you go further).
In ./arena-experiments/Arena-Benchmark.yaml
, comment out other envs and leave PongNoFrameskip-v4
there. Then, run:
python train.py -f ./arena-experiments/Arena-Benchmark.yaml
You should see reward goes up from -21 (as shown in the following), which means you installation works fine.
Meet some problems? Open an issue.
Now, recover ./arena-experiments/Arena-Benchmark.yaml
you just modified.
Test an Arena environment (Arena could have difficulties lunching due to different reasons, so a test is needed before you go further)
python test_arena_rllib_env.py -f ./arena-experiments/Arena-Benchmark.yaml
You should see prints like following:
INFO:__main__:obs_rllib: {'agent_0': array([0.41156876, ... , 0.23122263]), 'agent_1': array([0.41156882, ... , 0.23122263])}
INFO:__main__:rewards_rllib: {'agent_0': 0.0, 'agent_1': 1.0}
INFO:__main__:dones_rllib: {'agent_0': True, 'agent_1': True, '__all__': True}
INFO:__main__:infos_rllib: {'agent_0': {'text_observation': ['', ''], 'brain_info': <mlagents.envs.brain.BrainInfo object at 0x7f179da92780>, 'shift': 1}, 'agent_1': {'text_observation': ['', ''], 'brain_info': <mlagents.envs.brain.BrainInfo object at 0x7f179da92780>, 'shift': 1}}
episode end, keep going?
Hit enter and it keeps rolling. Meet some problems? Open an issue.
Now train on Arena games (reproduce our Arena-Benchmark) with:
python train.py -f ./arena-experiments/Arena-Benchmark.yaml
The code log multiple curves (as well as figures and other formats of data) to help analysis the training process, run:
source activate Arena-Baselines && tensorboard --logdir=~/Arena-Benchmark --port=9999
and visit http://localhost:9999
for visualization with tensorboard.
If your port is blocked, use natapp to forward a port:
./natapp --authtoken 237e94b5d173a7c3
Above token is a free port used by myself, please apply a free port (or buy a faster one) yourself on natapp so that you can enjoy the best experience.
You should see the episode_len_mean
goes up from 20 (as shown in the following), which means you installation works fine.
Our established benchmark can be found in Arena-Benchmark.
Append --eval
argument.
In evaluation mode, the programe will detect the logdirs you have locally on your computer as well as detect the populations and iterations you have in each logdir.
Then a series of questions will be promoted for you to make a selection of which checkpoint(s) you want to load for each agent.
After this, you will have selected a list of checkpoint_paths
for each agent, then the programe will run each of the possible combinations, thus, you will see print like the following:
Note about agent
and policy
: agent
is the concept in environment, and policy
is the corresponding concept on the learning side. In Arena-Baselines, the mapping between the two is one-to-one, though you can share weights between policies (equals to the case where two agents is controlled by just one policy).
============================= sampling... =============================
policy_loading_status:
{ 'policy_0': { 'checkpoint_path': '...size_population=1_2020-02-07_17-01-35stfrwt58/learning_agent/p_0-i_1',
'checkpoint_path_i': 0},
'policy_1': { 'checkpoint_path': '...size_population=1_2020-02-07_17-01-35stfrwt58/learning_agent/p_0-i_2',
'checkpoint_path_i': 1}}
results:
{ 'policy_0': { 'episode_lengths_max': 24,
'episode_lengths_mean': 15.428571428571429,
'episode_lengths_min': 14,
'episode_rewards_max': 1.0,
'episode_rewards_mean': 0.71428573,
'episode_rewards_min': 0.0},
'policy_1': { 'episode_lengths_max': 24,
'episode_lengths_mean': 15.428571428571429,
'episode_lengths_min': 14,
'episode_rewards_max': 1.0,
'episode_rewards_mean': 0.2857143,
'episode_rewards_min': 0.0}}
============================= sampling... =============================
policy_loading_status:
{ 'policy_0': { 'checkpoint_path': '...size_population=1_2020-02-07_17-01-35stfrwt58/learning_agent/p_0-i_1',
'checkpoint_path_i': 0},
'policy_1': { 'checkpoint_path': '...size_population=1_2020-02-07_17-01-35stfrwt58/learning_agent/p_0-i_3',
'checkpoint_path_i': 2}}
summarization:
{ 'policy_0': { 'episode_lengths_max': 32,
'episode_lengths_mean': 17.0,
'episode_lengths_min': 14,
'episode_rewards_max': 1.0,
'episode_rewards_mean': 0.33333334,
'episode_rewards_min': 0.0},
'policy_1': { 'episode_lengths_max': 32,
'episode_lengths_mean': 17.0,
'episode_lengths_min': 14,
'episode_rewards_max': 1.0,
'episode_rewards_mean': 0.6666667,
'episode_rewards_min': 0.0}}
where policy_loading_status
tells you which checkpoint is loaded to each policy, and summarization
tells you the episode statistic of each policy when loading policies according to policy_loading_status
.
The above results will be recorded in a single matrix called result_matrix
.
result_matrix
is nested list with the shape of:
(
len(checkpoint_paths[policy_ids[0]]),
len(checkpoint_paths[policy_ids[1]]),
...,
len(policy_ids)
)
Example: value at result_matrix[a,b,c]
means
when load policy_0
with checkpoint_paths[policy_ids[0]][a]
and load policy_1
with checkpoint_paths[policy_ids[1]][b]
, the episode_rewards_mean of policy_c
.
Then, we will try to visualize result_matrix
, see vis_result_matrix
in ./arena/vis.py
.
For examle, for 2TxP games, you will get a visualization as following:
which shows how each match goes.
Specifically, policy at later iterations beat those at earlier iterations, policies at similar iterations produce a tie, and etc.
You can, of course, compare your policies against our established Arena-Benchmark.
Append --dummy
argument to the end of the command, which will try to run the experiments with minimal request of resources. This is for the convenience of debugging.
To resume/restore a training, append --resume
to the above command in Reproduce Arena-Benchmark
To register a new game, append the ID of the game to env_ids
in ./arena/__init__.py
.
Note that the ID of a game is slightly different from the file name of a game, see Common Naming Rules.
Function create_parser
in ./train.py
gives the detailed description of the configs.
Read them to understand the configs.
Note that we do not recommend passing configs via argparse, instead, use yaml file to config you experiment, as what has been done in run tests.
In this way, you can make sure your experiments are reproducable.
Sometimes, the game threads do not exit properly after you kill the python thread. Run following command to print a banch of kill commmands. Then run the printed commands to kill all the game threads.
ps aux | grep -ie Linux.x86_64 | awk '{print "kill -9 " $2}'
If you use Arena to conduct research, we ask that you cite the following paper as a reference:
@inproceedings{song2020arena,
title={Arena: A General Evaluation Platform and Building Toolkit for Multi-Agent Intelligence.},
author={Song, Yuhang and Wojcicki, Andrzej and Lukasiewicz, Thomas and Wang, Jianyi and Aryan, Abi and Xu, Zhenghua and Xu, Mai and Ding, Zihan and Wu, Lianlong},
booktitle={AAAI},
pages={7253--7260},
year={2020}
}
as well as the engine behind Arena, without which the platform would be impossible to create
@article{juliani2018unity,
title={Unity: A general platform for intelligent agents},
author={Juliani, Arthur and Berges, Vincent-Pierre and Vckay, Esh and Gao, Yuan and Henry, Hunter and Mattar, Marwan and Lange, Danny},
journal={arXiv preprint arXiv:1809.02627},
year={2018}
}
- Upgrade ml-agent and ray to recent versions
We give special thanks to the Whiteson Research Lab and Unity ML-Agents Team, with which the discussion shaped the vision of the project a lot.