Name		Name	Last commit message	Last commit date
Latest commit Cannot retrieve latest commit at this time. History 19 Commits
games		games
notes		notes
server		server
README.md		README.md
config.py		config.py
environment.py		environment.py
evaluation.py		evaluation.py
exceptions.py		exceptions.py
game.py		game.py
mcts.py		mcts.py
muzero.py		muzero.py
network.py		network.py
storage_replay.py		storage_replay.py
training.py		training.py
utils.py		utils.py

Repository files navigation

MuZero

Tensorflow implementation of the MuZero algorithm, based on the pseudo-code provided in the original paper:

[1] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E Lockhart, D. Hassabis, T. Graepel, T. Lillicrap, D. Silver, "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model"

WARNING: This code is highly experimental, badly documented and certainly buggy. Comments, questions and corrections are welcome (forks and pull requests even more :-).

Main differences with the algorithm described in [1]

Some changes have been made in the direction of supporting multi-player games:

More flexibility in the environment responses: after each move all players can receive rewards (not just the player who made that move).
An additional head in the dynamics function predicts who is the next player to play.

Additionally, a simplified UCB formula is used to reduce the number of hyperparameters.

Training modes

Support for single-thread training (synchronous mode) and multi-thread or distributed training (asynchronous mode).

Synchronous mode

In this mode a single thread plays games to generate training data, and then uses this data to train the neural networks. This is slower but easier to setup than asynchronous training (as used in the original paper). Use

python muzero.py --game GAME --synchronous --num-steps NUM_STEPS --num-games NUM_GAMES --num-eval-games NUM_EVAL_GAMES

to alternate self-playing NUM_GAMES games, and training for NUM_STEPS steps. At each checkpoint, NUM_EVAL_GAMES are played to evaluate the network.

Asynchronous mode

In this mode self-playing and training occur simultaneously in different processes or different nodes of a distributed network. A simple HTTP server maintains a database of self-played games and neural network weights. Self-playing and training agents interact with this server through a simple API.

Use python muzero.py --game GAME --server DATA_DIR to start the server and save logs in DATA_DIR(network weights are saved in HDF5 format, self-play games are pickled). You can then go to http://localhost:5000/ to see basic server statistics.
Use python muzero.py --client HOST --self-play NUM_GAMES to start a self-playing agent that uses the latest network from the server HOST to generate batches of NUM_GAMES games and send them back to the server.
Use python muzero.py --client HOST --train NUM_EVAL_GAMES to start a training agent that queries the server at HOST for batches of training data, and uses them to train the latest network. At each checkpoint, NUM_EVAL_GAMES are played to evaluate the network.

By default, the server is only visible locally. Change api.run() to api.run(0.0.0.0) in storage_replay.py to make the server visible to the outside

WARNING: the API is implemented using Flask, and it is not recommended to deploy it as is for production.

Currently implemented games:

The following games have already been implemented (though only partial experiments have been carried out with them):

CartPole-v0 (games/cartpole.py)
Tic-tac-toe (games/tictactoe.py)
One-arm bandit (games/onearmbandit.py)

You can run MuZero on any of these games by using the --game command-line argument with the corresponding filename, e.g. --game tictactoe.

Other features

Tensorboard logging: on the base directory use tensorboard --logdir checkpoints to visualize training.
Easily add games: just add a file to the games directory defining MuZero's configuration for the game of your choice, and implementing sub-classes for the Environment, Game and Network classes. The methods you need to implement for each subclass are marked in environment.py, game.py and network.py, respectively. NOTE: It should be trivial to interface with OpenAI Gym environments, see games/cartpole.py for an example.
Loss selection: you can choose to use MSE or CCE losses for values and rewards (setting scalar_support_size in the game configuration transforms scalars to categorical representations in a manner similar to that described in [1]).
Weight and game buffer loading in asynchronous mode: you can upload network weights and self-played games directly to the server in asynchronous mode in order to resume training from a checkpoint of your choice.
Playing against the latest network: a basic interface is setup through the server to play against the latest network directly on the browser in asynchronous mode.
Pre-built networks: basic residual and fully connected architectures are defined in network.py, and can be reused simply defining the network parameters (see examples in games directory).

To-do list

Prioritized replay
Optimize hyperparameters for some of the included weights
Pre-trained networks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MuZero

Main differences with the algorithm described in [1]

Training modes

Synchronous mode

Asynchronous mode

Currently implemented games:

Other features

To-do list

About

Releases

Packages

Languages

fidel-schaposnik/muzero

Folders and files

Latest commit

History

Repository files navigation

MuZero

Main differences with the algorithm described in [1]

Training modes

Synchronous mode

Asynchronous mode

Currently implemented games:

Other features

To-do list

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages