Hobby Project during my visa-related forced unpaid leave in spring 2020!! Refactored in Spring 2022
- AlphaGo Paper: https://arxiv.org/abs/1712.01815
Train an agent to play Connect 4 using a strategy similar to AlphaGo Zero.
Instead of creating a brute-force mimimax strategy (exhaustive tree search + feature engineering), we adopt a less resource-intensive approach: truncated tree search.
graph TD
A(Evaluator) -->|powers MCTS prioritization| B(MCTS)
A -->|train next gen Evaluator| A
B -->|improves policy| C(AlphaZero Agent)
C -->|self-play| C
C -->|generate records| D(Game Records)
D -->|provides training data| A
This strategy consists in:
- an evaluator predicting the probabilities for each possible next move to be the best action as well as the overall probability of winning for the next player. This evaluator is essentially based on a classic CNN architecture (lightweight ResNet with ~350k parameters) with 2 heads: a policy head with a softmax activation and a state value head with a sigmoid activation.
- a version of Monte-Carlo Tree Search that leverages this evaluator to prioritize the regions of the tree to explore further, and returns an improved policy compared to the original evaluator.
- The evaluator is iteratively trained following batches of self-played games.
- Training data is gathered for each state by using:
- the improved policy output from MCTS for the policy head and
- the final outcome of the game (-1,0,1) for the value head.
- Note: duplicate examples of the same state appearing in multiple games are consolidated by averaging the final outcomes and the improved policies.
Setup environment:
conda install -f environment.yaml
Install the connectfour
package with pip
:
place yourself in the root folder of this project and run pip install .
You can play against an agent in the terminal by running the command:
> python3 play_vs_ai.py --temperature 1 --n_simulations 200
--temperature
controls how greedily the agent selects what it estimates the best move (lower is more greedy)--model_path
controls the version of the agent to play against (currently defaults tomodels/gen9.h5
, a model trained for 9 generations)--n_simulations
controls the number of MCTS simulations to improve the policy before selecting the next move.
Personally, i haven't been able to beat gen9
with low temperature and 400 simulations. :)
Agents with various strategies are compared by playing against each other 50 or 100 games. Results are expressed in terms of % of games won, inference time and the distribution of the number of moves before the game ends.
Findings: even a small number of MCTS simulations provide a significant competitive advantage over using the raw evaluator direclty to select the next action.
We successively test agents performing different tree search depth by increasing the number of simulations (20 vs 100, 100 vs 200, 200 vs 400) and report the results below.
Findings: as expected, the higher the number of simulations, the stronger the Agent. Marginal gains remain important even when going from 200 to 400 simulations.
Diversity in moves chosen is good to help the agent learn from a more diverse set of situations, and to make the agent less predictable (boring?) to play against, but this may come at the cost of optimal play. We here make 2 agents compete. One is playing with a greedy temperature of 0.1 across the whole game, another one is on the opposite, keeping a temperature of 1 for the whole game, while the last one adopts a hybrid temperature of more diverse plays (tau=1) for the first 10 moves and more greedy behavior in the latter part of the game (tau=0.1). Results are presented below.
- lessons from implementing alphazero
-
MCTS in AlphaGoZero
The former explains some implementation details and offers insights into good hyper-parameters to fine-tune a model for Connect Four: -
Self-play$MCTS$:|
-
$c_{puct} = 4$ : coefficient for the$U(s,a)$ value for the PUCT version of the UCB value. -
$\alpha = 1$ : Dirichlet distribution parameter defining the level of noise introduced into the prior probability for actions from the root node (encouraging more epxloration) -
$\tau = 1$ : modifies the way the improved policy$\Pi$ is estimated. the higher$\tau$ is, the more flattened the policy distribution is going to get, promoting more exploration, but also leading to more frequent suboptimal moves. when$\tau$ tends to zero,
-
-
Policy Evaluator (Neural Network):
- Core Model:
- Paper: ResNet 20 x 256 filters (24M params)
- Original Implementation: ResNet 5 layers x 64 filters (386k parameters)
- Later: ResNet 20 layers x 128 filters (6.3M paramaters)
- Value and Policy heads:
- originally: resp. 1 and 2 filters
- later: 32 filters
- Core Model:
-
Generating Training Data:
- 7168 self-play games per generation (140K-220K game positions to evaluate)
- 40 generations to plateau performance
- training target for value head: average of
$z$ and root node's$Q$ instead of only$z$ . - position averaging: de-duplicate positions frequenty encountered in the training data and replace
$Pi$ and$z$ with their averages.
-
Training Pipeline
- sampling window: grows from 4 last gens to 20 last gens (after 35 epochs)
- sample size: 500,000
- epochs: 2 per generation
-
$lr$ learning rate: 1cycle learning rate
STEPS:
- setup an environment to:
- define state transitions
- define possible moves
- define game result
- render states
- create estimator: simple CNN with
- a policy head (softmax across 7 outputs)
- a value head estimate value of any state (between -1 and 1) > tanh
- training:
- self-play for n games:
- sample from potential moves, using estimator policy head distribution
- perform MCTS to search the game space given a search "budget" (N evaluations, X seconds, ...) > needs to modify mctspy code: currently relies on random rollouts, no estimator / feature extractor to determine best moves.
- then: back-propagate final outcome of games (RL) > Vi, Qi
- use
$V_i = Q_i + z$ and$\pi_i$ as labels to train estimator - augmentation: use symmetry of the game to multiply by 2 the number of training samples generated by a single game.
- self-play for n games:
- transfer config to yaml file
- refactor selfplay.py
- refactor train.py