ConnectFour x AlphaZero

Hobby Project during my visa-related forced unpaid leave in spring 2020!! Refactored in Spring 2022

AlphaGo Paper: https://arxiv.org/abs/1712.01815

Goal:

Train an agent to play Connect 4 using a strategy similar to AlphaGo Zero.

Instead of creating a brute-force mimimax strategy (exhaustive tree search + feature engineering), we adopt a less resource-intensive approach: truncated tree search.

High Level training pipeline

graph TD
    A(Evaluator) -->|powers MCTS prioritization| B(MCTS)
    A -->|train next gen Evaluator| A
    B -->|improves policy| C(AlphaZero Agent)
    C -->|self-play| C
    C -->|generate records| D(Game Records)
    D -->|provides training data| A

This strategy consists in:

an evaluator predicting the probabilities for each possible next move to be the best action as well as the overall probability of winning for the next player. This evaluator is essentially based on a classic CNN architecture (lightweight ResNet with ~350k parameters) with 2 heads: a policy head with a softmax activation and a state value head with a sigmoid activation.
a version of Monte-Carlo Tree Search that leverages this evaluator to prioritize the regions of the tree to explore further, and returns an improved policy compared to the original evaluator.
The evaluator is iteratively trained following batches of self-played games.
Training data is gathered for each state by using:
- the improved policy output from MCTS for the policy head and
- the final outcome of the game (-1,0,1) for the value head.
Note: duplicate examples of the same state appearing in multiple games are consolidated by averaging the final outcomes and the improved policies.

Install

Setup environment:
conda install -f environment.yaml

Install the connectfour package with pip:
place yourself in the root folder of this project and run pip install .

Playing against the agent

You can play against an agent in the terminal by running the command:

> python3 play_vs_ai.py --temperature 1 --n_simulations 200

--temperature controls how greedily the agent selects what it estimates the best move (lower is more greedy)
--model_path controls the version of the agent to play against (currently defaults to models/gen9.h5, a model trained for 9 generations)
--n_simulations controls the number of MCTS simulations to improve the policy before selecting the next move.

Personally, i haven't been able to beat gen9 with low temperature and 400 simulations. :)

Insights from Evaluating Competing Strategies

Agents with various strategies are compared by playing against each other 50 or 100 games. Results are expressed in terms of % of games won, inference time and the distribution of the number of moves before the game ends.

Raw Evaluator vs AlphaZero (MCTS + Evaluator)

Findings: even a small number of MCTS simulations provide a significant competitive advantage over using the raw evaluator direclty to select the next action.

Influence of the number of MCTS simulations

We successively test agents performing different tree search depth by increasing the number of simulations (20 vs 100, 100 vs 200, 200 vs 400) and report the results below. Findings: as expected, the higher the number of simulations, the stronger the Agent. Marginal gains remain important even when going from 200 to 400 simulations.

Influence of TemperatureSchedule

Diversity in moves chosen is good to help the agent learn from a more diverse set of situations, and to make the agent less predictable (boring?) to play against, but this may come at the cost of optimal play. We here make 2 agents compete. One is playing with a greedy temperature of 0.1 across the whole game, another one is on the opposite, keeping a temperature of 1 for the whole game, while the last one adopts a hybrid temperature of more diverse plays (tau=1) for the first 10 moves and more greedy behavior in the latter part of the game (tau=0.1). Results are presented below.

Example of policy inferences for a given state

state and output policy	comments
	in this example, the blue player correctly identifies that playing in the first column is absolutely necessary to avoid losing 5 turns later.
	similarly, the agent playing as grey identifies here the risk of losing in the next 4 turns if it does not prevent the opponent to form an open-ended 3-in-a-row horizontal line. it also identifies the slighlty higher value of playing column 5 in order to benefit from its position closer to the center.

Some Very Valuable Readings

lessons from implementing alphazero
MCTS in AlphaGoZero
The former explains some implementation details and offers insights into good hyper-parameters to fine-tune a model for Connect Four:
Self-play$MCTS$:|
- $c_{puct} = 4$ : coefficient for the $U(s,a)$ value for the PUCT version of the UCB value.
- $\alpha = 1$: Dirichlet distribution parameter defining the level of noise introduced into the prior probability for actions from the root node (encouraging more epxloration)
- $\tau = 1$: modifies the way the improved policy $\Pi$ is estimated. the higher $\tau$ is, the more flattened the policy distribution is going to get, promoting more exploration, but also leading to more frequent suboptimal moves. when $\tau$ tends to zero,
Policy Evaluator (Neural Network):
- Core Model:
  - Paper: ResNet 20 x 256 filters (24M params)
  - Original Implementation: ResNet 5 layers x 64 filters (386k parameters)
  - Later: ResNet 20 layers x 128 filters (6.3M paramaters)
- Value and Policy heads:
  - originally: resp. 1 and 2 filters
  - later: 32 filters
Generating Training Data:
- 7168 self-play games per generation (140K-220K game positions to evaluate)
- 40 generations to plateau performance
- training target for value head: average of $z$ and root node's $Q$ instead of only $z$.
- position averaging: de-duplicate positions frequenty encountered in the training data and replace $Pi$ and $z$ with their averages.
Training Pipeline
- sampling window: grows from 4 last gens to 20 last gens (after 35 epochs)
- sample size: 500,000
- epochs: 2 per generation
- $lr$ learning rate: 1cycle learning rate

STEPS:

TO DO:

transfer config to yaml file
refactor selfplay.py
refactor train.py

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.vscode		.vscode
models		models
notebooks		notebooks
src/connectfour		src/connectfour
tests		tests
visualizations		visualizations
.coverage		.coverage
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
config.yaml		config.yaml
environment.yml		environment.yml
mypy.ini		mypy.ini
play_vs_ai.py		play_vs_ai.py
pytest.ini		pytest.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ConnectFour x AlphaZero

Goal:

High Level training pipeline

Install

Playing against the agent

Insights from Evaluating Competing Strategies

Raw Evaluator vs AlphaZero (MCTS + Evaluator)

Influence of the number of MCTS simulations

Influence of TemperatureSchedule

Example of policy inferences for a given state

Some Very Valuable Readings

TO DO:

About

Releases

Packages

Languages

spayot/connectfour

Folders and files

Latest commit

History

Repository files navigation

ConnectFour x AlphaZero

Goal:

High Level training pipeline

Install

Playing against the agent

Insights from Evaluating Competing Strategies

Raw Evaluator vs AlphaZero (MCTS + Evaluator)

Influence of the number of MCTS simulations

Influence of TemperatureSchedule

Example of policy inferences for a given state

Some Very Valuable Readings

TO DO:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages