Skip to content

Commit

Permalink
m
Browse files Browse the repository at this point in the history
  • Loading branch information
qibinyi committed Oct 11, 2018
1 parent 6168168 commit 02fac4d
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 0 deletions.
8 changes: 8 additions & 0 deletions dev_notes/alphazero.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,14 @@ http://tim.hibal.org/blog/alpha-zero-how-and-why-it-works/
- Suppose we have an *expert policy π* that, for a given state *s*, tells us how likely an expert-level player is to make each possible action.
- For the tic-tac-toe example, this might look like:
- ![](https://raw.githubusercontent.com/mebusy/notes/master/imgs/mcts_tictac8.png)
- where each Pᵢ = π(aᵢ|s₀) is the probability of choosing the ith action aᵢ given the root state s₀.
- If the expert policy is really good then we can produce a strong bot by directly drawing our next action according to the probabilities produces by π, by taking the move with the highest probability.
- Unfortunately, getting an expert policy is difficult, and verifying that one's policy is optimal is difficult as well.
- :(
- Fortunately, one can improve on a policy by using a modified form of Monte Carlo tree search. 可以通过一个修改版来改进策略
- This version will also store the probability of each node according to the policy, and this probability is used to adjust the node's score during selection.
- The probabilistic upper confidence tree score used by DeepMind is:
-



Expand Down
Binary file added imgs/mcts_UTCscore_m.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 02fac4d

Please sign in to comment.