actor-critic--Asynchronous advantage actor-critic（A3C）

review policy gradient

$\triangledown \approx \frac{1}{N}\sum\limits^N_{n=1}\sum\limits^{T_n}{t=1}(\sum^{T_n}{t^{'}=t}\gamma^{t^{'}-t }r^n_{t^{'}}-b)\triangledown\log p_\theta(a^n_t|s^n_t)$ 采取sample的方式导致$G^n_t$非常不稳定，有极大的variance，如何估测G的期望值？

state value function $V^\pi(s)$
state-action value function $Q^{\pi}(s,a)$

advantage actor-critic

$Q^{\pi}(s_t^n,a_t^n)-V^\pi(s_t^n)$，会带来一定的方差，但只用确定一个网络

tips
- the parameters of actor $\pi(s)$ and critic $V^\pi(s)$可以共享一部分浅层网络参数
- use output entropy as regularization for $\pi(s)$

asynchronous advantage actor-critic(A3C)

鸣人影分身，多个worker进行学习

pathwise derivative policy gradient

不仅对策略进行打分，并且直接告知应该采取什么行动才是好的

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

actor-critic--Asynchronous advantage actor-critic（A3C）

review policy gradient

advantage actor-critic

tips

asynchronous advantage actor-critic(A3C)

pathwise derivative policy gradient

About

Releases

Packages

zhuhxi/zhuhongxxi.github.io

Folders and files

Latest commit

History

Repository files navigation

actor-critic--Asynchronous advantage actor-critic（A3C）

review policy gradient

advantage actor-critic

tips

asynchronous advantage actor-critic(A3C)

pathwise derivative policy gradient

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages