You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to ask some questions about the critic loss.
If I understand correctly, the soft action return $$Z(s_t, a_t)\sim\mathcal{N}(\mu_\theta(s_t,a_t), \sigma_\theta^2(s_t,a_t))$$, and $$Z(s_t, a_t)=r_t+\gamma(Z_\bar{\theta}(s_{t+1}, a_{t+1})-\alpha\log\pi(a_{t+1}|s_{t+1}))\sim\mathcal{N}(r_t+\gamma(\mu_\bar{\theta}(s_t,a_t)-\alpha\log\pi(a_{t+1}|s_{t+1})), \gamma^2\sigma_\bar{\theta}^2(s_{t+1},a_{t+1}))$$.
Actually, you cannot use the critic loss as proposed in your question. The original formula for $Z(s_t,a_t)$
in the paper is somewhat oversimplified, which may lead to some misunderstanding. The complete and correct value distribution consistency condition should take into account the marginal distributions by integrating over $s'$ and $a'$:
It is essential to account for the randomness of $s'$ and $a'$. A closed-form solution for the target value distribution cannot be obtained because the environment dynamics $p(s'|s,a)$ is unknown. In such cases, only sample-based update rules can be applied.
Actually, you cannot use the critic loss as proposed in your question. The original formula for Z ( s t , a t ) in the paper is somewhat oversimplified, which may lead to some misunderstanding. The complete and correct value distribution consistency condition should take into account the marginal distributions by integrating over s ′ and a ′ : It is essential to account for the randomness of s ′ and a ′ . A closed-form solution for the target value distribution cannot be obtained because the environment dynamics p ( s ′ | s , a ) is unknown. In such cases, only sample-based update rules can be applied.
Thank you for your answer! It perfectly clears up my confusion!
Hello, thanks for your nice work.
I would like to ask some questions about the critic loss.
If I understand correctly, the soft action return$$Z(s_t, a_t)\sim\mathcal{N}(\mu_\theta(s_t,a_t), \sigma_\theta^2(s_t,a_t))$$ , and $$Z(s_t, a_t)=r_t+\gamma(Z_\bar{\theta}(s_{t+1}, a_{t+1})-\alpha\log\pi(a_{t+1}|s_{t+1}))\sim\mathcal{N}(r_t+\gamma(\mu_\bar{\theta}(s_t,a_t)-\alpha\log\pi(a_{t+1}|s_{t+1})), \gamma^2\sigma_\bar{\theta}^2(s_{t+1},a_{t+1}))$$ .
Then, can we use the following critic loss?
The text was updated successfully, but these errors were encountered: