Some questions about the critic loss #13

HYB777 · 2025-01-08T16:58:07Z

Hello, thanks for your nice work.

I would like to ask some questions about the critic loss.

If I understand correctly, the soft action return $$Z(s_t, a_t)\sim\mathcal{N}(\mu_\theta(s_t,a_t), \sigma_\theta^2(s_t,a_t))$$, and $$Z(s_t, a_t)=r_t+\gamma(Z_\bar{\theta}(s_{t+1}, a_{t+1})-\alpha\log\pi(a_{t+1}|s_{t+1}))\sim\mathcal{N}(r_t+\gamma(\mu_\bar{\theta}(s_t,a_t)-\alpha\log\pi(a_{t+1}|s_{t+1})), \gamma^2\sigma_\bar{\theta}^2(s_{t+1},a_{t+1}))$$.

Then, can we use the following critic loss?

$$L_{critic}(\theta)=(\mu_\theta(s_t,a_t)-( r_t+ \gamma( \mu_\bar{\theta}(s_t,a_t)-\alpha\log\pi(a_{t+1}|s_{t+1}) ) ) )^2+( \sigma_\theta(s_t,a_t)-\gamma\sigma_\bar{\theta}(s_{t+1},a_{t+1}))^2$$

Kirikirito · 2025-01-09T13:00:11Z

Thank you for your insightful question!

Actually, you cannot use the critic loss as proposed in your question. The original formula for $Z(s_t,a_t)$
in the paper is somewhat oversimplified, which may lead to some misunderstanding. The complete and correct value distribution consistency condition should take into account the marginal distributions by integrating over $s'$ and $a'$:

It is essential to account for the randomness of $s'$ and $a'$. A closed-form solution for the target value distribution cannot be obtained because the environment dynamics $p(s'|s,a)$ is unknown. In such cases, only sample-based update rules can be applied.

HYB777 · 2025-01-09T15:03:57Z

Thank you for your insightful question!

Actually, you cannot use the critic loss as proposed in your question. The original formula for Z ( s t , a t ) in the paper is somewhat oversimplified, which may lead to some misunderstanding. The complete and correct value distribution consistency condition should take into account the marginal distributions by integrating over s ′ and a ′ : It is essential to account for the randomness of s ′ and a ′ . A closed-form solution for the target value distribution cannot be obtained because the environment dynamics p ( s ′ | s , a ) is unknown. In such cases, only sample-based update rules can be applied.

Thank you for your answer! It perfectly clears up my confusion!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about the critic loss #13

Some questions about the critic loss #13

HYB777 commented Jan 8, 2025

Kirikirito commented Jan 9, 2025

HYB777 commented Jan 9, 2025

Some questions about the critic loss #13

Some questions about the critic loss #13

Comments

HYB777 commented Jan 8, 2025

Kirikirito commented Jan 9, 2025

HYB777 commented Jan 9, 2025