You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many thanks for your great work. I am trying to reimplement your work via tensorflow. But I am a little confused about the reward measurement. As mentioned in your paper, the policy gradient is ∇ J(\theta)=[Q+({x, y}) − b({x, y})]∇\sum{t}logp(y{t}|x, y_{1:t-1}).
I have looked into your code. I just find how [Q+({x, y}) − b({x, y})] is measured in here. But as for the term ∇\sum{t}logp(y{t}|x, y_{1:t-1}), I have no idea. Could you please tell me how to measure it? And given the policy gradient value, shall I send it back to the generator as the optimization target directly? Indeed I have no experience of LUA before. Thus I may misunderstood your implementation. Thanks in advance!
The text was updated successfully, but these errors were encountered:
Many thanks for your great work. I am trying to reimplement your work via tensorflow. But I am a little confused about the reward measurement. As mentioned in your paper, the policy gradient is
∇ J(\theta)=[Q+({x, y}) − b({x, y})]∇\sum{t}logp(y{t}|x, y_{1:t-1})
.I have looked into your code. I just find how [Q+({x, y}) − b({x, y})] is measured in here. But as for the term
∇\sum{t}logp(y{t}|x, y_{1:t-1})
, I have no idea. Could you please tell me how to measure it? And given the policy gradient value, shall I send it back to the generator as the optimization target directly? Indeed I have no experience of LUA before. Thus I may misunderstood your implementation. Thanks in advance!The text was updated successfully, but these errors were encountered: