Reward measurement in Adversarial learning for Neural Dialogue Generation #2

ghost · 2017-02-16T09:55:50Z

Many thanks for your great work. I am trying to reimplement your work via tensorflow. But I am a little confused about the reward measurement. As mentioned in your paper, the policy gradient is
∇ J(\theta)=[Q+({x, y}) − b({x, y})]∇\sum{t}logp(y{t}|x, y_{1:t-1}).
I have looked into your code. I just find how [Q+({x, y}) − b({x, y})] is measured in here. But as for the term ∇\sum{t}logp(y{t}|x, y_{1:t-1}), I have no idea. Could you please tell me how to measure it? And given the policy gradient value, shall I send it back to the generator as the optimization target directly? Indeed I have no experience of LUA before. Thus I may misunderstood your implementation. Thanks in advance!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reward measurement in Adversarial learning for Neural Dialogue Generation #2

Reward measurement in Adversarial learning for Neural Dialogue Generation #2

ghost commented Feb 16, 2017 •

edited by ghost

Loading

Reward measurement in Adversarial learning for Neural Dialogue Generation #2

Reward measurement in Adversarial learning for Neural Dialogue Generation #2

Comments

ghost commented Feb 16, 2017 • edited by ghost Loading

ghost commented Feb 16, 2017 •

edited by ghost

Loading