论文 deep reinforcement learning for dialogue generation

https://arxiv.org/pdf/1606.01541v4.pdf
Q-learning做对话选择参考https://arxiv.org/pdf/1511.04636.pdf
另外，如本文的policy gradient方法优于Q-learning方法的理由是什么？
文中提到：
The parameters of the network are optimized tomaximize the expected future reward using policysearch, as described in Section 4.3. Policy gradi-ent methods are more appropriate for our scenariothan Q-learning (Mnih et al., 2013), because we caninitialize the encoder-decoder RNN using MLE pa-rameters that already produce plausible responses,before changing the objective and tuning towards apolicy that maximizes long-term reward. Q-learning,on the other hand, directly estimates the future ex-pected reward of each action, which can differ fromthe MLE objective by orders of magnitude, thus mak-ing MLE parameters inappropriate for initialization.The components (states, actions, reward, etc.) of oursequential decision problem are summarized in thefollowing sub-sections.
应该如何理解？

我的理解是 policy gradient 关心的是短期的合理结果，即 p(a|s)。这个比较适合用在对话的领域，因为即刻的合理反应(而不是长期的奖励）十分重要。

扫二维码下载贴吧客户端

下载贴吧APP
看高清直播、视频！

5回复贴，共1页

<<返回人工智能吧

分享到:

日	一	二	三	四	五	六