https://arxiv.org/pdf/1606.01541v4.pdf
Q-learning做对话选择 参考https://arxiv.org/pdf/1511.04636.pdf
另外,如本文的policy gradient方法优于Q-learning方法的理由是什么?
文中提到:
The parameters of the network are optimized tomaximize the expected future reward using policysearch, as described in Section 4.3. Policy gradi-ent methods are more appropriate for our scenariothan Q-learning (Mnih et al., 2013), because we caninitialize the encoder-decoder RNN using MLE pa-rameters that already produce plausible responses,before changing the objective and tuning towards apolicy that maximizes long-term reward. Q-learning,on the other hand, directly estimates the future ex-pected reward of each action, which can differ fromthe MLE objective by orders of magnitude, thus mak-ing MLE parameters inappropriate for initialization.The components (states, actions, reward, etc.) of oursequential decision problem are summarized in thefollowing sub-sections.
应该如何理解?
Q-learning做对话选择 参考https://arxiv.org/pdf/1511.04636.pdf
另外,如本文的policy gradient方法优于Q-learning方法的理由是什么?
文中提到:
The parameters of the network are optimized tomaximize the expected future reward using policysearch, as described in Section 4.3. Policy gradi-ent methods are more appropriate for our scenariothan Q-learning (Mnih et al., 2013), because we caninitialize the encoder-decoder RNN using MLE pa-rameters that already produce plausible responses,before changing the objective and tuning towards apolicy that maximizes long-term reward. Q-learning,on the other hand, directly estimates the future ex-pected reward of each action, which can differ fromthe MLE objective by orders of magnitude, thus mak-ing MLE parameters inappropriate for initialization.The components (states, actions, reward, etc.) of oursequential decision problem are summarized in thefollowing sub-sections.
应该如何理解?