INSIGHTSINREINFORCEMENTLEARNING(HadovanHasselt).pdf资源-CSDN文库

需积分: 10 125 浏览量 2018-01-17 19:29:06 上传评论 1 收藏 1.83MB PDF 举报

资源推荐

资源详情

资源评论

INSIGHTS IN

REINFORCEMENT LEARNING

Formal analysis and empirical evaluation of

temporal-difference learning algorithms

Hado van Hasselt

CONTENTS

Contents 1

1 Introduction 3

1.1 The Aim of this Dissertation . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Reinforcement Learning 17

2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3 Model-Free Value Learning . . . . . . . . . . . . . . . . . . . . . . 33

2.4 Learning Action Values . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3 Estimation Biases in Maximization 51

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3 The Single Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.4 The Double Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.5 Comparing the Single and Double Estimator . . . . . . . . . . . . 65

3.6 A Comparison on Uniform Variables . . . . . . . . . . . . . . . . . 66

3.7 The Effect of More Samples . . . . . . . . . . . . . . . . . . . . . . 70

3.8 The Effect of More Variables . . . . . . . . . . . . . . . . . . . . . . 73

3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.10 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4 The Overestimation of Q-learning 85

4.1 Context and Contributions . . . . . . . . . . . . . . . . . . . . . . . 86

4.2 Overestimations in Bandit Problems . . . . . . . . . . . . . . . . . 89

4.3 Convergence Rates of Q-learning . . . . . . . . . . . . . . . . . . . 94

4.4 Double Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

1

剩余281页未读，继续阅读

内容反馈

zwxeye

粉丝: 12
资源: 47

最新资源

资源上传下载、课程学习等过程中有任何疑问或建议，欢迎提出宝贵意见哦~我们会及时处理！点击此处反馈

feedback-tip