没有合适的资源?快使用搜索试试~ 我知道了~
资源推荐
资源详情
资源评论
Reinforcement Learning
An Introductory Note
Jingye Wang
Ó wangjy5@shanghaitech.edu.cn
Spring 2020
Contents
1 Introduction 3
2 Review of Basic Probability 5
2.1 Interpretation of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Limit eorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Sampling & Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Basic Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Concentration Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Bandit Algorithms 14
3.1 Bandit Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Stochastic Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Greedy Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 UCB Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 ompson Sampling Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6 Gradient Bandit Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Markov Chains 20
4.1 Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Basic Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Classications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
CONTENTS 2
4.4 Stationary Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Reversibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.6 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Markov Decision Process 25
5.1 Markov Reward Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6 Model-Free Prediction 33
6.1 Monte-Carlo Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.2 Temporal-Dierence Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7 Model-Free Control 37
7.1 On Policy Monte-Carlo Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2 On Policy Temporal-Dierence Control: Sarsa . . . . . . . . . . . . . . . . . . . . . . . . 39
7.3 O-Policy Temporal-Dierence Control: Q-Learning . . . . . . . . . . . . . . . . . . . . 40
8 Value Function Approximation 41
8.1 Semi-gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.2 Deep Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
9 Policy Optimization 46
9.1 Policy Optimization eorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
9.2 REINFORCE: Monte-Carlo Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . 49
9.3 Actor-Critic Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.4 Extension of Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Introduction 3
1 Introduction
Course Prerequisite:
• Linear Algebra
• Probability
• Machine Learning relevant course (data mining, pattern recognition, etc)
• PyTorch, Python
What is Reinforcement Learning and why we care:
a computational approach to learning whereby an agent tries to maximize the total amount of reward
it receives while interacting with a complex and uncertain environment.[1]
Dierence between Reinforcement Learning and Supervised Learning:
• Sequential data as input (not i.i.d);
• e learner is not told which actions to take, but instead must discover which actions yield the most
reward by trying them;
• Trial-and-error exploration (balance between exploration and exploitation);
• ere is
no supervisor
, only a reward signal, which is also
delayed
Big deal: Able to Achieve Superhuman Performance
• Upper bound for Supervised Learning is human-performance.
• Upper bound for Reinforcement Learning ?
Why Reinforcement Learning works now?
• Computation power: many GPUs to do trial-and-error rollout;
• Acquire the high degree of prociency in domains governed by simple, known rules;
• End-to-end training, features and policy are jointly optimized toward the end goal.
Sequential Decision Making:
• Agent and Environment: the agent learns to interact with the environment;
• Rewards: a scalar feedback signal that indicates how well agent is doing;
• Policy: a map function from state/observation to action models the agent’s behavior;
• Value function: expected discounted sum of future rewards under a particular policy;
• Objective of the agent: selects a series of actions to maximize total future rewards;
• History: a sequence of observations, actions, rewards;
• Full observability: agent directly observes the environment state, formally as Markov decision process
(MDP);
Introduction 4
• Partial observability: agent indirectly observes the environment, formally as partially observable
Markov decision process (POMDP)
All goals of the agent can be described by the maximization of expected cumulative reward.
Types of Reinforcement Learning Agents based on What the Agent Learns
• Value-based agent:
– Explicit: Value function;
– Implicit: Policy (can derive a policy from value function);
• Policy-based agent:
– Explicit: policy;
– No value function;
• Actor-Critic agent:
– Explicit: policy and value function.
Types of Reinforcement Learning Agents on if there is model
• Model-based:
– Explicit: model;
– May or may not have policy and/or value function;
• Model-free:
– Explicit: value function and/or policy function;
– No model.
Review of Basic Probability 5
2 Review of Basic Probability
For more details of this part one can refer to Introduction to Probability [2] and Monte Carlo Statistical
Methods [3].
2.1 Interpretation of Probability
e Frequentist view: Probability represents a long-run frequency over a large number of repetitions of
an experiment.
e Bayesian view: Probability represents a degree of belief about the event in question.
Many machine learning techniques are derived from these two views. As the computing power and
algorithms develop, however, Bayesian is becoming dominant.
2.2 Transformations
Change of variables: Let X = (X
1
, ..., X
n
) be a continuous random vector with joint PDF X, and let
Y = g (X) where g is an invertible function from R
n
to R
n
. en the joint PDF of Y is
f
Y
(y) = f
X
(x)
∂x
∂y
where the vertical bars say “take the absolute value of the determinant of ∂x/∂y” and ∂x/∂y is a Jacobian
matrix
∂x
∂y
=
∂x
1
∂y
1
∂x
1
∂y
2
···
∂x
1
∂y
n
.
.
.
.
.
.
.
.
.
.
.
.
∂x
n
∂y
1
∂x
n
∂y
2
···
∂x
n
∂y
n
.
It assumes that the determinant of the Jacobian matrix is never 0. It also supposes all the partial
derivatives
∂x
i
∂y
j
exist and are continuous.
2.3 Limit Theorem
Strong Law of Large Numbers (SLLN): e sample mean
¯
X
n
converges to the true mean µ point-wise
as n → ∞, w.p.1(i.e. with probability 1). In other words, the event
¯
X
n
→ µ has probability 1.
Weak Law of Large Numbers (WLLN): For all ε > 0, P (|
¯
X
n
− µ| > ε) → 0 as n → ∞. (is form
of convergence is called convergence in probability ).
e Weak Law of Large Numbers can be proved by using Chebyshev’s inequality.
Central Limit eorem (CLT): As n → ∞,
√
n
¯
X
n
−µ
σ
→ N(0, 1) in distribution. In words, the
CDF of the le-hand side approaches the CDF of the standard Normal distribution.
剩余52页未读,继续阅读
资源评论
RobotLearn@ParisTech
- 粉丝: 0
- 资源: 11
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功