没有合适的资源?快使用搜索试试~ 我知道了~
BGU Deep Reinforcement Learning final examination review
需积分: 0 0 下载量 191 浏览量
2023-02-25
00:08:47
上传
评论
收藏 147KB DOCX 举报
温馨提示
试读
15页
BGU Deep Reinforcement Learning final examination review
资源推荐
资源详情
资源评论
Lecture01: Introduction to RL
0) Terminology
-future state distribution depends only on present
action and state (Markovian)
-
𝜸
: discount factor. rewards we get in the future are
less valuable from rewards we get right now (money).
-value function: how good the state is
-q-function: how good is the state-action pair
-Bellman equation: mean that the optimal policy
dictates that we take the optimal action at every state.
The optimal action is specified by 𝑞_∗
Solving the Bellman equation enables us to
discover 𝜋_∗.
- Reinforcement learning is a sequential decision
making problem (the output will affect the input).
Feature of RL:
-Need access to the environment
-Jointly learning and planning from correlated
samples
-Data distribution changes with action choice
1) Model based:(policy/value iteration). need transition
matrix
Cons: if pi is deterministic, we will never get to
some state-action pairs;
Solution: start at random state
2)Model free: model
𝑞(𝑠,𝑎)
directly.
Cons: need complete episodes. Use monte-Carlo
(sampling). Can’t evaluate states.
3)Policy iteration: policy evaluation(test current pi,
update v)->policy improvement(find better pi, update pi)
cons: not efficient because it needs evaluate all
state again.
4)Value iteration: evaluate state only once.
Policy/value iteration are most efficient in DP
(always optimal).
5)On policy:
attempt to evaluate/improve the policy that is used to
make the decision. (cannot reach all, more efficient). e.g.,
Sarsa: Q value from current policy
𝛾𝑄(𝑠’,𝑎’)
, also a
temporal difference learning (TD) (Sarsa use more
exploration), epsilon greedy.
Pros: take exploration into account.
6)Off policy: attempt to evaluate/improve a policy other
than the one used to generate the data <Target policy;
behavior policy>.(Do not test policy, slow; pros:
powerful, general, can learn from experts). E.g., Q-
learning: Q value from optimal policy
𝛾 𝑚𝑎𝑥𝑄(𝑠’,𝑎)
, with
Temporal Difference, like a greedy.
Target policy: try to learn;
Behavior policy: generate the data (stochastic)
Importance sampling: relative probability for target
and behavior
Why Importance sampling: get the expectation of
rewards.
6.5) Monte Carlo
Pros: Can learn V and Q directly from the
environment; No need for models of the environment; No
need to learn about all states.
Cons: Only works for episodic (finite) environments;
Learn from complete episodes (i.e. no bootstrapping)
Must wait until the end of the; episode to know what is
the return.
Solution: Temporal Difference
7)Temporal Difference (TD)/learning.
Pros: can update each step; Use both Monte Carlo
and Dynamic Programming; Estimate V instead of real
𝑣
𝜋
TD error:
𝑆
𝑡
=
𝑅
𝑡
+
1
+
𝛾 𝑣(
𝑠
𝑡
+
1
)
―
𝑣(
𝑠
𝑡
)
Assumption: we can have to wait one time step in
order to update, so
𝑉
doesn’t change from one step
to another.
8)
𝝐
-Greedy algorithm
Exploration: use different actions to better reward
Exploitation: maximize reward by choosing best
Traditional greedy algorithm never explore, it may
stuck in a sub-optimal actions.
With probability
(1
―
𝜖)
to select best action.
Improve: A good heuristic to improve this algorithm
is to set all values of Q to a very high value and then use
Monte Carlo sampling to drive them down gradually.
9)Deep Q learning
Use a function to create approximation of q:
q(s,a,theta) with loss function and gradient descent.
Cons: sequential prediction state (correlated states)
and target is not stationary. Easily forget previous.
Improve: fix Q-target (freeze); Experience
relay(random replay or prioritize samples<high TD or
stochastic>), (solve forget and correlation);
High TD: make probability of sampling with TD,
which is suitable for incremental approaches: SARSA,
Q-learning.
10)Double DQN
DQN is inaccurate Q-value (sub optimal actions).
Method: 2 Q-functions, learn from different set of
experience, decisions are made jointly.
Pros: Double Q-learning is not a full solution to the
problem of inaccurate values, but it has been proven to
perform well.
11)Dueling QQN
Intuition: the importance of choosing right action is
not equal across all states. So decouple action from
state.
State values: V(s) | action values: A(s,a);
Q(s,a)=V(s)+b*A(s,a)
b is proportion.
Pros: keep a relative rank of the actions
Lecture02: Policy Gradient DQN
1)Policy gradients: model policy directly
Output is parameterized policy:
• Discrete action space – probability of action
• Continuous – mean and covariance of a Gaussian
Pros: greater than
𝜖
-greedy because its action
probability change smoothy;
𝜖
-greedy causes
drastic change in values .
Policy gradient uses a learning rate to adjust
changing scale.
2)REINFORCE (improve of 1)
Update is based on the reward from that point and
chance of taking the action;
All updates are made after completion of trajectory.
3)REINFORCE with baseline
Why: Because of our use of sampling, the
REINFORCE algorithm might fluctuate and be slow to
converge (i.e., high variance)
Intuition: a value to reduce the size of update
(baseline—not change with action).
State value function:
𝑣
(𝑠,𝑤)
Policy function:
𝛿
=
𝐺
𝑡
―
𝑣
(𝑠,𝑤)
, use this to back
propagate gradient.
Always converge faster than REINFORCE.
4)Summary of policy gradient:
Pros: model probability of taking an action; balance
exploration/exploitation; handle continues state-space;
sometimes easier than value-function.
Cons: require plenty sampling; slow to converge
5)Actor- Critic
Intuition: Combine Actor and Critic
Actor only methods: rely on sampling/simulation.
Cons: large variance, no accumulation of old
information
Critic only methods: rely on value function, solve
bellman equation. Pros: near-optimal. Cons: lack
guaranties of the solution near optimal
The actor attempts to improve the policy
-Can be done greedily (e.g. Q-functions), mostly
applicable to small state-space
-Can be done using policy gradients
The critic evaluates the current policy
-Can use a different policy to evaluate Q-functions
(target vs. behavior policies)
-Bootstrapping methods can reduce the variance,
as in REINFORCE with baseline
E.g.: One-step actor-critic + one step TD;
Pros: can be applied to a continuous or very large
actions space
6)A3C
Intuition: An asynchronous version of the actor-
critic algorithm
Method: Multiple critics operating simultaneously,
all updating a single network that produces the value
function
Pros: The use of multiple agents can increase
exploration, since they can be initialized with differing
policies.
Lecture03. Imitation learning
0)terminology: regret: the difference between
𝜋
∗
and
𝜋
. Maximizing rewards = minimizing regret.
Why imitation intuition: apply RL is difficult in
these situations.(Nonlinear dynamics, less training
samples, highly correlated states.)
it is sometimes difficult to define similarity
between similar states.
Solution: convert problem to prediction. To maximize
rewards rather than value function.
Suitable for: Navigation, autonomous driving,
helicopter, question answering
Problems: hardly to recover from failure with
supervised learning. And hard to response on unseen
states.
Notations:
𝑑
𝑡
𝜋
—the state distribution following
𝜋
form time 1 to t-1
𝐶
(
𝑠,𝑎
)
—immediate cost of performing
𝑎
in state
𝑠
.
1)Apprenticeship learning:
Algorithm: record expert datalearn dynamics
modelfind near optimal policy by reinforcement
learning -> test. (the dynamics of the environment and
the transition matrix are learned together)
Training: We can use regularized linear regression
to explore all the trajectories to calculate the utility. We
defined utility as the average sum of rewards.
Cons: completely greedy, only exploitation; suitable
for cases where exploration is dangerous; can not cope
with unexplored states; can not recover form errors
Pros: suitable for cases where exploration is
dangerous/costly (autonomous driving).
Compare with imitation: imitation assume an
unknown and complex world dynamics. Slower but
general.
2)Supervised learning:
Approach: reduce the sequence into many
decoupled supervised learning problems.
Use a stationary policy throughout the trajectory.
Objective: find a policy
𝜋
to minimize the observed
surrogate loss.
Intuition: once classifier makes a mistake, find it in a
previously unseen state, from that point every action is
a mistake. The mistakes compound, with each step
incurring the maximal loss. The loss growth will be
super-linear.
3)Forward training:
Algorithm: sample trajectories generated by the
latest policy -> ask the expert to provide state-action
combinations -> We train a new classifier to provide the
policy for the next step -> use the policy to advance us
one step forward.
Cons: The algorithm requires iterating over all T
steps; if T is large, it is impractical.
Pros: near linear regret; we train the algorithm on
currently visited states, the algorithm can make good
decisions; The input from the expert enables it to recover
from mistakes
Analysis: regret for this algorithm is bounded by
𝑂(𝑢𝑇𝜖). near-linear; 𝑢 represents the diversion from the
optimal policy.
If the diversion is maximal for each step, we get to
𝑂(𝑇^2 𝜖) – like supervised learning.
4)Dagger: (dataset aggregation)
Dagger is a stationary algorithm ( policy is not
updated while running)
Train pi on
𝐷
𝑔𝑒𝑡
run pi get
𝐷
𝑛𝑒𝑤
ask expert
perform
𝐷
𝑛𝑒𝑤
𝐷
=
𝐷 𝑈
𝐷
𝑛𝑒𝑤
train pi on D
Pros: faster and more efficient; update once per
trajectory; near-linear regret (low regret bound); work
well for both simple and complex problems.
Cons: huge difference between learner and expert
With Coaching: balance between expert and policy
actions. optimal and within capability for policy.
hope action as
𝜋
𝑖
(
𝑠
)
=
𝑎𝑟𝑔𝑚𝑎𝑥
𝑎
∈
𝐴
(
𝜆
𝑖
∙
𝑠𝑐𝑜𝑟𝑒
𝜋
𝑖
(
𝑠,𝑎
)
―
𝐿(𝑠,𝑎))
where 𝜆_𝑖 specifies how close the coach is to the
expert (𝜆_𝑖≥0) and s𝑐𝑜𝑟𝑒_(𝜋_𝑖 ) is the likelihood of
choosing action 𝑎 in state 𝑠. 𝐿(𝑠,𝑎) is the immediate loss.
Lecture04. Multi-arm bandits
0) general
why: get to know the different distribution (fixed) of
each arm’s payoff.
Target: maximize their profit or minimize their regret
across all plays.
Expected cumulative regret(optimal – our
strategy):
𝐸[𝑅𝑒𝑔
𝑛
]
=
𝑛
⋅
𝑅
∗
―
𝑛
𝑖
=
1
𝐸[𝑟
𝑖
]
Application: recommendation system.
EE-problem: Exploration (the discovery of new
information) and Exploitation (the use of information he
already has).
Algorithm:
Contextual free: 𝜖-greedy, UCB1
Contextual: LinUCB, Thompson sampling
1)
𝝐
-greedy approach (simple, stationary):
Method:
With probability to explore and
exploitation(constant step-size parameter to give
greater weight to more recent rewards.).
Cons: not elegant—algorithm explicitly distinguish
between exploration and exploitation; Suboptimal
estimation: any arm is equality likely to be picked; less
effective dealing with context.
Solution: explore with less confidence ( solve any
arm is equality)
2) Upper confidence bound (UCB):
Solve: epsilon-greedy explore any arm equally
Intuition: explore these less confidence.
𝐴
𝑡
=
argma
𝑥
𝑎
𝑄
𝑡
(
𝑎
)
+
𝑐
𝑙𝑛𝑡
𝑁
𝑡
(
𝑎
)
𝐴
𝑡
=value term + exploration term
Pull
𝑚𝑎𝑥
𝐴
𝑡
arm. (max uncertainty)
Cons: nonstationary problems: methods more
complex; large state space; action selection is not
practical.
Pros: get higher rewards. Optimal.
3) Contextual Bandits (model state)
Example: different rewards from the same arm for
varying contexts (state).
4) LinUCB algorithm
Intuition: Expectation for each arm is modeled as a
linear function of the context.
Loss: confidence interval is in fact standard
deviation
computational complexity:
Linear in the number of arms; At most cubic in
the number of features.
Pros: works well for a dynamic arm set (arms come
and go)
Suitable for : article recommendation
5) Thompson sampling
Intuition: A simple natural Bayesian
heuristic(Maintain a belief (distribution) for the
unknown parameters); assumes that the reward
distribution of every arm is fixed, though unknown.
Play an arm with posterior probability
Beta Distribution: shape of the beta function is
defined by a) how many times we “won” and; b) how
many times we “lost”. That’s why it’s great for
exploration/exploitation. When a is larger, we have a
larger probability of small values. Hence, the
sampling changes the distribution.
Algorithm:
For each round:
Sample random variable X from each arm’s
Beta Distribution
Select the arm with largest X
Observe the result of selected arm
Update prior Beta distribution for selected
arm
Pros: near-linear regret.
Lecture05. Monte Carlo search tree,
AlphaGo, AutoML
0)general
Selection with Upper confidence bound
(UCB)(combine with exploration and exploitation)
Expansion: reach a leaf(End of game or uncharted state
(initialize; rollout policy))->Simulation: at the end of game
get score. We can run more than one
simulation->Backpropagation: update
𝑄(𝑠,𝑎)
.
Exploration is usually given a larger weight at the
beginning and then reduced over time.
1)MCTS
Pros: no specific knowledge / interrupt at any time
Cons: difficulty to sacrifices well
Application: MCTS with a policy, MCTS with a
value function
Improve:
Combine MCTS with a policy
Combine MCTS with a value function
2)Apply monte-carlo to Go
Perfect information games: chess, checkers, Go.
They have optimal value function
b: game breadth (length moves)
d: game depth(length of game)
2.2) Combine MCTS with a Policy
policy of the software is defined using a set of hand-
crafted features;
Once an expanded node has had a sufficient
number of simulations, the policy is used to determine
whether or not it should remain in the tree or be “pruned-
off”;
This approach slows the growth of the tree while
focusing its “attention” on high-probability moves;
2.3) Combine MCTS with a Value Function
The 𝑄 function is defined as
𝑄_𝑅𝐿𝐺𝑂 (𝑠,𝑎)=𝜎(𝜙(𝑠,𝑎)^𝑇 𝜃)
where the combination of 𝜙 (binary features) and 𝜃
(weights) defines the value function. The function 𝜎
maps the value function to an estimated probability of
winning the game. Instead of setting a random policy for
the sampling process, the authors experimented with
three options: 𝜖-greedy policy; Greedy policy with a
noisy value function; A softmax distribution with a
temperature parameter.
3)AlphaGo:
Supervised learning policy network: recommend good
action by prediction. Maximize the log-likelihood function
of taking ‘human’ action (3weeks, 50gpu)
RL policy network (give action): improve the network by
play with itself. Replay to determine gradients. (30gpu, a
剩余14页未读,继续阅读
资源评论
爱安敝之
- 粉丝: 64
- 资源: 8
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 王姿.html
- 51单片机学习(1)-软件keil下载
- 历届(第1-21届)希望杯数学竞赛初一试题及答案(最新整理).doc全国数学邀请赛(264页资料)
- 水滴.psd
- TokenPocket_V2.1.2_release.apk
- Apache-druid-kafka-rce.yaml
- 基于C#的ASP.NET数据库原理及应用技术课程指导平台的开发
- 基于ROS的智能车轨迹跟踪算法的仿真与设计源码运用PID跟踪算法.zip.zip
- Bug Bounty Tip - i春秋Self-XSS变废为宝的奇思妙想
- 1991-2015年全国初中化学竞赛复赛试题汇编(212页)(24年竞赛复赛真题).docx天原杯
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功