【免费】BGUDeepReinforcementLearningfinalexaminationreview资源-CSDN文库

Deeplearning

需积分: 0 191 浏览量 2023-02-25 00:08:47 上传评论收藏 147KB DOCX 举报

资源推荐

资源详情

资源评论

Lecture01: Introduction to RL

0) Terminology

-future state distribution depends only on present

action and state (Markovian)

𝜸

: discount factor. rewards we get in the future are

less valuable from rewards we get right now (money).

-value function: how good the state is

-q-function: how good is the state-action pair

-Bellman equation: mean that the optimal policy

dictates that we take the optimal action at every state.

The optimal action is specified by 𝑞_∗

Solving the Bellman equation enables us to

discover 𝜋_∗.

- Reinforcement learning is a sequential decision

making problem (the output will affect the input).

Feature of RL:

-Need access to the environment

-Jointly learning and planning from correlated

samples

-Data distribution changes with action choice

1) Model based:(policy/value iteration). need transition

matrix

Cons: if pi is deterministic, we will never get to

some state-action pairs;

Solution: start at random state

2)Model free: model

𝑞(𝑠,𝑎)

directly.

Cons: need complete episodes. Use monte-Carlo

(sampling). Can’t evaluate states.

3)Policy iteration: policy evaluation(test current pi,

update v)->policy improvement(find better pi, update pi)

cons: not efficient because it needs evaluate all

state again.

4)Value iteration: evaluate state only once.

Policy/value iteration are most efficient in DP

(always optimal).

5)On policy:

attempt to evaluate/improve the policy that is used to

make the decision. (cannot reach all, more efficient). e.g.,

Sarsa: Q value from current policy

𝛾𝑄(𝑠’,𝑎’)

, also a

temporal difference learning (TD) (Sarsa use more

exploration), epsilon greedy.

Pros: take exploration into account.

6)Off policy: attempt to evaluate/improve a policy other

than the one used to generate the data <Target policy;

behavior policy>.(Do not test policy, slow; pros:

powerful, general, can learn from experts). E.g., Q-

learning: Q value from optimal policy

𝛾 𝑚𝑎𝑥𝑄(𝑠’,𝑎)

, with

Temporal Difference, like a greedy.

Target policy: try to learn;

Behavior policy: generate the data (stochastic)

Importance sampling: relative probability for target

and behavior

Why Importance sampling: get the expectation of

rewards.

6.5) Monte Carlo

Pros: Can learn V and Q directly from the

environment; No need for models of the environment; No

need to learn about all states.

Cons: Only works for episodic (finite) environments;

Learn from complete episodes (i.e. no bootstrapping)

Must wait until the end of the; episode to know what is

the return.

Solution: Temporal Difference

7)Temporal Difference (TD)/learning.

Pros: can update each step; Use both Monte Carlo

and Dynamic Programming; Estimate V instead of real

𝑣

𝜋

TD error:

𝑆

𝑡

𝑅

𝑡

𝛾 𝑣(

𝑠

𝑡

)

―

𝑣(

𝑠

𝑡

)

Assumption: we can have to wait one time step in

order to update, so

𝑉

doesn’t change from one step

to another.

𝝐

-Greedy algorithm

Exploration: use different actions to better reward

Exploitation: maximize reward by choosing best

Traditional greedy algorithm never explore, it may

stuck in a sub-optimal actions.

With probability

―

𝜖)

to select best action.

Improve: A good heuristic to improve this algorithm

is to set all values of Q to a very high value and then use

Monte Carlo sampling to drive them down gradually.

9)Deep Q learning

Use a function to create approximation of q:

q(s,a,theta) with loss function and gradient descent.

Cons: sequential prediction state (correlated states)

and target is not stationary. Easily forget previous.

Improve: fix Q-target (freeze); Experience

relay(random replay or prioritize samples<high TD or

stochastic>), (solve forget and correlation);

High TD: make probability of sampling with TD,

which is suitable for incremental approaches: SARSA,

Q-learning.

10)Double DQN

DQN is inaccurate Q-value (sub optimal actions).

Method: 2 Q-functions, learn from different set of

experience, decisions are made jointly.

Pros: Double Q-learning is not a full solution to the

problem of inaccurate values, but it has been proven to

perform well.

11)Dueling QQN

Intuition: the importance of choosing right action is

not equal across all states. So decouple action from

state.

State values: V(s) | action values: A(s,a);

Q(s,a)=V(s)+b*A(s,a)

b is proportion.

Pros: keep a relative rank of the actions

Lecture02: Policy Gradient DQN

1)Policy gradients: model policy directly

Output is parameterized policy:

• Discrete action space – probability of action

• Continuous – mean and covariance of a Gaussian

Pros: greater than

𝜖

-greedy because its action

probability change smoothy;

𝜖

-greedy causes

drastic change in values .

Policy gradient uses a learning rate to adjust

changing scale.

2)REINFORCE (improve of 1)

Update is based on the reward from that point and

chance of taking the action;

All updates are made after completion of trajectory.

3)REINFORCE with baseline

Why: Because of our use of sampling, the

REINFORCE algorithm might fluctuate and be slow to

converge (i.e., high variance)

Intuition: a value to reduce the size of update

(baseline—not change with action).

State value function:

𝑣

(𝑠,𝑤)

Policy function:

𝛿

𝐺

𝑡

―

𝑣

(𝑠,𝑤)

, use this to back

propagate gradient.

Always converge faster than REINFORCE.

4)Summary of policy gradient:

Pros: model probability of taking an action; balance

exploration/exploitation; handle continues state-space;

sometimes easier than value-function.

Cons: require plenty sampling; slow to converge

5)Actor- Critic

Intuition: Combine Actor and Critic

Actor only methods: rely on sampling/simulation.

Cons: large variance, no accumulation of old

information

Critic only methods: rely on value function, solve

bellman equation. Pros: near-optimal. Cons: lack

guaranties of the solution near optimal

The actor attempts to improve the policy

-Can be done greedily (e.g. Q-functions), mostly

applicable to small state-space

-Can be done using policy gradients

The critic evaluates the current policy

-Can use a different policy to evaluate Q-functions

(target vs. behavior policies)

-Bootstrapping methods can reduce the variance,

as in REINFORCE with baseline

E.g.: One-step actor-critic + one step TD;

Pros: can be applied to a continuous or very large

actions space

6)A3C

Intuition: An asynchronous version of the actor-

critic algorithm

Method: Multiple critics operating simultaneously,

all updating a single network that produces the value

function

Pros: The use of multiple agents can increase

exploration, since they can be initialized with differing

policies.

Lecture03. Imitation learning

0)terminology: regret: the difference between

𝜋

∗

and

𝜋

. Maximizing rewards = minimizing regret.

Why imitation intuition: apply RL is difficult in

these situations.(Nonlinear dynamics, less training

samples, highly correlated states.)

it is sometimes difficult to define similarity

between similar states.

Solution: convert problem to prediction. To maximize

rewards rather than value function.

Suitable for: Navigation, autonomous driving,

helicopter, question answering

Problems: hardly to recover from failure with

supervised learning. And hard to response on unseen

states.

Notations:

𝑑

𝑡

𝜋

—the state distribution following

𝜋

form time 1 to t-1

𝐶

(

𝑠,𝑎

)

—immediate cost of performing

𝑎

in state

𝑠

1)Apprenticeship learning:

Algorithm: record expert datalearn dynamics

modelfind near optimal policy by reinforcement

learning -> test. (the dynamics of the environment and

the transition matrix are learned together)

Training: We can use regularized linear regression

to explore all the trajectories to calculate the utility. We

defined utility as the average sum of rewards.

Cons: completely greedy, only exploitation; suitable

for cases where exploration is dangerous; can not cope

with unexplored states; can not recover form errors

Pros: suitable for cases where exploration is

dangerous/costly (autonomous driving).

Compare with imitation: imitation assume an

unknown and complex world dynamics. Slower but

general.

2)Supervised learning:

Approach: reduce the sequence into many

decoupled supervised learning problems.

Use a stationary policy throughout the trajectory.

Objective: find a policy

𝜋

to minimize the observed

surrogate loss.

Intuition: once classifier makes a mistake, find it in a

previously unseen state, from that point every action is

a mistake. The mistakes compound, with each step

incurring the maximal loss. The loss growth will be

super-linear.

3)Forward training:

Algorithm: sample trajectories generated by the

latest policy -> ask the expert to provide state-action

combinations -> We train a new classifier to provide the

policy for the next step -> use the policy to advance us

one step forward.

Cons: The algorithm requires iterating over all T

steps; if T is large, it is impractical.

Pros: near linear regret; we train the algorithm on

currently visited states, the algorithm can make good

decisions; The input from the expert enables it to recover

from mistakes

Analysis: regret for this algorithm is bounded by

𝑂(𝑢𝑇𝜖). near-linear; 𝑢 represents the diversion from the

optimal policy.

If the diversion is maximal for each step, we get to

𝑂(𝑇^2 𝜖) – like supervised learning.

4)Dagger: (dataset aggregation)

Dagger is a stationary algorithm ( policy is not

updated while running)

Train pi on

𝐷

𝑔𝑒𝑡

run pi get

𝐷

𝑛𝑒𝑤

ask expert

perform

𝐷

𝑛𝑒𝑤



𝐷

𝐷 𝑈

𝐷

𝑛𝑒𝑤

 train pi on D

Pros: faster and more efficient; update once per

trajectory; near-linear regret (low regret bound); work

well for both simple and complex problems.

Cons: huge difference between learner and expert

With Coaching: balance between expert and policy

actions. optimal and within capability for policy.

hope action as

𝜋

𝑖

(

𝑠

)

𝑎𝑟𝑔𝑚𝑎𝑥

𝑎

∈

𝐴

(

𝜆

𝑖

∙

𝑠𝑐𝑜𝑟𝑒

𝜋

𝑖

(

𝑠,𝑎

)

―

𝐿(𝑠,𝑎))

where 𝜆_𝑖 specifies how close the coach is to the

expert (𝜆_𝑖≥0) and s𝑐𝑜𝑟𝑒_(𝜋_𝑖 ) is the likelihood of

choosing action 𝑎 in state 𝑠. 𝐿(𝑠,𝑎) is the immediate loss.

Lecture04. Multi-arm bandits

0) general

why: get to know the different distribution (fixed) of

each arm’s payoff.

Target: maximize their profit or minimize their regret

across all plays.

Expected cumulative regret(optimal – our

strategy):

𝐸[𝑅𝑒𝑔

𝑛

]

𝑛

⋅

𝑅

∗

―

𝑛

𝑖

𝐸[𝑟

𝑖

]

Application: recommendation system.

EE-problem: Exploration (the discovery of new

information) and Exploitation (the use of information he

already has).

Algorithm:

Contextual free: 𝜖-greedy, UCB1

Contextual: LinUCB, Thompson sampling

𝝐

-greedy approach (simple, stationary):

Method:

With probability to explore and

exploitation(constant step-size parameter to give

greater weight to more recent rewards.).

Cons: not elegant—algorithm explicitly distinguish

between exploration and exploitation; Suboptimal

estimation: any arm is equality likely to be picked; less

effective dealing with context.

Solution: explore with less confidence ( solve any

arm is equality)

2) Upper confidence bound (UCB):

Solve: epsilon-greedy explore any arm equally

Intuition: explore these less confidence.

𝐴

𝑡

argma

𝑥

𝑎

𝑄

𝑡

(

𝑎

)

𝑐

𝑙𝑛𝑡

𝑁

𝑡

(

𝑎

)

𝐴

𝑡

=value term + exploration term

Pull

𝑚𝑎𝑥

𝐴

𝑡

arm. (max uncertainty)

Cons: nonstationary problems: methods more

complex; large state space; action selection is not

practical.

Pros: get higher rewards. Optimal.

3) Contextual Bandits (model state)

Example: different rewards from the same arm for

varying contexts (state).

4) LinUCB algorithm

Intuition: Expectation for each arm is modeled as a

linear function of the context.

Loss: confidence interval is in fact standard

deviation

computational complexity:

Linear in the number of arms; At most cubic in

the number of features.

Pros: works well for a dynamic arm set (arms come

and go)

Suitable for : article recommendation

5) Thompson sampling

Intuition: A simple natural Bayesian

heuristic(Maintain a belief (distribution) for the

unknown parameters); assumes that the reward

distribution of every arm is fixed, though unknown.

Play an arm with posterior probability

Beta Distribution: shape of the beta function is

defined by a) how many times we “won” and; b) how

many times we “lost”. That’s why it’s great for

exploration/exploitation. When a is larger, we have a

larger probability of small values. Hence, the

sampling changes the distribution.

Algorithm:

For each round:

Sample random variable X from each arm’s

Beta Distribution

Select the arm with largest X

Observe the result of selected arm

Update prior Beta distribution for selected

arm

Pros: near-linear regret.

Lecture05. Monte Carlo search tree,

AlphaGo, AutoML

0)general

Selection with Upper confidence bound

(UCB)(combine with exploration and exploitation) 

Expansion: reach a leaf(End of game or uncharted state

(initialize; rollout policy))->Simulation: at the end of game

get score. We can run more than one

simulation->Backpropagation: update

𝑄(𝑠,𝑎)

Exploration is usually given a larger weight at the

beginning and then reduced over time.

1)MCTS

Pros: no specific knowledge / interrupt at any time

Cons: difficulty to sacrifices well

Application: MCTS with a policy, MCTS with a

value function

Improve:

Combine MCTS with a policy

Combine MCTS with a value function

2)Apply monte-carlo to Go

Perfect information games: chess, checkers, Go.

They have optimal value function

b: game breadth (length moves)

d: game depth(length of game)

2.2) Combine MCTS with a Policy

policy of the software is defined using a set of hand-

crafted features;

Once an expanded node has had a sufficient

number of simulations, the policy is used to determine

whether or not it should remain in the tree or be “pruned-

off”;

This approach slows the growth of the tree while

focusing its “attention” on high-probability moves;

2.3) Combine MCTS with a Value Function

The 𝑄 function is defined as

𝑄_𝑅𝐿𝐺𝑂 (𝑠,𝑎)=𝜎(𝜙(𝑠,𝑎)^𝑇 𝜃)

where the combination of 𝜙 (binary features) and 𝜃

(weights) defines the value function. The function 𝜎

maps the value function to an estimated probability of

winning the game. Instead of setting a random policy for

the sampling process, the authors experimented with

three options: 𝜖-greedy policy; Greedy policy with a

noisy value function; A softmax distribution with a

temperature parameter.

3)AlphaGo:

Supervised learning policy network: recommend good

action by prediction. Maximize the log-likelihood function

of taking ‘human’ action (3weeks, 50gpu)

RL policy network (give action): improve the network by

play with itself. Replay to determine gradients. (30gpu, a

剩余14页未读，继续阅读

评论收藏

内容反馈

爱安敝之

粉丝: 64
资源: 8

BGU Deep Reinforcement Learning final examination review

最新资源

BGU Deep Reinforcement Learning final examination review

QANet:BGU NLP课程的最终项目

Leader统帅L2SO56BGU1蒸烤一体机说明书.pdf

Distributed:分布式BGU-meni adler

matlab中灰色的代码-bgu:gu

cicd-bgu-session

BGU.Dist-crx插件

Tenders_Manager:作为BGU，Google和司法部的一部分，我们开发了一个可帮助政府部门运行和管理招标的系统

GPS_BD_LNA低噪放.OLB

bgu-session

dsps-212-assignment1:BGU DSPS课程的作业1

israeli-dubbers:以色列电影配音员收藏作为数据（BGU项目）

Advanced Optical Wireless Communication Systems

视频图matlab代码-JA-POLS:运动相机背景模型（CVPR'20论文）

课程注册系统：该程序是BG大学的注册系统。 专注于任务之间的沟通。 服务器和客户端之间的通信将通过两种协议执行：线程每客户端（TPC）和Reactor服务器

Listener-开源

BGUGrades-crx插件

ANSYS-Textmate-Bundle:用于ANSYS参数设计语言（APDL）的Textmate Bundel

Twenty-Five Comparators is Optimal when Sorting Nine Inputs (and Twenty-Nine for Ten) (1405.5754)-计算机科学

StateSpaceMapper:用于映射用BPjs编写的b程序的状态空间的实用程序

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

全新的SOTA模型YOLOv9

YOLOV5 + 双目相机实现三维测距（新版本）

YOLOV5口罩检测数据集+代码+模型 2000张标注好的数据+教学视频.zip

最新资源

课程注册系统：该程序是BG大学的注册系统。专注于任务之间的沟通。服务器和客户端之间的通信将通过两种协议执行：线程每客户端（TPC）和Reactor服务器