增强学习算法AlgorithmsforReinforcementLearning资源-CSDN文库

共6个文件

pdf：6个

增强学习

3星 · 超过75%的资源需积分: 12 103 浏览量 2017-08-30 15:55:41 上传评论 1 收藏 14.65MB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

增强学习算法.Algorithms for Reinforcement Learning.rar （6个子文件）

增强学习算法.Algorithms for Reinforcement Learning

Errata.pdf 117KB

Algorithms for Reinforcement Learning.pdf 1.59MB

tutorial02-slidesonly.pdf 3.78MB

tutorial03-slidesonly.pdf 6.26MB

tutorial04-slidesonly.pdf 3.36MB

tutorial01-slidesonly.pdf 6.23MB

Reinforcement Learning Algorithms in Markov

Decision Processes

AAAI-10 Tutorial

Part III: Learning to control

Csaba Szepesv

ari Richard S. Sutton

University of Alberta

E-mails: {szepesva,rsutton}@.ualberta.ca

Atlanta, July 11, 2010

Szepesv

ari & Sutton (UofA) RL Algorithms July 11, 2010 1 / 39

Contributions

! !"# !"$ %

&'()*+

-.*/0/)1)(2

34+5)(2

67+'()*+5

80.94(

:*1)'2;<)(=;

.4'*9+)>4.

?4=0@)*.;:*1)'2

80.94(;

:*1)'2;<A*

;.4'*9+)>4.

Off-policy learning with options and recognizers

Doina Precup, Richard S. Sutton, Cosmin Paduraru, Anna J. Koop, Satinder Singh

McGill University, University of Alberta, University of Michigan

Options

Distinguished

region

Ideas and Motivation Background Recognizers

Off-policy algorithm for options

Learning w/o the Behavior Policy

Wall

Options

• A way of behaving for a period of time

Models of options

• A predictive model of the outcome of following the option

• What state will you be in?

• Will you still control the ball?

• What will be the value of some feature?

• Will your teammate receive the pass?

• What will be the expected total reward along the way?

• How long can you keep control of the ball?

Dribble Keepaway Pass

Options for soccer players could be

Options in a 2D world

The red and blue options

are mostly executed.

Surely we should be able

to learn about them from

this experience!

Experienced

trajectory

Off-policy learning

• Learning about one policy while behaving according to another

• Needed for RL w/exploration (as in Q-learning)

• Needed for learning abstract models of dynamical systems

(representing world knowledge)

• Enables efﬁcient exploration

• Enables learning about many ways of behaving at the same time

(learning models of options)

! a policy

! a stopping condition

Non-sequential example

Problem formulation w/o recognizers

Problem formulation with recognizers

• One state

• Continuous action a ∈ [0, 1]

• Outcomes z

= a

• Given samples from policy b : [0, 1] → #

• Would like to estimate the mean outcome for a sub-region of the

action space, here a ∈ [0.7, 0.9]

Target policy π : [0, 1] → "

is uniform within the region of interest

(see dashed line in ﬁgure below). The estimator is:

ˆm

i=1

π(a

)

b(a

)

Theorem 1 Let A = {a

, . . . a

} ⊆ A be a subset of all the

possible actions. Consider a ﬁxed behavior policy b and let π

the class of policies that only choose actions from A, i.e., if

π(a) > 0 then a ∈ A. Then the policy induced by b and the binary

recognizer c

is the policy with minimum-variance one-step

importance sampling corrections, among those in π

π as given by (1) = arg min

p∈π

„

π(a

)

b(a

)

(2)

Proof: Using Lagrange multipliers

Theorem 2 Consider two binary recognizers c

and c

, such that

> µ

. Then the importance sampling corrections for c

have

lower variance than the impor tance sampling corrections for c

Off-policy learning

Let the importance sampling ratio at time step t be:

π(s

, a

)

b(s

, a

)

The truncated n-step return, R

(n)

, satisﬁes:

(n)

= ρ

t+1

+ (1 − β

t+1

(n−1)

t+1

The update to the parameter vector is propor tional to:

∆θ

− y

∇

(1 − β

) · · · ρ

t−1

(1 − β

Theorem 3 For every time step t ≥ 0 and any initial state s,

[∆θ

|s] = E

[∆

|s].

Proof: By induction on n we show that

(n)

|s} = E

{

(n)

|s}

which implies that E

|s} = E

(

|s}. The rest of the proof is

algebraic manipulations (see paper).

Implementation of off-policy learning for options

In order to avoid ∆θ → 0, we use a restart function g : S → [0, 1]

(like in the PSD algorithm). The forward algorithm becomes:

∆θ

= (R

− y

)∇

i=0

...ρ

t−1

(1 − β

i+1

)...(1 − β

where g

is the extent of restar ting in state s

The incremental learning algorithm is the following:

• Initialize κ

= g

, e

= κ

∇

• At every time step t:

= ρ

t+1

+ (1 − β

t+1

) − y

t+1

= θ

+ αδ

t+1

= ρ

(1 − β

t+1

) + g

t+1

= λρ

(1 − β

t+1

+ κ

t+1

∇

t+1

References

Off-policy learning is tricky

• The Bermuda triangle

! Temporal-difference learning

! Function approximation (e.g., linear)

! Off-policy

• Leads to divergence of iterative algorithms

! Q-learning diverges with linear FA

! Dynamic programming diverges with linear FA

Baird's Counterexample

(s) =

!(7)+2!(1)

terminal

state

99%

100%

(s) =

!(7)+2!(2)

(s) =

!(7)+2!(3)

(s) =

!(7)+2!(4)

(s) =

!(7)+2!(5)

(s) =

2!(7)+!(6)

0 1000 2000 3000 4000 5000

/ -10

Iterations (k)

Parameter

values, !

(i)

(log scale,

broken at !1)

(7)

(1) –

(5)

(6)

Precup, Sutton & Dasgupta (PSD) algorithm

• Uses importance sampling to convert off-policy case to on-policy case

• Convergence assured by theorem of Tsitsiklis & Van Roy (1997)

• Survives the Bermuda triangle!

BUT!

• Variance can be high, even inﬁnite (slow learning)

• Difﬁcult to use with continuous or large action spaces

• Requires explicit representation of behavior policy (probability distribution)

Option formalism

An option is deﬁned as a triple o = !I , π, β"

• I ⊆ S is the set of states in which the option can be initiated

• π is the internal policy of the option

• β : S → [0, 1] is a stochastic ter mination condition

We want to compute the

reward model of option o:

{R(s)} = E {r

+ r

+ . . . + r

= s, π, β}

We assume that

linear function approximation is used to represent

the model:

{R(s)} ≈ θ

= y

Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function

approximation. In Proceedings of ICML.

Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference

learning with function approximation. In Proceedings of ICML.

Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A

framework for temporal abstraction in reinforcement learning. Artiﬁcial

Intelligence, vol . 112, pp. 181–211.

Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings

of NIPS-17.

Sutton R.S., Rafols E. and Koop, A. (2006). Temporal abstraction in

temporal-difference networks”. In Proceedings of NIPS-18.

Tadic,V. (2001). On the convergence of temporal-difference learning with linear

function approximation. In Machine learning vol. 42.

Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning

with function approximation. IEEE Transactionson Automatic Control 42.

Acknowledgements

Theorem 4 If the following assumptions hold:

• The function approximator used to represent the model is a

state aggregrator

• The recognizer behaves consistently with the function

approximator, i.e., c(s, a) = c(p, a), ∀s ∈ p

• The recognition probability for each par tition, ˆµ(p) is estimated

using maximum likelihood:

ˆµ(p) =

N(p, c = 1)

N(p)

Then there exists a polic y ˆπ such that the off-policy learning

algorithm converges to the same model as the on-policy algorithm

using ˆπ.

Proof: In the limit, w.p.1, ˆµ converges to

(s|p)

c(p, a)b(s, a) where d

(s|p) is the probability of

visiting state s from partition p under the stationary distribution of b.

Let ˆπ be deﬁned to be the same for all states in a partition p:

ˆπ (p, a) = ˆρ(p, a)

(s|p)b(s, a)

ˆπ is well-deﬁned, in the sense that

ˆπ (s, a) = 1. Using Theorem

3, off-policy updates using importance sampling corrections ˆρ will

have the same expected value as on-policy updates using ˆπ.

The authors gratefully acknowledge the ideas and encouragement

they have received in this work from Eddie Rafols, Mark Ring,

Lihong Li and other members of the rlai.net group. We thank Csaba

Szepesvari and the reviewers of the paper for constructive

comments. This research was supported in par t by iCore, NSERC,

Alberta Ingenuity, and CFI.

The target policy π is induced by a recognizer function

c : [0, 1] !→ #

π(a) =

c(a)b(a)

c(x)b(x)

c(a)b(a)

(1)

(see blue line below). The estimator is:

ˆm

i=1

π(a

)

b(a

)

i=1

c(a

)b(a

)

b(a

)

i=1

c(a

)

!" !"" #"" $"" %"" &""

!'&

()*+,+-./01.,+.2-34

5.13,.630780#""04.)*/301.,+.2-349

:+;<7=;0,3-762+>3,

:+;<0,3-762+>3,

?=)@3,07804.)*/30.-;+724

McGill

The importance sampling corrections are:

ρ(s, a) =

π(s, a)

b(s, a)

c(s, a)

µ(s)

where µ(s) depends on the behavior policy b. If b is unknown,

instead of µ we will use a

maximum likelihood estimate

ˆµ : S → [0, 1], and importance sampling corrections will be deﬁned

as:

ˆρ(s, a) =

c(s, a)

ˆµ(s)

On-policy learning

If π is used to generate behavior, then the reward model of an

option can be learned using TD-learning.

The n-step truncated return is:

(n)

= r

t+1

+ (1 − β

t+1

)

(n−1)

t+1

The λ-return is deﬁned as usual:

= (1 − λ)

∞

n=1

n−1

(n)

The parameters of the function approximator are updated on every

step proportionally to:

∆

− y

∇

(1 − β

) · · · (1 − β

• Recognizers reduce variance

• First off-policy learning algorithm for option models

• Off-policy learning without knowledge of the behavior

distribution

• Observations

– Options are a natural way to reduce the variance of

importance sampling algorithms (because of the termination

condition)

– Recognizers are a natural way to deﬁne options, especially

for large or continuous action spaces.

Contributions

! !"# !"$ %

&'()*+

-.*/0/)1)(2

34+5)(2

67+'()*+5

80.94(

:*1)'2;<)(=;

.4'*9+)>4.

?4=0@)*.;:*1)'2

80.94(;

:*1)'2;<A*

;.4'*9+)>4.

Off-policy learning with options and recognizers

Doina Precup, Richard S. Sutton, Cosmin Paduraru, Anna J. Koop, Satinder Singh

McGill University, University of Alberta, University of Michigan

Options

Distinguished

region

Ideas and Motivation Background Recognizers

Off-policy algorithm for options

Learning w/o the Behavior Policy

Wall

Options

• A way of behaving for a period of time

Models of options

• A predictive model of the outcome of following the option

• What state will you be in?

• Will you still control the ball?

• What will be the value of some feature?

• Will your teammate receive the pass?

• What will be the expected total reward along the way?

• How long can you keep control of the ball?

Dribble Keepaway Pass

Options for soccer players could be

Options in a 2D world

The red and blue options

are mostly executed.

Surely we should be able

to learn about them from

this experience!

Experienced

trajectory

Off-policy learning

• Learning about one policy while behaving according to another

• Needed for RL w/exploration (as in Q-learning)

• Needed for learning abstract models of dynamical systems

(representing world knowledge)

• Enables efﬁcient exploration

• Enables learning about many ways of behaving at the same time

(learning models of options)

! a policy

! a stopping condition

Non-sequential example

Problem formulation w/o recognizers

Problem formulation with recognizers

• One state

• Continuous action a ∈ [0,1]

• Outcomes z

= a

• Given samples from policy b : [0, 1] → #

• Would like to estimate the mean outcome for a sub-region of the

action space, here a ∈ [0.7,0.9]

Targetpolicy π : [0, 1] → "

is uniform within the region of interest

(see dashed line in ﬁgure below). The estimator is:

ˆm

i=1

π(a

)

b(a

)

Theorem 1 Let A = {a

, .. . a

} ⊆ A be a subset of all the

possible actions. Consider a ﬁxed behavior policy b and let π

the class of policies that only choose actions from A, i.e., if

π(a) > 0 then a ∈ A. Then the policy induced by b and the binary

recognizer c

is the policy with minimum-variance one-step

importance sampling corrections, among those in π

π as given by (1) = arg min

p∈π

„

π(a

)

b(a

)

(2)

Proof: Using Lagrange multipliers

Theorem 2 Consider two binary recognizers c

and c

, such that

> µ

. Then the impor tance sampling corrections for c

have

lower variance than the importance sampling corrections for c

Off-policy learning

Let the importance sampling ratio at time step t be:

π(s

, a

)

b(s

, a

)

The truncated n-step return, R

(n)

, satisﬁes:

(n)

= ρ

t+1

+ (1 − β

t+1

(n−1)

t+1

The update to the parameter vector is proportional to:

∆θ

− y

∇

(1 − β

) ·· · ρ

t−1

(1 − β

Theorem 3 For every time step t ≥ 0 and any initial state s,

[∆θ

|s] = E

[∆

|s].

Proof: By induction on n we show that

(n)

|s} = E

{

(n)

|s}

which implies that E

|s} = E

(

|s}. The rest of the proof is

algebraic manipulations (see paper).

Implementation of off-policy learning for options

In order to avoid ∆θ → 0, we use a restart function g : S → [0, 1]

(like in the PSD algorithm). The forward algorithm becomes:

∆θ

= (R

− y

)∇

i=0

...ρ

t−1

(1 − β

i+1

)...(1 − β

where g

is the extent of restarting in state s

The incremental learning algorithm is the following:

• Initialize κ

= g

, e

= κ

∇

• At every time step t:

= ρ

t+1

+ (1 − β

t+1

) − y

t+1

= θ

+ αδ

t+1

= ρ

(1 − β

t+1

) + g

t+1

= λρ

(1 − β

t+1

+ κ

t+1

∇

t+1

References

Off-policy learning is tricky

• The Bermuda triangle

! Temporal-difference learning

! Function approximation (e.g., linear)

! Off-policy

• Leads to divergence of iterative algorithms

! Q-learning diverges with linear FA

! Dynamic programming diverges with linear FA

Baird's Counterexample

(s) =

!(7)+2!(1)

terminal

state

99%

100%

(s) =

!(7)+2!(2)

(s) =

!(7)+2!(3)

(s) =

!(7)+2!(4)

(s) =

!(7)+2!(5)

(s) =

2!(7)+!(6)

0 1000 2000 3000 4000 5000

/ -10

Iterations (k)

Parameter

values, !

(i)

(log scale,

broken at !1)

(7)

(1) –

(5)

(6)

Precup, Sutton & Dasgupta (PSD) algorithm

• Uses importance sampling to convert off-policy case to on-policy case

• Convergence assured by theorem of Tsitsiklis & Van Roy (1997)

• Survives the Bermuda triangle!

BUT!

• Variance can be high, even inﬁnite (slow learning)

• Difﬁcult to use with continuous or large action spaces

• Requires explicit representation of behavior policy (probability distribution)

Option formalism

An option is deﬁned as a triple o = !I ,π, β"

• I ⊆ S is the set of states in which the option can be i nitiated

• π is the internal policy of the option

• β : S → [0,1] is a stochastic termination condition

We want to compute the

reward model of option o:

{R(s)} = E {r

+ r

+ . .. + r

= s, π, β}

We assume that

linear function approximation is used to represent

the model:

{R(s)} ≈ θ

= y

Baird,L. C. (1995). Residualalgorithms: Reinforcement learning withfunction

approximation. InProceedings of ICML.

Precup,D., Sutton, R.S. and Dasgupta,S. (2001). Off-policytemporal-difference

learning withfunction approximation. InProceedings of ICML.

Sutton,R.S., Precup D.and Singh, S(1999). Between MDPs andsemi-MDPs: A

frameworkfor temporal abstractionin reinforcement learning. Artiﬁcial

Intelligence,vol . 112, pp.181–211.

Sutton,,R.S. and Tanner,B. (2005). Temporal-differencenetworks. InProceedings

ofNIPS-17.

SuttonR.S., Rafols E.and Koop, A.(2006). Temporalabstraction in

temporal-differencenetworks”. In Proceedings ofNIPS-18.

Tadic,V. (2001).On the convergenceof temporal-difference learningwith linear

functionapproximation. In Machine learningvol. 42.

Tsitsiklis,J. N., andVan Roy,B. (1997). An analysisof temporal-differencelear ning

withfunction approximation. IEEE Transactionson AutomaticControl 42.

Acknowledgements

Theorem 4 If the following assumptions hold:

• The function approximator used to represent the model is a

state aggregrator

• The recognizer behaves consistently with the function

approximator, i.e., c(s,a) = c(p, a),∀s ∈ p

• The recognition probability for each partition, ˆµ(p) is estimated

using maximum likelihood:

ˆµ(p) =

N(p, c = 1)

N(p)

Then there exists a policy ˆπ such that the off-policy learning

algorithm converges to the same model as the on-policy algorithm

using ˆπ.

Proof: In the limit, w.p.1, ˆµ converges to

(s|p)

c(p, a)b(s,a) where d

(s|p) is the probability of

visiting state s from partition p under the stationary distribution of b.

Let ˆπ be deﬁned to be the same for all states in a par tition p:

ˆπ (p,a) = ˆρ(p, a)

(s|p)b(s, a)

ˆπ is well-deﬁned, in the sense that

ˆπ (s,a) = 1. Using Theorem

3, off-policy updates using importance sampling corrections ˆρ will

have the same expected value as on-policy updates using ˆπ.

The authors gratefully acknowledge the ideas and encouragement

they have received in this work from Eddie Rafols, Mark Ring,

Lihong Li and other members of the rlai.net group. We thank Csaba

Szepesvari and the reviewers of the paper for constructive

comments. This research was supported in part by iCore, NSERC,

Alberta Ingenuity, and CFI.

The target policy π is induced by a recognizer function

c : [0, 1] !→ #

π(a) =

c(a)b(a)

c(x)b(x)

c(a)b(a)

(1)

(see blue line below). The estimator is:

ˆm

i=1

π(a

)

b(a

)

i=1

c(a

)b(a

)

b(a

)

i=1

c(a

)

!" !"" #"" $"" %"" &""

!'&

()*+,+-./01.,+.2-34

5.13,.630780#""04.)*/301.,+.2-349

:+;<7=;0,3-762+>3,

:+;<0,3-762+>3,

?=)@3,07804.)*/30.-;+724

McGill

The importance sampling corrections are:

ρ(s, a) =

π(s, a)

b(s, a)

c(s, a)

µ(s)

where µ(s) depends on the behavior policy b. If b is unknown,

instead of µ we will use a

maximum likelihood estimate

ˆµ : S → [0, 1], and importance sampling corrections will be deﬁned

as:

ˆρ(s, a) =

c(s, a)

ˆµ(s)

On-policy learning

If π is used to generate behavior,then the reward model of an

option can be learned using TD-learning.

The n-step truncated return is:

(n)

= r

t+1

+ (1 − β

t+1

)

(n−1)

t+1

The λ-return is deﬁned as usual:

= (1 − λ)

∞

n=1

n−1

(n)

The parameters of the function approximator are updated on every

step proportionally to:

∆

− y

∇

(1 − β

) ·· · (1 − β

• Recognizers reduce variance

• First off-policy learning algorithm for option models

• Off-policy learning without knowledge of the behavior

distribution

• Observations

– Options are a natural way to reduce the variance of

importance sampling algorithms (because of the termination

condition)

– Recognizers are a natural way to deﬁne options, especially

for large or continuous action spaces.

Outline

Introduction

Closed-loop, interactive learning

Q-learning – a direct method

Finite MDPs

Linear function approximation

Fitted Q-iteration

Actor-critic methods

SARSA(λ) with linear function approximation

Policy gradient

Actor-critic with SARSA(1)

Natural actor-critic

Bibliography

Szepesv

ari & Sutton (UofA) RL Algorithms July 11, 2010 2 / 39

The landscape

!"#$"%&#%'("$")

*+,$-%&#%'("$")

./&$",%('+,$/"

0%'("$")&1+%"'($/

2",%('+,$/"

Szepesv

ari & Sutton (UofA) RL Algorithms July 11, 2010 3 / 39

Bandit problems: How to gamble if you must? Part II

Bandit problem

MDP with single state

Unknown distribution of rewards

Which action to choose so as to minimize the regret,

= T max

a∈A

r (a) −

t=1

Lai and Robbins (1985): optimism in the face of uncertainty (OFU)

principle:

Choose the action with the best potential where the

uncertainty of the available information is taken into

account

They “solved” the parametric case: log regret, matching upper and

lower bounds

Szepesv

ari & Sutton (UofA) RL Algorithms July 11, 2010 4 / 39

Bandit problems: Nonparametrics

Auer et al. (2002): When the distributions can be arbitrary

∈ [0, 1]), play the action maximizing

(a) = r

(a) + R

2 log t

(a)

Upper Conﬁdence Bound: UCB ⇒ UCB1 algorithm

Main result: L

= O(log(T))

The minimax regret is O(

√

T).

By estimating the variance the expected regret can be improved,

but there is a bias-variance tradeoff

Szepesv

ari & Sutton (UofA) RL Algorithms July 11, 2010 5 / 39

评论收藏

内容反馈

我要出家当道士

2021-06-04

一点用没有，只是draft
jojojojo2002

2019-08-20

Szepesvari的强化学习算法的书，英文高清版，挺不错，谢谢分享，里面有英文的PPT

mg1616

粉丝: 15
资源: 7

增强学习算法 Algorithms for Reinforcement Learning

最新资源

增强学习算法 Algorithms for Reinforcement Learning

Algorithms for reinforcement learning

Algorithms for Reinforcement Learning

Algorithm for reinforcement

强化学习算法（Csaba Szepesvari）Algorithms for Reinforcement Learning (Csaba Szepesvari)

Algorithms for Reinforcement Learning 等三本

Algorithm for reinforcement learning.zip（解压即可，无密码）

贝叶斯学习和增强：贝叶斯分类器和Adaboost算法的实现

强化学习入门资料Algorithms for Reinforcement Learning

Algorithms of Reinforcement Learning

book Algorithms of Reinforcement Learning

Reinforcement-Learning-Algorithms-with-Python:Packt发布的Python强化学习算法

Algorithms for Reinforcement Learning by Csaba Szepesvári

强化学习导论（Reinforcement Learning）

Algorithm-reinforcement-learning-algorithms.zip

增强学习 Reinforcement Learning: An Introduction

什么是强化学习? (Reinforcement Learning)

机器学习算法Machine Learning Algorithms,

Algorithms:机器学习算法

Solutions (Selected) Reinforcement Learning_Reinforcement_学习_强化学

Algorithms:这个目的是我的学习算法

Algorithms_Cplus:算法_学习

Reinforcement-Learning-Draft 增强学习手稿

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOV5 + 双目相机实现三维测距（新版本）

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

最新资源