没有合适的资源？快使用搜索试试~ 我知道了~

# Fastest Convergence for Q-Learning

需积分: 13 3 下载量 83 浏览量
2018-11-24
15:11:27
上传
评论
收藏 5.9MB PDF 举报

温馨提示

本文介绍的Zap Q-Learning算法是Watkins原始算法和近期竞争对手在几个方面的改进。 它是一种矩阵增益算法，旨在使其渐近方差达到最优。 此外，ODE分析表明，瞬态行为与确定性Newton-Raphson实现非常接近。 这可以通过矩阵增益序列的两个时间尺度更新方程来实现。 分析表明，即使对于非理想的参数化设置，该方法也将导致稳定且高效的计算。 即使在非理想情况下，数值实验也能确保快速收敛。 第一页的比较图取自本文的图9，是使用新算法收敛的惊人加速度的图示。 本文的第二个目标是教程。 本文的前半部分包含对强化学习算法的调查，重点是最小方差算法。

资源推荐

资源详情

资源评论

Fastest Convergence for Q-Learning

Adithya M. Devraj and Sean P. Meyn

∗†

March 23, 2018

Abstract

The Zap Q-learning algorithm introduced in this paper is an improvement of Watkins’ origi-

nal algorithm and recent competitors in several respects. It is a matrix-gain algorithm designed

so that its asymptotic variance is optimal. Moreover, an ODE analysis suggests that the tran-

sient behavior is a close match to a deterministic Newton-Raphson implementation. This is

made possible by a two time-scale update equation for the matrix gain sequence.

The analysis suggests that the approach will lead to stable and eﬃcient computation even for

non-ideal parameterized settings. Numerical experiments conﬁrm the quick convergence, even

in such non-ideal cases. The comparison plot on this ﬁrst page, taken from Fig. 9 of this paper,

is an illustration of the amazing acceleration in convergence using the new algorithm.

A secondary goal of this paper is tutorial. The ﬁrst half of the paper contains a survey on

reinforcement learning algorithms, with a focus on minimum variance algorithms.

Keywords: Reinforcement learning, Q-learning, Stochastic optimal control

2000 AMS Subject Classiﬁcation: 93E20, 93E35

0 1 2 3 4 5 6 7 8 9 10

0

2

4

6

8

10

12

14

16

18

20

0 1 2 3 4 5 6 7 8 9 10

0

20

40

60

80

100

120

10

x

5

n

Bellman Error

RPJ

Zap-Q:

Zap-Q:

Speedy

Polynomial

Watkins

β = 0.8

g = 70

β = 0.99

g = 1500

≡ α

0.85

n

γ

n

≡

γ

n

α

n

B

n

∗

Research supported by the National Science Foundation under grants CPS-0931416 and EPCN-1609131

†

A.D. and S.M. are with the Department of Electrical and Computer Engg. at the University of Florida, Gainesville.

A part of the research was conducted when the authors were visitors at the Simons Institute for the Theory of

Computing at University of California, Berkeley.

1

arXiv:1707.03770v2 [cs.SY] 21 Mar 2018

1 Introduction

It is recognized that algorithms for reinforcement learning such as TD- and Q-learning can be slow

to converge. The poor performance of Watkins’ Q-learning algorithm was ﬁrst quantiﬁed in [31],

and since then many papers have appeared with proposed improvements, such as [10, 1].

An emphasis in much of the literature is computation of ﬁnite-time PAC (probably almost

correct) bounds as a metric for performance. Explicit bounds were obtained in [31] for Watkins’

algorithm, and in [1] for the “speedy” Q-learning algorithm that was introduced by these authors.

A general theory is presented in [21] for stochastic approximation algorithms.

In each of the models considered in prior work, the update equation for the parameter estimates

can be expressed

θ

n+1

= θ

n

+ α

n

[f(θ

n

) + ∆

n+1

] , n ≥ 0 , (1)

in which {α

n

} is a positive gain sequence, and {∆

n

} is a martingale diﬀerence sequence. This

representation is critical in analysis, but unfortunately is not typical in reinforcement learning ap-

plications outside of these versions of Q-learning. For Markovian models, the usual transformation

used to obtain a representation similar to (1) results in an error sequence {∆

n

} that is the sum

of a martingale diﬀerence sequence and a telescoping sequence [16]. It is the telescoping sequence

that prevents easy analysis of Markovian models.

This gap in the research literature carries over to the general theory of Markov chains. Examples

of concentration bounds for i.i.d. sequences or martingale-diﬀerence sequences include the ﬁnite-

time bounds of Hoeﬀding and Bennett. Extensions to Markovian models either oﬀer very crude

bounds [19], or restrictive assumptions [15, 11]; this remains an active area of research [23].

In contrast, asymptotic theory for stochastic approximation (as well as general state space

Markov chains) is mature. Large Deviations or Central Limit Theorem (CLT) limits hold under

very general assumptions [3, 14, 5].

The CLT will be a guide to algorithm design in the present paper. For a typical stochastic ap-

proximation algorithm, this takes the following form: denoting {

˜

θ

n

:=θ

n

−θ

∗

: n ≥ 0} to be the error

sequence, under general conditions the scaled sequence {

√

n

˜

θ

n

: n ≥ 1} converges in distribution

to a Gaussian distribution, N(0, Σ

θ

). Typically, the scaled covariance is also convergent:

Σ

θ

= lim

n→∞

nE[

˜

θ

n

˜

θ

T

n

] . (2)

The limit is known as the asymptotic covariance.

An asymptotic bound such as (2) may not be satisfying for practitioners of stochastic optimiza-

tion or reinforcement learning, given the success of ﬁnite-n performance bounds in prior research.

There are however good reasons to apply this asymptotic theory in algorithm design:

(i) The asymptotic covariance Σ

θ

has a simple representation as the solution to a Lyapunov

equation. It is easily improved or optimized by design.

(ii) As shown in examples in this paper, the asymptotic covariance is often a good predictor of

ﬁnite-time performance, since the CLT approximation is accurate for reasonable values of n.

Two approaches are known for optimizing the asymptotic covariance. First is the remarkable

averaging technique of Polyak and Juditsky [24, 25] and Ruppert [27] ([12] provides an accessible

treatment in a simpliﬁed setting). Second is what we will call Stochastic Newton-Raphson, based

on a special choice of matrix gain for the algorithm. The second approach underlies the analysis of

the averaging approach.

We are not aware of theory that distinguishes the performance of Polyak-Ruppert averaging

as compared to the Stochastic Newton-Raphson method. It is noted in [21] that the averaging

2

approach often leads to very large transients, so that the algorithm should be modiﬁed (such as

through projection of parameter updates). This may explain why averaging is not very popular in

practice. In our own numerical experiments it is observed that the rate of convergence of CLT in

this case is slow when compared to matrix gain methods.

In addition to accelerating the convergence rate of standard algorithms for reinforcement learn-

ing, it is hoped that this paper will lead to entirely new algorithms. In particular, there is little

theory to support Q-learning in non-ideal settings in which the optimal “Q-function” does not lie

in the parameterized function class. Convergence results have been obtained for a class of optimal

stopping problems [37], and for deterministic models [17]. There is now intense practical inter-

est, despite an incomplete theory. A stronger supporting theory will surely lead to more eﬃcient

algorithms.

Contributions A new class of algorithms is proposed, designed to more accurately mimic the

classical Newton-Raphson algorithm. It is based on a two time-scale stochastic approximation

algorithm, constructed so that the matrix gain tracks the gain that would be used in a deterministic

Newton-Raphson method.

The application of this approach to reinforcement learning results in the new Zap Q-learning

algorithms. A full analysis is presented for the special case of a complete parameterization (similar

to the setting of Watkins’ original algorithm). It is found that the associated ODE has a remarkable

and simple representation, which implies consistency under suitable assumptions. Extensions to

non-ideal parameterized settings are also proposed, and numerical experiments show dramatic

variance reductions. Moreover, results obtained from ﬁnite-n experiments show close solidarity

with asymptotic theory.

The potential complexity introduced by the matrix gain is not of great concern in many cases,

because of the dramatically acceleration in the rate of convergence. Moreover, the main contribution

of this paper is not a single algorithm but a class of algorithms, wherein the computational com-

plexity can be dealt with separately. For example, in a parameterized setting, the basis functions

can be intelligently pruned via random projection [2].

The remainder of the paper is organized as follows. Background on computing and optimiz-

ing the asymptotic covariance is contained in Section 2. Application to Q-learning, and theory

surrounding the new Zap Q-learning algorithm is developed in Section 3. Numerical results are

surveyed in Section 4, and conclusions are contained in Section 5. The proofs of the main results

are contained in the Appendix; the ﬁnal page contains Table 2 containing a list of notation.

2 Stochastic Newton Raphson and TD-Learning

This ﬁrst section is largely a tutorial on reinforcement learning. It is shown that the LSTD(λ)

learning algorithm of [8, 7, 22] is an instance of the “SNR algorithm”, in which there is only one

time-scale for the parameter and matrix-gain updates. The original motivation for the LSTD(λ)

algorithm had no connection with asymptotic variance. It was shown later in [13] that the LSTD

(λ) algorithm is the minimum asymptotic variance version of the TD (λ) algorithm of [30].

The focus is on ﬁxed point equations associated with an uncontrolled Markov chain, denoted

X = {X

n

: n = 0, 1, . . . }, on a measurable state space (X, B(X)). It is assumed to be ψ-irreducible

and aperiodic [20]. In Section 3 we specialize to a ﬁnite state space.

In control applications and analysis of learning algorithms, it is necessary to construct a Markov

chain Φ, of which X is a component. Other components may be an input process, or a sequence of

3

“eligibility vectors” that arise in TD-learning. It will be assumed throughout that there is a unique

stationary realization of Φ, with unique marginal distribution denoted $.

2.1 Motivation from SA & ODE fundamentals

The goal of stochastic approximation is to compute the solution f(θ

∗

) = 0 for a function f : R

d

→

R

d

. If the function is easily evaluated, then successive approximation can be used, and under

stronger conditions the Newton-Raphson algorithm:

θ

n+1

= θ

n

+ G

n

f(θ

n

) , G

−1

n

= −∂

θ

f (θ

n

) . (3)

Under general conditions the convergence rate of (3) is quadratic (much faster than geometric),

which is not generally true of successive approximation.

Stochastic approximation is itself an approximation of successive approximation. It is assumed

that f (θ) = E[f (θ, Φ)], where f : R

d

× R

m

→ R

d

and Φ is a random variable with distribution $.

The standard stochastic approximation algorithm is deﬁned by

θ

n+1

= θ

n

+ α

n

f(θ

n

, Φ

n+1

) , n ≥ 0 . (4)

For simplicity it is assumed that Φ is the stationary realization of the Markov chain. It is always

assumed that the scalar gain sequence {α

n

} is non-negative, and satisﬁes:

X

α

n

= ∞,

X

α

2

n

< ∞. (5)

While convergent under general conditions, the rate of convergence of (4) can often be improved

dramatically through the introduction of a matrix gain. This is explained ﬁrst in a simple linear

setting.

2.2 Optimal covariance for linear stochastic approximation

In many applications of reinforcement learning we arrive at a linear recursion of the form

θ

n+1

= θ

n

+ α

n+1

A

n+1

θ

n

− b

n+1

(6)

where A

n+1

= A(Φ

n+1

) is a d × d matrix and b

n+1

= b(Φ

n+1

) is a d × 1 vector, n ≥ 0. Let A, b

denote the respective steady-state means:

A = E[A(Φ)] , b = E[b(Φ)] . (7)

It is assumed throughout this section that A is Hurwitz: the real part of each eigenvalue is negative.

Under this assumption, and subject to mild conditions on Φ, it is known that {θ

n

} converges with

probability one to θ

∗

= A

−1

b [3, 14, 5].

Convergence of the recursion (6) will be assumed henceforth. It is also assumed that the gain

sequence is given by α

n

= 1/n, n ≥ 1.

Under general conditions, the asymptotic covariance Σ

θ

deﬁned in (2) is the non-negative semi-

deﬁnite solution to the Lyapunov equation:

(A +

1

2

I)Σ

θ

+ Σ

θ

(A +

1

2

I)

T

+ Σ

∆

= 0. (8)

A solution is guaranteed only if each eigenvalue of A has real part that is strictly less than −1/2.

If there exists an eigenvalue which does not satisfy this property, then under general conditions

4

the asymptotic covariance is inﬁnity (see Thm. 2.1). Hence the Hurwitz assumption must be

strengthened to ensure that the asymptotic covariance is ﬁnite.

The matrix Σ

∆

is obtained as follows: based on (6), the error sequence {

˜

θ

n

= θ

n

− θ

∗

} evolves

according to a deterministic linear system driven by “noise”:

˜

θ

n+1

=

˜

θ

n

+

1

n + 1

[A

˜

θ

n

+ ∆

n+1

]

in which ∆ is the sum of three terms:

∆

n+1

=

˜

A

n+1

θ

∗

−

˜

b

n+1

+

˜

A

n+1

˜

θ

n

, (9)

with

˜

A

n+1

= A

n+1

−A ,

˜

b

n+1

= b

n+1

−b. The third term vanishes with probability one. The “noise

covariance matrix” Σ

∆

has the following two equivalent forms:

Σ

∆

= lim

T →∞

1

T

E

S

T

S

T

T

=

∞

X

k=−∞

R(k) (10)

in which S

T

=

P

T

n=1

∆

n

, and

R(k) = R(−k)

T

= E[(

˜

A

k

θ

∗

−

˜

b

k

)(

˜

A

0

θ

∗

−

˜

b

0

)

T

] , k ≥ 0

where the expectation is in steady-state. It is assumed that the CLT holds for sample-averages of

the noise sequence:

1

√

N

N

X

n=1

∆

n

→ N(0, Σ

∆

) , N → ∞, (11)

where the limit is in distribution. This is a mild requirement when Φ is Markovian [20].

A ﬁnite asymptotic covariance can be guaranteed by increasing the gain: choose α

n

= g/n in

(6), with g > 0 suﬃciently large so that the eigenvalues of gA satisfy the required bound. More

generally, a matrix gain can be introduced:

θ

n+1

= θ

n

+

1

n + 1

G

A

n+1

θ

n

− b

n+1

(12)

in which G is a d × d matrix. Provided the matrix GA satisﬁes the eigenvalue bound, the corre-

sponding asymptotic covariance Σ

G

θ

is ﬁnite and solves a modiﬁed Lyapunov equation:

(GA +

1

2

I)Σ

G

θ

+ Σ

G

θ

(GA +

1

2

I)

T

+ GΣ

∆

G

T

= 0 (13)

The choice G

∗

= −A

−1

is analogous to the gain used in the Newton-Raphson algorithm (3).

With this choice, the asymptotic covariance is ﬁnite and given by

Σ

∗

:= A

−1

Σ

∆

A

−1

T

. (14)

It is a remarkable fact that this choice is optimal in the strongest possible statistical sense: For

any other gain G, the two asymptotic covariance matrices satisfy

Σ

G

θ

≥ Σ

∗

That is, the diﬀerence Σ

G

θ

− Σ

∗

is positive semi-deﬁnite [3, 14, 5].

The following theorem summarizes the results on the asymptotic covariance for the matrix-gain

recursion (12). The proof is contained in Section A.1 of the Appendix.

5

剩余45页未读，继续阅读

资源评论

AI技术与生活

- 粉丝: 6
- 资源: 2

上传资源 快速赚钱

- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助

### 最新资源

- plc-leaf-disease-recognition-maste笔记
- djangossification-maste笔记
- QQ软件安装包ARM版本下载 QQ9轻松做自己
- 毕业设计javajsp基于jquery的电子相册(jsp+sqlserver)-qlkrp源码工具包
- 分布式互联网爬虫及其在垂直领域的使用-项目开发计划
- transformer-leaf-disease-recognition-笔记
- 焊接机器人应用+ GSK RH06 焊接机器人广州汽车配饰有限公司焊接工装上的应用
- 51 单片机控制系统设计原则+问题描述+软件设计+仿真分析
- ds18b20 asm 程序（DS18B20 ASM program）
- 基于单片机的 PWM 变频调速系统设计+概述+PWM调速技术+PWM调速系统设计+总结

资源上传下载、课程学习等过程中有任何疑问或建议，欢迎提出宝贵意见哦~我们会及时处理！
点击此处反馈

安全验证

文档复制为VIP权益，开通VIP直接复制

信息提交成功