藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文04.pdf资源-CSDN文库

阿里云

需积分: 5 85 浏览量 2023-08-27 06:38:42 上传评论收藏 887KB PDF 举报

资源推荐

资源详情

资源评论

SBEED: Convergent Reinforcement Learning with

Nonlinear Function Approximation

Bo Dai

, Albert Shaw

, Lihong Li

, Lin Xiao

, Niao He

, Zhen Liu

, Jianshu Chen

, Le Song

Georgia Insititute of Technology

Google Inc.,

Microsoft Research, Redmond

University of Illinois at Urbana Champaign,

Tencent AI Lab, Bellevue

June 7, 2018

Abstract

When function approximation is used, solving the Bellman optimality equation with stability guarantees

has remained a major open problem in reinforcement learning for decades. The fundamental diﬃculty

is that the Bellman operator may become an expansion in general, resulting in oscillating and even

divergent behavior of popular algorithms like Q-learning. In this paper, we revisit the Bellman equation,

and reformulate it into a novel primal-dual optimization problem using Nesterov’s smoothing technique

and the Legendre-Fenchel transformation. We then develop a new algorithm, called Smoothed Bellman

Error Embedding, to solve this optimization problem where any diﬀerentiable function class may be

used. We provide what we believe to be the ﬁrst convergence guarantee for general nonlinear function

approximation, and analyze the algorithm’s sample complexity. Empirically, our algorithm compares

favorably to state-of-the-art baselines in several benchmark control problems.

1 Introduction

In reinforcement learning (RL), the goal of an agent is to learn a policy that maximizes long-term returns by

sequentially interacting with an unknown environment (Sutton & Barto, 1998). The dominating framework

to model such an interaction is the Markov decision process, or MDP, in which the optimal value function are

characterized as a ﬁxed point of the Bellman operator. A fundamental result for MDP is that the Bellman

operator is a contraction in the value-function space, so the optimal value function is the unique ﬁxed point.

Furthermore, starting from any initial value function, iterative applications of the Bellman operator ensure

convergence to the ﬁxed point. Interested readers are referred to the textbook of Puterman (2014) for details.

Many of the most eﬀective RL algorithms have their root in such a ﬁxed-point view. The most prominent

family of algorithms is perhaps the temporal-diﬀerence algorithms, including TD(

) (Sutton, 1988), Q-

learning (Watkins, 1989), SARSA (Rummery & Niranjan, 1994; Sutton, 1996), and numerous variants such

as the empirically very successful DQN (Mnih et al., 2015) and A3C (Mnih et al., 2016) implementations.

Compared to direct policy search/gradient algorithms like REINFORCE (Williams, 1992), these ﬁxed-point

methods make learning more eﬃcient by bootstrapping (a sample-based version of Bellman operator).

When the Bellman operator can be computed exactly (even on average), such as when the MDP has ﬁnite

state/actions, convergence is guaranteed thanks to the contraction property (Bertsekas & Tsitsiklis, 1996).

Unfortunately, when function approximatiors are used, such ﬁxed-point methods easily become unstable or

even divergent (Boyan & Moore, 1995; Baird, 1995; Tsitsiklis & Van Roy, 1997), except in a few special cases.

For example,

arXiv:1712.10285v4 [cs.LG] 5 Jun 2018

•

for some rather restrictive function classes, such as those with a non-expansion property, some of the

ﬁnite-state MDP theory continues to apply with modiﬁcations (Gordon, 1995; Ormoneit & Sen, 2002;

Antos et al., 2008);

•

when linear value function approximation in certain cases, convergence is guaranteed: for evaluating

a ﬁxed policy from on-policy samples (Tsitsiklis & Van Roy, 1997), for evaluating the policy using a

closed-form solution from oﬀ-policy samples (Boyan, 2002; Lagoudakis & Parr, 2003), or for optimizing

a policy using samples collected by a stationary policy (Maei et al., 2010).

In recent years, a few authors have made important progress toward ﬁnding scalable, convergent TD algorithms,

by designing proper objective functions and using stochastic gradient descent (SGD) to optimize them (Sutton

et al., 2009; Maei, 2011). Later on, it was realized that several of these gradient-based algorithms can be

interpreted as solving a primal-dual problem (Mahadevan et al., 2014; Liu et al., 2015; Macua et al., 2015;

Dai et al., 2016). This insight has led to novel, faster, and more robust algorithms by adopting sophisticated

optimization techniques (Du et al., 2017). Unfortunately, to the best of our knowledge, all existing works

either assume linear function approximation or are designed for policy evaluation. It remains a major open

problem how to ﬁnd the optimal policy reliably with general nonlinear function approximators such as neural

networks, especially in the presence of oﬀ-policy data.

Contributions

In this work, we take a substantial step towards solving this decades-long open problem,

leveraging a powerful saddle-point optimization perspective, to derive a new algorithm called Smoothed

Bellman Error Embedding (SBEED) algorithm. Our development hinges upon a novel view of a smoothed

Bellman optimality equation, which is then transformed to the ﬁnal primal-dual optimization problem.

SBEED learns the optimal value function and a stochstic policy in the primal, and the Bellman error (also

known as Bellman residual) in the dual. By doing so, it avoids the non-smooth

max

-operator in the Bellman

operator, as well as the double-sample challenge that has plagued RL algorithm designs (Baird, 1995). More

speciﬁcally,

•

SBEED is stable for a broad class of nonlinear function approximators including neural networks, and

provably converges to a solution with vanishing gradient. This holds even in the more challenging

oﬀ-policy case;

•

it uses bootstrapping to yield high sample eﬃciency, as in TD-style methods, and is also generalized to

cases of multi-step bootstrapping and eligibility traces;

•

it avoids the double-sample issue and directly optimizes the squared Bellman error based on sample

trajectories;

• it uses stochastic gradient descent to optimize the objective, thus very eﬃcient and scalable.

Furthermore, the algorithm handles both the optimal value function estimation and policy optimization in a

uniﬁed way, and readily applies to both continuous and discrete action spaces. We compare the algorithm

with state-of-the-art baselines on several continuous control benchmarks, and obtain excellent results.

2 Preliminaries

In this section, we introduce notation and technical background that is needed in the rest of the paper. We

denote a Markov decision process (MDP) as

(S, A, P, R, γ)

, where

is a (possible inﬁnite) state space,

an action space,

(

·|s, a

) the transition probability kernel deﬁning the distribution over next states upon

taking action

on state

(

s, a

) the average immediate reward by taking action

in state

, and

γ ∈

a discount factor. Given an MDP, we wish to ﬁnd a possibly stochastic policy

S → P

to maximize the

expected discounted cumulative reward starting from any state

s ∈ S

∞

t=0

R(s

, a

)



= s, π

, where

denotes all probability measures over A. The set of all policies is denoted by P := (P

)

Deﬁne

∗

(

) :=

max

π(·|s)

E [

∞

t=0

R(s

, a

)|s

= s, π]

to be the optimal value function. It is known that

∗

is the unique ﬁxed point of the Bellman operator

, or equivalently, the unique solution to the Bellman

optimality equation (Bellman equation, for short) (Puterman, 2014):

V (s) = (T V )(s) := max

R(s, a) + γE

|s,a

[V (s

)]. (1)

The optimal policy π

∗

is related to V

∗

by the following:

∗

(a|s) = argmax



R(s, a) + γE

|s,a

∗

)]



It should be noted that in practice, for convenience we often work on the Q-function instead of the state-value

function V

∗

. In this paper, it suﬃces to use the simpler V

∗

function.

3 A Primal-Dual View of Bellman Equation

In this section, we introduce a novel view of Bellman equation that enables the development of the new

algorithm in Section 4. After reviewing the Bellman equation and the challenges to solve it, we describe the

two key technical ingredients that lead to our primal-dual reformulation.

We start with another version of Bellman equation that is equivalent to Eqn (1) (see, e.g., Puterman

(2014)):

V (s) = max

π(·|s)∈P

a∼π(·|s)



R(s, a) + γE

|s,a

[V (s

)]



. (2)

Eqn (2) makes the role of a policy explicit. Naturally, one may try to jointly optimize over

and

minimize the discrepancy between the two sides of

(2)

. For concreteness, we focus on the square distance

in this paper, but our results can be extended to other convex loss functions. Let

be some given state

distribution so that µ(s) > 0 for all s ∈ S. Minimizing the squared Bellman error gives the following:

min

s∼µ



max

π(·|s)∈P

a∼π(·|s)



R(s, a) + γE

|s,a

[V (s

)]



− V (s)



. (3)

While natural, this approach has several major diﬃculties when it comes to optimization, which are to be

dealt with in the following subsections:

•

The

max

operator over

introduces non-smoothness to the objective function. A slight change in

may cause large diﬀerences in the RHS of Eqn (2).

•

The conditional expectation,

|s,a

[·]

, composed within the square loss, requires double samples (Baird,

1995) to obtain unbiased gradients, which is often impractical in most but simulated environments.

3.1 Smoothed Bellman Equation

To avoid the instability and discontinuity caused by the

max

operator, we use the smoothing technique of

Nesterov (2005) to smooth the Bellman operator

. Since policies are conditional distributions over

, we

choose entropy regularization, and Eqn (2) becomes:

(s) = max

π(·|s)∈P



a∼π(·|s)



R(s, a) + γE

|s,a

)]



+ λH(π, s)



, (4)

where

(

π, s

) :=

−

a∈A

(

a|s

)

log π

(

a|s

), and

λ >

0 controls the degree of smoothing. Note that with

= 0, we obtain the standard Bellman equation. Moreover, the regularization may be viewed a shaping

reward added to the reward function of an induced, equivalent MDP; see Appendix C.2 for more details.

Since negative entropy is the conjugate of the log-sum-exp function (Boyd & Vandenberghe, 2004,

Example 3.25), Eqn (4) can be written equivalently as

(s) = (T

) (s) := λ log

a∈A

exp



R(s, a) + γE

|s,a

)]



(5)

where the log-sum-exp is an eﬀective smoothing approximation of the max-operator.

Remark.

While Eqns (4) and (5) are inspired by Nestorov smoothing technique, they can also be derived

from other principles (Rawlik et al., 2012; Fox et al., 2016; Neu et al., 2017; Nachum et al., 2017; Asadi &

Littman, 2017). For example, Nachum et al. (2017) use entropy regularization in the policy space to encourage

exploration, but arrive at the same smoothed form; the smoothed operator

is called “Mellowmax” by

Asadi & Littman (2017), which is obtained as a particular instantiation of the quasi-arithmetic mean. In the

rest of the subsection, we review the properties of

, although some of the results have appeared in the

literature in slightly diﬀerent forms. Proofs are deferred to Appendix A.

First, we show

is also a contraction, as with the standard Bellman operator (Fox et al., 2016; Asadi &

Littman, 2017):

Proposition 1 (Contraction) T

is a

-contraction. Consequently, the corresponding smoothed Bellman

equation (4), or equivalently (5), has a unique solution V

∗

Second, we show that while in general

∗

, their diﬀerence is controlled by

. To do so, deﬁne

∗

:= max

s∈S,π(·|s)∈P

H(π, s). For ﬁnite action spaces, we immediately have H

∗

= log(|A|).

Proposition 2 (Smoothing bias) Let V

∗

and V

∗

be ﬁxed points of (2) and (4), respectively. Then,

∗

(s) − V

∗

(s)k

∞

λH

∗

1 − γ

Consequently, as λ → 0, V

∗

converges to V

∗

pointwisely.

Finally, the smoothed Bellman operator has the very nice property of temporal consistency (Rawlik et al.,

2012; Nachum et al., 2017):

Proposition 3 (Temporal consistency)

Assume

λ >

0. Let

∗

be the ﬁxed point of (4) and

∗

the

corresponding policy that attains the maximum on the RHS of (4). Then, (

∗

, π

∗

) is the unique (

V, π

) pair

that satisﬁes the following equality for all (s, a) ∈ S × A:

V (s) = R(s, a) + γE

|s,a

[V (s

)] − λ log π(a|s) . (6)

In other words, Eqn (6) provides an easy-to-check condition to characterize the optimal value function and

optimal policy on arbitrary pair of (

s, a

), therefore, which is easy to incorporate oﬀ-policy data. It can also be

extended to the multi-step or eligibility-traces cases (Appendix C; see also Sutton & Barto (1998, Chapter 7)).

Later, this condition will be one of the critical foundations to develop our new algorithm.

3.2 Bellman Error Embedding

A natural objective function inspired by (6) is the mean squared consistency Bellman error, given by:

min

V,π∈P

`(V, π) := E

s,a



R(s, a) + γE

|s,a

[V (s

)] − λ log π(a|s) − V (s)



, (7)

where

s,a

[

] is shorthand for

s∼µ(·),a∼π

(·|s)

[

]. Unfortunately, due to the inner conditional expectation, it

would require two independent sample of

(starting from the same (

s, a

)) to obtain an unbiased estimate

of gradient of

, a problem known as the double-sample issue (Baird, 1995). In practice, however, one can

rarely obtain two independent samples except in simulated environments.

To bypass this problem, we make use of the conjugate of the square function (Boyd & Vandenberghe,

2004):

max



2νx − ν



, as well as the interchangeability principle (Shapiro et al., 2009; Dai et al.,

2016) to rewrite the optimization problem (7) into an equivalent form:

min

V,π∈P

max

ν∈F

S×A

L(V, π; ν) := 2E

s,a,s

ν(s, a)



R(s, a) + γV (s

) − λ log π(a|s) − V (s)



− E

s,a,s



(s, a)



, (8)

where

S×A

is the set of real-valued functions on

S×A

s,a,s

[

] is shorthand for

s∼µ(·),a∼π

(·|s),s

∼P (·|s,a)

[

Note that (8) is not a standard convex-concave saddle-point problem: the objective is convex in

for any

ﬁxed (π, ν), and concave in ν for any ﬁxed (V, π), but not necessarily convex in π ∈ P for any ﬁxed (V, ν).

Remark.

In contrast to our saddle-point formulation (8), Nachum et al. (2017) get around the double-sample

obstacle by minimizing an upper bound of

(

V, π

(

V, π

) :=

s,a,s

(R(s, a) + γV (s

) − λ log π(a|s) − V (s))

As is known (Baird, 1995), the gradient of

is diﬀerent from that of

, as it has a conditional variance

term coming from the stochastic outcome

. In problems where this variance is highly heterogeneous across

diﬀerent (s, a) pairs, impact of such a bias can be substantial.

Finally, substituting the dual function

(

s, a

) =

(

s, a

)

− V

(

), the objective in the saddle point problem

becomes

min

V,π

max

ρ∈F

S×A

(V, π; ρ) := E

s,a,s

(δ(s, a, s

) − V (s))

− E

s,a,s

(δ(s, a, s

) − ρ(s, a))

(9)

where

(

s, a, s

) :=

(

s, a

) +

γV (s

) − λ log π(a|s)

. Note that the ﬁrst term is

(

V, π

), and the second term

will cancel the extra variance term (see Proposition 8 in Appendix B). The use of an auxiliary function to

cancel the variance is also observed by Antos et al. (2008). On the other hand, when function approximation

is used, extra bias will also be introduced. We note that such a saddle-point view of debiasing the extra

variance term leads to a useful mechanism for better bias-variance trade-oﬀs, leading to the ﬁnal primal-dual

formulation we aim to solve in the next section:

min

V,π∈P

max

ρ∈F

S×A

(V, π; ρ) := E

s,a,s

(δ(s, a, s

) − V (s))

− ηE

s,a,s

(δ(s, a, s

) − ρ(s, a))

, (10)

where

η ∈

1] is a hyper-parameter controlling the trade-oﬀ. When

= 1, this reduces to the original

saddle-point formulation (8). When

= 0, this reduces to the surrogate objective considered by Nachum

et al. (2017).

4 Smoothed Bellman Error Embedding

In this section, we derive the Smoothed Bellman Error EmbeDding (SBEED) algorithm, based on stochastic

mirror descent (Nemirovski et al., 2009), to solve the smoothed Bellman equation. For simplicity of exposition,

we mainly discuss the one-step optimization (10), although it is possible to generalize the algorithm to the

multi-step and eligibility-traces settings; see Appendices C.2 and C.3 for details.

Due to the curse of dimensionality, the quantities (

V, π, ρ

) are often represented by compact, parametric

functions in practice. Denote these parameters by

= (

, w

). Abusing notation a little bit, we now

write the objective function L

(V, π; ρ) as L

, w

; w

First, we note that the inner (dual) problem is standard least-squares regression with parameter

so can be solved using a variety of algorithms (Bertsekas, 2016); in the presence of special structures like

convexity, global optima can be found eﬃciently (Boyd & Vandenberghe, 2004). The more involved part is to

optimize the primal (w

, w

), whose gradients are given by the following theorem.

Theorem 4 (Primal gradient)

Deﬁne

(

, w

) :=

(

, w

;

∗

), where

∗

arg max

(

, w

;

Let δ

s,a,s

be a shorthand for δ(s, a, s

), and ˆρ be dual parameterized by w

∗

. Then,

∇

=2E

s,a,s

[(δ

s,a,s

− V (s)) (γ∇

V (s

) − ∇

V (s))] −2ηγE

s,a,s

[(δ

s,a,s

− ˆρ(s, a)) ∇

V (s

)] ,

∇

= − 2λE

s,a,s



((1 − η)δ

s,a,s

+ ηˆρ(s, a) −V (s)) ·∇

log π(a|s)



剩余27页未读，继续阅读

评论收藏

内容反馈

weixin_40191861_zj

粉丝: 62
资源: 1万+

藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文04.pdf

蚂蚁金服人工智能部研究员ICML贡献论文04.pdf

藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文05.pdf

蚂蚁金服人工智能部研究员ICML贡献论文05.pdf

蚂蚁金服人工智能部研究员ICML贡献论文06.pdf

藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文06.pdf

藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文02.pdf

藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文03.pdf

藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文07.pdf

蚂蚁金服人工智能部研究员ICML贡献论文01.pdf

蚂蚁金服人工智能部研究员ICML贡献论文03.pdf

蚂蚁金服人工智能部研究员ICML贡献论文02.pdf

《AGL：可扩展工业图机器学习系统》（蚂蚁金服人工智能部论文）.pdf

蚂蚁金服人工智能部研究员ICML贡献论文07.pdf

2019-icml-蚂蚁金服-Generative Adversarial User Model for Reinforceme

ICML2023_Tutorial.pdf

KakadeLangford-icml2002.pdf

ICML19-attention.pdf

ICML2020-2.zip

ICML2020-1.zip

relu_hybrid_icml2013_final.pdf

操作系统学习与考试系统(XOSCATS)

SquareLine-Studio 1.3.0安装包

王道操作系统课件 2024

C语言规范标准-C99(中文版)

ELF解析工具 v1.7（elf格式解析工具)

计算机组成原理：最详细笔记 word格式下载

KeepOutlookRunning.7z

dell r730xd 调速工具

Sim-EKB-Install-2022-11-27.zip

最新资源