没有合适的资源?快使用搜索试试~ 我知道了~
藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文04.pdf
需积分: 5 0 下载量 85 浏览量
2023-08-27
06:38:42
上传
评论
收藏 887KB PDF 举报
温馨提示
试读
28页
藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文04.pdf
资源推荐
资源详情
资源评论
SBEED: Convergent Reinforcement Learning with
Nonlinear Function Approximation
Bo Dai
1
, Albert Shaw
1
, Lihong Li
2
, Lin Xiao
3
, Niao He
4
, Zhen Liu
1
, Jianshu Chen
5
, Le Song
1
1
Georgia Insititute of Technology
2
Google Inc.,
3
Microsoft Research, Redmond
4
University of Illinois at Urbana Champaign,
5
Tencent AI Lab, Bellevue
June 7, 2018
Abstract
When function approximation is used, solving the Bellman optimality equation with stability guarantees
has remained a major open problem in reinforcement learning for decades. The fundamental difficulty
is that the Bellman operator may become an expansion in general, resulting in oscillating and even
divergent behavior of popular algorithms like Q-learning. In this paper, we revisit the Bellman equation,
and reformulate it into a novel primal-dual optimization problem using Nesterov’s smoothing technique
and the Legendre-Fenchel transformation. We then develop a new algorithm, called Smoothed Bellman
Error Embedding, to solve this optimization problem where any differentiable function class may be
used. We provide what we believe to be the first convergence guarantee for general nonlinear function
approximation, and analyze the algorithm’s sample complexity. Empirically, our algorithm compares
favorably to state-of-the-art baselines in several benchmark control problems.
1 Introduction
In reinforcement learning (RL), the goal of an agent is to learn a policy that maximizes long-term returns by
sequentially interacting with an unknown environment (Sutton & Barto, 1998). The dominating framework
to model such an interaction is the Markov decision process, or MDP, in which the optimal value function are
characterized as a fixed point of the Bellman operator. A fundamental result for MDP is that the Bellman
operator is a contraction in the value-function space, so the optimal value function is the unique fixed point.
Furthermore, starting from any initial value function, iterative applications of the Bellman operator ensure
convergence to the fixed point. Interested readers are referred to the textbook of Puterman (2014) for details.
Many of the most effective RL algorithms have their root in such a fixed-point view. The most prominent
family of algorithms is perhaps the temporal-difference algorithms, including TD(
λ
) (Sutton, 1988), Q-
learning (Watkins, 1989), SARSA (Rummery & Niranjan, 1994; Sutton, 1996), and numerous variants such
as the empirically very successful DQN (Mnih et al., 2015) and A3C (Mnih et al., 2016) implementations.
Compared to direct policy search/gradient algorithms like REINFORCE (Williams, 1992), these fixed-point
methods make learning more efficient by bootstrapping (a sample-based version of Bellman operator).
When the Bellman operator can be computed exactly (even on average), such as when the MDP has finite
state/actions, convergence is guaranteed thanks to the contraction property (Bertsekas & Tsitsiklis, 1996).
Unfortunately, when function approximatiors are used, such fixed-point methods easily become unstable or
even divergent (Boyan & Moore, 1995; Baird, 1995; Tsitsiklis & Van Roy, 1997), except in a few special cases.
For example,
1
arXiv:1712.10285v4 [cs.LG] 5 Jun 2018
•
for some rather restrictive function classes, such as those with a non-expansion property, some of the
finite-state MDP theory continues to apply with modifications (Gordon, 1995; Ormoneit & Sen, 2002;
Antos et al., 2008);
•
when linear value function approximation in certain cases, convergence is guaranteed: for evaluating
a fixed policy from on-policy samples (Tsitsiklis & Van Roy, 1997), for evaluating the policy using a
closed-form solution from off-policy samples (Boyan, 2002; Lagoudakis & Parr, 2003), or for optimizing
a policy using samples collected by a stationary policy (Maei et al., 2010).
In recent years, a few authors have made important progress toward finding scalable, convergent TD algorithms,
by designing proper objective functions and using stochastic gradient descent (SGD) to optimize them (Sutton
et al., 2009; Maei, 2011). Later on, it was realized that several of these gradient-based algorithms can be
interpreted as solving a primal-dual problem (Mahadevan et al., 2014; Liu et al., 2015; Macua et al., 2015;
Dai et al., 2016). This insight has led to novel, faster, and more robust algorithms by adopting sophisticated
optimization techniques (Du et al., 2017). Unfortunately, to the best of our knowledge, all existing works
either assume linear function approximation or are designed for policy evaluation. It remains a major open
problem how to find the optimal policy reliably with general nonlinear function approximators such as neural
networks, especially in the presence of off-policy data.
Contributions
In this work, we take a substantial step towards solving this decades-long open problem,
leveraging a powerful saddle-point optimization perspective, to derive a new algorithm called Smoothed
Bellman Error Embedding (SBEED) algorithm. Our development hinges upon a novel view of a smoothed
Bellman optimality equation, which is then transformed to the final primal-dual optimization problem.
SBEED learns the optimal value function and a stochstic policy in the primal, and the Bellman error (also
known as Bellman residual) in the dual. By doing so, it avoids the non-smooth
max
-operator in the Bellman
operator, as well as the double-sample challenge that has plagued RL algorithm designs (Baird, 1995). More
specifically,
•
SBEED is stable for a broad class of nonlinear function approximators including neural networks, and
provably converges to a solution with vanishing gradient. This holds even in the more challenging
off-policy case;
•
it uses bootstrapping to yield high sample efficiency, as in TD-style methods, and is also generalized to
cases of multi-step bootstrapping and eligibility traces;
•
it avoids the double-sample issue and directly optimizes the squared Bellman error based on sample
trajectories;
• it uses stochastic gradient descent to optimize the objective, thus very efficient and scalable.
Furthermore, the algorithm handles both the optimal value function estimation and policy optimization in a
unified way, and readily applies to both continuous and discrete action spaces. We compare the algorithm
with state-of-the-art baselines on several continuous control benchmarks, and obtain excellent results.
2 Preliminaries
In this section, we introduce notation and technical background that is needed in the rest of the paper. We
denote a Markov decision process (MDP) as
M
=
(S, A, P, R, γ)
, where
S
is a (possible infinite) state space,
A
an action space,
P
(
·|s, a
) the transition probability kernel defining the distribution over next states upon
taking action
a
on state
s
,
R
(
s, a
) the average immediate reward by taking action
a
in state
s
, and
γ ∈
(0
,
1)
a discount factor. Given an MDP, we wish to find a possibly stochastic policy
π
:
S → P
A
to maximize the
expected discounted cumulative reward starting from any state
s ∈ S
:
E
h
P
∞
t=0
γ
t
R(s
t
, a
t
)
s
0
= s, π
i
, where
P
A
denotes all probability measures over A. The set of all policies is denoted by P := (P
A
)
S
.
2
Define
V
∗
(
s
) :=
max
π(·|s)
E [
P
∞
t=0
γ
t
R(s
t
, a
t
)|s
0
= s, π]
to be the optimal value function. It is known that
V
∗
is the unique fixed point of the Bellman operator
T
, or equivalently, the unique solution to the Bellman
optimality equation (Bellman equation, for short) (Puterman, 2014):
V (s) = (T V )(s) := max
a
R(s, a) + γE
s
0
|s,a
[V (s
0
)]. (1)
The optimal policy π
∗
is related to V
∗
by the following:
π
∗
(a|s) = argmax
a
R(s, a) + γE
s
0
|s,a
[V
∗
(s
0
)]
.
It should be noted that in practice, for convenience we often work on the Q-function instead of the state-value
function V
∗
. In this paper, it suffices to use the simpler V
∗
function.
3 A Primal-Dual View of Bellman Equation
In this section, we introduce a novel view of Bellman equation that enables the development of the new
algorithm in Section 4. After reviewing the Bellman equation and the challenges to solve it, we describe the
two key technical ingredients that lead to our primal-dual reformulation.
We start with another version of Bellman equation that is equivalent to Eqn (1) (see, e.g., Puterman
(2014)):
V (s) = max
π(·|s)∈P
A
E
a∼π(·|s)
R(s, a) + γE
s
0
|s,a
[V (s
0
)]
. (2)
Eqn (2) makes the role of a policy explicit. Naturally, one may try to jointly optimize over
V
and
π
to
minimize the discrepancy between the two sides of
(2)
. For concreteness, we focus on the square distance
in this paper, but our results can be extended to other convex loss functions. Let
µ
be some given state
distribution so that µ(s) > 0 for all s ∈ S. Minimizing the squared Bellman error gives the following:
min
V
E
s∼µ
"
max
π(·|s)∈P
A
E
a∼π(·|s)
R(s, a) + γE
s
0
|s,a
[V (s
0
)]
− V (s)
2
#
. (3)
While natural, this approach has several major difficulties when it comes to optimization, which are to be
dealt with in the following subsections:
•
The
max
operator over
P
A
introduces non-smoothness to the objective function. A slight change in
V
may cause large differences in the RHS of Eqn (2).
•
The conditional expectation,
E
s
0
|s,a
[·]
, composed within the square loss, requires double samples (Baird,
1995) to obtain unbiased gradients, which is often impractical in most but simulated environments.
3.1 Smoothed Bellman Equation
To avoid the instability and discontinuity caused by the
max
operator, we use the smoothing technique of
Nesterov (2005) to smooth the Bellman operator
T
. Since policies are conditional distributions over
A
, we
choose entropy regularization, and Eqn (2) becomes:
V
λ
(s) = max
π(·|s)∈P
A
E
a∼π(·|s)
R(s, a) + γE
s
0
|s,a
[V
λ
(s
0
)]
+ λH(π, s)
, (4)
where
H
(
π, s
) :=
−
P
a∈A
π
(
a|s
)
log π
(
a|s
), and
λ >
0 controls the degree of smoothing. Note that with
λ
= 0, we obtain the standard Bellman equation. Moreover, the regularization may be viewed a shaping
reward added to the reward function of an induced, equivalent MDP; see Appendix C.2 for more details.
3
Since negative entropy is the conjugate of the log-sum-exp function (Boyd & Vandenberghe, 2004,
Example 3.25), Eqn (4) can be written equivalently as
V
λ
(s) = (T
λ
V
λ
) (s) := λ log
X
a∈A
exp
R(s, a) + γE
s
0
|s,a
[V
λ
(s
0
)]
λ
!
(5)
where the log-sum-exp is an effective smoothing approximation of the max-operator.
Remark.
While Eqns (4) and (5) are inspired by Nestorov smoothing technique, they can also be derived
from other principles (Rawlik et al., 2012; Fox et al., 2016; Neu et al., 2017; Nachum et al., 2017; Asadi &
Littman, 2017). For example, Nachum et al. (2017) use entropy regularization in the policy space to encourage
exploration, but arrive at the same smoothed form; the smoothed operator
T
λ
is called “Mellowmax” by
Asadi & Littman (2017), which is obtained as a particular instantiation of the quasi-arithmetic mean. In the
rest of the subsection, we review the properties of
T
λ
, although some of the results have appeared in the
literature in slightly different forms. Proofs are deferred to Appendix A.
First, we show
T
λ
is also a contraction, as with the standard Bellman operator (Fox et al., 2016; Asadi &
Littman, 2017):
Proposition 1 (Contraction) T
λ
is a
γ
-contraction. Consequently, the corresponding smoothed Bellman
equation (4), or equivalently (5), has a unique solution V
∗
λ
.
Second, we show that while in general
V
∗
6
=
V
∗
λ
, their difference is controlled by
λ
. To do so, define
H
∗
:= max
s∈S,π(·|s)∈P
A
H(π, s). For finite action spaces, we immediately have H
∗
= log(|A|).
Proposition 2 (Smoothing bias) Let V
∗
and V
∗
λ
be fixed points of (2) and (4), respectively. Then,
kV
∗
(s) − V
∗
λ
(s)k
∞
6
λH
∗
1 − γ
.
Consequently, as λ → 0, V
∗
λ
converges to V
∗
pointwisely.
Finally, the smoothed Bellman operator has the very nice property of temporal consistency (Rawlik et al.,
2012; Nachum et al., 2017):
Proposition 3 (Temporal consistency)
Assume
λ >
0. Let
V
∗
λ
be the fixed point of (4) and
π
∗
λ
the
corresponding policy that attains the maximum on the RHS of (4). Then, (
V
∗
λ
, π
∗
λ
) is the unique (
V, π
) pair
that satisfies the following equality for all (s, a) ∈ S × A:
V (s) = R(s, a) + γE
s
0
|s,a
[V (s
0
)] − λ log π(a|s) . (6)
In other words, Eqn (6) provides an easy-to-check condition to characterize the optimal value function and
optimal policy on arbitrary pair of (
s, a
), therefore, which is easy to incorporate off-policy data. It can also be
extended to the multi-step or eligibility-traces cases (Appendix C; see also Sutton & Barto (1998, Chapter 7)).
Later, this condition will be one of the critical foundations to develop our new algorithm.
3.2 Bellman Error Embedding
A natural objective function inspired by (6) is the mean squared consistency Bellman error, given by:
min
V,π∈P
`(V, π) := E
s,a
h
R(s, a) + γE
s
0
|s,a
[V (s
0
)] − λ log π(a|s) − V (s)
2
i
, (7)
where
E
s,a
[
·
] is shorthand for
E
s∼µ(·),a∼π
b
(·|s)
[
·
]. Unfortunately, due to the inner conditional expectation, it
would require two independent sample of
s
0
(starting from the same (
s, a
)) to obtain an unbiased estimate
of gradient of
f
, a problem known as the double-sample issue (Baird, 1995). In practice, however, one can
rarely obtain two independent samples except in simulated environments.
4
To bypass this problem, we make use of the conjugate of the square function (Boyd & Vandenberghe,
2004):
x
2
=
max
ν
2νx − ν
2
, as well as the interchangeability principle (Shapiro et al., 2009; Dai et al.,
2016) to rewrite the optimization problem (7) into an equivalent form:
min
V,π∈P
max
ν∈F
S×A
L(V, π; ν) := 2E
s,a,s
0
h
ν(s, a)
R(s, a) + γV (s
0
) − λ log π(a|s) − V (s)
i
− E
s,a,s
0
ν
2
(s, a)
, (8)
where
F
S×A
is the set of real-valued functions on
S×A
,
E
s,a,s
0
[
·
] is shorthand for
E
s∼µ(·),a∼π
b
(·|s),s
0
∼P (·|s,a)
[
·
].
Note that (8) is not a standard convex-concave saddle-point problem: the objective is convex in
V
for any
fixed (π, ν), and concave in ν for any fixed (V, π), but not necessarily convex in π ∈ P for any fixed (V, ν).
Remark.
In contrast to our saddle-point formulation (8), Nachum et al. (2017) get around the double-sample
obstacle by minimizing an upper bound of
`
(
V, π
):
˜
`
(
V, π
) :=
E
s,a,s
0
h
(R(s, a) + γV (s
0
) − λ log π(a|s) − V (s))
2
i
.
As is known (Baird, 1995), the gradient of
˜
`
is different from that of
f
, as it has a conditional variance
term coming from the stochastic outcome
s
0
. In problems where this variance is highly heterogeneous across
different (s, a) pairs, impact of such a bias can be substantial.
Finally, substituting the dual function
ν
(
s, a
) =
ρ
(
s, a
)
− V
(
s
), the objective in the saddle point problem
becomes
min
V,π
max
ρ∈F
S×A
L
1
(V, π; ρ) := E
s,a,s
0
h
(δ(s, a, s
0
) − V (s))
2
i
− E
s,a,s
0
h
(δ(s, a, s
0
) − ρ(s, a))
2
i
(9)
where
δ
(
s, a, s
0
) :=
R
(
s, a
) +
γV (s
0
) − λ log π(a|s)
. Note that the first term is
˜
`
(
V, π
), and the second term
will cancel the extra variance term (see Proposition 8 in Appendix B). The use of an auxiliary function to
cancel the variance is also observed by Antos et al. (2008). On the other hand, when function approximation
is used, extra bias will also be introduced. We note that such a saddle-point view of debiasing the extra
variance term leads to a useful mechanism for better bias-variance trade-offs, leading to the final primal-dual
formulation we aim to solve in the next section:
min
V,π∈P
max
ρ∈F
S×A
L
η
(V, π; ρ) := E
s,a,s
0
h
(δ(s, a, s
0
) − V (s))
2
i
− ηE
s,a,s
0
h
(δ(s, a, s
0
) − ρ(s, a))
2
i
, (10)
where
η ∈
[0
,
1] is a hyper-parameter controlling the trade-off. When
η
= 1, this reduces to the original
saddle-point formulation (8). When
η
= 0, this reduces to the surrogate objective considered by Nachum
et al. (2017).
4 Smoothed Bellman Error Embedding
In this section, we derive the Smoothed Bellman Error EmbeDding (SBEED) algorithm, based on stochastic
mirror descent (Nemirovski et al., 2009), to solve the smoothed Bellman equation. For simplicity of exposition,
we mainly discuss the one-step optimization (10), although it is possible to generalize the algorithm to the
multi-step and eligibility-traces settings; see Appendices C.2 and C.3 for details.
Due to the curse of dimensionality, the quantities (
V, π, ρ
) are often represented by compact, parametric
functions in practice. Denote these parameters by
w
= (
w
V
, w
π
, w
ρ
). Abusing notation a little bit, we now
write the objective function L
η
(V, π; ρ) as L
η
(w
V
, w
π
; w
ρ
).
First, we note that the inner (dual) problem is standard least-squares regression with parameter
w
ρ
,
so can be solved using a variety of algorithms (Bertsekas, 2016); in the presence of special structures like
convexity, global optima can be found efficiently (Boyd & Vandenberghe, 2004). The more involved part is to
optimize the primal (w
V
, w
π
), whose gradients are given by the following theorem.
Theorem 4 (Primal gradient)
Define
¯
`
η
(
w
V
, w
π
) :=
L
η
(
w
V
, w
π
;
w
∗
ρ
), where
w
∗
ρ
=
arg max
w
ρ
L
η
(
w
V
, w
π
;
w
ρ
).
Let δ
s,a,s
0
be a shorthand for δ(s, a, s
0
), and ˆρ be dual parameterized by w
∗
ρ
. Then,
∇
w
V
¯
`
η
=2E
s,a,s
0
[(δ
s,a,s
0
− V (s)) (γ∇
w
V
V (s
0
) − ∇
w
V
V (s))] −2ηγE
s,a,s
0
[(δ
s,a,s
0
− ˆρ(s, a)) ∇
w
V
V (s
0
)] ,
∇
w
π
¯
`
η
= − 2λE
s,a,s
0
((1 − η)δ
s,a,s
0
+ ηˆρ(s, a) −V (s)) ·∇
w
π
log π(a|s)
.
5
剩余27页未读,继续阅读
资源评论
weixin_40191861_zj
- 粉丝: 62
- 资源: 1万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功