没有合适的资源?快使用搜索试试~ 我知道了~
Introduction to Stochastic Approximation Algorithms.pdf
需积分: 0 1 下载量 127 浏览量
2022-11-17
09:33:34
上传
评论
收藏 762KB PDF 举报
温馨提示
试读
14页
网上找的关于随即逼近算法(Stochastic Approximation Algorithms)的资料。
资源推荐
资源详情
资源评论
Chapter 15
Introduction to Stochastic
Approximation Algorithms
1
Stochastic approximation algorithms are recursive update rules that can be
used, among other things, to solve optimization problems and fixed point equa-
tions (including standard linear systems) when the collected data is subject to
noise. In engineering, optimization problems are often of this type, when you
do not have a mathematical model of the system (which can be too complex)
but still would like to optimize its behavior by adj usting certain parameters.
For this purpose, you can do experiments or run simulations to evaluate the
performance of the system at given values of the parameters. Stochastic ap-
proximation algorithms have also been used in the social sciences to describe
collective dynamics: fictitious play in learning theory and consensus algorithms
can be studied using their theory. In short, it is hard to overemphasized their
usefulness. In addition, the theory of stochastic approximation algorithms, at
least when approached using the ODE method as done here, is a beautiful mix
of dynamical systems theory and probability theory. We only have time to give
you a flavor of this theory but hopefully this will motivate you to explore fur-
ther on your own. For our purpose, essentially all approximate DP algorithms
encountered in the following chapters are stochastic approximation algorithms.
We will not have time to give formal convergence proofs for all of them, but this
chapter should give you a starting point to understand the basic mechanisms
involved. Most of the material discussed here is taken from [Bor08].
15.1 Example: The Robbins-Monro Algorithm
Suppose we wish to find the root
¯
θ of the function f : R → R. We can use
Newton’s procedure, which generates the sequence of iterates
θ
n+1
= θ
n
−
f(θ
n
)
f
!
(θ
n
)
.
1
This version: Octob er 31 2009
129
Suppose we also know a neighborhood of
¯
θ, where f(θ) < 0 for θ<
¯
θ, f(θ) > 0
for θ>
¯
θ, and f in nondecreasing in this neighborhood. Then if we start
at θ
0
close enough of
¯
θ, the following simpler (but less efficient) scheme also
converges to
¯
θ, and does not require the derivative of f:
θ
n+1
= θ
n
− αf(θ
n
), (15.1)
for some fixed and sufficiently small α> 0. Note that if f is itself the derivative
of a function F , these schemes correspond to Newton’s method and a fixed-
step gradient descent procedure for minimizing F , respectively (more precisely,
finding a critical point of F or root of the gradient of F ).
Very often in applications, we do not have access to the mathematical model
f, but we can do experiments or simulations to sample the function at particular
values of θ. These samples are typically noisy however, so that we can assume
that we have a black-box at our disposal (the simulator, the lab where we do
the experiments, etc.), which on input xθ returns the value y = f(θ)+d, where
d is a noise, which will soon be assumed to be random. The point is that we
only have access to the value y, and we have no way of removing the noise from
it, i.e., of isolating the exact value of f (θ). Now suppose that we still want to
find a root of f as in the problem above, with access only to this noisy black
box.
Assume for now that we know that the noise is i.i.d. and zero-mean. A first
approach to the problem could be, for a given value of θ, to sample sufficient
many time at the same point θ and get values y
1
, . . . , y
N
, and then form an
estimate of f(θ ) using the empirical average
f(θ) ≈
1
N
N
!
i=1
y
i
. (15.2)
With sufficiently many samples at every iterate θ
n
of (15.1), we can reasonably
hope to find approximately the root of f. The problem is that we might spend
a lot of time taking samples at points θ that are far from
¯
θ and are not really
relevant, except for telling us in which direction to move next. This can be a
real issue if obtaining each sample is time-consuming or costly.
An alternative procedure, studied by Robbins and Monro [RM51]
2
, is to
simply use directly the noisy version of f in a slightly modified version of
algorithm (15.1):
θ
n+1
= θ
n
− γ
n
y
n
, (15.3)
where γ
n
is a sequence of positive numbers converging to 0 and such that
"
n
γ
n
= ∞ (for example, γ
n
=1/(n + 1)), and y
n
= f (θ
n
)+d
n
is the noisy
version of f(θ
n
). Note that the iterates θ
n
are now random variables.
The intuition behing the decreasing step size γ
n
is that it provides a sort
of averaging of the observations. For an analogy in a simpler setting, suppose
2
In fact, recursive stochastic algorithms have been used in signal processing (e.g., for
smoothing radar returns) even before the work of Robbins and Monro. However, there was
apparently no general asymptotic theory.
130
we have i.i.d. observations ξ
1
, . . . , ξ
N
of a random variable and wish to form
their empirical average as in (15.2). A recursive alternative to (15.2), extremely
useful in settings where the samples become available progressively with time
(recall for example the Kalman filter), is to form
θ
1
= ξ
1
,θ
n+1
= θ
n
− γ
n
[θ
n
− ξ
n+1
],
with γ
n
=1/(n + 1). One can immediately verif y that θ
n
=(
"
n
i=1
ξ
i
)/n, for
all n.
This chapter is concerned with recurrences generalizing (15.3) of the form:
θ
n+1
= θ
n
+ γ
n
[f(θ
n
)+b
n
+ D
n+1
] (15.4)
where θ
0
∈ R
d
is possibly random, f is a function R
d
→ R
d
, b
n
is a small sys-
tematic perturbation term, such as a bias in our estimator of f(θ
n
), and D
n+1
is a random noise with zero mean (conditioned on the past). The assumptions
and exact definitions of these terms will be made precise in section 15.3. In
applications, we are typically first interested in the asymptotic behavior of the
sequence {θ
n
}.
15.2 The ODE Approach and More Application
Examples
The ODE (Ordinary Differential Equation) method says roughly that if the
step sizes γ
n
are appropriately chosen, the bias terms b
n
decrease appropriately,
and the noise D
n
is zero-mean, then the iterates (15.4) asymptotically track
the trajectories of the dynamical system
3
˙
θ = f (θ).
We will give a more formal proof of this fact in the basic case in section 15.3.
Typically for the simplest proofs γ
n
must be decreasing to 0 and satisfy
!
n
γ
n
= ∞,
!
n
γ
2
n
< ∞.
However other choices are possible, including constant small step sizes in some
cases, and in practice the choice of step sizes requires experimentation because
it controls the convergence rate. Some theoretical results regarding convergence
rates are also available but will not be covered here. The ODE is extremely
useful in any case, even if another technique is chosen for formal convergence
proofs, in order to get a quick idea of the behavior of an algorithm. Moreover,
another big advantage of this method is that it can be used to easily create new
stochastic approximation algorithms from convergent ODEs. We now describe
a few more classes of problems where these algorithms arise.
3
By definition, ˙x :=
d
dt
x(t)
131
剩余13页未读,继续阅读
资源评论
琉璃树下
- 粉丝: 20
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功