没有合适的资源?快使用搜索试试~ 我知道了~
Approximate Dynamic Programming:Algorithms
需积分: 9 7 下载量 136 浏览量
2011-06-23
11:08:42
上传
评论
收藏 247KB PDF 举报
温馨提示
试读
22页
Approximate Dynamic Programming:Algorithms
资源推荐
资源详情
资源评论
Approximate Dynamic Programming - II:
Algorithms
Warren B. Powell
December 8, 2009
Abstract
Approximate dynamic programming is a powerful class of algorithmic strategies for solving stochastic
optimization problems where optimal decisions can be characterized using Bellman’s optimality equa-
tion, but where the characteristics of the problem make solving Bellman’s equation computationally
intractable. This brief chapter provides an introduction to the basic concepts of approximate dy-
namic programming while building bridges between the different communities that have contributed
to this field. We cover basic approximate value iteration (temporal difference learning), policy ap-
proximation, and a brief introduction to strategies for approximating value functions. We cover
Q-learning, and the use of the post-decision state for solving problems with vector-valued decisions.
The approximate linear programming method is introduced, along with a discussion of stepsize se-
lection issues. The presentation closes with a discussion of some practical issues that arise in the
implementation of ADP techniques.
1 Introduction
Approximate dynamic programming represents a powerful modeling and algorithmic strategy that
can address a wide range of optimization problems that involve making decisions sequentially in
the presence of different types of uncertainty. A short list of applications, which illustrate different
problem classes, include the following:
• Option pricing - An American option allows us to buy or sell an asset at any time up to a
specified time, where we make money when the price goes under or over (respectively) a set
strike price. Valuing the option requires finding an optimal policy for determining when to
exercise the option.
• Playing games - Computer algorithms have been designed to play backgammon, bridge, chess
and, recently, the Chinese game of Go.
• Controlling a device - This might be a robot or unmanned aerial vehicle, but there is a need
for autonomous devices to manage themselves for tasks ranging from vacuuming the floor to
collecting information about terrorists.
• Storage of continuous resources - Managing the cash balance for a mutual fund or the amount
of water in a reservoir used for a hydroelectric dam requires managing a continuous resource
over time in the presence of stochastic information on parameters such as prices and rainfall.
• Asset acquisition - We often have to acquire physical and financial assets over time under
different sources of uncertainty about availability, demands and prices.
• Resource allocation - Whether we are managing blood inventories, financial portfolios or fleets
of vehicles, we often have to move, transform, clean and repair resources to meet various needs
under uncertainty about demands and prices.
• R & D portfolio optimization - The Department of Energy has to determine how to allocate
government funds to advance the science of energy generation, transmission and storage. These
decisions have to be made over time in the presence of uncertain changes in technology and
commodity prices.
These problems range from relatively low dimensional applications to very high-dimensional indus-
trial problems, but all share the property of making decisions over time under different types of
uncertainty. In Powell (2010), a modeling framework is described which breaks a problem into five
components:
• State - S
t
(or x
t
in the control theory community), capturing all the information we need at
time t to make a decision and model the evolution of the system in the future.
• Action/decision/control - Depending on the community, these will be modeled as a, x or u.
Decisions are made using a decision function, or policy. If we are using action a, we represent
the decision function using A
π
(S
t
) where π ∈ Π is a family of policies (or functions). If our
action is x, we use X
π
(S
t
). We use a if our problem has a small number of discrete actions.
We use x when the decision might be a vector (of discrete or continuous elements).
• Exogenous information - Lacking standard notation, we let W
t
be the family of random variables
that represent new information that first becomes known by time t.
1
• Transition function - Also known as the system model (or just model), this function is denoted
S
M
(·), and is used to express the evolution of the state variable, as in S
t+1
= S
M
(S
t
, x
t
, W
t+1
).
• Objective function - Let C(S
t
, x
t
) be the contribution (if we are maximizing) received when in
state S
t
if we take action x
t
. Our objective is to find a decision function (policy) that solves
max
π∈Π
E
(
T
X
t=0
γ
t
C(S
t
, X
π
(S
t
))
)
. (1)
We encourage readers to review Powell (2010) (in this volume) or chapter 5 in Powell (2007), available
at http://www.castlelab.princeton.edu/adp.htm, before attempting to design an algorithmic strategy.
For the remainder of this chapter, we assume that the reader is familiar with the modeling behind
the objective function in equation (1), and in particular the range of policies that can be used to
provide solutions. In this chapter, we assume that we would like to find a good policy by using
Bellman’s equation as a starting point, which we may write in either of two ways:
V
t
(S
t
) = max
a
C(S
t
, a) + γ
X
s
0
p(s
0
|S
t
, a)V
t+1
(s
0
)
, (2)
= max
a
C(S
t
, a) + γE{V
t+1
(S
t+1
)|S
t
}
, (3)
where V
t
(S
t
) is the value of being in state S
t
and following an optimal policy from t until the end
of the planning horizon (which may be infinite). The control theory community replaces V
t
(S
t
) with
J
t
(S
t
) which is referred to as the “cost-to-go” function. If we are solving a problem in steady state,
V
t
(S
t
) would be replaced with V (S).
If we have a small number of discrete states and actions, we can find the value function V
t
(S
t
)
using the classical techniques of value iteration and policy iteration (see Puterman (1994)). Many
practical problems, however, suffer from one or more of the three curses of dimensionality: 1) vector-
valued (and possibly continuous) state variables, 2) random variables W
t
for which we may not be able
to compute the expectation, and 3) decisions (typically denoted by x
t
or u
t
) which may be discrete
or continuous vectors, requiring some sort of solver (linear, nonlinear or integer programming) or
specialized algorithmic strategy.
The field of approximate dynamic programming has historically focused on the problem of mul-
tidimensional state variables, which prevent us from calculating V
t
(s) for each discrete state s. In
our presentation, we make an effort to cover this literature, but we also show how we can over-
come the other two curses using a device known as the post-decision state variable. However, a
short chapter such as this is simply unable to present the vast range of algorithmic strategies in
any detail. For this, we recommend for additional reading the following: Bertsekas and Tsitsiklis
(1996), especially for students interested in obtaining thorough theoretical foundations; Sutton and
Barto (1998) for a presentation from the perspective of the reinforcement learning community; Pow-
ell (2007) for a presentation that puts more emphasis on modeling, and more from the perspective
of the operations research community; and chapter 6 of Bertsekas (2007), which can be downloaded
from http://web.mit.edu/dimitrib/www/dpchapter.html.
2
2 A generic ADP algorithm
Equation (2) (or (3)) is typically solved by stepping backward in time, where it is necessary to loop
over all the potential states to compute V
t
(S
t
) for each state S
t
. For this reason, classical dynamic
programming is often referred to as backward dynamic programming. The requirement of looping
over all the states is the first computational step that cannot be performed when the state variable
is a vector, or even a scalar continuous variable.
Approximate dynamic programming takes a very different approach. In most ADP algorithms,
we step forward in time following a single sample path. Assume we are modeling a finite horizon
problem. We are going to iteratively simulate this problem. At iteration n, assume at time t that
we are in state S
n
t
. Now assume that we have a policy of some sort that produces a decision x
n
t
. In
approximate dynamic programming, we are typically solving a problem that can be written
ˆv
n
t
= max
x
t
C(S
n
t
, x
t
) + γE{
¯
V
n−1
t+1
(S
M
(S
n
t
, x
t
, W
t+1
))|S
t
}
. (4)
Here,
¯
V
n−1
t+1
(S
t+1
) is an approximation of the value of being in state S
t+1
= S
M
(S
n
t
, x
t
, W
t+1
) at time
t + 1. If we are modeling a problem in steady state, we drop the subscript t everywhere, recognizing
that W is a random variable. We note that x
t
here can be a vector, where the maximization problem
might require the use of a linear, nonlinear or integer programming package. Let x
n
t
be the value of
x
t
that solves this optimization problem. We can use ˆv
n
t
to update our approximation of the value
function. For example, if we are using a lookup table representation, we might use
¯
V
n
t
(S
n
t
) = (1 − α
n−1
)
¯
V
n−1
t
(S
n
t
) + α
n−1
ˆv
n
t
, (5)
where α
n−1
is a stepsize between 0 and 1 (more on this later).
Now we are going to use Monte Carlo methods to sample our vector of random variables which
we denote by W
n
t+1
, representing the new information that would have first been learned between
time t and t + 1. The next state would be given by
S
n
t+1
= S
M
(S
n
t
, x
n
t
, W
n
t+1
).
The overall algorithm is given in figure 1. This is a very basic version of an ADP algorithm, one
which would generally not work in practice. But it illustrates some of the basic elements of an ADP
algorithm. First, it steps forward in time, using a randomly sampled set of outcomes, visiting one
state at a time. Second, it makes decisions using some sort of statistical approximation of a value
function, although it is unlikely that we would use a lookup table representation. Third, we use
information gathered as we step forward in time to update our value function approximation, almost
always using some sort of recursive statistics.
The algorithm in figure 1 goes under different names. It is a basic “forward pass” algorithm where
we step forward in time, updating value functions as we progress. Another variation involves simu-
lating forward through the horizon without updating the value function. Then, after the simulation
is done, we step backward through time, using information about the entire future trajectory.
We have written the algorithm in the context of a finite horizon problem. For infinite horizon
problems, simply drop the subscript t everywhere, remembering that you have to keep track of the
“current” and “future” state.
3
剩余21页未读,继续阅读
资源评论
waterlike007
- 粉丝: 0
- 资源: 4
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- STC15单片机串口2使用程序例子
- 读取日志的excel生成周报 用python3开发weekplan-master.zip
- python 读取excel数据导入dbimport-data-master.zip
- K折交叉验证BP神经网络,多输入多输出BP神经网络(代码完整,数据齐全)
- B07训练原图.zip
- python-对Excel数据处理做可视化分析.zip
- 人工智能大作业-无人机图像目标检测的python源代码+文档说明.zip
- 基于GoogLeNet实现Cifar-10图像分类项目python源码(高分项目).zip
- 数据库 sql 面试题目及答案解析.docx
- 汽车常见 10 种传感器故障后的表现与解决措施.docx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功