ApproximateDynamicProgramming:Algorithms资源-CSDN文库

需积分: 9 136 浏览量 2011-06-23 11:08:42 上传评论收藏 247KB PDF 举报

资源推荐

资源详情

资源评论

Approximate Dynamic Programming - II:

Algorithms

Warren B. Powell

December 8, 2009

1 Introduction

Approximate dynamic programming represents a powerful modeling and algorithmic strategy that

can address a wide range of optimization problems that involve making decisions sequentially in

the presence of diﬀerent types of uncertainty. A short list of applications, which illustrate diﬀerent

problem classes, include the following:

• Option pricing - An American option allows us to buy or sell an asset at any time up to a

speciﬁed time, where we make money when the price goes under or over (respectively) a set

strike price. Valuing the option requires ﬁnding an optimal policy for determining when to

exercise the option.

• Playing games - Computer algorithms have been designed to play backgammon, bridge, chess

and, recently, the Chinese game of Go.

• Controlling a device - This might be a robot or unmanned aerial vehicle, but there is a need

for autonomous devices to manage themselves for tasks ranging from vacuuming the ﬂoor to

collecting information about terrorists.

• Storage of continuous resources - Managing the cash balance for a mutual fund or the amount

of water in a reservoir used for a hydroelectric dam requires managing a continuous resource

over time in the presence of stochastic information on parameters such as prices and rainfall.

• Asset acquisition - We often have to acquire physical and ﬁnancial assets over time under

diﬀerent sources of uncertainty about availability, demands and prices.

• Resource allocation - Whether we are managing blood inventories, ﬁnancial portfolios or ﬂeets

of vehicles, we often have to move, transform, clean and repair resources to meet various needs

under uncertainty about demands and prices.

• R & D portfolio optimization - The Department of Energy has to determine how to allocate

government funds to advance the science of energy generation, transmission and storage. These

decisions have to be made over time in the presence of uncertain changes in technology and

commodity prices.

These problems range from relatively low dimensional applications to very high-dimensional indus-

trial problems, but all share the property of making decisions over time under diﬀerent types of

uncertainty. In Powell (2010), a modeling framework is described which breaks a problem into ﬁve

components:

• State - S

(or x

in the control theory community), capturing all the information we need at

time t to make a decision and model the evolution of the system in the future.

• Action/decision/control - Depending on the community, these will be modeled as a, x or u.

Decisions are made using a decision function, or policy. If we are using action a, we represent

the decision function using A

) where π ∈ Π is a family of policies (or functions). If our

action is x, we use X

). We use a if our problem has a small number of discrete actions.

We use x when the decision might be a vector (of discrete or continuous elements).

• Exogenous information - Lacking standard notation, we let W

be the family of random variables

that represent new information that ﬁrst becomes known by time t.

• Transition function - Also known as the system model (or just model), this function is denoted

(·), and is used to express the evolution of the state variable, as in S

t+1

= S

, x

, W

t+1

• Objective function - Let C(S

, x

) be the contribution (if we are maximizing) received when in

state S

if we take action x

. Our objective is to ﬁnd a decision function (policy) that solves

max

π∈Π

(

t=0

C(S

, X

))

)

. (1)

We encourage readers to review Powell (2010) (in this volume) or chapter 5 in Powell (2007), available

at http://www.castlelab.princeton.edu/adp.htm, before attempting to design an algorithmic strategy.

For the remainder of this chapter, we assume that the reader is familiar with the modeling behind

the objective function in equation (1), and in particular the range of policies that can be used to

provide solutions. In this chapter, we assume that we would like to ﬁnd a good policy by using

Bellman’s equation as a starting point, which we may write in either of two ways:

) = max



C(S

, a) + γ

p(s

, a)V

t+1

)



, (2)

= max



C(S

, a) + γE{V

t+1

)|S

}



, (3)

where V

) is the value of being in state S

and following an optimal policy from t until the end

of the planning horizon (which may be inﬁnite). The control theory community replaces V

) with

) which is referred to as the “cost-to-go” function. If we are solving a problem in steady state,

) would be replaced with V (S).

If we have a small number of discrete states and actions, we can ﬁnd the value function V

)

using the classical techniques of value iteration and policy iteration (see Puterman (1994)). Many

practical problems, however, suﬀer from one or more of the three curses of dimensionality: 1) vector-

valued (and possibly continuous) state variables, 2) random variables W

for which we may not be able

to compute the expectation, and 3) decisions (typically denoted by x

or u

) which may be discrete

or continuous vectors, requiring some sort of solver (linear, nonlinear or integer programming) or

specialized algorithmic strategy.

The ﬁeld of approximate dynamic programming has historically focused on the problem of mul-

tidimensional state variables, which prevent us from calculating V

(s) for each discrete state s. In

our presentation, we make an eﬀort to cover this literature, but we also show how we can over-

come the other two curses using a device known as the post-decision state variable. However, a

short chapter such as this is simply unable to present the vast range of algorithmic strategies in

any detail. For this, we recommend for additional reading the following: Bertsekas and Tsitsiklis

(1996), especially for students interested in obtaining thorough theoretical foundations; Sutton and

Barto (1998) for a presentation from the perspective of the reinforcement learning community; Pow-

ell (2007) for a presentation that puts more emphasis on modeling, and more from the perspective

of the operations research community; and chapter 6 of Bertsekas (2007), which can be downloaded

from http://web.mit.edu/dimitrib/www/dpchapter.html.

2 A generic ADP algorithm

Equation (2) (or (3)) is typically solved by stepping backward in time, where it is necessary to loop

over all the potential states to compute V

) for each state S

. For this reason, classical dynamic

programming is often referred to as backward dynamic programming. The requirement of looping

over all the states is the ﬁrst computational step that cannot be performed when the state variable

is a vector, or even a scalar continuous variable.

Approximate dynamic programming takes a very diﬀerent approach. In most ADP algorithms,

we step forward in time following a single sample path. Assume we are modeling a ﬁnite horizon

problem. We are going to iteratively simulate this problem. At iteration n, assume at time t that

we are in state S

. Now assume that we have a policy of some sort that produces a decision x

. In

approximate dynamic programming, we are typically solving a problem that can be written

ˆv

= max



C(S

, x

) + γE{

n−1

t+1

, x

, W

t+1

))|S

}



. (4)

Here,

n−1

t+1

) is an approximation of the value of being in state S

t+1

= S

, x

, W

t+1

) at time

t + 1. If we are modeling a problem in steady state, we drop the subscript t everywhere, recognizing

that W is a random variable. We note that x

here can be a vector, where the maximization problem

might require the use of a linear, nonlinear or integer programming package. Let x

be the value of

that solves this optimization problem. We can use ˆv

to update our approximation of the value

function. For example, if we are using a lookup table representation, we might use

) = (1 − α

n−1

)

n−1

) + α

n−1

ˆv

, (5)

where α

n−1

is a stepsize between 0 and 1 (more on this later).

Now we are going to use Monte Carlo methods to sample our vector of random variables which

we denote by W

t+1

, representing the new information that would have ﬁrst been learned between

time t and t + 1. The next state would be given by

t+1

= S

, x

, W

t+1

The overall algorithm is given in ﬁgure 1. This is a very basic version of an ADP algorithm, one

which would generally not work in practice. But it illustrates some of the basic elements of an ADP

algorithm. First, it steps forward in time, using a randomly sampled set of outcomes, visiting one

state at a time. Second, it makes decisions using some sort of statistical approximation of a value

function, although it is unlikely that we would use a lookup table representation. Third, we use

information gathered as we step forward in time to update our value function approximation, almost

always using some sort of recursive statistics.

The algorithm in ﬁgure 1 goes under diﬀerent names. It is a basic “forward pass” algorithm where

we step forward in time, updating value functions as we progress. Another variation involves simu-

lating forward through the horizon without updating the value function. Then, after the simulation

is done, we step backward through time, using information about the entire future trajectory.

We have written the algorithm in the context of a ﬁnite horizon problem. For inﬁnite horizon

problems, simply drop the subscript t everywhere, remembering that you have to keep track of the

“current” and “future” state.

剩余21页未读，继续阅读

评论收藏

内容反馈

waterlike007

粉丝: 0
资源: 4

Approximate Dynamic Programming:Algorithms

最新资源

Approximate Dynamic Programming:Algorithms

Approximate.Dynamic.Programming.

Pyramid Algorithms: A Dynamic Programming Approach ...

Approximate Dynamic Programming Solving the Curses of Dimensionality-第五章

Approximate Dynamic Programming Solution for the Optimal NOx/PM Trade-off Control of a WAPS Engine

Algorithms Illuminated Part 3_ Greedy Algorithms and Dynamic Programming.pdf

Approximate Dynamic Programming:Modeling

Approximate Dynamic Programming Solving the Curses of Dimensionality

Adaptive and Approximate Dynamic Programming (ADP)

Handbook of learning and approximation dynamic programming

Adaptive Dynamic Programming for Control: Algorithms and Stabili

Approximate.Dynamic.Programming.2011

Special issue on approximate dynamic programming and reinforcement learning (2011年)

approximate-now:大约（当前）当前UNIX时间

approximate-time:大约人类可读的时间

Design of Approximation Algorithms——近似算法设计

DNA-Sequencing-Boyer-Moore-Approximate-Matching:用Python编码的几种DNA测序算法

Approximate Solution Finder:开源近似逻辑设计工具-开源

Variational Algorithms for Approximate Bayesian Inference.pdf

approximate-life:艺术项目，从比特流中构建三角形生物

matlab的欧拉方法代码-Numerical-Methods-to-Approximate-IVPs:数值方法近似IVP

Approximate Lowner Ellipsoid：近似包围任意维度的一组点的最小体积椭球。-matlab开发

approximate-number:将数字转换为更人性化的格式。 例如123456变成123k。 类似于`ls -lh`或Stack Overflow的信誉号

python大作业 含爬虫、数据可视化、地图、报告、及源码（整和为一个文件）（2014-2020全国各地区原油加工量）.rar

仿真电路以及操作方法

【纯干货啊】华为IPD流程管理(完整版).pptx

可编程语言标准IEC61131-3中文版.pdf

OFDM完整仿真过程与教程.zip

信号与系统——保研复习资料.pdf

Landsat_WRS2.zip

最新资源

approximate-number:将数字转换为更人性化的格式。例如123456变成123k。类似于`ls -lh`或Stack Overflow的信誉号

python大作业含爬虫、数据可视化、地图、报告、及源码（整和为一个文件）（2014-2020全国各地区原油加工量）.rar