ComprehensivecomparisonofonlineADPalgorithmsforcontinuous-timeoptimalcontrol资源-CSDN文库

78 浏览量 2021-02-08 03:47:56 上传评论收藏 766KB PDF 举报

在线自适应动态规划(ADP)算法在连续时间最优控制领域的综合比较是本文研究的重点。ADP的在线学习是一种重要的属性，它可以通过在线观察获取丰富的动态信息，并利用这些信息来学习最优控制策略。文章回顾了针对连续时间系统最优控制问题的在线ADP算法的研究，并在深入研究的基础上，ADP已经发展成为一个无模型和数据高效的工具。在分别介绍这些算法之后，本文比较了它们在同一个问题上的性能，旨在提供对连续时间在线ADP算法全面的理解。自适应动态规划（ADP）的概念最早由Werbos于1977年提出，它融合了从计算智能领域中强化学习（RL）的思想。ADP与系统交互，并以最小化某些成本准则为目标学习最优控制策略。在过去，ADP主要集中在具有随机或确定性转移的离散时间（DT）系统上。然而，在现实世界中，许多问题都是连续时间（CT）的，这使得直接应用DT ADP算法到这些问题变得困难。从数学角度来看，要找到CT系统的最优控制，可以通过解决Hamilton–Jacobi–Bellman（HJB）方程来实现。HJB方程是连续时间最优控制的理论基础，它是寻找最优策略的核心数学工具。 ADP的研究近年来也向着无模型和数据高效的方向发展。ADP通过无模型方法来近似最优策略，这种方法并不需要对系统进行详尽的建模，而是利用从系统运行中获得的数据来逼近最优策略。这减少了对准确系统模型的依赖，对于那些难以获得精确模型的复杂系统尤其重要。数据效率意味着算法能够用较少的数据来学习有效的控制策略，这对于实时系统控制尤其有价值。本文中提到的算法包括策略迭代（Policy Iteration）、积分强化学习（Integral Reinforcement Learning）和经验回放（Experience Replay）等。策略迭代是一种通过不断更新策略来逼近最优控制的算法。而积分强化学习则是一种能够处理连续时间系统的ADP算法，它将强化学习的策略更新与积分形式的成本函数结合。经验回放是一种存储系统交互数据（即经验）的方法，并且可以在之后的训练中重复使用这些数据，这对于提高数据使用效率有很大帮助。离策略（Off-policy）学习方法是另一个在文章中被讨论的关键点。与传统的在策略（On-policy）方法不同，离策略方法允许算法从与当前策略不同的历史数据中学习，这为算法提供了更大的灵活性，并能够使用之前存储的数据进行学习，这在数据有限的情况下尤为重要。 ADP算法在控制理论领域的应用已经取得了显著进展，特别是它能够处理传统控制方法难以解决的问题。通过强化学习，ADP能够在无需精确系统模型的情况下，通过与环境的不断交互来学习最优控制策略。这一点在处理复杂、多变的环境时显得尤为重要，因为它能减少模型的复杂度，并对环境的动态变化保持快速适应能力。 ADP在实际应用中面临的挑战还包括算法的收敛速度和稳定性，以及如何高效地处理和利用收集到的数据。为了克服这些挑战，研究者需要对算法进行深入的理论分析，并通过实验来验证算法在不同环境和问题上的性能。 ADP作为一种结合了动态规划和强化学习思想的控制策略，已经在连续时间最优控制领域展现了其独特的优势和潜力。随着算法和理论的进一步发展，ADP有望在更多实际应用中实现更高效的控制策略，特别是在数据驱动的控制和智能决策领域。

资源推荐

资源详情

资源评论

Artif Intell Rev

DOI 10.1007/s10462-017-9548-4

Comprehensive comparison of online ADP algorithms

for continuous-time optimal control

Yuanheng Zhu

1,2

· Dongbin Zhao

1,2

Abstract Online learning is an important property of adaptive dynamic programming

(ADP). Online observations contain plentiful dynamics information, and ADP algorithms

can utilize them to learn the optimal control policy. This paper reviews the research of online

ADP algorithms for the optimal control of continuous-time systems. With the intensive study,

ADP has been developed towards model free and data efﬁcient. After separately introducing

the algorithms, we compare their performance on the same problem. This paper is desired to

provide a comprehensive understanding of continuous-time online ADP algorithms.

Keywords Adaptive dynamic programming · Policy iteration · Integral

reinforcement learning · Experience replay · Off-policy

1 Introduction

With decades of development, adaptive dynamic programming (ADP) (Lewis and Vrabie

2009; Wang et al. 2009; Zhang et al. 2012, 2013; Song et al. 2015; Zhu et al. 2016a, 2017a;

Zhao et al. 2016) has now become a powerful method in the ﬁeld of control theory for

the optimal control. ADP is ﬁrst proposed by Werbos (Werbos 1977), who incorporates the

idea of reinforcement learning (RL) (Kaelbling et al. 1996; Sutton and Barto 1998; Ribeiro

2002) from the ﬁeld of computational intelligence. It interacts with the system and learns the

optimal control policy with the target of minimizing certain cost criteria. In the past, ADP

mainly focuses on discrete-time (DT) systems with stochastic or deterministic transition

Dongbin Zhao

dongbin.zhao@ia.ac.cn

Yuanheng Zhu

yuanheng.zhu@ia.ac.cn

The State Key Laboratory of Management and Control for Complex Systems, Institute

of Automation, Chinese Academy of Sciences, Beijing 100190, China

University of Chinese Academy of Sciences, Beijing, China

123

Y. Zhu, D. Zhao

dynamics (Al-Tamimi et al. 2008; Wang et al. 2012; Zhao and Zhu 2015). In the physical

world, many problems are continuous-time (CT), which makes it difﬁcult to directly apply

DT ADP algorithms to these problems.

From a mathematical viewpoint, to ﬁnd the optimal control for CT systems one can

solve the Hamilton–Jacobi–Bellman (HJB) equation (Bardi and Capuzzo-Dolcetta 2008;

Beard et al. 1997), which is a ﬁrst-order, nonlinear partial differential equation (PDE). In

general, it is intractable to give a universal solution. So approximation technique has to be

used to approach the solution over a compact set. Neural networks (NNs) are among the most

widely used approximations. In most cases, one network is constructed to evaluate the control

performance, termed as critic, and another network approximates the policy, termed as actor.

When the system dynamics is known, the HJB equation can be devided into a sequence of

linear PDEs by policy iteration (PI) (Beard et al. 1998; Abu-Khalaf and Lewis 2005). The

coefﬁcients are computed ofﬂine. However, this method requires sampling the dynamics,

so algorithms lack interactions with the system. This problem can be overcomed by online

learning. Another advantage of online learning is that it helps algorithm avoid training on

unusual states and save computation resources.

Murray et al. (2002) execute a given stabilizing policy on the system and evaluate its

performance by observations. The policy is then updated. After iterating between the two

phases, the optimal policy is obtained. In their implementation, state derivatives must be

known. After that, Vrabie and Lewis (2009) introduce integral reinforcement learning (IRL)

to PI method. They use only partial system dynamics and online trajectories to implement

their algorithm. The input gain matrix is needed. Motivated by that, a complete model-free

method is developed by Jiang and Jiang (2014) without any dynamics knowledge. Probing

noise is inserted in dynamics, so trajectories contain more dynamics information, and the

algorithm can learn the optimal solution without any knowledge of dynamics.

One common feature of the above mentioned algorithms is that the policy evaluation phase

and the policy improvement phase are separately conducted. In other words, when the critic

is updated, the actor holds constant, and vice versa. To simplify the process, Vamvoudakis

and Lewis (2010) propose a synchronous policy iteration (SPI) algorithm. The critic and the

actor are updated synchronously. They further prove that the system states and critic/actor

NN errors are uniformly ultimately bounded (UUB), which illustrates the convergence of

the learning. The full system dynamics is needed. In many practical applications, the precise

dynamics is usually unknown. One solution is to construct identiﬁer NNs to model dynamics,

such as Bhasin et al. (2013), Modares et al. (2013). The update of the critic and the actor is

implemented on the basis of the identiﬁed dynamics. Notice that online trajectories contain the

complete dynamics information. So the more efﬁcient approach is to design direct online ADP

algorithm that learns the optimal solution using online data. Vamvoudakis et al. (2011, 2014)

combine their SPI algorithm with IRL technique. Their updating laws for critic and actor

use online trajectories, so the internal dynamics is no longer needed. Modares et al. (2014)

further introduce experience replay (ER) technique to accelerate the convergence rate. Past

observations are repeatedly utilized to train the critic and the actor. In the literature, actuation

saturation problem is particularly considered. However, input gain matrix is supposed to be

known in both algorithms. Inspired by the works of Jiang and Jiang (2014), we develop a

complete model-free SPI algorithm to solve the optimal tracking problems (Zhu et al. 2016b).

The convergence rate is further improved by ER technique.

Even though online ADP algorithms have been fully developed, the systematic comparison

of these algorithms in the perspectives of methodology and experiments are rare. This paper

aims to summarize the state-of-the-art online ADP algorithms for the optimal control of

CT systems. Their performance is observed in solving the same problem. Their dynamics

123

Comprehensive comparison of online ADP algorithms for…

dependency and learning speed are also compared. The paper is organized as follows. In

Sect. 2, we brieﬂy describe the optimal control problem of CT systems. In Sect. 3, the latest

online ADP algorithms are reviewed. The comparison in solving the same problem is give

in Sect. 4. In the end we have the discussion and conclusion.

Notations

Throughout this paper, we use R, R

, R

n×m

to denote the sets of real numbers, vectors

and matrices. ·denotes the Euclidean norm for vectors, or the induced matrix norm for

matrices. z

max

represents the upper bound of a variable vector or matrix z in the norm

sense.

2 Optimal control and HJB equation

The continuous-time system considered here is described by

˙x(t) = f (x(t)) + g(x(t))u(t) (1)

where the state x(t) ∈ R

, the control u(t) ∈ R

, the internal dynamics f (x(t)) ∈ R

and the input gain matrix g(x(t)) ∈ R

n×m

. We assume f (0) = 0and f , g are Lipschitz

continuous on a compact set Ω ∈ R

that contains the origin. In addition, we assume (1)is

stabilizable on Ω, i.e. there exists a continuous control function u(t ) rendering the system

asymptotically stable.

For a linear system, it is easy to verify the global asymptotic stability. But for a nonlinear

system, it is generally difﬁcult to guarantee the global asymptotic stability. It is because there

may exist the discontinuity of state time derivatives and cost gradient at some points due

to the dynamics nonlinearity. In Jiang and Jiang (2015), Zhu et al. (2017b), authors study

the global optimal control and the global H

∞

optimal control for nonlinear CT systems.

Their research is based on sum of squares (SOS) theory, but is out of scope of this paper.

We here consider the general cases and restrict the state space Ω to a compact set so that the

asymptotic stability and differential continuity is guaranteed.

The subject of interest is to ﬁnd a state-feedback control policy that minimizes a prescribed

performance criterion. For a policy u = u(x(t)), its value function is deﬁned as an inﬁnite

horizon integral cost

V (x(0)) =



∞

Qx + u

Ru)dτ (2)

where Q and R are symmetric positive deﬁnite matrices.

Deﬁnition 1 (Admissible) [Beard et al. (1997), Vrabie and Lewis (2009)] A control policy

u(x) is deﬁned as admissible with respect to (2)onΩ, denoted by u ∈ Ψ(Ω),ifu(x) is

continuous on Ω, u(0) = 0, u(x) stabilizes (1)onΩ and V (x

) is ﬁnite ∀x

∈ Ω.

The optimal control is to ﬁnd the optimal admissible policy u

∗

∈ Ψ(Ω)that has the lowest

value for every state, called the optimal policy. The corresponding value function is called

the optimal value function, denoted by V

∗

(x ) = min

u∈Ψ(Ω)

V (x). We assume there exists a

unique solution to the optimal control problem and V

∗

is continuously differentiable on Ω,

i.e. V

∗

∈ C

(Ω). For ease of expression, x is omitted in functions if there is no confusion in

the context.

123

剩余16页未读，继续阅读

评论收藏

内容反馈

weixin_38700240

粉丝: 2
资源: 976

Comprehensive comparison of online ADP algorithms for continuous...

最新资源

Comprehensive comparison of online ADP algorithms for continuous...

Comparison of growth structures for continuous-wave electrically pumped 1.55

Comprehensive Comparison of Region-Based Image Similarity Models

!A quantitative comparison of change detection algorithms

Comparison of envelope extraction algorithms for cardiac sound signal segmentation

A comparison of GPS-TEC with IRI-TEC at low latitudes in China in 2006

A Survey and Comparison of Time-Delay Estimation Methods in Linear Systems

Comparison-of-Disparity-Estimation-Algorithms:实现简单的块匹配、动态规划的块匹配和使用置信传播算法的立体匹配-matlab开发

A Comparison of Software and Hardware Techniques for x86 Virtualization

A Comparison of Decision Tree, KNN, andXGBoost for Fashion-MNIST

Comparison of correlation algorithms between GTEM cell and semi anechoic chamber

Theoretical Comparison of Direct-Sampling Versus Heterodyne RF Receivers

A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms

Comparison of fast discrete wavelet transform algorithms

On the Comparison and Uniqueness for Solutions of BSDEs with Continuous Coefficients

maxflow算法

Comparison-of-OFDMA-and-NOMA

An Experimental Comparison of Min-Cut/Max-Flow Algorithms

H-bridge-multi-level-control.rar_H-bridge_control_matlab bridge_

Performance-Comparison-of-VSC-Based-Shunt-and-SER_CONVERTER VSC_

Comparison-of-Statis-and-Dynamic-Cerebral.pdf

A Shared-Subspace Learning Framework for Multi-Label Classification

Algorithm-sorting-algorithms-performance-comparison.zip

Fuzzy Control Systems

Handbook of Research on Soft Computing and Nature-Inspired Algorithms

a-comparison-of-block.zip_Predictor_algorithms

多目标优化（Comparison of Multiobjective Evolutionary Algorithms: Empirical Results）

Algorithms and Data Structures - Niklaus Wirth

最新资源

On the Comparison and Uniqueness for Solutions of BSDEs with　Continuous Coefficients