Y. Zhu, D. Zhao
dynamics (Al-Tamimi et al. 2008; Wang et al. 2012; Zhao and Zhu 2015). In the physical
world, many problems are continuous-time (CT), which makes it difficult to directly apply
DT ADP algorithms to these problems.
From a mathematical viewpoint, to find the optimal control for CT systems one can
solve the Hamilton–Jacobi–Bellman (HJB) equation (Bardi and Capuzzo-Dolcetta 2008;
Beard et al. 1997), which is a first-order, nonlinear partial differential equation (PDE). In
general, it is intractable to give a universal solution. So approximation technique has to be
used to approach the solution over a compact set. Neural networks (NNs) are among the most
widely used approximations. In most cases, one network is constructed to evaluate the control
performance, termed as critic, and another network approximates the policy, termed as actor.
When the system dynamics is known, the HJB equation can be devided into a sequence of
linear PDEs by policy iteration (PI) (Beard et al. 1998; Abu-Khalaf and Lewis 2005). The
coefficients are computed offline. However, this method requires sampling the dynamics,
so algorithms lack interactions with the system. This problem can be overcomed by online
learning. Another advantage of online learning is that it helps algorithm avoid training on
unusual states and save computation resources.
Murray et al. (2002) execute a given stabilizing policy on the system and evaluate its
performance by observations. The policy is then updated. After iterating between the two
phases, the optimal policy is obtained. In their implementation, state derivatives must be
known. After that, Vrabie and Lewis (2009) introduce integral reinforcement learning (IRL)
to PI method. They use only partial system dynamics and online trajectories to implement
their algorithm. The input gain matrix is needed. Motivated by that, a complete model-free
method is developed by Jiang and Jiang (2014) without any dynamics knowledge. Probing
noise is inserted in dynamics, so trajectories contain more dynamics information, and the
algorithm can learn the optimal solution without any knowledge of dynamics.
One common feature of the above mentioned algorithms is that the policy evaluation phase
and the policy improvement phase are separately conducted. In other words, when the critic
is updated, the actor holds constant, and vice versa. To simplify the process, Vamvoudakis
and Lewis (2010) propose a synchronous policy iteration (SPI) algorithm. The critic and the
actor are updated synchronously. They further prove that the system states and critic/actor
NN errors are uniformly ultimately bounded (UUB), which illustrates the convergence of
the learning. The full system dynamics is needed. In many practical applications, the precise
dynamics is usually unknown. One solution is to construct identifier NNs to model dynamics,
such as Bhasin et al. (2013), Modares et al. (2013). The update of the critic and the actor is
implemented on the basis of the identified dynamics. Notice that online trajectories contain the
complete dynamics information. So the more efficient approach is to design direct online ADP
algorithm that learns the optimal solution using online data. Vamvoudakis et al. (2011, 2014)
combine their SPI algorithm with IRL technique. Their updating laws for critic and actor
use online trajectories, so the internal dynamics is no longer needed. Modares et al. (2014)
further introduce experience replay (ER) technique to accelerate the convergence rate. Past
observations are repeatedly utilized to train the critic and the actor. In the literature, actuation
saturation problem is particularly considered. However, input gain matrix is supposed to be
known in both algorithms. Inspired by the works of Jiang and Jiang (2014), we develop a
complete model-free SPI algorithm to solve the optimal tracking problems (Zhu et al. 2016b).
The convergence rate is further improved by ER technique.
Even though online ADP algorithms have been fully developed, the systematic comparison
of these algorithms in the perspectives of methodology and experiments are rare. This paper
aims to summarize the state-of-the-art online ADP algorithms for the optimal control of
CT systems. Their performance is observed in solving the same problem. Their dynamics
123