Abstract
The stability analysis of dynamical systems, which are ubiquitous in nature, has long been a hot topic of research and several approaches have been proposed. However, control scientists often demand optimality in addition to the stability of the control system. In the 1950s and 1960s, motivated by the development of space technology and the practical use of digital computers, the theory of optimization of dynamical systems developed rapidly, forming an important branch of the discipline: optimal control.
You have full access to this open access chapter, Download chapter PDF
The stability analysis of dynamical systems, which are ubiquitous in nature, has long been a hot topic of research and several approaches have been proposed. However, control scientists often demand optimality in addition to the stability of the control system. In the 1950s and 1960s, motivated by the development of space technology and the practical use of digital computers, the theory of optimization of dynamical systems developed rapidly, forming an important branch of the discipline: optimal control. It is increasingly used in many fields, such as space technology, systems engineering, economic management and decision-making, population control, and optimization of multi-stage process equipment. In 1957, Bellman proposed an effective tool for solving optimal control problems: the dynamic programming (DP) method [1]. At the heart of this approach is Bellman’s optimality principle, which states that the optimal policy for a multilevel decision process has the property that, regardless of the initial state and initial decision, the remaining decisions must also be optimal for the state formed by the initial decision. This principle can be reduced to a basic recursive formula for solving multilevel decision problems by starting at the end and working backward to the beginning. It applies to a wide range of discrete, continuous, linear, nonlinear, deterministic, and stochastic systems.
ADP is a new approach to approximate optimality in the field of optimal control, and it is a current research topic in the international optimization field. The ADP method uses the function approximation structure to approximate the solution of the Hamilton-Jacobi-Bellman (HJB) equation and uses offline iteration or online update to obtain the approximate optimal control strategy of the system, which can effectively solve the optimal control problem of nonlinear systems [2,3,4,5,6,7,8,9,10,11]. Bertsekas et al. summarized neuronal dynamic programming in the literature [12, 13], describing in detail dynamic programming, the structure of neural networks, and training algorithms. Meanwhile, several effective methods have been proposed for applying neuronal dynamic programming. Si et al. summarized the development of ADP methods in cross-cutting disciplines and discussed the connection of DP and ADP methods with artificial intelligence, approximation theory, control theory, operations research, and statistics [14]. In [15], Powell showed how to use ADP methods to solve deterministic or stochastic optimization problems, and pointed out the direction of ADP methods. In [16], Balakrishnan et al. concluded previous approaches to the design of feedback controllers for dynamic systems using the ADP method from both model and model-free cases. In [17], the ADP method was described from the perspective of requiring initial stability and not requiring initial stability.
The ADP method has a unique algorithm and structure compared to other existing optimal control methods. It overcomes the drawback that classical variational theory cannot handle optimal control problems with closed-set constraints on the control variables. Like the maximum value principle, the ADP method is not only suitable for optimal control problems with open-set constraints, but also for optimal control problems with closed-set constraints. While the extreme value principle can only provide the necessary conditions for optimal control problems, the DP method gives sufficient conditions. However, the direct application of the DP method is difficult due to the difficulty of solving the problem of “dimensional disaster” by the HJB equation in the DP method. Hence the ADP method, as an approximate solution to the DP method, overcomes the limitations of the DP method. It is more suitable for applications in systems with strong coupling, strong nonlinearity and high complexity. For example, the literature [18] presented a constrained adaptive dynamic programming (CADP) algorithm that could be used to solve general nonlinear non-affine optimal control problems with known dynamics. Unlike previous ADP algorithms, it was able to handle problems with state constraints directly by proposing a constrained generalized policy iteration framework that transforms the traditional policy improvement process into a constrained policy optimization problem with state constraints. To solve the problem of robust tracking control, the literature [19] designed an online adaptive learning structure to build a robust tracking controller for nonlinear uncertain systems. The literature [20] proposed an iterative method of bias policy for solving data-driven optimal control problems for unknown continuous linear systems by adding a bias parameter that could further relax the conditions of the initial admissible controller. The literature [21] considered the first attempt at ADP control for a nonlinear Itô-type stochastic system, which transformed a complex optimal tracking control problem into a stable control optimization problem by reconstructing a new stochastic augmented system. The use of a critical neural network in iterative learning subsequently simplifies the structure of the behavioral criterion and reduces the computational load. The ADP approach is still very widely used for a number of common practical systems. The literature [22] developed an event-triggered adaptive dynamic planning method to design formation controllers, and solved the problem of distributed formation control for multi-rotor UAS. For wind/light energy hybrid systems, literature [23] presented an adaptive dynamic programming method based on Bellman’s principle, which enables accurate current sharing and voltage regulation. Based on this approach, it is possible to obtain the optimal control variables for each energy body objective.
Optimal control of nonlinear systems has been one of the hot spots and difficulties in the field of control research. As a novel technology to solve the optimal control problem, ADP method integrates the theories of neural network, adaptive evaluation design, augmented learning and classical dynamic programming, to overcome the problem of “dimensional disaster”, which also enables the acquisition of an approximate optimal closed-loop feedback control law. As a consequence, delving deeper into the theory of ADP and its algorithms for solving optimal control of nonlinear systems holds immense theoretical significance and practical application value. Although the researches on the ADP method are still in its early stages, this book aims to equip readers with a foundational understanding of the method and empower them to apply it to diverse optimization problems in fields such as medicine, science, and engineering.
1.1 Optimal Control Formulation
There are several schemes of dynamic programming [1, 13, 24]. One can consider discrete-time systems or continuous-time systems, linear systems or nonlinear systems, time-invariant systems or time-varying systems, deterministic systems or stochastic systems, etc. Discrete-time (deterministic) nonlinear (time-invariant) dynamical systems will be discussed first. Time-invariant nonlinear systems cover most of the application areas and discrete time is the basic consideration for digital implementation.
1.1.1 ADP for Discrete-Time Systems
Consider the following discrete-time nonlinear systems:
where \(x_k \in \mathbb {R}^n\) is the state vector and \(u_k \in \mathbb {R}^m\) is the control input vector. The corresponding cost function (performance index function) of the system takes the form of
where \(\overline{u}_k=(u_k,u_{k+1},...)\) is the control sequence starting at time k. \( U(x_i,u_i) \) is the utility function. \( \gamma \) is the discount factor, meeting \( 0<\gamma <1 \). Note that the function J is dependent on the initial time k and the initial state \( x_k \). Generally, it is desired to determine \(\overline{u}_0=(u_0,u_{1},...)\) so that \( J(x_0, \overline{u}_0 )\) is optimized (i.e., maximized or minimized). We will use \(\overline{u}_0^*=(u_0^*,u_1^*,...)\) and \( J^*(x_0 )\) to denote the optimal control sequence and the optimal cost function, respectively.The objective of dynamic programming problem in this book is to determine a control sequence \(u_k , k = 0, 1, . . . ,\) so that the function J (i.e., the cost) in (1.2) is minimized. The optimal cost function is defined as
which is dependent upon the initial state \( x_0 \).
The control action may be determined as a function of the state. In this case, we write \( u_k = u(x_k), \forall k\). Such a relationship, or mapping \(u: R^n \rightarrow R^m\), is called feedback control, or control policy, or policy. It is also called control law. For a given control policy \(\mu \), the cost function in (1.2) is rewritten as
which is the cost function for system (1.1) starting at xk when the policy \( u_k = \mu (x_k) \) is applied. The optimal cost for system (1.1) starting at \( x_0 \) is determined as
where \( \mu ^*\) denotes the optimal policy.
Dynamic programming is based on Bellman’s principle of optimality [1, 13, 24]: An optimal (control) policy has the property that no matter what previous decisions have been, the remaining decisions must constitute an optimal policy with regard to the state resulting from those previous decisions.
According to Bellman, the minimum cost of any state starting at time k consists of two parts, one of which is the minimum cost at time k and the other part is the cumulative sum of the infinite minimum cost starting from time \( k + 1 \). In terms of equations, this means that
This is known as the Bellman optimality equation, or the discrete-time Hamilton-Jacobi-Bellman (HJB) equation. One then has the optimal policy, i.e., the optimal control \(u_k^*\) at time k is the \( u_k \) that achieves this minimum as
Since one must know the optimal policy at time \( k+1 \) to (1.6) use to determine the optimal policy at time k, Bellman’s principle yields a backwards-in-time procedure for solving the optimal control problem. It is the basis for dynamic programming algorithms in extensive use in control system theory, operations research, and elsewhere.
1.1.2 ADP for Continuous-Time Systems
For continuous-time systems, the cost function J is also the key to dynamic programming. By minimizing J, one gets the optimal cost function \( J^* \), which is often a Lyapunov function of the system. As a consequence of the Bellman’s principle of optimality, \( J^* \)satisfies the Hamilton-Jacobi-Bellman (HJB) equation. But usually, one cannot get the analytical solution of the HJB equation. Even to find an accurate numerical solution is very difficult due to the so-called curse of dimensionality.
Consider the continuous-time nonlinear dynamical system
where \(x \in \mathbb {R}^n\) is the state vector and \(u \in \mathbb {R}^m\) is the control input vector. The corresponding cost function of the system can be defined as
with utility function \( U(x, u) \ge 0 \), where \( x(t_0) = x_0 \). The Bellman’s principle of optimality can also be applied to continuous-time systems. In this case, the optimal cost
satisfies the HJB equation
The HJB equation in (1.11) can be derived from the Bellman’s principle of optimality [24]. Meanwhile, the optimal control \( u^* (t) \) will be the one that minimizes the cost function,
In 1994, Saridis and Wang [25] studied the nonlinear stochastic systems described by
with the cost function
where \(x \in \mathbb {R}^n, u \in \mathbb {R}^m\), and \(w \in \mathbb {R}^k\) are the state vector, the control vector, and a separable Wiener process; f, g and h are measurable system functions; and Q and \(\phi \) are nonnegative functions. A value function V is defined as
where \(I \triangleq \left[ t_0, T\right] \). The HJB equation is modified to become the following equation
where \(\mathscr {L}_u\) is the infinitesimal generator of the stochastic process specified by (1.13) and is defined by
Depending on whether \(\nabla V \le 0\) or \(\nabla V \ge 0\), an upper bound \(\bar{V}\) or a lower bound \(\underline{V}\) of the optimal cost \(J^*\) are found by solving equation (1.14) such that \(\underline{V} \le J^* \le \bar{V}\). Using \(\bar{V}\) (or \(\underline{V}\)) as an approximation to \(J^*\), one can solve for a control law. This leads to the so-called suboptimal control. It was proved that such controls are stable for the infinite-time stochastic regulator optimal control problem, where the cost function is defined as
The benefit of the suboptimal control is that the bound V of the optimal \({\text {cost}} J^*\) can be approximated by an iterative process. Beginning from certain chosen functions \(u_0\) and \(V_0\), let
Then, by repeatedly applying (1.14) and (1.15), one will get a sequence of functions \(V_i\). This sequence \(\left\{ V_i\right\} \) will converge to the bound \(\bar{V}\) (or V) of the cost function \(J^*\). Consequently, \(u_i\) will approximate the optimal control when i tends to \(\infty \). It is important to note that the sequences \(\left\{ V_i\right\} \) and \(\left\{ u_i\right\} \) are obtainable by computation and they approximate the optimal cost and the optimal control law, respectively.
Some further theoretical results for ADP have been obtained in [2]. These works investigated the stability and optimality for some special cases of ADP. In [2], Murray et al. studied the (deterministic) continuous-time affine nonlinear systems
with the cost function
where \(U(x, u)=Q(x)+u^{T} R(x) u, Q(x)>0\) for \(x \ne 0\) and \(Q(0)=0\), and \(R(x)>0\) for all x. Similar to [25], an iterative procedure is proposed to find the control law as follows. For the plant (1.16) and the cost function (1.17), the HJB equation leads to the following optimal control law
Applying (1.17) and (1.18) repeatedly, one will get sequences of estimations of the optimal cost function \(J^*\) and the optimal control \(u^*\). Starting from an initial stabilizing control \(v_0(x)\), for \(i=0,1, \ldots \), the approximation is given by the following iterations between value functions
and control laws
The following results were shown in [2].
(1) The sequence of functions \(\left\{ V_i\right\} \) obtained above converges to the optimal cost function \(J^*\).
(2) Each of the control laws \(v_{i+1}\) obtained above stabilizes the plant (1.16), for all \(i=0,1, \ldots \)
(3) Each of the value functions \(V_{i+1}(x)\) is a Lyapunov function of the plant, for all \(i=0,1, \ldots \)
Abu-Khalaf and Lewis [26] also studied the system (1.16) with the following value function
where Q and R are positive-definite matrices. The successive approximation to the HJB equation starts with an initial stabilizing control law \(v_0(x)\). For \(i=0,1, \ldots \), the approximation is given by the following iterations between policy evaluation
and policy improvement
where \(\nabla V_i(x)=\partial V_i(x) / \partial x\). In [26], the above iterative approach was applied to systems (1.16) with saturating actuators through a modified utility function, with convergence and optimality proofs showing that \(V_i \rightarrow J^*\) and \(v_i \rightarrow u^*\), as \(i \rightarrow \infty \). For continuous-time optimal control problems, attempts have been going on for a long time in the quest for successive solutions to the \(\textrm{HJB}\) equation. Published works can date back to as early as 1967 by Leake and Liu [26]. The brief overview presented here only serves as a beginning of many more recent results [26,27,28].
1.2 Publication Outline
The general layout of the presentation of this monograph is given as follows. Adaptive dynamic programming is used to design drug dosage regulation mechanisms to provide adaptive viral treatment strategies for input-limited organisms, and to extend this to tumour cells, immune cells and interplay and regulation schemes among the immune system. The main contents of this monograph are shown as follows:
- Chapter 1:
-
introduces the research background, development and current status of ADP both domestically and internationally, as well as the idea and design framework of the underlying ADP, including discrete-time and continuous-time systems.
- Chapter 2:
-
investigates optimal regulation scheme between tumor and immune cells based on ADP approach. The therapeutic goal is to inhibit the growth of tumor cells to allowable injury degree, and maximize the number of immune cells in the meantime. The reliable controller is derived through the ADP approach to make the number of cells achieve the specific ideal states. Firstly, the main objective is to weaken the negative effect caused by chemotherapy and immunotherapy, which means that minimal dose of chemotherapeutic and immunotherapeutic drugs can be operational in the treatment process. Secondly, according to nonlinear dynamical mathematical model of tumor cells, chemotherapy and immunotherapeutic drugs can act as powerful regulatory measures, which is a closed-loop control behavior. Finally, states of the system and critic weight errors are proved to be ultimately uniformly bounded with the appropriate optimization control strategy and the simulation results are shown to demonstrate effectiveness of the cybernetics methodology.
- Chapter 3:
-
investigates the optimal control strategy problem for nonzero-sum games of the immune system based on adaptive dynamic programming. Firstly, the main objective is approximating a Nash equilibrium between the tumor cells and the immune cell population, which is governed through chemotherapy drugs and immunoagents guided by the mathematical growth model of the tumor cells. Secondly, a novel intelligent nonzero-sum games-based ADP is put forward to solve optimization control problem through reducing the growth rate of tumor cells and minimizing chemotherapy drugs and immunotherapy drugs. Meanwhile, convergence analysis and iterative ADP algorithm are specified to prove feasibility. Finally, simulation examples are listed to account for availability and effectiveness of the research methodology.
- Chapter 4:
-
devotes to evolutionary dynamics optimal control oriented tumor immune differential game system. Firstly, the mathematical model covering immune cells and tumor cells considering the effects of chemotherapy drugs and immune agents. Secondly, the bounded optimal control problem covering is transformed into solving HJB equation considering the actual constraints and infinite-horizon performance index based on minimize the amount of medication administered. Finally, approximate optimal control strategy is acquired through iteration dual heuristic dynamic programming algorithm avoiding dimensional disaster effectively and providing optimal treatment scheme for clinical applications.
- Chapter 5:
-
mainly proposes an evolutionary algorithm and its first application to develop therapeutic strategies for Ecological Evolutionary Dynamics Systems (EEDS), obtaining the balance between tumor cells and immune cells by rationally arranging chemotherapeutic drugs and immune drugs. Firstly, an EEDS nonlinear kinetic model is constructed to describe the relationship between tumor cells, immune cells, dose, and drug concentration. Secondly, the N-Level Hierarchy Optimization (NLHO) algorithm is designed and compared with 5 algorithms on 20 benchmark functions, which proves the feasibility and effectiveness of NLHO. Finally, we apply NLHO into EEDS to give a dynamic adaptive optimal control policy, and develop therapeutic strategies to reduce tumor cells, while minimizing the harm of chemotherapy drugs and immune drugs to the human body. The experimental results prove the validity of the research method.
- Chapter 6:
-
investigates the optimal control strategy for organism by using ADP method under the architecture of Firstly, a tumor model is established to formulate the interaction relationships among normal cells, tumor cells, endothelial cells and the concentrations of drugs. Then, the ADP-based method of single-critic network architecture is proposed to approximate the coupled HJEs under the medicine dosage regulation mechanism (MDRM). According to game theory, the approximate MDRM-based optimal strategy can be derived, which is of great practical significance. Owing to the proposed mechanism, the dosages of the chemotherapy and anti-angiogenic drugs can be regulated timely and necessarily. Furthermore, the stability of the closed-loop system with the obtained strategy is analyzed via Lyapunov theory. Finally, a simulation experiment is conducted to verify the effectiveness of the proposed method.
- Chapter 7:
-
investigates the constrained adaptive control strategy based on virotherapy for organism using the MDRM. Firstly, the tumor-virus-immune interaction dynamics is established to model the relations among the tumor cells (TCs), virus particles and the immune response. ADP method is extended to approximately obtain the optimal strategy for the interaction system to reduce the populations of TCs. Due to the consideration of asymmetric control constraints, the non-quadratic functions are proposed to formulate the value function such that the corresponding Hamilton-Jacobi-Bellman equation (HJBE) is derived which can be deemed as the cornerstone of ADP algorithms. Then, the ADP method of single-critic network architecture which integrates MDRM is proposed to obtain the approximate solutions of HJBE and eventually derive the optimal strategy. The design of MDRM makes it possible for the dosage of the agentia containing oncolytic virus particles to be regulated timely and necessarily. Furthermore, the uniform ultimate boundedness of the system states and critic weight estimation errors are validated by Lyapunov stability analysis. Finally, simulation results are given to show the effectiveness of the derived therapeutic strategy.
References
Bellman RE (2010) Dynamic programming. Princeton University Press
Murray JJ, Cox CJ, Lendaris GG, Saeks R (2002) Adaptive dynamic programming. IEEE Trans Syst Man Cybernet Part C (Appl Rev) 32(2):140–153
Lewis FL, Vrabie D (2009) Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst Mag 9(3):32–50
Zhang HG, Zhang X, Luo YH, Yang J (2013) An overview of research on adaptive dynamic programming. Acta Automatica Sinica 39(4):303–311
Jiang ZP, Jiang Y (2013) Robust adaptive dynamic programming for linear and nonlinear systems: an overview. Eur J Control 19(5):417–425
He H, Ni Z, Fu J (2012) A three-network architecture for on-line learning and optimization based on adaptive dynamic programming. Neurocomputing 78(1):3–13
Bertsekas DP (2015) Value and policy iterations in optimal control and adaptive dynamic programming. IEEE Trans Neural Netw Learn Syst 28(3):500–509
Luo B, Liu D, Wu HN, Wang D, Lewis FL (2017) Policy gradient adaptive dynamic programming for data-based optimal control. IEEE Trans Cybernet 47(10):3341–3354
Yang Y, Vamvoudakis KG, Modares H, Yin Y, Wunsch DC (2021) Hamiltonian-driven hybrid adaptive dynamic programming. IEEE Trans Syst Man, Cybernet: Syst 51(10):6423–6434
Jiang Y, Jiang ZP (2013) Robust adaptive dynamic programming with an application to power systems. IEEE Trans Neural Netw Learn Syst 24(7):1150–1156
Jiang Y, Jiang ZP (2015) Global adaptive dynamic programming for continuous-time nonlinear systems. IEEE Trans Autom Control 60(11):2917–2929
Bertsekas DP, Tsitsiklis JN (1996) Neuro-dynamic programming. Athena Scientific
Bertsekas DP (2011) Dynamic programming and optimal control, vol ii, 3rd edn. Athena Scientific, Belmont, MA
Si J, Barto AG, Powell WB, Wunsch D (2004) Handbook of learning and approximate dynamic programming. Wiley
Powell WB (2007) Approximate dynamic programming: solving the curses of dimensionality. Wiley
Balakrishnan SN, Ding J, Lewis FL (2008) Issues on stability of ADP feedback controllers for dynamical systems. IEEE Trans Syst Man Cybernet Part B: Cybernet 38(4):913–917
Wang FY, Zhang HG, Liu DR (2009) Adaptive dynamic programming: an introduction. IEEE Comput Intell Mag 4(2):39–47
Duan J, Liu Z, Li SE, Sun Q, Jia Z, Cheng B (2022) Adaptive dynamic programming for nonaffine nonlinear optimal control problem with state constraints. Neurocomputing 484:128–141
Zhao J, Na J, Gao G (2022) Robust tracking control of uncertain nonlinear systems with adaptive dynamic programming. Neurocomputing 471:21–30
Jiang H, Zhou B (2022) Bias-policy iteration based adaptive dynamic programming for unknown continuous-time linear systems. Automatica 136:110058
Ming Z, Zhang H, Li W, Luo Y (2022) Neurodynamic programming and tracking control for nonlinear stochastic systems by PI algorithm. IEEE Trans Circuits Syst II Express Briefs 69(6):2892–2896
Dou L, Cai S, Zhang X, Su X, Zhang R (2022) Event-triggered-based adaptive dynamic programming for distributed formation control of multi-UAV. J Frankl Inst 359(8):3671–3691
Wang R, Ma D, Li MJ, Sun Q, Zhang H, Wang P (2022) Accurate current sharing and voltage regulation in hybrid wind/solar systems: an adaptive dynamic programming approach. IEEE Trans Consum Electron 68(3):261–272
Lewis FL, Syrmos VL (1995) Optimal control. Wiley, New Yorks
Saridis GN, Wang FY (1994) Suboptimal control of nonlinear stochastic systems. Control Theory Adv Technol 10(4):847–871
Abu-Khalaf M, Lewis FL (2005) Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 41(5):779–791
Yang X, Liu D, Huang Y (2013) Neural-network-based online optimal control for uncertain non-linear continuous-time systems with control constraints. IET Control Theory Appl 7(17):2037–2047
Yang X, Liu D, Wang D (2014) Reinforcement learning for adaptive optimal control of unknown continuous-time nonlinear systems with input constraints. Int J Control 87(3):553–556
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2024 The Author(s)
About this chapter
Cite this chapter
Sun, J., Xu, S., Liu, Y., Zhang, H. (2024). Introduction. In: Adaptive Dynamic Programming. Springer, Singapore. https://doi.org/10.1007/978-981-99-5929-7_1
Download citation
DOI: https://doi.org/10.1007/978-981-99-5929-7_1
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-5928-0
Online ISBN: 978-981-99-5929-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)