Stochastic linear quadratic optimal tracking control for discrete-time systems with delays based on Q-learning algorithm

Xufeng Tan; Yuan Li; Yang Liu; Xufeng Tan; Yuan Li; Yang Liu

doi:10.3934/math.2023519

AIMS Mathematics

2023, Volume 8, Issue 5: 10249-10265. doi: 10.3934/math.2023519

Previous Article Next Article

Research article Special Issues

Stochastic linear quadratic optimal tracking control for discrete-time systems with delays based on Q-learning algorithm

1.
School of Science, Shenyang University of Technology, Shenyang 110870, China
2.
School of Electrical and Electronic Engineering, Shenyang University of Technology, Shenyang 110870, China

Received: 13 December 2022 Revised: 13 February 2023 Accepted: 20 February 2023 Published: 27 February 2023
MSC : 93E20, 93C05, 93C41, 93C55

In this paper, a reinforcement Q-learning method based on value iteration (Ⅵ) is proposed for a class of model-free stochastic linear quadratic (SLQ) optimal tracking problem with time delay. Compared with the traditional reinforcement learning method, Q-learning method avoids the need for accurate system model. Firstly, the delay operator is introduced to construct a novel augmented system composed of the original system and the command generator. Secondly, the SLQ optimal tracking problem is transformed into a deterministic one by system transformation and the corresponding Q function of SLQ optimal tracking control is derived. Based on this, Q-learning algorithm is proposed and its convergence is proved. Finally, a simulation example shows the effectiveness of the proposed algorithm.

Keywords:

reinforcement Q-learning,
value iterative,
model-free,
stochastic linear quadratic optimal tracking,
time delay,
deterministic system

Citation: Xufeng Tan, Yuan Li, Yang Liu. Stochastic linear quadratic optimal tracking control for discrete-time systems with delays based on Q-learning algorithm[J]. AIMS Mathematics, 2023, 8(5): 10249-10265. doi: 10.3934/math.2023519

Related Papers:

[1]	Erfeng Xu, Wenxing Xiao, Yonggang Chen . Local stabilization for a hyperchaotic finance system via time-delayed feedback based on discrete-time observations. AIMS Mathematics, 2023, 8(9): 20510-20529. doi: 10.3934/math.20231045
[2]	Guijun Xing, Huatao Chen, Zahra S. Aghayan, Jingfei Jiang, Juan L. G. Guirao . Tracking control for a class of fractional order uncertain systems with time-delay based on composite nonlinear feedback control. AIMS Mathematics, 2024, 9(5): 13058-13076. doi: 10.3934/math.2024637
[3]	Huihui Zhong, Weijian Wen, Jianjun Fan, Weijun Yang . Reinforcement learning-based adaptive tracking control for flexible-joint robotic manipulators. AIMS Mathematics, 2024, 9(10): 27330-27360. doi: 10.3934/math.20241328
[4]	Jiali Wu, Maoning Tang, Qingxin Meng . A stochastic linear-quadratic optimal control problem with jumps in an infinite horizon. AIMS Mathematics, 2023, 8(2): 4042-4078. doi: 10.3934/math.2023202
[5]	Jingjing Yang, Jianqiu Lu . Stabilization in distribution of hybrid stochastic differential delay equations with Lévy noise by discrete-time state feedback controls. AIMS Mathematics, 2025, 10(2): 3457-3483. doi: 10.3934/math.2025160
[6]	Lidiya Melnikova, Valeriy Rozenberg . One dynamical input reconstruction problem: tuning of solving algorithm via numerical experiments. AIMS Mathematics, 2019, 4(3): 699-713. doi: 10.3934/math.2019.3.699
[7]	Yujun Zhu, Ju Ming . Lagrangian decomposition for stochastic TIMES energy system optimization model. AIMS Mathematics, 2022, 7(5): 7964-7996. doi: 10.3934/math.2022445
[8]	Xiao Yu, Yan Hua, Yanrong Lu . Observer-based robust preview tracking control for a class of continuous-time Lipschitz nonlinear systems. AIMS Mathematics, 2024, 9(10): 26741-26764. doi: 10.3934/math.20241301
[9]	Yabo Zhao, Huaiqin Wu . Fixed/Prescribed stability criterions of stochastic system with time-delay. AIMS Mathematics, 2024, 9(6): 14425-14453. doi: 10.3934/math.2024701
[10]	Qike Zhang, Tao Xie, Wenxiang Fang . Fixed/predefined-time generalized synchronization for stochastic complex dynamical networks with delays. AIMS Mathematics, 2024, 9(3): 5482-5500. doi: 10.3934/math.2024266

Abstract

1. Introduction

It is well known that the optimal tracking control (OTC) problem plays an important role in the field of optimal control and develops fast in applications^[1,2,3,4]. The goal of OTC problem is to design a controller, which can make the output of the system track the reference trajectory by minimizing the cost function. Traditional OTC problem is realized by feedback linearization ^[5] and object inversion ^[6], but this usually requires complex mathematical analysis. As for the linear quadratic tracking (LQT) problem, the traditional method of LQT problem is to solve the algebraic Riccati equation (ARE) and the noncausal difference equation. However, these methods require accurate system model^[7]. In practical situations, the system parameters are partially unknown or completely unknown, so it is impossible to be realized by traditional methods.

The key to the OTC problem is to solve Hamilton-Jacobi-Bellman (HJB) equation. However, HJB equation involves solving difference or differential equations, so it is difficult to solve it. Although dynamic programming has always been an effective method to solve the HJB equation, it is not feasible in the calculation of large dimensions because of "the curse of dimensionality". To solve the solution of the HJB equation, adaptive dynamic programming (ADP) algorithms have been widely used and developed. In ^[8], a policy iteration (PI) scheme was adopted to approximate the optimal control for the partly unknown continuous-time systems. In ^[9], B. Kiumarsi solves the LQT problem online only by measuring the input, output, and reference trajectory data of the system. In ^[10], a Q-learning method was proposed to calculate the optimal control, only relying on system parameters and command generators.

In recent years, stochastic system control theory has become the focus of optimal control theory because of its academic difficulty and wide application, especially the model-free SLQ optimal tracking problem has attracted more and more attention^{[11,12,13,14,15]}. In ^[14], ADP algorithm based on neural networks is proposed to solve the model-free SLQ optimal tracking control problem. In addition, the Q-learning algorithm is used to solve the model-free SLQ optimal tracking control problem in ^[15]. For all we know, there seem to be many research results on the model-free SLQ optimal tracking problem based on ADP algorithm, but the SLQ optimal tracking problem with delays has received little attention. Time delay ^[16] is an important factor that cannot be ignored. It exists in many practical systems, such as industrial processes, power grids, chemical reactions, and so on ^{[17,18,19,20]}. However, in these methods^{[11,12,13,14,15]}, the influence of time delay on the system is neglected. If the time delay is ignored, it will affect the control effect and even make the system divergence. The method proposed in ^[16] takes into account the time delay but ignores the influence of stochastic disturbance disturbances on the system. As far as we know, there is no research on the optimal tracking problem of stochastic linear systems with delays. Therefore, how to use ADP algorithm to deal with the model-free SLQ optimal tracking control problem has important practical significance. This is the motivation we study in this paper.

The main contributions of this paper include:

(1) For stochastic linear system, this paper proposes Q-learning to model-free solve SLQ optimal tracking control problem with delays for the first time, which enhances the practicability of ADP algorithm in tracking problems.

(2) By introducing the delay factor, the influence of delays on the subsequent algorithm can be effectively eliminated.

(3) In this paper, the Q-learning algorithm is used to solve the model-free SLQ optimal tracking control problem with delays. Compared with other methods which need accurate system model to obtain the optimal control, this method makes full use of the online system state information to obtain the optimal control and avoids solving augmented stochastic algebraic equation (SAE).

The structure of this paper is organized as follows. In section 2, we give the problem formulation and conversion. In section 3, we derive the Q-learning algorithm and prove its convergence. In section 4, we give the implementation steps of Q-learning algorithm. In section 5, a simulation example is given to verify the effectiveness of the algorithm. In section 6, the conclusion is given.

2. Problem formulation and transformation

2.1. Problem formulation

Consider the following linear stochastic systems with delays

$\begin{equation} \begin{split} x_{k+1} & = Ax_{k}+A_{d}x_{k-d}+Bu_{k}+B_{d}u_{k-d}+(Cx_{k}+C_{d}x_{k-d}+Du_{k}+D_{d}u_{k-d})\omega_{k},\\y_{k} & = Ex_{k}+E_{d}x_{k-d} \end{split} \end{equation}$

(2.1)

where $x_{k}\in\mathcal{R}^{n}$ is the system state vector, $u_{k}\in\mathcal{R}^{m}$ is the control input vector, $y_{k}\in\mathcal{R}^{q}$ is the system output, while $x_{k-d}, u_{k-d}$ and $y_{k-d}$ are the delay variables with delay index $d \in N$ . $A\in\mathcal{R}^{n\times n}$ , $B\in\mathcal{R}^{n\times m}$ , $C\in\mathcal{R}^{n\times n}$ , $D\in\mathcal{R}^{n\times m}$ , $E\in\mathcal{R}^{q\times n}$ are given constant, $A_{d}\in\mathcal{R}^{n\times n}$ , $B_{d}\in\mathcal{R}^{n\times m}$ , $C_{d}\in\mathcal{R}^{n\times n}$ , $D_{d}\in\mathcal{R}^{n\times m}$ , $E_{d}\in\mathcal{R}^{q\times n}$ are their corresponding delay dynamics matrices. One-dimensional stochastic disturbance sequence $\omega_{k}$ is defined on the given probability space $(\Omega, \mathcal{F}, \mathcal{P}, \mathcal{F}_{k})$ , and meets the following condition $E(\omega_{k}\mid\mathcal{F}_{k}) = 0$ , $E(\omega_{k}^{2}\mid\mathcal{F}_{k}) = 1$ . The initial state $x_{0}$ is irrelevant with $\omega_{k}$ .

Assume the reference trajectory of SLQ optimal tracking control is generated by a command generator

$\begin{equation} \begin{split} r_{k+1} & = Fr_{k} \end{split} \end{equation}$

(2.2)

where $r_{k}\in\mathcal{R}^{q}$ represents the reference system trajectory, and $F$ is the constant matrix.

The tracking error can be expressed as

$\begin{equation} \begin{split} e_{k} & = y_{k}-r_{k} \end{split} \end{equation}$

(2.3)

where $r_{k}$ is the reference trajectory.

The goal of the SLQ optimal tracking problem with delays is to design an optimal controller, which can not only ensure that the output of the target system track the reference trajectory stably, but also minimize the cost function. The cost function is denoted as

$\begin{equation} \begin{split} J(x_{k}, r_{k}, u_{k} ) = E\sum\limits_{i = k}^\infty U_{i}(x_{i},x_{i-d}, u_{i}) \end{split} \end{equation}$

(2.4)

where $U_{i}(x_{i}, x_{i-d}, u_{i}) = (y_{i}-r_{i})^{T}O(y_{i}-r_{i})+u_{i}^{T}\mathcal{R}u_{i}+u_{i-d}^{T}\mathcal{R}_{d} u_{i-d}$ is the utility function. $O = O^{T}\in\mathcal{R}^{q\times q} \ge0$ , $R = R^{T}\in\mathcal{R}^{m\times m} \ge0$ , $R_{d} = R_{d}^{T}\in\mathcal{R}^{m\times m} \ge0$ are the constant matrices.

Only when $F$ is Hurwitz can the cost function (2.4) be used, that is, the reference trajectory system is required to be asymptotically stable. If the reference trajectory does not tend to zero with time delay, then the cost function (2.4) will be unbounded. In practice, this condition is difficult to achieve. Therefore, a discount factor $\gamma$ is introduced into the cost function (2.4) to relax this restriction. Based on (2.4), the cost function with discount factor is redefined as

$\begin{equation} \begin{split} J(x_{k}, r_{k}, u_{k} )& = E\sum\limits_{i = k}^\infty \gamma^{i-k} U_{i}(x_{i},x_{i-d}, u_{i})\\& = E\sum\limits_{i = k}^\infty \gamma^{i-k}(y_{i}-r_{i})^{T}O(y_{i}-r_{i})+u_{i}^{T}\mathcal{R}u_{i}+u_{i-d}^{T}\mathcal{R}_{d} u_{i-d} \end{split} \end{equation}$

(2.5)

where $0 < \gamma\leq1$ is the discount factor.

Definition 1 (^[21]). $u_{k}$ is called mean-square stabilizing at $e_{0}$ if there exists a linear feedback form of $u_{k}$ for every initial state $e_{0}$ satisfies $\mathop {\lim }\limits_{k \to \infty } E(e_k^T{e_k}) = 0$ . The system (2.3) with a mean-square stabilizing control $u_{k}$ is called mean-square stabilizable.

Definition 2 (^[21]). $u_{k}$ is said to be admissible if $u_{k}$ satisfies the following: (1) $u_{k}$ is a $F_{k}$ adapted and measurable stochastic process; (2) $u_{k}$ is mean-square stabilizing; (3) It enables the cost function to reach the minimum value.

The goal of this paper is to seek an admissible control, which not only minimizes the cost function (2.5) but also stabilizes the system (2.3) for each initial state $e_{0}$ . We denote the optimal cost function as follows

$\begin{equation} \begin{split} V({e_0}) = \mathop {\min }\limits_u J({e_0},u). \end{split} \end{equation}$

(2.6)

In order to achieve the above goal, this paper establishes an augmented system composed of system (2.1) and the reference trajectory system (2.2), and then transforms the optimal tracking problem into an optimal regulation problem.

The system (2.1) can be rewritten as the following equivalent form:

$\begin{equation} \begin{split} \begin{array}{l} {x_{k + 1}} = [A{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {A_d}]\left[ {\begin{array}{*{20}{c}} {{x_k}}\\ {{x_{k - d}}} \end{array}} \right] + [B{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {B_d}]\left[ {\begin{array}{*{20}{c}} {{u_k}}\\ {{u_{k - d}}} \end{array}} \right]\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} + ([C{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {C_d}]\left[ {\begin{array}{*{20}{c}} {{x_k}}\\ {{x_{k - d}}} \end{array}} \right] + [D{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {D_d}]\left[ {\begin{array}{*{20}{c}} {{u_k}}\\ {{u_{k - d}}} \end{array}} \right]){\omega _k},\\ {y_k} = [E{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {E_d}]\left[ {\begin{array}{*{20}{c}} {{x_k}}\\ {{x_{k - d}}} \end{array}} \right]. \end{array} \end{split} \end{equation}$

(2.7)

According to ^[16,22,23], we define the delay operator $\nabla_{d}$ satisfies $\nabla_{d}x_{k} = x_{k-d}$ and $(\nabla_{d}x_{k})^{T} = x_{k-d}^{T}$ . Then, the system (2.7) can be expressed as

$\begin{equation} \begin{split} x_{k+1} & = A_{\nabla}x_{k}+B_{\nabla}u_{k}+(C_{\nabla}x_{k}+D_{\nabla}u_{k})\omega_{k},\\y_{k} & = E_{\nabla}x_{k} \end{split} \end{equation}$

(2.8)

where $A_{\nabla} = A+A_{d}\nabla_{d}$ , $B_{\nabla} = B+B_{d}\nabla_{d}$ , $C_{\nabla} = C+C_{d}\nabla_{d}$ , $D_{\nabla} = D+D_{d}\nabla_{d}$ , $E_{\nabla} = E+E_{d}\nabla_{d}$ .

Based on the system (2.1) and the reference trajectory system (2.2), the augmented system can be defined as

$\begin{equation} \begin{split} G_{k+1} & = \left[\begin{array}{c} x_{k+1} \\r_{k+1}\end{array}\right] = \left[\begin{array}{c c} A_{\nabla}+C_{\nabla} \omega_{k} &0\\0&F\end{array}\right]\left[\begin{array}{c} x_{k} \\r_{k}\end{array}\right]+\left[\begin{array}{c } B_{\nabla} +D_{\nabla} \omega_{k}\\0\end{array}\right]u_{k}\\& = TG_{k}+B_{0}u_{k} \end{split} \end{equation}$

(2.9)

where $G_{k} = \left[\begin{array}{c} x_{k} \\r_{k}\end{array}\right]\in\mathcal{R}^{n+q}$ , $T\in\mathcal{R}^{(n+q)\times(n+q)}$ , $B_{0}\in\mathcal{R}^{(n+q)\times m}$ .

Based on the augmented system (2.9), the cost function (2.5) can be expressed as

$\begin{equation} \begin{split} J(G_{k}, u_{k} )& = E\sum\limits_{i = k}^\infty \gamma^{i-k}[G_{i}^{T}O_{1}G_{i}+u_{i}^{T}\mathcal{R}_{\nabla}u_{i}] \end{split} \end{equation}$

(2.10)

where $O_{1} = \left[\begin{array}{c c} E & -I\end{array}\right]^{T}O\left[\begin{array}{c c} E & -I\end{array}\right]\in\mathcal{R}^{(n+q)\times(n+q)}$ , $R_{\nabla} = R+R_{d}\nabla_{d}$ .

The state feedback linear controller is defined as

$\begin{equation} \begin{split} u_{k} & = KG_{k}, \quad K\in\mathcal{R}^{m\times(n+q)} \end{split} \end{equation}$

(2.11)

where $K$ represents the control gain matrix of the system.

Substituting (2.11) into (2.10), the cost function (2.10) can be transformed into

$\begin{equation} \begin{split} J(G_{k},K )& = E\sum\limits_{i = k}^\infty \gamma^{i-k}G_{i}^{T}[O_{1}+K^{T}R_{\nabla}K]G_{i}. \end{split} \end{equation}$

(2.12)

Therefore, the target of SQL optimal tracking problem with delays can be further expressed as

$\begin{equation} \begin{split} V(G_{0},K )& = \min\limits_{K}J(G_{0},K). \end{split} \end{equation}$

(2.13)

Definition 3. The SLQ optimal control problem is well posed if

$\begin{equation} -\infty < V(G_{0},K ) < +\infty. \notag \end{equation}$

Before solving the SLQ control problem, we need to know whether it is well-posed. Therefore, we give the following lemma first.

Lemma 1. If there exists an admissible control $u_{k} = KG_{k}$ , then the SLQ optimal tracking control is well-posed, and the cost function can be expressed as

$\begin{equation} \begin{split} J(G_{k},K )& = E(G_{k}^{T}PG_{k}) \end{split} \end{equation}$

(2.14)

where the matrix $P\in\mathcal{R}^{(n+q)\times(n+q)}$ satisfies the following augmented SAE

$\begin{equation} \begin{aligned} P& = \gamma (A_{1}+B_{1}K)^{T} P(A_{1}+B_{1}K) \\&+ \gamma (C_{1}+D_{1}K)^{T}P (C_{1}+D_{1}K) + O_{1} + K^{T}R_{\nabla}K \end{aligned} \end{equation}$

(2.15)

where $A_{1} = \left[\begin{array}{c c} A_{\nabla} & 0\\0&F\end{array}\right] \in\mathcal{R}^{(n+q)\times(n+q)}$ , $B_{1} = \left[\begin{array}{c } B_{\nabla} \\0\end{array}\right] \in\mathcal{R}^{(n+q)\times m}$ , $C_{1} = \left[\begin{array}{c c} C_{\nabla} & 0\\0 & 0\end{array}\right] \in\mathcal{R}^{(n+q)\times(n+q)}$ , $D_{1} = \left[\begin{array}{c } D_{\nabla} \\0\end{array}\right] \in\mathcal{R}^{(n+q)\times m}$ .

Proof. Assuming that the control $u_{k}$ is admissible and the matrix $P$ satisfies (2.15), then

$\begin{equation} \begin{array}{l} E\sum\limits_{i = k}^\infty {[\gamma {G_{i{\rm{ + }}1}}^T} P{G_{i + 1}} - {G_i}^TP{G_i}]\\ = E\sum\limits_{i = k}^\infty {\left\{ {\gamma {{[({A_1} + {B_1}K){G_i} + ({C_1}{\omega _i} + {D_1}K{\omega _i}){G_i}]}^T}P} \right.} \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left. {[({A_1} + {B_1}K){G_i} + ({C_1}{\omega _i} + {D_1}K{\omega _i}){G_i}] - {G_i}^TP{G_i}} \right\}\\ {\rm{ = }}E\sum\limits_{i = k}^\infty {\left\{ {{G_i}^T[\gamma {{({A_1} + {B_1}K)}^T}P({A_1} + {B_1}K)} \right.} \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left. { + \gamma {{({C_1} + {D_1}K)}^T}P({C_1} + {D_1}K) - P]{G_i}} \right\}\notag. \end{array} \end{equation}$

Based on (2.12) and (2.15), we have

$\begin{equation} \begin{aligned} J(G_{k},K)& = E\sum\limits_{i = k}^\infty \gamma^{i-k} G_{i}^{T}[O_{1}+K^{T}R_{\nabla}K] G_{i}] \\& = E\sum\limits_{i = k}^\infty \gamma^{i-k}G_{i}^{T} [P - \gamma(A_{1}+B_{1}K)^{T}P(A_{1}+B_{1}K)\\&\qquad-\gamma(C_{1}+D_{1}K)^{T}P(C_{1}+D_{1}K)]G_{i}\\& = -E\sum\limits_{i = k}^\infty \gamma^{i-k} [\gamma G_{i+1}^{T}PG_{i+1} - G_{i}^{T}PG_{i}] \\& = E(G_{k}^{T}PG_{k}) - \lim _{i \to \infty}\gamma^{i-k+1}E(G_{i}^{T}PG_{i})\\& = E(G_{k}^{T}PG_{k})\notag. \end{aligned} \end{equation}$

Since the feedback control $u_{k}$ is admissible, we can obtain $J({G_k}, K){\rm{ = E(}}{G_k}^TP{G_k}{\rm{)}}$ , which satisfies the well-posedness of SLQ optimal tracking control problem.

To make sure the mean-square stable control, we make the following assumption.

Assumption 1. The system (2.9) is mean-square stabilizable.

2.2. Problem transformation

At present, ADP algorithm has achieved great success in the optimal tracking control of deterministic systems ^[24,25,26], which inspires us to transform stochastic problems into deterministic problems through system transformation.

Let $M_{k} = E(G_{k}G_{k}^{T})$ , then the system (2.9) can be converted to

$\begin{equation} \begin{aligned} M_{k+1}& = E(G_{k+1}G_{k+1}^{T})\\& = E((TG_{k}+B_{0}u_{k})(TG_{k}+B_{0}u_{k})^{T})\\& = (A_{1}+B_{1}K)M_{k}(A_{1}+B_{1}K)^{T} \\&+(C_{1}+D_{1}K)M_{k}(C_{1}+D_{1}K)^{T} \end{aligned} \end{equation}$

(2.16)

where $M_{k}\in\mathcal{R}^{(n+q)\times(n+q)}$ is the state of a deterministic system and $M_{0}$ is the initial state.

Therefore, the cost function (2.10) can be rewritten as

$\begin{equation} \begin{aligned} J(M_{k},K)& = tr\{\sum\limits_{i = k}^\infty \gamma^{i-k} [(O_{1}+K_{T}R_{\nabla}K)M_{k}]\}. \end{aligned} \end{equation}$

(2.17)

Remark 1. After system transformation, the stochastic system is transformed into deterministic system. The system (2.17) completely gets rid of stochastic disturbance $\omega_{k}$ and will only be dependent on the initial state $M_{0}$ and control gain matrix $K$ , which makes preparation for the derivation and application of Q-learning algorithm.

3. The Q-learning algorithm and convergence proof

In this paper, Q-learning method is used to solve the SLQ optimal tracking problem, which avoids the need for accurate system model. Thus we first give the formula of the optimal control and the corresponding augmented SAE.

Lemma 2. Given the admissible control $u_{k}$ , we can get the following optimal control

$\begin{equation} \begin{aligned} u^{\ast}_{k}& = K^{\ast}G_{k} = -(R_{\nabla}+\gamma B^{T}_{1}PB_{1})^{-1}\gamma (B^{T}_{1}PA_{1} + D^{T}_{1}PD_{1})G_{k} \end{aligned} \end{equation}$

(3.1)

and the optimal cost function

$\begin{equation} \begin{aligned} V(G_{k})& = E(G^{T}_{k}PG_{k}) = tr(PM_{k}) \end{aligned} \end{equation}$

(3.2)

where the matrix $P$ satisfies the following augmented SAE

$\begin{equation} \left\{ \begin{array}{l} P = {O_1} + \gamma (A_1^TP{A_1} + C_1^TP{C_1}) - \gamma (A_1^TP{B_1} + C_1^TP{D_1})\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \times {({R_\nabla } + \gamma B_1^TP{B_1} + \gamma D_1^TP{D_1})^{ - 1}}\gamma (B_1^TP{A_1} + D_1^TP{C_1})\\ {R_\nabla } + \gamma B_1^TP{B_1} + D_1^TP{D_1} > 0 \end{array} \right.. \end{equation}$

(3.3)

Proof. Suppose $u_{k}$ is an admissible control. According to Lemma 1 and (2.17), the cost function can be written as

$\begin{equation} \begin{aligned} J(M_{k},K)& = tr\{ \sum\limits_{i = k}^\infty \gamma^{i-k}[(O_{1}+K^{T}R_{\nabla}K)M_{i}]\} \\& = tr\{(O_{1}+K^{T}R_{\nabla}K)M_{i} \} + tr\{\sum\limits_{i = k+1}^\infty \gamma^{i-k}[(O_{1}+K^{T}R_{\nabla}K)M_{i}]\} \\& = tr\{(O_{1}+K^{T}R_{\nabla}K)M_{i} \} + J(M_{k+1},K). \end{aligned} \end{equation}$

(3.4)

According to Bellman optimality principle, the optimal cost function satisfies

$\begin{equation} \begin{aligned} V(M_{k})& = \min\limits_{K}\{tr\{(O_{1}+K^{T}R_{\nabla}K)M_{k}\}+V(M_{k+1})\}. \end{aligned} \end{equation}$

(3.5)

The optimal control gain matrix can be obtained as follow

$\begin{equation} \begin{aligned} K^{\ast}(M_{k})& = \mathop{\arg\min}\limits_{K}\{tr\{(O_{1}+K^{T}R_{\nabla}K)M_{k}\}+V(M_{k+1})\}. \end{aligned} \end{equation}$

(3.6)

Considering the first-order necessary condition

$\begin{equation} \begin{aligned} \frac{\partial[tr\{(O_{1}+K^{T}R_{\nabla}K)M_{k}\}+V(M_{k+1})]}{\partial K} = 0, \end{aligned} \end{equation}$

(3.7)

we can obtain

$\begin{equation} \begin{aligned} (R_{\nabla}+\gamma B^{T}_{1}PB_{1} + \gamma D^{T}_{1}PD^{T})KG_{k}+\gamma (B^{T}_{1}PA_{1} + D^{T}_{1}PC_{1})G_{k} = 0 \end{aligned} \end{equation}$

(3.8)

where the matrix $P$ satisfies augmented SAE (2.15).

Supposing $R_{\nabla}+\gamma B^{T}_{1}PB_{1} + \gamma D^{T}_{1}PD^{T} > 0$ , we have

$\begin{equation} \begin{aligned} K^{\ast} = -(R_{\nabla}+\gamma B^{T}_{1}PB_{1})^{-1}\gamma (B^{T}_{1}PA_{1} + D^{T}_{1}PD_{1}). \end{aligned} \end{equation}$

(3.9)

When taking (3.9) into the (2.15), we can obtain

$\begin{equation} \begin{aligned} P & = O_{1} + \gamma (A^{T}_{1}PA_{1} + C^{T}_{1}PC_{1}) - \gamma (A^{T}_{1}PB_{1} + C^{T}_{1}PD_{1})\\& \times (R_{\nabla}+\gamma B^{T}_{1}PB_{1} + \gamma D^{T}_{1}PD^{T})^{-1}\gamma (B^{T}_{1}PA_{1} + D^{T}_{1}PC_{1}). \end{aligned} \end{equation}$

(3.10)

From Lemma 2, the SQL optimal tracking problem can be dealt with by the solution of augmented SAE (3.3). However, solving augmented SAE (3.3) requires accurate system model, so this method is not feasible when the dynamics are unknown.

3.1. Derivation of Q-learning algorithm

To solve model-free SQL optimal tracking problem with delays, we give the definition of the Q function and the corresponding matrix $H$ .

Based on (2.10) and Bellman optimality principle, we know that the optimal cost function satisfies Hamilton Jacobi Bellman (HJB) equation

$\begin{equation} \begin{aligned} V(G_{k})& = \mathop{\min}\limits_{u_{k}}\{E[G^{T}_{k}O_{1}G_{k}+u^{T}_{k}R_{\nabla}u_{k}]+\gamma V(G_{k+1})\}. \end{aligned} \end{equation}$

(3.11)

The Q-function is defined as

$\begin{equation} \begin{aligned} Q{\rm{(}}{G_k}{\rm{,}}{u_k}{\rm{) = }}E[{G_k}^T{O_1}{G_k} + u_k^T{R_\nabla }{u_k}] + \gamma V{\rm{(}}{G_{k + 1}}{\rm{)}}. \end{aligned} \end{equation}$

(3.12)

According to Lemma 1, $V{\rm{(}}{G_{k + 1}}{\rm{)}}$ can be written as

$\begin{equation} \begin{aligned} \begin{array}{l} V({G_{k{\rm{ + }}1}})\\ = E(G_{k{\rm{ + }}1}^TP{G_{k{\rm{ + }}1}})\\ {\rm{ = }}E\{ {(T{G_k} + {B_0}{u_k})^T}P(T{G_k} + {B_0}{u_k})\} \\ {\kern 1pt} = E\{ {[({A_1}{G_k} + {C_1}{\omega _k}{G_k}) + ({B_1}{u_k} + {D_1}{\omega _k}{u_k})]^T}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} P[({A_1}{G_k} + {C_1}{\omega _k}{G_k}) + ({B_1}{u_k} + {D_1}{\omega _k}{u_k})]\}. \end{array} \end{aligned} \end{equation}$

(3.13)

Substitute (3.13) into (3.12), we can get

$\begin{equation} \begin{aligned} Q{\rm{(}}{G_k}{\rm{,}}{u_k}{\rm{) = E}}\left\{ {{{\left[ {\begin{array}{*{20}{c}} {{G_k}}\\ {{u_k}} \end{array}} \right]}^T}\left[ {\begin{array}{*{20}{c}} {{H_{GG}}}&{{H_{Gu}}}\\ {{H_{uG}}}&{{H_{uu}}} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{G_k}}\\ {{u_k}} \end{array}} \right]} \right\} = {\rm{E}}\left\{ {{{\left[ {\begin{array}{*{20}{c}} {{G_k}}\\ {{u_k}} \end{array}} \right]}^T}H\left[ {\begin{array}{*{20}{c}} {{G_k}}\\ {{u_k}} \end{array}} \right]} \right\} \end{aligned} \end{equation}$

(3.14)

where $H = H^{T}\in\mathcal{R}^{(n+q+m)\times(n+q+m)}$ ,

$\begin{equation} \begin{aligned} H& = \left[\begin{array}{c c} H_{GG} &H_{Gu} \\H_{uG} &H_{uu} \end{array}\right]\\& = \left[\begin{array}{c c} O_{1}+\gamma A^{T}_{1}PA_{1}+\gamma C^{T}_{1}PC_{1} &\gamma A_{1}^{T}PB_{1}+\gamma C_{1}^{T}PD_{1} \\\gamma B_{1}^{T}PA_{1}+\gamma D_{1}^{T}PC_{1} &\gamma B^{T}_{1}PB_{1}+\gamma D^{T}_{1}PD_{1}+R_{\nabla} \end{array}\right]. \end{aligned} \end{equation}$

(3.15)

Let $\frac{\partial Q(G_{k}, u_{k})}{\partial u_{k}} = 0$ , then the optimal control can be obtained as follow

$\begin{equation} \begin{aligned} u^{\ast}_{k}& = -H^{-1}_{uu} H_{uG}G_{k}. \end{aligned} \end{equation}$

(3.16)

From Lemma 1 and (3.15), we can know the relationship between matrix $P$ and matrix $H$ .

$\begin{equation} \begin{aligned} P = \left[ {\begin{array}{c c} I& K^{T} \end{array}} \right]H{\left[ {\begin{array}{c c} I&K^{T} \end{array}} \right]^{T}}. \end{aligned} \end{equation}$

(3.17)

As can be seen from (3.16), the optimal control only depends on the matrix $H$ , which is completely get rid of the constraints of the system parameters. Next, we will present the Q-learning iterative algorithm for estimating the matrix $H$ .

In this section, we propose Q-learning iterative algorithm based on the Ⅵ. This method starts with the initial value $Q_{0}(G_{k}, u_{k}) = 0$ and the initial admissible control $u_{0}(G_{k})$ , $Q_{1}(G_{k}, u_{k})$ will be updated by the initial value and the initial control as follows

$\begin{equation} \begin{aligned} {Q_1}({G_k},{u_k}) = E[{G_k}^T{O_1}{G_k} + u_0^T({G_k}){R_\nabla }{u_0}({G_k})] + \gamma {Q_0}({G_{k + 1}},{u_0}({G_{k + 1}})). \end{aligned} \end{equation}$

(3.18)

The control is updated as follows

$\begin{equation} \begin{aligned} u_{1}(G_{k}) = \mathop {\arg \min }\limits_{u(G_{k})} Q_{1}(G_{k},u_{k}) \end{aligned} \end{equation}$

(3.19)

for $i \ge 1$ , Q-learning algorithm iterates between

$\begin{equation} \begin{aligned} {Q_{i + 1}}{\rm{(}}{G_k}{\rm{,}}{u_k}{\rm{) = }}E[{G_k}^{^T}{O_1}{G_k} + u_i^T({G_k}){R_\nabla }{u_i}({G_k})] + \gamma {Q_i}{\rm{(}}{G_{k + 1}}{\rm{,}}{u_i}{\rm{(}}{G_{k + 1}}{\rm{))}} \end{aligned} \end{equation}$

(3.20)

and

$\begin{equation} \begin{aligned} u_{i + 1}(G_{k}) = \mathop {\arg \min }\limits_{u_{k}}\{ E[G_{k}^{T}O_{1}G_{k} + u^{T}_{k}{R_\nabla }u_{k}] + \mathop {\min }\limits_{ u_{k + 1}}Q_{i}(G_{k + 1},u_{k + 1})\} \end{aligned} \end{equation}$

(3.21)

where $i$ is the iteration index and $k$ is time index.

According to (3.14), the Q function can be rewritten as

$\begin{equation} \begin{array}{l} {Q_{i + 1}}{\rm{(}}{G_k}{\rm{,}}{u_k}{\rm{) = }}\left[ {\begin{array}{*{20}{c}} {{G_k}^{^T}}&{u_i^T({G_k})} \end{array}} \right]{H_{i{\rm{ + }}1}}{\left[ {\begin{array}{*{20}{c}} {{G_k}^{^T}}&{u_i^T({G_k})} \end{array}} \right]^T}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{ = }}E{\left\{ {\left[ {\begin{array}{*{20}{c}} {{G_k}^{^T}}&{u_i^T({G_k})} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{O_1}}&0\\ 0&{{R_\nabla }} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{G_k}^{^T}}&{u_i^T({G_k})} \end{array}} \right]} \right.^T}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{ + }}{\kern 1pt} {\kern 1pt} \left. {\gamma \left[ {\begin{array}{*{20}{c}} {{G_{k + 1}}^{^T}}&{u_i^T({G_{k + 1}})} \end{array}} \right]{H_i}{{\left[ {\begin{array}{*{20}{c}} {{G_{k + 1}}^{^T}}&{u_i^T({G_{k + 1}})} \end{array}} \right]}^T}} \right\} \end{array} \end{equation}$

(3.22)

and we can obtain the optimal controller

$\begin{equation} {u_i}({G_k}) = - H_{uu,i}^{ - 1}{H_{uG,i}}{G_k}. \end{equation}$

(3.23)

According to (3.17), we can get

$\begin{equation} {P_i} = \left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]{H_i}{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]^T}. \end{equation}$

(3.24)

3.2. The convergence of Q-learning algorithm

Before proving the convergence of Q-learning algorithm, we first give the following two lemmas.

Lemma 3. Q-learning algorithm (3.22) and (3.23) is equivalent to

$\begin{equation} \begin{array}{l} {P_{i + 1}} = {O_1} + \gamma (A_1^{^T}{P_i}{A_1} + C_1^{^T}{P_i}{C_1}) - \gamma (A_1^{^T}{P_i}{B_1} + C_1^{^T}{P_i}{D_1})\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \times {(R + \gamma B_1^{^T}{P_i}{B_1} + \gamma D_1^{^T}{P_i}{D_1})^{ - 1}}\gamma (B_1^{^T}{P_i}{A_1} + D_1^{^T}{P_i}{C_1}). \end{array} \end{equation}$

(3.25)

Proof. According to (2.11), the last term of (3.22) can be written as

$\begin{equation} \begin{array}{l} E\left\{ {\left[ {\begin{array}{*{20}{c}} {{G_{k + 1}}^T}&{u_i^T({G_{k + 1}})} \end{array}} \right]{H_i}{{\left[ {\begin{array}{*{20}{c}} {{G_{k + 1}}^T}&{u_i^T({G_{k + 1}})} \end{array}} \right]}^T}} \right\}\\ = E\left\{ {{G_{k + 1}}^T\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]{H_i}{{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]}^T}{G_{k + 1}}} \right\}\\ {\rm{ = }}E\{ {[({A_1}{G_k} + {C_1}{\omega _k}{G_k}) + ({B_1}{u_i}({G_k}) + {D_1}{\omega _k}{u_i}({G_k}))]^T}\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]{H_i}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]^T}({A_1}{G_k} + {C_1}{\omega _k}{G_k}) + ({B_1}{u_i}({G_k}) + {D_1}{\omega _k}{u_i}({G_k}))]\} \\ = E\left\{ {\left[ {\begin{array}{*{20}{c}} {{G_k}^T}&{{u_i}^T({G_k})} \end{array}} \right]{{\left[ {\begin{array}{*{20}{c}} {{A_1}}&{{B_1}} \end{array}} \right]}^T}\left[ {\begin{array}{*{20}{c}} I&{{K_i}^T} \end{array}} \right]{H_i}} \right.\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\left[ {\begin{array}{*{20}{c}} I&{{K_i}^T} \end{array}} \right]^T}\left[ {\begin{array}{*{20}{c}} {{A_1}}&{{B_1}} \end{array}} \right]{\left[ {\begin{array}{*{20}{c}} {{G_k}^T}&{{u_i}^T({G_k})} \end{array}} \right]^T}{\kern 1pt} \\ {\kern 1pt} + \left[ {\begin{array}{*{20}{c}} {{G_k}^T}&{{u_i}^T({G_k})} \end{array}} \right]{\left[ {\begin{array}{*{20}{c}} {{C_1}}&{{D_1}} \end{array}} \right]^T}\left[ {\begin{array}{*{20}{c}} I&{{K_i}^T} \end{array}} \right]{H_i}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left. {{\kern 1pt} {\kern 1pt} {{\left[ {\begin{array}{*{20}{c}} I&{{K_i}^T} \end{array}} \right]}^T}\left[ {\begin{array}{*{20}{c}} {{C_1}}&{{D_1}} \end{array}} \right]{{\left[ {\begin{array}{*{20}{c}} {{G_k}^T}&{{u_i}^T({G_k})} \end{array}} \right]}^T}} \right\}. \end{array} \end{equation}$

(3.26)

Substitute (3.26) into (3.22), according to (3.24), we can get

$\begin{equation} \begin{array}{l} {H_{i + 1}} = \left[ {\begin{array}{*{20}{c}} {{O_1}}&0\\ 0&{{R_\nabla }} \end{array}} \right] + \left[ {\begin{array}{*{20}{c}} {\gamma A_1^{^T}{P_i}{A_1}}&{\gamma A_1^{^T}{P_i}{B_1}}\\ {\gamma B_1^{^T}{P_i}{A_1}}&{\gamma B_1^{^T}{P_i}{B_1}} \end{array}} \right]\\ {\rm{ }}{\rm{ }}{\rm{ }} + \left[ {\begin{array}{*{20}{c}} {\gamma C_1^{^T}{P_i}{C_1}}&{\gamma C_1^{^T}{P_i}{D_1}}\\ {\gamma D_1^{^T}{P_i}{C_1}}&{\gamma D_1^{^T}{P_i}{D_1}} \end{array}} \right]. \end{array} \end{equation}$

(3.27)

Based on (3.24), we have

$\begin{equation} {P_{i{\rm{ + }}1}} = \left[ {\begin{array}{*{20}{c}} I&{K_{i + 1}^T} \end{array}} \right]{H_{i{\rm{ + }}1}}{\left[ {\begin{array}{*{20}{c}} I&{K_{i + 1}^T} \end{array}} \right]^T}. \end{equation}$

(3.28)

Substitute (3.27) into (3.28), we can get

$\begin{equation} \begin{array}{l} {P_{i + 1}} = {O_1} + \gamma (A_1^T{P_i}{A_1} + C_1^T{P_i}{C_1}) - \gamma (A_1^T{P_i}{B_1} + C_1^T{P_i}{D_1})\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \times {(R + \gamma B_1^T{P_i}{B_1} + \gamma D_1^T{P_i}{D_1})^{ - 1}}\gamma (B_1^T{P_i}{A_1} + D_1^T{P_i}{C_1}) \end{array} \end{equation}$

(3.29)

where ${R_\nabla } + \gamma B_1^{^T}P{B_1} + D_1^{^T}P{D_1} > 0$ .

Lemma 4 (^[27]). The value iteration algorithm iterates between

$\begin{equation} {V_{i + 1}}({G_k}) = E({G^T}_k({O_1} + {K_i}^T{R_\nabla }{K_i}){G_k}) + \gamma {V_i}({G_{k + 1}}) \end{equation}$

(3.30)

and

$\begin{equation} {K_{i + 1}}{\rm{ = }}\mathop {{\rm{argmin}}}\limits_K \{ E({G^T}_k({O_1} + {K_i}^T{R_\nabla }{K_i}){G_k}){\rm{ + }}\gamma {{\rm{V}}_i}({G_{k{\rm{ + }}1}})\} \end{equation}$

(3.31)

is the convergence, then

$\begin{equation} \mathop {\lim }\limits_{i \to \infty } {V_i}({G_k}) = V({G_k}) = E({G_k}^T{P }{G_k}) = tr\{ {P }{M_k}\} \notag, \end{equation}$

$\begin{equation} \mathop {\lim }\limits_{i \to \infty } {K_i} = {K^ * } = - {{\rm{(}}{{\rm{R}}_{\nabla {\rm{ }}}}{\rm{ + }}\gamma {\rm{B}}_1^T{{\rm{P}} }{{\rm{B}}_{\rm{1}}}{\rm{ + }}\gamma {\rm{D}}_1^T{{\rm{P}} }{{\rm{D}}_{\rm{1}}}{\rm{)}}^{ - 1}}\gamma {\rm{(B}}_{\rm{1}}^T{{\rm{P}} }{{\rm{A}}_{\rm{1}}}{\rm{ + D}}_1^T{{\rm{P}} }{{\rm{C}}_{\rm{1}}}{\rm{)}} \notag \end{equation}$

where the matrix ${P }$ satisfies the augmented SAE (3.3).

Theorem 3.1. Assuming that system (2.9) is mean-square stabilizable, the matrix sequence ${\rm{\{ }}{H_i}{\rm{\} }}$ calculated by Q-learning algorithm (3.22) converges to matrix $H$ and the matrix sequence ${\rm{\{ }}{P_i}{\rm{\} }}$ calculated by (3.24) converges to the solution $P$ of augmented SAE (3.3).

Proof. According to Lemma 4, (3.30) can be rewritten as

$\begin{equation} \begin{array}{l} {V_{i + 1}}({G_k}) = E({G_k}^T{P_{i + 1}}{G_k})\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = E\left. {\left[ {{G_k}^T({O_1} + {K_i}{R_\nabla }{K_i}){G_k}} \right.} \right] + E(G_{k + 1}^T{P_i}{G_{k + 1}})\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = E\{ {G_k}^T({O_1} + {K_i}{R_\nabla }{K_i}){G_k} + [({A_1} + {B_1}K){G_i}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} + ({C_1}{\omega _i} + {D_1}K{\omega _i}){G_i}{]^T}P{\kern 1pt} {\kern 1pt} [({A_1} + {B_1}K){G_i}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} + ({C_1}{\omega _i} + {D_1}K{\omega _i}){G_i}]\} \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = E({G_i}^T[{({A_1} + {B_1}K)^T}P({A_1} + {B_1}K)\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} + {({C_1} + {D_1}K)^T}P({C_1} + {D_1}K) + {O_1} + {K_i}^T{R_\nabla }{K_i}]{G_i}). \end{array} \end{equation}$

(3.32)

We can update the control gain matrix by (3.31) as follows

$\begin{equation} {K_i} = - {({R_{\nabla {\rm{ }}}} + {\rm{ }}\gamma B_1^TP{}_i{B_1} + {\rm{ }}\gamma D_1^T{P_i}{D_1})^{ - 1}}\gamma (B_1^T{P_i}{A_1} + {\rm{ }}D_1^T{P_i}{C_1}). \end{equation}$

(3.33)

Substituting (3.33) into (3.32), we can get

(3.34)

According to Lemmas 3 and 4, we can conclude $\mathop {\lim }\limits_{i \to \infty } {P_i} = {P}$ . when $i \to \infty$ , the matrix $P$ satisfies

$\begin{equation} \begin{array}{l} {P_{}} = {O_1} + \gamma (A_1^{^T}P{A_1} + C_1^{^T}P{C_1}) - \gamma (A_1^{^T}P{B_1} + C_1^{^T}P{D_1})\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \times {(R + \gamma B_1^{^T}P{B_1} + \gamma D_1^{^T}P{D_1})^{ - 1}}\gamma (B_1^{^T}P{A_1} + D_1^{^T}P{C_1}). \end{array} \end{equation}$

(3.35)

Based on (3.27), we can know $H$ satisfies $\mathop {\lim }\limits_{i \to \infty } {H_i} = H$ , where

$\begin{equation} H = \left[ {\begin{array}{*{20}{c}} {\gamma A_1^TP{A_1} + \gamma C_1^TP{C_1} + {Q_1}}&{\gamma A_1^TP{B_1} + \gamma C_1^TP{D_1}}\\ {B_1^TP{A_1} + \gamma D_1^TP{C_1}}&{\gamma B_1^TP{B_1} + \gamma D_1^TP{D_1} + {R_\nabla }} \end{array}} \right]. \end{equation}$

(3.36)

So the Q-learning algorithm converges.

4. Implementation of the Q-learning algorithm

Due to the existence of stochastic disturbance, the output trajectory of the system is uncertain, and the cost function has expectations, the online algorithm cannot achieve the function. Therefore, it is necessary to transform the stochastic Q-learning algorithm into a deterministic Q-learning algorithm. In this section, we will give the implementation steps of deterministic Q-learning algorithm. The flow chart of Q learning algorithm is shown in Figure 1.

Figure 1. Flowchart of Q-learning.

DownLoad: Full-Size Img PowerPoint

According to Eq (2.11), the left side of (3.22) can be simplified to

$\begin{equation} \begin{array}{l} E\left\{ {\left[ {\begin{array}{*{20}{c}} {{G_k}^T}&{u_i^T({G_k})} \end{array}} \right]{H_{i + 1}}\left[ {\begin{array}{*{20}{c}} {{G_k}^T}&{u_i^T({G_k})} \end{array}} \right]} \right\}\\ = E\left\{ {{G_k}^T\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]{H_{i + 1}}{{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]}^T}{G_k}} \right\}\\ = tr\left\{ {\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]{H_{i + 1}}{{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]}^T}{M_k}} \right\}. \end{array} \end{equation}$

(4.1)

The right side of (3.22) can be simplified as

$\begin{equation} \begin{array}{l} E\left\{ {{G_k}^{^T}\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{O_1}}&0\\ 0&{{R_\nabla }} \end{array}} \right]{{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]}^T}{G_k}} \right.\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left. {{\kern 1pt} {\kern 1pt} {\kern 1pt} + {G_{k + 1}}^{^T}\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]{H_i}{{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]}^T}{G_{k + 1}}} \right\}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = tr\left\{ {\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{O_1}}&0\\ 0&{{R_\nabla }} \end{array}} \right]{{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]}^T}{M_k}} \right.\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left. { + \left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]{H_i}{{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]}^T}{M_{k + 1}}} \right\}. \end{array} \end{equation}$

(4.2)

For simplicity, let

$\begin{equation} {L_i}({H_i}) = \left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]{H_i}{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]^T},{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2,3, \cdots. \end{equation}$

(4.3)

Then (3.22) can be simplified as

$\begin{equation} tr\left\{ {L_i^{}({H_{i + 1}}){M_k}} \right\} = tr\left\{ {L_i^{}(\left[ {\begin{array}{*{20}{c}} {{O_1}}&0\\ 0&{{R_\nabla }} \end{array}} \right]){M_k} + L_i^{}({H_i}){M_{k + 1}}} \right\}. \end{equation}$

(4.4)

The Q-learning iterative algorithm consisting of (4.4) and (3.23) only relies on determining the state $M_{k}$ of the system (2.16) and iteratively controlling the gain matrix $K_{i}$ , avoiding the constraints of system parameters and stochastic disturbance.

Remark 2. The Q-learning algorithm based on Ⅵ is performed online and solves (4.4) using least squares (LS) without knowing augmented system. In fact, (4.4) is a scalar equation and $H$ is a symmetric $(n + q + m) \times (n + q + m)$ matrix with $(n + q + m) \times (n + q + m + 1)/2$ independent elements. Therefore, at least $(n + q + m + 1) \times (n + q + m + 1)/2$ data tuples are required before (4.4) can be solved using LS.

Remark 3. Q-learning algorithm based on Ⅵ requires a persistent excitation (PE) condition ^[28] to ensure the sufficient exploration of the state space.

5. Simulation

In this section, a simulation example is given to illustrate the effectiveness of Q-learning algorithm. Consider the following stochastic linear system with delays

$\begin{array}{l} {x_{k + 1}} = A{x_k} + {A_d}{x_{k - d}} + B{u_k} + {B_d}{u_{k - d}}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} + (C{x_k} + {C_d}{x_{k - d}} + D{u_k} + {D_d}{u_{k - d}}){\omega _k},\\ {y_k} = E{x_k} + {E_d}{x_{k - d}} \end{array}$

in which $A = \left({\begin{array}{*{20}{c}} {0.2} & { - 0.8}\\ {0.5} & { - 0.7} \end{array}} \right)$ , ${A_d} = \left({\begin{array}{*{20}{c}} {0.2} & { - 0.2}\\ {0.1} & {0.15} \end{array}} \right)$ , $B = \left({\begin{array}{*{20}{c}} {0.03}\\ { - 0.5} \end{array}} \right)$ , ${B_d} = \left({\begin{array}{*{20}{c}} {0.3}\\ { - 0.2} \end{array}} \right)$ , $C = \left({\begin{array}{*{20}{c}} { - 0.04} & {0.4}\\ { - 0.3} & {0.13} \end{array}} \right)$ , ${C_d} = \left({\begin{array}{*{20}{c}} {0.2} & { - 0.1}\\ {0.2} & {0.11} \end{array}} \right)$ , $D = \left({\begin{array}{*{20}{c}} {0.05}\\ { - 0.3} \end{array}} \right)$ , ${D_d} = \left({\begin{array}{*{20}{c}} {0.1}\\ {0.1} \end{array}} \right)$ , $E = \left({\begin{array}{*{20}{c}} 3 & 3 \end{array}} \right)$ , ${E_d} = \left({\begin{array}{*{20}{c}} {0.1} & {0.12} \end{array}} \right)$ .

Suppose the reference trajectory is as follows

${r_{k + 1}} = - {r_k}$

where ${r_0} = 1$ .

The cost function is considered as (2.5) with $R = 1$ , $R_{d} = 1$ , $O = 10$ and delay index $d = 1$ . The initial state for augmented system (2.9) is chosen as ${G_0} = {\left[ {\begin{array}{*{20}{c}}{10} & {{\rm{ - }}10} & 1\end{array}} \right]^T}$ . The initial control gain matrix is selected as $K = \left[ {\begin{array}{*{20}{c}}0 & 0 & 0\end{array}} \right]$ . In each iteration of the algorithm, 21 samples are collected to update the control gain matrix $K$ .

In order to verify the effectiveness of the iterative Q-learning algorithm, we compared $K$ with optimal solution $K^{\ast}$ solved by SAE (3.1). shows the control gain matrix $K$ converges to the optimal control gain matrix $K^{\ast}$ as the number of iterations increases. shows the convergence process of $H$ to its optimal values $H^{\ast}$ , which can be calculated by (3.15). The goal of the optimal tracking problem is to trace the reference signal trajectory. In , the expectation of system output $E(y)$ can track the reference trajectory $r_{k}$ . This further proves the effectiveness of the proposed Q-learning algorithm.

Figure 2. Convergence trajectory of control gain matrix

$K$ to

$K^{\ast}$ .

DownLoad: Full-Size Img PowerPoint

Figure 3. Convergence trajectory of matrix

$H$ to its optimal values

$H^{\ast}$ .

DownLoad: Full-Size Img PowerPoint

Figure 4. Curves of expectation of output

$E(y)$ and reference signal

$r_{k}$ .

DownLoad: Full-Size Img PowerPoint

6. Conclusions

For the model-free SLQ optimal tracking problem with delays, Q-learning algorithm based on Ⅵ is proposed in this paper. This method makes full use of the system information to approximate the optimal control online, and never needs the system parameter information. In the iterative process of the algorithm, the $H$ matrix sequence and the control gain matrix $K$ sequence are guaranteed to approximate the optimal value. Finally, the simulation results show that the system output can track the reference trajectory effectively.

Conflict of interest

The authors declare that they have no conflicts of interest.

References

[1]	H. Modares, F. L. Lewis, Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning, Automatica, 50 (2014), 1780–1792. https://doi.org/10.1016/j.automatica.2014.05.011 doi: 10.1016/j.automatica.2014.05.011
[2]	B. Zhao, Y. Li, Model-free adaptive dynamic programming based near-optimal decentralized tracking control of reconfigurable manipulators, Int. J. Control, Autom. Syst., 16 (2018), 478–490. https://doi.org/10.1007/s12555-016-0711-5 doi: 10.1007/s12555-016-0711-5
[3]	T. Huang, D. Liu, A self-learning scheme for residential energy system control and management, Neural Comput. Appl., 22 (2013), 259–269. https://doi.org/10.1007/s00521-011-0711-6 doi: 10.1007/s00521-011-0711-6
[4]	M. Gluzman, J. G. Scott, A. Vladimirsky, Optimizing adaptive cancer therapy: dynamic programming and evolutionary game theory, Proc. Royal Soc. B: Biol. Sci., 287 (2020), 20192454. https://doi.org/10.1098/rspb.2019.2454 doi: 10.1098/rspb.2019.2454
[5]	I. Ha, E. Gilbert, Robust tracking in nonlinear systems, IEEE Trans. Automat. Control, 32 (1987), 763–771. https://doi.org/10.1109/TAC.1987.1104710 doi: 10.1109/TAC.1987.1104710
[6]	M. A. Rami, X. Y. Zhou, Linear matrix inequalities, Riccati equations and indefinite stochastic linear quadratic controls, IEEE Trans. Automat. Control, 45 (2000), 1131–1143. https://doi.org/10.1109/9.863597 doi: 10.1109/9.863597
[7]	R. Byers, Solving the algebraic Riccati equation with the matrix sign function, Linear Algebra Appl., 89 (1987), 267–279. https://doi.org/10.1016/0024-3795(87)90222-9 doi: 10.1016/0024-3795(87)90222-9
[8]	D. Vrabie, O. Pastravanu, M. Abu-Khalaf, F. L. Lewis, Adaptive optimal control for continuous-time linear systems based on policy iteration, Automatica, 45 (2009), 477–484. https://doi.org/10.1016/j.automatica.2008.08.017 doi: 10.1016/j.automatica.2008.08.017
[9]	B. Kiumarsi, F. L. Lewis, M. B. Naghibi-Sistani, A. Karimpour, Optimal tracking control of unknown discrete-time linear systems using input-output measured data, IEEE Trans. Cybern., 45 (2015), 2770–2779. https://doi.org/10.1109/TCYB.2014.2384016 doi: 10.1109/TCYB.2014.2384016
[10]	B. Kiumarsi, F. L. Lewis, H. Modares, A. Karimpour, M. B. Naghibi-Sistani, Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics, Automatica, 50 (2014), 1167–1175. https://doi.org/10.1016/j.automatica.2014.02.015 doi: 10.1016/j.automatica.2014.02.015
[11]	G. Wang, H. Zhang, Model-free value iteration algorithm for continuous-time stochastic linear quadratic optimal control problems, arXiv, 2022. https://doi.org/10.48550/arXiv.2203.06547
[12]	H. Zhang, Adaptive dynamic programming-based algorithm for infinite-horizon linear quadratic stochastic optimal control problems, arXiv, 2022. https://doi.org/10.48550/arXiv.2210.04486
[13]	R. Liu, Y. Li, X. Liu, Linear-quadratic optimal control for unknown mean-field stochastic discrete-time system via adaptive dynamic programming approach, Neurocomputing, 282 (2018), 16–24. https://doi.org/10.1016/j.neucom.2017.12.007 doi: 10.1016/j.neucom.2017.12.007
[14]	X. Chen, F. Wang, Neural-network-based stochastic linear quadratic optimal tracking control scheme for unknown discrete-time systems using adaptive dynamic programming, Control Theory Technol., 19 (2021), 315–327. https://doi.org/10.1007/s11768-021-00046-y doi: 10.1007/s11768-021-00046-y
[15]	Z. Zhang, X. Zhao, Stochastic linear quadratic optimal tracking control for stochastic discrete time systems based on Q-learning, J. Nanjing Univ. Inf. Sci. Technol. (Nat. Sci.), 13 (2021), 548–555.
[16]	Y. Liu, H. Zhang, Y. Luo, J. Han, ADP based optimal tracking control for a class of linear discrete-time system with multiple delays, J. Franklin Inst., 353 (2016), 2117–2136. https://doi.org/10.1016/j.jfranklin.2016.03.012 doi: 10.1016/j.jfranklin.2016.03.012
[17]	B. L. Zhang, Q. L. Han, X. M. Zhang, X. Yu, Sliding mode control with mixed current and delayed states for offshore steel jacket platforms, IEEE Trans. Control Syst. Technol., 22 (2014), 1769–1783. https://doi.org/10.1109/TCST.2013.2293401 doi: 10.1109/TCST.2013.2293401
[18]	M. J. Park, O. M. Kwon, J. H. Ryu, Advanced stability criteria for linear systems with time-varying delays, J. Franklin Inst., 355 (2018), 520–5433. https://doi.org/10.1016/j.jfranklin.2017.11.029 doi: 10.1016/j.jfranklin.2017.11.029
[19]	H. Zhang, Y. Luo, D. Liu, Neural-network-based near-optimal control for a class of discrete-time affine nonlinear systems with control constraints, IEEE Trans. Neural Networks, 20 (2009), 1490–1503. https://doi.org/10.1109/TNN.2009.2027233 doi: 10.1109/TNN.2009.2027233
[20]	H. Zhang, Z. Wang, D. Liu, Global asymptotic stability of recurrent neural networks with multiple time-varying delays, IEEE Trans. Neural Networks, 19 (2008), 855–873. https://doi.org/10.1109/TNN.2007.912319 doi: 10.1109/TNN.2007.912319
[21]	T. Wang, H. Zhang, Y. Luo, Infinite-time stochastic linear quadratic optimal control for unknown discrete-time systems using adaptive dynamic programming approach, Neurocomputing, 171 (2016), 379–386. https://doi.org/10.1016/j.neucom.2015.06.053 doi: 10.1016/j.neucom.2015.06.053
[22]	A. Garate-Garcia, L. A. Marquez-Martinez, C. H. Moog, Equivalence of linear time-delay systems, IEEE Trans. Automat. Control, 56 (2011), 666–670. https://doi.org/10.1109/TAC.2010.2095550 doi: 10.1109/TAC.2010.2095550
[23]	Y. Liu, R. Yu, Model-free optimal tracking control for discrete-time system with delays using reinforcement Q-learning, Electron. Lett., 54 (2018), 750–752. https://doi.org/10.1049/el.2017.3238 doi: 10.1049/el.2017.3238
[24]	H. Zhang, Q. Wei, Y. Luo, A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm, IEEE Trans. Syst., Man, Cybern. B, 38 (2008), 937–942. https://doi.org/10.1109/TSMCB.2008.920269 doi: 10.1109/TSMCB.2008.920269
[25]	J. Shi, D. Yue, X. Xie, Adaptive optimal tracking control for nonlinear continuous-time systems with time delay using value iteration algorithm, Neurocomputing, 396 (2020), 172–178. https://doi.org/10.1016/j.neucom.2018.07.098 doi: 10.1016/j.neucom.2018.07.098
[26]	Q. Wei, D. Liu, Adaptive dynamic programming for optimal tracking control of unknown nonlinear systems with application to coal gasification, IEEE Trans. Automat. Sci. Eng., 11 (2014), 1020–1036. https://doi.org/10.1109/TASE.2013.2284545 doi: 10.1109/TASE.2013.2284545
[27]	T. Wang, H. Zhang, Y. Luo, Stochastic linear quadratic optimal control for model-free discrete-time systems based on Q-learning algorithm, Neurocomputing, 312 (2018), 1–8. https://doi.org/10.1016/j.neucom.2018.04.018 doi: 10.1016/j.neucom.2018.04.018
[28]	F. L. Lewis, D. Vrabie, Reinforcement learning and adaptive dynamic programming for feedback control, IEEE Circuits Syst. Mag., 9 (2009), 32–50. https://doi.org/10.1109/MCAS.2009.933854 doi: 10.1109/MCAS.2009.933854

This article has been cited by:

Heng Zhang, Na Li, Data‐driven policy iteration algorithm for continuous‐time stochastic linear‐quadratic optimal control problems, 2024, 26, 1561-8625, 481, 10.1002/asjc.3223

Reader Comments

Your name:*

Email:*
© 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

AIMS Mathematics

1.8 3.4

Metrics

Article views(1828) PDF downloads(146) Cited by(1)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(4)

AIMS Mathematics

Stochastic linear quadratic optimal tracking control for discrete-time systems with delays based on Q-learning algorithm

Related Papers:

Abstract

1. Introduction

2. Problem formulation and transformation

2.1. Problem formulation

2.2. Problem transformation

3. The Q-learning algorithm and convergence proof

3.1. Derivation of Q-learning algorithm

3.2. The convergence of Q-learning algorithm

4. Implementation of the Q-learning algorithm

5. Simulation

6. Conclusions

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

AIMS Mathematics

Stochastic linear quadratic optimal tracking control for discrete-time systems with delays based on Q-learning algorithm

Related Papers:

Abstract

1. Introduction

2. Problem formulation and transformation

2.1. Problem formulation

2.2. Problem transformation

3. The Q-learning algorithm and convergence proof

3.1. Derivation of Q-learning algorithm

3.2. The convergence of Q-learning algorithm

4. Implementation of the Q-learning algorithm

5. Simulation

6. Conclusions

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog