Periodic solution of a stage-structured predator-prey model incorporating prey refuge

Weijie Lu; Yonghui Xia; Yuzhen Bai; Weijie Lu; Yonghui Xia; Yuzhen Bai

doi:10.3934/mbe.2020179

Mathematical Biosciences and Engineering

2020, Volume 17, Issue 4: 3160-3174. doi: 10.3934/mbe.2020179

Previous Article Next Article

Research article Special Issues

Periodic solution of a stage-structured predator-prey model incorporating prey refuge

1.
Department of Mathematics, Zhejiang Normal University, Jinhua 321004, China
2.
School of Mathematical Sciences, Qufu Normal University, Qufu 273165, China

Received: 29 January 2020 Accepted: 09 April 2020 Published: 22 April 2020

In this paper, we consider a delayed stage-structured predator-prey model incorporating prey refuge with Holling type Ⅱ functional response. It is assumed that prey can live in two different regions. One is the prey refuge and the other is the predatory region. Moreover, in real world application, we should consider the stage-structured model. It is assumed that the prey in the predatory region can divided by two stages: Mature predators and immature predators, and the immature predators have no ability to attack prey. Based on Mawhin's coincidence degree and novel estimation techniques for a priori bounds of unknown solutions to Lu = λNu, some sufficient conditions for the existence of periodic solution is obtained. Finally, an example demonstrate the validity of our main results.

Keywords:

Citation: Weijie Lu, Yonghui Xia, Yuzhen Bai. Periodic solution of a stage-structured predator-prey model incorporating prey refuge[J]. Mathematical Biosciences and Engineering, 2020, 17(4): 3160-3174. doi: 10.3934/mbe.2020179

Related Papers:

[1]	Abdon Atangana, Seda İğret Araz . Piecewise differential equations: theory, methods and applications. AIMS Mathematics, 2023, 8(7): 15352-15382. doi: 10.3934/math.2023785
[2]	Woocheol Choi, Young-Pil Choi . A sharp error analysis for the DG method of optimal control problems. AIMS Mathematics, 2022, 7(5): 9117-9155. doi: 10.3934/math.2022506
[3]	Xumei Zhang, Junying Cao . A high order numerical method for solving Caputo nonlinear fractional ordinary differential equations. AIMS Mathematics, 2021, 6(12): 13187-13209. doi: 10.3934/math.2021762
[4]	Xin Liu, Yan Wang . Averaging principle on infinite intervals for stochastic ordinary differential equations with Lévy noise. AIMS Mathematics, 2021, 6(5): 5316-5350. doi: 10.3934/math.2021314
[5]	Shifan Luo, Dongshu Wang, Wenxiu Li . Dynamic analysis of a SIV Filippov system with media coverage and protective measures. AIMS Mathematics, 2022, 7(7): 13469-13492. doi: 10.3934/math.2022745
[6]	Essam R. El-Zahar, Ghaliah F. Al-Boqami, Haifa S. Al-Juaydi . Piecewise approximate analytical solutions of high-order reaction-diffusion singular perturbation problems with boundary and interior layers. AIMS Mathematics, 2024, 9(6): 15671-15698. doi: 10.3934/math.2024756
[7]	Saima Rashid, Fahd Jarad, Sobhy A. A. El-Marouf, Sayed K. Elagan . Global dynamics of deterministic-stochastic dengue infection model including multi specific receptors via crossover effects. AIMS Mathematics, 2023, 8(3): 6466-6503. doi: 10.3934/math.2023327
[8]	Wedad Albalawi, Muhammad Imran Liaqat, Kottakkaran Sooppy Nisar, Abdel-Haleem Abdel-Aty . Qualitative study of Caputo Erdélyi-Kober stochastic fractional delay differential equations. AIMS Mathematics, 2025, 10(4): 8277-8305. doi: 10.3934/math.2025381
[9]	Yanshou Dong, Junfang Zhao, Xu Miao, Ming Kang . Piecewise pseudo almost periodic solutions of interval general BAM neural networks with mixed time-varying delays and impulsive perturbations. AIMS Mathematics, 2023, 8(9): 21828-21855. doi: 10.3934/math.20231113
[10]	Dumitru Baleanu, Babak Shiri . Nonlinear higher order fractional terminal value problems. AIMS Mathematics, 2022, 7(5): 7489-7506. doi: 10.3934/math.2022420

Abstract

1. Introduction

It is well known that the optimal tracking control (OTC) problem plays an important role in the field of optimal control and develops fast in applications^[1,2,3,4]. The goal of OTC problem is to design a controller, which can make the output of the system track the reference trajectory by minimizing the cost function. Traditional OTC problem is realized by feedback linearization ^[5] and object inversion ^[6], but this usually requires complex mathematical analysis. As for the linear quadratic tracking (LQT) problem, the traditional method of LQT problem is to solve the algebraic Riccati equation (ARE) and the noncausal difference equation. However, these methods require accurate system model^[7]. In practical situations, the system parameters are partially unknown or completely unknown, so it is impossible to be realized by traditional methods.

The key to the OTC problem is to solve Hamilton-Jacobi-Bellman (HJB) equation. However, HJB equation involves solving difference or differential equations, so it is difficult to solve it. Although dynamic programming has always been an effective method to solve the HJB equation, it is not feasible in the calculation of large dimensions because of "the curse of dimensionality". To solve the solution of the HJB equation, adaptive dynamic programming (ADP) algorithms have been widely used and developed. In ^[8], a policy iteration (PI) scheme was adopted to approximate the optimal control for the partly unknown continuous-time systems. In ^[9], B. Kiumarsi solves the LQT problem online only by measuring the input, output, and reference trajectory data of the system. In ^[10], a Q-learning method was proposed to calculate the optimal control, only relying on system parameters and command generators.

In recent years, stochastic system control theory has become the focus of optimal control theory because of its academic difficulty and wide application, especially the model-free SLQ optimal tracking problem has attracted more and more attention^{[11,12,13,14,15]}. In ^[14], ADP algorithm based on neural networks is proposed to solve the model-free SLQ optimal tracking control problem. In addition, the Q-learning algorithm is used to solve the model-free SLQ optimal tracking control problem in ^[15]. For all we know, there seem to be many research results on the model-free SLQ optimal tracking problem based on ADP algorithm, but the SLQ optimal tracking problem with delays has received little attention. Time delay ^[16] is an important factor that cannot be ignored. It exists in many practical systems, such as industrial processes, power grids, chemical reactions, and so on ^{[17,18,19,20]}. However, in these methods^{[11,12,13,14,15]}, the influence of time delay on the system is neglected. If the time delay is ignored, it will affect the control effect and even make the system divergence. The method proposed in ^[16] takes into account the time delay but ignores the influence of stochastic disturbance disturbances on the system. As far as we know, there is no research on the optimal tracking problem of stochastic linear systems with delays. Therefore, how to use ADP algorithm to deal with the model-free SLQ optimal tracking control problem has important practical significance. This is the motivation we study in this paper.

The main contributions of this paper include:

(1) For stochastic linear system, this paper proposes Q-learning to model-free solve SLQ optimal tracking control problem with delays for the first time, which enhances the practicability of ADP algorithm in tracking problems.

(2) By introducing the delay factor, the influence of delays on the subsequent algorithm can be effectively eliminated.

(3) In this paper, the Q-learning algorithm is used to solve the model-free SLQ optimal tracking control problem with delays. Compared with other methods which need accurate system model to obtain the optimal control, this method makes full use of the online system state information to obtain the optimal control and avoids solving augmented stochastic algebraic equation (SAE).

The structure of this paper is organized as follows. In section 2, we give the problem formulation and conversion. In section 3, we derive the Q-learning algorithm and prove its convergence. In section 4, we give the implementation steps of Q-learning algorithm. In section 5, a simulation example is given to verify the effectiveness of the algorithm. In section 6, the conclusion is given.

2. Problem formulation and transformation

2.1. Problem formulation

Consider the following linear stochastic systems with delays

$\begin{equation} \begin{split} x_{k+1} & = Ax_{k}+A_{d}x_{k-d}+Bu_{k}+B_{d}u_{k-d}+(Cx_{k}+C_{d}x_{k-d}+Du_{k}+D_{d}u_{k-d})\omega_{k},\\y_{k} & = Ex_{k}+E_{d}x_{k-d} \end{split} \end{equation}$

(2.1)

where $x_{k}\in\mathcal{R}^{n}$ is the system state vector, $u_{k}\in\mathcal{R}^{m}$ is the control input vector, $y_{k}\in\mathcal{R}^{q}$ is the system output, while $x_{k-d}, u_{k-d}$ and $y_{k-d}$ are the delay variables with delay index $d \in N$ . $A\in\mathcal{R}^{n\times n}$ , $B\in\mathcal{R}^{n\times m}$ , $C\in\mathcal{R}^{n\times n}$ , $D\in\mathcal{R}^{n\times m}$ , $E\in\mathcal{R}^{q\times n}$ are given constant, $A_{d}\in\mathcal{R}^{n\times n}$ , $B_{d}\in\mathcal{R}^{n\times m}$ , $C_{d}\in\mathcal{R}^{n\times n}$ , $D_{d}\in\mathcal{R}^{n\times m}$ , $E_{d}\in\mathcal{R}^{q\times n}$ are their corresponding delay dynamics matrices. One-dimensional stochastic disturbance sequence $\omega_{k}$ is defined on the given probability space $(\Omega, \mathcal{F}, \mathcal{P}, \mathcal{F}_{k})$ , and meets the following condition $E(\omega_{k}\mid\mathcal{F}_{k}) = 0$ , $E(\omega_{k}^{2}\mid\mathcal{F}_{k}) = 1$ . The initial state $x_{0}$ is irrelevant with $\omega_{k}$ .

Assume the reference trajectory of SLQ optimal tracking control is generated by a command generator

$\begin{equation} \begin{split} r_{k+1} & = Fr_{k} \end{split} \end{equation}$

(2.2)

where $r_{k}\in\mathcal{R}^{q}$ represents the reference system trajectory, and $F$ is the constant matrix.

The tracking error can be expressed as

$\begin{equation} \begin{split} e_{k} & = y_{k}-r_{k} \end{split} \end{equation}$

(2.3)

where $r_{k}$ is the reference trajectory.

The goal of the SLQ optimal tracking problem with delays is to design an optimal controller, which can not only ensure that the output of the target system track the reference trajectory stably, but also minimize the cost function. The cost function is denoted as

$\begin{equation} \begin{split} J(x_{k}, r_{k}, u_{k} ) = E\sum\limits_{i = k}^\infty U_{i}(x_{i},x_{i-d}, u_{i}) \end{split} \end{equation}$

(2.4)

where $U_{i}(x_{i}, x_{i-d}, u_{i}) = (y_{i}-r_{i})^{T}O(y_{i}-r_{i})+u_{i}^{T}\mathcal{R}u_{i}+u_{i-d}^{T}\mathcal{R}_{d} u_{i-d}$ is the utility function. $O = O^{T}\in\mathcal{R}^{q\times q} \ge0$ , $R = R^{T}\in\mathcal{R}^{m\times m} \ge0$ , $R_{d} = R_{d}^{T}\in\mathcal{R}^{m\times m} \ge0$ are the constant matrices.

Only when $F$ is Hurwitz can the cost function (2.4) be used, that is, the reference trajectory system is required to be asymptotically stable. If the reference trajectory does not tend to zero with time delay, then the cost function (2.4) will be unbounded. In practice, this condition is difficult to achieve. Therefore, a discount factor $\gamma$ is introduced into the cost function (2.4) to relax this restriction. Based on (2.4), the cost function with discount factor is redefined as

$\begin{equation} \begin{split} J(x_{k}, r_{k}, u_{k} )& = E\sum\limits_{i = k}^\infty \gamma^{i-k} U_{i}(x_{i},x_{i-d}, u_{i})\\& = E\sum\limits_{i = k}^\infty \gamma^{i-k}(y_{i}-r_{i})^{T}O(y_{i}-r_{i})+u_{i}^{T}\mathcal{R}u_{i}+u_{i-d}^{T}\mathcal{R}_{d} u_{i-d} \end{split} \end{equation}$

(2.5)

where $0 < \gamma\leq1$ is the discount factor.

Definition 1 (^[21]). $u_{k}$ is called mean-square stabilizing at $e_{0}$ if there exists a linear feedback form of $u_{k}$ for every initial state $e_{0}$ satisfies $\mathop {\lim }\limits_{k \to \infty } E(e_k^T{e_k}) = 0$ . The system (2.3) with a mean-square stabilizing control $u_{k}$ is called mean-square stabilizable.

Definition 2 (^[21]). $u_{k}$ is said to be admissible if $u_{k}$ satisfies the following: (1) $u_{k}$ is a $F_{k}$ adapted and measurable stochastic process; (2) $u_{k}$ is mean-square stabilizing; (3) It enables the cost function to reach the minimum value.

The goal of this paper is to seek an admissible control, which not only minimizes the cost function (2.5) but also stabilizes the system (2.3) for each initial state $e_{0}$ . We denote the optimal cost function as follows

$\begin{equation} \begin{split} V({e_0}) = \mathop {\min }\limits_u J({e_0},u). \end{split} \end{equation}$

(2.6)

In order to achieve the above goal, this paper establishes an augmented system composed of system (2.1) and the reference trajectory system (2.2), and then transforms the optimal tracking problem into an optimal regulation problem.

The system (2.1) can be rewritten as the following equivalent form:

$\begin{equation} \begin{split} \begin{array}{l} {x_{k + 1}} = [A{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {A_d}]\left[ {\begin{array}{*{20}{c}} {{x_k}}\\ {{x_{k - d}}} \end{array}} \right] + [B{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {B_d}]\left[ {\begin{array}{*{20}{c}} {{u_k}}\\ {{u_{k - d}}} \end{array}} \right]\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} + ([C{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {C_d}]\left[ {\begin{array}{*{20}{c}} {{x_k}}\\ {{x_{k - d}}} \end{array}} \right] + [D{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {D_d}]\left[ {\begin{array}{*{20}{c}} {{u_k}}\\ {{u_{k - d}}} \end{array}} \right]){\omega _k},\\ {y_k} = [E{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {E_d}]\left[ {\begin{array}{*{20}{c}} {{x_k}}\\ {{x_{k - d}}} \end{array}} \right]. \end{array} \end{split} \end{equation}$

(2.7)

According to ^[16,22,23], we define the delay operator $\nabla_{d}$ satisfies $\nabla_{d}x_{k} = x_{k-d}$ and $(\nabla_{d}x_{k})^{T} = x_{k-d}^{T}$ . Then, the system (2.7) can be expressed as

$\begin{equation} \begin{split} x_{k+1} & = A_{\nabla}x_{k}+B_{\nabla}u_{k}+(C_{\nabla}x_{k}+D_{\nabla}u_{k})\omega_{k},\\y_{k} & = E_{\nabla}x_{k} \end{split} \end{equation}$

(2.8)

where $A_{\nabla} = A+A_{d}\nabla_{d}$ , $B_{\nabla} = B+B_{d}\nabla_{d}$ , $C_{\nabla} = C+C_{d}\nabla_{d}$ , $D_{\nabla} = D+D_{d}\nabla_{d}$ , $E_{\nabla} = E+E_{d}\nabla_{d}$ .

Based on the system (2.1) and the reference trajectory system (2.2), the augmented system can be defined as

$\begin{equation} \begin{split} G_{k+1} & = \left[\begin{array}{c} x_{k+1} \\r_{k+1}\end{array}\right] = \left[\begin{array}{c c} A_{\nabla}+C_{\nabla} \omega_{k} &0\\0&F\end{array}\right]\left[\begin{array}{c} x_{k} \\r_{k}\end{array}\right]+\left[\begin{array}{c } B_{\nabla} +D_{\nabla} \omega_{k}\\0\end{array}\right]u_{k}\\& = TG_{k}+B_{0}u_{k} \end{split} \end{equation}$

(2.9)

where $G_{k} = \left[\begin{array}{c} x_{k} \\r_{k}\end{array}\right]\in\mathcal{R}^{n+q}$ , $T\in\mathcal{R}^{(n+q)\times(n+q)}$ , $B_{0}\in\mathcal{R}^{(n+q)\times m}$ .

Based on the augmented system (2.9), the cost function (2.5) can be expressed as

$\begin{equation} \begin{split} J(G_{k}, u_{k} )& = E\sum\limits_{i = k}^\infty \gamma^{i-k}[G_{i}^{T}O_{1}G_{i}+u_{i}^{T}\mathcal{R}_{\nabla}u_{i}] \end{split} \end{equation}$

(2.10)

where $O_{1} = \left[\begin{array}{c c} E & -I\end{array}\right]^{T}O\left[\begin{array}{c c} E & -I\end{array}\right]\in\mathcal{R}^{(n+q)\times(n+q)}$ , $R_{\nabla} = R+R_{d}\nabla_{d}$ .

The state feedback linear controller is defined as

$\begin{equation} \begin{split} u_{k} & = KG_{k}, \quad K\in\mathcal{R}^{m\times(n+q)} \end{split} \end{equation}$

(2.11)

where $K$ represents the control gain matrix of the system.

Substituting (2.11) into (2.10), the cost function (2.10) can be transformed into

$\begin{equation} \begin{split} J(G_{k},K )& = E\sum\limits_{i = k}^\infty \gamma^{i-k}G_{i}^{T}[O_{1}+K^{T}R_{\nabla}K]G_{i}. \end{split} \end{equation}$

(2.12)

Therefore, the target of SQL optimal tracking problem with delays can be further expressed as

$\begin{equation} \begin{split} V(G_{0},K )& = \min\limits_{K}J(G_{0},K). \end{split} \end{equation}$

(2.13)

Definition 3. The SLQ optimal control problem is well posed if

$\begin{equation} -\infty < V(G_{0},K ) < +\infty. \notag \end{equation}$

Before solving the SLQ control problem, we need to know whether it is well-posed. Therefore, we give the following lemma first.

Lemma 1. If there exists an admissible control $u_{k} = KG_{k}$ , then the SLQ optimal tracking control is well-posed, and the cost function can be expressed as

$\begin{equation} \begin{split} J(G_{k},K )& = E(G_{k}^{T}PG_{k}) \end{split} \end{equation}$

(2.14)

where the matrix $P\in\mathcal{R}^{(n+q)\times(n+q)}$ satisfies the following augmented SAE

$\begin{equation} \begin{aligned} P& = \gamma (A_{1}+B_{1}K)^{T} P(A_{1}+B_{1}K) \\&+ \gamma (C_{1}+D_{1}K)^{T}P (C_{1}+D_{1}K) + O_{1} + K^{T}R_{\nabla}K \end{aligned} \end{equation}$

(2.15)

where $A_{1} = \left[\begin{array}{c c} A_{\nabla} & 0\\0&F\end{array}\right] \in\mathcal{R}^{(n+q)\times(n+q)}$ , $B_{1} = \left[\begin{array}{c } B_{\nabla} \\0\end{array}\right] \in\mathcal{R}^{(n+q)\times m}$ , $C_{1} = \left[\begin{array}{c c} C_{\nabla} & 0\\0 & 0\end{array}\right] \in\mathcal{R}^{(n+q)\times(n+q)}$ , $D_{1} = \left[\begin{array}{c } D_{\nabla} \\0\end{array}\right] \in\mathcal{R}^{(n+q)\times m}$ .

Proof. Assuming that the control $u_{k}$ is admissible and the matrix $P$ satisfies (2.15), then

$\begin{equation} \begin{array}{l} E\sum\limits_{i = k}^\infty {[\gamma {G_{i{\rm{ + }}1}}^T} P{G_{i + 1}} - {G_i}^TP{G_i}]\\ = E\sum\limits_{i = k}^\infty {\left\{ {\gamma {{[({A_1} + {B_1}K){G_i} + ({C_1}{\omega _i} + {D_1}K{\omega _i}){G_i}]}^T}P} \right.} \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left. {[({A_1} + {B_1}K){G_i} + ({C_1}{\omega _i} + {D_1}K{\omega _i}){G_i}] - {G_i}^TP{G_i}} \right\}\\ {\rm{ = }}E\sum\limits_{i = k}^\infty {\left\{ {{G_i}^T[\gamma {{({A_1} + {B_1}K)}^T}P({A_1} + {B_1}K)} \right.} \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left. { + \gamma {{({C_1} + {D_1}K)}^T}P({C_1} + {D_1}K) - P]{G_i}} \right\}\notag. \end{array} \end{equation}$

Based on (2.12) and (2.15), we have

$\begin{equation} \begin{aligned} J(G_{k},K)& = E\sum\limits_{i = k}^\infty \gamma^{i-k} G_{i}^{T}[O_{1}+K^{T}R_{\nabla}K] G_{i}] \\& = E\sum\limits_{i = k}^\infty \gamma^{i-k}G_{i}^{T} [P - \gamma(A_{1}+B_{1}K)^{T}P(A_{1}+B_{1}K)\\&\qquad-\gamma(C_{1}+D_{1}K)^{T}P(C_{1}+D_{1}K)]G_{i}\\& = -E\sum\limits_{i = k}^\infty \gamma^{i-k} [\gamma G_{i+1}^{T}PG_{i+1} - G_{i}^{T}PG_{i}] \\& = E(G_{k}^{T}PG_{k}) - \lim _{i \to \infty}\gamma^{i-k+1}E(G_{i}^{T}PG_{i})\\& = E(G_{k}^{T}PG_{k})\notag. \end{aligned} \end{equation}$

Since the feedback control $u_{k}$ is admissible, we can obtain $J({G_k}, K){\rm{ = E(}}{G_k}^TP{G_k}{\rm{)}}$ , which satisfies the well-posedness of SLQ optimal tracking control problem.

To make sure the mean-square stable control, we make the following assumption.

Assumption 1. The system (2.9) is mean-square stabilizable.

2.2. Problem transformation

At present, ADP algorithm has achieved great success in the optimal tracking control of deterministic systems ^[24,25,26], which inspires us to transform stochastic problems into deterministic problems through system transformation.

Let $M_{k} = E(G_{k}G_{k}^{T})$ , then the system (2.9) can be converted to

$\begin{equation} \begin{aligned} M_{k+1}& = E(G_{k+1}G_{k+1}^{T})\\& = E((TG_{k}+B_{0}u_{k})(TG_{k}+B_{0}u_{k})^{T})\\& = (A_{1}+B_{1}K)M_{k}(A_{1}+B_{1}K)^{T} \\&+(C_{1}+D_{1}K)M_{k}(C_{1}+D_{1}K)^{T} \end{aligned} \end{equation}$

(2.16)

where $M_{k}\in\mathcal{R}^{(n+q)\times(n+q)}$ is the state of a deterministic system and $M_{0}$ is the initial state.

Therefore, the cost function (2.10) can be rewritten as

$\begin{equation} \begin{aligned} J(M_{k},K)& = tr\{\sum\limits_{i = k}^\infty \gamma^{i-k} [(O_{1}+K_{T}R_{\nabla}K)M_{k}]\}. \end{aligned} \end{equation}$

(2.17)

Remark 1. After system transformation, the stochastic system is transformed into deterministic system. The system (2.17) completely gets rid of stochastic disturbance $\omega_{k}$ and will only be dependent on the initial state $M_{0}$ and control gain matrix $K$ , which makes preparation for the derivation and application of Q-learning algorithm.

3. The Q-learning algorithm and convergence proof

In this paper, Q-learning method is used to solve the SLQ optimal tracking problem, which avoids the need for accurate system model. Thus we first give the formula of the optimal control and the corresponding augmented SAE.

Lemma 2. Given the admissible control $u_{k}$ , we can get the following optimal control

$\begin{equation} \begin{aligned} u^{\ast}_{k}& = K^{\ast}G_{k} = -(R_{\nabla}+\gamma B^{T}_{1}PB_{1})^{-1}\gamma (B^{T}_{1}PA_{1} + D^{T}_{1}PD_{1})G_{k} \end{aligned} \end{equation}$

(3.1)

and the optimal cost function

$\begin{equation} \begin{aligned} V(G_{k})& = E(G^{T}_{k}PG_{k}) = tr(PM_{k}) \end{aligned} \end{equation}$

(3.2)

where the matrix $P$ satisfies the following augmented SAE

$\begin{equation} \left\{ \begin{array}{l} P = {O_1} + \gamma (A_1^TP{A_1} + C_1^TP{C_1}) - \gamma (A_1^TP{B_1} + C_1^TP{D_1})\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \times {({R_\nabla } + \gamma B_1^TP{B_1} + \gamma D_1^TP{D_1})^{ - 1}}\gamma (B_1^TP{A_1} + D_1^TP{C_1})\\ {R_\nabla } + \gamma B_1^TP{B_1} + D_1^TP{D_1} > 0 \end{array} \right.. \end{equation}$

(3.3)

Proof. Suppose $u_{k}$ is an admissible control. According to Lemma 1 and (2.17), the cost function can be written as

$\begin{equation} \begin{aligned} J(M_{k},K)& = tr\{ \sum\limits_{i = k}^\infty \gamma^{i-k}[(O_{1}+K^{T}R_{\nabla}K)M_{i}]\} \\& = tr\{(O_{1}+K^{T}R_{\nabla}K)M_{i} \} + tr\{\sum\limits_{i = k+1}^\infty \gamma^{i-k}[(O_{1}+K^{T}R_{\nabla}K)M_{i}]\} \\& = tr\{(O_{1}+K^{T}R_{\nabla}K)M_{i} \} + J(M_{k+1},K). \end{aligned} \end{equation}$

(3.4)

According to Bellman optimality principle, the optimal cost function satisfies

$\begin{equation} \begin{aligned} V(M_{k})& = \min\limits_{K}\{tr\{(O_{1}+K^{T}R_{\nabla}K)M_{k}\}+V(M_{k+1})\}. \end{aligned} \end{equation}$

(3.5)

The optimal control gain matrix can be obtained as follow

$\begin{equation} \begin{aligned} K^{\ast}(M_{k})& = \mathop{\arg\min}\limits_{K}\{tr\{(O_{1}+K^{T}R_{\nabla}K)M_{k}\}+V(M_{k+1})\}. \end{aligned} \end{equation}$

(3.6)

Considering the first-order necessary condition

$\begin{equation} \begin{aligned} \frac{\partial[tr\{(O_{1}+K^{T}R_{\nabla}K)M_{k}\}+V(M_{k+1})]}{\partial K} = 0, \end{aligned} \end{equation}$

(3.7)

we can obtain

$\begin{equation} \begin{aligned} (R_{\nabla}+\gamma B^{T}_{1}PB_{1} + \gamma D^{T}_{1}PD^{T})KG_{k}+\gamma (B^{T}_{1}PA_{1} + D^{T}_{1}PC_{1})G_{k} = 0 \end{aligned} \end{equation}$

(3.8)

where the matrix $P$ satisfies augmented SAE (2.15).

Supposing $R_{\nabla}+\gamma B^{T}_{1}PB_{1} + \gamma D^{T}_{1}PD^{T} > 0$ , we have

$\begin{equation} \begin{aligned} K^{\ast} = -(R_{\nabla}+\gamma B^{T}_{1}PB_{1})^{-1}\gamma (B^{T}_{1}PA_{1} + D^{T}_{1}PD_{1}). \end{aligned} \end{equation}$

(3.9)

When taking (3.9) into the (2.15), we can obtain

$\begin{equation} \begin{aligned} P & = O_{1} + \gamma (A^{T}_{1}PA_{1} + C^{T}_{1}PC_{1}) - \gamma (A^{T}_{1}PB_{1} + C^{T}_{1}PD_{1})\\& \times (R_{\nabla}+\gamma B^{T}_{1}PB_{1} + \gamma D^{T}_{1}PD^{T})^{-1}\gamma (B^{T}_{1}PA_{1} + D^{T}_{1}PC_{1}). \end{aligned} \end{equation}$

(3.10)

From Lemma 2, the SQL optimal tracking problem can be dealt with by the solution of augmented SAE (3.3). However, solving augmented SAE (3.3) requires accurate system model, so this method is not feasible when the dynamics are unknown.

3.1. Derivation of Q-learning algorithm

To solve model-free SQL optimal tracking problem with delays, we give the definition of the Q function and the corresponding matrix $H$ .

Based on (2.10) and Bellman optimality principle, we know that the optimal cost function satisfies Hamilton Jacobi Bellman (HJB) equation

$\begin{equation} \begin{aligned} V(G_{k})& = \mathop{\min}\limits_{u_{k}}\{E[G^{T}_{k}O_{1}G_{k}+u^{T}_{k}R_{\nabla}u_{k}]+\gamma V(G_{k+1})\}. \end{aligned} \end{equation}$

(3.11)

The Q-function is defined as

$\begin{equation} \begin{aligned} Q{\rm{(}}{G_k}{\rm{,}}{u_k}{\rm{) = }}E[{G_k}^T{O_1}{G_k} + u_k^T{R_\nabla }{u_k}] + \gamma V{\rm{(}}{G_{k + 1}}{\rm{)}}. \end{aligned} \end{equation}$

(3.12)

According to Lemma 1, $V{\rm{(}}{G_{k + 1}}{\rm{)}}$ can be written as

$\begin{equation} \begin{aligned} \begin{array}{l} V({G_{k{\rm{ + }}1}})\\ = E(G_{k{\rm{ + }}1}^TP{G_{k{\rm{ + }}1}})\\ {\rm{ = }}E\{ {(T{G_k} + {B_0}{u_k})^T}P(T{G_k} + {B_0}{u_k})\} \\ {\kern 1pt} = E\{ {[({A_1}{G_k} + {C_1}{\omega _k}{G_k}) + ({B_1}{u_k} + {D_1}{\omega _k}{u_k})]^T}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} P[({A_1}{G_k} + {C_1}{\omega _k}{G_k}) + ({B_1}{u_k} + {D_1}{\omega _k}{u_k})]\}. \end{array} \end{aligned} \end{equation}$

(3.13)

Substitute (3.13) into (3.12), we can get

$\begin{equation} \begin{aligned} Q{\rm{(}}{G_k}{\rm{,}}{u_k}{\rm{) = E}}\left\{ {{{\left[ {\begin{array}{*{20}{c}} {{G_k}}\\ {{u_k}} \end{array}} \right]}^T}\left[ {\begin{array}{*{20}{c}} {{H_{GG}}}&{{H_{Gu}}}\\ {{H_{uG}}}&{{H_{uu}}} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{G_k}}\\ {{u_k}} \end{array}} \right]} \right\} = {\rm{E}}\left\{ {{{\left[ {\begin{array}{*{20}{c}} {{G_k}}\\ {{u_k}} \end{array}} \right]}^T}H\left[ {\begin{array}{*{20}{c}} {{G_k}}\\ {{u_k}} \end{array}} \right]} \right\} \end{aligned} \end{equation}$

(3.14)

where $H = H^{T}\in\mathcal{R}^{(n+q+m)\times(n+q+m)}$ ,

$\begin{equation} \begin{aligned} H& = \left[\begin{array}{c c} H_{GG} &H_{Gu} \\H_{uG} &H_{uu} \end{array}\right]\\& = \left[\begin{array}{c c} O_{1}+\gamma A^{T}_{1}PA_{1}+\gamma C^{T}_{1}PC_{1} &\gamma A_{1}^{T}PB_{1}+\gamma C_{1}^{T}PD_{1} \\\gamma B_{1}^{T}PA_{1}+\gamma D_{1}^{T}PC_{1} &\gamma B^{T}_{1}PB_{1}+\gamma D^{T}_{1}PD_{1}+R_{\nabla} \end{array}\right]. \end{aligned} \end{equation}$

(3.15)

Let $\frac{\partial Q(G_{k}, u_{k})}{\partial u_{k}} = 0$ , then the optimal control can be obtained as follow

$\begin{equation} \begin{aligned} u^{\ast}_{k}& = -H^{-1}_{uu} H_{uG}G_{k}. \end{aligned} \end{equation}$

(3.16)

From Lemma 1 and (3.15), we can know the relationship between matrix $P$ and matrix $H$ .

$\begin{equation} \begin{aligned} P = \left[ {\begin{array}{c c} I& K^{T} \end{array}} \right]H{\left[ {\begin{array}{c c} I&K^{T} \end{array}} \right]^{T}}. \end{aligned} \end{equation}$

(3.17)

As can be seen from (3.16), the optimal control only depends on the matrix $H$ , which is completely get rid of the constraints of the system parameters. Next, we will present the Q-learning iterative algorithm for estimating the matrix $H$ .

In this section, we propose Q-learning iterative algorithm based on the Ⅵ. This method starts with the initial value $Q_{0}(G_{k}, u_{k}) = 0$ and the initial admissible control $u_{0}(G_{k})$ , $Q_{1}(G_{k}, u_{k})$ will be updated by the initial value and the initial control as follows

$\begin{equation} \begin{aligned} {Q_1}({G_k},{u_k}) = E[{G_k}^T{O_1}{G_k} + u_0^T({G_k}){R_\nabla }{u_0}({G_k})] + \gamma {Q_0}({G_{k + 1}},{u_0}({G_{k + 1}})). \end{aligned} \end{equation}$

(3.18)

The control is updated as follows

$\begin{equation} \begin{aligned} u_{1}(G_{k}) = \mathop {\arg \min }\limits_{u(G_{k})} Q_{1}(G_{k},u_{k}) \end{aligned} \end{equation}$

(3.19)

for $i \ge 1$ , Q-learning algorithm iterates between

$\begin{equation} \begin{aligned} {Q_{i + 1}}{\rm{(}}{G_k}{\rm{,}}{u_k}{\rm{) = }}E[{G_k}^{^T}{O_1}{G_k} + u_i^T({G_k}){R_\nabla }{u_i}({G_k})] + \gamma {Q_i}{\rm{(}}{G_{k + 1}}{\rm{,}}{u_i}{\rm{(}}{G_{k + 1}}{\rm{))}} \end{aligned} \end{equation}$

(3.20)

and

$\begin{equation} \begin{aligned} u_{i + 1}(G_{k}) = \mathop {\arg \min }\limits_{u_{k}}\{ E[G_{k}^{T}O_{1}G_{k} + u^{T}_{k}{R_\nabla }u_{k}] + \mathop {\min }\limits_{ u_{k + 1}}Q_{i}(G_{k + 1},u_{k + 1})\} \end{aligned} \end{equation}$

(3.21)

where $i$ is the iteration index and $k$ is time index.

According to (3.14), the Q function can be rewritten as

$\begin{equation} \begin{array}{l} {Q_{i + 1}}{\rm{(}}{G_k}{\rm{,}}{u_k}{\rm{) = }}\left[ {\begin{array}{*{20}{c}} {{G_k}^{^T}}&{u_i^T({G_k})} \end{array}} \right]{H_{i{\rm{ + }}1}}{\left[ {\begin{array}{*{20}{c}} {{G_k}^{^T}}&{u_i^T({G_k})} \end{array}} \right]^T}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{ = }}E{\left\{ {\left[ {\begin{array}{*{20}{c}} {{G_k}^{^T}}&{u_i^T({G_k})} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{O_1}}&0\\ 0&{{R_\nabla }} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{G_k}^{^T}}&{u_i^T({G_k})} \end{array}} \right]} \right.^T}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{ + }}{\kern 1pt} {\kern 1pt} \left. {\gamma \left[ {\begin{array}{*{20}{c}} {{G_{k + 1}}^{^T}}&{u_i^T({G_{k + 1}})} \end{array}} \right]{H_i}{{\left[ {\begin{array}{*{20}{c}} {{G_{k + 1}}^{^T}}&{u_i^T({G_{k + 1}})} \end{array}} \right]}^T}} \right\} \end{array} \end{equation}$

(3.22)

and we can obtain the optimal controller

$\begin{equation} {u_i}({G_k}) = - H_{uu,i}^{ - 1}{H_{uG,i}}{G_k}. \end{equation}$

(3.23)

According to (3.17), we can get

$\begin{equation} {P_i} = \left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]{H_i}{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]^T}. \end{equation}$

(3.24)

3.2. The convergence of Q-learning algorithm

Before proving the convergence of Q-learning algorithm, we first give the following two lemmas.

Lemma 3. Q-learning algorithm (3.22) and (3.23) is equivalent to

$\begin{equation} \begin{array}{l} {P_{i + 1}} = {O_1} + \gamma (A_1^{^T}{P_i}{A_1} + C_1^{^T}{P_i}{C_1}) - \gamma (A_1^{^T}{P_i}{B_1} + C_1^{^T}{P_i}{D_1})\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \times {(R + \gamma B_1^{^T}{P_i}{B_1} + \gamma D_1^{^T}{P_i}{D_1})^{ - 1}}\gamma (B_1^{^T}{P_i}{A_1} + D_1^{^T}{P_i}{C_1}). \end{array} \end{equation}$

(3.25)

Proof. According to (2.11), the last term of (3.22) can be written as

$\begin{equation} \begin{array}{l} E\left\{ {\left[ {\begin{array}{*{20}{c}} {{G_{k + 1}}^T}&{u_i^T({G_{k + 1}})} \end{array}} \right]{H_i}{{\left[ {\begin{array}{*{20}{c}} {{G_{k + 1}}^T}&{u_i^T({G_{k + 1}})} \end{array}} \right]}^T}} \right\}\\ = E\left\{ {{G_{k + 1}}^T\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]{H_i}{{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]}^T}{G_{k + 1}}} \right\}\\ {\rm{ = }}E\{ {[({A_1}{G_k} + {C_1}{\omega _k}{G_k}) + ({B_1}{u_i}({G_k}) + {D_1}{\omega _k}{u_i}({G_k}))]^T}\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]{H_i}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]^T}({A_1}{G_k} + {C_1}{\omega _k}{G_k}) + ({B_1}{u_i}({G_k}) + {D_1}{\omega _k}{u_i}({G_k}))]\} \\ = E\left\{ {\left[ {\begin{array}{*{20}{c}} {{G_k}^T}&{{u_i}^T({G_k})} \end{array}} \right]{{\left[ {\begin{array}{*{20}{c}} {{A_1}}&{{B_1}} \end{array}} \right]}^T}\left[ {\begin{array}{*{20}{c}} I&{{K_i}^T} \end{array}} \right]{H_i}} \right.\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\left[ {\begin{array}{*{20}{c}} I&{{K_i}^T} \end{array}} \right]^T}\left[ {\begin{array}{*{20}{c}} {{A_1}}&{{B_1}} \end{array}} \right]{\left[ {\begin{array}{*{20}{c}} {{G_k}^T}&{{u_i}^T({G_k})} \end{array}} \right]^T}{\kern 1pt} \\ {\kern 1pt} + \left[ {\begin{array}{*{20}{c}} {{G_k}^T}&{{u_i}^T({G_k})} \end{array}} \right]{\left[ {\begin{array}{*{20}{c}} {{C_1}}&{{D_1}} \end{array}} \right]^T}\left[ {\begin{array}{*{20}{c}} I&{{K_i}^T} \end{array}} \right]{H_i}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left. {{\kern 1pt} {\kern 1pt} {{\left[ {\begin{array}{*{20}{c}} I&{{K_i}^T} \end{array}} \right]}^T}\left[ {\begin{array}{*{20}{c}} {{C_1}}&{{D_1}} \end{array}} \right]{{\left[ {\begin{array}{*{20}{c}} {{G_k}^T}&{{u_i}^T({G_k})} \end{array}} \right]}^T}} \right\}. \end{array} \end{equation}$

(3.26)

Substitute (3.26) into (3.22), according to (3.24), we can get

$\begin{equation} \begin{array}{l} {H_{i + 1}} = \left[ {\begin{array}{*{20}{c}} {{O_1}}&0\\ 0&{{R_\nabla }} \end{array}} \right] + \left[ {\begin{array}{*{20}{c}} {\gamma A_1^{^T}{P_i}{A_1}}&{\gamma A_1^{^T}{P_i}{B_1}}\\ {\gamma B_1^{^T}{P_i}{A_1}}&{\gamma B_1^{^T}{P_i}{B_1}} \end{array}} \right]\\ {\rm{ }}{\rm{ }}{\rm{ }} + \left[ {\begin{array}{*{20}{c}} {\gamma C_1^{^T}{P_i}{C_1}}&{\gamma C_1^{^T}{P_i}{D_1}}\\ {\gamma D_1^{^T}{P_i}{C_1}}&{\gamma D_1^{^T}{P_i}{D_1}} \end{array}} \right]. \end{array} \end{equation}$

(3.27)

Based on (3.24), we have

$\begin{equation} {P_{i{\rm{ + }}1}} = \left[ {\begin{array}{*{20}{c}} I&{K_{i + 1}^T} \end{array}} \right]{H_{i{\rm{ + }}1}}{\left[ {\begin{array}{*{20}{c}} I&{K_{i + 1}^T} \end{array}} \right]^T}. \end{equation}$

(3.28)

Substitute (3.27) into (3.28), we can get

$\begin{equation} \begin{array}{l} {P_{i + 1}} = {O_1} + \gamma (A_1^T{P_i}{A_1} + C_1^T{P_i}{C_1}) - \gamma (A_1^T{P_i}{B_1} + C_1^T{P_i}{D_1})\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \times {(R + \gamma B_1^T{P_i}{B_1} + \gamma D_1^T{P_i}{D_1})^{ - 1}}\gamma (B_1^T{P_i}{A_1} + D_1^T{P_i}{C_1}) \end{array} \end{equation}$

(3.29)

where ${R_\nabla } + \gamma B_1^{^T}P{B_1} + D_1^{^T}P{D_1} > 0$ .

Lemma 4 (^[27]). The value iteration algorithm iterates between

$\begin{equation} {V_{i + 1}}({G_k}) = E({G^T}_k({O_1} + {K_i}^T{R_\nabla }{K_i}){G_k}) + \gamma {V_i}({G_{k + 1}}) \end{equation}$

(3.30)

and

$\begin{equation} {K_{i + 1}}{\rm{ = }}\mathop {{\rm{argmin}}}\limits_K \{ E({G^T}_k({O_1} + {K_i}^T{R_\nabla }{K_i}){G_k}){\rm{ + }}\gamma {{\rm{V}}_i}({G_{k{\rm{ + }}1}})\} \end{equation}$

(3.31)

is the convergence, then

$\begin{equation} \mathop {\lim }\limits_{i \to \infty } {V_i}({G_k}) = V({G_k}) = E({G_k}^T{P }{G_k}) = tr\{ {P }{M_k}\} \notag, \end{equation}$

$\begin{equation} \mathop {\lim }\limits_{i \to \infty } {K_i} = {K^ * } = - {{\rm{(}}{{\rm{R}}_{\nabla {\rm{ }}}}{\rm{ + }}\gamma {\rm{B}}_1^T{{\rm{P}} }{{\rm{B}}_{\rm{1}}}{\rm{ + }}\gamma {\rm{D}}_1^T{{\rm{P}} }{{\rm{D}}_{\rm{1}}}{\rm{)}}^{ - 1}}\gamma {\rm{(B}}_{\rm{1}}^T{{\rm{P}} }{{\rm{A}}_{\rm{1}}}{\rm{ + D}}_1^T{{\rm{P}} }{{\rm{C}}_{\rm{1}}}{\rm{)}} \notag \end{equation}$

where the matrix ${P }$ satisfies the augmented SAE (3.3).

Theorem 3.1. Assuming that system (2.9) is mean-square stabilizable, the matrix sequence ${\rm{\{ }}{H_i}{\rm{\} }}$ calculated by Q-learning algorithm (3.22) converges to matrix $H$ and the matrix sequence ${\rm{\{ }}{P_i}{\rm{\} }}$ calculated by (3.24) converges to the solution $P$ of augmented SAE (3.3).

Proof. According to Lemma 4, (3.30) can be rewritten as

$\begin{equation} \begin{array}{l} {V_{i + 1}}({G_k}) = E({G_k}^T{P_{i + 1}}{G_k})\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = E\left. {\left[ {{G_k}^T({O_1} + {K_i}{R_\nabla }{K_i}){G_k}} \right.} \right] + E(G_{k + 1}^T{P_i}{G_{k + 1}})\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = E\{ {G_k}^T({O_1} + {K_i}{R_\nabla }{K_i}){G_k} + [({A_1} + {B_1}K){G_i}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} + ({C_1}{\omega _i} + {D_1}K{\omega _i}){G_i}{]^T}P{\kern 1pt} {\kern 1pt} [({A_1} + {B_1}K){G_i}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} + ({C_1}{\omega _i} + {D_1}K{\omega _i}){G_i}]\} \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = E({G_i}^T[{({A_1} + {B_1}K)^T}P({A_1} + {B_1}K)\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} + {({C_1} + {D_1}K)^T}P({C_1} + {D_1}K) + {O_1} + {K_i}^T{R_\nabla }{K_i}]{G_i}). \end{array} \end{equation}$

(3.32)

We can update the control gain matrix by (3.31) as follows

$\begin{equation} {K_i} = - {({R_{\nabla {\rm{ }}}} + {\rm{ }}\gamma B_1^TP{}_i{B_1} + {\rm{ }}\gamma D_1^T{P_i}{D_1})^{ - 1}}\gamma (B_1^T{P_i}{A_1} + {\rm{ }}D_1^T{P_i}{C_1}). \end{equation}$

(3.33)

Substituting (3.33) into (3.32), we can get

(3.34)

According to Lemmas 3 and 4, we can conclude $\mathop {\lim }\limits_{i \to \infty } {P_i} = {P}$ . when $i \to \infty$ , the matrix $P$ satisfies

$\begin{equation} \begin{array}{l} {P_{}} = {O_1} + \gamma (A_1^{^T}P{A_1} + C_1^{^T}P{C_1}) - \gamma (A_1^{^T}P{B_1} + C_1^{^T}P{D_1})\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \times {(R + \gamma B_1^{^T}P{B_1} + \gamma D_1^{^T}P{D_1})^{ - 1}}\gamma (B_1^{^T}P{A_1} + D_1^{^T}P{C_1}). \end{array} \end{equation}$

(3.35)

Based on (3.27), we can know $H$ satisfies $\mathop {\lim }\limits_{i \to \infty } {H_i} = H$ , where

$\begin{equation} H = \left[ {\begin{array}{*{20}{c}} {\gamma A_1^TP{A_1} + \gamma C_1^TP{C_1} + {Q_1}}&{\gamma A_1^TP{B_1} + \gamma C_1^TP{D_1}}\\ {B_1^TP{A_1} + \gamma D_1^TP{C_1}}&{\gamma B_1^TP{B_1} + \gamma D_1^TP{D_1} + {R_\nabla }} \end{array}} \right]. \end{equation}$

(3.36)

So the Q-learning algorithm converges.

4. Implementation of the Q-learning algorithm

Due to the existence of stochastic disturbance, the output trajectory of the system is uncertain, and the cost function has expectations, the online algorithm cannot achieve the function. Therefore, it is necessary to transform the stochastic Q-learning algorithm into a deterministic Q-learning algorithm. In this section, we will give the implementation steps of deterministic Q-learning algorithm. The flow chart of Q learning algorithm is shown in Figure 1.

Figure 1. Flowchart of Q-learning.

DownLoad: Full-Size Img PowerPoint

According to Eq (2.11), the left side of (3.22) can be simplified to

$\begin{equation} \begin{array}{l} E\left\{ {\left[ {\begin{array}{*{20}{c}} {{G_k}^T}&{u_i^T({G_k})} \end{array}} \right]{H_{i + 1}}\left[ {\begin{array}{*{20}{c}} {{G_k}^T}&{u_i^T({G_k})} \end{array}} \right]} \right\}\\ = E\left\{ {{G_k}^T\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]{H_{i + 1}}{{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]}^T}{G_k}} \right\}\\ = tr\left\{ {\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]{H_{i + 1}}{{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]}^T}{M_k}} \right\}. \end{array} \end{equation}$

(4.1)

The right side of (3.22) can be simplified as

$\begin{equation} \begin{array}{l} E\left\{ {{G_k}^{^T}\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{O_1}}&0\\ 0&{{R_\nabla }} \end{array}} \right]{{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]}^T}{G_k}} \right.\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left. {{\kern 1pt} {\kern 1pt} {\kern 1pt} + {G_{k + 1}}^{^T}\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]{H_i}{{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]}^T}{G_{k + 1}}} \right\}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = tr\left\{ {\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]\left[ {\begin{array}{*{20}{c}} {{O_1}}&0\\ 0&{{R_\nabla }} \end{array}} \right]{{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]}^T}{M_k}} \right.\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left. { + \left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]{H_i}{{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]}^T}{M_{k + 1}}} \right\}. \end{array} \end{equation}$

(4.2)

For simplicity, let

$\begin{equation} {L_i}({H_i}) = \left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]{H_i}{\left[ {\begin{array}{*{20}{c}} I&{K_i^T} \end{array}} \right]^T},{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2,3, \cdots. \end{equation}$

(4.3)

Then (3.22) can be simplified as

$\begin{equation} tr\left\{ {L_i^{}({H_{i + 1}}){M_k}} \right\} = tr\left\{ {L_i^{}(\left[ {\begin{array}{*{20}{c}} {{O_1}}&0\\ 0&{{R_\nabla }} \end{array}} \right]){M_k} + L_i^{}({H_i}){M_{k + 1}}} \right\}. \end{equation}$

(4.4)

The Q-learning iterative algorithm consisting of (4.4) and (3.23) only relies on determining the state $M_{k}$ of the system (2.16) and iteratively controlling the gain matrix $K_{i}$ , avoiding the constraints of system parameters and stochastic disturbance.

Remark 2. The Q-learning algorithm based on Ⅵ is performed online and solves (4.4) using least squares (LS) without knowing augmented system. In fact, (4.4) is a scalar equation and $H$ is a symmetric $(n + q + m) \times (n + q + m)$ matrix with $(n + q + m) \times (n + q + m + 1)/2$ independent elements. Therefore, at least $(n + q + m + 1) \times (n + q + m + 1)/2$ data tuples are required before (4.4) can be solved using LS.

Remark 3. Q-learning algorithm based on Ⅵ requires a persistent excitation (PE) condition ^[28] to ensure the sufficient exploration of the state space.

5. Simulation

In this section, a simulation example is given to illustrate the effectiveness of Q-learning algorithm. Consider the following stochastic linear system with delays

$\begin{array}{l} {x_{k + 1}} = A{x_k} + {A_d}{x_{k - d}} + B{u_k} + {B_d}{u_{k - d}}\\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} + (C{x_k} + {C_d}{x_{k - d}} + D{u_k} + {D_d}{u_{k - d}}){\omega _k},\\ {y_k} = E{x_k} + {E_d}{x_{k - d}} \end{array}$

in which $A = \left({\begin{array}{*{20}{c}} {0.2} & { - 0.8}\\ {0.5} & { - 0.7} \end{array}} \right)$ , ${A_d} = \left({\begin{array}{*{20}{c}} {0.2} & { - 0.2}\\ {0.1} & {0.15} \end{array}} \right)$ , $B = \left({\begin{array}{*{20}{c}} {0.03}\\ { - 0.5} \end{array}} \right)$ , ${B_d} = \left({\begin{array}{*{20}{c}} {0.3}\\ { - 0.2} \end{array}} \right)$ , $C = \left({\begin{array}{*{20}{c}} { - 0.04} & {0.4}\\ { - 0.3} & {0.13} \end{array}} \right)$ , ${C_d} = \left({\begin{array}{*{20}{c}} {0.2} & { - 0.1}\\ {0.2} & {0.11} \end{array}} \right)$ , $D = \left({\begin{array}{*{20}{c}} {0.05}\\ { - 0.3} \end{array}} \right)$ , ${D_d} = \left({\begin{array}{*{20}{c}} {0.1}\\ {0.1} \end{array}} \right)$ , $E = \left({\begin{array}{*{20}{c}} 3 & 3 \end{array}} \right)$ , ${E_d} = \left({\begin{array}{*{20}{c}} {0.1} & {0.12} \end{array}} \right)$ .

Suppose the reference trajectory is as follows

${r_{k + 1}} = - {r_k}$

where ${r_0} = 1$ .

The cost function is considered as (2.5) with $R = 1$ , $R_{d} = 1$ , $O = 10$ and delay index $d = 1$ . The initial state for augmented system (2.9) is chosen as ${G_0} = {\left[ {\begin{array}{*{20}{c}}{10} & {{\rm{ - }}10} & 1\end{array}} \right]^T}$ . The initial control gain matrix is selected as $K = \left[ {\begin{array}{*{20}{c}}0 & 0 & 0\end{array}} \right]$ . In each iteration of the algorithm, 21 samples are collected to update the control gain matrix $K$ .

In order to verify the effectiveness of the iterative Q-learning algorithm, we compared $K$ with optimal solution $K^{\ast}$ solved by SAE (3.1). shows the control gain matrix $K$ converges to the optimal control gain matrix $K^{\ast}$ as the number of iterations increases. shows the convergence process of $H$ to its optimal values $H^{\ast}$ , which can be calculated by (3.15). The goal of the optimal tracking problem is to trace the reference signal trajectory. In , the expectation of system output $E(y)$ can track the reference trajectory $r_{k}$ . This further proves the effectiveness of the proposed Q-learning algorithm.

Figure 2. Convergence trajectory of control gain matrix

$K$ to

$K^{\ast}$ .

DownLoad: Full-Size Img PowerPoint

Figure 3. Convergence trajectory of matrix

$H$ to its optimal values

$H^{\ast}$ .

DownLoad: Full-Size Img PowerPoint

Figure 4. Curves of expectation of output

$E(y)$ and reference signal

$r_{k}$ .

DownLoad: Full-Size Img PowerPoint

6. Conclusions

For the model-free SLQ optimal tracking problem with delays, Q-learning algorithm based on Ⅵ is proposed in this paper. This method makes full use of the system information to approximate the optimal control online, and never needs the system parameter information. In the iterative process of the algorithm, the $H$ matrix sequence and the control gain matrix $K$ sequence are guaranteed to approximate the optimal value. Finally, the simulation results show that the system output can track the reference trajectory effectively.

Conflict of interest

The authors declare that they have no conflicts of interest.

References

[1]	A. Lotka, Elements of Physical Biology, USA: Williams Wilkins Co., Balitmore, 1925.
[2]	V. Volterra, Variazioni e fluttuazioni del numero dindividui in specie animali conviventi, Mem. Acad Lincei Roma, 2 (1926), 31-113.
[3]	G. F. Gause, N. P. Smaragdova, A. A. Witt, Further studies of interaction between predators and prey, J. Anim. Ecol., 5 (1936), 1-18. doi: 10.2307/1087
[4]	G. F. Gause, The Struggle for Existence, USA: Williams Wilkins Co., Balitmore, 1934.
[5]	S. Magalhães, P. C. J. V. Rijn, M. Montserrat, A. Pallini, M. W. Sabelis, Population dynamics of thrips prey and their mite predators in a refuge, Oecologia, 150 (2007), 557-568. doi: 10.1007/s00442-006-0548-3
[6]	J. Ghosh, B. Sahoo, S. Poria, Prey-predator dynamics with prey refuge providing additional food to predator, Chaos Soliton. Fract., 96 (2017), 110-119. doi: 10.1016/j.chaos.2017.01.010
[7]	B. Sahoo, S. Poria, Effects of additional food in a delayed predator-prey model, Math. Biosci., 261 (2015), 62-73. doi: 10.1016/j.mbs.2014.12.002
[8]	B. Sahoo, S. Poria, Dynamics of predator-prey system with fading memory, Appl. Math. Comput., 347 (2019), 319-333.
[9]	U. Ufuktepe, B. Kulahcioglu, O. Akman, Stability analysis of a prey refuge predator-prey model with Allee effects, J. Biosciences, 44 (2019), 85. doi: 10.1007/s12038-019-9911-5
[10]	Y. Xie, J. Lu, Z. Wang, Stability analysis of a fractional-order diffused prey-predator model with prey refuges, Physica A., 526 (2019), 120773. doi: 10.1016/j.physa.2019.04.009
[11]	C. S. Holling, The functional response of predators to prey density and its role in mimicry and population regulation, Mem. Entomol. Soc. Canada, 45 (1965), 1-60.
[12]	Q. Y. Bie, Q. R. Wang, Z. A. Yao, Cross-diffusion induced instability and pattern formation for a Holling type-Ⅱ predator-prey model, Appl. Math. Comput., 247 (2014), 1-12.
[13]	L. Chen, F. Chen, L. Chen, Qualitative analysis of a predator-prey model with Holling type Ⅱ functional response incorporating a constant prey refuge, Nonlinear. Anal-Real., 11 (2010), 246-252. doi: 10.1016/j.nonrwa.2008.10.056
[14]	Z. J. Du, X. Chen, Z. S. Feng, Multiple positive periodic solutions to a predator-prey model with Leslie-Gower Holling-type Ⅱ functional response and harvesting terms, Discrete. Contin. Dyn. Syst., 7 (2014), 1203-1214.
[15]	J. J. Jiao, L. S. Chen, S. H. Cai, A delayed stage-structured Holling Ⅱ predator-prey model with mutual interference and impulsive perturbations on predator, Chaos Soliton. Fract., 40 (2009), 1946-1955. doi: 10.1016/j.chaos.2007.09.074
[16]	W. Ko, K. Ryu, Qualitative analysis of a predator-prey model with Holling type Ⅱ functional response incorporating a prey refuge, J. Differ. Equations, 231 (2006), 534-550. doi: 10.1016/j.jde.2006.08.001
[17]	V. Krivan, J. Eisner, The effect of the Holling type Ⅱ functional response on apparent competition, Theor. Popul. Biol., 70 (2006), 421-430. doi: 10.1016/j.tpb.2006.07.004
[18]	V. Krivan, On the Gause predator prey model with a refuge: A fresh look at the history, J. Theor. Biol., 274 (2011), 67-73. doi: 10.1016/j.jtbi.2011.01.016
[19]	Q. Liu, D. Q.Jiang, H. Tasawar, A. Ahmed, Dynamics of a stochastic predator-prey model with stage structure for predator and holling type Ⅱ functional response, J. Nonlinear Sci., 28 (2018), 1151-1187. doi: 10.1007/s00332-018-9444-3
[20]	S. P. Li, W. N. Zhang, Bifurcations of a discrete prey-predator model with Holling type Ⅱ functional response, Discrete Cont. Dyn-B., 14 (2010), 159-176.
[21]	H. Molla, S. R. Md, S. Sahabuddin, Dynamics of a predator-prey model with holling type Ⅱ functional response incorporating a prey refuge depending on both the species, Int. J. Nonlin. Sci. Num., 20 (2019), 1-16. doi: 10.1515/ijnsns-2017-0166
[22]	J. Song, Y. Xia, Y. Bai, Y. Cai, D. O'Regan, A non-autonomous Leslie-Gower model with Holling type Ⅳ functional response and harvesting complexity, Adv. Differ. Equ-Ny., 2019 (2019), 1-12. doi: 10.1186/s13662-018-1939-6
[23]	D. Ye, M. Fan, W. P. Zhang, Periodic solutions of density dependent predator-prey systems with Holling Type 2 functional response and infinite delays, J. Appl. Math. Mec., 85 (2005), 213-221.
[24]	S. W. Zhang, L. S. Chen, A Holling Ⅱ functional response food chain model with impulsive perturbations, Chaos Soliton. Fract., 24 (2005), 1269-1278. doi: 10.1016/j.chaos.2004.09.051
[25]	J. Zhou, C. L. Mu, Coexistence states of a Holling type-Ⅱ predator-prey system, J. Math. Anal. Appl., 369 (2010), 555-563. doi: 10.1016/j.jmaa.2010.04.001
[26]	S. Jana, M. Chakraborty, K. Chakraborty, T. K. Kar, Global stability and bifurcation of time delayed prey-predator system incorporating prey refuge. Math. Comput. Simulat., 85 (2012), 57-77. doi: 10.1016/j.matcom.2012.10.003
[27]	W. G. Aiello, H. I. Freedman, J. Wu, Analysis of a model representing stage-structured population growth with state-dependent timedelay, SIAM J. Appl. Math., 52 (1992), 885-889.
[28]	F. Brauer, Z. Ma, Stability of stage-structured population models, J. Math. Anal. Appl., 126 (1987), 301-315. doi: 10.1016/0022-247X(87)90041-2
[29]	H. I. Freedman, J. Wu, Persistence and global asymptotic stability of single species dispersal models with stage-structure, Q. Appl. Math., 49 (1991), 351-371. doi: 10.1090/qam/1106397
[30]	W. Wang, L. Chen, A predator-prey system with stage-structure for predator, Comput. Math. Appl., 33 (1997), 83-91.
[31]	W. Wang, G. Mulone, F. Salemi, V. Salone, Permanence and stability of a stage-structured predator prey model, J. Math. Anal. Appl., 262 (2001), 499-528. doi: 10.1006/jmaa.2001.7543
[32]	Y. Chen, Multiple periodic solution of delayed predator-prey systems with type Ⅳ functional responses, Nonlinear. Anal-Hybri., 5 (2004), 45-53. doi: 10.1016/S1468-1218(03)00014-2
[33]	M. Fan, Q. Wang, X. F. Zou, Dynamics of a nonautonomous ratio-dependent predator-prey system, P. Roy. Soc. Lond. A. Math., 133 (2003), 97-118.
[34]	M. Fan, P. J. Y. Wong, R. P. Agarwal, Periodicity and stability in periodic n-species Lotka-Volterra competition system with feedback controls and deviating arguments, Acta. Math. Sin., 19 (2003), 801-822. doi: 10.1007/s10114-003-0311-1
[35]	R. Gaines, J. Mawhin, Coincidence Degree and Nonlinear Differential Equations, Lecture Notes in Mathematics, Springer, Berlin, 1977.
[36]	H. Zheng, L. Guo, Y. Z. Bai, Y. H. Xia, Periodic solutions of a non-autonomous predator-prey system with migrating prey and disease infection: Via Mawhin's coincidence degree theory, J. Fix. Point Theory A., 21 (2019), 21-37. doi: 10.1007/s11784-019-0660-8
[37]	Y. H. Xia, Y. Shen, An nonautonomous predator-prey model with refuge effect, J. Xuzhou Inst. Tech., 34 (2019), 1-7.
[38]	F. Chen, On a periodic multi-species ecological model, Appl. Math. Comput., 171 (2005), 492-510.
[39]	F. Chen, Positive periodic solutions of neutral Lotka-Volterra system with feedback control, Appl. Math. Comput., 162 (2005), 1279-1302.
[40]	F. Chen, F. Lin, X. Chen, Sufficient conditions for the existence positive periodic solutions of a class of neutral delay models with feedback control, Appl. Math. Comput., 158 (2004), 45-68.
[41]	L. Chen. Mathematical Models and Methods in Ecology, Science Press, Beijing Chinese, 1998.
[42]	X. Chen, Z. J. Du, Existence of positive periodic solutions for a neutral delay predator-prey model with Hassell-Varley type functional response and impulse, Qual. Theor. Dyn. Syst., 17 (2018), 67-80. doi: 10.1007/s12346-017-0223-6
[43]	Z. J. Du, Z. S. Feng, Periodic solutions of a neutral impulsive predator-prey model with Beddington-DeAngelis functional response with delays, J. Comput. Appl. Math., 258 (2014), 87-98. doi: 10.1016/j.cam.2013.09.008
[44]	S. Gao, L. Chen, Z. Teng, Hopf bifurcation and global stability for a delayed predator-prey system with stage structure for predator, Appl. Math. Comput., 202 (2008), 721-729.
[45]	S. Kant, V. Kumar, Stability analysis of predator-prey system with migrating prey and disease infection in both species. Appl. Math. Model., 42 (2017), 509-539. doi: 10.1016/j.apm.2016.10.003
[46]	Y. Kuang, Delay Differential Equations: With Applications in Population Dynamics, Academic Press, San Diego, 1993.
[47]	S. Liu, L. Chen, Z. Liu, Extinction and permanence in nonautonomous competitive system with stage structure, J. Math. Anal. Appl., 274 (2002), 667-684. doi: 10.1016/S0022-247X(02)00329-3
[48]	S. Lu, W. Ge, Existence of positive periodic solutions for neutral population model with multiple delays, Appl. Math. Comput., 153 (2004), 885-902.
[49]	X. Z. Meng, S. N. Zhao, T. Feng, T. H. Zhang, Dynamics of a novel nonlinear stochastic SIS epidemic model with double epidemic hypothesis, J. Math. Anal. Appl., 433 (2016), 227-242. doi: 10.1016/j.jmaa.2015.07.056
[50]	J. Song, M. Hu, Y. Z. Bai, Y. Xia, Dynamic analysis of a non-autonomous ratio-dependent predator-prey model with additional food. J. Comput. Anal. Appl., 8 (2018), 1893-1909.
[51]	Y. L. Song, H. P. Jiang, Q. X. Liu, Y. Yuan, Spatiotemporal dynamics of the diffusive Mussel-Algae model near Turing-Hopf bifurcation, SIAM J. Appl. Dyn. Syst., 16 (2017), 2030-2062. doi: 10.1137/16M1097560
[52]	Y. L. Song, X. S. Tang, Stability, steady-state bifurcations and turing patterns in a predator-prey model with herd behavior and prey-taxis, Stud. Appl. Math., 139 (2017), 371-404. doi: 10.1111/sapm.12165
[53]	Y. L. Song, S. H. Wu, H. Wang, Spatiotemporal dynamics in the single population modelwith memory-based diffusion and nonlocal effect, J. Differ. Equations. 267 (2019), 6316-6351. doi: 10.1016/j.jde.2019.06.025
[54]	J. J. Wei, M. Y. Li, Hopf bifurcation analysis in a delayed nicholson blowflies equation, Nonlinear Anal-Theor., 60 (2005), 1351-1367. doi: 10.1016/j.na.2003.04.002
[55]	Z. Wei, Y. H. Xia, T. Zhang, Stability and bifurcation analysis of a amensalism model with weak Allee effect, Qual. Theor. Dyn. Syst., 2020.
[56]	R. Xu, Z. Ma, Stability and Hopf bifurcation in a ratio-dependent predator prey system with stage structure, Chaos Soliton. Fract., 38 (2008), 669-684. doi: 10.1016/j.chaos.2007.01.019
[57]	J. Y. Xu, T. H. Zhang, K. Y. Song, A stochastic model of bacterial infection associated with neutrophils, Appl. Math. Comput., 373 (2020), 125025.
[58]	F. Xu, C. Ross, K. Vlastimil, Evolution of mobility in predator-prey systems, Discrete Cont. DynB., 19 (2014), 3397-3432.
[59]	F. Xu, M. Connell, An investigation of the combined effect of an annual mass gathering event and seasonal infectiousness on disease outbreak, Math. Biosci., 312 (2019), 50-58. doi: 10.1016/j.mbs.2019.03.006
[60]	J. Y. Yang, Z. Jin, F. Xu, Threshold dynamics of an age-space structured SIR model on heterogeneous environment, Appl. Math. Lett., 96 (2019), 69-74. doi: 10.1016/j.aml.2019.03.009
[61]	F. Q. Yi, J. J. Wei, J. P. Shi, Bifurcation and spatiotemporal patterns in a homogeneous diffusive predater-prey system, J. Differ. Equations, 246 (2009), 1944-1977. doi: 10.1016/j.jde.2008.10.024
[62]	F. Q. Yi, J. J. Wei, J. P. Shi, Diffusion-driven instability and bifurcation in the Lengyel-Epstein system, Nonlinear Anal-Real., 9 (2008), 1038-1051. doi: 10.1016/j.nonrwa.2007.02.005
[63]	T. H. Zhang, T. Q. Zhang, X. Z. Meng, Stability analysis of a chemostat model with maintenance energy, Appl. Math. Lett., 68 (2017), 1-7. doi: 10.1016/j.aml.2016.12.007
[64]	T. H. Zhang, Z. W. Geem, Review of harmony search with respect to algorithm structure, Swarm Evol. Comput., 48 (2019), 31-43. doi: 10.1016/j.swevo.2019.03.012
[65]	X. G. Zhang, C. H. Shan, Z. Jin, H. P. Zhu, Complex dynamics of epidemic models on adaptive networks, J. Differ. Equations, 266 (2019), 803-832. doi: 10.1016/j.jde.2018.07.054

This article has been cited by:

Heng Zhang, Na Li, Data‐driven policy iteration algorithm for continuous‐time stochastic linear‐quadratic optimal control problems, 2024, 26, 1561-8625, 481, 10.1002/asjc.3223

Reader Comments

Your name:*

Email:*
© 2020 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Mathematical Biosciences and Engineering

3.9

Metrics

Article views(4255) PDF downloads(328) Cited by(12)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Mathematical Biosciences and Engineering

Periodic solution of a stage-structured predator-prey model incorporating prey refuge

Related Papers:

Abstract

1. Introduction

2. Problem formulation and transformation

2.1. Problem formulation

2.2. Problem transformation

3. The Q-learning algorithm and convergence proof

3.1. Derivation of Q-learning algorithm

3.2. The convergence of Q-learning algorithm

4. Implementation of the Q-learning algorithm

5. Simulation

6. Conclusions

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Other Articles By Authors

Catalog

Abstract

1. Introduction

2. Problem formulation and transformation

2.1. Problem formulation

2.2. Problem transformation

3. The Q-learning algorithm and convergence proof

3.1. Derivation of Q-learning algorithm

3.2. The convergence of Q-learning algorithm

4. Implementation of the Q-learning algorithm

5. Simulation

6. Conclusions

Conflict of interest

References

Mathematical Biosciences and Engineering

Periodic solution of a stage-structured predator-prey model incorporating prey refuge

Related Papers:

Abstract

1. Introduction

2. Problem formulation and transformation

2.1. Problem formulation

2.2. Problem transformation

3. The Q-learning algorithm and convergence proof

3.1. Derivation of Q-learning algorithm

3.2. The convergence of Q-learning algorithm

4. Implementation of the Q-learning algorithm

5. Simulation

6. Conclusions

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog

Abstract

1. Introduction

2. Problem formulation and transformation

2.1. Problem formulation

2.2. Problem transformation

3. The Q-learning algorithm and convergence proof

3.1. Derivation of Q-learning algorithm

3.2. The convergence of Q-learning algorithm

4. Implementation of the Q-learning algorithm

5. Simulation

6. Conclusions

Conflict of interest

References