Learning-based DoS attack game strategy over multi-process systems

Zhiqiang Hang; Xiaolin Wang; Fangfei Li; Yi-ang Ren; Haitao Li; Zhiqiang Hang; Xiaolin Wang; Fangfei Li; Yi-ang Ren; Haitao Li

doi:10.3934/mmc.2024034

Mathematical Modelling and Control

2024, Volume 4, Issue 4: 424-438. doi: 10.3934/mmc.2024034

Previous Article Next Article

Research article

Learning-based DoS attack game strategy over multi-process systems

1.
School of Mathematics, East China University of Science and Technology, Shanghai 200237, China
2.
Key Laboratory of System Control and Information Processing, Ministry of Education, Shanghai 200240, China
3.
School of Mathematics and Statistics, Shandong Normal University, Jinan 250014, China

Received: 17 April 2024 Revised: 17 June 2024 Accepted: 06 August 2024 Published: 17 December 2024

In cyber-physical systems, the state information from multiple processes is sent simultaneously to remote estimators through wireless channels. However, with the introduction of open media such as wireless networks, cyber-physical systems may become vulnerable to denial-of-service attacks, which can pose significant security risks and challenges to the systems. To better understand the impact of denial-of-service attacks on cyber-physical systems and develop corresponding defense strategies, several research papers have explored this issue from various perspectives. However, most current works still face three limitations. First, they only study the optimal strategy from the perspective of one side (either the attacker or defender). Second, these works assume that the attacker possesses complete knowledge of the system's dynamic information. Finally, the power exerted by both the attacker and defender is assumed to be small and discrete. All these limitations are relatively strict and not suitable for practical applications. In this paper, we addressed these limitations by establishing a continuous power game problem of a denial-of-service attack in a multi-process cyber-physical system with asymmetric information. We also introduced the concept of the age of information to comprehensively characterize data freshness. To solve this problem, we employed the multi-agent deep deterministic policy gradient algorithm. Numerical experiments demonstrate that the algorithm is effective for solving the game problem and exhibits convergence in multi-agent environments, outperforming other algorithms.

Keywords:

cyber-physical systems,
game theory,
DoS attack,
AoI,
multi-agent deep deterministic policy gradient

Citation: Zhiqiang Hang, Xiaolin Wang, Fangfei Li, Yi-ang Ren, Haitao Li. Learning-based DoS attack game strategy over multi-process systems[J]. Mathematical Modelling and Control, 2024, 4(4): 424-438. doi: 10.3934/mmc.2024034

Related Papers:

[1]	Yunsi Yang, Jun-e Feng, Lei Jia . Recent advances of finite-field networks. Mathematical Modelling and Control, 2023, 3(3): 244-255. doi: 10.3934/mmc.2023021
[2]	Lei Wang, Xinyun Liu, Ting Li, Jiandong Zhu . Skew-symmetric games and symmetric-based decomposition of finite games. Mathematical Modelling and Control, 2022, 2(4): 257-267. doi: 10.3934/mmc.2022024
[3]	Lusong Ding, Weiwei Sun . Neuro-adaptive finite-time control of fractional-order nonlinear systems with multiple objective constraints. Mathematical Modelling and Control, 2023, 3(4): 355-369. doi: 10.3934/mmc.2023029
[4]	S. Y. Tchoumi, Y. Kouakep-Tchaptchie, D. J. Fotsa-Mbogne, J. C. Kamgang, J. M. Tchuenche . Optimal control of a malaria model with long-lasting insecticide-treated nets. Mathematical Modelling and Control, 2021, 1(4): 188-207. doi: 10.3934/mmc.2021018
[5]	Hongli Lyu, Yanan Lyu, Yongchao Gao, Heng Qian, Shan Du . MIMO fuzzy adaptive control systems based on fuzzy semi-tensor product. Mathematical Modelling and Control, 2023, 3(4): 316-330. doi: 10.3934/mmc.2023026
[6]	Qin Xu, Xiao Wang, Yicheng Liu . Emergent behavior of Cucker–Smale model with time-varying topological structures and reaction-type delays. Mathematical Modelling and Control, 2022, 2(4): 200-218. doi: 10.3934/mmc.2022020
[7]	M. Sathish Kumar, M. Deepa, J Kavitha, V. Sadhasivam . Existence theory of fractional order three-dimensional differential system at resonance. Mathematical Modelling and Control, 2023, 3(2): 127-138. doi: 10.3934/mmc.2023012
[8]	Biresh Kumar Dakua, Bibhuti Bhusan Pati . A frequency domain-based loop shaping procedure for the parameter estimation of the fractional-order tilt integral derivative controller. Mathematical Modelling and Control, 2024, 4(4): 374-389. doi: 10.3934/mmc.2024030
[9]	Hongyu Ma, Dadong Tian, Mei Li, Chao Zhang . Reachable set estimation for 2-D switched nonlinear positive systems with impulsive effects and bounded disturbances described by the Roesser model. Mathematical Modelling and Control, 2024, 4(2): 152-162. doi: 10.3934/mmc.2024014
[10]	C. Kavitha, A. Gowrisankar . Fractional integral approach on nonlinear fractal function and its application. Mathematical Modelling and Control, 2024, 4(3): 230-245. doi: 10.3934/mmc.2024019

Abstract

1. Introduction

Cyber-physical systems (CPSs) are multidimensional complex systems which integrate computing, network, and physical environments ^[1]. Through the organic integration and deep collaboration of computing, communication, and control (3C) technology, they achieve real-time perception, dynamic control, and information services for large-scale engineering systems ^[2]. Nowadays, they are widely applied in many fields, such as intelligent transportations ^[3] and health monitoring ^[4]. However, in recent years, a series of cyber attacks targeting CPSs have occurred, which have caused significant resource and economic losses. Thus, it is necessary and imperative to put emphasis on CPS security. A mainstream viewpoint is to study attack methods targeting CPSs. By studying attack methods, one can observe the worst performance of the CPSs, and design corresponding defense strategies to improve system security.

There are many different forms of cyber attacks, such as DoS attacks ^[2], false data injection attacks ^[5], and man-in-the-middle attacks ^[6]. Based on the mode of interference with the system, cyber attacks can be mainly categorized into three types ^[7]: confidentiality attacks, integrity attacks, and availability attacks. Confidentiality attacks involve eavesdropping on the system without the capability to interfere. Integrity attacks involve intercepting and tampering with transmitted data, resulting in significant damage and imposing strict requirements on the attackers. Availability attacks aim to disrupt the transmission of the CPS. Among availability attacks, denial-of-service (DoS) attacks are some of the most common and feasible, and they have been widely studied in recent literature ^[8,9,10].

In earlier literature ^[11,12], the probability of successful channel transmission under DoS attacks was simplified as a binary variable, with attackers having only two options: 1 (attacked) or 0 (not attacked). However, in practice scenarios, this probability is influenced by many factors, such as the channel model, transmission power, and additional noise. Therefore, recent studies on DoS attacks have focused not only on identifying which channel to attack, but also on determining how much power to allocate to jam the transmission channel and reduce the probability of a successful transmission. For instance, the authors in ^[13] solved the optimal DoS attack allocation for multi-channel CPS systems with additive Gaussian noise by solving the Bellman equation under the Markov decision process (MDP). Meanwhile, ^[14] considered the DoS attack allocation strategy from two different perspectives. However, these approaches are based on a strong assumption that DoS attackers possess a comprehensive knowledge of the system, including its model and transmission power, which may not be realistic in real-world scenarios. In typical CPS attack scenarios, defenders, acting as system regulators, should ideally understand the complete dynamics or at least estimate the state of the system. In contrast, attackers typically gather information through eavesdropping and may lack specific knowledge about system dynamics and transmission power.

Consequently, many studies assume attackers have no prior knowledge of system states or transmission power and propose using reinforcement learning (RL) algorithms to manage uncertainty ^[15,16]. For example, ^[17] introduced the Q-learning algorithm^[18] to work out the optimal DoS attack strategy on a small scale, while ^[1] introduced the double deep Q network (DDQN) algorithm to calculate optimal DoS attack strategies. Building on this, the deep deterministic policy gradient (DDPG) algorithm was introduced to solve the optimal DoS attack allocation strategy for a multi-process CPS where action domains are continuous. It should be noted that the abovementioned papers predominantly focus on attack strategies from the attacker's perspective. The strategy of the defender is either fixed or considered only after calculating the optimal strategy of the attacker. There is still a research gap in addressing the game problem of asymmetric information ^[19], where both sides make simultaneous decisions.

In this paper, we consider a multi-channel CPS with time delays ^[20] in wireless transmission that involves a DoS attacker and a defender (referred to as the smart sensor). At each step, both the attacker and the defender need to determine how much energy to allocate to attack each channel within their limited energy budget. However, neither side possesses knowledge about the strategy employed by their opponent. Consequently, the problem at hand can be framed as an asymmetrical information DoS attack game. We aim to find the Bayesian Nash equilibrium strategy between the defender and the attacker. Compared with other existing DoS attack game studies ^[21], we not only consider the Bayesian Nash equilibrium strategy in game problems with asymmetric information, but also extend the feasible regions of both sides' actions to continuous situations, which is more adoptable in actual systems with different discrete accuracy.

To address the issue of asymmetric information between the defender and attacker, we employ RL methods. However, the previously mentioned algorithms such as Q-learning, DDQN, and DDPG can have significant instability in the face of complex agents, as any agent in the environment will be affected by other agents. In this paper, we study a deep reinforcement learning algorithm called the multi-agent deep deterministic policy gradient (MADDPG) to deal with game problems in multi-agent scenarios. By introducing feasible layers, the algorithm can be applied to multi-process CPSs.

In addition, we utilize a more comprehensive metric called age of information (AoI) ^[22] to characterize the freshness of information and update the estimated state (see^[23,24,25]), which is widely studied in the field of communication recently. However, there is still a lack of relevant content in the game problem of DoS attacks. Compared with the definition of the holding time in previous works ^[1,2], the introduction of AoI is effective to deal with the problem of time delays in CPSs. To demonstrate the effectiveness of the MADDPG algorithm in DoS attack game problems within multi-process CPS systems with time delays, a series of extensive experiments have been conducted in this paper.

The main contributions of this work are summarized as follows:

$1)$ In the context of a CPS facing DoS attacks, considering that DoS attacks are often confused with transmission delays, we introduces the concept of AoI to more effectively measure the freshness of the received data. Compared to prior literature that used the holding time to describe transmission delays, this paper offers enhanced estimation accuracy. Consequently, the findings of our study carry broader applicability and practical significance in mitigating DoS attacks coupled with time delays.

$2)$ We establish an asymmetric sensor scheduling game problem for CPS systems under DoS attacks. In contrast to the majority of the prior studies, which focused on optimizing strategies separately for attackers and defenders, we address the optimal problem considering both perspectives simultaneously. While some studies have explored real-time asymmetric information DoS attack game issues, their action spaces were limited to discrete cases due to methodological constraints. In comparison, the MADDPG algorithm introduced in our work can handle not only discrete but also continuous action settings. It is worth mentioning that we have incorporated two feasibility layers in the neural network to restrict network output actions, adapting to multi-channel CPS systems affected by time delays.

$3)$ We demonstrate the effectiveness of the MADDPG-based algorithm for this problem through several illustrative examples, calculating the Bayesian Nash equilibrium points for different weights and the overall cumulative discount rewards for the attacker and defender at each time step.

The remainder of this paper is organized as follows: In Section 2, the system model and the DoS attack game-theoretic model are presented. The MADDPG-based algorithm to the game problem with feasibility layers is proposed in Section 3. In Sections 4 and 5, simulation results and the conclusion are provided, respectively.

Notations: $\mathbb{N}$ and $\mathbb{R}$ denote the sets of natural and real numbers, respectively. $\mathbb{N}^{+}$ and $\mathbb{R}^{+}$ represent the set of positive natural numbers and the set of non-negative real numbers, respectively. $\mathbb{S}_{+}^{n}$ is the set of $n \times n$ positive semi-definite matrices. $\mathbb{R}^n$ is the $n-$ dimensional Euclidean space. $\mathbb{R}_{+}^{n}$ is the $n-$ dimensional Euclidean space of non-negative real numbers. For a matrix $X$ , $X^{\mathsf{T}}$ , $Tr(X)$ , and $|X|$ stand for the transpose, trace, and determinant of $X$ , respectively. $\Vert {\cdot} \Vert_1$ is the $l_{1}$ norm. The notation $\mathbb{E}[\cdot]$ stands for the expectation of a random variable. $\mathcal{N}$ denotes the standard Gaussian distribution. The spectral radius of a matrix is $\rho(\cdot)$ . $\text{Pr}(\cdot|\cdot)$ is the conditional probability. ${\nabla}$ is the vector gradient operator. Dim( $\cdot$ ) is an abbreviation of dimension. ReLU $(x) = \max(0, x)$ . Sigmoid $(x) = \frac{{{e}^{x}}}{{{e}^{x}}+1}$ .

2. Problem setup

2.1. System model

As shown in , the CPS is composed of discrete linear time-invariant processes, smart sensors, wireless channels, a remote estimator, and a DoS attacker. The dynamics of the $i$ -th process ( $1 \le i \le N, i \in \mathbb{R}^{+}$ ) is given by:

$x_{k+1, i} = A_{i}x_{k, i}+\omega_{k, i},$

(2.1)

$y_{k, i} = C_{i}x_{k, i}+\upsilon_{k, i},$

(2.2)

Figure 1. Multi-process system with transmission delay layout.

DownLoad: Full-Size Img PowerPoint

where $x_{k, i} \in \mathbb{R}^n$ and $y_{k, i} \in \mathbb{R}^m$ are the state and the measurement of the $i$ -th channel, respectively. The matrices $A_i$ and $C_i$ are system parameters. The variables $\omega_{k, i} \in \mathbb{R}^n$ and $\upsilon_{k, i} \in \mathbb{R}^m$ indicate independent and identically distributed Gaussian noises, with their distributions given as $\omega_{k, i} \sim \mathcal{N}(0, \Sigma_{{\omega}_{i}})$ and $\upsilon_{k, i} \sim \mathcal{N}(0, \Sigma_{{\upsilon}_{i}})$ , resepectively, where $\Sigma_{{\omega}_i} \ge 0$ and $\Sigma_{{\upsilon}_i} > 0$ . The initial state of the $i$ -th process $x_{0, i} \in \mathbb{R}^n$ satisfies the Gaussian distribution as $x_{0, i} \sim \mathcal{N}(0, \Sigma_{x_{0, i}})$ with $\Sigma_{x_{0, i}} \ge 0$ . Suppose that the variables $x_{0, i}$ , $\omega_{k, i}$ , and $\upsilon_{k, i}$ are mutually uncorrelated for $k \ge 0$ . Moreover, we assume that $(A_i, C_i)$ is observable and $(A_i, \sqrt \Sigma_{{\omega}_{i}})$ is controllable^[26].

Taking the $i$ -th process as an example, at every time step $k$ , the measurement ${{{y}}_{k, i}}$ is obtained by the smart sensor $i$ , which then employs a local Kalman filter^[27] to estimate the actual state of the process, represented by $\hat{x}_{k, i}^{s}$ . The corresponding estimation error covariance of the sensor is denoted by $P_{k, i}^{s}$ , which is calculated by the following equation:

$\begin{equation} P_{k, i}^{s} = \mathbb{E}\left[ ({{x}_{k, i}}-\hat{x}_{k, i}^{s}){{({{x}_{k, i}}-\hat{x}_{k, i}^{s})}^\mathsf{T}}|{{y}_{1, i}}, \ldots , {{y}_{k, i}} \right], \end{equation}$

(2.3)

where

$\hat{x}_{0, i}^{s} = 0 \; \; \; \text{and}\; \; \; P_{0, i}^{s} = {{\Sigma }_{{{x}_{0, i}}}}.$

Since $P_{k, i}^{s}$ converges exponentially to a certain steady state $\overline{P}_{i}$ , we assume that

$P_{k, i}^{s} = \overline{P}_{i}$

when $k \ge l, l \in \mathbb{N}^{+}$ .

After the $i$ -th sensor calculates the estimation of $\hat{x}_{k, i}^{s}$ , it will select transmit power $p_{k, i} \in \mathbb{R}^{+}$ to send $\hat{x}_{k, i}^{s}$ to the remote estimator through a wireless transmission channel. However, due to the inherent properties and the transmission medium of the wireless channel, random time delays or packet loss often occur during the transmission. The time delay of the $i$ -th channel at every step is denoted by $t_{d, i}$ . A binary variable $\gamma_{k, i}$ is defined to represent whether the estimated state $\hat{x}_{k, i}^{s}$ is received successfully or not, as given by the following equation:

$\begin{equation} \begin{aligned} \begin{split} {{\gamma }_{k, i}} = \left \{ \begin{array}{ll} 1, & \hat{x}_{k, i}^{s} \enspace \text{is} \enspace \text{received} \enspace \text{successfully}, \\ 0, & \text{otherwise}. \end{array} \right. \end{split} \end{aligned} \end{equation}$

(2.4)

The probability of a successful transmission is denoted by:

$\begin{equation} \Pr ({{\gamma }_{k, i}} = 1|{{p}_{k, i}}, {{\sigma }_{k, i}}) = f({{p}_{k, i}}, {{\sigma }_{k, i}}), \end{equation}$

(2.5)

where the scalar ${\sigma }_{k, i} \in \mathbb{R}^{+}$ is the noise power of the $i$ -th channel, which follows a Gaussian distribution^[13]. The function

$f:{{\mathbb{R}}^{+}}\times {{\mathbb{R}}^{+}}\to [0, 1]$

is determined by the specific wireless channel model and the selected modulation approach. Naturally, we suppose that $f$ increases with $p_{k, i}$ and decreases with ${\sigma}_{k, i}$ according to ^[1,9]. That is, $\hat{x}_{k, i}^{s}$ is more likely to be transmitted successfully to the remote estimator with higher transmit power and less channel noise. The above assumption is intuitive and in accord with many practical communication channel models.

If $\hat{x}_{k, i}^{s}$ is successfully transmitted, the remote estimator will update its estimated state of the process $\hat{x}_{k, i}$ with $\hat{x}_{k, i}^{s}$ . Conversely, if $\hat{x}_{k, i}^{s}$ is not obtained by the remote estimator, it will predict the current estimated state $\hat{x}_{k, i}$ according to the previous estimated state $\hat{x}_{k-1, i}$ . The remote estimator utilizes a feedback channel to send an acknowledgment (ACK) signal back to the sensor, indicating whether $\hat{x}_{k, i}^s$ has been successfully transmitted ( $\gamma_{k, i} = 1 \enspace \text{or} \enspace 0$ ).

2.2. Definition and impact of AoI

In order to better capture the impact of data timeliness, we introduce a definition, i.e., AoI, which is widely employed in the field of communication. It characterizes the information age (denoted by $\Delta_n$ ) of the most recent packet received by the remote estimator. It is calculated by

${\Delta}_{n} = n - {T_{n}},$

where $T_{n}$ is the most recent generation time of the latest received packet and $n$ is the current time^[22]. For the $i$ -th process, denote ${\Delta}_{k, i}$ as its AoI at time step $k$ . Therefore, the updates of ${\Delta}_{k, i}$ and $\hat{x}_{k, i}$ , as well as the estimation error covariance $P_{k, i}$ of its remote estimator can be summarized as the following equations:

$\begin{equation} \begin{aligned} \begin{split} {{\Delta }_{k, i}} = \left \{ \begin{array}{ll} 0, & \gamma_{k, i} = 1, \\ {\Delta }_{k-1, i}+1, & \text{otherwise}, \end{array} \right. \end{split} \end{aligned} \end{equation}$

(2.6)

$\begin{equation} \begin{aligned} \begin{split} {\hat{x}_{k, i}} = \left \{ \begin{array}{ll} {\hat{x}^{s}_{k, i}}, &\gamma_{k, i} = 1, \\ A_{i}{\hat{x}_{k-1, i}} = A_{i}^{{\Delta}_{k, i}}{\hat{x}^{s}_{k-{\Delta}_{k, i}}}, &\text{otherwise}, \end{array} \right. \end{split} \end{aligned} \end{equation}$

(2.7)

$\begin{equation} \begin{aligned} \begin{split} {P_{k, i}} = \left \{ \begin{array}{ll} \overline{P}_{i}, &\gamma_{k, i} = 1, \\ {h_{i}}({{P}_{k-1, i}}) = {{{h}_{i}}^{{{\Delta }_{k, i}}}}(\overline{P}_{i}), &\text{otherwise}, \end{array} \right. \end{split} \end{aligned} \end{equation}$

(2.8)

where function $h_{i}:\mathbb{S}_{+}^{n}\to \mathbb{S}_{+}^{n}$ is defined as follows:

$\begin{equation} h_{i}(X) = A_{i}X{{{A}_{i}}^{\mathsf{T}}}+{{\Sigma }_{{w}_{i}}}. \end{equation}$

(2.9)

According to the proof in ^[28], we can get

$Tr(h_{i}(\overline{P}_{i})) \ge Tr(\overline{P}_{i}).$

In other words, a larger $\Delta_{k, i}$ results in a larger trace of the estimation error covariance $Tr(P_{k, i})$ .

By introducing AoI, the systems are able to determine, to some extent, whether the time delay in the transmission channel is caused by a DoS attacker or by the transmission channel itself. We will discuss it in detail in the subsequent chapters.

2.3. DoS attack optimization problem

The objective of the attacker is to attack the transmission channels. At each time step $k$ , the attacker exerts attack power $a_{k_i} \in \mathbb{R}^{+}$ as additional noise into the $i$ -th channel, resulting in the following probability equation:

$\begin{equation} \Pr(\gamma_{k, i} = 1|p_{k, i}, \sigma_{k, i}, a_{k, i}) = f(p_{k, i}, \sigma_{k, i}+a_{k, i}). \end{equation}$

(2.10)

When $a_{k, i}$ is large, there is a higher probability of transmission failure for the estimated state, resulting in a relatively larger estimation error covariance $P_{k, i}$ and poor system performance. However, injecting attack power is costly and restricted, which forces the attacker to make a trade-off between attack performance and cost. In other words, the attacker should increase the estimation error covariance with potentially low cost. Assume that the attacker can eavesdrop the ACK signal from the feedback channels and the estimation error covariance from the transmission channels. Without loss of the generality, suppose the attack begins at time step $k = 1$ . Then, from the attacker's perspective, the attacker needs to consider the following optimization problem:

$\begin{equation} \begin{aligned} &\underset{\text{a}_1, \text{a}_2, ...}{\mathop{\max}} \enspace \mathbb{E} (\sum\limits_{k = 1}^{+\infty }{\beta }^{k-1} [\omega_{A} Tr({{P}_{k}}) - \beta_{A}(1-\omega_{A} ){\Vert {\text{a}_k} \Vert_1}]) \\ &s.t.\enspace{\text{a}_{k}}\in \mathbb{R}_{+}^{N} , \enspace {{\Vert {{\text{a}}_{k}} \Vert_1}}\le \overline{a}, \enspace k\in {{\mathbb{N}}^{+}}, \enspace k \rightarrow \infty, \\ \end{aligned} \end{equation}$

(2.11)

where ${{P}_{k}}$ represents a diagonal matrix composed of the covariance of the estimation error for each process, given by:

$\begin{equation} {P}_{k} = diag \{ {{P}_{k, 1}}, {{P}_{k, 2}}, \cdots, {{P}_{k, N}} \}. \end{equation}$

(2.12)

In addition,

${\text{a}}_{k} = [a_{k, 1}, \ldots , a_{k, i}, \ldots , a_{k, N}]^\mathsf{T}$

represents the attack power at time step $k$ for $N$ channels in which $a_{k, i}$ is the attack power of the $i$ -th channel, and $\omega_{A} \in [0, 1]$ represents a weight that measures the trade-off between the attack performance and the attack cost. The positive constants $\overline{a}$ and $\beta_{A}$ represent the upper bound for total attack power in a time step and the cost of the allocation per unit of attack power, respectively.

The variable $\beta \in [0, 1)$ is the discount rate. To prevent the objective function from tending to infinity regardless of the selected attack policy, similar to ^[29], we assume that $\mathbb{E}(Tr(P_{k}))$ is bounded. Additionally, for each $i$ , we assume that

${\rho}^{2}(A_{i}) > 1/\beta.$

2.4. Game-theoretic model

In the presence of a DoS attack, each sensor needs to take defensive measures. Assuming that each sensor has the capability to inject defense power $p_{k, i}$ to make the transmission more likely to succeed, such as increasing the transmission power at an additional cost to reduce noise, the problem becomes a game between both sides (the defender and the attacker). With this in mind, from the perspective of the defender, we focus on optimizing the defender's power, i.e.,

$\begin{equation} \begin{aligned} & \underset{{p}_1, {p}_2, ...}{\mathop{\max }}\, \mathbb{E} (\sum\limits_{k = 1}^{+\infty }{{-{\beta }^{k-1}}}[\omega_{D} Tr({{P}_{k}})-\beta_{D}(1-\omega_{D} ){\Vert {{p}_{k}} \Vert_1}]) \\ & s.t.\enspace{{p}_{k}}\in \mathbb{R}_{+}^{N}, \enspace {{\Vert {{p}_{k}} \Vert_1}}\le \overline{p}, \enspace k\in {{\mathbb{N}}^{+}}, \\ \end{aligned} \end{equation}$

(2.13)

where

${p}_{k} = [p_{k, 1}, \ldots, p_{k, i}, \ldots , p_{k, N}]^\mathsf{T}$

is the defense power at time step $k$ for $N$ channels in which $p_{k, i}$ is the defense power of the $i$ -th channel, $\omega_{D}$ represents a weight measuring defense performance and defense cost. The positive constants $\overline{p}$ and $\beta_{D}$ are the upper bound for the total defense power in a time step and the cost of the allocation per unit of defense power, respectively. Therefore, we can describe our game theoretic model by considering the following optimization problem:

Problem 1. Game-theoretic framework between a defender and an attacker under DoS attack.

For the attacker,

(2.14)

For the defender,

(2.15)

Remark 1. At each time step $k$ , both the defender and the attacker select their power levels $p_{k}$ and $\text{a}_{k}$ , respectively, to impact the transmission channel based on their previous knowledge of the ACK signal. Unlike the research in ^[29], which demands full knowledge of the channel model and transmission power of the sensor for the attacker, both sides only require the ACK signal and the estimation error covariance received in the previous time step.

There are two challenges in solving this game-theoretic optimization problem. First, both sides lack information about the power allocation chosen by their opponent in the current time step, making it become an incomplete information dynamic game problem. Second, the power allocation for both sides is a vector of $N$ dimensions in a continuous domain, leading to an infinite number of feasible choices.

3. MADDPG-based DoS attack and defend power design in a multi-channel process system

We intend to utilize deep reinforcement learning to solve Problem 1. Before introducing our method, we will first establish an MDP framework for Problem 1.

3.1. MDP formulation

The MDP is defined by a tuple $(\mathcal{S}, \mathcal{A}, \mathcal{P}, r, \beta)$ , where $\mathcal{S}$ represents the state space, $\mathcal{A}$ denotes the action space,

$\mathcal{P}:\mathcal{S} \times \mathcal{A} \times \mathcal{S} \to [0, 1]$

is the transition probability function,

$r:\mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}$

is the reward function, and $\beta \in [0, 1)$ is the discount rate. For Problem 1, the elements of the MDP model are as follows:

● State space $\mathcal{S}$ : Since the defender and the attacker only have the previous state information, the state at time step $k$ is defined as

$\Delta_{k-1} = [\Delta_{k-1, 1}, {\Delta_{k-1, 2}}, \cdots, {\Delta_{k-1, N}}]^{\mathsf{T}} \in \mathbb{R}_{+}^{N}.$

The state space includes all possible vectors, where all components ${\Delta_{k-1, 1}, \Delta_{k-1, 2}, \cdots, \Delta_{k-1, N} }$ are non-negative real numbers.

Remark 2. The state here represents the information for the defender and attacker to design the power allocation, which is different from the state of the dynamic systems (2.1) and (2.2).

● Action space $\mathcal{A}$ : The global action of the allocation of attack and defense power at time step $k$ is a set of all possible choices of 2 $N$ -dimension vectors which concatenates defense power $p_{k}$ and attack power $\text{a}_{k}$ denoted by

$O_{k} = [(p_{k})^{\mathsf{T}}, (\text{a}_{k})^{\mathsf{T}}]$

with

$0 \le {{\Vert {{p}_{k}} \Vert_1}} \le \overline{p}$

and

$0 \le {{\Vert {{\text{a}}_{k}} \Vert_1}} \le \overline{a}.$

● Transition function $\mathcal{P}$ : The transition function is denoted by:

$\begin{equation} \mathcal{P} = \Pr(\Delta_{k}|\Delta_{k-1}, {p}_{k}, \text{a}_{k}) = \prod\limits_{i = 1}^{N}{{\Pr_{i}}}, \end{equation}$

(3.1)

where

$\begin{equation} \begin{aligned} \begin{split} {\Pr_{i}} = \left \{ \begin{array}{ll} f({{p}_{k, i}}, {{\sigma }_{k, i}}+{\text{a}_{k, i}}), &{{\Delta }_{k, i}} = 0, \\ 1-f({{p}_{k, i}}, {{\sigma }_{k, i}}+{\text{a}_{k, i}}), &{{\Delta }_{k, i}} = {{\Delta }_{k-1, i}}+1, \\ 0, & \text{otherwise}. \end{array} \right. \end{split} \end{aligned} \end{equation}$

(3.2)

For simplicity, we assume that these $N$ transmission channels are independent in the simulation. However, if this assumption does not hold and there is mutual interference between transmission channels, the corresponding function $f$ will be replaced by:

$\begin{equation} f({{p}_{k, i}}, {{\sigma}_{k, i}}+{\text{a}_{k, i}}+\sum\nolimits_{j = 1, j\ne i}^{N}{E_{ij}^{D}}{{p}_{k, j}}+\sum\nolimits_{j = 1, j\ne i}^{N}{E_{ij}^{A}}{\text{a}_{k, j}}), \end{equation}$

(3.3)

where ${E_{ij}^{D}}{{p}_{k, j}}$ and ${E_{ij}^{A}}{\text{a}_{k, j}}$ are the correlation coefficients of the defender and the attacker, respectively. Nevertheless, our learning-based method is suitable for both situations because it does not require either the defender or the attacker to possess any knowledge of the transition function.

● Reward: The one-step reward is defined as:

(1) For the defender:

$\begin{equation} r_{D}({{\Delta}_{k-1}}, {{p}_{k}}, {{\Delta}_{k}}) = -{{\omega}_{D}}Tr({{P}_{k}})-{\beta_{D}}(1-{{\omega }_{D}}){{\Vert {{p}_{k}} \Vert_1}}. \end{equation}$

(3.4)

(2) For the attacker:

$\begin{equation} r_{A}({{\Delta}_{k-1}}, {\text{a}_{k}}, {{\Delta}_{k}}) = {{\omega}_{A}}Tr({{P}_{k}})-{\beta_{A}}(1-{{\omega }_{A}}){{\Vert {\text{a}_{k}} \Vert_1}}. \end{equation}$

(3.5)

We denote the stationary strategies of the attacker and the defender as two functions which map the state $\delta_{k-1}$ to the power $\text{a}_{k}$ and $p_{k}$ as

${{\pi }_{A}}: \enspace \mathcal{S} \to \mathcal{A}, \pi_{A} \in {{\Pi }_{A}}$

and

${{\pi }_{D}}: \enspace \mathcal{S} \to \mathcal{A}, \pi_{D} \in {{\Pi }_{D}},$

respectively, where ${{\Pi }_{A}}$ and ${{\Pi }_{D}}$ are the policy spaces that include all feasible stationary strategies of the attacker and the defender. Problem 1 can be converted into the consideration of the stationary strategies for the MDP model ^[30] as follows:

Problem 2. For the attacker:

$\begin{equation} \underset{{{\pi}_{A}}\in {{\Pi}_{A}}}{\mathop{\max\limits}}\, \mathbb{E}(\sum\limits_{k = 1}^{\infty }{{{\beta }^{k-1}}{r_{A}}({{\Delta }_{k-1}}, {{\pi }_{A}}({{\Delta }_{k-1}}), {{\Delta }_{k}})}). \end{equation}$

(3.6)

For the defender:

$\begin{equation} \underset{{{\pi}_{D}}\in {{\Pi}_{D}}}{\mathop{\max\limits}}\, \mathbb{E}(\sum\limits_{k = 1}^{\infty }{{-{\beta }^{k-1}}{r_{D}}({{\Delta }_{k-1}}, {{\pi }_{D}}({{\Delta }_{k-1}}), {{\Delta }_{k}})}). \end{equation}$

(3.7)

Remark 3. Denote ${\pi}_{D}^{*} \in {\Pi }_{D}$ and ${\pi}_{A}^{*} \in {\Pi }_{A}$ as the optimal strategies for the defender and the attacker, respectively. Based on ^[31], the existence of the stationary and deterministic optimal strategy can be proved. Define the $Q$ -value function ${{Q}^{{{\pi }_{D}}}}(\Delta, p)$ and ${{Q}^{{{\pi }_{A}}}}(\Delta, \text{a})$ of the MDP under the strategies ${{\pi }_{D}}$ and ${{\pi }_{A}}$ , respectively, as:

$\begin{equation} \begin{aligned} & {{Q}^{{{\pi }_{A}}}}(\Delta , \text{a}) = \mathbb{E}({{r}_{A}}({{\Delta }_{0}}, {{\text{a}}_{\text{1}}}, {{\Delta }_{1}}) \\ & +\sum\limits_{k = 2}^{\infty }{{{\beta }^{k-1}}{{r}_{A}}({{\Delta }_{k-1}}, {{\pi }_{A}}, {{\Delta }_{k}})}|{{\Delta }_{0}} = \Delta , {{\text{a}}_{\text{1}}} = \text{a}), \\ \end{aligned} \end{equation}$

(3.8)

$\begin{equation} \begin{aligned} & {{Q}^{{{\pi }_{D}}}}(\Delta , p) = \mathbb{E}({{r}_{D}}({{\Delta }_{0}}, {{p}_{1}}, {{\Delta }_{1}}) \\ &+ \sum\limits_{k = 2}^{\infty }{{{\beta }^{k-1}}{{r}_{D}}({{\Delta }_{k-1}}, {{\pi }_{D}}, {{\Delta }_{k}})}|{{\Delta }_{0}} = \Delta , {{p}_{1}} = p), \\ \end{aligned} \end{equation}$

(3.9)

where ${\Delta }_{0}$ is the initial state. The functions ${{Q}^{{{\pi }_{A}}}}(\Delta, \text{a})$ and ${{Q}^{{{\pi }_{D}}}}(\Delta, p)$ represent the expected cumulative reward of $\Delta$ when the attacker and defender take actions $\text{a}$ and $p$ , respectively, and follow their strategies ${{\pi }_{A}}$ and ${{\pi }_{D}}$ in the subsequent steps. Similarly, the optimal Q-value functions for the attacker and defender under their respective optimal strategies are denoted as ${{Q}^{\pi _{A}^{*}}}(\Delta, \text{a})$ and ${{Q}^{\pi _{D}^{*}}}(\Delta, p)$ . According to ^[31], if the state space and the action space are sufficiently small and discrete, ${{Q}^{\pi _{A}^{*}}}(\Delta, \text{a})$ and ${{Q}^{\pi _{D}^{*}}}(\Delta, p)$ can be obtained easily by value iteration or policy iteration algorithms ^[32] when the tuple $(\mathcal{S}, \mathcal{A}, \mathcal{P}, r, \beta)$ is accessible.

However, in Problem 2, there still exists three problems. One is that the action space is continuous, which leads to infinite strategies. Another one is that neither side has any knowledge of their opponent's actions. The last one is that the transition function $\mathcal{P}$ is determined by actions $\text{a}_{k}$ and $p_{k}$ of both sides, which means each side must consider their opponent's action when selecting their own action. In the following subsection, we present a MADDPG-based power allocation design algorithm that efficiently solves these challenges.

3.2. MADDPG algorithm

In order to deal with the large state space and the continuous action space, a deep learning-based algorithm is introduced, which utilizes the deep neural networks (DNNs) to parameterize the $Q$ -value function. We divide the introduction of the MADDPG algorithm into several parts, i.e., the actor-critic algorithm, the replay buffer, double DNNs, and the updating rules.

3.2.1. Actor-critic algorithm

The actor-critic algorithm ^[33] is composed of two DNNs, i.e., actor and critic. It is the foundation of the MADDPG-based power allocation design algorithm. The actor contains a policy network whose structure is given by . The input of the network is the state $\Delta$ ( $\Delta$ is a time-varying variable, denoted as

$\Delta(k) = \Delta_{k},$

Figure 2. Policy network architecture of the actor.

DownLoad: Full-Size Img PowerPoint

where $k$ represents the time step). Through several hidden layers such as activation function layers and fully connected layers, it outputs the corresponding action ( $p$ or $\text{a}$ ). The critic network includes an evaluation network which is shown in . It is responsible for estimating the $Q$ -value of the tuple $(\Delta, O)$ ( $O$ is also a time-varying variable similar to $\Delta$ ) under the current strategy $\pi(\pi_{D} \enspace \text{and} \enspace \pi_{A})$ . The inputs of the critic are composed of three components, the state $\Delta$ , the action generated from the actor, and the estimation of other agents' actions.

Figure 3. Evaluation network architecture of the critic.

DownLoad: Full-Size Img PowerPoint

Since the MADDPG algorithm takes into account the actions of other agents when considering the action of one agent, it needs to integrate the $Q$ -value function and the actions of the multiple agents. Denote the centralized action-value function for the agent $i$ as

$Q_{{i}}^{\pi } = (\Delta , O) = (\Delta , ({{p}}, {{\text{a}}})),$

where the inputs are the actions of all agents and the state information.

Remark 4. Notice that there only exists two agents (the defender and the attacker) in Problem 1, and we define the defender and the attacker as agent 1 and agent 2, respectively (naturally, we set the number of agents as $n$ = 2). Correspondingly, denote

${{r}_{1}} = {{r}_{D}}, \ \ \ {{r}_{2}} = {{r}_{A}}, \ \ \ r_{k} = [{{r}_{1}}, \ \ \ {{r}_{2}}], \ \ \ \alpha_{1} = p, \ \ \ \alpha_{2} = \text{a}$

for notational simplicity.

We denote the parameters of the policy network in the actor and the evaluation network in the critic of the $i$ -th agent as $\theta _{i}^{\mu }$ and $\theta _{i}^{Q }$ , respectively. Thus the output of the policy network is given by:

$\begin{equation} {{\alpha }_{i}} = {{\mu }_{\theta _{i}^{\mu }}}(\Delta ) = {{F}_{{{a}_{L}}}}\circ {{F}_{{{c}_{L}}}}\circ \cdots \circ {{F}_{{{a}_{1}}}}\circ {{F}_{{{c}_{1}}}}(\Delta ), \end{equation}$

(3.10)

where $\circ$ is the compound operation of functions,

${{F}_{{{c}_{i}}}}(x) = {{\theta }_{{{\omega }_{i}}}}x+{{\theta }_{{{b}_{i}}}}, \ \ \ i = 1, 2.$

The parameters ${{\omega }_{i}}$ and ${{\theta }_{{{b}_{i}}}}$ are the weights and biases. The form of ${{F}_{{{a}_{i}}}}$ is determined by the selection of each activation function. The variable $L$ represents the number of hidden layers. Similarly, denote the output of the evaluation network as:

$\begin{equation} {{Q}_{\theta _{i}^{Q}}}(\Delta , O) = {{{F}'}_{{{a}_{L}}}}{}^\circ {{{F}'}_{{{c}_{L}}}}{}^\circ \cdots {}^\circ {{{F}'}_{{{a}_{1}}}}{}^\circ {{{F}'}_{{{c}_{1}}}}(\Delta , O), \end{equation}$

(3.11)

where the inputs are the state and the actions of all agents.

3.2.2. The replay buffer and double DNNs

Before giving the updating rules, we introduce two techniques that the MADDPG algorithm utilizes to increase the stability of the algorithms, namely, the replay buffer and double DNNs:

● Replay buffer: In MADDPG, the tuple $(\Delta_{k-1}, O_{k}, \Delta_{k}, r_{k})$ obtained at time step $k$ is stored in a buffer with a fixed size $B$ . If the number of stored tuples exceeds $B$ , the oldest one will be replaced by the newest one. In the updating progress, a mini batch of tuples from the replay buffer is sampled randomly to calculate the corresponding gradient and error. The introduction of the replay buffer solves the problem of the sequential correlation of data as long as the replay buffer $B$ is large enough.

● Double DNNs: Based on DDPG, MADDPG also employs two DNNs, i.e., an online network and a target network in both the actor and the critic networks. The structures of these two DNNs are the same, but their updating rules of the parameters are quite different. For the two online networks, the parameters $\theta _{i}^{\mu}$ and $\theta _{i}^{Q}$ are updated at every time step $k$ . For the two target networks, the parameters $\theta _{i}^{{{\mu }'}}$ and $\theta _{i}^{{{Q}'}}$ are softly updated by the following rules:

$\begin{equation} \begin{aligned} \theta _{i}^{{{\mu }'}}&\leftarrow \tau \theta _{i}^{\mu }+(1-\tau )\theta _{i}^{{{\mu }'}}, \\ \theta _{i}^{{{Q}'}} & \leftarrow \tau \theta _{i}^{Q}+(1-\tau )\theta _{i}^{{{Q}'}}, \\ \end{aligned} \end{equation}$

(3.12)

where $0 < \tau \ll 1$ is a small update rate, also known as the filter parameter. The soft updating rules are employed to enhance the stability of the learning performance as the target network gradually synchronizes with the online network. The relationship between the online network and the target network is shown in Figure 4.

Figure 4. DNNs in the actor and the critic.

DownLoad: Full-Size Img PowerPoint

3.2.3. The updating rules

Different from the DDPG algorithm^[34], the MADDPG algorithm incorporates the estimation information from other agents into the critic network of each agent. This is done to eliminate the assumption of knowing the accurate information of other agents and address the issue that the $Q$ -function is dependent on the actions of all agents. Each agent $i$ needs to maintain an approximation whose parameter is denoted by ${{{\hat{\mu}}}_{\theta_{i}}^{j}}$ (where $\theta$ represents the parameter of the approximation. For simplicity, the above variable is denoted by ${\hat{\mu}}_{i}^{j}$ ) to the actual policy by ${{\mu}_{j}}$ for the $j$ -th agent. This approximation is learned by maximizing the natural logarithm of the probability of agent $j$ 's action, with an entropy regularizer, and the loss function $L$ of the critic is as follows:

$\begin{equation} L(\theta_{i}^{j}) = -{{\mathbb{E}}_{{{\Delta}_{j}}, {{\alpha}_{j}}}}[\log \hat{\mu}_{i}^{j}({{\alpha}_{j}}|{{\Delta}_{j}})+\lambda {{H}_{e}}(\hat{\mu}_{i}^{j})], \end{equation}$

(3.13)

where ${H}_{e}$ represents the entropy of the policy distribution, the symbol $\lambda$ denotes the coefficient for entropy regularization, while $\lambda {{H}_{e}}(\hat{\mu}_{i}^{j})$ refers to the aforementioned term for entropy regularization. Thus the estimated $Q$ -value of the target network of the critic $\hat{y}$ can be calculated as follows:

$\begin{equation} \hat{y} = {{r}_{i}}+\gamma {{Q_{i}}^{{{\mu}^{'}}}}({{{\Delta}^{'}}, {{{{\hat{\mu}}}}_{i}}^{'1}}({{\Delta}_{1}}), \ldots , {{\mu }}_{i}^{'i}({{\Delta}_{i}}), \ldots , {{{{\hat{\mu }}}_{i}}^{'n}}({{\Delta}_{n}})), \end{equation}$

(3.14)

where ${{\hat{\mu }}}_{i}^{'j}$ is the target network for the approximate action $\hat{\mu }_{i}^{j}$ , $j \neq i$ . Therefore, the whole updating rules are given by:

(1) Critic update (evaluation network update): for agent $j, \ j \in \{1, 2\}$ ,

$\begin{equation} \begin{aligned} & L(\phi _{i}^{j}) = -{{\mathbb{E}}_{{{\Delta }_{j}}, {{a}_{j}}}}[\log \hat{\mu}_{i}^{j}({{a}_{j}}|{{\Delta }_{j}})+\lambda {{H}_{e}}(\hat{\mu}_{i}^{j})], \\ & \hat{y} = {{r}_{i}}+\gamma {{Q_{i}}^{{{\mu}^{'}}}}({\Delta}^{'}, {{{{{{\hat{{\mu}}}}}_{i}}^{'1}}}({{\Delta}_{1}}), \ldots , {{\mu }}_{i}^{'i}({{\Delta}_{i}}), \ldots , {{{{{\hat{\mu }}}_{i}}^{'n}}}({{\Delta}_{n}})).\\ \end{aligned} \end{equation}$

(3.15)

(2) Actor update (policy network update):

$\begin{equation} \begin{aligned} & \theta _{i}^{\mu }\leftarrow \theta _{i}^{\mu }+{{\nabla }_{\theta _{i}^{\mu }}}J, \\ & {{\nabla }_{\theta _{i}^{\mu }}}J\approx \frac{1}{S}\sum\limits_{j}{{{\nabla }_{\theta _{i}^{\mu }}}}{{\mu }_{\theta _{i}^{\mu }}}(\Delta _{i}^{j}) \\ &\times {{\nabla }_{{{\alpha}_{i}}}}{{Q}_{\theta _{i}^{Q}}}({{\Delta }^{j}}, \alpha_{1}^{j}, \cdots , {{\alpha}_{i}}, \cdots , \alpha_{n}^{j}){{|}_{{{\alpha}_{i}} = {{\mu }_{\theta _{i}^{\mu }}}(\Delta _{i}^{j})}}, \\ \end{aligned} \end{equation}$

(3.16)

where $S$ is the size of mini-batch samples, and ${{\nabla }_{\theta _{i}^{\mu }}}J$ denotes the update estimated through policy gradients.

The module of the MADDPG algorithm based on the actor-critic network is shown in Figure 5. The MADDPG algorithm for solving Problem 2 is given in Algorithm 1.

Figure 5. Framework of the MADDPG-based DoS attack game problem.

DownLoad: Full-Size Img PowerPoint

However, there still exists some other problems. For example, consider the scenario where the remote estimator receives a data packet with an AoI that is older than the one received in the previous time step due to transmission channel time delays. In other words, when

$\Delta_{k} > \Delta_{k-1} + 1, \ \ \ \Delta_{k-1} \neq 0,$

which means the data packet received at time step $k-1$ was more recently generated. In this case, Algorithm 1 may directly overlook the problem and utilize the older data, resulting in a degradation of performance.

Remark 5. For Algorithm 1, it would be very easy to replace the AoI with the definition of holding time ^[1] without affecting the algorithm's results, as the one-step reward and generated actions at every time step are the same. However, this still does not solve the abovementioned problem of the freshness of data. In other words, for Algorithm 1, using the definition of holding time and the definition of AoI to derive the algorithm will lead to the same results. In subsequent experimental comparisons, we will consider Algorithm 1 to be representative of the algorithm derived from the holding time.

Algorithm 1: MADDPG-based DoS attack game-theoretic power design.
1 For the given system, set the number of agents $n$ , the number of episodes $M$ , and the maximum iteration number $Iter_{max}$ , respectively.
2 Set the maximum power $\overline{p}$ and $\overline{a}$ . Set the size of replay buffer $B$ and the size of mini-batch $S$ , respectively. Set a discount rate $\gamma$ as well as the weight of the reward function $\omega_{D}$ and $\omega_{A}$ .
3 For each agent $i$ :
4 Initialization: The filter parameter $\tau$ , the noise decay rate ${{\alpha }_{\eta }}$ , the actor network ${{\mu }_{\theta _{\text{i}}^{\mu }}}$ , and the critic network ${Q_{\theta _{\text{i}}^{Q}}}$ with weights $\theta _{\text{i}}^{\mu }$ and $\theta _{i}^{Q}$ , respectively. The target actor network ${{\mu }_{\theta _{i}^{{{\mu }'}}}}$ and target critic network ${Q_{\theta _{i}^{{Q}'}}}$ with weights $\theta _{i}^{{{\mu }'}} = \theta _{i}^{\mu }$ and $\theta _{i}^{{{Q}'}} = \theta _{i}^{Q}$ , respectively.
5 for $episode = 1$ to $M$ do

25 end

3.3. Feasibility layers for the received AoI and power constraint

As mentioned in the previous section, measures should be taken to deal with the problem caused by time delays. This is solved by modifying the received state $\Delta$ with the following rules:

$\begin{equation} \begin{aligned} \begin{split} {{\Delta }_{k, i}} = \left \{ \begin{array}{ll} \Delta_{k, i}, & \Delta_{k, i} \le \Delta_{k-1, i} + 1, \\ {\Delta }_{k-1, i} + 1, & \text{otherwise}. \end{array} \right. \end{split} \end{aligned} \end{equation}$

(3.17)

In addition, notice that the power chosen during the process is calculated by the policy gradient with an exploration noise, which may sometimes be beyond the feasible domain. Thus, two feasibility layers are added to our MADDPG algorithm. The layers have three functions:

(1) Decide the freshness of the current data by AoI.

(2) Map the infeasible actions to the feasible ones.

(3) Make the action provided by the actor network be, as much as possible within the feasible domain by adding a penalty term to the reward function.

In our new algorithm, the feasibility layers are added after the network receiving the state as well as the actor network outputting the actions, resepectively. We choose the scale mapping as the method to generate the feasible action, which is as follows:

$\begin{equation} \begin{split} {{\alpha }_{1, i}} = \left \{ \begin{array}{ll} {{\alpha }_{1, i}}, &{\Vert {{p}_{k}} \Vert_1} \le \overline{p}, \\ \overline{p}\times \frac{{{p}_{k, i}}}{{{\Vert {{p}_{k}} \Vert_1}}}, i = 1, \ldots , N, &{\Vert {{p}_{k}} \Vert_1} > \overline{p}, \\ \end{array} \right. \end{split} \end{equation}$

(3.18)

and

$\begin{equation} \begin{split} {{\alpha }_{2, i}} = \left \{ \begin{array}{ll} {{\alpha }_{2, i}}, &{\Vert {\text{a}_{k}} \Vert_1} \le \overline{a}, \\ \overline{a}\times \frac{{\text{a}_{k, i}}}{{{\Vert {\text{a}_{k}} \Vert_1}}}, i = 1, \ldots , N, &{\Vert {\text{a}_{k}} \Vert_1} > \overline{a}, \\ \end{array} \right. \end{split} \end{equation}$

(3.19)

where ${\alpha}_{1, i}$ and ${\alpha}_{2, i}$ represent the $i$ -th component of the vectors $\alpha_{1}$ and $\alpha_{2}$ , respectively.

Furthermore, a penalty term $\lambda_{D} {{I}_{D}}$ or $\lambda_{A} {{I}_{A}}$ is added to the reward function to motivate the actor network to generate feasible actions as: for the defender,

$\begin{equation} r_{1}({{\Delta }_{k-1}}, {{p}_{k}}, {{\Delta }_{k}}) = -{{\omega }_{D}}Tr({{P}_{k}})-{\beta_{D}}(1-{{\omega }_{D}}){{\Vert {{p}_{k}} \Vert_1}}-{\lambda }_{D}{{I}_{D}}. \end{equation}$

(3.20)

For the attacker,

$\begin{equation} r_{2}({{\Delta }_{k-1}}, {{\text{a}}_{k}}, {{\Delta }_{k}}) = {{\omega }_{A}}Tr({{P}_{k}})-{\beta_{A}}(1-{{\omega }_{A}}){{\Vert {{\text{a}}_{k}} \Vert_1}}-{{\lambda }_{A}}{{I}_{A}}, \end{equation}$

(3.21)

where

${{I}_{D}} = \max (0, {{\Vert {{p}_{k}} \Vert_1}}-\overline{p})$

and

${{I}_{A}} = \max (0, {{\Vert {{a}_{k}} \Vert_1}}-\overline{a})$

are the infeasible parts of the given actions of the agents. The variables ${\lambda }_{D} \in {{\mathbb{R}}^{+}}$ and ${\lambda }_{A} \in {{\mathbb{R}}^{+}}$ represent the penalty weights of ${{I}_{D}}$ and ${{I}_{a}}$ , respectively. The scale mapping method described in (3.18) and (3.19) is reasonable and has little influence on Problem 2. First, the feasible action generated by the networks does not change. If the generated action exceeds the feasible domain, the reward will obviously decrease with a great penalty term, which causes the agent to reduce the attempts to take action that exceeds the feasible domain. Therefore, our MADDPG-based algorithm becomes Algorithm 2.

Algorithm 2: Feasible MADDPG-based DoS attack game-theoretic power design.
1 Set $n$ , $M$ , $Iter_{max}$ , $\overline{p}$ , $\overline{a}$ , $\gamma$ , $\omega_{D}$ , $\omega_{A}$ , $S$ , and $B$ .
2 For each agent $i$ :
3 Initialization: $\tau$ , ${{\alpha }_{\eta }}$ , ${{\mu }_{\theta _{i}^{\mu }}}$ , ${Q_{\theta _{i}^{Q}}}$ with weights $\theta _{i}^{\mu }$ and $\theta _{i}^{Q}$ , respectively. ${{\mu }_{\theta _{i}^{{{\mu }'}}}}$ and ${Q_{\theta _{i}^{{Q}'}}}$ with weights $\theta _{i}^{{{\mu }'}} = \theta _{i}^{\mu }$ and $\theta _{i}^{{{Q}'}} = \theta _{i}^{Q}$ , respectively. The penalty terms ${\lambda }_{A}$ and ${\lambda }_{D}$ .
4 for $episode = 1$ to $M$ do

35 end

4. Simulation results

4.1. Environment setting

We consider a cyber-physical system that involves two processes with transmission delay. Here the parameters of the system are given by Table 1, where "No." is an abbreviation for numbers. The transmission delay is set as

${{t}_{d, i}}\in \{0, 1, 2\}, \ \ i = 1, 2.$

Table 1. Multi-process system parameters.

Processes	$A_{i}$	$C_{i}$	${{\Sigma }_{{{\omega }_{\text{i}}}}}$	${{\Sigma }_{{{\upsilon }_{\text{i}}}}}$	$\sigma_{i}$
$i=1$	1.1	1.2	0.8	0.6	0.8
$i=2$	1.1	1.0	0.9	0.9	1.0

| Show Table

DownLoad: CSV

Assume the model of the wireless transmission channel is an AWGN channel ^[29]. Thus, the probability function $f$ in (2.5) is denoted by:

$\begin{equation} f(p_{i, k}^{s}, p_{i, k}^{a}) = 1-2g(\sqrt{\frac{{{\delta }_{s}}p_{i, k}^{s}}{{{\delta }_{a}}p_{i, k}^{a}+\sigma _{i}^{2}}}), i = 1, \ldots , N, \end{equation}$

(4.1)

where

$g(x) = \left( \frac{1}{\sqrt{2\pi }} \right)\int_{x}^{\infty }{\exp (-{{v}^{2}}/2)}dv.$

The variable $\sigma _{i}^{2}$ represents the white noise power of the $i$ -th channel, and the positive parameters ${{\delta }_{s}}$ and ${{\delta }_{a}}$ represent the performance characterization of specific transmission channels under DoS attack and defense. We assume that the upper bound of the state of each process is $9$ to ensure at least one successful transmission within a limited number of time steps, in other words, when AoI exceeds $9$ , the transmission will definitely succeed at the next time step. This assumption has a negligible influence on the problem, as proved in ^[21]. The maximum power of the defender and attacker is set as $\overline{p} = 10$ and $\overline{a} = 10$ , respectively. The discount rate is set as $\beta = 0.9$ . The network parameters of the MADDPG-based game algorithm are summarized in Table 2, and the learning parameters are displayed in Table 3. All simulations were run on a computer with Intel i7-10875H and 16 GB RAM.

Table 2. Network parameters.

Parameters	Policy Network	Evaluation Network
No. of inputs	Dim( $\Delta$ )	Dim( $\Delta$ )+Dim( $O$ )
No. of hidden layers	3	2
No. of nodes in hidden layer 1	64	64
No. of nodes in hidden layer 2	64	64
No. of nodes in hidden layer 3	Dim( $\alpha$ )	$\times$
No. of outputs	Dim( $\alpha$ )	1
Activation function 1	ReLU	ReLU
Activation function 2	ReLU	ReLU
Activation function 3	Sigmoid	$\times$

| Show Table

DownLoad: CSV

Table 3. Learning parameters.

Parameters	$n$	$M$	$Iter_{max}$	$\overline{p}$	$\overline{a}$	$\gamma$	$\lambda_{a}$	$\lambda_{D}$	$\tau$	$\alpha_{\eta}$	$B$	$S$
Value	2	$10^3$	200	10	10	0.95	10	10	0.01	0.999	$10^4$	10

| Show Table

DownLoad: CSV

4.2. Result presentation

The performance of the attackers and defenders is measured by the cumulative discounted reward and the mutual Bayesian Nash equilibrium strategy under different states. We set the initial weights

${{\omega}_{A}} = {{\omega}_{D}} = 0.8$

with

${{\beta}_{A}} = 0.48, \ \ {{\beta}_{D}} = 0.67,$

and the experimental results are shown in Figure 6. After 100 episodes, it can be observed that the cumulative discounted rewards for both the attacker and defender reached a relatively stable stage.

Figure 6. Attacker and defender performance with

${{\beta}_{A}} = 0.48$ and

${{\beta}_{D}} = 0.67$ in two different processes in Table 1.

DownLoad: Full-Size Img PowerPoint

The attacker's cumulative discounted reward fluctuated around 0 to 20, while the defender's cumulative discounted reward fluctuated around -20 to 0.

Based on the neural networks that have been recently updated, their Bayesian Nash equilibrium strategies can be calculated as shown in and , where each scatter point in the figure represents the optimal power allocation under its corresponding state $(\Delta_{1}, \Delta_{2})$ (the x-axis and y-axis correspond to the states of process 1 and process 2, respectively, while the legend provides precise values). The reason that both the attacker and defender chose to focus on process 1 instead of process 2 is determined by the settings of the system parameters.

Figure 7. Attacker's optimal strategy with

${{\beta}_{A}} = 0.48$ and

${{\beta}_{D}} = 0.67$ in two different processes in Table 1.

DownLoad: Full-Size Img PowerPoint

Figure 8. Defender's optimal strategy with

${{\beta}_{A}} = 0.48$ and

${{\beta}_{D}} = 0.67$ in two different processes in Table 1.

DownLoad: Full-Size Img PowerPoint

After conducting the simulation with

${{\beta}_{A}} = 0.48 \enspace \text{and} \enspace {{\beta}_{D}} = 0.67,$

we change the power discount rate with

${{\beta}_{A}} = 0.55 \enspace \text{and} \enspace {{\beta}_{D}} = 0.68$

in two identical processes as in Process 2 in Table 1. The corresponding results are shown in Figures 9–. From the attacker's prospective (), it tends to spare little power to attack the channel when the state is $(0, 0)$ . However, if the state is not $(0, 0)$ , the attacker will put most of their emphasis on attacking channel 1. The optimal strategy of the defender is much more complex than the attacker's. We present it in Figure 11 and show a part of the optimal strategies in Table 4. Affected by the attacker's attack strategy, the defender also pays higher attention to channel 1.

Figure 9. Attacker and defender performance with

${{\beta}_{A}} = 0.55$ and

${{\beta}_{D}} = 0.68$ in the two same processes as in Process 2 in Table 1.

DownLoad: Full-Size Img PowerPoint

Figure 10. Attacker's optimal strategy with

${{\beta}_{A}} = 0.55$ and

${{\beta}_{D}} = 0.68$ in the two same processes as in Process 2 in Table 1.

DownLoad: Full-Size Img PowerPoint

Figure 11. Defender's optimal strategy with

${{\beta}_{A}} = 0.55$ and

${{\beta}_{D}} = 0.68$ in the two same processes as in Process 2 in Table 1.

DownLoad: Full-Size Img PowerPoint

Table 4. Part of the optimal strategies of the defender in Figure 11.

$\Delta_{2} \ge \Delta_{1} \ or \ (\Delta_{1}, \Delta_{2})= (3, 2)$	$(\Delta_{1}, \Delta_{2})= (6, 0)$	$(\Delta_{1}, \Delta_{2})= (6, 1)$	$(\Delta_{1}, \Delta_{2})= (6, 2)$
$(6.18-6.26, 2.65-2.75)$	$(0.19, 9.52)$	$(0.41, 8.71)$	$(0.96, 7.81)$
$(\Delta_{1}, \Delta_{2})= (6, 3)$	$(\Delta_{1}, \Delta_{2})= (6, 4)$	$(\Delta_{1}, \Delta_{2})= (6, 5)$	$(\Delta_{1}, \Delta_{2})= (6, 6)$
$(2.47, 6.33)$	$(4.04, 4.96)$	$(4.86, 3.98)$	$(5.36, 3.18)$

| Show Table

DownLoad: CSV

When the AoI of channel 2 is bigger than the one of channel 1 (except for the state $(0, 0)$ ), the optimal strategy is around (6.18–6.26, 2.65–2.75). In addition, if the AoI of channel 1 is fixed, the defender's power allocation to channel 1 will increase as the AoI of channel 2 increases. To show that the strategies obtained by Algorithm 2 are optimal, we compare the optimal strategy of the defender with other strategies when the attacker's strategy is fixed as the one derived from Algorithm 2 in Figure 12.

Figure 12. Defender's cumulative discounted reward with different strategies.

DownLoad: Full-Size Img PowerPoint

From Figure 12, it can be seen that the defender's cumulative discounted reward under the optimal strategy is obviously more than the ones under other fixed strategies. In addition, we compare the optimal strategy to the one derived by Algorithm 1 to show the effectiveness of the feasibility layers in CPSs with time delays. Although Algorithm 1 employs AoI as a state definition, it does not take into account that AoI is a variable used to measure the freshness of data. In essence, there is not much difference between AoI and other definitions (for example, holding time). Therefore, the discounted cumulative rewards for the defender are not as good as those of Algorithm 2. Therefore, the optimality of the strategy derived by Algorithm 2 is obvious.

It is worth noting that, in practical scenarios, such results are relatively rare. Due to the characteristics of different channels, both attackers and defenders tend to have a focus on allocating energy in one channel, which leads to the optimal action near $(\overline{p} \enspace or \enspace \overline{a}, 0)$ or $(0, \overline{p} \enspace or \enspace \overline{a})$ . What is more, when $\omega_{A}$ and $\omega_{D}$ reduce, the weight of cost in the reward formulas (3.20) and (3.21) of both sides will correspondingly increase, leading them to reduce the injecting energy to the channel rather than focus on increasing (or reducing) the estimated error covariance. Therefore, in practical applications, the setting of the weights and power discount rates may be extremely important.

In addition, compared with other learning-based methods in existing literature (such as DQN, DDQN, DDPG, etc.), our method can be applied to multi-agent environments with asymmetric information. However, other mentioned algorithms may easily fail to converge due to the complexity and variability of the multi-agent environments.

5. Conclusions

We considered a learning-based game-theoretic DoS attack problem with continuous power in a multi-process CPS with random time delays of transmission channels. In order to deal with the challenges posed by a multi-agent environment with asymmetric information and continuous attack and defense power, we provided a MADDPG-based algorithm with two feasible layers, which is suitable for the above situation, and is able to reach an optimal Bayesian Nash equilibrium. In future work, we will introduce various types of attacks into our research on game theory problems. We will consider not only cyber attacks targeting transmission channels but also attacks aimed at other parts of the networks.

Use of Generative-AI tools declaration

The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China (Grants 62173142, 62303185, and 62073202), in part by the Programme of Introducing Talents of Discipline to Universities (the 111 Project) under Grant B17017, in part by the Shanghai Sailing Program under Grant 23YF1409500, and in part by the Fundamental Research Funds for the Central Universities under Grant JKM01231838.

Conflict of interest

The authors declare that there are no conflicts of interest in this paper.

References

[1]	M. Huang, K. Ding, S. Dey, Y. Li, L. Shi, Learning-based DoS attack power allocation in multiprocess systems, IEEE Trans. Neural Networks Learn. Syst., 34 (2023), 8017–8030. https://doi.org/10.1109/TNNLS.2022.3148924 doi: 10.1109/TNNLS.2022.3148924
[2]	M. Huang, K. Tsang, Y. Li, L. Li, L. Shi, Strategic DoS attack in continuous space for cyber-physical systems over wireless networks, IEEE Trans. Signal Inf. Process. Networks, 8 (2022), 421–432. https://doi.org/10.1109/TSIPN.2022.3174969 doi: 10.1109/TSIPN.2022.3174969
[3]	D. Möller, H. Vakilzadian, Cyber-physical systems in smart transportation, 2016 IEEE International Conference on Electro Information Technology (EIT), 2016. https://doi.org/10.1109/EIT.2016.7535338
[4]	A. Kasun, W. Chathurika, M. Daniel, R. Craig, M. Milos, Framework for data driven health monitoring of cyber-physical systems, 2018 Resil Week (RWS), 2018, 25–30. https://doi.org/10.1109/rweek.2018.8473535
[5]	G. Liang, R. W. Steven, J. Zhao, F. Luo, Z. Dong, The 2015 Ukraine blackout: implications for false data injection attacks, of cyber-physical systems, IEEE Trans. Power Syst., 32 (2017), 3317–3318. https://doi.org/10.1109/TPWRS.2016.2631891 doi: 10.1109/TPWRS.2016.2631891
[6]	I. Stojmenovic, S. Wen, X. Huang, H. Luan, An overview of Fog computing and its security issues, Concurrency Comput. Practice Exper., 28 (2016), 2991–3005. https://doi.org/10.1002/cpe.3485 doi: 10.1002/cpe.3485
[7]	P. Griffioen, S. Weerakkody, O. Ozel, Y. Mo, B. Sinopoli, A tutorial on detecting security attacks on cyber-physical systems, 2019 18th European Control Conference (ECC), 2019,979–984. https://doi.org/10.23919/ecc.2019.8796117
[8]	J. Qin, M. Li, L. Shi, X. Yu, Optimal denial-of-service attack scheduling with energy constraint over packet-dropping networks, IEEE Trans. Autom. Control, 63 (2018), 1648–1663. https://doi.org/10.1109/TAC.2017.2756259 doi: 10.1109/TAC.2017.2756259
[9]	H. Zhang, P. Cheng, L. Shi, J. Chen, Optimal denial-of-service attack scheduling with energy constraint, IEEE Trans. Autom. Control, 60 (2015), 3023–3028. https://doi.org/10.1109/TAC.2015.2409905 doi: 10.1109/TAC.2015.2409905
[10]	S. Li, C. K. Ahn, Z. Xiang, Decentralized sampled-data control for cyber-physical systems subject to DoS attacks, IEEE Syst. J., 15 (2021), 5126–5134. https://doi.org/10.1109/JSYST.2020.3019939 doi: 10.1109/JSYST.2020.3019939
[11]	Y. Mo, B. Sinopoli, L. Shi, E. Garone, Infinite-horizon sensor scheduling for estimation over lossy networks, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), 2012, 3317–3322. https://doi.org/10.1109/cdc.2012.6425859
[12]	R. Gan, Y. Xiao, J. Shao, J. Qin, An analysis on optimal attack schedule based on channel hopping scheme in cyber-physical systems, IEEE Trans. Cybern., 51 (2021), 994–1003. https://doi.org/10.1109/TCYB.2019.2914144 doi: 10.1109/TCYB.2019.2914144
[13]	X. Ren, J. Wu, S. Dey, L. Shi, Attack allocation on remote state estimation in multi-systems: structural results and asymptotic solution, Automatica, 87 (2018), 184–194. https://doi.org/10.1016/j.automatica.2017.09.021 doi: 10.1016/j.automatica.2017.09.021
[14]	Z. Guo, J. Wang, L. Shi, Optimal denial-of-service attack on feedback channel against acknowledgment-based sensor power schedule for remote estimation, 2017 IEEE 56th Annual Conference on Decision and Control (CDC), 2017. https://doi.org/10.1109/cdc.2017.8264567
[15]	X. Wang, J. He, S. Zhu, C. Chen, X. Guan, Learning-based attack schedule against remote state estimation in cyber-physical systems, 2019 American Control Conference (ACC), 2019, 4503–4508. https://doi.org/10.23919/acc.2019.8814961
[16]	X. Wang, C. Chen, J. He, S. Zhu, X. Guan, Learning-based online transmission path selection for secure estimation in edge computing systems, IEEE Trans. Ind. Inf., 17 (2021), 3577–3587. https://doi.org/10.1109/TII.2020.3012090 doi: 10.1109/TII.2020.3012090
[17]	P. Dai, W. Yu, H. Wang, G. Wen, Y. Lv, Distributed reinforcement learning for cyber-physical system With multiple remote state estimation under DoS attacker, IEEE Trans. Networks Sci. Eng., 7 (2020), 3212–3222. https://doi.org/10.1109/TNSE.2020.3018871 doi: 10.1109/TNSE.2020.3018871
[18]	J. Clifton, E. Laber, $Q$ -Learning: theory and applications, Ann. Rev. Stat. Appl., 7 (2020), 279–301. https://doi.org/10.1146/annurev-statistics-031219-041220 doi: 10.1146/annurev-statistics-031219-041220
[19]	L. Wang, X. Liu, T. Li, J. Zhu, Skew-symmetric games and symmetric-based decomposition of finite games, Math. Modell. Control, 2 (2022), 257–267. https://doi.org/10.3934/mmc.2022024 doi: 10.3934/mmc.2022024
[20]	Z. Yue, L. Dong, S. Wang, Stochastic persistence and global attractivity of a two-predator one-prey system with $S$ -type distributed time delays, Math, Modell. Control, 2 (2022), 272–281. https://doi.org/10.3934/mmc.2022026 doi: 10.3934/mmc.2022026
[21]	K. Ding, X. Ren, D. Quevedo, S. Dey, L. Shi, DoS attacks on remote state estimation with asymmetric information, IEEE Trans. Control Networks, 6 (2019), 653–666. https://doi.org/10.1109/TCNS.2018.2867157 doi: 10.1109/TCNS.2018.2867157
[22]	J. Champati, M. Mamduhi, K. Johansson, J. Gross, Performance characterization using AoI in a single-loop networked control system, IEEE INFOCOM 2019-IEEE Conference on Computer Communications Workshops, 2019,197–203. https://doi.org/10.1109/infcomw.2019.8845114
[23]	Q. He, D. Yuan, A. Ephremides, Optimal link scheduling for age minimization in wireless systems, IEEE Trans. Inf. Theory, 64 (2018), 5381–5394. https://doi.org/10.1109/TIT.2017.2746751 doi: 10.1109/TIT.2017.2746751
[24]	Y. Sun, E. Uysal-Biyikoglu, R. Yates, C. Koksal, N. Shroff, Update or wait: how to keep your data fresh, IEEE Trans. Inf. Theory, 63 (2017), 7492–7508. https://doi.org/10.1109/TIT.2017.2735804 doi: 10.1109/TIT.2017.2735804
[25]	X. Wang, C. Chen, J. He, S. Zhu, X. Guan, AoI-aware control and communication co-design for industrial IoT systems, IEEE Int. Things J., 8 (2021), 8464–8473. https://doi.org/10.1109/JIOT.2020.3046742 doi: 10.1109/JIOT.2020.3046742
[26]	L. Shi, P. Cheng, J. Chen, Sensor data scheduling for optimal state estimation with communication energy constraint, Automatica, 47 (2011), 1693–1698. https://doi.org/10.1016/j.automatica.2011.02.037 doi: 10.1016/j.automatica.2011.02.037
[27]	L. Shi, L. Xie, Optimal sensor power scheduling for state estimation of Gauss-Markov systems over a packet-dropping network, IEEE Trans. Signal Process., 60 (2012), 2701–2705. https://doi.org/10.1109/TSP.2012.2184536 doi: 10.1109/TSP.2012.2184536
[28]	L. Shi, L. Xie, R. Murray, Kalman filtering over a packet-delaying network: a probabilistic approach, Automatica, 45 (2009), 2134–2140. https://doi.org/10.1016/j.automatica.2009.05.018 doi: 10.1016/j.automatica.2009.05.018
[29]	L. Peng, L. Shi, X. Cao, C. Sun, Optimal attack energy allocation against remote state estimation, IEEE Trans. Autom. Control, 63 (2018), 2199–2205. https://doi.org/10.1109/TAC.2017.2775344 doi: 10.1109/TAC.2017.2775344
[30]	R. Sutton, G. Andrew, Reinforcement learning: an introduction, MIT Press, 1998.
[31]	K. Hazeghi, M. Puterman, Markov decision processes: discrete stochastic dynamic programming, J. Amer. Stat. Assoc., 90 (1995), 392. https://doi.org/10.2307/2291177 doi: 10.2307/2291177
[32]	H. Tan, Reinforcement learning with deep deterministic policy gradient, 2021 International Conference on Artificial Intelligence, Big Data and Algorithms (CAIBDA), 2021, 82–85. https://doi.org/10.1109/CAIBDA53561.2021.00025
[33]	B. Shalabh, R. Sutton, M. Ghavamzadeh, M. Lee, Natural actor-critic algorithms, Automatica, 45 (2009), 2471–2482. https://doi.org/10.1016/j.automatica.2009.07.008 doi: 10.1016/j.automatica.2009.07.008
[34]	A. Rodriguez-Ramos, C. Sampedro, H. Bavle, P. de la Puente, P. Campoy, A deep reinforcement learning strategy for UAV autonomous landing on a moving platform, J. Intell. Robot. Syst., 93 (2019), 351–366. https://doi.org/10.1007/s10846-018-0891-8 doi: 10.1007/s10846-018-0891-8

This article has been cited by:

Yuhang Yang, Wenbo Zhang, Shenmin Song, Distributed fusion estimator for cyber-physical system with multi-channel network under DoS attacks, 2025, 10512004, 105192, 10.1016/j.dsp.2025.105192

Reader Comments

Your name:*

Email:*
© 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Mathematical Modelling and Control

1.4 1.2

Metrics

Article views(552) PDF downloads(31) Cited by(1)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(12) / Tables(4)

Mathematical Modelling and Control

Learning-based DoS attack game strategy over multi-process systems

Related Papers:

Abstract

1. Introduction

2. Problem setup

2.1. System model

2.2. Definition and impact of AoI

2.3. DoS attack optimization problem

2.4. Game-theoretic model

3. MADDPG-based DoS attack and defend power design in a multi-channel process system

3.1. MDP formulation

3.2. MADDPG algorithm

3.2.1. Actor-critic algorithm

3.2.2. The replay buffer and double DNNs

3.2.3. The updating rules

3.3. Feasibility layers for the received AoI and power constraint

4. Simulation results

4.1. Environment setting

4.2. Result presentation

5. Conclusions

Use of Generative-AI tools declaration

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

Mathematical Modelling and Control

Learning-based DoS attack game strategy over multi-process systems

Related Papers:

Abstract

1. Introduction

2. Problem setup

2.1. System model

2.2. Definition and impact of AoI

2.3. DoS attack optimization problem

2.4. Game-theoretic model

3. MADDPG-based DoS attack and defend power design in a multi-channel process system

3.1. MDP formulation

3.2. MADDPG algorithm

3.2.1. Actor-critic algorithm

3.2.2. The replay buffer and double DNNs

3.2.3. The updating rules

3.3. Feasibility layers for the received AoI and power constraint

4. Simulation results

4.1. Environment setting

4.2. Result presentation

5. Conclusions

Use of Generative-AI tools declaration

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog