Research article Special Issues

Dynamic computation offloading algorithm based on particle swarm optimization with a mutation operator in multi-access edge computing


  • The current computation offloading algorithm for the mobile cloud ignores the selection of offloading opportunities and does not consider the uninstall frequency, resource waste, and energy efficiency reduction of the user's offloading success probability. Therefore, in this study, a dynamic computation offloading algorithm based on particle swarm optimization with a mutation operator in a multi-access edge computing environment is proposed (DCO-PSOMO). According to the CPU utilization and the memory utilization rate of the mobile terminal, this method can dynamically obtain the overload time by using a strong, locally weighted regression method. After detecting the overload time, the probability of successful downloading is predicted by the mobile user's dwell time and edge computing communication range, and the offloading is either conducted immediately or delayed. A computation offloading model was established via the use of the response time and energy consumption of the mobile terminal. Additionally, the optimal computing offloading algorithm was designed via the use of a particle swarm with a mutation operator. Finally, the DCO-PSOMO algorithm was compared with the JOCAP, ECOMC and ESRLR algorithms, and the experimental results demonstrated that the DCO-PSOMO offloading method can effectively reduce the offloading cost and terminal energy consumption, and improves the success probability of offloading and the user's QoS.

    Citation: Yanpei Liu, Wei Huang, Liping Wang, Yunjing Zhu, Ningning Chen. Dynamic computation offloading algorithm based on particle swarm optimization with a mutation operator in multi-access edge computing[J]. Mathematical Biosciences and Engineering, 2021, 18(6): 9163-9189. doi: 10.3934/mbe.2021452

    Related Papers:

    [1] R Nandhini Abiram, P M Durai Raj Vincent . Identity preserving multi-pose facial expression recognition using fine tuned VGG on the latent space vector of generative adversarial network. Mathematical Biosciences and Engineering, 2021, 18(4): 3699-3717. doi: 10.3934/mbe.2021186
    [2] Hao Wang, Guangmin Sun, Kun Zheng, Hui Li, Jie Liu, Yu Bai . Privacy protection generalization with adversarial fusion. Mathematical Biosciences and Engineering, 2022, 19(7): 7314-7336. doi: 10.3934/mbe.2022345
    [3] Guangmin Sun, Hao Wang, Yu Bai, Kun Zheng, Yanjun Zhang, Xiaoyong Li, Jie Liu . PrivacyMask: Real-world privacy protection in face ID systems. Mathematical Biosciences and Engineering, 2023, 20(2): 1820-1840. doi: 10.3934/mbe.2023083
    [4] Shikang Zheng, Kai Chen, Xinping Lin, Shiqian Liu, Jie Han, Guomin Wu . Quantitative analysis of facial proportions and facial attractiveness among Asians and Caucasians. Mathematical Biosciences and Engineering, 2022, 19(6): 6379-6395. doi: 10.3934/mbe.2022299
    [5] Qingwei Wang, Xiaolong Zhang, Xiaofeng Li . Facial feature point recognition method for human motion image using GNN. Mathematical Biosciences and Engineering, 2022, 19(4): 3803-3819. doi: 10.3934/mbe.2022175
    [6] Zilong Liu, Jingbing Li, Jing Liu . Encrypted face recognition algorithm based on Ridgelet-DCT transform and THM chaos. Mathematical Biosciences and Engineering, 2022, 19(2): 1373-1387. doi: 10.3934/mbe.2022063
    [7] Musiri Kailasanathan Nallakaruppan, Chiranji Lal Chowdhary, SivaramaKrishnan Somayaji, Himakshi Chaturvedi, Sujatha. R, Hafiz Tayyab Rauf, Mohamed Sharaf . Comparative analysis of GAN-based fusion deep neural models for fake face detection. Mathematical Biosciences and Engineering, 2024, 21(1): 1625-1649. doi: 10.3934/mbe.2024071
    [8] Ricai Luo, Khadija Dawood, Muhammad Kamran Jamil, Muhammad Azeem . Some new results on the face index of certain polycyclic chemical networks. Mathematical Biosciences and Engineering, 2023, 20(5): 8031-8048. doi: 10.3934/mbe.2023348
    [9] Qian Zhang, Haigang Li, Ming Li, Lei Ding . Feature extraction of face image based on LBP and 2-D Gabor wavelet transform. Mathematical Biosciences and Engineering, 2020, 17(2): 1578-1592. doi: 10.3934/mbe.2020082
    [10] Jinhua Zeng, Xiulian Qiu, Shaopei Shi . Image processing effects on the deep face recognition system. Mathematical Biosciences and Engineering, 2021, 18(2): 1187-1200. doi: 10.3934/mbe.2021064
  • The current computation offloading algorithm for the mobile cloud ignores the selection of offloading opportunities and does not consider the uninstall frequency, resource waste, and energy efficiency reduction of the user's offloading success probability. Therefore, in this study, a dynamic computation offloading algorithm based on particle swarm optimization with a mutation operator in a multi-access edge computing environment is proposed (DCO-PSOMO). According to the CPU utilization and the memory utilization rate of the mobile terminal, this method can dynamically obtain the overload time by using a strong, locally weighted regression method. After detecting the overload time, the probability of successful downloading is predicted by the mobile user's dwell time and edge computing communication range, and the offloading is either conducted immediately or delayed. A computation offloading model was established via the use of the response time and energy consumption of the mobile terminal. Additionally, the optimal computing offloading algorithm was designed via the use of a particle swarm with a mutation operator. Finally, the DCO-PSOMO algorithm was compared with the JOCAP, ECOMC and ESRLR algorithms, and the experimental results demonstrated that the DCO-PSOMO offloading method can effectively reduce the offloading cost and terminal energy consumption, and improves the success probability of offloading and the user's QoS.



    In the last two decades, the research and application of nonlinear systems have made considerable progress. Combined with the Lyapunov stability theory, there have been many research results in the control theory of nonlinear systems. Wu et al. [1] designed an adaptive quantitative control method for analyzing uncertain nonlinear strictly feedback systems by combining the backstepping technique and Lyapunov stability theory. Shatyrko et al. [2] studied the stabilization of Lur'e-type nonlinear indirect control systems with time-delay argument, and obtained the sufficient conditions for the absolute stability of the system by the Lyapunov method. Kai et al. [3] discussed a control method that can generate the desired limit-cycle-like behavior for a two-dimensional discrete-time nonlinear control system. Wei [4] discussed the boundedness of nonlinear nabla fractional-order systems, and by using the nabla Laplace transform, derived two stability criteria in the form of Lyapunov's theorem. Pole et al. [5] described progressively globally asymptotically stable nonlinear control systems through symbolic models, which enables one to utilize techniques from supervisory control and algorithms from game theory for controller synthesis purposes.

    At the same time, the development of optimization theory of dynamic systems and the research and application of nonlinear control systems have opened up huge application space. Zhang et al. [6] introduced the control research of adaptive dynamic programming (ADP) in synchronous generator and generator excitation of multi-machine power system. Volckaert et al. [7] use iterative learning control (ILC) to solve the control problem of the cart-cable system with nonlinear characteristics. Gao et al. [8] used a global adaptive dynamic programming(GADP) method for cooperative adaptive cruise control (CACC) of connected vehicles with unknown nonlinear dynamics. Trélat. [9] conducted research on some common methods used to solve nonlinear optimal control problems in aerospace, such as Pontryagin's maximum principle and conjugate point theory.

    At present, many scholars combine nonlinear system and optimal control theory to carry out research on optimal control of nonlinear system. The main methods to solve the optimal control problem are classical variational method, maximum value principle and dynamic programming [10,11,12]. One of the main difficulties of these classical optimal control theories is to determine the optimal control of a nonlinear system, so it is necessary to solve the Hamilton-Jacobi-Bellman (HJB) partial differential equations (PDEs) [13]. Due to the complexity of the nonlinear system and the various constraints of the real systems, nonlinear HJB equations and nonlinear two-point boundary value problems cannot be solved analytically. Therefore, it is of great significance to study some numerical solution methods or explicit optimal solutions approximate to the optimal solution to solve the optimal control problem.

    With the development of various technologies in the field of artificial intelligence, many methods of combining artificial intelligence algorithms to solve nonlinear system problems emerge [14,15]. Another powerful methodology for solving optimal control problems is the Adaptive Dynamic Programming (ADP) algorithm, which was proposed by Werbos in 1991 [16]. Combining the reinforcement learning and the dynamic programming, ADP simulates the human learning through environmental feedback and is considered to be a method very close to the intelligence of the human brain. This method effectively solves the problem of "curse of dimensionality" in dynamic programming [17]. In 1997, Prokhorov and Wunsch discussed the design of Heuristic Dynamic Programming (HDP), Dual Heuristic Programming (DHP) and Global Dual Heuristic Dynamic Programming (GDHP), and proposed the implementation method and training steps of ADP [18]. The essence of ADP is to use the function approximation structure to approximate the performance index function, control strategy in the dynamic programming equation, and obtain the optimal control and optimal performance index function under the condition of meeting the Behrman optimality principle [19,20,21]. In the ADP structure, there are generally three parts, namely the dynamic system, the critic performance index function and the control action. Each part can be replaced by a neural network [22]. The mathematical derivation process of ADP method is based on minimizing the performance index function or cost function in the infinite horizon, and the performance index function is the integration or summation of the utility function. In [23], combined with Bellman's optimization principle, the method of introducing discount factor into cost function was used to solve the optimal control problem. This method is closer to the role of discount rate in reinforcement learning, therefore, the model can better refer to the value return in the future during the process of training [24].

    The research of the above literature is mainly aimed at the design and optimization of the network structure in ADP, and proposing ADP algorithms with different structures. In recent years, many scholars have done a lot of research on the stability analysis and algorithm convergence of the closed-loop system composed of the ADP iterative algorithm. Mu et al. [25] proposed an approximate optimal control strategy, obtained the approximate optimal tracking control law of uncertain nonlinear systems with predefined cost function by using ADP, and solved the tracking control problem of continuous-time nonlinear systems with unmatched uncertainty. Dong et al. [26] proposed a critic-only ADP algorithm to approximate the solution of the HJB equation and the corresponding optimal control strategy. This strategy solves the tracking control problem described by Euler-Lagrange equations with uncertain system parameters, furthermore, uniformly ultimately bounded stability is guaranteed via a Lyapunov-based stability analysis. Song et al. [27] proposed a value iteration (Ⅵ) algorithm based on ADP to solve the fixed-point tracking control problem, gave the convergence proof of Ⅵ algorithm, and proved that the iterative cost function can converge to the optimal value accurately. Liang et al. [28] constructed a partial-policy iterative adaptive dynamic programming (ADP) algorithm to solve the optimal control problem of nonlinear systems with discounted total rewards. Compared with the traditional strategy iterative ADP algorithm, the calculation amount in the iterative process is reduced, and the stability is improved. Fan et al. [29] proposed a novel control strategy based on robust adaptive dynamic programming (RADP) for optimal control of a class of output-constrained continuous-time unknown nonlinear systems. Yang et al. [30] proposed a self-learning robust optimal control scheme based on adaptive dynamic programming (ADP), using indicator functions and concurrent learning techniques to solve mismatched perturbed input affine continuous-time nonlinear systems.

    However, most ADP algorithms are either implemented offline by iterative methods or require prior knowledge of system dynamics. These ADP algorithms are intractable for real-time control applications since exact knowledge of nonlinear dynamic systems is often not available. Many scholars use reinforcement learning methods to solve the above problems. Liu et al. [31] proposed a robust adaptive control algorithm based on Reinforcement Learning (RL) to solve the control problem of continuous-time uncertain nonlinear systems with input constraints. The algorithm uses a single evaluation network to obtain approximate optimal control, which ensures a certain stability of the uncertain nonlinear continuous-time system. Liu et al. [32] constructed two kinds of neural networks to achieve approximate optimal control for continuous-time nonlinear systems with unknown structure and control constraints. Among them, the cyclic neural network (RNN) is used to identify the system model of the system, and two feedforward neural networks (Feedforward NN) are used as the execution network and the evaluation network, respectively, for the approximation of the optimal control strategy and the optimal performance index function. Zhao et al. [33] proposed a RL-based cyclic fixed finite-time algorithm to solve the HJB equation concerning the optimal control of continuous nonlinear finite-time systems with uncertain system structure. Zhao et al. [34] proposed a new reinforcement learning strategy consisting of two parts: an online learning optimal control of a nominal system and a feedforward neural network (NN) compensation that handles uncertain input constraints. This strategy solves the optimal stability problem of unknown nonlinear systems constrained by uncertain inputs. Wang et al. [35] studied the approximate optimal control of continuous-time non-affine nonlinear systems using reinforcement learning, and used effective precompensation techniques for proper system transitions. Combined with the overall strategy iteration to alleviate the needs of system dynamics. J. W. Kim et al. [36] proposed a model-based RL approach that uses deep neural networks (DNNs) to iteratively learn the solutions of HJB and its associated equations, which can significantly improve the performance of learning policies in uncertain initial states and in the presence of state noise.

    The algorithm based on dynamic programming and iterative strategy starts with the performance index function in the state space of the control system, explores the control effect of the control sequence, and obtains the optimal control law [37,38,39,40]. The above literature has achieved great results in the design of network structure and the convergence proof of policy algorithm. In the study of optimal control of nonlinear systems, more attention is paid to the control details of dynamic programming algorithms. Its purpose is to expect that the changes in the state space during the whole control process can meet the detailed requirements of the control system, and finally achieve the control goal. This requires that the utility function in the performance indicator function must be able to express the specific needs of users. Most of the above literatures use quadratic forms when defining utility functions, and they do not have a good description of the relationship between utility functions and system states. In addition, there are great limitations on the setting of parameters in utility functions. As for how to design a utility function that is closely related to the actual control objective in nonlinear systems, there are few literature, and this is also the research direction of this paper.

    This paper proposes detailed regulation of DRM based on the control process. By designing the evaluation function of the control target, multiple evaluation functions are combined into a reward function to replace the utility function in the performance index function, and then the reward function is introduced into the deep reinforcement learning algorithms. We use three algorithms to verify the effectiveness of DRM, which are based on Deep Q-Networks, policy gradient and actor-critic frameworks respectively.

    The remainder of this paper is organized as follows. Section 2 describes nonlinear system properties and introduces the limitations of utility functions in iterative algorithms. In Section 3, an alternative approach to the utility function is proposed combining Lyapunov stability and Q-learning. Taking the inverted pendulum system as the experiment object, Section 4 establishes the dynamic environment, constructs the reward function mechanism of the inverted pendulum system, analyzes the deep reinforcement learning algorithm model and designs experiments to verify the effectiveness of the algorithm. Finally, Section 5 concludes this work.

    Consider a continuous-time nonlinear system given by:

    ˙x(t)=f[x(t),u(t),t], (1)

    where x(t)Rn is the system state vector, u(t)ΩRm is the control input vector, and f() is a continuously differentiable vector function. The boundary conditions are that the initial condition x(t0)=x0 is fixed and the end condition x(tf) is free. The performance index function is defined as:

    J(x(t),t)=φ(x(tf),tf)+tftL(x(τ),u(τ),τ)dτ, (2)

    where the control vector u(t) is unconstrained and continuous. In the interval [t,tf], the scalar functions φ() and L() are continuous and twice differentiable. Function J(x(t),t) is continuous and has continuous first and second order partial derivatives with respect to x(t) and t. In the admissible control domain Ω, u[t,tf]Ω, so the optimal performance index function is defined as:

    J(x(t),t)=minu[t,tf]Ω{φ(x(tf),tf)+tftL(x(τ),u(τ),τ)dτ}. (3)

    Then the continuous time HJB equation can be:

    J(x(t),t)t=minu(t)Ω{L(x(t),u(t),t)+(J(x(t),t)x)Tf[x(t),u(t),t]}. (4)

    For optimal control law u(t), Eq (4) can be written as:

    J(x(t),t)t=L(x(t),u(t),t)(J(x(t),t)x)Tf[x(t),u(t),t]. (5)

    According to Euler's discretization [41], the discrete NCS function of the control system can be as follows:

    xk+1=F(xk,uk). (6)

    The nonlinear system is expressed by the function F(xk,uk), where xkΩxRn is the system state vector, and ukΩuRm is the control input vector. Ωx and Ω are defined as:

    {Ωx={xk|xkRn and ||xk||<}Ωu={uk|ukRm and ||uk||<}, (7)

    where |||| denotes the Euclidean norm. Let u_k=(uk,uk+1,...) denote the control sequence from k to . Let U(xk,uk) be the utility function. The performance index function is defined as:

    J(xk,u_k)=i=kU(xi,ui). (8)

    For a control sequence u_k, the optimal performance index function can be defined as:

    J(xk)minu_k{J(xk u_k)}. (9)

    So we have:

    J(xk)=minuk{U(xk,uk)+J(xk+1)}. (10)

    Then according to Bellman's principle of optimality [42], the discrete time HJB equation can be:

    J(xk)=U(xk,u(xk))+J(F(xk,u(xk))). (11)

    The optimal single control law u(xk) can be expressed as:

    u(xk)=arg  minuk{U(xk,u(xk))+J(F(xk,u(xk)))}. (12)

    The above is the basic principle of using dynamic programming to solve the optimal control of the two systems. The optimal control vector function and the optimal state trajectory can be solved through the HJB equation, but the process is often very complicated or difficult to achieve. In addition, when the discrete system is high-order, it is easy to cause the "curse of dimensionality" problem. A common alternative method is ADP. The definition of the performance index function in ADP is usually in the form of a quadratic performance index, similar to the following expression as follows:

    J(x(t),u(t),t)=12t[xT(t)Q(t)x(t)+uT(t)R(t)u(t)]dt, (13)

    where Q is a l×l-dimensional nonnegative definite matrix, and R is a m×m-dimensional positive definite matrix. The values of l and m depend on the dimensions of x(t) and u(t) respectively. The utility function is generally expressed in the following form:

    U(xk,uk)=(xTkQxk+uTkRuk). (14)

    In most cases, Q and R are set as identity matrix with suitable dimensions, but this is often not very consistent with the real control system and control objectives. For example, in a specific control system, the high-frequency control law may damage electronic components, and high-frequency control should be avoided. The mechanical characteristics of some systems require that the control law conforms to a functional flexible control as much as possible. Moreover, due to the specific operation environment, the state space under the output of the control law is required to satisfy the specific requirements. This requires that the utility function has a more flexible nonlinear form. As far as we know, there is very little literature about how to assign appropriate values to Q and R in order that the utility function satisfies the real control system requirements. In fact, it is difficult or even impossible to accurately determine the two matrices to match the real control system.

    In order to solve the above problems, we try to modify the utility function so that it can not only satisfy the mathematical requirements of the best performance indicator, but also satisfy our requirements for the control process. Firstly, define a new function R(xk,uk) and use it as the mapping object of the utility function U(xk,uk). Then, for a known NCS f[x(t),u(t),t], the discrete method is used to establish the function F(xk,uk), so as to establish the dynamic environment. Using the dynamic environment, the output vector xk+1 under the input vectors xk and uk can be solved, and the R(xk,uk) matching with U(xk,uk) can be output.

    Thus, how to design the dynamic environment and how to design the output function R(xk,uk) corresponding to U(xk,uk) has become one of the key problems to solve in the nonlinear control systems. For mathematical convenience, we make the following conventions and constraints on R(xk,uk):

    1) The discrete nonlinear control sequence can be defined as a finite Markov decision process (MDP)M=(S,A,R), where: S={x0,x1,,xN},xiΩx,N>1 is a finite set of states; A={u0,u1,,uN},uiΩu is a set of control actions; R is the reward distributions, and it can be written as R:S×A×SR, with R(xk,uk,xk+1)[0,rmax] being the reward received upon taking uk in state xk transitioning to state xk+1. The boundedness of the reward function of R(xk,uk,xk+1) is to ensure that the performance index function composed of it is bounded. The intuitive explanation of the reward function is: the higher the compliance of the control effect, the closer the value is to rmax, and the lower the compliance of the control effect, the closer the value is to 0.

    2) For a specific state space κRm, it is the subspace of state space x, i.e., κx, and it is composed of specific components of state space x. In this state, there may be specific control requirements or the designer's objective evaluation of the state. We regard κ as an observation perspective of state space. Therefore, we define the evaluation function eκ(xk,uk)[0,1] under the perspective of κ. In this perspective, specific control can be done and the control effect (xk,uk) can be evaluated. As a component of the reward function, the evaluation function under a specific observation perspective is often helpful to guide the follow-up control to better approach the control goal. As shown in Figure 1, in a gradually stable MDP control trajectory, the control vector in state xk is uk, the reward value is rk, and xk transitions to the next state xk+1. If there is a better control vector uk under the observation perspective κk a at this time step, the state is directly transitioned to xk+i, bypassing the middle i1 control processes, and the MDP trajectory becomes (xk,uk,rk). Therefore, such a control law is suitable for increasing the reward value. According to the different observation perspectives, the evaluation function e(xk,uk) related to the state-control pair (xk,uk) is formulated to evaluate the control validity. When i>0, ei() has a value range of [0,1] and is continuously differentiable.

    Figure 1.  MDP control trajectory.

    3) R() can be defined as:

    R(xk,uk)n1i=0ei(xk,uk),n>1 (15)

    and

    {e0(xk,uk)=rmax,   xkΩx  and  uxΩue0(xk,uk)=0,  otherwise. (16)

    Here rmax1 means that the maximum value of the reward function is 1. When i>0, ei() should be continuous and twice differentiable in order to ensure that R() is also continuous and twice differentiable if R()>0. The evaluation function sequence is (e0(xk,uk),...,en1(xk,uk)). Where n is the number of evaluation functions, i.e., the effectiveness of control is divided into n different observation perspectives for evaluation.

    A dynamic environment is constructed by the above formula, as shown in Figure 2. The input parameters are state-control pair (xk,uk), and the output parameters are the next state xk+1 and R(xk,uk) of the system. If R() is regarded as the reward value of the dynamic environment, the transformation from the utility function of the NCS to the reward is completed, i.e., the utility function is substituted by the reward value. The description effect may be that the smaller the value of the utility function, the greater the reward value, but the value range of the reward is the interval from 0 to rmax.

    Figure 2.  Dynamic environment.

    In order to satisfy the Eqs (8)–(11), we can express the relationship between U(xk,uk) and R(xk,uk) with the following equation:

    U(xk,uk)=e0(xk,uk)R(xk,uk). (17)

    It is worth mentioning that the n evaluation functions and their respective observation perspectives should be designed consistent with the control objectives. The design of observation perspectives can mainly focus on two points: the first is to monitor the effect of the control law in a specific state during the control process, the second is that this design is conducive to the system to achieve an optimal control effect. The observation perspective of the evaluation function should not be repeated, and there should be no conflict in control methods between the evaluation functions, which means that the high score of the control law under one observation perspective is easier to guide the control law to obtain high scores under other observation perspectives. However, the high score of the control law under a certain observation perspective may not immediately lead to high scores under other observation perspectives, but this will cause the control law in the current state to get high scores with great probability under more observation perspectives in the next time steps.

    According to the definition of expected discounted return in reinforcement learning theory, let Rk=R(xk,uk) and bring it into the following equation:

    GtTk=t+1γkt1Rk, (18)

    where γ is the discount rate, 0γ1, Rk is the reward of step k, RkRR, R is reward sets, and T is a final time step.

    The advantages of using reward function R(xk,uk) instead of utility function U(xk,uk) are as follows:

    1) The reward function reflects the feasible control objectives from various observation perspectives. It may be a nonlinear function instead of the linear expression of the utility function, which can better satisfy the detailed control objectives.

    2) The reward function R(xk,uk) is composed of several evaluation functions from their respective observation perspectives. Each evaluation function e(xk,uk) is specific, simple and easy to calculate, which avoids the problem that it is difficult to specify the weight matrix Q and R in the utility function of U(xk,uk).

    3) For a specific real control platform, the reward function is easier to realize. According to the requirements of control objectives and control details, it is easier to adjust each evaluation function, and the adjustment of one evaluation function will not affect other evaluation functions.

    4) After adding the reward function in the dynamic environment, we can try to use more skillful reinforcement learning algorithms to solve the nonlinear control problem.

    In Eq (15), the reward function takes the form of successive multiplications instead of successive additions or others. The advantages of doing this are as following:

    1) If the reward function takes the continuous addition form of the evaluation functions, when changing one of the evaluation functions, the weight of each evaluation function must also be considered, which will make the problem very complex, and its rationality needs to be carefully verified in the experiment. In the form of continuous multiplication, if one of the evaluation functions is changed, it only changes the calculation method and the possible gain of the function, and there is no need to change the weights of other evaluation functions. In fact, the output value of each function can be understood as its own weight.

    2) Any evaluation function can play the role of "one vote veto" by using continuous multiplication. For example, in the experiment of controlling the robot to walk, no matter how high the score given by the evaluation function that responsible for evaluating the robot's posture, as long as it touches the boundary, it should be given a zero score. According to Eq (15), when the evaluation function responsible for boundary detection gives a zero score, the score of the reward function is zero. But the continuous addition form of the evaluation functions will not bring such an effect.

    In order to comply with Eqs (3)–(5) and (11), we try to modify Eq (8) as:

    J(xk,u_k)=i=kU(xi,ui)
    =i=k(e0(xi,ui)R(xi,ui))
    =i=k(rmaxR(xi,ui)). (19)

    Let

    Ψ(xk,u_k)=i=kR(xi,ui), (20)

    and we have:

    J(xk,u_k)=i=krmaxΨ(xk,u_k). (21)

    Therefore, as long as u_k conforms to the system under (6), it satisfies the optimality equation:

    Ψ(xk)=maxu_k{Ψ(xk u_k)}. (22)

    Hence Ψ(xk) can be obtained as:

    Ψ(xk)=maxuk{R(xk,uk)+Ψ(xk+1)}. (23)

    So the optimal performance index function has another form:

    J(xk)=minu_k{J(xk,u_k)}
    =i=krmaxΨ(xk). (24)

    Recalling Eq (10), we get:

    J(xk)=minuk{U(xk,uk)+J(xk+1)}
    =rmaxR(xk,u(xk))+J(F(xk,u(xk))), (25)

    and the optimal single control law is:

    u(xk)=arg  maxukΨ(xk). (26)

    Comparing Eqs (26) and (14), the problem of solving the optimal performance index function is transformed into the problem of maximizing the return function. Further, combined with Eq (18), we can try to use a wider range of reinforcement learning algorithms to solve the optimization problem of nonlinear systems.

    For the convenience of analysis, the stability proof is made on the basis of the following assumptions 1 and 2.

    Assumption 1: The discrete NCS represented by Eq (6) is controllable, and the system state xk=0 is an equilibrium state of the discrete NCS under the condition of uk=0, i.e., F(0,0)=0.

    Assumption 2: The feedback control uk=u(xk) satisfies uk=u(xk)=0 for xk=0, xiΩiRn, u(xk) is continuous and u(xk) stabilizes the discrete NCS on Ωi, J(xi) is finite.

    Assumption 1 ensures that the object we use the deep reinforcement learning method to solve is a controllable NCS, and determines that the equilibrium point of the system is xk=0, and the output of the system function F() is 0. Assumption 2 ensures that the control law is admissible and the output is 0 at the equilibrium point, so as to meet the requirements of a controllable NCS. In the process of control sequence, the performance index function J(xi) is a bounded function. In this way, we have made restrictive assumptions about the system object and control law.

    According to the definition of e(xk,uk), e(xk,uk)0, then R(xk,uk) is a positive definite function for any xk and uk. Let a0(xk) be an arbitrary admissible control law. According to Q-learning [43], for i=0, let Q0(xk,uk) be the initial iterative Q function constructed by a0(xk), i.e.,

    Q0(xk,a0(xk))=j=0R(xk+j,a0(xk+j)). (27)

    Thus, initial iterative Q function satisfies the following optimality equation:

    Q0(xk,uk)=R(xk,uk)+Q0(xk+1,a0(xk+1)). (28)

    Then, we get the iterative control law as follows:

    a1(xk)=arg maxukQ0(xk,uk). (29)

    Let Qi(xk,uk) be the iterative Q function constructed by ai(xk) for i=1,2,..., then we can get the following generalized optimality equation:

    Qi(xk,uk)=R(xk,uk)+Qi(xk+1,ai(xk+1)). (30)

    Therefore, the iterative control law is updated by:

    ai+1(xk)=arg maxukQi(xk,uk). (31)

    Because the reward function R(xk,uk) is a positive definite function of xk and uk, and under Assumption 1 and 2, the iterative function Qi(xk,uk),i=0,1,..., is positive definite for xk and uk. According to Eq (30), substitute ai(xk) for uk, and let Vi(xk)=Qi(xk,ai(xk)) for i=0,1,..., we can get:

    Vi(xk+1)Vi(xk)=Qi(xk+1,ai(xk+1))Qi(xk,ai(xk))
    =R(xk,ai(xk))<0. (32)

    For i=0,1,..., the function Vi(xk) is positive definite for xk, so Vi(xk) is a Lyapunov function. Thus, ai(xk) is a stable control law.

    In the above, it is difficult to give a specific expression form for the control objective of the utility function under the actual control system. We can try to reconstruct the reward function R(xk,uk) to replace the utility function with the help of DRM. Therefore, the key step of solving nonlinear optimization problem is transformed into the problem of how to construct a reinforcement learning dynamic environment and optimize the reward function R(xk,uk). According to Eqs (15) and (16), the determination of evaluation function is one of the important tasks of DRM. Ng et al. [44] proposed the reward shaping theory, if a potential energy-based function is added to the reward function, the optimal strategy remains unchanged, realizing the optimal strategy of modifying the reward without affecting the Markov Decision Process. The reward shaping method is used in multi-agent reinforcement learning, and it is effective in robots collaborative-competitive operations and multi-objective tasks [45,46,47,48,49]. This approach give us some inspiration, especially in the construction of the potential energy function and how to avoid reward conflicts.

    Using the mathematical models above, the problem of obtaining a control method for a nonlinear system can be transformed into a problem of training one or more agents' control strategies, establishing a value function and obtaining the maximum reward. In this section, by choosing the inverted pendulum system as the experiment object, we will analyze its characteristics and establish the dynamic environment, design the reward function through different control objectives, then use three deep reinforcement learning algorithm models to conduct experiments. the experiments result is discussed finally.

    Figure 3 is a schematic diagram of the structure of the inverted pendulum system. Our ultimate goal is to control the force exerted to the cart and make it move left or right to maintain the balance of the single pole mounted on the cart. The typical parameters of inverted pendulum-cart system setup are selected as: mass of the cart (mc): 1.34 kg, mass of the pendulum (mp): 0.09 kg, length of the swing pole (l): 0.40 m, length of the cart rail (lr): 0.40 m, friction coefficient of the cart & pole rotation is assumed negligible. The acceleration due to gravity g = 9.81 m/s2.

    Figure 3.  Schematic diagram of the structure of the inverted pendulum system.

    In [50], the deep reinforcement learning algorithm model is used to realize the balance of the cart-pole in the inverted pendulum system, but the relevant control theory in the NCS is not involved, and the design of the reward function is not introduced in detail. The system function of the inverted pendulum system model in this paper is expressed as follows [51,52]:

    ¨θ=gmsinθcosθ(F+mpl˙θ2sinθ)lm(4m/3mpcos2θ), (33)
    ¨χ=F+mpl(˙θ2sinθ¨θcosθ)m, (34)

    where m=mc+mp, ¨χ is the moving speed of the cart on the rail, ¨θ is the swing angular speed of the swing pole. The nonlinear system equations (33) and (34) of inverted pendulum dynamic system represent the control input affine system. These two nonlinear equations are represented in the following standard state space form:

    ˙x(t)=f(x(t))+g(x(t))u(t). (35)

    For Eq (35), Let x(t)=(χ,˙χ,θ,˙θ)T, and set θmax as the maximum limit of the swing pole deviation from the balance angle, Let χmax be the maximum value that the cart deviates from the origin of the guide rail, ˙χmax the maximum speed of the cart, and ˙θmax the maximum rotation speed of the swing pole. The definition rules of the system state space and evaluation function are as following:

    Ωx=[lr2χlr2˙χmax˙χ˙χmaxθmaxθθmax˙θmax˙θ˙θmax], (36)
    e0(x,u)={1,xΩx0,xΩx. (37)

    The range of the output value of the evaluation function e0 is [0, 1]. In order to describe the expected position of the cart on the rail and guide the cart to run at the equilibrium position of the rail with greater probability, we make an evaluation function as following:

    e1(x,u)=1sgn(χ)2χlr. (38)

    From the perspective of control, the cart "has to" drive the swing pole into a better state space at some positions of the rail. In this way, we can establish an evaluation function to better guide the system into the equilibrium state. If it is said that the desired balance angle is zero in the equilibrium state, then at different positions of the rail, we assign position-related dynamic balance angles to guide the movement of the cart. The position-related balance angle is given by:

    θ0=2χθmaxlr. (39)

    Then the evaluation function is established as:

    e2(x,u)={1|θθ0|θmax,    |θθ0|θmax0,    |θθ0|>θmax    or    θ>θmax   or   θ<θmax. (40)

    Notice that the output range of e2() is [0, 1], and θmax is the maximum allowable deflection angle of the swing pole.

    In the balancing process of inverted pendulum system, not only the system is required to reach the equilibrium state in a limited time, but also the energy consumed should be considered. It is expected that the energy consumed will reach the minimum in the process of state transition under two adjacent time steps. Therefore, we make an evaluation function as follows:

    e3(x,u)=η(x,u)φ(x), (41)

    where, η(x,u) is the power consumption function of the system under the action of control output u in x state; φ(x) is the coefficient function in x state. Without considering the external disturbance of the system such as air resistance, the force power consumption function can be expressed by the following formula:

    η(xk,uk)=W(xu=Fk+1)+ς(xu=Fk+1)W(xu=0k+1)ς(xu=0k+1), (42)

    here, W(xu=0k+1) is defined as the energy loss after the system transitions to the next state when there is no action output control in state xk; in the same case of no action output, ς(xu=0k+1) is the energy loss caused by system friction. W(xu=Fk+1) is defined as the energy loss after the system transitions to the next state under the action of output control u=F in state xk, and ς(xu=Fk+1) is the energy loss caused by system friction after the system transitions to the next state under the action of output control u=F in state xk.

    In the system represented by Eqs (33) and (34), friction is not considered, so ς(xu=0k+1)=ς(xu=Fk+1)=0. Therefore, in state xk, the system energy is:

    W(xk)=Tc(xk)+Tp(xk)+Vp(xk), (43)

    where Tc(xk) is the kinetic energy of the cart, Tp(xk) is the kinetic energy of the swing pole, and Vp(xk) is the potential energy of the swing pole. According to the inverted pendulum model, we can get:

    Tc(xk)=12mc˙χ2xk, (44)
    Tp(xk)=12mp((d(θxk+lsinθxk/2)dt)2+(d(lcosθxk/2)dt)2)+16mpl2˙θ2xk
    =12mp˙χ2xk+12mpl˙χxk˙θxkcos˙θ2xk+724mpl2˙θ2xk, (45)
    Vp(xk)=12mpglcosθxk, (46)

    where θxk is the swing pole angle in state xk. According to the above formulas, calculate W(xu=0k+1) and W(xu=Fk+1) respectively, and then calculate η(xk,uk). Note the coefficient function φ(xk), which reflects the user's specific allocation method of energy in combination with the current state. For example, it is required to minimize the single-step energy, or establish the minimum theoretical value of energy according to the current state, and then require the single-step energy to approach the value as much as possible, and ensure that the final evaluation function is between 0 and 1.

    Note that in the control process of the system, e0(x,u) gives the maximum value of evaluation 1, e1(x,u) is the evaluation function of cart displacement χ in the state vector, e2(x,u) is the evaluation function of χ and θ in the state vector, and e3(x,u) is the evaluation function of θ, ˙χ and ˙θ in the state vector. The final evaluation effect is expressed as:

    R(xk,uk)=3i=0ei(xk,uk). (47)

    R(xk,uk) is a nonlinear description that acts as a utility function, which is different from the description method of Eq (8). In other words, R(xk,uk) reflects the user's specific control needs in a more subtle way. It is commendable that users can add, modify the details of specific control requirements via the evaluation function, and the modification of a certain evaluation function will not affect the roles of other evaluation functions. In the dynamic environment, the relationship between the evaluation functions and the reward function is that the reward function is like composed of multiple "referees", and these "referees" are the evaluation functions. Each referee gives a score ranging from 0 to 1, and then the reward function summarizes the scores of each referee by means of continuous multiplication. The scoring standard for each referee is based on their different viewing angles, and the viewing angles of each referee cannot be the same. In fact, the viewing angles are their observation perspectives as shown in Figure 2. As mentioned above, no evaluation function can violate the control law of the system, which means that the referee is required to guide and encourage the effect of "performer", in other words, the performer (i.e., the control law) follows the performance guidance effect of a referee's high score, which is easier to achieve the final effect of control. This final effect is the best one close to rmax recognized by each referee.

    In this paper, R() is regarded as the reward value of the dynamic environment, and the deep reinforcement learning algorithm is designed to interact with the dynamic environment from three perspectives of state-value, policy and "Actor-Critic". During the interaction process, the agent's execution strategy is optimized through the reinforcement learning algorithm to achieve the maximum reward value. Among them, the deep reinforcement learning algorithm based on state-value mainly takes Deep Q-Learning (DQN) as an example, the policy-based algorithm mainly takes the policy-gradient reinforcement algorithm as an example, and the "Actor-Critic" framework mainly takes Deep Deterministic Policy Gradient as an example. Three deep reinforcement learning models are analyzed and described below.

    DQN is a value-based iteration reinforcement learning algorithm [53]. When the state and action space generated during the interaction between the agent and the environment is discrete and the dimension is not high, the Q-Learning method can be used to establish the state-action value Q table. Update the Q-values in the table according to the Bellman optimality equation. In the inverted pendulum system, the states of the cart and the swing pole are high-dimensional continuous. Using the Q-Learning method to build a table is very difficult, and it often causes the "curse of dimensionality". When the states and actions are high-dimensional and continuous, the construction table can be transformed into a function fitting problem, and the Q value is generated by fitting the function instead of the table, so that similar states can obtain similar output actions. In the function fitting of high-dimensional input, the advantage of deep neural network for complex feature extraction can be used to map the input to the output through neural network. Therefore, the essence of the DQN method is to combine the deep neural network and Q-Learning, map the state action value Q through the network structure, and update the agent action strategy according to the Bellman optimal equation. Algorithm 1 shows the DQN algorithm flow.


     | Show Table
    DownLoad: CSV

    In algorithm 1, st{χt,θt,˙χt,˙θt} is the state space of the inverted pendulum system at time t. Among them, χt is the displacement of the cart, θt is the angle of the swing pole, ˙χt is the speed of the cart, and ˙θt is the angular speed of the swing pole. rt is the reward value obtained during the interaction with the dynamic environment at time t.

    DQN records the tuple {st,at,rt+1,st+1} generated by the agent in each state and stores it in the experience replay. The deep neural network randomly selects the tuple in the experience replay as the training label to update the network parameters. Randomly selecting previous experience in the experience replay will make the network more efficient, solving the problem of correlation and non-static distribution to a certain extent [54].

    DQN adopts two network structures with the same structure but different parameters, namely action-value network (Q-network) and target action-value network (target Q-network). When the action-value network is used to approximate the Q value, the update of the Q value is prone to oscillation, resulting in unstable learning behavior of the agent. The target action-value network reduces the correlation between the predicted Q value and the estimated Q value to a certain extent to improve the stability of the algorithm. In the network parameter update process, the parameters used by the Q-network are the latest parameters of each update, and the Q-network parameters are copied to the target Q-network after multiple iterations.

    The agent maps the estimated Q value through Q-network at st, uses the epsilon-greedy [55] strategy to select the Q value under the corresponding action, takes the action at corresponding to the Q value to interact with the environment, and records the generated tuple {st,at,rt+1,st+1} and stores it in the experience replay. The target Q-network maps st+1 as input to the predicted Q value, selects the sum of the maximum predicted Q value and the immediate reward rt+1 as the TD target[56], uses the mean square error between the TD target and the Q value as a loss function for updating network parameters. The loss function between target action-value function and action-value function in DQN is as follows:

    TargetQ = rt+1+γmaxaˆQw(st+1,a), (48)
    L(θ)=E{st,at,rt+1,st+1}ϵN[(TargetQQw(st,at))2]. (49)

    In the DQN series of reinforcement learning algorithms, we mainly approximate the value function, learn the action value function based on the value, and then adopt a similar greedy strategy to select the agent output action according to the estimated action value function. But the DQN series of reinforcement learning algorithms have the following problems: 1) Algorithms based on value functions can only process discrete actions, but cannot process continuous actions output by the agent. 2) The algorithm based on the value function cannot solve the stochastic policy problem. The optimal strategy corresponding to the DQN series of methods is a deterministic strategy, and the action corresponding to the maximum value is taken as the optimal strategy. Aiming at the shortcomings of value-based reinforcement learning algorithms, a policy-based reinforcement learning algorithm, policy gradient, is introduced [57].

    The following is the state value function gradient formula in Monte Carlo sampling, where Pr(sx,k,π) is the probability of transitioning from state s to state x in k steps under policy π:

    vπ(s)=[aπ(a|s)qπ(s,a)]
    =a[π(a|s)qπ(s,a)+π(a|s)qπ(s,a)]
    =a[π(a|s)qπ(s,a)+π(a|s)s,rp(s,r|s,a)(r+vπ(s))]
    =a[π(a|s)qπ(s,a)+π(a|s)sp(s|s,a)vπ(s)]
    =xSk=0Pr(sx,k,π)aπ(a|x)qπ(x,a). (50)

    vπ(s) in Eq (50) is converted as follows:

    J(w)=vπ(s0)
    =sk=0Pr(s0s,k,π)aπ(a|s)qπ(s,a)
    =sη(s)aπ(a|s)qπ(s,a)
    =sη(s)sη(s)sη(s)aπ(a|s)qπ(s,a)
    =sη(s)sμ(s)aπ(a|s)qπ(s,a)
    sμ(s)aπ(a|s)qπ(s,a)
    =Eπ[aqπ(St,a)π(a|St,w)]. (51)

    J(w) is the gradient of the policy parameters, μ(s) is the policy distribution under the policy π, and the learning goal of the policy parameters is to maximize the value of J(w), so the update of the policy parameters is similar to solving the gradient of J(w), which is:

    wt+1
    =wt+αJ(wt)
    =wt+αEπ[aqπ(St,a)π(a|St,w)]
    =wt+αaˆq(St,a,w)π(a|St,w). (52)

    In the above equation, w and w are the parameters in the action value network and the policy network, respectively. In the Monte Carlo Policy Gradient, the sampling At is used to replace a under the policy π. Therefore, J(w) can be defined as:

    J(w)=Eπ[aqπ(St,a)π(a|St,w)]
    =Eπ[aπ(a|St,w)qπ(St,a)π(a|St,w)π(a|St,w)]
    =Eπ[qπ(St,At)π(At|St,w)π(At|St,w)]
    =Eπ[Gtπ(At|St,w)π(At|St,w)]. (53)

    Gt represents the reward obtained in the St state, Gtπ(At|St,w)π(At|St,w) can be calculated by sampling, the expected value is equal to the gradient. Introducing the above gradient into the formula wt+1=wt+αJ(wt), we get:

    wt+1˙=wt+αGtπ(At|St,wt)π(At|St,wt)
    =wt+αGtlnπ(At|St,wt). (54)

    Algorithm 2 shows the process of the Monte Carlo Policy Gradient reinforce algorithm. In algorithm 2, s={χ,θ,˙χ,˙θ} represents the state space in the inverted pendulum system, and r represents the immediate reward obtained when the state is transferred, which is equivalent to R(xk,uk).


     | Show Table
    DownLoad: CSV

    According to the output action characteristics of the agent in reinforcement learning, strategies can be divided into deterministic strategies and random strategies. Deterministic policy means that the action of the agent output through the policy in a certain state is deterministic. The random strategy refers to the probability that the agent outputs multiple actions through the strategy in a certain state, and the agent chooses to execute the action according to the probability. In the DQN series of methods, the greedy strategy is used to select the action corresponding to the maximum Q value, which belongs to the greedy deterministic strategy. When the action set output by the agent is continuous-valued or discrete-valued with very high dimensionality, we use a stochastic strategy to study the probability of action output, which increases the amount of computation. The Deep Deterministic Policy Gradient (DDPG) algorithm is often used to solve the action selection problem in a high-dimensional continuous state space [58]. The algorithm is tuned on the basis of the Actor-Critic framework, making full use of DQN and policy gradients. The DDPG algorithm uses a training network and a target network in the design of the network structure. Through the target network, the correlation between the predicted value and the estimated value can be reduced to a certain extent, and the stability of the algorithm can be improved. Training the network with samples from the experience replay minimizes the correlation between samples.

    The DDPG algorithm mainly includes the following parts: ① The network includes two parts, Actor and Critic, in which Actor and Critic are composed of training network and target network respectively. ② Introduce the experience replay to store the data of the interaction between the agent and the environment, and use soft update to update the parameters of the target network to ensure the stability of the training network [59]. ③ Add random noise to the Actor network output to ensure that the agent has a certain exploration ability when selecting actions [60]. Algorithm 3 shows the process of the DDPG algorithm.


     | Show Table
    DownLoad: CSV

    In algorithm 3, N is random noise, μ(s|wμ) is the Actor network, Q(s,a|wQ) is the Critic network, μ(s|wμ) is the target Actor network and Q(s,a|wQ) is the target critic network. The agent obtains the execution action at through the network μ at st, and interacts with the environment after adding noise to the action to generate an immediate reward rt+1, and the state is updated to st+1. The agent stores the tuple {st,at,rt+1,st+1} in the experience replay. When the experience replay data reaches a certain amount, it samples data from the experience replay to train the network Q. Q-Network needs to evaluate the actions made by μ. Therefore, Q-Network uses the deterministic policy gradient method to optimize the network parameters of μ, and the network parameter replication process adopts a soft update method. The agent keeps repeating the above process until the policy network converges.

    The target action value function in the DDPG algorithm can be expressed as:

    yi=ri+1+γQ(si+1,μ(si+1wμ)wQ). (55)

    The loss function of the Critic network is:

    L(w)=1Ni(yiQ(si,aiwQ))2. (56)

    Using the deterministic sampling strategy gradient to update the policy network parameter can be expression as:

    wμJ1NiaQ(s,a|wQ)|s=si,a=μ(si)wμμ(s|wμ)|s. (57)

    This section describes and analyzes the experimental results. We use three reinforcement learning algorithms to train the inverted pendulum control model, and analyze the differences between the different algorithms based on their training data and test data (Section 4.4.1). Finally, we discuss the current research deficiencies and propose future research directions (Section 4.4.2).

    This section compares the experimental results of three deep reinforcement learning algorithms (DQN, policy-gradient, DDPG) in the dynamic environment of the inverted pendulum system, including the impact of the DRM settings on the experimental accuracy, and the differences between different algorithms. The experiment is divided into two parts: ⅰ) The influence of the setting of the detail-reward function under three deep reinforcement learning algorithms on the experimental accuracy in the training and testing phase. ⅱ) Differences between three deep reinforcement learning algorithms under the same detail-reward function. In the dynamic environment of the inverted pendulum system, the initial states of the inverted pendulum system in the training and testing phase are shown in Table 1.

    Table 1.  Initial state parameters of the inverted pendulum system.
    experimental phases Cart position (/m) Cart speed (m/s) Swing pole angle (/rad) Swing pole speed (rad/s)
    Training phase (−0.2, 0.2) (−0.05, 0.05) (−0.174, 0.174) (−0.05, 0.05)
    Testing phase −0.12 0.2 0.1 0.2
    *Note: In the training phase of the inverted pendulum system, the initial state of the cart and swing pole is a random number within the corresponding parameter range. In the experiment, the position of the cart and the angle unit of the swing pole are the same as those in the table.

     | Show Table
    DownLoad: CSV

    1) Experimental comparison of reward functions with different details under the same algorithm

    Tables 2 and 3 show the reinforcement learning parameters and network parameters of the DQN algorithm in the inverted pendulum system.

    Table 2.  Deep reinforcement learning parameters.
    Episodes Steps Discount factor Epsilon(ε) Memory
    2000 300 0.95 0.9 2000

     | Show Table
    DownLoad: CSV
    Table 3.  Neural network parameters.
    Network Network structure Learning rate Loss function Activation function Batch size
    Action-value Q network (4,128)
    (128,128)
    (128, 25)
    0.0002 MSE RELU 32

     | Show Table
    DownLoad: CSV

    In Table 3, we set the number of outputs of the action-value Q network equal to 25. It can be seen from reference [61] that the richer the choice of control vector u(k), the higher the controllability, but if it is too large, the convergence effect of the network model becomes worse in the training process. Therefore, reasonable selection of the number of action values output by Q network can improve the controllability of inverted pendulum system without affecting the effect of training.

    We first give an illustration of the agent training rules for the following experiments. Episode is a round of agent training, and in each episode, the maximum number of times the agent is allowed to learn is 300. The setting range of cart position is [−0.2, 0.2], and the setting range of the swing pole angle is [−0.174, 0.174]. During the training of the agent, as long as the cart and the swing pole are in the areas of the set values, the agent can continue to learn in this episode, and the step value is increased by 1 while the sum of the reward value is also added to the instant reward value, otherwise the learning round will be exited, and the sums of steps and rewards under each episode is recorded.

    Figure 4 shows the training and testing process of an inverted pendulum system using the DQN algorithm in a dynamic environment. Figure 4(a) shows the number of admissible control actions output by the inverted pendulum system agent under each episode in the training process. Figure 4(b) shows the cumulative sum of the reward values obtained by the agent under each episode during the training process. Figure 4(c), (d) show the experimental results in the testing phase. Figure 4(c) shows the position of the cart at each step. Figure 4(d) shows the swing pole angle at each step. In the inverted pendulum system, the cart should be controlled to be near the midpoint of the rail (χ=0) and the swing pole angle to be near the 0 degree (θ=0).

    Figure 4.  The training and testing results of DQN.

    As can be seen from Figure 4, the setting of the DRM affects the reinforcement learning effect of the inverted pendulum. In Figure 4(a), (b), a single-reward function (i.e., R=e0) is used, when the episode value is less than 400, the number of times the rule is met is less than 50, and the reward value under each episode is less than 50. When the number of episodes is greater than 400, the reward value gradually increases, but the reward distribution curve fluctuates greatly, the generalization of the trained model is weak, and the agent cannot learn useful experience well. When detail-reward function (i.e., R=ei) is used, the number of times the rule is met is proportional to the number of the episodes. In Figure 4(b), when the episodes are greater than 200, the training process tends to converge, the reward value is close to 250, and the reward distribution curve is relatively flat.

    In Figure 4(c), (d), when steps = 15 and a single-reward function is used, the position of the cart is 0.01 and the swing pole angle is 0.33, which means that the agent control fails. After 40 steps, by using detail-reward function, the position of the cart is close to zero, the swing pole angle is close to zero degrees, and the distribution of the position and angle curves is relatively gentle.

    The above analysis of the distribution curves of steps, rewards, cart position, and swing pole angle in Figure 4 shows that when using the DQN algorithm to train an agent in a dynamic environment of inverted pendulum system, the experimental effect of detail-reward function is better than a single-reward function.

    Tables 4 and 5 show the reinforcement learning parameters and network parameters of the policy gradient algorithm in the inverted pendulum system. Figure 5 shows the training and testing process of an inverted pendulum system using the policy gradient algorithm in a dynamic environment.

    Table 4.  Deep reinforcement learning parameter.
    Episodes Steps Discount factor
    15000 300 0.99

     | Show Table
    DownLoad: CSV
    Table 5.  Neural network parameters.
    Network Network structure Learning rate Loss function Activation function
    Policy network (4,128)
    (128,128)
    (128, 3)
    1 × 10−3 Cross entropy Adam

     | Show Table
    DownLoad: CSV
    Figure 5.  The training and testing results of policy gradient.

    As can be seen from Figure 5(a), (b), the training process converges faster when a single-reward function is used. When the episode is equal to 3000, the step and the reward reach the maximum value of 300 respectively; when using the detail-reward function, the training process converges slowly and the reward rises more gently. When the number of episodes is 14,000, the step and reward reach their maximum values of 300 and 250 respectively. According to Eq (37), a single-reward function means that only the evaluation function e0() plays a role, and its values are only 0 and 1. The final value range of the detail-reward function is [0, 1]. Therefore, the maximum values of reward in these two cases are different.

    It area A2 of Figure 5(c), when the detail-reward function is used, the position curve distribution of the cart is relatively flat, the minimum position is −0.05, and the maximum position is 0.04. While using a single-reward function, the cart position distribution curve oscillates greatly, the minimum position and maximum position are −0.077 and 0.05 respectively. It can be seen from area A1 of Figure 5(d) that when the detail-reward function is used, the variation range of the swing pole angle is [−0.05, 0.05]. When the single-reward function is used, the range of the swing pole angle is [−0.07, 0.06], and the data illustrate the swing angle range of the former is smaller than that of the latter.

    By analyzing the distribution curves of step, reward, cart position and swing pole angle in Figure 5, we can get: When an agent is trained with the policy-gradient algorithm, the single-reward function makes the training rounds less than the detail-reward function, but in the test phase, the experimental effect of detail-reward function is better than that of single-reward function.

    Tables 6 and 7 show the reinforcement learning parameters and network parameters of the DDPG algorithm in the inverted pendulum system. Figure 6 shows the training and testing process while using the DDPG algorithm in a dynamic environment.

    Table 6.  Deep reinforcement learning parameters.
    Episodes Steps Discount factor Soft update coefficient Memory
    2000 300 0.99 1 × 10−2 10,000

     | Show Table
    DownLoad: CSV
    Table 7.  Neural network parameters.
    Network Network structure Learning rate Loss function Activation function Batch size
    Actor network (4,128)
    (128,128)
    (128, 1)
    1 × 10−3 Q(s,a|wQ) Adam 64
    Critic network (5,128)
    (128,128)
    (128, 1)
    1 × 10−2 MSE Adam 64

     | Show Table
    DownLoad: CSV
    Figure 6.  The training and testing results of DDPG.

    It can be seen from Figure 6(a), (b) that in the case of using single-reward function, when the episodes are less than 180, the number of times the rule is met is less than 100, and the reward value under each episode is less than 50. As the episodes gradually increase, the distribution of step and reward curves fluctuates greatly. In contrast, using detail-reward function, when the episodes are greater than 180, the distribution of the reward curve is relatively flat, and the reward distribution under each episode is around 210.

    In Figures 6(c), (d), when steps = 15 and the single-reward function is used, the position of the cart is located at 0.05 in the positive direction, and the swing pole angle is 0.3, which means that the agent control fails. After 200 steps, by using detail-reward function, both the cart position and the swing pole angle are close to zero, while the distribution of the position and angle curves is relatively gentle.

    From the above analysis it can be concluded that in the process of using the DDPG algorithm to train an agent, the reward obtained by using the single-reward function is greater than that obtained by using the detail-reward function in most cases. However, the distribution of the reward curve under the single-reward function fluctuates greatly and the model stability is poor. In addition, the model trained with single-reward function will lead to its out of control in the process of testing. The experimental effect of using the detail-reward function is better than that of using the single-reward function.

    2) Comparison of experimental effects of different algorithms under the same detail-reward function

    Through the analysis and comparison of the above experimental results, the following conclusions can be drawn: i) The method of using DRM in the inverted pendulum system under deep reinforcement learning is effective and innovative. ii) A good design of evaluation functions is crucial for the training and testing process. Under different algorithms, the detail-reward function produces better experimental results than a single-reward function.

    Figures 7 and 8 show the experimental comparison of the models generated by different deep reinforcement learning algorithms during training under the same detail-reward function in the testing phase. Figure 7 shows the distribution curves of cart positions in each step of the inverted pendulum system when DRM is applied in the experiment. After 50 steps, the model generated by DQN algorithm converges to a point close to 0 in the test phase, while the model generated by policy-gradient algorithm fluctuates in the range of [−0.06, 0.05], and the model generated by DDPG algorithm shows that the moving position of the cart gradually converges to the vicinity of zero with the increase of the number of steps. The right part of Figure 7 is an enlarged schematic diagram of area B. It can be seen from the figure that the model generated by the policy-gradient algorithm fluctuates greatly, i.e., the moving position of the cart fluctuated wildly in the range of more than 0.06. The data of area B also shows that the models generated by DQN and DDPG algorithms has a small fluctuation range, both mean that the cart fluctuates reasonably near the equilibrium point.

    Figure 7.  Comparison of test accuracy under different algorithms (cart position).
    Figure 8.  Comparison of test accuracy under different algorithms (swing pole angle).

    Figure 8 is the distribution curves of swing pole angle at each step under the models generated by three algorithms under the application of DRM in the experiment. After 50 steps, the model generated by DQN algorithm converges to near zero during the test process, the model generated by the policy-gradient algorithm fluctuates in the range of [−0.05, 0.05], and the model generated by the DDPG algorithm gradually converges to zero with the increase of steps. The right part of Figure 8 is the enlarged schematic diagram of area A. After 200 steps, the models generated by DDPG and DQN will guide the swing pole angle to vibrate around 0. The fluctuation range of the swing pole angle under the model generated by policy-gradient is [−0.06, 0.06], which is larger than that of the above two models.

    By analyzing the distribution curves in Figures 7 and 8, the following conclusions can be drawn: ⅰ) The system stability of the policy-gradient algorithm is weaker than that of DQN and DDPG during the test process, and the car position and the swing pole angle are difficult to converged to zero. ⅱ) During the test process of the DQN algorithm, the cart position and the swing pole angle distribution are both close to the zero. However, the output of the DQN algorithm is a discrete action. In a continuous control system, the frequent output of discrete actions will affect the stability of the system, and on a physical platform, the large-scale pulsed current output can greatly reduce the lifetime of electronic devices. ⅲ) DDPG algorithm is continuous action output. Compared to the DQN algorithm, the convergence rate is slower. During the test, as the number of steps increased, the cart position and the swing pole angle distribution gradually converged to zero. In an inverted pendulum system, the output of the DDPG model can ensure the stability of the system. Compared with the above three deep reinforcement learning algorithms, as a continuous action output algorithm, DDPG can be applied to the inverted pendulum control system and achieve good experimental results by introducing DRM into the algorithm.

    According to the characteristics of the inverted pendulum NCS, deep reinforcement learning with DRM can solve the optimal control problem of the system. Through the above experiments, the control accuracy of the inverted pendulum system under different reward functions is compared. At present, this paper has the following deficiencies in experimental details: 1) This paper just introduce DRM to the inverted pendulum system without using more complex nonlinear systems. 2) In this paper, a novel DRM method is given, and the evaluation functions and reward function are established by taking the inverted pendulum system as an example, but the unified rules for the establishment of evaluation function are not given, i.e., it depends on the designer's personalized design. 3) The parameters of the inverted pendulum system used have limitations. For example, we only set the angle of the swing pole in the range of [−0.174, 0.174], and did not study the control problem of a broader state space. 4) When designing the dynamic environment of the inverted pendulum system in this paper, the unstable disturbance of the real inverted pendulum experimental platform is not considered, which has certain limitations for deploying agent models to real physical platforms.

    Future work will improve in the following aspects: 1) We will use the method of DRM in more nonlinear systems and summarize the applicability of the method in a wider experimental setting. 2) The construction of the evaluation function is a reward realization for a specific observation perspective. In the course of experiments in more nonlinear environments, the validity of the evaluation function and the design method will be summarized. 3) We will increase the control range of the swing pole angle in the inverted pendulum system, and use the deep reinforcement learning algorithms to solve the global optimal control problems. 4) We will design the target policy and action policy separately according to the importance sampling off policy learning theory in deep reinforcement learning. The action strategy is a strategy optimized for training in a dynamic environment, and the target strategy is a real experimental platform. The control strategy uses the action strategy to sample the data in the dynamic environment, which is used to predict the target strategy under the real experimental platform, so as to realize the optimal control of the real system.

    In this paper, a reward mechanism based on control details is proposed for the first time. According to the specific requirements of control objectives, the state space is divided into different observation perspectives, and then the evaluation function is designed. Finally, the final reward function is composed in the form of continuous multiplication of the evaluation functions, which replaces the utility function in HJB equation. Based on Lyapunov stability theory and Q-learning, the stability of the control law is proved. Then taking the nonlinear inverted pendulum system as the experiment object, the detail-reward mechanism is introduced into the deep reinforcement learning algorithms. After designing the reward function according to the detail-reward mechanism, the effects of different reward functions on the experimental accuracy are compared under three algorithms: DQN, policy-gradient and DDPG. The effectiveness of this method is proved by experiments, and the optimal reward function and deep reinforcement learning algorithm model of inverted pendulum system are determined. We will do the following work in the future: 1) Apply the detail reward mechanism to other deep reinforcement learning algorithms for verification, such as A3C, TD3 and PPO. 2) Apply the detail reward mechanism to more nonlinear control systems, such as multi-agent, robot motion control systems, etc. 3) Optimize the model generated by the deep reinforcement learning algorithm to improve the control effect.

    The author acknowledge the support from the Joint Development Research Institute of Intelligent Motion Control Technology of the Liaoning Provincial Department of Education and the National Key R & D Program of China (Grant No. 2017YFB1300700).

    The authors declared that they have no conflicts of interest in this work.



    [1] VZKOO, Hootsuite—global digital report 2021, 2021. Available from: https://www.vzkoo.com/doc/31188.html.
    [2] D. Sabella, A. Vaillant, P. Kuure, U. Rauschenbach, F. Giust, Mobile–edge computing architecture: the role of MEC in the internet of things, IEEE Consum. Electron. Mag., 5 (2016), 84–91. doi: 10.1109/MCE.2016.2590118
    [3] J. Yan, S. Bi, Y. J. Zhang, M. Tao, Optimal task offloading and resource allocation in mobile-edge computing with inter-user task dependency, IEEE Trans. Wireless Commun., 19 (2019), 235–250.
    [4] T. Taleb, K. Samdanis, B. Mada, H. Flinck, S. Dutta, D. Sabella, On multi-access edge computing: A survey of the emerging 5G network edge cloud architecture and orchestration, IEEE Commun. Surv. Tutorials, 19 (2017), 1657–1681. doi: 10.1109/COMST.2017.2705720
    [5] Y. H. Kao, B. Krishnamachari, M. R. Ra, F. Bai, Hermes: Latency optimal task assignment for resource–constrained mobile computing, IEEE Trans. Mobile Comput., 16 (2017), 3056–3069. doi: 10.1109/TMC.2017.2679712
    [6] N. Cheng, F. Lyu, W. Quan, C. Zhou, H. He, W. Shi, et al., Space/aerial–assisted computing offloading for IoT applications: A learning–based approach, IEEE J. Sel. Areas Commun., 37 (2019), 1117–1129. doi: 10.1109/JSAC.2019.2906789
    [7] M. G. R. Alam, M. M. Hassan, M. Z. I. Uddin, A. Almogren, G. Fortino, Autonomic computation offloading in mobile edge for IoT applications, Future Gener. Comput. Syst., 90 (2009), 149–157.
    [8] S. Wang, R. Urgaonkar, T. He, K. Chan, M. Zafer, K. K. Leung, Dynamic service placement for mobile micro–clouds with predicted future costs, IEEE Trans. Parallel Distrib. Syst., 28 (2016), 1002–1016.
    [9] K. Li, M. Tao, Z. Chen, A computation–communication tradeoff study for mobile edge computing networks, in 2019 IEEE International Symposium on Information Theory, (2019), 2639–2643.
    [10] C. Yi, J. Cai, Z. Su, A multi-user mobile computation offloading and transmission scheduling mechanism for delay-sensitive applications, IEEE Trans. Mobile Comput., 19 (2019), 29–43.
    [11] M. E. Khoda, M. A. Razzaque, A. Almogren, M. M. Hassan, A. Alamri, A. Alelaiwi, Efficient computation offloading decision in mobile cloud computing over 5G network, Mobile Netw. Appl., 21 (2016), 777–792. doi: 10.1007/s11036-016-0688-6
    [12] S. Deng, L. Huang, J. Taheri, A. Y. Zomaya, Computation offloading for service workflow in mobile cloud computing, IEEE Trans. Parallel Distrib. Syst., 26 (2014), 3317–3329.
    [13] Y. Mao, J. Zhang, K. B. Letaief, Dynamic computation offloading for mobile-edge computing with energy harvesting devices, IEEE J. Sel. Areas Commun., 34 (2016), 3590–3605. doi: 10.1109/JSAC.2016.2611964
    [14] H. Liu, L. Cao, T. Pei, Q. Deng, J. Zhu, A fast algorithm for energy-saving offloading with reliability and latency requirements in multi-access edge computing, IEEE Access, 8 (2019), 151–161.
    [15] B. Li, Y. Pei, H. Wu, B. Shen, Heuristics to allocate high–performance cloudlets for computation offloading in mobile ad hoc clouds, J. Supercomput., 71 (2015), 3009–3036. doi: 10.1007/s11227-015-1425-9
    [16] M. Deng, H. Tian, B. Fan, Fine-granularity based application offloading policy in cloud–enhanced small cell networks, in 2016 IEEE International Conference on Communications Workshops (ICC), (2016), 638–643.
    [17] M. Chen, Y. Hao, Y. Li, C. F. Lai, D. Wu, On the computation offloading at ad hoc cloudlet: architecture and service modes, IEEE Commun. Mag., 53 (2015), 18–24.
    [18] H. Kchaou, Z. Kechaou, A. M. Alimi, Towards an offloading framework based on big data analytics in mobile cloud computing environments, Proc. Comput. Sci., 53 (2015), 292–297. doi: 10.1016/j.procs.2015.07.306
    [19] M. H. Chen, M. Dong, B. Liang, Joint offloading decision and resource allocation for mobile cloud with computing access point, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2016), 3516–3520.
    [20] O. Munoz, A. Pascual-Iserte, J. Vidal, Optimization of radio and computational resources for energy efficiency in latency–constrained application offloading, IEEE Trans. Veh. Technol., 64 (2014), 4738–4755.
    [21] V. Pandey, S. Singh, S. Tapaswi, Energy and time efficient algorithm for cloud offloading using dynamic profiling, Wireless Pers. Commun., 80 (2015), 1687–1701. doi: 10.1007/s11277-014-2107-2
    [22] Y. Li, J. Wu, L. Chen, POEM+: Pricing longer for mobile blockchain computation offloading with edge computing, in IEEE International Conference on High Performance Computing and Communications, (2019), 162–167.
    [23] C. You, K. Huang, H. Chae, B. H. Kim, Energy-efficient resource allocation for mobile–edge computation offloading, IEEE Trans. Wireless Commun., 16 (2017), 1397–1411. doi: 10.1109/TWC.2016.2633522
    [24] S. Khalili, O. Simeone, Inter-layer per-mobile optimization of cloud mobile computing: a message-passing approach, IEEE Trans. Emerging Telecommun. Technol., 27 (2016):814–827. doi: 10.1002/ett.3028
    [25] G. Mitsis, E. E. Tsiropoulou, S. Papavassiliou, Data offloading in UAV-assisted multi-access edge computing systems: A resource-based pricing and user risk-awareness approach, Sensors, 20 (2020), 1–21. doi: 10.1109/JSEN.2020.3010656
    [26] P. A. Apostolopoulos, E. E. Tsiropoulou, S. Papavassiliou, Cognitive data offloading in mobile edge computing for internet of things, IEEE Access, 8 (2020), 55736–55749. doi: 10.1109/ACCESS.2020.2981837
    [27] A. Nurunnabi, G. West, D. Belton, Robust locally weighted regression techniques for ground surface points filtering in mobile laser scanning three–dimensional point cloud data, IEEE Trans. Geosci. Remote Sens., 54 (2015), 2181–2193.
  • This article has been cited by:

    1. Athman Dhiya Abdulsatar, Ali Mohsin Aljuboori, 2023, Stack of Heatmap Hourglass Network for Face Alignment, 979-8-3503-3511-8, 168, 10.1109/ICITAMS57610.2023.10525654
    2. Likhitha Daram, Gayathri Pullakhandam, Nageshwari Godari, , 2024, Virtual Glamour: AI-Enhanced Makeup Recommendations and Trials, 979-8-3503-7329-5, 206, 10.1109/ICICI62254.2024.00043
    3. Yumiao Li, Peng Li, Ziqi Shen, Wenhui Li, 2023, A Novel Hybrid Makeup Transfer Method Based on Latent Vector Manipulation, 979-8-3503-0295-0, 1, 10.1109/AICIT59054.2023.10277962
  • Reader Comments
  • © 2021 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(2805) PDF downloads(85) Cited by(0)

Figures and Tables

Figures(15)  /  Tables(3)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog