
The objective of reinforcement learning (RL) is to find an optimal strategy for solving a dynamical control problem. Evolution strategy (ES) has been shown great promise in many challenging reinforcement learning (RL) tasks, where the underlying dynamical system is only accessible as a black box such that adjoint methods cannot be used. However, existing ES methods have two limitations that hinder its applicability in RL. First, most existing methods rely on Monte Carlo based gradient estimators to generate search directions. Due to low accuracy of Monte Carlo estimators, the RL training suffers from slow convergence and requires more iterations to reach the optimal solution. Second, the landscape of the reward function can be deceptive and may contain many local maxima, causing ES algorithms to prematurely converge and be unable to explore other parts of the parameter space with potentially greater rewards. In this work, we employ a Directional Gaussian Smoothing Evolutionary Strategy (DGS-ES) to accelerate RL training, which is well-suited to address these two challenges with its ability to (i) provide gradient estimates with high accuracy, and (ii) find nonlocal search direction which lays stress on large-scale variation of the reward function and disregards local fluctuation. Through several benchmark RL tasks demonstrated herein, we show that the DGS-ES method is highly scalable, possesses superior wall-clock time, and achieves competitive reward scores to other popular policy gradient and ES approaches.
Citation: Jiaxin Zhang, Hoang Tran, Guannan Zhang. Accelerating reinforcement learning with a Directional-Gaussian-Smoothing evolution strategy[J]. Electronic Research Archive, 2021, 29(6): 4119-4135. doi: 10.3934/era.2021075
[1] | Jiaxin Zhang, Hoang Tran, Guannan Zhang . Accelerating reinforcement learning with a Directional-Gaussian-Smoothing evolution strategy. Electronic Research Archive, 2021, 29(6): 4119-4135. doi: 10.3934/era.2021075 |
[2] | Ruyu Yan, Jiafei Jin, Kun Han . Reinforcement learning for deep portfolio optimization. Electronic Research Archive, 2024, 32(9): 5176-5200. doi: 10.3934/era.2024239 |
[3] | Jye Ying Sia, Yong Kheng Goh, How Hui Liew, Yun Fah Chang . Constructing hidden differential equations using a data-driven approach with the alternating direction method of multipliers (ADMM). Electronic Research Archive, 2025, 33(2): 890-906. doi: 10.3934/era.2025040 |
[4] | Sida Lin, Lixia Meng, Jinlong Yuan, Changzhi Wu, An Li, Chongyang Liu, Jun Xie . Sequential adaptive switching time optimization technique for maximum hands-off control problems. Electronic Research Archive, 2024, 32(4): 2229-2250. doi: 10.3934/era.2024101 |
[5] | Shizhen Huang, Enhao Tang, Shun Li, Xiangzhan Ping, Ruiqi Chen . Hardware-friendly compression and hardware acceleration for transformer: A survey. Electronic Research Archive, 2022, 30(10): 3755-3785. doi: 10.3934/era.2022192 |
[6] | Kun Han, Feng Jiang, Haiqi Zhu, Mengxuan Shao, Ruyu Yan . Learning cooperative strategies in StarCraft through role-based monotonic value function factorization. Electronic Research Archive, 2024, 32(2): 779-798. doi: 10.3934/era.2024037 |
[7] | Gonglin Yuan, Minjie Huang . An efficient modified HS conjugate gradient algorithm in machine learning. Electronic Research Archive, 2024, 32(11): 6175-6199. doi: 10.3934/era.2024287 |
[8] | Bao-Lin Ye, Peng Wu, Lingxi Li, Weimin Wu . Uniformity of markov elements in deep reinforcement learning for traffic signal control. Electronic Research Archive, 2024, 32(6): 3843-3866. doi: 10.3934/era.2024174 |
[9] | Yaokun Wang, Kun Zhao, Juan L. G. Guirao, Kai Pan, Huatao Chen . Online intelligent maneuvering penetration methods of missile with respect to unknown intercepting strategies based on reinforcement learning. Electronic Research Archive, 2022, 30(12): 4366-4381. doi: 10.3934/era.2022221 |
[10] | Chao Ma, Hang Gao, Wei Wu . Adaptive learning nonsynchronous control of nonlinear hidden Markov jump systems with limited mode information. Electronic Research Archive, 2023, 31(11): 6746-6762. doi: 10.3934/era.2023340 |
The objective of reinforcement learning (RL) is to find an optimal strategy for solving a dynamical control problem. Evolution strategy (ES) has been shown great promise in many challenging reinforcement learning (RL) tasks, where the underlying dynamical system is only accessible as a black box such that adjoint methods cannot be used. However, existing ES methods have two limitations that hinder its applicability in RL. First, most existing methods rely on Monte Carlo based gradient estimators to generate search directions. Due to low accuracy of Monte Carlo estimators, the RL training suffers from slow convergence and requires more iterations to reach the optimal solution. Second, the landscape of the reward function can be deceptive and may contain many local maxima, causing ES algorithms to prematurely converge and be unable to explore other parts of the parameter space with potentially greater rewards. In this work, we employ a Directional Gaussian Smoothing Evolutionary Strategy (DGS-ES) to accelerate RL training, which is well-suited to address these two challenges with its ability to (i) provide gradient estimates with high accuracy, and (ii) find nonlocal search direction which lays stress on large-scale variation of the reward function and disregards local fluctuation. Through several benchmark RL tasks demonstrated herein, we show that the DGS-ES method is highly scalable, possesses superior wall-clock time, and achieves competitive reward scores to other popular policy gradient and ES approaches.
Reinforcement learning is a class of problems which aim to find, through trial and error, a feedback policy that prescribes how an agent should act in an uncertain, complex environment to maximize some notion of cumulative reward [30]. RL can be viewed as a generalization of stochastic optimal control. Traditionally, RL algorithms have mainly been employed for small input and action spaces, and suffered difficulties when scaling to high-dimensional problems. With the recent emergence of deep learning, powerful non-linear function approximators such as deep neural networks (DNN) can be integrated into RL and extend the capability of RL in a variety of challenging tasks which would otherwise be infeasible, ranging from playing Atari from pixels [18], playing expert-level Go [28] to robotic control [2]. Among the most popular current deep RL algorithms are Q-learning methods, policy gradient methods, and evolution strategies. Deep Q-learning algorithms [18] use a DNN to approximate the optimal Q function, yielding policies that, for a given state, choose the action that maximizes the Q-value. Policy gradient methods [26] improve the policies with a gradient estimator obtained from sample trajectories in action space, examples of which are A3C [17], TRPO [24] and PPO [25].
This work concerns RL techniques based on evolution strategies. ES refers to a family of blackbox optimization algorithms inspired by ideas of natural evolution, often used to optimize functions when gradient information is inaccessible. This is exactly the prominent challenge in a typical RL problem, where the environment and policy are usually nonsmooth or can only be accessed via noisy sampling. It is not totally surprising ES has become a convincing competitor to Q-learning and policy gradient in deep RL in recent years. Unlike policy gradient, ES perturbs and performs policy search directly in the parameter space to find an effective policy, which is now generally considered to be superior to action perturbation [27]. The policy search can be guided by a surrogate gradient [23], completely population-based and gradient-free [29], and hybridized with other exploration strategies such as novelty search and quality diversity [7]. It has been shown in those works that ES is easy to parallelize, and requires low communication overhead in a distributed setting. More importantly, being able to achieve competitive performance on many RL tasks, these methods are advantageous over other RL approaches in their high scalability and substantially lower wall-clock time. Given wide availability of distributed computing resources, all environment simulations at each iteration of training can be conducted totally in parallel. Thus, a more reasonable metric for the performance of a training algorithm is the number of non-parallelizable iterations, as opposed to the sample complexity. In this metric, ES is truly an appealing choice.
Nevertheless, there are several challenges that need to be addressed in order to further improve the performance of ES in training complex policies. First, most ES methods cope with the non-smoothness of the objective function by considering a Gaussian-smoothed version of the expected total reward. The gradient of this function is intractable and must be estimated to provide the policy parameter update. In the pioneering work [23], a gradient estimator is proposed based on random parameter sampling. Developing efficient sampling strategies for gradient estimates has become an interest in ES research since then, and several improvements have been proposed, based on imposing structures on parameter perturbation [5], or reusing past evaluations, [6,15,16]. Yet, most of these gradient estimators are of Monte Carlo type, therefore arguably affected by the low accuracy of Monte Carlo methods. For faster convergence of training (i.e., reducing the number of iterations), more accurate gradient estimators are desired, particularly in RL tasks where the policy has a large number of parameters to learn. Another prominent challenge is that the landscape of the objective function is complex and possesses plentiful local maxima. There is a risk for any optimization algorithm to get trapped in some of those points and unable to explore the parameter space effectively. The Gaussian smoothing, with its ability to smooth out function and damps out small, insignificant fluctuations, is a strong candidate in this very challenge. Specifically, with a moderately large smoothing parameter (i.e., strong smoothing effect), we can expect the gradient of the smoothed objective function will be able to look outside unimportant variation in the adjacent area and detect the general trend of the function from a distance, therefore an efficient nonlocal search direction. This potential of Gaussian smoothing, however, has not been explored in reinforcement learning.
In this paper, we propose a new strategy to accelerate the time-to-solution of reinforcement learning by exploiting the Directional Gaussian Smoothing Evolution Strategy (DGS-ES) method, developed in our recently work [33]. The DGS-ES method introduced a new directional Gaussian smoothing (DGS) gradient operator, that smooths the original objective function only along
The rest of the paper is organized as follows: the RL problem under consideration and a brief review about the classic Gaussian smoothing technique is given in Section 2, the DGS gradient and the corresponding DGS-ES algorithm is introduced in Section 3, extensive numerical experiments including tests on benchmark optimization problems and on several benchmark RL problems are provided in Section 4, some concluding remarks are given in Section 5.
We study the continuous-time, continuous-state stochastic control problem via RL. The evolution of the state
dst=b(st,at)dt+σ(xt,at)dwt, | (1) |
where
J(π):=Eπ[∫T0γtr(st,at)dt], |
where
J(π):=N∑n=0E(stn,atn)[γtnr(stn,atn)], | (2) |
where
In policy-based RL, the policy
maxθ∈RdJ(θ), | (3) |
where we denote
We briefly recall the evolution strategy methods, e.g., [13,23], which use a multivariate Gaussian distribution to generate the population around the current parameter value
Jσ(θ):=1(2π)d2∫RdJ(θ+σ∘u)e−12‖u‖22du=Eu∼N(0,Id)[J(θ+σ∘u)], | (4) |
where
maxθ∈RdJσ(θ), | (5) |
where the gradient of
∇Jσ(θ)=1‖σ‖d/22Eu∼N(0,Id)[J(θ+σ∘u)u]. | (6) |
The standard ES method [23] uses Monte Carlo sampling to estimate the gradient
θn+1=θn−λMσM∑m=1J(θn+σ∘um)um, | (7) |
where
One drawback of the ES method and its variants is the slow convergence of the training process, due to the low accuracy of the MC-based gradient estimator (see [3]), also [33], for extended discussions on the accuracy of gradient approximations using Eq. (7) and related methods). On the other hand, the evaluations of
This section introduces our main framework. We start by introducing in Section 3.1 the DGS gradient operator and its approximation using the Gauss-Hermite quadrature rule. In Section 3.2, we describe how to incorporate the DGS gradient operator into the ES for reinforcement learning.
For a given direction
G(y|θ,ξ)=J(θ+yξ),y∈R, | (8) |
where
Gσ(y|θ,ξ):=Ev∼N(0,1)[G(y+σv|θ,ξ)], | (9) |
which is also the Gaussian smoothing of
D[Gσ(0|θ,ξ)]=1σEv∼N(0,1)[G(σv|θ,ξ)v], | (10) |
where
We can assemble a new gradient, i.e., the DGS gradient, by putting together the derivatives in Eq. (10) along
∇σ,Ξ[J](θ)=Ξ⊤[D[Gσ1(0|θ,ξ1)]⋮D[Gσd(0|θ,ξd)]], | (11) |
where
∇Jσ(θ)≠∇σ,Ξ[J](θ) |
for any
limσ→0|∇Jσ(θ)−∇σ,Ξ[J](θ)|=0, | (12) |
for fixed
Since each component of
˜DM[Gσ(0|θ,ξ)]:=1√πσM∑m=1wmG(√2σvm|θ,ξ)√2vm | (13) |
where
wm=2M+1M!√π[H′M(vm)]2,m=1,…,M, | (14) |
HM(v)=(−1)Mev2dMdvM(e−v2), | (15) |
and
|˜DM[Gσ]−D[Gσ]|≤C0M!√π2M(2M)!σ2M−1, | (16) |
where
Applying the GH quadrature rule
˜∇Mσ,Ξ[J](θ):=Ξ⊤[˜DM[Gσ1(0|θ,ξ1)]⋮˜DM[Gσd(0|θ,ξd)]], | (17) |
which requires a total of
On the other hand, as the quadrature weights
We suggest to set the hyperparameters in Algorithm 1 as follows:
Algorithm 1: : The DGS-ES for reinforcement learning |
1: hyperparameters: 2: Input: 3: Output: the final parameter value 4: Initialize the policy 5: Set 6: Broadcast 7: Divide the total GH quadrature points into 8: for 9: Each worker runs 10: Each worker sends 11: for 12: Compute 13: end for 14: Assemble 15: Update 16: if 17: Generate 18: Generate 19: end if 20: Broadcast 21: Each worker updates the policy to 22: end for |
We evaluate the DGS-ES method using two sets of problems: (a) classic high-dimensional benchmark functions to test the performance of DGS-ES in solving high-dimensional multi-modal optimization problems, and (b) several reinforcement learning benchmark problems.
Here we investigate the DGS-ES performance on the high-dimensional functions: Sphere, Ackley, Lévy and Rastrigin functions, whose definitions are given below.
● The Sphere function
F1(x)=d∑i=1x2i, |
where
● The Ackley function
F2(x)=−aexp(−b√1dd∑i=1x2i)−exp(1dd∑i=1cos(cxi))+a+exp(1), |
where
● The Rastrigin function
F3(x)=10d+d∑i=1[x2i−10cos(2πxi)], | (18) |
where
● The Lévy function
F4(x)=sin2(πw1)+d−1∑i=1(wi−1)2[1+10sin2(πwi+1)]+(wd−1)2[1+sin2(2πwd)], |
where
We compare the DGS-ES method with the following blackbox optimization methods: (a) Evolution Strategy (ES) in [23]; (b) Covariance matrix adaptation evolution strategy (CMA-ES) in [12] (we used the implementation of CMA-ES in the pycma open-source code from https://github.com/CMA-ES/pycma) and (c) the BFGS method in the Scipy library.
Figure 2 shows the scalability of the DGS-ES algorithm with the dimension of the objective function. We perform 20 repeated independent trials for each method to show the statistical performance. For the sphere function, the convergence rate does not change when we increase the dimension from 10 to 1000. Such property empirically carries over to the non-convex Lévy and Rastrigin functions. The reason is that both Lévy and Rastrigin have a globally near-convex structure when smoothing out their local minima. In contrast, the Ackley function is highly concave, as a large part of its surface is almost flat regardless of the small local minima. In this case, it takes many iterations for DGS-ES to search for the global minimum hidden in the middle of the flat surface.
Figure 3 shows the comparison between the DGS-ES and the baselines for optimizing the four benchmark functions in Figure 2 in 2000-dimensional space. For the sphere function, BFGS is a clear winner due to its optimal performance in handling convex functions. Among the three ES-type methods, the DGS-ES has much better performance than its competitors. For the other three functions, the DGS-ES method shows significant advantages over other methods.
To evaluate the DGS-ES algorithm, we test its performance on two classes of reinforcement learning environments: three classical control theory problems from OpenAI Gym (https://github.com/openai/gym) [4] and three continuous control tasks simulated using PyBullet (2.6.5) (https://pybullet.org/) [8] which is an open-source library. Within OpenAI Gym, we demonstrate the proposed approach on three benchmark examples: CartPole-v0 (discrete), MountainCarContinuous-v0 (continuous), Pendulum-v0 (continuous). The maximum time steps for these examples are 200,999 and 200, respectively. More details about the environment and reward settings can be found in [4]. We also examine the DGS-ES algorithm on the challenging continuous robotic control problems in PyBullet library, namely HopperBulletEnv-v0, InvertedPendulumBulletEnv-v0 and ReacherBulletEnv-v0. In these three tasks, the maximum time steps are 1000, 1000 and 150, respectively. For the purpose of reproducible comparison, we employ the original environment settings from the OpenAI Gym and the PyBullet library without modifying the rewards or the environments.
For our implementation of DGS-ES, we define our policies as a two-layer feed-forward neural network with 16 hidden nodes and tanh activation functions. For gradient-based optimization, we use Adam to adaptively update the network parameters with a learning rate of
The DGS-ES algorithm is specifically amenable to parallelization since it only needs to communicate scalars, allowing it to scale to over a large number of parallel workers. We implement a distributed version of Algorithm 1 to the reinforcement learning tasks. The distributed DGS-ES is implemented using PyTorch [21] combined with Ray [19] (https://github.com/ray-project/ray), which does not rely on special networking setup and is tested on large-scale high performance computing facilities with thousands of computing nodes/workers.
Comparison metric. As the motivation of this work is to accelerate time-to-solution of reinforcement training under the assumption that sufficient distributed computing resource is available, we use a different metric to evaluate the performance of DGS-ES and the baselines. Specifically, we are interested in the average return
Baseline methods. We compare Algorithm 1 against several RL baselines, including ES, PPO and TRPO, as well as the state-of-the-art algorithms such as ASEBO, DDPG, and TD3. Below is the information of the packages used.
● ES: The Evolution Strategy proposed in [23]. We used the implementation of ES from the open-source code https://github.com/hardmaru/estool.
● ASEBO: Adaptive ES-Active Subspaces for Blackbox Optimization, which was recently developed by [5]. We used the implementation released by the authors at https://github.com/jparkerholder/ASEBO.
● PPO: Proximal Policy Optimization in [25], which is available in OpenAI's baselines repository at https://github.com/openai/baselines [9].
● TRPO: Trust Region Policy Optimization, developed by [24]. We also used the OpenAI's baselines implementation [9].
● DDPG: Deep Deterministic Policy Gradient, proposed by [14]. We used the implementation from https://github.com/georgesung/TD3 where the benchmark DDPG in PyBullet is provided.
● TD3: Twin Delayed Deep Deterministic Policy Gradient [11], which was built upon the DDPG. The original results were reported for the MuJoCo version environments using the implementation from https://github.com/sfujim/TD3, but we used the PyBullet implementation from https://github.com/georgesung/TD3.
The hyperparameters for all algorithms above are set to match the original papers without further tuning to improve performance on the testing benchmark examples.
Comparative evaluation. Figure 4 shows the comparison results of CartPole, Pendulum and MountainCar problems from the OpenAI Gym. We compared the DGS-ES with classical ES and the improved ASEBO method. In general, the DGS-ES method features faster convergence than the baselines. For the simplest CartPole problem, the three methods perform equally well. Discrepancy appears in the Pendulum test, where the DGS-ES method not only converges faster than the baselines, but also achieves a higher average return. There is a much bigger discrepancy between the DGS-ES and the baselines appear in the MountainCar test. According to the guideline provided in the OpenAI Gym, the success threshold is to achieve an average return of 90. The DGS-ES method achieves the threshold within 500 iterations, while the average returns of the ES and ASEBO methods are around zero. It is well known that the challenge of this problem is that the surface of the objective function
Figure 5 shows the comparison results of Hopper-v0, InvertedPendulum-v0 and Reacher-v0 problems from the PyBullet library. We compare the DGS-ES with six baselines including ES, ASEBO, PPO, TRPO, DDPG and TD3. As expected, the DGS-ES method shows better performance in terms of the convergence speed. For the Hopper-v0 problem, the DGS-ES achieves the highest return of all the methods within 400 iterations. It is worth pointing out that some of the baselines will eventually catch up with the DGS-ES given a sufficiently large number of iterations. For example, DDPG and TD3 do not provide much improvement within 400 iterations, but DDPG could reach 1650 average return with over 3000 iterations, and TD3 could reach even higher, around 2200 average return, with over 6000 iterations, according to the baselines provided by OpenAI [9] and [11]. This phenomenon illustrates the fast convergence feature of DGS-ES. For the InvertedPendulum-v0 problem, DGS-ES can achieve the maximum return 1000 (default value in the PyBellet library) around 30 iterations. In comparison, ES and ASEBO can reach the maximum return but with more iterations than DGS-ES. According to the benchmark results for PyBullet environments in [11], DDPG and TD3 cannot converge to the maximum return even with a large number of iterations. For the Reacher-v0 problem, DGS-ES and ASEBO are still the top performers, and the advantage of DGS-ES is, again, faster convergence.
Figure 6 illustrates the effect of the radius
Now we demonstrate the DGS-ES method on a real-world stiffened shell design problem. Due to its high strength and stiffness, the hierarchical stiffened shell has been widely used in aerospace engineering [31,32]. However, it is challenging to fully explore its optimal buckling load-carrying capacity. The goal of this work is to improve the load-carrying capacity by optimizing the representative unit cell of hierarchical stiffened shell where the inputs are 9 size variables (widths, heights and thickness) for major and minor stiffeners, as shown in Figure 7(a), and the output is the carrying-load capability factor which is calculated by high-fidelity numerical simulation, e.g., finite element method (FEM), see Figure 7 (b). Although the high-fidelity FEM simulation is time-consuming, many open-source codes and commercial software have improved the computational efficiency via scalable parallelism and GPU acceleration. It is thus feasible to apply the DGS-ES method by implementing parallel FEM simulations in supercomputers. Figure 7 presents the comparison between DGS-ES and the other blackbox optimization methods. We observe that DGS-ES outperforms all other algorithms and achieves faster convergence using only 100 iterations. The robustness of DGS-ES also outperforms the alternatives.
Despite the successful demonstration shown in §4, there are several limitations with the DGS-ES method for reinforcement learning. First, it requires a powerful enough cluster or distributed computing resources to show superior performance. Even though more and more parallel environments for complex RL tasks, those parallel codes might not compatible with a cluster/supercomputer with a specific architecture. This will need some extra effort to modify environment codes, in order to exploit the advantage of the DGS-ES approach. Second, asynchronization between different environment simulations may drag down the total performance. In a distributed computing system, all the parallel workers receive the same number of environment simulations. However, as different parameter values may lead to different termination times of the environment simulations, there will be a potential waste of computing resources due to such asynchronization. Thus, a better scheduling algorithm is needed to further improve the performance of DGS-ES in RL. Third, even though the performance of the DGS-ES is not very sensitive to the hyperparameters, especially the radius
This material was based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under contract and award numbers ERKJ369, by the DOE SciDac FastMath project, and by the Artificial Intelligence Initiative at the Oak Ridge National Laboratory (ORNL). ORNL is operated by UT-Battelle, LLC., for the U.S. Department of Energy under Contract DE-AC05-00OR22725.
[1] | M. Abramowitz and I. Stegun (eds.), Handbook of Mathematical Functions, Dover, New York, 1972. |
[2] | M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P. Abbeel and W. Zaremba, Hindsight experience replay, in Advances in Neural Information Processing Systems, (2017), 5048–5058. |
[3] | A. S. Berahas, L. Cao, K. Choromanskiv and K. Scheinberg, A theoretical and empirical comparison of gradient approximations in derivative-free optimization, arXiv: 1905.01332. |
[4] | G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang and W. Zaremba, Openai gym, arXiv preprint arXiv: 1606.01540. |
[5] | K. Choromanski, A. Pacchiano, J. Parker-Holder and Y. Tang, From complexity to simplicity: Adaptive es-active subspaces for blackbox optimization, NeurIPS. |
[6] | K. Choromanski, A. Pacchiano, J. Parker-Holder and Y. Tang, Provably robust blackbox optimization for reinforcement learning, arXiv: 1903.02993. |
[7] | E. Conti, V. Madhavan, F. P. Such, J. Lehman, K. O. Stanley and J. Clune, Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents, NIPS. |
[8] | E. Coumans and Y. Bai, Pybullet, a python module for physics simulation for games, robotics and machine learning, GitHub Repository. |
[9] | P. Dhariwal, C. Hesse, O. Klimov, A. Nichol, M. Plappert, A. Radford, J. Schulman, S. Sidor, Y. Wu and P. Zhokhov, Openai Baselines, https://github.com/openai/baselines, 2017. |
[10] | A. D. Flaxman, A. T. Kalai and H. B. McMahan, Online convex optimization in the bandit setting: Gradient descent without a gradient, Proceedings of the 16th Annual ACM-SIAM symposium on Discrete Algorithms, 385–394, ACM, New York, (2005). |
[11] | S. Fujimoto, H. Van Hoof and D. Meger, Addressing function approximation error in actor-critic methods, arXiv preprint, arXiv: 1802.09477. |
[12] |
N. Hansen, The CMA evolution strategy: A comparing review, in Towards a new Evolutionary Computation, Springer, 192 (2006), 75–102. doi: 10.1007/3-540-32494-1_4
![]() |
[13] |
Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation (2001) 9: 159-195. ![]() |
[14] | T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver and D. Wierstra, Continuous control with deep reinforcement learning, ICLR. |
[15] | N. Maheswaranathan, L. Metz, G. Tucker, D. Choi and J. Sohl-Dickstein, Guided evolutionary strategies: Augmenting random search with surrogate gradients, Proceedings of the 36th International Conference on Machine Learning. |
[16] | F. Meier, A. Mujika, M. M. Gauy and A. Steger, Improving gradient estimation in evolutionary strategies with past descent directions, Optimization Foundations for Reinforcement Learning Workshop at NeurIPS. |
[17] | V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver and K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, ICML, 1928–1937. |
[18] |
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature, 518 (2015), 529-533. doi: 10.1038/nature14236
![]() |
[19] | P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan et al., Ray: A distributed framework for emerging {AI} applications, in 13th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 18), (2018), 561–577. |
[20] |
Random gradient-free minimization of convex functions. Found. Comput. Math. (2017) 17: 527-566. ![]() |
[21] | A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga and A. Lerer, Automatic differentiation in pytorch., |
[22] |
A. Quarteroni, R. Sacco and F. Saleri, Numerical Mathematics, Texts in Applied Mathematics, 37. Springer-Verlag, Berlin, 2007. doi: 10.1007/b98885
![]() |
[23] | T. Salimans, J. Ho, X. Chen, S. Sidor and I. Sutskever, From complexity to simplicity as a scalable alternative to reinforcement learning, arXiv preprint, arXiv: 1703.03864. |
[24] | J. Schulman, S. Levine, P. Abbeel, M. I. Jordan and P. Moritz, Trust region policy optimization, ICML, 1889–1897. |
[25] | J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov, Proximal policy optimization algorithms, arXiv preprint, arXiv: 1707.06347. |
[26] |
Parameter-exploring policy gradients. Neural Networks (2010) 23: 551-559. ![]() |
[27] | Robot skill learning: From reinforcement learning to evolution strategies. Paladyn Journal of Behavioral Robotics (2013) 4: 49-61. |
[28] |
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. v. d. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, et al., Mastering the game of go with deep neural networks and tree search, Nature, 529 (2016), 484-489. doi: 10.1038/nature16961
![]() |
[29] | F. P. Such, V. Madhavan, E. Conti, J. Lehman, K. O. Stanley and J. Clune, Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning, arXiv preprint, arXiv: 1712.06567. |
[30] | R. S. Sutton and A. G. Barto (eds.), Reinforcement Learning: An introduction, Second edition. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 2018. |
[31] |
Fast buckling load numerical prediction method for imperfect shells under axial compression based on pod and vibration correlation technique. Composite Structures (2020) 252: 112721. ![]() |
[32] |
K. Tian, Z. Li, L. Huang, K. Du, L. Jiang and B. Wang, Enhanced variable-fidelity surrogate-based optimization framework by gaussian process regression and fuzzy clustering, Comput. Methods Appl. Mech. Engrg., 366 (2020), 113045, 19 pp. doi: 10.1016/j.cma.2020.113045
![]() |
[33] | J. Zhang, H. Tran, D. Lu and G. Zhang, Enabling long-range exploration in minimization of multimodal functions, Proceedings of 37th on Uncertainty in Artificial Intelligence (UAI). |
1. | Sirui Bi, Benjamin Stump, Jiaxin Zhang, Yousub Lee, John Coleman, Matt Bement, Guannan Zhang, Blackbox optimization for approximating high-fidelity heat transfer calculations in metal additive manufacturing, 2022, 13, 2590048X, 100258, 10.1016/j.rinma.2022.100258 | |
2. | Majdi I. Radaideh, Hoang Tran, Lianshan Lin, Hao Jiang, Drew Winder, Sarma Gorti, Guannan Zhang, Justin Mach, Sarah Cousineau, Model calibration of the liquid mercury spallation target using evolutionary neural networks and sparse polynomial expansions, 2022, 525, 0168583X, 41, 10.1016/j.nimb.2022.06.001 | |
3. | Anton Dereventsov, Clayton G. Webster, Joseph Daws, 2022, Chapter 28, 978-981-16-3801-5, 333, 10.1007/978-981-16-3802-2_28 | |
4. | Amjad Yousef Majid, Serge Saaybi, Vincent Francois-Lavet, R. Venkatesha Prasad, Chris Verhoeven, Deep Reinforcement Learning Versus Evolution Strategies: A Comparative Survey, 2024, 35, 2162-237X, 11939, 10.1109/TNNLS.2023.3264540 |
Algorithm 1: : The DGS-ES for reinforcement learning |
1: hyperparameters: 2: Input: 3: Output: the final parameter value 4: Initialize the policy 5: Set 6: Broadcast 7: Divide the total GH quadrature points into 8: for 9: Each worker runs 10: Each worker sends 11: for 12: Compute 13: end for 14: Assemble 15: Update 16: if 17: Generate 18: Generate 19: end if 20: Broadcast 21: Each worker updates the policy to 22: end for |
Algorithm 1: : The DGS-ES for reinforcement learning |
1: hyperparameters: 2: Input: 3: Output: the final parameter value 4: Initialize the policy 5: Set 6: Broadcast 7: Divide the total GH quadrature points into 8: for 9: Each worker runs 10: Each worker sends 11: for 12: Compute 13: end for 14: Assemble 15: Update 16: if 17: Generate 18: Generate 19: end if 20: Broadcast 21: Each worker updates the policy to 22: end for |