Social Networking of Quasi-Species Consortia drive Virolution via Persistence

Luis P. Villarreal; Guenther Witzany; Luis P. Villarreal; Guenther Witzany

doi:10.3934/microbiol.2021010

AIMS Microbiology

2021, Volume 7, Issue 2: 138-162. doi: 10.3934/microbiol.2021010

Previous Article Next Article

Review

Social Networking of Quasi-Species Consortia drive Virolution via Persistence

Luis P. Villarreal ¹,
Guenther Witzany ^{2
,
,}

1..
Center for Virus Research, University of California, Irvine, USA
2..
Telos-Philosophische Praxis, Buermoos, AUSTRIA

Received: 22 March 2021 Accepted: 25 April 2021 Published: 30 April 2021

The emergence of cooperative quasi-species consortia (QS-C) thinking from the more accepted quasispecies equations of Manfred Eigen, provides a conceptual foundation from which concerted action of RNA agents can now be understood. As group membership becomes a basic criteria for the emergence of living systems, we also start to understand why the history and context of social RNA networks become crucial for survival and function. History and context of social RNA networks also lead to the emergence of a natural genetic code. Indeed, this QS-C thinking can also provide us with a transition point between the chemical world of RNA replicators and the living world of RNA agents that actively differentiate self from non-self and generate group identity with membership roles. Importantly the social force of a consortia to solve complex, multilevel problems also depend on using opposing and minority functions. The consortial action of social networks of RNA stem-loops subsequently lead to the evolution of cellular organisms representing a tree of life.
- virolution,
- quasispecies consortia,
- virus persistence,
- membership roles
Citation: Luis P. Villarreal, Guenther Witzany. Social Networking of Quasi-Species Consortia drive Virolution via Persistence[J]. AIMS Microbiology, 2021, 7(2): 138-162. doi: 10.3934/microbiol.2021010

Abstract

The emergence of cooperative quasi-species consortia (QS-C) thinking from the more accepted quasispecies equations of Manfred Eigen, provides a conceptual foundation from which concerted action of RNA agents can now be understood. As group membership becomes a basic criteria for the emergence of living systems, we also start to understand why the history and context of social RNA networks become crucial for survival and function. History and context of social RNA networks also lead to the emergence of a natural genetic code. Indeed, this QS-C thinking can also provide us with a transition point between the chemical world of RNA replicators and the living world of RNA agents that actively differentiate self from non-self and generate group identity with membership roles. Importantly the social force of a consortia to solve complex, multilevel problems also depend on using opposing and minority functions. The consortial action of social networks of RNA stem-loops subsequently lead to the evolution of cellular organisms representing a tree of life.

1. Introduction

Machine learning (ML) is a discipline of artificial intelligence (AI) that uses the theory of statistics in building mathematical models to program computers to "learn" or "discover" algorithms, where an algorithm is a sequence of instructions for solving a problem or performing a computation, to optimize the performance criteria for given tasks using example data or past experiences ^[1]. ML has found applications across various domains, such as image recognition ^[2], healthcare ^[3], natural language processing ^[4], games and strategy ^[5], etc. In ML models, parameters fall into two categories: model parameters (e.g., weights and biases of the connections between neurons of a neural network, centroids in k-means clustering) and hyperparameters (e.g., learning rate, number of hidden layers of a neural network, number of clusters in k-means clustering). While model parameters can be learned and adapted during the training process based on the training dataset, hyperparameters serve as external configuration variables that are to be determined before the training commences. The tuning of hyperparameters has played an essential role both from the methodological perspective, e.g., deep neural networks and shallow models, as well as from the application aspect, with models derived from computer vision, natural language models, speech, etc. Though it seems trivial at first glance that hyperparameters can be tuned by humans, in modern systems, manual tuning can be ineffective and is nearly impossible when dealing with large-scale hyperparameters.

Differing from the human tuning process, which can be tedious and heavily dependent on human expertise and experience, hyperparameter optimization (HPO) aims to automatically tune the relevant hyperparameters for the system, where the performance is measured by certain metrics, e.g., classification accuracy for a classifier model could be tuned to a better point. Apart from saving human labor and boosting model performance to its fullest potential, another advantage lies in enhancing the evaluation of different models. This can be achieved by diligently fine-tuning their respective hyperparameters, even when the models differ from one another. In contrast, the traditional trial-and-error procedure by humans can be ad-hoc and less reproducible, posing challenges in the evaluation process.

In recent years, deep learning (DL) techniques have reached remarkable achievements on various AI tasks, including image classification ^[6], language modeling ^[4], and speech recognition ^[7]. However, these models are sensitive to diverse task-specific configurations, costing substantial expert efforts to redesign them through a trial-and-error process. Automated machine learning (AutoML) ^[8] has been proposed to tackle this problem in areas such as data preparation, feature engineering, model generation, and model evaluation.

Given the search space, optimization methods in model generation can be classified into two categories: HPO and architecture optimization. In this paper, we mainly focus on HPO algorithms. A generic HPO algorithm for an unknown objective function $f \colon \mathcal{X}\xrightarrow{}\mathbb{R}$ , where $\mathcal{X}\in\mathbb{R}^d$ and $d$ is the size of design space of interest, is to find the $\mathbf{x}$ , a hyperparameter configuration, e.g., learning rate and batch size that globally minimizes $f$ :

$\begin{equation} \mathbf{x}^* = \mathop{\arg\min}\limits_{\mathbf{x}\, \in\, \mathcal{X}}f(\mathbf{x}) . \end{equation}$

(1.1)

Although gradient descent ^[9] is popular in DL, which is capable of tuning the learning rate, the cases where it is possible to obtain the training gradient of hyperparameters are rare. Therefore, traditional HPO frameworks treat $f$ as a non-convex black-box function and search for optimum without leveraging derivatives. As shown in Figure 1, classical HPO algorithms cover grid search and random search ^[10], metaheuristic search ^[11,12], Bayesian optimization ^[13], etc.

Figure 1. An overview of the lines of research around the topic of HPO surveyed in this paper: classical optimization methods in Section 2, acceleration techniques in Section 3, DAC in Section 4, and MOHPO in Section 5.

DownLoad: Full-Size Img PowerPoint

Unfortunately, despite the success of liberating humans from the loop ^[14], most conventional HPO algorithms follow a sequential querying approach and therefore suffer delays of the convergence of neural network training, which is always computationally resource-intensive. For instance, the whole training process for the once-for-all (OFA) network ^[15] took 1200 GPU hours with V100 GPUs. The community has thus proposed many acceleration methods to shorten the optimization process and avoid unnecessary repetition. Another drawback of classical HPOs is that they often yield fixed configurations throughout the entire run of AI models, even though the optimum may change at different phases. To tackle this, some dynamic algorithm configuration (DAC) methods ^{[16,17,18,19]} emerge to optimize on-the-fly, i.e., dynamically tune both parameters and hyperparameter schedules during the training procedure. In practical settings, there can be more than one metric to optimize beyond the conventional prediction accuracy (cf. Eqs (1.1) and (5.1)). Additional considerations such as space/time overheads and adherence to specific business constraints can play pivotal roles. The field has thus recently witnessed a growing interest in multi-objective HPO (MOHPO), the HPO counterpart to multi-objective optimization (MOO) ^[20,21].

As a well-established field, HPO has attracted wide and growing attention over the decades from both academia and industry. There exist numerous surveys covering a range of related topics, including the specific topic of HPO itself ^{[22,23,24,25,26,27]}, some closely related topics, e.g., Bayesian optimization ^{[14,28,29,30]}, as well as topics of broader scopes such as AutoML ^[8,31,32].

Existing surveys on general HPO ^{[22,23,24,26,27]} introduce HPO algorithms by categorizing them into several major classes: grid search (GS), random search (RS), Bayesian optimization (BO) and variants, multi-fidelity or Hyperband, population- or evolutionary-based, and gradient-based. This work takes a distinctive approach (Figure 1) and considers these algorithms as (ⅰ) classical methods for HPO, i.e., simple searches, BOs, and metaheuristics, (ⅱ) techniques for acceleration, i.e., multi-fidelity, bandit-based, and early stopping, (ⅲ) DAC optimizers that are either gradient-, population-, or reinforcement learning-based, and (ⅳ) strategies for MOHPO. Notably, DAC and MOHPO algorithms in particular have been either minimally explored or entirely overlooked in ^{[22,23,24,26,27]}. A list of popular frameworks and tools for HPO is compiled and tabulated, serving as an accessible guide for individuals new to the field, students, researchers, or practitioners in search of diverse options or alternatives for their HPO tasks or workflow integration.

This survey is structured as follows. In Section 2, we provide the basics of the primary classical algorithms in HPO. Section 3 then delves into strategies for accelerating these conventional algorithms from different perspectives. This is followed by Sections 4 and 5, where we explore the increasingly recognized DAC and MOHPO algorithms, respectively. After presenting popular tools and frameworks for HPO in Section 6.1, exploring applications in Section 6.2, and providing further discussions in Section 6.3, the paper is concluded in Section 7.

2. Classical optimization methods

In this section, we introduce three kinds of optimization methods that are commonly applied in HPO. These methods have to handle two key considerations: (ⅰ) the exploration vs. exploitation trade-off, which refers to the budget spent on exploring unknown search space or on exploiting known search space, and (ⅱ) the inference vs. search trade-off, referring to the overhead used to analyze existing information to guide the search process versus the budget allocated for the search itself. Table 1 summarizes the classical optimization methods.

Table 1. A concise overview of primary classical optimization methods. The column definitions are as follows: Variable: the types of hyperparameters the method can address. Hierarchy: whether the method can handle complex hierarchy search spaces. Parallelizability: whether the method can propose more than one candidate at the same time. Certain methods, such as sequential model-based algorithm configuration (SMAC), can build the model in parallel but propose only one candidate at a time. Exploration and Exploitation: how the method explores unknown areas, exploits known areas, and manages the trade-off between them. We assume that there are

$d$ hyperparameters with each of them having

$n$ distinct values. Note that this table covers only standard algorithms, which can easily be extended. See Section 2 for more details.

Method	Variable	Hierarchy	Parallelizability	Time complexity	Exploration	Exploitation
GradS	Continuous	$\times$	$\times$	$\mathcal{O}(n^d)$	$\times$	Gradient descent
GS	Continuous, Discrete, Categorical	$\checkmark$	$\checkmark$	$\mathcal{O}(n^d)$	Grid	$\times$
RS	Continuous, Discrete, Categorical	$\checkmark$	$\checkmark$	$\mathcal{O}(n)$	Random	$\times$
BO-GP	Continuous	$\times$	$\times$	$\mathcal{O}(n^3)$	Balanced by acquisition function
SMAC	Continuous, Discrete, Categorical	$\checkmark$	$\times$	$\mathcal{O}(n\log n)$	Balanced by acquisition function
BO-TPE	Continuous, Discrete, Categorical	$\checkmark$	$\times$	$\mathcal{O}(n\log n)$	Balanced by acquisition function
GA	Continuous, Discrete, Categorical	$\times$	$\checkmark$	$\mathcal{O}(n^2)$	Crossover/Mutation	Selection
ES	Continuous	$\times$	$\checkmark$	$\mathcal{O}(n^2)$	Recombination/Mutation	Selection
$\omega$ -PSO	Continuous	$\times$	$\checkmark$	$\mathcal{O}(n\log n)$	Balanced by parameter $\omega$

| Show Table

DownLoad: CSV

2.1. Simple search scheme

2.1.1. Gradient-based search (GradS)

Gradient descent-based methods are extensively used for optimization problems including the HPO task, which often involves non-convex optimization, and a local optimum is acceptable. Hypergradients (HGs) refer to the gradients of the model selection criterion such as cross-validation performance and validation error with respect to hyperparameters. With hypergradients, gradient descent can be employed to handle a large number of hyperparameters efficiently.

Reverse-mode differentiation (RMD) ^[33], known as backpropagation in DL, has been the standard method for computing gradients with respect to parameters in ML models. It was introduced to HPO problems to compute hypergradients ^[34]. While algorithms for computing hypergradients for optimization methods, including gradient descent, were derived in Domke ^[35], their impracticality for DL models due to high memory consumption for storing all intermediate variables has been recognized. In Maclaurin et al. ^[36], reverse-mode differentiation of stochastic gradient descent (SGD) with momentum has been proposed, overcoming the problem by computing intermediate variables during the reverse pass. It is further improved in Pedregosa ^[37] with the adoption of approximate gradients. In Franceschi et al. ^[38], reverse-mode algorithms for hypergradients are explained from the perspective of Lagrangian formulation, and a forward-mode algorithm is introduced for situations involving a small number of hyperparameters. Later, Lorraine et al. ^[39] devised a scalable gradient-based HPO technique capable of handling millions of hyperparameters. Reverse-mode and forward-mode hypergradient algorithms are summarized in Algorithms 1 and 2, respectively, based on the derivations in Franceschi et al. ^[38]. In these two algorithms, $\Phi$ represents parameter optimization methods such as gradient descent, $\mathbf{w}$ denotes parameters or weights of DL models, $\mathbf{x}$ signifies hyperparameters, and $E$ is the validation objective.

Algorithm 1: Reverse-Hypergradient (credit to Franceschi et al. ^[38])
Input: Initial hyperparameter $\mathbf{x}$ , initial parameters $\mathbf{w}_0$
// Train parameters
1 for $t = 1$ to $T$ do
2 Update $\mathbf{w}_t \xleftarrow{} \Phi_t(\mathbf{w}_{t-1}, \mathbf{x})$
// Compute hypergradients reversely
3 $\Delta \mathbf{x} \xleftarrow{}0$
4 $\alpha_T\xleftarrow{}\nabla E(\mathbf{w}_T)$
5 for $t = T-1$ downto $1$ do

Output: Hypergradients $\Delta \mathbf{x}$

Algorithm 2: Forward-Hypergradient(credit to Franceschi et al. ^[38])
Input: Initial hyperparameters $\mathbf{x}$ , initial parameters $\mathbf{w}_0$
1 $\Delta \mathbf{x} \xleftarrow{} 0$
2 $\mathbf{Z}_0 \xleftarrow{} 0 \qquad \text { // } \; \; \mathbf{Z}_t = \frac{\partial \mathbf{w}_t}{\partial \mathbf{x}}$
// Train parameters and compute hypergradients forwardly
3 for $t = 1$ to $T$ do

6 $\Delta \mathbf{x} \xleftarrow{} \nabla E(\mathbf{w}_T) \mathbf{Z}_T$
Output: Hypergradients $\Delta \mathbf{x}$

2.1.2. Grid search (GS)

GS is the most basic HPO method. It requires the user to select a finite subset for each hyperparameter and then exhaustively evaluate every possible combination point to find the optimum. GS inherently supports parallel implementation but fails in efficiency once the search space is large or objective dimensions get high, as the number of combinations to evaluate grows exponentially. To tackle these challenges, a multi-scale grid approach is used in Hsu et al. ^[40], where a coarse grid is first applied to identify a good region, and then a finer grid is conducted on that region. Alternatively, direct search ^[41] queries only the neighbors around current points to update the optimum. When neither improvement nor degradation is observed in any parameter, the search step is reduced until convergence.

2.1.3. Random search (RS)

Contrary to GS, RS randomly and independently samples candidates from the search space until the defined budget is exhausted. While GS is inefficient in high dimensions, RS is less affected, benefiting from lower effective dimensionality where only a few hyperparameters have an influence on the performance ^[10]. Given the same budget, RS can explore a larger area of the effective dimensions compared to GS, leading to better performance. As shown in Figure 2, RS tends to explore a broader space than GS with a limited budget, avoiding lingering in less promising areas. Bergstra and Bengio ^[10] have proved empirically and theoretically that RS is more practical than GS while in some cases sophisticated methods may bring little advantage over RS. Additionally, RS is readily resource-allocated since it can be extended by further samples, and the probability of sampling in different regions can be adjusted manually so exploration of valuable regions can be prioritized. While RS may lead to suboptimal solutions, its performance can be arbitrarily close to the optimum in expectation when provided with enough budgets.

Figure 2. Comparison of GS and RS. As reported in Bergstra and Bengio ^[10], it is always a few hyperparameters dominating the data. The limitations of GS stem from its inefficient exploration of influential dimensions. Figure adapted from Bergstra and Bengio ^[10].

DownLoad: Full-Size Img PowerPoint

2.2. Bayesian optimization (BO) with surrogates and acquisitions

BO ^[13] is an efficient global black-box optimization framework for expensive functions. Recently, it has gained widespread application in HPO problems and achieved state of the art results across various ML domains like image classification ^[13] and speech recognition ^[42].

BO is the most popular sequential model-based optimization (SMBO) algorithm, which has proven its superiority in optimizing expensive black-box functions ^[14] and exhibits impressive performance on even hard-to-tune hyperparameters ^[43]. Figure 3 illustrates a general sequential optimization framework that utilizes a model learned from observations to recommend promising candidates, which then gets queried to generate feedback for updating the model.

Figure 3. A general pipeline of SMBO. When

${t\le N}$ , it is the initialization stage where the observation dataset

$\mathcal{D} = \{(\mathbf{x}_t, y_t)\}$ is populated by points

$\mathbf{x}_t$ sampled randomly from the search space

$\mathcal{X}$ . When

$t > N$ , candidate points

$\mathbf{x}_t\in\mathcal{X}$ are selected using the acquisition function, which is guided by the posterior distribution derived from the surrogate model.

DownLoad: Full-Size Img PowerPoint

A classic BO framework comprises a probabilistic surrogate model and an acquisition function. The surrogate model approximates the unknown objective function $f$ based on the observation dataset $\mathcal{D} = \{(\mathbf{x}_t, y_t)\}_{t = 1}^N$ , where $\mathbf{x}_t \in \mathcal{X}$ and $y_t \in \mathbb{R}$ are the input and the observation value of $f$ , respectively. The prior distribution of the surrogate model captures our knowledge about $f$ and is updated with $\mathcal{D}$ to generate a posterior distribution, which provides predictions and uncertainties over the search space $\mathcal{X}$ . The acquisition function $\alpha$ utilizes the posterior distribution to guide the sequential search. Instead of observing the expensive objective function, BO optimizes the cheap acquisition function globally to generate candidates. The main property of the acquisition function is the trade-off between the exploration of areas with high uncertainty and the exploitation of areas with low predictions (for minimization task).

One common choice for the surrogate model is Gaussian process (GP) ^[44], and for the acquisition function, it is the expected improvement (EI) ^[45]. We summarize the BO algorithm with GP and EI in Algorithm 3, where $\phi$ and $\Phi$ denote the standard normal probability density function (p.d.f.) and cumulative distribution function (c.d.f.), respectively, and $f^*$ is the best-known value.

Algorithm 3: Bayesian-Optimization (credit to Shahriari et al. ^[14])
Input: Initial number $N$ , total number $T$ , noise $\delta_n$
1 Initialize $\mathcal{D} = \{(\mathbf{x}_t, y_t)\}_{t = 1}^N$ randomly
2 for $t = N+1$ to $T$ do

Output: Optimal hyperparameters $\mathbf{x}_T$

2.2.1. Surrogate models

The performance of BO significantly hinges on the choice of surrogate model. GP, random forest (RF), tree-structured Parzen estimator (TPE), and Bayesian neural network (BNN) are the commonly employed surrogate models for BO. A concise comparison of surrogates is presented in Table 2. A detailed discussion of these surrogate models follows.

Table 2. Comparison of four common BO surrogate models.

Surrogates	Time complexity	Fit type
GP	$\mathcal{O}(n^3)$	Regression
RF	$\mathcal{O}(n\log n)$	Regression
TPE	$\mathcal{O}(n\log n)$	KDE and Classification
BNN	$\mathcal{O}(n)$	Regression

| Show Table

DownLoad: CSV

• GP is a nonparametric model that is fully specified by a prior mean function $\mu_0 \colon \mathcal{X}\xrightarrow{}\mathbb{R}$ , which is usually set to a constant, and a covariance function $k \colon \mathcal{X}\times \mathcal{X}\xrightarrow{}\mathbb{R}$ . Any finite collection of GP induces a multivariate normal distribution. The marginalization properties of GP enable them to compute marginals and conditionals in closed form simply and flexibly. Given the observation dataset $\mathcal{D}_t$ at step $t$ , the posterior mean and variance functions are:

$\begin{equation} \mu_{t}(\mathbf{x}) = \mu_0(\mathbf{x}) + \mathbf{k}^{\top}_*\left[\mathbf{K}+\sigma^2_y\mathbf{I}\right]^{-1}\mathbf{y} \end{equation}$

(2.1)

$\begin{equation} \sigma_{t}^2(\mathbf{x}) = k_{**}-\mathbf{k}^{\top}_*\left[\mathbf{K}+\sigma^2_y\mathbf{I}\right]^{-1}\mathbf{k}_* \end{equation}$

(2.2)

where $\mathbf{k}_*$ denotes a vector of covariances between $\mathbf{x}$ and all points in $\mathcal{D}_t$ , $k_{**} = k(\mathbf{x}, \mathbf{x})$ denotes the variance of $\mathbf{x}$ , and $\mathbf{K}$ is the covariance matrix of all points in $\mathcal{D}_t$ . The property of GP is determined by the covariance function $k$ , with Mátern $5/2$ kernel being the most commonly used.

While BO with GP performs well on real-valued hyperparameters, it has difficulty with discrete, non-numeric, or conditional hyperparameters. Special kernels are required to address these situations ^[46]. In addition, GP scales poorly (cubically) with increasing data due to the necessity of the inversion of a dense covariance matrix. To mitigate this, some sparsification techniques have been proposed. Among them are sparse pseudo-input GPs (SPGPs) ^[47], which select a subset of the original dataset as inducing pseudoinputs to reduce the rank of the covariance matrix and compute the approximate posterior quickly. Another drawback of the standard GP is its poor scalability with dimensions, limiting the number of hyperparameters it can tune. The properties of hyperparameter space have been leveraged to design new kernels, such as cylindrical kernels ^[48] and additive kernels ^[49].

• RF, which is used in sequential model-based algorithm configuration (SMAC), models the objective function using an ensemble of regression trees ^[50]. The algorithm works as follows: $B$ regression trees are constructed with $n$ data points randomly sampled with replacement from the dataset of size $n$ . For each split point of a tree, the split criterion is chosen from a subset of size $pd$ that is randomly selected from $d$ hyperparameters, where $p$ is a split ratio that defaults to $5/6$ . When the number of data points on a node falls below a threshold $n_{min}$ , the node is set to a leaf and the leaf's prediction is set to the mean or median of the data points on it. Given a new hyperparameter configuration, these trees can produce $B$ predictions for which the mean and variance can be computed.

SMAC can handle continuous, discrete, categorical, and conditional hyperparameters naturally. The time complexities for fitting and predicting are $\mathcal{O}(n\log n)$ and $\mathcal{O}(\log n)$ , respectively. The limitation of data points on leaves and parallel training of trees can further reduce budgets, making RF suitable for larger datasets compared to GP. The subsampling of hyperparameters also helps it work on high-dimensional search spaces. However, despite RF's good predictive performance in the vicinity of training data, it exhibits poor performance far from the data. In areas with missing data, the variance can be highly erratic, ranging from very large to very small.

• TPE ^[51] uses kernel density estimation and classifies observations instead of regression. In contrast to most approaches modeling $P(y\mid\mathbf{x})$ directly, TPE considers Bayes' rule $P(y\mid\mathbf{x}) = \frac{P(\mathbf{x}\mid y)P(y)}{P(\mathbf{x})}$ and models $P(\mathbf{x}\mid y)$ and $P(y)$ . Two densities, $l(\mathbf{x})$ and $g(\mathbf{x})$ , are built over the search space $\mathcal{X}$ as follows:

$\begin{equation} P(\mathbf{x}\mid y) = \left\{\begin{aligned} l(\mathbf{x}), & \quad y < y^*\\ g(\mathbf{x}), & \quad y\geq y^* \end{aligned} \right. \qquad \end{equation}$

(2.3)

where $y^*$ is a predefined percentile determined by a threshold $\gamma$ such that $P(y < y^*) = \gamma$ . The time complexity is $\mathcal{O}(n\log n)$ . The EI acquisition function is then optimized. By construction and simplification, we have the result:

$\begin{equation} \alpha_{EI}(\mathbf{x}) \propto \left(\gamma + \dfrac{g(\mathbf{x})}{l(\mathbf{x})}(1-\gamma)\right)^{-1} \end{equation}$

(2.4)

Based on this expression, when optimizing $EI_{y^*}(\mathbf{x})$ , finding the optimum point of $\frac{g(\mathbf{x})}{l(\mathbf{x})}$ suffices and it is not necessary to model $P(y)$ . Since the Parzen estimators are organized in a tree structure, TPE can handle conditional hyperparameters naturally and outperform GP in structured HPO tasks ^[52,53].

• BNN places a distribution over neural network parameters, amalgamating the strengths of neural networks and probabilistic models ^[54]. Neural networks have strong capabilities of approximating continuous functions, extending the applicability of BO to more complex tasks. Probabilistic models can generate a complete posterior distribution on the predictions, suitable for Bayesian analysis. In Deep Networks for Global Optimization (DNGO) ^[55], the prior distribution is put on the weights of the output layer, while other parameters are learned via point estimation (typically stochastic training). BOHAMIANN ^[56] also adopts a BNN to construct the response surface, with weights sampled using a stochastic gradient Hamiltonian Monte-Carlo (SGHMC) method ^[57] to evaluate the posterior. BNN is more scalable than GP and is faster when the dataset is large ^[55].

2.2.2. Acquisition functions

Leveraging the predictive posterior, acquisition functions recommend the most promising candidate in the trade-off between the exploitation of known optima and the exploration of uncertainty.

EI is an improvement-based function ^[14]. It measures both the amount and the probability of improvement:

$\begin{equation} \alpha_{EI}(\mathbf{x}) = \mathbb{E}[\max\{(f^*-f(\mathbf{x})), 0\}] \end{equation}$

(2.5)

When $f(\mathbf{x})$ is normally distributed with mean $\mu(\mathbf{x})$ and variance $\sigma(\mathbf{x})$ , the expectation can be computed analytically as:

$\begin{equation} \alpha_{EI}(\mathbf{x}) = (f^*-\mu(\mathbf{x}))\Phi\left(\dfrac{f^*-\mu(\mathbf{x})}{\sigma(\mathbf{x})}\right) + \sigma(\mathbf{x})\phi\left(\dfrac{f^*-\mu(\mathbf{x})}{\sigma(\mathbf{x})}\right) \end{equation}$

(2.6)

where $\phi$ and $\Phi$ denote the standard normal p.d.f. and c.d.f., respectively, and $f^*$ is the best-known value.

Lower confidence bound (LCB), or upper confidence bound (UCB) for maximization problems ^[58], treats uncertainty as an additive incentive. It uses the optimum point of a fixed probability surface according to the model. In the GP case, LCB is computed as:

$\begin{equation} \alpha_{LCB}(\mathbf{x}) = \mu(\mathbf{x}) - \beta \sigma(\mathbf{x}) \end{equation}$

(2.7)

where $\beta$ is a control parameter and this function is to be minimized. Some guidelines for choosing $\beta$ have been proposed to achieve optimal regret ^[58].

Information-based policies aim to deduce the position of optimum point $\mathbf{x}^*$ by considering the posterior distribution $P(\mathbf{x^*}\mid\mathcal{D})$ . Entropy search (ES) ^[59] selects the point that maximally reduces the entropy of $P(\mathbf{x^*}\mid\mathcal{D})$ . It measures the expected information gain about the position of $\mathbf{x^*}$ :

$\begin{equation} \alpha_{ES}(\mathbf{x}) = H(\mathbf{x^*}\mid\mathcal{D})-\mathbb{E}_{y\mid\mathcal{D}, \mathbf{x}}[H(\mathbf{x}^*\mid\mathcal{D}\cup\{(\mathbf{x}, y)\})] \end{equation}$

(2.8)

where $H(\mathbf{x^*}\mid\mathcal{D})$ is the differential entropy of $P(\mathbf{x^*}\mid\mathcal{D})$ , and the expectation is over the predictive distribution of $y = f(\mathbf{x})+\delta_n$ , where $\delta_n$ is the observation noise. However, this function is intractable and is usually approximated by expensive methods such as Monte Carlo (MC) sampling, whose computation cost is quartic.

Predictive entropy search (PES) ^[60] reformulates the function of ES with the symmetric property of mutual information as:

$\begin{equation} \alpha_{PES}(\mathbf{x}) = H(y\mid\mathcal{D}, \mathbf{x})-\mathbb{E}_{\mathbf{x^*}\mid\mathcal{D}}[H(y\mid\mathcal{D}, \mathbf{x}, \mathbf{x^*})] \end{equation}$

(2.9)

where the expectation is over distribution $P(\mathbf{x^*}\mid\mathcal{D})$ . This function is approximated by expectation propagation and Thompson sampling. The computation cost is cubic, and the performance is not worse than ES empirically.

Max-value entropy search (MES) ^[61] uses the information about the optimum value $y^* = f(\mathbf{x^*})$ instead of $\mathbf{x^*}$ . The expected information gain about $y^*$ is expressed as:

$\begin{equation} \alpha_{MES}(\mathbf{x}) = H(y\mid\mathcal{D}, \mathbf{x})-\mathbb{E}_{\mathbf{y^*}\mid\mathcal{D}}[H(y\mid\mathcal{D}, \mathbf{x}, \mathbf{y^*})] \end{equation}$

(2.10)

Here, $y^*$ is sampled via Gumbel distribution or from the posterior distribution, and the expectation is approximated using MC estimation. The computation cost of MES is much lower since the distribution $P(\mathbf{x^*}\mid\mathcal{D})$ is $d$ -dimensional, while $P(y^*\mid\mathcal{D})$ is one-dimensional. Empirically, MES performs at least as well as ES and PES.

2.2.3. Recent BOs

Although optimizers are epoch efficient with little overheads and some parallel versions of algorithms are now available ^[13], two common drawbacks still exist ^[62]: First, the intrinsic observation process can be time-consuming; second, SMBO provides only fixed hyperparameters. Many methods emerge to accelerate the vanilla BO as will be discussed in Section 3. Notably, BFO (Bayesian Functional Optimization) ^[63] has been proposed to optimize hyperparameters on function spaces. 2-OPT (Two-step optimal) ^[64] enables the acquisition functions to look ahead for two steps to alleviate the shortsightedness of BO. In an attempt to scale BO to high-dimensional domains, while LineBO ^[65] decomposes iteratively the global problem into a sequence of one-dimensional sub-problems, TuRBO (trust region BO) ^[66] maintains a collection of local BO models and performs search across trust regions centered around the best solutions. Nguyen and Osborne ^[67] transform the GP to incorporate more foregone information (e.g., cases where classification accuracy is less than 100%). Meanwhile, MiVaBo (mixed-variable BO) ^[68] extends BO to optimize variables of mixed types. Several methods, including BOPrO (BO with a Prior for the Optimum) ^[69], $\pi$ BO ^[70], and PriorBand ^[71], have been proposed to incorporate expert insights on promising configurations into BO, using this prior information to guide the search.

2.3. Metaheuristic search

A metaheuristic is a generic or high-level optimization strategy or algorithmic framework designed to efficiently explore and exploit solution spaces to find approximate solutions to optimization problems ^[11]. Metaheuristic search often draws inspiration from natural phenomena, such as evolution and annealing. In general, metaheuristics make no assumptions about the objective function, and they do not rely on gradient information, enabling them to tackle non-convex, noncontinuous, and non-smooth optimization problems. According to the number of solutions they hold, metaheuristics can be categorized into single-solution-based methods and population-based methods. Population-based methods run a population of solutions in parallel and evaluate their quality using a fitness function, demanding significant computing power. Advances in computer technology and parallel architectures have facilitated the realization of many algorithms. The distinctions among population-based methods lie in how they initialize and update populations, with performance being greatly influenced by parameter settings. This section introduces two types of population-based methods: evolutionary algorithms and swarm intelligence.

2.3.1. Evolutionary algorithms (EA)

Evolutionary algorithms (EAs) ^[73] are inspired by Darwin's evolutionary theory. Generally, EAs update populations through the crossover of two ancestral individuals and mutation. Genetic algorithm (GA) ^[74] is the most commonly used method. As shown in , GA typically represents solutions as chromosomes, often in the form of a fixed-length binary string, and implements crossover and mutation by simple bit manipulation operations. Two critical parameters in GA are the probabilities of crossover and mutation. The procedures of GA are as follows: A population with $N$ individuals is initially generated by randomly initializing chromosomes. Subsequently, a fitness function, whose outcome often reflects the performance of a configuration, is applied to each individual. Based on the results, selection, crossover, and mutation are performed on the population to yield a new generation with $N$ individuals. These two steps are repeated until convergence or some conditions are met. A majority of innovations of GAs are the selection schemes for producing offspring, including roulette-wheel, tournament, and ranking selection ^[75].

Figure 4. Workflow of GAs for HPO.

DownLoad: Full-Size Img PowerPoint

In a certain context, evolution strategy (ES) ^[76] slightly differs from GA by extending the values of GA's chromosomes to the real number domain and introducing the heritable step to guide mutation. Mutation for real numbers is realized by adding noise sampled from a zero-mean normal distribution. The standard deviations for different genes significantly impact the performance of ES. They can be kept constant or adjusted dynamically based on the number of generations and performance ^[77,78]. Selection in ES involves eliminating the worst individuals to maintain a constant population size. While ES was originally designed for real numbers, it can be easily extended to other data types, such as integers ^[79]. ES prefers exploration to exploitation, so it is less prone to getting stuck in local optima.

CMA-ES (Covariance matrix adaptation evolution strategy) ^[80] is the most representative and effective ES algorithm. It dynamically adapts distribution parameters and the step by modifying the covariance matrix. Differential evolution (DE) ^[81] inherits from GA but drives the evolution through mutation based on a differential vector rather than relying on crossover. In the recent decade, EAs have enlightened optimization methods for neural architecture search (NAS) ^[8].

2.3.2. Swarm intelligence (SI)

SI algorithms are inspired by the collective behavior of biological groups, including ants, grey wolves, grasshoppers, etc. ^[82] Within these groups, each individual has limited capability, but they can jointly accomplish complex tasks through local interactions without any centralized control.

The particle swarm optimization (PSO) algorithm is arguably the most popular SI paradigm ^[83]. The vanilla PSO imitates the flocking behavior observed in bird communities to address optimization problems ^[84]. In PSO, each particle represents a potential solution to the problem and is defined by a position and a velocity. The position is initialized randomly, and the velocity starts at zero. A topology is assigned to the swarm to describe the interconnections among particles, where particles connected to a specific particle are considered its neighbors. At each step, a particle's velocity is updated based on the best positions it and its neighbors have found thus far, and the position is updated accordingly. This enables particles to search for the optimum in parallel, sharing the current best position and fitness value with one or more particles to determine their next movements. To prevent the swarm from being trapped in local optima, mutations are introduced by incorporating slight randomness into the update process.

Additionally, inertia PSO ( $\omega$ -PSO) ^[85] introduced the inertia weight $\omega$ , a positive constant or function, to balance the trade-off between exploration and exploitation. The process of standard $\omega$ -PSO is summarized in Algorithm 4 for minimization problems. $x_i$ and $v_i$ are $d$ -dimensional vectors representing the position and velocity of particle $p_i$ , respectively. $r_1$ and $r_2$ are random variables sampled from the uniform distribution in the range $[0, 1]$ . In the algorithm, each particle is the neighbor of all other particles, and thus the topology is a fully connected graph. For alternative topologies, $x_g^*$ in the update formula of $v_i$ should be replaced with the best position of $p_i$ 's neighbors.

Algorithm 4: Particle-Swarm-Optimization (credit to Houssein et al. ^[83])
Input: Particle number $N$ , total steps $T$ , learning parameters $c_1, c_2, \omega$ , fitness function $F$
1 Initialize particles $\mathcal{P} = \{p_i = [\mathbf{x}_{i}, \mathbf{v}_{i}]\}_{i = 1}^N$
// Local/global best fitness
2 Evaluate $\{f_i^* \xleftarrow{} F(\mathbf{x}_i)\}_{i = 1}^N$ , $f_g^\xleftarrow{}\min_{1\leq i\leq N} f_i^$
// Local/global best position
3 Initialize $\{\mathbf{x}_i^\xleftarrow{}\mathbf{x}_i\}_{i = 1}^N, \mathbf{x}_g^$
4 for $t = 1$ to $T$ do

Output: Optimal position $\mathbf{x}_g^*$

2.4. Summary

Table 1 at the beginning of this section provides a concise overview of classical HPO methods just discussed. GradS methods are suitable only in situations where the objective functions are differentiable and obtaining the gradients of hyperparameters is feasible (e.g., learning rate in neural networks). Since gradient provides only local information, these approaches might converge quickly to a local optimum instead of a global optimum ^[16,27]. GS and RS are conceptually simple and are the most basic HPO methods. They can both be readily implemented, and since each hyperparameter configuration evaluation is independent of each other, GS and RS possess the advantage of easier parallelization. In practice, RS is more efficient than GS for large search space and when some hyperparameters have higher importance over others in high-dimensional hyperparameter configuration space ^[10].

In contrast to the simple search schemes above that are either only exploitative (GradS) or explorative (GS and RS), BO allows the exploration of unknown search space and the exploitation of promising regions. While BO is an efficient strategy for the global optimization of expensive black-box functions, its performance fundamentally hinges on the surrogate model and the acquisition function chosen ^[13,86]. Specifically, the quality of BO using GP as the surrogate model, BO-GP, also depends on the choice of the kernel function. Compared to BOs using other surrogates, e.g., RF, TPE, and BNN, BO-GP struggles with large dataset and high dimensions and is only applicable to continuous hyperparameters. Due to the inherently sequential nature of BO algorithms, where the acquisition function guides the selection of subsequent hyperparameter sets based on the current model ^[14], it is challenging to efficiently parallelize the entire optimization process.

Evolutionary algorithms (EAs, includes GA and ES) and swarm intelligence (SI, includes PSO), owning to their population-based nature and the independence of evaluation for different individuals, support parallelization. Similar to BO, they are capable of both exploration (diversification) and exploitation (intensification). The exploration of GA and ES is primarily driven by crossover (recombination for ES ^[87]) and mutation, and the exploitation by the selection mechanism; PSO explores by the stochasticity embedded in its update rules and exploits by the convergence of individuals toward the best-known positions. These advantages come with the cost of additional considerations. In the case of GA and ES, this entails managing supplementary hyperparameters like the fitness function and crossover and mutation rates. For PSO, the challenge lies in careful parameter and topology selection, along with proper population initialization to mitigate the risk of premature convergence ^[83].

3. Acceleration techniques

Approaches addressing HPO can be distilled into two main methodological families: model-free metaheuristic methods and model-based BO methods. Despite the small overheads of meta-guiders, both conventional metaheuristics and BO methods remain computationally resource-intensive. Specifically, metaheuristic methods need to maintain a considerably large population to prevent the collapse of the solution space. On the other hand, BO algorithms suffer from an iterative waiting period due to the time-consuming observation process throughout the sequential search. To overcome these challenges, numerous methods have been proposed to accelerate the search process and optimize the allocation of computational resources.

This section aims to categorize these methods into the following groups: multi-fidelity optimization, bandit-based algorithm, and early stop. These represent prominent lines of research in the ongoing efforts to enhance the efficiency of HPO.

3.1. Multi-fidelity optimization

Multi-fidelity (MF) optimization leverages auxiliary information from various sources to reduce the total evaluation cost. In practice, conventional techniques discard intermediate information and overlook the iterative nature of modern DL algorithms. Strategies that combine information from related tasks to speed up the search efficiency also work. Both approaches revolve around finding a balance between the actual objectives, which are expensive and of high-fidelity, and the auxiliary observations, which are cheap and of low-fidelity.

3.1.1. Multi-fidelity / multi-source

MF-GP-UCB ^[88] pioneered the formalization of a multi-fidelity bandit algorithm by taking GP to establish the relations between low-fidelity approximations and final performance. Impressively, FABOLAS (fast BO on large datasets) ^[89] presents to take the subset size of training data as an auxiliary variable, accelerating the search process 10 to 100 times faster than vanilla BOs for large datasets. FABOLAS has also proposed a classical method for modeling the relation via GP using a product kernel. Simultaneously, misoKG (multi-information source optimization with a Knowledge Gradient) ^[90] describes another generative model that estimates the $i$ -th output by summing up two GPs representing the high-fidelity result and the corresponding discrepancy separately.

Furthermore, methods have emerged to improve multi-fidelity approaches by reinforcing acquisition functions. taKG ^[91] suggests a trace-aware knowledge-gradient algorithm that is provably convergent. MF-MES ^[92] incorporates MES ^[61] to enable a better exploit-explore trade-off for multi-fidelity BO without introducing extra parameters, providing an information-theoretic guarantee.

3.1.2. Multitask / warm start

MTBO (Multitask BO) ^[93] recommends applying a multitask GP via a Kronecker product kernel for a warm start, avoiding unnecessary re-exploration in familiar search space. MI-SMBO (Meta-learning-based initialization SMBO) ^[94] proposed a meta-learning-based initialization to warm start BO. Specifically, BO is initialized by well-performing configurations in similar datasets, with a set of meta-features defining the distances between datasets. ABLR (Adaptive Bayesian linear regression) ^[95] scales MTBO with BNN with a sharing neural network to learn basis features and a Bayesian layer to model the posterior for each output. Recently, WS-CMA-ES (warm starting CMA-ES) ^[96] has also proposed warm-starting the initialization of CMA-ES to address the conflict between the costly adaptation phase and limited evaluation budget.

3.2. Bandit-based algorithm

Bandit-based algorithms derived from RS have been proven to be compelling in the allocation of limited resources. The successive halving (SHA) algorithm ^[97] dynamically allocates budgets to top-performing configurations by regularly discarding the least promising half.

A notable extension of SHA is the Hyperband algorithm ^[98], a multi-armed bandit algorithm designed to terminate poorly performing configurations. Algorithm 5 summarizes the Hyperband process. Compared to SHA, which refrains from allocating resources to underperforming configurations, Hyperband takes a step further by dividing the budget into iterations. This strategy aims to strike a balance between exploration and exploitation, enhancing the algorithm's ability to navigate the search space effectively. Benefiting from the elegant simplicity and flexibility, Hyperband typically outperforms RS and vanilla BO in practice. Another adaptation made to SHA is asynchronous SHA (ASHA) ^[99], leveraging asynchrony for parallelization. The main idea is to promptly promote configurations to the next rung level, foregoing the necessity to wait for rung completion. While this decision rule may lead to unfavorable configurations being promoted, by the law of large number the impact is expected to diminish as the total number of configurations grows ^[99].

Algorithm 5: Hyperband (credit to Li et al. ^[98])
Input: budgets $[b_\text{min}, b_\text{max}]$ , $\eta$ (default = 3)
1 $s_\text{max}\xleftarrow{}\left\lfloor\log_{\eta}\frac{b_\text{max}}{b_\text{min}}\right\rfloor$
2 for $s\in\{s_{max}, s_{max}-1, \cdots, 0\}$ do

Output: best configuration

However, the convergence of Hyperband is constrained by its randomly drawn strategy, leading to underutilization of known observations. To address this, BOHB (BO and Hyperband) ^[53] proposed a Hyperband and BO hybrid, replacing the random selection of each Hyperband iteration with a modified TPE surrogate to guide the search. Notably, BOHB employs a multi-dimensional kernel density estimation (KDE) instead of the hierarchy of one-dimensional KDEs in TPE. BOHB is often regarded as the best previous off-the-shelf optimizer.

Recent developments have seen the emergence of methods building upon Hyperband and BOHB. HyperSTAR ^[100] adapts the surrogate model to specific tasks and ranks the hyperparameters by estimation in a joint dataset-hyperparameter space. DEHB (DE and Hyperband) ^[101] combines DE and Hyperband, achieving more robust performance for high-dimensional and combinatorial data than BOHB. The overheads of model-free DE operations remain constant and precede BOHB by up to one order of magnitude. In addition to Hyperband's random sampling, PriorBand ^[71] includes also prior-based and incumbent-based sampling strategies.

3.3. Early stop

Besides model-free Hyperband, many other stopping criteria are exploited to early terminate poor configurations.

3.3.1. Curve estimation

Modeling the learning curve to allocate resources and stop running individuals dynamically is in vogue. Freeze-thaw BO ^[102] extends the BO framework with a strong assumption that the training loss curve follows an exponential decay toward some value. A well-designed time kernel gets developed to support this nonstationary prior. However, Freeze-thaw BO has been proven ineffective in describing learning curves of deep networks in practice ^[103].

To enhance representational capacity, learning curve extrapolation (LCE) ^[103] suggests modeling the curves with a set of parametric model families. Various increasing and saturating functions, each of which may capture certain aspects of curves, are ensembled to describe the entire learning process. To further get rid of manually designed functions, which may introduce undue strong assumptions, Klein et al. ^[104] recommends using BNN with basis function layers for prediction. Baker et al. ^[105] incorporates sequential regression models to estimate curves, achieving notable advancements in both NAS and HPO.

3.3.2. Other criteria

Lately, BO-BOS ^[106] has been proposed to unify the BO and Bayesian optimal stop (BOS) mechanism to eliminate unnecessary queries. A significant difference between BO-BOS and multi-fidelity methods is that BO-BOS determines whether to stop in the training process. BOIL (BO for iterative learning) ^[107] inherits the advantages of both multi-fidelity BOs and curve estimation techniques by optimizing the numeric score of curves compression instead of averaged final performance. Makarova et al. ^[108] suggests an automatic termination criterion based on the discrepancy between the actual HPO objective and the BO-optimized target function.

3.4. Summary

This section has introduced three main strategies for speeding up the hyperparameter optimizing process: Multi-fidelity, bandit-based algorithms, and early stopping. Multi-fidelity optimization methods operate by leveraging auxiliary information and exploiting more cost-effective approximations of the expensive objective function. In navigating the trade-off between optimization performance and runtime, practical implementations often witness the speed improvements outweighing the errors introduced by the approximations ^[27]. These methods can either actively or adaptively determine the appropriate fidelity (or fidelities), or transfer knowledge from previous experiments or similar tasks. Some literature, e.g., Yang and Shami ^[24] and Shawi et al. ^[31], consider also bandit-based algorithms such as SHA ^[97] and Hyperband ^[98] introduced in Section 3.2 as a subset of multi-fidelity approaches. While most methods that enhance the efficiency of HPO have been predominantly proposed within the context of BO frameworks, learning curve-based prediction for early termination proposed by Domhan et al. ^[103] and Baker et al. ^[105] work on deep neural networks and are agnostic to the hyperparameter optimizer used.

4. Dynamic algorithm configuration

The above strategies have demonstrated their superiority through various acceleration tricks. However, most HPO algorithms only search for fixed configurations, while dynamic or scheduling hyperparameters can be more welcomed in practice. In this section, we broadly convey recent trends of the emerging DAC algorithms covering the following aspects: gradient-based optimizers, population-based algorithms, and reinforcement learning methods. We also explore a few miscellaneous approaches. A succinct summary of these algorithms is presented in Table 3. They are compared from several aspects, including the types of hyperparameters they can handle and their approach to managing hyperparameters and parameters during the training process.

Table 3. A concise overview of representative methods that optimize hyperparameters on the fly, i.e., dynamically tune configurations (schedule) in the training process. Column Coverage shows the types of hyperparameters suitable for each algorithm. Some algorithms are designed for specific hyperparameters like learning rate (LR), while others can accommodate most continuous hyperparameters. Exploration and Exploitation columns show how hyperparameters and parameters are treated separately. In the last column, "Keep" means the parameters remain unaltered, and "Overwrite" means former bad parameters are replaced with parameters copied from other well-performing models.

Method	Coverage	Base	Exploration	Exploitation
HD ^[109]	LR	Hypergradient	Hypergradient descent	Keep
RTHO ^[38]	Continuous	Hypergradient	Hypergradient descent	Keep
MARTHE ^[17]	LR	Hypergradient	Hypergradient descent	Keep
FSLA ^[111]	Continuous	Hypergradient	Hypergradient descent	Keep
PBT ^[62]	Continuous	PBT	Mutation / Re-sample	Overwrite
PB2 ^[18]	Continuous	PBT	Time-varying GP-bandit	Overwrite
HPM ^[112]	Continuous	PBT	Teacher+Hypergradient	Overwrite
PB2-Mix ^[113]	Mixed	PBT	Mixed GP-bandit	Overwrite
BG-PBT ^[114]	Mixed	PBT	TR-GP-BO	Overwrite
HOOF ^[117]	RL parameters	Reinforcement	Off-policy Estimate	Keep
Biedenkapp et al. ^[19]	Mixed	Reinforcement	Dynamic movement primitives	Keep
AutoLRS ^[126]	LR	Curve Estimation	Forecasting+GP-bandit	Keep
Multistage QHM ^[127]	BS+SGD parameters	Momentum	Quasi-hyperbolic momentum	Keep

| Show Table

DownLoad: CSV

4.1. Gradient-based optimizers

In Section 2, we have discussed gradient-based search and introduced both reverse-mode and forward-mode algorithms. While reverse-mode is less computationally demanding with a large number of hyperparameters, forward-mode exhibits greater memory efficiency and is suitable for real-time hyperparameter updates.

Baydin et al. ^[109] derived the hypergradient with respect to the learning rate and updated the learning rate at each iteration. The hypergradient is based on the partial derivative of one-step parameter training, and parameters from the previous time step are irrelevant to the current learning rate. SGD, SGD with Nesterov, and Adam are generalized to online update forms, termed hypergradient descent (HD) variants. HD is considered shortsighted ^[110]. RTHO (Real-time hyperparameter optimization) ^[38] modifies the forward-mode hypergradient algorithm in Algorithm 1 to a real-time update form, which is longsighted but too slow to adapt to abrupt changes in the loss surface. MARTHE (Moving Average Real-Time Hyperparameter Estimation) ^[17] (Algorithm 6) tries to find the balance between HD and RTHO by introducing a discount factor $\mu$ , providing globally useful update directions while formal methods fail to cope with loss surface that varies too fast or too slow. Both RTHO and HD can be interpreted as special cases of MARTHE when $\mu = 1$ and $\mu = 0$ , respectively. Li et al. ^[111] unified hypergradient approximation methods including back propagation through time, Neumann series, and conjugate gradient descent under the same framework, and has proposed a fully single loop algorithm (FSLA) ^[111] for bi-level optimization based on this framework.

Algorithm 6: MARTHE (credit to Donini et al. ^[17])
Input: Initial hyperparameters $\mathbf{x}$ , initial parameters $\mathbf{w}_0$ , hyper-learning rate $\beta$ , discount factor $\mu$ , objective function $E\colon\mathbb{R}^{d}\xrightarrow{}\mathbb{R}^{+}$ , weight update dynamics $\Phi_t\colon\mathbb{R}^{d}\times\mathbb{R}^{+}\xrightarrow{}\mathbb{R}^{d}$
1 $\Delta \mathbf{x} \xleftarrow{} 0$
2 $\mathbf{Z}_0 \xleftarrow{} 0 \qquad \text{//}\; \mathbf{Z}_t = \frac{\partial \mathbf{w}_t}{\partial \mathbf{x}}$
3 for $t = 1$ to $T$ do

4.2. Population-based algorithms

Population-based training (PBT) ^[62] derives from evolution search but updates both weights and hyperparameters of the population of agents in a single training process instead. Proportional agents with poor performances get eliminated regularly to generate new individuals exploited from the best-performing ones. Typically, neural network weights are copied from one of the tops while hyperparameters are randomly resampled from the whole space or slightly mutated from best values.

However, the reliance on metaheuristics for exploration leads to space collapse when the population is small. PB2 ^[18] proposed to guide the exploration by a time-varying GP with sublinear regret guarantees, which enable it to search with a small computational budget. A novel trick of batch sampling for the acquisition function helps maintain the diversification of the population. HPM (Hyperparameter mutation) ^[112] regards the agents as students that update their hyperparameters with hypergradient. Meanwhile, an attention-based teacher model is leveraged to learn a mutation schedule. While PB2-Mix ^[113] uses a hierarchical approach to allow PB2 to tackle hyperparameters of continuous and categorical types, BG-PBT (Bayesian generational PBT) ^[114] extends PBT-style methods to consider also ordinal (in addition to continuous and categorical) variables by employing trust-region (TR), GP-based BO.

4.3. Reinforcement learning methods

Reinforcement learning (RL) ^[115] has gained widespread application in optimizing algorithm configurations ^[116]. Typically, a controller is used to sample new candidates to get a return from the environment, where an evaluator scores the return to generate a reward. The controller then gets updated based on the received reward and current network states.

HOOF (Hyperparameter Optimization on the Fly) ^[117] proposed an off-policy method to adapt sensitive RL hyperparameters to changing environments. In HOOF, candidates are regularly generated and estimated by trajectories sampled from the current policy. The policy then gets updated greedily using the values of candidates. A novel generation strategy with importance sampling and a Kullback-Leibler (KL) constraint reduces the computation cost further.

While HOOF is tailored for RL tasks, Biedenkapp et al. ^[19] formalized learning DAC as a contextual Markov decision process (MDP) where multiple agents sample sequences of parameters across different stages. A global policy is generated according to the distribution of stages, and self-paced learning (SPL) is adopted to evaluate this policy to maximize its reward.

4.4. Remarks

DAC facilitates the real-time application of HPO while training of the model parameters makes progress. The capability of automatically determining algorithm configuration (or schedule) adaptively in an online fashion is particularly advantageous in scenarios where learning dynamics exhibit high non-stationarity, and a fixed configuration (or schedule) for the entire training duration could potentially be suboptimal ^[62,118]. Successful DAC, however, may require more computational resources compared to static methods ^[16].

Given the iterative nature of many AI algorithms ^[19], especially in the context of RL recognized as a universal modeling framework for sequential decision problems ^[119], dynamically learning optimal hyperparameter configurations that may vary over time during the training process becomes a natural proposition. AutoRL (Automated RL) ^[120,121], a recent area of research, has arisen to address RL algorithms' sensitivity to design choices ^{[122,123,124]} that often leads to laborious and costly manual tuning. AutoRL aims to automate different components of the RL framework, including (PO)MDP modeling, algorithm selection, HPO, and, when DL is used, the architecture search. Considering the nonstationary nature of RL, which adds to the vulnerability of RL algorithms ^[125], integrating dynamic HPO in the AutoRL pipeline can be beneficial ^[120,124].

As DAC is an emerging field, there are approaches optimizing hyperparameter schedules beyond the major classes discussed earlier. AutoLRS (Automatic learning rate scheduler) ^[126] trains a time-series forecasting model to estimate performance based on observations from the initial steps at each training stage and utilize a GP to explore the search spaces further. Multistage QHM (quasi-hyperbolic momentum) ^[127] provides a quasi-hyperbolic momentum modification for almost all momentum variants to dynamically tune hyperparameters, including SGD parameters and batch size.

5. Multi-objective hyperparameter optimization

Methods presented in earlier sections consider HPO problems with a single objective, e.g., prediction accuracy or error-based measure. Many real-world challenges, however, demand the simultaneous optimization of multiple, sometimes conflicting, objectives (or criteria). These objectives or criteria may encompass, for instance, training accuracy/error, generalization capability, model complexity, sensitivity, specificity, energy consumption, and inference time ^[128].

Karl et al. ^[20] gave the definition of the general MOHPO problem, representing the generalized HPO problem taking into account $m\in\mathbb{N}$ objectives or criteria,

$\begin{equation} \mathbf{x}^* = \mathop{\arg\min}\limits_{\mathbf{x}\, \in\, \mathcal{X}}f(\mathbf{x}) = \mathop{\arg\min}\limits_{\mathbf{x}\, \in\, \mathcal{X}}\, (f_1(\mathbf{x}), f_2(\mathbf{x}), \ldots , f_m(\mathbf{x})), \end{equation}$

(5.1)

where $f_1\colon\mathcal{X}\xrightarrow{}\mathbb{R}, \ldots, f_m\colon\mathcal{X}\xrightarrow{}\mathbb{R}$ and $f\colon\mathcal{X}\xrightarrow{}\mathbb{R}^m$ . A hyperparameter configuration $\mathbf{x}\in\mathcal{X}$ (Pareto) dominates another configuration $\mathbf{x}'\in\mathcal{X}$ , if and only if $f(\mathbf{x})\prec f(\mathbf{x}')$ , i.e.,

$\begin{equation} \begin{aligned} \forall\, i & \in\{1, \ldots, m\}\colon f_i(\mathbf{x})\leq f_i(\mathbf{x}')\, \land \\ \exists\, j & \in\{1, \ldots, m\}\colon f_j(\mathbf{x})\leq f_j(\mathbf{x}'). \end{aligned} \end{equation}$

(5.2)

A configuration $\mathbf{x}^*$ is non-dominated, Pareto optimal, or Pareto efficient if and only if there exists no other configuration $\mathbf{x}\in\mathcal{X}$ that dominates $\mathbf{x}^*$ . From illustrating a simple MOO example with two minimizing objectives, it can be seen that in contrast to single-objective optimization, MOO problems lack a single universally optimal solution that satisfies all objectives simultaneously. The set of Pareto optimal solutions is referred to as the Pareto set $\mathcal{P}$ ^[20],

$\begin{equation} \mathcal{P}: = \{\mathbf{x}\in\mathcal{X}\mid\nexists\, \mathbf{x}'\in\mathcal{X}\, \text{ s.t. } \mathbf{x}'\prec\mathbf{x}\}, \end{equation}$

(5.3)

and $f(\mathcal{P})$ is the Pareto front. The Pareto set represents solutions that achieve the best trade-off between objectives, where no objective can be made better without making another worse off. The ideal goals of MOO, and thus MOHPO, algorithms are to ^[129]: (ⅰ) find a set of solutions that lies on the Pareto front, and (ⅱ) ensure the solution set found in (ⅰ) is diverse enough to represent the entire range of the Pareto front.

Figure 5. An example of the objective space of a multi-objective optimization problem with two minimizing objectives.

DownLoad: Full-Size Img PowerPoint

Perhaps the simplest approaches to MOHPO are GS (Section 2.1.2) and RS (Section 2.1.3). To adapt single-objective GS and RS to MOHPO problems, given that each point represents a combination of hyperparameters in the hyperparameter space, one can simply evaluate the performance based on multiple objectives at each chosen point and return all non-dominated solutions as the Pareto set. When dealing with high- or large-dimensional search space, GS and RS can be computationally inefficient and lack the ability to exploit the structure of the search space efficiently. They can, however, serve as simple baselines for specialized MOHPO algorithms, e.g., ^[130,131].

Below we introduce the canonical approaches tailored for MO(HP)O problems, distinguishing them into three main categories: scalarization, metaheuristic-based, and model-based. For detailed discussions on MOHPO, interested readers are referred to Karl et al. ^[20] and Morales-Hernández et al. ^[21].

5.1. Scalarization

Scalarization is a straightforward technique to MOHPO. This approach works by transforming multiple objectives into a single objective function, which then single-objective optimizers can be used ^{[132,133,134]}. Two of the most popular scalarization methods are the (i) weighted sum, which combines the objectives linearly with each objective $f_i(\mathbf{x})$ multiplied by a user-specified weight $\lambda_i$ ,

$\begin{equation} \mathop{\min}\limits_{\mathbf{x}\, \in\, \mathcal{X}}f(\mathbf{x}) = \mathop{\min}\limits_{\mathbf{x}\, \in\, \mathcal{X}}\sum\limits_{i = 1}^{m}\lambda_i f_i(\mathbf{x}), \end{equation}$

(5.4)

and the (ⅱ) $\epsilon$ -constraint ^[135], which retains only one objective $f_i$ and turns the rest of the objectives $f_j, \forall j = 1, \ldots, m, \, j\neq i$ , into constraints, i.e., restrict them to be within user-specified values $\epsilon_j$ ,

$\begin{equation} \mathop{\min}\limits_{\mathbf{x}\, \in\, \mathcal{X}}f_i(\mathbf{x}) \quad\text{subject to}\; f_j(\mathbf{x})\leq \epsilon_j, \forall j = 1, \ldots, m, \, j\ne i. \end{equation}$

(5.5)

Another scalarization function or its variants that has been used in other MOO approaches (such as those introduced in the following sections) is the (weighted) Chebyshev or Tchebycheff function (TCH) ^[132],

$\begin{equation} \mathop{\min}\limits_{\mathbf{x}\, \in\, \mathcal{X}}f(\mathbf{x}) = \mathop{\min}\limits_{\mathbf{x}\, \in\, \mathcal{X}}\mathop{\max}\limits_{i = 1, \ldots, m}[\lambda_i\mid f_i(\mathbf{x}-z_{i}^{*})], \end{equation}$

(5.6)

where $(z_{i}^{*})$ is usually set to the ideal reference point, i.e. $z_{i}^{*} = \min_{\mathbf{x}\in\mathcal{X}}f_i(\mathbf{x})$ for $i = 1, \ldots, m$ .

5.2. Multi-objective metaheuristics

NSGA-Ⅱ (Non-dominated sorting GA Ⅱ) ^[136], building on the primitive NSGA algorithm ^[137], is a widely adopted algorithm for MOO problems. It employs a GA framework and introduces non-dominated sorting to categorize solutions based on their dominance relationships, and crowding distance to sustain a diverse set of solutions on the Pareto front. NSGA-Ⅲ ^[138,139] extends NSGA-Ⅱ to MOO problems with a larger number of objectives by involving predefined reference points to aid in diversity preservation.

SPEA2 (Strength Pareto EA 2) ^[140] employs a strength-based approach within its EA framework, using Pareto dominance to guide the selection, favoring individuals that dominate more other individuals. Diversity is achieved through an external, fully-filled archive. SPEA2 and NSGA-Ⅱ both incorporate elitism in which the best individuals are preserved in each generation.

MOEA/D (Multi-objective EA based on decomposition) ^[141] takes a decomposition approach, explicitly using scalarization to transform the multi-objective problem into a set of single-objective subproblems. These subproblems are simultaneously optimized, achieving diverse and evenly distributed solutions along the Pareto front through properly selected decomposition methods and weight vectors ^[141].

To maximize the hypervolume indicator, where the hypervolume (or the $\mathcal{S}$ -metric) can be considered as the size covered in the objective space by a set of non-dominated solutions with respect to a user-defined reference point ^[142], SMS-EMOA ( $\mathcal{S}$ -metric selection evolutionary MOO algorithm) ^[143] proposes a steady-state EA with constant population size. SMS-EMOA applies non-dominated sorting from NSGA-Ⅱ ^[136] as first ranking criterion and discards individuals with low marginal hypervolume contribution.

Analogous to EAs, PSO has also been used for MOO. NSPSO (Non-dominated sorting PSO) ^[144] extends single-objective PSO to MOO through information sharing, motivating the entire swarm population progress toward the Pareto front. Non-dominated sorting and crowding distance from NSGA-Ⅱ ^[136] are used in NSPSO. Inspired by multi-objective EAs, Coello et al. ^[145] incorporated the concept of Pareto dominance and a secondary (external) population in its MOO approach via PSO. An adaptive mutation operator facilitates throughout exploration, while a historical record of previously non-dominated solutions promotes convergence toward the Pareto front.

5.3. Multi-objective model-based optimization

Approaches for multi-objective BO can broadly be categorized into scalarization, leveraging Pareto hypervolume (PHV) indicator ^[142], and information-theoretic methods.

ParEGO (Pareto efficient global optimization) ^[146] using GP as the surrogate based on the EGO approach ^[45], approximates the Pareto front with a single objective scalarized via the augmented Tchebycheff function ^[146], where the parameterization of the weight vector is sampled uniformly per iteration.

SMS-EGO ^[147] is another approach based on EGO ^[45]. Instead of using scalarization, SMS-EGO models each objective with a separate surrogate model, with the optimization and selection based on the hypervolume indicator ^[142]. SMS-EGO enhances the expected hypervolume improvement (EHI/EHVI) ^[148] acquisition function, initially proposed for surrogate-assisted evolutionary multi-objective search ^[149], by introducing adaptive search space reduction for improved scalability. As EHVI demands high computational complexity, Daulton et al. ^[150] proposed qEHVI, a differentiable hypervolume improvement acquisition function. qEHVI allows gradient-based parallel and sequential greedy optimization via the inclusion-exclusion principle.

PESMO (Predictive entropy search) ^[151] and MESMO (multi-objective maximum entropy search) ^[152] are information-theoretic methods rooted in information theory. Both PESMO and MESMO build an individual surrogate model for each objective. PESMO's acquisition function utilizes input space entropy, selecting the next evaluation that maximizes information gain about the optimal Pareto set. In contrast, MESMO proposes an output-space-entropy-based acquisition function for candidate selection, evaluating the input that maximizes information gain about the optimal Pareto front.

5.4. Remarks

Evolutionary-based methods, particularly exemplified by tools like NSGA-Ⅱ ^[136] (refer to Table 4), have been predominant in MOO, owing to their derivative information-free nature, simplicity for implementation, and flexibility with widespread applicability ^[129]. As we have seen in Section 5, MOO aims to find a set of solutions, the Pareto set, making EAs a natural choice as they inherently rely on population for optimization.

Table 4. Overview of frameworks/tools for HPO. In Language column, cpp: C++, go: Golang, jv: Java, jl: Julia, py: Python, R: R language. MO: multi-objective. MF: multi-fidelity. In Mode column, cent.: centralized, dist.: distributed. UI denotes user interface. In column MF,

$\checkmark$ is provided for tools that offer either their own multi-fidelity option or the original SHA/Hyperband algorithm. Citations are provided for frameworks/tools with publications. Note that the column Problem is relatively loosely specified due to challenges in attributing work to a single problem setting. Frameworks/tools that are not open source are marked with an asterisk (*).

Framework/Tool	Year	Problem	Optimization	Language	MO	MF	Mode	UI	URL
Optuna ^[161]	2019	HPO	GS, RS, TPE, CMA-ES, NSGA-Ⅱ, etc.	py	$\checkmark$	$\times$	dist.	$\checkmark$	https://github.com/optuna/optuna
Ray Tune ^[162]	2018	HPO	GS, RS, BO, EA, Optuna, Nevergrad, etc.	py	$\checkmark$	$\checkmark$	cent., dist., cloud	$\checkmark$	https://docs.ray.io/en/latest/tune/index.html
BoTorch ^[163]	2019	HPO	GP, qEHVI, etc.	py	$\checkmark$	$\checkmark$	cent.	$\times$	https://github.com/pytorch/botorch
Bayesian Optimization	2014	HPO	GP	py	$\times$	$\times$	cent.	$\times$	https://github.com/bayesian-optimization/BayesianOptimization
Hyperopt ^[164]	2015	HPO	RS, TPE, Adaptive TPE	py	$\times$	$\times$	dist.	$\times$	https://hyperopt.github.io/hyperopt/
SMAC3 ^[50,165]	2011	HPO	SMAC, ParEGO, mean aggr.	py	$\checkmark$	$\checkmark$	cent.	$\times$	https://github.com/automl/SMAC3
DEAP ^[166]	2012	HPO	GA, PSO, CMA-ES, NSGA-Ⅱ, NSGA-Ⅲ, SPEA2, MO-CMA-ES	py	$\checkmark$	$\times$	dist.	$\times$	https://github.com/DEAP/deap
Nevergrad	2018	HPO	RS, GA, PSO, BO	py	$\times$	$\times$	cent.	$\times$	https://github.com/facebookresearch/nevergrad
AgileRL	2023	HPO	evolutionary	py	$\times$	$\times$	cent., dist.	$\times$	https://github.com/AgileRL/AgileRL
TuRBO ^[66]	2019	HPO	GP	py	$\times$	$\times$	cent.	$\times$	https://github.com/uber-research/TuRBO
BayesOpt ^[167]	2014	HPO	GP	cpp	$\times$	$\times$	cent.	$\times$	https://github.com/rmcantin/bayesopt
HyperMapper ^[168]	2019	HPO	BO, TCH, etc.	cpp, py	$\checkmark$	$\times$	cent.	$\times$	https://github.com/luinardi/hypermapper
mlr3 ^[22,169,170]	2019	HPO	GS, RS, BO, SHA, Hyperband, CMA-ES, etc.	R	$\checkmark$	$\checkmark$	cent.	$\times$	https://github.com/mlr-org/mlr3
jMetal(Py) ^[171]	2018	HPO	ES, GA, MOEA/D, NSGA-Ⅱ, OMOPSO, etc.	jv, py	$\checkmark$	$\times$	cent.	$\times$	https://github.com/jMetal/jMetalPy
Goptuna	2019	HPO	RS, TPE, CMA-ES, ASHA, etc.	go	$\times$	$\times$	cent.	$\checkmark$	https://github.com/c-bata/goptuna
EvoTorch ^[172]	2022	HPO	evolutionary	py	$\checkmark$	$\times$	cent., dist.	$\times$	https://github.com/nnaisense/evotorch
OpenBox ^[173]	2021	HPO	BO, Hyperband, EHVI, MESMO, etc.	py	$\checkmark$	$\checkmark$	cent., dist., cloud	$\checkmark$	https://github.com/PKU-DAIR/open-box
Dragonfly ^[174]	2020	HPO	GP	py	$\checkmark$	$\checkmark$	cent.	$\times$	https://github.com/dragonfly/dragonfly
DEHB ^[101]	2021	HPO	DE, Hyperband	py	$\times$	$\checkmark$	cent.	$\times$	https://github.com/automl/DEHB
Orion	2022	HPO	RS, GS, Hyperband, PBT, BO, EA, etc.	py	$\times$	$\checkmark$	dist.	$\times$	https://github.com/Epistimio/orion
KerasTuner	2019	HPO	RS, BO, Hyperband	py	$\times$	$\checkmark$	cent., dist.	$\times$	https://github.com/keras-team/keras-tuner
Syne Tune ^[175]	2022	HPO	GS, RS, BO, PBT, DEHB, ASHA, NSGA-Ⅱ, MO-ASHA, etc.	py	$\checkmark$	$\checkmark$	cent., dist., cloud	$\times$	https://github.com/awslabs/syne-tune
Katib ^[176]	2020	HPO, NAS	BO, CMA-ES, Hyperband, PBT, Optuna, Hyperopt, etc.	py	$\times$	$\checkmark$	cent., dist., cloud	$\checkmark$	https://github.com/kubeflow/katib
Propulate ^[177]	2023	HPO	EA	py	$\times$	$\times$	dist.	$\times$	https://github.com/Helmholtz-AI-Energy/propulate
Determined AI	2020	HPO	GS, RS, ASHA, etc.	py	$\times$	$\times$	cent., dist., cloud	$\checkmark$	https://github.com/determined-ai/determined
Facebook Ax	2019	HPO	GP, BoTorch	py	$\checkmark$	$\checkmark$	cent.	$\times$	https://github.com/facebook/Ax
pymoo ^[178]	2018	HPO	GA, DE, CMA-ES, NSGA-Ⅱ, NSGA-Ⅲ, MOEA/D, SMS-EMOA, etc.	py	$\checkmark$	$\times$	cent.	$\times$	https://github.com/anyoptimization/pymoo
Hypernets	2022	HPO, NAS	MCTS, EA, NSGA-Ⅱ, NSGA-Ⅲ, MOEA/D, etc.	py	$\checkmark$	$\times$	cent.	$\times$	https://github.com/DataCanvasIO/Hypernets
Hyperopt.jl	2018	HPO	BO, Hyperband, BOHB, etc.	jl	$\times$	$\checkmark$	cent.	$\times$	https://github.com/baggepinnen/Hyperopt.jl
MLJTuning	2020	HPO	GS, RS, TPE, PSO, etc.	jl	$\times$	$\times$	cent., dist.	$\times$	https://github.com/JuliaAI/MLJTuning.jl
DeepHyper	2019	HPO, NAS	BO, scalarization	py	$\checkmark$	$\checkmark$	cent., dist.	$\times$	https://github.com/deephyper/deephyper
Weights & Biases	2018	HPO	GS, RS, BO, Hyperband	py	$\times$	$\checkmark$	cent., dist., cloud	$\checkmark$	https://github.com/wandb/wandb
Polyaxon (HyperTune)	2018	HPO	GS, RS, BO, Hyperband, Hyperopt	py	$\times$	$\checkmark$	cent., dist., cloud	$\checkmark$	https://github.com/polyaxon/polyaxon
Mango ^[179]	2020	HPO	BO	py	$\times$	$\times$	cent., dist.	$\times$	https://github.com/ARM-software/mango
Gradient-Free-Optimizers (Hyperactive)	2020	HPO	GS, RS, PSO, BO, etc.	py	$\times$	$\times$	cent.	$\times$	https://github.com/SimonBlanke/Gradient-Free-Optimizers
shap-hypetune	2021	HPO	GS, RS, BO	py	$\times$	$\times$	cent.	$\times$	https://github.com/cerlymarco/shap-hypetune
NePS ^[71]	2023	HPO, NAS	BO, Hyperband, $\pi$ BO, PriorBand	py	$\times$	$\checkmark$	cent., dist.	$\times$	https://github.com/automl/neps
Scikit-Optimize	2016	HPO	GS, RS, GP	py	$\times$	$\times$	cent.	$\times$	https://github.com/scikit-optimize/scikit-optimize
Talos	2019	HPO	GS, RS, etc.	py	$\times$	$\times$	cent.	$\times$	https://github.com/autonomio/talos
SHERPA ^[180]	2018	HPO	GS, RS, BO-GP, Hyperband, ASHA, PBT	py	$\times$	$\checkmark$	cent.	$\times$	https://github.com/sherpa-ai/sherpa
FAR-HO ^[38]	2017	HPO	ReverseHG, RTHO, ForwardHG	py	$\times$	$\times$	cent.	$\times$	https://github.com/lucfra/FAR-HO
FEDOT ^[181,182]	2021	AutoML	Hyperopt, Optuna, etc.	py	$\checkmark$	$\times$	cent.	$\times$	https://github.com/aimclub/FEDOT
TPOT ^[183]	2016	AutoML	GA	py	$\times$	$\times$	dist.	$\times$	https://github.com/EpistasisLab/tpot
AutoGL ^[184]	2021	AutoML	GS, BO, CMA-ES, MO-CMA-ES, etc.	py	$\checkmark$	$\times$	cent.	$\times$	https://github.com/THUMNLab/AutoGL
Auto-Sklearn ^[185,186]	2015	AutoML	SMAC	py	$\backslash$	$\times$	dist.	$\times$	https://github.com/automl/auto-sklearn
Auto-PyTorch ^[187]	2020	AutoML	SMAC, Hyperband	py	$\times$	$\checkmark$	cent.	$\times$	https://github.com/automl/Auto-PyTorch
AutoKeras ^[188,189]	2019	AutoML	GP	py	$\times$	$\times$	cent., cloud	$\times$	https://github.com/keras-team/autokeras
AutoGluon ^[190]	2020	AutoML	RS, BO	py	$\times$	$\times$	cent.	$\times$	https://github.com/autogluon/autogluon
TransmogrifAI	2018	AutoML	GS, RS, BO	Scala	$\times$	$\times$	cent.	$\times$	https://github.com/salesforce/TransmogrifAI
EvalML	2019	AutoML	GS, RS, BO	py	$\times$	$\times$	cent.	$\times$	https://github.com/alteryx/evalml
MLJAR AutoML	2021	AutoML	RS, TPE (Optuna)	py	$\times$	$\times$	cent.	$\checkmark$	https://github.com/mljar/mljar-supervised
Microsoft NNI	2021	AutoML	GS, RS, EA, Hyperband, PBT, BO, etc.	py	$\times$	$\checkmark$	cent., dist., cloud	$\checkmark$	https://github.com/microsoft/nni
Microsoft FLAML ^[191]	2021	AutoML	GS, RS, Optuna	py, .NET	$\checkmark$	$\times$	cent., dist, cloud	$\times$	https://github.com/microsoft/FLAML
Microsoft Archai	2020	NAS	RS, SHA, evolution, etc.	py	$\checkmark$	$\times$	cent., dist., cloud	$\times$	https://github.com/microsoft/archai
*Microsoft AzureML AutoML ^[192]	2018	AutoML	GS, RS, BO	py	$\times$	$\times$	cent., cloud	$\checkmark$	https://azure.microsoft.com/en-us/products/machine-learning/automatedml/
*Oracle AutoML ^[193]	2020	AutoML	gradient-based	py	$\times$	$\times$	cent., cloud	$\checkmark$	https://docs.oracle.com/en/database/oracle/machine-learning/
*Google Vizier ^[194]	2017	HPO	RS, GS, BO	cpp, go, py	$\times$	$\times$	cloud	$\checkmark$	https://cloud.google.com/vertex-ai
*Amazon SageMaker ^[195]	2017	AutoML	GS, Hyperband, RS, BO	py, R	$\checkmark$	$\checkmark$	cloud	$\checkmark$	https://aws.amazon.com/machine-learning

| Show Table

DownLoad: CSV

Apart from the prominent approaches introduced earlier, recently there has been growing interest in encompassing additional considerations when working with MOO problems. For instance, observing that observations are often subject to noise while optimization formulations assume noiseless observations, Daulton et al. ^[153] proposed NEHVI (noisy expected hypervolume improvement) for noisy multi-objective BOs. Lin et al. ^[154] and Misitano et al. ^[155] advocated incorporating decision-maker preferences. Instead of searching for the Pareto front, Malkomes et al. ^[156] proposed the constraint active search (CAS) formulation to search for diverse solutions satisfying objectives-turned-constraints.

In MOHPO, efforts have also been put toward adapting advanced single-objective HPO techniques to the multi-objective counterpart. For instance, a multi-fidelity method for MOHPO building upon the Hyperband algorithm ^[98] and scalarization was proposed by Schmucker et al. ^[131]. Similarly, Chen et al. ^[157] adapted BOHB ^[53] to MOHPO, leveraging the integrated information from a multi-fidelity ensemble model effectively in an online fashion. Dushatskiy et al. ^[158] introduced MO-PBT, the multi-objective version of PBT ^[62], demonstrating superior performance over MO-ASHA ^[159], the multi-objective version of ASHA ^[99].

6. Tools, applications, and further discussion

6.1. Public frameworks and tools

In the past, hyperparameters were usually tuned manually, which is time-consuming. To reduce the bar of using HPO algorithms and enhance their widespread applicability, various libraries and frameworks have been developed. We tabulate the popular frameworks used for HPO, referencing notable works such as ^{[22,23,24,31,32,160]}.

Existing HPO frameworks can be categorized in several ways. Based on the distribution of computing resources, they fall into three main categories: centralized, distributed, and cloud-based ^[31]. When considering optimization methods, they can be distinguished as RS-based, BO-based, metaheuristic-based, and so on. They can also be classified based on the type of problems they address, including HPO, NAS, and AutoML. HPO tools are primarily designed for tuning given hyperparameters; NAS tools are used to automatically discover the optimal architecture or structure of a neural network for a given task; AutoML considers all or most tasks in building ML models, including feature engineering, model construction, and HPO. Overview of existing HPO frameworks and tools is provided in Table 4.

6.2. HPO applications

HPO has emerged as a pivotal component of AutoML, finding extensive applications across diverse domains and tasks. For instance, it is applied for software sensor design ^[196], electroencephalography (EEG) data decoding ^[197], smart grid ^[198], P systems optimization ^[199], and healthcare ^[200].

Data-driven models have become integral in the realm of software sensor design for vehicles, where the performance is notably influenced by hyperparameters. A specific study ^[196] focused on designing a roll angle estimator based on an artificial neural network (ANN), employing techniques including GS, Hyperband, BO, and GA. The results emphasized the significant impact of search space size on performance, with knowledge-based methods outperforming their counterparts.

In the domain of EEG data decoding for brain-computer interfaces (BCIs), HPO methods play a crucial role in optimizing DL models for the classification of EEG recordings. In Stober et al. ^[201], hyperparameters associated with the convolutional neural networks (CNN) used to classify EEG recordings of rhythm perception were optimized with BO. In another work by Drouin-Picaro and Falk ^[202], GS for HPO was used in the classification of EEG signals corresponding to natural saccade with CNN and multilayer perceptron (MLP). Coonet et al. ^[197] suggested that more experiments should be conducted to investigate whether the generalization of HPOs across subjects is feasible.

In a smart grid, the consumption of electricity varies from time to time, which burdens the electricity systems. Prediction with the ML models can help achieve the balance between demand and supply, and manage the energy efficiently ^[198]. ML models such as support vector machine (SVM), ANN, and BNN are commonly used for this problem. To achieve a good predictive performance, tuning the hyperparameters is of vital importance. For example, Zhou et al. ^[203] proposed a system based on autoencoders and tuned the hyperparameters using GA. Additionally, AutoML frameworks are likely to be applied to this field in the future search.

In recent attempts to create bridges between adaptive P systems and ML paradigms, the optimization of one of the hyperparameters, which is the type of the spiking neuron used in spiking neural (SN) P systems, prior to the training of P systems, can potentially benefit from HPO ^[199].

Biomedical applications, encompassing healthcare, biomedical research, and big biomedical data, have also witnessed the advantages of integrating HPO methods in the ML pipelines^[200]. Hospitals can deploy ML models for health outcome improvement, healthcare cost reduction, and clinical research advancement ^[204]. In fact, certain frameworks are developed exclusively for healthcare. One example is AutoPrognosis ^[205], which uses BO to optimize models and hyperparameters for clinical prognosis. HPO also has non-negligible impact when working with medical images using AI techniques ^[206]. For instance, BO was used to optimize the network architecture of Nishio et al. ^[207] for lung segmentation using chest X-ray (CXR) images with severe abnormalities. BO with TPE as the surrogate was also used in Abdellatif et al. ^[208] on the Improved Weighted RF for predicting the presence of cardiovascular disease and patient survival. Experiments by Belete and Huchaiah ^[209] employing GS-based HPO on eight ML models have shown improvement over statistical way of optimization in predicting HIV/AIDS test results. Similarly, Nematzadeh et al. ^[210] demonstrated that by employing dedicated HPO algorithms for ML models in the classification and regression of biomedical datasets, training process and model performance improved when compared to using blindly chosen hyperparameters. Despite these advancements, challenges persist. This includes the limited interpretability of ML models contributing to user skepticism regarding predictions, efficiency concerns as current methods struggle to swiftly identify a near-optimal hyperparameter configuration, and the protracted trial-and-error processes presenting an impracticality given the demand for approaching thousands of predictive modeling problems in personalized medicine. In light of these challenges, DAC algorithms emerge as a promising avenue.

6.3. Further discussion

In this paper, we have discussed four kinds of HPO methods that are closely connected. Classical techniques, with their foundational concepts proposed over a decade ago, have evolved alongside advancements in ML models. On one hand, these techniques are employed to tune hyperparameters, enhancing the performance of ML models. On the other hand, ML models contribute to the expansion and evolution of these techniques. ML models such as RF and BNNs can serve as surrogate models of BO. As ML models experience longer training times, there is a growing demand for more efficient HPO techniques. Researchers have responded by designing various frameworks to carefully manage computing resources, while retaining the core ideas. DAC algorithms, for instance, aim to utilize resources efficiently within a single training cycle to achieve near-optimal performance. These algorithms often involve structural modifications to classical techniques, such as gradient-based and population-based methods, allowing for on-the-fly updates to hyperparameters. However, a common challenge faced by these methods is the inherent trade-off between performance and efficiency. Driven by real-world situations where practical considerations involve additional metrics or criteria, framing the HPO problem as an MOHPO problem is a more pragmatic approach. MOHPO, being an extension of the broader MOO, leverages existing MOO methodologies, with recent endeavors adapting single-objective HPO algorithms for multi-objective cases.

HPO algorithms encounter several challenges and issues. First, the diversity of hyperparameter types and complex search spaces can impact the usability of algorithms, often requiring disharmonious extensions that may affect performance and theoretical properties. Second, the high-dimensionality of hyperparameters makes convergence challenging due to the vast number of samples required to explore the search space. Identifying the few influential hyperparameters is a daunting task. Third, handling the intricate relationships among hyperparameters poses difficulty, as the value of one hyperparameter may significantly affect another, while others remain independent. Capturing these relations can reduce the complexity of the search space. Additionally, scalability is a concern, with standard BO-GP having a complexity of $\mathcal{O}(n^3)$ , which becomes impractical with a large number of samples. Furthermore, the issue of transferring hyperparameters from task to task or leveraging knowledge from previous search results remains a challenge. Most algorithms treat different tasks independently, even when sharing the same datasets or models, leading to inefficiencies. Lastly, the optimal hyperparameters may change as datasets grow in size. Many existing algorithms require a search from scratch, which is not conducive to the dynamic nature of big data.

In this paper, we have only reviewed approaches to general HPO, acceleration techniques, and algorithms for the emerging DAC and MOHPO. It is noteworthy that various efforts explore HPO from distinct perspectives, warranting attention. For instance, work looking at constrained HPO ^{[211,212,213,214]} or constrained BO ^{[215,216,217]} and investigations on robustness ^{[53,180,218,219,220]}. Explorations of HPO with meta-learning and AutoML ^{[221,222,223,224]} also offer valuable insights into related areas.

7. Conclusion and outlook

Complex computing systems, exemplified by modern ML pipelines, have found diverse applications across various domains. In the pursuit of more effective and efficient deployment of these pipelines, researchers have increasingly focused on the performance and efficiency of HPO algorithms – a focal point of this paper. We begin by providing a broad overview of HPO, delving notably into strategies tailored for DL algorithms. Subsequently, we explore methods to accelerate optimization procedures. This is followed by comprehensive and systematic reviews of the emerging DAC and MOHPO, offering insights for future research. We also present a comparative analysis of existing HPO tools and their applications across different domains, laying the groundwork for potential future research endeavors.

The potential applications of HPO span diverse areas, presenting a promising landscape for exploration and advancement. From the ML application perspective, here we discuss important areas that underscore the impact of HPO. One key domain is NAS, wherein HPO manifests as a task within a discrete search space. First, NAS can be regarded as an HPO task on discrete search space, and recent works have explored various search methods, including RL ^[225], EA ^[226], BO ^[227], and gradient-based methods ^{[228,229,230]}. Noteworthy applications of NAS extend to complex vision tasks such as object detection ^[231,232] and segmentation ^[233,234]. Recent work by Wang et al. ^[235] introduces a NAS framework based on the “mergeNAS” technique ^[229], allowing for searches within any custom search spaces across a spectrum of vision tasks.

In addition to ML, the fusion of combinatorial optimization ^[236,237] with HPO within the realm of applied mathematics offers a platform for further exploration. This includes but is not limited to, tackling challenges like the traveling salesman problem ^[238], vehicle routing problem ^[239], mix-integer programming ^[240], boolean satisfiability ^[241], portfolio optimization ^[242], graph matching, either through ML approaches ^[2434] or conventional methods ^[244], graph clustering ^[245], and graph edit distance computing ^[246].

Despite HPO's long history spanning over half a century, numerous promising areas warrant continued efforts. We envision that this comprehensive survey will not only equip researchers with diverse backgrounds to understand the current state and evolution of HPO, facilitating its seamless integration into future work, but also provide practical insights for practitioners in ML and HPO-related domains.

Use of AI tools declaration

The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Acknowledgment

This work was partly supported by National Natural Science Foundation of China (623B1009, 62073333), and Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102). The authors are also thankful for the anonymous reviewers for their valuable suggestions and comments to help improve this survey in all aspects.

Conflict of interest

Junchi Yan is an editorial board member for Mathematical Biosciences and Engineering and was not involved in the editorial review or the decision to publish this article. All authors declare that there are no competing interests.

Author contributions

Both authors have read and agreed to the published version of the manuscript.

Funding

The authors confirm no financial support by any institution.

Conflicts of interest

The authors declare no conflict of interest.

References

[1]	Lehman N (2015) The RNA World: 4,000,000,050 years old. Life 5: 1583-1586. doi: 10.3390/life5041583
[2]	Díaz Arenas C, Lehman N (2010) Quasispecies-like behavior observed in catalytic RNA populations evolving in a test tube. BMC Evol Biol 10: 80. doi: 10.1186/1471-2148-10-80
[3]	Yeates JAM, Lehman N (2016) RNA networks at the origins of life. Biochemist 38: 8-12. doi: 10.1042/BIO03802008
[4]	Di Mauro E, Saladino R, Trifonov EN (2014) The path to life's origins. Remaining hurdles. J Biomol Struct Dyn 32: 512-522. doi: 10.1080/07391102.2013.783509
[5]	Adamski P, Eleveld M, Sood A, et al. (2020) From self-replication to replicator systems en route to de novo life. Nat Rev Chem 4: 386-403. doi: 10.1038/s41570-020-0196-x
[6]	Vaidya N, Walker SI, Lehman N (2013) Recycling of informational units leads to selection of replicators in a prebiotic soup. Chem Biol 20: 241-252. doi: 10.1016/j.chembiol.2013.01.007
[7]	French RK, Holmes EC (2020) An ecosystems perspective on virus evolution and emergence. Trends Microbiol 28: 165-175. doi: 10.1016/j.tim.2019.10.010
[8]	Villarreal LP, Witzany G (2013) Rethinking quasispecies theory: From fittest type to cooperative consortia. World J Biol Chem 4: 79-90. doi: 10.4331/wjbc.v4.i4.79
[9]	Janzen E, Blanco C, Peng H, et al. (2020) Promiscuous ribozymes and their proposed role in prebiotic evolution. Chem Rev 120: 4879-4897. doi: 10.1021/acs.chemrev.9b00620
[10]	Furubayashi T, Ueda K, Bansho Y, et al. (2020) Emergence and diversification of a host-parasite RNA ecosystem through Darwinian evolution. Elife 9: e56038. doi: 10.7554/eLife.56038
[11]	Witzany G (2011) The agents of natural genome editing. J Mol Cell Biol 3: 181-189. doi: 10.1093/jmcb/mjr005
[12]	Witzany G (2019) Communication as the Main Characteristic of Life. Handbook of Astrobiology Boka Raton: CrC Press, 91-105.
[13]	Villarreal LP, Witzany G (2019) That is life: communicating RNA networks from viruses and cells in continuous interaction. Ann NY Acad Sci 1447: 5-20. doi: 10.1111/nyas.14040
[14]	Harish A, Caetano-Anollés G (2012) Ribosomal history reveals origins of modern protein synthesis. PLoS One 7: e32776. doi: 10.1371/journal.pone.0032776
[15]	Root-Bernstein M, Root-Bernstein R (2015) The ribosome as a missing link in the evolution of life. J Theor Biol 367: 130-158. doi: 10.1016/j.jtbi.2014.11.025
[16]	Pak D, Root-Bernstein R, Burton ZF (2017) tRNA structure and evolution and standardization to the three nucleotide genetic code. Transcription 8: 205-219. doi: 10.1080/21541264.2017.1318811
[17]	Mizuno CM, Guyomar C, Roux S, et al. (2019) Numerous cultivated and uncultivated viruses encode ribosomal proteins. Nat Commun 10: 752. doi: 10.1038/s41467-019-08672-6
[18]	Villarreal LP (2009) Origin of Group Identity: Viruses, Addiction and Cooperation New York: Springer.
[19]	Villarreal LP, Witzany G (2015) When competing viruses unify: evolution, conservation, and plasticity of genetic identities. J Mol Evol 80: 305-318. doi: 10.1007/s00239-015-9683-y
[20]	Hayden EJ, Lehman N, Unrau P (2019) Transitions. RNA and Ribozymes in the Development of Life. Handbook of Astrobiology Boka Raton: CrC Press, 379-394.
[21]	Villarreal LP (2004) Can Viruses Make Us Human? Proc Am Phl Soc 148: 296-323.
[22]	Villarreal LP (2015) Can Virolution Help Us Understand Recent Human Evolution? Astrobiology. An Evolutionary Approach Boca Raton: CRC Press, 441-472.
[23]	Gemmell P, Hein J, Katzourakis A (2019) The exaptation of HERV-H: evolutionary analyses reveal the genomic features of highly transcribed elements. Front Immunol 10: 1339. doi: 10.3389/fimmu.2019.01339
[24]	Larouche JD, Trofimov A, Hesnard L, et al. (2020) Widespread and tissue-specific expression of endogenous retroelements in human somatic tissues. Genome Med 12: 40. doi: 10.1186/s13073-020-00740-7
[25]	Witzany G (2006) Natural genome-editing competences of viruses. Acta Biotheor 54: 235-253. doi: 10.1007/s10441-006-9000-7
[26]	Witzany G (2009) Noncoding RNAs: persistent viral agents as modular tools for cellular needs. Ann NY Acad Sci 1178: 244-267. doi: 10.1111/j.1749-6632.2009.04989.x
[27]	Villarreal LP, Witzany G (2010) Viruses are essential agents within the roots and stem of the tree of life. J Theor Biol 262: 698-710. doi: 10.1016/j.jtbi.2009.10.014
[28]	Manrubia SC, Briones C (2007) Modular evolution and increase of functional complexity in replicating RNA molecules. RNA 13: 97-107. doi: 10.1261/rna.203006
[29]	Villarreal LP (2005) Viruses and The Evolution of Life Washington: ASM Press. doi: 10.1128/9781555817626
[30]	Villarreal LP (2009) Persistence pays: how viruses promote host group survival. Curr Opin Microbiol 12: 467-472. doi: 10.1016/j.mib.2009.06.014
[31]	Gómez J, Ariza-Mateos A, Cacho I (2015) Virus is a Signal for the Host Cell. Biosemiotics 8: 483-491. doi: 10.1007/s12304-015-9245-0
[32]	Feschotte C (2008) Transposable elements and the evolution of regulatory networks. Nat Rev Genet 9: 397-405. doi: 10.1038/nrg2337
[33]	Moelling K, Broecker F (2019) Viruses and Evolution - Viruses First? A Personal Perspective. Front Microbiol 10: 523. doi: 10.3389/fmicb.2019.00523
[34]	Ryan F (2009) Virolution London: William Collins.
[35]	Villarreal LP (2006) Virus-host symbiosis mediated by persistence. Symbiosis 44: 1-9.
[36]	Villarreal LP (2008) The Widespread Evolutionary Significance of Viruses. Origin and Evolution of Viruses New York: Academic Press, 477-516. doi: 10.1016/B978-0-12-374153-0.00021-7
[37]	Villareal LP, Ryan F (2011) Viruses in host evolution: General principles and future extrapolations. Curr Top Virol 9: 79-90.
[38]	Pereira L, Rodrigues T, Carrapiço F (2012) A Symbiogenic Way in the Origin of Life. In The Beginning: Precursors of Life, Chemical Models and Early Biological Evolution, Cellular Origin, Life in Extreme Habitats and Astrobiology Dordrecht: Springer, 723-742. doi: 10.1007/978-94-007-2941-4_36
[39]	Villarreal LP, Witzany G (2018) Editorial: Genome Invading RNA-Networks. Frontiers Microbiol 9: 581. doi: 10.3389/fmicb.2018.00581
[40]	Domingo E, Sheldon J, Perales C (2012) Viral quasispecies evolution. Microbiol Mol Biol Rev 76: 159-216. doi: 10.1128/MMBR.05023-11
[41]	Vaidya N, Manapat ML, Chen IA, et al. (2012) Spontaneous network formation among cooperative RNA replicators. Nature 491: 72-77. doi: 10.1038/nature11549
[42]	Higgs PG, Lehman N (2015) The RNA World: molecular cooperation at the origins of life. Nat Rev Genet 16: 7-17. doi: 10.1038/nrg3841
[43]	Nelson JW, Breaker RR (2017) The lost language of the RNA World. Sci Signal 10: eaam8812. doi: 10.1126/scisignal.aam8812
[44]	Villarreal LP (2015) Force for ancient and recent life: viral and stemloop RNA consortia promote life. Ann NY Acad Sci 1341: 25-34. doi: 10.1111/nyas.12565
[45]	Krupovic M, Bamford DH (2009) Does the evolution of viral polymerases reflect the origin and evolution of viruses? Nat Rev Microbiol 7: 250. doi: 10.1038/nrmicro2030-c1
[46]	Forterre P, Prangishvili D (2013) The major role of viruses in cellular evolution: facts and hypotheses. Curr Opin Virol 3: 558-565. doi: 10.1016/j.coviro.2013.06.013
[47]	Wolf YI, Kazlauskas D, Iranzo J, et al. (2018) Origins and evolution of the Global RNA virome. mBio 9: e02329-18.
[48]	Feschotte C, Gilbert C (2012) Endogenous viruses: insights into viral evolution and impact on host biology. Nat Rev Genet 13: 283-296. doi: 10.1038/nrg3199
[49]	Vignuzzi M, López CB (2019) Defective viral genomes are key drivers of the virus-host interaction. Nat Microbiol 4: 1075-1087. doi: 10.1038/s41564-019-0465-y
[50]	Koonin EV, Dolja VV, Krupovic M (2015) Origins and evolution of viruses of eukaryotes: The ultimate modularity. Virology 479–480: 2-25. doi: 10.1016/j.virol.2015.02.039
[51]	Briones C, Stich M, Manrubia SC (2009) The dawn of the RNA world: toward functional complexity through ligation of random RNA oligomers. RNA 15: 743-749. doi: 10.1261/rna.1488609
[52]	Tjhung KF, Shokhirev MN, Horning DP, et al. (2020) An RNA polymerase ribozyme that synthesizes its own ancestor. Proc Natl Acad Sci USA 117: 2906-2913. doi: 10.1073/pnas.1914282117
[53]	Petrov AS, Bernier CR, Hsiao C, et al. (2014) Evolution of the ribosome at atomic resolution. Proc Natl Acad Sci USA 111: 10251-10256. doi: 10.1073/pnas.1407205111
[54]	Villarreal LP, Witzany G (2013) The DNA Habitat and its RNA Inhabitants: At the Dawn of RNA Sociology. Genomics Ins 6: 1-12.
[55]	Bertalanffy L, Laviolette PA (1981) A Systems View of Man New York: Routledge.
[56]	Schudoma C, Larhlimi A, Walther D (2011) The influence of the local sequence environment on RNA loop structures. RNA 17: 1247-1257. doi: 10.1261/rna.2550211
[57]	Witzany G (2020) What is Life? Front Astron Space Sci 7: 7. doi: 10.3389/fspas.2020.00007
[58]	Root-Bernstein RS, Merrill SJ (1997) The necessity of cofactors in the pathogenesis of AIDS: a mathematical model. J Theor Biol 187: 135-146. doi: 10.1006/jtbi.1997.0449
[59]	Witzany G (2014) RNA sociology: group behavioral motifs of RNA consortia. Life 4: 800-818. doi: 10.3390/life4040800
[60]	Witzany G (2014) Pragmatic turn in biology: From biological molecules to genetic content operators. World J Biol Chem 5: 279-285. doi: 10.4331/wjbc.v5.i3.279
[61]	Schudoma C (2011) It's a loop world—single strands in RNA as structural and functional elements. Biomol Conc 2: 171-181. doi: 10.1515/bmc.2011.016
[62]	Bergthaler A, Menche J (2017) The immune system as a social network. Nat Immunol 18: 481-482. doi: 10.1038/ni.3727
[63]	Witzany G (2016) Crucial steps to life: From chemical reactions to code using agents. Biosystems 140: 49-57. doi: 10.1016/j.biosystems.2015.12.007
[64]	Koonin EV (2016) Viruses and mobile elements as drivers of evolutionary transitions. Philos Trans R Soc Lond B Biol Sci 371: 20150442. doi: 10.1098/rstb.2015.0442
[65]	Sullivan MB, Weitz JS, Wilhelm S (2017) Viral ecology comes of age. Environ Microbiol Rep 9: 33-35. doi: 10.1111/1758-2229.12504
[66]	Bushman FD (2003) Targeting survival: Integration site selection by retroviruses and LTR-retrotransposons. Cell 115: 135-138. doi: 10.1016/S0092-8674(03)00760-8
[67]	Hayden EJ, Lehman N (2006) Self-assembly of a group I intron from inactive oligonucleotide fragments. Chem Biol 13: 909-918. doi: 10.1016/j.chembiol.2006.06.014
[68]	Bartel DP (2009) MicroRNAs: target recognition and regulatory functions. Cell 136: 215-233. doi: 10.1016/j.cell.2009.01.002
[69]	Belfort M, Curcio MJ, Lue NF (2011) Telomerase and retrotransposons: reverse transcriptases that shaped genomes. Proc Natl Acad Sci USA 108: 20304-20310. doi: 10.1073/pnas.1100269109
[70]	Pyle AM (2016) Group II Intron Self-Splicing. Annu Rev Biophys 45: 183-205. doi: 10.1146/annurev-biophys-062215-011149
[71]	Villarreal LP (2011) Viruses and host evolution: virus-mediated self identity. Adv Exp Med Biol 738: 185-217. doi: 10.1007/978-1-4614-1680-7_12
[72]	Villarreal LP (2012) The addiction module as a social force. Viruses: Essential Agents of Life Dordrecht: Springer Science + Business Media, 107-146. doi: 10.1007/978-94-007-4899-6_6
[73]	Villarreal LP (2016) Persistent virus and addiction modules: an engine of symbiosis. Curr Opin Microbiol 31: 70-79. doi: 10.1016/j.mib.2016.03.005
[74]	Kobayashi I (2001) Behavior of restriction-modification systems as selfish mobile elements and their impact on genome evolution. Nucleic Acids Res 29: 3742-3756. doi: 10.1093/nar/29.18.3742
[75]	Mruk I, Kobayashi I (2014) To be or not to be: regulation of restriction-modification systems and other toxin-antitoxin systems. Nucleic Acids Res 42: 70-86. doi: 10.1093/nar/gkt711
[76]	Gerdes K, Wagner EG (2007) RNA antitoxins. Curr Opin Microbiol 10: 117-124. doi: 10.1016/j.mib.2007.03.003
[77]	Witzany G (2020) Evolution of Genetic Information without Error Replication. Theoretical Information Studies: Information in the World Singapore: World Scientific, 295-320. doi: 10.1142/9789813277496_0014
[78]	Schuster P (2011) Mathematical modeling of evolution. Solved and open problems. Theory Biosci 130: 71-89. doi: 10.1007/s12064-010-0110-z
[79]	Eigen M (1971) Selforganization of matter and the evolution of biological macromolecules. Naturwissenschaften 58: 465-523. doi: 10.1007/BF00623322
[80]	Koonin EV (2009) On the origin of cells and viruses: primordial virus world scenario. Ann N Y Acad Sci 1178: 47-64. doi: 10.1111/j.1749-6632.2009.04992.x
[81]	Comeau AM, Hatfull GF, Krisch HM, et al. (2008) Exploring the prokaryotic virosphere. Res Microbiol 159: 306-313. doi: 10.1016/j.resmic.2008.05.001
[82]	Zhang YZ, Chen YM, Wang W, et al. (2019) Expanding the RNA Virosphere by Unbiased Metagenomics. Annu Rev Virol 6: 119-139. doi: 10.1146/annurev-virology-092818-015851
[83]	Moelling K (2013) What contemporary viruses tell us about evolution: a personal view. Arch Virol 158: 1833-1848. doi: 10.1007/s00705-013-1679-6
[84]	Stedman KM (2015) Deep recombination: RNA and ssDNA virus genes in DNA virus and host genomes. Annu Rev Virol 2: 203-217. doi: 10.1146/annurev-virology-100114-055127
[85]	Orgel LE, Crick FH (1980) Selfish DNA: the ultimate parasite. Nature 284: 604-607. doi: 10.1038/284604a0
[86]	Doolittle WF, Sapienza C (1980) Selfish genes, the phenotype paradigm and genome evolution. Nature 284: 601-603. doi: 10.1038/284601a0
[87]	Volff JN (2006) Turning junk into gold: domestication of transposable elements and the creation of new genes in eukaryotes. Bioessays 28: 913-922. doi: 10.1002/bies.20452
[88]	Biémont C, Vieira C (2006) Genetics: junk DNA as an evolutionary force. Nature 443: 521-524. doi: 10.1038/443521a
[89]	Ryan F (2006) Genomic creativity and natural selection: a modern synthesis. Biol J Linn Soc 88: 655-672. doi: 10.1111/j.1095-8312.2006.00650.x
[90]	Root-Bernstein RS, Dillon PF (1997) Molecular complementarity I: the complementarity theory of the origin and evolution of life. J Theor Biol 188: 447-749. doi: 10.1006/jtbi.1997.0476
[91]	Stadler PF, Schuster P (1992) Mutation in autocatalytic reaction networks. An analysis based on perturbation theory. J Math Biol 30: 597-631. doi: 10.1007/BF00948894
[92]	Naville M, Volff JN (2016) Endogenous Retroviruses in Fish Genomes: From Relics of Past Infections to Evolutionary Innovations? Front Microbiol 7: 1197. doi: 10.3389/fmicb.2016.01197
[93]	Broecker F, Moelling K (2019) Evolution of Immune Systems. From Viruses and Transposable Elements. Front Microbiol 10: 51. doi: 10.3389/fmicb.2019.00051
[94]	Naville M, Warren IA, Haftek-Terreau Z, et al. (2016) Not so bad after all: retroviruses and long terminal repeat retrotransposons as a source of new genes in vertebrates. Clin Microbiol Infect 22: 312-323. doi: 10.1016/j.cmi.2016.02.001
[95]	Cech TR (2012) The RNA worlds in context. Cold Spring Harb Perspect Biol 4: a006742.
[96]	Eigen M, Schuster P (1977) A principle of natural self-organization. Naturwissenschaften 64: 541-565. doi: 10.1007/BF00450633
[97]	Szathmáry E (1993) Coding coenzyme handles: a hypothesis for the origin of the genetic code. Proc Natl Acad Sci USA 90: 9916-9920. doi: 10.1073/pnas.90.21.9916
[98]	Domingo E, Schuster P (2016) Quasispecies: From Theory to Experimental Systems Cham: Springer. doi: 10.1007/978-3-319-23898-2
[99]	Domingo E, Grande-Pérez A, Martín V (2008) Future prospects for the treatment of rapidly evolving viral pathogens: insights from evolutionary biology. Expert Opin Biol Ther 8: 1455-1460. doi: 10.1517/14712598.8.10.1455
[100]	Villarreal LP (2016) Viruses and the placenta: the essential virus first view. APMIS 124: 20-30. doi: 10.1111/apm.12485
[101]	Chaudhari N, Hahn WE (1983) Genetic expression in the developing brain. Science 220: 924-928. doi: 10.1126/science.6189184
[102]	Khaitovich P, Muetzel B, She X, et al. (2004) Regional patterns of gene expression in human and chimpanzee brains. Genome Res 14: 1462-1473. doi: 10.1101/gr.2538704
[103]	Watanabe H, Fujiyama A, Hattori M, et al. (2004) DNA sequence and comparative analysis of chimpanzee chromosome 22. Nature 429: 382-388. doi: 10.1038/nature02564
[104]	Wetterbom A, Sevov M, Cavelier L (2006) Comparative genomic analysis of human and chimpanzee indicates a key role for indels in primate evolution. J Mol Evol 63: 682-690. doi: 10.1007/s00239-006-0045-7
[105]	Mills RE, Luttig CT, Larkins CE, et al. (2006) An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res 16: 1182-1190. doi: 10.1101/gr.4565806
[106]	Mager DL, Stoye JP (2014) Mammalian endogenous retroviruses. Microbiol Spectrum 3: MDNA3-0009-2014.
[107]	Sakate R, Suto Y, Imanishi T, et al. (2007) Mapping of chimpanzee full-length cDNAs onto the human genome unveils large potential divergence of the transcriptome. Gene 399: 1-10. doi: 10.1016/j.gene.2007.04.013
[108]	Zeh DW, Zeh JA, Ishida Y (2009) Transposable elements and an epigenetic basis for punctuated equilibria. Bioessays 31: 715-726. doi: 10.1002/bies.200900026
[109]	Keightley PD, Lercher MJ, Eyre-Walker A (2005) Evidence for widespread degradation of gene control regions in hominid genomes. PLoS Biol 3: e42. doi: 10.1371/journal.pbio.0030042
[110]	Keightley PD, Lercher MJ, Eyre-Walker A (2006) Understanding the degradation of hominid gene control. PLoS Comput Biol 2: e19. doi: 10.1371/journal.pcbi.0020019
[111]	Toder R, Grützner F, Haaf T, et al. (2001) Species-specific evolution of repeated DNA sequences in great apes. Chromosome Res 9: 431-435. doi: 10.1023/A:1011605824530
[112]	Bird CP, Stranger BE, Liu M, et al. (2007) Fast-evolving noncoding sequences in the human genome. Genome Biol 8: R118. doi: 10.1186/gb-2007-8-6-r118
[113]	Bejerano G, Haussler D, Blanchette M (2004) Into the heart of darkness: large-scale clustering of human non-coding DNA. Bioinformatics 20: i40-48. doi: 10.1093/bioinformatics/bth946
[114]	Bush EC, Lahn BT (2005) Selective constraint on noncoding regions of hominid genomes. PLoS Comput Biol 1: e73. doi: 10.1371/journal.pcbi.0010073
[115]	Barry G, Mattick JS (2012) The role of regulatory RNA in cognitive evolution. Trends Cogn Sci 16: 497-503. doi: 10.1016/j.tics.2012.08.007
[116]	Mattick JS (2009) Deconstructing the dogma: a new view of the evolution and genetic programming of complex organisms. Ann N Y Acad Sci 1178: 29-46. doi: 10.1111/j.1749-6632.2009.04991.x
[117]	Mattick JS (2011) The central role of RNA in human development and cognition. FEBS Lett 585: 1600-1616. doi: 10.1016/j.febslet.2011.05.001
[118]	Mattick JS (2005) The functional genomics of noncoding RNA. Science 309: 1527-1528. doi: 10.1126/science.1117806
[119]	Mattick JS, Taft RJ, Faulkner GJ (2010) A global view of genomic information--moving beyond the gene and the master regulator. Trends Genet 26: 21-28. doi: 10.1016/j.tig.2009.11.002
[120]	Claverie JM (2005) Fewer genes, more noncoding RNA. Science 309: 1529-1530. doi: 10.1126/science.1116800
[121]	Khalil AM, Guttman M, Huarte M, et al. (2009) Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proc Natl Acad Sci USA 106: 11667-11672. doi: 10.1073/pnas.0904715106
[122]	Mehler MF, Mattick JS (2007) Noncoding RNAs and RNA editing in brain development, functional diversification, and neurological disease. Physiol Rev 87: 799-823. doi: 10.1152/physrev.00036.2006
[123]	Mattick JS, Mehler MF (2008) RNA editing, DNA recoding and the evolution of human cognition. Trends Neurosci 31: 227-233. doi: 10.1016/j.tins.2008.02.003
[124]	Mercer TR, Dinger ME, Mariani J, et al. (2008) Noncoding RNAs in Long-Term Memory Formation. Neuroscientist 14: 434-445. doi: 10.1177/1073858408319187
[125]	Nghe P, Hordijk W, Kauffman SA, et al. (2015) Prebiotic network evolution: six key parameters. Mol Biosyst 11: 3206-3217. doi: 10.1039/C5MB00593K
[126]	Kaliatsi EG, Giarimoglou N, Stathopoulos C, et al. (2020) Non-Coding RNA-Driven Regulation of rRNA Biogenesis. Int J Mol Sci 21: 9738. doi: 10.3390/ijms21249738
[127]	Shepherd JD (2018) Arc-An endogenous neuronal retrovirus? Semin Cell Dev Biol 77: 73-78. doi: 10.1016/j.semcdb.2017.09.029
[128]	Pastuzyn ED, Day CE, Kearns RB, et al. (2018) The neuronal gene arc encodes a repurposed retrotransposon gag protein that mediates intercellular RNA transfer. Cell 172: 275-288. doi: 10.1016/j.cell.2017.12.024
[129]	Vignuzzi M, Stone JK, Arnold JJ, et al. (2006) Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population. Nature 439: 344-348. doi: 10.1038/nature04388
[130]	Korboukh VK, Lee CA, Acevedo A, et al. (2014) RNA virus population diversity, an optimum for maximal fitness and virulence. J Biol Chem 289: 29531-29544. doi: 10.1074/jbc.M114.592303
[131]	Briones C, de Vicente A, Molina-París C, et al. (2006) Minority memory genomes can influence the evolution of HIV-1 quasispecies in vivo. Gene 384: 129-138. doi: 10.1016/j.gene.2006.07.037
[132]	Briones C, Domingo E (2008) Minority report: hidden memory genomes in HIV-1 quasispecies and possible clinical implications. AIDS Rev 10: 93-109.
[133]	Bordería AV, Stapleford KA, Vignuzzi M (2011) RNA virus population diversity: implications for inter-species transmission. Curr Opin Virol 1: 643-648. doi: 10.1016/j.coviro.2011.09.012
[134]	Grande-Pérez A, Lázaro E, Lowenstein P, et al. (2005) Suppression of viral infectivity through lethal defection. Proc Natl Acad Sci USA 102: 4448-4452. doi: 10.1073/pnas.0408871102
[135]	Domingo E, Gomez J (2007) Quasispecies and its impact on viral hepatitis. Virus Res 127: 131-150. doi: 10.1016/j.virusres.2007.02.001
[136]	Ojosnegros S, Perales C, Mas A, et al. (2011) Quasispecies as a matter of fact: viruses and beyond. Virus Res 162: 203-215. doi: 10.1016/j.virusres.2011.09.018
[137]	Lauring AS, Andino R (2010) Quasispecies theory and the behavior of RNA viruses. PLoS Pathog 6: e1001005. doi: 10.1371/journal.ppat.1001005
[138]	Arbiza J, Mirazo S, Fort H (2010) Viral quasispecies profiles as the result of the interplay of competition and cooperation. BMC Evol Biol 10: 137. doi: 10.1186/1471-2148-10-137
[139]	Brooks K, Jones BR, Dilernia DA, et al. (2020) HIV-1 variants are archived throughout infection and persist in the reservoir. PLoS Pathog 16: e1008378. doi: 10.1371/journal.ppat.1008378
[140]	Levin SR, Gandon S, West SA (2020) The social coevolution hypothesis for the origin of enzymatic cooperation. Nat Ecol Evol 4: 132-137. doi: 10.1038/s41559-019-1039-3
[141]	Atkins JF, Loughran G, Bhatt PR, et al. (2016) Ribosomal frameshifting and transcriptional slippage: From genetic steganog- raphy and cryptography to adventitious use. Nucleic Acids Res 44: 7007-7078.
[142]	Urtel GC, Rind T, Braun D (2017) Reversible Switching of Cooperating Replicators. Phys Rev Lett 118: 078102. doi: 10.1103/PhysRevLett.118.078102
[143]	Bastet L, Turcotte P, Wade JT, et al. (2018) Maestro of regulation: Riboswitches orchestrate gene expression at the levels of translation, transcription and mRNA decay. RNA Biol 15: 679-682.
[144]	Gesteland RF, Cech TR, Atkins JF (2005) The RNA World New York: Cold Spring Harbor Laboratory Press.
[145]	Altman S Rna008-BioTheory (2013) .Available from: http://tbio.molpit.ru/main-results/rna-world/rna008.
[146]	Holmes EC (2013) What can we predict about viral evolution and emergence? Curr Opin Virol 3: 180-184. doi: 10.1016/j.coviro.2012.12.003
[147]	Wildschutte JH, Williams ZH, Montesion M, et al. (2016) Discovery of unfixed endogenous retrovirus insertions in diverse human populations. Proc Natl Acad Sci USA 113: E2326-2334. doi: 10.1073/pnas.1602336113
[148]	Randau L, Söll D (2008) Transfer RNA genes in pieces. EMBO Rep 9: 623-628. doi: 10.1038/embor.2008.101
[149]	Tamura K (2015) Origins and early evolution of the tRNA molecule. Life 5: 1687-1699. doi: 10.3390/life5041687
[150]	DeFarias ST, Rêgo TG, José MV (2019) Origin of the 16S ribosomal molecule from ancestor tRNAs. Sciences 1: 8.
[151]	Daly T, Chen XS, Penny D (2011) How old are RNA networks? Adv Exp Med Biol 722: 255-273. doi: 10.1007/978-1-4614-0332-6_17
[152]	Frenkel-Pinter M, Haynes JW, Mohyeldin AM, et al. (2020) Mutually stabilizing interactions between proto-peptides and RNA. Nat Commun 11: 3137. doi: 10.1038/s41467-020-16891-5
[153]	Yarus M (2011) The meaning of a minuscule ribozyme. Philos Trans R Soc Lond B Biol Sci 366: 2902-2909. doi: 10.1098/rstb.2011.0139
[154]	Martin LL, Unrau PJ, Müller UF (2015) RNA synthesis by in vitro selected ribozymes for recreating an RNA world. Life 5: 247-268. doi: 10.3390/life5010247
[155]	Smit S, Yarus M, Knight R (2006) Natural selection is not required to explain universal compositional patterns in rRNA secondary structure categories. RNA 12: 1-14. doi: 10.1261/rna.2183806
[156]	Yarus M (2011) Getting past the RNA world: the initial Darwinian ancestor. Cold Spring Harb Perspect Biol 3: a003590. doi: 10.1101/cshperspect.a003590
[157]	Vitas M, Dobovišek A (2018) In the Beginning was a Mutualism - On the Origin of Translation. Orig Life Evol Biosph 48: 223-243. doi: 10.1007/s11084-018-9557-6
[158]	Müller S, Appel B, Krellenberg T, et al. (2012) The many faces of the hairpin ribozyme: structural and functional variants of a small catalytic RNA. IUBMB Life 64: 36-47. doi: 10.1002/iub.575
[159]	Cheng LK, Unrau PJ (2010) Closing the circle: replicating RNA with RNA. Cold Spring Harb Perspect Biol 2: a002204. doi: 10.1101/cshperspect.a002204
[160]	Gwiazda S, Salomon K, Appel B, et al. (2012) RNA self-ligation: from oligonucleotides to full length ribozymes. Biochimie 94: 1457-1463. doi: 10.1016/j.biochi.2012.03.015
[161]	Behrouzi R, Roh JH, Kilburn D, et al. (2012) Cooperative tertiary interaction network guides RNA folding. Cell 149: 348-357. doi: 10.1016/j.cell.2012.01.057
[162]	Ferré-D'Amaré AR, Scott WG (2010) Small self-cleaving ribozymes. Cold Spring Harb Perspect Biol 2: a003574.
[163]	Breaker RR (2018) Riboswitches and Translation Control. Cold Spring Harb Perspect Biol 10: a032797. doi: 10.1101/cshperspect.a032797
[164]	Matzke MA, Mosher RA (2014) RNA-directed DNA methylation: an epigenetic pathway of increasing complexity. Nat Rev Genet 15: 394-408. doi: 10.1038/nrg3683
[165]	Perreault J, Weinberg Z, Roth A, et al. (2011) Identification of hammerhead ribozymes in all domains of life reveals novel structural variations. PLoS Comput Biol 7: e1002031. doi: 10.1371/journal.pcbi.1002031
[166]	Kumar RM, Joyce GF (2003) A modular, bifunctional RNA that integrates itself into a target RNA. Proc Natl Acad Sci USA 100: 9738-9743. doi: 10.1073/pnas.1334190100
[167]	Maizels N, Weiner AM (1994) Phylogeny from function: evidence from the molecular fossil record that tRNA originated in replication, not translation. Proc Natl Acad Sci USA 91: 6729-6734. doi: 10.1073/pnas.91.15.6729
[168]	de Farias ST, José MV (2020) Transfer RNA: The molecular demiurge in the origin of biological systems. Prog Biophys Mol Biol 53: 28-34. doi: 10.1016/j.pbiomolbio.2020.02.006
[169]	Demongeot J, Seligmann H (2019) More pieces of ancient than recent theoretical minimal proto-tRNA-like RNA rings in genes coding for tRNA synthetases. J Mol Evol 87: 152-174. doi: 10.1007/s00239-019-09892-6
[170]	Root-Bernstein R (2012) A modular hierarchy-based theory of the chemical origins of life based on molecular complementarity. Acc Chem Res 45: 2169-2177. doi: 10.1021/ar200209k
[171]	Dick TP, Schamel WA (1995) Molecular evolution of transfer RNA from two precursor hairpins: implications for the origin of protein synthesis. J Mol Evol 41: 1-9.
[172]	Belfort M, Weiner A (1997) Another bridge between kingdoms: tRNA splicing in archaea and eukaryotes. Cell 89: 1003-1006. doi: 10.1016/S0092-8674(00)80287-1
[173]	Lambowitz AM, Zimmerly S (2011) Group II introns: mobile ribozymes that invade DNA. Cold Spring Harb Perspect Biol 3: a003616. doi: 10.1101/cshperspect.a003616
[174]	Fujishima K, Kanai A (2014) tRNA gene diversity in the three domains of life. Front Genet 5: 142. doi: 10.3389/fgene.2014.00142
[175]	Rodin AS, Szathmáry E, Rodin SN (2011) On origin of genetic code and tRNA before translation. Biol Direct 6: 14. doi: 10.1186/1745-6150-6-14
[176]	Wolf YI, Koonin EV (2007) On the origin of the translation system and the genetic code in the RNA world by means of natural selection, exaptation, and subfunctionalization. Biol Direct 2: 14. doi: 10.1186/1745-6150-2-14
[177]	Sun FJ, Caetano-Anollés G (2008) Transfer RNA and the origins of diversified life. Sci Prog 91: 265-284. doi: 10.3184/003685008X360650
[178]	Petrov AS, Gulen B, Norris AM, et al. (2015) History of the ribosome and the origin of translation. Proc Natl Acad Sci USA 112: 15396-15401. doi: 10.1073/pnas.1509761112
[179]	Di Giulio M (2013) A polyphyletic model for the origin of tRNAs has more support than a monophyletic model. J Theor Biol 318: 124-128. doi: 10.1016/j.jtbi.2012.11.012
[180]	Liow LH, Van Valen L, Stenseth NC (2011) Red Queen: from populations to taxa and communities. Trends Ecol Evol 26: 349-358. doi: 10.1016/j.tree.2011.03.016
[181]	Hayne CK, Schmidt CA, Haque MI, et al. (2020) Reconstitution of the human tRNA splicing endonuclease complex: insight into the regulation of pre-tRNA cleavage. Nucleic Acids Res 48: 7609-7622. doi: 10.1093/nar/gkaa438
[182]	Bowman JC, Petrov AS, Frenkel-Pinter M, et al. (2020) Root of the Tree: The Significance, Evolution, and Origins of the Ribosome. Chem Rev 120: 4848-4878. doi: 10.1021/acs.chemrev.9b00742
[183]	Root-Bernstein R, Root-Bernstein M (2019) The Ribosome as a Missing Link in Prebiotic Evolution III: Over-Representation of tRNA- and rRNA-Like Sequences and Plieofunctionality of Ribosome-Related Molecules Argues for the Evolution of Primitive Genomes from Ribosomal RNA Modules. Int J Mol Sci 20: 140. doi: 10.3390/ijms20010140
[184]	Arenas CD, Lehman N (2010) The continuous evolution in vitro technique. Curr Protoc Nucleic Acid Chem Chapter 9: 9.7.1-9.7.17.
[185]	Dadon Z, Samiappan M, Wagner N, et al. (2012) Chemical and light triggering of peptide networks under partial thermodynamic control. Chem Commun (Camb) 48: 1419-1421. doi: 10.1039/C1CC14301H
[186]	Ameta S, Arsène S, Foulon S, et al. (2019) Darwinian properties and their trade-offs in autocatalytic RNA reaction networks. bioRxiv Available from: https://doi.org/10.1101/726497.
[187]	Hordijk W, Steel M (2012) Predicting template-based catalysis rates in a simple catalytic reaction model. J Theor Biol 295: 132-138. doi: 10.1016/j.jtbi.2011.11.024
[188]	Jalasvuori M, Bamford JK (2008) Structural co-evolution of viruses and cells in the primordial world. Orig Life Evol Biosph 38: 165-181. doi: 10.1007/s11084-008-9121-x
[189]	Kun Á, Szilágyi A, Könnyű B, et al. (2015) The dynamics of the RNA world: insights and challenges. Ann NY Acad Sci 1341: 75-95. doi: 10.1111/nyas.12700
[190]	Rieckmann JC, Geiger R, Hornburg D, et al. (2017) Social network architecture of human immune cells unveiled by quantitative proteomics. Nat Immunol 18: 583-593. doi: 10.1038/ni.3693
[191]	Caetano-Anollés G, Sun FJ (2014) The natural history of transfer RNA and its interactions with the ribosome. Front Genet 5: 127.
[192]	Thornlow BP, Armstrong J, Holmes AD, et al. (2020) Predicting transfer RNA gene activity from sequence and genome context. Genome Res 30: 85-94. doi: 10.1101/gr.256164.119
[193]	Withers M, Wernisch L, dos Reis M (2006) Archaeology and evolution of transfer RNA genes in the Escherichia coli genome. RNA 12: 933-942. doi: 10.1261/rna.2272306
[194]	Ariza-Mateos A, Briones C, Perales C, et al. (2019) The archaeology of coding RNA. Ann NY Acad Sci 1447: 119-134. doi: 10.1111/nyas.14173
[195]	Stevenson DS (2002) Co-evolution of the genetic code and ribozyme replication. J Theor Biol 217: 235-253. doi: 10.1006/jtbi.2002.3013
[196]	Witzany G (2012) From Molecular Entities to Competent Agents: Viral Infection-Derived Consortia Act as Natural Genetic Engineers. Viruses: Essential Agents of Life Dordrecht: Springer, 407-419. doi: 10.1007/978-94-007-4899-6_20
[197]	Hesselberth JR (2013) Lives that introns lead after splicing. Wiley Interdiscip Rev RNA 4: 677-691. doi: 10.1002/wrna.1187
[198]	Belfort M (2017) Mobile self-splicing introns and inteins as environmental sensors. Curr Opin Microbiol 38: 51-58. doi: 10.1016/j.mib.2017.04.003
[199]	Lennon CW, Belfort M (2017) Inteins. Curr Biol 27: 204-206. doi: 10.1016/j.cub.2017.01.016
[200]	Villareal LP (2015) Virolution Can Help Us Understand the Origin of Life. Astrobiology. An Evolutionary Approach Boca Raton: CRC Press, 421-440.
[201]	Villarreal LP, Witzany G (2021) Infectious thoughts: discovering biology as a social science. PrePrint .
[202]	Vandevenne M, Delmarcelle M, Galleni M (2019) RNA Regulatory Networks as a Control of Stochasticity in Biological Systems. Front Genet 10: 403. doi: 10.3389/fgene.2019.00403
[203]	Lozada-Chávez I, Stadler PF, Prohaska SJ (2011) Hypothesis for the modern RNA world: a pervasive non-coding RNA-based genetic regulation is a prerequisite for the emergence of multicellular complexity. Orig Life Evol Biosph 41: 587-607. doi: 10.1007/s11084-011-9262-1

This article has been cited by:

1.	Yaniasih Yaniasih, Aris Yaman, Siska Pebiana, 2024, Assessment of Hyperparameter Optimization Techniques for Cross-Stitched Multi-Task Learning, 979-8-3315-4231-3, 179, 10.1109/IC3INA64086.2024.10732471
2.	Ruiyao Yang, 2024, Privacy-Enhanced Algorithms for E-commerce Fraud Detection, 9798400709999, 124, 10.1145/3708036.3708057

Reader Comments

Your name:*

Email:*
© 2021 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

AIMS Microbiology

2.7 7.0

Metrics

Article views(3325) PDF downloads(140) Cited by(9)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

AIMS Microbiology

Social Networking of Quasi-Species Consortia drive Virolution via Persistence

Abstract

1. Introduction

2. Classical optimization methods

2.1. Simple search scheme

2.1.1. Gradient-based search (GradS)

2.1.2. Grid search (GS)

2.1.3. Random search (RS)

2.2. Bayesian optimization (BO) with surrogates and acquisitions

2.2.1. Surrogate models

2.2.2. Acquisition functions

2.2.3. Recent BOs

2.3. Metaheuristic search

2.3.1. Evolutionary algorithms (EA)

2.3.2. Swarm intelligence (SI)

2.4. Summary

3. Acceleration techniques

3.1. Multi-fidelity optimization

3.1.1. Multi-fidelity / multi-source

3.1.2. Multitask / warm start

3.2. Bandit-based algorithm

3.3. Early stop

3.3.1. Curve estimation

3.3.2. Other criteria

3.4. Summary

4. Dynamic algorithm configuration

4.1. Gradient-based optimizers

4.2. Population-based algorithms

4.3. Reinforcement learning methods

4.4. Remarks

5. Multi-objective hyperparameter optimization

5.1. Scalarization

5.2. Multi-objective metaheuristics

5.3. Multi-objective model-based optimization

5.4. Remarks

6. Tools, applications, and further discussion

6.1. Public frameworks and tools

6.2. HPO applications

6.3. Further discussion

7. Conclusion and outlook

Use of AI tools declaration

Acknowledgment

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog