Joint attention mechanism for the design of anti-bird collision accident detection system

Xuanfeng Li; Jiajia Yu; Xuanfeng Li; Jiajia Yu

doi:10.3934/era.2022223

Electronic Research Archive

2022, Volume 30, Issue 12: 4401-4415. doi: 10.3934/era.2022223

Previous Article Next Article

Research article Special Issues

Joint attention mechanism for the design of anti-bird collision accident detection system

Xuanfeng Li ,
Jiajia Yu ^,

Department of Electronic and Computer Engineering, Southeast University Chengxian College, Nanjing 210088, China

Received: 19 July 2022 Revised: 15 September 2022 Accepted: 27 September 2022 Published: 08 October 2022

Keywords:

Citation: Xuanfeng Li, Jiajia Yu. Joint attention mechanism for the design of anti-bird collision accident detection system[J]. Electronic Research Archive, 2022, 30(12): 4401-4415. doi: 10.3934/era.2022223

Related Papers:

[1]	Jianing Wang . Prediction of postoperative recovery in patients with acoustic neuroma using machine learning and SMOTE-ENN techniques. Mathematical Biosciences and Engineering, 2022, 19(10): 10407-10423. doi: 10.3934/mbe.2022487
[2]	Yunxiang Meng, Qihong Duan, Kai Jiao, Jiang Xue . A screened predictive model for esophageal squamous cell carcinoma based on salivary flora data. Mathematical Biosciences and Engineering, 2023, 20(10): 18368-18385. doi: 10.3934/mbe.2023816
[3]	Sarth Kanani, Shivam Patel, Rajeev Kumar Gupta, Arti Jain, Jerry Chun-Wei Lin . An AI-Enabled ensemble method for rainfall forecasting using Long-Short term memory. Mathematical Biosciences and Engineering, 2023, 20(5): 8975-9002. doi: 10.3934/mbe.2023394
[4]	Lili Jiang, Sirong Chen, Yuanhui Wu, Da Zhou, Lihua Duan . Prediction of coronary heart disease in gout patients using machine learning models. Mathematical Biosciences and Engineering, 2023, 20(3): 4574-4591. doi: 10.3934/mbe.2023212
[5]	Hu Dong, Gang Liu, Xin Tong . Influence of temperature-dependent acoustic and thermal parameters and nonlinear harmonics on the prediction of thermal lesion under HIFU ablation. Mathematical Biosciences and Engineering, 2021, 18(2): 1340-1351. doi: 10.3934/mbe.2021070
[6]	Ji Zhou, Zhanlin Su, Shahab Hosseini, Qiong Tian, Yijun Lu, Hao Luo, Xingquan Xu, Chupeng Chen, Jiandong Huang . Decision tree models for the estimation of geo-polymer concrete compressive strength. Mathematical Biosciences and Engineering, 2024, 21(1): 1413-1444. doi: 10.3934/mbe.2024061
[7]	Yuanyao Lu, Kexin Li . Research on lip recognition algorithm based on MobileNet + attention-GRU. Mathematical Biosciences and Engineering, 2022, 19(12): 13526-13540. doi: 10.3934/mbe.2022631
[8]	Sidra Abid Syed, Munaf Rashid, Samreen Hussain . Meta-analysis of voice disorders databases and applied machine learning techniques. Mathematical Biosciences and Engineering, 2020, 17(6): 7958-7979. doi: 10.3934/mbe.2020404
[9]	Hyeonjeong Ahn, Hyojung Lee . Predicting the transmission trends of COVID-19: an interpretable machine learning approach based on daily, death, and imported cases. Mathematical Biosciences and Engineering, 2024, 21(5): 6150-6166. doi: 10.3934/mbe.2024270
[10]	Jinyi Tai, Chang Liu, Xing Wu, Jianwei Yang . Bearing fault diagnosis based on wavelet sparse convolutional network and acoustic emission compression signals. Mathematical Biosciences and Engineering, 2022, 19(8): 8057-8080. doi: 10.3934/mbe.2022377

1. Introduction

In the last decades, in the context of time series, several bootstrap procedures have been proposed to infer properties of a statistic of interest, including both parametric and non-parametric resampling schemes. Anyway, the effectiveness of the different bootstrap procedures is related to their ability to capture the dependent probabilistic structure of the underlying stochastic process under study, and to the analytical properties of the particular statistic considered.

When it is reasonable to impose parametric type assumptions on the process, the residual bootstrap is usually a sensible choice. In this general bootstrap scheme, the dependence structure of the series is modelled explicitly by using a parametric model and the bootstrap sample is drawn from the fitted model. This approach can be easily implemented, leading to very efficient results, but it is very sensitive to model misspecification, in which case it leads to bootstrap estimators, which might be not consistent. In this latter case, or when it is possible to account only for some mixing or weak dependence structure, more complex and fully non-parametric bootstrap methods are required. In this context, sieve bootstrap schemes have become useful and powerful tools to capture the dependence structure of the data without imposing any rigid parametric model specification.

The basic idea of this approach is to use non-parametric estimators as sieve approximators. In particular, the stochastic process under study is approximated by a family of (semi-) parametric models which, in a proper sense, contains the original process. Fixed an appropriate model selection rule, a model is picked from the set of the identified family and estimated on the observed dataset. Residual bootstrap is then implemented on the previously estimated model. In the context of sieve bootstrap schemes, the most used approach is the AR-Sieve bootstrap procedure ^[1,2,3]. It is based on the method of autoregressive process sieve in which an $AR (p_{T})$ model is fitted to the observed data and a bootstrap sample is generated by resampling from the centred residuals. This resampling scheme retains the simplicity of the classical residual bootstrap while being a nonparametric bootstrap scheme. It enjoys the properties of a plug-in rule. Moreover, it does not exhibit artefacts in the dependence structure like in the blockwise bootstrap, and there is no need for 'pre-vectorizing' the original observations. For these reasons, the AR-Sieve bootstrap has been largely used in the literature for constructing prediction intervals for linear processes ^{[4,5,6,7,8,9]}; for unit root testing ^[10] and stationarity testing ^[11], for fractionally integrated and non-invertible processes ^[12,13]. More recently, it has been used also in the context of functional time series ^[14] and spatial processes ^[15].

However, the AR-Sieve bootstrap performs better than other bootstrap techniques if the data generating process is linear and representable as an AR( $\infty$ ) process. Moreover, for quite general processes, it is expected to deliver consistent results if the asymptotic distribution of a given statistic is determined solely by the first and second order moment structure ^[16] or if the original time series is transformed and a more complex residual bootstrap scheme is applied ^[17].

For general nonlinear processes, an approach based on the use of feedforward Neural Networks (NN) as sieve approximators has been proposed ^[18]. The NN-Sieve resampling scheme, which is non-parametric in its spirit, retains the conceptual simplicity of the AR-Sieve bootstrap and it delivers consistent results for quite general nonlinear processes. Moreover, it performs similarly to the AR-Sieve bootstrap for linear processes while it outperforms the AR-Sieve bootstrap and the moving block bootstrap for nonlinear processes, both in terms of bias and variability ^[19].

However, despite their proven theoretical capabilities of non-parametric data-driven universal approximation of general nonlinear functions, NNs face challenging issues concerning the computational burden, which can be very heavy especially for complex non-linear generating processes. This might be a serious concern, even with computational power available nowadays, when using computer intensive model selection techniques (such as cross-validation) to tune the neural network hyper-parameters (such as hidden layer size, weight decay, etc.) in a resampling scheme involving neural networks. Moreover, in many applications of the bootstrap, to incorporate uncertainty due to model estimation, the statistical model needs to be estimated on each bootstrap resampled data. This makes the use of the bootstrap unfeasible when applied to complex estimation procedures. Finally, many statistical problems could be solved in practice only by using iterative bootstrap, which of course requires very efficient estimation procedures.

To overcome these problems, our proposal is to estimate the neural network model in the NN-Sieve procedure by using learning without iterative tuning. This approach, known as Extreme Learning Machines (ELM) in the computational intelligence and machine learning literature ^[20,21], has been extensively studied and remarkable contributions have been made both in theories and applications. By using ELMs, a nonlinear autoregressive sieve bootstrap scheme can be implemented for general non-linear time series. This scheme has the advantage to dramatically reduce the computational burden of the overall procedure; moreover, it has performances comparable to the NN-Sieve bootstrap and computing time comparable to the AR-Sieve bootstrap.

The paper is organized as follows. In section 2 a brief review of neural networks for time series is introduced in the context of Nonlinear Autoregressive (NAR) time series. In section 3 the extreme learning machine approach is presented and discussed, highlighting the advantages of its use with respect to the classical neural network approach. In section 3, NAR Sieve bootstrap based on ELMs is proposed, emphasizing the improvement with respect to alternative bootstrap schemes. In section 4 a simulation experiment is performed in order to evaluate the performance of the proposed bootstrap procedure and to compare it with the NN approach. Some remarks close the paper.

2. Neural networks for time series analysis

Let $\left\{ {Y_t, t \in \mathbb{Z}} \right\}$ be a real valued stochastic process modeled as a nonlinear autoregressive process with exogenous components:

$\begin{equation} Y_t = m\left( Y_{t-1}, \ldots, Y_{t-p}, {{\mathbf X}_{t} } \right) + \varepsilon _t \end{equation}$

(2.1)

where $m(\cdot)$ is an unknown (possibly nonlinear) function, $\varepsilon_t$ are i.i.d. innovations with mean 0 and finite variance and ${\mathbf X}_{t}$ is a $d$ -dimensional stochastic process representing other explicative variables, useful in predicting $Y_t$ or other functionals related to $Y_t$ .

By denoting with $\mathcal{I}_{t}$ the information set available at time $t$ and using the abbreviation ${\mathbf Z}_{t} = (Y_{t-1}, \ldots, Y_{t-p}, {{\mathbf X}_{t} })$ the conditional expectation of $Y_t$ , given the information set $\mathcal{I}_{t}$ is given by:

$\begin{equation} \mathbb{E}(Y_t|\mathcal{I}_{t}) = m({\mathbf Z}_{t}) \end{equation}$

(2.2)

There are many different parametric approaches to modelling the function $m$ and they can give quite different answers in the range of perhaps most interest to practitioners. In many cases, they are not willing to assume any parametric form (avoiding, in this way, model mispecification errors) and this motivates a nonparametric approach, because of the greater fexibility in functional form thereby allowed. For models with simple lag structure and without the presence of exogenous variables, the problem has been thoroughly investigated using local smoothers, such as in ^[22].

However, the use of local smoothers, due the so called "curse of dimensionality", does not allow complex lag structure or inclusion of other explanatory variables. These issues lead Franke and Diagne ^[23] to propose an alternative approach based on feedforward neural networks with one hidden layer. The function $m$ can be approximated by using neural networks with a single output and additive nodes in the class:

$\begin{equation} \mathcal{F} = \left \{ f({\mathbf z}, \boldsymbol{\eta}): {\mathbf z}\in \mathbb{R}^ {p+d}, \boldsymbol{\eta} \in \mathbb{R}^ {r(p+d+2)} \right \} \end{equation}$

(2.3)

with:

$\begin{equation} f_r\left( {{\mathbf z}, \boldsymbol{\eta} } \right) = \sum\limits_{k = 1}^r {\beta_k}\psi \left ( {\mathbf{a}}'_k {\mathbf{z}}+b_k\right ) \end{equation}$

(2.4)

where $r$ is the hidden layer size, $\psi(\cdot)$ is a sigmoidal activation function; $\boldsymbol{\eta} = (\beta_1, \ldots, \beta_r, {\mathbf{a}}'_1, {\mathbf{a}}'_2, \ldots, {\mathbf{a}}'_r, b_1, \ldots, b_r)'$ ; $\left \{ {\mathbf{a}}_k \right \}$ are the $(p+d)$ dimensional vectors of weights for the connections between input layer and hidden layer; $\left \{\beta_k \right \}$ are the weights of the link between the hidden layer and the output; $\left \{b_k \right \}$ are the bias terms of the hidden neurons.

Suppose that the processs $Y_{t}$ and ${\mathbf{X}}_{t}$ are observed for $T$ consecutive time periods generating the time series $y_{1}, y_{2}, \ldots, y_{T}$ and ${\mathbf{x}}_{1}, {\mathbf{x}}_{2}, \ldots, {\mathbf{x}}_{T}$ along with appropriate initial conditions. Let ${\mathbf{u}}_{t} = (y_{t}, {\mathbf{x}}_{t})$ and ${\mathbf{z}}_{t} = (y_{t-1}, y_{t-2}, \ldots, y_{t-p}, {\mathbf{x}}_{t})$ . A consistent estimate of the regression function $m(\cdot)$ can be obtained as:

$\begin{equation} \hat m = \text{argmin}_{f \in\mathcal{F}} \left \|f({\mathbf z}_{t}, \boldsymbol{\eta}) - {\mathbf{y}}\right \| \end{equation}$

(2.5)

where $\| \cdot \|$ denotes the $L_{2}$ -norm. Alternatively, to improve the stability of the network solution, a regularized version of the optimization problem 2.5 can be used:

$\begin{equation} \hat m = \text{argmin}_{f \in\mathcal{F}} \left \|f({\mathbf z}_{t}, \boldsymbol{\eta}) - {\mathbf{y}}\right \| + \lambda \left \| \boldsymbol{\eta}\right \| \end{equation}$

(2.6)

where the tuning parameter $\lambda$ is usually fixed by cross-validation.

Neural networks provide an arbitrarily accurate approximation to the unknown target function which satisfy certain smoothness conditions. In particular, under quite general conditions on the activation function $\psi$ , there exists a sequence of network functions $\left \{ f_r \right \}$ approximating to any given continuous target function $m$ with any expected learning error. A deterministic approximation rate (in $L_{2}$ -norm) of $r_T^{-1/2}$ for NNs with sigmoid activation functions has been obtained in ^[24]. For better rates see ^[25,26,27]. If the network model is fitted to the data in such a way that complexity of the network is allowed to increase at a proper rate with the sample size, the resulting function estimator can then be viewed as a nonparametric sieve estimator ^[28,29]. Moreover, estimation of hidden layer size ( $r_T$ ) seems to be less critical than estimation of the window size in local nonparametric approaches. Finally, extension of the sieve bootstrap procedure to high dimensional models is (much) more straightforward than other nonparametric approaches (absence of 'curse of dimensionality').

However, despite their proven theoretical capabilities of non-parametric data driven universal approximation of a general class of nonlinear functions, NNs face other challenging issues. Firstly, the use of the NNs requires the specification of the network topology in accordance with the underlying structure of the series. It involves the specification of the size and the structure of the input layer, the size of the hidden layer, the signal processing within nodes (i.e., the choice of the activation function). Furthermore, in the context of time series analysis, the characteristics of the series and the presence of deterministic and/or stochastic components such as trend, seasonality, structural breaks and level shift impose also an accurate feature selection. Many strategies have been proposed to solve these problems (for example, ^[30,31]) but the difficulty to find a unique method able to automatically identify the optimal NN still remains an open issue.

Moreover, once the neural network architecture is fixed, the estimation of the parameters can be made by using the backpropagation algorithm, which is essentially a first order gradient method for parameter optimization and suffers from slow convergence and local minimum problem. Also in this context, various ways to improve the efficiency or optimality in training a neural network have been proposed. They include second order optimization methods ^[32], subset selection methods ^[33] or global optimization methods ^[34]. Although these methods lead to faster training speed and, in general, better generalization performance compared to the back propagation algorithm, most of them still cannot guarantee a global optimal solution.

3. Extreme learning machines in a nutshell

Recently, extreme learning machine (ELM) for training NNs has attracted the attention in the literature ^[35,36,37] as a possible method to overcome some challenges faced by the other techniques.

The essence of ELMs is that, unlike the other traditional learning algorithms, such as back propagation based neural networks, the hidden nodes weights are randomly generated and they need not to be tuned, so that the algorithm analytically determines the output weights of NNs.

Basically, ELM trains a NN in two main stages. In the first, ELM randomly initializes the hidden layer to map the input data into a feature space by some non linear functions. As in NN theory, they can be any nonlinear piecewise continuous functions such as the sigmoid or the hyperbolic functions. The hidden node parameters $({\mathbf{ a}}, b)$ are randomly generated according to any continuos probability distribution so that the matrix:

$\begin{equation} {\mathbf{H}} = \begin{bmatrix} \psi({\mathbf{a}}'_1{\mathbf{z}}_1+b_1) & \cdots & \psi({\mathbf{a}}'_r{\mathbf{z}}_1+b_r) \\ \vdots & \ddots & \vdots \\ \psi({\mathbf{a}}'_1{\mathbf{z}}_{T}+b_1) & \cdots & \psi({\mathbf{a}}'_r{\mathbf{z}}_{T}+b_r) \end{bmatrix} \end{equation}$

(3.1)

is completely known. In the second stage, the output weights $\boldsymbol{\beta}$ are estimated by solving the following minimization problem:

$\begin{equation} \hat{\boldsymbol{\beta}} = \text{argmin}_{\boldsymbol{\beta}}\left \| {\mathbf{H}} \boldsymbol{\beta}-\mathbf{y} \right \| \end{equation}$

(3.2)

where ${\mathbf{y}}$ is the training data target vector and $\left \| \cdot \right \|$ denotes the $L_{2}$ -norm.

If ${\mathbf{H}}^\dagger$ denotes the Moore-Penrose generalized inverse of matrix ${\mathbf{H}}$ , the optimal solution to the previous optimization problem is:

$\begin{equation} \hat{\boldsymbol{\beta}} = {\mathbf{H}}^\dagger\mathbf{y} \end{equation}$

(3.3)

The matrix ${\mathbf{H}}^\dagger$ can be calculated by using one of the numerous methods proposed in the literature which include orthogonal projection, orthogonalization method, iterative method and the single value decomposition, the last one being the most general.

The estimation of the parameter vector $\boldsymbol{\beta}$ can also be obtained via regularized ELM ^[38]. If ${\mathbf{H}}$ has more rows than columns ( $T > r$ ), which is usually the case when the number of training data is larger than the number of hidden neurons, the following closed form solution can be obtained:

$\begin{equation} \hat{\boldsymbol{\beta}} = \left({\mathbf{H}}^{\prime}\mathbf{H} + \frac{\mathbf{I}}{C}\right)^{-1} {\mathbf{H}}^{\prime} {\mathbf{y}} \end{equation}$

(3.4)

where ${\mathbf{I}}$ is an identity matrix of dimension $r$ and $C$ is a proper chosen constant.

If the number of training data is less than the number of hidden neurons ( $T < r$ ), an estimate for $\boldsymbol{\beta}$ can be obtained as:

$\hat{\boldsymbol{\beta}} = {\mathbf{H}}^{\prime}\left({\mathbf{H}}\mathbf{H}^{\prime} + \frac{\mathbf{I}}{C}\right)^{-1} {\mathbf{y}}$

where ${\mathbf{I}}$ is an identity matrix of dimension $T$ this time.

It can be shown that Eq. 3.4 actually aims at minimizing:

$\hat{\boldsymbol{\beta}} = \text{argmin}_{\boldsymbol{\beta}}\left \| {\mathbf{H}} \boldsymbol{\beta}-\mathbf{y} \right \| + \frac{1}{C}\left \|\boldsymbol{\beta} \right\|$

Comparing to standard ELM, in which the target is to minimize $\left \| {\mathbf{H}} \boldsymbol{\beta}-\mathbf{y} \right \|$ , an extra penalty term $\frac{1}{C}\|\boldsymbol{\beta}\|$ is added to the target of standard ELM. This is actually consistent to the theory that smaller output weights $\boldsymbol{\beta}$ play an important role for ELM in achieving better generalization ability.

The ELM approach have several advantages. Firstly, it has good generalization performance in the sense that it reaches the small training error and, contemporaneously, the smallest norm of output weights. Secondly, learning can be done without iteratively tuning the hidden nodes which can be independent of training data. Moreover, ELMs, like NNs, enjoy the property of being universal approximators. Given any non constant piecewise continuous function $\psi$ , if

$\text{spam}\left \{\psi({\mathbf{a}}, b, {\mathbf{z}}) : ({\mathbf{a}}, b) \in \mathbb{R}^{p+d}\times \mathbb{R}\right \}$

is dense in $L^2$ , for any continuous target function $m$ and any sequence $\psi({\mathbf{a}}^{\prime}_k{\mathbf{z}} + b_k)$ for $k = 1, \ldots, r$ randomly generated according to any continuous sampling distribution and if the output weights $\hat{\boldsymbol{\beta}}$ are determined by ordinary least square to minimize:

$\left \| m({\mathbf{z}})- \sum\limits_{k = 1}^r \hat{\boldsymbol{\beta}}\psi({\mathbf{a}}^{\prime}_k{\mathbf{z}}+ b_k)\right \|$

it can be shown ^[39,40,41] that, with probability one, it is:

$\begin{equation} \lim\limits_{r \to \infty}\left \| m-f_r \right \| = 0 \end{equation}$

(3.5)

This result states the universal approximation capability of ELMs without imposing any restrictive assumption on the activation function as in the case of NN paradigm in which, on the contrary, a continuous and differentiable activation function is needed. In practice, being the hidden layer randomly generated, ELMs usually require more hidden neurons than NNs to obtain a given performance. However, this does not seem to be a serious problem due to the computational efficiency of ELMs. Moreover, ELMs are well suited for large data processing and, even if a model selection process is implemented for an optimal structure searching, the running time of ELMs is always lower than other competing strategies. In any case, parallel and cloud computing techniques ^[42] can also be used for even faster implementation of ELMs.

4. NAR Sieve bootstrap based on ELM

Given a general real valued stochastic process $\{Y_t, t \in \mathbb{Z}\} \sim M_0$ modeled as in Eq. 2.1, the basic idea of a sieve bootstrap scheme is to approximate the process by a family of (semi-) parametric models $\mathcal{M} = \{M_ j, j \in \mathbb{N}\}$ such that $\cup_{j = 1}^\infty M_ j$ contains (in some sense) the original process $M_0$ . Fixed a proper model selection rule, a model is picked from the set $\mathcal{M}$ and estimated. Residual bootstrap is based on the previous estimated model. Hence, a key issue is the selection of a proper model family.

Algorithm 1 The NAR-Sieve bootstrap scheme.

1: Fix

$B$ , the number of bootstrap runs
2: Consider the time series

$\left\{ {\mathbf{u}}_{t}, t=1, 2, \ldots, T \right\}$ with

${\mathbf{u}}_{t}=(y_{t}, {\mathbf{z}}_{t})$
3: Let

${\mathbf{z}}_{t}=\left\{ (y_{t-1}, y_{t-2}, \ldots, y_{t-p}, {\mathbf{x}_t}), t=p+1, \ldots, T \right\}$
4: Fix

$n_{1}$ , the observations to be discarded in order to make negligible the effect of starting values.
5: Let

$N = n_{1}+T$
6: Estimate

$m(\cdot)$ by using an ELM, obtaining

$\hat m(\cdot)$
7: Compute the residuals

$\hat \varepsilon_t =y_t- \hat m ({\mathbf z}_{t})$
8: Compute the centered residuals

${\tilde \varepsilon _t} = {\hat \varepsilon _t} - {\left({T - p} \right)^{ - 1}}\sum\limits_{t = p + 1}^T {{{\hat \varepsilon }_t}}.$
9: Denote the empirical distribution function of the centered residuals

${\tilde \varepsilon _t }$ by

$F_{\tilde \varepsilon}\left(x \right) = \left({T - p} \right)^{ - 1} \sum\limits_{t = p+1}^T {\mathbb{I}\left({\tilde \varepsilon _t \leq x} \right)}$
where

$\mathbb{I}(\cdot)$ is the indicator function.
10: for

$b=1, 2, \ldots, B$ do
11: Resample for

$t=p+1, p+2, \ldots, N$

$\varepsilon_{(b, t)}^*\stackrel{iid}{\sim} F_{\tilde \varepsilon}$
12: Fix

$y_{(b, t)}^*=\bar y$ for

$t=1, \ldots, p$ . Define

${\mathbf{z}}^{*}_{(b, t)}=(y^{*}_{(b, t-1)}, \ldots, y^{*}_{(b, t-p)}, {\mathbf{x}}_{t})$ .
13: Define

$y_{(b, t)}^*$ by recursion

$y_{(b, t)}^* = \hat m \left({\mathbf z}_{(b, t)}^* \right)+ \varepsilon _{(b, t)}^*; \; t = p+1, \ldots, N$
14: Compute

$\hat \theta^{*}_{(b, T)} = q \left({\mathbf{u}^{*}_{(b, 1)}, {\mathbf{u}}_{(b, 2)}, \ldots, {\mathbf{u}}^{*}_{(b, T)}} \right)$
15: end for
16: Use the empirical distribution function

$\hat F^*\left(x \right) = B^{ - 1} \sum\limits_{b = 1}^B {\mathbb{I}\left({\hat \theta^*_{(b, T)} \leq x} \right)}$
to approximate the unknown sampling distribution of the estimator

$\hat \theta_n$ .

To elaborate, let $\left\{ {\mathbf{u}}_{t} = (y_{t}, {\mathbf{x}}_t), t = 1, 2, \ldots, T \right\}$ be the observed time series, and let $\theta$ a finite dimensional parameter of interest and $\hat \theta_T = q \left({\mathbf{u}_1, {\mathbf{u}}_{2}, \ldots, {\mathbf{u}}_T} \right)$ a scalar-, vector- or curve-valued estimator, which is a measurable function of the data. Inference on $\theta$ can be gained by using the NN-Sieve bootstrap approach. The procedure, proposed in ^[18], exploits the good properties of neural network modelling and it is shown to be asymptotically justified, delivering consistent results for quite general non linear models, and it yields satisfactory results for finite sample size ^[19].

Here, we propose to approximate the unknown function $m(\cdot)$ with the class of ELMs with fixed input neurons and hidden layer size going to infinity with the time series length at a proper rate. The resampling procedure is detailed in algorithm 1.

The proposed resampling scheme has some advantages which make it effective in many applicative fields. As the NN-Sieve bootstrap, this approach is asymptotically justified and it delivers consistent results for quite general nonlinear processes ^[19]. Moreover, it does not suffer for the so-called 'curse of dimensionality'. Theoretically, ELMs are expected to perform better than other approximation methods since the approximation form is not so sensitive to the increasing dimension, at least within the confines of particular classes of functions.

As neural networks, ELMs are global nonparametric methods and their use could stress different features and data structures when compared to local nonparametric methods. With respect the NN-Sieve bootstrap, this scheme dramatically reduces the computational burden of the overall procedure having a computing time comparable to the AR-Sieve bootstrap.

5. Simulation results

In this section, we discuss the results of a simulation experiment performed in order to evaluate the performance of the proposed procedure and to compare it with the NN approach. All computations were implemented in the language R (version 3.6.0) using the package nnet (for feedforward neural networks with regularization) and ad hoc implementation by the authors for ELMs and ELMs with regularization. This latter implementation is based on the package ridge ^[43] which also includes the selection of the regularization parameter. All computations were made by using two different workstations: the first running MacOS (version 10.14.4) with 4 GHz Intel Core i7 and 16 GB 1600 MHz DDR3 memory, using the standard R math library and, the second running Ubuntu Linux (version 18.04) with a 3.50 Ghz Intel Xeon E5–1650 and 32 Gb ECC DDR4 memory, using OpenBlas.

In the first place, the computational advantage of ELMs is evaluated in terms of computing time of the learning process. In Tables 1 and 2 some descriptive statistics for the computing time for NNs, ELMs and ELMs with regularization are reported for a regression problem with two levels of complexity. The first model has 2 predictors, 2 neurons in the hidden layer and it has been estimated on a sample of 300 observations. In this case, the median execution time for the learning process of a full neural network (with ten random restarts) is 54 times slower with respect to the learning process of ELM with regularization and about 300 times slower with respect to a learning process based on the Moore-Penrose generalized inverse. The second model is much more complex, with 16 predictors, 40 neurons in the hidden layer and it has been estimated on a sample of 2000 observations. In this case, the median execution time for the learning process of a full neural network (with ten random restarts) is about 478 times slower with respect to the learning process of ELM with regularization and about 1575 times slower with respect to a learning process based on the Moore-Penrose generalized inverse. The more complex the model, the higher the advantage of using ELMs. As a remark, note that it is well known that the standard R math library does not deliver the best performance for linear algebra operations. So the figures reported in Tables 1 and 2 could be significantly improved using better Basic Linear Algebra System (BLAS) implementations such as the OpenBlas or Intel's Math Kernel Library (MKL), as evident from the results reported in the following.

Table 1. Descriptive statistics of computing time (in milliseconds) for feedforward neural networks (NN), extreme learning machines (ELM) and extreme learning machines with regularization (Reg), for a sample size of

$300$ observations, input size = 2, hidden layer size = 2. Computations made on a MacOS with 4 GHz Intel Core i7 and 16 GB 1600 MHz DDR3, using the standard R math library.

Method	min	$1^{st}$ quart.	mean	median	$3^{rd}$ quart.	max	runs
NN	37.54	48.22	54.14	54.75	59.02	73.29	100
Reg	0.75	0.95	1.40	1.02	1.13	24.56	100
ELM	0.11	0.16	0.40	0.18	0.20	21.99	100

| Show Table

DownLoad: CSV

Table 2. Descriptive statistics of computing time (in milliseconds) for feedforward neural networks (NN), extreme learning machines (ELM) and extreme learning machines with regularization (Reg), for a sample size of

$2,000$ observations, input size = 16, hidden layer size = 40. Computations made on a MacOS with 4 GHz Intel Core i7 and 16 GB 1600 MHz DDR3, using the standard R math library.

Method	min	$1^{st}$ quart.	mean	median	$3^{rd}$ quart.	max	runs
NN	15886.80	17575.38	17906.21	17791.85	18252.19	20371.12	100
Reg	32.86	35.11	41.46	37.18	43.88	256.92	100
ELM	8.62	10.85	11.97	11.29	11.77	22.29	100

| Show Table

DownLoad: CSV

In and , the relative execution time between NN, ELM and ELM with regularization is reported, for different number of input neurons $d \in \{4, 8, 12, 16\}$ , different hidden layer size $r = 2, 3, \ldots, 40$ and samples of different size $\{300,500, 1000, 2000\}$ . In all cases the computational gain when using ELM is very significant, ranging from about 300 times (for the simplest model with two input neurons and two hidden neurons) to 6,000 times for the most complex model considered (16 input neurons and 40 hidden neurons) by using OpenBlas. Note that to make the comparison between the OpenBlas implementation and the standard R math library implementation (which is single threaded) fair, we have forced the OpenBlas to use a single thread. Therefore, even better performance might be expected using a multi-threaded OpenBlas implementation. All neural networks have been estimated by restarting the learning process ten times, to avoid being trapped in local minima (an operation not necessary when using ELMs). In many applications, it is a standard choice to use 50 random restarts, making the advantage in using ELM even more effective. As a further remark, note that when using the ELM with the regularization learning process, the computational advantage is reduced by a factor of 4 on average. However, this is still an important figure. In the NN learning process, the regularization parameter has not been estimated on each dataset but fixed on the base of ad hoc choices. The selection of this tuning parameter for NNs is usually based on cross-validation, making the computational burden heavy and in many applications unfeasible.

Figure 1. NN vs ELM with regularization relative estimation computing time:

$d = 4, 6, 8, 10$ ;

$T = 300,500, 1000, 2000$ ; number of random restarts of NN = 10; maximum number of iteration for NN = 200; ratios of the average estimation time over 100 Monte Carlo runs.

DownLoad: Full-Size Img PowerPoint

Figure 2. NN vs ELM relative estimation computing time:

$d = 4, 6, 8, 10$ ;

$T = 300,500, 1000, 2000$ ; number of random restarts of NN = 10; maximum number of iteration for NN = 200; ratios of the average estimation time over 100 Monte Carlo runs.

DownLoad: Full-Size Img PowerPoint

The computational advantage in using ELM (with or without regularization) in bootstrap schemes is so high that the whole bootstrap distribution for a given statistic of interest could be estimated using basically the same computational time that it is needed for a single neural network estimation (by considering 1000/2000 bootstrap runs to approximate the bootstrap distribution). Moreover, consider that in several applications, the neural network model needs to be re-estimated on each bootstrap run. This is the case, for example, when computing bootstrap prediction intervals, where model re-estimation is needed to incorporate uncertainty due to model estimation. In this latter case, using NNs is almost unfeasible, while using ELMs makes the overall bootstrap procedure possible, with very reasonable computing time.

As a further step in the simulation design, we wish to evaluate and compare the performance of the bootstrap procedure using NNs and ELMs, in order to check if there is any loss in accuracy for the bootstrap inference process. The experimental setup is based on datasets generated by different nonlinear models, with different degrees on nonlinearity.

We consider the class of STAR models as specified in ^[45]:

$Y_{t} = \phi_{1}Y_{t-1} -(\phi_{1} - \phi_{2})F\left(Y_{t-1}, \gamma \right)Y_{{t-1}} + \varepsilon_{t}$

with $\varepsilon_{t}\sim N(0, 1)$ . The function $F$ determines different transition processes. If $F\left(u, \gamma\right) = 1-\exp(-\gamma u^{2})$ we get exponential STAR model (ESTAR) while using $F\left(u, \gamma\right) = \frac{1}{1+\exp(-\gamma u)}$ we get a logistic smooth transition model (LSTAR). These models, very popular in several applications in different fields, have been chosen as representing different kind of dynamic behaviour, since their flexibility allows generation of quite different time series structures.

The degree of non-linearity in the LSTAR/ESTAR models is controlled by the parameter $\gamma$ in the transition function. When $\gamma \to 0$ , the transition function tends towards 0, and the model will be a simple autoregressive process. When $\gamma \to \infty$ , the transition function converges towards unity, which implies that the model is a different autoregressive model with coefficients equal to the mean of the autoregressive parameters of the two regimes.

The parameters $\phi_{1}$ , $\phi_{2}$ and $\gamma$ have been fixed according to the values in Table 3, leading to four different specifications for each class of models (denoted as v1, v2, v3 and v4).

Table 3. Parameter values for the LSTAR/ESTAR models chosen according to the combinations v1, v2, v3 and v4.

	v1	v2	v3	v4
$\phi_{1}$	0.1	0.6	0.1	0.6
$\phi_{2}$	-0.1	-0.6	-0.1	-0.6
$\gamma$	5	5	25	25

| Show Table

DownLoad: CSV

In order to evaluate the distribution of either linear estimators or simple nonlinear estimators, we consider, for all the models, four statistics: the sample mean, the sample median, the sample variance and the sample autocovariance (at lag 1). Let $\hat \theta$ be the statistic of interest and let $\mathcal{R}^{*} = (\hat \theta ^{*} - \mathbb{E}_{*}[\hat \theta ^{*}])/\sigma_{T}$ the bootstrap counterpart of the root $\mathcal{R} = (\hat \theta - \mathbb{E}[\hat \theta])/\sigma_{T}$ where $\sigma_{T} = \sqrt{{\rm Var}(\hat \theta)}$ denotes the true standard deviation. The true standard errors $\sigma_{t}$ have been estimated by using a Monte Carlo simulation with 100,000 runs. The distribution of $\mathcal{R}^{*}$ has been generated by using NAR-Sieve bootstrap based on NNs, ELMs and regularized ELMs.

All simulations are based on $N = 500$ Monte Carlo runs with time series of length $T$ with $T \in \{300,500, 1000, 2000\}$ . The bootstrap distributions have been estimated using $B = 1,000$ bootstrap replicates. The final experiment is based on 384 design points (2 classes of models, 4 model specifications, 4 statistics, 4 different time series length and three different bootstrap implementations).

The results are reported in Figures 3 and 4. For all the statistics considered, the performance of the bootstrap based on ELMs is comparable with that of the bootstrap based on NNs or, even, slightly better in some cases. For linear functionals, such as the mean, the performance of the three methods is very close, with low variability, showing excellent accuracy when using the bootstrap to estimate the true unknown standard error. For nonlinear functionals, such as the median, the results are very similar and remain stable for the two class of models and different degrees of nonlinearity. For the case of complex functionals, which can be expressed as a function of means such as the variance and the covariance, the results remain consistent, but they show a lower accuracy for shorter time series.

Figure 3. LSTAR model. Variance estimation of

$(\hat \theta - \mathbb{E}[\hat \theta])/\sigma_{T}$ by

$(\hat \theta ^{*} - \mathbb{E}_{*}[\hat \theta ^{*}])/\sigma_{T}$ [with

$\sigma_{T} = \sqrt{{\rm Var}(\hat \theta)}$ ] for

$\hat \theta = \{\rm mean, median, var, cov\}$ . Boxplots for NN-Sieve (NN), NN-sieve with ELM (ELM) and NN-sieve with ELM Ridge (Reg), with target indicated by the horizontal line. 500 simulation runs, 1000 bootstrap replicates per simulation run.

DownLoad: Full-Size Img PowerPoint

Figure 4. ESTAR model. Variance estimation of

$(\hat \theta - \mathbb{E}[\hat \theta])/\sigma_{T}$ by

$(\hat \theta ^{*} - \mathbb{E}_{*}[\hat \theta ^{*}])/\sigma_{T}$ [with

$\sigma_{T} = \sqrt{{\rm Var}(\hat \theta)}$ ] for

DownLoad: Full-Size Img PowerPoint

Moreover, the degree of nonlinearity of the models might reduce the accuracy of the bootstrap standard error estimation. However, ELMs appear to deliver better results in these latter cases both in terms of accuracy and bias. Anyhow, using ELMs requires only a fraction of the computational time needed for NNs. Finally, all estimators appear to be consistent, with an apparent convergence to the reference value and a decreasing variability, while the time series length increases.

6. An application to real data

As an application to real data, we consider the normalized tree-ring widths in dimensionless units. The data were recorded by Donald A. Graybill, 1980, from Gt Basin Bristlecone Pine 2805M, 3726–11810 in Methuselah Walk, California. It is a univariate time series with 7981 yearly observations from 6000 BC to 1979. Tree-ring data are of great importance in ecology in general and in climate change studies in particular. Other fields of interest include archaeology (for dating materials and artefacts made from wood), chemists (where tree rings based methods are used to calibrate radiocarbon dates) and dendrology (which also includes forestry management and conservation).

In this application, we limit the analysis to the period ranging from 1001 to 1979 (about the last 1000 years). The time plot of the data and its autocorrelation function (for the first 20 lags) are reported in Figure 5. The data generating process appears to be stationary with a decreasing autocorrelation function where the first five lags are statistically significant at the level of 5%. To gain inference on the true autocorrelations, given the observed time series, the sampling distribution of the estimates is derived by using the sieve bootstrap based both on NNs and ELMs. The kernel density estimation of the bootstrap distributions for the first six lags is reported in Figure 6. Clearly, the bootstrap estimates seem to be able to capture the asymmetry of the true sampling distribution of autocorrelations. There is a slight difference between NN and ELM sieve bootstrap for the first lag, but in all other cases, the two approaches considered in the paper appear to deliver similar results.

Figure 5. Tree-ring width yearly time series (on the left) from 1001 to 1979 and ACF (on the right) for the first 20 lags.

DownLoad: Full-Size Img PowerPoint

Figure 6. Kernel densities estimation of the bootstrap distributions (bootstrap runs = 1,999) obtained by using the sieve bootstrap based on neural networks (solid line) and extreme learning machines (dashed line) for the first 6 lags.

DownLoad: Full-Size Img PowerPoint

Given the bootstrap sampling distribution, confidence intervals with nominal level equal to 95% are derived and plotted in Figure 7 for the first 20 lags. Both procedures identify the first four lags as significant, while all other lags cannot be considered different from zero at the given confidence level. The different conclusion on lag five can be explained by the higher accuracy of the bootstrap-based inference with respect to the normal approximation used to construct the confidence intervals reported in Figure 5.

Figure 7. Percentile bootstrap confidence intervals (nominal level = 0.95, bootstrap runs = 1,999) with sieve bootstrap based on neural networks (on the left) and based on extreme learning machines (on the right) for the first 20 lags.

DownLoad: Full-Size Img PowerPoint

7. Concluding remarks

In this paper, a novel nonlinear autoregressive sieve bootstrap scheme based on the use of ELMs have been proposed and discussed. To evaluate the performance of the proposed approach, a Monte Carlo simulation experiment has been implemented. Bootstrap schemes based on neural networks appear to be an encouraging solution to extend sieve bootstrap techniques to nonlinear time series. In this framework, ELMs can dramatically reduce the computational burden of the overall procedure, with performances comparable to the NN-Sieve bootstrap and computing time comparable to the AR-Sieve bootstrap.

Moreover, alternative algorithms for ELMs estimation can be considered. Although the orthogonal projection method can be efficiently used to calculate the Moore-Penrose inverse and the solution can be easily and fast obtained, regularized versions of least squares have proved to be stabler. These techniques regularize the coefficients (controlling how large they grow) by penalizing their magnitude along with minimizing the error between predicted and actual observations. Adding a penalty term can improve the stability of ELMs reducing the variance of the estimates. Even when using the regularized versions based on the Tsybakov ( $L_{2}$ ) regularization the computational advantage of ELMs over classical NNs is maintained. Extensions of the bootstrap resampling scheme by using $L_{1}$ (such the LASSO) or a combination of $L_{1}$ and $L_{2}$ regularization (such as the elastic net) is still under study.

The performance of the sieve bootstrap based on ELMs depends on the Basic Linear Algebra System (BLAS) implementation. Using OpenBlas in place of the standard R library (even in single thread mode) can significantly improve the performance with better-scaling properties for bigger size problems. Other BLAS implementations that could be considered are the Intel's Math Kernel Library (MKL) and the ATLAS library which is a general purpose tunable library.

However, several different aspects should be further explored through a more extensive simulation study. These aspects include the sensitivity and the stability of the NAR-Sieve bootstrap to lag structure misspecification and to the choice of the hidden layer size. These topics are still under investigation and out of the scope of this paper. As a final remark, note that ELMs have been extended to deal with large size data problems effectively. The feasibility of bootstrap resampling schemes in this framework is still under investigation

Acknowledgements

The authors wish to thank the Associate Editor and the anonymous referees for their helpful comments and suggestions.

Conflict of interest

The authors declare no conflict of interest.

References

[1]	D. Aleksandra, I. Cavka, V. Cokorilo, Bird strikes on an aircraft and bird strike prevention, Tehnika, 2 (2014), 291–298. https://doi.org/10.5937/tehnika1402291D doi: 10.5937/tehnika1402291D
[2]	A. T. Marques, H. Batalha, J. Bernardino, Bird displacement by wind turbines: assessing current knowledge and recommendations for future studies, Birds, 2 (2021), 460–475. https://doi.org/10.3390/birds2040034 doi: 10.3390/birds2040034
[3]	T. Wu, X. Luo, Q. Xu, A new skeleton based flying bird detection method for low-altitude air traffic management, Chin. J. Aeronaut., 31 (2018), 2149–2164. https://doi.org/10.1016/j.cja.2018.01.018 doi: 10.1016/j.cja.2018.01.018
[4]	X. Zhang, X. Wu, X. Zhou, X. Wang, Y. Zhang, Automatic detection and tracking of maneuverable birds in videos, in 2008 International Conference on Computational Intelligence and Security, (2008), 185–189. https://doi.org/10.1109/CIS.2008.46
[5]	K. A. Klein, R. Mino, M. J. Hovan, P. Antonik, G. Genello, MMW radar for dedicated bird detection at airports and airfields, in First European Radar Conference, (2004), 157–160.
[6]	T. Yang, S. Z. Li, O. Pan, J. Li, Real-time and accurate segmentation of moving objects in dynamic scene, in Proceedings of the ACM 2nd international workshop on Video surveillance & sensor networks, (2004), 136–143. https://doi.org/10.1145/1026799.1026822
[7]	Q. Hu, R. Li, Y. Xu, C. Pan, C. Niu, W. Liu, Toward aircraft detection and fine-grained recognition from remote sensing images, J. Appl. Remote Sens., 16 (2022), 024516. https://doi.org/10.1117/1.JRS.16.024516 doi: 10.1117/1.JRS.16.024516
[8]	X. Yang, X. Fan, J. Wang, X. Yin, S. Qiu, Edge-based cover recognition and tracking method for an AR-aided aircraft inspection system, Int. J. Adv. Manuf. Technol., 111 (2020), 3505–3518. https://doi.org/10.1007/s00170-020-06301-x doi: 10.1007/s00170-020-06301-x
[9]	J. Yi, P. Wu, B. Liu, Q. Huang, H. Qu, D. Metaxas, Oriented object detection in aerial images with box boundary-aware vectors, in 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), (2021), 2149–2158. https://doi.org/10.1109/WACV48630.2021.00220
[10]	J. Feng, D. Ming, B. Zeng, J. Yu, Y. Qing, T. Du, et al., Aircraft detection in high spatial resolution remote sensing images combining multi-angle features driven and majority voting CNN, Remote Sens., 13 (2021), 2207–2224. https://doi.org/10.3390/RS13112207 doi: 10.3390/RS13112207
[11]	C. Wang, A. Bochkovskiy, H. M. Liao, Scaled-yolov4: Scaling cross stage partial network, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 13024–13033. https://doi.org/10.1109/CVPR46437.2021.01283
[12]	Q. Hou, D. Zhou, J. Feng, Coordinate attention for efficient mobile network design, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 13708–13717. https://doi.org/10.1109/CVPR46437.2021.01350
[13]	T. ting, L. Guo, H. Gao, T. Chen, Y. Yu, C. Li, A new time–space attention mechanism driven multi-feature fusion method for tool wear monitoring, Int. J. Adv. Manuf. Technol., 120 (2022), 5633–5648. https://doi.org/10.1007/S00170-022-09032-3 doi: 10.1007/S00170-022-09032-3
[14]	H. Gong, L. Chen, H. Pan, S. Li, Y. Guo, L. Fu, et al., Sika deer facial recognition model based on SE-ResNet, Comput. Mater. Con., 72 (2022), 6015–6027. https://doi.org/10.32604/CMC.2022.027160 doi: 10.32604/CMC.2022.027160
[15]	H. Fu, G. Song, Y. Wang, Improved YOLOv4 marine target detection combined with CBAM, Symmetry, 13 (2021), 623–637. https://doi.org/10.3390/sym13040623 doi: 10.3390/sym13040623
[16]	Z. Li, X. Huang, Z. Zhang, L. Liu, F. Wang, S. Li, Synthesis of magnetic resonance images from computed tomography data using convolutional neural network with contextual loss function, Quant. Imag. Med. Surg., 12 (2022), 3151–3169. https://doi.org/10.21037/qims-21-846 doi: 10.21037/qims-21-846
[17]	D. Zhu, Overlapping boundary based multimedia slice transcoding method and its system for medical video and traffic video, Multimed. Tools Appl., 75 (2016), 14233–14246. https://doi.org/10.1007/s11042-015-3235-8 doi: 10.1007/s11042-015-3235-8
[18]	Z. Tang, J. Li., Y. Zhou, Clothing information collection based on theme web crawler, Int. J. Adv. Networking Appl., 10 (2019), 3919–3924. https://doi.org/10.35444/IJANA.2019.10043 doi: 10.35444/IJANA.2019.10043
[19]	J. Huang, D. Shao, H. Liu, Y. Xiang, L. Ma, S. Yi, et al., A lightweight segmentation method based on residual U-Net for MR images, J. Intell. Fuzzy Syst., 42 (2022), 5085–5095. https://doi.org/10.3233/JIFS-211424 doi: 10.3233/JIFS-211424
[20]	C. Wen, M. Hong, X. Yang, J. Jia, Pulmonary nodule detection based on convolutional block attention module, in 2019 Chinese Control Conference (CCC), (2019), 8583–8587. https://doi.org/10.23919/ChiCC.2019.8865792
[21]	H. Alqaysi, I. Fedorov, F. Z. Qureshi, M. O'Nils, A temporal boosted YOLO-Based model for birds detection around wind farms, J. Imaging, 7 (2021), 227–240. https://doi.org/10.3390/jimaging7110227 doi: 10.3390/jimaging7110227

This article has been cited by:

1.	Lukasz Przepiorka, Sławomir Kujawski, Katarzyna Wójtowicz, Edyta Maj, Andrzej Marchel, Przemysław Kunert, Development and application of explainable artificial intelligence using machine learning classification for long-term facial nerve function after vestibular schwannoma surgery, 2024, 0167-594X, 10.1007/s11060-024-04844-7
2.	José Alberto Benítez-Andrades, Camino Prada-García, Nicolás Ordás-Reyes, Marta Esteban Blanco, Alicia Merayo, Antonio Serrano-García, Enhanced prediction of spine surgery outcomes using advanced machine learning techniques and oversampling methods, 2025, 13, 2047-2501, 10.1007/s13755-025-00343-9

Reader Comments

Your name:*

Email:*
© 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)