Intelligent breast cancer diagnostic system empowered by deep extreme gradient descent optimization

Muhammad Bilal Shoaib Khan; Atta-ur-Rahman; Muhammad Saqib Nawaz; Rashad Ahmed; Muhammad Adnan Khan; Amir Mosavi; Muhammad Bilal Shoaib Khan; Atta-ur-Rahman; Muhammad Saqib Nawaz; Rashad Ahmed; Muhammad Adnan Khan; Amir Mosavi

doi:10.3934/mbe.2022373

Mathematical Biosciences and Engineering

2022, Volume 19, Issue 8: 7978-8002. doi: 10.3934/mbe.2022373

Previous Article Next Article

Research article Special Issues

Intelligent breast cancer diagnostic system empowered by deep extreme gradient descent optimization

1.
Department of Information Technology, Akhuwat College University, Lahore 54000, Pakistan
2.
Department of Computer Science, College of Computer Science and Information Technology (CCSIT), Imam Abdulrahman Bin Faisal University (IAU), P.O. Box 1982, Dammam 31441, Saudi Arabia
3.
Department of Computer Science & IT, Minhaj University Lahore, Lahore 54000, Pakistan
4.
ICS Department, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia
5.
Department of Software, Gachon University, Seongnam 13120, Korea
6.
John von Neumann Faculty of Informatics, Obuda University, Budapest, Hungary
7.
Institute of Information Engineering, Automation and Mathematics, Slovak University of Technology in Bratislava, Bratislava, Slovakia
8.
Institute of Information Society, University of Public Service, 1083 Budapest, Hungary

Academic Editor: Soheila Borhani

Received: 02 April 2022 Revised: 25 April 2022 Accepted: 09 May 2022 Published: 30 May 2022

Cancer is a manifestation of disorders caused by the changes in the body's cells that go far beyond healthy development as well as stabilization. Breast cancer is a common disease. According to the stats given by the World Health Organization (WHO), 7.8 million women are diagnosed with breast cancer. Breast cancer is the name of the malignant tumor which is normally developed by the cells in the breast. Machine learning (ML) approaches, on the other hand, provide a variety of probabilistic and statistical ways for intelligent systems to learn from prior experiences to recognize patterns in a dataset that can be used, in the future, for decision making. This endeavor aims to build a deep learning-based model for the prediction of breast cancer with a better accuracy. A novel deep extreme gradient descent optimization (DEGDO) has been developed for the breast cancer detection. The proposed model consists of two stages of training and validation. The training phase, in turn, consists of three major layers data acquisition layer, preprocessing layer, and application layer. The data acquisition layer takes the data and passes it to preprocessing layer. In the preprocessing layer, noise and missing values are converted to the normalized which is then fed to the application layer. In application layer, the model is trained with a deep extreme gradient descent optimization technique. The trained model is stored on the server. In the validation phase, it is imported to process the actual data to diagnose. This study has used Wisconsin Breast Cancer Diagnostic dataset to train and test the model. The results obtained by the proposed model outperform many other approaches by attaining 98.73 % accuracy, 99.60% specificity, 99.43% sensitivity, and 99.48% precision.

Keywords:

Citation: Muhammad Bilal Shoaib Khan, Atta-ur-Rahman, Muhammad Saqib Nawaz, Rashad Ahmed, Muhammad Adnan Khan, Amir Mosavi. Intelligent breast cancer diagnostic system empowered by deep extreme gradient descent optimization[J]. Mathematical Biosciences and Engineering, 2022, 19(8): 7978-8002. doi: 10.3934/mbe.2022373

Related Papers:

[1]	Karl Hajjar, Lénaïc Chizat . On the symmetries in the dynamics of wide two-layer neural networks. Electronic Research Archive, 2023, 31(4): 2175-2212. doi: 10.3934/era.2023112
[2]	Eray Önler . Feature fusion based artificial neural network model for disease detection of bean leaves. Electronic Research Archive, 2023, 31(5): 2409-2427. doi: 10.3934/era.2023122
[3]	Dong-hyeon Kim, Se-woon Choe, Sung-Uk Zhang . Recognition of adherent polychaetes on oysters and scallops using Microsoft Azure Custom Vision. Electronic Research Archive, 2023, 31(3): 1691-1709. doi: 10.3934/era.2023088
[4]	Ziqing Yang, Ruiping Niu, Miaomiao Chen, Hongen Jia, Shengli Li . Adaptive fractional physical information neural network based on PQI scheme for solving time-fractional partial differential equations. Electronic Research Archive, 2024, 32(4): 2699-2727. doi: 10.3934/era.2024122
[5]	Ilyоs Abdullaev, Natalia Prodanova, Mohammed Altaf Ahmed, E. Laxmi Lydia, Bhanu Shrestha, Gyanendra Prasad Joshi, Woong Cho . Leveraging metaheuristics with artificial intelligence for customer churn prediction in telecom industries. Electronic Research Archive, 2023, 31(8): 4443-4458. doi: 10.3934/era.2023227
[6]	Kai Huang, Chang Jiang, Pei Li, Ali Shan, Jian Wan, Wenhu Qin . A systematic framework for urban smart transportation towards traffic management and parking. Electronic Research Archive, 2022, 30(11): 4191-4208. doi: 10.3934/era.2022212
[7]	Ruyu Yan, Jiafei Jin, Kun Han . Reinforcement learning for deep portfolio optimization. Electronic Research Archive, 2024, 32(9): 5176-5200. doi: 10.3934/era.2024239
[8]	Mohd. Rehan Ghazi, N. S. Raghava . Securing cloud-enabled smart cities by detecting intrusion using spark-based stacking ensemble of machine learning algorithms. Electronic Research Archive, 2024, 32(2): 1268-1307. doi: 10.3934/era.2024060
[9]	Manal Abdullah Alohali, Mashael Maashi, Raji Faqih, Hany Mahgoub, Abdullah Mohamed, Mohammed Assiri, Suhanda Drar . Spotted hyena optimizer with deep learning enabled vehicle counting and classification model for intelligent transportation systems. Electronic Research Archive, 2023, 31(7): 3704-3721. doi: 10.3934/era.2023188
[10]	Jiaxin Zhang, Hoang Tran, Guannan Zhang . Accelerating reinforcement learning with a Directional-Gaussian-Smoothing evolution strategy. Electronic Research Archive, 2021, 29(6): 4119-4135. doi: 10.3934/era.2021075

Abstract

1. Introduction

The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure which appears, for instance, in the context of natural language processing, face recognition, fraud detection, and game intelligence. Although there exist a large number of numerical simulations in which GD type optimization schemes are effectively used to train ANNs with ReLU activation, till this day in the scientific literature there is in general no mathematical convergence analysis which explains the success of GD type optimization schemes in the training of such ANNs.

GD type optimization schemes can be regarded as temporal discretization methods for the gradient flow (GF) differential equations associated to the considered optimization problem and, in view of this, it seems to be a natural direction of research to first aim to develop a mathematical convergence theory for time-continuous GF differential equations and, thereafter, to aim to extend such a time-continuous convergence theory to implementable time-discrete GD type optimization methods.

Although there is in general no theoretical analysis which explains the success of GD type optimization schemes in the training of ANNs in the literature, there are several auspicious analysis approaches as well as several promising partial error analyses regarding the training of ANNs via GD type optimization schemes and GFs, respectively, in the literature. For convex objective functions, the convergence of GF and GD processes to the global minimum in different settings has been proved, e.g., in ^[1,2,3,4,5]. For general non-convex objective functions, even under smoothness assumptions GF and GD processes can show wild oscillations and admit infinitely many limit points, cf., e.g., ^[6]. A standard condition which excludes this undesirable behavior is the Kurdyka-Łojasiewicz inequality and we point to ^{[7,8,9,10,11,12,13,14,15,16]} for convergence results for GF and GD processes under Łojasiewicz type assumptions. It is in fact one of the main contributions of this work to demonstrate that the objective functions occurring in the training of ANNs with ReLU activation satisfy an appropriate Kurdyka-Łojasiewicz inequality, provided that both the target function and the density of the probability distribution of the input data are piecewise polynomial. For further abstract convergence results for GF and GD processes in the non-convex setting we refer, e.g., to ^{[17,18,19,20,21]} and the references mentioned therein.

In the overparametrized regime, where the number of training parameters is much larger than the number of training data points, GF and GD processes can be shown to converge to global minima in the training of ANNs with high probability, cf., e.g., ^{[22,23,24,25,26,27,28]}. As the number of neurons increases to infinity, the corresponding GF processes converge (with appropriate rescaling) to a measure-valued process which is known in the scientific literature as Wasserstein GF. For results on the convergence behavior of Wasserstein GFs in the training of ANNs we point, e.g., to ^[29,30,31], [32, Section 5.1], and the references mentioned therein.

A different approach is to consider only very special target functions and we refer, in particular, to ^[33,34] for a convergence analysis for GF and GD processes in the case of constant target functions and to ^[35] for a convergence analysis for GF and GD processes in the training of ANNs with piecewise linear target functions. In the case of linear target functions, a complete characterization of the non-global local minima and the saddle points of the risk function has been obtained in ^[36].

In this article we establish two basic results for GF differential equations in the training of fully-connected feedforward ANNs with one hidden layer and ReLU activation. Specifically, in the first main result of this article, see Theorem 1.1 below, we establish in the training of such ANNs under the assumption that the probability distribution of the input data of the considered supervised learning problem is absolutely continuous with a bounded density function that every GF differential equation possesses for every initial value a solution which is also unique among a suitable class of solutions (see (1.6) in Theorem 1.1 for details). In the second main result of this article, see Theorem 1.2 below, we prove in the training of such ANNs under the assumption that the target function and the density function are piecewise polynomial (see (1.8) below for details) that every non-divergent GF trajectory converges with an appropriate speed of convergence (see (1.11) below) to a critical point.

In Theorems 1.1 and 1.2 we consider ANNs with $d \in \mathbb{N} = \{ 1, 2, 3, \dots \}$ neurons on the input layer ( $d$ -dimensional input), $H \in \mathbb{N}$ neurons on the hidden layer ( $H$ -dimensional hidden layer), and $1$ neuron on the output layer ( $1$ -dimensional output). There are thus $H d$ scalar real weight parameters and $H$ scalar real bias parameters to describe the affine linear transformation between $d$ -dimensional input layer and the $H$ -dimensional hidden layer and there are thus $H$ scalar real weight parameters and 1 scalar real bias parameter to describe the affine linear transformation between the $H$ -dimensional hidden layer and the $1$ -dimensional output layer. Altogether there are thus

$\begin{equation} \mathfrak{d} = H d + H + H + 1 = H d + 2 H + 1 \end{equation}$

(1.1)

real numbers to describe the ANNs in Theorems 1.1 and 1.2. We also refer to for a graphical illustration of the architecture of an example ANN with $d = 4$ neurons on the input layer and $H = 5$ neurons on the hidden layer.

Figure 1. Graphical illustration of the architecture of an example fully-connected feedforward ANN with one hidden layer with

$4$ neurons on the input layer,

$5$ neurons on the hidden layer, and

$1$ neuron on the output layer corresponding to

$d = 4$ and

$H = 5$ in Theorems 1.1 and 1.2. In this example there are

$H d = 20$ arrows from the input layer to the hidden layer corresponding to

$H d = 20$ weight parameters to describe the affine linear transformation from the input layer to the hidden layer, there are

$H = 5$ bias parameters to describe the affine linear transformation from the input layer to the hidden layer, there are

$H = 5$ arrows from the hidden layer to the output layer corresponding to

$H = 5$ weight parameters to describe the affine linear transformation from the hidden layer to the output layer, and there is

$1$ bias parameter to describe the affine linear transformation from the hidden layer to the output layer. The overall number

$\mathfrak{d} \in \mathbb{N}$ of ANN parameters thus satisfies

$\mathfrak{d} = H d + H + H + 1 = Hd + 2 H + 1 = 20 + 10 + 1 = 31$ (cf. (1.1), Theorems 1.1 and 1.2.

DownLoad: Full-Size Img PowerPoint

The real numbers $\mathscr{a} \in \mathbb{R}$ , $\mathscr{b} \in (\mathscr{a}, \infty)$ in Theorems 1.1 and 1.2 are used to specify the set $[\mathscr{a}, \mathscr{b}]^d$ in which the input data of the considered supervised learning problem takes values in and the function $f \colon [\mathscr{a}, \mathscr{b}]^d \to \mathbb{R}$ in Theorem 1.1 specifies the target function of the considered supervised learning problem.

In Theorem 1.1 we assume that the target function is an element of the set $C([\mathscr{a}, \mathscr{b}]^d, \mathbb{R})$ of continuous functions from $[\mathscr{a}, \mathscr{b}]^d$ to $\mathbb{R}$ but beside this continuity hypothesis we do not impose further regularity assumptions on the target function.

The function $\mathfrak{p} \colon [\mathscr{a}, \mathscr{b}]^d \to [0, \infty)$ in Theorems 1.1 and 1.2 is an unnormalized density function of the probability distribution of the input data of the considered supervised learning problem and in Theorem 1.1 we impose that this unnormalized density function is bounded and measurable.

In Theorems 1.1 and 1.2 we consider ANNs with the ReLU activation function

$\begin{equation} \mathbb{R} \ni x \mapsto \max \left\{ {{ x, 0 }} \right\} \in \mathbb{R} . \end{equation}$

(1.2)

The ReLU activation function fails to be differentiable and this lack of regularity also transfers to the risk function of the considered supervised learning problem; cf. (1.5) below. We thus need to employ appropriately generalized gradients of the risk function to specify the dynamics of the GFs. As in [34, Setting 2.1 and Proposition 2.3] (cf. also ^[33,37]), we accomplish this, first, by approximating the ReLU activation function through continuously differentiable functions which converge pointwise to the ReLU activation function and whose derivatives converge pointwise to the left derivative of the ReLU activation function and, thereafter, by specifying the generalized gradient function as the limit of the gradients of the approximated risk functions; see (1.3) and (1.5) in Theorem 1.1 and (1.9) and (1.10) in Theorem 1.2 for details.

We now present the precise statement of Theorem 1.1 and, thereafter, provide further comments regarding Theorem 1.2.

Theorem 1.1 (Existence and uniqueness of solutions of GFs in the training of ANNs). Let $d, H, \mathfrak{d} \in \mathbb{N}$ , $\mathscr{a} \in \mathbb{R}$ , $\mathscr{b} \in (\mathscr{a}, \infty)$ , $f \in C ([\mathscr{a}, \mathscr{b}]^d, \mathbb{R})$ satisfy $\mathfrak{d} = d H + 2 H + 1$ , let $\mathfrak{p} \colon [\mathscr{a}, \mathscr{b}]^d \to [0, \infty)$ be bounded and measurable, let $\mathfrak{R}_r \in C (\mathbb{R}, \mathbb{R})$ , $r \in \mathbb{N} \cup \left\{ {{ \infty }} \right\}$ , satisfy for all $x \in \mathbb{R}$ that $(\cup_{r \in \mathbb{N}} \left\{ {{ \mathfrak{R}_r }} \right\}) \subseteq C^1(\mathbb{R}, \mathbb{R})$ , $\mathfrak{R}_\infty (x) = \max \left\{ {{ x, 0 }} \right\}$ , $\sup_{r \in \mathbb{N}} \sup_{y \in [-|x|, |x|] } | (\mathfrak{R}_r)'(y)| < \infty$ , and

$\begin{equation} \limsup\nolimits_{r \to \infty} \left( {{ | \mathfrak{R}_r ( x ) - \mathfrak{R} _\infty ( x ) | + | ( \mathfrak{R}_r)' ( x ) - \mathbf{1}_{\smash{{(0, \infty)}}} ( x ) | }} \right) = 0, \end{equation}$

(1.3)

for every $\theta = (\theta_1, \ldots, \theta_ \mathfrak{d}) \in \mathbb{R}^ \mathfrak{d}$ let $\mathbf{D} ^\theta \subseteq \mathbb{N}$ satisfy

$\begin{equation} \mathbf{D}^\theta = \big\{ i \in \left\{ {{ 1, 2, \ldots, H }} \right\} \colon |\theta_{ H d + i } | + \sum _{j = 1}^d |\theta_{(i - 1 ) d + j } | = 0 \big\}, \end{equation}$

(1.4)

for every $r \in \mathbb{N} \cup \left\{ {{ \infty }} \right\}$ let $\mathcal{L}_r \colon \mathbb{R}^ \mathfrak{d} \to \mathbb{R}$ satisfy for all $\theta = (\theta_1, \ldots, \theta_ \mathfrak{d}) \in \mathbb{R}^{ \mathfrak{d}}$ that

$\begin{eqnarray} \mathcal{L}_r ( \theta ) = \int_{[ \mathscr{a} , \mathscr{b}]^d} \bigl( f ( x_1, \ldots, x_d ) \\ - \theta_{ \mathfrak{d}} - \sum\nolimits_{i = 1}^ H \theta_{ H ( d + 1 ) + i } \left[ {{ \mathfrak{R}_r ( \theta_{ H d + i} + \sum\nolimits_{j = 1}^d\theta_{(i-1)d + j } x_j ) }} \right] \bigr) ^2 \mathfrak{p} ( x ) \, \mathrm{d} (x_1, \ldots, x_d ) , \end{eqnarray}$

(1.5)

let $\theta \in \mathbb{R}^ \mathfrak{d}$ , and let $\mathcal{G} \colon \mathbb{R}^ \mathfrak{d} \to \mathbb{R}^ \mathfrak{d}$ satisfy for all $\vartheta \in \left\{ {{ v \in \mathbb{R}^ \mathfrak{d} \colon ((\nabla \mathcal{L}_r) (v)) _{r \in \mathbb{N}}\; \mathit{\text{is convergent}} }} \right\}$ that $\mathcal{G} (\vartheta) = \lim_{r \to \infty} (\nabla \mathcal{L}_r) (\vartheta)$ . Then

(i) it holds that $\mathcal{G}$ is locally bounded and measurable and

(ii) there exists a unique $\Theta \in C([0, \infty), \mathbb{R}^ \mathfrak{d})$ which satisfies for all $t \in [0, \infty)$ , $s \in [t, \infty)$ that $\mathbf{D}^{\Theta_t} \subseteq \mathbf{D}^{\Theta_s }$ and

$\begin{equation} \Theta_t = \theta - \int_0^t \mathcal{G} ( \Theta_u ) \, \mathrm{d} u . \end{equation}$

(1.6)

Theorem 1.1 is a direct consequence of Theorem 3.3 below. In Theorem 1.2 we also assume that the target function $f \colon [\mathscr{a}, \mathscr{b}]^d \to \mathbb{R}$ is continuous but additionally assume that, roughly speaking, both the target function $f \colon [\mathscr{a}, \mathscr{b}]^d \to \mathbb{R}$ and the unnormalized density function $\mathfrak{p} \colon [\mathscr{a}, \mathscr{b}]^d \to [0, \infty)$ coincide with polynomial functions on suitable subsets of their domain of definition $[\mathscr{a}, \mathscr{b}]^d$ . In Theorem 1.2 the $(n \times d)$ -matrices $\alpha^k_i \in \mathbb{R}^{ n \times d }$ , $i \in \left\{ {{ 1, 2, \ldots, n }} \right\}$ , $k \in \left\{ {{0, 1}} \right\}$ , and the $n$ -dimensional vectors $\beta^k_i \in \mathbb{R}^n$ , $i \in \left\{ {{ 1, 2, \ldots, n }} \right\}$ , $k \in \left\{ {{0, 1}} \right\}$ , are used to describe these subsets and the functions $P^k_i \colon \mathbb{R}^d \to \mathbb{R}$ , $i \in \left\{ {{ 1, 2, \ldots, n }} \right\}$ , $k \in \left\{ {{0, 1}} \right\}$ , constitute the polynomials with which the target function and the unnormalized density function should partially coincide. More formally, in (1.8) in Theorem 1.2 we assume that for every $x \in [\mathscr{a}, \mathscr{b}] ^d$ we have that

$\begin{equation} \mathfrak{p} (x) = \sum _{i \in \left\{ {{1, 2, \ldots, n }} \right\}, \, \alpha_i^0 x + \beta_i^0 \in [0, \infty)^n } P_i^0 ( x ) \quad\text{and}\quad f(x) = \sum _{i \in \left\{ {{1, 2, \ldots, n }} \right\}, \, \alpha_i^1 x + \beta_i^1 \in [0, \infty)^n } P_i^1 ( x ) . \end{equation}$

(1.7)

In (1.11) in Theorem 1.2 we prove that there exists a strictly positive real number $\beta \in (0, \infty)$ such that for every GF trajectory $\Theta \colon [0, \infty) \to \mathbb{R}^{ \mathfrak{d} }$ which does not diverge to infinity in the sense^* that $\liminf_{t \to \infty} ||\Theta_t || < \infty$ we have that $\Theta_t \in \mathbb{R}^{ \mathfrak{d} }$ , $t \in [0, \infty)$ , converges with order $\beta$ to a critical point $\vartheta \in \mathcal{G}^{ - 1 }(\left\{ {{ 0 }} \right\}) = \left\{ {{ \theta \in \mathbb{R}^{ \mathfrak{d} } \colon \mathcal{G} (\theta) = 0 }} \right\}$ and we have that the risk $\mathcal{L} (\Theta_t) \in \mathbb{R}$ , $t \in [0, \infty)$ , converges with order 1 to the risk $\mathcal{L} (\vartheta)$ of the critical point $\vartheta$ . We now present the precise statement of Theorem 1.2.

^*Note that the functions $|| \cdot || \colon (\cup_{n \in \mathbb{N}} \mathbb{R} ^n) \to \mathbb{R}$ and $\left\langle {{ \cdot, \cdot }} \right\rangle \colon (\cup_{n \in \mathbb{N}} (\mathbb{R}^n \times \mathbb{R}^n)) \to \mathbb{R}$ satisfy for all $n \in \mathbb{N}$ , $x = (x_1, \ldots, x_n)$ , $y = (y_1, \ldots, y_n) \in \mathbb{R}^n$ that $|| x || = [\sum _{i = 1}^n \left| x_i \right| ^2] ^{1/2}$ and $\left\langle {{ x, y }} \right\rangle = \sum _{i = 1}^ \mathfrak{d} x_i y_i$ .

Theorem 1.2 (Convergence rates for GFs trajectories in the training of ANNs). Let $d, H, \mathfrak{d}, n \in \mathbb{N}$ , $\mathscr{a} \in \mathbb{R}$ , $\mathscr{b} \in (\mathscr{a}, \infty)$ , $f \in C([\mathscr{a}, \mathscr{b}] ^d, \mathbb{R})$ satisfy $\mathfrak{d} = d H + 2 H + 1$ , for every $i \in \left\{ {{1, 2, \ldots, n}} \right\}$ , $k \in \left\{ {{0, 1}} \right\}$ let $\alpha_{i}^k \in \mathbb{R}^{n \times d}$ , let $\beta_{i }^k \in \mathbb{R}^n$ , and let $P_i^k \colon \mathbb{R}^d \to \mathbb{R}$ be a polynomial, let $\mathfrak{p} \colon [\mathscr{a}, \mathscr{b}] ^d \to [0, \infty)$ satisfy for all $k \in \left\{ {{0, 1}} \right\}$ , $x \in [\mathscr{a}, \mathscr{b}] ^d$ that

$\begin{equation} k f ( x ) + ( 1 - k ) \mathfrak{p} ( x ) = \sum _{i = 1}^n \left[ {{ P_i^k ( x ) \mathbf{1}_{\smash{{[0, \infty )^n}}} \left( {{ \alpha^k _i x + \beta^k_i }} \right) }} \right] , \end{equation}$

(1.8)

let $\mathfrak{R}_r \in C (\mathbb{R}, \mathbb{R})$ , $r \in \mathbb{N} \cup \left\{ {{ \infty }} \right\}$ , satisfy for all $x \in \mathbb{R}$ that $(\cup_{r \in \mathbb{N}} \left\{ {{ \mathfrak{R}_r }} \right\}) \subseteq C^1(\mathbb{R}, \mathbb{R})$ , $\mathfrak{R}_\infty (x) = \max \left\{ {{ x, 0 }} \right\}$ , $\sup_{r \in \mathbb{N}} \sup_{y \in [-|x|, |x|] } | (\mathfrak{R}_r)'(y)| < \infty$ , and

(1.9)

$\begin{eqnarray} \mathcal{L}_r ( \theta ) = \int_{[ \mathscr{a} , \mathscr{b}]^d} \bigl( f ( x_1, \ldots, x_d ) \\ - \theta_{ \mathfrak{d}} - \sum\nolimits_{i = 1}^ H \theta_{ H ( d + 1 ) + i } \left[ {{ \mathfrak{R}_r ( \theta_{ H d + i} + \sum\nolimits_{j = 1}^d\theta_{(i-1)d + j } x_j ) }} \right] \bigr)^2 \mathfrak{p} ( x ) \, \mathrm{d} (x_1, \ldots, x_d ) , \end{eqnarray}$

(1.10)

let $\mathcal{G} \colon \mathbb{R}^ \mathfrak{d} \to \mathbb{R}^ \mathfrak{d}$ satisfy for all $\theta \in \left\{ {{ \vartheta \in \mathbb{R}^ \mathfrak{d} \colon ((\nabla \mathcal{L}_r) (\vartheta)) _{r \in \mathbb{N}}\; \mathit{\text{is convergent}} }} \right\}$ that $\mathcal{G} (\theta) = \lim_{r \to \infty} (\nabla \mathcal{L}_r) (\theta)$ , and let $\Theta \in C ([0, \infty), \mathbb{R}^ \mathfrak{d})$ satisfy $\liminf_{t \to \infty } ||\Theta_t || < \infty$ and $\forall \, t \in [0, \infty) \colon \Theta_t = \Theta_0 - \int_0^t \mathcal{G} (\Theta_s) \, \mathrm{d} s$ . Then there exist $\vartheta \in \mathcal{G}^{ - 1 } (\left\{ {{ 0 }} \right\})$ , $\mathfrak{C}, \beta \in (0, \infty)$ which satisfy for all $t \in [0, \infty)$ that

$\begin{equation} ||\Theta_t - \vartheta|| \leq \mathfrak{C} ( 1 + t ) ^{- \beta } \qquad\mathit{\text{and}}\qquad | \mathcal{L}_\infty ( \Theta_t ) - \mathcal{L}_\infty ( \vartheta ) | \leq \mathfrak{C} ( 1 + t ) ^{ - 1 } . \end{equation}$

(1.11)

Theorem 1.2 above is an immediate consequence of Theorem 5.4 in Subsection 5.3 below. Theorem 1.2 is related to Theorem 1.1 in our previous article ^[37]. In particular, [37, Theorem 1.1] uses weaker assumptions than Theorem 1.2 above but Theorem 1.2 above establishes a stronger statement when compared to [37, Theorem 1.1]. Specifically, on the one hand in [37, Theorem 1.1] the target function is only assumed to be a continuous function and the unnormalized density is only assumed to be measurable and integrable while in Theorem 1.2 it is additionally assumed that both the target function and the unnormalized density are piecewise polynomial in the sense of (1.8) above. On the other hand [, Theorem 1.1] only asserts that the risk of every bounded GF trajectory converges to the risk of critical point while Theorem 1.2 assures that every non-divergent GF trajectory converges with a strictly positive rate of convergence to a critical point (the rate of convergence is given through the strictly positive real number $\beta \in (0, \infty)$ appearing in the exponent on the left inequality in (Eq 1.11) in Theorem 1.2) and also assures that the risk of the non-divergent GF trajectory converges with rate 1 to the risk of the critical point (the convergence rate $1$ is ensured through the $1$ appearing in the exponent on the right inequality in (Eq 1.11) in Theorem 1.2).

We also point out that Theorem 1.2 assumes that the GF trajectory is non-divergent in the sense that $\liminf_{ t \to \infty } ||\Theta_t || < \infty$ . In general, it remains an open problem to establish sufficient conditions which ensure that the GF trajectory has this non-divergence property. In this aspect we also refer to Gallon et al. ^[38] for counterexamples for which it has been proved that every GF trajectory with sufficiently small initial risk does in the training of ANNs diverge to $\infty$ in the sense that $\liminf_{ t \to \infty } ||\Theta_t || = \infty$ .

The remainder of this article is organized in the following way. In Section 2 we establish several regularity properties for the risk function of the considered supervised learning problem and its generalized gradient function. In Subsection 3 we employ the findings from Section 2 to establish existence and uniqueness properties for solutions of GF differential equations. In particular, in Subsection 3 we present the proof of Theorem 1.1 above. In Subsection 4 we establish under the assumption that both the target function $f \colon [\mathscr{a}, \mathscr{b}]^d \to \mathbb{R}$ and the unnormalized density function $\mathfrak{p} \colon [\mathscr{a}, \mathscr{b}]^d \to [0, \infty)$ are piecewise polynomial that the risk function is semialgebraic in the sense of Definition 4.3 in Subsection 4 (see Corollary 4.10 in Subsection 4 for details). In Subsection 5 we engage the results from Sections 2 and 4 to establish several convergence rate results for solutions of GF differential equations and, thereby, we also prove Theorem 1.2 above.

2. Properties of the risk function and its generalized gradient function

In this section we establish several regularity properties for the risk function $\mathcal{L} \colon \mathbb{R}^ \mathfrak{d} \to \mathbb{R}$ and its generalized gradient function $\mathcal{G} \colon \mathbb{R}^ \mathfrak{d} \to \mathbb{R}^ \mathfrak{d}$ . In particular, in Proposition 2.12 in Subsection 2.5 below we prove for every parameter vector $\theta \in \mathbb{R}^{ \mathfrak{d} }$ in the ANN parameter space $\mathbb{R}^{ \mathfrak{d} } = \mathbb{R}^{ d H + 2 H + 1 }$ that the generalized gradient $\mathcal{G} (\theta)$ is a limiting subdifferential of the risk function $\mathcal{L} \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}$ at $\theta$ . In Definition 2.8 in Subsection 2.5 we recall the notion of subdifferentials (which are sometimes also referred to as Fréchet subdifferentials in the scientific literature) and in Definition 2.9 in Subsection 2.5 we recall the notion of limiting subdifferentials. In the scientific literature Definitions 2.8 and 2.9 can in a slightly different presentational form, e.g., be found in Rockafellar & Wets [39, Definition 8.3] and Bolte et al. [9, Definition 2.10], respectively.

Our proof of Proposition 2.12 uses the continuously differentiability result for the risk function in Proposition 2.3 in Subsection 2.2 and the local Lipschitz continuity result for the generalized gradient function in Corollary 2.7 in Subsection 2.4. Corollary 2.7 will also be employed in Subsection 3 below to establish existence and uniqueness results for solutions of GF differential equations. Proposition 2.3 follows directly from [37, Proposition 2.10, Lemmas 2.11 and 2.12]. Our proof of Corollary 2.7, in turn, employs the known representation result for the generalized gradient function in Proposition 2.2 in Subsection 2.2 below and the local Lipschitz continuity result for certain parameter integrals in Corollary 2.6 in Subsection 2.4. Statements related to Proposition 2.2 can, e.g., be found in [37, Proposition 2.2], [33, Proposition 2.3], and [34, Proposition 2.3].

Our proof of Corollary 2.6 uses the elementary abstract local Lipschitz continuity result for certain parameter integrals in Lemma 2.5 in Subsection 2.4 and the local Lipschitz continuity result for active neuron regions in Lemma 2.4 in Subsection 2.3 below. Lemma 2.4 is a generalization of [35, Lemma 7], Lemma 2.5 is a slight generalization of [35, Lemma 6], and Corollary 2.6 is a generalization of [37, Lemma 2.12] and [35, Corollary 9]. The proof of Lemma 2.5 is therefore omitted.

In Setting 2.1 in Subsection 2.1 below we present the mathematical setup to describe ANNs with ReLU activation, the risk function $\mathcal{L} \colon \mathbb{R}^ \mathfrak{d} \to \mathbb{R}$ , and its generalized gradient function $\mathcal{G} \colon \mathbb{R}^ \mathfrak{d} \to \mathbb{R}^ \mathfrak{d}$ . Moreover, in (2.6) in Setting 2.1 we define for a given parameter vector $\theta \in \mathbb{R}^ \mathfrak{d}$ the set of hidden neurons which have all input parameters equal to zero. Such neurons are sometimes called degenerate (cf. Cheridito et al. ^[36]) and can cause problems with the differentiability of the risk function, which is why we exclude degenerate neurons in Proposition 2.3 and Corollary 2.7 below.

2.1. Mathematical description of artificial neural networks (ANNs)

In this subsection we present in Setting 2.1 below the mathematical setup that we employ to state most of the mathematical results of this work. We also refer to Figure 2 below for a table in which we briefly list the mathematical objects introduced in Setting 2.1.

Figure 2. List of the mathematical objects introduced in Setting 2.1.

DownLoad: Full-Size Img PowerPoint

Setting 2.1. Let $d, H, \mathfrak{d} \in \mathbb{N}$ , $\mathscr{a} \in \mathbb{R}$ , $\mathscr{b} \in (\mathscr{a}, \infty)$ , $f \in C ([\mathscr{a}, \mathscr{b}]^d, \mathbb{R})$ satisfy $\mathfrak{d} = d H + 2 H + 1$ , let $\mathfrak{w} = ((\mathfrak{w}^{{ \theta }}_{ i, j })_{ (i, j) \in \left\{ {{ 1, \ldots, H }} \right\} \times \left\{ {{ 1, \ldots, d }} \right\} })_{ \theta \in \mathbb{R}^{ \mathfrak{d} } } \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}^{ H \times d }$ , $\mathfrak{b} = ((\mathfrak{b}^{{ \theta }}_1, \dots, \mathfrak{b}^{{ \theta }}_{ H }))_{ \theta \in \mathbb{R}^{ \mathfrak{d} } } \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}^{ H }$ , $\mathfrak{v} = ((\mathfrak{v}^{{\theta}}_1, \dots, \mathfrak{v}^{{\theta}}_{ H }))_{ \theta \in \mathbb{R}^{ \mathfrak{d} } } \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}^{ H }$ , and $\mathfrak{c} = (\mathfrak{c}^{{ \theta }})_{ \theta \in \mathbb{R}^{ \mathfrak{d} } } \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}$ satisfy for all $\theta = (\theta_1, \ldots, \theta_{ \mathfrak{d}}) \in \mathbb{R}^{ \mathfrak{d}}$ , $i \in \left\{ {{ 1, 2, \ldots, H }} \right\}$ , $j \in \left\{ {{ 1, 2, \ldots, d }} \right\}$ that

$\begin{equation} \mathfrak{w}^{{ \theta }}_{ i , j } = \theta_{ (i - 1 ) d + j}, \qquad \mathfrak{b}^{{\theta}}_i = \theta_{ H d + i }, \qquad \mathfrak{v}^{{\theta}}_i = \theta_{ H (d + 1) + i }, \qquad\mathit{\text{and}}\qquad \mathfrak{c}^{{\theta}} = \theta_{ \mathfrak{d} } , \end{equation}$

(2.1)

let $\mathfrak{R}_r \in C^1(\mathbb{R}, \mathbb{R})$ , $r \in \mathbb{N}$ , satisfy for all $x \in \mathbb{R}$ that

$\begin{equation} \limsup\nolimits_{ r \to \infty } \left( {{ | \mathfrak{R}_r( x ) - \max\left\{ {{ x , 0}} \right\} | + | ( \mathfrak{R}_r )'( x ) - \mathbf{1}_{\smash{{ (0, \infty) }}}( x ) | }} \right) = 0 \end{equation}$

(2.2)

and $\sup_{r \in \mathbb{N}} \sup_{y \in [-|x|, |x|] } |(\mathfrak{R}_r)'(y)| < \infty$ , let $\lambda \colon \mathcal{B} (\mathbb{R}^d) \to [0, \infty]$ be the Lebesgue–Borel measure on $\mathbb{R}^d$ , let $\mathfrak{p} \colon [\mathscr{a}, \mathscr{b}] ^d \to [0, \infty)$ be bounded and measurable, let $\mathscr{N} = (\mathscr{N} ^{ { \theta } })_{ \theta \in \mathbb{R}^{ \mathfrak{d} } } \colon \mathbb{R}^{ \mathfrak{d} } \to C(\mathbb{R}^d, \mathbb{R})$ and $\mathcal{L} \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}$ satisfy for all $\theta \in \mathbb{R}^{ \mathfrak{d} }$ , $x = (x_1, \dots, x_d) \in \mathbb{R}^d$ that

$\begin{equation} \mathscr{N} ^{ {\theta} }(x) = \mathfrak{c}^{{\theta}} + \sum _{ i = 1 }^{ H } \mathfrak{v}^{{\theta}}_i \left\{ {{ \mathfrak{b}^{{\theta}}_i + \sum _{ j = 1 }^d \mathfrak{w}^{{\theta}}_{ i, j } x_j , 0 }} \right\} \end{equation}$

(2.3)

and $\mathcal{L} (\theta) = \int_{[\mathscr{a}, \mathscr{b}] ^d} (f (y) - \mathscr{N} ^{ {\theta} } (y))^2 \mathfrak{p} (y) \, \lambda (\mathrm{d} y)$ , for every $r \in \mathbb{N}$ let $\mathfrak{L}_r \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}$ satisfy for all $\theta \in \mathbb{R}^{ \mathfrak{d} }$ that

$\begin{equation} \mathfrak{L}_r( \theta ) = \int_{ [ \mathscr{a} , \mathscr{b} ] ^d } \bigl( f(y) - \mathfrak{c}^{{\theta}} - \sum _{ i = 1 }^{ H } \mathfrak{v}^{{\theta}}_i \left[ {{ \mathfrak{R}_r \left( {{ \mathfrak{b}^{{\theta}}_i + \sum _{j = 1}^d \mathfrak{w}^{{\theta}}_{i , j } y _j }} \right) }} \right] \bigr)^2 \, \mathfrak{p}( y ) \, \lambda( \mathrm{d} y ) , \end{equation}$

(2.4)

for every $\varepsilon \in (0, \infty)$ , $\theta \in \mathbb{R}^ \mathfrak{d}$ let $B_\varepsilon (\theta) \subseteq \mathbb{R}^ \mathfrak{d}$ satisfy $B_\varepsilon (\theta) = \left\{ {{\vartheta \in \mathbb{R}^ \mathfrak{d} \colon ||\theta - \vartheta || < \varepsilon }} \right\}$ , for every $\theta \in \mathbb{R}^{ \mathfrak{d} }$ , $i \in \left\{ {{ 1, 2, \ldots, H }} \right\}$ let $I_i^\theta \subseteq \mathbb{R}^d$ satisfy

$\begin{equation} I_i^{ \theta } = \left\{ {{ x = ( x_1, \ldots, x_d ) \in [ \mathscr{a}, \mathscr{b} ]^d \colon \mathfrak{b}^{{ \theta }}_i + \sum _{ j = 1 }^d \mathfrak{w}^{{ \theta }}_{ i, j } x_d > 0 }} \right\} , \end{equation}$

(2.5)

for every $\theta \in \mathbb{R}^ \mathfrak{d}$ let $\mathbf{D} ^\theta \subseteq \mathbb{N}$ satisfy

$\begin{equation} \mathbf{D}^{ \theta } = \left\{ {{ i \in \left\{ {{ 1, 2, \dots, H }} \right\} \colon | \mathfrak{b}^{{ \theta }}_i | + \sum _{ j = 1 }^d | \mathfrak{w}^{{ \theta }}_{ i, j } | = 0 }} \right\} , \end{equation}$

(2.6)

and let $\mathcal{G} = (\mathcal{G}_1, \dots, \mathcal{G}_ \mathfrak{d}) \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}^{ \mathfrak{d} }$ satisfy for all $\theta \in \left\{ {{ \vartheta \in \mathbb{R}^ \mathfrak{d} \colon ((\nabla \mathfrak{L}_r)(\vartheta))_{r \in \mathbb{N}}\; \mathit{\text{is convergent}} }} \right\}$ that $\mathcal{G}(\theta) = \lim_{r \to \infty} (\nabla \mathfrak{L}_r) (\theta)$ .

Next we add some explanations regarding the mathematical framework presented in Setting 2.1 above. In Setting 2.1

● the natural number $d \in \mathbb{N}$ represents the number of neurons on the input layer of the considered ANNs,

● the natural number $H \in \mathbb{N}$ represents the number of neurons on the hidden layer of the considered ANNs, and

● the natural number $\mathfrak{d} \in \mathbb{N}$ measures the overall number of parameters of the considered ANNs

(cf. (1.1) and above). The real numbers $\mathscr{a} \in \mathbb{R}$ , $\mathscr{b} \in (\mathscr{a}, \infty)$ in Setting 2.1 are employed to specify the $d$ -dimensional set $[\mathscr{a}, \mathscr{b}]^d \subseteq \mathbb{R}^d$ in which the input data of the supervised learning problem considered in Setting 2.1 takes values in and which, thereby, also serves as the domain of definition of the target function of the considered supervised learning problem.

In Setting 2.1 the function $f \colon [\mathscr{a}, \mathscr{b}]^d \to \mathbb{R}$ represents the target function of the considered supervised learning problem. In Setting 2.1 the target function $f$ is assumed to be an element of the set $C([\mathscr{a}, \mathscr{b}]^d, \mathbb{R})$ of continuous functions from the $d$ -dimensional set $[\mathscr{a}, \mathscr{b}]^d$ to the reals $\mathbb{R}$ (first line in Setting 2.1).

The matrix valued function $\mathfrak{w} = ((\mathfrak{w}^{{ \theta }}_{ i, j })_{ (i, j) \in \left\{ {{ 1, \ldots, H }} \right\} \times \left\{ {{ 1, \ldots, d }} \right\} })_{ \theta \in \mathbb{R}^{ \mathfrak{d} } } \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}^{ H \times d }$ in Setting 2.1 is used to represent the inner weight parameters of the ANNs considered in Setting 2.1. In particular, in Setting 2.1 we have for every $\theta \in \mathbb{R}^{ \mathfrak{d} }$ that the $H \times d$ -matrix $\mathfrak{w}^{{ \theta }} = (\mathfrak{w}^{{ \theta }}_{ i, j })_{ (i, j) \in \left\{ {{ 1, \ldots, H }} \right\} \times \left\{ {{ 1, \ldots, d }} \right\} } \in \mathbb{R}^{ H \times d }$ represents the weight parameter matrix for the affine linear transformation from the $d$ -dimensional input layer to the $H$ -dimensional hidden layer of the ANN associated to the ANN parameter vector $\theta \in \mathbb{R}^{ \mathfrak{d} }$ (cf. (2.1), (2.3), and Figure 1).

The vector valued function $\mathfrak{b} = ((\mathfrak{b}^{{ \theta }}_1, \dots, \mathfrak{b}^{{ \theta }}_{ H }))_{ \theta \in \mathbb{R}^{ \mathfrak{d} } } \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}^{ H }$ in Setting 2.1 is used to represent the inner bias parameters of the ANNs considered in Setting 2.1. In particular, in Setting 2.1 we have for every $\theta \in \mathbb{R}^{ \mathfrak{d} }$ that the $d$ -dimensional vector $\mathfrak{b}^{{ \theta }} = (\mathfrak{b}^{{ \theta }}_1, \dots, \mathfrak{b}^{{ \theta }}_{ H }) \in \mathbb{R}^{ H }$ represents the bias parameter vector for the affine linear transformation from the $d$ -dimensional input layer to the $H$ -dimensional hidden layer of the ANN associated to the ANN parameter vector $\theta \in \mathbb{R}^{ \mathfrak{d} }$ (cf. (2.1), (2.3), and Figure 1).

The vector valued function $\mathfrak{v} = ((\mathfrak{v}^{{\theta}}_1, \dots, \mathfrak{v}^{{\theta}}_{ H }))_{ \theta \in \mathbb{R}^{ \mathfrak{d} } } \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}^{ H }$ in Setting 2.1 is used to describe the outer weight parameters of the ANNs considered in Setting 2.1. In particular, in Setting 2.1 we have for every $\theta \in \mathbb{R}^{ \mathfrak{d} }$ that the transpose of the $H$ -dimensional vector $\mathfrak{v}^{{ \theta }} = (\mathfrak{v}^{{\theta}}_1, \dots, \mathfrak{v}^{{\theta}}_{ H }) \in \mathbb{R}^{ H }$ represents the weight parameter matrix for the affine linear transformation from the $H$ -dimensional hidden layer to the $1$ -dimensional output layer of the ANN associated to the ANN parameter vector $\theta \in \mathbb{R}^{ \mathfrak{d} }$ (cf. (2.1), (2.3), and Figure 1).

The real valued function $\mathfrak{c} = (\mathfrak{c}^{{ \theta }})_{ \theta \in \mathbb{R}^{ \mathfrak{d} } } \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}$ in Setting 2.1 is used to represent the outer bias parameters of the ANNs considered in Setting 2.1. In particular, in Setting 2.1 we have for every $\theta \in \mathbb{R}^{ \mathfrak{d} }$ that the real number $\mathfrak{c}^{{ \theta }} \in \mathbb{R}$ describes the bias parameter for the affine linear transformation from the $H$ -dimensional hidden layer to the $1$ -dimensional output layer of the ANN associated to the ANN parameter vector $\theta \in \mathbb{R}^{ \mathfrak{d} }$ (cf. (2.1), (2.3), and Figure 1).

In Setting 2.1 we consider ANNs with the ReLU activation function $\mathbb{R} \ni x \mapsto \max\{ x, 0 \} \in \mathbb{R}$ (cf. (1.2)). The ReLU activation function fails to be differentiable and this lack of differentiability typically transfers from the activation function to the realization functions $\mathscr{N} ^{ { \theta } } \colon \mathbb{R}^d \to \mathbb{R}$ , $\theta \in \mathbb{R}^{ \mathfrak{d} }$ , of the considered ANNs and the risk function $\mathcal{L} \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}$ of the considered supervised learning problem, both, introduced in (2.3) in Setting 2.1. In general, there thus do not exist standard derivatives and standard gradients of the risk function and, in view of this, we need to introduce suitably generalized gradients of the risk function to specify the GF dynamics. As in [34, Setting 2.1 and Proposition 2.3] (cf. also ^[33,37]), we accomplish this,

● first, by approximating the ReLU activation function through appropriate continuously differentiable functions which converge pointwise to the ReLU activation function and whose derivatives converge pointwise to the left derivative of the ReLU activation function,

● then, by using these continuously differentiable approximations of the ReLU activation function to specify approximated risk functions, and,

● finally, by specifying the generalized gradient function as the pointwise limit of the standard gradients of the approximated risk functions.

In Setting 2.1 the functions $\mathfrak{R}_r \colon \mathbb{R} \to \mathbb{R}$ , $r \in \mathbb{N}$ , serves as such appropriate continuously differentiable approximations of the ReLU activation function and the hypothesis in (2.2) ensures that these functions converge pointwise to the ReLU activation function and that the derivatives of these functions converge pointwise to the left derivative of the ReLU activation function (cf. also (1.3) in Theorem 1.1 and (1.9) in Theorem 1.2). These continuously differentiable approximations of the ReLU activation function are then used in (2.4) in Setting 2.1 (cf. also (1.5) in Theorem 1.1 and (1.10) in Theorem 1.2) to introduce continuously differentiable approximated risk functions $\mathfrak{L}_r \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}$ , $r \in \mathbb{N}$ , which converge pointwise to the risk function $\mathcal{L} \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}$ (cf., e.g., [, Proposition 2.2]). Finally, the standard gradients of the approximated risk functions $\mathfrak{L}_r \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}$ , $r \in \mathbb{N}$ , are then used to introduce the generalized gradient function $\mathcal{G} = (\mathcal{G}_1, \dots, \mathcal{G}_ \mathfrak{d}) \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}^{ \mathfrak{d} }$ in Setting 2.1. In this regard we also note that Proposition 2.2 in Subsection 2.2 below, in particular, ensures that the function $\mathcal{G} = (\mathcal{G}_1, \dots, \mathcal{G}_ \mathfrak{d}) \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}^{ \mathfrak{d} }$ in Setting 2.1 is indeed uniquely defined.

2.2. Differentiability properties of the risk function

Proposition 2.2. Assume Setting 2.1. Then it holds for all $\theta \in \mathbb{R}^{ \mathfrak{d}}$ , $i \in \left\{ {{ 1, 2, \ldots, H }} \right\}$ , $j \in \left\{ {{ 1, 2, \ldots, d }} \right\}$ that

$\begin{equation} \begin{split} \mathcal{G}_{ ( i - 1 ) d + j } ( \theta) & = 2 \mathfrak{v}^{{\theta}}_i \int_{I_i^\theta} x_j ( \mathscr{N} ^{ {\theta} } (x) - f ( x ) ) \mathfrak{p} ( x ) \, \lambda ( \mathrm{d} x ) , \\ \mathcal{G}_{ H d + i} ( \theta) & = 2 \mathfrak{v}^{{\theta}}_i \int_{I_i^\theta} (\mathscr{N} ^{ {\theta} } (x) - f ( x ) ) \mathfrak{p} ( x ) \, \lambda( \mathrm{d} x ) , \\ \mathcal{G}_{ H (d + 1 ) + i} ( \theta) & = 2 \int_{[ \mathscr{a} , \mathscr{b} ] ^d} \left[ {{ \max \left\{ {{ \mathfrak{b}^{{\theta}}_i + \sum\nolimits_{j = 1}^d \mathfrak{w}^{{\theta}}_{i, j} x_j , 0 }} \right\} }} \right] ( \mathscr{N} ^{ {\theta} }(x) - f ( x ) ) \mathfrak{p} ( x ) \, \lambda ( \mathrm{d} x ), \\ \mathit{\text{and}} \qquad \mathcal{G}_{ \mathfrak{d}} ( \theta) & = 2 \int_{[ \mathscr{a} , \mathscr{b} ] ^d} (\mathscr{N} ^{ {\theta} } (x) - f ( x ) ) \mathfrak{p} ( x ) \, \lambda ( \mathrm{d} x ) . \end{split} \end{equation}$

(2.7)

Proof of Proposition 2.2. Observe that, e.g., [37, Proposition 2.2] establishes 2.7. The proof of Proposition 2.2 is thus complete.

Proposition 2.3. Assume Setting 2.1 and let $U \subseteq \mathbb{R}^ \mathfrak{d}$ satisfy $U = \left\{ {{\theta \in \mathbb{R}^ \mathfrak{d} \colon \mathbf{D}^\theta = \varnothing }} \right\}$ . Then

(i) it holds that $U \subseteq \mathbb{R}^ \mathfrak{d}$ is open,

(ii) it holds that $\mathcal{L} | _U \in C^1 (U, \mathbb{R})$ , and

(iii) it holds that $\nabla (\mathcal{L} |_U) = \mathcal{G} |_U$ .

Proof of Proposition 2.3. Note that [37, Proposition 2.10,Lemmas 2.11 and 2.12] establish items (i), (ii), and (iii). The proof of Proposition 2.3 is thus complete.

2.3. Local Lipschitz continuity of active neuron regions

Lemma 2.4. Let $d \in \mathbb{N}$ , $\mathscr{a} \in \mathbb{R}$ , $\mathscr{b} \in (\mathscr{a}, \infty)$ , for every $v = (v_1, \ldots, v_{d+1}) \in \mathbb{R}^{d+1}$ let $I^v \subseteq [\mathscr{a}, \mathscr{b}]^d$ satisfy $I^v = \left\{ {{ x \in [\mathscr{a}, \mathscr{b}] ^d \colon v_{d+1} + \sum _{i = 1}^d v_i x_i > 0 }} \right\}$ , for every $n \in \mathbb{N}$ let $\lambda_n \colon \mathcal{B} (\mathbb{R}^n) \to [0, \infty]$ be the Lebesgue–Borel measure on $\mathbb{R}^n$ , let $\mathfrak{p} \colon [\mathscr{a}, \mathscr{b}] ^d \to [0, \infty)$ be bounded and measurable, and let $u \in \mathbb{R}^{d+1} \backslash \left\{ {{0}} \right\}$ . Then there exist $\varepsilon, \mathfrak{C} \in (0, \infty)$ such that for all $v, w \in \mathbb{R}^{d+1}$ with $\max \left\{ {{ ||u - v||, ||u - w || }} \right\} \le \varepsilon$ it holds that

$\begin{equation} \int_{ I^v \Delta I^w } \mathfrak{p} ( x ) \, \lambda_d ( \mathrm{d} x) \leq \mathfrak{C} ||v - w ||. \end{equation}$

(2.8)

Proof of Lemma 2.4. Observe that for all $v, w \in \mathbb{R}^{d+1}$ we have that

$\begin{equation} \int_{ I^v \Delta I^w } \mathfrak{p} ( x ) \, \lambda_d ( \mathrm{d} x) \leq \left( {{\sup \nolimits_{x \in [ \mathscr{a} , \mathscr{b} ]^d} \mathfrak{p} ( x ) }} \right) \lambda_d ( I^v \Delta I^w ). \end{equation}$

(2.9)

Moreover, note that the fact that for all $y \in \mathbb{R}$ it holds that $y \geq - |y|$ ensures that for all $v = (v_1, \ldots, v_{d+1}) \in \mathbb{R}^{d+1}$ , $i \in \left\{ {{1, 2, \ldots, d+1 }} \right\}$ with $||u-v|| < |u_i|$ it holds that

$\begin{equation} u_i v_i = (u_i)^2 + ( v_i - u_i ) u_i \geq |u_i|^2 - |u_i - v_i| |u_i| \geq |u_i|^2 - ||u-v|| |u_i| > 0. \end{equation}$

(2.10)

Next observe that for all $v_1, v_2, w_1, w_2 \in \mathbb{R}$ with $\min \left\{ {{|v_1|, |w_1|}} \right\} > 0$ it holds that

$\begin{equation} \left| \tfrac{v_2}{v_1} - \tfrac{w_2}{w_1} \right| = \tfrac{|v_2 w_1 - w_2 v_1|}{|v_1 w_1 |} = \tfrac{| v_2 ( w_1 - v_1 ) + v_1 ( v_2 - w_2 ) |}{|v_1 w_1 |} \le \left[ {{ \tfrac{|v_2| + |v_1|}{|v_1 w_1|}}} \right] \left[ {{ |v_1 - w_1| + |v_2 - w_2 |}} \right] . \end{equation}$

(2.11)

Combining this and 2.10 demonstrates for all $v = (v_1, \ldots, v_{d+1})$ , $w = (w_1, \ldots, w_{d+1}) \in \mathbb{R}^{d+1}$ , $i \in \left\{ {{1, 2, \ldots, d}} \right\}$ with $\max \left\{ {{||v-u||, ||w-u||}} \right\} < |u_1|$ that $v_1 w_1 > 0$ and

$\begin{equation} \left| \tfrac{v_i}{v_1} - \tfrac{w_i}{w_1} \right| \le \left[ {{\tfrac{2 ||v||}{ |v_1 w_1 |}}} \right] [2 ||v - w || ] \le \left[ {{\tfrac{4 ||v - u || + 4 ||u||}{ |v_1 w_1 |}}} \right] ||v - w || . \end{equation}$

(2.12)

Hence, we obtain for all $v = (v_1, \ldots, v_{d+1})$ , $w = (w_1, \ldots, w_{d+1}) \in \mathbb{R}^{d+1}$ , $i \in \left\{ {{1, 2, \ldots, d}} \right\}$ with $\max \left\{ {{||v-u||, ||w-u||}} \right\} \le \frac{|u_1|}{2}$ and $|u_1| > 0$ that $v_1 w_1 > 0$ and

$\begin{equation} \left| \tfrac{v_i}{v_1} - \tfrac{w_i}{w_1} \right| \le \tfrac{(2 |u_1| + 4 ||u|| ) ||v - w || }{|u_1 + (v_1 - u_1)||u_1 + (w_1 - u_1 ) |} \le \tfrac{6 ||u|| ||v - w ||}{ ( |u_1| - ||v-u|| ) ( |u_1| - ||w - u || )} \le \left[ {{ \tfrac{24 ||u||}{ |u_1|^2}}} \right] ||v - w || . \end{equation}$

(2.13)

In the following we distinguish between the case $\max_{i \in \left\{ {{1, 2, \ldots, d }} \right\} } |u_i| = 0$ , the case $(\max_{i \in \left\{ {{1, 2, \ldots, d }} \right\} } |u_i|, d) \in (0, \infty) \times [2, \infty)$ , and the case $(\max_{i \in \left\{ {{1, 2, \ldots, d }} \right\} } |u_i|, d) \in (0, \infty) \times \left\{ {{1}} \right\}$ . We first prove 2.8 in the case

$\begin{equation} \max\nolimits_{i \in \left\{ {{1, 2, \ldots, d }} \right\} } |u_i| = 0. \end{equation}$

(2.14)

Note that (2.14) and the assumption that $u \in \mathbb{R}^{d+1} \backslash \left\{ {{ 0 }} \right\}$ imply that $| u_{d+1} | > 0$ . Moreover, observe that (2.14) shows that for all $v = (v_1, \ldots, v_{d+1}) \in \mathbb{R}^{d+1}$ , $x = (x_1, \ldots, x_d) \in I^u \Delta I^v$ we have that

$\begin{equation} \begin{split} &\left| \left( {{ \left[ {\sum\nolimits_{i = 1}^d v_i x_i } \right]+ v_{d+1} }} \right) - \left( {{ \left[ {\sum\nolimits_{i = 1}^d u_i x_i } \right]+ u_{d+1} }} \right) \right| \\ & = \left| \left[ {\sum\nolimits_{i = 1}^d v_i x_i } \right]+ v_{d+1} \right| + \left| \left[ {\sum\nolimits_{i = 1}^d u_i x_i } \right]+ u_{d+1} \right| \geq \left| \left[ {\sum\nolimits_{i = 1}^d u_i x_i } \right]+ u_{d+1} \right| = |u_{d+1} |. \end{split} \end{equation}$

(2.15)

In addition, note that for all $v = (v_1, \ldots, v_{d+1}) \in \mathbb{R}^{d+1}$ , $x = (x_1, \ldots, x_d) \in [\mathscr{a}, \mathscr{b}] ^d$ it holds that

$\begin{equation} \begin{split} & \left| \left( {{ \left[ {\sum\nolimits_{i = 1}^d v_i x_i } \right]+ v_{d+1} }} \right) - \left( {{ \left[ {\sum\nolimits_{i = 1}^d u_i x_i } \right]+ u_{d+1} }} \right) \right| \leq \left[ {\sum\nolimits_{i = 1}^d |v_i - u_i| |x_i| } \right]+ | v_{d+1} - u_{d+1} | \\ & \leq \max \left\{ {{ | \mathscr{a}|, | \mathscr{b}| }} \right\} \left[ {\sum\nolimits_{i = 1}^d |v_i - u_i| } \right]+ | v_{d+1} - u_{d+1} | \leq ( 1 + d \max \left\{ {{ | \mathscr{a} , \mathscr{b} | }} \right\} ) || v - u ||. \end{split} \end{equation}$

(2.16)

This and (2.15) prove that for all $v \in \mathbb{R}^{d+1}$ with $|| u - v || \le \frac{|u_{d+1}|}{2 + d \max \left\{ {{ | \mathscr{a}, \mathscr{b} | }} \right\}}$ we have that $I^u \Delta I^v = \varnothing$ , i.e., $I^u = I^v$ . Therefore, we get for all $v, w \in \mathbb{R}^{d+1}$ with $\max \left\{ {{ || u - v ||, ||u - w|| }} \right\} \le \frac{|u_{d+1}|}{2 + d \max \left\{ {{ | \mathscr{a}, \mathscr{b} | }} \right\}}$ that $I^v = I^w = I^u$ . Hence, we obtain for all $v, w \in \mathbb{R}^{d+1}$ with $\max \left\{ {{ || u - v ||, ||u - w|| }} \right\} \le \frac{|u_{d+1}|}{2 + d \max \left\{ {{ | \mathscr{a}, \mathscr{b} | }} \right\}}$ that $\lambda_d(I^v \Delta I^w) = 0$ . This establishes (2.8) in the case $\max_{i \in \left\{ {{1, 2, \ldots, d }} \right\} } |u_i| = 0$ . In the next step we prove 2.8 in the case

$\begin{equation} (\max\nolimits_{i \in \left\{ {{1, 2, \ldots, d }} \right\} } |u_i| , d ) \in (0, \infty ) \times [ 2 , \infty ). \end{equation}$

(2.17)

For this we assume without loss of generality that $| u_1 | > 0$ . In the following let $J_x^{v, w} \subseteq \mathbb{R}$ , $x \in [\mathscr{a}, \mathscr{b}]^{d-1}$ , $v, w \in \mathbb{R}^{d+1}$ , satisfy for all $x = (x_2, \ldots, x_d) \in [\mathscr{a}, \mathscr{b}] ^{d-1}$ , $v, w \in \mathbb{R}^{d+1}$ that $J_x^{v, w} = \left\{ {{ y \in [\mathscr{a}, \mathscr{b}] \colon (y, x_2, \ldots, x_d) \in I^v \backslash I^w }} \right\}$ . Next observe that Fubini's theorem and the fact that for all $v \in \mathbb{R}^{d+1}$ it holds that $I^v$ is measurable show that for all $v, w \in \mathbb{R}^{d+1}$ we have that

$\begin{equation} \begin{split} &\lambda_d ( I^v \Delta I^w ) = \int_{[ \mathscr{a} , \mathscr{b} ]^d } \mathbf{1}_{\smash{{I^v \Delta I^w}}} ( x ) \, \lambda_d ( \mathrm{d} x ) = \int_{[ \mathscr{a} , \mathscr{b} ] ^d } \left( {{ \mathbf{1}_{\smash{{I^v \backslash I^w}}} ( x ) + \mathbf{1}_{\smash{{I^w \backslash I^v }}} ( x ) }} \right) \, \lambda_d ( \mathrm{d} x) \\ & = \int_{[ \mathscr{a} , \mathscr{b}]^{d-1} } \int_{[ \mathscr{a} , \mathscr{b} ]} \left( {{ \mathbf{1}_{\smash{{I^v \backslash I^w}}} ( y, x_2, \ldots, x_d ) + \mathbf{1}_{\smash{{I^w \backslash I^v }}} ( y, x_2, \ldots, x_d ) }} \right) \, \lambda_1 ( \mathrm{d} y ) \, \lambda_{d-1} ( \mathrm{d} ( x_2, \ldots, x_d ) ) \\ & = \int_{[ \mathscr{a} , \mathscr{b} ] ^{d-1} } \int_{[ \mathscr{a} , \mathscr{b} ] } \left( {{ \mathbf{1}_{\smash{{J_x^{v , w}}}} ( y ) + \mathbf{1}_{\smash{{J_x^{w , v}}}} ( y ) }} \right) \, \lambda_1 ( \mathrm{d} y ) \, \lambda_{d-1} ( \mathrm{d} x ) \\ & = \int_{[ \mathscr{a} , \mathscr{b} ]^{d-1} } ( \lambda_1 ( J_x^{v , w} ) + \lambda_1 ( J_x^{w , v} ) ) \, \lambda_{d-1} ( \mathrm{d} x). \end{split} \end{equation}$

(2.18)

Furthermore, note that for all $x = (x_2, \ldots, x_d) \in [\mathscr{a}, \mathscr{b}] ^{d-1}$ , $v = (v_1, \ldots, v_{d+1})$ , $w = (w_1, \ldots, w_{d+1}) \in \mathbb{R}^{d+1}$ , $\mathfrak{s} \in \left\{ {{ - 1, 1 }} \right\}$ with $\min \left\{ {{ \mathfrak{s} v_1, \mathfrak{s} w_1 }} \right\} > 0$ it holds that

$\begin{equation} \begin{split} J_x^{v , w} & = \left\{ {{y \in [ \mathscr{a} , \mathscr{b} ] \colon (y, x_2, \ldots, x_d) \in I^v \backslash I^w }} \right\}\\ & = \left\{ {{ y \in [ \mathscr{a} , \mathscr{b} ] \colon v_1 y + \left[ {\sum\nolimits_{i = 2}^d v_i x_i } \right]+ v_{d+1} > 0 \geq w_1 y + \left[ {\sum\nolimits_{i = 2}^d w_i x_i } \right]+ w_{d+1} }} \right\}\\ & = \left\{ {{y \in [ \mathscr{a} , \mathscr{b} ] \colon - \tfrac{ \mathfrak{s} }{v_1} \left( {{ \left[ {\sum\nolimits_{i = 2}^d v_i x_i } \right]+ v_{d+1} }} \right) < \mathfrak{s} y \leq -\tfrac{ \mathfrak{s} }{w_1} \left( {{ \left[ {\sum\nolimits_{i = 2}^d w_i x_i } \right]+ w_{d+1} }} \right) }} \right\} . \end{split} \end{equation}$

(2.19)

Hence, we obtain for all $x = (x_2, \ldots, x_d) \in [\mathscr{a}, \mathscr{b}] ^{d-1}$ , $v = (v_1, \ldots, v_{d+1})$ , $w = (w_1, \ldots, w_{d+1}) \in \mathbb{R}^{d+1}$ , $\mathfrak{s} \in \left\{ {{ - 1, 1 }} \right\}$ with $\min \left\{ {{ \mathfrak{s} v_1, \mathfrak{s} w_1 }} \right\} > 0$ that

$\begin{equation} \begin{split} \lambda_1 ( J_x^{v , w} ) &\leq \left| \tfrac{ \mathfrak{s} }{v_1} \left( {{ \left[ {\sum\nolimits_{i = 2}^d v_i x_i } \right]+ v_{d+1} }} \right) -\tfrac{ \mathfrak{s} }{w_1} \left( {{ \left[ {\sum\nolimits_{i = 2}^d w_i x_i } \right]+ w_{d+1} }} \right) \right| \\ &\leq \left[ {\sum\nolimits_{i = 2}^d \left| \tfrac{v_i}{v_1} - \tfrac{w_i}{w_1} \right| |x_i|} \right] + \left| \tfrac{v_{d+1}}{v_1} - \tfrac{w_{d+1}}{w_1} \right| \\ &\leq \max \left\{ {{ | \mathscr{a}| , | \mathscr{b}| }} \right\} \left[ {\sum\nolimits_{i = 2}^d \left| \tfrac{v_i}{v_1} - \tfrac{w_i}{w_1} \right| } \right] + \left| \tfrac{v_{d+1}}{v_1} - \tfrac{w_{d+1}}{w_1} \right|. \end{split} \end{equation}$

(2.20)

Furthermore, observe that (2.10) demonstrates for all $v = (v_1, \ldots, v_{d+1}) \in \mathbb{R}^{d+1}$ with $||u - v|| < |u_1|$ that $u_1 v_1 > 0$ . This implies that for all $v = (v_1, \ldots, v_{d+1})$ , $w = (w_1, \ldots, w_{d+1}) \in \mathbb{R}^{d+1}$ with $\max \left\{ {{||u - v||, ||u - w|| }} \right\} < |u_1|$ there exists $\mathfrak{s} \in \left\{ {{-1, 1 }} \right\}$ such that $\min \left\{ {{ \mathfrak{s} v_1, \mathfrak{s} w_1 }} \right\} > 0$ . Combining this and (2.13) with (2.20) proves that there exists $\mathfrak{C} \in \mathbb{R}$ such that for all $x \in [\mathscr{a}, \mathscr{b}]^{d-1}$ , $v, w \in \mathbb{R}^{d+1}$ with $\max \left\{ {{ || u - v ||, ||u - w || }} \right\} \le \frac{ | u_1 | }{2}$ we have that $\lambda_1(J_x^{v, w}) + \lambda_1 (J_x^{w, v}) \leq \mathfrak{C} || v - w ||$ . This, (2.18), and (2.9) establish (2.8) in the case $(\max_{i \in \left\{ {{1, 2, \ldots, d }} \right\} } |u_i|, d) \in (0, \infty) \times [2, \infty)$ . Finally, we prove (2.8) in the case

$\begin{equation} (\max\nolimits_{i \in \left\{ {{1, 2, \ldots, d }} \right\} } |u_i| , d ) \in (0, \infty ) \times \left\{ {{ 1 }} \right\}. \end{equation}$

(2.21)

Note that (2.21) demonstrates that $|u_1| > 0$ . In addition, observe that for all $v = (v_1, v_2)$ , $w = (w_1, w_2) \in \mathbb{R}^{2}$ , $\mathfrak{s} \in \left\{ {{ - 1, 1 }} \right\}$ with $\min \left\{ {{ \mathfrak{s} v_1, \mathfrak{s} w_1 }} \right\} > 0$ it holds that

$\begin{equation} \begin{split} I^v \backslash I^w & = \left\{ {{ y \in [ \mathscr{a} , \mathscr{b} ] \colon v_1 y + v_2 > 0 \geq w_1 y + w_2}} \right\} = \left\{ {{ y \in [ \mathscr{a} , \mathscr{b} ] \colon - \tfrac{ \mathfrak{s} v_2}{v_1} < \mathfrak{s} y \leq - \tfrac{s w_2}{w_1} }} \right\} \\ & \subseteq \left\{ {{ y \in \mathbb{R} \colon - \tfrac{ \mathfrak{s} v_2}{v_1} < \mathfrak{s} y \leq - \tfrac{s w_2}{w_1} }} \right\}. \end{split} \end{equation}$

(2.22)

Therefore, we get for all $v = (v_1, v_2)$ , $w = (w_1, w_2) \in \mathbb{R}^{2}$ , $\mathfrak{s} \in \left\{ {{ - 1, 1 }} \right\}$ with $\min \left\{ {{ \mathfrak{s} v_1, \mathfrak{s} w_1 }} \right\} > 0$ that

$\begin{equation} \lambda_1 ( I^v \backslash I^w ) \leq \left| \left( {{ - \tfrac{s v_2}{v_1}}} \right) - \left( {{ - \tfrac{ \mathfrak{s} w_2}{w_1} }} \right) \right| = \left| \tfrac{v_2}{v_1} - \tfrac{w_2}{w_1} \right|. \end{equation}$

(2.23)

Furthermore, note that (2.10) ensures for all $v = (v_1, v_2) \in \mathbb{R}^2$ with $||u - v || < |u_1|$ that $u_1 v_1 > 0$ . This proves that for all $v = (v_1, v_2)$ , $w = (w_1, w_2) \in \mathbb{R}^2$ with $\max \left\{ {{||u - v ||, ||u - w|| }} \right\} < |u_1|$ there exists $\mathfrak{s} \in \left\{ {{-1, 1 }} \right\}$ such that $\min \left\{ {{ \mathfrak{s} v_1, \mathfrak{s} w_1 }} \right\} > 0$ . Combining this with (2.23) demonstrates for all $v = (v_1, v_2)$ , $w = (w_1, w_2) \in \mathbb{R}^2$ with $\max \left\{ {{||u - v ||, ||u - w || }} \right\} < |u_1|$ that $\min \left\{ {{ |v_1|, |w_1|}} \right\} > 0$ and

$\begin{equation} \lambda_1 ( I^v \Delta I^w ) = \lambda_1 ( I^v \backslash I^w ) + \lambda_1 ( I^w \backslash I^v ) \leq 2 \left| \tfrac{v_2}{v_1} - \tfrac{w_2}{w_1} \right|. \end{equation}$

(2.24)

This, (2.13), and (2.9) establish (2.8) in the case $(\max_{i \in \left\{ {{1, 2, \ldots, d }} \right\} } |u_i|, d) \in (0, \infty) \times \left\{ {{ 1 }} \right\}$ . The proof of Lemma 2.4 is thus complete.

2.4. Local Lipschitz continuity properties for the generalized gradient function

Lemma 2.5. Let $d, n \in \mathbb{N}$ , $\mathscr{a} \in \mathbb{R}$ , $\mathscr{b} \in (\mathscr{a}, \infty)$ , $x \in \mathbb{R}^n$ , $\mathfrak{C}, \varepsilon \in (0, \infty)$ , let $\phi \colon \mathbb{R}^n \times [\mathscr{a}, \mathscr{b}]^d \to \mathbb{R}$ be locally bounded and measurable, assume for all $r \in (0, \infty)$ that

$\begin{equation} \sup\nolimits_{y , z \in \mathbb{R}^n , \, ||y|| + ||z|| \le r , \, y \not = z } \sup\nolimits_{s \in [ \mathscr{a} , \mathscr{b} ] ^d } \tfrac{|\phi ( y , s ) - \phi ( z , s ) |}{ ||y-z||} < \infty, \end{equation}$

(2.25)

let $\mu \colon \mathcal{B}([\mathscr{a}, \mathscr{b}]^d) \to [0, \infty)$ be a finite measure, let $I^y \in \mathcal{B} ([\mathscr{a}, \mathscr{b}]^d)$ , $y \in \mathbb{R}^n$ , satisfy for all $y, z \in \left\{ {{ v \in \mathbb{R}^n \colon ||x-v|| \le \varepsilon }} \right\}$ that $\mu (I^y \Delta I^z) \leq \mathfrak{C} ||y - z ||$ , and let $\Phi \colon \mathbb{R}^n \to \mathbb{R}$ satisfy for all $y \in \mathbb{R}^n$ that

$\begin{equation} \Phi ( y ) = \int_{I^y} \phi (y , s ) \, \mu ( \mathrm{d} s ). \end{equation}$

(2.26)

Then there exists $\mathscr{C} \in \mathbb{R}$ such that for all $y, z \in \left\{ {{ v \in \mathbb{R}^n \colon ||x - v|| \le \varepsilon }} \right\}$ it holds that $|\Phi (y) - \Phi (z) | \leq \mathscr{C} ||y - z||$ .

Proof of Lemma 2.5. The proof is analogous to the proof of [35, Lemma 6].

Corollary 2.6. Assume Setting 2.1, let $\phi \colon \mathbb{R}^ \mathfrak{d} \times [\mathscr{a}, \mathscr{b}]^d \to \mathbb{R}$ be locally bounded and measurable, and assume for all $r \in (0, \infty)$ that

$\begin{equation} \sup\nolimits_{\theta , \vartheta \in \mathbb{R}^ \mathfrak{d} , \, ||\theta || + ||\vartheta|| \le r , \, \theta \not = \vartheta } \sup\nolimits_{x \in [ \mathscr{a} , \mathscr{b} ] ^d } \tfrac{|\phi ( \theta , x ) - \phi ( \vartheta , x ) |}{ ||\theta - \vartheta ||} < \infty. \end{equation}$

(2.27)

Then

(i) it holds that

$\begin{equation} \mathbb{R}^ \mathfrak{d} \ni \theta \mapsto \int_{[ \mathscr{a} , \mathscr{b} ] ^d} \phi (\theta , x ) \mathfrak{p} ( x ) \, \lambda ( \mathrm{d} x ) \in \mathbb{R} \end{equation}$

(2.28)

is locally Lipschitz continuous and

(ii) it holds for all $i \in \left\{ {{ 1, 2, \ldots, H}} \right\}$ that

$\begin{equation} \left\{ {{\vartheta \in \mathbb{R}^ \mathfrak{d} \colon i \notin \mathbf{D}^\vartheta }} \right\} \ni \theta \mapsto \int_{I_i^\theta } \phi (\theta , x ) \mathfrak{p} ( x ) \, \lambda ( \mathrm{d} x ) \in \mathbb{R} \end{equation}$

(2.29)

is locally Lipschitz continuous.

Proof of Corollary 2.6. First observe that Lemma 2.5 (applied for every $\theta \in \mathbb{R}^ \mathfrak{d}$ with $n \curvearrowleft \mathfrak{d}$ , $x \curvearrowleft \theta$ , $\mu \curvearrowleft (\mathcal{B} ([\mathscr{a}, \mathscr{b}] ^d) \ni A \mapsto \int_A \mathfrak{p} (x) \, \lambda (\, \mathrm{d} x) \in [0, \infty))$ , $(I^y)_{y \in \mathbb{R}^n} \curvearrowleft ([\mathscr{a}, \mathscr{b}]^d)_{y \in \mathbb{R}^ \mathfrak{d}}$ in the notation of Lemma 2.5) establishes item (i). In the following let $i \in \left\{ {{1, 2, \ldots, H}} \right\}$ , $\theta \in \left\{ {{\vartheta \in \mathbb{R}^ \mathfrak{d} \colon i \notin \mathbf{D}^\vartheta }} \right\}$ . Note that Lemma 2.4 shows that there exist $\varepsilon, \mathfrak{C} \in (0, \infty)$ which satisfy for all $\vartheta_1, \vartheta_2 \in \mathbb{R}^ \mathfrak{d}$ with $\max \left\{ {{ ||\theta - \vartheta_1||, ||\theta - \vartheta_2|| }} \right\} \le \varepsilon$ that

$\begin{equation} \int_{ I_i^{\vartheta_1} \Delta I_i^{\vartheta_2} } \mathfrak{p} ( x ) \, \lambda ( \mathrm{d} x ) \leq \mathfrak{C} ||\vartheta_1 - \vartheta_2 || . \end{equation}$

(2.30)

Combining this with Lemma 2.5 (applied for every $\theta \in \mathbb{R}^ \mathfrak{d}$ with $n \curvearrowleft \mathfrak{d}$ , $x \curvearrowleft \theta$ , $\mu \curvearrowleft (\mathcal{B} ([\mathscr{a}, \mathscr{b}] ^d) \ni A \mapsto \int_A \mathfrak{p} (x) \, \lambda (\, \mathrm{d} x) \in [0, \infty))$ , $(I^y)_{y \in \mathbb{R}^n} \curvearrowleft (I_i^y)_{ y \in \mathbb{R}^ \mathfrak{d}}$ in the notation of Lemma 2.5) demonstrates that there exists $\mathscr{C} \in \mathbb{R}$ such that for all $\vartheta_1, \vartheta_2 \in \mathbb{R}^ \mathfrak{d}$ with $\max \left\{ {{ ||\theta - \vartheta_1||, ||\theta - \vartheta_2|| }} \right\} \le \varepsilon$ it holds that

$\begin{equation} \left| \int_{I_i^{\vartheta_1}} \phi ( \vartheta_1 , x ) \mathfrak{p} ( x ) \, \lambda ( \mathrm{d} x ) - \int_{I_i^{\vartheta_2}} \phi ( \vartheta_2 , x ) \mathfrak{p} ( x ) \, \lambda ( \mathrm{d} x ) \right| \leq \mathscr{C} ||\vartheta_1 - \vartheta_2 ||. \end{equation}$

(2.31)

This establishes item (ii). The proof of Corollary 2.6 is thus complete.

Corollary 2.7. Assume Setting 2.1. Then

(i) it holds for all $k \in \mathbb{N} \cap (H d + H, \mathfrak{d}]$ that

$\begin{equation} \mathbb{R}^ \mathfrak{d} \ni \theta \mapsto \mathcal{G}_k ( \theta ) \in \mathbb{R} \end{equation}$

(2.32)

is locally Lipschitz continuous,

(ii) it holds for all $i \in \left\{ {{ 1, 2, \ldots, H}} \right\}$ , $j \in \left\{ {{ 1, 2, \ldots, d}} \right\}$ that

$\begin{equation} \left\{ {{\vartheta \in \mathbb{R}^ \mathfrak{d} \colon i \notin \mathbf{D}^\vartheta }} \right\} \ni \theta \mapsto \mathcal{G}_{(i - 1 ) d + j } ( \theta ) \in \mathbb{R} \end{equation}$

(2.33)

is locally Lipschitz continuous, and

(iii) it holds for all $i \in \left\{ {{1, 2, \ldots, H}} \right\}$ that

$\begin{equation} \left\{ {{\vartheta \in \mathbb{R}^ \mathfrak{d} \colon i \notin \mathbf{D}^\vartheta }} \right\} \ni \theta \mapsto \mathcal{G}_{ H d + i } ( \theta ) \in \mathbb{R} \end{equation}$

(2.34)

is locally Lipschitz continuous.

Proof of Corollary 2.7. Observe that (2.7) and Corollary 2.6 establish items (i), (ii), and (iii). The proof of Corollary 2.7 is thus complete.

2.5. Subdifferentials

Definition 2.8 (Subdifferential). Let $n \in \mathbb{N}$ , $f \in C(\mathbb{R}^n, \mathbb{R})$ , $x \in \mathbb{R}^n$ . Then we denote by $\hat{\partial} f(x) \subseteq \mathbb{R}^n$ the set given by

$\begin{equation} \hat{\partial} f ( x ) = \left\{ {{ y \in \mathbb{R}^n \colon \liminf\limits_{ \mathbb{R}^n \backslash \left\{ {{ 0 }} \right\} \ni h \to 0 } \left( {{ \frac{f(x + h ) - f ( x ) - \left\langle {{y , h }} \right\rangle }{||h||} }} \right) \geq 0 }} \right\}. \end{equation}$

(2.35)

Definition 2.9 (Limiting subdifferential). Let $n \in \mathbb{N}$ , $f \in C(\mathbb{R}^n, \mathbb{R})$ , $x \in \mathbb{R}^n$ . Then we denote by $\partial f(x) \subseteq \mathbb{R}^n$ the set given by

$\begin{equation} \partial f ( x ) = \bigcap_{\varepsilon \in (0, \infty ) } \overline{\left[ {{ \bigcup\nolimits_{y \in \left\{ {{z \in \mathbb{R}^n \colon ||x-z|| < \varepsilon }} \right\}} \hat{\partial} f ( y ) }} \right] } \end{equation}$

(2.36)

(cf. Definition 2.8).

Lemma 2.10. Let $n \in \mathbb{N}$ , $f \in C(\mathbb{R}^n, \mathbb{R})$ , $x \in \mathbb{R}^n$ . Then

$\begin{eqnarray} \partial f ( x ) = \bigl\{ y \in \mathbb{R}^n \colon \exists \, z = (z_1, z_2) \colon \mathbb{N} \to \mathbb{R}^n \times \mathbb{R}^n \colon \bigl( \left[ {{\forall \, k \in \mathbb{N} \colon z_2 ( k ) \in \hat{\partial} f(z_1(k))}} \right] , \\ \left[ {{\limsup\nolimits_{k \to \infty} ( ||z_1(k) - x || + ||z_2(k) - y || ) = 0}} \right] \bigr) \bigr\} \end{eqnarray}$

(2.37)

(cf. Definitions 2.8 and 2.9).

Proof of Lemma 2.10. Note that (2.36) establishes (2.37). The proof of Lemma 2.10 is thus complete.

Lemma 2.11. Let $n \in \mathbb{N}$ , $f \in C (\mathbb{R}^n, \mathbb{R})$ , let $U \subseteq \mathbb{R}^n$ be open, assume $f | _U \in C^1 (U, \mathbb{R})$ , and let $x \in U$ . Then $\hat{\partial} f(x) = \partial f(x) = \left\{ {{ (\nabla f) (x) }} \right\}$ (cf. Definitions 2.8 and 2.9).

Proof of Lemma 2.11. This is a direct consequence of, e.g., Rockafellar & Wets [39, Exercise 8.8]. The proof of Lemma 2.11 is thus complete.

Proposition 2.12. Assume Setting 2.1 and let $\theta \in \mathbb{R}^ \mathfrak{d}$ . Then $\mathcal{G} (\theta) \in \partial \mathcal{L} (\theta)$ (cf. Definition 2.9).

Proof of Proposition 2.12. Throughout this proof let $\vartheta = (\vartheta_n)_{n \in \mathbb{N}} \colon \mathbb{N} \to \mathbb{R}^ \mathfrak{d}$ satisfy for all $n \in \mathbb{N}$ , $i \in \left\{ {{ 1, 2, \ldots, H}} \right\}$ , $j \in \left\{ {{ 1, 2, \ldots, d}} \right\}$ that $\mathfrak{w}^{{\vartheta_n}}_{i, j} = \mathfrak{w}^{{\theta}}_{i, j}$ , $\mathfrak{b}^{{\vartheta_n}}_i = \mathfrak{b}^{{\theta}}_i - \frac{1}{n} \mathbf{1}_{\smash{{ \mathbf{D}^\theta }}} (i)$ , $\mathfrak{v}^{{\vartheta_n}} _ i = \mathfrak{v}^{{\theta}}_i$ , and $\mathfrak{c}^{{\vartheta_n}} = \mathfrak{c}^{{\theta}}$ . We prove Proposition 2.12 through an application of Lemma 2.10. Observe that for all $n \in \mathbb{N}$ , $i \in \left\{ {{ 1, 2, \ldots, H }} \right\} \backslash \mathbf{D}^\theta$ it holds that $\mathfrak{b}^{{\vartheta_n}}_i = \mathfrak{b}^{{\theta}}_i$ . This implies for all $n \in \mathbb{N}$ , $i \in \left\{ {{ 1, 2, \ldots, H }} \right\} \backslash \mathbf{D}^\theta$ that

$\begin{equation} i \notin \mathbf{D}^{\vartheta_n}. \end{equation}$

(2.38)

In addition, note that for all $n \in \mathbb{N}$ , $i \in \mathbf{D}^\theta$ it holds that $\mathfrak{b}^{{\vartheta_n}}_i = - \frac{1}{n} < 0$ . This shows for all $n \in \mathbb{N}$ , $i \in \mathbf{D}^\theta$ that

$\begin{equation} i \notin \mathbf{D}^{\vartheta_n}. \end{equation}$

(2.39)

Hence, we obtain for all $n \in \mathbb{N}$ that $\mathbf{D}^{\vartheta_n} = \varnothing$ . Combining this with Proposition 2.3 and Lemma 2.11 demonstrates that for all $n \in \mathbb{N}$ it holds that $\hat{\partial} \mathcal{L} (\vartheta_n) = \left\{ {{ (\nabla \mathcal{L}) (\vartheta_n) }} \right\} = \left\{ {{ \mathcal{G} (\vartheta_n) }} \right\}$ (cf. Definition 2.8). Moreover, observe that $\lim_{n \to \infty} \vartheta_n = \theta$ . It thus remains to show that $\mathcal{G} (\vartheta _n)$ , $n \in \mathbb{N}$ , converges to $\mathcal{G} (\theta)$ . Note that Corollary 2.7 ensures that for all $k \in \mathbb{N} \cap (H d + H, \mathfrak{d}]$ it holds that

$\begin{equation} \lim\nolimits_{n \to \infty} \mathcal{G}_{k } ( \vartheta _n ) = \mathcal{G} _ {k } ( \theta ). \end{equation}$

(2.40)

Furthermore, observe that Corollary 2.7, (2.38) and (2.39) assure that for all $i \in \left\{ {{ 1, 2, \ldots, H}} \right\} \backslash \mathbf{D}^\theta$ , $j \in \left\{ {{ 1, 2, \ldots, d }} \right\}$ it holds that

$\begin{equation} \lim\nolimits_{n \to \infty} \mathcal{G}_{(i - 1 ) d + j} ( \vartheta _n ) = \mathcal{G} _ {(i - 1 ) d + j } ( \theta ) \qquad\text{and}\qquad \lim\nolimits_{n \to \infty} \mathcal{G}_{ H d + i } ( \vartheta_n ) = \mathcal{G}_{ H d + i } ( \theta ). \end{equation}$

(2.41)

In addition, note that for all $n \in \mathbb{N}$ , $i \in \mathbf{D}^\theta$ we have that $I_i^{\vartheta_n } = I_i^\theta = \varnothing$ . Hence, we obtain for all $i \in \mathbf{D}^\theta$ , $j \in \left\{ {{ 1, 2, \ldots, d}} \right\}$ that

$\begin{equation} \lim\nolimits_{n \to \infty} \mathcal{G}_{(i - 1 ) d + j} ( \vartheta_n ) = 0 = \mathcal{G}_{(i - 1 ) d + j} ( \theta ) \qquad\text{and}\qquad \lim\nolimits_{n \to \infty} \mathcal{G} _{ H d + i } ( \vartheta_n ) = 0 = \mathcal{G}_{ H d + i } ( \theta ). \end{equation}$

(2.42)

Combining this, (2.40) and (2.41) demonstrates that $\lim_{n \to \infty} \mathcal{G} (\vartheta _ n) = \mathcal{G} (\theta)$ . This and Lemma 2.10 assure that $\mathcal{G} (\theta) \in \partial \mathcal{L} (\theta)$ . The proof of Proposition 2.12 is thus complete.

3. Existence and uniqueness properties for solutions of gradient flows (GFs)

In this section we employ the local Lipschitz continuity result for the generalized gradient function in Corollary 2.7 from Section 2 to establish existence and uniqueness results for solutions of GF differential equations. Specifically, in Proposition 3.1 in Subsection 3.1 below we prove the existence of solutions GF differential equations, in Lemma 3.2 in Subsection 3.2 below we establish the uniqueness of solutions of GF differential equations among a suitable class of GF solutions, and in Theorem 3.3 in Subsection 3.3 below we combine Proposition 3.1 and Lemma 3.2 to establish the unique existence of solutions of GF differential equations among a suitable class of GF solutions. Theorem 1.1 in the introduction is an immediate consequence of Theorem 3.3.

Roughly speaking, we show in Theorem 3.3 the unique existence of solutions of GF differential equations among the class of GF solutions which satisfy that the set of all degenerate neurons of the GF solution at time $t \in [0, \infty)$ is non-decreasing in the time variable $t \in [0, \infty)$ . In other words, in Theorem 3.3 we prove the unique existence of GF solutions with the property that once a neuron has become degenerate it will remain degenerate for subsequent times.

Our strategy of the proof of Theorem 3.3 and Proposition 3.1, respectively, can, loosely speaking, be described as follows. Corollary 2.7 above implies that the components of the generalized gradient function $\mathcal{G} \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}^{ \mathfrak{d} }$ corresponding to non-degenerate neurons are locally Lipschitz continuous so that the classical Picard-Lindelöf local existence and uniqueness theorem for ordinary differential equations can be brought into play for those components. On the other hand, if at some time $t \in [0, \infty)$ the $i$ -th neuron is degenerate, then Proposition 2.2 above shows that the corresponding components of the generalized gradient function $\mathcal{G} \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}^{ \mathfrak{d} }$ vanish. The GF differential equation is thus satisfied if the neuron remains degenerate at all subsequent times $s \in [t, \infty)$ . Using these arguments we prove in Proposition 3.1 the existence of GF solutions by induction on the number of non-degenerate neurons of the initial value.

3.1. Existence properties for solutions of GF differential equations

Proposition 3.1. Assume Setting 2.1 and let $\theta \in \mathbb{R}^ \mathfrak{d}$ . Then there exists $\Theta \in C([0, \infty), \mathbb{R}^ \mathfrak{d})$ which satisfies for all $t \in [0, \infty)$ , $s \in [t, \infty)$ that

$\begin{equation} \Theta_t = \theta - \int_0^t \mathcal{G} ( \Theta_u ) \, \mathrm{d} u \qquad\mathit{\text{and}}\qquad \mathbf{D}^{\Theta_t} \subseteq \mathbf{D}^{\Theta_s}. \end{equation}$

(3.1)

Proof of Proposition 3.1. We prove the statement by induction on the quantity $H - \# (\mathbf{D}^\theta) \in \mathbb{N} \cap [0, H]$ . Assume first that $H - \# (\mathbf{D}^\theta) = 0$ , i.e., $\mathbf{D}^\theta = \left\{ {{1, 2, \ldots, H}} \right\}$ . Observe that this implies that $\mathfrak{w}^{{\theta}} = 0$ and $\mathfrak{b}^{{\theta}} = 0$ . In the following let $\kappa \in \mathbb{R}$ satisfy

$\begin{equation} \kappa = \int_{ [ \mathscr{a} , \mathscr{b} ] ^d} f ( x ) \mathfrak{p} ( x ) \, \lambda ( \mathrm{d} x ). \end{equation}$

(3.2)

Note that the Picard–Lindelöf Theorem shows that there exists a unique $c \in C([0, \infty), \mathbb{R})$ which satisfies for all $t \in [0, \infty)$ that

$\begin{equation} c(0) = \mathfrak{c}^{{\theta}} \qquad \text{and} \qquad c(t) = c(0) + 2 \kappa t - 2 \left( {{\int_{ [ \mathscr{a} , \mathscr{b} ] ^d } \mathfrak{p} ( x ) \, \lambda ( \mathrm{d} x ) }} \right) \left( {{ \int_0^t c(s) \, \mathrm{d} s }} \right). \end{equation}$

(3.3)

Next let $\Theta \in C([0, \infty), \mathbb{R}^ \mathfrak{d})$ satisfy for all $t \in [0, \infty)$ , $i \in \left\{ {{ 1, 2, \ldots, H }} \right\}$ , $j \in \left\{ {{1, 2, \ldots, d}} \right\}$ that

$\begin{equation} \mathfrak{w}^{{\Theta_t}}_{i , j} = \mathfrak{w}^{{\theta}}_{i, j} = \mathfrak{b}^{{\Theta_t}}_i = \mathfrak{b}^{{\theta}}_i = 0, \qquad \mathfrak{v}^{{\Theta_t}}_i = \mathfrak{v}^{{\theta}}_i, \qquad \text{and} \qquad \mathfrak{c}^{{\Theta_t}} = c ( t ) . \end{equation}$

(3.4)

Observe that (2.7), (3.3), and (3.4) ensure for all $t \in [0, \infty)$ that

$\begin{equation} \begin{split} \mathfrak{c}^{{\Theta_t }} & = \mathfrak{c}^{{\theta}} + 2 \kappa t - 2 \left( {{\int_{ [ \mathscr{a} , \mathscr{b} ] ^d } \mathfrak{p} ( x ) \, \lambda ( \mathrm{d} x ) }} \right) \left( {{ \int_0^t \mathfrak{c}^{{\Theta_s }} \, \mathrm{d} s }} \right) \\ & = \mathfrak{c}^{{\theta}} - 2 \int_0^t \left( {{ - \kappa + \int_{ [ \mathscr{a} , \mathscr{b} ] ^d } \mathfrak{c}^{{\Theta_s}} \mathfrak{p} ( x ) \, \lambda ( \mathrm{d} x ) }} \right) \, \mathrm{d} s \\ & = \mathfrak{c}^{{\theta}} - 2 \int_0^t \int_{ [ \mathscr{a} , \mathscr{b} ]^d } \left( {{\mathfrak{c}^{{\Theta_s}} + \sum\nolimits_{i = 1}^ H \left[ {{ \mathfrak{v}^{{\Theta_s}}_i \max \left\{ {{ \mathfrak{b}^{{\Theta_s}}_i + \sum\nolimits_{j = 1}^d \mathfrak{w}^{{\Theta_s }}_{i , j} x_j , 0 }} \right\} }} \right] - f ( x ) }} \right) \mathfrak{p} ( x ) \, \lambda ( \mathrm{d} x ) \, \mathrm{d} s\\ & = \mathfrak{c}^{{\theta}} - 2\int_0^t \int_{ [ \mathscr{a} , \mathscr{b} ] ^d } ( \mathscr{N} ^{ {\Theta_s} } ( x ) - f ( x ) ) \mathfrak{p} ( x ) \, \lambda ( \mathrm{d} x ) \, \mathrm{d} s = \mathfrak{c}^{{\theta}} - \int_0^t \mathcal{G}_ \mathfrak{d} ( \Theta_s ) \, \mathrm{d} s. \end{split} \end{equation}$

(3.5)

Next note that (3.4) and (2.7) show for all $t \in [0, \infty)$ , $i \in \mathbb{N} \cap [1, \mathfrak{d})$ that $\mathbf{D}^{\Theta_t} = \left\{ {{1, 2, \ldots, H}} \right\}$ and $\mathcal{G}_i (\Theta_t) = 0$ . Combining this with (3.4) and (3.5) proves that $\Theta$ satisfies 3.1. This establishes the claim in the case $\# (\mathbf{D}^\theta) = H$ .

For the induction step assume that $\# (\mathbf{D}^\theta) < H$ and assume that for all $\vartheta \in \mathbb{R}^ \mathfrak{d}$ with $\# (\mathbf{D}^\vartheta) > \# (\mathbf{D}^\theta)$ there exists $\Theta \in C([0, \infty), \mathbb{R}^ \mathfrak{d})$ which satisfies for all $t \in [0, \infty)$ , $s \in [t, \infty)$ that $\Theta_t = \vartheta - \int_0^t \mathcal{G} (\Theta_u) \, \mathrm{d} u$ and $\mathbf{D}^{\Theta_t} \subseteq \mathbf{D}^{\Theta_s}$ . In the following let $U \subseteq \mathbb{R}^ \mathfrak{d}$ satisfy

$\begin{equation} U = \left\{ {{\vartheta \in \mathbb{R}^ \mathfrak{d} \colon \mathbf{D}^\vartheta \subseteq \mathbf{D}^\theta }} \right\} \end{equation}$

(3.6)

and let $\mathfrak{G} \colon U \to \mathbb{R}^ \mathfrak{d}$ satisfy for all $\vartheta \in U$ , $i \in \left\{ {{ 1, 2, \ldots, \mathfrak{d} }} \right\}$ that

$\begin{equation} \mathfrak{G} _ i ( \vartheta ) = \begin{cases} 0 & \colon i \in \left\{ {{(\ell - 1 ) d + j \colon \ell \in \mathbf{D}^\theta , j \in \mathbb{N} \cap [1 , d] }} \right\} \cup \left\{ {{ H d + \ell \colon \ell \in \mathbf{D}^\theta }} \right\} \\ \mathcal{G}_i ( \vartheta ) & \colon \text{else}. \end{cases} \end{equation}$

(3.7)

Observe that (3.6) assures that $U \subseteq \mathbb{R}^ \mathfrak{d}$ is open. In addition, note that Corollary 2.7 implies that $\mathfrak{G}$ is locally Lipschitz continuous. Combining this with the Picard–Lindelöf Theorem demonstrates that there exist a unique maximal $\tau \in (0, \infty]$ and $\Psi \in C([0, \tau), U)$ which satisfy for all $t \in [0, \tau)$ that

$\begin{equation} \Psi_t = \theta - \int_0^t \mathfrak{G} ( \Psi_u ) \, \mathrm{d} u . \end{equation}$

(3.8)

Next observe that (3.7) ensures that for all $t \in [0, \tau)$ , $i \in \mathbf{D}^\theta$ , $j \in \left\{ {{1, 2, \ldots, d}} \right\}$ we have that

$\begin{equation} \mathfrak{w}^{{\Psi_t}}_{i , j} = \mathfrak{w}^{{\theta}}_{i , j} = \mathfrak{b}^{{\Psi_t}}_i = \mathfrak{b}^{{\theta}}_i = 0 \qquad\text{and}\qquad \mathfrak{v}^{{\Psi_t}}_i = \mathfrak{v}^{{\theta}}_i. \end{equation}$

(3.9)

This, (3.7), and (2.7) demonstrate for all $t \in [0, \tau)$ that $\mathcal{G} (\Psi_t) = \mathfrak{G} (\Psi_t)$ . In addition, note that (3.6) and (3.9) imply for all $t \in [0, \tau)$ that $\mathbf{D}^{\Psi_t} = \mathbf{D}^\theta$ . Hence, if $\tau = \infty$ then $\Psi$ satisfies (3.1). Next assume that $\tau < \infty$ . Observe that the Cauchy-Schwarz inequality and [, Lemma 3.1] prove for all $s, t \in [0, \tau)$ with $s \leq t$ that

$\begin{equation} \begin{split} ||\Psi_t - \Psi_s|| &\leq \int_s^t || \mathcal{G} ( \Psi_u ) || \, \mathrm{d} u \leq (t-s)^{1/2} \left[ {{\int_s^t || \mathcal{G} ( \Psi_u ) || ^2 \, \mathrm{d} u }} \right]^{1/2} \\ &\leq (t-s)^{1/2} \left[ {{\int_0^t || \mathcal{G} ( \Psi_u ) || ^2 \, \mathrm{d} u }} \right]^{1/2} = (t - s )^{1/2} \left( {{ \mathcal{L} ( \Psi_0 ) - \mathcal{L} ( \Psi_t )}} \right)^{1/2} \\ &\leq (t - s )^{1/2} \left( {{ \mathcal{L} ( \Psi_0 ) }} \right)^{1/2} . \end{split} \end{equation}$

(3.10)

Hence, we obtain for all $(t_n) _{n \in \mathbb{N}} \subseteq [0, \tau)$ with $\liminf_{n \to \infty} t_n = \tau$ that $(\Psi_{t_n})$ is a Cauchy sequence. This implies that $\vartheta : = \lim_{t \uparrow \tau} \Psi_t \in \mathbb{R}^ \mathfrak{d}$ exists. Furthermore, note that the fact that $\tau$ is maximal proves that $\vartheta \notin U$ . Therefore, we have that $\mathbf{D}^\vartheta \backslash \mathbf{D}^\theta \not = \varnothing$ . Moreover, observe that (3.9) shows that for all $i \in \mathbf{D}^\theta$ , $j \in \left\{ {{1, 2, \ldots, d}} \right\}$ it holds that $\mathfrak{w}^{{\vartheta}}_{i, j} = \mathfrak{b}^{{\vartheta}}_i = 0$ and, therefore, $i \in \mathbf{D}^\vartheta$ . This demonstrates that $\# (\mathbf{D}^\vartheta) > \# (\mathbf{D}^\theta)$ . Combining this with the induction hypothesis ensures that there exists $\Phi \in C([0, \infty), \mathbb{R}^ \mathfrak{d})$ which satisfies for all $t \in [0, \infty)$ , $s \in [t, \infty)$ that

$\begin{equation} \Phi_t = \vartheta - \int_0^t \mathcal{G} ( \Phi_u ) \, \mathrm{d} u \qquad\text{and}\qquad \mathbf{D}^{\Phi_t} \subseteq \mathbf{D}^{\Phi_s} . \end{equation}$

(3.11)

In the following let $\Theta \colon [0, \infty) \to \mathbb{R}^ \mathfrak{d}$ satisfy for all $t \in [0, \infty)$ that

$\begin{equation} \Theta_t = \begin{cases} \Psi_t & \colon t \in [0, \tau) \\ \Phi_{t - \tau} & \colon t \in [\tau , \infty ). \end{cases} \end{equation}$

(3.12)

Note that the fact that $\vartheta = \lim_{t \uparrow \tau} \Psi_t$ and the fact that $\Phi_0 = \vartheta$ imply that $\Theta$ is continuous. Furthermore, observe that the fact that $\mathcal{G}$ is locally bounded and (3.8) ensure that

$\begin{equation} \Theta_\tau = \vartheta = \lim\limits_{t \uparrow \tau} \Psi_t = \lim\limits_{t \uparrow \tau} \left[ {{ \theta - \int_0^t \mathcal{G} ( \Psi_s ) \, \mathrm{d} s }} \right] = \theta - \int_0^\tau \mathcal{G} ( \Psi_s ) \, \mathrm{d} s = \theta - \int_0^\tau \mathcal{G} ( \Theta_s ) \, \mathrm{d} s. \end{equation}$

(3.13)

Hence, we obtain for all $t \in [\tau, \infty)$ that

$\begin{equation} \begin{split} \Theta_t & = (\Theta_t - \Theta_\tau ) + \Theta_\tau = (\Phi_{t - \tau} - \Phi_0 ) + \Theta_\tau = - \int_0^{t - \tau} \mathcal{G} ( \Phi_s ) \, \mathrm{d} s + \theta - \int_0^\tau \mathcal{G} ( \Theta_s ) \, \mathrm{d} s \\ & = - \int_t^\tau \mathcal{G} ( \Theta_s ) + \theta - \int_0^\tau \mathcal{G} ( \Theta_s ) \, \mathrm{d} s = \theta - \int_0^t \mathcal{G} ( \Theta_s ) \, \mathrm{d} s. \end{split} \end{equation}$

(3.14)

This shows that $\Theta$ satisfies (3.1). The proof of Proposition 3.1 is thus complete.

3.2. Uniqueness properties for solutions of GF differential equations

Lemma 3.2. Assume Setting 2.1 and let $\theta \in \mathbb{R}^ \mathfrak{d}$ , $\Theta^1, \Theta^2 \in C([0, \infty), \mathbb{R}^ \mathfrak{d})$ satisfy for all $t \in [0, \infty)$ , $s \in [t, \infty)$ , $k \in \left\{ {{ 1, 2}} \right\}$ that

$\begin{equation} \Theta_t^k = \theta - \int_0^t \mathcal{G} ( \Theta_u^k ) \, \mathrm{d} u \qquad\mathit{\text{and}}\qquad \mathbf{D}^{\Theta_t^k } \subseteq \mathbf{D}^{\Theta_s^k }. \end{equation}$

(3.15)

Then it holds for all $t \in [0, \infty)$ that $\Theta_t^1 = \Theta_t^2$ .

Proof of Lemma 3.2. Assume for the sake of contradiction that there exists $t \in [0, \infty)$ such that $\Theta_t^1 \not = \Theta_t^2$ . By translating the variable $t$ if necessary, we may assume without loss of generality that $\inf \left\{ {{t \in [0, \infty) \colon \Theta_t^1 \not = \Theta_t^2}} \right\} = 0$ . Next note that the fact that $\Theta^1$ and $\Theta^2$ are continuous implies that there exists $\delta \in (0, \infty)$ which satisfies for all $t \in [0, \delta]$ , $k \in \left\{ {{ 1, 2}} \right\}$ that $\mathbf{D}^{\Theta_t^k} \subseteq \mathbf{D}^\theta$ . Furthermore, observe that 3.15 ensures for all $t \in [0, \infty)$ , $i \in \mathbf{D}^\theta$ , $k \in \left\{ {{ 1, 2}} \right\}$ that $i \in \mathbf{D}^{\Theta_t^k}$ . Hence, we obtain for all $t \in [0, \infty)$ , $i \in \mathbf{D}^\theta$ , $j \in \left\{ {{1, 2, \ldots, d}} \right\}$ , $k \in \left\{ {{ 1, 2}} \right\}$ that

$\begin{equation} \mathcal{G}_{(i - 1 ) d + j } ( \Theta_t^k ) = \mathcal{G}_{ H d + i } ( \Theta_t^k ) = \mathcal{G}_{ H ( d+1 ) + i } ( \Theta _t^k ) = 0 . \end{equation}$

(3.16)

In addition, note that the fact that $\Theta^1$ and $\Theta^2$ are continuous implies that there exists a compact $K \subseteq \left\{ {{\vartheta \in \mathbb{R}^ \mathfrak{d} \colon \mathbf{D}^\vartheta \subseteq \mathbf{D}^\theta }} \right\}$ which satisfies for all $t \in [0, \delta]$ , $k \in \left\{ {{1, 2}} \right\}$ that $\Theta_t^k \in K$ . Moreover, observe that Corollary 2.7 proves that for all $i \in \left\{ {{1, 2, \ldots, H}} \right\} \backslash \mathbf{D}^\theta$ , $j \in \left\{ {{1, 2, \ldots, d}} \right\}$ it holds that $\mathcal{G}_{(i - 1) d + j }, \mathcal{G}_{ H d + i }, \mathcal{G}_{ H (d+1) + i }, \mathcal{G} _ \mathfrak{d} \colon K \to \mathbb{R}$ are Lipschitz continuous. This and (3.16) show that there exists $L \in (0, \infty)$ such that for all $t \in [0, \delta]$ we have that

$\begin{equation} || \mathcal{G} ( \Theta_t^1 ) - \mathcal{G} ( \Theta_t^2 ) || \leq L ||\Theta_t^1 - \Theta_t^2||. \end{equation}$

(3.17)

In the following let $M \colon [0, \infty) \to [0, \infty)$ satisfy for all $t \in [0, \infty)$ that $M_t = \sup_{s \in (0, t] } ||\Theta_s^1 - \Theta_s^2||$ . Note that the fact that $\inf \left\{ {{t \in [0, \infty) \colon \Theta_t^1 \not = \Theta_t^2}} \right\} = 0$ proves for all $t \in (0, \infty)$ that $M_t > 0$ . Moreover, observe that (3.17) ensures for all $t \in (0, \delta)$ that

$\begin{equation} \begin{split} ||\Theta_t^1 - \Theta_t^2|| & = \left\| {{ \int_0^t \mathcal{G} ( \Theta_u^1) \, \mathrm{d} u - \int_0^t \mathcal{G} ( \Theta_u^2 ) \, \mathrm{d} u }} \right\| \leq \int_0^t || \mathcal{G} ( \Theta_u^1) - \mathcal{G} ( \Theta_u^2 ) || \, \mathrm{d} u \\ &\leq L \int_0^t ||\Theta_u^1 - \Theta_u^2|| \, \mathrm{d} u \leq L t M_t. \end{split} \end{equation}$

(3.18)

Combining this with the fact that $M$ is non-decreasing shows for all $t \in (0, \delta)$ , $s \in (0, t]$ that

$\begin{equation} ||\Theta_s^1 - \Theta_s^2|| \leq L s M_s \leq L t M_t. \end{equation}$

(3.19)

This demonstrates for all $t \in (0, \min \left\{ {{L^{-1}, \delta }} \right\})$ that

$\begin{equation} 0 < M_t \leq Lt M_t < M_t, \end{equation}$

(3.20)

which is a contradiction. The proof of Lemma 3.2 is thus complete.

3.3. Existence and uniqueness properties for solutions of GF differential equations

Theorem 3.3. Assume Setting 2.1 and let $\theta \in \mathbb{R}^ \mathfrak{d}$ . Then there exists a unique $\Theta \in C([0, \infty), \mathbb{R}^ \mathfrak{d})$ which satisfies for all $t \in [0, \infty)$ , $s \in [t, \infty)$ that

$\begin{equation} \Theta_t = \theta - \int_0^t \mathcal{G} ( \Theta_u ) \, \mathrm{d} u \qquad\mathit{\text{and}}\qquad \mathbf{D}^{\Theta_t} \subseteq \mathbf{D}^{\Theta_s }. \end{equation}$

(3.21)

Proof of Theorem 3.3. Proposition 3.1 establishes the existence and Lemma 3.2 establishes the uniqueness. The proof of Theorem 3.3 is thus complete.

4. Semialgebraic sets and functions

In this section we establish in Corollary 4.10 in Subsection 4.3 below that under the assumption that both the target function $f \colon [\mathscr{a}, \mathscr{b}] ^d \to \mathbb{R}$ and the unnormalized density function $\mathfrak{p} \colon [\mathscr{a}, \mathscr{b}]^d \to [0, \infty)$ are piecewise polynomial in the sense of Definition 4.9 in Subsection 4.3 we have that the risk function $\mathcal{L} \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}$ is a semialgebraic function in the sense of Definition 4.3 in Subsection 4.1. In Definition 4.9 we specify precisely what we mean by a piecewise polynomial function, in Definition 4.2 in Subsection 4.1 we recall the notion of a semialgebraic set, and in Definition 4.3 we recall the notion of a semialgebraic function. In the scientific literature Definitions 4.2 and 4.3 can in a slightly different presentational form, e.g., be found in Bierstone & Milman [40, Definitions 1.1 and 1.2] and Attouch et al. [8, Definition 2.1].

Note that the risk function $\mathcal{L} \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}$ is given through a parametric integral in the sense that for all $\theta \in \mathbb{R}^{ \mathfrak{d} }$ we have that

$\begin{equation} \mathcal{L}( \theta ) = \int_{[ \mathscr{a} , \mathscr{b} ] ^d} ( f ( y ) - \mathscr{N} ^{ {\theta} } (y) )^2 \, \mathfrak{p} ( y ) \, \lambda ( \mathrm{d} y ) . \end{equation}$

(4.1)

In general, parametric integrals of semialgebraic functions are no longer semialgebraic functions and the characterization of functions that can occur as such integrals is quite involved (cf. Kaiser ^[41]). This is the reason why we introduce in Definition 4.6 in Subsection 4.2 below a suitable subclass of the class of semialgebraic functions which is rich enough to contain the realization functions of ANNs with ReLU activation (cf. (4.30) in Subsection 4.2 below) and which can be shown to be closed under integration (cf. Proposition 4.8 in Subsection 4.2 below for the precise statement).

4.1. Semialgebraic sets and functions

Definition 4.1 (Set of polynomials). Let $n \in \mathbb{N}_0$ . Then we denote by $\mathscr{P}_n \subseteq C(\mathbb{R}^n, \mathbb{R})$ the set^† of all polynomials from $\mathbb{R}^n$ to $\mathbb{R}$ .

^†Note that $\mathbb{R}^0 = \{ 0 \}$ , $C(\mathbb{R}^0, \mathbb{R}) = C(\{ 0 \}, \mathbb{R})$ , and $\#(C(\mathbb{R}^0, \mathbb{R})) = \#(C(\{ 0 \}, \mathbb{R})) = \infty$ . In particular, this shows for all $n \in \mathbb{N}_0$ that $\operatorname{dim}(\mathbb{R}^n) = n$ and $\#(C(\mathbb{R}^n, \mathbb{R})) = \infty$ .

Definition 4.2 (Semialgebraic sets). Let $n \in \mathbb{N}$ and let $A \subseteq \mathbb{R}^n$ be a set. Then we say that $A$ is a semialgebraic set if and only if there exist $k \in \mathbb{N}$ and $(P_{i, j, \ell })_{ (i, j, \ell) \in \left\{ {{1, 2, \ldots, k}} \right\} ^2 \times \left\{ {{0, 1}} \right\}} \subseteq \mathscr{P}_n$ such that

$\begin{equation} A = \bigcup\limits_{i = 1}^k \bigcap\limits_{j = 1}^k \bigl\{ x \in \mathbb{R}^n \colon P_{i, j, 0} ( x ) = 0 < P_{i , j , 1} ( x ) \bigr\} \end{equation}$

(4.2)

(cf. Definition 4.1).

Definition 4.3 (Semialgebraic functions). Let $m, n \in \mathbb{N}$ and let $f \colon \mathbb{R}^n \to \mathbb{R}^m$ be a function. Then we say that $f$ is a semialgebraic function if and only if it holds that $\left\{ {{ (x, f (x)) \colon x \in \mathbb{R}^n }} \right\} \subseteq \mathbb{R}^{m+n}$ is a semialgebraic set (cf. Definition 4.2).

Lemma 4.4. Let $n \in \mathbb{N}$ and let $f, g \colon \mathbb{R}^n \to \mathbb{R}$ be semialgebraic functions (cf. Definition 4.3). Then

(i) it holds that $\mathbb{R}^n \ni x \mapsto f(x) + g(x) \in \mathbb{R}$ is semialgebraic and

(ii) it holds that $\mathbb{R}^n \ni x \mapsto f(x) g(x) \in \mathbb{R}$ is semialgebraic.

Proof of Lemma 4.4. Note that, e.g., Coste [42, Corollary 2.9] (see, e.g., also Bierstone & Milman [40, Section 1]) establishes items (i) and (ii). The proof of Lemma 4.4 is thus complete.

4.2. On the semialgebraic property of certain parametric integrals

Definition 4.5 (Set of rational functions). Let $n \in \mathbb{N}$ . Then we denote by $\mathscr{R}_n$ the set given by

$\begin{equation} \mathscr{R}_n = \left\{ {{R \colon \mathbb{R}^n \to \mathbb{R} \colon \left[ {{ \exists \, P, Q \in \mathscr{P}_n \colon \forall \, x \in \mathbb{R}^n \colon R(x) = \begin{cases} \frac{P(x) }{ Q ( x ) } & \colon Q ( x ) \not = 0 \\[0.5ex] 0 & \colon Q ( x ) = 0 \end{cases} }} \right]}} \right\} \end{equation}$

(4.3)

(cf. Definition 4.1).

Definition 4.6. Let $m \in \mathbb{N}$ , $n \in \mathbb{N}_0$ . Then we denote by $\mathscr{A}_{m, n}$ the $\mathbb{R}$ -vector space given by

$\begin{eqnarray} \mathscr{A}_{m, n} = \operatorname{span} \Bigl( \Bigl\{ f \colon \mathbb{R}^m \times \mathbb{R}^n \to \mathbb{R} \colon \Bigl[ \exists \, r \in \mathbb{N}, \, A_1, A_2, \ldots, A_r \in \left\{ {{ \left\{ {{0}} \right\}, [0, \infty ), (0, \infty )}} \right\}, \\ R \in \mathscr{R}_m , \, Q \in \mathscr{P}_n, \, P = (P_{i, j})_{ (i, j) \in \left\{ {{1, 2, \ldots, r }} \right\} \times \left\{ {{0, 1, \ldots, n }} \right\}} \subseteq \mathscr{P}_m \colon \forall \, \theta \in \mathbb{R}^m , \, x = (x_1, \ldots, x_n) \in \mathbb{R}^n \colon \\ f ( \theta , x ) = R ( \theta ) Q ( x ) \left[ {{ \prod\nolimits_{i = 1}^r \mathbf{1}_{\smash{{A_i}}} \left( {{ P_{i, 0} ( \theta ) + \sum\nolimits_{j = 1}^n P_{i, j} ( \theta ) x_j }} \right) }} \right] \Bigr] \Bigr\} \Bigr) \end{eqnarray}$

(4.4)

(cf. Definitions 4.1 and 4.5).

Lemma 4.7. Let $m \in \mathbb{N}$ , $f \in \mathscr{A}_{m, 0 }$ (cf. Definition 4.6). Then $f$ is semialgebraic (cf. Definition 4.3).

Proof of Lemma 4.7. Throughout this proof let $r \in \mathbb{N}$ , $A_1, A_2, \ldots, A_r \in \left\{ {{ \left\{ {{0}} \right\}, [0, \infty), (0, \infty)}} \right\}$ , $R \in \mathscr{R}_m$ , $P = (P_i)_{ i \in \left\{ {{1, 2, \ldots, r }} \right\}} \subseteq \mathscr{P}_m$ , and let $g \colon \mathbb{R}^m \to \mathbb{R}$ satisfy for all $\theta \in \mathbb{R}^m$ that

$\begin{equation} g(\theta) = R ( \theta ) \prod _{i = 1}^r \mathbf{1}_{\smash{{A_i}}} \left( {{ P_i ( \theta ) }} \right) \end{equation}$

(4.5)

(cf. Definitions 4.1 and 4.5). Due to the fact that sums of semialgebraic functions are again semialgebraic (cf. Lemma 4.4), it suffices to show that $g$ is semialgebraic. Furthermore, observe that for all $y \in \mathbb{R}$ it holds that $\mathbf{1}_{\smash{{(0, \infty)}}} (y) = 1 - \mathbf{1}_{\smash{{[0, \infty) }}} (- y)$ and $\mathbf{1}_{\smash{{\left\{ {{0}} \right\}}}} (y) = \mathbf{1}_{\smash{{[0, \infty) }}} (y) \mathbf{1}_{\smash{{[0, \infty) }}} (- y)$ . Hence, by linearity we may assume for all $i \in \left\{ {{1, 2, \ldots, r }} \right\}$ that $A_i = [0, \infty)$ . Next let $Q_1, Q_2 \in \mathscr{P}_m$ satisfy for all $x \in \mathbb{R}^m$ that

$\begin{equation} R(x) = \begin{cases} \frac{Q_1 ( x ) }{ Q_2 ( x ) } & \colon Q_2 ( x ) \not = 0 \\ 0 & \colon Q_2 ( x ) = 0. \end{cases} \end{equation}$

(4.6)

Note that the graph of $\mathbb{R}^m \ni \theta \mapsto R(\theta) \in \mathbb{R}$ is given by

$\begin{equation} \left\{ {{(\theta , y ) \in \mathbb{R}^m \times \mathbb{R} \colon Q_2(\theta ) = 0 , \, y = 0 }} \right\} \cup \left\{ {{(\theta , y ) \in \mathbb{R}^m \times \mathbb{R} \colon Q_2 (\theta ) \not = 0 , \, Q_2 ( \theta ) y - Q_1 ( \theta ) = 0}} \right\}. \end{equation}$

(4.7)

Since both of these sets are described by polynomial equations and inequalities, it follows that $\mathbb{R}^m \ni \theta \mapsto R(\theta) \in \mathbb{R}$ is semialgebraic. In addition, observe that for all $i \in \left\{ {{1, 2, \ldots, r}} \right\}$ the graph of $\mathbb{R}^m \ni \theta \mapsto \mathbf{1}_{\smash{{[0, \infty) }}} \left({{ P_i (\theta) }} \right) \in \mathbb{R}$ is given by

$\begin{equation} \left\{ {{(\theta , y ) \in \mathbb{R}^m \times \mathbb{R} \colon P_i ( \theta ) < 0 , \, y = 0 }} \right\} \cup \left\{ {{(\theta , y ) \in \mathbb{R}^m \times \mathbb{R} \colon P_i ( \theta ) \geq 0 , \, y = 1 }} \right\}. \end{equation}$

(4.8)

This demonstrates for all $i \in \left\{ {{1, 2, \ldots, r}} \right\}$ that $\mathbb{R}^m \ni \theta \mapsto \mathbf{1}_{\smash{{[0, \infty) }}} \left({{ P_i (\theta) }} \right) \in \mathbb{R}$ is semialgebraic. Combining this and (4.5) with Lemma 4.4 demonstrates that $g$ is semialgebraic. The proof of Lemma 4.7 is thus complete.

Proposition 4.8. Let $m, n \in \mathbb{N}$ , $\mathscr{a} \in \mathbb{R}$ , $\mathscr{b} \in (\mathscr{a}, \infty)$ , $f \in \mathscr{A}_{m, n}$ (cf. Definition 4.6). Then

$\begin{equation} \left[ {{ \mathbb{R}^m \times \mathbb{R}^{n-1} \ni (\theta, x_1, \ldots, x_{n-1} ) \mapsto \int_ \mathscr{a}^ \mathscr{b} f ( \theta , x_1, \ldots, x_n ) \, \mathrm{d} x_n \in \mathbb{R} }} \right] \in \mathscr{A}_{m, n-1}. \end{equation}$

(4.9)

Proof of Proposition 4.8. By linearity of the integral it suffices to consider a function $f$ of the form

$\begin{equation} f(\theta , x ) = R ( \theta ) Q ( x ) \prod\limits_{i = 1}^r \mathbf{1}_{\smash{{A_i}}} \left( {{ P_{i, 0} ( \theta ) + \sum\nolimits_{j = 1}^n P_{i , j} ( \theta ) x_j }} \right) \end{equation}$

(4.10)

where $r \in \mathbb{N}$ , $\left({{P_{i, j}}} \right)_{(i, j) \in \left\{ {{1, 2, \ldots, r}} \right\} \times \left\{ {{0, 1, \ldots, n }} \right\} } \subseteq \mathscr{P}_m$ , $A_1, A_2, \ldots, A_r \in \left\{ {{ \left\{ {{0}} \right\}, (0, \infty), [0, \infty) }} \right\}$ , $Q \in \mathscr{P}_n$ , and $R \in \mathscr{R}_m$ (cf. Definitions 4.1 and 4.5). Moreover, note that for all $y \in \mathbb{R}$ it holds that $\mathbf{1}_{\smash{{(0, \infty)}}} (y) = 1 - \mathbf{1}_{\smash{{[0, \infty) }}} (- y)$ and $\mathbf{1}_{\smash{{\left\{ {{0}} \right\}}}} (y) = \mathbf{1}_{\smash{{[0, \infty) }}} (y) \mathbf{1}_{\smash{{[0, \infty) }}} (- y)$ . Hence, by linearity we may assume that $A_i = [0, \infty)$ for all $i \in \left\{ {{1, 2, \ldots, r }} \right\}$ . Furthermore, by linearity we may assume that $Q$ is of the form

$\begin{equation} Q(x_1, \ldots, x_n) = \prod _{\ell = 1}^n ( x_\ell ) ^{i_\ell} \end{equation}$

(4.11)

with $i_1, i_2, \ldots, i_n \in \mathbb{N}_0$ . In the following let $\mathfrak{s} \colon \mathbb{R} \to \mathbb{R}$ satisfy for all $x \in \mathbb{R}$ that $\mathfrak{s} (x) = \mathbf{1}_{\smash{{(0, \infty)}}} (x) - \mathbf{1}_{\smash{{(0, \infty) }}} (- x)$ , for every $\theta \in \mathbb{R}^m$ , $k \in \left\{ {{-1, 0, 1}} \right\}$ let $\mathcal{S}_k^\theta \subseteq \left\{ {{1, 2, \ldots, r}} \right\}$ satisfy $\mathcal{S}_k^\theta = \left\{ {{ i \in \left\{ {{1, 2, \ldots, r }} \right\} \colon \mathfrak{s} (P_{i, n} (\theta)) = k }} \right\}$ , and for every $i \in \left\{ {{1, 2, \ldots, r}} \right\}$ let $Z_i \colon \mathbb{R}^m \times \mathbb{R}^n \to \mathbb{R}$ satisfy for all $(\theta, x) \in \mathbb{R}^m \times \mathbb{R}^n$ that

$\begin{equation} Z_i ( \theta , x ) = - P_{i, 0} ( \theta ) - \sum\nolimits_{j = 1}^{n-1} P_{i, j} ( \theta ) x_j . \end{equation}$

(4.12)

Observe that (4.10), (4.11), and (4.12) imply for all $\theta \in \mathbb{R}^m$ , $x = (x_1, \ldots, x_n) \in \mathbb{R}^n$ that

$\begin{equation} f ( \theta , x ) = R ( \theta ) \left( {{ \prod _{\ell = 1}^n ( x_\ell ) ^{i_\ell} }} \right) \left( {{ \prod _{i = 1}^r \mathbf{1}_{\smash{{ [ 0 , \infty ) }}} \left( {{ P_{i, n } ( \theta ) x_n - Z_i ( \theta , x ) }} \right) }} \right) . \end{equation}$

(4.13)

This shows that $f(\theta, x)$ can only be nonzero if

$\begin{equation} \begin{split} \forall \, i \in \mathcal{S}^\theta_1 &\colon x_n \geq \frac{Z_i ( \theta , x )}{ P_{i , n} ( \theta ) }, \\ \forall \, i \in \mathcal{S}^\theta_{-1} &\colon x_n \leq \frac{Z_i ( \theta , x )}{ P_{i , n} ( \theta ) }, \\ \forall \, i \in \mathcal{S}^\theta_0 &\colon - Z_i ( \theta , x ) \ge 0. \end{split} \end{equation}$

(4.14)

Hence, if for given $\theta \in \mathbb{R}^m$ , $(x_1, \ldots, x_{n-1}) \in \mathbb{R}^{n-1}$ there exists $x_n \in [\mathscr{a}, \mathscr{b}]$ which satisfies these conditions then (4.13) and the fact that $\int y ^{i_ n } \, \mathrm{d} y = \frac{1}{i_n + 1 } y^{i_n + 1 }$ imply that

$\begin{equation} \begin{split} & \int_ \mathscr{a}^ \mathscr{b} f ( \theta , x_1, \ldots, x_n ) \, \mathrm{d} x_n \\ & = \frac{ R ( \theta )}{i_n + 1 } \left( {{ \prod\nolimits_{\ell = 1}^{n-1} x_\ell^{i_\ell} }} \right) \left[ {{ \left( {{ \min \left\{ {{ \mathscr{b} , \min\limits_{j \in \mathcal{S}_{-1}^\theta} \frac{Z_j ( \theta , x ) }{ P_{j , n} ( \theta ) } }} \right\} }} \right)^{\! i_n + 1} \! \! - \left( {{ \max \left\{ {{ \mathscr{a}, \max\limits_{j \in \mathcal{S}_1^\theta} \frac{ Z_j ( \theta , x ) }{ P_{j , n} ( \theta ) } }} \right\} }} \right)^{\! i_n + 1} }} \right]. \end{split} \end{equation}$

(4.15)

Otherwise, we have that $\int_ \mathscr{a}^ \mathscr{b} f (\theta, x_1, \ldots, x_n) \, \mathrm{d} x_n = 0$ . It remains to write these expressions in the different cases as a sum of functions of the required form in Definition 4.6 by introducing suitable indicator functions. Note that there are four possible cases where the integral is nonzero:

● It holds that $\mathscr{a} < \max_{j \in \mathcal{S}_1^\theta} \frac{ Z_j (\theta, x) }{ P_{j, n} (\theta) } < \min_{j \in \mathcal{S}_{-1}^\theta} \frac{Z_j (\theta, x) }{ P_{j, n} (\theta) } < \mathscr{b}$ . In this case, we have

$\begin{equation} \begin{split} & \int_ \mathscr{a}^ \mathscr{b} f ( \theta , x_1, \ldots, x_n ) \, \mathrm{d} x_n \\ & = \frac{ R ( \theta )}{i_n + 1 } \left( {{ \prod\nolimits_{\ell = 1}^{n-1} x_\ell^{i_\ell} }} \right) \left[ {{ \left( {{ \min\limits_{j \in \mathcal{S}_{-1}^\theta} \frac{Z_j ( \theta , x ) }{ P_{j , n} ( \theta ) }}} \right)^{i_n + 1} - \left( {{ \max\limits_{j \in \mathcal{S}_1^\theta} \frac{ Z_j ( \theta , x ) }{ P_{j , n} ( \theta ) }}} \right)^{i_n + 1} }} \right]. \end{split} \end{equation}$

(4.16)

● It holds that $\mathscr{a} < \max_{j \in \mathcal{S}_1^\theta} \frac{ Z_j (\theta, x) }{ P_{j, n} (\theta) } < \mathscr{b} \le \min_{j \in \mathcal{S}_{-1}^\theta} \frac{Z_j (\theta, x) }{ P_{j, n} (\theta) }$ . In this case, we have

$\begin{equation} \int_ \mathscr{a}^ \mathscr{b} f ( \theta , x_1, \ldots, x_n ) \, \mathrm{d} x_n = \frac{ R ( \theta )}{i_n + 1 } \left( {{ \prod _{\ell = 1}^{n-1} x_\ell^{i_\ell} }} \right) \left[ {{ \mathscr{b}^{i_n + 1} - \left( {{ \max\limits_{j \in \mathcal{S}_1^\theta} \frac{ Z_j ( \theta , x ) }{ P_{j , n} ( \theta ) }}} \right)^{i_n + 1} }} \right]. \end{equation}$

(4.17)

● It holds that $\max_{j \in \mathcal{S}_1^\theta} \frac{ Z_j (\theta, x) }{ P_{j, n} (\theta) } \le \mathscr{a} < \min_{j \in \mathcal{S}_{-1}^\theta} \frac{Z_j (\theta, x) }{ P_{j, n} (\theta) } < \mathscr{b}$ . In this case, we have

$\begin{equation} \int_ \mathscr{a}^ \mathscr{b} f ( \theta , x_1, \ldots, x_n ) \, \mathrm{d} x_n = \frac{ R ( \theta )}{i_n + 1 } \left( {{ \prod _{\ell = 1}^{n-1} x_\ell^{i_\ell} }} \right) \left[ {{ \left( {{ \min\limits_{j \in \mathcal{S}_{-1}^\theta} \frac{Z_j ( \theta , x ) }{ P_{j , n} ( \theta ) }}} \right)^{i_n + 1} - \mathscr{a}^{i_n + 1} }} \right]. \end{equation}$

(4.18)

● It holds that $\max_{j \in \mathcal{S}_1^\theta} \frac{ Z_j (\theta, x) }{ P_{j, n} (\theta) } \le \mathscr{a} < \mathscr{b} \le \min_{j \in \mathcal{S}_{-1}^\theta} \frac{Z_j (\theta, x) }{ P_{j, n} (\theta) }$ . In this case, we have

(4.19)

Since these four cases are disjoint, by summing over all possible choices $A, B, C \subseteq \left\{ {{1, 2, \ldots, r}} \right\}$ of the sets $\mathcal{S}^\theta_k$ , $k \in \left\{ {{-1, 0, 1}} \right\}$ , and all choices of (non-empty) subsets $\mathcal{I}, \mathcal{J}$ of $\mathcal{S}^\theta_1$ , $\mathcal{S}_{-1}^\theta$ where the maximal/minimal values are achieved, we can write

(4.20)

where $(I), (II), (III), (IV)$ denote the functions of $\theta \in \mathbb{R}^m$ and $(x_1, \ldots, x_{n-1}) \in \mathbb{R}^{ n - 1 }$ given by

$\begin{equation} \begin{split} (I) & = \sum\limits_{ A \dot{\cup} B \dot{\cup} C = \left\{ {{1, \ldots, r }} \right\} } \left[ {{ \prod\limits_{j \in A} \mathbf{1}_{\smash{{(0, \infty ) }}} ( P_{ j , n} ( \theta ) ) \prod\limits_{j \in B} \mathbf{1}_{\smash{{(0, \infty ) }}} ( - P_{j , n} ( \theta ) ) \prod\limits_{j \in C} \left( {{ \mathbf{1}_{\smash{{ \left\{ {{0 }} \right\} }}} ( P_{j , n} ( \theta ) ) \mathbf{1}_{\smash{{[0, \infty ) }}} ( - Z_j ( \theta , x ) }} \right) }} \right] \\ & \sum\limits_{ \varnothing \not = \mathcal{I} \subseteq A} \sum\limits_{ \varnothing \not = \mathcal{J} \subseteq B} \Biggl[ \Biggl[ \prod\limits_{i \in \mathcal{I} } \left( {{ \mathbf{1}_{\smash{{( \mathscr{a} , \mathscr{b} ) }}} \left( {{ \frac{Z_i ( \theta , x ) }{ P_{i , n } ( \theta ) }}} \right) \mathbf{1}_{\smash{{\left\{ {{0}} \right\}}}} \left( {{ \frac{ Z_i ( \theta , x )}{P_{i , n} ( \theta ) } - \frac{ Z_{\min \mathcal{I}} ( \theta , x ) }{P_{ \min \mathcal{I} , n} ( \theta ) } }} \right) }} \right) \Biggr. \Biggr. \\ & \times \prod\limits_{j \in A \backslash \mathcal{I} } \mathbf{1}_{\smash{{(0, \infty ) }}} \left( {{ \frac{ Z_{\min \mathcal{I}} ( \theta , x ) }{P_{ \min \mathcal{I} , n} ( \theta ) } - \frac{ Z_j ( \theta , x )}{P_{j , n} ( \theta ) } }} \right) \prod\limits_{i \in \mathcal{J} } \left( {{ \mathbf{1}_{\smash{{( \mathscr{a} , \mathscr{b} ) }}} \left( {{ \frac{Z_i ( \theta , x ) }{ P_{i , n } ( \theta ) }}} \right) \mathbf{1}_{\smash{{\left\{ {{0}} \right\}}}} \left( {{ \frac{ Z_i ( \theta , x )}{P_{i , n} ( \theta ) } - \frac{ Z_{\min \mathcal{J}} ( \theta , x ) }{P_{ \min \mathcal{J} , n} ( \theta ) } }} \right) }} \right) \\ & \Biggl. \times \prod\limits_{j \in B \backslash \mathcal{J} } \mathbf{1}_{\smash{{(0, \infty ) }}} \left( {{ \frac{ Z_j ( \theta , x )}{P_{j , n} ( \theta ) } - \frac{ Z_{\min \mathcal{J}} ( \theta , x ) }{P_{ \min \mathcal{J} , n} ( \theta ) } }} \right) \mathbf{1}_{\smash{{(0, \infty ) }}} \left( {{ \frac{ Z_{\min \mathcal{J}} ( \theta , x ) }{P_{ \min \mathcal{J} , n} ( \theta ) } - \frac{ Z_{\min \mathcal{I}} ( \theta , x ) }{P_{ \min \mathcal{I} , n} ( \theta ) } }} \right) \Biggr] \\ & \Biggl. \times \left[ {{ \left( {{ \frac{ Z_{\min \mathcal{J}} ( \theta , x ) }{P_{ \min \mathcal{J} , n } ( \theta ) } }} \right)^{i_n + 1 } - \left( {{\frac{ Z_{\min \mathcal{I}} ( \theta , x ) }{P_{ \min \mathcal{I} , n} ( \theta ) } }} \right) ^{i_n + 1 } }} \right] \Biggr], \end{split} \end{equation}$

(4.21)

$\begin{equation} \begin{split} (II) & = \sum\limits_{ A \dot{\cup} B \dot{\cup} C = \left\{ {{1, \ldots, r }} \right\} } \left[ {{ \prod\limits_{j \in A} \mathbf{1}_{\smash{{(0, \infty ) }}} ( P_{j , n} ( \theta ) ) \prod\limits_{j \in B} \mathbf{1}_{\smash{{(0, \infty ) }}} ( - P_{j , n} ( \theta ) ) \prod\limits_{j \in C} \left( {{ \mathbf{1}_{\smash{{ \left\{ {{0 }} \right\} }}} ( P_{j , n} ( \theta ) ) \mathbf{1}_{\smash{{[0, \infty ) }}} ( - Z_j ( \theta , x ) }} \right) }} \right] \\ & \sum\limits_{ \varnothing \not = \mathcal{I} \subseteq A} \Biggl[ \Biggl[ \prod\limits_{i \in \mathcal{I} } \left( {{ \mathbf{1}_{\smash{{( \mathscr{a} , \mathscr{b} ) }}} \left( {{ \frac{Z_i ( \theta , x ) }{ P_{ i , n } ( \theta ) }}} \right) \mathbf{1}_{\smash{{\left\{ {{0}} \right\}}}} \left( {{ \frac{ Z_i ( \theta , x )}{P_{i , n} ( \theta ) } - \frac{ Z_{\min \mathcal{I}} ( \theta , x ) }{P_{ \min \mathcal{I} , n} ( \theta ) } }} \right) }} \right) \Biggr. \Biggr. \\ & \times \prod\limits_{j \in A \backslash \mathcal{I} } \mathbf{1}_{\smash{{(0, \infty ) }}} \left( {{ \frac{ Z_{\min \mathcal{I}} ( \theta , x ) }{P_{ \min \mathcal{I} , n} ( \theta ) } - \frac{ Z_j ( \theta , x )}{P_{j , n} ( \theta ) } }} \right) \prod\limits_{i \in B } \left( {{ \mathbf{1}_{\smash{{[ \mathscr{b} , \infty ) }}} \left( {{ \frac{Z_i ( \theta , x ) }{ P_{i , n} ( \theta ) }}} \right) }} \right) \\ & \Biggl. \times \left[ {{ \mathscr{b}^{i_n + 1 } - \left( {{\frac{ Z_{\min \mathcal{I}} ( \theta , x ) }{P_{ \min \mathcal{I} , n} ( \theta ) } }} \right) ^{i_n + 1 } }} \right] \Biggr], \end{split} \end{equation}$

(4.22)

$\begin{equation} \begin{split} (III) & = \sum\limits_{ A \dot{\cup} B \dot{\cup} C = \left\{ {{1, \ldots, r }} \right\} } \left[ {{ \prod\limits_{j \in A} \mathbf{1}_{\smash{{(0, \infty ) }}} ( P_{j , n} ( \theta ) ) \prod\limits_{j \in B} \mathbf{1}_{\smash{{(0, \infty ) }}} ( - P_{j , n} ( \theta ) ) \prod\limits_{j \in C} \left( {{ \mathbf{1}_{\smash{{ \left\{ {{0 }} \right\} }}} ( P_{j , n} ( \theta ) ) \mathbf{1}_{\smash{{[0, \infty ) }}} ( - Z_j ( \theta , x ) }} \right) }} \right] \\ & \sum\limits_{ \varnothing \not = \mathcal{J} \subseteq B} \Biggl[ \Biggl[ \prod\limits_{i \in A } \left( {{ \mathbf{1}_{\smash{{(- \infty , \mathscr{a} ] }}} \left( {{ \frac{Z_i ( \theta , x ) }{ P_{ i , n } ( \theta ) }}} \right) }} \right) \prod\limits_{i \in \mathcal{J} } \left( {{ \mathbf{1}_{\smash{{( \mathscr{a} , \mathscr{b} ) }}} \left( {{ \frac{Z_i ( \theta , x ) }{ P_{ i , n } ( \theta ) }}} \right) \mathbf{1}_{\smash{{\left\{ {{0}} \right\}}}} \left( {{ \frac{ Z_i ( \theta , x )}{P_{ i , n} ( \theta ) } - \frac{ Z_{\min \mathcal{J}} ( \theta , x ) }{P_{ \min \mathcal{J} , n } ( \theta ) } }} \right) }} \right) \\ & \Biggl. \times \prod\limits_{j \in B \backslash \mathcal{J} } \mathbf{1}_{\smash{{(0, \infty ) }}} \left( {{ \frac{ Z_j ( \theta , x )}{P_{j , n} ( \theta ) } - \frac{ Z_{\min \mathcal{J}} ( \theta , x ) }{P_{ \min \mathcal{J} , n} ( \theta ) } }} \right) \Biggr] \times \left[ {{ \left( {{ \frac{ Z_{\min \mathcal{J}} ( \theta , x ) }{P_{ \min \mathcal{J} , n} ( \theta ) } }} \right)^{i_n + 1 } - \mathscr{a} ^{i_n + 1 } }} \right] \Biggr], \end{split} \end{equation}$

(4.23)

and

$\begin{equation} \begin{split} (IV) & = \sum\limits_{ A \dot{\cup} B \dot{\cup} C = \left\{ {{1, \ldots, r }} \right\} } \left[ {{ \prod\limits_{j \in A} \mathbf{1}_{\smash{{(0, \infty ) }}} ( P_{j , n} ( \theta ) ) \prod\limits_{j \in B} \mathbf{1}_{\smash{{(0, \infty ) }}} ( - P_{j , n} ( \theta ) ) \prod\limits_{j \in C} \left( {{ \mathbf{1}_{\smash{{ \left\{ {{0 }} \right\} }}} ( P_{j , n} ( \theta ) ) \mathbf{1}_{\smash{{[0, \infty ) }}} ( - Z_j ( \theta , x ) }} \right) }} \right] \\ & \times \left( {{ \prod\limits_{i \in A } \mathbf{1}_{\smash{{(- \infty , \mathscr{a} ] }}} \left( {{ \frac{Z_i ( \theta , x ) }{ P_{ i , n } ( \theta ) } }} \right)\prod\limits_{i \in B } \mathbf{1}_{\smash{{[ \mathscr{b} , \infty ) }}} \left( {{ \frac{Z_i ( \theta , x ) }{ P_{i , n } ( \theta ) } }} \right) }} \right) \left[ {{ \mathscr{b}^{i_n + 1 } - \mathscr{a} ^{i_n + 1 } }} \right] . \end{split} \end{equation}$

(4.24)

Note that the first products over all elements of $A, B, C$ precisely describe the conditions that $\mathcal{S}_1^\theta = A$ , $\mathcal{S}_{ - 1 }^\theta = B$ , $\mathcal{S}_0^\theta = C$ , and $\forall \, j \in \mathcal{S}_0^\theta \colon - Z_j (\theta, x) \ge 0$ . Furthermore, observe that, e.g., in $(I)$ we we must have for all $i \in \mathcal{I}$ , $j \in A \backslash \mathcal{I}$ that $\frac{Z_j (\theta, x)}{ P_{j, n } (\theta) } < \frac{ Z_{\min \mathcal{I}} (\theta, x) }{P_{ \min \mathcal{I}, n} (\theta) } = \frac{Z_i (\theta, x)}{ P_{i, n } (\theta) } \in (\mathscr{a}, \mathscr{b})$ in order to obtain a non-zero value. In other words, the maximal value of $\frac{Z_i (\theta, x)}{ P_{i, n } (\theta) }$ , $i \in A$ , is achieved exactly for $i \in \mathcal{I}$ , and similarly the minimal value of $\frac{Z_j (\theta, x)}{ P_{j, n } (\theta) }$ , $j \in B$ , is achieved exactly for $j \in \mathcal{J}$ (and analogously in $(II), (III)$ ). Moreover, note that we have for all $i \in \mathcal{I} \subseteq A$ that

$\begin{equation} \begin{split} \mathbf{1}_{\smash{{( \mathscr{a} , \mathscr{b} ) }}} \left( {{ \frac{Z_i ( \theta , x ) }{ P_{ i , n } ( \theta ) }}} \right) & = \mathbf{1}_{\smash{{( \mathscr{a} , \infty ) }}} \left( {{ \frac{Z_i ( \theta , x ) }{ P_{ i , n } ( \theta ) }}} \right) \mathbf{1}_{\smash{{(- \infty , \mathscr{b} ) }}} \left( {{ \frac{Z_i ( \theta , x ) }{ P_{i , n } ( \theta ) }}} \right) \\ & = \mathbf{1}_{\smash{{(0, \infty ) }}} \left( {{ Z_i ( \theta , x ) - \mathscr{a} P_{ i , n} ( \theta ) }} \right) \mathbf{1}_{\smash{{(0, \infty ) }}} \left( {{ \mathscr{b} P_{ i , n} ( \theta ) - Z_i ( \theta , x ) }} \right). \end{split} \end{equation}$

(4.25)

Here $Z_i (\theta, x)$ is polynomial in $\theta$ and linear in $x_1, \ldots, x_{n-1}$ , and thus of the form required by Definition 4.6. Similarly, the other indicator functions can be brought into the correct form, taking into account the different signs of $P_{j, n} (\theta)$ for $j \in A$ and $j \in B$ . Moreover, observe that the remaining terms can be written as linear combinations of rational functions in $\theta$ and polynomials in $x$ . Hence, we obtain that the functions defined by $(I), (II), (III), (IV)$ are elements of $\mathscr{A}_{m, n-1}$ . The proof of Proposition 4.8 is thus complete.

4.3. On the semialgebraic property of the risk function

Definition 4.9. Let $d \in \mathbb{N}$ , let $A \subseteq \mathbb{R}^d$ be a set, and let $f \colon A \to \mathbb{R}$ be a function. Then we say that $f$ is piecewise polynomial if and only if there exist $n \in \mathbb{N}$ , $\alpha_1, \alpha_2, \ldots, \alpha_n \in \mathbb{R}^{n \times d}$ , $\beta_{1}, \beta_2, \ldots, \beta_n \in \mathbb{R}^n$ , $P_1, P_2, \ldots, P_n \in \mathscr{P}_d$ such that for all $x \in A$ it holds that

$\begin{equation} f(x) = \sum\nolimits_{i = 1}^n \left[ {{ P_i(x) \mathbf{1}_{\smash{{[0, \infty )^n}}} \left( {{ \alpha_i x + \beta_i }} \right) }} \right] \end{equation}$

(4.26)

(cf. Definition 4.1).

Corollary 4.10. Assume Setting 2.1 and assume that $f$ and $\mathfrak{p}$ are piecewise polynomial (cf. Definition 4.9). Then $\mathcal{L}$ is semialgebraic (cf. Definition 4.3).

Proof of Corollary 4.10. Throughout this proof let $F \colon \mathbb{R}^d \to \mathbb{R}$ and $\mathfrak{P} \colon \mathbb{R}^d \to \mathbb{R}$ satisfy for all $x \in \mathbb{R}^d$ that

$\begin{equation} F ( x ) = \begin{cases} f(x) & \colon x \in [ \mathscr{a} , \mathscr{b} ] ^d \\ 0 & \colon x \notin [ \mathscr{a} , \mathscr{b} ] ^d \end{cases} \qquad\text{and}\qquad \mathfrak{P} ( x ) = \begin{cases} \mathfrak{p} ( x ) & \colon x \in [ \mathscr{a} , \mathscr{b} ] ^d \\ 0 & \colon x \notin [ \mathscr{a} , \mathscr{b} ] ^d. \end{cases} \end{equation}$

(4.27)

Note that (4.27) and the assumption that $f$ and $\mathfrak{p}$ are piecewise polynomial assure that

$\begin{equation} \left[ {{ \mathbb{R}^ \mathfrak{d} \times \mathbb{R}^d \ni (\theta , x ) \mapsto F ( x ) \in \mathbb{R} }} \right] \in \mathscr{A}_{ \mathfrak{d} , d} \quad\text{and}\quad \left[ {{ \mathbb{R}^ \mathfrak{d} \times \mathbb{R}^d \ni (\theta , x ) \mapsto \mathfrak{P} ( x ) \in \mathbb{R} }} \right] \in \mathscr{A}_{ \mathfrak{d} , d} \end{equation}$

(4.28)

(cf. Definition 4.6). In addition, observe that the fact that for all $\theta \in \mathbb{R}^ \mathfrak{d}$ , $x \in \mathbb{R}^d$ we have that

$\begin{equation} \begin{split} \mathscr{N} ^{ {\theta} } ( x ) & = \mathfrak{c}^{{\theta}} + \sum\limits_{i = 1}^ H \mathfrak{v}^{{\theta}}_i \max \{\sum\limits_{\ell = 1}^d \mathfrak{w}^{{\theta}}_{i , \ell} x_\ell + \mathfrak{b}^{{\theta}}_i , 0\} \\ & = \mathfrak{c}^{{\theta}} + \sum\limits_{i = 1}^ H \mathfrak{v}^{{\theta}}_i \left( {\sum\nolimits_{\ell = 1}^d \mathfrak{w}^{{\theta}}_{i , \ell} x_\ell + \mathfrak{b}^{{\theta}}_i} \right) \mathbf{1}_{\smash{{[0, \infty ) }}} \left( {\sum\nolimits_{\ell = 1}^d \mathfrak{w}^{{\theta}}_{i , \ell} x_\ell + \mathfrak{b}^{{\theta}}_i } \right) \end{split} \end{equation}$

(4.29)

demonstrates that

$\begin{equation} \left[ {{ \mathbb{R}^ \mathfrak{d} \times \mathbb{R}^d \ni (\theta , x ) \mapsto \mathscr{N} ^{ {\theta} } ( x ) \in \mathbb{R} }} \right] \in \mathscr{A}_{ \mathfrak{d} , d} . \end{equation}$

(4.30)

Combining this with (4.28) and the fact that $\mathscr{A}_{ \mathfrak{d}, d}$ is an algebra proves that

$\begin{equation} \left[ {{ \mathbb{R}^ \mathfrak{d} \times \mathbb{R}^d \ni (\theta , x ) \mapsto ( \mathscr{N} ^{ {\theta} } ( x ) - F ( x ) ) ^2 \mathfrak{P} ( x ) \in \mathbb{R} }} \right] \in \mathscr{A}_{ \mathfrak{d} , d } . \end{equation}$

(4.31)

This, Proposition 4.8, and induction demonstrate that

$\begin{equation} \left[ {{ \mathbb{R}^ \mathfrak{d} \ni \theta \mapsto \int_ \mathscr{a}^ \mathscr{b} \int_ \mathscr{a}^ \mathscr{b} \cdots \int_ \mathscr{a}^ \mathscr{b} ( \mathscr{N} ^{ {\theta} } ( x ) - F ( x ) ) ^2 \mathfrak{P} ( x ) \, \mathrm{d} x_d \cdots \, \mathrm{d} x_2 \, \mathrm{d} x_1 \in \mathbb{R} }} \right] \in \mathscr{A}_{ \mathfrak{d} , 0}. \end{equation}$

(4.32)

Fubini's theorem hence implies that $\mathcal{L} \in \mathscr{A}_{ \mathfrak{d}, 0 }$ . Combining this and Lemma 4.7 shows that $\mathcal{L}$ is semialgebraic. The proof of Corollary 4.10 is thus complete.

5. Convergence rates for solutions of GF differential equations

In this section we employ the findings from Sections 2 and 4 to establish in Proposition 5.2 in Subsection 5.2 below, in Proposition 5.3 in Subsection 5.2, and in Theorem 5.4 in Subsection 5.3 below several convergence rate results for solutions of GF differential equations. Theorem 1.2 in the introduction is a direct consequence of Theorem 5.4. Our proof of Theorem 5.4 is based on an application of Proposition 5.3 and our proof of Proposition 5.3 uses Proposition 5.2. Our proof of Proposition 5.2, in turn, employs Proposition 5.1 in Subsection 5.1 below. In Proposition 5.1 we establish that under the assumption that the target function $f \colon [\mathscr{a}, \mathscr{b}] ^d \to \mathbb{R}$ and the unnormalized density function $\mathfrak{p} \colon [\mathscr{a}, \mathscr{b}] ^d \to [0, \infty)$ are piecewise polynomial (see Definition 4.9 in Subsection 4.3) we have that the risk function $\mathcal{L} \colon \mathbb{R}^ \mathfrak{d} \to \mathbb{R}$ satisfies an appropriately generalized Kurdyka-Łojasiewicz inequality.

In the proof of Proposition 5.1 the classical Łojasiewicz inequality for semialgebraic or subanalytic functions (cf., e.g., Bierstone & Milman ^[40]) is not directly applicable since the generalized gradient function $\mathcal{G} \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}^{ \mathfrak{d} }$ is not continuous. We will employ the more general results from Bolte et al. ^[9] which also apply to not necessarily continuously differentiable functions.

The arguments used in the proof of Proposition 5.2 are slight adaptions of well-known arguments in the literature; see, e.g., Kurdyka et al. [12, Section 1], Bolte et al. [9, Theorem 4.5], or Absil et al. [6, Theorem 2.2]. On the one hand, in Kurdyka et al. [, Section 1] and Absil et al. [, Theorem 2.2] it is assumed that the object function of the considered optimization problem is analytic and in Bolte et al. [, Theorem 4.5] it is assumed that the objective function of the considered optimization problem is convex or lower $C^2$ and Proposition 5.2 does not require these assumptions. On the other hand, Bolte et al. [, Theorem 4.5] consider more general differential dynamics and the considered gradients are allowed to be more general than the specific generalized gradient function $\mathcal{G} \colon \mathbb{R}^{ \mathfrak{d} } \to \mathbb{R}^{ \mathfrak{d} }$ which is considered in Proposition 5.2.

5.1. Generalized Kurdyka-Łojasiewicz inequality for the risk function

Proposition 5.1 (Generalized Kurdyka-Łojasiewicz inequality). Assume Setting 2.1, assume that $\mathfrak{p}$ and $f$ are piecewise polynomial, and let $\vartheta \in \mathbb{R}^ \mathfrak{d}$ (cf. Definition 4.9). Then there exist $\varepsilon, \mathfrak{D} \in (0, \infty)$ , $\alpha \in (0, 1)$ such that for all $\theta \in B_\varepsilon (\vartheta)$ it holds that

$\begin{equation} | \mathcal{L} ( \theta ) - \mathcal{L} ( \vartheta ) | ^\alpha \leq \mathfrak{D} || \mathcal{G} ( \theta ) ||. \end{equation}$

(5.1)

Proof of Proposition 5.1. Throughout this proof let $\mathbf{M} \colon \mathbb{R}^ \mathfrak{d} \to [0, \infty]$ satisfy for all $\theta \in \mathbb{R}^ \mathfrak{d}$ that

$\begin{equation} \mathbf{M}( \theta ) = \inf \left( {{ \left\{ {{||h|| \colon h \in \partial \mathcal{L} ( \theta ) }} \right\} \cup \left\{ {{ \infty }} \right\} }} \right). \end{equation}$

(5.2)

Note that Proposition 2.12 implies for all $\theta \in \mathbb{R}^ \mathfrak{d}$ that

$\begin{equation} \mathbf{M}( \theta ) \leq || \mathcal{G} ( \theta ) || . \end{equation}$

(5.3)

Furthermore, observe that Corollary 4.10, the fact that semialgebraic functions are subanalytic, and Bolte et al. [, Theorem 3.1 and Remark 3.2] ensure that there exist $\varepsilon, \mathfrak{D} \in (0, \infty)$ , $\mathfrak{a} \in [0, 1)$ which satisfy for all $\theta \in B_\varepsilon (\vartheta)$ that

$\begin{equation} | \mathcal{L} ( \theta ) - \mathcal{L} ( \vartheta ) | ^ \mathfrak{a} \leq \mathfrak{D} \mathbf{M} ( \theta ). \end{equation}$

(5.4)

Combining this and (5.3) with the fact that $\sup_{\theta \in B_\varepsilon (\vartheta) } | \mathcal{L} (\theta) - \mathcal{L} (\vartheta) | < \infty$ demonstrates that for all $\theta \in B_\varepsilon (\vartheta)$ , $\alpha \in (\mathfrak{a}, 1)$ we have that

$\begin{equation} \begin{split} | \mathcal{L} ( \theta ) - \mathcal{L} ( \vartheta ) | ^{\alpha} & \le | \mathcal{L} ( \theta ) - \mathcal{L} ( \vartheta ) | ^ \mathfrak{a} \left( {{ \sup\nolimits_{\psi \in B_\varepsilon ( \vartheta ) } | \mathcal{L} ( \psi ) - \mathcal{L} ( \vartheta ) | ^{\alpha - \mathfrak{a}} }} \right) \\ &\le \left( {{ \mathfrak{D} \sup\nolimits_{\psi \in B_\varepsilon ( \vartheta ) } | \mathcal{L} ( \psi ) - \mathcal{L} ( \vartheta ) | ^{\alpha - \mathfrak{a}} }} \right) || \mathcal{G} ( \theta ) ||. \end{split} \end{equation}$

(5.5)

This completes the proof of Proposition 5.1.

5.2. Local convergence for solutions of GF differential equations

Proposition 5.2. Assume Setting 2.1 and let $\vartheta \in \mathbb{R}^ \mathfrak{d}$ , $\varepsilon, \mathfrak{D} \in (0, \infty)$ , $\alpha \in (0, 1)$ satisfy for all $\theta \in B_\varepsilon (\vartheta)$ that

$\begin{equation} | \mathcal{L} ( \theta ) - \mathcal{L} ( \vartheta ) | ^\alpha \leq \mathfrak{D} || \mathcal{G}( \theta ) || . \end{equation}$

(5.6)

Then there exists $\delta \in (0, \varepsilon)$ such that for all $\Theta \in C([0, \infty), \mathbb{R}^ \mathfrak{d})$ with $\Theta_0 \in B_\delta (\vartheta)$ , $\forall \, t \in [0, \infty) \colon \Theta_t = \Theta_0 - \int_0^t \mathcal{G} (\Theta_s) \, \mathrm{d} s$ , and $\inf_{t \in \left\{ {{ s \in [0, \infty) \colon \Theta_s \in B_\varepsilon (\vartheta) }} \right\} } \mathcal{L} (\Theta_t) \geq \mathcal{L} (\vartheta)$ there exists $\psi \in \mathcal{L}^{ - 1 }(\left\{ {{ \mathcal{L}(\vartheta) }} \right\})$ such that for all $t \in [0, \infty)$ it holds that

$\begin{equation} \Theta_t \in B_{ \varepsilon }( \vartheta ) , \qquad \int_0^\infty || \mathcal{G} ( \Theta_s ) || \, \mathrm{d} s \leq \varepsilon , \qquad | \mathcal{L}( \Theta_t ) - \mathcal{L}( \psi ) | \leq ( 1 + \mathfrak{D}^{ - 2 } t )^{ - 1 } , \end{equation}$

(5.7)

$\begin{equation} \mathit{\text{and}} \qquad ||\Theta_t - \psi || \leq \left[ {{ 1 + \left( {{ \mathfrak{D}^{-1 / \alpha} ( 1 - \alpha ) }} \right)^{\frac{\alpha}{1 - \alpha } } t }} \right]^ { - \min \left\{ {{1, \frac{1 - \alpha}{ \alpha } }} \right\} } . \end{equation}$

(5.8)

Proof of Proposition 5.2. Note that the fact that $\mathcal{L}$ is continuous implies that there exists $\delta \in (0, \varepsilon / 3)$ which satisfies for all $\theta \in B_\delta (\vartheta)$ that

$\begin{equation} | \mathcal{L} ( \theta ) - \mathcal{L} ( \vartheta ) | ^{1 - \alpha } \leq \min \left\{ {{ \frac{\varepsilon ( 1 - \alpha ) }{3 \mathfrak{D} }, \frac{1 - \alpha}{ \mathfrak{D}} , 1 }} \right\}. \end{equation}$

(5.9)

In the following let $\Theta \in C([0, \infty), \mathbb{R}^ \mathfrak{d})$ satisfy $\forall \, t \in [0, \infty) \colon \Theta_t = \Theta_0 - \int_0^t \mathcal{G} (\Theta_s) \, \mathrm{d} s$ , $\Theta_0 \in B_\delta (\vartheta)$ , and

$\begin{equation} \inf\nolimits_{t \in \left\{ {{ s \in [0, \infty ) \colon \Theta_s \in B_\varepsilon ( \vartheta ) }} \right\} } \mathcal{L} ( \Theta_t ) \geq \mathcal{L} ( \vartheta ). \end{equation}$

(5.10)

In the first step we show that for all $t \in [0, \infty)$ it holds that

$\begin{equation} \Theta_t \in B_\varepsilon ( \vartheta ). \end{equation}$

(5.11)

Observe that, e.g., [, Lemma 3.1] ensures for all $t \in [0, \infty)$ that

$\begin{equation} \mathcal{L} ( \Theta_t ) = \mathcal{L} ( \Theta_0 ) - \int_0^t || \mathcal{G} ( \Theta_s ) || ^2 \, \mathrm{d} s. \end{equation}$

(5.12)

This implies that $[0, \infty) \ni t \mapsto \mathcal{L} (\Theta_t) \in [0, \infty)$ is non-increasing. Next let $L \colon [0, \infty) \to \mathbb{R}$ satisfy for all $t \in [0, \infty)$ that

$\begin{equation} L(t) = \mathcal{L} ( \Theta_t ) - \mathcal{L} ( \vartheta ) \end{equation}$

(5.13)

and let $T \in [0, \infty]$ satisfy

$\begin{equation} T = \inf \left( {{ \left\{ {{t \in [0, \infty ) \colon ||\Theta_t - \vartheta || \geq \varepsilon }} \right\} \cup \left\{ {{\infty}} \right\} }} \right). \end{equation}$

(5.14)

We intend to show that $T = \infty$ . Note that (5.10) assures for all $t \in [0, T)$ that $L(t) \geq 0$ . Moreover, observe that (5.12) and (5.13) ensure that for almost all $t \in [0, T)$ it holds that $L$ is differentiable at $t$ and satisfies $L ' (t) = \frac{ \mathrm{d}}{ \mathrm{d} t} (\mathcal{L} (\Theta_t)) = - || \mathcal{G} (\Theta_t) || ^2$ . In the following let $\tau \in [0, T]$ satisfy

$\begin{equation} \tau = \inf \left( {{ \left\{ {{t \in [0, T) \colon L ( t ) = 0 }} \right\} \cup \left\{ {{T }} \right\} }} \right). \end{equation}$

(5.15)

Note that the fact that $L$ is non-increasing implies that for all $s \in [\tau, T)$ it holds that $L(s) = 0$ . Combining this with (5.12) demonstrates for almost all $s \in (\tau, T)$ that $\mathcal{G} (\Theta_s) = 0$ . This proves for all $s \in [\tau, T)$ that $\Theta_s = \Theta_\tau$ . Next observe that (5.6) ensures that for all $t \in [0, \tau)$ it holds that

$\begin{equation} 0 < [ L ( t ) ] ^\alpha = | \mathcal{L} ( \Theta_t ) - \mathcal{L} ( \vartheta ) | ^\alpha \leq \mathfrak{D} || \mathcal{G} ( \Theta_t ) ||. \end{equation}$

(5.16)

Combining this with the chain rule proves for almost all $t \in [0, \tau)$ that

$\begin{equation} \begin{split} \frac{ \mathrm{d} }{ \mathrm{d} t} ([ L ( t ) ]^{1 - \alpha } ) & = (1-\alpha) [ L ( t ) ]^{-\alpha} \left( {{ - || \mathcal{G}(\Theta_t)||^2 }} \right) \\ &\leq - ( 1 - \alpha ) \mathfrak{D}^{-1} || \mathcal{G} ( \Theta_t ) ||^{-1} || \mathcal{G} ( \Theta_t ) || ^2 = - \mathfrak{D}^{-1} (1 - \alpha ) || \mathcal{G}(\Theta_t)||. \end{split} \end{equation}$

(5.17)

In addition, note that the fact that $[0, \infty) \ni t \mapsto L(t) \in \mathbb{R}$ is absolutely continuous and the fact that for all $r \in (0, \infty)$ it holds that $(r, \infty) \ni y \mapsto y^{1 - \alpha } \in \mathbb{R}$ is Lipschitz continuous demonstrate for all $t \in [0, \tau)$ that $[0, t] \ni s \mapsto [L (s)]^{ 1 - \alpha } \in \mathbb{R}$ is absolutely continuous. Integrating (5.17) hence shows for all $s, t \in [0, \tau)$ with $t \le s$ that

$\begin{equation} \int _t ^s || \mathcal{G}(\Theta_u ) || \, \mathrm{d} u \leq - \mathfrak{D} \left( {{1 - \alpha }} \right) ^{-1} \left( {{ [ L ( s ) ] ^{1 - \alpha } - [ L ( t ) ] ^{ 1 - \alpha } }} \right) \leq \mathfrak{D} \left( {{1 - \alpha}} \right)^{-1} [ L ( t ) ]^{1 - \alpha} . \end{equation}$

(5.18)

This and the fact that for almost all $s \in (\tau, T)$ it holds that $\mathcal{G} (\Theta_s) = 0$ ensure that for all $s, t \in [0, T)$ with $t \le s$ we have that

$\begin{equation} \int_t^s || \mathcal{G}(\Theta_u ) || \, \mathrm{d} u \leq \mathfrak{D} \left( {{ 1 - \alpha}} \right)^{-1} [ L ( t ) ]^{1 - \alpha} . \end{equation}$

(5.19)

Combining this with (5.9) demonstrates for all $t \in [0, T)$ that

$\begin{equation} ||\Theta_t - \Theta_{0} || = \left\| {{\int_{0}^t \mathcal{G} ( \Theta_s ) \, \mathrm{d} s}} \right\| \leq \int _{0} ^t || \mathcal{G}(\Theta_s ) || \, \mathrm{d} s \leq \frac{ \mathfrak{D} | \mathcal{L} ( \Theta_0 ) - \mathcal{L} ( \vartheta ) |^{1-\alpha} }{1 - \alpha} \leq \min \left\{ {{ \frac{\varepsilon}{3} , 1 }} \right\}. \end{equation}$

(5.20)

This, the fact that $\delta < \varepsilon / 3$ , and the triangle inequality assure for all $t \in [0, T)$ that

$\begin{equation} ||\Theta_t - \vartheta || \leq ||\Theta_t - \Theta_{0} || + ||\Theta_{0} - \vartheta|| \leq \frac{\varepsilon}{3} + \delta \leq \frac{\varepsilon}{3} + \frac{\varepsilon}{3} = \frac{2 \varepsilon }{3 }. \end{equation}$

(5.21)

Combining this with (5.14) proves that $T = \infty$ . This establishes (5.11).

Next observe that the fact that $T = \infty$ and (5.20) prove that

$\begin{equation} \int_0^\infty || \mathcal{G} ( \Theta_s ) || \, \mathrm{d} s \leq \min \left\{ {{ \frac{\varepsilon}{3} , 1 }} \right\} \le \varepsilon < \infty. \end{equation}$

(5.22)

In the following let $\sigma \colon [0, \infty) \to [0, \infty)$ satisfy for all $t \in [0, \infty)$ that

$\begin{equation} \sigma ( t ) = \int_t^\infty || \mathcal{G} ( \Theta_s )|| \, \mathrm{d} s. \end{equation}$

(5.23)

Note that (5.22) proves that $\limsup_{t \to \infty} \sigma (t) = 0$ . In addition, observe that (5.22) assures that there exists $\psi \in \mathbb{R}^ \mathfrak{d}$ such that

$\begin{equation} \limsup \nolimits_{t \to \infty} ||\Theta_t - \psi || = 0. \end{equation}$

(5.24)

In the next step we combine the weak chain rule for the risk function in (5.12) with (5.11) and (5.6) to obtain that for almost all $t \in [0, \infty)$ we have that

$\begin{equation} L ' ( t ) = - || \mathcal{G} ( \Theta_t ) || ^2 \leq - \mathfrak{D}^{-2} [ L ( t ) ] ^{ 2 \alpha}. \end{equation}$

(5.25)

In addition, note that the fact that $L$ is non-increasing and (5.9) ensure that for all $t \in [0, \infty)$ it holds that $L (t) \leq L (0) \leq 1$ . Therefore, we get for almost all $t \in [0, \infty)$ that

$\begin{equation} L ' ( t ) \leq - \mathfrak{D}^{-2} [ L ( t ) ] ^{ 2 }. \end{equation}$

(5.26)

Combining this with the fact that for all $t \in [0, \tau)$ it holds that $L (t) > 0$ establishes for almost all $t \in [0, \tau)$ that

$\begin{equation} \frac{ \mathrm{d}}{ \mathrm{d} t} \left( {{ \frac{ \mathfrak{D}^2}{L ( t ) } }} \right) = - \frac{ \mathfrak{D}^2 L ' ( t )}{[ L ( t ) ] ^2} \geq 1. \end{equation}$

(5.27)

The fact that for all $t \in [0, \tau)$ it holds that $[0, t] \ni s \mapsto L (s) \in (0, \infty)$ is absolutely continuous hence demonstrates for all $t \in [0, \tau)$ that

$\begin{equation} \frac{ \mathfrak{D}^2}{L(t)} \geq \frac{ \mathfrak{D}^2}{L ( 0 )} + t \geq \mathfrak{D}^2 + t. \end{equation}$

(5.28)

Therefore, we infer for all $t \in [0, \tau)$ that

$\begin{equation} L ( t ) \leq \mathfrak{D}^2 \left( {{ \mathfrak{D}^2 + t }} \right)^{-1} = \left( {{ 1 + \mathfrak{D}^{-2}t }} \right)^{-1}. \end{equation}$

(5.29)

This and the fact that for all $t \in [\tau, \infty)$ it holds that $L(t) = 0$ prove that for all $t \in [0, \infty)$ we have that

$\begin{equation} | \mathcal{L} ( \Theta_t ) - \mathcal{L} ( \vartheta ) | = L ( t ) \leq \left( {{ 1 + \mathfrak{D}^{-2}t }} \right)^{-1}. \end{equation}$

(5.30)

Furthermore, observe that (5.24) and the fact that $\mathcal{L}$ is continuous imply that $\limsup_{t \to \infty} | \mathcal{L} (\Theta_t) - \mathcal{L} (\psi) | = 0$ . Hence, we obtain that $\mathcal{L} (\psi) = \mathcal{L} (\vartheta)$ . This shows for all $t \in [0, \infty)$ that

$\begin{equation} | \mathcal{L} ( \Theta_t ) - \mathcal{L} ( \psi ) | \leq \left( {{ 1 + \mathfrak{D}^{-2}t }} \right)^{-1}. \end{equation}$

(5.31)

In the next step we establish a convergence rate for the quantity $||\Theta_t - \psi||$ , $t \in [0, \infty)$ . We accomplish this by employing an upper bound for the tail length of the curve $\Theta_t \in \mathbb{R}^ \mathfrak{d}$ , $t \in [0, \infty)$ . More formally, note that (5.19), (5.11), and (5.6) demonstrate for all $t \in [0, \infty)$ that

$\begin{equation} \begin{split} \sigma(t) & = \int_t^\infty || \mathcal{G} ( \Theta_u )|| \, \mathrm{d} u = \lim\limits_{s \to \infty} \left[ {{ \int_t^s || \mathcal{G} ( \Theta_u ) || \, \mathrm{d} u }} \right] \\ & \leq \mathfrak{D} \left( {{1 - \alpha}} \right)^{-1} \left( L ( t ) \right)^{1-\alpha} \leq \mathfrak{D} \left( {{1-\alpha}} \right)^{-1} \left( {{ \mathfrak{D} || \mathcal{G}(\Theta_t) || }} \right)^{\frac{1-\alpha}{\alpha}}. \end{split} \end{equation}$

(5.32)

Next observe that the fact that for all $t \in [0, \infty)$ it holds that $\sigma (t) = \int_0^\infty || \mathcal{G} (\Theta_s)|| \, \mathrm{d} s - \int_0^t || \mathcal{G} (\Theta_s) || \, \mathrm{d} s$ shows that for almost all $t \in [0, \infty)$ we have that $\sigma' (t) = - || \mathcal{G} (\Theta_t) ||$ . This and (5.32) yield for almost all $t \in [0, \infty)$ that $\sigma(t) \leq \mathfrak{D}^{1 / \alpha} \left({{1-\alpha}} \right)^{-1} \left[{{- \sigma ' (t) }} \right]^{\frac{1-\alpha}{\alpha}}$ . Therefore, we obtain for almost all $t \in [0, \infty)$ that

$\begin{equation} \sigma ' ( t ) \leq - \left[ {{ ( 1-\alpha ) \mathfrak{D}^{-1 / \alpha} \sigma ( t ) }} \right]^{\frac{\alpha}{1-\alpha}}. \end{equation}$

(5.33)

Combining this with the fact that $\sigma$ is absolutely continuous implies for all $t \in [0, \infty)$ that

$\begin{equation} \sigma(t) - \sigma(0) \leq - \left[ {{ ( 1-\alpha ) \mathfrak{D}^{-1 / \alpha} }} \right]^{\frac{\alpha}{1-\alpha}} \int _{0}^t [ \sigma ( s ) ]^{\frac{\alpha}{1-\alpha}} \, \mathrm{d} s. \end{equation}$

(5.34)

In the following let $\beta, \mathfrak{C} \in (0, \infty)$ satisfy $\beta = \max \left\{ {{ 1, \frac{\alpha}{1-\alpha} }} \right\}$ and $\mathfrak{C} = \left({{ (1-\alpha) \mathfrak{D} ^{ -1 / \alpha} }} \right)^{\frac{\alpha}{1-\alpha}}$ . Note that (5.34) and the fact that for all $t \in [0, \infty)$ it holds that $\sigma (t) \leq \sigma (0) \leq 1$ ensure that for all $t \in [0, \infty)$ it holds that

$\begin{equation} \sigma(t) \leq \sigma(0) - \mathfrak{C} \int _{0}^t [ \sigma ( s ) ]^\beta \, \mathrm{d} s. \end{equation}$

(5.35)

This, the fact that $\sigma$ is non-increasing, and the fact that for all $t \in [0, \infty)$ it holds that $0 \leq \sigma(t) \leq 1$ prove that for all $t \in [0, \infty)$ we have that

$\begin{equation} \left( \sigma ( t ) \right)^\beta \le \sigma ( t ) \leq \sigma(0) - \mathfrak{C} [ \sigma ( t ) ]^\beta t \leq 1 - \mathfrak{C} t [ \sigma ( t ) ]^\beta . \end{equation}$

(5.36)

Hence, we obtain for all $t \in [0, \infty)$ that $\sigma(t) \leq \left({{ 1 + \mathfrak{C} t }} \right)^{-\frac{1}{\beta}}$ . Combining this with the fact that for all $t \in [0, \infty)$ it holds that

$\begin{equation} \begin{split} ||\Theta_t - \psi|| &\leq \limsup\limits_{s \to \infty} ||\Theta_t - \Theta_s || = \limsup\limits_{s \to \infty} \left\| {{ \int_t^s \mathcal{G} ( \Theta_u ) \, \mathrm{d} u }} \right\| \leq \limsup\limits_{s \to \infty} \left[ {{ \int_t^s || \mathcal{G} ( \Theta_u ) || \, \mathrm{d} u }} \right] \\ & = \int_t^\infty || \mathcal{G} ( \Theta_u ) || \, \mathrm{d} u = \sigma ( t ) \end{split} \end{equation}$

(5.37)

shows that for all $t \in [0, \infty)$ we have that $||\Theta_t - \psi || \le (1 + \mathfrak{C} t) ^{- 1 / \beta}$ . This, (5.11), (5.22), and (5.31) establish (5.8). The proof of Proposition 5.2 is thus complete.

5.3. Global convergence for solutions of GF differential equations

Proposition 5.3. Assume Setting 2.1, assume that $\mathfrak{p}$ and $f$ are piecewise polynomial, and let $\Theta \in C ([0, \infty), \mathbb{R}^ \mathfrak{d})$ satisfy

$\begin{equation} \liminf\limits_{t \to \infty } ||\Theta_t || < \infty \qquad\mathit{\text{and}}\qquad \forall \, t \in [0, \infty ) \colon \Theta_t = \Theta_0 - \int_0^t \mathcal{G} ( \Theta_s ) \, \mathrm{d} s \end{equation}$

(5.38)

(cf. Definition 4.9). Then there exist $\vartheta \in \mathcal{G}^{ - 1 } (\left\{ {{ 0 }} \right\})$ , $\mathfrak{C}, \tau, \beta \in (0, \infty)$ which satisfy for all $t \in [\tau, \infty)$ that

$\begin{equation} ||\Theta_t - \vartheta|| \leq \left( {{ 1 + \mathfrak{C} ( t - \tau ) }} \right)^{- \beta} \qquad\mathit{\text{and}}\qquad | \mathcal{L} ( \Theta_t ) - \mathcal{L} ( \vartheta ) | \leq \left( {{1 + \mathfrak{C} ( t - \tau ) }} \right) ^{-1}. \end{equation}$

(5.39)

Proof of Proposition 5.3. First observe that [, Lemma 3.1] ensures that for all $t \in [0, \infty)$ it holds that

$\begin{equation} \mathcal{L}( \Theta_t ) = \mathcal{L}( \Theta_0 ) - \int_0^t || \mathcal{G} ( \Theta_s ) || ^2 \, \mathrm{d} s . \end{equation}$

(5.40)

This implies that $[0, \infty) \ni t \mapsto \mathcal{L} (\Theta_t) \in [0, \infty)$ is non-increasing. Hence, we obtain that there exists $\mathbf{m} \in [0, \infty)$ which satisfies that

$\begin{equation} \mathbf{m} = \limsup\nolimits_{t \to \infty} \mathcal{L} ( \Theta_t ) = \liminf\nolimits_{t \to \infty} \mathcal{L} ( \Theta_t ) = \inf\nolimits_{t \in [0, \infty )} \mathcal{L} ( \Theta_t ). \end{equation}$

(5.41)

Moreover, note that the assumption that $\liminf_{t \to \infty } ||\Theta_t || < \infty$ ensures that there exist $\vartheta \in \mathbb{R}^ \mathfrak{d}$ and $\tau = (\tau_n)_{n \in \mathbb{N}} \colon \mathbb{N} \to [0, \infty)$ which satisfy $\liminf_{n \to \infty} \tau_n = \infty$ and

$\begin{equation} \limsup\nolimits_{n \to \infty} ||\Theta_{\tau_n} - \vartheta || = 0. \end{equation}$

(5.42)

Combining this with (5.41) and the fact that $\mathcal{L}$ is continuous shows that

$\begin{equation} \mathcal{L} ( \vartheta ) = \mathbf{m} \qquad\text{and}\qquad \forall \, t \in [0, \infty ) \colon \mathcal{L} ( \Theta_t ) \geq \mathcal{L} ( \vartheta ). \end{equation}$

(5.43)

Next observe that Proposition 5.1 demonstrates that there exist $\varepsilon, \mathfrak{D} \in (0, \infty)$ , $\alpha \in (0, 1)$ such that for all $\theta \in B_\varepsilon (\vartheta)$ we have that

$\begin{equation} | \mathcal{L} ( \theta ) - \mathcal{L} ( \vartheta ) |^\alpha \leq \mathfrak{D} || \mathcal{G} ( \theta ) || . \end{equation}$

(5.44)

Combining this and (5.42) with Proposition 5.2 demonstrates that there exists $\delta \in (0, \varepsilon)$ which satisfies for all $\Phi \in C([0, \infty), \mathbb{R}^ \mathfrak{d})$ with $\Phi_0 \in B_\delta (\vartheta)$ , $\forall \, t \in [0, \infty) \colon \Phi_t = \Phi_0 - \int_0^t \mathcal{G} (\Phi_s) \, \mathrm{d} s$ , and $\inf_{ t \in \left\{ {{ s \in [0, \infty) \colon \Phi_s \in B_{ \varepsilon }(\vartheta) }} \right\} } \mathcal{L}(\Phi_t) \ge \mathcal{L}(\vartheta)$ that it holds for all $t \in [0, \infty)$ that

$\begin{equation} \Phi_t \in B_{ \varepsilon }( \vartheta ) , \qquad | \mathcal{L}( \Phi_t ) - \mathcal{L}( \vartheta ) | \leq ( 1 + \mathfrak{D}^{ - 2 } t )^{ - 1 } , \end{equation}$

(5.45)

$\begin{equation} \text{and} \qquad ||\Phi_t - \vartheta || \leq \left[ {{ 1 + \left( {{ \mathfrak{D}^{-1 / \alpha} ( 1 - \alpha ) }} \right)^{\frac{\alpha}{1 - \alpha } } t }} \right]^ { - \min \left\{ {{1, \frac{1 - \alpha}{ \alpha } }} \right\} } . \end{equation}$

(5.46)

Moreover, note that (5.42) ensures that there exists $n \in \mathbb{N}$ which satisfies $\Theta_{\tau_n } \in B_\delta (\vartheta)$ . Next let $\Phi \in C([0, \infty), \mathbb{R}^ \mathfrak{d})$ satisfy for all $t \in [0, \infty)$ that

$\begin{equation} \Phi_t = \Theta_{t + \tau_n }. \end{equation}$

(5.47)

Observe that (5.43) and (5.47) assure that

$\begin{equation} \Phi_0 \in B_\delta ( \vartheta ) , \qquad \inf\nolimits_{ t \in [0, \infty) } \mathcal{L}( \Phi_t ) \ge \mathcal{L}( \vartheta ) , \qquad\text{and}\qquad \forall \, t \in [0, \infty ) \colon \Phi_t = \Phi_0 - \int_0^t \mathcal{G} ( \Phi_s ) \, \mathrm{d} s . \end{equation}$

(5.48)

Combining this with (5.46) proves for all $t \in [\tau_n, \infty)$ that

$\begin{equation} | \mathcal{L} ( \Theta_t ) - \mathcal{L} ( \vartheta ) | \le \left( {{ 1 + \mathfrak{D}^{-2} ( t - \tau_n ) }} \right)^{-1} \end{equation}$

(5.49)

and

$\begin{equation} ||\Theta_t - \vartheta || \leq \left[ {{ 1 + \left( {{ \mathfrak{D}^{-1 / \alpha} ( 1 - \alpha ) }} \right)^{\frac{\alpha}{1 - \alpha } } ( t - \tau_n ) }} \right]^ { - \min \left\{ {{1, \frac{1 - \alpha}{ \alpha } }} \right\} } . \end{equation}$

(5.50)

Next note that [, Corollary 2.15] shows that $\mathbb{R}^ \mathfrak{d} \ni \theta \mapsto || \mathcal{G} (\theta) || \in [0, \infty)$ is lower semicontinuous. The fact that $\liminf_{s \to \infty} || \mathcal{G} (\Theta_s) || = 0$ and the fact that $\limsup_{t \to \infty} ||\Theta_t - \vartheta || = 0$ hence imply that $\mathcal{G} (\vartheta) = 0$ . Combining this with (5.49) and (5.50) establishes (5.39). The proof of Proposition 5.3 is thus complete.

By choosing a sufficiently large $\mathscr{C} \in (0, \infty)$ we can conclude a simplified version of Proposition 5.3. This is precisely the subject of the next result, Theorem 5.4 below. Theorem 1.2 in the introduction is a direct consequence of Theorem 5.4.

Theorem 5.4. Assume Setting 2.1, assume that $\mathfrak{p}$ and $f$ are piecewise polynomial, and let $\Theta \in C ([0, \infty), \mathbb{R}^ \mathfrak{d})$ satisfy $\liminf_{t \to \infty } ||\Theta_t || < \infty$ and $\forall \, t \in [0, \infty) \colon \Theta_t = \Theta_0 - \int_0^t \mathcal{G} (\Theta_s) \, \mathrm{d} s$ (cf. Definition 4.9). Then there exist $\vartheta \in \mathcal{G}^{ - 1 } (\left\{ {{ 0 }} \right\})$ , $\mathscr{C}, \beta \in (0, \infty)$ which satisfy for all $t \in [0, \infty)$ that

$\begin{equation} ||\Theta_t - \vartheta|| \leq \mathscr{C} ( 1 + t ) ^{ - \beta } \qquad\mathit{\text{and}}\qquad | \mathcal{L} ( \Theta_t ) - \mathcal{L} ( \vartheta ) | \leq \mathscr{C} ( 1 + t ) ^{ - 1 } . \end{equation}$

(5.51)

Proof of Theorem 5.4. Observe that Proposition 5.3 assures that there exist $\vartheta \in \mathcal{G}^{ - 1 } (\left\{ {{ 0 }} \right\})$ , $\mathfrak{C}, \tau, \beta \in (0, \infty)$ which satisfy for all $t \in [\tau, \infty)$ that

$\begin{equation} ||\Theta_t - \vartheta|| \leq \left( {{ 1 + \mathfrak{C} ( t - \tau ) }} \right)^{- \beta} \end{equation}$

(5.52)

and

$\begin{equation} | \mathcal{L} ( \Theta_t ) - \mathcal{L} ( \vartheta ) | \leq \left( {{1 + \mathfrak{C} ( t - \tau ) }} \right) ^{-1}. \end{equation}$

(5.53)

In the following let $\mathscr{C} \in (0, \infty)$ satisfy

$\begin{equation} \mathscr{C} = \max \left\{ {{ \mathfrak{C}^{-1} , 1 + \tau , \mathfrak{C}^{- \beta}, (1 + \tau )^\beta , ( 1 + \tau ) ^\beta \left[ {{ \sup\nolimits_{s \in [0, \tau ] } ||\Theta_s - \vartheta|| }} \right] , (1 + \tau ) \mathcal{L} ( \Theta_0 ) }} \right\}. \end{equation}$

(5.54)

Note that (5.53), (5.54), and the fact that $[0, \infty) \ni t \mapsto \mathcal{L} (\Theta_t) \in [0, \infty)$ is non-increasing show for all $t \in [0, \tau]$ that

$\begin{equation} ||\Theta_t - \vartheta || \le \sup\nolimits_{s \in [0, \tau ]} ||\Theta_s - \vartheta|| \le \mathscr{C} ( 1 + \tau ) ^{- \beta } \le \mathscr{C} ( 1 + t ) ^{-\beta } \end{equation}$

(5.55)

and

$\begin{equation} | \mathcal{L} ( \Theta_t ) - \mathcal{L} ( \vartheta ) | = \mathcal{L} ( \Theta_t ) - \mathcal{L} ( \vartheta) \le \mathcal{L} ( \Theta_t ) \le \mathcal{L} ( \Theta_0 ) \le \mathscr{C} ( 1 + \tau )^{-1} \le \mathscr{C} ( 1 + t ) ^{-1}. \end{equation}$

(5.56)

Moreover, observe that (5.52) and (5.54) imply for all $t \in [\tau, \infty)$ that

$\begin{equation} ||\Theta_t - \vartheta || \leq \mathscr{C} \left( {{ \mathscr{C}^{1 / \beta} + \mathfrak{C} \mathscr{C}^{1 / \beta} ( t - \tau ) }} \right) ^{ - \beta } \le \mathscr{C} \left( {{ \mathscr{C}^{1 / \beta} - \tau + t }} \right)^{-\beta} \le \mathscr{C} (1 + t ) ^{-\beta } . \end{equation}$

(5.57)

In addition, note that (5.53) and (5.54) demonstrate for all $t \in [\tau, \infty)$ that

$\begin{equation} | \mathcal{L} ( \Theta_t ) - \mathcal{L} ( \vartheta ) | \le \mathscr{C} \left( {{ \mathscr{C} + \mathfrak{C} \mathscr{C} ( t - \tau ) }} \right) ^{-1} \le \mathscr{C} \left( {{ \mathscr{C} - \tau + t }} \right)^{-1} \le \mathscr{C} ( 1 + t ) ^{-1} . \end{equation}$

(5.58)

This completes the proof of Theorem 5.4.

Acknowledgments

This project has been partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy EXC 2044-390685587, Mathematics Münster: Dynamics-Geometry-Structure.

Conflict of interest

The authors declare that there are no conflicts of interest.

References

[1]	E. Aličković, A. Subasi, Breast cancer diagnosis using GA feature selection and Rotation Forest, Neural Comput. Appl., 28 (2015), 753–763. https://doi.org/10.1007/s00521-015-2103-9 doi: 10.1007/s00521-015-2103-9
[2]	World Health Organization, Breast cancer 2021, 2021. Available from: https://www.who.int/news-room/fact-sheets/detail/breast-cancer.
[3]	Y. S. Sun, Z. Zhao, Z. N. Yang, F. Xu, H. J. Lu, Z. Y. Zhu, et al., Risk factors and preventions of breast cancer, Int. J. Biol. Sci., 13 (2017), 1387–1397. https://doi.org/10.7150/ijbs.21635 doi: 10.7150/ijbs.21635
[4]	J. B. Harford, Breast-cancer early detection in low-income and middle-income countries: Do what you can versus one size fits all, Lancet Oncol., 12 (2011), 306–312. https://doi.org/10.1016/s1470-2045(10)70273-4 doi: 10.1016/s1470-2045(10)70273-4
[5]	C. Lerman, M. Daly, C. Sands, A. Balshem, E. Lustbader, T. Heggan, et al., Mammography adherence and psychological distress among women at risk for breast cancer, J. Natl. Cancer Inst., 85 (1993), 1074–1080. https://doi.org/10.1093/jnci/85.13.1074 doi: 10.1093/jnci/85.13.1074
[6]	P. T. Huynh, A. M. Jarolimek, S. Daye, The false-negative mammogram, Radiographics, 18 (1998), 1137–1154. https://doi.org/10.1148/radiographics.18.5.9747612 doi: 10.1148/radiographics.18.5.9747612
[7]	M. G. Ertosun, D. L. Rubin, Probabilistic Visual Search for Masses within mammography images using Deep Learning, in 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), (2015), 1310–1315. https://doi.org/10.1109/bibm.2015.7359868
[8]	Y. Lu, J. Y. Li, Y. T. Su, A. A. Liu, A review of breast cancer detection in medical images, in 2018 IEEE Visual Communications and Image Processing, (2018), 1–4. https://doi.org/10.1109/vcip.2018.8698732
[9]	J. Ferlay, I. Soerjomataram, R. Dikshit, S. Eser, C. Mathers, M. Rebelo, et al., Cancer incidence and mortality worldwide: Sources, methods and major patterns in Globocan 2012, Int. J. Cancer, 136 (2014), E359–E386. https://doi.org/10.1002/ijc.29210 doi: 10.1002/ijc.29210
[10]	N. Mao, P. Yin, Q. Wang, M. Liu, J. Dong, X. Zhang, et al., Added value of Radiomics on mammography for breast cancer diagnosis: A feasibility study, J. Am. Coll. Radiol., 16 (2019), 485–491. https://doi.org/10.1016/j.jacr.2018.09.041 doi: 10.1016/j.jacr.2018.09.041
[11]	H. Wang, J. Feng, Q. Bu, F. Liu, M. Zhang, Y. Ren, et al., Breast mass detection in digital mammogram based on Gestalt Psychology, J. Healthc. Eng., 2018 (2018), 1–13. https://doi.org/10.1155/2018/4015613 doi: 10.1155/2018/4015613
[12]	S. McGuire, World cancer report 2014, Switzerland: World Health Organization, international agency for research on cancer, Adv. Nutrit. Int. Rev., 7 (2016), 418–419. https://doi.org/10.3945/an.116.012211 doi: 10.3945/an.116.012211
[13]	M. K. Gupta, P. Chandra, A comprehensive survey of Data Mining, Int. J. Comput. Technol., 12 (2020), 1243–1257. https://doi.org/10.1007/s41870-020-00427-7 doi: 10.1007/s41870-020-00427-7
[14]	T. Zou, T. Sugihara, Fast identification of a human skeleton-marker model for motion capture system using stochastic gradient descent method, in 2020 8th IEEE RAS/EMBS International Conference for Biomedical Robotics and Biomechatronics (BioRob)., (2020), 181–186. https://doi.org/10.1109/biorob49111.2020.9224442
[15]	A. Reisizadeh, A. Mokhtari, H. Hassani, R. Pedarsani, An exact quantized decentralized gradient descent algorithm, IEEE Trans. Signal Process., 67 (2019), 4934–4947. https://doi.org/10.1109/tsp.2019.2932876 doi: 10.1109/tsp.2019.2932876
[16]	D. Maulud, A. M. Abdulazeez, A review on linear regression comprehensive in machine learning, J. Appl. Sci. Technol. Trends, 1 (2020), 140–147. https://doi.org/10.38094/jastt1457 doi: 10.38094/jastt1457
[17]	D. R. Wilson, T. R. Martinez, The general inefficiency of batch training for gradient descent learning, Neural Networks, 16 (2003) 1429–1451. https://doi.org/10.1016/s0893-6080(03)00138-2 doi: 10.1016/s0893-6080(03)00138-2
[18]	D. Yi, S. Ji, S. Bu, An enhanced optimization scheme based on gradient descent methods for machine learning, Symmetry, 11 (2019), 942. https://doi.org/10.3390/sym11070942 doi: 10.3390/sym11070942
[19]	D. A. Zebari, D. Q. Zeebaree, A. M. Abdulazeez, H. Haron, H. N. Hamed, Improved threshold based and trainable fully automated segmentation for breast cancer boundary and pectoral muscle in mammogram images, IEEE Access, 8 (2020), 203097–203116. https://doi.org/10.1109/access.2020.3036072 doi: 10.1109/access.2020.3036072
[20]	D. Q. Zeebaree, H. Haron, A. M. Abdulazeez, D. A. Zebari, Trainable model based on new uniform LBP feature to identify the risk of the breast cancer, in 2019 International Conference on Advanced Science and Engineering (ICOASE), 2019. https://doi.org/10.1109/icoase.2019.8723827
[21]	D. Q. Zeebaree, A. M. Abdulazeez, L. M. Abdullrhman, D. A. Hasan, O. S. Kareem, The prediction process based on deep recurrent neural networks: A Review, Asian J. Comput. Inf. Syst., 10 (2021), 29–45. https://doi.org/10.9734/ajrcos/2021/v11i230259 doi: 10.9734/ajrcos/2021/v11i230259
[22]	D. Q. Zeebaree, A. M. Abdulazeez, D. A. Zebari, H. Haron, H. N. A. Hamed, Multi-level fusion in ultrasound for cancer detection based on uniform LBP features, Comput. Matern. Contin., 66 (2021), 3363–3382. https://doi.org/10.32604/cmc.2021.013314 doi: 10.32604/cmc.2021.013314
[23]	M. Muhammad, D. Zeebaree, A. M. Brifcani, J. Saeed, D. A. Zebari, A review on region of interest segmentation based on clustering techniques for breast cancer ultrasound images, J. Appl. Sci. Technol. Trends, 1 (2020), 78–91. https://doi.org/10.38094/jastt1328 doi: 10.38094/jastt1328
[24]	P. Kamsing, P. Torteeka, S. Yooyen, An enhanced learning algorithm with a particle filter-based gradient descent optimizer method, Neural Comput. Appl., 32 (2020), 12789–12800. https://doi.org/10.1007/s00521-020-04726-9 doi: 10.1007/s00521-020-04726-9
[25]	Y. Hamid, L. Journaux, J. A. Lee, M. Sugumaran, A novel method for network intrusion detection based on nonlinear SNE and SVM, J. Artif. Intell. Soft Comput. Res., 6 (2018), 265. https://doi.org/10.1504/ijaisc.2018.097280 doi: 10.1504/ijaisc.2018.097280
[26]	H. Sadeeq, A. M. Abdulazeez, Hardware implementation of firefly optimization algorithm using fpgas, in 2018 International Conference on Advanced Science and Engineering, (2018), 30–35. https://doi.org/10.1109/icoase.2018.8548822
[27]	D. P. Hapsari, I. Utoyo, S. W. Purnami, Fractional gradient descent optimizer for linear classifier support vector machine, in 2020 Third International Conference on Vocational Education and Electrical Engineering (ICVEE), (2020), 1–5.
[28]	M. S. Nawaz, B. Shoaib, M. A. Ashraf, Intelligent cardiovascular disease prediction empowered with gradient descent optimization, Heliyon, 7 (2021), 1–10. https://doi.org/10.1016/j.heliyon.2021.e06948 doi: 10.1016/j.heliyon.2021.e06948
[29]	Y. Qian, Exploration of machine algorithms based on deep learning model and feature extraction, J. Math. Biosci. Eng., 18 (2021), 7602–7618. https://doi.org/10.3934/mbe.2021376 doi: 10.3934/mbe.2021376
[30]	Z. Wang, M. Li, H. Wang, H. Jiang, Y. Yao, H. Zhang, et al., Breast cancer detection using extreme learning machine based on feature fusion with CNN deep features, IEEE Access, 7 (2019), 105146–105158. https://doi.org/10.1109/access.2019.2892795 doi: 10.1109/access.2019.2892795
[31]	UCI Machine Learning Repository, Breast Cancer Wisconsin (Diagnostic) Data Set. Available from: https://archive.ics.uci.edu/ml/datasets/Breast+ Cancer + Wisconsin + (Diagnostic).
[32]	R. V. Anji, B. Soni, R. K. Sudheer, Breast cancer detection by leveraging machine learning, ICT Express, 6 (2020), 320–324. https://doi.org/10.1016/j.icte.2020.04.009 doi: 10.1016/j.icte.2020.04.009
[33]	Z. Salod, Y. Singh, Comparison of the performance of machine learning algorithms in breast cancer screening and detection: A Protocol, J. Public Health Res., 8 (2019). https://doi.org/10.4081/jphr.2019.1677 doi: 10.4081/jphr.2019.1677
[34]	Y. Lin, H. Luo, D. Wang, H. Guo, K. Zhu, An ensemble model based on machine learning methods and data preprocessing for short-term electric load forecasting, Energies, 10 (2017), 1186. https://doi.org/10.3390/en10081186 doi: 10.3390/en10081186
[35]	M. Amrane, S. Oukid, I. Gagaoua, T. Ensari, Breast cancer classification using machine learning, in 2018 Electric Electronics, Computer Science, Biomedical Engineerings' Meeting (EBBT), (2018), 1–4. https://doi.org/10.1109/ebbt.2018.8391453
[36]	R. Sumbaly, N. Vishnusri, S. Jeyalatha, Diagnosis of breast cancer using decision tree data mining technique, Int. J. Comput. Appl., 98 (2014), 16–24. https://doi.org/10.5120/17219-7456 doi: 10.5120/17219-7456
[37]	B. Zheng, S. W. Yoon, S. S. Lam, Breast cancer diagnosis based on feature extraction using a hybrid of k-means and support vector machine algorithms, Expert Syst. Appl., 41 (2014), 1476–1482. https://doi.org/10.1016/j.eswa.2013.08.044 doi: 10.1016/j.eswa.2013.08.044
[38]	T. Araújo, G. Aresta, E. Castro, J. Rouco, P. Aguiar, C. Eloy, et al., Classification of breast cancer histology images using convolutional neural networks, Plos One, 12 (2017), e0177544. https://doi.org/10.1371/journal.pone.0177544 doi: 10.1371/journal.pone.0177544
[39]	S. P. Rajamohana, A. Dharani, P. Anushree, B. Santhiya, K. Umamaheswari, Machine learning techniques for healthcare applications: early autism detection using ensemble approach and breast cancer prediction using SMO and IBK, in Cognitive Social Mining Applications in Data Analytics and Forensics, (2019), 236–251. https://doi.org/10.4018/978-1-5225-7522-1.ch012
[40]	L. G. Ahmad, Using three machine learning techniques for predicting breast cancer recurrence, J. Health Med. Inf., 4 (2013), 10–15. https://doi.org/10.4172/2157-7420.1000124 doi: 10.4172/2157-7420.1000124
[41]	B. Padmapriya, T. Velmurugan, Classification algorithm based analysis of breast cancer data, Int. J. Data Min. Tech. Appl., 5 (2016), 43–49. https://doi.org/10.20894/ijdmta.102.005.001.010 doi: 10.20894/ijdmta.102.005.001.010
[42]	S. Bharati, M. A. Rahman, P. Podder, Breast cancer prediction applying different classification algorithm with comparative analysis using Weka, in 2018 4th International Conference on Electrical Engineering and Information & Communication Technology (ICEEiCT), (2018), 581–584. https://doi.org/10.1109/ceeict.2018.8628084
[43]	K. Williams, P. A. Idowu, J. A. Balogun, A. I. Oluwaranti, Breast cancer risk prediction using data mining classification techniques, Trans. Networks Commun., 3 (2015), 17–23. https://doi.org/10.14738/tnc.32.662 doi: 10.14738/tnc.32.662
[44]	P. Mekha, N. Teeyasuksaet, Deep learning algorithms for predicting breast cancer based on tumor cells, in 2019 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT-NCON), 2019. https://doi.org/10.1109/ecti-ncon.2019.8692297
[45]	C. Shah, A. G. Jivani, Comparison of data mining classification algorithms for breast cancer prediction, in 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT), 2013. https://doi.org/10.1109/icccnt.2013.6726477
[46]	A. A. Bataineh, A comparative analysis of nonlinear machine learning algorithms for breast cancer detection, Int. J. Mach. Learn. Comput., 9 (2019), 248–254. https://doi.org/10.18178/ijmlc.2019.9.3.794 doi: 10.18178/ijmlc.2019.9.3.794
[47]	M. S. M. Prince, A. Hasan, F. M. Shah, An efficient ensemble method for cancer detection, in 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), 2019. https://doi.org/10.1109/icasert.2019.8934817
[48]	S. Aruna, A novel SVM based CSSFFS feature selection algorithm for Detecting Breast Cancer, Int. J. Comput., 31 (2011), 14–20. https://doi.org/10.5120/3844-5346 doi: 10.5120/3844-5346
[49]	G. Carneiro, J. Nascimento, A. P. Bradley, Automated analysis of unregistered Multi-View Mammograms with deep learning, IEEE Trans. Med. Imaging, 36 (2017), 2355–2365. https://doi.org/10.1109/tmi.2017.2751523 doi: 10.1109/tmi.2017.2751523
[50]	Z. Sha, L. Hu, B. D. Rouyendegh, Deep learning and optimization algorithms for Automatic Breast Cancer Detection, Int. J. Imaging Syst. Technol., 30 (2020), 495–506. https://doi.org/10.1002/ima.22400 doi: 10.1002/ima.22400
[51]	M. Mahmoud, Breast cancer classification in histopathological images using convolutional neural network, Int. J. Comput. Sci. Appl., 9 (2018), 12–15. https://doi.org/10.14569/ijacsa.2018.090310 doi: 10.14569/ijacsa.2018.090310
[52]	Z. Jiao, X. Gao, Y. Wang, J. Li, A deep feature based framework for Breast Masses classification, Neurocomputing, 197 (2016), 221–231. https://doi.org/10.1016/j.neucom.2016.02.060 doi: 10.1016/j.neucom.2016.02.060
[53]	M. H. Yap, G. Pons, J. Marti, S. Ganau, M. Sentis, R. Zwiggelaar, et al., Automated breast ultrasound lesions detection using convolutional neural networks, IEEE. J. Biomed. Health Inf., 22 (2018), 1218–1226. https://doi.org/10.1109/jbhi.2017.2731873 doi: 10.1109/jbhi.2017.2731873
[54]	N. Wahab, A. Khan, Y. S. Lee, Transfer learning based deep CNN for segmentation and detection of mitoses in breast cancer histopathological images, Microscopy, 68 (2019), 216–233. https://doi.org/10.1093/jmicro/dfz002 doi: 10.1093/jmicro/dfz002
[55]	Z. Wang, G. Yu, Y. Kang, Y. Zhao, Q. Qu, Breast tumor detection in digital mammography based on Extreme Learning Machine, Neurocomputing, 128 (2014), 175–184. https://doi.org/10.1016/j.neucom.2013.05.053 doi: 10.1016/j.neucom.2013.05.053
[56]	Y. Qiu, Y. Wang, S. Yan, M. Tan, S. Cheng, H. Liu, et al., An initial investigation on developing a new method to predict short-term breast cancer risk based on Deep Learning Technology, Comput. Aided. Des., 2016. https://doi.org/10.1117/12.2216275 doi: 10.1117/12.2216275
[57]	X. W. Chen, X. Lin, Big data deep learning: Challenges and perspectives, IEEE Access, 2 (2014), 514–525. https://doi.org/10.1109/access.2014.2325029 doi: 10.1109/access.2014.2325029
[58]	J. Arevalo, F. A. González, R. R. Pollán, J. L. Oliveira, M. A. G. Lopez, Representation learning for mammography mass lesion classification with convolutional neural networks, Comput. Methods Programs Biomed., 127 (2016), 248–257. https://doi.org/10.1016/j.cmpb.2015.12.014 doi: 10.1016/j.cmpb.2015.12.014
[59]	Y. Kumar, A. Aggarwal, S. Tiwari, K. Singh, An efficient and robust approach for biomedical image retrieval using zernike moments, Biomed. Signal Process. Control, 39 (2018), 459–473. https://doi.org/10.1016/j.bspc.2017.08.018 doi: 10.1016/j.bspc.2017.08.018
[60]	K. Kalaiarasi, R. Soundaria, N. Kausar, P. Agarwal, H. Aydi, H. Alsamir, Optimization of the average monthly cost of an EOQ inventory model for deteriorating items in machine learning using Python, Therm. Sci., 25 (2021), 347–358. https://doi.org/10.2298/tsci21s2347k doi: 10.2298/tsci21s2347k
[61]	M. Franulović, K. Marković, A. Trajkovski, Calibration of material models for the human cervical spine ligament behaviour using a genetic algorithm, Facta Univ., Series: Mechan. Eng., 19 (2021) 751. https://doi.org/10.22190/fume201029023f doi: 10.22190/fume201029023f
[62]	M. Fayaz, D. H. Kim, A prediction methodology of energy consumption based on Deep Extreme Learning Machine and comparative analysis in residential buildings, Electronics, 7 (2018), 222. https://doi.org/10.3390/electronics7100222 doi: 10.3390/electronics7100222
[63]	G. B. Huang, D. H. Wang, Y. Lan, Extreme learning machines: A survey, Int. J. Mach. Learn. Cybern., 2 (2011), 107–122. https://doi.org/10.1007/s13042-011-0019-y doi: 10.1007/s13042-011-0019-y
[64]	H. Tang, S. Gao, L. Wang, X. Li, B. Li, S. Pang, A novel intelligent fault diagnosis method for rolling bearings based on Wasserstein generative adversarial network and Convolutional Neural Network under Unbalanced Dataset, Sensors, 21 (2021), 6754. https://doi.org/10.3390/s21206754 doi: 10.3390/s21206754
[65]	J. Wei, H. Liu, G. Yan, F. Sun, Multi-modal deep extreme learning machine for robotic grasping recognition, Proceed. Adapt., Learn. Optim., (2016), 223–233. https://doi.org/10.1007/978-3-319-28373-9_19 doi: 10.1007/978-3-319-28373-9_19
[66]	N. S. Naz, M. A. Khan, S. Abbas, A. Ather, S. Saqib, Intelligent routing between capsules empowered with deep extreme machine learning technique, SN Appl. Sci., 2 (2019), 1–14. https://doi.org/10.1007/s42452-019-1873-6 doi: 10.1007/s42452-019-1873-6
[67]	J. Cai, J. Luo, S. Wang, S. Yang, Feature selection in Machine Learning: A new perspective, Neurocomputing, 300 (2018), 70–79. https://doi.org/10.1016/j.neucom.2017.11.077 doi: 10.1016/j.neucom.2017.11.077
[68]	L. M. Abualigah, A. T. Khader, E. S. Hanandeh, A new feature selection method to improve the document clustering using particle swarm optimization algorithm, J. Comput. Sci., 25 (2018), 456–466. https://doi.org/10.1016/j.jocs.2017.07.018 doi: 10.1016/j.jocs.2017.07.018
[69]	P. A. Flach, ROC analysis, encyclopedia of machine learning and data mining, Encycl. Mach. Learn. Data Min., (2016), 1–8. https://doi.org/10.1007/978-1-4899-7502-7_739-1 doi: 10.1007/978-1-4899-7502-7_739-1
[70]	Q. Wuniri, W. Huangfu, Y. Liu, X. Lin, L. Liu, Z. Yu, A generic-driven wrapper embedded with feature-type-aware hybrid bayesian classifier for breast cancer classification, IEEE Access, 7 (2019), 119931–119942. https://doi.org/10.1109/access.2019.2932505 doi: 10.1109/access.2019.2932505
[71]	J. Zheng, D. Lin, Z. Gao, S. Wang, M. He, J. Fan, Deep Learning assisted efficient ADABOOST algorithm for breast cancer detection and early diagnosis, IEEE Access, 8 (2020), 96946–96954. https://doi.org/10.1109/access.2020.2993536 doi: 10.1109/access.2020.2993536
[72]	X. Zhang, D. He, Y. Zheng, H. Huo, S. Li, R. Chai, et al., Deep learning based analysis of breast cancer using advanced ensemble classifier and linear discriminant analysis, IEEE Access, 8 (2020), 120208–120217. https://doi.org/10.1109/access.2020.3005228 doi: 10.1109/access.2020.3005228
[73]	Y. Yari, T. V. Nguyen, H. T. Nguyen, Deep learning applied for histological diagnosis of breast cancer, IEEE Access, 8 (2020), 162432–162448. https://doi.org/10.1109/access.2020.3021557 doi: 10.1109/access.2020.3021557
[74]	A. H. Osman, H. M. Aljahdali, An effective of ensemble boosting learning method for breast cancer virtual screening using neural network model, IEEE Access, 8 (2020), 39165–39174. https://doi.org/10.1109/access.2020.2976149 doi: 10.1109/access.2020.2976149
[75]	Y. Li, J. Wu, Q. Wu, Classification of breast cancer histology images using multi-size and discriminative patches based on Deep Learning, IEEE Access, 7 (2019), 21400–21408. https://doi.org/10.1109/access.2019.2898044 doi: 10.1109/access.2019.2898044
[76]	D. M. Vo, N. Q. Nguyen, S. W. Lee, Classification of breast cancer histology images using incremental boosting convolution networks, Inf. Sci., 482 (2019), 123–138. https://doi.org/10.1016/j.ins.2018.12.089 doi: 10.1016/j.ins.2018.12.089
[77]	S. Y. Siddiqui, M. A. Khan, S. Abbas, F. Khan, Smart occupancy detection for road traffic parking using deep extreme learning machine, J. K.S.U. Comput. Inf. Sci., 34 (2022), 727–733. https://doi.org/10.1016/j.jksuci.2020.01.016 doi: 10.1016/j.jksuci.2020.01.016
[78]	M. A. Khan, S. Abbas, K. M. Khan, M. A. A. Ghamdi, A. Rehman, Intelligent forecasting model of covid-19 novel coronavirus outbreak empowered with deep extreme learning machine, Comput. Matern. Contin., 64 (2020), 1329–1342. https://doi.org/10.32604/cmc.2020.011155 doi: 10.32604/cmc.2020.011155
[79]	S. Abbas, M. A. Khan, L. E. F. Morales, A. Rehman, Y. Saeed, Modelling, simulation and optimization of power plant energy sustainability for IoT enabled smart cities empowered with deep extreme learning machine, IEEE Access, 8 (2020), 39982–39997. https://doi.org/10.1109/ACCESS.2020.2976452 doi: 10.1109/ACCESS.2020.2976452
[80]	A. Rehman, A. Athar, M. A. Khan, S. Abbas, A. Fatima, M. Zareei, et al., Modelling, simulation, and optimization of diabetes type ii prediction using deep extreme learning machine, J. Ambient Intell. Smart Environ., 12 (2020), 125–138. https://doi.org/10.3233/AIS-200554 doi: 10.3233/AIS-200554
[81]	A. Haider, M. A. Khan, A. Rehman, H. S. Kim, A real-time sequential deep extreme learning machine cybersecurity intrusion detection system, Comput. Matern. Contin., 66 (2021), 1785–1798. https://doi.org/10.32604/cmc.2020.013910 doi: 10.32604/cmc.2020.013910
[82]	M. A. Khan, A. Rehman, K. M. Khan, M. A. A. Ghamdi, S. H. Almotiri, Enhance intrusion detection in computer networks based on deep extreme learning machine, Comput. Matern. Contin., 66 (2021), 467–480. https://doi.org/10.32604/cmc.2020.013121 doi: 10.32604/cmc.2020.013121
[83]	U. Ahmed, G. F. Issa, M. A. Khan, S. Aftab, M. F. Khan, R. A. T. Said, et al., Prediction of diabetes empowered with fused machine learning, IEEE Access, 10 (2022), 8529–8538. https://doi.org/10.1109/ACCESS.2022.3142097 doi: 10.1109/ACCESS.2022.3142097
[84]	S. Y. Siddiqui, A. Haider, T. M. Ghazal, M. A. Khan, I. Naseer, S. Abbas, et al., IoMT cloud-based intelligent prediction of breast cancer stages empowered with deep learning, IEEE Access, 9 (2021), 146478–146491. https://doi.org/10.1109/ACCESS.2021.3123472 doi: 10.1109/ACCESS.2021.3123472
[85]	M. Ahmad, M. Alfayad, S. Aftab, M. A. Khan, A. Fatima, B. Shoaib, et.al., Data and machine learning fusion architecture for cardiovascular disease prediction, Comput. Matern. Contin., 69 (2021), 2717–2731. https://doi.org/10.32604/cmc.2021.019013 doi: 10.32604/cmc.2021.019013

This article has been cited by:

1.	W. Jung, C.A. Morales, Training neural networks from an ergodic perspective, 2023, 0233-1934, 1, 10.1080/02331934.2023.2239852
2.	Steffen Dereich, Arnulf Jentzen, Sebastian Kassing, On the Existence of Minimizers in Shallow Residual ReLU Neural Network Optimization Landscapes, 2024, 62, 0036-1429, 2640, 10.1137/23M1556241

Reader Comments

Your name:*

Email:*
© 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Mathematical Biosciences and Engineering

3.9

Metrics

Article views(3358) PDF downloads(169) Cited by(9)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Mathematical Biosciences and Engineering

Intelligent breast cancer diagnostic system empowered by deep extreme gradient descent optimization

Related Papers:

Abstract

1. Introduction

2. Properties of the risk function and its generalized gradient function

2.1. Mathematical description of artificial neural networks (ANNs)

2.2. Differentiability properties of the risk function

2.3. Local Lipschitz continuity of active neuron regions

2.4. Local Lipschitz continuity properties for the generalized gradient function

2.5. Subdifferentials

3. Existence and uniqueness properties for solutions of gradient flows (GFs)

3.1. Existence properties for solutions of GF differential equations

3.2. Uniqueness properties for solutions of GF differential equations

3.3. Existence and uniqueness properties for solutions of GF differential equations

4. Semialgebraic sets and functions

4.1. Semialgebraic sets and functions

4.2. On the semialgebraic property of certain parametric integrals

4.3. On the semialgebraic property of the risk function

5. Convergence rates for solutions of GF differential equations

5.1. Generalized Kurdyka-Łojasiewicz inequality for the risk function

5.2. Local convergence for solutions of GF differential equations

5.3. Global convergence for solutions of GF differential equations

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog

Abstract

1. Introduction

2. Properties of the risk function and its generalized gradient function

2.1. Mathematical description of artificial neural networks (ANNs)

2.2. Differentiability properties of the risk function

2.3. Local Lipschitz continuity of active neuron regions

2.4. Local Lipschitz continuity properties for the generalized gradient function

2.5. Subdifferentials

3. Existence and uniqueness properties for solutions of gradient flows (GFs)

3.1. Existence properties for solutions of GF differential equations

3.2. Uniqueness properties for solutions of GF differential equations

3.3. Existence and uniqueness properties for solutions of GF differential equations

4. Semialgebraic sets and functions

4.1. Semialgebraic sets and functions

4.2. On the semialgebraic property of certain parametric integrals

4.3. On the semialgebraic property of the risk function

5. Convergence rates for solutions of GF differential equations

5.1. Generalized Kurdyka-Łojasiewicz inequality for the risk function

5.2. Local convergence for solutions of GF differential equations

5.3. Global convergence for solutions of GF differential equations

Acknowledgments

Conflict of interest

References