Loading [MathJax]/jax/output/SVG/jax.js
Research article Special Issues

Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation

  • The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure. GD type optimization schemes can be regarded as temporal discretization methods for the gradient flow (GF) differential equations associated to the considered optimization problem and, in view of this, it seems to be a natural direction of research to first aim to develop a mathematical convergence theory for time-continuous GF differential equations and, thereafter, to aim to extend such a time-continuous convergence theory to implementable time-discrete GD type optimization methods. In this article we establish two basic results for GF differential equations in the training of fully-connected feedforward ANNs with one hidden layer and ReLU activation. In the first main result of this article we establish in the training of such ANNs under the assumption that the probability distribution of the input data of the considered supervised learning problem is absolutely continuous with a bounded density function that every GF differential equation admits for every initial value a solution which is also unique among a suitable class of solutions. In the second main result of this article we prove in the training of such ANNs under the assumption that the target function and the density function of the probability distribution of the input data are piecewise polynomial that every non-divergent GF trajectory converges with an appropriate rate of convergence to a critical point and that the risk of the non-divergent GF trajectory converges with rate 1 to the risk of the critical point. We establish this result by proving that the considered risk function is semialgebraic and, consequently, satisfies the Kurdyka-Łojasiewicz inequality, which allows us to show convergence of every non-divergent GF trajectory.

    Citation: Simon Eberle, Arnulf Jentzen, Adrian Riekert, Georg S. Weiss. Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation[J]. Electronic Research Archive, 2023, 31(5): 2519-2554. doi: 10.3934/era.2023128

    Related Papers:

    [1] Karl Hajjar, Lénaïc Chizat . On the symmetries in the dynamics of wide two-layer neural networks. Electronic Research Archive, 2023, 31(4): 2175-2212. doi: 10.3934/era.2023112
    [2] Eray Önler . Feature fusion based artificial neural network model for disease detection of bean leaves. Electronic Research Archive, 2023, 31(5): 2409-2427. doi: 10.3934/era.2023122
    [3] Dong-hyeon Kim, Se-woon Choe, Sung-Uk Zhang . Recognition of adherent polychaetes on oysters and scallops using Microsoft Azure Custom Vision. Electronic Research Archive, 2023, 31(3): 1691-1709. doi: 10.3934/era.2023088
    [4] Ziqing Yang, Ruiping Niu, Miaomiao Chen, Hongen Jia, Shengli Li . Adaptive fractional physical information neural network based on PQI scheme for solving time-fractional partial differential equations. Electronic Research Archive, 2024, 32(4): 2699-2727. doi: 10.3934/era.2024122
    [5] Ilyоs Abdullaev, Natalia Prodanova, Mohammed Altaf Ahmed, E. Laxmi Lydia, Bhanu Shrestha, Gyanendra Prasad Joshi, Woong Cho . Leveraging metaheuristics with artificial intelligence for customer churn prediction in telecom industries. Electronic Research Archive, 2023, 31(8): 4443-4458. doi: 10.3934/era.2023227
    [6] Kai Huang, Chang Jiang, Pei Li, Ali Shan, Jian Wan, Wenhu Qin . A systematic framework for urban smart transportation towards traffic management and parking. Electronic Research Archive, 2022, 30(11): 4191-4208. doi: 10.3934/era.2022212
    [7] Ruyu Yan, Jiafei Jin, Kun Han . Reinforcement learning for deep portfolio optimization. Electronic Research Archive, 2024, 32(9): 5176-5200. doi: 10.3934/era.2024239
    [8] Mohd. Rehan Ghazi, N. S. Raghava . Securing cloud-enabled smart cities by detecting intrusion using spark-based stacking ensemble of machine learning algorithms. Electronic Research Archive, 2024, 32(2): 1268-1307. doi: 10.3934/era.2024060
    [9] Manal Abdullah Alohali, Mashael Maashi, Raji Faqih, Hany Mahgoub, Abdullah Mohamed, Mohammed Assiri, Suhanda Drar . Spotted hyena optimizer with deep learning enabled vehicle counting and classification model for intelligent transportation systems. Electronic Research Archive, 2023, 31(7): 3704-3721. doi: 10.3934/era.2023188
    [10] Jiaxin Zhang, Hoang Tran, Guannan Zhang . Accelerating reinforcement learning with a Directional-Gaussian-Smoothing evolution strategy. Electronic Research Archive, 2021, 29(6): 4119-4135. doi: 10.3934/era.2021075
  • The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure. GD type optimization schemes can be regarded as temporal discretization methods for the gradient flow (GF) differential equations associated to the considered optimization problem and, in view of this, it seems to be a natural direction of research to first aim to develop a mathematical convergence theory for time-continuous GF differential equations and, thereafter, to aim to extend such a time-continuous convergence theory to implementable time-discrete GD type optimization methods. In this article we establish two basic results for GF differential equations in the training of fully-connected feedforward ANNs with one hidden layer and ReLU activation. In the first main result of this article we establish in the training of such ANNs under the assumption that the probability distribution of the input data of the considered supervised learning problem is absolutely continuous with a bounded density function that every GF differential equation admits for every initial value a solution which is also unique among a suitable class of solutions. In the second main result of this article we prove in the training of such ANNs under the assumption that the target function and the density function of the probability distribution of the input data are piecewise polynomial that every non-divergent GF trajectory converges with an appropriate rate of convergence to a critical point and that the risk of the non-divergent GF trajectory converges with rate 1 to the risk of the critical point. We establish this result by proving that the considered risk function is semialgebraic and, consequently, satisfies the Kurdyka-Łojasiewicz inequality, which allows us to show convergence of every non-divergent GF trajectory.



    The training of artificial neural networks (ANNs) with rectified linear unit (ReLU) activation via gradient descent (GD) type optimization schemes is nowadays a common industrially relevant procedure which appears, for instance, in the context of natural language processing, face recognition, fraud detection, and game intelligence. Although there exist a large number of numerical simulations in which GD type optimization schemes are effectively used to train ANNs with ReLU activation, till this day in the scientific literature there is in general no mathematical convergence analysis which explains the success of GD type optimization schemes in the training of such ANNs.

    GD type optimization schemes can be regarded as temporal discretization methods for the gradient flow (GF) differential equations associated to the considered optimization problem and, in view of this, it seems to be a natural direction of research to first aim to develop a mathematical convergence theory for time-continuous GF differential equations and, thereafter, to aim to extend such a time-continuous convergence theory to implementable time-discrete GD type optimization methods.

    Although there is in general no theoretical analysis which explains the success of GD type optimization schemes in the training of ANNs in the literature, there are several auspicious analysis approaches as well as several promising partial error analyses regarding the training of ANNs via GD type optimization schemes and GFs, respectively, in the literature. For convex objective functions, the convergence of GF and GD processes to the global minimum in different settings has been proved, e.g., in [1,2,3,4,5]. For general non-convex objective functions, even under smoothness assumptions GF and GD processes can show wild oscillations and admit infinitely many limit points, cf., e.g., [6]. A standard condition which excludes this undesirable behavior is the Kurdyka-Łojasiewicz inequality and we point to [7,8,9,10,11,12,13,14,15,16] for convergence results for GF and GD processes under Łojasiewicz type assumptions. It is in fact one of the main contributions of this work to demonstrate that the objective functions occurring in the training of ANNs with ReLU activation satisfy an appropriate Kurdyka-Łojasiewicz inequality, provided that both the target function and the density of the probability distribution of the input data are piecewise polynomial. For further abstract convergence results for GF and GD processes in the non-convex setting we refer, e.g., to [17,18,19,20,21] and the references mentioned therein.

    In the overparametrized regime, where the number of training parameters is much larger than the number of training data points, GF and GD processes can be shown to converge to global minima in the training of ANNs with high probability, cf., e.g., [22,23,24,25,26,27,28]. As the number of neurons increases to infinity, the corresponding GF processes converge (with appropriate rescaling) to a measure-valued process which is known in the scientific literature as Wasserstein GF. For results on the convergence behavior of Wasserstein GFs in the training of ANNs we point, e.g., to [29,30,31], [32, Section 5.1], and the references mentioned therein.

    A different approach is to consider only very special target functions and we refer, in particular, to [33,34] for a convergence analysis for GF and GD processes in the case of constant target functions and to [35] for a convergence analysis for GF and GD processes in the training of ANNs with piecewise linear target functions. In the case of linear target functions, a complete characterization of the non-global local minima and the saddle points of the risk function has been obtained in [36].

    In this article we establish two basic results for GF differential equations in the training of fully-connected feedforward ANNs with one hidden layer and ReLU activation. Specifically, in the first main result of this article, see Theorem 1.1 below, we establish in the training of such ANNs under the assumption that the probability distribution of the input data of the considered supervised learning problem is absolutely continuous with a bounded density function that every GF differential equation possesses for every initial value a solution which is also unique among a suitable class of solutions (see (1.6) in Theorem 1.1 for details). In the second main result of this article, see Theorem 1.2 below, we prove in the training of such ANNs under the assumption that the target function and the density function are piecewise polynomial (see (1.8) below for details) that every non-divergent GF trajectory converges with an appropriate speed of convergence (see (1.11) below) to a critical point.

    In Theorems 1.1 and 1.2 we consider ANNs with dN={1,2,3,} neurons on the input layer (d-dimensional input), HN neurons on the hidden layer (H-dimensional hidden layer), and 1 neuron on the output layer (1-dimensional output). There are thus Hd scalar real weight parameters and H scalar real bias parameters to describe the affine linear transformation between d-dimensional input layer and the H-dimensional hidden layer and there are thus H scalar real weight parameters and 1 scalar real bias parameter to describe the affine linear transformation between the H-dimensional hidden layer and the 1-dimensional output layer. Altogether there are thus

    d=Hd+H+H+1=Hd+2H+1 (1.1)

    real numbers to describe the ANNs in Theorems 1.1 and 1.2. We also refer to Figure 1 for a graphical illustration of the architecture of an example ANN with d=4 neurons on the input layer and H=5 neurons on the hidden layer.

    Figure 1.  Graphical illustration of the architecture of an example fully-connected feedforward ANN with one hidden layer with 4 neurons on the input layer, 5 neurons on the hidden layer, and 1 neuron on the output layer corresponding to d=4 and H=5 in Theorems 1.1 and 1.2. In this example there are Hd=20 arrows from the input layer to the hidden layer corresponding to Hd=20 weight parameters to describe the affine linear transformation from the input layer to the hidden layer, there are H=5 bias parameters to describe the affine linear transformation from the input layer to the hidden layer, there are H=5 arrows from the hidden layer to the output layer corresponding to H=5 weight parameters to describe the affine linear transformation from the hidden layer to the output layer, and there is 1 bias parameter to describe the affine linear transformation from the hidden layer to the output layer. The overall number dN of ANN parameters thus satisfies d=Hd+H+H+1=Hd+2H+1=20+10+1=31 (cf. (1.1), Theorems 1.1 and 1.2.

    The real numbers aR, b(a,) in Theorems 1.1 and 1.2 are used to specify the set [a,b]d in which the input data of the considered supervised learning problem takes values in and the function f:[a,b]dR in Theorem 1.1 specifies the target function of the considered supervised learning problem.

    In Theorem 1.1 we assume that the target function is an element of the set C([a,b]d,R) of continuous functions from [a,b]d to R but beside this continuity hypothesis we do not impose further regularity assumptions on the target function.

    The function p:[a,b]d[0,) in Theorems 1.1 and 1.2 is an unnormalized density function of the probability distribution of the input data of the considered supervised learning problem and in Theorem 1.1 we impose that this unnormalized density function is bounded and measurable.

    In Theorems 1.1 and 1.2 we consider ANNs with the ReLU activation function

    Rxmax{x,0}R. (1.2)

    The ReLU activation function fails to be differentiable and this lack of regularity also transfers to the risk function of the considered supervised learning problem; cf. (1.5) below. We thus need to employ appropriately generalized gradients of the risk function to specify the dynamics of the GFs. As in [34, Setting 2.1 and Proposition 2.3] (cf. also [33,37]), we accomplish this, first, by approximating the ReLU activation function through continuously differentiable functions which converge pointwise to the ReLU activation function and whose derivatives converge pointwise to the left derivative of the ReLU activation function and, thereafter, by specifying the generalized gradient function as the limit of the gradients of the approximated risk functions; see (1.3) and (1.5) in Theorem 1.1 and (1.9) and (1.10) in Theorem 1.2 for details.

    We now present the precise statement of Theorem 1.1 and, thereafter, provide further comments regarding Theorem 1.2.

    Theorem 1.1 (Existence and uniqueness of solutions of GFs in the training of ANNs). Let d,H,dN, aR, b(a,), fC([a,b]d,R) satisfy d=dH+2H+1, let p:[a,b]d[0,) be bounded and measurable, let RrC(R,R), rN{}, satisfy for all xR that (rN{Rr})C1(R,R), R(x)=max{x,0}, suprNsupy[|x|,|x|]|(Rr)(y)|<, and

    lim supr(|Rr(x)R(x)|+|(Rr)(x)1(0,)(x)|)=0, (1.3)

    for every θ=(θ1,,θd)Rd let DθN satisfy

    Dθ={i{1,2,,H}:|θHd+i|+dj=1|θ(i1)d+j|=0}, (1.4)

    for every rN{} let Lr:RdR satisfy for all θ=(θ1,,θd)Rd that

    Lr(θ)=[a,b]d(f(x1,,xd)θdHi=1θH(d+1)+i[Rr(θHd+i+dj=1θ(i1)d+jxj)])2p(x)d(x1,,xd), (1.5)

    let θRd, and let G:RdRd satisfy for all ϑ{vRd:((Lr)(v))rNis convergent} that G(ϑ)=limr(Lr)(ϑ). Then

    (i) it holds that G is locally bounded and measurable and

    (ii) there exists a unique ΘC([0,),Rd) which satisfies for all t[0,), s[t,) that DΘtDΘs and

    Θt=θt0G(Θu)du. (1.6)

    Theorem 1.1 is a direct consequence of Theorem 3.3 below. In Theorem 1.2 we also assume that the target function f:[a,b]dR is continuous but additionally assume that, roughly speaking, both the target function f:[a,b]dR and the unnormalized density function p:[a,b]d[0,) coincide with polynomial functions on suitable subsets of their domain of definition [a,b]d. In Theorem 1.2 the (n×d)-matrices αkiRn×d, i{1,2,,n}, k{0,1}, and the n-dimensional vectors βkiRn, i{1,2,,n}, k{0,1}, are used to describe these subsets and the functions Pki:RdR, i{1,2,,n}, k{0,1}, constitute the polynomials with which the target function and the unnormalized density function should partially coincide. More formally, in (1.8) in Theorem 1.2 we assume that for every x[a,b]d we have that

    p(x)=i{1,2,,n},α0ix+β0i[0,)nP0i(x)andf(x)=i{1,2,,n},α1ix+β1i[0,)nP1i(x). (1.7)

    In (1.11) in Theorem 1.2 we prove that there exists a strictly positive real number β(0,) such that for every GF trajectory Θ:[0,)Rd which does not diverge to infinity in the sense* that lim inft||Θt||< we have that ΘtRd, t[0,), converges with order β to a critical point ϑG1({0})={θRd:G(θ)=0} and we have that the risk L(Θt)R, t[0,), converges with order 1 to the risk L(ϑ) of the critical point ϑ. We now present the precise statement of Theorem 1.2.

    *Note that the functions ||||:(nNRn)R and ,:(nN(Rn×Rn))R satisfy for all nN, x=(x1,,xn), y=(y1,,yn)Rn that ||x||=[ni=1|xi|2]1/2 and x,y=di=1xiyi.

    Theorem 1.2 (Convergence rates for GFs trajectories in the training of ANNs). Let d,H,d,nN, aR, b(a,), fC([a,b]d,R) satisfy d=dH+2H+1, for every i{1,2,,n}, k{0,1} let αkiRn×d, let βkiRn, and let Pki:RdR be a polynomial, let p:[a,b]d[0,) satisfy for all k{0,1}, x[a,b]d that

    kf(x)+(1k)p(x)=ni=1[Pki(x)1[0,)n(αkix+βki)], (1.8)

    let RrC(R,R), rN{}, satisfy for all xR that (rN{Rr})C1(R,R), R(x)=max{x,0}, suprNsupy[|x|,|x|]|(Rr)(y)|<, and

    lim supr(|Rr(x)R(x)|+|(Rr)(x)1(0,)(x)|)=0, (1.9)

    for every rN{} let Lr:RdR satisfy for all θ=(θ1,,θd)Rd that

    Lr(θ)=[a,b]d(f(x1,,xd)θdHi=1θH(d+1)+i[Rr(θHd+i+dj=1θ(i1)d+jxj)])2p(x)d(x1,,xd), (1.10)

    let G:RdRd satisfy for all θ{ϑRd:((Lr)(ϑ))rNis convergent} that G(θ)=limr(Lr)(θ), and let ΘC([0,),Rd) satisfy lim inft||Θt||< and t[0,):Θt=Θ0t0G(Θs)ds. Then there exist ϑG1({0}), C,β(0,) which satisfy for all t[0,) that

    ||Θtϑ||C(1+t)βand|L(Θt)L(ϑ)|C(1+t)1. (1.11)

    Theorem 1.2 above is an immediate consequence of Theorem 5.4 in Subsection 5.3 below. Theorem 1.2 is related to Theorem 1.1 in our previous article [37]. In particular, [37, Theorem 1.1] uses weaker assumptions than Theorem 1.2 above but Theorem 1.2 above establishes a stronger statement when compared to [37, Theorem 1.1]. Specifically, on the one hand in [37, Theorem 1.1] the target function is only assumed to be a continuous function and the unnormalized density is only assumed to be measurable and integrable while in Theorem 1.2 it is additionally assumed that both the target function and the unnormalized density are piecewise polynomial in the sense of (1.8) above. On the other hand [37, Theorem 1.1] only asserts that the risk of every bounded GF trajectory converges to the risk of critical point while Theorem 1.2 assures that every non-divergent GF trajectory converges with a strictly positive rate of convergence to a critical point (the rate of convergence is given through the strictly positive real number β(0,) appearing in the exponent on the left inequality in (Eq 1.11) in Theorem 1.2) and also assures that the risk of the non-divergent GF trajectory converges with rate 1 to the risk of the critical point (the convergence rate 1 is ensured through the 1 appearing in the exponent on the right inequality in (Eq 1.11) in Theorem 1.2).

    We also point out that Theorem 1.2 assumes that the GF trajectory is non-divergent in the sense that lim inft||Θt||<. In general, it remains an open problem to establish sufficient conditions which ensure that the GF trajectory has this non-divergence property. In this aspect we also refer to Gallon et al. [38] for counterexamples for which it has been proved that every GF trajectory with sufficiently small initial risk does in the training of ANNs diverge to in the sense that lim inft||Θt||=.

    The remainder of this article is organized in the following way. In Section 2 we establish several regularity properties for the risk function of the considered supervised learning problem and its generalized gradient function. In Subsection 3 we employ the findings from Section 2 to establish existence and uniqueness properties for solutions of GF differential equations. In particular, in Subsection 3 we present the proof of Theorem 1.1 above. In Subsection 4 we establish under the assumption that both the target function f:[a,b]dR and the unnormalized density function p:[a,b]d[0,) are piecewise polynomial that the risk function is semialgebraic in the sense of Definition 4.3 in Subsection 4 (see Corollary 4.10 in Subsection 4 for details). In Subsection 5 we engage the results from Sections 2 and 4 to establish several convergence rate results for solutions of GF differential equations and, thereby, we also prove Theorem 1.2 above.

    In this section we establish several regularity properties for the risk function L:RdR and its generalized gradient function G:RdRd. In particular, in Proposition 2.12 in Subsection 2.5 below we prove for every parameter vector θRd in the ANN parameter space Rd=RdH+2H+1 that the generalized gradient G(θ) is a limiting subdifferential of the risk function L:RdR at θ. In Definition 2.8 in Subsection 2.5 we recall the notion of subdifferentials (which are sometimes also referred to as Fréchet subdifferentials in the scientific literature) and in Definition 2.9 in Subsection 2.5 we recall the notion of limiting subdifferentials. In the scientific literature Definitions 2.8 and 2.9 can in a slightly different presentational form, e.g., be found in Rockafellar & Wets [39, Definition 8.3] and Bolte et al. [9, Definition 2.10], respectively.

    Our proof of Proposition 2.12 uses the continuously differentiability result for the risk function in Proposition 2.3 in Subsection 2.2 and the local Lipschitz continuity result for the generalized gradient function in Corollary 2.7 in Subsection 2.4. Corollary 2.7 will also be employed in Subsection 3 below to establish existence and uniqueness results for solutions of GF differential equations. Proposition 2.3 follows directly from [37, Proposition 2.10, Lemmas 2.11 and 2.12]. Our proof of Corollary 2.7, in turn, employs the known representation result for the generalized gradient function in Proposition 2.2 in Subsection 2.2 below and the local Lipschitz continuity result for certain parameter integrals in Corollary 2.6 in Subsection 2.4. Statements related to Proposition 2.2 can, e.g., be found in [37, Proposition 2.2], [33, Proposition 2.3], and [34, Proposition 2.3].

    Our proof of Corollary 2.6 uses the elementary abstract local Lipschitz continuity result for certain parameter integrals in Lemma 2.5 in Subsection 2.4 and the local Lipschitz continuity result for active neuron regions in Lemma 2.4 in Subsection 2.3 below. Lemma 2.4 is a generalization of [35, Lemma 7], Lemma 2.5 is a slight generalization of [35, Lemma 6], and Corollary 2.6 is a generalization of [37, Lemma 2.12] and [35, Corollary 9]. The proof of Lemma 2.5 is therefore omitted.

    In Setting 2.1 in Subsection 2.1 below we present the mathematical setup to describe ANNs with ReLU activation, the risk function L:RdR, and its generalized gradient function G:RdRd. Moreover, in (2.6) in Setting 2.1 we define for a given parameter vector θRd the set of hidden neurons which have all input parameters equal to zero. Such neurons are sometimes called degenerate (cf. Cheridito et al. [36]) and can cause problems with the differentiability of the risk function, which is why we exclude degenerate neurons in Proposition 2.3 and Corollary 2.7 below.

    In this subsection we present in Setting 2.1 below the mathematical setup that we employ to state most of the mathematical results of this work. We also refer to Figure 2 below for a table in which we briefly list the mathematical objects introduced in Setting 2.1.

    Figure 2.  List of the mathematical objects introduced in Setting 2.1.

    Setting 2.1. Let d,H,dN, aR, b(a,), fC([a,b]d,R) satisfy d=dH+2H+1, let w=((wθi,j)(i,j){1,,H}×{1,,d})θRd:RdRH×d, b=((bθ1,,bθH))θRd:RdRH, v=((vθ1,,vθH))θRd:RdRH, and c=(cθ)θRd:RdR satisfy for all θ=(θ1,,θd)Rd, i{1,2,,H}, j{1,2,,d} that

    wθi,j=θ(i1)d+j,bθi=θHd+i,vθi=θH(d+1)+i,andcθ=θd, (2.1)

    let RrC1(R,R), rN, satisfy for all xR that

    lim supr(|Rr(x)max{x,0}|+|(Rr)(x)1(0,)(x)|)=0 (2.2)

    and suprNsupy[|x|,|x|]|(Rr)(y)|<, let λ:B(Rd)[0,] be the Lebesgue–Borel measure on Rd, let p:[a,b]d[0,) be bounded and measurable, let N=(Nθ)θRd:RdC(Rd,R) and L:RdR satisfy for all θRd, x=(x1,,xd)Rd that

    Nθ(x)=cθ+Hi=1vθi{bθi+dj=1wθi,jxj,0} (2.3)

    and L(θ)=[a,b]d(f(y)Nθ(y))2p(y)λ(dy), for every rN let Lr:RdR satisfy for all θRd that

    Lr(θ)=[a,b]d(f(y)cθHi=1vθi[Rr(bθi+dj=1wθi,jyj)])2p(y)λ(dy), (2.4)

    for every ε(0,), θRd let Bε(θ)Rd satisfy Bε(θ)={ϑRd:||θϑ||<ε}, for every θRd, i{1,2,,H} let IθiRd satisfy

    Iθi={x=(x1,,xd)[a,b]d:bθi+dj=1wθi,jxd>0}, (2.5)

    for every θRd let DθN satisfy

    Dθ={i{1,2,,H}:|bθi|+dj=1|wθi,j|=0}, (2.6)

    and let G=(G1,,Gd):RdRd satisfy for all θ{ϑRd:((Lr)(ϑ))rNis convergent} that G(θ)=limr(Lr)(θ).

    Next we add some explanations regarding the mathematical framework presented in Setting 2.1 above. In Setting 2.1

    ● the natural number dN represents the number of neurons on the input layer of the considered ANNs,

    ● the natural number HN represents the number of neurons on the hidden layer of the considered ANNs, and

    ● the natural number dN measures the overall number of parameters of the considered ANNs

    (cf. (1.1) and Figure 1 above). The real numbers aR, b(a,) in Setting 2.1 are employed to specify the d-dimensional set [a,b]dRd in which the input data of the supervised learning problem considered in Setting 2.1 takes values in and which, thereby, also serves as the domain of definition of the target function of the considered supervised learning problem.

    In Setting 2.1 the function f:[a,b]dR represents the target function of the considered supervised learning problem. In Setting 2.1 the target function f is assumed to be an element of the set C([a,b]d,R) of continuous functions from the d-dimensional set [a,b]d to the reals R (first line in Setting 2.1).

    The matrix valued function w=((wθi,j)(i,j){1,,H}×{1,,d})θRd:RdRH×d in Setting 2.1 is used to represent the inner weight parameters of the ANNs considered in Setting 2.1. In particular, in Setting 2.1 we have for every θRd that the H×d-matrix wθ=(wθi,j)(i,j){1,,H}×{1,,d}RH×d represents the weight parameter matrix for the affine linear transformation from the d-dimensional input layer to the H-dimensional hidden layer of the ANN associated to the ANN parameter vector θRd (cf. (2.1), (2.3), and Figure 1).

    The vector valued function b=((bθ1,,bθH))θRd:RdRH in Setting 2.1 is used to represent the inner bias parameters of the ANNs considered in Setting 2.1. In particular, in Setting 2.1 we have for every θRd that the d-dimensional vector bθ=(bθ1,,bθH)RH represents the bias parameter vector for the affine linear transformation from the d-dimensional input layer to the H-dimensional hidden layer of the ANN associated to the ANN parameter vector θRd (cf. (2.1), (2.3), and Figure 1).

    The vector valued function v=((vθ1,,vθH))θRd:RdRH in Setting 2.1 is used to describe the outer weight parameters of the ANNs considered in Setting 2.1. In particular, in Setting 2.1 we have for every θRd that the transpose of the H-dimensional vector vθ=(vθ1,,vθH)RH represents the weight parameter matrix for the affine linear transformation from the H-dimensional hidden layer to the 1-dimensional output layer of the ANN associated to the ANN parameter vector θRd (cf. (2.1), (2.3), and Figure 1).

    The real valued function c=(cθ)θRd:RdR in Setting 2.1 is used to represent the outer bias parameters of the ANNs considered in Setting 2.1. In particular, in Setting 2.1 we have for every θRd that the real number cθR describes the bias parameter for the affine linear transformation from the H-dimensional hidden layer to the 1-dimensional output layer of the ANN associated to the ANN parameter vector θRd (cf. (2.1), (2.3), and Figure 1).

    In Setting 2.1 we consider ANNs with the ReLU activation function Rxmax{x,0}R (cf. (1.2)). The ReLU activation function fails to be differentiable and this lack of differentiability typically transfers from the activation function to the realization functions Nθ:RdR, θRd, of the considered ANNs and the risk function L:RdR of the considered supervised learning problem, both, introduced in (2.3) in Setting 2.1. In general, there thus do not exist standard derivatives and standard gradients of the risk function and, in view of this, we need to introduce suitably generalized gradients of the risk function to specify the GF dynamics. As in [34, Setting 2.1 and Proposition 2.3] (cf. also [33,37]), we accomplish this,

    ● first, by approximating the ReLU activation function through appropriate continuously differentiable functions which converge pointwise to the ReLU activation function and whose derivatives converge pointwise to the left derivative of the ReLU activation function,

    ● then, by using these continuously differentiable approximations of the ReLU activation function to specify approximated risk functions, and,

    ● finally, by specifying the generalized gradient function as the pointwise limit of the standard gradients of the approximated risk functions.

    In Setting 2.1 the functions Rr:RR, rN, serves as such appropriate continuously differentiable approximations of the ReLU activation function and the hypothesis in (2.2) ensures that these functions converge pointwise to the ReLU activation function and that the derivatives of these functions converge pointwise to the left derivative of the ReLU activation function (cf. also (1.3) in Theorem 1.1 and (1.9) in Theorem 1.2). These continuously differentiable approximations of the ReLU activation function are then used in (2.4) in Setting 2.1 (cf. also (1.5) in Theorem 1.1 and (1.10) in Theorem 1.2) to introduce continuously differentiable approximated risk functions Lr:RdR, rN, which converge pointwise to the risk function L:RdR (cf., e.g., [37, Proposition 2.2]). Finally, the standard gradients of the approximated risk functions Lr:RdR, rN, are then used to introduce the generalized gradient function G=(G1,,Gd):RdRd in Setting 2.1. In this regard we also note that Proposition 2.2 in Subsection 2.2 below, in particular, ensures that the function G=(G1,,Gd):RdRd in Setting 2.1 is indeed uniquely defined.

    Proposition 2.2. Assume Setting 2.1. Then it holds for all θRd, i{1,2,,H}, j{1,2,,d} that

    G(i1)d+j(θ)=2vθiIθixj(Nθ(x)f(x))p(x)λ(dx),GHd+i(θ)=2vθiIθi(Nθ(x)f(x))p(x)λ(dx),GH(d+1)+i(θ)=2[a,b]d[max{bθi+dj=1wθi,jxj,0}](Nθ(x)f(x))p(x)λ(dx),andGd(θ)=2[a,b]d(Nθ(x)f(x))p(x)λ(dx). (2.7)

    Proof of Proposition 2.2. Observe that, e.g., [37, Proposition 2.2] establishes 2.7. The proof of Proposition 2.2 is thus complete.

    Proposition 2.3. Assume Setting 2.1 and let URd satisfy U={θRd:Dθ=}. Then

    (i) it holds that URd is open,

    (ii) it holds that L|UC1(U,R), and

    (iii) it holds that (L|U)=G|U.

    Proof of Proposition 2.3. Note that [37, Proposition 2.10,Lemmas 2.11 and 2.12] establish items (i), (ii), and (iii). The proof of Proposition 2.3 is thus complete.

    Lemma 2.4. Let dN, aR, b(a,), for every v=(v1,,vd+1)Rd+1 let Iv[a,b]d satisfy Iv={x[a,b]d:vd+1+di=1vixi>0}, for every nN let λn:B(Rn)[0,] be the Lebesgue–Borel measure on Rn, let p:[a,b]d[0,) be bounded and measurable, and let uRd+1{0}. Then there exist ε,C(0,) such that for all v,wRd+1 with max{||uv||,||uw||}ε it holds that

    IvΔIwp(x)λd(dx)C||vw||. (2.8)

    Proof of Lemma 2.4. Observe that for all v,wRd+1 we have that

    IvΔIwp(x)λd(dx)(supx[a,b]dp(x))λd(IvΔIw). (2.9)

    Moreover, note that the fact that for all yR it holds that y|y| ensures that for all v=(v1,,vd+1)Rd+1, i{1,2,,d+1} with ||uv||<|ui| it holds that

    uivi=(ui)2+(viui)ui|ui|2|uivi||ui||ui|2||uv|||ui|>0. (2.10)

    Next observe that for all v1,v2,w1,w2R with min{|v1|,|w1|}>0 it holds that

    |v2v1w2w1|=|v2w1w2v1||v1w1|=|v2(w1v1)+v1(v2w2)||v1w1|[|v2|+|v1||v1w1|][|v1w1|+|v2w2|]. (2.11)

    Combining this and 2.10 demonstrates for all v=(v1,,vd+1), w=(w1,,wd+1)Rd+1, i{1,2,,d} with max{||vu||,||wu||}<|u1| that v1w1>0 and

    |viv1wiw1|[2||v|||v1w1|][2||vw||][4||vu||+4||u|||v1w1|]||vw||. (2.12)

    Hence, we obtain for all v=(v1,,vd+1), w=(w1,,wd+1)Rd+1, i{1,2,,d} with max{||vu||,||wu||}|u1|2 and |u1|>0 that v1w1>0 and

    |viv1wiw1|(2|u1|+4||u||)||vw|||u1+(v1u1)||u1+(w1u1)|6||u||||vw||(|u1|||vu||)(|u1|||wu||)[24||u|||u1|2]||vw||. (2.13)

    In the following we distinguish between the case maxi{1,2,,d}|ui|=0, the case (maxi{1,2,,d}|ui|,d)(0,)×[2,), and the case (maxi{1,2,,d}|ui|,d)(0,)×{1}. We first prove 2.8 in the case

    maxi{1,2,,d}|ui|=0. (2.14)

    Note that (2.14) and the assumption that uRd+1{0} imply that |ud+1|>0. Moreover, observe that (2.14) shows that for all v=(v1,,vd+1)Rd+1, x=(x1,,xd)IuΔIv we have that

    |([di=1vixi]+vd+1)([di=1uixi]+ud+1)|=|[di=1vixi]+vd+1|+|[di=1uixi]+ud+1||[di=1uixi]+ud+1|=|ud+1|. (2.15)

    In addition, note that for all v=(v1,,vd+1)Rd+1, x=(x1,,xd)[a,b]d it holds that

    |([di=1vixi]+vd+1)([di=1uixi]+ud+1)|[di=1|viui||xi|]+|vd+1ud+1|max{|a|,|b|}[di=1|viui|]+|vd+1ud+1|(1+dmax{|a,b|})||vu||. (2.16)

    This and (2.15) prove that for all vRd+1 with ||uv|||ud+1|2+dmax{|a,b|} we have that IuΔIv=, i.e., Iu=Iv. Therefore, we get for all v,wRd+1 with max{||uv||,||uw||}|ud+1|2+dmax{|a,b|} that Iv=Iw=Iu. Hence, we obtain for all v,wRd+1 with max{||uv||,||uw||}|ud+1|2+dmax{|a,b|} that λd(IvΔIw)=0. This establishes (2.8) in the case maxi{1,2,,d}|ui|=0. In the next step we prove 2.8 in the case

    (maxi{1,2,,d}|ui|,d)(0,)×[2,). (2.17)

    For this we assume without loss of generality that |u1|>0. In the following let Jv,wxR, x[a,b]d1, v,wRd+1, satisfy for all x=(x2,,xd)[a,b]d1, v,wRd+1 that Jv,wx={y[a,b]:(y,x2,,xd)IvIw}. Next observe that Fubini's theorem and the fact that for all vRd+1 it holds that Iv is measurable show that for all v,wRd+1 we have that

    λd(IvΔIw)=[a,b]d1IvΔIw(x)λd(dx)=[a,b]d(1IvIw(x)+1IwIv(x))λd(dx)=[a,b]d1[a,b](1IvIw(y,x2,,xd)+1IwIv(y,x2,,xd))λ1(dy)λd1(d(x2,,xd))=[a,b]d1[a,b](1Jv,wx(y)+1Jw,vx(y))λ1(dy)λd1(dx)=[a,b]d1(λ1(Jv,wx)+λ1(Jw,vx))λd1(dx). (2.18)

    Furthermore, note that for all x=(x2,,xd)[a,b]d1, v=(v1,,vd+1), w=(w1,,wd+1)Rd+1, s{1,1} with min{sv1,sw1}>0 it holds that

    Jv,wx={y[a,b]:(y,x2,,xd)IvIw}={y[a,b]:v1y+[di=2vixi]+vd+1>0w1y+[di=2wixi]+wd+1}={y[a,b]:sv1([di=2vixi]+vd+1)<sysw1([di=2wixi]+wd+1)}. (2.19)

    Hence, we obtain for all x=(x2,,xd)[a,b]d1, v=(v1,,vd+1), w=(w1,,wd+1)Rd+1, s{1,1} with min{sv1,sw1}>0 that

    λ1(Jv,wx)|sv1([di=2vixi]+vd+1)sw1([di=2wixi]+wd+1)|[di=2|viv1wiw1||xi|]+|vd+1v1wd+1w1|max{|a|,|b|}[di=2|viv1wiw1|]+|vd+1v1wd+1w1|. (2.20)

    Furthermore, observe that (2.10) demonstrates for all v=(v1,,vd+1)Rd+1 with ||uv||<|u1| that u1v1>0. This implies that for all v=(v1,,vd+1), w=(w1,,wd+1)Rd+1 with max{||uv||,||uw||}<|u1| there exists s{1,1} such that min{sv1,sw1}>0. Combining this and (2.13) with (2.20) proves that there exists CR such that for all x[a,b]d1, v,wRd+1 with max{||uv||,||uw||}|u1|2 we have that λ1(Jv,wx)+λ1(Jw,vx)C||vw||. This, (2.18), and (2.9) establish (2.8) in the case (maxi{1,2,,d}|ui|,d)(0,)×[2,). Finally, we prove (2.8) in the case

    (maxi{1,2,,d}|ui|,d)(0,)×{1}. (2.21)

    Note that (2.21) demonstrates that |u1|>0. In addition, observe that for all v=(v1,v2), w=(w1,w2)R2, s{1,1} with min{sv1,sw1}>0 it holds that

    IvIw={y[a,b]:v1y+v2>0w1y+w2}={y[a,b]:sv2v1<sysw2w1}{yR:sv2v1<sysw2w1}. (2.22)

    Therefore, we get for all v=(v1,v2), w=(w1,w2)R2, s{1,1} with min{sv1,sw1}>0 that

    λ1(IvIw)|(sv2v1)(sw2w1)|=|v2v1w2w1|. (2.23)

    Furthermore, note that (2.10) ensures for all v=(v1,v2)R2 with ||uv||<|u1| that u1v1>0. This proves that for all v=(v1,v2), w=(w1,w2)R2 with max{||uv||,||uw||}<|u1| there exists s{1,1} such that min{sv1,sw1}>0. Combining this with (2.23) demonstrates for all v=(v1,v2), w=(w1,w2)R2 with max{||uv||,||uw||}<|u1| that min{|v1|,|w1|}>0 and

    λ1(IvΔIw)=λ1(IvIw)+λ1(IwIv)2|v2v1w2w1|. (2.24)

    This, (2.13), and (2.9) establish (2.8) in the case (maxi{1,2,,d}|ui|,d)(0,)×{1}. The proof of Lemma 2.4 is thus complete.

    Lemma 2.5. Let d,nN, aR, b(a,), xRn, C,ε(0,), let ϕ:Rn×[a,b]dR be locally bounded and measurable, assume for all r(0,) that

    supy,zRn,||y||+||z||r,yzsups[a,b]d|ϕ(y,s)ϕ(z,s)|||yz||<, (2.25)

    let μ:B([a,b]d)[0,) be a finite measure, let IyB([a,b]d), yRn, satisfy for all y,z{vRn:||xv||ε} that μ(IyΔIz)C||yz||, and let Φ:RnR satisfy for all yRn that

    Φ(y)=Iyϕ(y,s)μ(ds). (2.26)

    Then there exists CR such that for all y,z{vRn:||xv||ε} it holds that |Φ(y)Φ(z)|C||yz||.

    Proof of Lemma 2.5. The proof is analogous to the proof of [35, Lemma 6].

    Corollary 2.6. Assume Setting 2.1, let ϕ:Rd×[a,b]dR be locally bounded and measurable, and assume for all r(0,) that

    supθ,ϑRd,||θ||+||ϑ||r,θϑsupx[a,b]d|ϕ(θ,x)ϕ(ϑ,x)|||θϑ||<. (2.27)

    Then

    (i) it holds that

    Rdθ[a,b]dϕ(θ,x)p(x)λ(dx)R (2.28)

    is locally Lipschitz continuous and

    (ii) it holds for all i{1,2,,H} that

    {ϑRd:iDϑ}θIθiϕ(θ,x)p(x)λ(dx)R (2.29)

    is locally Lipschitz continuous.

    Proof of Corollary 2.6. First observe that Lemma 2.5 (applied for every θRd with nd, xθ, μ(B([a,b]d)AAp(x)λ(dx)[0,)), (Iy)yRn([a,b]d)yRd in the notation of Lemma 2.5) establishes item (i). In the following let i{1,2,,H}, θ{ϑRd:iDϑ}. Note that Lemma 2.4 shows that there exist ε,C(0,) which satisfy for all ϑ1,ϑ2Rd with max{||θϑ1||,||θϑ2||}ε that

    Iϑ1iΔIϑ2ip(x)λ(dx)C||ϑ1ϑ2||. (2.30)

    Combining this with Lemma 2.5 (applied for every θRd with nd, xθ, μ(B([a,b]d)AAp(x)λ(dx)[0,)), (Iy)yRn(Iyi)yRd in the notation of Lemma 2.5) demonstrates that there exists CR such that for all ϑ1,ϑ2Rd with max{||θϑ1||,||θϑ2||}ε it holds that

    |Iϑ1iϕ(ϑ1,x)p(x)λ(dx)Iϑ2iϕ(ϑ2,x)p(x)λ(dx)|C||ϑ1ϑ2||. (2.31)

    This establishes item (ii). The proof of Corollary 2.6 is thus complete.

    Corollary 2.7. Assume Setting 2.1. Then

    (i) it holds for all kN(Hd+H,d] that

    RdθGk(θ)R (2.32)

    is locally Lipschitz continuous,

    (ii) it holds for all i{1,2,,H}, j{1,2,,d} that

    {ϑRd:iDϑ}θG(i1)d+j(θ)R (2.33)

    is locally Lipschitz continuous, and

    (iii) it holds for all i{1,2,,H} that

    {ϑRd:iDϑ}θGHd+i(θ)R (2.34)

    is locally Lipschitz continuous.

    Proof of Corollary 2.7. Observe that (2.7) and Corollary 2.6 establish items (i), (ii), and (iii). The proof of Corollary 2.7 is thus complete.

    Definition 2.8 (Subdifferential). Let nN, fC(Rn,R), xRn. Then we denote by ˆf(x)Rn the set given by

    ˆf(x)={yRn:lim infRn{0}h0(f(x+h)f(x)y,h||h||)0}. (2.35)

    Definition 2.9 (Limiting subdifferential). Let nN, fC(Rn,R), xRn. Then we denote by f(x)Rn the set given by

    f(x)=ε(0,)¯[y{zRn:||xz||<ε}ˆf(y)] (2.36)

    (cf. Definition 2.8).

    Lemma 2.10. Let nN, fC(Rn,R), xRn. Then

    f(x)={yRn:z=(z1,z2):NRn×Rn:([kN:z2(k)ˆf(z1(k))],[lim supk(||z1(k)x||+||z2(k)y||)=0])} (2.37)

    (cf. Definitions 2.8 and 2.9).

    Proof of Lemma 2.10. Note that (2.36) establishes (2.37). The proof of Lemma 2.10 is thus complete.

    Lemma 2.11. Let nN, fC(Rn,R), let URn be open, assume f|UC1(U,R), and let xU. Then ˆf(x)=f(x)={(f)(x)} (cf. Definitions 2.8 and 2.9).

    Proof of Lemma 2.11. This is a direct consequence of, e.g., Rockafellar & Wets [39, Exercise 8.8]. The proof of Lemma 2.11 is thus complete.

    Proposition 2.12. Assume Setting 2.1 and let θRd. Then G(θ)L(θ) (cf. Definition 2.9).

    Proof of Proposition 2.12. Throughout this proof let ϑ=(ϑn)nN:NRd satisfy for all nN, i{1,2,,H}, j{1,2,,d} that wϑni,j=wθi,j, bϑni=bθi1n1Dθ(i), vϑni=vθi, and cϑn=cθ. We prove Proposition 2.12 through an application of Lemma 2.10. Observe that for all nN, i{1,2,,H}Dθ it holds that bϑni=bθi. This implies for all nN, i{1,2,,H}Dθ that

    iDϑn. (2.38)

    In addition, note that for all nN, iDθ it holds that bϑni=1n<0. This shows for all nN, iDθ that

    iDϑn. (2.39)

    Hence, we obtain for all nN that Dϑn=. Combining this with Proposition 2.3 and Lemma 2.11 demonstrates that for all nN it holds that ˆL(ϑn)={(L)(ϑn)}={G(ϑn)} (cf. Definition 2.8). Moreover, observe that limnϑn=θ. It thus remains to show that G(ϑn), nN, converges to G(θ). Note that Corollary 2.7 ensures that for all kN(Hd+H,d] it holds that

    limnGk(ϑn)=Gk(θ). (2.40)

    Furthermore, observe that Corollary 2.7, (2.38) and (2.39) assure that for all i{1,2,,H}Dθ, j{1,2,,d} it holds that

    limnG(i1)d+j(ϑn)=G(i1)d+j(θ)andlimnGHd+i(ϑn)=GHd+i(θ). (2.41)

    In addition, note that for all nN, iDθ we have that Iϑni=Iθi=. Hence, we obtain for all iDθ, j{1,2,,d} that

    limnG(i1)d+j(ϑn)=0=G(i1)d+j(θ)andlimnGHd+i(ϑn)=0=GHd+i(θ). (2.42)

    Combining this, (2.40) and (2.41) demonstrates that limnG(ϑn)=G(θ). This and Lemma 2.10 assure that G(θ)L(θ). The proof of Proposition 2.12 is thus complete.

    In this section we employ the local Lipschitz continuity result for the generalized gradient function in Corollary 2.7 from Section 2 to establish existence and uniqueness results for solutions of GF differential equations. Specifically, in Proposition 3.1 in Subsection 3.1 below we prove the existence of solutions GF differential equations, in Lemma 3.2 in Subsection 3.2 below we establish the uniqueness of solutions of GF differential equations among a suitable class of GF solutions, and in Theorem 3.3 in Subsection 3.3 below we combine Proposition 3.1 and Lemma 3.2 to establish the unique existence of solutions of GF differential equations among a suitable class of GF solutions. Theorem 1.1 in the introduction is an immediate consequence of Theorem 3.3.

    Roughly speaking, we show in Theorem 3.3 the unique existence of solutions of GF differential equations among the class of GF solutions which satisfy that the set of all degenerate neurons of the GF solution at time t[0,) is non-decreasing in the time variable t[0,). In other words, in Theorem 3.3 we prove the unique existence of GF solutions with the property that once a neuron has become degenerate it will remain degenerate for subsequent times.

    Our strategy of the proof of Theorem 3.3 and Proposition 3.1, respectively, can, loosely speaking, be described as follows. Corollary 2.7 above implies that the components of the generalized gradient function G:RdRd corresponding to non-degenerate neurons are locally Lipschitz continuous so that the classical Picard-Lindelöf local existence and uniqueness theorem for ordinary differential equations can be brought into play for those components. On the other hand, if at some time t[0,) the i-th neuron is degenerate, then Proposition 2.2 above shows that the corresponding components of the generalized gradient function G:RdRd vanish. The GF differential equation is thus satisfied if the neuron remains degenerate at all subsequent times s[t,). Using these arguments we prove in Proposition 3.1 the existence of GF solutions by induction on the number of non-degenerate neurons of the initial value.

    Proposition 3.1. Assume Setting 2.1 and let θRd. Then there exists ΘC([0,),Rd) which satisfies for all t[0,), s[t,) that

    Θt=θt0G(Θu)duandDΘtDΘs. (3.1)

    Proof of Proposition 3.1. We prove the statement by induction on the quantity H#(Dθ)N[0,H]. Assume first that H#(Dθ)=0, i.e., Dθ={1,2,,H}. Observe that this implies that wθ=0 and bθ=0. In the following let κR satisfy

    κ=[a,b]df(x)p(x)λ(dx). (3.2)

    Note that the Picard–Lindelöf Theorem shows that there exists a unique cC([0,),R) which satisfies for all t[0,) that

    c(0)=cθandc(t)=c(0)+2κt2([a,b]dp(x)λ(dx))(t0c(s)ds). (3.3)

    Next let ΘC([0,),Rd) satisfy for all t[0,), i{1,2,,H}, j{1,2,,d} that

    wΘti,j=wθi,j=bΘti=bθi=0,vΘti=vθi,andcΘt=c(t). (3.4)

    Observe that (2.7), (3.3), and (3.4) ensure for all t[0,) that

    cΘt=cθ+2κt2([a,b]dp(x)λ(dx))(t0cΘsds)=cθ2t0(κ+[a,b]dcΘsp(x)λ(dx))ds=cθ2t0[a,b]d(cΘs+Hi=1[vΘsimax{bΘsi+dj=1wΘsi,jxj,0}]f(x))p(x)λ(dx)ds=cθ2t0[a,b]d(NΘs(x)f(x))p(x)λ(dx)ds=cθt0Gd(Θs)ds. (3.5)

    Next note that (3.4) and (2.7) show for all t[0,), iN[1,d) that DΘt={1,2,,H} and Gi(Θt)=0. Combining this with (3.4) and (3.5) proves that Θ satisfies 3.1. This establishes the claim in the case #(Dθ)=H.

    For the induction step assume that #(Dθ)<H and assume that for all ϑRd with #(Dϑ)>#(Dθ) there exists ΘC([0,),Rd) which satisfies for all t[0,), s[t,) that Θt=ϑt0G(Θu)du and DΘtDΘs. In the following let URd satisfy

    U={ϑRd:DϑDθ} (3.6)

    and let G:URd satisfy for all ϑU, i{1,2,,d} that

    Gi(ϑ)={0:i{(1)d+j:Dθ,jN[1,d]}{Hd+:Dθ}Gi(ϑ):else. (3.7)

    Observe that (3.6) assures that URd is open. In addition, note that Corollary 2.7 implies that G is locally Lipschitz continuous. Combining this with the Picard–Lindelöf Theorem demonstrates that there exist a unique maximal τ(0,] and ΨC([0,τ),U) which satisfy for all t[0,τ) that

    Ψt=θt0G(Ψu)du. (3.8)

    Next observe that (3.7) ensures that for all t[0,τ), iDθ, j{1,2,,d} we have that

    wΨti,j=wθi,j=bΨti=bθi=0andvΨti=vθi. (3.9)

    This, (3.7), and (2.7) demonstrate for all t[0,τ) that G(Ψt)=G(Ψt). In addition, note that (3.6) and (3.9) imply for all t[0,τ) that DΨt=Dθ. Hence, if τ= then Ψ satisfies (3.1). Next assume that τ<. Observe that the Cauchy-Schwarz inequality and [37, Lemma 3.1] prove for all s,t[0,τ) with st that

    ||ΨtΨs||ts||G(Ψu)||du(ts)1/2[ts||G(Ψu)||2du]1/2(ts)1/2[t0||G(Ψu)||2du]1/2=(ts)1/2(L(Ψ0)L(Ψt))1/2(ts)1/2(L(Ψ0))1/2. (3.10)

    Hence, we obtain for all (tn)nN[0,τ) with lim infntn=τ that (Ψtn) is a Cauchy sequence. This implies that ϑ:=limtτΨtRd exists. Furthermore, note that the fact that τ is maximal proves that ϑU. Therefore, we have that DϑDθ. Moreover, observe that (3.9) shows that for all iDθ, j{1,2,,d} it holds that wϑi,j=bϑi=0 and, therefore, iDϑ. This demonstrates that #(Dϑ)>#(Dθ). Combining this with the induction hypothesis ensures that there exists ΦC([0,),Rd) which satisfies for all t[0,), s[t,) that

    Φt=ϑt0G(Φu)duandDΦtDΦs. (3.11)

    In the following let Θ:[0,)Rd satisfy for all t[0,) that

    Θt={Ψt:t[0,τ)Φtτ:t[τ,). (3.12)

    Note that the fact that ϑ=limtτΨt and the fact that Φ0=ϑ imply that Θ is continuous. Furthermore, observe that the fact that G is locally bounded and (3.8) ensure that

    Θτ=ϑ=limtτΨt=limtτ[θt0G(Ψs)ds]=θτ0G(Ψs)ds=θτ0G(Θs)ds. (3.13)

    Hence, we obtain for all t[τ,) that

    Θt=(ΘtΘτ)+Θτ=(ΦtτΦ0)+Θτ=tτ0G(Φs)ds+θτ0G(Θs)ds=τtG(Θs)+θτ0G(Θs)ds=θt0G(Θs)ds. (3.14)

    This shows that Θ satisfies (3.1). The proof of Proposition 3.1 is thus complete.

    Lemma 3.2. Assume Setting 2.1 and let θRd, Θ1,Θ2C([0,),Rd) satisfy for all t[0,), s[t,), k{1,2} that

    Θkt=θt0G(Θku)duandDΘktDΘks. (3.15)

    Then it holds for all t[0,) that Θ1t=Θ2t.

    Proof of Lemma 3.2. Assume for the sake of contradiction that there exists t[0,) such that Θ1tΘ2t. By translating the variable t if necessary, we may assume without loss of generality that inf{t[0,):Θ1tΘ2t}=0. Next note that the fact that Θ1 and Θ2 are continuous implies that there exists δ(0,) which satisfies for all t[0,δ], k{1,2} that DΘktDθ. Furthermore, observe that 3.15 ensures for all t[0,), iDθ, k{1,2} that iDΘkt. Hence, we obtain for all t[0,), iDθ, j{1,2,,d}, k{1,2} that

    G(i1)d+j(Θkt)=GHd+i(Θkt)=GH(d+1)+i(Θkt)=0. (3.16)

    In addition, note that the fact that Θ1 and Θ2 are continuous implies that there exists a compact K{ϑRd:DϑDθ} which satisfies for all t[0,δ], k{1,2} that ΘktK. Moreover, observe that Corollary 2.7 proves that for all i{1,2,,H}Dθ, j{1,2,,d} it holds that G(i1)d+j,GHd+i,GH(d+1)+i,Gd:KR are Lipschitz continuous. This and (3.16) show that there exists L(0,) such that for all t[0,δ] we have that

    ||G(Θ1t)G(Θ2t)||L||Θ1tΘ2t||. (3.17)

    In the following let M:[0,)[0,) satisfy for all t[0,) that Mt=sups(0,t]||Θ1sΘ2s||. Note that the fact that inf{t[0,):Θ1tΘ2t}=0 proves for all t(0,) that Mt>0. Moreover, observe that (3.17) ensures for all t(0,δ) that

    ||Θ1tΘ2t||=t0G(Θ1u)dut0G(Θ2u)dut0||G(Θ1u)G(Θ2u)||duLt0||Θ1uΘ2u||duLtMt. (3.18)

    Combining this with the fact that M is non-decreasing shows for all t(0,δ), s(0,t] that

    ||Θ1sΘ2s||LsMsLtMt. (3.19)

    This demonstrates for all t(0,min{L1,δ}) that

    0<MtLtMt<Mt, (3.20)

    which is a contradiction. The proof of Lemma 3.2 is thus complete.

    Theorem 3.3. Assume Setting 2.1 and let θRd. Then there exists a unique ΘC([0,),Rd) which satisfies for all t[0,), s[t,) that

    Θt=θt0G(Θu)duandDΘtDΘs. (3.21)

    Proof of Theorem 3.3. Proposition 3.1 establishes the existence and Lemma 3.2 establishes the uniqueness. The proof of Theorem 3.3 is thus complete.

    In this section we establish in Corollary 4.10 in Subsection 4.3 below that under the assumption that both the target function f:[a,b]dR and the unnormalized density function p:[a,b]d[0,) are piecewise polynomial in the sense of Definition 4.9 in Subsection 4.3 we have that the risk function L:RdR is a semialgebraic function in the sense of Definition 4.3 in Subsection 4.1. In Definition 4.9 we specify precisely what we mean by a piecewise polynomial function, in Definition 4.2 in Subsection 4.1 we recall the notion of a semialgebraic set, and in Definition 4.3 we recall the notion of a semialgebraic function. In the scientific literature Definitions 4.2 and 4.3 can in a slightly different presentational form, e.g., be found in Bierstone & Milman [40, Definitions 1.1 and 1.2] and Attouch et al. [8, Definition 2.1].

    Note that the risk function L:RdR is given through a parametric integral in the sense that for all θRd we have that

    L(θ)=[a,b]d(f(y)Nθ(y))2p(y)λ(dy). (4.1)

    In general, parametric integrals of semialgebraic functions are no longer semialgebraic functions and the characterization of functions that can occur as such integrals is quite involved (cf. Kaiser [41]). This is the reason why we introduce in Definition 4.6 in Subsection 4.2 below a suitable subclass of the class of semialgebraic functions which is rich enough to contain the realization functions of ANNs with ReLU activation (cf. (4.30) in Subsection 4.2 below) and which can be shown to be closed under integration (cf. Proposition 4.8 in Subsection 4.2 below for the precise statement).

    Definition 4.1 (Set of polynomials). Let nN0. Then we denote by PnC(Rn,R) the set of all polynomials from Rn to R.

    Note that R0={0}, C(R0,R)=C({0},R), and #(C(R0,R))=#(C({0},R))=. In particular, this shows for all nN0 that dim(Rn)=n and #(C(Rn,R))=.

    Definition 4.2 (Semialgebraic sets). Let nN and let ARn be a set. Then we say that A is a semialgebraic set if and only if there exist kN and (Pi,j,)(i,j,){1,2,,k}2×{0,1}Pn such that

    A=ki=1kj=1{xRn:Pi,j,0(x)=0<Pi,j,1(x)} (4.2)

    (cf. Definition 4.1).

    Definition 4.3 (Semialgebraic functions). Let m,nN and let f:RnRm be a function. Then we say that f is a semialgebraic function if and only if it holds that {(x,f(x)):xRn}Rm+n is a semialgebraic set (cf. Definition 4.2).

    Lemma 4.4. Let nN and let f,g:RnR be semialgebraic functions (cf. Definition 4.3). Then

    (i) it holds that Rnxf(x)+g(x)R is semialgebraic and

    (ii) it holds that Rnxf(x)g(x)R is semialgebraic.

    Proof of Lemma 4.4. Note that, e.g., Coste [42, Corollary 2.9] (see, e.g., also Bierstone & Milman [40, Section 1]) establishes items (i) and (ii). The proof of Lemma 4.4 is thus complete.

    Definition 4.5 (Set of rational functions). Let nN. Then we denote by Rn the set given by

    Rn={R:RnR:[P,QPn:xRn:R(x)={P(x)Q(x):Q(x)00:Q(x)=0]} (4.3)

    (cf. Definition 4.1).

    Definition 4.6. Let mN, nN0. Then we denote by Am,n the R-vector space given by

    Am,n=span({f:Rm×RnR:[rN,A1,A2,,Ar{{0},[0,),(0,)},RRm,QPn,P=(Pi,j)(i,j){1,2,,r}×{0,1,,n}Pm:θRm,x=(x1,,xn)Rn:f(θ,x)=R(θ)Q(x)[ri=11Ai(Pi,0(θ)+nj=1Pi,j(θ)xj)]]}) (4.4)

    (cf. Definitions 4.1 and 4.5).

    Lemma 4.7. Let mN, fAm,0 (cf. Definition 4.6). Then f is semialgebraic (cf. Definition 4.3).

    Proof of Lemma 4.7. Throughout this proof let rN, A1,A2,,Ar{{0},[0,),(0,)}, RRm, P=(Pi)i{1,2,,r}Pm, and let g:RmR satisfy for all θRm that

    g(θ)=R(θ)ri=11Ai(Pi(θ)) (4.5)

    (cf. Definitions 4.1 and 4.5). Due to the fact that sums of semialgebraic functions are again semialgebraic (cf. Lemma 4.4), it suffices to show that g is semialgebraic. Furthermore, observe that for all yR it holds that 1(0,)(y)=11[0,)(y) and 1{0}(y)=1[0,)(y)1[0,)(y). Hence, by linearity we may assume for all i{1,2,,r} that Ai=[0,). Next let Q1,Q2Pm satisfy for all xRm that

    R(x)={Q1(x)Q2(x):Q2(x)00:Q2(x)=0. (4.6)

    Note that the graph of RmθR(θ)R is given by

    {(θ,y)Rm×R:Q2(θ)=0,y=0}{(θ,y)Rm×R:Q2(θ)0,Q2(θ)yQ1(θ)=0}. (4.7)

    Since both of these sets are described by polynomial equations and inequalities, it follows that RmθR(θ)R is semialgebraic. In addition, observe that for all i{1,2,,r} the graph of Rmθ1[0,)(Pi(θ))R is given by

    {(θ,y)Rm×R:Pi(θ)<0,y=0}{(θ,y)Rm×R:Pi(θ)0,y=1}. (4.8)

    This demonstrates for all i{1,2,,r} that Rmθ1[0,)(Pi(θ))R is semialgebraic. Combining this and (4.5) with Lemma 4.4 demonstrates that g is semialgebraic. The proof of Lemma 4.7 is thus complete.

    Proposition 4.8. Let m,nN, aR, b(a,), fAm,n (cf. Definition 4.6). Then

    [Rm×Rn1(θ,x1,,xn1)baf(θ,x1,,xn)dxnR]Am,n1. (4.9)

    Proof of Proposition 4.8. By linearity of the integral it suffices to consider a function f of the form

    f(θ,x)=R(θ)Q(x)ri=11Ai(Pi,0(θ)+nj=1Pi,j(θ)xj) (4.10)

    where rN, (Pi,j)(i,j){1,2,,r}×{0,1,,n}Pm, A1,A2,,Ar{{0},(0,),[0,)}, QPn, and RRm (cf. Definitions 4.1 and 4.5). Moreover, note that for all yR it holds that 1(0,)(y)=11[0,)(y) and 1{0}(y)=1[0,)(y)1[0,)(y). Hence, by linearity we may assume that Ai=[0,) for all i{1,2,,r}. Furthermore, by linearity we may assume that Q is of the form

    Q(x1,,xn)=n=1(x)i (4.11)

    with i1,i2,,inN0. In the following let s:RR satisfy for all xR that s(x)=1(0,)(x)1(0,)(x), for every θRm, k{1,0,1} let Sθk{1,2,,r} satisfy Sθk={i{1,2,,r}:s(Pi,n(θ))=k}, and for every i{1,2,,r} let Zi:Rm×RnR satisfy for all (θ,x)Rm×Rn that

    Zi(θ,x)=Pi,0(θ)n1j=1Pi,j(θ)xj. (4.12)

    Observe that (4.10), (4.11), and (4.12) imply for all θRm, x=(x1,,xn)Rn that

    f(θ,x)=R(θ)(n=1(x)i)(ri=11[0,)(Pi,n(θ)xnZi(θ,x))). (4.13)

    This shows that f(θ,x) can only be nonzero if

    iSθ1:xnZi(θ,x)Pi,n(θ),iSθ1:xnZi(θ,x)Pi,n(θ),iSθ0:Zi(θ,x)0. (4.14)

    Hence, if for given θRm, (x1,,xn1)Rn1 there exists xn[a,b] which satisfies these conditions then (4.13) and the fact that yindy=1in+1yin+1 imply that

    baf(θ,x1,,xn)dxn=R(θ)in+1(n1=1xi)[(min{b,minjSθ1Zj(θ,x)Pj,n(θ)})in+1(max{a,maxjSθ1Zj(θ,x)Pj,n(θ)})in+1]. (4.15)

    Otherwise, we have that baf(θ,x1,,xn)dxn=0. It remains to write these expressions in the different cases as a sum of functions of the required form in Definition 4.6 by introducing suitable indicator functions. Note that there are four possible cases where the integral is nonzero:

    ● It holds that a<maxjSθ1Zj(θ,x)Pj,n(θ)<minjSθ1Zj(θ,x)Pj,n(θ)<b. In this case, we have

    baf(θ,x1,,xn)dxn=R(θ)in+1(n1=1xi)[(minjSθ1Zj(θ,x)Pj,n(θ))in+1(maxjSθ1Zj(θ,x)Pj,n(θ))in+1]. (4.16)

    ● It holds that a<maxjSθ1Zj(θ,x)Pj,n(θ)<bminjSθ1Zj(θ,x)Pj,n(θ). In this case, we have

    baf(θ,x1,,xn)dxn=R(θ)in+1(n1=1xi)[bin+1(maxjSθ1Zj(θ,x)Pj,n(θ))in+1]. (4.17)

    ● It holds that maxjSθ1Zj(θ,x)Pj,n(θ)a<minjSθ1Zj(θ,x)Pj,n(θ)<b. In this case, we have

    baf(θ,x1,,xn)dxn=R(θ)in+1(n1=1xi)[(minjSθ1Zj(θ,x)Pj,n(θ))in+1ain+1]. (4.18)

    ● It holds that maxjSθ1Zj(θ,x)Pj,n(θ)a<bminjSθ1Zj(θ,x)Pj,n(θ). In this case, we have

    baf(θ,x1,,xn)dxn=R(θ)in+1(n1=1xi)[bin+1ain+1]. (4.19)

    Since these four cases are disjoint, by summing over all possible choices A,B,C{1,2,,r} of the sets Sθk, k{1,0,1}, and all choices of (non-empty) subsets I,J of Sθ1, Sθ1 where the maximal/minimal values are achieved, we can write

    baf(θ,x1,,xn)dxn=R(θ)in+1(n1=1xi)[(I)+(II)+(III)+(IV)], (4.20)

    where (I),(II),(III),(IV) denote the functions of θRm and (x1,,xn1)Rn1 given by

    (I)=A˙B˙C={1,,r}[jA1(0,)(Pj,n(θ))jB1(0,)(Pj,n(θ))jC(1{0}(Pj,n(θ))1[0,)(Zj(θ,x))]IAJB[[iI(1(a,b)(Zi(θ,x)Pi,n(θ))1{0}(Zi(θ,x)Pi,n(θ)ZminI(θ,x)PminI,n(θ)))×jAI1(0,)(ZminI(θ,x)PminI,n(θ)Zj(θ,x)Pj,n(θ))iJ(1(a,b)(Zi(θ,x)Pi,n(θ))1{0}(Zi(θ,x)Pi,n(θ)ZminJ(θ,x)PminJ,n(θ)))×jBJ1(0,)(Zj(θ,x)Pj,n(θ)ZminJ(θ,x)PminJ,n(θ))1(0,)(ZminJ(θ,x)PminJ,n(θ)ZminI(θ,x)PminI,n(θ))]×[(ZminJ(θ,x)PminJ,n(θ))in+1(ZminI(θ,x)PminI,n(θ))in+1]], (4.21)
    (II)=A˙B˙C={1,,r}[jA1(0,)(Pj,n(θ))jB1(0,)(Pj,n(θ))jC(1{0}(Pj,n(θ))1[0,)(Zj(θ,x))]IA[[iI(1(a,b)(Zi(θ,x)Pi,n(θ))1{0}(Zi(θ,x)Pi,n(θ)ZminI(θ,x)PminI,n(θ)))×jAI1(0,)(ZminI(θ,x)PminI,n(θ)Zj(θ,x)Pj,n(θ))iB(1[b,)(Zi(θ,x)Pi,n(θ)))×[bin+1(ZminI(θ,x)PminI,n(θ))in+1]], (4.22)
    (III)=A˙B˙C={1,,r}[jA1(0,)(Pj,n(θ))jB1(0,)(Pj,n(θ))jC(1{0}(Pj,n(θ))1[0,)(Zj(θ,x))]JB[[iA(1(,a](Zi(θ,x)Pi,n(θ)))iJ(1(a,b)(Zi(θ,x)Pi,n(θ))1{0}(Zi(θ,x)Pi,n(θ)ZminJ(θ,x)PminJ,n(θ)))×jBJ1(0,)(Zj(θ,x)Pj,n(θ)ZminJ(θ,x)PminJ,n(θ))]×[(ZminJ(θ,x)PminJ,n(θ))in+1ain+1]], (4.23)

    and

    (IV)=A˙B˙C={1,,r}[jA1(0,)(Pj,n(θ))jB1(0,)(Pj,n(θ))jC(1{0}(Pj,n(θ))1[0,)(Zj(θ,x))]×(iA1(,a](Zi(θ,x)Pi,n(θ))iB1[b,)(Zi(θ,x)Pi,n(θ)))[bin+1ain+1]. (4.24)

    Note that the first products over all elements of A,B,C precisely describe the conditions that Sθ1=A, Sθ1=B, Sθ0=C, and jSθ0:Zj(θ,x)0. Furthermore, observe that, e.g., in (I) we we must have for all iI, jAI that Zj(θ,x)Pj,n(θ)<ZminI(θ,x)PminI,n(θ)=Zi(θ,x)Pi,n(θ)(a,b) in order to obtain a non-zero value. In other words, the maximal value of Zi(θ,x)Pi,n(θ), iA, is achieved exactly for iI, and similarly the minimal value of Zj(θ,x)Pj,n(θ), jB, is achieved exactly for jJ (and analogously in (II),(III)). Moreover, note that we have for all iIA that

    1(a,b)(Zi(θ,x)Pi,n(θ))=1(a,)(Zi(θ,x)Pi,n(θ))1(,b)(Zi(θ,x)Pi,n(θ))=1(0,)(Zi(θ,x)aPi,n(θ))1(0,)(bPi,n(θ)Zi(θ,x)). (4.25)

    Here Zi(θ,x) is polynomial in θ and linear in x1,,xn1, and thus of the form required by Definition 4.6. Similarly, the other indicator functions can be brought into the correct form, taking into account the different signs of Pj,n(θ) for jA and jB. Moreover, observe that the remaining terms can be written as linear combinations of rational functions in θ and polynomials in x. Hence, we obtain that the functions defined by (I),(II),(III),(IV) are elements of Am,n1. The proof of Proposition 4.8 is thus complete.

    Definition 4.9. Let dN, let ARd be a set, and let f:AR be a function. Then we say that f is piecewise polynomial if and only if there exist nN, α1,α2,,αnRn×d, β1,β2,,βnRn, P1,P2,,PnPd such that for all xA it holds that

    f(x)=ni=1[Pi(x)1[0,)n(αix+βi)] (4.26)

    (cf. Definition 4.1).

    Corollary 4.10. Assume Setting 2.1 and assume that f and p are piecewise polynomial (cf. Definition 4.9). Then L is semialgebraic (cf. Definition 4.3).

    Proof of Corollary 4.10. Throughout this proof let F:RdR and P:RdR satisfy for all xRd that

    F(x)={f(x):x[a,b]d0:x[a,b]dandP(x)={p(x):x[a,b]d0:x[a,b]d. (4.27)

    Note that (4.27) and the assumption that f and p are piecewise polynomial assure that

    [Rd×Rd(θ,x)F(x)R]Ad,dand[Rd×Rd(θ,x)P(x)R]Ad,d (4.28)

    (cf. Definition 4.6). In addition, observe that the fact that for all θRd, xRd we have that

    Nθ(x)=cθ+Hi=1vθimax{d=1wθi,x+bθi,0}=cθ+Hi=1vθi(d=1wθi,x+bθi)1[0,)(d=1wθi,x+bθi) (4.29)

    demonstrates that

    [Rd×Rd(θ,x)Nθ(x)R]Ad,d. (4.30)

    Combining this with (4.28) and the fact that Ad,d is an algebra proves that

    [Rd×Rd(θ,x)(Nθ(x)F(x))2P(x)R]Ad,d. (4.31)

    This, Proposition 4.8, and induction demonstrate that

    [Rdθbababa(Nθ(x)F(x))2P(x)dxddx2dx1R]Ad,0. (4.32)

    Fubini's theorem hence implies that LAd,0. Combining this and Lemma 4.7 shows that L is semialgebraic. The proof of Corollary 4.10 is thus complete.

    In this section we employ the findings from Sections 2 and 4 to establish in Proposition 5.2 in Subsection 5.2 below, in Proposition 5.3 in Subsection 5.2, and in Theorem 5.4 in Subsection 5.3 below several convergence rate results for solutions of GF differential equations. Theorem 1.2 in the introduction is a direct consequence of Theorem 5.4. Our proof of Theorem 5.4 is based on an application of Proposition 5.3 and our proof of Proposition 5.3 uses Proposition 5.2. Our proof of Proposition 5.2, in turn, employs Proposition 5.1 in Subsection 5.1 below. In Proposition 5.1 we establish that under the assumption that the target function f:[a,b]dR and the unnormalized density function p:[a,b]d[0,) are piecewise polynomial (see Definition 4.9 in Subsection 4.3) we have that the risk function L:RdR satisfies an appropriately generalized Kurdyka-Łojasiewicz inequality.

    In the proof of Proposition 5.1 the classical Łojasiewicz inequality for semialgebraic or subanalytic functions (cf., e.g., Bierstone & Milman [40]) is not directly applicable since the generalized gradient function G:RdRd is not continuous. We will employ the more general results from Bolte et al. [9] which also apply to not necessarily continuously differentiable functions.

    The arguments used in the proof of Proposition 5.2 are slight adaptions of well-known arguments in the literature; see, e.g., Kurdyka et al. [12, Section 1], Bolte et al. [9, Theorem 4.5], or Absil et al. [6, Theorem 2.2]. On the one hand, in Kurdyka et al. [12, Section 1] and Absil et al. [6, Theorem 2.2] it is assumed that the object function of the considered optimization problem is analytic and in Bolte et al. [9, Theorem 4.5] it is assumed that the objective function of the considered optimization problem is convex or lower C2 and Proposition 5.2 does not require these assumptions. On the other hand, Bolte et al. [9, Theorem 4.5] consider more general differential dynamics and the considered gradients are allowed to be more general than the specific generalized gradient function G:RdRd which is considered in Proposition 5.2.

    Proposition 5.1 (Generalized Kurdyka-Łojasiewicz inequality). Assume Setting 2.1, assume that p and f are piecewise polynomial, and let ϑRd (cf. Definition 4.9). Then there exist ε,D(0,), α(0,1) such that for all θBε(ϑ) it holds that

    |L(θ)L(ϑ)|αD||G(θ)||. (5.1)

    Proof of Proposition 5.1. Throughout this proof let M:Rd[0,] satisfy for all θRd that

    M(θ)=inf({||h||:hL(θ)}{}). (5.2)

    Note that Proposition 2.12 implies for all θRd that

    M(θ)||G(θ)||. (5.3)

    Furthermore, observe that Corollary 4.10, the fact that semialgebraic functions are subanalytic, and Bolte et al. [9, Theorem 3.1 and Remark 3.2] ensure that there exist ε,D(0,), a[0,1) which satisfy for all θBε(ϑ) that

    |L(θ)L(ϑ)|aDM(θ). (5.4)

    Combining this and (5.3) with the fact that supθBε(ϑ)|L(θ)L(ϑ)|< demonstrates that for all θBε(ϑ), α(a,1) we have that

    |L(θ)L(ϑ)|α|L(θ)L(ϑ)|a(supψBε(ϑ)|L(ψ)L(ϑ)|αa)(DsupψBε(ϑ)|L(ψ)L(ϑ)|αa)||G(θ)||. (5.5)

    This completes the proof of Proposition 5.1.

    Proposition 5.2. Assume Setting 2.1 and let ϑRd, ε,D(0,), α(0,1) satisfy for all θBε(ϑ) that

    |L(θ)L(ϑ)|αD||G(θ)||. (5.6)

    Then there exists δ(0,ε) such that for all ΘC([0,),Rd) with Θ0Bδ(ϑ), t[0,):Θt=Θ0t0G(Θs)ds, and inft{s[0,):ΘsBε(ϑ)}L(Θt)L(ϑ) there exists ψL1({L(ϑ)}) such that for all t[0,) it holds that

    ΘtBε(ϑ),0||G(Θs)||dsε,|L(Θt)L(ψ)|(1+D2t)1, (5.7)
    and||Θtψ||[1+(D1/α(1α))α1αt]min{1,1αα}. (5.8)

    Proof of Proposition 5.2. Note that the fact that L is continuous implies that there exists δ(0,ε/3) which satisfies for all θBδ(ϑ) that

    |L(θ)L(ϑ)|1αmin{ε(1α)3D,1αD,1}. (5.9)

    In the following let ΘC([0,),Rd) satisfy t[0,):Θt=Θ0t0G(Θs)ds, Θ0Bδ(ϑ), and

    inft{s[0,):ΘsBε(ϑ)}L(Θt)L(ϑ). (5.10)

    In the first step we show that for all t[0,) it holds that

    ΘtBε(ϑ). (5.11)

    Observe that, e.g., [37, Lemma 3.1] ensures for all t[0,) that

    L(Θt)=L(Θ0)t0||G(Θs)||2ds. (5.12)

    This implies that [0,)tL(Θt)[0,) is non-increasing. Next let L:[0,)R satisfy for all t[0,) that

    L(t)=L(Θt)L(ϑ) (5.13)

    and let T[0,] satisfy

    T=inf({t[0,):||Θtϑ||ε}{}). (5.14)

    We intend to show that T=. Note that (5.10) assures for all t[0,T) that L(t)0. Moreover, observe that (5.12) and (5.13) ensure that for almost all t[0,T) it holds that L is differentiable at t and satisfies L(t)=ddt(L(Θt))=||G(Θt)||2. In the following let τ[0,T] satisfy

    τ=inf({t[0,T):L(t)=0}{T}). (5.15)

    Note that the fact that L is non-increasing implies that for all s[τ,T) it holds that L(s)=0. Combining this with (5.12) demonstrates for almost all s(τ,T) that G(Θs)=0. This proves for all s[τ,T) that Θs=Θτ. Next observe that (5.6) ensures that for all t[0,τ) it holds that

    0<[L(t)]α=|L(Θt)L(ϑ)|αD||G(Θt)||. (5.16)

    Combining this with the chain rule proves for almost all t[0,τ) that

    ddt([L(t)]1α)=(1α)[L(t)]α(||G(Θt)||2)(1α)D1||G(Θt)||1||G(Θt)||2=D1(1α)||G(Θt)||. (5.17)

    In addition, note that the fact that [0,)tL(t)R is absolutely continuous and the fact that for all r(0,) it holds that (r,)yy1αR is Lipschitz continuous demonstrate for all t[0,τ) that [0,t]s[L(s)]1αR is absolutely continuous. Integrating (5.17) hence shows for all s,t[0,τ) with ts that

    st||G(Θu)||duD(1α)1([L(s)]1α[L(t)]1α)D(1α)1[L(t)]1α. (5.18)

    This and the fact that for almost all s(τ,T) it holds that G(Θs)=0 ensure that for all s,t[0,T) with ts we have that

    st||G(Θu)||duD(1α)1[L(t)]1α. (5.19)

    Combining this with (5.9) demonstrates for all t[0,T) that

    ||ΘtΘ0||=t0G(Θs)dst0||G(Θs)||dsD|L(Θ0)L(ϑ)|1α1αmin{ε3,1}. (5.20)

    This, the fact that δ<ε/3, and the triangle inequality assure for all t[0,T) that

    ||Θtϑ||||ΘtΘ0||+||Θ0ϑ||ε3+δε3+ε3=2ε3. (5.21)

    Combining this with (5.14) proves that T=. This establishes (5.11).

    Next observe that the fact that T= and (5.20) prove that

    0||G(Θs)||dsmin{ε3,1}ε<. (5.22)

    In the following let σ:[0,)[0,) satisfy for all t[0,) that

    σ(t)=t||G(Θs)||ds. (5.23)

    Note that (5.22) proves that lim suptσ(t)=0. In addition, observe that (5.22) assures that there exists ψRd such that

    lim supt||Θtψ||=0. (5.24)

    In the next step we combine the weak chain rule for the risk function in (5.12) with (5.11) and (5.6) to obtain that for almost all t[0,) we have that

    L(t)=||G(Θt)||2D2[L(t)]2α. (5.25)

    In addition, note that the fact that L is non-increasing and (5.9) ensure that for all t[0,) it holds that L(t)L(0)1. Therefore, we get for almost all t[0,) that

    L(t)D2[L(t)]2. (5.26)

    Combining this with the fact that for all t[0,τ) it holds that L(t)>0 establishes for almost all t[0,τ) that

    ddt(D2L(t))=D2L(t)[L(t)]21. (5.27)

    The fact that for all t[0,τ) it holds that [0,t]sL(s)(0,) is absolutely continuous hence demonstrates for all t[0,τ) that

    D2L(t)D2L(0)+tD2+t. (5.28)

    Therefore, we infer for all t[0,τ) that

    L(t)D2(D2+t)1=(1+D2t)1. (5.29)

    This and the fact that for all t[τ,) it holds that L(t)=0 prove that for all t[0,) we have that

    |L(Θt)L(ϑ)|=L(t)(1+D2t)1. (5.30)

    Furthermore, observe that (5.24) and the fact that L is continuous imply that lim supt|L(Θt)L(ψ)|=0. Hence, we obtain that L(ψ)=L(ϑ). This shows for all t[0,) that

    |L(Θt)L(ψ)|(1+D2t)1. (5.31)

    In the next step we establish a convergence rate for the quantity ||Θtψ||, t[0,). We accomplish this by employing an upper bound for the tail length of the curve ΘtRd, t[0,). More formally, note that (5.19), (5.11), and (5.6) demonstrate for all t[0,) that

    σ(t)=t||G(Θu)||du=lims[st||G(Θu)||du]D(1α)1(L(t))1αD(1α)1(D||G(Θt)||)1αα. (5.32)

    Next observe that the fact that for all t[0,) it holds that σ(t)=0||G(Θs)||dst0||G(Θs)||ds shows that for almost all t[0,) we have that σ(t)=||G(Θt)||. This and (5.32) yield for almost all t[0,) that σ(t)D1/α(1α)1[σ(t)]1αα. Therefore, we obtain for almost all t[0,) that

    σ(t)[(1α)D1/ασ(t)]α1α. (5.33)

    Combining this with the fact that σ is absolutely continuous implies for all t[0,) that

    σ(t)σ(0)[(1α)D1/α]α1αt0[σ(s)]α1αds. (5.34)

    In the following let β,C(0,) satisfy β=max{1,α1α} and C=((1α)D1/α)α1α. Note that (5.34) and the fact that for all t[0,) it holds that σ(t)σ(0)1 ensure that for all t[0,) it holds that

    σ(t)σ(0)Ct0[σ(s)]βds. (5.35)

    This, the fact that σ is non-increasing, and the fact that for all t[0,) it holds that 0σ(t)1 prove that for all t[0,) we have that

    (σ(t))βσ(t)σ(0)C[σ(t)]βt1Ct[σ(t)]β. (5.36)

    Hence, we obtain for all t[0,) that σ(t)(1+Ct)1β. Combining this with the fact that for all t[0,) it holds that

    ||Θtψ||lim sups||ΘtΘs||=lim supsstG(Θu)dulim sups[st||G(Θu)||du]=t||G(Θu)||du=σ(t) (5.37)

    shows that for all t[0,) we have that ||Θtψ||(1+Ct)1/β. This, (5.11), (5.22), and (5.31) establish (5.8). The proof of Proposition 5.2 is thus complete.

    Proposition 5.3. Assume Setting 2.1, assume that p and f are piecewise polynomial, and let ΘC([0,),Rd) satisfy

    lim inft||Θt||<andt[0,):Θt=Θ0t0G(Θs)ds (5.38)

    (cf. Definition 4.9). Then there exist ϑG1({0}), C,τ,β(0,) which satisfy for all t[τ,) that

    ||Θtϑ||(1+C(tτ))βand|L(Θt)L(ϑ)|(1+C(tτ))1. (5.39)

    Proof of Proposition 5.3. First observe that [37, Lemma 3.1] ensures that for all t[0,) it holds that

    L(Θt)=L(Θ0)t0||G(Θs)||2ds. (5.40)

    This implies that [0,)tL(Θt)[0,) is non-increasing. Hence, we obtain that there exists m[0,) which satisfies that

    m=lim suptL(Θt)=lim inftL(Θt)=inft[0,)L(Θt). (5.41)

    Moreover, note that the assumption that lim inft||Θt||< ensures that there exist ϑRd and τ=(τn)nN:N[0,) which satisfy lim infnτn= and

    lim supn||Θτnϑ||=0. (5.42)

    Combining this with (5.41) and the fact that L is continuous shows that

    L(ϑ)=mandt[0,):L(Θt)L(ϑ). (5.43)

    Next observe that Proposition 5.1 demonstrates that there exist ε,D(0,), α(0,1) such that for all θBε(ϑ) we have that

    |L(θ)L(ϑ)|αD||G(θ)||. (5.44)

    Combining this and (5.42) with Proposition 5.2 demonstrates that there exists δ(0,ε) which satisfies for all ΦC([0,),Rd) with Φ0Bδ(ϑ), t[0,):Φt=Φ0t0G(Φs)ds, and inft{s[0,):ΦsBε(ϑ)}L(Φt)L(ϑ) that it holds for all t[0,) that

    ΦtBε(ϑ),|L(Φt)L(ϑ)|(1+D2t)1, (5.45)
    and||Φtϑ||[1+(D1/α(1α))α1αt]min{1,1αα}. (5.46)

    Moreover, note that (5.42) ensures that there exists nN which satisfies ΘτnBδ(ϑ). Next let ΦC([0,),Rd) satisfy for all t[0,) that

    Φt=Θt+τn. (5.47)

    Observe that (5.43) and (5.47) assure that

    Φ0Bδ(ϑ),inft[0,)L(Φt)L(ϑ),andt[0,):Φt=Φ0t0G(Φs)ds. (5.48)

    Combining this with (5.46) proves for all t[τn,) that

    |L(Θt)L(ϑ)|(1+D2(tτn))1 (5.49)

    and

    ||Θtϑ||[1+(D1/α(1α))α1α(tτn)]min{1,1αα}. (5.50)

    Next note that [37, Corollary 2.15] shows that Rdθ||G(θ)||[0,) is lower semicontinuous. The fact that lim infs||G(Θs)||=0 and the fact that lim supt||Θtϑ||=0 hence imply that G(ϑ)=0. Combining this with (5.49) and (5.50) establishes (5.39). The proof of Proposition 5.3 is thus complete.

    By choosing a sufficiently large C(0,) we can conclude a simplified version of Proposition 5.3. This is precisely the subject of the next result, Theorem 5.4 below. Theorem 1.2 in the introduction is a direct consequence of Theorem 5.4.

    Theorem 5.4. Assume Setting 2.1, assume that p and f are piecewise polynomial, and let ΘC([0,),Rd) satisfy lim inft||Θt||< and t[0,):Θt=Θ0t0G(Θs)ds (cf. Definition 4.9). Then there exist ϑG1({0}), C,β(0,) which satisfy for all t[0,) that

    ||Θtϑ||C(1+t)βand|L(Θt)L(ϑ)|C(1+t)1. (5.51)

    Proof of Theorem 5.4. Observe that Proposition 5.3 assures that there exist ϑG1({0}), C,τ,β(0,) which satisfy for all t[τ,) that

    ||Θtϑ||(1+C(tτ))β (5.52)

    and

    |L(Θt)L(ϑ)|(1+C(tτ))1. (5.53)

    In the following let C(0,) satisfy

    C=max{C1,1+τ,Cβ,(1+τ)β,(1+τ)β[sups[0,τ]||Θsϑ||],(1+τ)L(Θ0)}. (5.54)

    Note that (5.53), (5.54), and the fact that [0,)tL(Θt)[0,) is non-increasing show for all t[0,τ] that

    ||Θtϑ||sups[0,τ]||Θsϑ||C(1+τ)βC(1+t)β (5.55)

    and

    |L(Θt)L(ϑ)|=L(Θt)L(ϑ)L(Θt)L(Θ0)C(1+τ)1C(1+t)1. (5.56)

    Moreover, observe that (5.52) and (5.54) imply for all t[τ,) that

    ||Θtϑ||C(C1/β+CC1/β(tτ))βC(C1/βτ+t)βC(1+t)β. (5.57)

    In addition, note that (5.53) and (5.54) demonstrate for all t[τ,) that

    |L(Θt)L(ϑ)|C(C+CC(tτ))1C(Cτ+t)1C(1+t)1. (5.58)

    This completes the proof of Theorem 5.4.

    This project has been partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy EXC 2044-390685587, Mathematics Münster: Dynamics-Geometry-Structure.

    The authors declare that there are no conflicts of interest.



    [1] F. Bach, E. Moulines, Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n), in Advances in Neural Information Processing Systems, 26 (2013), 773–781. Available from: https://proceedings.neurips.cc/paper/2013/file/7fe1f8abaad094e0b5cb1b01d712f708-Paper.pdf.
    [2] A. Jentzen, B. Kuckuck, A. Neufeld, P. von Wurstemberger, Strong error analysis for stochastic gradient descent optimization algorithms, IMA J. Numer. Anal., 41 (2021), 455–492. https://doi.org/10.1093/imanum/drz055 doi: 10.1093/imanum/drz055
    [3] E. Moulines, F. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning, in Advances in Neural Information Processing Systems, 24 (2011), 451–459. Available from: https://proceedings.neurips.cc/paper/2011/file/40008b9a5380fcacce3976bf7c08af5b-Paper.pdf.
    [4] Y. Nesterov, Introductory Lectures on Convex Optimization, 2004. https://doi.org/10.1007/978-1-4419-8853-9
    [5] A. Rakhlin, O. Shamir, K. Sridharan, Making gradient descent optimal for strongly convex stochastic optimization, in Proceedings of the 29th International Conference on Machine Learning, Madison, WI, USA, (2012), 1571–1578.
    [6] P. A. Absil, R. Mahony, B. Andrews, Convergence of the iterates of descent methods for analytic cost functions, SIAM J. Optim., 16 (2005), 531–547. https://doi.org/10.1137/040605266 doi: 10.1137/040605266
    [7] H. Attouch, J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Math. Program., 116 (2009), 5–16. https://doi.org/10.1007/s10107-007-0133-5 doi: 10.1007/s10107-007-0133-5
    [8] H. Attouch, J. Bolte, B. F. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods, Math. Program., 137 (2013), 91–129. https://doi.org/10.1007/s10107-011-0484-9 doi: 10.1007/s10107-011-0484-9
    [9] J. Bolte, A. Daniilidis, A. Lewis. The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems, SIAM J. Optim., 17 (2007), 1205–1223. https://doi.org/10.1137/050644641 doi: 10.1137/050644641
    [10] S. Dereich, S. Kassing, Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes, preprint, arXiv: 2102.09385.
    [11] H. Karimi, J. Nutini, M. Schmidt, Linear convergence of gradient and proximal-gradient methods under the Polyak-Lojasiewicz condition, in Machine Learning and Knowledge Discovery in Databases, (2016), 795–811. https://doi.org/10.1007/978-3-319-46128-1_50
    [12] K. Kurdyka, T. Mostowski, A. Parusiński, Proof of the gradient conjecture of R. Thom, Ann. Math., 152 (2000), 763–792. https://doi.org/10.2307/2661354 doi: 10.2307/2661354
    [13] J. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M. Jordan, B. Recht, First-order methods almost always avoid strict saddle points, Math. Program., 176 (2019), 311–337. https://doi.org/10.1007/s10107-019-01374-3 doi: 10.1007/s10107-019-01374-3
    [14] J. D. Lee, M. Simchowitz, M. I. Jordan, B. Recht, Gradient descent only converges to minimizers, in 29th Annual Conference on Learning Theory, 49 (2016), 1246–1257. Available from: http://proceedings.mlr.press/v49/lee16.html.
    [15] S. Łojasiewicz, Sur les trajectoires du gradient d'une fonction analytique, in Geometry Seminars, (1983), 115–117.
    [16] P. Ochs, Unifying abstract inexact convergence theorems and block coordinate variable metric iPiano, SIAM J. Optim., 29 (2019), 541–570. https://doi.org/10.1137/17M1124085 doi: 10.1137/17M1124085
    [17] D. P. Bertsekas, J. N. Tsitsiklis, Gradient convergence in gradient methods with errors, SIAM J. Optim., 10 (2000), 627–642. https://doi.org/10.1137/S105262349733106 doi: 10.1137/S105262349733106
    [18] B. Fehrman, B. Gess, A. Jentzen, Convergence rates for the stochastic gradient descent method for non-convex objective functions, J. Mach. Learn. Res., 21 (2022), 5354–5401. Available from: https://dl.acm.org/doi/abs/10.5555/3455716.3455852.
    [19] Y. Lei, T. Hu, G. Li, K. Tang, Stochastic gradient descent for nonconvex learning without bounded gradient assumptions, IEEE Trans. Neural Networks Learn. Syst., 31 (2019), 4394–4400. https://doi.org/10.1109/TNNLS.2019.2952219 doi: 10.1109/TNNLS.2019.2952219
    [20] V. Patel. Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions, Math. Program., 195 (2022), 693–734. https://doi.org/10.1007/s10107-021-01710-6 doi: 10.1007/s10107-021-01710-6
    [21] F. Santambrogio, {Euclidean, metric, and Wasserstein} gradient flows: an overview, Bull. Math. Sci., 7 (2017), 87–154. https://doi.org/10.1007/s13373-017-0101-1 doi: 10.1007/s13373-017-0101-1
    [22] S. Arora, S. Du, W. Hu, Z. Li, R. Wang, Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks, in Proceedings of the 36th International Conference on Machine Learning, 97 (2019), 322–332. Available from: http://proceedings.mlr.press/v97/arora19a.html.
    [23] L. Chizat, E. Oyallon, F. Bach, On lazy training in differentiable programming, in Advances in Neural Information Processing Systems, 32 (2019). Available from: https://proceedings.neurips.cc/paper/2019/file/ae614c557843b1df326cb29c57225459-Paper.pdf.
    [24] S. S. Du, X. Zhai, B. Póczos, A. Singh, Gradient descent provably optimizes over-parameterized neural networks, in International Conference on Learning Representations, 2019. Available from: https://openreview.net/forum?id = S1eK3i09YQ.
    [25] W. E, C. Ma, L. Wu, A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics, Sci. China Math., 63 (2020), 1235–1258. https://doi.org/10.1007/s11425-019-1628-5 doi: 10.1007/s11425-019-1628-5
    [26] A. Jacot, F. Gabriel, C. Hongler, Neural tangent kernel: convergence and generalization in neural networks, in Advances in Neural Information Processing Systems, 31 (2018). Available from: https://proceedings.neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.
    [27] A. Jentzen, T. Kröger, Convergence rates for gradient descent in the training of overparameterized artificial neural networks with biases, preprint, arXiv: 2102.11840.
    [28] G. Zhang, J. Martens, R. Grosse, Fast convergence of natural gradient descent for over-parameterized neural networks, in Advances in Neural Information Processing Systems, 32 (2019), 8082–8093. Available from: https://proceedings.neurips.cc/paper/2019/file/1da546f25222c1ee710cf7e2f7a3ff0c-Paper.pdf.
    [29] Z. Chen, G. Rotskoff, J. Bruna, E. Vanden-Eijnden, A dynamical central limit theorem for shallow neural networks, in Advances in Neural Information Processing Systems, 33 (2020), 22217–22230. Available from: https://proceedings.neurips.cc/paper/2020/file/fc5b3186f1cf0daece964f78259b7ba0-Paper.pdf.
    [30] L. Chizat, Sparse optimization on measures with over-parameterized gradient descent, Math. Program., 194 (2022), 487–532. https://doi.org/10.1007/s10107-021-01636-z doi: 10.1007/s10107-021-01636-z
    [31] L. Chizat, F. Bach, On the global convergence of gradient descent for over-parameterized models using optimal transport, in Advances in Neural Information Processing Systems, 31 (2018), 3036–3046. Available from: https://proceedings.neurips.cc/paper/2018/file/a1afc58c6ca9540d057299ec3016d726-Paper.pdf.
    [32] W. E, C. Ma, S. Wojtowytsch, L. Wu, Towards a mathematical understanding of neural network-based machine learning: what we know and what we don't, preprint, arXiv: 2009.10713.
    [33] P. Cheridito, A. Jentzen, A. Riekert, F. Rossmannek, A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions, J. Complexity, 72 (2022), 101646. https://doi.org/10.1016/j.jco.2022.101646 doi: 10.1016/j.jco.2022.101646
    [34] A. Jentzen, A. Riekert, A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions, Z. Angew. Math. Phys., 73 (2022), 188. https://doi.org/10.1007/s00033-022-01716-w doi: 10.1007/s00033-022-01716-w
    [35] A. Jentzen, A. Riekert, A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions, J. Mach. Learn. Res., 23 (2022), 1–50. Available from: https://www.jmlr.org/papers/volume23/21-0962/21-0962.pdf.
    [36] P. Cheridito, A. Jentzen, F. Rossmannek, Landscape analysis for shallow neural networks: complete classification of critical points for affine target functions, J. Nonlinear Sci., 32 (2022), 64. https://doi.org/10.1007/s00332-022-09823-8 doi: 10.1007/s00332-022-09823-8
    [37] A. Jentzen, A. Riekert, Convergence analysis for gradient flows in the training of artificial neural networks with ReLU activation, J. Math. Anal. Appl., 517 (2023), 126601. https://doi.org/10.1016/j.jmaa.2022.126601 doi: 10.1016/j.jmaa.2022.126601
    [38] D. Gallon, A. Jentzen, F. Lindner, Blow up phenomena for gradient descent optimization methods in the training of artificial neural networks, preprint, arXiv: 2211.15641.
    [39] R. T. Rockafellar, R. Wets, Variational Analysis, Springer-Verlag, Berlin, 1998. https://doi.org/10.1007/978-3-642-02431-3
    [40] E. Bierstone, P. D. Milman, Semianalytic and subanalytic sets, Inst. Hautes Études Sci. Publ. Math., 67 (1998), 5–42. https://doi.org/10.1007/BF02699126
    [41] T. Kaiser, Integration of semialgebraic functions and integrated Nash functions, Math. Z., 275 (2013), 349–366. https://doi.org/10.1007/s00209-012-1138-1 doi: 10.1007/s00209-012-1138-1
    [42] M. Coste, An introduction to semialgebraic geometry, 2000. Available from: http://blogs.mat.ucm.es/jesusr/wp-content/uploads/sites/52/2020/03/SAG.pdf.
  • This article has been cited by:

    1. W. Jung, C.A. Morales, Training neural networks from an ergodic perspective, 2023, 0233-1934, 1, 10.1080/02331934.2023.2239852
    2. Steffen Dereich, Arnulf Jentzen, Sebastian Kassing, On the Existence of Minimizers in Shallow Residual ReLU Neural Network Optimization Landscapes, 2024, 62, 0036-1429, 2640, 10.1137/23M1556241
  • Reader Comments
  • © 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(2141) PDF downloads(113) Cited by(2)

Figures and Tables

Figures(2)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog