Loading [MathJax]/jax/output/SVG/jax.js
Research note Special Issues

On the symmetries in the dynamics of wide two-layer neural networks

  • Received: 05 December 2022 Revised: 18 January 2023 Accepted: 01 February 2023 Published: 22 February 2023
  • We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias), and study the effect of symmetries on the learned parameters and predictors. We first describe a general class of symmetries which, when satisfied by the target function f and the input distribution, are preserved by the dynamics. We then study more specific cases. When f is odd, we show that the dynamics of the predictor reduces to that of a (non-linearly parameterized) linear predictor, and its exponential convergence can be guaranteed. When f has a low-dimensional structure, we prove that the gradient flow PDE reduces to a lower-dimensional PDE. Furthermore, we present informal and numerical arguments that suggest that the input neurons align with the lower-dimensional structure of the problem.

    Citation: Karl Hajjar, Lénaïc Chizat. On the symmetries in the dynamics of wide two-layer neural networks[J]. Electronic Research Archive, 2023, 31(4): 2175-2212. doi: 10.3934/era.2023112

    Related Papers:

    [1] Giuseppe Maria Coclite, Carlotta Donadello . Vanishing viscosity on a star-shaped graph under general transmission conditions at the node. Networks and Heterogeneous Media, 2020, 15(2): 197-213. doi: 10.3934/nhm.2020009
    [2] John D. Towers . An explicit finite volume algorithm for vanishing viscosity solutions on a network. Networks and Heterogeneous Media, 2022, 17(1): 1-13. doi: 10.3934/nhm.2021021
    [3] Boris Andreianov, Kenneth H. Karlsen, Nils H. Risebro . On vanishing viscosity approximation of conservation laws with discontinuous flux. Networks and Heterogeneous Media, 2010, 5(3): 617-633. doi: 10.3934/nhm.2010.5.617
    [4] Wen Shen . Traveling wave profiles for a Follow-the-Leader model for traffic flow with rough road condition. Networks and Heterogeneous Media, 2018, 13(3): 449-478. doi: 10.3934/nhm.2018020
    [5] Martin Gugat, Mario Sigalotti . Stars of vibrating strings: Switching boundary feedback stabilization. Networks and Heterogeneous Media, 2010, 5(2): 299-314. doi: 10.3934/nhm.2010.5.299
    [6] Giuseppe Maria Coclite, Nicola De Nitti, Mauro Garavello, Francesca Marcellini . Vanishing viscosity for a $ 2\times 2 $ system modeling congested vehicular traffic. Networks and Heterogeneous Media, 2021, 16(3): 413-426. doi: 10.3934/nhm.2021011
    [7] Michael Herty, Niklas Kolbe, Siegfried Müller . Central schemes for networked scalar conservation laws. Networks and Heterogeneous Media, 2023, 18(1): 310-340. doi: 10.3934/nhm.2023012
    [8] Joachim von Below, José A. Lubary . Isospectral infinite graphs and networks and infinite eigenvalue multiplicities. Networks and Heterogeneous Media, 2009, 4(3): 453-468. doi: 10.3934/nhm.2009.4.453
    [9] Alessia Marigo . Equilibria for data networks. Networks and Heterogeneous Media, 2007, 2(3): 497-528. doi: 10.3934/nhm.2007.2.497
    [10] Gen Qi Xu, Siu Pang Yung . Stability and Riesz basis property of a star-shaped network of Euler-Bernoulli beams with joint damping. Networks and Heterogeneous Media, 2008, 3(4): 723-747. doi: 10.3934/nhm.2008.3.723
  • We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias), and study the effect of symmetries on the learned parameters and predictors. We first describe a general class of symmetries which, when satisfied by the target function f and the input distribution, are preserved by the dynamics. We then study more specific cases. When f is odd, we show that the dynamics of the predictor reduces to that of a (non-linearly parameterized) linear predictor, and its exponential convergence can be guaranteed. When f has a low-dimensional structure, we prove that the gradient flow PDE reduces to a lower-dimensional PDE. Furthermore, we present informal and numerical arguments that suggest that the input neurons align with the lower-dimensional structure of the problem.



    We consider a family of scalar conservation laws defined on an oriented graph Γ consisting of m incoming and n outgoing edges Ω, =1,m+n joining at a single vertex. Incoming edges are parametrized by x(,0] while outgoing edges by x[0,) in such a way that the junction is always located at x=0. We use the index i, i=1,,m, to refer to incoming edges and j, j=m+1,,m+n, for the outgoing ones.

    On the edge Ω we introduce a scalar conservation law, describing the evolution of a density ρ. Then on the incoming edges we have

    tρi+xfi(ρi)=0,t>0,x<0,i=1,...,m, (1)

    and on the outgoing ones

    tρj+xfj(ρj)=0,t>0,x>0,j=m+1,...,m+n. (2)

    The fluxes f1,...,fm+n, differ in general, however we assume that they are bell-shaped (unimodal), Lipschitz and non-degenerate nonlinear, i.e.

    (H.1) for each {1,...,m+n}, fC2([0,1]), f(0)=f(1)=0, f0, and there exist ¯ρ(0,1) such that f(ρ)(¯ρρ)>0 for every ρ[0,1]{¯ρ};

    (H.2) for any {1,...,m+n}, |{ρ:f(ρ)=0}|=0.

    We augment (1) and (2) with the initial conditions

    {ρi(0,x)=ρi,0(x),x<0,i=1,...,m,ρj(0,x)=ρj,0(x),x>0,j=m+1,...,m+n, (3)

    assuming that

    (H.3) ρ1,0,...,ρm,0L1(,0)BV(,0),

    ρm+1,0,...,ρm+n,0L1(0,)BV(0,)

    and 0ρ1,0,...,ρm+n,01.

    Finally, we introduce the necessary conservation assumption at the node, which transforms our family of independent equations into a single problem

    mi=1fi(ρi(t,0))=m+nj=m+1fj(ρj(t,0+)) for a.e. t0.

    Questions related to existence, uniqueness and stability of solutions for problems of this kind have been extensively investigated in recent years, mainly in relation with traffic modeling. The interested reader can refer to [7,13] for an overview of the subject. Here our point of view is different, as we do not focus on a specific model. We consider a parabolic regularization of the problem, similarly to what has been done in [11,10], but instead of enforcing a continuity condition at the node for the regularized solutions, we introduce a more general set of transmission conditions on the parabolic fluxes.

    In this work we adopt the following definition of weak solution for the problem (1), (2), and (3). We stress that this definition is for sure not sufficient to ensure uniqueness. On the contrary it fix somehow a minimal set of properties that any reasonable solution is expected to satisfy, see [3] and references therein for a more detailed discussion on this point.

    Definition 1.1. Let ρ1,...,ρm:[0,)×(,0]R and ρm+1,...,ρm+n:[0,)×[0,)R be functions. We say that (ρ1,....,ρm+n) is a weak solution of (1), (2), and (3) if

    (D.1) f1(ρ1),...,fm(ρm)BVloc((0,)×(,0)) and fm+1(ρm+1),...,fm+n(ρm+n)BVloc((0,)×(0,));

    (D.2) for every i{1,...,m}, every cR and every nonnegative test function φC(R×(,0)) with compact support

    00(|ρic|tφ+sign(ρic)(fi(ρi)fi(c))xφ)dtdx+0|ρi,0(x)c|φ(0,x)dx0;

    (D.3) for every j{m+1,...,m+n}, every cR and every nonnegative test function φC(R×(0,)) with compact support

    00(|ρjc|tφ+sign(ρjc)(fj(ρR)fj(c))xφ)dtdx+0|ρj,0(x)c|φ(0,x)dx0;

    (D.4) mi=1fi(ρi(t,0))=m+nj=m+1fj(ρj(t,0+)) for a.e. t0.

    Figure 1.  A junction consisting of m incoming and n outgoing edges.

    In [10] the authors approximated (1), (2), and (3) in the following way

    {tρi,ε+xfi(ρi,ε)=ε2xxρi,ε,t>0,x<0,i,tρj,ε+xfj(ρj,ε)=ε2xxρj,ε,t>0,x>0,j,ρi,ε(t,0)=ρj,ε(t,0),t>0,i,j,mi=1(fi(ρi,ε(t,0))εxρi,ε(t,0))=m+nj=m+1(fj(ρj,ε(t,0))εxρj,ε(t,0)),t>0,ρi,ε(0,x)=ρi,0,ε(x),x<0,i,ρj,ε(0,x)=ρj,0,ε(x),x>0,j, (4)

    where i{1,...,m} and j{m+1,...,m+n} and ρi,0,ε,ρj,0,ε are smooth approximations of ρi,0,ρj,0. In this setting they showed that

    ρi,ερia.e. in (0,)×(,0) andin Lploc((0,)×(,0)),1p<, as ε0 for every i,ρj,ερja.e. in (0,)×(0,) and in Lploc((0,)×(0,)),1p<, as ε0 for every j,

    where (ρ1,....,ρm+n) is a weak solution of (1), (2), (3), in the sense of Definition 1.1.

    In this paper we modify the transmission condition of (4) and inspired by [14] we consider the following viscous approximation of (1), (2), and (3)

    {tρi,ε+xfi(ρi,ε)=ε2xxρi,ε,t>0,x<0,i,tρj,ε+xfj(ρj,ε)=ε2xxρj,ε,t>0,x>0,j,fi(ρi,ε(t,0))εxρi,ε(t,0)=βi(ρ1,ε(t,0),....,ρm+n,ε(t,0)),t>0,i,fj(ρj,ε(t,0))εxρj,ε(t,0)=βj(ρ1,ε(t,0),....,ρm+n,ε(t,0)),t>0,j,ρi,ε(0,x)=ρi,0,ε(x),x<0,i,ρj,ε(0,x)=ρj,0,ε(x),x>0,j, (5)

    where, of course,

    mi=1βi(ρ1,ε(t,0),,ρm+n,ε(t,0))=m+nj=m+1βj(ρ1,ε(t,0),,ρm+n,ε(t,0)). (6)

    The additional assumptions we make on the functions β and on the initial conditions ρ,0,ε are postposed to the next section.

    The main result of the paper is the following.

    Theorem 1.2. Assume(H.1), (H.2), and (H.3). There exist a sequence {εk}kN(0,), εk0, and a solution (ρ1,,ρm+n) of (1), (2), and (3), in the sense of Definition 1.1, such that

    ρi,εkρi,a.e.andinLploc((0,)×(,0)), (7)
    ρj,εkρj,a.e.andinLploc((0,)×(0,)), (8)
    f1(ρ1),...,fm(ρm)BV((0,)×(,0)),fm+1(ρm+1),...,fm+n(ρm+n)BV((0,)×(0,)), (9)

    for every 1p<,i{1,..,m},j{m+1,..,m+n}, where (ρ1,εk,...,ρm+n,εk) is the corresponding solution of (5).

    It worth mentioning that a complete characterization of the limit solution obtained from (4) as ε0 is given in [3], where the authors adapt to a star shaped graph setting some ideas and techniques originally developed for conservation laws with discontinous flux, see in particular [2,4,5].

    At the moment we are not able to formulate a similar characterization of the limit of (5). In general, however, the limits coming from parabolic regularization subject to the two different kinds of transmission conditions are different.

    To show this consider the simple case of a junction with one incoming and one outgoing edges. So we have the conservation law

    tρ1+xf1(ρ1)=0,t>0,x<0, (10)

    on the incoming edge and

    tρ2+xf2(ρ2)=0,t>0,x>0, (11)

    on the outgoing one. Assume that

    f1(0)=f1(1)=f2(0)=f2(1)=0,f1,f2<0,there exists 0<ˇρ<ˆρ<1 and G>0 such that f1(ˆρ)=f2(ˇρ)=G(ˆρˇρ). (12)

    Consider the simplified version of (5)

    {tρ1,ε+xf1(ρ1,ε)=ε2xxρ1,ε,t>0,x<0,tρ2,ε+xf2(ρ2,ε)=ε2xxρ2,ε,t>0,x>0,f1(ρ1,ε(t,0))εxρ1,ε(t,0)=f2(ρ2,ε(t,0))εxρ2,ε(t,0)=G(ρ1,ερ2,ε),t>0,ρ1,ε(0,x)=ˆρ,x<0,ρ2,ε(0,x)=ˇρ,x>0. (13)

    The unique solution of (13) is

    ρ1,ε(,)=ˆρ,ρ2,ε(,)=ˇρ,ε>0. (14)

    Therefore, as ε0 we get the solution of (10)-(11)

    ρ1(,)=ˆρ,ρ2(,)=ˇρ. (15)

    This stationary solution is not admissible in the sense of the classical vanishing viscosity germ, see [5,Sec. 5], as it consists of a nonclassical shock. However, when dealing with conservation laws with discontinuous flux, it is well known that infinitely many L1 contractive semigroups of solutions exist, also in relation with different physical applications. In particular, when the right and left fluxes are bell-shaped, as we assume in condition (H.1), each of those notions of admissible solution is uniquely determined by the choice of a (A,B)-connection, see [1,5,9,12] for precise definitions and exemples. In the exemple above the couple (ˆρ,ˇρ) is a connection.

    It is worth noticing that entropy solutions admissible in the sense of a (A,B)-connection can be obtained as limits of a sequence of parabolic approximations made with adapted viscosities but a classical condition of continuity at the interface, see [5,Sec. 6.2] for a general result, but also [2,15] for an application to the Buckley-Leverett equation.

    It is difficult, however, to establish a direct equivalence between the aforementioned results and the one we put forward in this paper. In particular, in the present case we miss information on the boundary layers at the parabolic level and we do not know how the transmission conditions we impose on the parabolic fluxes translates into a condition for the hyperbolic problem.

    This means in particular that we have little information on the germ associated to the family of limit solutions obtained in Theorem 1.2 and, so far, we have not been able to prove that this germ is L1-dissipative. We conjecture, however, that this is due to a technical obstruction and that uniqueness of the limit solutions holds.

    The paper is organized as follows: Section 2 contains the precise list of assumptions on the initial and transmission conditions in the parabolic problem (5). In Section 3 we present the proofs of all necessary a priori estimates on (5). Finally, in Section 4 we detail the proof of Theorem 1.2.

    The initial conditions ρ,0, =1,,m+n, on the hyperbolic problem (1), (2), and (3) satisfy (H.3).

    Once the functions ρ,0 are fixed, we impose on (5) initial conditions ρ,0,ε such that

    ρi,0,εC((,0])L1(,0),ρj,0,εC([0,))L1(0,),ε>0,ρi,0,ερi,0a.e. in (,0) and in Lploc(,0),1p<, as ε0,ρj,0,ερj,0a.e. in (0,) and in Lploc(0,),1p<, as ε0,0ρi,0,ε,ρj,0,ε1,ε>0,ρi,0,εL1(,0)ρi,0L1(,0),ρj,0,εL1(0,)ρj,0L1(0,),ε>0,ρi,0,εL2(,0)ρi,0L2(,0),ρj,0,εL2(0,)ρj,0L2(0,),ε>0,xρi,0,εL1(,0)TV(ρi,0),xρj,0,εL1(0,)TV(ρj,0),ε>0,εxρi,0,εL1(,0),ε2xxρj,0,εL1(0,)C,ε>0, (16)

    for some constant C>0 independent on ε.

    The functions β appearing in the transmission conditions in (5) take the form

    βi(ρ1,ε(t,0),....,ρm+n,ε(t,0))=m+nj=m+1Gi,j(ρi,ε(t,0),ρj,ε(t,0))+ε(mh=1Ki,h(ρi,ε(t,0),ρh,ε(t,0))m+nh=1Kh,i(ρh,ε(t,0),ρi,ε(t,0))); (17)

    for i{1,,m}, and for j{m+1,,m+n}

    βj(ρ1,ε(t,0),....,ρm+n,ε(t,0))=mi=1Gi,j(ρi,ε(t,0),ρj,ε(t,0))+ε(m+nh=m+1Kh,j(ρh,ε(t,0),ρj,ε(t,0))m+nh=1Kj,h(ρj,ε(t,0),ρh,ε(t,0))). (18)

    The functions Gi,j(u,v)C(R2), i{1,,m}, j{m+1,,m+n}, and Kh,(u,v)C(R2), h,{1,,m+n}, satisfy

    vGi,j(,)0uGi,j(,),Gi,j(0,0)=Gi,j(1,1)=0,uKh,(,)0vKh,(,),Kh,(0,0)=Kh,(1,1)=0. (19)

    In particular, (19) implies

    (sign(u)sign(v))Gi,j(,)(u,v)0,u,vR,(sign(u)sign(v))Kh,(,)(u,v)0,u,vR,(sign(uu)sign(vv))(Gi,j(u,v)Gi,j(u,v))0,u,u,v,vR,(sign(uu)sign(vv))(Kh,(u,v)Kh,(u,v))0,u,u,v,vR,(χ(,0)(u)χ(,0)(v))Gi,j(u,v)0,u,vR,(χ(,0)(u)χ(,0)(v))Kh,(u,v)0,u,vR, (20)

    where χ(,0) is the characteristic function of the set (,0).

    This specific form of transmission conditions is reminiscent of the parabolic transmission conditions considered in [14,8], which were originally inspired from the Kedem-Katchalsky conditions for membrane permeability introduced in [16]

    Gh,(u,v)=ch,(uv), (21)

    for some constants ch,>0. Our conditions are more general and in particular we can notice that the function Gh, above satisfies

    Gh,(u,v)(uv)0, (22)

    that allows the authors in [14] to get the L2 conservation (see Lemma 3.3 below).

    We can observe that the equality (6) holds as

    mi=1βi(ρ1,ε(t,0),....,ρm+n,ε(t,0))=mi=1m+nj=m+1Gi,j(ρi,ε(t,0),ρj,ε(t,0))+εmi=1(mh=1Ki,h(ρi,ε(t,0),ρh,ε(t,0))m+nh=1Kh,i(ρh,ε(t,0),ρi,ε(t,0)))=mi=1m+nj=m+1(Gi,j(ρi,ε(t,0),ρj,ε(t,0))εKj,i(ρj,ε(t,0),ρi,ε(t,0))) (23)

    and analogously

    m+nj=m+1βj(ρ1,ε(t,0),....,ρm+n,ε(t,0))=m+nj=m+1mi=1(Gi,j(ρi,ε(t,0),ρj,ε(t,0))εKj,i(ρj,ε(t,0),ρi,ε(t,0))). (24)

    This section is devoted to establish a priori estimates, uniform with respect to ε, which are necessary toward the proof of our main convergence result in the next section.

    For every ε>0, let (ρ1,ε,...,ρm+n,ε) be a solution of (5) satisfying (16).

    Lemma 3.1 (L estimate). We have that

    0ρi,ε,ρj,ε1,i,j. (25)

    Proof. Consider the function

    η(ξ)=ξχ(,0)(ξ).

    Since

    η(ξ)=χ(,0)(ξ),

    using (19) we obtain

    ddt(mi=10η(ρi,ε)dx+m+nj=m+10η(ρj,ε)dx)=mi=10η(ρi,ε)tρi,εdx+m+nj=m+10η(ρj,ε)tρj,εdx=mi=10χ(,0)(ρi,ε)tρi,εdxm+nj=m+10χ(,0)(ρj,ε)tρj,εdx=mi=10χ(,0)(ρi,ε)x(fi(ρi,ε)εxρi,ε)dx+m+nj=m+10χ(,0)(ρj,ε)x(fj(ρj,ε)εxρj,ε)dx=mi=1χ(,0)(ρi,ε(t,0))(fi(ρi,ε(t,0))εxρi,ε(t,0))m+nj=m+1χ(,0)(ρj,ε(t,0))(fj(ρj,ε(t,0))εxρj,ε(t,0))+mi=10xρi,ε(fi(ρi,ε)εxρi,ε)dδ{ρi,ε=0}0+m+nj=m+10xρj,ε(fj(ρj,ε)εxρj,ε)dδ{ρj,ε=0}0m+nj=m+1mi=1(χ(,0)(ρi,ε(t,0))χ(,0)(ρj,ε(t,0)))(Gi,j(ρi,ε(t,0),ρj,ε(t,0))εKj,i(ρj,ε(t,0),ρi,ε(t,0)))0,

    where δ{ρi,ε=0} and δ{ρj,ε=0} are the Dirac deltas concentrated on the sets {ρi,ε=0} and {ρj,ε=0}, respectively and we apply [6,Lemma 2] to compute the value of the integrals as a limit. Integrating over (0,t) and using (16) we get

    0mi=10η(ρi,ε(t,x))dx+m+nj=m+10η(ρj,ε(t,x))dxmi=10η(ρi,0,ε)dx+m+nj=m+10η(ρj,0,ε)dx=0

    and then

    ρi,ε,ρj,ε0,i,j,

    that proves the lower bounds in (25). The upper bounds in (25) can be proved in the same way using the function ξ(ξ1)χ(1,)(ξ).

    Lemma 3.2 (L1 estimate). We have that

    mi=1ρi,ε(t,)L1(,0)+m+nj=m+1ρj,ε(t,)L1(0,)mi=1ρi,0L1(,0)+m+nj=m+1ρj,0L1(0,),t0. (26)

    Proof. Thanks to (5), (23), (24), and (25), we have that

    ddt(mi=10|ρi,ε|dx+m+nj=m+10|ρj,ε|dx)=ddt(mi=10ρi,εdx+m+nj=m+10ρj,εdx)=mi=10tρi,εdx+m+nj=m+10tρj,εdx=mi=10x(fi(ρi,ε)εxρi,ε)dxm+nj=m+10x(fj(ρj,ε)εxρj,ε)dx=mi=1βi(ρ1,ε(t,0),,ρm+n,ε(t,0))+m+nj=m+1βj(ρ1,ε(t,0),,ρm+n,ε(t,0))=0.

    Integrating over (0,t) and using (16) we get (26).

    Lemma 3.3 (L2 estimate). We have that

    mi=1ρi,ε(t,)2L2(,0)+m+nj=m+1ρj,ε(t,)2L2(0,)+2εt0(mi=1xρi,ε(s,)2L2(,0)+m+nj=m+1xρj,ε(s,)2L2(0,))dsmi=1ρi,02L2(,0)+m+nj=m+1ρj,02L2(0,)+2(m+n=1βL((0,1)m+n)+mi=1fiL1(0,1))t, (27)

    for every t0.

    Proof. Thanks to (5), we have that

    ddt(mi=10ρ2i,ε2dx+m+nj=m+10ρ2j,ε2dx)=mi=10ρi,εtρi,εdx+m+nj=m+10ρj,εtρj,εdx=mi=10ρi,εx(fi(ρi,ε)εxρi,ε)dxm+nj=m+10ρj,εx(fj(ρj,ε)εxρj,ε)dx=mi=1ρi,ε(t,0)(fi(ρi,ε(t,0))εxρi,ε(t,0))+m+nj=m+1ρj,ε(t,0)(fj(ρj,ε(t,0))εxρj,ε(t,0))+mi=10x(ρi,ε(t,x)0fi(ξ)dξ)dx+m+nj=m+10x(ρj,ε(t,x)0fj(ξ)dξ)dxεmi=10(xρi,ε)2dxεm+nj=m+10(xρj,ε)2dx=mi=1ρj,ε(t,0)βi(ρ1,ε(t,0),....,ρm+n,ε(t,0))m+nj=m+1ρi,ε(t,0))βj(ρ1,ε(t,0),....,ρm+n,ε(t,0))+mi=1ρi,ε(t,0)0fi(ξ)dξm+nj=m+1ρj,ε(t,0)0fj(ξ)dξ0εmi=10(xρi,ε)2dxεm+nj=m+10(xρj,ε)2dxm+n=1βL((0,1)m+n)+mi=1fiL1(0,1)εmi=10(xρi,ε)2dxεm+nj=m+10(xρj,ε)2dx.

    Integrating over (0,t) and using (16) we get (27).

    Lemma 3.4 (BV estimate). We have that

    mi=1tρi,ε(t,)L1(,0)+m+nj=m+1tρj,ε(t,)L1(0,)(m+n)C+mi=1fiL(0,1)TV(ρi,0)+m+nj=m+1fjL(0,1)TV(ρj,0), (28)

    for every t0.

    Proof. From (5) we get

    2ttρi,ε+x(fi(ρi,ε)tρi,ε)=ε3txxρi,ε,2ttρj,ε+x(fj(ρj,ε)tρj,ε)=ε3txxρj,ε,fi(ρi,ε(t,0))tρi,ε(t,0)ε2txρi,ε(t,0)=m+nj=m+1Gi,j(ρi,ε(t,0),ρj,ε(t,0))(tρi,ε(t,0),tρj,ε(t,0))+εmh=1Ki,h(ρi,ε(t,0),ρh,ε(t,0))(tρi,ε(t,0),tρh,ε(t,0))εm+nh=1Kh,i(ρh,ε(t,0),ρi,ε(t,0))(tρh,ε(t,0),tρi,ε(t,0)),fj(ρj,ε(t,0))tρj,ε(t,0)ε2txρj,ε(t,0)=mi=1Gi,j(ρi,ε(t,0),ρj,ε(t,0))(tρi,ε(t,0),tρj,ε(t,0))+εm+nh=m+1Kh,j(ρh,ε(t,0),ρj,ε(t,0))(tρh,ε(t,0),tρj,ε(t,0))εm+nh=1Kj,h(ρj,ε(t,0),ρh,ε(t,0))(tρi,ε(t,0),tρh,ε(t,0)).

    Thanks to (20), we have that

    ddt(mi=10|tρi,ε|dx+m+nj=m+10|tρj,ε|dx)=mi=102ttρi,εsign(tρi,ε)dx+m+nj=m+102ttρj,εsign(tρj,ε)dx=mi=10sign(tρi,ε)x(fi(ρi,ε)tρi,εε2txρi,ε)dxm+nj=m+10sign(tρj,ε)x(fj(ρj,ε)tρj,εε2txρj,ε)dx=mi=1sign(tρi,ε(t,0))(fi(ρi,ε(t,0))tρi,ε(t,0)ε2txρi,ε(t,0))+m+nj=m+1sign(tρj,ε(t,0))(fj(ρj,ε(t,0))tρj,ε(t,0)ε2txρj,ε(t,0))+2mi=102txρi,ε(fi(ρi,ε)tρi,εε2txρi,ε)dδ{tρi,ε=0}0+2m+nj=m+102txρj,ε(fj(ρj,ε)tρj,εε2txρj,ε)dδ{tρj,ε=0}0mi=1m+nj=m+1(sign(tρi,ε(t,0))sign(tρj,ε(t,0)))××Gi,j(ρi,ε(t,0),ρj,ε(t,0))(tρi,ε(t,0),tρj,ε(t,0))+εmi=1m+nj=m+1(sign(tρi,ε(t,0))sign(tρj,ε(t,0)))××Kj,i(ρi,ε(t,0),ρj,ε(t,0))(tρi,ε(t,0),tρj,ε(t,0))0,

    where δ{tρi,ε=0} and δ{tρj,ε=0} are the Dirac deltas concentrated on the sets {tρi,ε=0} and {tρj,ε=0}, respectively and we apply [6,Lemma 2].

    Integrating over (0,t) and using (16), (25) we get

    mi=1tρi,ε(t,)L1(,0)+m+nj=m+1tρj,ε(t,)L1(0,)mi=1tρi,ε(0,)L1(,0)+m+nj=m+1tρj,ε(0,)L1(0,)=mi=1ε2xxρi,0,εxfi(ρi,0,ε)L1(,0)+m+nj=m+1ε2xxρj,0,εxfj(ρj,0,ε)L1(0,)mi=1(ε2xxρi,0,εL1(,0)+fi(ρi,0,ε)L(,0)xρi,0,εL1(,0))+m+nj=m+1(ε2xxρj,0,εL1(0,)+fj(ρj,0,ε)L(0,)xρj,0,εL1(0,))(m+n)C+mi=1fiL(0,1)TV(ρi,0)+m+nj=m+1fjL(0,1)TV(ρj,0),

    that is (28).

    Lemma 3.5 (Stability estimate). Let (ρ1,ε,...,ρm+n,ε) and (¯ρ1,ε,...,¯ρm+n,ε) be two solutions of (5). The following estimate holds

    mi=1ρi,ε(t,)¯ρi,ε(t,)L1(,0)+m+nj=m+1ρj,ε(t,)¯ρj,ε(t,)L1(0,)mi=1ρi,0,ε¯ρi,0,εL1(,0)+m+nj=m+1ρj,0,ε¯ρj,0,εL1(0,),t0. (29)

    Proof. From (5) we get

    t(ρi,ε¯ρi,ε)+x(fi(ρi,ε)fi(¯ρi,ε))=ε2xx(ρi,ε¯ρi,ε),t(ρj,ε¯ρj,ε)+x(fj(ρj,ε)fj(¯ρj,ε))=ε2xx(ρj,ε¯ρj,ε).

    Thanks to (5), (20), and (25), we have that

    ddt(mi=10|ρi,ε¯ρi,ε|dx+m+nj=m+10|ρj,ε¯ρj,ε|dx)=mi=10sign(ρi,ε¯ρi,ε)t(ρi,ε¯ρi,ε)dx+m+nj=m+10sign(ρj,ε¯ρj,ε)t(ρj,ε¯ρj,ε)dx=mi=10sign(ρi,ε¯ρi,ε)x((fi(ρi,ε)fi(¯ρi,ε))εx(ρi,ε¯ρi,ε))dxm+nj=m+10sign(ρj,ε¯ρj,ε)x((fj(ρj,ε)fj(¯ρj,ε))εx(ρj,ε¯ρj,ε))dx=mi=1m+nj=m+1[sign(ρi,ε(t,0)¯ρi,ε(t,0))sign(ρj,ε(t,0)¯ρj,ε(t,0))]××[Gi,j(ρi,ε(t,0),ρj,ε(t,0))Gi,j(¯ρi,ε(t,0),¯ρj,ε(t,0))]+εmi=1m+nj=m+1[sign(ρi,ε(t,0)¯ρi,ε(t,0))sign(ρj,ε(t,0)¯ρj,ε(t,0))]××[Kj,i(ρi,ε(t,0),ρj,ε(t,0))Gi,j(¯ρi,ε(t,0),¯ρj,ε(t,0))]+2mi=10x(ρi,ε¯ρi,ε)((fi(ρi,ε)fi(¯ρi,ε))εx(ρi,ε¯ρi,ε))dδ{ρi,ε=¯ρi,ε}0+2m+nj=m+10x(ρj,ε¯ρj,ε)((fi(ρj,ε)fi(¯ρj,ε))εx(ρj,ε¯ρj,ε))dδ{ρj,ε=¯ρj,ε}00,

    where we use [6,Lemma 2] and we denote by δ{ρi,ε=¯ρi,ε} and δ{ρj,ε=¯ρj,ε} respectively the Dirac deltas concentrated on the sets {ρi,ε=¯ρi,ε} and {ρj,ε=¯ρj,ε}.

    Integrating over (0,t) we get (29).

    The well-posedness of smooth solutions for (5) can be proved following the argument used in [10,Theorem 1.2] to establish the well-posedness of smooth solutions for (4). Indeed, the existence of a linear semigroup of solutions in the linear case (i.e., when f0) is shown in [14]. Then the Duhamel Formula, estimates similar to the ones in the previous section and a fixed point argument lead to the result.

    The main result of this section is the following.

    Lemma 4.1. Let (ρ1,ε,...,ρm+n,ε) be the solution of (5). There exist a sequence {εk}kN(0,),εk0, and m+n maps ρ1,...,ρm+n such that

    ρ1,...,ρmL1((0,)×(,0))L((0,)×(,0)), (30)
    ρm+1,...,ρm+nL1((0,)×(0,))L((0,)×(0,)), (31)
    0ρ1,{1,...,m+n}, (32)
    ρi,εkρi,a.e.andinLploc((0,)×(,0)), (33)
    ρj,εkρj,a.e.andinLploc((0,)×(0,)), (34)

    for every 1p<,i{1,..,m},j{m+1,..,m+n}. Moreover, we have that

    mi=1ρi(t,)L1(,0)+m+nj=m+1ρj(t,)L1(0,) (35)
    mi=1ρi,0L1(,0)+m+nj=m+1ρj,0L1(0,),mi=1ρi(t,)2L2(,0)+m+nj=m+1ρj(t,)2L2(0,) (36)
    mi=1ρi,02L2(,0)+m+nj=m+1ρj,02L2(0,)+2(m+n=1βL((0,1)m+n)+mi=1fiL1(0,1))t,mi=1TV(fi(ρi(t,)))+m+nj=m+1TV(fj(ρj(t,)))=mi=1tρi(t,)M(,0)+m+nj=m+1tρj(t,)M(0,)(m+n)C+mi=1fiL(0,1)TV(ρi,0)+m+nj=m+1fjL(0,1)TV(ρj,0). (37)

    Thanks to the genuine nonlinearity of f1,...,fm+n, we can use the Tartar compensated compactness method [18] to obtain strong convergence of a subsequence of viscosity approximations. The notation R can stand for (0,) or (,0).

    Theorem 4.2 (Tartar). Let {vν}ν>0 be a family of functions defined on (0,)×R such that

    vνL((0,T)×R)MT,T,ν>0,

    and the family

    {tη(vν)+xq(vν)}ν>0

    is compact in H1loc((0,)×R), for every convex ηC2(R), where q=fη. Then there exist a sequence {νn}nN(0,),νn0, and a map vL((0,T)×R),T>0, such that

    vνnva.e.andinLploc((0,)×R),1p<.

    The following compact embedding of Murat [17] is useful.

    Theorem 4.3 (Murat). Let Ω be a bounded open subset of RN, N2. Suppose the sequence {Ln}nN of distributions is bounded in W1,(Ω). Suppose also that

    Ln=L1,n+L2,n,

    where {L1,n}nN lies in a compact subset of H1loc(Ω) and {L2,n}nN lies in a bounded subset of L1loc(Ω). Then {Ln}nN lies in a compact subset of H1loc(Ω).

    Proof of Lemma 4.1. Let us fix i{1,...,m} and prove the lemma for the incoming edges, as the proof for the outgoing ones is analogous.

    Let η:RR be any convex C2 entropy function, and let qi:RR be the corresponding entropy flux defined by qi=ηfi. By multiplying ith equation in (5) by η(ρi,ε) and using the chain rule, we get

    tη(ρi,ε)+xqi(ρi,ε)=ε2xxη(ρi,ε)L1,εεη(ρi,ε)(xρi,ε)2L2,ε. (38)

    We claim that

    L1,ε0inH1((0,T)×(,0)),T>0,asε0,{L2,ε}εisuniformlyboundedinL1((0,T)×(,0)),T>0. (39)

    Indeed, (25) and (27) imply

    εxη(ρi,ε)L2((0,T)×(,0))εηL(0,1)εxρi,εL2((0,)×(,0))εηL(0,1)(mi=1ρi,ε,0L2(,0)+m+nj=m+1ρj,ε,0L2(0,)+2(m+n=1βL((0,1)m+n)+mi=1fiL1(0,1))T)0,εη(ρi,ε)(xρi,ε)2L1((0,T)×(,0))ηL(0,1)(mi=1ρi,ε,02L2(,0)+m+nj=m+1ρj,ε,02L2(0,)+2(m+n=1βL((0,1)m+n)+mi=1fiL1(0,1))T).

    Due to (16), (39) follows. Therefore, Theorems 4.3 and 4.2 give the existence of a subsequence {ρi,εk}kN and a limit function ρi satisfying (30) such that as k

    ρi,εkρiinLploc((0,)×(,0))foranyp[1,),ρi,εkρia.e.in(0,)×(,0), (40)

    that guarantees (32) and (33).

    Finally, thanks to Lemmas 3.2, 3.3, and 3.4 we have (35), (36), and (37).

    Proof of Theorem 1.2.. The first part of the statement related to the convergence of vanishing viscosity approximations has been proved in Lemma 4.1.

    Let us fix i{1,,m} and prove (9) for the incoming edges, the case of the outgoing ones is analogous.

    Thanks to (3.4) and (33), for all φC((0,)×(,0)) with compact support, we have

    00ρitφdxdt=limk00ρi,εktφdxdt=limk00tρi,εkφdxdtφL((0,)×(,0))((m+n)C+mi=1fiL(0,1)TV(ρi,0)+m+nj=m+1fjL(0,1)TV(ρj,0)),

    therefore

    tρiM((0,)×(,0)), (41)

    where M((0,)×(,0)) is the set of all Radon measures on (0,)×(,0). Moreover, from the equations in (1) and (2) we have also

    xfi(ρi)M((0,)×(,0)). (42)

    Clearly (41) and (42) give (9) and so the trace at the junction f(ρi(t,0)) exists for a.e. t>0.

    We prove now that the identity

    mi=1fi(ρi(t,0))=n+mj=m+1fj(ρj(t,0+)) (43)

    holds for a.e. t>0; consequently the functions ρ1,,ρm+n provide a solution to (1), (2), and (3) in the sense of Definition 1.1.

    Let φC1([0,)),φ(0)=0 with compact support. Consider the sequence {rν}νN{0}C2([0,)) of cut-off functions satisfying

    0rν(x)1,rν(0)=1,supp(rν)[0,1ν], (44)

    for every x0 and ν1. Moreover, for every ν1, we define the sequence {˜rν}νN{0}C2((,0]) by writing ˜rν(x)=rν(x) for every x0.

    From (5) we have that

    0=mi=100(tρi,εk+xfi(ρi,εk)εk2xxρi,εk)φ(t)˜rν(x)dxdt+m+nj=m+100(tρj,εk+xfj(ρj,εk)εk2xxρj,εk)φ(t)rν(x)dxdt=mi=100(ρi,εkφ(t)˜rν(x)+fi(ρi,εk)φ(t)˜rν(x)εkxρi,εkφ(t)˜rν(x))dxdtm+nj=m+100(ρj,εkφ(t)rν(x)+fj(ρj,εk)φ(t)rν(x)εkxρj,εkφ(t)rν(x))dxdt+mi=10(fi(ρi,εk(t,0))εkxρi,εk(t,0))φ(t)dtm+nj=m+10(fj(ρj,εk(t,0))εkxρj,εk(t,0))φ(t)dt=mi=100(ρi,εkφ(t)˜rν(x)+fi(ρi,εk)φ(t)˜rν(x)εkxρi,εkφ(t)˜rν(x))dxdtm+nj=m+100(ρj,εkφ(t)rν(x)+fj(ρj,εk)φ(t)rν(x)εkxρj,εkφ(t)rν(x))dxdt.

    As k, due to (27), (33), and (34),

    0=mi=100(ρiφ(t)˜rν(x)+fi(ρi)φ(t)˜rν(x))dxdtm+nj=m+100(ρjφ(t)rν(x)+fj(ρj)φ(t)rν(x))dxdt.

    Finally, sending ν,

    0=mi=10fi(ρi(t,0))φ(t)dt+m+nj=m+10fj(ρj(t,0+))φ(t)dt,

    that gives (43).



    [1] N. Cammarata, S. Carter, G. Goh, C. Olah, M. Petrov, L. Schubert, et al., Thread: Circuits, Distill, 5 (2020), e24.
    [2] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in European Conference on Computer Vision, Springer, (2014), 818–833. https://doi.org/10.1007/978-3-319-10590-1_53
    [3] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016.
    [4] L. Chizat, F. Bach, Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss, in Conference on Learning Theory, PMLR, (2020), 1305–1338.
    [5] S. Wojtowytsch, On the convergence of gradient descent training for two-layer relu-networks in the mean field regime, preprint, arXiv: 2005.13530.
    [6] S. Mei, A. Montanari, P. M. Nguyen, A mean field view of the landscape of two-layer neural networks, Proc. Natl. Acad. Sci., 115 (2018), E7665–E7671.
    [7] P. Nguyen, H. T. Pham, A rigorous framework for the mean field limit of multilayer neural networks, preprint, arXiv: 2001.11443.
    [8] L. Chizat, E. Oyallon, F. Bach, On lazy training in differentiable programming, in Advances in Neural Information Processing Systems, 32 (2019), 1–11.
    [9] G. Yang, E. J. Hu, Tensor programs iv: Feature learning in infinite-width neural networks, in International Conference on Machine Learning, PMLR, (2021), 11727–11737.
    [10] A. Jacot, F. Gabriel, C. Hongler, Neural tangent kernel: Convergence and generalization in neural networks, Adv. Neural Inf. Proc. Syst., 31 (2018), 1–10.
    [11] L. Chizat, F. Bach, On the global convergence of gradient descent for over-parameterized models using optimal transport, in Proceedings of the 32nd International Conference on Neural Information Processing Systems, (2018), 3040–3050.
    [12] G. Rotskoff, E. Vanden-Eijnden, Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks, in Advances in Neural Information Processing Systems (eds. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett), 31 (2018), 1–10.
    [13] J. Sirignano, K. Spiliopoulos, Mean field analysis of neural networks: A law of large numbers, SIAM J. Appl. Math., 80 (2020), 725–752. https://doi.org/10.1137/18M1192184 doi: 10.1137/18M1192184
    [14] C. Ma, L. Wu, Machine learning from a continuous viewpoint, I, Sci. Chin. Math., 63 (2020), 2233–2266, https://doi.org/10.1007%2Fs11425-020-1773-8.
    [15] H. Daneshmand, F. Bach, Polynomial-time sparse measure recovery, preprint, arXiv: 2204.07879.
    [16] F. Bach, Breaking the curse of dimensionality with convex neural networks, J. Mach. Learn. Res., 18 (2017), 629–681.
    [17] A. Cloninger, T. Klock, A deep network construction that adapts to intrinsic dimensionality beyond the domain, Neural Networks, 141 (2021), 404–419. https://doi.org/10.1016/j.neunet.2021.06.004 doi: 10.1016/j.neunet.2021.06.004
    [18] A. Damian, J. Lee, M. Soltanolkotabi, Neural networks can learn representations with gradient descent, in Conference on Learning Theory, PMLR, (2022), 5413–5452.
    [19] A. Mousavi-Hosseini, S. Park, M. Girotti, M. Ioannis, M. Erdogdu, Neural networks efficiently learn low-dimensional representations with SGD, preprint, arXiv: 2209.14863.
    [20] J. Paccolat, L. Petrini, M. Geiger, K. Tyloo, M. Wyart, Geometric compression of invariant manifolds in neural networks, J. Stat. Mech. Theory Exp., 2021 (2021), 044001. https://doi.org/10.1088/1742-5468/abf1f3 doi: 10.1088/1742-5468/abf1f3
    [21] E. Abbe, E. Boix-Adsera, T. Misiakiewicz, The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks, preprint, arXiv: 2202.08658.
    [22] E. Abbe, E. Boix-Adsera, M. S. Brennan, G. Bresler, D. Nagaraj, The staircase property: How hierarchical structure can guide deep learning, Adv. Neural Inf. Proc. Syst., 34 (2021), 26989–27002.
    [23] Z. Allen-Zhu, Y. Li, Y. Liang, Learning and generalization in overparameterized neural networks, going beyond two layers, in Proceedings of the 33rd International Conference on Neural Information Processing Systems, (2019), 6158–6169.
    [24] J. Ba, M. A. Erdogdu, T. Suzuki, Z. Wang, D. Wu, G. Yang, High-dimensional asymptotics of feature learning: How one gradient step improves the representation, preprint, arXiv: 2205.01445.
    [25] G. Yehudai, O. Shamir, On the power and limitations of random features for understanding neural networks, Adv. Neural Inf. Proc. Syst., 32 (2019), 1–11.
    [26] B. Bloem-Reddy, Y. W. Teh, Probabilistic symmetries and invariant neural networks, J. Mach. Learn. Res., 21 (2020), 90–91. https://doi.org/10.1109/MMM.2020.3008308 doi: 10.1109/MMM.2020.3008308
    [27] I. Ganev, R. Walters, The QR decomposition for radial neural networks, preprint, arXiv: 2107.02550.
    [28] G. Głuch, R. Urbanke, Noether: The more things change, the more stay the same, preprint, arXiv: 2104.05508.
    [29] Z. Ji, M. Telgarsky, Gradient descent aligns the layers of deep linear networks, preprint, arXiv: 1810.02032.
    [30] T. Gallouët, M. Laborde, L. Monsaingeon, An unbalanced optimal transport splitting scheme for general advection-reaction-diffusion problems, ESAIM Control. Optim. Calc. Var., 25 (2019), 8. https://doi.org/10.1051/cocv/2018001 doi: 10.1051/cocv/2018001
    [31] L. Chizat, Sparse optimization on measures with over-parameterized gradient descent, Math. Program., 194 (2022), 487–532. https://doi.org/10.1007/s10107-021-01636-z doi: 10.1007/s10107-021-01636-z
    [32] F. Santambrogio, Optimal transport for applied mathematicians, Birkäuser, NY, 55 (2015), 94. https://doi.org/10.1007/978-3-319-20828-2
    [33] F. Santambrogio, {Euclidean, metric, and Wasserstein} gradient flows: an overview, Bull. Math. Sci., 7 (2017), 87–154. https://doi.org/10.1007/s13373-017-0101-1 doi: 10.1007/s13373-017-0101-1
    [34] K. Atkinson, W. Han, Spherical Harmonics and Approximations on the Unit Sphere: An Introduction, Springer, 2012. https://doi.org/10.1007/978-3-642-25983-8
  • This article has been cited by:

    1. Markus Musch, Ulrik Skre Fjordholm, Nils Henrik Risebro, Well-posedness theory for nonlinear scalar conservation laws on networks, 2022, 17, 1556-1801, 101, 10.3934/nhm.2021025
    2. Francesca R. Guarguaglini, Roberto Natalini, Vanishing viscosity approximation for linear transport equations on finite star-shaped networks, 2021, 21, 1424-3199, 2413, 10.1007/s00028-021-00688-0
    3. John D. Towers, An explicit finite volume algorithm for vanishing viscosity solutions on a network, 2022, 17, 1556-1801, 1, 10.3934/nhm.2021021
    4. Ulrik S. Fjordholm, Markus Musch, Nils H. Risebro, Well-Posedness and Convergence of a Finite Volume Method for Conservation Laws on Networks, 2022, 60, 0036-1429, 606, 10.1137/21M145001X
    5. Jon Asier Bárcena-Petisco, Márcio Cavalcante, Giuseppe Maria Coclite, Nicola De Nitti, Enrique Zuazua, Control of hyperbolic and parabolic equations on networks and singular limits, 2024, 0, 2156-8472, 0, 10.3934/mcrf.2024015
    6. Dilip Sarkar, Shridhar Kumar, Pratibhamoy Das, Higinio Ramos, Higher-order convergence analysis for interior and boundary layers in a semi-linear reaction-diffusion system networked by a $ k $-star graph with non-smooth source terms, 2024, 19, 1556-1801, 1085, 10.3934/nhm.2024048
  • Reader Comments
  • © 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1704) PDF downloads(55) Cited by(1)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog