Loading [MathJax]/jax/output/SVG/jax.js
Research article Special Issues

On fast reconstruction of periodic structures with partial scattering data

  • Received: 21 August 2024 Revised: 12 November 2024 Accepted: 19 November 2024 Published: 29 November 2024
  • This paper presents a numerical method for solving the inverse problem of reconstructing the shape of periodic structures from scattering data. This inverse problem is motivated by applications in the nondestructive evaluation of photonic crystals. The numerical method belongs to the class of sampling methods that aim to construct an imaging function for the shape of the periodic structure using scattering data. By extending the results of Nguyen, Stahl, and Truong [Inverse Problems, 39:065013, 2023], we studied a simple imaging function that uses half the data required by the numerical method in the cited paper. Additionally, this imaging function is fast, simple to implement, and very robust against noise in the data. Both isotropic and anisotropic cases were investigated, and numerical examples were presented to demonstrate the performance of the numerical method.

    Citation: John Daugherty, Nate Kaduk, Elena Morgan, Dinh-Liem Nguyen, Peyton Snidanko, Trung Truong. On fast reconstruction of periodic structures with partial scattering data[J]. Electronic Research Archive, 2024, 32(11): 6481-6502. doi: 10.3934/era.2024303

    Related Papers:

    [1] Xia Shi, Ziheng Zhang . Multiple-site deep brain stimulation with delayed rectangular waveforms for Parkinson's disease. Electronic Research Archive, 2021, 29(5): 3471-3487. doi: 10.3934/era.2021048
    [2] Tej Bahadur Shahi, Cheng-Yuan Xu, Arjun Neupane, William Guo . Machine learning methods for precision agriculture with UAV imagery: a review. Electronic Research Archive, 2022, 30(12): 4277-4317. doi: 10.3934/era.2022218
    [3] Jing Lu, Longfei Pan, Jingli Deng, Hongjun Chai, Zhou Ren, Yu Shi . Deep learning for Flight Maneuver Recognition: A survey. Electronic Research Archive, 2023, 31(1): 75-102. doi: 10.3934/era.2023005
    [4] Rui Wang, Haiqiang Li, Chen Hu, Xiao-Jun Wu, Yingfang Bao . Deep Grassmannian multiview subspace clustering with contrastive learning. Electronic Research Archive, 2024, 32(9): 5424-5450. doi: 10.3934/era.2024252
    [5] Jianting Gong, Yingwei Zhao, Xiantao Heng, Yongbing Chen, Pingping Sun, Fei He, Zhiqiang Ma, Zilin Ren . Deciphering and identifying pan-cancer RAS pathway activation based on graph autoencoder and ClassifierChain. Electronic Research Archive, 2023, 31(8): 4951-4967. doi: 10.3934/era.2023253
    [6] Ruyu Yan, Jiafei Jin, Kun Han . Reinforcement learning for deep portfolio optimization. Electronic Research Archive, 2024, 32(9): 5176-5200. doi: 10.3934/era.2024239
    [7] Ju Wang, Leifeng Zhang, Sanqiang Yang, Shaoning Lian, Peng Wang, Lei Yu, Zhenyu Yang . Optimized LSTM based on improved whale algorithm for surface subsidence deformation prediction. Electronic Research Archive, 2023, 31(6): 3435-3452. doi: 10.3934/era.2023174
    [8] Ye Yu, Zhiyuan Liu . A data-driven on-site injury severity assessment model for car-to-electric-bicycle collisions based on positional relationship and random forest. Electronic Research Archive, 2023, 31(6): 3417-3434. doi: 10.3934/era.2023173
    [9] Shizhen Huang, Enhao Tang, Shun Li, Xiangzhan Ping, Ruiqi Chen . Hardware-friendly compression and hardware acceleration for transformer: A survey. Electronic Research Archive, 2022, 30(10): 3755-3785. doi: 10.3934/era.2022192
    [10] Jiawang Li, Chongren Wang . A deep learning approach of financial distress recognition combining text. Electronic Research Archive, 2023, 31(8): 4683-4707. doi: 10.3934/era.2023240
  • This paper presents a numerical method for solving the inverse problem of reconstructing the shape of periodic structures from scattering data. This inverse problem is motivated by applications in the nondestructive evaluation of photonic crystals. The numerical method belongs to the class of sampling methods that aim to construct an imaging function for the shape of the periodic structure using scattering data. By extending the results of Nguyen, Stahl, and Truong [Inverse Problems, 39:065013, 2023], we studied a simple imaging function that uses half the data required by the numerical method in the cited paper. Additionally, this imaging function is fast, simple to implement, and very robust against noise in the data. Both isotropic and anisotropic cases were investigated, and numerical examples were presented to demonstrate the performance of the numerical method.



    The ability of neural networks to learn rich representations—or features—of their input data is commonly observed in state-of-the art models [1,2] and often thought to be the reason behind their good practical performance [3,Chap. 1]. Yet, our theoretical understanding of how feature learning arises from simple gradient-based training algorithms remains limited. Much progress (discussed in Section 1.3) has been made recently to understand the power and limitations of gradient-based learning with neural networks, showing in particular their superiority over fixed-feature methods on some difficult tasks. However, positive results are often obtained for algorithms that differ in substantial ways from plain (stochastic) gradient descent (e.g., the layers trained separately, or the algorithm makes just one truly non-linear step, etc.).

    In this work, we take the algorithm as a given and instead adopt a descriptive approach. Our goal is to improve our understanding of how neural networks behave in the presence of symmetries in the data with plain gradient descent (GD) on two-layer fully-connected ReLU neural networks. To this end, we investigate situations with strong symmetries on the data, the target function and on the initial parameters, and study the properties of the training dynamics and the learned predictor in this context.

    We denote by d the input dimension, ρ the input data distribution which we assume to be uniform over the unit sphere Sd1 of Rd, and by P2(Ω) the space of probability measures with finite second moments over a measurable space Ω. We call σ the activation function, which we take to be ReLU, that is σ(z)=max(0,z), :R×RR the loss function, which we assume to be continuous in both arguments and continuously differentiable w.r.t. its second argument and we denote by 2 this derivative.

    Mean-field limit of two-layer networks. In this work, we consider the infinite-width limit in the mean-field regime of the training dynamics of two-layer networks without intercept with a ReLU activation function. Given a measure μP2(R×Rd), we consider the infinitely wide two-layer network parameterized by μ, defined, for any input xRd, by

    f(μ;x)=cR1+dϕ(c;x)dμ(c), (1.1)

    where, for any c=(a,b)R×Rd, ϕ(c;x)=aσ(bx). Note that width-m two-layer networks with input weights (bj)j[1,m](Rd)m and output weights (aj)j[1,m]Rm can be recovered by a measure μm=(1/m)mj=1δ(maj,bj) with m atoms.

    Objective and Wasserstein gradient flow. We consider the problem of minimizing the population loss objective for a given target function f:RdR, which we assume to be bounded on the unit sphere, that is

    minμP2(R×Rd)(F(μ):=Exρ[(f(x),f(μ;x))]). (1.2)

    The Fréchet derivative of the objective function F at μ is given by the function Fμ(c)=Exρ[2(f(x),f(μ;x))ϕ(c;x)] for any c=(a,b)R×Rd (for more details, see Appendix B.1). Starting from a given measure μ0P2(R×Rd), we study the Wasserstein gradient flow (GF) of the objective (1.2) which is a path (μt)t0 in the space of probability measures satisfying, in the sense of distributions, the partial differential equation (PDE) known as the continuity equation:

    tμt=div(vtμt),vt(c):=Fμt(c). (1.3)

    Initialization. We make the following assumption on the initial measure μ0P2(R×Rd): μ0 decomposes as μ0=μ10μ20 where μ10,μ20P2(R)×P2(Rd). This follows the standard initialization procedure at finite width. Because no direction should a priori be favored, we assume μ20 to have spherical symmetry, i.e., , it is invariant under any orthogonal transformation, and we additionally assume that |a|=||b|| almost surely at initialization. It is shown in [4,Lemma 26], and [5,Section 2.5], that with this assumption, μt stays supported on the set {|a|=||b||} for any t0.

    Comment on the assumptions. The assumption that μ0 decomposes as a product of two measures is to stay as close as possible to what is done in practice (independent initialization for different layers). The assumption that |a|=||b|| is of a technical nature, and, along with the regularity conditions on the loss and the input data distribution ρ, ensures that the Wasserstein GF (1.3) is well-defined [5,Lemmas 3.1 and 3.9] when using ReLU as a activation function (which bears technical difficulties because of its non-smoothness). The results of Section 2 hold for others activation function which potentially require less restrictive assumptions on μ0 and ρ but still requires μ0 to decompose as a product of measures. In contrast, the results of Sections 3 and 4 are specific to σ=ReLU and thus require the assumptions above on μ0 and ρ. Since our work focuses mostly on ReLU, we choose to state the results of all sections with the (more restrictive) assumptions stated above on μ0 and ρ.

    Relationship with finite-width GD. If μ0=(1/m)mj=1δ(aj(0),bj(0)) is discrete, the Wasserstein GF (1.3) is exactly continuous-time GD on the parameters of a standard finite-width neural network, and discretization errors (w.r.t. the number of neurons) can be provided [6,7].

    Our main object of study is the gradient flow of the population risk of infinitely wide two-layer ReLU neural networks without intercept. Our motivation to consider this idealistic setting—infinite data and infinite width—is that it allows, under suitable choices for ρ and μ0, the emergence of exact symmetries which are only approximate in the non-asymptotic setting*.

    *In contrast, our focus on GF is only for theoretical convenience and most of our results could be adapted to the case of GD.

    Symmetries, structure, and convergence. In this work, we are interested in the structures learned by the predictor f(μt;) under GF as t grows large. Specifically, we make the following contributions:

    ● In Section 2, we prove that if f is invariant under some orthogonal linear map T, then f(μt;) inherits this invariance under GF (Theorem 2.1).

    ● In Section 3, we study the case when f is an odd function and show that the network converges to the best linear approximator of f at an exponential rate (Theorem 3.2). Linear predictors are optimal over the hypothesis class in that case, in particular because there is no intercept in our model.

    ● In Section 4, we consider the multi-index model where f depends only on the orthogonal projection of its input onto some sub-space H of dimension dH. We prove that the dynamics can be reduced to a PDE in dimension dH. If in addition, f is the Euclidean norm of the projection of the input, we show that the dynamics reduce to a one-dimensional PDE (Theorem 4.3). In the latter case, we were not able to prove theoretically the convergence of the neurons of the first layer towards H, and leave this as an open problem but we provide numerical evidence in favor of this result.

    The code to reproduce the results of the numerical experiments can be found at: https://github.com/karl-hajjar/learning-structure.

    Infinite-width dynamics. It has been shown rigourously that for infinitely wide networks there is a clear distinction between a feature-learning regime and a kernel regime [8,9]. For shallow networks, this difference stems from a different scale (w.r.t. width) of the initialization where a large initialization leads to the Neural Tangent Kernel (NTK) (a.k.a. the "lazy regime") which is equivalent to a kernel method with random features [10] whereas a small initialization leads to the so-called mean-field (MF) limit where features are learned from the first layer [8,9]. However, it is unclear in this setting exactly what those features are and what underlying structures are learned by the network. The aim of the present work is to study this phenomenon from a theoretical perspective for infinitely wide networks and to understand the relationship between the ability of networks to learn specific structures and the symmetries of a given task.

    A flurry of works study the dynamics of infinitely wide two-layer neural networks. [5,6,11,12,13] study the gradient flow dynamics of the MF limit and show that they are well-defined in general settings and lead to convergence results (local or global depending on the assumptions). On the other hand, Jacot et al. [10] study the dynamic of the NTK parameterization in the infinite-width limit and show that it amounts to learning a linear predictor on top of random features (fixed kernel), so that there is no feature learning.

    Convergence rates. In the MF limit, convergence rates are in general difficult to obtain in a standard setting. For instance, [5,11] show the convergence of the GF to a global optimum in a general setting but this does not allow convergence rates to be provided. To illustrate the convergence of the parameterizing measure to a global optimum in the MF limit, Ma et al. [14] prove local convergence (see Section 7) for one-dimensional inputs and a specific choice of target function in O(t1) where t is the time step. At finite-width, Daneshmand and Bach [15] also prove convergence of the parameters to a global optimum in O(t1) using an algebraic idea which is specific to the ad-hoc structure they consider (inputs in two dimensions and target functions with finite number of atoms).

    In Section 3, we show convergence of the MF limit at an exponential rate when the target function is odd. In the setting of this section, the training dynamics are degenerate and although input neurons move, the symmetries of the problem imply that the predictor is linear.

    Low-dimensional structure. Studying how neural networks can adapt to hidden low-dimensional structures is a way of approaching theoretically the feature-learning abilities of neural networks. Bach [16] studies the statistical properties of infinitely wide two-layer networks, and shows that when the target function only depends on the projection on a low-dimensional sub-space, these networks circumvent the curse of dimensionality with generalization bounds which only depend on the dimension of the sub-space. In a slightly different context, Chizat and Bach [4] show that for a binary classification task, when there is a low-dimensional sub-space for which the projection of the data has sufficiently large inter-class distance, only the dimension of the sub-space (and not that of the ambient space) appears in the upper bound on the probability of misclassification. Whether or not such a low-dimensional sub-space is actually learned by GD is not addressed in these works.

    Similarly, [17,18] focus on learning functions which have a hidden low-dimensional structure with neural networks. They consider a single step of GD on the input layer weights and show that the approximation / generalization error adapts to the structure of the problem: they provide bounds on the number of data points / parameters needed to achieve negligible error, which depend on the reduced dimension and not the dimension of the ambient space. In a similar context, Mousavi-Hosseini et al. [19] consider (S)GD on the first layer only of a finite-width two-layer network and show that with sufficient L2-regularization and with a standard normal distribution on the input data the first layer weights align with the lower-dimensional sub-space when trained for long enough. They then use this property to then provide statistical results on networks trained with SGD.

    In a setting close to ours but on a classification task with finite-data and at finite-width, Paccolat et al. [20] compare the feature learning regime with the NTK regime in the presence of hidden low-dimensional structure and quantify for each regime the scaling law of the test error w.r.t. the number of training samples, mostly focusing on the case dH=1.

    In a similar setting to that of [16], Abbe et al. [21] study how GF for infinitely wide two-layer networks can learn specific classes of functions which have a hidden low-dimensional structure when the inputs are Rademacher variables. This strong symmetry assumption ensures that the learned predictor shares the same low-dimensional structure at any time step (from the t=0) and this allows them to characterize precisely what classes of target functions can or cannot be learned by GF in this setting. In contrast, we are interested in how infinitely wide networks learn those low-dimensional structures during training, and in the role of symmetries in enabling such a behaviour after initialization.

    Learning representations. An existing line of work [18,22,23,24,25] studies in depth the representations learned by neural networks trained with (S)GD at finite-width from a different perspective focusing on the advantages of feature-learning in terms of performance comparatively to using random features. In contrast, our aim is to describe the representations themselves in relationship with the symmetries of the problem.

    Symmetries. We stress that the line of work around symmetries of neural networks dealing with finding network architectures for which the output is invariant (w.r.t. to its input or parameters) by some group of transformations (see [26,27,28], and references therein) is entirely different from what we are concerned with in the present work. In contrast, the setting of Mei et al. [6] is much closer to ours as they study how the invariances of the target function / input data can lead to simplifications in the dynamics of infinitely wide two-layer networks in the mean-field regime which allows them to prove global convergence results.

    We denote by M+(Ω) the space of non-negative measures over a measurable space Ω. For any measure μ and measurable map T, T#μ denotes the pushforward measure of μ by T. We denote by O(p) and idRp respectively the orthogonal group and the identity map of Rp for any pN. Finally, , is the Euclidean inner product and |||| the corresponding norm.

    In this section, we demonstrate that if the target function f is invariant under some orthogonal transformation T, since the input data distribution is also invariant under T, then f(μt;) is invariant under T as well for any t0. This invariance property of the dynamics w.r.t. orthogonal symmetries is possible with an infinite number of neurons but is only approximate at finite-width. It is noteworthy that the results of this section hold for any activation function σ and input data distribution ρ which has the same symmetries as f, provided that the Wasserstein GF (1.3) is unique. We start with a couple of definitions:

    Definition 2.1 (Function invariance). Let T be a map from Rd to Rd, and f:RdR. Then, f is said to be invariant (resp. anti-invariant) under T if for any xRd, f(T(x))=f(x) (resp. f(T(x))=f(x)).

    Definition 2.2 (Measure invariance). Let ΩRd, T be a measurable map from Ω to Ω, and μ be a measure on Ω. Then, μ is said to be invariant under T if T#μ=μ, or equivalently, if for any continuous and compactly supported φ:ΩR, φ(x)dμ(x)=φ(T(x))dμ(x).

    We are now ready to state the two main results of this section.

    Proposition 2.1 (Learning invariance). Let TO(d), and assume that f is invariant under T. Then, for any t0, the Wasserstein GF μt of Eq (1.3) is invariant under ˜T:(a,b)R×Rd(a,T(b)), and the corresponding predictor f(μt;) is invariant under T.

    Proposition 2.2 (Learning anti-invariance). Under the same assumptions as in Proposition 2.1 except now we assume f is anti-invariant under T, and assuming further that 2(y,ˆy)=2(y,ˆy) for any y,ˆyR, and that μ10 is symmetric around 0 (i.e., , invariant under :aRa), we then have that for any t0, the Wasserstein GF μt in Eq (1.3) is invariant under ˜T:(a,b)R×Rd(a,T(b)), and the corresponding predictor f(μt;) is anti-invariant under T.

    Remark. The results above also hold for networks with intercepts at both layers. The conditions of Proposition 2.2 are satisfied by both the squared loss and the logistic loss (a.k.a. the cross-entropy loss).

    Essentially, those results show that training with GF preserves the orthogonal symmetries of the problem: the invariance of the target function under an orthogonal transformation leads to the same invariance for μt and f(μt;). The proof, presented in Appendix C, relies crucially on the fact that T is an orthogonal map which combines well with the structure of ϕ(c;x) involving an inner product. The idea is essentially that the orthogonality of T allows us to relate the gradient of ϕ (and consequently of Fμt) w.r.t. c at (T(c);x) to the same gradient at (c;T1(x)) and then to use the invariance of f and ρ to conclude.

    In the following sections we discuss the particular cases where functions are (anti-)invariant under idRd (i.e., , even or odd functions) or some sub-group of O(d).

    We consider here an odd target, function, i.e., , for any xRd, f(x)=f(x).

    Linearity of odd predictors. Proposition 2.2 ensures that the predictor f(μt;) associated with the Wasserstein GF of Eq (1.3) is also odd at any time t0, and we can thus write, for any x, f(μt;x)=12(f(μt;x)f(μt;x)), which yields

    f(μt;x)=12(a,ba[σ(bx)σ(bx)]dμt(a,b))=12a,ba(bx)dμt(a,b),

    where the last equality stems from the fact that for ReLU, σ(x)σ(x)=x. Put differently, the predictor is linear: it is the same as replacing σ by 12idRd, and f(μt;x)=w(t)x, where

    w(t):=12a,babdμt(a,b)Rd. (3.1)

    This degeneracy is not surprising as in fact, a linear predictor is the best one can hope for in this setting. Indeed, consider the following assumption and the next lemma:

    Assumption 1 (Squared loss function). The loss function is the squared loss, i.e., , (y,ˆy)=12(yˆy)2, and thus satisfies the condition of Proposition 2.2.

    We make this assumption in order to provide an explicit convergence rate in Theorem 3.2 below.

    Lemma 3.1 (Optimality of odd predictors). Let f be a predictor in the hypothesis class F:={:xaσ(bx)dμ(a,b);μP2(R ×Rd))}. Then, denoting fodd(x):=12(f(x)f(x)) (resp. feven:=12(f(x)+f(x))) the odd (resp. even) part of f, one has:

    (i)foddF,(ii)  L(f):=Exρ[(f(x)f(x))2]Exρ[(f(x)fodd(x))2]=:L(fodd),(iii) equality holds if and only iffis odd ρ-almost surely.

    Proof. The result readily follows from the decomposition f=fodd+feven which leads to

    L(f)=L(fodd) + Exρ[(feven(x))2]0  2Exρ[(f(x)fodd(x))feven(x)]0 by symmetry.

    We then get that L(f)L(fodd) with equality if and only if Exρ[(feven(x))2]=0, i.e., , feven(x)=0 for ρ-almost every x. Finally, if μP2(Rd+1), then ν:=12(μ+S#μ)P2(Rd+1), where S:(a,b)Rd+1(a,b), and f(ν;)=fodd(μ;), which shows fodd(μ;)F.

    Since, as shown above, any odd predictor turns out to be linear because of the symmetries of ReLU, in this context, the best one can expect is thus to learn the best linear predictor.

    Exponential convergence for linear networks. We are thus reduced to studying the dynamics of linear networks (which in our case are infinitely wide), which is an interesting object of study its own right (Ji and Telegarsky [29] show a result similar to our result below in the finite-width case with the logistic loss on a binary classification task). In this case, the Wasserstein GF (1.3) (with ReLU replaced by 12idRd) is defined for more general input distributions PP2(Rd) (e.g., empirical measures) and target functions f. The objective in this context is thus to learn:

    wargminwRd(Q(w):=12ExP[(f(x)w,x)2]) (3.2)

    with the dynamics of linear infinitely wide two-layer networks described by the Wasserstein GF (1.3) where the activation function σ is replaced by 12idRd. Theorem 3.2 below shows exponential convergence to a global minimum of Q as soon as the problem is strongly convex. Note that although in this case both ϕ(;) (see Eq (1.1)) and the predictor in the objective Q are linear w.r.t. the input, only the predictor in Q is linear in the parameters (ordinary least squares).

    Theorem 3.2. Assume that the smallest eigenvalue λmin of ExP[xx] is positive. Let (μt)t0 be the Wasserstein GF associated to (1.3) with activation function 12idRd instead of σ=ReLU, and call w(t)=12abdμt(a,b)Rd. Then, there exits η>0 and t0>0 such that, for any tt0,

    (Q(w(t))Q(w))e2ηλmin(tt0)(Q(w(t0))Q(w)).

    Remark. Note that as soon as P has spherical symmetry, the problem becomes strongly convex by Lemma B.3. Note that although F(μt)=Q(w(t)), (w(t))t0 is not a gradient flow for the (strongly) convex objective Q (which would immediately guarantee exponential convergence to the global minimum).

    The proof, provided in Appendix E, proceeds in two steps: first it is shown that w(t)=H(t)Q(w(t)) for some positive definite matrix H(t) whose smallest eigenvalue is always lower-bounded by a positive quantity, then we prove that this leads to exponential convergence. Figure 1 illustrates that the dynamics of GF on F remain non-linear in that they do not reduce to GF on Q (although the paths are close). To simulate GF on F we use a large (but finite) number of neurons m=1024 and a small (but positive) step-size 102 and simply proceed to do GD on the corresponding finite-dimensional objective (see comment in Section 1.1 on relationship between the Wasserstein GF and finite-width GD). The dashed curves are the trajectories of w(t) (see Theorem 3.2) for 10 random initializations, which corresponds to GD on F, whereas the blue curve is the trajectory of GD on Q starting from 0. P=1nni=1δxi is the empirical distribution over n=100 samples, with xiU([1,1]d) and yi=f(xi) drawn i.i.d. from N(y,2) where yN(0,1).

    Figure 1.  GD path for two of the coordinates: two-layer linear network vs pure linear model.

    Consider a linear sub-space H of dimension dH<d (potentially much smaller than the ambient dimension), and assume f has the following structure: f(x)=fH(pH(x)) where pH is the orthogonal projection onto H (which we also write xH for simplicity, and we reserve sub-scripts for denoting entries of vectors) and fH:HR is a given function.

    In this context it is natural to study whether the learned function shares the same structure as f. As observed in Figure 2 this is not the case in finite time, but it is reasonable however to think that the learned predictor f(μt;) shares the same structure as f as t, and we give numerical evidence in this direction. On the other hand, we prove rigorously that the structure of the problem allows to reduce the dynamics to a lower-dimensional PDE. In this section, we consider for simplicity that μ10 is the uniform distribution over {1,+1} and that μ20 is the uniform distribution over Sd1.

    Figure 2.  f(μt;uH+re1) vs r and t for a random uHSdH1 with d=20, dH=5.

    Comment on the assumptions for this section. The assumptions that |a|=||b|| on the support of μ0 is crucial here. This ensures that the Wasserstein GF (1.3) is well-defined and that μt stays supported on the set {|a|=||b||} for any t0, a fact which is used in the proofs. The assumption that ρ is the uniform distribution over the unit sphere bears some importance but could likely be replaced by other measures with spherical symmetry provided that the dynamics would still be well-defined and at the cost of more technical proofs.

    The structure of f implies that it is invariant by any TO(d) which preserves H, i.e., , such that its restrictions to H and H are T|H=idH and T|HO(d), where O(d) is the orthogonal group of H whose dimension is d=ddH. By Proposition 2.1, such transformations also leave the predictor f(μt;) invariant for any t0 since ρ is spherically symmetric. Lemma 4.1 below then ensures that f(μt;x) depends on the projection x onto H only through its norm, that is f(μt;x)=˜ft(xH,||x||) for some ˜ft:H×R+R.

    Lemma 4.1 (Invariance by a sub-group of O(d)). Let f:RdR be invariant under any TO(d) such that T|H=idH and T|HO(d). Then, there exists some ˜f:H×R+R such that for any xRd, f(x)=˜f(xH,||x||).

    Proof. Consider ˜f:(xH,r)H×R+f(xH+re1) where e1 is the first vector of an orthonormal basis of H, and let xRd. If x=0, the result is obvious. Otherwise, consider an orthogonal linear map Tx such that Tx|H=idH and Tx sends x/||x|| on e1. The invariance of f under Tx implies f(x)=f(Tx(x))=f(xH+||x||e1)=˜f(xH,||x||).

    Figure 2 shows that the dependence in ||x|| cannot be removed in finite time: f(μt;uH+re1) does depend on the distance rR+ to H, but this dependence tends to vanish as t. The plots of Figure 2 are obtained by discretizing the initial measure μ0,m=1mmj=1δ(aj(0),bj(0)) with m=1024 atoms, and sampling aj(0)U({1,+1}) and bj(0)U(Sd1). We perform GD with a finite step-size η= and a finite number n=256 of fresh i.i.d. samples from the data distribution per step with f(x)=||xH||, d=20 and dH=5.

    Dynamics over the sphere Sd1. Using the positive 1-homogeneity of ReLU, and with the assumptions on μ0, the dynamics on μtP2(Rd+1) can be reduced to dynamics on the space M+(Sd1) of non-negative measures over Sd1: only the direction of neurons matter and their norm only affects the total mass. From this point of view, neurons with positive and negative output weights behave differently and have separate dynamics. Indeed, consider the pair of measures (ν+t,νt)M+(Sd1)2 characterized by the property that for any continuous test function φ:Sd1R,

    uφ(u)dν±t(u)=±a0,b|a|||b||φ(b||b||)dμt(a,b), (4.1)

    where we have used the superscript ± to denote either or ν+t or νt and the right-hand side is changed accordingly (the integration domain) depending on the sign + or . Because ReLU is positively 1-homogeneous, we have f(μt;x)=σ(ux)d(ν+tνt)(x). It is shown in Appendix E.1.1 that ν±t satisfies, in the sense of distributions, the equation

    tν±t=div(±˜vtν±t)±2gtν±t, (4.2)

    where, for any uSd1,

    gt(u)=y2(f(y),f(μt;y))σ(uy)dρ(y),˜vt(u)=y2(f(y),f(μt;y))σ(uy)[y(uy)u]dρ(y). (4.3)

    Equation (4.2) can be interpreted as a Wasserstein-Fisher-Rao GF [30] on the sphere since ˜vt(u)=proj{u}(gt(u)).

    Closed dynamics over [0,π/2]×SdH1. The dynamics on the pair (ν+t,νt) can be further reduced to dynamics over [0,π/2]×SdH1. Indeed, by positive 1-homogeneity of f(μt;) we may restrict ourselves to inputs uSd1, and f(μt;u) depends only on uH and ||u||. However, because ||uH||2+||u||2=1, this dependence translates into a dependence on the direction uH/||uH|| of the projection onto H and the norm ||uH||. The former is an element of SdH1 while the latter is given by the angle θ between u and H, that is θ:=arccos(uuH/||uH||)=arccos(||uH||). This simplification leads to the following lemma:

    Lemma 4.2. Define the measures τ+t,τt by τ±t=P#ν±tM+([0,π/2]×SdH1) via P:uSd1H(arccos(||uH||),uH/||uH||)[0,π/2]×SdH1. Then, the measures τ+t,τt satisfy the equation

    τ±t=div(±Vtτ±t)±2Gtτ±t, (4.4)

    where Gt:[0,π/2]×SdH1R, and Vt:[0,π/2]×SdH1RdH+1 are functions depending only on (τ+t,τt), and furthermore, f(μt;) can be expressed solely using τ+t,τt (exact formulas are provided in Appendix E.1.2).

    Abbe et al. [21] show a similar result with a lower-dimensional dynamics in the context of infinitely wide two-layer networks when the input data have i.i.d coordinates distributed uniformly over {1,+1} (i.e., , Rademacher variables), except that they do not have the added dimension due to the angle θ as we do thanks to their choice of input data distribution.

    Lemma 4.2 above illustrates how the GF dynamics of infinitely wide two-layer networks adapts to the lower-dimensional structure of the problem: the learned predictor and the dynamics can described only in terms of the angle θ between the input neurons and H and their projection on the unit sphere of H.

    Since the predictors we consider are positively homogeneous, one cannot hope to do better than learn a positively homogeneous function. A natural choice of such a target function to learn is the Euclidean norm. With the additional structure that the target only depends on the projection onto H, this leads to considering f(x)=||xH|| which has additional symmetries compared to the general case presented above: it is invariant by any linear map T such that T|HO(dH) and T|HO(d). By Proposition 2.1 those symmetries are shared by μt and f(μt;), and we show that in this case the dynamic reduces to a one-dimensional dynamic over the angle θ between input neurons and H.

    We prove a general disintegration result for the uniform measure on the sphere in the Appendix (see Lemma B.4) which allows, along with some spherical harmonics analysis, to describe the reduced dynamics and characterize the objective that they optimize. This leads to the following result:

    Theorem 4.3 (1d dynamics over the angle θ). Assume that f(x)=||xH||, and define the measures (τ+t,τt)M+([0,π/2])2 from (ν+t,νt) via P:uSd1arccos(||uH||)[0,π/2]: τ±t=P#ν±t. Then, the pair (τ+t,τt) follows the Wasserstein-Fisher-Rao GF for the objective A(τ+,τ):=E[(f(τ+,τ;x),f(x))] over the space M+([0,π/2])×M+([0,π/2]), where f(τ+,τ;x) is the expression (with a slight overloading of notations) of f(μ;x) in function of (τ+,τ) (see Appendix E.2 for more details):

    dτ±0(θ)=1B(dH2,d2)cos(θ)dH1sin(θ)d1dθ,tτ±t=div(±Vtτ±t)±2Gtτ±t, (4.5)

    where B is the Beta function, and

    Gt(θ)=y2(f(y),f(μt;y))σ(cos(θ)yH1+sin(θ)y1)dρ(y),Vt(θ)=Gt(θ).

    Additionally, f(μt;), Gt, and Vt only depend on the pair (τ+t,τt), and for any t0, it holds that F(μt)=A(τ+t,τt).

    Remark. The result should still hold for general ρ which are spherically symmetric as long as the Wasserstein GF (1.3) is well-defined but the proof is more technical. In addition, this result shows that even with more structure than in Lemma 4.2, the dynamics of infinitely wide two-layer networks are still able to adapt to this setting: these dynamics, as well as the learned predictor, can be fully characterized solely by the one-dimensional dynamics over the angle θ between input neurons and H. This is noteworthy since this angle determines the alignment of the neurons with H, and thus measures how much the representations learned by the network have adapted to the structure of the problem. Furthermore, as discussed below, this reduction with exact formulas enables efficient numerical simulation in one dimension.

    Daneshmand and Bach [15] prove the global convergence of a reduced one-dimensional dynamics in a context similar to ours but their original problem is two-dimensional and with a choice of activation function that leads to specific algebraic properties.

    Expression of f(μt;). Because of the symmetries of f(μt;), which result from that of f, f(μt;x) depends only on ||xH|| and ||x||. What is more, since f(μt;) is positively 1-homogeneous (because ReLU is) it actually holds that f(μt;x)=||x||˜ft(φx) where φx=arccos(||xH||/||x||) is the angle between x and H, and ˜ft(φ):=θ˜ϕ(θ;φ)d(τ+tτt)(θ), ˜ϕ depending only on σ and fixed probability measures (see Appendix E.1.2 for an exact formula).

    Learning the low-dimensional structure as t. Although, as shown in Figure 2, f(μt;) does not learn the low-dimensional structure in finite-time, it is reasonable to expect that as t, the measures τ±t put mass only on θ=0, indicating that the only part of the space that the predictor is concerned with for large t is the sub-space H. Since we assume here that the target function f is non-negative, the most natural limits for τ+t and τt are τ+tαδ0 with α>0, and τt0 (in the sense that τt([0,π/2])0) as t, because then the "negative" output weights do not participate in the prediction in the large t limit.

    The global convergence result of Chizat and Bach [5,11] still holds but is not quantitative and moreover does not guarantee that the limit is the one described above. We leave the proof of this result as an open problem, but we provide numerical evidence supporting this conjecture. Indeed, we take advantage of the one-dimensional reduction from Theorem 4.3, and numerically simulate the resulting dynamics by parameterizing τ±t via weight and position [31] as μm,t=(1/m)mj=1c±j(t)δθ±j(t), and simulating the corresponding dynamics for c±j(t) and θ±j(t). The corresponding results are depicted in Figure 3 which are again obtained by discretizing the initial measures τ+0,τ0 and performing GD with finite step-size (see more details in Appendix 63141592631415927). Figure 3(a), (b) show that the mass of τ+t tends to concentrate around 0 while that of τt tends to concentrate around π/2, indicating that τ+t adapts to the part of the space relevant to learning f while τt puts mass close to the orthogonal to that space.

    Figure 3.  Angle distributions τ+t/τt and position / mass distances with m=1024, d=30 and dH=5. (a) (resp. (b)) τ+t (resp. τt) as a histogram for different t. (c) distances (in log-log scales) of the mass and positions of positive (blue) / negative (orange) particles to the intuitively expected limits: the distance in position is the Wasserstein-2 distance of the normalized (probability) measures ˜τ±t to the corresponding Dirac measures while the distance in mass is the absolute error to the expected mass as t.

    Total mass of particles at convergence. If τ=0 and τ+=αδ0 as described above, we have f(μ;x)=α||x||˜ϕ(0;φx)=αΓ(dH/2)2πΓ((dH+1)/2)||x||cos(φx)=αΓ(dH/2)2πΓ((dH+1)/2)||xH||. To recover exactly f, it must hold that α=τ+([0,π/2])=2πΓ((dH+1)/2)Γ(dH/2). Defining the normalized probability measure ˜τ±t=τ±t/τ±t([0,π/2]), we thus expect ˜τ+t to grow close to δ0 and ˜τt to δπ/2. In terms of total mass, we expect that τ+t([0,π/2]) gets closer to α while τt([0,π/2]) gets closer to 0.

    The numerical behaviour depicted in Figure 3(c) seems to follow our intuitive description, at least until a critical time t in the numerical simulation which corresponds to the first time t where τ+t([0,π/2])>α. While the total mass of τ±t (dashed lines) seems to approach its limit rapidly before t it slowly moves further away from it for tt. On the other hand, while the angles only slowly change before t, they start converging fast towards the corresponding Dirac measures after t. It is unclear whether this slight difference in behaviour (around the critical time t) between what we intuitively expected and the numerical simulation is an artefact of the finite width and finite step size or if it actually corresponds to some phenomenon present in the limiting model. For more details concerning the numerical experiments, see Appendix 63141592631415927. Note that there is a priori not a unique global optimum: τ+ and τ (if they exist) can compensate on parts of the space [0,π/2] and lead to the same optimal predictor for different choices of measures. Our numerical experiments suggest that the GF dynamics select a "simple" solution where τ+ is concentrated on {θ=0} and τ vanishes (puts 0 mass everywhere), which is a form of implicit bias.

    We have explored the symmetries of infinitely wide two-layer ReLU networks and we have seen that: (i) they adapt to the orthogonal symmetries of the problem, (ii) they reduce to the dynamics of a linear network in the case of an odd target function and lead to exponential convergence, and (iii) when the target function depends only on the orthogonal projection onto a lower-dimensional sub-space H, the dynamics can be reduced to a lower-dimensional PDE. In particular, when f is the Euclidean norm, this PDE is over a one-dimensional space corresponding to the angle θ between the particles and H. We have presented numerical experiment indicating that the positive particles converge to the subspace H in this case and leave the proof of this result as an open problem. We also leave as an open question whether the results of Section 2 extend to deeper networks.

    Karl Hajjar received full support from the Agence Nationale de la Recherche (ANR), reference ANR-19-CHIA-0021-01 "BiSCottE". The authors thank Christophe Giraud for useful discussions and comments.

    The authors declare there is no conflict of interest.

    We introduce in this section additional notation that we use throughout the Appendix.

    Residual: we call Rt(y):=2(f(y),f(μt;y)), the "residual", which is equal to the difference f(y)f(μt;y) when is the squared loss.

    Identity matrix: we denote by Ip the identity matrix in Rp×p for any pN.

    Indicator functions: we denote by 1A the indicator of a set A, that is 1A(z)=1zA, and 1A(z)=0 otherwise.

    Total variation: for any measure ν, we denote by |ν| its total variation, which should cause no confusion with the absolute value given the context.

    Beta / Gamma function and distribution: for α,β>0, we denote by B(α,β) the Beta function equal to Γ(α)Γ(β)/Γ(α+β) where Γ is the Gamma function, and by Beta(α,β) the beta law with density equal to uα1uβ1/B(α,β) on [0,1].

    Gaussian / spherical measures: we call ρp the standard Gaussian measure in Rp (corresponding to N(0,Ip)) for any pN.

    Whenever τM+(Ω) has finite and non-zero total variation, we denote by ˜τP2(Ω) its normalized counterpart (which is a probability measure), that is ˜τ=τ/τ(Ω)=τ/|τ|.

    For any pN, we call ωp the Lebesgue (spherical) measure over the unit sphere Sp1 of Rp, that is the measure such that ˜ωp is the uniform measure on Sp1. We then denote by |Sp1| the surface area of Sp1, that is |Sp1|:=|ωp|=ωp(Sp1)=2πp/2/Γ(p/2).

    Smooth functions: we denote by C(Ω) (resp. C1c(Ω)) the set of continuous (resp. continuously differentiable and compactly supported) functions from a set Ω to R.

    In this section, we list a number of lemmas related to symmetries of measures and functions which will prove helpful in the proofs presented in the Appendix.

    Lemma A.1 (Invariance under invertible maps). Let μ be a measure invariant under some measurable and invertible map T. Then, assuming T1 is also measurable, one has that μ is also invariant under T1.

    Remark. A similar result holds for a function f invariant under an invertible map.

    Proof. Because μ is invariant under T, we have for any measurable set A, μ(A)=μ(T1(A)). Since T1 is assumed to be measurable, for any measurable set A, T(A) is also measurable (T(A)=(T1)1(A)) and thus μ(T(A))=μ(T1(T(A))=μ(A) which shows μ is invariant under T1.

    Lemma A.2 (Invariance of the density). Let ν be a measure with density p w.r.t. some measure μ, and assume both ν and μ are σ-finite and invariant under some measurable and invertible map T, whose inverse T1 is also measurable. Then p is also invariant under T μ-almost everywhere, i.e., , p(T(x))=p(x) for μ-almost every x.

    Proof. For any measurable φ (w.r.t. μ, and thus w.r.t. ν as well), φT1 is also measurable, and we have, on the one hand

    φT1dν=(φT1)pdμ=φ(pT)dμ,

    and on the other hand

    φT1dν=φdν=φpdμ,

    which shows that φ(pT)dμ=φpdμ, and thus that pT=p μ-almost everywhere.

    Lemma A.3 (Projected variance with spherical symmetry). Let ζ be a spherically symmetric measure on Rp (i.e., , such that for any orthogonal linear map TO(p), T#ζ=ζ), with finite second moment. Then we have the following matrix identity:

    zzzdζ(z)=vζIp,vζ:=z(z1)2dζ(z)=1pz||z||2dζ(z).

    Proof. The (i,j)-th entry of the matrix on the left-hand-side is zzizjdζ(z), and it is readily seen that the terms outside the diagonal are 0. Indeed, let (i,j)[1,p]2 with ij, and consider the orthogonal map Tj:zRp(z1,,zj1,zj,zj+1,,zp). The spherical symmetry of ρ implies that it is invariant under Tj, which yields zzizjdρ(z)=zzizjdρ(z), thereby showing that the latter is 0. To see that the diagonal terms are all equal, it suffices to consider the orthogonal map Si which swaps the 1st and i-th coordinates of a vector z. The invariance of ρ under Si yields z(z1)2dρ(z)=z(zi)2dρ(z), which concludes the proof.

    Consider a uSd1. u is determined by: (i) its angle θ:=arccos(||uH||)[0,π/2] with H (i.e., , its angle with its projection uH onto H), (ii) the direction zH=uH/||uH||SdH1 of its projection uH onto H, and finally (iii) the direction z=u/||u||Sd1 of its projection u onto H. Since ||uH||2+||u||2=1, the angle θ gives both the norms of the projections onto H and H: ||uH||=cos(θ) and ||u||=sin(θ).

    When z ranges over the unit sphere Sd1, the angle θ and the directions zH,z range over [0,π/2], SdH1, and Sd1 respectively. We wish to understand what measures we obtain on these three sets when z is distributed on the sphere according to the Lebesgue measure ωd. We show below below that after the change of coordinates described above (from uSd1 to (θ,zH,z)[0,π/2]×SdH1×Sd1), the corresponding measures over SdH1 and Sd1 are uniform measures and the measure over θ is given by a push-forward of a Beta distribution as defined below:

    Definition A.1 (Distribution γ of the angle θ). We define the measure γ on [0,π/2] with the following density w.r.t. the Lebesgue measure on [0,π/2]:

    dγ(θ):=cos(θ)dH1sin(θ)d1dθ.

    Remark. γ is in fact simply given by (arccos)#Beta(dH/2,d/2). Note that the total variation of gamma is |γ|=γ([0,π/2])=12B(dH2,d2), and the corresponding normalized (probability) measure is d˜γ(θ)=dγ(θ)/|γ|=2B(dH2,d2)cos(θ)dH1sin(θ)d1dθ.

    We now state the disintegration theorem and give its proof:

    Theorem A.4 (Disintegration of the Lebesgue measure on the sphere). Let ωd denote the Lebesgue measure on the sphere measure on the sphere of Rd, and let γ be the measure of Definition 3. Then, one has

    ωd=Φ#(ωdHωdγ)

    where

    Φ:[0,π/2]×SdH1×Sd1Sd1(θ,zH,z)cos(θ)zH+sin(θ)z.

    Proof. Denoting ˜ωd the uniform measure on the sphere, |Sd1|:=2πd/2Γ(d/2) the surface are of the sphere in dimension d, and ρp the standard Gaussian distribution in Rp for any p. Using the well-known fact that ˜ωd=Π#ρd with Π:xRd{0}x/||x||Sd1, we have, for any measurable test function φ:Sd1R,

    φdωd=|Sd1|φd˜ωd=|Sd1|xφ(x||x||)dρd(x)=|Sd1|xH,xφ(xH+x||xH+x||)dρdH(xH)dρd(x)=Cdφ(rHzH+rz||rHzH+rz||)rdH1Her2H/2rd1er2/2drHdrdωdH(zH)dωd(z)=CdzH,zrH,rφ(rHzH+rzr2H+r2)rdH1Hrd1e(r2H+r2)/2drHdrdωdH(zH)dωd(z),

    with

    Cd:=|Sd1|(2π)dH/2(2π)d/2=|Sd1|(2π)d/2=2πd/22d/2πd/2Γ(d/2)=12(d2)/2Γ(d/2).

    Doing the polar change of variables (rH,r)R2+(R,θ)R+×[0,π/2], we get:

    φdωd=CdzH,zθφ(cos(θ)zH+sin(θ)z)cos(θ)dH1sin(θ)d1dθdωdH(zH)dωd(z)

    where

    Cd:=Cd+0Rd2eR2/2RdR=Cd+0Rd1eR2/2dR=Cd×2(d2)/2Γ(d/2)=1.

    which concludes the proof.

    Remark. A similar disintegration result holds for the uniform measure ˜ωd on the sphere. The corresponding measures which are then pushed-forward by the same Φ are the normalized counterparts of the measures in the theorem above: ˜ωd=Φ#(˜ωdH˜ωd˜γ). This readily comes from noting that a simple calculation yields |ωd|=|ωdH||ωd||γ|.

    Given a functional F:P2(Rp)R, its first variation or Fréchet derivative at μP2(Rp) is defined as a measurable function, denoted δFδμ(μ):RpR, such that, for any νP2(Rp) for which μ+tνP2(Rp) in a neighborhood (in t) of t=0,

    ddtF(μ+tν)|t=0=zδFδμ(μ)[z]dν(z).

    See Santambrogio [32,Definition 7.12], or [33,p.29] for more details on the first variation.

    In the case of the functional defined in Eq (1.2) corresponding to the population loss objective, using the differentiability of the loss w.r.t. its second argument, one readily has that

    Fμ(c):=δFδμ(μ)[c]=x2(f(x),f(μ;x))ϕ(c;x)dρ(x)

    since

    ddt(f(x),f(μ;x)+tf(ν;x))=2(f(x),f(μ;x)+tf(ν;x))cϕ(c;x)dν(c).

    A Wasserstein gradient flow for the objective F defined in Eq (1.2) is a path (μt)t0 in the space of probability measures P2(Rd+1) which satisfies the continuity equation with a vector field vt which is equal to the opposite of the gradient of the first variation of the functional F. This means that we have, in the sense of distributions,

    tμt=div((δFδμ)μt).

    That a pair ((μt)t0,vt) consisting of a path in P2(Rp) and a (time-dependent) vector field in Rp satisfies the continuity equation tμt=div(vtμt) in the sense of the distributions simply means that for any test function φC1c(Rp),

    tφdμt=vtφdμt,

    where t stands for the time derivative ddt. Similarly, when we say that the advection-reaction equation tμt=div(vtμt)+gtμt is satisfied for some function gt:RpR, we mean that it is in the sense of distributions: for any test function φC1c(Rp),

    tφdμt=(vtφ+gt)dμt.

    An alternative description of the Wasserstein gradient flow of the objective F is to consider a flow X() in R+×Rd+1 such that, for any cRd+1,

    X0(c)=cddtXt(c)=(δFδμ)(Xt(c))

    and to define μt=(Xt)#μ0.

    For more details on Wasserstein gradient flows in the space of probability measures see Santambrogio [32,Section 5.3], and [33,Section 4], and for more details on the equivalence between the continuity equation and the flow-based representation of the solution see Santambrogio [32,Theorem 4.4].

    There are two main ideas behind the proof. Call ˜T:(a,b)R×Rd(±a,T(b)) (depending on whether f is invariant or anti-invariant under T) and consider the following two facts:

    Structure of ϕ((a,b);x). Since T is orthogonal, so is ˜T, and the structure of ϕ((a,b);x)=aσ(bx) is such that ϕ(˜T(a,b);x)=±ϕ((a,b);T1(x)) because T is orthogonal (its adjoint is thus its inverse).

    Conjugate gradients. Computing the gradient of a function whose input has been transformed by ˜T1 is the same as the conjugate action of ˜T on the gradient: (φ˜T1)=˜T(φ)˜T1 (this is due to the fact that the adjoint of ˜T1 is ˜T because ˜T is orthogonal). Note that we similarly get (φ˜T)=˜T1(φ)˜T.

    We present here arguments that are present in both the proofs of Propositions 2.1 and 2.2. Let T be a linear orthogonal map such that f(T(x))=±f(x), where the ± is because we deal with both cases at the same time since the logic is the same. Let t0, and define νt:=˜T1#μt. We aim to show that (νt)t0 is also a Wasserstein gradient flow for the same objective as (μt)t0.

    Prediction function. Let xRd. We have, using the fact that T is orthogonal (and thus that T(x),y=x,T1(y)),

    f(νt;x)=a,baσ(bx)dνt(a,b)=a,b±aσ(T1(b)x)dμt(a,b)=±a,baσ(bT(x))dμt(a,b)=±f(μt;T(x)).

    Time derivative. Let φC1c(Rd). Because μt satisfies the continuity Eq (1.3) in the sense of distributions, and using the remark above on conjugate gradients as well as the orthogonality of ˜T, we have:

    tφdνt=tφ˜T1dμt=(φ˜T1),vtdμt=˜Tφ˜T1,vtdμt=φ˜T1,˜T1vtdμt=φ,˜T1vt˜Tdνt.

    Conjugate velocity field. The equality above actually shows that νt satisfies the continuity equation with the conjugate velocity field ˜T1vt˜T instead of vt. We show below that the former is closely related to the latter (and is in fact equal to Fνt with sufficient assumptions on 2, which is the step proven in Appendices C.2 and C.3). Indeed, because vt is a gradient: vt=Fμt, we have using again the remark above on conjugate gradients:

    ˜T1vt˜T=(Fμt˜T).

    Computing the function on the right-hand-side, for any (a,b)R×Rd, we get, using the remark above on the structure of ϕ,

    Fμt(˜T(a,b))=y2(f(y),f(μt;y))ϕ(˜T(a,b);y)dρ(y)=±y2(f(y),f(μt;y))ϕ((a,b);T1(y))dρ(y).

    ρ is invariant under T since it spherically symmetric by assumption (and thus invariant under any orthogonal map) and we can therefore replace y by T(y) in the integral above, which yields

    Fμt(˜T(a,b))=±y2(f(T(y)),f(μt;T(y)))ϕ((a,b);y)dρ(y)=±y2(±f(y),±f(νt;y))ϕ((a,b);y)dρ(y),

    and thus we get

    (Fμt˜T)(a,b)=±y2(±f(y),±f(νt;y))(a,b)ϕ((a,b);y)dρ(y).

    One can already notice that if f is invariant under T (as opposed to anti-invariant), that is if we keep the "+" in ±, we get ˜T1vt˜T=Fνt.

    Proof. We first prove ν0=μ0 and then prove that both (μt)t0 and (νt)t0 are Wasserstein gradient flows of the objective F defined in Eq (1.2), starting from the initial condition μ0 at t=0. The unicity of such a gradient flow then guarantees that μt=νt and thus f(μt;T(x))=f(μt;x) by the preliminaries above on the prediction function (see Appendix C.1).

    Initialization: ν0=μ0. By definition, ˜T(a,b)=(a,T(b)). Since μ0=μ10μ20 by assumption, and μ20 is invariant under T since it has spherical symmetry, it is clear that μ0 is invariant under ˜T, and thus under ˜T1 by Lemma A.1, which gives ν0=μ0 because νt=˜T1#μt for any t by definition.

    Time derivative. From the preliminary results above (see Appendix C.1) we have

    tνt=div(Fνtνt),

    which shows that (νt)t0 is also a Wasserstein gradient flow of the objective F. By unicity of the latter (starting from the initial condition μ0), it must hold that μt=νt for any t0 which concludes the proof.

    The proof follows the exact same pattern as that of Proposition 2.1 (see Appendix C.2). We now have by definition, ˜T(a,b)=(a,T(b)) and the added symmetry assumption on μ10 ensures that ν0=μ0 still holds in this case. As for the time derivative, the preliminaries above (see Appendix C.1) ensure that

    (Fμt˜T)(a,b)=y2(f(y),f(νt;y))(a,b)ϕ((a,b);y)dρ(y)=y2(f(y),f(νt;y))(a,b)ϕ((a,b);y)dρ(y),

    where we have used the extra assumption that 2(y,ˆy)=2(y,ˆy). This yields

    tνt=div(Fνtνt)

    and the conclusion follows from the same logic as for Proposition 2.1.

    Proof. The proof is divided in three steps: (i) we derive the dynamics in time of the vector w(t)=12abdμt(a,b), (ii) we show that the positive definite matrix H(t) appearing in these dynamics has its smallest eigenvalue lower-bounded by some positive constant after some t0>0, and (iii) we show that this implies the exponential convergence to the global minimum.

    Generalities on the objective Q. Expanding the square in the definition of Q (3.2), we have

    Q(w)=12[ExP[f(x)2]2βw+wCw],C:=ExP[xx]Rd×d,β:=ExP[f(x)x]Rd.

    If C0, Q(w) as ||w|| and since Q is lower-bounded by 0, it thus admits at least one global minimum. This minimizer w is unique as soon as Q is strongly convex, i.e., , C is definite positive, which holds in this case as we have assumed the smallest eigenvalue λ min of C to be >0. Note that Q(w)=Cwβ=x((xw)f(x))xdP(x)Rd.

    First step: dynamics of w(t). Let k{1,,d}, the k-th coordinate wk(t) of w(t) is given by wk(t)=abkdμt(a,b), and its time derivative is given by

    wk(t)=12((a,b)(abk))vt(a,b)dμt(a,b)

    where vt is given by Eq (1.3) except we replace σ by 12idRd and ρ by P in Fμt, that is

    vt(a,b)=12yRt(y)(byay)dP(y)R1+d.

    On the other hand, (a,b)(abk)=(bkaek)R1+d where ek is the k-th element of the canonical orthonormal basis of Rd. Note that here, Rt(y)=f(y)w(t),y. We thus get

    wk(t)= 14a,bbkbdμt(a,b),y(f(y)(w(t)y))ydP(y)+  14a,ba2ekdμt(a,b),y(f(y)(w(t)y))ydP(y).

    Note that the term on the right in the inner products is in fact equal to Q(w(t)), which yields the following dynamics for the vector w(t):

    w(t)=H(t)Q(w(t)),H(t):=14(bbdμt(a,b)+a2dμt(a,b)Id)Rd×d.

    Second step: lower bound on the smallest eigenvalue of H(t). At initialization, by symmetry one has w(0)=0, and using Lemma A.3, one has that H(0)=14(1d+1)Id, so that

    ddtQ(w(t))|t=0=w(0),Q(w(0))=d+14d||Q(0)||2=d+14d||β||2

    If β=0, then Q(0)=0 and since w(0)=0, w(t) starts at the global optimum and thus stays constant equal to 0. Otherwise, if ||β||>0, one has ddtQ(w(t))|t=0<0, which ensures that there is a t0>0 such that Q(w(t))<Q(w(0))=Q(0) for any t(0,t0]. Call ε:=[Q(0)Q(w(t0))]/2>0. The continuity of Q at 0 guarantees that there is a δ>0 such that for any wRd, if ||w||<δ, then |Q(w)Q(0)|ε.

    Now assume that there exists t1t0 such that a2dμt1(a,b)δ. Then, one has

    ||w(t1)||=||12abdμt1(a,b)||12|a|||b||dμt1(a,b)12a2dμt1(a,b)δ2<δ,

    where we have used in the penultimate inequality that μt1 is supported on the set {|a|=||b||} because of the assumptions on the initialization μ0 (see Section 1.1). This ensures that |Q(w(t1))Q(0)|ε. Since :tQ(w(t)) is decreasing (Q(w(t))=F(μt) and it is classical that the objective is decreasing along the gradient flow path, see third step below) and t1t0, this means that

    0<Q(0)Q(w(t0))Q(0)Q(w(t1))ε=[Q(0)Q(w(t0))]/2

    which is a contradiction. Therefore, for any tt0, a2dμt(a,b)δ. Calling η:=δ/4>0, we thus have that for any tt0, the smallest eigenvalue of H(t) is larger than η because H(t) is the sum of the positive semi-definite matrix 14bbdμt(a,b) and of the positive definite matrix 14a2dμt(a,b)Id whose smallest eigenvalue is at least η for tt0.

    Third step: exponential convergence. We have:

    ddtQ(w(t))=w(t),Q(w(t))=Q(w(t))H(t)Q(w(t))0,

    which shows that because H(t) is positive definite, the objective Q is decreasing along the path (w(t))t0. Since after t0>0, the smallest eigenvalue of H(t) is lower bounded by a constant η>0, we have that, for any tt0:

    ddtQ(w(t))η||Q(w(t))||2. (D.1)

    Because Q is λ min-strongly convex (as the smallest eigenvalue of C is λ min>0), one has the classical inequality

    12||Q(w)||2λ min(Q(w)Q(w)).

    Plugging this into Eq (D.1) gives

    ddt(Q(w(t))Q(w))2ηλ min(Q(w(t))Q(w)),

    which by Gronwall's lemma in turn yields for any tt0

    0Q(w(t))Q(w)e2ηλ min(tt0)(Q(w(t0))Q(w)),

    thereby proving exponential convergence.

    Exponential convergence in distance. Given that Q(w)=0 because w is and optimum, it holds Cw=β. Using this fact, it easily follows that

    Q(w)Q(w)=12C(ww),ww,

    and the right-hand-side is lower bounded by λ min2||ww||2, from which we conclude that

    ||w(t)w||22λ min(Q(w(t))Q(w)),

    and the exponential decrease of the right-hand-side allows to conclude.

    We wish to show here that the pair of measures (ν+t,νt) defined through Eq (4.1) satisfy Eq (4.2) and that the corresponding dynamic is closed is the sense that it can be expressed solely using (ν+t,νt) (without requiring to express quantities in function of μt). Below, we use κ(z)=max(0,z). We do this do differentiate it from the activation function σ (which is also equal to ReLU) so as avoid confusion because the κ which appears below has nothing to do with the activation function of the network and simply comes from the integration domain in the calculations.

    Equations of the dynamics on the sphere. Let φC1c(Sd1). One has

    tφdν±t= t±a0,b|a|||b||φ(b||b||)dμt(a,b)= a,bκ(±a)||b||φ(b||b||)dμt(a,b)= a,b(a,b)(κ(±a)||b||φ(b||b||))vt(a,b)dμt(a,b)

    Let us compute the components of the gradient above. We have

    a(κ(±a)||b||φ(b||b||))=±κ(±a)||b||φ(b||b||)=1{±a0}||b||φ(b||b||).

    The Jacobian of the map :bRdb/||b|| is equal to 1||b||(Idbb/||b||2) which is a symmetric (or self-adjoint) matrix, so that the gradient w.r.t. b is

    b(κ(±a)||b||φ(b||b||))=1{±a0}|a|[φ(b||b||)b||b||+(Idb||b||(b||b||))φ(b||b||)].

    On the other hand, the first component of vt(a,b) (corresponding to the gradient w.r.t. a) is

    v1t(a,b)=yRt(y)κ(by)dρ(y)=||b||yRt(y)κ((b||b||)y)dρ(y),

    and the last d components (corresponding to the gradient w.r.t. b) are

    v2t(a,b)=yRt(y)aκ(by)ydρ(y)=ayRt(y)κ((b||b||)y)ydρ(y).

    When computing the inner product (a,b)(κ(±a)||b||φ(b||b||))vt(a,b), we can re-arrange the terms to keep one term where φ appears and the other where φ appears. Using the facts that the Jacobian computed above is symmetric, that κ(z)=κ(z)z for any zR, and that 1{±a0}a=±1{±a0}|a|=±κ(±a), we get,

    (a,b)(κ(±a)||b||φ(b||b||))vt(a,b)= ±1{±a0}||b||||b||φ(b||b||)gt(b||b||)+  ±1{±a0}|a||a|φ(b||b||)gt(b||b||)+  ±1{±a0}|a||a|φ(b||b||)˜vt(b||b||),

    where, for uSd1

    gt(u):=yRt(y)σ(uy)dρ(y),˜vt(u):=yRt(y)σ(uy)[y(uy)u]dρ(y).

    Finally, because μt stays on the cone {(a,b)Rd+1;|a| =||b||} for any t (see Chizat and Bach [4,Lemma 26], Wojtowytsch [5,Section 2.5]), when integrating against μt, we can replace ||b|| by |a| and vice-versa. We thus get that the time derivative we initially computed is the sum of two terms:

    tφdν±t= 2±a0,b|a|||b||φ(b||b||)gt(b||b||)dμt(a,b) + ±a0,b|a|||b||φ(b||b||)˜vt(b||b||)dμt(a,b)= 2uSd1φ(u)gt(u)dν±t(u) + uSd1φ(u)˜vt(u)dν±t(u),

    which shows that ν±t satisfies Eq (4.2) in the sense of distributions.

    Closed dynamics. We want to show that gt and ˜vt can be expressed using only ν+t and νt. Both these quantities depend on t only through the residual Rt, which itself only depends on t through f(μt;). We thus show that the latter can be expressed using only ν+t and νt, which easily follows from writing, for any yRd,

    f(μt;y)=aσ(by)dμt(a,b)=abσ(bb,y)dμt(a,b)=a0,b|a|bσ(bb,y)dμt(a,b)a0,b|a|bσ(bb,y)dμt(a,b)=uSd1σ(uy)dv+t(u)uSd1σ(uy)dvt(u)

    Proof. We first prove that the Eq (4.4) for τ±t holds in the sense of distributions, and then show that the corresponding dynamics are closed because the Vt and gt appearing in Eq (4.4) can be expressed with (τ+t,τt) (and not only with (ν+t,νt) for instance). We show this by expressing f(μt;) only in function of the pair (τ+t,τt).

    The pair (τ+t,τt) satisfy Eq (4.4). First, we show that gt and ˜vt defined in Eq (4.3) admit modified expressions that match the structure of the pushforward transforming ν±t into τ±t. Indeed, since ρ is assumed to be spherically symmetric, it is invariant by any orthogonal transformation. In particular, for a fixed uSd1 such that u0, we consider the orthogonal map Tu:RdRd such that Tu|H=idH and Tu|H sends the canonical orthonormal basis (e1,,ed) of H on (u/||u||,u2,,ud) where (u2,,ud)(H)d1 is an orthonormal family, orthogonal to u, so that for any yH with coordinates y1,,yd in the basis (e1,,ed), Tu|H(y)=y1u/||u||+hu(y) with hu(y)u.

    Note that since f(y)=fH(yH) and f(μt;y)=˜ft(yH,||y||), the residual Rt(y)=f(y)f(μt;y) is invariant by any orthogonal transformation which preserves H (and in particular by Tu). We thus have

    gt(u)=yRt(y)σ(uH,yH+y1||u||)dρ=:˜gt(uH,||u||),˜vt(u)=yRt(y)σ(uH,yH+y1||u||)[yH+Tu|H(y)(uH,yH+y1||u||)u]dρ.

    Now consider, for any (θ,zH)[0,π/2]×SdH1,

    Gt(θ,zH):=˜gt(cos(θ)zH,sin(θ))Vt(θ,zH):=yRt(y)σ(cos(θ)zH,yH+y1sin(θ))(y1cos(θ)sin(θ)zH,yHyHcos(θ)zH,yHzHcos(θ))dρ

    We show below that (τ+t,τt) satisfy Eq (4.4) with the Gt and Vt defined above. Let φC1c([0,π/2]×SdH1). Since τ±t is defined as a push-forward measure obtained from ν±t we have:

    tφ(θ,zH)dτ±t(θ,zH)= tφ(arccos(||uH||),uH||uH||)dν±t(u)=±2φ(arccos(||uH||),uH||uH||)˜gt(uH,||u||)dν±t(u) +±u(φ(arccos(||uH||),uH||uH||))˜vt(u)dν±t(u).

    By definition of the pushforward, and since uH=cos(arccos(||uH||))uH/||uH|| and ||u||=sin(arccos(||uH||)) for uSd1, the first integral is equal φ(θ,zH)Gt(θ,zH)dτ±t(u). For the second integral, let us first compute the gradient. One has

    u(φ(arccos(||uH||),uH||uH||))=θφ(arccos(||uH||),uH||uH||)11||uH||2uH||uH|| +1||uH||[IdHuH(uH)||uH||2](zHφ)(arccos(||uH||),uH||uH||).

    We observe that the gradient above belongs to H which implies that when computing its inner product with ˜vt(u) we can consider only the component of the latter along H. Additionally, we note that IdHuH(uH)/||uH||2 is actually the orthogonal projection onto {uH}, so that it yields 0 when applied to u. Using that ||u||=1||uH||2 for uSd1, we then get:

    u(φ(arccos(||uH||),uH||uH||))˜vt(u)=φ(arccos(||uH||),uH||uH||)Vt((arccos(||uH||),uH||uH||).

    where φ(θ,zH)=(θφ(θ,zH)zHφ(θ,zH)). This shows that

    tφ(θ,zH)dτ±t(θ,zH)=±2φ(θ,zH)Gt(θ,zH)dτ±t(θ,zH) +±φ(θ,zH)Vt(θ,zH)dτ±t(θ,zH),

    which proves that τ±t indeed satisfies Eq (4.3) in the sense of distributions.

    The dynamics are closed in the pair (τ+t,τt). The only thing left to prove to show that the dynamics are closed for the pair (τ+t,τt) is that Gt and Vt can be expressed using only the pair (τ+t,τt). The only dependence of these quantities on t is through the residual Rt which itself depends on t only through f(μt;). Let yRd. We have already shown at the end of the previous Section E.1.1 that by definition of ν+t and νt, we have

    f(μt;y)=uSd1σ(uy)d(ν+tνt)(u).

    On the other hand, we show below that the integral of any measurable function φ:Sd1R against ν±t can be expressed as an integral against τ±t in the case where ν±t admits a density w.r.t. the uniform measure on Sd1 (which is the case for ν±0), the case of a general measure ν±t being a simple extension via a weak convergence argument. Thus call p±t the density of ν±t w.r.t. ˜ωd. Since ν±t is invariant by any linear map T such that T|H=idH T|HO(d) (because of the symmetries on μt given by Proposition 2.1), and since this is also the case for ˜ωd because ˜ωd has spherical symmetry and T is orthogonal, we have by Lemma A.2 that p±t is invariant by any such T, which then leads to pt having the form pt(u)=˜p±t(uH,||u||) by Lemma 4.1.

    First step. We show that τ±t has the density q±t(θ,zH)=|Sd1|˜p±t(cos(θ)zH,sin(θ)) w.r.t. ˜γ˜ωdH where the measure ˜γ is the normalized counterpart of the measure in Definition A.1. Indeed, let φ:[0,π/2]×Sd1R be any measurable function w.r.t. τ±t. Using the disintegration Lemma A.4 on ˜ωd, one has that

    φ(θ,zH)dτ±t(θ,zH)=φ(arccos(||uH||),uH||uH||)dν±t(u)=φ(arccos(||uH||),uH||uH||)˜p±t(uH,||u||)d˜ωd(u)=φ(θ,zH)˜p±t(cos(θ)zH,sin(θ))d˜γ(θ)d˜ωdH(zH)d˜ωd(z)=φ(θ,zH)˜p±t(cos(θ)zH,sin(θ))d˜γ(θ)d˜ωdH(zH),

    which proves the desired density for τ±t.

    Second step. Consider a measurable φ:Sd1R w.r.t. ν±t. One has with similar calculations as above

    uφ(u)dν±t(u)=uφ(u)˜p±t(uH,||u||)d˜ωd(u)=φ(cos(θ)zH+sin(θ)z)˜p±t(cos(θ)zH,sin(θ))d˜γ(θ)d˜ωdH(zH)d˜ωd(z)=θ,zH(zφ(cos(θ)zH+sin(θ)z)d˜ωd(z))q±t(θ,zH)d˜γ(θ)d˜ωdH(zH)=θ,zH(zφ(cos(θ)zH+sin(θ)z)d˜ωd(z))dτ±t(θ,zH).

    Applying this to f(μt;y) shows that the latter quantity can be expressed solely using (τ+t,τt), which proves that the dynamics is indeed closed and therefore concludes the proof when ν±t has a density.

    Third step: extending to any measure. It is known that for any measure ν over Sd1, there exists a sequence of measure (νn)nN such that: (i) νn has a density pn w.r.t. the uniform measure ˜ωd over Sd1, and (ii) the sequence (νn)nN converges weakly to ν, that is, for any continuous (and thus automatically bounded because the unit sphere is compact) φ, φdνnnφdν. Let thus νM+(Sd1), and consider a sequence (νn)nN with density converging weakly towards ν. Let τ (resp. τn) be defined from ν (resp. νn) as τ±t is defined from ν±t, that is for any measurable φ:[0,π/2]×SdH1R,

    φ(θ,zH)dτ(θ,zH)=φ(arccos(||uH||),uH||uH||)dν(u),φ(θ,zH)dτn(θ,zH)=φ(arccos(||uH||),uH||uH||)dνn(u).

    Let thus φ be a continuous map from Sd1R (having in mind the example of :uσ(uy) for a fixed y). By the result of Step 2, since νn has a density for every n, we have that

    φ(u)dνn(u)=θ,zH(zφ(cos(θ)zH+sin(θ)z)d˜ωd(z))dτn(θ,zH), (E.1)

    and taking the limit n, the left-hand-side of Eq (E.1) converges to φdν by assumption. Now let us look at the right-hand-side of (E.1). Calling ψ(θ,zH)=zφ(cos(θ)zH+sin(θ)z)d˜ωd(z) and Φ(u)=zφ(uH+||u||z)d˜ωd(z), the right-hand-side is in fact ψdτn and, for any nN, is equal to:

    ψdτn=uψ(arccos(||uH||),uH||uH||)dνn(u)=uzφ(||uH||uH||uH||+||u||z)d˜ωd(z)dνn(u)=uzφ(uH+||u||z)d˜ωd(z)dνn(u)=Φdνn,

    and a similar result holds for τ and ν. Now, the continuity of Φ is readily obtained from that of φ, and thus the right-hand-side in the last equality above converges to Φdν which is also equal to ψdτ by the same calculations as above. The right-hand-side in (E.1) therefore converges to ψdτ, and since the limits of both sides are equal, we get φdν=ψdτ, which is the claim of Step 2 for a general measure ν which does not necessarily admit a density, thereby concluding the proof.

    Here, we give the proof of Theorem 4.3 which shows that when f(x)=||xH|| the dynamics can be reduced to a single variable: the angle θ[0,π/2] between particles and the subs-space H.

    We decompose the proof in three steps: first we show that the pair of measures (τ+t,τt)M+([0,π/2]) as defined in Section 4.2 indeed follows Eq (4.5); then we show that the dynamics are indeed closed by proving that the terms Vt and Gt appearing in the GF depend only on (τ+t,τt); and finally, we show that Eq (4.5) indeed corresponds to a Wasserstein-Fisher-Rao GF on a given objective functional over M+([0,π/2])2.

    Proof. We first use the added symmetry to simplify the terms gt and ˜vt which appear in the GF with (ν+t,νt) (see Section 4.1) and express them only with ||uH|| and ||u||. Then we use the equations satisfied by (ν+t,νt) to obtain equations for (τ+t,τt).

    Equations for (τ+t,τt). Let φC1c([0,π/2]). We have

    tφdτ±t=tφ(arccos(||uH||))dν±t(u)=±u(φ(arccos(||uH||)))˜vt(u)dν±t(u)±2φ(arccos(||uH||))gt(u)dν±t(u).

    One has that

    u(φ(arccos(||uH||)))=φ(arccos(||uH||))×11||uH||2uH||uH||,

    which belongs to H. We recall here the expressions of ˜vt and gt: for any uSd1, we have

    gt(u)=yRt(y)σ(uy)dρ(y),˜vt(u)=yRt(y)σ(uy)[y(uy)u]dρ(y).

    Since, f(x)=||xH||, f is now invariant under any orthogonal map T preserving H and H, that is such that the restrictions T|HO(dH) and T|HO(d). Proposition 2.1 then ensures that so is f(μt,), which in turn implies that the residual Rt()=2(f(μt;),f()) also shares that invariance property. Using a similar change of variable as in Appendix E.1.2, and because ρ is spherically symmetric, one gets that gt can be re-written

    gt(u)=yRt(y)σ(yH1||uH||+y1||u||)dρ(y).

    Calling

    Gt(θ):=yRt(y)σ(yH1cos(θ)+y1sin(θ))dρ(y),

    one has gt(u)=Gt(arccos(||uH||)) because uSd1, so that ||u||=1||uH||2. Then, by definition of τ±t, the second integral in the time derivative above is equal to φ(θ)Gt(θ)dτ±t. For the first integral appearing in that time derivative, we get

    u(φ(arccos(||uH||)))˜vt(u)=φ(arccos(||uH||))||u||||uH||yRt(y)σ(uy)[(uy)uy]uHdρ.

    Expanding the inner product inside the integral, we have

    [(uy)uy]uH=(uH,yH+u,y)||uH||2uH,yH=||uH||2u,y(1||uH||2)uH,yH=||uH||2u,y||u||2uH,yH.

    Calling

    Vt(θ):=yRt(y)σ(yH1cos(θ)+y1sin(θ))[y1cos(θ)yH1sin(θ)]dρ(y)=G(θ),

    and using again the spherical symmetry of ρ, with the same change of variable in the integral as for gt, we get that

    u(φ(arccos(||uH||)))˜vt(u)=φ(arccos(||uH||))Vt(arccos(||uH||)).

    Finally, this combined with the previous result on the integral with gt yields

    tφdτ±t=±φ(θ)Vt(θ)dτ±t(θ)±2φ(θ)Gt(θ)dτ±t(θ),

    which leads to the desired equation

    τ±t=div(±Vtτ±t)±2Gtτ±t.

    The proof follow closely that of Appendix E.1.2 (where we prove closed dynamics), except here we take advantage of the added symmetry of the dynamics. As in Appendix E.1.2, we have

    f(μt;y)=uSd1σ(uy)d(ν+tνt)(u),

    and the only thing to prove is that this quantity can be expressed using only (τ+t,τt). As in Appendix E.1.2, we first prove this when ν±t has a density, which is the case for ν±0 and should thus remain so during the dynamics.

    Similarly to what occurs in Appendix E.1.2, ν±t is invariant by any orthogonal map T which preserves H and H because μt has those symmetries given by Proposition 2.1, and if ν±t has a density pν±t w.r.t. ˜ωd, then p±t is also invariant by any such map T, and thus depends only on the norms ||uH|| and ||u|| of its input uSd1. But since its input is on the sphere, those norms are determined by the angle θ=arccos(||uH||) between the input u and H. Calling q±t such that p±t(u)=q±t(arccos(||uH||)), this will lead τ±t to have the density q±t w.r.t. ˜γ. Then, we show below that similarly to Appendix E.1.2, the integral of any measurable φ:Sd1R against ν±t can be expressed as an integral against τ±t. Indeed, using the disintegration Lemma A.4,

    φdν±t=θ[0,π/2]φ(u)q±t(arccos(||uH||))d˜ωd(u)=uφ(cos(θ)zH+sin(θ)z)q±t(θ)d˜ωdH(zH)d˜ωd(z)d˜γ(θ)=θ[0,π/2]˜φ(θ)q±t(θ)d˜γ(θ)=θ[0,π/2]˜φ(θ)dτ±t(θ)

    where

    ˜φ(θ):=zH,zφ(cos(θ)zH+sin(θ)z)d˜ωdH(zH)d˜ωd(z),

    which concludes the proof if ν±t has a density w.r.t. the uniform measure ˜ωd on the sphere Sd1. The general case is obtained by a weak convergence argument (of measures with density) as in the third step of Section E.1.2.

    Proof. Recall that γ is the measure in Definition 3, and consider the following objective functional over M([0,π/2])2:

    A(τ+,τ):=φ[0,π/2](cos(φ),˜f(τ+,τ;φ))d˜γ(φ),˜f(τ+,τ;φ):=θ[0,π/2]˜ϕ(θ;φ)d(τ+τ)(θ),˜ϕ(θ;φ):=r,s[1,1]σ(rcos(φ)cos(θ)+ssin(φ)sin(θ))d˜γdH(r)d˜γd(s)

    where, for any pN, dγp(r)=(1r2)(p3)/2dr, and ˜γp=γp/|γp| with the normalizing factor |γp|=B(1/2,(p1)/2)=πΓ((p1)/2)/Γ(p/2)=|Sp1|/|Sp2|. Note that ˜γp can be simply expressed as the law of ϵ×X where ϵU({1,+1}) and XBeta(1/2,(p1)/2).

    Computing the first variation or Fréchet derivative of the functional A w.r.t. to its first and second argument yields, for any θ[0,π/2],

    δAδτ±(τ+,τ)[θ]=±φ2(cos(φ),˜f(τ+,τ;φ))˜ϕ(θ;φ)d˜γ(φ).

    To conclude one needs only observe that the quantity above is simply equal to Gt(θ), up to a fixed multiplicative constant. Since we have assumed ρ to be the uniform measure over Sd1 to ensure that the Wasserstein GF (1.3) is well-defined, the constant is one here but in the case of a general ρ with spherical symmetry, the result should also hold (as long as the Wasserstein GF (1.3) is well-defined) but the proof is more technical and different constants might appear.

    Simplifying f(μt;). Using the results from Appendix E.2.2, we have for any φ,zH,z[0,π/2]×SdH1×Sd1 (so that u=cos(φ)zH+sin(φ)zSd1)

    f(μt;cos(φ)zH+sin(φ)z)=ψξH,ξσ(cos(ψ)cos(φ)ξH,zH+sin(ψ)sin(φ)ξ,z)d˜ωdH(ξH)d˜ωd(ξ)d(τ+tτt)(ψ)

    Now, because of the integration against uniform measures on the unit spheres, and the inner products involved, we can use some spherical harmonics theory to simplify those calculations. Using The Funk-Hecke formula (see Atkinson and Han [34,Theorem 2.22], n=0, d=dH or d=d), we get

    f(μt;cos(φ)zH+sin(φ)z)=|SdH2||Sd2||SdH1||Sd1|ψr,sσ(rcos(ψ)cos(φ)+ssin(ψ)sin(φ))dγdH(r)dγd(s)d(τ+tτt)(ψ)=1|γdH||γd||γdH||γd|ψ[0,π/2]˜ϕ(ψ;φ)d(τ+tτt)(ψ)=˜f(τ+t,τt;φ).

    Simplifying f(cos(φ)zH+sin(φ)z). Because f(y)=||yH||, f(cos(φ)zH+sin(φ)z) is simply ||cos(φ)zH||=cos(φ) because zHSdH1.

    With the previous expressions for f(μt;) and f we have that for any function Φ:R2R,

    Φ(f(cos(φ)zH+sin(φ)z),f(μt;cos(φ)zH+sin(φ)z))=Φ(cos(φ),˜f(τ+t,τt;φ)).

    Note that this applies both to Φ(y,ˆy)=(y,ˆy) and Φ(y,ˆy)=2(y,ˆy).

    Proof that F(μt)=A(τ+t,τt).

    Using the disintegration Lemma A.4 for the uniform measure on the unit sphere Sd1, we have

    F(μt)=y(f(y),f(μt;y))dρ(y)=(f(),f(μt;))(cos(φ)zH+sin(φ)z)d˜ωdH(zH)d˜ωd(z)d˜γ(φ)=(cos(φ),˜f(τ+t,τt;φ))d˜ωdH(zH)d˜ωd(z)d˜γ(φ)=(cos(φ),˜f(τ+t,τt;φ))d˜γ(φ),

    where we have used in the last equality the fact that the integrand does not depend on zH or z and that ˜ωdH and ˜ωd are probability measures (and thus their total mass is 1).

    Simplifying Gt. Using the disintegration Lemma A.4, we have:

    Gt(θ)=yRt(y)σ(yH1cos(θ)+y1sin(θ))dρ(y)=Rt(cos(φ)zH+sin(φ)z)σ(zH1cos(φ)cos(θ)+z1sin(φ)sin(θ))˜ωdH(zH)d˜ωd(z)d˜γ(φ).

    Similarly to what we did for simplifying f(μt;), we can simplify the integrals against ˜ωdH and ˜ωd using spherical harmonics theory to get:

    Gt(θ)=φ[0,π/2]2(cos(φ),˜f(τ+t,τt;φ))˜ϕ(φ;θ)d˜γ(φ).

    This shows that

    δAδτ+(τ+t,τt)[θ]=Gt(θ)δAδτ(τ+t,τt)[θ]=Gt(θ),

    which proves that Eq (4.5) indeed describes the evolution of the Wasserstein-Fisher-Rao for the objective functional A over M([0,π/2])2, given by the pair (τ+t,τt).

    Measure discretization. Discretizing μt via μm,t=1mmj=1δ(aj(t),bj(t)), we get that τm,t:=τ+m,tτm,t=1mmj=1cj(t)δθ(t) where

    cj(t)=εj|aj(t)|||bj(t)||,εj=sign(aj(0)),θj(t)=arccos(bj(t)||bj(t)||).

    Initializing through aj(0)U{1,+1} and bj(0)˜ωd=U(Sd1), yields cj(0)U{1,+1} and θj(0)˜γ, i.i.d. over j. The gradient flows of Eq (4.5) translates into the following ODEs on (cj)j[1,m] and (θj)j[1,m]:

    ddtcj(t)=2εjGt(θj(t))cj(t),ddtθj(t)=εjVt(θj(t)).

    where εj=aj(0){1,+1} denotes whether the corresponding quantity appears in τ+t,m (ε=+1) or τt,m (ε=1).

    Time discretization. Simulating these ODEs via the discrete Euler scheme with step η>0, leads, for any iteration kN, to:

    cj(k+1)=(1+2ηεjGk(θj(k)))cj(k)θj(k+1)=θj(k+1)+ηεjVk(θj(k)). (F.1)

    Approximating integrals numerically. The only thing that needs to be dealt with numerically is estimating the values of Gt and Vt which are defined by integrals. With the discretization of the measures, we have:

    Gk(θ)=φ(cos(φ)˜f(τ+k,τk;φ))˜ϕ(φ;θ)d˜γ(φ),˜f(τ+k,τk;φ)=mj=1cj(k)˜ϕ(θj(k);φ),˜ϕ(θ;φ)=r,s[1,1]σ(rcos(φ)cos(θ)+ssin(φ)sin(θ))d˜γdH(r)d˜γd(s).

    We thus get:

    Gk(θ)=ψ(r,s;θ,φ)mmj=1(cos(φ)cj(k)ψ(r,s;θj(k),φ))d˜γ(φ)(d˜γdH)2(r,r))(d˜γd)2(s,s),

    with

    ψ(r,s;θ,φ):=σ(rcos(φ)cos(θ)+ssin(φ)sin(θ)).

    Similarly, we have:

    Vk(θ)=χ(r,s;θ,φ)mmj=1(cos(φ)cj(k)ψ(r,s;θj(k),φ))d˜γ(φ)(d˜γdH)2(r,r))(d˜γd)2(s,s),

    with

    χ(r,s;θ,φ):=θψ(r,s;θ,φ)=σ(cos(θ)cos(φ)r+sin(θ)sin(φ)s)[sin(θ)cos(φ)r+cos(θ)sin(φ)s].

    We use Monte-Carlo estimation through sampling to approximate the integrals against the five variables (φ,r,r,s,s) by drawing N samples from the corresponding distributions. We get:

    Gk(θj(k))1mNNi=1ml=1Ψji(cos(Φi)cl(k)˜Ψli),Ψji(k)=ψ(Ri,Si;θj(k),Φi)˜Ψji(k)=ψ(Ri,Si;θj(k),Φi),

    and similarly

    Vk(θj(k))1mNNi=1mj=1χji(cos(Φi)cl(k)˜Ψli),χji(k)=χ(Ri,Si;θj(k),Φi),

    where we have drawn the samples i.i.d. over i[1,N]:

    Φi˜γ,Ri,Ri˜γdH,Si,Si˜γd.

    Iterations in the numerical simulation. Defining the vectors c(k)=(cj(k))j[1,m], θ(k)=(θj)j[1,m], and ε=(εj)j[1,m], the update Eq (13.3) can then be written in terms of update rules using the matrices Ψ(k)=(Ψji(k))j,i[1,m]×[1,N], ˜Ψ(k)=(˜Ψji(k))j,i[1,m]×[1,N], and finally χ=(χji(k))j,i[1,m]×[1,N], and the vectors (Φ,R,R,S,S)=(Φi,Ri,Ri,Si,Si)i[1,N], which are re-sampled at each iteration k[0,K], where KN:

    c(k+1)=(1+2ηεˆGk)c(k),θ(k+1)=θ(k)+ηεˆVk,

    where

    ˆGk=dNbΨ(cos(Φ)1m˜Ψc(k)),ˆVk=dNbχ(cos(Φ)1m˜Ψc(k)),

    and denotes the Hadamard (element-wise) product of two vectors. One can compute the loss through sampling in a similar way.

    Experimental value for α and parameters of the numerical simulation. For the numerical simulations, we fix the number of atoms of the measure (or equivalently the width of the network) to m=1024, the learning rate to η=5.103, the number of samples for the Monte-Carlo scheme to N=1000, and the total number of iterations to K=20,000. The experimental value for α (see Section 4.2) is computed through αexp=τ+m,K([0,π/2]), that is

    αexp=1mjJ+cj(K),J+:={j[1,m] ; εj=1}.

    As mentioned in the main text, the behaviour of the numerical simulation depends a lot on the step-size η. Some of the differences between our observations and our intuitive description of the limiting model (infinite-width and continuous time) can come from too big a step-size. We have thus run the numerical simulation with η=2.105 as well, for K=230,000 steps but the same differences still appear (e.g., τ+m,k([0,π/2]) still grows larger than the theoretically expected limit α after some time, albeit by a smaller margin) and after the critical t, some negative particles seem to go slightly beyond π/2, even with a very small step-size, a fact which cannot happen for the limiting model. Consequently, in Figure 3, the first histogram bin right after π/2 has been merged with the one before.



    [1] W. Dorfler, A. Lechleiter, M. Plum, G. Schneider, C. Wieners, Photonic Crystals: Mathematical Analysis and Numerical Approximation, Springer, 2012. https://doi.org/10.1007/978-3-0348-0113-3
    [2] T. Arens, N. Grinberg, A complete factorization method for scattering by periodic structures, Computing, 75 (2005), 111–132. https://doi.org/10.1007/s00607-004-0092-0 doi: 10.1007/s00607-004-0092-0
    [3] G. Bao, T. Cui, P. Li, Inverse diffraction grating of Maxwell's equations in biperiodic structures, Opt. Express, 22 (2014), 4799–4816. https://doi.org/10.1364/OE.22.004799 doi: 10.1364/OE.22.004799
    [4] F. Cakoni, H. Haddar, T. P. Nguyen, New interior transmission problem applied to a single Floquet–Bloch mode imaging of local perturbations in periodic media, Inverse Probl., 35 (2019), 015009. https://doi.org/10.1088/1361-6420/aaecfd doi: 10.1088/1361-6420/aaecfd
    [5] H. Haddar, T. P. Nguyen, Sampling methods for reconstructing the geometry of a local perturbation in unknown periodic layers, Comput. Math. Appl., 74 (2017), 2831–2855. https://doi.org/10.1016/j.camwa.2017.07.015 doi: 10.1016/j.camwa.2017.07.015
    [6] A. Lechleiter, D. L. Nguyen, Factorization method for electromagnetic inverse scattering from biperiodic structures, SIAM J. Imag. Sci., 6 (2013), 1111–1139. https://doi.org/10.1137/120903968 doi: 10.1137/120903968
    [7] D. L. Nguyen, The factorization method for the Drude-Born-Fedorov model for periodic chiral structures, Inverse Probl. Imaging, 10 (2016), 519–547. https://doi.org/10.3934/ipi.2016010 doi: 10.3934/ipi.2016010
    [8] D. L. Nguyen, T. Truong, Imaging of bi-anisotropic periodic structures from electromagnetic near field data, J. Inverse Ill-Posed Probl., 30 (2022), 205–219. https://doi.org/10.1515/jiip-2020-0114 doi: 10.1515/jiip-2020-0114
    [9] T. P. Nguyen, Differential imaging of local perturbations in anisotropic periodic media, Inverse Probl., 36 (2020), 034004. https://doi.org/10.1088/1361-6420/ab2066 doi: 10.1088/1361-6420/ab2066
    [10] K. Sandfort, The Factorization Method for Inverse Scattering from Periodic Inhomogeneous Media, Ph.D thesis, Karlsruher Institut für Technologie, 2010. https://doi.org/10.5445/KSP/1000019400
    [11] J. Yang, B. Zhang, R. Zhang, A sampling method for the inverse transmission problem for periodic media, Inverse Probl., 28 (2012), 035004. https://doi.org/10.1088/0266-5611/28/3/035004 doi: 10.1088/0266-5611/28/3/035004
    [12] A. Kirsch, N. Grinberg, The Factorization Method for Inverse Problems, Oxford University Press, 2008. https://doi.org/10.1093/acprof: oso/9780199213535.001.0001
    [13] D. Colton, A. Kirsch, A simple method for solving inverse scattering problems in the resonance region, Inverse Probl., 12 (1996), 383–393. https://doi.org/10.1088/0266-5611/12/4/003 doi: 10.1088/0266-5611/12/4/003
    [14] A. Kirsch, Characterization of the shape of a scattering obstacle using the spectral data of the far field operator, Inverse Probl., 14 (1998), 1489–1512. https://doi.org/10.1088/0266-5611/14/6/009 doi: 10.1088/0266-5611/14/6/009
    [15] X. Jiang, P. Li, Inverse electromagnetic diffraction by biperiodic dielectric gratings, Inverse Probl., 33 (2017), 085004. https://doi.org/10.1088/1361-6420/aa76b9 doi: 10.1088/1361-6420/aa76b9
    [16] R. Griesmaier, Multi-frequency orthogonality sampling for inverse obstacle scattering problems, Inverse Probl., 27 (2011), 085005. https://doi.org/10.1088/0266-5611/27/8/085005 doi: 10.1088/0266-5611/27/8/085005
    [17] I. Harris, D. L. Nguyen, Orthogonality sampling method for the electromagnetic inverse scattering problem, SIAM J. Sci. Comput., 42 (2020), B722–B737. https://doi.org/10.1137/19M129783X doi: 10.1137/19M129783X
    [18] K. Ito, B. Jin, J. Zou, A direct sampling method to an inverse medium scattering problem, Inverse Probl., 28 (2012), 025003. https://doi.org/10.1088/0266-5611/28/2/025003 doi: 10.1088/0266-5611/28/2/025003
    [19] R. Potthast, A study on orthogonality sampling, Inverse Probl., 26 (2010), 074015. https://doi.org/10.1088/0266-5611/26/7/074015 doi: 10.1088/0266-5611/26/7/074015
    [20] D. L. Nguyen, K. Stahl, T. Truong, A new sampling indicator function for stable imaging of periodic scattering media, Inverse Probl., 39 (2023), 065013. https://doi.org/10.1088/1361-6420/acce5f doi: 10.1088/1361-6420/acce5f
    [21] D. L. Nguyen, T. Truong, A stable imaging functional for anisotropic periodic media in electromagnetic inverse scattering, SIAM J. Appl. Math., 84 (2024), 1631–1657. https://doi.org/10.1137/23M1577080 doi: 10.1137/23M1577080
    [22] A. S. Bonnet-Bendhia, F. Starling, Guided waves by electromagnetic gratings and non-uniqueness examples for the diffraction problem, Math. Methods Appl. Sci., 17 (1994), 305–338. https://doi.org/10.1002/mma.1670170502 doi: 10.1002/mma.1670170502
    [23] D. L. Colton, R. Kress, Inverse Acoustic and Electromagnetic Scattering Theory, 2nd edition, Springer, 1998. https://doi.org/10.1007/978-3-662-03537-5
    [24] A. Lechleiter, D. L. Nguyen, Volume integral equations for scattering from anisotropic diffraction gratings, Math. Methods Appl. Sci., 36 (2013), 262–274. https://doi.org/10.1002/mma.2585 doi: 10.1002/mma.2585
    [25] A. Lechleiter, D. L. Nguyen, A trigonometric Galerkin method for volume integral equations arising in TM grating scattering, Adv. Comput. Math., 40 (2014), 1–25. https://doi.org/10.1007/s10444-013-9295-2 doi: 10.1007/s10444-013-9295-2
  • This article has been cited by:

    1. Guilhem Semerjian, Some observations on the ambivalent role of symmetries in Bayesian inference problems, 2025, 26, 1878-1535, 199, 10.5802/crphys.236
  • Reader Comments
  • © 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(719) PDF downloads(53) Cited by(0)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog