
Citation: Bedassa D. Kitessa, Semu M. Ayalew, Geremew S. Gebrie, Solomon T. Teferi. Quantifying water-energy nexus for urban water systems: A case study of Addis Ababa city[J]. AIMS Environmental Science, 2020, 7(6): 486-504. doi: 10.3934/environsci.2020031
[1] | Xia Shi, Ziheng Zhang . Multiple-site deep brain stimulation with delayed rectangular waveforms for Parkinson's disease. Electronic Research Archive, 2021, 29(5): 3471-3487. doi: 10.3934/era.2021048 |
[2] | Tej Bahadur Shahi, Cheng-Yuan Xu, Arjun Neupane, William Guo . Machine learning methods for precision agriculture with UAV imagery: a review. Electronic Research Archive, 2022, 30(12): 4277-4317. doi: 10.3934/era.2022218 |
[3] | Jing Lu, Longfei Pan, Jingli Deng, Hongjun Chai, Zhou Ren, Yu Shi . Deep learning for Flight Maneuver Recognition: A survey. Electronic Research Archive, 2023, 31(1): 75-102. doi: 10.3934/era.2023005 |
[4] | Rui Wang, Haiqiang Li, Chen Hu, Xiao-Jun Wu, Yingfang Bao . Deep Grassmannian multiview subspace clustering with contrastive learning. Electronic Research Archive, 2024, 32(9): 5424-5450. doi: 10.3934/era.2024252 |
[5] | Jianting Gong, Yingwei Zhao, Xiantao Heng, Yongbing Chen, Pingping Sun, Fei He, Zhiqiang Ma, Zilin Ren . Deciphering and identifying pan-cancer RAS pathway activation based on graph autoencoder and ClassifierChain. Electronic Research Archive, 2023, 31(8): 4951-4967. doi: 10.3934/era.2023253 |
[6] | Ruyu Yan, Jiafei Jin, Kun Han . Reinforcement learning for deep portfolio optimization. Electronic Research Archive, 2024, 32(9): 5176-5200. doi: 10.3934/era.2024239 |
[7] | Ju Wang, Leifeng Zhang, Sanqiang Yang, Shaoning Lian, Peng Wang, Lei Yu, Zhenyu Yang . Optimized LSTM based on improved whale algorithm for surface subsidence deformation prediction. Electronic Research Archive, 2023, 31(6): 3435-3452. doi: 10.3934/era.2023174 |
[8] | Ye Yu, Zhiyuan Liu . A data-driven on-site injury severity assessment model for car-to-electric-bicycle collisions based on positional relationship and random forest. Electronic Research Archive, 2023, 31(6): 3417-3434. doi: 10.3934/era.2023173 |
[9] | Shizhen Huang, Enhao Tang, Shun Li, Xiangzhan Ping, Ruiqi Chen . Hardware-friendly compression and hardware acceleration for transformer: A survey. Electronic Research Archive, 2022, 30(10): 3755-3785. doi: 10.3934/era.2022192 |
[10] | Jiawang Li, Chongren Wang . A deep learning approach of financial distress recognition combining text. Electronic Research Archive, 2023, 31(8): 4683-4707. doi: 10.3934/era.2023240 |
The ability of neural networks to learn rich representations—or features—of their input data is commonly observed in state-of-the art models [1,2] and often thought to be the reason behind their good practical performance [3,Chap. 1]. Yet, our theoretical understanding of how feature learning arises from simple gradient-based training algorithms remains limited. Much progress (discussed in Section 1.3) has been made recently to understand the power and limitations of gradient-based learning with neural networks, showing in particular their superiority over fixed-feature methods on some difficult tasks. However, positive results are often obtained for algorithms that differ in substantial ways from plain (stochastic) gradient descent (e.g., the layers trained separately, or the algorithm makes just one truly non-linear step, etc.).
In this work, we take the algorithm as a given and instead adopt a descriptive approach. Our goal is to improve our understanding of how neural networks behave in the presence of symmetries in the data with plain gradient descent (GD) on two-layer fully-connected ReLU neural networks. To this end, we investigate situations with strong symmetries on the data, the target function and on the initial parameters, and study the properties of the training dynamics and the learned predictor in this context.
We denote by d the input dimension, ρ the input data distribution which we assume to be uniform over the unit sphere Sd−1 of Rd, and by P2(Ω) the space of probability measures with finite second moments over a measurable space Ω. We call σ the activation function, which we take to be ReLU, that is σ(z)=max(0,z), ℓ:R×R→R the loss function, which we assume to be continuous in both arguments and continuously differentiable w.r.t. its second argument and we denote by ∂2ℓ this derivative.
Mean-field limit of two-layer networks. In this work, we consider the infinite-width limit in the mean-field regime of the training dynamics of two-layer networks without intercept with a ReLU activation function. Given a measure μ∈P2(R×Rd), we consider the infinitely wide two-layer network parameterized by μ, defined, for any input x∈Rd, by
f(μ;x)=∫c∈R1+dϕ(c;x)dμ(c), | (1.1) |
where, for any c=(a,b)∈R×Rd, ϕ(c;x)=aσ(b⊤x). Note that width-m two-layer networks with input weights (bj)j∈[1,m]∈(Rd)m and output weights (aj)j∈[1,m]∈Rm can be recovered by a measure μm=(1/m)∑mj=1δ(maj,bj) with m atoms.
Objective and Wasserstein gradient flow. We consider the problem of minimizing the population loss objective for a given target function f∗:Rd→R, which we assume to be bounded on the unit sphere, that is
minμ∈P2(R×Rd)(F(μ):=Ex∼ρ[ℓ(f∗(x),f(μ;x))]). | (1.2) |
The Fréchet derivative of the objective function F at μ is given by the function F′μ(c)=Ex∼ρ[∂2ℓ(f∗(x),f(μ;x))ϕ(c;x)] for any c=(a,b)∈R×Rd (for more details, see Appendix B.1). Starting from a given measure μ0∈P2(R×Rd), we study the Wasserstein gradient flow (GF) of the objective (1.2) which is a path (μt)t≥0 in the space of probability measures satisfying, in the sense of distributions, the partial differential equation (PDE) known as the continuity equation:
∂tμt=−div(vtμt),vt(c):=−∇F′μt(c). | (1.3) |
Initialization. We make the following assumption on the initial measure μ0∈P2(R×Rd): μ0 decomposes as μ0=μ10⊗μ20 where μ10,μ20∈P2(R)×P2(Rd). This follows the standard initialization procedure at finite width. Because no direction should a priori be favored, we assume μ20 to have spherical symmetry, i.e., , it is invariant under any orthogonal transformation, and we additionally assume that |a|=||b|| almost surely at initialization. It is shown in [4,Lemma 26], and [5,Section 2.5], that with this assumption, μt stays supported on the set {|a|=||b||} for any t≥0.
Comment on the assumptions. The assumption that μ0 decomposes as a product of two measures is to stay as close as possible to what is done in practice (independent initialization for different layers). The assumption that |a|=||b|| is of a technical nature, and, along with the regularity conditions on the loss ℓ and the input data distribution ρ, ensures that the Wasserstein GF (1.3) is well-defined [5,Lemmas 3.1 and 3.9] when using ReLU as a activation function (which bears technical difficulties because of its non-smoothness). The results of Section 2 hold for others activation function which potentially require less restrictive assumptions on μ0 and ρ but still requires μ0 to decompose as a product of measures. In contrast, the results of Sections 3 and 4 are specific to σ=ReLU and thus require the assumptions above on μ0 and ρ. Since our work focuses mostly on ReLU, we choose to state the results of all sections with the (more restrictive) assumptions stated above on μ0 and ρ.
Relationship with finite-width GD. If μ0=(1/m)∑mj=1δ(aj(0),bj(0)) is discrete, the Wasserstein GF (1.3) is exactly continuous-time GD on the parameters of a standard finite-width neural network, and discretization errors (w.r.t. the number of neurons) can be provided [6,7].
Our main object of study is the gradient flow of the population risk of infinitely wide two-layer ReLU neural networks without intercept. Our motivation to consider this idealistic setting—infinite data and infinite width—is that it allows, under suitable choices for ρ and μ0, the emergence of exact symmetries which are only approximate in the non-asymptotic setting*.
*In contrast, our focus on GF is only for theoretical convenience and most of our results could be adapted to the case of GD.
Symmetries, structure, and convergence. In this work, we are interested in the structures learned by the predictor f(μt;⋅) under GF as t grows large. Specifically, we make the following contributions:
● In Section 2, we prove that if f∗ is invariant under some orthogonal linear map T, then f(μt;⋅) inherits this invariance under GF (Theorem 2.1).
● In Section 3, we study the case when f∗ is an odd function and show that the network converges to the best linear approximator of f∗ at an exponential rate (Theorem 3.2). Linear predictors are optimal over the hypothesis class in that case, in particular because there is no intercept in our model.
● In Section 4, we consider the multi-index model where f∗ depends only on the orthogonal projection of its input onto some sub-space H of dimension dH. We prove that the dynamics can be reduced to a PDE in dimension dH. If in addition, f∗ is the Euclidean norm of the projection of the input, we show that the dynamics reduce to a one-dimensional PDE (Theorem 4.3). In the latter case, we were not able to prove theoretically the convergence of the neurons of the first layer towards H, and leave this as an open problem but we provide numerical evidence in favor of this result.
The code to reproduce the results of the numerical experiments can be found at: https://github.com/karl-hajjar/learning-structure.
Infinite-width dynamics. It has been shown rigourously that for infinitely wide networks there is a clear distinction between a feature-learning regime and a kernel regime [8,9]. For shallow networks, this difference stems from a different scale (w.r.t. width) of the initialization where a large initialization leads to the Neural Tangent Kernel (NTK) (a.k.a. the "lazy regime") which is equivalent to a kernel method with random features [10] whereas a small initialization leads to the so-called mean-field (MF) limit where features are learned from the first layer [8,9]. However, it is unclear in this setting exactly what those features are and what underlying structures are learned by the network. The aim of the present work is to study this phenomenon from a theoretical perspective for infinitely wide networks and to understand the relationship between the ability of networks to learn specific structures and the symmetries of a given task.
A flurry of works study the dynamics of infinitely wide two-layer neural networks. [5,6,11,12,13] study the gradient flow dynamics of the MF limit and show that they are well-defined in general settings and lead to convergence results (local or global depending on the assumptions). On the other hand, Jacot et al. [10] study the dynamic of the NTK parameterization in the infinite-width limit and show that it amounts to learning a linear predictor on top of random features (fixed kernel), so that there is no feature learning.
Convergence rates. In the MF limit, convergence rates are in general difficult to obtain in a standard setting. For instance, [5,11] show the convergence of the GF to a global optimum in a general setting but this does not allow convergence rates to be provided. To illustrate the convergence of the parameterizing measure to a global optimum in the MF limit, Ma et al. [14] prove local convergence (see Section 7) for one-dimensional inputs and a specific choice of target function in O(t−1) where t is the time step. At finite-width, Daneshmand and Bach [15] also prove convergence of the parameters to a global optimum in O(t−1) using an algebraic idea which is specific to the ad-hoc structure they consider (inputs in two dimensions and target functions with finite number of atoms).
In Section 3, we show convergence of the MF limit at an exponential rate when the target function is odd. In the setting of this section, the training dynamics are degenerate and although input neurons move, the symmetries of the problem imply that the predictor is linear.
Low-dimensional structure. Studying how neural networks can adapt to hidden low-dimensional structures is a way of approaching theoretically the feature-learning abilities of neural networks. Bach [16] studies the statistical properties of infinitely wide two-layer networks, and shows that when the target function only depends on the projection on a low-dimensional sub-space, these networks circumvent the curse of dimensionality with generalization bounds which only depend on the dimension of the sub-space. In a slightly different context, Chizat and Bach [4] show that for a binary classification task, when there is a low-dimensional sub-space for which the projection of the data has sufficiently large inter-class distance, only the dimension of the sub-space (and not that of the ambient space) appears in the upper bound on the probability of misclassification. Whether or not such a low-dimensional sub-space is actually learned by GD is not addressed in these works.
Similarly, [17,18] focus on learning functions which have a hidden low-dimensional structure with neural networks. They consider a single step of GD on the input layer weights and show that the approximation / generalization error adapts to the structure of the problem: they provide bounds on the number of data points / parameters needed to achieve negligible error, which depend on the reduced dimension and not the dimension of the ambient space. In a similar context, Mousavi-Hosseini et al. [19] consider (S)GD on the first layer only of a finite-width two-layer network and show that with sufficient L2-regularization and with a standard normal distribution on the input data the first layer weights align with the lower-dimensional sub-space when trained for long enough. They then use this property to then provide statistical results on networks trained with SGD.
In a setting close to ours but on a classification task with finite-data and at finite-width, Paccolat et al. [20] compare the feature learning regime with the NTK regime in the presence of hidden low-dimensional structure and quantify for each regime the scaling law of the test error w.r.t. the number of training samples, mostly focusing on the case dH=1.
In a similar setting to that of [16], Abbe et al. [21] study how GF for infinitely wide two-layer networks can learn specific classes of functions which have a hidden low-dimensional structure when the inputs are Rademacher variables. This strong symmetry assumption ensures that the learned predictor shares the same low-dimensional structure at any time step (from the t=0) and this allows them to characterize precisely what classes of target functions can or cannot be learned by GF in this setting. In contrast, we are interested in how infinitely wide networks learn those low-dimensional structures during training, and in the role of symmetries in enabling such a behaviour after initialization.
Learning representations. An existing line of work [18,22,23,24,25] studies in depth the representations learned by neural networks trained with (S)GD at finite-width from a different perspective focusing on the advantages of feature-learning in terms of performance comparatively to using random features. In contrast, our aim is to describe the representations themselves in relationship with the symmetries of the problem.
Symmetries. We stress that the line of work around symmetries of neural networks dealing with finding network architectures for which the output is invariant (w.r.t. to its input or parameters) by some group of transformations (see [26,27,28], and references therein) is entirely different from what we are concerned with in the present work. In contrast, the setting of Mei et al. [6] is much closer to ours as they study how the invariances of the target function / input data can lead to simplifications in the dynamics of infinitely wide two-layer networks in the mean-field regime which allows them to prove global convergence results.
We denote by M+(Ω) the space of non-negative measures over a measurable space Ω. For any measure μ and measurable map T, T#μ denotes the pushforward measure of μ by T. We denote by O(p) and idRp respectively the orthogonal group and the identity map of Rp for any p∈N. Finally, ⟨⋅,⋅⟩ is the Euclidean inner product and ||⋅|| the corresponding norm.
In this section, we demonstrate that if the target function f∗ is invariant under some orthogonal transformation T, since the input data distribution is also invariant under T, then f(μt;⋅) is invariant under T as well for any t≥0. This invariance property of the dynamics w.r.t. orthogonal symmetries is possible with an infinite number of neurons but is only approximate at finite-width. It is noteworthy that the results of this section hold for any activation function σ and input data distribution ρ which has the same symmetries as f∗, provided that the Wasserstein GF (1.3) is unique. We start with a couple of definitions:
Definition 2.1 (Function invariance). Let T be a map from Rd to Rd, and f:Rd→R. Then, f is said to be invariant (resp. anti-invariant) under T if for any x∈Rd, f(T(x))=f(x) (resp. f(T(x))=−f(x)).
Definition 2.2 (Measure invariance). Let Ω⊂Rd, T be a measurable map from Ω to Ω, and μ be a measure on Ω. Then, μ is said to be invariant under T if T#μ=μ, or equivalently, if for any continuous and compactly supported φ:Ω→R, ∫φ(x)dμ(x)=∫φ(T(x))dμ(x).
We are now ready to state the two main results of this section.
Proposition 2.1 (Learning invariance). Let T∈O(d), and assume that f∗ is invariant under T. Then, for any t≥0, the Wasserstein GF μt of Eq (1.3) is invariant under ˜T:(a,b)∈R×Rd↦(a,T(b)), and the corresponding predictor f(μt;⋅) is invariant under T.
Proposition 2.2 (Learning anti-invariance). Under the same assumptions as in Proposition 2.1 except now we assume f∗ is anti-invariant under T, and assuming further that ∂2ℓ(−y,−ˆy)=−∂2ℓ(y,ˆy) for any y,ˆy∈R, and that μ10 is symmetric around 0 (i.e., , invariant under :a∈R↦−a), we then have that for any t≥0, the Wasserstein GF μt in Eq (1.3) is invariant under ˜T:(a,b)∈R×Rd↦(−a,T(b)), and the corresponding predictor f(μt;⋅) is anti-invariant under T.
Remark. The results above also hold for networks with intercepts at both layers. The conditions of Proposition 2.2 are satisfied by both the squared loss and the logistic loss (a.k.a. the cross-entropy loss).
Essentially, those results show that training with GF preserves the orthogonal symmetries of the problem: the invariance of the target function under an orthogonal transformation leads to the same invariance for μt and f(μt;⋅). The proof, presented in Appendix C, relies crucially on the fact that T is an orthogonal map which combines well with the structure of ϕ(c;x) involving an inner product. The idea is essentially that the orthogonality of T allows us to relate the gradient of ϕ (and consequently of F′μt) w.r.t. c at (T(c);x) to the same gradient at (c;T−1(x)) and then to use the invariance of f∗ and ρ to conclude.
In the following sections we discuss the particular cases where functions are (anti-)invariant under −idRd (i.e., , even or odd functions) or some sub-group of O(d).
We consider here an odd target, function, i.e., , for any x∈Rd, f∗(−x)=−f∗(x).
Linearity of odd predictors. Proposition 2.2 ensures that the predictor f(μt;⋅) associated with the Wasserstein GF of Eq (1.3) is also odd at any time t≥0, and we can thus write, for any x, f(μt;x)=12(f(μt;x)−f(μt;−x)), which yields
f(μt;x)=12(∫a,ba[σ(b⊤x)−σ(−b⊤x)]dμt(a,b))=12∫a,ba(b⊤x)dμt(a,b), |
where the last equality stems from the fact that for ReLU, σ(x)−σ(−x)=x. Put differently, the predictor is linear: it is the same as replacing σ by 12idRd, and f(μt;x)=w(t)⊤x, where
w(t):=12∫a,babdμt(a,b)∈Rd. | (3.1) |
This degeneracy is not surprising as in fact, a linear predictor is the best one can hope for in this setting. Indeed, consider the following assumption and the next lemma:
Assumption 1 (Squared loss function). The loss function ℓ is the squared loss, i.e., , ℓ(y,ˆy)=12(y−ˆy)2, and thus satisfies the condition of Proposition 2.2.
We make this assumption in order to provide an explicit convergence rate in Theorem 3.2 below.
Lemma 3.1 (Optimality of odd predictors). Let f be a predictor in the hypothesis class F:={:x↦∫aσ(b⊤x)dμ(a,b);μ∈P2(R ×Rd))}. Then, denoting fodd(x):=12(f(x)−f(−x)) (resp. feven:=12(f(x)+f(−x))) the odd (resp. even) part of f, one has:
(i)fodd∈F,(ii) L(f):=Ex∼ρ[(f∗(x)−f(x))2]≥Ex∼ρ[(f∗(x)−fodd(x))2]=:L(fodd),(iii) equality holds if and only iffis odd ρ-almost surely. |
Proof. The result readily follows from the decomposition f=fodd+feven which leads to
L(f)=L(fodd) + Ex∼ρ[(feven(x))2]⏟≥0 − 2Ex∼ρ[(f∗(x)−fodd(x))feven(x)]⏟0 by symmetry. |
We then get that L(f)≥L(fodd) with equality if and only if Ex∼ρ[(feven(x))2]=0, i.e., , feven(x)=0 for ρ-almost every x. Finally, if μ∈P2(Rd+1), then ν:=12(μ+S#μ)∈P2(Rd+1), where S:(a,b)∈Rd+1↦(−a,−b), and f(ν;⋅)=fodd(μ;⋅), which shows fodd(μ;⋅)∈F.
Since, as shown above, any odd predictor turns out to be linear because of the symmetries of ReLU, in this context, the best one can expect is thus to learn the best linear predictor.
Exponential convergence for linear networks. We are thus reduced to studying the dynamics of linear networks (which in our case are infinitely wide), which is an interesting object of study its own right (Ji and Telegarsky [29] show a result similar to our result below in the finite-width case with the logistic loss on a binary classification task). In this case, the Wasserstein GF (1.3) (with ReLU replaced by 12idRd) is defined for more general input distributions P∈P2(Rd) (e.g., empirical measures) and target functions f∗. The objective in this context is thus to learn:
w⋆∈argminw∈Rd(Q(w):=12Ex∼P[(f∗(x)−⟨w,x⟩)2]) | (3.2) |
with the dynamics of linear infinitely wide two-layer networks described by the Wasserstein GF (1.3) where the activation function σ is replaced by 12idRd. Theorem 3.2 below shows exponential convergence to a global minimum of Q as soon as the problem is strongly convex. Note that although in this case both ϕ(⋅;⋅) (see Eq (1.1)) and the predictor in the objective Q are linear w.r.t. the input, only the predictor in Q is linear in the parameters (ordinary least squares).
Theorem 3.2. Assume that the smallest eigenvalue λmin of Ex∼P[xx⊤] is positive. Let (μt)t≥0 be the Wasserstein GF associated to (1.3) with activation function 12idRd instead of σ=ReLU, and call w(t)=12∫abdμt(a,b)∈Rd. Then, there exits η>0 and t0>0 such that, for any t≥t0,
(Q(w(t))−Q(w⋆))≤e−2ηλmin(t−t0)(Q(w(t0))−Q(w⋆)). |
Remark. Note that as soon as P has spherical symmetry, the problem becomes strongly convex by Lemma B.3. Note that although F(μt)=Q(w(t)), (w(t))t≥0 is not a gradient flow for the (strongly) convex objective Q (which would immediately guarantee exponential convergence to the global minimum).
The proof, provided in Appendix E, proceeds in two steps: first it is shown that w′(t)=−H(t)∇Q(w(t)) for some positive definite matrix H(t) whose smallest eigenvalue is always lower-bounded by a positive quantity, then we prove that this leads to exponential convergence. Figure 1 illustrates that the dynamics of GF on F remain non-linear in that they do not reduce to GF on Q (although the paths are close). To simulate GF on F we use a large (but finite) number of neurons m=1024 and a small (but positive) step-size 10−2 and simply proceed to do GD on the corresponding finite-dimensional objective (see comment in Section 1.1 on relationship between the Wasserstein GF and finite-width GD). The dashed curves are the trajectories of w(t) (see Theorem 3.2) for 10 random initializations, which corresponds to GD on F, whereas the blue curve is the trajectory of GD on Q starting from 0. P=1n∑ni=1δxi is the empirical distribution over n=100 samples, with xi∼U([−1,1]d) and yi=f∗(xi) drawn i.i.d. from N(y∗,2) where y∗∼N(0,1).
Consider a linear sub-space H of dimension dH<d (potentially much smaller than the ambient dimension), and assume f∗ has the following structure: f∗(x)=fH(pH(x)) where pH is the orthogonal projection onto H (which we also write xH for simplicity, and we reserve sub-scripts for denoting entries of vectors) and fH:H→R is a given function.
In this context it is natural to study whether the learned function shares the same structure as f∗. As observed in Figure 2 this is not the case in finite time, but it is reasonable however to think that the learned predictor f(μt;⋅) shares the same structure as f∗ as t→∞, and we give numerical evidence in this direction. On the other hand, we prove rigorously that the structure of the problem allows to reduce the dynamics to a lower-dimensional PDE. In this section, we consider for simplicity that μ10 is the uniform distribution over {−1,+1} and that μ20 is the uniform distribution over Sd−1.
Comment on the assumptions for this section. The assumptions that |a|=||b|| on the support of μ0 is crucial here. This ensures that the Wasserstein GF (1.3) is well-defined and that μt stays supported on the set {|a|=||b||} for any t≥0, a fact which is used in the proofs. The assumption that ρ is the uniform distribution over the unit sphere bears some importance but could likely be replaced by other measures with spherical symmetry provided that the dynamics would still be well-defined and at the cost of more technical proofs.
The structure of f∗ implies that it is invariant by any T∈O(d) which preserves H, i.e., , such that its restrictions to H and H⊥ are T|H=idH and T|H⊥∈O(d⊥), where O(d⊥) is the orthogonal group of H⊥ whose dimension is d⊥=d−dH. By Proposition 2.1, such transformations also leave the predictor f(μt;⋅) invariant for any t≥0 since ρ is spherically symmetric. Lemma 4.1 below then ensures that f(μt;x) depends on the projection x⊥ onto H⊥ only through its norm, that is f(μt;x)=˜ft(xH,||x⊥||) for some ˜ft:H×R+→R.
Lemma 4.1 (Invariance by a sub-group of O(d)). Let f:Rd→R be invariant under any T∈O(d) such that T|H=idH and T|H⊥∈O(d⊥). Then, there exists some ˜f:H×R+→R such that for any x∈Rd, f(x)=˜f(xH,||x⊥||).
Proof. Consider ˜f:(xH,r)∈H×R+↦f(xH+re⊥1) where e⊥1 is the first vector of an orthonormal basis of H⊥, and let x∈Rd. If x⊥=0, the result is obvious. Otherwise, consider an orthogonal linear map Tx such that Tx|H=idH and Tx sends x⊥/||x⊥|| on e⊥1. The invariance of f under Tx implies f(x)=f(Tx(x))=f(xH+||x⊥||e⊥1)=˜f(xH,||x⊥||).
Figure 2 shows that the dependence in ||x⊥|| cannot be removed in finite time: f(μt;uH+re⊥1) does depend on the distance r∈R+ to H, but this dependence tends to vanish as t→∞. The plots of Figure 2 are obtained by discretizing the initial measure μ0,m=1m∑mj=1δ(aj(0),bj(0)) with m=1024 atoms, and sampling aj(0)∼U({−1,+1}) and bj(0)∼U(Sd−1). We perform GD with a finite step-size η= and a finite number n=256 of fresh i.i.d. samples from the data distribution per step with f∗(x)=||xH||, d=20 and dH=5.
Dynamics over the sphere Sd−1. Using the positive 1-homogeneity of ReLU, and with the assumptions on μ0, the dynamics on μt∈P2(Rd+1) can be reduced to dynamics on the space M+(Sd−1) of non-negative measures over Sd−1: only the direction of neurons matter and their norm only affects the total mass. From this point of view, neurons with positive and negative output weights behave differently and have separate dynamics. Indeed, consider the pair of measures (ν+t,ν−t)∈M+(Sd−1)2 characterized by the property that for any continuous test function φ:Sd−1→R,
∫uφ(u)dν±t(u)=∫±a≥0,b|a|||b||φ(b||b||)dμt(a,b), | (4.1) |
where we have used the superscript ± to denote either or ν+t or ν−t and the right-hand side is changed accordingly (the integration domain) depending on the sign + or −. Because ReLU is positively 1-homogeneous, we have f(μt;x)=∫σ(u⊤x)d(ν+t−ν−t)(x). It is shown in Appendix E.1.1 that ν±t satisfies, in the sense of distributions, the equation
∂tν±t=−div(±˜vtν±t)±2gtν±t, | (4.2) |
where, for any u∈Sd−1,
gt(u)=−∫y∂2ℓ(f∗(y),f(μt;y))σ(u⊤y)dρ(y),˜vt(u)=−∫y∂2ℓ(f∗(y),f(μt;y))σ′(u⊤y)[y−(u⊤y)u]dρ(y). | (4.3) |
Equation (4.2) can be interpreted as a Wasserstein-Fisher-Rao GF [30] on the sphere since ˜vt(u)=proj{u}⊥(∇gt(u)).
Closed dynamics over [0,π/2]×SdH−1. The dynamics on the pair (ν+t,ν−t) can be further reduced to dynamics over [0,π/2]×SdH−1. Indeed, by positive 1-homogeneity of f(μt;⋅) we may restrict ourselves to inputs u∈Sd−1, and f(μt;u) depends only on uH and ||u⊥||. However, because ||uH||2+||u⊥||2=1, this dependence translates into a dependence on the direction uH/||uH|| of the projection onto H and the norm ||uH||. The former is an element of SdH−1 while the latter is given by the angle θ between u and H, that is θ:=arccos(u⊤uH/||uH||)=arccos(||uH||). This simplification leads to the following lemma:
Lemma 4.2. Define the measures τ+t,τ−t by τ±t=P#ν±t∈M+([0,π/2]×SdH−1) via P:u∈Sd−1∖H⊥↦(arccos(||uH||),uH/||uH||)∈[0,π/2]×SdH−1. Then, the measures τ+t,τ−t satisfy the equation
∂τ±t=−div(±Vtτ±t)±2Gtτ±t, | (4.4) |
where Gt:[0,π/2]×SdH−1→R, and Vt:[0,π/2]×SdH−1→RdH+1 are functions depending only on (τ+t,τ−t), and furthermore, f(μt;⋅) can be expressed solely using τ+t,τ−t (exact formulas are provided in Appendix E.1.2).
Abbe et al. [21] show a similar result with a lower-dimensional dynamics in the context of infinitely wide two-layer networks when the input data have i.i.d coordinates distributed uniformly over {−1,+1} (i.e., , Rademacher variables), except that they do not have the added dimension due to the angle θ as we do thanks to their choice of input data distribution.
Lemma 4.2 above illustrates how the GF dynamics of infinitely wide two-layer networks adapts to the lower-dimensional structure of the problem: the learned predictor and the dynamics can described only in terms of the angle θ between the input neurons and H and their projection on the unit sphere of H.
Since the predictors we consider are positively homogeneous, one cannot hope to do better than learn a positively homogeneous function. A natural choice of such a target function to learn is the Euclidean norm. With the additional structure that the target only depends on the projection onto H, this leads to considering f∗(x)=||xH|| which has additional symmetries compared to the general case presented above: it is invariant by any linear map T such that T|H∈O(dH) and T|H⊥∈O(d⊥). By Proposition 2.1 those symmetries are shared by μt and f(μt;⋅), and we show that in this case the dynamic reduces to a one-dimensional dynamic over the angle θ between input neurons and H.
We prove a general disintegration result for the uniform measure on the sphere in the Appendix (see Lemma B.4) which allows, along with some spherical harmonics analysis, to describe the reduced dynamics and characterize the objective that they optimize. This leads to the following result:
Theorem 4.3 (1d dynamics over the angle θ). Assume that f∗(x)=||xH||, and define the measures (τ+t,τ−t)∈M+([0,π/2])2 from (ν+t,ν−t) via P:u∈Sd−1↦arccos(||uH||)∈[0,π/2]: τ±t=P#ν±t. Then, the pair (τ+t,τ−t) follows the Wasserstein-Fisher-Rao GF for the objective A(τ+,τ−):=E[ℓ(f(τ+,τ−;x),f∗(x))] over the space M+([0,π/2])×M+([0,π/2]), where f(τ+,τ−;x) is the expression (with a slight overloading of notations) of f(μ;x) in function of (τ+,τ−) (see Appendix E.2 for more details):
dτ±0(θ)=1B(dH2,d⊥2)cos(θ)dH−1sin(θ)d⊥−1dθ,∂tτ±t=−div(±Vtτ±t)±2Gtτ±t, | (4.5) |
where B is the Beta function, and
Gt(θ)=−∫y∂2ℓ(f∗(y),f(μt;y))σ(cos(θ)yH1+sin(θ)y⊥1)dρ(y),Vt(θ)=G′t(θ). |
Additionally, f(μt;⋅), Gt, and Vt only depend on the pair (τ+t,τ−t), and for any t≥0, it holds that F(μt)=A(τ+t,τ−t).
Remark. The result should still hold for general ρ which are spherically symmetric as long as the Wasserstein GF (1.3) is well-defined but the proof is more technical. In addition, this result shows that even with more structure than in Lemma 4.2, the dynamics of infinitely wide two-layer networks are still able to adapt to this setting: these dynamics, as well as the learned predictor, can be fully characterized solely by the one-dimensional dynamics over the angle θ between input neurons and H. This is noteworthy since this angle determines the alignment of the neurons with H, and thus measures how much the representations learned by the network have adapted to the structure of the problem. Furthermore, as discussed below, this reduction with exact formulas enables efficient numerical simulation in one dimension.
Daneshmand and Bach [15] prove the global convergence of a reduced one-dimensional dynamics in a context similar to ours but their original problem is two-dimensional and with a choice of activation function that leads to specific algebraic properties.
Expression of f(μt;⋅). Because of the symmetries of f(μt;⋅), which result from that of f∗, f(μt;x) depends only on ||xH|| and ||x⊥||. What is more, since f(μt;⋅) is positively 1-homogeneous (because ReLU is) it actually holds that f(μt;x)=||x||˜ft(φx) where φx=arccos(||xH||/||x||) is the angle between x and H, and ˜ft(φ):=∫θ˜ϕ(θ;φ)d(τ+t−τ−t)(θ), ˜ϕ depending only on σ and fixed probability measures (see Appendix E.1.2 for an exact formula).
Learning the low-dimensional structure as t→∞. Although, as shown in Figure 2, f(μt;⋅) does not learn the low-dimensional structure in finite-time, it is reasonable to expect that as t→∞, the measures τ±t put mass only on θ=0, indicating that the only part of the space that the predictor is concerned with for large t is the sub-space H. Since we assume here that the target function f∗ is non-negative, the most natural limits for τ+t and τ−t are τ+t→αδ0 with α>0, and τ−t→0 (in the sense that τ−t([0,π/2])→0) as t→∞, because then the "negative" output weights do not participate in the prediction in the large t limit.
The global convergence result of Chizat and Bach [5,11] still holds but is not quantitative and moreover does not guarantee that the limit is the one described above. We leave the proof of this result as an open problem, but we provide numerical evidence supporting this conjecture. Indeed, we take advantage of the one-dimensional reduction from Theorem 4.3, and numerically simulate the resulting dynamics by parameterizing τ±t via weight and position [31] as μm,t=(1/m)∑mj=1c±j(t)δθ±j(t), and simulating the corresponding dynamics for c±j(t) and θ±j(t). The corresponding results are depicted in Figure 3 which are again obtained by discretizing the initial measures τ+0,τ−0 and performing GD with finite step-size (see more details in Appendix 63141592631415927). Figure 3(a), (b) show that the mass of τ+t tends to concentrate around 0 while that of τ−t tends to concentrate around π/2, indicating that τ+t adapts to the part of the space relevant to learning f∗ while τ−t puts mass close to the orthogonal to that space.
Total mass of particles at convergence. If τ−∞=0 and τ+∞=αδ0 as described above, we have f(μ∞;x)=α||x||˜ϕ(0;φx)=αΓ(dH/2)2√πΓ((dH+1)/2)||x||cos(φx)=αΓ(dH/2)2√πΓ((dH+1)/2)||xH||. To recover exactly f∗, it must hold that α=τ+∞([0,π/2])=2√πΓ((dH+1)/2)Γ(dH/2). Defining the normalized probability measure ˜τ±t=τ±t/τ±t([0,π/2]), we thus expect ˜τ+t to grow close to δ0 and ˜τ−t to δπ/2. In terms of total mass, we expect that τ+t([0,π/2]) gets closer to α while τ−t([0,π/2]) gets closer to 0.
The numerical behaviour depicted in Figure 3(c) seems to follow our intuitive description, at least until a critical time t∗ in the numerical simulation which corresponds to the first time t where τ+t([0,π/2])>α. While the total mass of τ±t (dashed lines) seems to approach its limit rapidly before t∗ it slowly moves further away from it for t≥t∗. On the other hand, while the angles only slowly change before t∗, they start converging fast towards the corresponding Dirac measures after t∗. It is unclear whether this slight difference in behaviour (around the critical time t∗) between what we intuitively expected and the numerical simulation is an artefact of the finite width and finite step size or if it actually corresponds to some phenomenon present in the limiting model. For more details concerning the numerical experiments, see Appendix 63141592631415927. Note that there is a priori not a unique global optimum: τ+∞ and τ−∞ (if they exist) can compensate on parts of the space [0,π/2] and lead to the same optimal predictor for different choices of measures. Our numerical experiments suggest that the GF dynamics select a "simple" solution where τ+∞ is concentrated on {θ=0} and τ−∞ vanishes (puts 0 mass everywhere), which is a form of implicit bias.
We have explored the symmetries of infinitely wide two-layer ReLU networks and we have seen that: (i) they adapt to the orthogonal symmetries of the problem, (ii) they reduce to the dynamics of a linear network in the case of an odd target function and lead to exponential convergence, and (iii) when the target function depends only on the orthogonal projection onto a lower-dimensional sub-space H, the dynamics can be reduced to a lower-dimensional PDE. In particular, when f∗ is the Euclidean norm, this PDE is over a one-dimensional space corresponding to the angle θ between the particles and H. We have presented numerical experiment indicating that the positive particles converge to the subspace H in this case and leave the proof of this result as an open problem. We also leave as an open question whether the results of Section 2 extend to deeper networks.
Karl Hajjar received full support from the Agence Nationale de la Recherche (ANR), reference ANR-19-CHIA-0021-01 "BiSCottE". The authors thank Christophe Giraud for useful discussions and comments.
The authors declare there is no conflict of interest.
We introduce in this section additional notation that we use throughout the Appendix.
Residual: we call Rt(y):=−∂2ℓ(f∗(y),f(μt;y)), the "residual", which is equal to the difference f∗(y)−f(μt;y) when ℓ is the squared loss.
Identity matrix: we denote by Ip the identity matrix in Rp×p for any p∈N.
Indicator functions: we denote by 1A the indicator of a set A, that is 1A(z)=1⟺z∈A, and 1A(z)=0 otherwise.
Total variation: for any measure ν, we denote by |ν| its total variation, which should cause no confusion with the absolute value given the context.
Beta / Gamma function and distribution: for α,β>0, we denote by B(α,β) the Beta function equal to Γ(α)Γ(β)/Γ(α+β) where Γ is the Gamma function, and by Beta(α,β) the beta law with density equal to uα−1uβ−1/B(α,β) on [0,1].
Gaussian / spherical measures: we call ρp the standard Gaussian measure in Rp (corresponding to N(0,Ip)) for any p∈N.
Whenever τ∈M+(Ω) has finite and non-zero total variation, we denote by ˜τ∈P2(Ω) its normalized counterpart (which is a probability measure), that is ˜τ=τ/τ(Ω)=τ/|τ|.
For any p∈N, we call ωp the Lebesgue (spherical) measure over the unit sphere Sp−1 of Rp, that is the measure such that ˜ωp is the uniform measure on Sp−1. We then denote by |Sp−1| the surface area of Sp−1, that is |Sp−1|:=|ωp|=ωp(Sp−1)=2πp/2/Γ(p/2).
Smooth functions: we denote by C(Ω) (resp. C1c(Ω)) the set of continuous (resp. continuously differentiable and compactly supported) functions from a set Ω to R.
In this section, we list a number of lemmas related to symmetries of measures and functions which will prove helpful in the proofs presented in the Appendix.
Lemma A.1 (Invariance under invertible maps). Let μ be a measure invariant under some measurable and invertible map T. Then, assuming T−1 is also measurable, one has that μ is also invariant under T−1.
Remark. A similar result holds for a function f invariant under an invertible map.
Proof. Because μ is invariant under T, we have for any measurable set A, μ(A)=μ(T−1(A)). Since T−1 is assumed to be measurable, for any measurable set A, T(A) is also measurable (T(A)=(T−1)−1(A)) and thus μ(T(A))=μ(T−1(T(A))=μ(A) which shows μ is invariant under T−1.
Lemma A.2 (Invariance of the density). Let ν be a measure with density p w.r.t. some measure μ, and assume both ν and μ are σ-finite and invariant under some measurable and invertible map T, whose inverse T−1 is also measurable. Then p is also invariant under T μ-almost everywhere, i.e., , p(T(x))=p(x) for μ-almost every x.
Proof. For any measurable φ (w.r.t. μ, and thus w.r.t. ν as well), φ∘T−1 is also measurable, and we have, on the one hand
∫φ∘T−1dν=∫(φ∘T−1)pdμ=∫φ(p∘T)dμ, |
and on the other hand
∫φ∘T−1dν=∫φdν=∫φpdμ, |
which shows that ∫φ(p∘T)dμ=∫φpdμ, and thus that p∘T=p μ-almost everywhere.
Lemma A.3 (Projected variance with spherical symmetry). Let ζ be a spherically symmetric measure on Rp (i.e., , such that for any orthogonal linear map T∈O(p), T#ζ=ζ), with finite second moment. Then we have the following matrix identity:
∫zzz⊤dζ(z)=vζIp,vζ:=∫z(z1)2dζ(z)=1p∫z||z||2dζ(z). |
Proof. The (i,j)-th entry of the matrix on the left-hand-side is ∫zzizjdζ(z), and it is readily seen that the terms outside the diagonal are 0. Indeed, let (i,j)∈[1,p]2 with i≠j, and consider the orthogonal map Tj:z∈Rp↦(z1,…,zj−1,−zj,zj+1,…,zp)⊤. The spherical symmetry of ρ implies that it is invariant under Tj, which yields ∫zzizjdρ(z)=−∫zzizjdρ(z), thereby showing that the latter is 0. To see that the diagonal terms are all equal, it suffices to consider the orthogonal map Si which swaps the 1st and i-th coordinates of a vector z. The invariance of ρ under Si yields ∫z(z1)2dρ(z)=∫z(zi)2dρ(z), which concludes the proof.
Consider a u∈Sd−1. u is determined by: (i) its angle θ:=arccos(||uH||)∈[0,π/2] with H (i.e., , its angle with its projection uH onto H), (ii) the direction zH=uH/||uH||∈SdH−1 of its projection uH onto H, and finally (iii) the direction z⊥=u⊥/||u⊥||∈Sd⊥−1 of its projection u⊥ onto H⊥. Since ||uH||2+||u⊥||2=1, the angle θ gives both the norms of the projections onto H and H⊥: ||uH||=cos(θ) and ||u⊥||=sin(θ).
When z ranges over the unit sphere Sd−1, the angle θ and the directions zH,z⊥ range over [0,π/2], SdH−1, and Sd⊥−1 respectively. We wish to understand what measures we obtain on these three sets when z is distributed on the sphere according to the Lebesgue measure ωd. We show below below that after the change of coordinates described above (from u∈Sd−1 to (θ,zH,z⊥)∈[0,π/2]×SdH−1×Sd⊥−1), the corresponding measures over SdH−1 and Sd⊥−1 are uniform measures and the measure over θ is given by a push-forward of a Beta distribution as defined below:
Definition A.1 (Distribution γ of the angle θ). We define the measure γ on [0,π/2] with the following density w.r.t. the Lebesgue measure on [0,π/2]:
dγ(θ):=cos(θ)dH−1sin(θ)d⊥−1dθ. |
Remark. γ is in fact simply given by (arccos∘√⋅)#Beta(dH/2,d⊥/2). Note that the total variation of gamma is |γ|=γ([0,π/2])=12B(dH2,d⊥2), and the corresponding normalized (probability) measure is d˜γ(θ)=dγ(θ)/|γ|=2B(dH2,d⊥2)cos(θ)dH−1sin(θ)d⊥−1dθ.
We now state the disintegration theorem and give its proof:
Theorem A.4 (Disintegration of the Lebesgue measure on the sphere). Let ωd denote the Lebesgue measure on the sphere measure on the sphere of Rd, and let γ be the measure of Definition 3. Then, one has
ωd=Φ#(ωdH⊗ωd⊥⊗γ) |
where
Φ:[0,π/2]×SdH−1×Sd⊥−1→Sd−1(θ,zH,z⊥)↦cos(θ)zH+sin(θ)z⊥. |
Proof. Denoting ˜ωd the uniform measure on the sphere, |Sd−1|:=2πd/2Γ(d/2) the surface are of the sphere in dimension d, and ρp the standard Gaussian distribution in Rp for any p. Using the well-known fact that ˜ωd=Π#ρd with Π:x∈Rd∖{0}↦x/||x||∈Sd−1, we have, for any measurable test function φ:Sd−1→R,
∫φdωd=|Sd−1|∫φd˜ωd=|Sd−1|∫xφ(x||x||)dρd(x)=|Sd−1|∫xH,x⊥φ(xH+x⊥||xH+x⊥||)dρdH(xH)dρd⊥(x⊥)=Cd∫φ(rHzH+r⊥z⊥||rHzH+r⊥z⊥||)rdH−1He−r2H/2rd⊥−1⊥e−r2⊥/2drHdr⊥dωdH(zH)dωd⊥(z⊥)=Cd∫zH,z⊥∫rH,r⊥φ(rHzH+r⊥z⊥√r2H+r2⊥)rdH−1Hrd⊥−1⊥e−(r2H+r2⊥)/2drHdr⊥dωdH(zH)dωd⊥(z⊥), |
with
Cd:=|Sd−1|(2π)dH/2(2π)d⊥/2=|Sd−1|(2π)d/2=2πd/22d/2πd/2Γ(d/2)=12(d−2)/2Γ(d/2). |
Doing the polar change of variables (rH,r⊥)∈R2+→(R,θ)∈R+×[0,π/2], we get:
∫φdωd=C′d∫zH,z⊥∫θφ(cos(θ)zH+sin(θ)z⊥)cos(θ)dH−1sin(θ)d⊥−1dθdωdH(zH)dωd⊥(z⊥) |
where
C′d:=Cd∫+∞0Rd−2e−R2/2RdR=Cd∫+∞0Rd−1e−R2/2dR=Cd×2(d−2)/2Γ(d/2)=1. |
which concludes the proof.
Remark. A similar disintegration result holds for the uniform measure ˜ωd on the sphere. The corresponding measures which are then pushed-forward by the same Φ are the normalized counterparts of the measures in the theorem above: ˜ωd=Φ#(˜ωdH⊗˜ωd⊥⊗˜γ). This readily comes from noting that a simple calculation yields |ωd|=|ωdH||ωd⊥||γ|.
Given a functional F:P2(Rp)→R, its first variation or Fréchet derivative at μ∈P2(Rp) is defined as a measurable function, denoted δFδμ(μ):Rp→R, such that, for any ν∈P2(Rp) for which μ+tν∈P2(Rp) in a neighborhood (in t) of t=0,
ddtF(μ+tν)|t=0=∫zδFδμ(μ)[z]dν(z). |
See Santambrogio [32,Definition 7.12], or [33,p.29] for more details on the first variation.
In the case of the functional defined in Eq (1.2) corresponding to the population loss objective, using the differentiability of the loss ℓ w.r.t. its second argument, one readily has that
F′μ(c):=δFδμ(μ)[c]=∫x∂2ℓ(f∗(x),f(μ;x))ϕ(c;x)dρ(x) |
since
ddtℓ(f∗(x),f(μ;x)+tf(ν;x))=∂2ℓ(f∗(x),f(μ;x)+tf(ν;x))∫cϕ(c;x)dν(c). |
A Wasserstein gradient flow for the objective F defined in Eq (1.2) is a path (μt)t≥0 in the space of probability measures P2(Rd+1) which satisfies the continuity equation with a vector field vt which is equal to the opposite of the gradient of the first variation of the functional F. This means that we have, in the sense of distributions,
∂tμt=−div(−∇(δFδμ)μt). |
That a pair ((μt)t≥0,vt) consisting of a path in P2(Rp) and a (time-dependent) vector field in Rp satisfies the continuity equation ∂tμt=−div(vtμt) in the sense of the distributions simply means that for any test function φ∈C1c(Rp),
∂t∫φdμt=∫v⊤t∇φdμt, |
where ∂t stands for the time derivative ddt. Similarly, when we say that the advection-reaction equation ∂tμt=−div(vtμt)+gtμt is satisfied for some function gt:Rp→R, we mean that it is in the sense of distributions: for any test function φ∈C1c(Rp),
∂t∫φdμt=∫(v⊤t∇φ+gt)dμt. |
An alternative description of the Wasserstein gradient flow of the objective F is to consider a flow X∙(⋅) in R+×Rd+1 such that, for any c∈Rd+1,
X0(c)=cddtXt(c)=−∇(δFδμ)(Xt(c)) |
and to define μt=(Xt)#μ0.
For more details on Wasserstein gradient flows in the space of probability measures see Santambrogio [32,Section 5.3], and [33,Section 4], and for more details on the equivalence between the continuity equation and the flow-based representation of the solution see Santambrogio [32,Theorem 4.4].
There are two main ideas behind the proof. Call ˜T:(a,b)∈R×Rd↦(±a,T(b)) (depending on whether f∗ is invariant or anti-invariant under T) and consider the following two facts:
Structure of ϕ((a,b);x). Since T is orthogonal, so is ˜T, and the structure of ϕ((a,b);x)=aσ(b⊤x) is such that ϕ(˜T(a,b);x)=±ϕ((a,b);T−1(x)) because T is orthogonal (its adjoint is thus its inverse).
Conjugate gradients. Computing the gradient of a function whose input has been transformed by ˜T−1 is the same as the conjugate action of ˜T on the gradient: ∇(φ∘˜T−1)=˜T∘(∇φ)∘˜T−1 (this is due to the fact that the adjoint of ˜T−1 is ˜T because ˜T is orthogonal). Note that we similarly get ∇(φ∘˜T)=˜T−1∘(∇φ)∘˜T.
We present here arguments that are present in both the proofs of Propositions 2.1 and 2.2. Let T be a linear orthogonal map such that f∗(T(x))=±f∗(x), where the ± is because we deal with both cases at the same time since the logic is the same. Let t≥0, and define νt:=˜T−1#μt. We aim to show that (νt)t≥0 is also a Wasserstein gradient flow for the same objective as (μt)t≥0.
Prediction function. Let x∈Rd. We have, using the fact that T is orthogonal (and thus that ⟨T(x),y⟩=⟨x,T−1(y)⟩),
f(νt;x)=∫a,baσ(b⊤x)dνt(a,b)=∫a,b±aσ(T−1(b)⊤x)dμt(a,b)=±∫a,baσ(b⊤T(x))dμt(a,b)=±f(μt;T(x)). |
Time derivative. Let φ∈C1c(Rd). Because μt satisfies the continuity Eq (1.3) in the sense of distributions, and using the remark above on conjugate gradients as well as the orthogonality of ˜T, we have:
∂t∫φdνt=∂t∫φ∘˜T−1dμt=∫⟨∇(φ∘˜T−1),vt⟩dμt=∫⟨˜T∘∇φ∘˜T−1,vt⟩dμt=∫⟨∇φ∘˜T−1,˜T−1∘vt⟩dμt=∫⟨∇φ,˜T−1∘vt∘˜T⟩dνt. |
Conjugate velocity field. The equality above actually shows that νt satisfies the continuity equation with the conjugate velocity field ˜T−1∘vt∘˜T instead of vt. We show below that the former is closely related to the latter (and is in fact equal to −∇F′νt with sufficient assumptions on ∂2ℓ, which is the step proven in Appendices C.2 and C.3). Indeed, because vt is a gradient: vt=−∇F′μt, we have using again the remark above on conjugate gradients:
˜T−1∘vt∘˜T=−∇(F′μt∘˜T). |
Computing the function on the right-hand-side, for any (a,b)∈R×Rd, we get, using the remark above on the structure of ϕ,
F′μt(˜T(a,b))=∫y∂2ℓ(f∗(y),f(μt;y))ϕ(˜T(a,b);y)dρ(y)=±∫y∂2ℓ(f∗(y),f(μt;y))ϕ((a,b);T−1(y))dρ(y). |
ρ is invariant under T since it spherically symmetric by assumption (and thus invariant under any orthogonal map) and we can therefore replace y by T(y) in the integral above, which yields
F′μt(˜T(a,b))=±∫y∂2ℓ(f∗(T(y)),f(μt;T(y)))ϕ((a,b);y)dρ(y)=±∫y∂2ℓ(±f∗(y),±f(νt;y))ϕ((a,b);y)dρ(y), |
and thus we get
∇(F′μt∘˜T)(a,b)=±∫y∂2ℓ(±f∗(y),±f(νt;y))∇(a,b)ϕ((a,b);y)dρ(y). |
One can already notice that if f∗ is invariant under T (as opposed to anti-invariant), that is if we keep the "+" in ±, we get ˜T−1∘vt∘˜T=−∇F′νt.
Proof. We first prove ν0=μ0 and then prove that both (μt)t≥0 and (νt)t≥0 are Wasserstein gradient flows of the objective F defined in Eq (1.2), starting from the initial condition μ0 at t=0. The unicity of such a gradient flow then guarantees that μt=νt and thus f(μt;T(x))=f(μt;x) by the preliminaries above on the prediction function (see Appendix C.1).
Initialization: ν0=μ0. By definition, ˜T(a,b)=(a,T(b)). Since μ0=μ10⊗μ20 by assumption, and μ20 is invariant under T since it has spherical symmetry, it is clear that μ0 is invariant under ˜T, and thus under ˜T−1 by Lemma A.1, which gives ν0=μ0 because νt=˜T−1#μt for any t by definition.
Time derivative. From the preliminary results above (see Appendix C.1) we have
∂tνt=−div(−∇F′νtνt), |
which shows that (νt)t≥0 is also a Wasserstein gradient flow of the objective F. By unicity of the latter (starting from the initial condition μ0), it must hold that μt=νt for any t≥0 which concludes the proof.
The proof follows the exact same pattern as that of Proposition 2.1 (see Appendix C.2). We now have by definition, ˜T(a,b)=(−a,T(b)) and the added symmetry assumption on μ10 ensures that ν0=μ0 still holds in this case. As for the time derivative, the preliminaries above (see Appendix C.1) ensure that
∇(F′μt∘˜T)(a,b)=−∫y∂2ℓ(−f∗(y),−f(νt;y))∇(a,b)ϕ((a,b);y)dρ(y)=∫y∂2ℓ(f∗(y),f(νt;y))∇(a,b)ϕ((a,b);y)dρ(y), |
where we have used the extra assumption that ∂2ℓ(−y,−ˆy)=−∂2ℓ(y,ˆy). This yields
∂tνt=−div(−∇F′νtνt) |
and the conclusion follows from the same logic as for Proposition 2.1.
Proof. The proof is divided in three steps: (i) we derive the dynamics in time of the vector w(t)=12∫abdμt(a,b), (ii) we show that the positive definite matrix H(t) appearing in these dynamics has its smallest eigenvalue lower-bounded by some positive constant after some t0>0, and (iii) we show that this implies the exponential convergence to the global minimum.
Generalities on the objective Q. Expanding the square in the definition of Q (3.2), we have
Q(w)=12[Ex∼P[f∗(x)2]−2β⊤w+w⊤Cw],C:=Ex∼P[xx⊤]∈Rd×d,β:=Ex∼P[f∗(x)x]∈Rd. |
If C≠0, Q(w)→∞ as ||w||→∞ and since Q is lower-bounded by 0, it thus admits at least one global minimum. This minimizer w⋆ is unique as soon as Q is strongly convex, i.e., , C is definite positive, which holds in this case as we have assumed the smallest eigenvalue λ min of C to be >0. Note that ∇Q(w)=Cw−β=∫x((x⊤w)−f∗(x))xdP(x)∈Rd.
First step: dynamics of w(t). Let k∈{1,…,d}, the k-th coordinate wk(t) of w(t) is given by wk(t)=∫abkdμt(a,b), and its time derivative is given by
w′k(t)=12∫(∇(a,b)(abk))⊤vt(a,b)dμt(a,b) |
where vt is given by Eq (1.3) except we replace σ by 12idRd and ρ by P in Fμt, that is
vt(a,b)=12∫yRt(y)(b⊤yay)dP(y)∈R1+d. |
On the other hand, ∇(a,b)(abk)=(bkaek)∈R1+d where ek is the k-th element of the canonical orthonormal basis of Rd. Note that here, Rt(y)=f∗(y)−⟨w(t),y⟩. We thus get
w′k(t)= 14⟨∫a,bbkbdμt(a,b),∫y(f∗(y)−(w(t)⊤y))ydP(y)⟩+ 14⟨∫a,ba2ekdμt(a,b),∫y(f∗(y)−(w(t)⊤y))ydP(y)⟩. |
Note that the term on the right in the inner products is in fact equal to −∇Q(w(t)), which yields the following dynamics for the vector w(t):
w′(t)=−H(t)∇Q(w(t)),H(t):=14(∫bb⊤dμt(a,b)+∫a2dμt(a,b)Id)∈Rd×d. |
Second step: lower bound on the smallest eigenvalue of H(t). At initialization, by symmetry one has w(0)=0, and using Lemma A.3, one has that H(0)=14(1d+1)Id, so that
ddtQ(w(t))|t=0=⟨w′(0),∇Q(w(0))⟩=−d+14d||∇Q(0)||2=−d+14d||β||2 |
If β=0, then ∇Q(0)=0 and since w(0)=0, w(t) starts at the global optimum and thus stays constant equal to 0. Otherwise, if ||β||>0, one has ddtQ(w(t))|t=0<0, which ensures that there is a t0>0 such that Q(w(t))<Q(w(0))=Q(0) for any t∈(0,t0]. Call ε:=[Q(0)−Q(w(t0))]/2>0. The continuity of Q at 0 guarantees that there is a δ>0 such that for any w∈Rd, if ||w||<δ, then |Q(w)−Q(0)|≤ε.
Now assume that there exists t1≥t0 such that ∫a2dμt1(a,b)≤δ. Then, one has
||w(t1)||=||12∫abdμt1(a,b)||≤12∫|a|||b||dμt1(a,b)≤12∫a2dμt1(a,b)≤δ2<δ, |
where we have used in the penultimate inequality that μt1 is supported on the set {|a|=||b||} because of the assumptions on the initialization μ0 (see Section 1.1). This ensures that |Q(w(t1))−Q(0)|≤ε. Since :t↦Q(w(t)) is decreasing (Q(w(t))=F(μt) and it is classical that the objective is decreasing along the gradient flow path, see third step below) and t1≥t0, this means that
0<Q(0)−Q(w(t0))≤Q(0)−Q(w(t1))≤ε=[Q(0)−Q(w(t0))]/2 |
which is a contradiction. Therefore, for any t≥t0, ∫a2dμt(a,b)≥δ. Calling η:=δ/4>0, we thus have that for any t≥t0, the smallest eigenvalue of H(t) is larger than η because H(t) is the sum of the positive semi-definite matrix 14∫bb⊤dμt(a,b) and of the positive definite matrix 14∫a2dμt(a,b)Id whose smallest eigenvalue is at least η for t≥t0.
Third step: exponential convergence. We have:
ddtQ(w(t))=⟨w′(t),∇Q(w(t))=−∇Q(w(t))⊤H(t)∇Q(w(t))≤0, |
which shows that because H(t) is positive definite, the objective Q is decreasing along the path (w(t))t≥0. Since after t0>0, the smallest eigenvalue of H(t) is lower bounded by a constant η>0, we have that, for any t≥t0:
ddtQ(w(t))≤−η||∇Q(w(t))||2. | (D.1) |
Because Q is λ min-strongly convex (as the smallest eigenvalue of C is λ min>0), one has the classical inequality
12||∇Q(w)||2≥λ min(Q(w)−Q(w⋆)). |
Plugging this into Eq (D.1) gives
ddt(Q(w(t))−Q(w⋆))≤−2ηλ min(Q(w(t))−Q(w⋆)), |
which by Gronwall's lemma in turn yields for any t≥t0
0≤Q(w(t))−Q(w⋆)≤e−2ηλ min(t−t0)(Q(w(t0))−Q(w⋆)), |
thereby proving exponential convergence.
Exponential convergence in distance. Given that ∇Q(w⋆)=0 because w⋆ is and optimum, it holds Cw⋆=β. Using this fact, it easily follows that
Q(w)−Q(w⋆)=12⟨C(w−w⋆),w−w⋆⟩, |
and the right-hand-side is lower bounded by λ min2||w−w⋆||2, from which we conclude that
||w(t)−w⋆||2≤2λ min(Q(w(t))−Q(w⋆)), |
and the exponential decrease of the right-hand-side allows to conclude.
We wish to show here that the pair of measures (ν+t,ν−t) defined through Eq (4.1) satisfy Eq (4.2) and that the corresponding dynamic is closed is the sense that it can be expressed solely using (ν+t,ν−t) (without requiring to express quantities in function of μt). Below, we use κ(z)=max(0,z). We do this do differentiate it from the activation function σ (which is also equal to ReLU) so as avoid confusion because the κ which appears below has nothing to do with the activation function of the network and simply comes from the integration domain in the calculations.
Equations of the dynamics on the sphere. Let φ∈C1c(Sd−1). One has
∂t∫φdν±t= ∂t∫±a≥0,b|a|||b||φ(b||b||)dμt(a,b)= ∂∫a,bκ(±a)||b||φ(b||b||)dμt(a,b)= ∫a,b∇(a,b)(κ(±a)||b||φ(b||b||))⊤vt(a,b)dμt(a,b) |
Let us compute the components of the gradient above. We have
∇a(κ(±a)||b||φ(b||b||))=±κ′(±a)||b||φ(b||b||)=1{±a≥0}||b||φ(b||b||). |
The Jacobian of the map :b∈Rd↦b/||b|| is equal to 1||b||(Id−bb⊤/||b||2) which is a symmetric (or self-adjoint) matrix, so that the gradient w.r.t. b is
∇b(κ(±a)||b||φ(b||b||))=1{±a≥0}|a|[φ(b||b||)b||b||+(Id−b||b||(b||b||)⊤)∇φ(b||b||)]. |
On the other hand, the first component of vt(a,b) (corresponding to the gradient w.r.t. a) is
v1t(a,b)=∫yRt(y)κ(b⊤y)dρ(y)=||b||∫yRt(y)κ((b||b||)⊤y)dρ(y), |
and the last d components (corresponding to the gradient w.r.t. b) are
v2t(a,b)=∫yRt(y)aκ′(b⊤y)ydρ(y)=a∫yRt(y)κ′((b||b||)⊤y)ydρ(y). |
When computing the inner product ∇(a,b)(κ(±a)||b||φ(b||b||))⊤vt(a,b), we can re-arrange the terms to keep one term where φ appears and the other where ∇φ appears. Using the facts that the Jacobian computed above is symmetric, that κ(z)=κ′(z)z for any z∈R, and that 1{±a≥0}a=±1{±a≥0}|a|=±κ(±a), we get,
∇(a,b)(κ(±a)||b||φ(b||b||))⊤vt(a,b)= ±1{±a≥0}||b||||b||φ(b||b||)gt(b||b||)+ ±1{±a≥0}|a||a|φ(b||b||)gt(b||b||)+ ±1{±a≥0}|a||a|∇φ(b||b||)⊤˜vt(b||b||), |
where, for u∈Sd−1
gt(u):=∫yRt(y)σ(u⊤y)dρ(y),˜vt(u):=∫yRt(y)σ′(u⊤y)[y−(u⊤y)u]dρ(y). |
Finally, because μt stays on the cone {(a,b)∈Rd+1;|a| =||b||} for any t (see Chizat and Bach [4,Lemma 26], Wojtowytsch [5,Section 2.5]), when integrating against μt, we can replace ||b|| by |a| and vice-versa. We thus get that the time derivative we initially computed is the sum of two terms:
∂t∫φdν±t= 2∫±a≥0,b|a|||b||φ(b||b||)gt(b||b||)dμt(a,b) + ∫±a≥0,b|a|||b||∇φ(b||b||)⊤˜vt(b||b||)dμt(a,b)= 2∫u∈Sd−1φ(u)gt(u)dν±t(u) + ∫u∈Sd−1∇φ(u)⊤˜vt(u)dν±t(u), |
which shows that ν±t satisfies Eq (4.2) in the sense of distributions.
Closed dynamics. We want to show that gt and ˜vt can be expressed using only ν+t and ν−t. Both these quantities depend on t only through the residual Rt, which itself only depends on t through f(μt;⋅). We thus show that the latter can be expressed using only ν+t and ν−t, which easily follows from writing, for any y∈Rd,
f(μt;y)=∫aσ(b⊤y)dμt(a,b)=∫a‖b‖σ(⟨b‖b‖,y⟩)dμt(a,b)=∫a≥0,b|a|‖b‖σ(⟨b‖b‖,y⟩)dμt(a,b)−∫a≤0,b|a|‖b‖σ(⟨b‖b‖,y⟩)dμt(a,b)=∫u∈Sd−1σ(u⊤y)dv+t(u)−∫u∈Sd−1σ(u⊤y)dv−t(u) |
Proof. We first prove that the Eq (4.4) for τ±t holds in the sense of distributions, and then show that the corresponding dynamics are closed because the Vt and gt appearing in Eq (4.4) can be expressed with (τ+t,τ−t) (and not only with (ν+t,ν−t) for instance). We show this by expressing f(μt;⋅) only in function of the pair (τ+t,τ−t).
The pair (τ+t,τ−t) satisfy Eq (4.4). First, we show that gt and ˜vt defined in Eq (4.3) admit modified expressions that match the structure of the pushforward transforming ν±t into τ±t. Indeed, since ρ is assumed to be spherically symmetric, it is invariant by any orthogonal transformation. In particular, for a fixed u∈Sd−1 such that u⊥≠0, we consider the orthogonal map Tu:Rd→Rd such that Tu|H=idH and Tu|H⊥ sends the canonical orthonormal basis (e⊥1,…,e⊥d⊥) of H⊥ on (u⊥/||u⊥||,u2,…,ud⊥) where (u2,…,ud⊥)∈(H⊥)d⊥−1 is an orthonormal family, orthogonal to u⊥, so that for any y⊥∈H⊥ with coordinates y⊥1,…,y⊥d⊥ in the basis (e⊥1,…,e⊥d⊥), Tu|H⊥(y⊥)=y⊥1u⊥/||u⊥||+hu(y⊥) with hu(y⊥)⊥u⊥.
Note that since f∗(y)=fH(yH) and f(μt;y)=˜ft(yH,||y⊥||), the residual Rt(y)=f∗(y)−f(μt;y) is invariant by any orthogonal transformation which preserves H (and in particular by Tu). We thus have
gt(u)=∫yRt(y)σ(⟨uH,yH⟩+y⊥1||u⊥||)dρ=:˜gt(uH,||u⊥||),˜vt(u)=∫yRt(y)σ′(⟨uH,yH⟩+y⊥1||u⊥||)[yH+Tu|H⊥(y⊥)−(⟨uH,yH⟩+y⊥1||u⊥||)u]dρ. |
Now consider, for any (θ,zH)∈[0,π/2]×SdH−1,
Gt(θ,zH):=˜gt(cos(θ)zH,sin(θ))Vt(θ,zH):=∫yRt(y)σ′(cos(θ)⟨zH,yH⟩+y⊥1sin(θ))(y⊥1cos(θ)−sin(θ)⟨zH,yH⟩yHcos(θ)−⟨zH,yH⟩zHcos(θ))dρ |
We show below that (τ+t,τ−t) satisfy Eq (4.4) with the Gt and Vt defined above. Let φ∈C1c([0,π/2]×SdH−1). Since τ±t is defined as a push-forward measure obtained from ν±t we have:
∂t∫φ(θ,zH)dτ±t(θ,zH)= ∂t∫φ(arccos(||uH||),uH||uH||)dν±t(u)=±2∫φ(arccos(||uH||),uH||uH||)˜gt(uH,||u⊥||)dν±t(u) +±∫∇u(φ(arccos(||uH||),uH||uH||))⊤˜vt(u)dν±t(u). |
By definition of the pushforward, and since uH=cos(arccos(||uH||))uH/||uH|| and ||u⊥||=sin(arccos(||uH||)) for u∈Sd−1, the first integral is equal ∫φ(θ,zH)Gt(θ,zH)dτ±t(u). For the second integral, let us first compute the gradient. One has
∇u(φ(arccos(||uH||),uH||uH||))=∂θφ(arccos(||uH||),uH||uH||)−1√1−||uH||2uH||uH|| +1||uH||[IdH−uH(uH)⊤||uH||2](∇zHφ)(arccos(||uH||),uH||uH||). |
We observe that the gradient above belongs to H which implies that when computing its inner product with ˜vt(u) we can consider only the component of the latter along H. Additionally, we note that IdH−uH(uH)⊤/||uH||2 is actually the orthogonal projection onto {uH}⊥, so that it yields 0 when applied to u. Using that ||u⊥||=√1−||uH||2 for u∈Sd−1, we then get:
∇u(φ(arccos(||uH||),uH||uH||))⊤˜vt(u)=∇φ(arccos(||uH||),uH||uH||)⊤Vt((arccos(||uH||),uH||uH||). |
where ∇φ(θ,zH)=(∂θφ(θ,zH)∇zHφ(θ,zH)). This shows that
∂t∫φ(θ,zH)dτ±t(θ,zH)=±2∫φ(θ,zH)Gt(θ,zH)dτ±t(θ,zH) +±∫∇φ(θ,zH)⊤Vt(θ,zH)dτ±t(θ,zH), |
which proves that τ±t indeed satisfies Eq (4.3) in the sense of distributions.
The dynamics are closed in the pair (τ+t,τ−t). The only thing left to prove to show that the dynamics are closed for the pair (τ+t,τ−t) is that Gt and Vt can be expressed using only the pair (τ+t,τ−t). The only dependence of these quantities on t is through the residual Rt which itself depends on t only through f(μt;⋅). Let y∈Rd. We have already shown at the end of the previous Section E.1.1 that by definition of ν+t and ν−t, we have
f(μt;y)=∫u∈Sd−1σ(u⊤y)d(ν+t−ν−t)(u). |
On the other hand, we show below that the integral of any measurable function φ:Sd−1→R against ν±t can be expressed as an integral against τ±t in the case where ν±t admits a density w.r.t. the uniform measure on Sd−1 (which is the case for ν±0), the case of a general measure ν±t being a simple extension via a weak convergence argument. Thus call p±t the density of ν±t w.r.t. ˜ωd. Since ν±t is invariant by any linear map T such that T|H=idH T|H⊥∈O(d⊥) (because of the symmetries on μt given by Proposition 2.1), and since this is also the case for ˜ωd because ˜ωd has spherical symmetry and T is orthogonal, we have by Lemma A.2 that p±t is invariant by any such T, which then leads to pt having the form pt(u)=˜p±t(uH,||u⊥||) by Lemma 4.1.
First step. We show that τ±t has the density q±t(θ,zH)=|Sd⊥−1|˜p±t(cos(θ)zH,sin(θ)) w.r.t. ˜γ⊗˜ωdH where the measure ˜γ is the normalized counterpart of the measure in Definition A.1. Indeed, let φ:[0,π/2]×Sd−1→R be any measurable function w.r.t. τ±t. Using the disintegration Lemma A.4 on ˜ωd, one has that
∫φ(θ,zH)dτ±t(θ,zH)=∫φ(arccos(||uH||),uH||uH||)dν±t(u)=∫φ(arccos(||uH||),uH||uH||)˜p±t(uH,||u⊥||)d˜ωd(u)=∫φ(θ,zH)˜p±t(cos(θ)zH,sin(θ))d˜γ(θ)d˜ωdH(zH)d˜ωd⊥(z⊥)=∫φ(θ,zH)˜p±t(cos(θ)zH,sin(θ))d˜γ(θ)d˜ωdH(zH), |
which proves the desired density for τ±t.
Second step. Consider a measurable φ:Sd−1→R w.r.t. ν±t. One has with similar calculations as above
∫uφ(u)dν±t(u)=∫uφ(u)˜p±t(uH,||u⊥||)d˜ωd(u)=∫φ(cos(θ)zH+sin(θ)z⊥)˜p±t(cos(θ)zH,sin(θ))d˜γ(θ)d˜ωdH(zH)d˜ωd⊥(z⊥)=∫θ,zH(∫z⊥φ(cos(θ)zH+sin(θ)z⊥)d˜ωd⊥(z⊥))q±t(θ,zH)d˜γ(θ)d˜ωdH(zH)=∫θ,zH(∫z⊥φ(cos(θ)zH+sin(θ)z⊥)d˜ωd⊥(z⊥))dτ±t(θ,zH). |
Applying this to f(μt;y) shows that the latter quantity can be expressed solely using (τ+t,τ−t), which proves that the dynamics is indeed closed and therefore concludes the proof when ν±t has a density.
Third step: extending to any measure. It is known that for any measure ν over Sd−1, there exists a sequence of measure (νn)n∈N such that: (i) νn has a density pn w.r.t. the uniform measure ˜ωd over Sd−1, and (ii) the sequence (νn)n∈N converges weakly to ν, that is, for any continuous (and thus automatically bounded because the unit sphere is compact) φ, ∫φdνn→n→∞∫φdν. Let thus ν∈M+(Sd−1), and consider a sequence (νn)n∈N with density converging weakly towards ν. Let τ (resp. τn) be defined from ν (resp. νn) as τ±t is defined from ν±t, that is for any measurable φ:[0,π/2]×SdH−1→R,
∫φ(θ,zH)dτ(θ,zH)=∫φ(arccos(||uH||),uH||uH||)dν(u),∫φ(θ,zH)dτn(θ,zH)=∫φ(arccos(||uH||),uH||uH||)dνn(u). |
Let thus φ be a continuous map from Sd−1→R (having in mind the example of :u↦σ(u⊤y) for a fixed y). By the result of Step 2, since νn has a density for every n, we have that
∫φ(u)dνn(u)=∫θ,zH(∫z⊥φ(cos(θ)zH+sin(θ)z⊥)d˜ωd⊥(z⊥))dτn(θ,zH), | (E.1) |
and taking the limit n→∞, the left-hand-side of Eq (E.1) converges to ∫φdν by assumption. Now let us look at the right-hand-side of (E.1). Calling ψ(θ,zH)=∫z⊥φ(cos(θ)zH+sin(θ)z⊥)d˜ωd⊥(z⊥) and Φ(u)=∫z⊥φ(uH+||u⊥||z⊥)d˜ωd⊥(z⊥), the right-hand-side is in fact ∫ψdτn and, for any n∈N, is equal to:
∫ψdτn=∫uψ(arccos(||uH||),uH||uH||)dνn(u)=∫u∫z⊥φ(||uH||uH||uH||+||u⊥||z⊥)d˜ωd⊥(z⊥)dνn(u)=∫u∫z⊥φ(uH+||u⊥||z⊥)d˜ωd⊥(z⊥)dνn(u)=∫Φdνn, |
and a similar result holds for τ and ν. Now, the continuity of Φ is readily obtained from that of φ, and thus the right-hand-side in the last equality above converges to ∫Φdν which is also equal to ∫ψdτ by the same calculations as above. The right-hand-side in (E.1) therefore converges to ∫ψdτ, and since the limits of both sides are equal, we get ∫φdν=∫ψdτ, which is the claim of Step 2 for a general measure ν which does not necessarily admit a density, thereby concluding the proof.
Here, we give the proof of Theorem 4.3 which shows that when f∗(x)=||xH|| the dynamics can be reduced to a single variable: the angle θ∈[0,π/2] between particles and the subs-space H.
We decompose the proof in three steps: first we show that the pair of measures (τ+t,τ−t)∈M+([0,π/2]) as defined in Section 4.2 indeed follows Eq (4.5); then we show that the dynamics are indeed closed by proving that the terms Vt and Gt appearing in the GF depend only on (τ+t,τ−t); and finally, we show that Eq (4.5) indeed corresponds to a Wasserstein-Fisher-Rao GF on a given objective functional over M+([0,π/2])2.
Proof. We first use the added symmetry to simplify the terms gt and ˜vt which appear in the GF with (ν+t,ν−t) (see Section 4.1) and express them only with ||uH|| and ||u⊥||. Then we use the equations satisfied by (ν+t,ν−t) to obtain equations for (τ+t,τ−t).
Equations for (τ+t,τ−t). Let φ∈C1c([0,π/2]). We have
∂t∫φdτ±t=∂t∫φ(arccos(||uH||))dν±t(u)=±∫∇u(φ(arccos(||uH||)))⊤˜vt(u)dν±t(u)±2∫φ(arccos(||uH||))gt(u)dν±t(u). |
One has that
∇u(φ(arccos(||uH||)))=φ′(arccos(||uH||))×−1√1−||uH||2uH||uH||, |
which belongs to H. We recall here the expressions of ˜vt and gt: for any u∈Sd−1, we have
gt(u)=∫yRt(y)σ(u⊤y)dρ(y),˜vt(u)=∫yRt(y)σ′(u⊤y)[y−(u⊤y)u]dρ(y). |
Since, f∗(x)=||xH||, f∗ is now invariant under any orthogonal map T preserving H and H⊥, that is such that the restrictions T|H∈O(dH) and T|H⊥∈O(d⊥). Proposition 2.1 then ensures that so is f(μt,⋅), which in turn implies that the residual Rt(⋅)=∂2ℓ(f(μt;⋅),f∗(⋅)) also shares that invariance property. Using a similar change of variable as in Appendix E.1.2, and because ρ is spherically symmetric, one gets that gt can be re-written
gt(u)=∫yRt(y)σ(yH1||uH||+y⊥1||u⊥||)dρ(y). |
Calling
Gt(θ):=∫yRt(y)σ(yH1cos(θ)+y⊥1sin(θ))dρ(y), |
one has gt(u)=Gt(arccos(||uH||)) because u∈Sd−1, so that ||u⊥||=√1−||uH||2. Then, by definition of τ±t, the second integral in the time derivative above is equal to ∫φ(θ)Gt(θ)dτ±t. For the first integral appearing in that time derivative, we get
∇u(φ(arccos(||uH||)))⊤˜vt(u)=φ′(arccos(||uH||))||u⊥||||uH||∫yRt(y)σ′(u⊤y)[(u⊤y)u−y]⊤uHdρ. |
Expanding the inner product inside the integral, we have
[(u⊤y)u−y]⊤uH=(⟨uH,yH⟩+⟨u⊥,y⊥⟩)||uH||2−⟨uH,yH⟩=||uH||2⟨u⊥,y⊥⟩−(1−||uH||2)⟨uH,yH⟩=||uH||2⟨u⊥,y⊥⟩−||u⊥||2⟨uH,yH⟩. |
Calling
Vt(θ):=∫yRt(y)σ′(yH1cos(θ)+y⊥1sin(θ))[y⊥1cos(θ)−yH1sin(θ)]dρ(y)=G′(θ), |
and using again the spherical symmetry of ρ, with the same change of variable in the integral as for gt, we get that
∇u(φ(arccos(||uH||)))⊤˜vt(u)=φ′(arccos(||uH||))Vt(arccos(||uH||)). |
Finally, this combined with the previous result on the integral with gt yields
∂t∫φdτ±t=±∫φ′(θ)Vt(θ)dτ±t(θ)±2∫φ(θ)Gt(θ)dτ±t(θ), |
which leads to the desired equation
∂τ±t=−div(±Vtτ±t)±2Gtτ±t. |
The proof follow closely that of Appendix E.1.2 (where we prove closed dynamics), except here we take advantage of the added symmetry of the dynamics. As in Appendix E.1.2, we have
f(μt;y)=∫u∈Sd−1σ(u⊤y)d(ν+t−ν−t)(u), |
and the only thing to prove is that this quantity can be expressed using only (τ+t,τ−t). As in Appendix E.1.2, we first prove this when ν±t has a density, which is the case for ν±0 and should thus remain so during the dynamics.
Similarly to what occurs in Appendix E.1.2, ν±t is invariant by any orthogonal map T which preserves H and H⊥ because μt has those symmetries given by Proposition 2.1, and if ν±t has a density pν±t w.r.t. ˜ωd, then p±t is also invariant by any such map T, and thus depends only on the norms ||uH|| and ||u⊥|| of its input u∈Sd−1. But since its input is on the sphere, those norms are determined by the angle θ=arccos(||uH||) between the input u and H. Calling q±t such that p±t(u)=q±t(arccos(||uH||)), this will lead τ±t to have the density q±t w.r.t. ˜γ. Then, we show below that similarly to Appendix E.1.2, the integral of any measurable φ:Sd−1→R against ν±t can be expressed as an integral against τ±t. Indeed, using the disintegration Lemma A.4,
∫φdν±t=∫θ∈[0,π/2]φ(u)q±t(arccos(||uH||))d˜ωd(u)=∫uφ(cos(θ)zH+sin(θ)z⊥)q±t(θ)d˜ωdH(zH)d˜ωd⊥(z⊥)d˜γ(θ)=∫θ∈[0,π/2]˜φ(θ)q±t(θ)d˜γ(θ)=∫θ∈[0,π/2]˜φ(θ)dτ±t(θ) |
where
˜φ(θ):=∫zH,z⊥φ(cos(θ)zH+sin(θ)z⊥)d˜ωdH(zH)d˜ωd⊥(z⊥), |
which concludes the proof if ν±t has a density w.r.t. the uniform measure ˜ωd on the sphere Sd−1. The general case is obtained by a weak convergence argument (of measures with density) as in the third step of Section E.1.2.
Proof. Recall that γ is the measure in Definition 3, and consider the following objective functional over M([0,π/2])2:
A(τ+,τ−):=∫φ∈[0,π/2]ℓ(cos(φ),˜f(τ+,τ−;φ))d˜γ(φ),˜f(τ+,τ−;φ):=∫θ∈[0,π/2]˜ϕ(θ;φ)d(τ+−τ−)(θ),˜ϕ(θ;φ):=∫r,s∈[−1,1]σ(rcos(φ)cos(θ)+ssin(φ)sin(θ))d˜γdH(r)d˜γd⊥(s) |
where, for any p∈N, dγp(r)=(1−r2)(p−3)/2dr, and ˜γp=γp/|γp| with the normalizing factor |γp|=B(1/2,(p−1)/2)=√πΓ((p−1)/2)/Γ(p/2)=|Sp−1|/|Sp−2|. Note that ˜γp can be simply expressed as the law of ϵ×√X where ϵ∼U({−1,+1}) and X∼Beta(1/2,(p−1)/2).
Computing the first variation or Fréchet derivative of the functional A w.r.t. to its first and second argument yields, for any θ∈[0,π/2],
δAδτ±(τ+,τ−)[θ]=±∫φ∂2ℓ(cos(φ),˜f(τ+,τ−;φ))˜ϕ(θ;φ)d˜γ(φ). |
To conclude one needs only observe that the quantity above is simply equal to Gt(θ), up to a fixed multiplicative constant. Since we have assumed ρ to be the uniform measure over Sd−1 to ensure that the Wasserstein GF (1.3) is well-defined, the constant is one here but in the case of a general ρ with spherical symmetry, the result should also hold (as long as the Wasserstein GF (1.3) is well-defined) but the proof is more technical and different constants might appear.
Simplifying f(μt;⋅). Using the results from Appendix E.2.2, we have for any φ,zH,z⊥∈[0,π/2]×SdH−1×Sd⊥−1 (so that u=cos(φ)zH+sin(φ)z⊥∈Sd−1)
f(μt;cos(φ)zH+sin(φ)z⊥)=∫ψ∫ξH,ξ⊥σ(cos(ψ)cos(φ)⟨ξH,zH⟩+sin(ψ)sin(φ)⟨ξ⊥,z⊥⟩)d˜ωdH(ξH)d˜ωd⊥(ξ⊥)d(τ+t−τ−t)(ψ) |
Now, because of the integration against uniform measures on the unit spheres, and the inner products involved, we can use some spherical harmonics theory to simplify those calculations. Using The Funk-Hecke formula (see Atkinson and Han [34,Theorem 2.22], n=0, d=dH or d=d⊥), we get
f(μt;cos(φ)zH+sin(φ)z⊥)=|SdH−2||Sd⊥−2||SdH−1||Sd⊥−1|∫ψ∫r,sσ(rcos(ψ)cos(φ)+ssin(ψ)sin(φ))dγdH(r)dγd⊥(s)d(τ+t−τ−t)(ψ)=1|γdH||γd⊥||γdH||γd⊥|∫ψ∈[0,π/2]˜ϕ(ψ;φ)d(τ+t−τ−t)(ψ)=˜f(τ+t,τ−t;φ). |
Simplifying f∗(cos(φ)zH+sin(φ)z⊥). Because f∗(y)=||yH||, f∗(cos(φ)zH+sin(φ)z⊥) is simply ||cos(φ)zH||=cos(φ) because zH∈SdH−1.
With the previous expressions for f(μt;⋅) and f∗ we have that for any function Φ:R2→R,
Φ(f∗(cos(φ)zH+sin(φ)z⊥),f(μt;cos(φ)zH+sin(φ)z⊥))=Φ(cos(φ),˜f(τ+t,τ−t;φ)). |
Note that this applies both to Φ(y,ˆy)=ℓ(y,ˆy) and Φ(y,ˆy)=−∂2ℓ(y,ˆy).
Proof that F(μt)=A(τ+t,τ−t).
Using the disintegration Lemma A.4 for the uniform measure on the unit sphere Sd−1, we have
F(μt)=∫yℓ(f∗(y),f(μt;y))dρ(y)=∫ℓ∘(f∗(⋅),f(μt;⋅))(cos(φ)zH+sin(φ)z⊥)d˜ωdH(zH)d˜ωd⊥(z⊥)d˜γ(φ)=∫ℓ(cos(φ),˜f(τ+t,τ−t;φ))d˜ωdH(zH)d˜ωd⊥(z⊥)d˜γ(φ)=∫ℓ(cos(φ),˜f(τ+t,τ−t;φ))d˜γ(φ), |
where we have used in the last equality the fact that the integrand does not depend on zH or z⊥ and that ˜ωdH and ˜ωd⊥ are probability measures (and thus their total mass is 1).
Simplifying Gt. Using the disintegration Lemma A.4, we have:
Gt(θ)=∫yRt(y)σ(yH1cos(θ)+y⊥1sin(θ))dρ(y)=∫Rt(cos(φ)zH+sin(φ)z⊥)σ(zH1cos(φ)cos(θ)+z⊥1sin(φ)sin(θ))˜ωdH(zH)d˜ωd⊥(z⊥)d˜γ(φ). |
Similarly to what we did for simplifying f(μt;⋅), we can simplify the integrals against ˜ωdH and ˜ωd⊥ using spherical harmonics theory to get:
Gt(θ)=−∫φ∈[0,π/2]∂2ℓ(cos(φ),˜f(τ+t,τ−t;φ))˜ϕ(φ;θ)d˜γ(φ). |
This shows that
−δAδτ+(τ+t,τ−t)[θ]=Gt(θ)δAδτ−(τ+t,τ−t)[θ]=Gt(θ), |
which proves that Eq (4.5) indeed describes the evolution of the Wasserstein-Fisher-Rao for the objective functional A over M([0,π/2])2, given by the pair (τ+t,τ−t).
Measure discretization. Discretizing μt via μm,t=1m∑mj=1δ(aj(t),bj(t)), we get that τm,t:=τ+m,t−τ−m,t=1m∑mj=1cj(t)δθ(t) where
cj(t)=εj|aj(t)|||bj(t)||,εj=sign(aj(0)),θj(t)=arccos(bj(t)||bj(t)||). |
Initializing through aj(0)∼U{−1,+1} and bj(0)∼˜ωd=U(Sd−1), yields cj(0)∼U{−1,+1} and θj(0)∼˜γ, i.i.d. over j. The gradient flows of Eq (4.5) translates into the following ODEs on (cj)j∈[1,m] and (θj)j∈[1,m]:
ddtcj(t)=2εjGt(θj(t))cj(t),ddtθj(t)=εjVt(θj(t)). |
where εj=aj(0)∈{−1,+1} denotes whether the corresponding quantity appears in τ+t,m (ε=+1) or τ−t,m (ε=−1).
Time discretization. Simulating these ODEs via the discrete Euler scheme with step η>0, leads, for any iteration k∈N, to:
cj(k+1)=(1+2ηεjGk(θj(k)))cj(k)θj(k+1)=θj(k+1)+ηεjVk(θj(k)). | (F.1) |
Approximating integrals numerically. The only thing that needs to be dealt with numerically is estimating the values of Gt and Vt which are defined by integrals. With the discretization of the measures, we have:
Gk(θ)=∫φ(cos(φ)−˜f(τ+k,τ−k;φ))˜ϕ(φ;θ)d˜γ(φ),˜f(τ+k,τ−k;φ)=m∑j=1cj(k)˜ϕ(θj(k);φ),˜ϕ(θ;φ)=∫r,s∈[−1,1]σ(rcos(φ)cos(θ)+ssin(φ)sin(θ))d˜γdH(r)d˜γd⊥(s). |
We thus get:
Gk(θ)=∫ψ(r,s;θ,φ)mm∑j=1(cos(φ)−cj(k)ψ(r′,s′;θj(k),φ))d˜γ(φ)(d˜γdH)2(r,r′))(d˜γd⊥)2(s,s′), |
with
ψ(r,s;θ,φ):=σ(rcos(φ)cos(θ)+ssin(φ)sin(θ)). |
Similarly, we have:
Vk(θ)=∫χ(r,s;θ,φ)mm∑j=1(cos(φ)−cj(k)ψ(r′,s′;θj(k),φ))d˜γ(φ)(d˜γdH)2(r,r′))(d˜γd⊥)2(s,s′), |
with
χ(r,s;θ,φ):=∂∂θψ(r,s;θ,φ)=σ′(cos(θ)cos(φ)r+sin(θ)sin(φ)s)[−sin(θ)cos(φ)r+cos(θ)sin(φ)s]. |
We use Monte-Carlo estimation through sampling to approximate the integrals against the five variables (φ,r,r′,s,s′) by drawing N samples from the corresponding distributions. We get:
Gk(θj(k))≈1mNN∑i=1m∑l=1Ψji(cos(Φi)−cl(k)˜Ψli),Ψji(k)=ψ(Ri,Si;θj(k),Φi)˜Ψji(k)=ψ(R′i,S′i;θj(k),Φi), |
and similarly
Vk(θj(k))≈1mNN∑i=1m∑j=1χji(cos(Φi)−cl(k)˜Ψli),χji(k)=χ(Ri,Si;θj(k),Φi), |
where we have drawn the samples i.i.d. over i∈[1,N]:
Φi∼˜γ,Ri,R′i∼˜γdH,Si,S′i∼˜γd⊥. |
Iterations in the numerical simulation. Defining the vectors c(k)=(cj(k))j∈[1,m], θ(k)=(θj)j∈[1,m], and ε=(εj)j∈[1,m], the update Eq (13.3) can then be written in terms of update rules using the matrices Ψ(k)=(Ψji(k))j,i∈[1,m]×[1,N], ˜Ψ(k)=(˜Ψji(k))j,i∈[1,m]×[1,N], and finally χ=(χji(k))j,i∈[1,m]×[1,N], and the vectors (Φ,R,R′,S,S′)=(Φi,Ri,R′i,Si,S′i)i∈[1,N], which are re-sampled at each iteration k∈[0,K], where K∈N:
c(k+1)=(1+2ηε⊙ˆGk)⊙c(k),θ(k+1)=θ(k)+ηε⊙ˆVk, |
where
ˆGk=dNbΨ(cos(Φ)−1m˜Ψ⊤c(k)),ˆVk=dNbχ(cos(Φ)−1m˜Ψ⊤c(k)), |
and ⊙ denotes the Hadamard (element-wise) product of two vectors. One can compute the loss through sampling in a similar way.
Experimental value for α and parameters of the numerical simulation. For the numerical simulations, we fix the number of atoms of the measure (or equivalently the width of the network) to m=1024, the learning rate to η=5.10−3, the number of samples for the Monte-Carlo scheme to N=1000, and the total number of iterations to K=20,000. The experimental value for α (see Section 4.2) is computed through αexp=τ+m,K([0,π/2]), that is
αexp=1m∑j∈J+cj(K),J+:={j∈[1,m] ; εj=1}. |
As mentioned in the main text, the behaviour of the numerical simulation depends a lot on the step-size η. Some of the differences between our observations and our intuitive description of the limiting model (infinite-width and continuous time) can come from too big a step-size. We have thus run the numerical simulation with η=2.10−5 as well, for K=230,000 steps but the same differences still appear (e.g., τ+m,k([0,π/2]) still grows larger than the theoretically expected limit \alpha after some time, albeit by a smaller margin) and after the critical t^* , some negative particles seem to go slightly beyond \pi/2 , even with a very small step-size, a fact which cannot happen for the limiting model. Consequently, in Figure 3, the first histogram bin right after \pi/2 has been merged with the one before.
[1] | Zakkour PD, Gaterell MR, Griffin P, et al. (2002) Developing a Sustainable Energy Strategy for a Water Utility. Part I: A Review of the UK Legislative Framework. J Environ Manage 66: 105-114. |
[2] | UNICEF, Annual Report. 2013. |
[3] | IEA, Emissions from fuel combustion. International Energy Agency. 2012. |
[4] | Cohen R, Nelson B, Wolff G (2004) Energy Down the Drain: The Hidden Costs of California's Water Supply. Natural Resources Defence Council and Pacific Institute. USA. |
[5] | Healy RW, Alley WM, Engle MA, et al. (2015) The water-energy nexus: an earth science perspective. US Geological Survey. |
[6] | Voinov A, Cardwell H (2009) The Energy-Water Nexus: Why Should We Care? J Contemp Water Res Educ 143: 17-29. |
[7] | Scott CA, Pierce SA, Pasqualetti MJ, et al. (2011) Policy and Institutional Dimensions of the Water-energy Nexus. Ener Pol 39: 6622-6630. |
[8] | Lundie S, Peters GM, Beavis PC (2004) Life cycle assessment for sustainable metropolitan water systems planning. Environ Sci Technol 3465-73. |
[9] | Loubet P, Roux P, Loiseau E, et al. (2014) Life cycle assessments of urban water systems: a comparative analysis of selected peer-reviewed literature. Water Res 67: 187-202. |
[10] | Kenney DS, Wilkinson R, editors (2011) The water-energy nexus in the American west. Edward Elgar Publishing. |
[11] | Hardy L, Garrido A, Juana L (2012) Evaluation of Spain's water-energy nexus. Int J Water Resour Dev 28: 151-70. |
[12] | W. W. Griffiths-Sattenspiel B, 2009. |
[13] | Stillwell AS, King CW, Webber ME, et al. (2011) The energy-water nexus in Texas. Ecol Soc 16. |
[14] | Ringler C, Bhaduri A, Lawford R (2013) The nexus across water, energy, land and food (WELF): potential for improved resource use efficiency? Curr Opin Environ Sustain 617-624. |
[15] | Wilkinson R K W (2006) An analysis of the energy intensity of water in California: providing a basis for quantification of energy savings from water system improvements. ACEEE Summer Study on Energy Efficiency in Buildings, ACEEE, Pacific Grove, CA, 123-33. |
[16] | Kenway SJ, Priestley A, Cook S, et al. (2008) Energy use in the provision and consumption of urban water in Australia and New Zealand. Water for a Healthy Country Flagship: CSIRO. |
[17] | King CW, Duncan IJ, Webber M (2008) Water Demand Projections for Power Generation in Texas. |
[18] | Siddiqi A, Anadon LD. The Water-Energy Nexus In Middle East And North Africa. Ener Pol 39: 4529-4540. |
[19] | Pan SY, Haddad AZ, Kumar A, et al. (2020) Brackish water desalination using reverse osmosis and capacitive. Water Res 183: 1-23. |
[20] | Kumar A, Pan SY (2020) Opportunities and Challenges for Renewable Energy Integrated Water-Energy. Water-Energy Nexus. |
[21] | Tsai SW, Hackl L, Kumar A, et al. (2021) Exploring the electrosorption selectivity of nitrate over chloride in capacitive deionization (CDI) and membrane capacitive deionization (MCDI). Desalination 497: 1-11. |
[22] | Chiu YR, Liaw CH, Chen LC (2009) Optimizing Rainwater Harvesting Systems as an Innovative Approach to saving energy in Hilly Communities. Renew Energ 34: 492-498. |
[23] | Retamal ML, Abeysuriya K, Turner AJ, et al. (2008) The Water energy Nexus: Literature Review. [Prepared For CSIRO]. Sydney: Institute For Sustainable Futures, University Of Technology "The Water energy Nexus: Literature Review". [Prepared For CSIRO]. Sydney: Institute For Sustainable Futures, University. |
[24] | Copeland C, Carter NT (2017) Energy-Water Nexus: The Water Sector's Energy Use. Congressional Research Service. |
[25] | Wakeel M, Chen B, Hayat T, et al. (2016) Energy consumption for water use cycles in different countries: a review. Appl Energy 178: 868-85. |
[26] | CEC. (2005) Comparison of Alternate Cooling Technologies for California Power Plants: Economic, Environmental and Other Tradeoffs" California, USA: California Energy Commission. |
[27] | Yoon H (2018) A Review on Water-Energy Nexus and Directions. Documents d'Anàlisi Geogràfica, 64: 365-395. |
[28] | Goldstein R, Smith WE (2002) Water & Sustainability (Volume 4): US Electricity Consumption for Water Supply & Treatment - The Next Half Century". Palo Alto, California, USA. |
[29] | Cohen R, Nelson B, Wolff G (2004) Energy Down the Drain-The Hidden Costs of California's Water Supply. Oakland, California: Natural Resources Defense Council, Pacific Institute. |
[30] | Hernández-Sancho F, Molinos-Senante M, Sala-Garrido R (2011) Energy Efficiency in Spanish Wastewater Treatment Plants: A Non-Radial DEA Approach. Sci Total Environ 409: 2693-2699. |
[31] | Lam KL, Kenway SJ, Lant PA (2017) Energy use for water provision in cities. J Clean Prod 143: 699-709. |
[32] | Hancock NT, Black ND, Cath TY (2012) A Comparative Life Cycle Assessment of Hybrid Osmotic Dilution Desalination and Established Seawater Desalination and Wastewater Reclamation Processes. Water Res 46: 1145-1154. |
[33] | Dai J, Wu S, Han G, Weinberg J, et al. (2018) Water-energy nexus: A review of methods and tools for macro-assessment. applied energy, 15: 393-408. |
[34] | Spang ES, Loge FJ (2015) A high-resolution approach to mapping energy flows through water infrastructure systems. J Ind Ecol 19: 656-65. |
[35] | Cohen R, Nelson B, Wolff G (2004) Energy down the drain: the hidden costs of California's water supply. Cousins, Emily, Natural Resources Defense Council, Oakland, . Researchgate, applied energy. |
[36] | Kenwa y, S. P (2008) Energy Use in the Provision and Consumption of Urban Water in Australia and New Zealand. Water Services Association of Australia. Canberra: CSIRO. |
[37] | Mo W, Wang R, Zimmerman JB (2014) Energy-water nexus analysis of enhanced water supply scenarios: a regional comparison of Tampa Bay Florida, and San Diego, California. Environ Sci Technol 48: 5883-91. |
[38] | Lemos D, Dias AC, Gabarrell X, et al. (2013) Environmental assessment of an urban water system. J Clean Prod 54: 157-65. |
[39] | MUDHCo (2015) National Report on Housing and Sustainable Urban Development. Federal Democratic Republic of Ethiopia. |
[40] | EEA (2013) Urban waste water treatment. European Environment Agency, Denmark. |
[41] | Racoviceanu AI, Karney BW, Kennedy CA, et al. (2007) Life-cyle energy use and greenhouse gas emissions inventory for water treatment systems. J Infrastruct Syst 13: 261-70. |
1. | Guilhem Semerjian, Some observations on the ambivalent role of symmetries in Bayesian inference problems, 2025, 26, 1878-1535, 199, 10.5802/crphys.236 |