Research note Special Issues

On the symmetries in the dynamics of wide two-layer neural networks

  • Received: 05 December 2022 Revised: 18 January 2023 Accepted: 01 February 2023 Published: 22 February 2023
  • We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias), and study the effect of symmetries on the learned parameters and predictors. We first describe a general class of symmetries which, when satisfied by the target function $ f^* $ and the input distribution, are preserved by the dynamics. We then study more specific cases. When $ f^* $ is odd, we show that the dynamics of the predictor reduces to that of a (non-linearly parameterized) linear predictor, and its exponential convergence can be guaranteed. When $ f^* $ has a low-dimensional structure, we prove that the gradient flow PDE reduces to a lower-dimensional PDE. Furthermore, we present informal and numerical arguments that suggest that the input neurons align with the lower-dimensional structure of the problem.

    Citation: Karl Hajjar, Lénaïc Chizat. On the symmetries in the dynamics of wide two-layer neural networks[J]. Electronic Research Archive, 2023, 31(4): 2175-2212. doi: 10.3934/era.2023112

    Related Papers:

  • We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias), and study the effect of symmetries on the learned parameters and predictors. We first describe a general class of symmetries which, when satisfied by the target function $ f^* $ and the input distribution, are preserved by the dynamics. We then study more specific cases. When $ f^* $ is odd, we show that the dynamics of the predictor reduces to that of a (non-linearly parameterized) linear predictor, and its exponential convergence can be guaranteed. When $ f^* $ has a low-dimensional structure, we prove that the gradient flow PDE reduces to a lower-dimensional PDE. Furthermore, we present informal and numerical arguments that suggest that the input neurons align with the lower-dimensional structure of the problem.



    加载中


    [1] N. Cammarata, S. Carter, G. Goh, C. Olah, M. Petrov, L. Schubert, et al., Thread: Circuits, Distill, 5 (2020), e24.
    [2] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in European Conference on Computer Vision, Springer, (2014), 818–833. https://doi.org/10.1007/978-3-319-10590-1_53
    [3] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016.
    [4] L. Chizat, F. Bach, Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss, in Conference on Learning Theory, PMLR, (2020), 1305–1338.
    [5] S. Wojtowytsch, On the convergence of gradient descent training for two-layer relu-networks in the mean field regime, preprint, arXiv: 2005.13530.
    [6] S. Mei, A. Montanari, P. M. Nguyen, A mean field view of the landscape of two-layer neural networks, Proc. Natl. Acad. Sci., 115 (2018), E7665–E7671.
    [7] P. Nguyen, H. T. Pham, A rigorous framework for the mean field limit of multilayer neural networks, preprint, arXiv: 2001.11443.
    [8] L. Chizat, E. Oyallon, F. Bach, On lazy training in differentiable programming, in Advances in Neural Information Processing Systems, 32 (2019), 1–11.
    [9] G. Yang, E. J. Hu, Tensor programs iv: Feature learning in infinite-width neural networks, in International Conference on Machine Learning, PMLR, (2021), 11727–11737.
    [10] A. Jacot, F. Gabriel, C. Hongler, Neural tangent kernel: Convergence and generalization in neural networks, Adv. Neural Inf. Proc. Syst., 31 (2018), 1–10.
    [11] L. Chizat, F. Bach, On the global convergence of gradient descent for over-parameterized models using optimal transport, in Proceedings of the 32nd International Conference on Neural Information Processing Systems, (2018), 3040–3050.
    [12] G. Rotskoff, E. Vanden-Eijnden, Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks, in Advances in Neural Information Processing Systems (eds. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett), 31 (2018), 1–10.
    [13] J. Sirignano, K. Spiliopoulos, Mean field analysis of neural networks: A law of large numbers, SIAM J. Appl. Math., 80 (2020), 725–752. https://doi.org/10.1137/18M1192184 doi: 10.1137/18M1192184
    [14] C. Ma, L. Wu, Machine learning from a continuous viewpoint, I, Sci. Chin. Math., 63 (2020), 2233–2266, https://doi.org/10.1007%2Fs11425-020-1773-8.
    [15] H. Daneshmand, F. Bach, Polynomial-time sparse measure recovery, preprint, arXiv: 2204.07879.
    [16] F. Bach, Breaking the curse of dimensionality with convex neural networks, J. Mach. Learn. Res., 18 (2017), 629–681.
    [17] A. Cloninger, T. Klock, A deep network construction that adapts to intrinsic dimensionality beyond the domain, Neural Networks, 141 (2021), 404–419. https://doi.org/10.1016/j.neunet.2021.06.004 doi: 10.1016/j.neunet.2021.06.004
    [18] A. Damian, J. Lee, M. Soltanolkotabi, Neural networks can learn representations with gradient descent, in Conference on Learning Theory, PMLR, (2022), 5413–5452.
    [19] A. Mousavi-Hosseini, S. Park, M. Girotti, M. Ioannis, M. Erdogdu, Neural networks efficiently learn low-dimensional representations with SGD, preprint, arXiv: 2209.14863.
    [20] J. Paccolat, L. Petrini, M. Geiger, K. Tyloo, M. Wyart, Geometric compression of invariant manifolds in neural networks, J. Stat. Mech. Theory Exp., 2021 (2021), 044001. https://doi.org/10.1088/1742-5468/abf1f3 doi: 10.1088/1742-5468/abf1f3
    [21] E. Abbe, E. Boix-Adsera, T. Misiakiewicz, The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks, preprint, arXiv: 2202.08658.
    [22] E. Abbe, E. Boix-Adsera, M. S. Brennan, G. Bresler, D. Nagaraj, The staircase property: How hierarchical structure can guide deep learning, Adv. Neural Inf. Proc. Syst., 34 (2021), 26989–27002.
    [23] Z. Allen-Zhu, Y. Li, Y. Liang, Learning and generalization in overparameterized neural networks, going beyond two layers, in Proceedings of the 33rd International Conference on Neural Information Processing Systems, (2019), 6158–6169.
    [24] J. Ba, M. A. Erdogdu, T. Suzuki, Z. Wang, D. Wu, G. Yang, High-dimensional asymptotics of feature learning: How one gradient step improves the representation, preprint, arXiv: 2205.01445.
    [25] G. Yehudai, O. Shamir, On the power and limitations of random features for understanding neural networks, Adv. Neural Inf. Proc. Syst., 32 (2019), 1–11.
    [26] B. Bloem-Reddy, Y. W. Teh, Probabilistic symmetries and invariant neural networks, J. Mach. Learn. Res., 21 (2020), 90–91. https://doi.org/10.1109/MMM.2020.3008308 doi: 10.1109/MMM.2020.3008308
    [27] I. Ganev, R. Walters, The QR decomposition for radial neural networks, preprint, arXiv: 2107.02550.
    [28] G. Głuch, R. Urbanke, Noether: The more things change, the more stay the same, preprint, arXiv: 2104.05508.
    [29] Z. Ji, M. Telgarsky, Gradient descent aligns the layers of deep linear networks, preprint, arXiv: 1810.02032.
    [30] T. Gallouët, M. Laborde, L. Monsaingeon, An unbalanced optimal transport splitting scheme for general advection-reaction-diffusion problems, ESAIM Control. Optim. Calc. Var., 25 (2019), 8. https://doi.org/10.1051/cocv/2018001 doi: 10.1051/cocv/2018001
    [31] L. Chizat, Sparse optimization on measures with over-parameterized gradient descent, Math. Program., 194 (2022), 487–532. https://doi.org/10.1007/s10107-021-01636-z doi: 10.1007/s10107-021-01636-z
    [32] F. Santambrogio, Optimal transport for applied mathematicians, Birkäuser, NY, 55 (2015), 94. https://doi.org/10.1007/978-3-319-20828-2
    [33] F. Santambrogio, {Euclidean, metric, and Wasserstein} gradient flows: an overview, Bull. Math. Sci., 7 (2017), 87–154. https://doi.org/10.1007/s13373-017-0101-1 doi: 10.1007/s13373-017-0101-1
    [34] K. Atkinson, W. Han, Spherical Harmonics and Approximations on the Unit Sphere: An Introduction, Springer, 2012. https://doi.org/10.1007/978-3-642-25983-8
  • Reader Comments
  • © 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1179) PDF downloads(50) Cited by(0)

Article outline

Figures and Tables

Figures(3)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog