We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias), and study the effect of symmetries on the learned parameters and predictors. We first describe a general class of symmetries which, when satisfied by the target function $ f^* $ and the input distribution, are preserved by the dynamics. We then study more specific cases. When $ f^* $ is odd, we show that the dynamics of the predictor reduces to that of a (non-linearly parameterized) linear predictor, and its exponential convergence can be guaranteed. When $ f^* $ has a low-dimensional structure, we prove that the gradient flow PDE reduces to a lower-dimensional PDE. Furthermore, we present informal and numerical arguments that suggest that the input neurons align with the lower-dimensional structure of the problem.
Citation: Karl Hajjar, Lénaïc Chizat. On the symmetries in the dynamics of wide two-layer neural networks[J]. Electronic Research Archive, 2023, 31(4): 2175-2212. doi: 10.3934/era.2023112
We consider the idealized setting of gradient flow on the population risk for infinitely wide two-layer ReLU neural networks (without bias), and study the effect of symmetries on the learned parameters and predictors. We first describe a general class of symmetries which, when satisfied by the target function $ f^* $ and the input distribution, are preserved by the dynamics. We then study more specific cases. When $ f^* $ is odd, we show that the dynamics of the predictor reduces to that of a (non-linearly parameterized) linear predictor, and its exponential convergence can be guaranteed. When $ f^* $ has a low-dimensional structure, we prove that the gradient flow PDE reduces to a lower-dimensional PDE. Furthermore, we present informal and numerical arguments that suggest that the input neurons align with the lower-dimensional structure of the problem.
[1] | N. Cammarata, S. Carter, G. Goh, C. Olah, M. Petrov, L. Schubert, et al., Thread: Circuits, Distill, 5 (2020), e24. |
[2] | M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in European Conference on Computer Vision, Springer, (2014), 818–833. https://doi.org/10.1007/978-3-319-10590-1_53 |
[3] | I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016. |
[4] | L. Chizat, F. Bach, Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss, in Conference on Learning Theory, PMLR, (2020), 1305–1338. |
[5] | S. Wojtowytsch, On the convergence of gradient descent training for two-layer relu-networks in the mean field regime, preprint, arXiv: 2005.13530. |
[6] | S. Mei, A. Montanari, P. M. Nguyen, A mean field view of the landscape of two-layer neural networks, Proc. Natl. Acad. Sci., 115 (2018), E7665–E7671. |
[7] | P. Nguyen, H. T. Pham, A rigorous framework for the mean field limit of multilayer neural networks, preprint, arXiv: 2001.11443. |
[8] | L. Chizat, E. Oyallon, F. Bach, On lazy training in differentiable programming, in Advances in Neural Information Processing Systems, 32 (2019), 1–11. |
[9] | G. Yang, E. J. Hu, Tensor programs iv: Feature learning in infinite-width neural networks, in International Conference on Machine Learning, PMLR, (2021), 11727–11737. |
[10] | A. Jacot, F. Gabriel, C. Hongler, Neural tangent kernel: Convergence and generalization in neural networks, Adv. Neural Inf. Proc. Syst., 31 (2018), 1–10. |
[11] | L. Chizat, F. Bach, On the global convergence of gradient descent for over-parameterized models using optimal transport, in Proceedings of the 32nd International Conference on Neural Information Processing Systems, (2018), 3040–3050. |
[12] | G. Rotskoff, E. Vanden-Eijnden, Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks, in Advances in Neural Information Processing Systems (eds. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett), 31 (2018), 1–10. |
[13] | J. Sirignano, K. Spiliopoulos, Mean field analysis of neural networks: A law of large numbers, SIAM J. Appl. Math., 80 (2020), 725–752. https://doi.org/10.1137/18M1192184 doi: 10.1137/18M1192184 |
[14] | C. Ma, L. Wu, Machine learning from a continuous viewpoint, I, Sci. Chin. Math., 63 (2020), 2233–2266, https://doi.org/10.1007%2Fs11425-020-1773-8. |
[15] | H. Daneshmand, F. Bach, Polynomial-time sparse measure recovery, preprint, arXiv: 2204.07879. |
[16] | F. Bach, Breaking the curse of dimensionality with convex neural networks, J. Mach. Learn. Res., 18 (2017), 629–681. |
[17] | A. Cloninger, T. Klock, A deep network construction that adapts to intrinsic dimensionality beyond the domain, Neural Networks, 141 (2021), 404–419. https://doi.org/10.1016/j.neunet.2021.06.004 doi: 10.1016/j.neunet.2021.06.004 |
[18] | A. Damian, J. Lee, M. Soltanolkotabi, Neural networks can learn representations with gradient descent, in Conference on Learning Theory, PMLR, (2022), 5413–5452. |
[19] | A. Mousavi-Hosseini, S. Park, M. Girotti, M. Ioannis, M. Erdogdu, Neural networks efficiently learn low-dimensional representations with SGD, preprint, arXiv: 2209.14863. |
[20] | J. Paccolat, L. Petrini, M. Geiger, K. Tyloo, M. Wyart, Geometric compression of invariant manifolds in neural networks, J. Stat. Mech. Theory Exp., 2021 (2021), 044001. https://doi.org/10.1088/1742-5468/abf1f3 doi: 10.1088/1742-5468/abf1f3 |
[21] | E. Abbe, E. Boix-Adsera, T. Misiakiewicz, The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks, preprint, arXiv: 2202.08658. |
[22] | E. Abbe, E. Boix-Adsera, M. S. Brennan, G. Bresler, D. Nagaraj, The staircase property: How hierarchical structure can guide deep learning, Adv. Neural Inf. Proc. Syst., 34 (2021), 26989–27002. |
[23] | Z. Allen-Zhu, Y. Li, Y. Liang, Learning and generalization in overparameterized neural networks, going beyond two layers, in Proceedings of the 33rd International Conference on Neural Information Processing Systems, (2019), 6158–6169. |
[24] | J. Ba, M. A. Erdogdu, T. Suzuki, Z. Wang, D. Wu, G. Yang, High-dimensional asymptotics of feature learning: How one gradient step improves the representation, preprint, arXiv: 2205.01445. |
[25] | G. Yehudai, O. Shamir, On the power and limitations of random features for understanding neural networks, Adv. Neural Inf. Proc. Syst., 32 (2019), 1–11. |
[26] | B. Bloem-Reddy, Y. W. Teh, Probabilistic symmetries and invariant neural networks, J. Mach. Learn. Res., 21 (2020), 90–91. https://doi.org/10.1109/MMM.2020.3008308 doi: 10.1109/MMM.2020.3008308 |
[27] | I. Ganev, R. Walters, The QR decomposition for radial neural networks, preprint, arXiv: 2107.02550. |
[28] | G. Głuch, R. Urbanke, Noether: The more things change, the more stay the same, preprint, arXiv: 2104.05508. |
[29] | Z. Ji, M. Telgarsky, Gradient descent aligns the layers of deep linear networks, preprint, arXiv: 1810.02032. |
[30] | T. Gallouët, M. Laborde, L. Monsaingeon, An unbalanced optimal transport splitting scheme for general advection-reaction-diffusion problems, ESAIM Control. Optim. Calc. Var., 25 (2019), 8. https://doi.org/10.1051/cocv/2018001 doi: 10.1051/cocv/2018001 |
[31] | L. Chizat, Sparse optimization on measures with over-parameterized gradient descent, Math. Program., 194 (2022), 487–532. https://doi.org/10.1007/s10107-021-01636-z doi: 10.1007/s10107-021-01636-z |
[32] | F. Santambrogio, Optimal transport for applied mathematicians, Birkäuser, NY, 55 (2015), 94. https://doi.org/10.1007/978-3-319-20828-2 |
[33] | F. Santambrogio, {Euclidean, metric, and Wasserstein} gradient flows: an overview, Bull. Math. Sci., 7 (2017), 87–154. https://doi.org/10.1007/s13373-017-0101-1 doi: 10.1007/s13373-017-0101-1 |
[34] | K. Atkinson, W. Han, Spherical Harmonics and Approximations on the Unit Sphere: An Introduction, Springer, 2012. https://doi.org/10.1007/978-3-642-25983-8 |