The Adam algorithm is a common choice for optimizing neural network models. However, its application often brings challenges, such as susceptibility to local optima, overfitting and convergence problems caused by unstable learning rate behavior. In this article, we introduce an enhanced Adam optimization algorithm that integrates Warmup and cosine annealing techniques to alleviate these challenges. By integrating preheating technology into traditional Adam algorithms, we systematically improved the learning rate during the initial training phase, effectively avoiding instability issues. In addition, we adopt a dynamic cosine annealing strategy to adaptively adjust the learning rate, improve local optimization problems and enhance the model's generalization ability. To validate the effectiveness of our proposed method, extensive experiments were conducted on various standard datasets and compared with traditional Adam and other optimization methods. Multiple comparative experiments were conducted using multiple optimization algorithms and the improved algorithm proposed in this paper on multiple datasets. On the MNIST, CIFAR10 and CIFAR100 datasets, the improved algorithm proposed in this paper achieved accuracies of 98.87%, 87.67% and 58.88%, respectively, with significant improvements compared to other algorithms. The experimental results clearly indicate that our joint enhancement of the Adam algorithm has resulted in significant improvements in model convergence speed and generalization performance. These promising results emphasize the potential of our enhanced Adam algorithm in a wide range of deep learning tasks.
Citation: Can Zhang, Yichuan Shao, Haijing Sun, Lei Xing, Qian Zhao, Le Zhang. The WuC-Adam algorithm based on joint improvement of Warmup and cosine annealing algorithms[J]. Mathematical Biosciences and Engineering, 2024, 21(1): 1270-1285. doi: 10.3934/mbe.2024054
The Adam algorithm is a common choice for optimizing neural network models. However, its application often brings challenges, such as susceptibility to local optima, overfitting and convergence problems caused by unstable learning rate behavior. In this article, we introduce an enhanced Adam optimization algorithm that integrates Warmup and cosine annealing techniques to alleviate these challenges. By integrating preheating technology into traditional Adam algorithms, we systematically improved the learning rate during the initial training phase, effectively avoiding instability issues. In addition, we adopt a dynamic cosine annealing strategy to adaptively adjust the learning rate, improve local optimization problems and enhance the model's generalization ability. To validate the effectiveness of our proposed method, extensive experiments were conducted on various standard datasets and compared with traditional Adam and other optimization methods. Multiple comparative experiments were conducted using multiple optimization algorithms and the improved algorithm proposed in this paper on multiple datasets. On the MNIST, CIFAR10 and CIFAR100 datasets, the improved algorithm proposed in this paper achieved accuracies of 98.87%, 87.67% and 58.88%, respectively, with significant improvements compared to other algorithms. The experimental results clearly indicate that our joint enhancement of the Adam algorithm has resulted in significant improvements in model convergence speed and generalization performance. These promising results emphasize the potential of our enhanced Adam algorithm in a wide range of deep learning tasks.
[1] | Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature, 521 (2015), 436–444. https://doi.org/10.1038/nature14539 doi: 10.1038/nature14539 |
[2] | A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, B. Recht, The marginal value of adaptive gradient methods in machine learning, in Advances in Neural Information Processing Systems, 30 (2017). |
[3] | S. J. Reddi, S. Kale, S. Kumar, On the convergence of Adam and beyond, preprint, arXiv: 1904.09237. |
[4] | I. Loshchilov, F. Hutter, Fixing weight decay regularization in Adam, 2018. Available from: https://openreview.net/forum?id = rk6qdGgCZ. |
[5] | Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, et al., Large batch optimization for deep learning: Training BERT in 76 minutes, preprint, arXiv: 1904.00962. |
[6] | S. J. Reddi, S. Kale, S. Kumar, On the convergence of Adam and beyond, preprint arXiv: 1904.09237. |
[7] | L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, et al., On the variance of the adaptive learning rate and beyond, preprint, arXiv: 1908.03265. |
[8] | J. Zhuang, T. Tang, Y. Ding, S. C. Tatikonda, N. Dvornek, X. Papademetris, et al., AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients, in Advances in Neural Information Processing System, 33 (2020), 18795–18806. |
[9] | W. Ilboudo, T. Kobayashi, K. Sugimoto, Robust stochastic gradient descent with student-t distribution based first-order momentum, IEEE Trans. Neural Networks Learn. Syst., 33 (2020), 1324–1337. https://doi.org/10.1109/TNNLS.2020.3041755 doi: 10.1109/TNNLS.2020.3041755 |
[10] | T. Dozat, Incorporating Nesterov Momentum into Adam, 2016. Available from: https://openreview.net/forum?id = OM0jvwB8jIp57ZJjtNEZ. |
[11] | L. Luo, Y. Xiong, Y. Liu, X. Sun, Adaptive gradient methods with dynamic bound of learning rate, preprint, arXiv: 1902.09843. |
[12] | G. Mordido, P. Malviya, A. Baratin, S. Chandar, Lookbehind optimizer: k steps back, 1 step forward, preprint, arXiv: 2307.16704. |
[13] | M. Reyad, A. M. Sarhan, M. Arafa, A modified Adam algorithm for deep neural network optimization, Neural Comput. Appl., 2023 (2023), 1–18. https://doi.org/10.1007/s00521-023-08568-z doi: 10.1007/s00521-023-08568-z |
[14] | X. Chen, C. Liang, D. Huang, E. Real, K. Wang, Y. Liu, et al., Symbolic discovery of optimization algorithms, preprint, arXiv: 2302.06675. |
[15] | R. Tian, A. P. Parikh, Amos: An Adam-style optimizer with adaptive weight decay towards model-oriented scale, preprint, arXiv: 2210.11693. |
[16] | M. Liu, D. Yao, Z. Liu, J. Guo, J. Chen, An improved Adam optimization algorithm combining adaptive coefficients and composite gradients based on randomized block coordinate descent, Comput. Intell. Neurosci., 2023 (2023), 1–13. https://doi.org/10.1155/2023/4765891 doi: 10.1155/2023/4765891 |
[17] | X. Xie, P. Zhou, H. Li, Z. Lin, S. Yan, Adan: Adaptive Nesterov momentum algorithm for faster optimizing deep models, preprint, arXiv: 2208.06677. |
[18] | D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, preprint, arXiv: 1412.6980. |
[19] | K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2015), 770–778. |