|
深度学习中的优化算法研究
|
Abstract:
深度学习作为处理神经网络的一个热门研究方向,在近些年来备受关注。深度学习模型是一个多层次的网络结构,评价模型最终效果优劣的网络参数必须通过深度学习优化器进行训练,因此深度学习中的优化算法成为了国内外的研究热点。本文对深度学习中的一阶优化算法进行综述,首先介绍了经典的随机梯度下降及其动量变体优化算法,然后介绍了近年更流行的自适应学习率优化算法,最后对未来深度学习中优化算法的发展进行了总结与展望。
As a popular research direction for processing neural networks, deep learning has attracted much attention in recent years. The deep learning model is a multi-layered network structure, and the network parameters that evaluate the final effect of the model must be trained by the deep learning optimizers, so the optimization algorithms in deep learning have become a research hotspot at home and abroad. In this paper, the first-order optimization algorithms in deep learning are reviewed. Firstly, the classical stochastic gradient descent and its momentum variant optimization algorithms are introduced. Then, the more popular adaptive learning rate optimization algorithms in recent years are introduced. Finally, the development of optimization algorithms in deep learning in the future is summarized and prospected.
[1] | Hinton, G.E., Osindero, S. and Teh, Y.W. (2006) A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 18, 1527-1554. https://doi.org/10.1162/neco.2006.18.7.1527 |
[2] | Kiran, B.R., Sobh, I., Talpaert, V., et al. (2021) Deep Reinforcement Learning for Autonomous Driving: A Survey. IEEE Transactions on Intelligent Transportation Systems, 6, 4909-4926. https://doi.org/10.1109/TITS.2021.3054625 |
[3] | Zou, Z., Shi, Z., Guo, Y., et al. (2019) Object Detection in 20 Years: A Survey. ArXiv: 1905.05055. |
[4] | Su, J., Xu, B. and Yin, H. (2022) A Survey of Deep Learning Approaches to Image Restoration. Neurocomputing, 487, 46-65. https://doi.org/10.1016/j.neucom.2022.02.046 |
[5] | 殷琪林, 王金伟. 深度学习在图像处理领域中的应用综述[J]. 高教学刊, 2018(9): 72-74. |
[6] | Ruder, S. (2016) An Overview of Gradient Descent Optimization Algorithms. ArXiv: 1609.04747. |
[7] | 袁群勇. 深度神经网络的训练优化方法研究[D]: [博士学位论文]. 广东: 华南理工大学, 2020. |
[8] | 张慧. 深度学习中优化算法的研究与改进[D]: [硕士学位论文]. 北京: 北京邮电大学, 2018. |
[9] | Cauchy, M.A. (1847) Méthodegénérale pour la résolution des systems d’équations simultanées. Comptes Rendus de l’Académie des Sciences, 25, 536-538. |
[10] | Levenberg, K. (1944) A Method for the Solution of Certain Non-Linear Problems in Least Squares. Quarterly of Applied Mathematics, 2, 164-168. https://doi.org/10.1090/qam/10666 |
[11] | Marquardt, D.W. (1963) An Algorithm for Least-Squares Estimation of Nonlinear Parameters. Journal of the Society for Industrial and Applied Mathematics, 11, 431-441. https://doi.org/10.1137/0111030 |
[12] | M?ller, M.F. (1993) Efficient Training of Feed-Forward Neural Networks. Aarhus University, Aarhus.
https://doi.org/10.7146/dpb.v22i464.6937 |
[13] | Le Roux, N., Bengio, Y. and Fitzgibbon, A. (2011) Improving First and Second-Order Methods by Modeling Uncertainty. In: Sra, S., Nowozin, S. and Wright, S.J., Eds., Optimization for Machine Learning, The MIT Press, Cambridge, 403-429. |
[14] | 王帅, 向建军, 彭芳, 唐书娟. 基于新最速下降法的目标跟踪算法[J]. 系统工程与电子技术, 2022, 44(5): 1512-1519. |
[15] | 刘晓, 吴明儿, 张华振. 基于最速下降法的可展开索网天线型面调整方法[J]. 中国空间科学技术, 2018, 38(3): 1-7. https://doi.org/10.16708/j.cnki.1000-758X.2018.0022 |
[16] | 于忠霞. 基于最速下降法的常州物流业优化问题研究[J]. 常州工学院学报, 2015, 28(3): 45-48. |
[17] | 米阳, 彭建伟, 陈博洋, 王晓敏, 刘子旭, 王育飞. 基于一致性原理和梯度下降法的微电网完全分布式优化调度[J]. 电力系统保护与控制, 2022, 50(15): 1-10. https://doi.org/10.19783/j.cnki.pspc.211371 |
[18] | Robbins, H. and Monro, S. (1985) A Stochastic Approximation Method. Springer, New York, 1985.
https://doi.org/10.1007/978-1-4612-5110-1_9 |
[19] | Bottou, L. (1998) Online Learning and Stochastic Approximations. In: Saad, D., Ed., On-Line Learning in Neural Networks, Cambridge University Press, Cambridge, 9-42. https://doi.org/10.1017/CBO9780511569920.003 |
[20] | Sutton, R. (1986) Two Problems with Back Propagation and Other Steepest Descent Learning Procedures for Networks. In: Proceedings of the 8th Annual Conference of the Cognitive Science Society, Erlbaum, Hillsdale, 823-832. |
[21] | Dauphin, Y., Pascanu, R., Gulcehre, C., et al. (2014) Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization. ArXiv: 1406.2572. |
[22] | LeCun, Y., Boser, B., Denker, J.S., et al. (1989) Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1, 541-551. https://doi.org/10.1162/neco.1989.1.4.541 |
[23] | Ning, Q. and Qian, N. (1999) On the Momentum Term in Gradient Descent Learning Algorithms. Neural Networks, 12, 145-151. https://doi.org/10.1016/S0893-6080(98)00116-6 |
[24] | LeCun, Y.A., Bottou, L., Orr, G.B. and Müller, K.R. (2012) Efficient BackProp. In: Montavon, G., Orr, G.B. and Müller, KR., Eds., Neural Networks: Tricks of the Trade, Springer, Berlin, 9-48.
https://doi.org/10.1007/978-3-642-35289-8_3 |
[25] | Sutskever, I., Martens, J., Dahl, G. and Hinton, G. (2013) On the Importance of Initialization and Momentum in Deep Learning. Proceedings of the International Conference on Machine Learning Research, 28, 1139-1147. |
[26] | Nesterov, Y. (1983) A Method of Solving a Convex Programming Problem with Convergence Rate Mathcal O(1/k^2). Proceedings of the USSR Academy of Sciences, 269, 543-547. |
[27] | Nesterov, Y. (2003) Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media, Berlin. https://doi.org/10.1007/978-1-4419-8853-9 |
[28] | Duchi, J., Hazan, E. and Singer, Y. (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121-2159. |
[29] | Jacobs, R.A. (1988) Increased Rates of Convergence through Learning Rate Adaptation. Neural Networks, 1, 295-307.
https://doi.org/10.1016/0893-6080(88)90003-2 |
[30] | Zeiler, M.D. (2012) Adadelta: An Adaptive Learning Rate Method. ArXiv: 1212.5701. |
[31] | Tieleman, T. and Hinton, G. (2012) RmsProp: Divide the Gradient by a Running Average of Its Recent Magnitude. COURSERA: Neural Networks for Machine Learning, 4, 26-31. |
[32] | Kingma, D.P. and Ba, J. (2014) Adam: A Method for Stochastic Optimization. ArXiv: 1412.6980. |
[33] | Loshchilov, I. and Hutter, F. (2017) Decoupled Weight Decay Regularization. ArXiv: 1711.05101. |
[34] | Zaheer, M., Reddi, S., Sachan, D., Kale, S. and Kumar, S. (2018) Adaptive Methods for Nonconvex Optimization. Advances in Neural Information Processing Systems, 31, 9793-9803. |
[35] | Reddi, S.J., Kale, S. and Kumar, S. (2019) On the Convergence of Adam and Beyond. ArXiv: 1904.09237. |
[36] | Zhuang, J., Tang, T., Ding, Y., et al. (2020) AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients. Advances in Neural Information Processing Systems, 33, 18795-18806. |
[37] | Dubey, S.R., Chakraborty, S., Roy, S.K., et al. (2019) DiffGrad: An Optimization Method for Convolutional Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 31, 4500-4511.
https://doi.org/10.1109/TNNLS.2019.2955777 |
[38] | Khan, M.U.S., Jawad, M. and Khan, S.U. (2021) Adadb: Adaptive Diff-Batch Optimization Technique for Gradient Descent. IEEE Access, 9, 99581-99588. https://doi.org/10.1109/ACCESS.2021.3096976 |
[39] | Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M. and Tang, P.T.P. (2016) On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ArXiv: 1609.04836. |
[40] | Keskar, N.S. and Socher, R. (2017) Improving Generalization Performance by Switching from Adam to SGD. ArXiv: 1712.07628. |
[41] | Xing, C., Arpit, D., Tsirigotis, C. and Bengio, Y. (2018) A Walk with SGD. ArXiv: 1802.08770. |
[42] | Tong, Q., Liang, G. and Bi, J. (2022) Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM. Neurocomputing, 481, 333-356. https://doi.org/10.1016/j.neucom.2022.01.014 |
[43] | Luo, L., Xiong, Y., Liu, Y. and Sun, X. (2019) Adaptive Gradient Methods with Dynamic Bound of Learning Rate. ArXiv: 1902.09843. |
[44] | Ding, J., Ren, X., Luo, R. and Sun, X. (2019) An Adaptive and Momental Bound Method for Stochastic Learning. ArXiv: 1910.12249. |
[45] | Liu, L., Jiang, H., He, P., et al. (2019) On the Variance of the Adaptive Learning Rate and Beyond. ArXiv: 1908.03265. |
[46] | Gotmare, A., Keskar, N.S., Xiong, C. and Socher, R. (2018) A Closer Look at Deep Learning Heuristics: Learning Rate restarts, Warmup and Distillation. ArXiv: 1810.13243. |
[47] | Zhang, M., Lucas, J., Ba, J. and Hinton, G.E. (2019) Lookahead Optimizer: K Steps Forward, 1 Step Back. Advances in Neural Information Processing Systems, 32, 9597-9608. |