Proximal gradient descent and its accelerated version are resultful methods for solving the sum of smooth and non-smooth problems. When the smooth function can be represented as a sum of multiple functions, the stochastic proximal gradient method performs well. However, research on its accelerated version remains unclear. This paper proposes a proximal stochastic accelerated gradient (PSAG) method to address problems involving a combination of smooth and non-smooth components, where the smooth part corresponds to the average of multiple block sums. Simultaneously, most of convergence analyses hold in expectation. To this end, under some mind conditions, we present an almost sure convergence of unbiased gradient estimation in the non-smooth setting. Moreover, we establish that the minimum of the squared gradient mapping norm arbitrarily converges to zero with probability one.
References
[1]
Wright, S.J., Nowak, R.D. and Figueiredo, M.A.T. (2009) Sparse Reconstruction by Separable Approximation. IEEE Transactions on Signal Processing, 57, 2479-2493. https://doi.org/10.1109/TSP.2009.2016892
[2]
Bottou, L., Curtis, F.E. and Nocedal, J. (2018) Optimization Methods for Large-Scale Machine Learning. SIAM Review, 60, 223-311. https://doi.org/10.1137/16M1080173
[3]
Mandic, D. and Chambers, J. (2001) Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability. Wiley, New York. https://doi.org/10.1002/047084535X
[4]
Shalev-Shwartz, S. and Ben-David, S. (2014) Understanding Machine Learning: from Theory to Algorithms. Cambridge University Press, New York. https://doi.org/10.1017/CBO9781107298019
[5]
Sun, R.Y. (2020) Optimization for Deep Learning: An Overview. Journal of the Operations Research Society of China, 8, 249-294. https://doi.org/10.1007/s40305-020-00309-6
[6]
Robbins, H. and Monro, S. (1951) A Stochastic Approximation Method. Annals of Mathematical Statistics, 22, 400-407. https://doi.org/10.1214/aoms/1177729586
[7]
Beck, A. (2017) First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia. https://doi.org/10.1137/1.9781611974997
[8]
Nesterov, Y. (2013) Gradient Methods for Minimizing Composite Functions. Mathematical Programming, 140, 125-161. https://doi.org/10.1007/s10107-012-0629-5
[9]
Beck, A. and Teboulle, M. (2009) A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences, 2, 183-202. https://doi.org/10.1137/080716542
[10]
Vandenberghe, L. (2010) Fast Proximal Gradient Methods. http://faculty.bicmr.pku.edu.cn/~wenzw/courses/fgrad.pdf
[11]
Becker, S., Candès, E. and Grant, M. (2011) Templates for Convex Cone Problems with Applications to Sparse Signal Recovery. Mathematical Programming Computation, 3, 165-218. https://doi.org/10.1007/s12532-011-0029-5
[12]
Li, H. and Lin, Z.C. (2015) Accelerated Proximal Gradient Methods for Nonconvex Programming. The 28th International Conference on Neural Information Processing Systems, Montreal, 7-12 December 2015, 379-387.
[13]
Ghadimi, S., Lan, G. and Zhang, H. (2016) Mini-Batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization. Mathematical Programming, 155, 267-305. https://doi.org/10.1007/s10107-014-0846-1
[14]
Reddi, S.J., Sra, S., Póczos, B. and Smola, A.J. (2016) Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization. Advances in Neural Information Processing Systems, 29, 1145-1153.
[15]
Metel, M.R. and Takeda, A. (2019) Stochastic Proximal Methods for Non-Smooth Non-Covex Constrained Sparse Optimization. Journal of Machine Learning Research, 22, 1-36.
[16]
Kawashima, T. and Fujisawa, H. (2018) Stochastic Gradient Descent for Stochastic Doubly-Nonconvex Composite Optimization. arXiv: 1805.07960.
[17]
Gorbunov, E., Hanzely, F. and Richtárik, P. (2020) A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent. arXiv: 1905.11261. https://arxiv.org/abs/1905.11261
[18]
Cevher, V. and Vũ, B.C. (2019) On the Linear Convergence of the Stochastic Gradient Method with Constant Step-Size. Optimization Letters, 13, 1177-1187. https://doi.org/10.1007/s11590-018-1331-1
[19]
Polyak, T.B. (1964) Some Methods of Speeding up the Convergence of Iteration Methods. USSR Computational Mathematics and Mathematical Physics, 4, 1-17. https://doi.org/10.1016/0041-5553(64)90137-5
[20]
Biscarat, C.J. (1994) Almost Sure Convergence of a Class of Stochastic Algorithms. Stochastic Processes and their Applications, 50, 83-99. https://doi.org/10.1016/0304-4149(94)90149-X
[21]
Sebbouh, O., Gower, M.R. and Defazio, A. (2021) Almost Sure Convergence Rates for Stochastic Gradient Descent and Stochastic Heavy Ball. Proceedings of Machine Learning Research, 134, 1-37.
[22]
Orabona, F. (2020) Almost Sure Convergence of SGD on Smooth Nonconvex Functions. http://parameterfree.com https://parameterfree.com/2020/10/05/almost-sure-convergence-of-sgd-on-smooth-non-convex-functions/
[23]
Liu, J. and Yuan, Y. (2022) On Almost Sure Convergence Rates of Stochastic Gradient Methods. arXiv: 2202.04295. https://arxiv.org/abs/2202.04295
[24]
Liang, Y.Q., Xu, D.P., Zhang, N.M. and Mandic, D.P. (2023) Almost Sure Convergence of Stochastic Composite Objective Mirror Descent for Non-Convex Non-Smooth Optimization. Optimization Letters. https://doi.org/10.1007/s11590-023-01972-3
[25]
Khaled, A. and Richtárik, P. (2020) Better Theory for SGD in the Nonconvex World. arXiv: 2002.03329.
[26]
Gower, R., Sebbouh, O. and Loizou, N. (2021) SGD for Structured Nonconvex Functions: Learning Rates, Mini-Batching and Interpolation. Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, Vol. 130, 13-15 April 2021, 1315-1323.
[27]
Mertikopoulos, P., Hallak, N., Kavis, A. and Cevher, V. (2020) On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems. Optimization and Control, 33, 1117-1128. https://arxiv.org/abs/2006.11144
[28]
Ward, R., Wu, X. and Bottou, L. (2020) AdaGrad Stepsizes: Sharp Convergence over Nonconvex Landscapes. Journal of Machine Learning Research, 21, 9047-9076.
[29]
Alacaoglu, A., Malitsky, Y. and Cevher, V. (2020) Convergence of Adaptive Algorithms for Weakly Convex Constrained Optimization. arXiv: 2006.06650.
[30]
Davis, D. and Drusvyatskiy, D. (2019) Stochastic Model-Based Minimization of Weakly Convex Functions. SIAM Journal on Optimization, 29, 207-239. https://doi.org/10.1137/18M1178244
[31]
Nesterov, Y. (2003) Introductory Lectures on Convex Optimization: A Basic Course. Springer Science & Business Media, New York. https://doi.org/10.1007/978-1-4419-8853-9
[32]
Robbins, H. and Siegmund, D. (1971) A Convergence Theorem for Non-Negative Almost Supermartingales and Some Applications. In: Rustagi, J.S., Ed., Optimizing Methods in Statistics, Academic Press, Cambridge, 233-257. https://doi.org/10.1016/B978-0-12-604550-5.50015-8
[33]
Nesterov, Y. (2018) Smooth Convex Optimization. Lectures on Convex Optimization, 137, 59-137. https://doi.org/10.1007/978-3-319-91578-4_2