In recent years, Convolutional Neural Networks (CNNs) have enabled unprecedented progress on a wide range of computer vision tasks. However, training large CNNs is a resource-intensive task that requires specialized Graphical Processing Units (GPU) and highly optimized implementations to get optimal performance from the hardware. GPU memory is a major bottleneck of the CNN training procedure, limiting the size of both inputs and model architectures. In this paper, we propose to alleviate this memory bottleneck by leveraging an under-utilized resource of modern systems: the device to host bandwidth. Our method, termed CPU offloading, works by transferring hidden activations to the CPU upon computation, in order to free GPU memory for upstream layer computations during the forward pass. These activations are then transferred back to the GPU as needed by the gradient computations of the backward pass. The key challenge to our method is to efficiently overlap data transfers and computations in order to minimize wall time overheads induced by the additional data transfers. On a typical work station with a Nvidia Titan X GPU, we show that our method compares favorably to gradient checkpointing as we are able to reduce the memory consumption of training a VGG19 model by 35% with a minimal additional wall time overhead of 21%. Further experiments detail the impact of the different optimization tricks we propose. Our method is orthogonal to other techniques for memory reduction such as quantization and sparsification so that they can easily be combined for further optimizations.
References
[1]
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. (1998) Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86, 2278-2324.
https://doi.org/10.1109/5.726791
[2]
Krizhevsky, A., Sutskever, I. and Hinton, G.E. (2012) ImageNet Classification with Deep Convolutional Neural Networks. Proceedings of the 25th International Conference on Neural Information Processing Systems, Volume 1, Lake Tahoe, 3-6 December 2012, 1097-1105.
[3]
Kingma, D.P. and Ba, J. (2014) Adam: A Method for Stochastic Optimization.
[4]
Shallue, C.J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R. and Dahl, G.E. (2018) Measuring the Effects of Data Parallelism on Neural Network Training.
[5]
McCandlish, S., Kaplan, J., Amodei, D. and OpenAI Dota Team (2018) An Empirical Model of Large-Batch Training.
[6]
You, Y., Gitman, I. and Ginsburg, B. (2017) Large Batch Training of Convolutional Networks.
[7]
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J. and Keutzer, K. (2016) SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and < 0.5 MB Model Size.
[8]
Molchanov, P., Tyree, S., Karras, T., Aila, T. and Kautz, J. (2016) Pruning Convolutional Neural Networks for Resource Efficient Inference.
[9]
Frankle, J. and Carbin, M. (2018) The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.
[10]
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., et al. (2017) Mixed Precision Training.
[11]
Jacob, B., Kligys, S., Chen, B., Zhu, M.L., Tang, M., Howard, A., Adam, H. and Kalenichenko, D. (2018) Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-22 June 2018, 2704-2713.
https://doi.org/10.1109/CVPR.2018.00286
[12]
Wu, S., Li, G.Q., Chen, F. and Shi, L.P. (2018) Training and Inference with Integers in Deep Neural Networks.
[13]
Martens, J. and Sutskever, I. (2012) Training Deep and Recurrent Networks with Hessian-Free Optimization. In: Neural Networks: Tricks of the Trade, Springer, Berlin, 479-535. https://doi.org/10.1007/978-3-642-35289-8_27
[14]
Chen, T.Q., Xu, B., Zhang, C.Y. and Guestrin, C. (2016) Training deep nets with sublinear memory cost.
[15]
Gomez, A.N., Ren, M.Y., Urtasun, R. and Grosse, R.B. (2017) The Reversible Residual Network: Back-Propagation without Storing Activations. 31st Conference on Neural Information Processing Systems, Long Beach, 2214-2224.
[16]
Jacobsen, J.-H., Smeulders, A. and Oyallon, E. (2018) i-RevNet: Deep Invertible Networks.
[17]
Huang, G., Liu, Z., Van Der Maaten, L. and Weinberger, K.Q. (2017) Densely Connected Convolutional Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 4700-4708.
https://doi.org/10.1109/CVPR.2017.243
[18]
Ronneberger, O., Fischer, P. and Brox, T. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Berlin, 234-241.
https://doi.org/10.1007/978-3-319-24574-4_28
[19]
Rota Bulò, S., Porzi, L. and Kontschieder, P. (2018) In-Place Activated BatchNorm for Memory-Optimized Training of DNNs. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-22 June 2018, 5639-5647. https://doi.org/10.1109/CVPR.2018.00591
[20]
Howard, A.G., Zhu, M.L., Chen, B., Kalenichenko, D., Wang, W.J., Weyand, T., Andreetto, M. and Adam, H. (2017) MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.
[21]
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.M., Desmaison, A., Antiga, L. and Lerer, A. (2017) Automatic Differentiation in PyTorch. 31st Conference on Neural Information Processing Systems, Long Beach, 1-4.
[22]
Wu, B.C., Wan, A., Yue, X.Y., Jin, P., Zhao, S.C., Golmant, N., Gholaminejad, A., Gonzalez, J. and Keutzer, K. (2018) Shift: A Zero Flop, Zero Parameter Alternative to Spatial Convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-22 June 2018, 9127-9135.
https://doi.org/10.1109/CVPR.2018.00951