OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Artificial Intelligence and Robotics Research 2025

基于两阶段蒸馏的动作识别
Action Recognition Based on Two-Stage Distillation

DOI: 10.12677/airr.2025.142036, PP. 362-375

陈凯, 党存远, 蔡子当, 夏雨涵, 孙永宣

Keywords: 特征蒸馏，模型融合，两阶段蒸馏，动作识别
Feature Distillation, Model Fusion, Two-Stage Distillation, Action Recognition

Full-Text Cite this paper Add to My Lib

Abstract:

在计算机视觉领域，CNN与Transformer分别在局部信息提取和全局特征建模方面具有优势，如何融合CNN与Transformer成为研究热点之一。一些工作直接在Transformer编码器中引入卷积运算，然而这会改变Transformer的原有结构，限制自注意力的全局建模能力。另一些工作在CNN与Transformer的logit输出层进行知识蒸馏，然而其未能利用CNN的特征层信息。针对上述问题，本文提出特征对齐蒸馏模块，通过将Transformer的特征层与CNN的特征层进行维度对齐，实现了Transformer与CNN的特征层蒸馏，使Transformer学习到了CNN的局部建模能力。针对特征对齐操作会引入卷积操作增加模型计算量的问题，本文又提出了特征映射logit蒸馏模块，通过将Transformer的特征层映射为logit，实现了Transformer与CNN特征层的通用蒸馏方法。为了使学生模型同时学习局部建模能力和长距离依赖建模能力，本文提出了两阶段蒸馏框架，实现了CNN教师和Transformer教师对学生模型的协同指导。实验结果表明，本文方法实现了CNN与Transformer的特征层蒸馏，并使学生模型在CNN教师和Transformer教师的协同指导下，同时学习到了局部建模能力和长距离依赖建模能力，提高了基准模型在动作识别下游任务上的准确率。
In the field of computer vision, CNN and Transformer have advantages in local information extraction and global feature modeling, respectively, and how to fuse CNN and Transformer has become one of the research hotspots. Some works directly introduce convolutional operations in the Transformer encoder, however, this will change the original structure of the Transformer and limit the global modeling ability of self-attention. Some other work performs knowledge distillation in the logit output layer of CNN and Transformer, however, it fails to utilize the feature layer information of CNN. Aiming at the above problems, this paper proposes the feature alignment distillation module, which realizes the feature layer distillation between Transformer and CNN by dimensionally aligning Transformer’s feature layer with CNN’s feature layer, so that Transformer learns the CNN’s local modeling ability. Aiming at the problem that the feature alignment operation will introduce the convolution operation to increase the model computation, this paper also proposes the feature mapping logit distillation module, which realizes a general distillation method for the feature layer of Transformer and CNN by mapping the feature layer of Transformer to logit. In order to enable student models to learn both local modeling ability and long-distance dependent modeling ability, this paper proposes a two-stage distillation framework, which realizes the collaborative guidance of CNN teachers and Transformer teachers to student models. The experimental results show that the method in this paper achieves feature layer distillation of CNN and Transformer, and enables the student model to learn both local modeling capability and long-distance dependency modeling capability under the collaborative guidance of CNN instructor

References

[1]	Gordon, D.F. and Desjardins, M. (1995) Evaluation and Selection of Biases in Machine Learning. Machine Learning, 20, 5-22. https://doi.org/10.1007/bf00993472
[2]	Goyal, A. and Bengio, Y. (2020) Inductive Biases for Deep Learning of Higher-Level Cognition. arXiv: 2011.15091.
[3]	Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. arXiv: 1706.03762.
[4]	Wang, W., Xie, E., Li, X., Fan, D., Song, K., Liang, D., et al. (2021) Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 548-558. https://doi.org/10.1109/iccv48922.2021.00061
[5]	Dai, Z.H., Liu, H.X., Le, Q.V. and Tan, M.X. (2021) CoAtNet: Marrying Convolution and Attention for All Data Sizes. arXiv: 2106.04803.
[6]	Beal, J., Kim, E., Tzeng, E., Park, D.H., Zhai, A. and Kislyuk, D. (2020) Toward Transformer-Based Object Detection. arXiv: 2012.09958.
[7]	Dosovitskiy, A., Beyer, L., Kolesnikov, A., et. al. (2021) An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv: 2010.11929.
[8]	Ren, S., He, K., Girshick, R. and Sun, J. (2017) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1137-1149. https://doi.org/10.1109/tpami.2016.2577031
[9]	Touvron, H., Cord, M., Douze, M., et al. (2021) Training Data-Efficient Image Transformers & Distillation through Attention. Proceedings of the 38th International Conference on Machine Learning, 18-24 July 2021, 10347-10357.
[10]	Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P. and Girshick, R.B. (2021) Early Convolutions Help Transformers See Better. arXiv: 2106.14881.
[11]	Hao, Z.W., Guo, J.Y., Han, K., et al. (2023) One-for-All: Bridge the Gap between Heterogeneous Architectures in Knowledge Distillation. arXiv: 2310.19444.
[12]	Cortes, C., Mohri, M. and Rostamizadeh, A. (2012) Algorithms for Learning Kernels Based on Centered Alignment. Journal of Machine Learning Research, 13, 795-828.
[13]	Kornblith, S., Norouzi, M., Lee, H. and Hinton, G.E. (2019) Similarity of Neural Network Representations Revisited. Proceedings of the 36th International Conference on Machine Learning, Long Beach, 9-15 June 2019, 3519-3529.
[14]	Wang, R., Chen, D., Wu, Z., Chen, Y., Dai, X., Liu, M., et al. (2023) Masked Video Distillation: Rethinking Masked Feature Modeling for Self-Supervised Video Representation Learning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 6312-6322. https://doi.org/10.1109/cvpr52729.2023.00611
[15]	Tong, Z., Song, Y.B., Wang, J. and Wang, L.M. (2022) VideoMAE: Masked Autoencoders Are Data-Efficient Learners for Self-Supervised Video Pre-Training. arXiv: 2203.12602.
[16]	He, K.M., Chen, X.L., Xie, S.N., Li, Y.H., Dollár, P. and Girshick, R.B. (2022) Masked Autoencoders Are Scalable Vision Learners. IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 19-20 June 2022, 15979-15988.
[17]	Cordonnier, J.B., Loukas, A. and Jaggi, M. (2020) On the Relationship between Self-Attention and Convolutional Layers. arXiv: 1911.03584.
[18]	Parmar, N., Ramachandran, P., Vaswani, A., et al. (2019) Stand-Alone Self-Attention in Vision Models. Advances in Neural Information Processing Systems, 32, 68-80.
[19]	Chen, X., Cao, Q., Zhong, Y., Zhang, J., Gao, S. and Tao, D. (2022) DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 12042-12052. https://doi.org/10.1109/cvpr52688.2022.01174
[20]	Romero, A., Ballas, N., Ebrahimi Kahou, S., et al. (2015) FitNets: Hints for Thin Deep Nets. arXiv: 1412.6550.
[21]	Hinton, G.E., Vinyals, O. and Dean, J. (2015) Distilling the Knowledge in a Neural Network. arXiv: 1503.02531.
[22]	d’Ascoli, S., Touvron, H., Leavitt, M.L., et al. (2021) Convit: Improving Vision Transformers with Soft Convolutional Inductive Biases. Proceedings of the 38th International Conference on Machine Learning, 18-24 July 2021, 2286-2296.
[23]	Feichtenhofer, C., Fan, H., Malik, J. and He, K. (2019) SlowFast Networks for Video Recognition. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 6201-6210. https://doi.org/10.1109/iccv.2019.00630
[24]	Carreira, J. and Zisserman, A. (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 4724-4733. https://doi.org/10.1109/cvpr.2017.502
[25]	Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., et al. (2019) Temporal Segment Networks for Action Recognition in Videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41, 2740-2755. https://doi.org/10.1109/tpami.2018.2868668
[26]	Loshchilov, I. and Hutter, F. (2017) Fixing Weight Decay Regularization in Adam. arxiv: 1711.05101.
[27]	Loshchilov, I. and Hutter, F. (2017) SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv: 1608.03983.
[28]	Wang, L., Tong, Z., Ji, B. and Wu, G. (202). TDN: Temporal Difference Networks for Efficient Action Recognition. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 1895-1904. https://doi.org/10.1109/cvpr46437.2021.00193
[29]	Fan, Q.F., Chen, C.F., Kuehne, H., et al. (2019) More is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation. arXiv: 1912.0086.
[30]	Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M. and Schmid, C. (2021) ViViT: A Video Vision Transformer. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 6816-6826. https://doi.org/10.1109/iccv48922.2021.00676
[31]	Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., et al. (2022) Video Swin Transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 3192-3201. https://doi.org/10.1109/cvpr52688.2022.00320
[32]	Xiang, W., Li, C., Wang, B., Wei, X., Hua, X. and Zhang, L. (2022) Spatiotemporal Self-Attention Modeling with Temporal Patch Shift for Action Recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M. and Hassner, T., Eds., Computer Vision—ECCV 2022., Springer, 627-644. https://doi.org/10.1007/978-3-031-20062-5_36
[33]	Lin, J., Gan, C. and Han, S. (2019) TSM: Temporal Shift Module for Efficient Video Understanding. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 7082-7092. https://doi.org/10.1109/iccv.2019.00718
[34]	Bertasius, G., Wang, H. and Torresani, L. (2021) Is Space-Time Attention All You Need for Video Understanding? arXiv: 2102.05095.
[35]	Liu, Z., Luo, D., Wang, Y., Wang, L., Tai, Y., Wang, C., et al. (2020) TEINet: Towards an Efficient Architecture for Video Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 11669-11676. https://doi.org/10.1609/aaai.v34i07.6836
[36]	Wang, Z., She, Q. and Smolic, A. (2021) Action-Net: Multipath Excitation for Action Recognition. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 13209-13218. https://doi.org/10.1109/cvpr46437.2021.01301
[37]	Kwon, H., Kim, M., Kwak, S. and Cho, M. (2020) MotionSqueeze: Neural Motion Feature Learning for Video Understanding. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.M., Eds., Computer Vision—ECCV 2020, Springer, 345-362. https://doi.org/10.1007/978-3-030-58517-4_21
[38]	Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., et al. (2021) Multiscale Vision Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 6804-6815. https://doi.org/10.1109/iccv48922.2021.00675
[39]	Patrick, M., Campbell, D., Asano, Y.M., et al. (2021) Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. Advances in Neural Information Processing Systems, 34, 12493-12506.
[40]	Bulat, A., Pérez-Rúa, J.M., Sudhakaran, S., et al. (2021) Space-Time Mixing Attention for Video Transformer. Advances in Neural Information Processing Systems, 34, 19594-19607.
[41]	Fan, Q.F., Chen, C.F. and Panda, R. (2021) An Image Classifier Can Suffice for Video Understanding. arXiv: abs/ 2106.14104.
[42]	Wang, X.L., Girshick, R.B., Gupta, A. and He, K.M. (2017) Non-Local Neural Networks. arXiv: 1711.07971.
[43]	Li, C., Zhong, Q.Y., Xie, D. and Pu, S.L. (2019) Collaborative Spatiotemporal Feature Learning for Video Action Recognition. IEEE Conference on Computer Vision and Pattern Recognition 2019, Long Beach, 15-20 June 2019, 7872-7881.
[44]	Feichtenhofer, C. (2020) X3D: Expanding Architectures for Efficient Video Recognition. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 200-210. https://doi.org/10.1109/cvpr42600.2020.00028
[45]	Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B. and Wang, L. (2020) TEA: Temporal Excitation and Aggregation for Action Recognition. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 906-915. https://doi.org/10.1109/cvpr42600.2020.00099
[46]	Zhang, H., Hao, Y. and Ngo, C. (2021) Token Shift Transformer for Video Classification. Proceedings of the 29th ACM International Conference on Multimedia, 20-24 October 2021, 917-925. https://doi.org/10.1145/3474085.3475272

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

基于两阶段蒸馏的动作识别Action Recognition Based on Two-Stage Distillation

基于两阶段蒸馏的动作识别
Action Recognition Based on Two-Stage Distillation