|
基于深度学习方法的在线动作检测技术综述
|
Abstract:
动作检测技术,是在算法观测整个视频后自动识别出其中出现的动作类别和始末时间,在机器人、智能家居、城市安防等领域均有应用。然而实际生活中,很多场景需要在某些事件刚发生时给予反馈,这需要检测算法以一种在线形式接收视频信息,传统的动作检测算法因为观测信息不完全,效果很差。本文基于当前在线动作检测算法的研究现状,概述了目前用于在线检测的主流方法,总结了目前研究将遇到的挑战。
Action detection technology, in which an algorithm observes the entire video and then automatically identifies the type of action that occurs in it and the start and end times, is used in robotics, smart homes, urban security and other areas. However, in real life, many scenarios require feedback when certain events first occur, which requires detection algorithms to receive video information in an online format. Traditional action detection algorithms are ineffective because of incomplete observation information. Based on the current state of research in online action detection algorithms, this paper provides an overview of the mainstream methods currently used for online detection and summarises the challenges that current research will encounter.
[1] | Vaudaux-Ruth, G., Chan-Hon-Tong, A. and Achard, C. (2021) SALAD: Self-Assessment Learning for Action Detec-tion. WACV, Waikoloa, 3-8 January 2021, 1268-1277. https://doi.org/10.1109/WACV48630.2021.00131 |
[2] | Shi, D.F., Zhong, Y.J., Cao, Q., et al. (2022) ReAct: Temporal Action Detection with Relation Queries. ECCV, Tel Aviv, 23-27 October 2022, 105-121. https://doi.org/10.1007/978-3-031-20080-9_7 |
[3] | De Geest, R., Gavves, E., Ghodrati, A., et al. (2016) Online Action Detection. ECCV, Amsterdam, 11-14 October 2016, 269-284. https://doi.org/10.1007/978-3-319-46454-1_17 |
[4] | Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. NIPS, Long Beach, 12 June 2017, 5998-6008. |
[5] | Xu, M.Z., Gao, M.F., Chen, Y.-T., et al. (2019) Temporal Recurrent Networks for Online Action Detection. ICCV, Seoul, 27 October-2 November 2019, 5531-5540. |
[6] | Eun, H., Moon, J., Park, J., et al. (2020) Learning to Discriminate Information for Online Action De-tection. CVPR, Seattle, 13-19 June 2020, 806-815. |
[7] | Wang, X., Zhang, S.W., Qing, Z.W., et al. (2021) Oadtr: Online Action Detection with Transformers. ICCV, Virtual, 21 June 2021, 7545-7555. https://doi.org/10.1109/ICCV48922.2021.00747 |
[8] | Xu, M.Z., Xiong, Y.J., Chen, H., et al. (2021) Long Short-Term Transformer for Online Action Detection. NeurIPS, Virtual, 7 July 2021, 1086-1099. |
[9] | Yang, L., Han, J.W. and Zhang, D.W. (2022) Colar: Effective and Efficient Online Action Detection by Consulting Exemplars. CVPR, New Orleans, 2 March 2022, 3150-3159. https://doi.org/10.1109/CVPR52688.2022.00316 |
[10] | Eun, H., Moon, J., Park, J., et al. (2021) Temporal Filtering Networks for Online Action Detection. Pattern Recognition, 111, Article ID: 107695. https://doi.org/10.1016/j.patcog.2020.107695 |
[11] | Idrees, H., Zamir, A.R., Jiang, Y.-G., et al. (2017) The THUMOS Challenge on Action Recognition for Videos “in the Wild”. Computer Vision and Image Understanding, 155, 1-23. https://doi.org/10.1016/j.cviu.2016.10.018 |
[12] | Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780.
https://doi.org/10.1162/neco.1997.9.8.1735 |
[13] | Cho, K., et al. (2014) Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. The Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, 25-29 October 2014, 1724-1734. |
[14] | Hoai, M. and De la Torre, F. (2014) Max-Margin Early Event Detectors. International Journal of Computer Vision, 107, 191-202. https://doi.org/10.1007/s11263-013-0683-3 |
[15] | Li, Y., Lan, C., Xing, J., et al. (2016) Online Human Action De-tection Using Joint Classification-Regression Recurrent Neural Networks. ECCV, Amsterdam, 11-14 October 2016, 203-220. https://doi.org/10.1007/978-3-319-46478-7_13 |
[16] | De Geest, R. and Tuytelaars, T. (2018) Modeling Temporal Structure with LSTM for Online Action Detection. WACV, Lake Tahoe, 12-15 March 2018, 1549-1557. https://doi.org/10.1109/WACV.2018.00173 |
[17] | Wang, W., Peng, X., Qiao, Y. and Cheng, J. (2022) An Empirical Study on Temporal Modeling for Online Action Detection. Complex & Intelligent Systems, 8, 1803-1817. https://doi.org/10.1007/s40747-021-00534-3 |
[18] | Kim, Y.H., Nam, S. and Kim, S.J. (2021) Temporally Smooth Online Action Detection Using Cycle-Consistent Future Anticipation. Pattern Recognition, 2021, Article ID: 107954. https://doi.org/10.1016/j.patcog.2021.107954 |
[19] | Min, S. and Moon, J. (2022) Information Elevation Network for Online Action Detection and Anticipation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, 19-20 June 2022, 2549-2557. https://doi.org/10.1109/CVPRW56347.2022.00287 |
[20] | Guo, H., Ren, Z., Wu, Y., Hua, G. and Ji, Q. (2022) Un-certainty-Based Spatial-Temporal Attention for Online Action Detection. European Conference on Computer Vision (ECCV), Tel Aviv, 23-27 October 2022, 69-86.
https://doi.org/10.1007/978-3-031-19772-7_5 |
[21] | Ramanishka, V., Chen, Y.-T., Misu, T. and Saenko, K. (2018) Toward Driving Scene Understanding: A Dataset for Learning Driver Behavior and Casual Reasoning. CVPR, Salt Lake City, 18-22 June 2018, 7699-7707. |
[22] | Chen, J.W., Mittal, G., Yu, Y., Kong, Y. and Chen, M. (2022) Gatehub: Gated History Unit with Background Suppression for Online Action Detection. CVPR, New Orleans, 18-24 June 2022, 19925-19934.
https://doi.org/10.1109/CVPR52688.2022.01930 |
[23] | Soomro, K., Zamir, A.R. and Shah, M. (2012) Ucf101: A Dataset of 101 Human Actions Classes from Videos in the Wild. |
[24] | Simonyan, K. and Zisserman, A. (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representa-tions, San Diego, 7-9 May 2015. |
[25] | Heilbron, F.C., Escorcia, V., Ghanem, B. and Niebles, J.C. (2015) Activitynet: A Large-Scale Video Benchmark for Human Activity Understanding. CVPR, Boston, 7-12 June 2015, 961-970.
https://doi.org/10.1109/CVPR.2015.7298698 |
[26] | Shou, Z., Chan, J., Zareian, A., Miyazawa, K. and Chang, S.-F. (2017) CDC: Convolutional-de-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos. CVPR, Honolulu, 21-26 July 2017, 1417-1426.
https://doi.org/10.1109/CVPR.2017.155 |
[27] | Gao, J.Y., Yang, Z.H. and Nevatia, R. (2017) RED: Reinforced En-coder-Decoder Networks for Action Anticipation. BMVC, London, 4-7 September 2017. |
[28] | Qu, S.Q., Chen, G., Xu, D., Dong, J.H., Lu, F. and Knoll, A. (2020) LAP-Net: Adaptive Features Sampling via Learning Action Progression for Online Action Detection. |
[29] | Wu, C.-Y., Feichtenhofer, C., Fan, H.Q., He, K.M., Krahenbuhl, P. and Girshick, R. (2019) Long-Term Feature Banks for Detailed Video Understanding. CVPR, Long Beach, 16-20 June 2019, 284-293. |
[30] | Zhao, P.S., Wang, J.J., Xie, L.X., Zhang, Y., Wang, Y.F. and Tian, Q. (2020) Privileged Knowledge Distillation for Online Action Detection. |
[31] | Gao, M.F., Zhou, Y.B., Xu, R., Socher, R. and Xiong, C.M. (2021) WOAD: Weakly Supervised Online Action Detection in Untrimmed Videos. CVPR, Virtual, 19-25 June 2021, 1915-1923. |
[32] | Hu, X.J., Dai, J.Z., Li, M., Peng, C.L., Li, Y. and Du, S.D. (2022) Online Human Action Detection and Anticipation in Videos: A Survey. Neurocomputing, 491, 395-413. https://doi.org/10.1016/j.neucom.2022.03.069 |