OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Computer Science and Application 2023

两阶段的弱监督时序动作定位
Two-Stage Weakly Supervised Sequential Action Positioning

DOI: 10.12677/CSA.2023.134065, PP. 657-671

骆文杰, 江朝晖, 单东风, 史俊彪, 熊思璇

Keywords: 弱监督，动作定位，两阶段，原型学习，特征增强
Weakly Supervised, Action Localization, Two Stages, Prototype Learning, Feature Enhancement

Full-Text Cite this paper Add to My Lib

Abstract:

由于弱监督时序定位模型没有帧级的监督信号，模型识别动作实例在边界处容易出现两个问题：过多地关注动作最具识别的部分，忽略了动作的其他部分而导致了动作的欠定位；动作的边界处与背景极其相似，模型难以区分而导致了动作的过定位。为了进一步有效的分类动作片段，改善边界困难样本的欠定位和过定位问题，提出了一种两阶段的弱监督时序定位。该方法分为两个阶段，第一阶段中我们对输入的视频帧提取RGB和光流特征，设计一种困难样本挖掘策略，得到边界的困难样本集合和易动作样本集合。另外，我们设计了一种原型生成模块，得到了每个动作类别的原型中心，将第二阶段的动作分类任务转换成嵌入空间与原型中心的距离问题。在第二阶段中，输入第一阶段得到的困难样本集合，使用原型匹配模块得到特定的时间类激活图。另外光流特征因其表达动态的特性，应当给予重视。本文设计了一种困难样本集合与易动作样本集合进行相似度计算得到增强光流特征的方法，实现边界困难样本更加准确地动作预测。最后为了进一步优化模型预测的动作标签，采用伪标签策略，为模型提供有效的帧级监督信号。在THUMOS’14和ActivityNet v1.2数据集进行实验论证。实验结果表明，方法性能优于现有弱监督时序定位方法。
Since the weakly supervised temporal localization model has no frame-level supervisory signal, the model recognizes action instances at the boundary and is prone to two problems: underlocalization of the action by focusing too much on the most recognized part of the action and ignoring the other parts of the action; overlocalization of the action by making the boundary of the action extremely similar to the background, which is difficult for the model to distinguish. In order to further classify action fragments effectively and improve the under- and over-localization problems of boundary-hard samples, a two-stage weakly supervised temporal localization is proposed. The method is divided into two stages. In the first stage, we extract RGB and optical flow features from the input video frames and design a difficult sample mining strategy to obtain the set of boundary difficult samples and the set of easy action samples. In addition, we design a prototype generation module to obtain the prototype center of each action category, and convert the action classification task in the second stage into a distance problem between the embedding space and the prototype center. In the second stage, the set of difficult samples obtained in the first stage is input and a specific temporal class activation map is obtained using the prototype matching module. In addition optical flow features should be given attention because of their property of expressing dynamics. In this paper, we design a method to obtain enhanced optical flow features by performing similarity calculation between the set of difficult samples and the set of easy action samples to achieve more accurate action prediction for boundary difficult samples. Finally, in order to further optimize the action labels predicted by the model, a pseudo-labeling strategy is used to provide an effective frame-level supervised signal for the model. Experimental demonstrations are performed on THUMOS’14 and ActivityNet 1.2 datasets.

References

[1]	Paul, S., Roy, S. and Roy-Chowdhury, A.K. (2018) W-TALC: Weakly-Supervised Temporal Activity Localization and Classification. Computer Vision ECCV 2018 15th European Conference, Munich, 8-14 September 2018, 588-607. https://doi.org/10.1007/978-3-030-01225-0_35
[2]	Lee, P., Uh, Y. and Byun, H. (2020) Background Suppression Network for Weakly-Supervised Temporal Action Localization. 34th AAAI Conference on Artificial Intelligence, AAAI 2020, New York, 7-12 February 2020, 11320-11327.
[3]	Moniruzzaman, M., Yin, Z., He, Z., et al. (2020) Action Completeness Modeling with Background Aware Networks for Weakly-Supervised Temporal Action Localization. MM ‘20: The 28th ACM International Conference on Multimedia, Seattle, 12-16 October 2020, 2166-2174. https://doi.org/10.1145/3394171.3413687
[4]	Luo, Z., Guillory, D., Shi, B., et al. (2020) Weakly-Supervised Ac-tion Localization with Expectation-Maximization Multi-Instance Learning. Computer Vision ECCV 2020 16th European Conference, Glasgow, 23-28 August 2020, 729-745. https://doi.org/10.1007/978-3-030-58526-6_43
[5]	Hong, F.T., Feng, J.C., Xu, D., et al. (2021) Cross-Modal Consensus Network for Weakly Supervised Temporal Action Local-ization. Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, 20-24 October 2021, 1591-1599. https://doi.org/10.1145/3474085.3475298
[6]	Nguyen, P., Han, B., Liu, T., et al. (2018) Weakly Su-pervised Action Localization by Sparse Temporal Pooling Network. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 6752-6761. https://doi.org/10.1109/CVPR.2018.00706
[7]	Narayan, S., Cholakkal, H., Khan, F.S., et al. (2019) 3C-Net: Cat-egory Count and Center Loss for Weakly-Supervised Action Localization. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 8678-8686. https://doi.org/10.1109/ICCV.2019.00877
[8]	Islam, A., Long, C. and Radke, R. (2021) A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action Localization. The 35th AAAI Conference on Artificial Intelligence (AAAI-21), 8-9 February 2021, 1637-1645. https://doi.org/10.1109/WACV45572.2020.9093620
[9]	Huang, L., Wang, L. and Li, H. (2021) Fore-ground-Action Consistency Network for Weakly Supervised Temporal Action Localization. 2021 IEEE/CVF Interna-tional Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 7982-7991. https://doi.org/10.1109/ICCV48922.2021.00790
[10]	Nguyen, P.X., Ramanan, D. and Fowlkes, C.C. (2019) Weakly-Supervised Action Localization with Background Modeling. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 5501-5510. https://doi.org/10.1109/ICCV.2019.00560
[11]	Liu, D., Jiang, T. and Wang, Y. (2019) Completeness Modeling and Context Separation for Weakly Supervised Temporal Action Localization. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 1298-1307. https://doi.org/10.1109/CVPR.2019.00139
[12]	Luo, W., Zhang, T., Yang, W., et al. (2021) Action Unit Memory Network for Weakly Supervised Temporal Action Localization. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 9964-9974. https://doi.org/10.1109/CVPR46437.2021.00984
[13]	Zhai, Y., Wang, L., Tang, W., et al. (2020) Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization. Computer Vision ECCV 2020 16th European Conference, Glasgow, 23-28 August 2020, 37-54. https://doi.org/10.1007/978-3-030-58539-6_3
[14]	Liu, Z., Wang, L., Zhang, Q., et al. (2021) ACSNet: Ac-tion-Context Separation Network for Weakly Supervised Temporal Action Localization. 35th AAAI Conference on Artifi-cial Intelligence (AAAI 2021), Vancouver, 2-9 February 2021.
[15]	Idrees, H., Zamir, A.R., Jiang, Y.-G., et al. (2017) The THUMOS Challenge on Action Recognition for Videos “in the Wild”. Computer Vision and Image Understanding, 155, 1-23. https://doi.org/10.1016/j.cviu.2016.10.018
[16]	Heilbron, F.C., Escorcia, V., Ghanem, B., et al. (2015) ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 961-970. https://doi.org/10.1109/CVPR.2015.7298698
[17]	Carreira, J. and Zisserman, A. (2017) Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. 2017 IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), Honolulu, 21-26 July 2017, 4724-4733. https://doi.org/10.1109/CVPR.2017.502
[18]	Wedel, A., Pock, T., Zach, C., et al. (2009) An Improved Algorithm for TV-L1 Optical Flow. International Dagstuhl Seminar, Wadern, 13-18 July 2008, 23-45. https://doi.org/10.1007/978-3-642-03061-1_2
[19]	Yang, W., Zhang, T., Yu, X., et al. (2021) Uncertainty Guided Collaborative Training for Weakly Supervised Temporal Action Detection. 2021 IEEE/CVF Conference on Computer Vi-sion and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 53-63. https://doi.org/10.1109/CVPR46437.2021.00012
[20]	Zhang, C., Cao, M., Yang, D., et al. (2021) CoLA: Weak-ly-Supervised Temporal Action Localization with Snippet Contrastive Learning. 2021 IEEE/CVF Conference on Com-puter Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 16005-16014. https://doi.org/10.1109/CVPR46437.2021.01575
[21]	Wang, L., Xiong, Y., Lin, D., et al. (2017) UntrimmedNets for Weakly Supervised Action Recognition and Detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 6402-6411. https://doi.org/10.1109/CVPR.2017.678
[22]	Shi, B., Dai, Q., Mu, Y., et al. (2020) Weakly-Supervised Action Localization by Generative Attention Modeling. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recogni-tion (CVPR), Seattle, 13-19 June 2020, 1006-1016. https://doi.org/10.1109/CVPR42600.2020.00109
[23]	Shou, Z., Chan, J., Zareian, A., et al. (2017) CDC: Convo-lutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos. 2017 IEEE Con-ference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 1417-1426. https://doi.org/10.1109/CVPR.2017.155
[24]	Lin, T., Zhao, X., Su, H., et al. (2018) BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. Proceedings of the European Conference on Computer Vision (ECCV), Munich, 8-14 September 2018, 3-19. https://doi.org/10.1007/978-3-030-01225-0_1
[25]	Yuan, L., Lin, M., Zhang, Y., et al. (2018) Multi-Granularity Generator for Temporal Action Proposal. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, 16-20 June 2019, 3604-3613.
[26]	Long, F., Yao, T., Qiu, Z., et al. (2019) Gaussian Temporal Awareness Networks for Action Localization. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 344-353. https://doi.org/10.1109/CVPR.2019.00043
[27]	Yuan, H., Ni, D. and Wang, M. (2021) Spatio-Temporal Dynamic Inference Network for Group Activity Recognition. Proceedings of the IEEE/CVF International Conference on Comput-er Vision, Montreal, 10-17 October 2021, 7476-7485. https://doi.org/10.1109/ICCV48922.2021.00738
[28]	Zhao, C., Thabet, A.K. and Ghanem, B. (2021) Video Self-Stitching Graph Network for Temporal Action Localization. Pro-ceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, 10-17 October 2021, 13658-13667. https://doi.org/10.1109/ICCV48922.2021.01340
[29]	Zeng, R., Huang, W., Gan, C., et al. (2019) Graph Convolu-tional Networks for Temporal Action Localization. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 13638-13647. https://doi.org/10.1109/ICCV.2019.00719

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

两阶段的弱监督时序动作定位Two-Stage Weakly Supervised Sequential Action Positioning

两阶段的弱监督时序动作定位
Two-Stage Weakly Supervised Sequential Action Positioning