OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Software Engineering and Applications 2023

基于空洞卷积的多维度注意力视频摘要
Video Summarization Using a Dilated Convolutional Multi-Dimension Attention Network

DOI: 10.12677/SEA.2023.123043, PP. 433-443

秦佳林

Keywords: 视频摘要，空洞卷积，深度学习，自注意力机制，计算机视觉Video Summarization, Dilated Convolution, Deep Learning, Self-Attention, Computer Vision

Full-Text Cite this paper Add to My Lib

Abstract:

智能手机、摄像机的普及导致视频数据每天呈指数级上升，准确可靠的视频摘要技术对视频概括、视频浏览和视频检索等具有重大意义。目前主流的视频摘要方法主要基于LSTM (Long Short-Term Memory)与卷积神经网络。但这些方法有固有的局限：一是LSTM每个时间步只能处理一帧，训练速度慢，不利于并行化，二是卷积网络中重复的下采样与最大池化操作导致大量细节信息丢失。基于上述问题本文提出一种基于空洞卷积的多维度注意力模型DCMAN (Dilated Convolutional Multi-Dimension Attention Network)。该模型首先利用级联空洞卷积网络提取视频的短期时间信息和视觉特征，不使用最大池化，同时设置合适的膨胀系数保证网络感受野不受影响。其次，空间与通道注意力的结合捕获视频的长期依赖，给每个视频帧赋予对应的重要性分数。最后，更多的跳连接结构融合更多尺度的信息，不一样的初始化注意力查询向量带来更完整的视频信息。实验在四个公共的视频摘要数据集上进行，实验结果表明本文提出的DCMAN模型明显优于与其它最新的视频摘要方法。
The popularity of smartphones and video cameras have led to an exponential increase in video data every day. Accurate and reliable video summarization techniques are of great significance for video summarization, video browsing, and video retrieval. The current mainstream video summarization methods are based on LSTM (Long Short-Term Memory) with convolutional neural networks. However, these methods have inherent limitations: first, LSTM can only process one frame per time step, which is slow in training and not conducive to parallelization, and second, the repetitive down-sampling and maximum pooling operations in convolutional networks lead to the loss of a large amount of detailed information. Based on the above problems, this paper proposes a multidimensional attention model based on cavity convolution DCMAN (Dilated Convolutional Multi-Dimension Attention Network). The model firstly extracts short-term temporal information and visual features of the video using cascaded dilated convolutional network without using maximum pooling, while setting appropriate expansion coefficients to ensure that the network perceptual field is not affected. Second, the combination of spatial and channel attention captures the long-term dependence of the video, assigning a corresponding importance score to each video frame. Finally, more hop-connected structures fuse more scales of information, and different initialized attention query vectors bring more complete video information. Experiments are conducted on four public video summarization datasets, and the experimental results show that the DCMAN model proposed in this paper significantly outperforms with other state-of-the-art video summarization methods.

References

[1]	Li, P., Ye, Q.H., Zhang, L.M., Yuan, L., Xu, X. and Shao, L. (2021) Exploring Global Diverse Attention via Pairwise Temporal Relation for Video Summarization. Pattern Recognition, 111, Article ID: 107677. https://doi.org/10.1016/j.patcog.2020.107677
[2]	Pfeiffer, S., Lienhart, R., Fischer, S. and Effelsberg, W. (1996) Abstracting Digital Movies Automatically. Journal of Visual Communication and Image Representation, 7, 345-353. https://doi.org/10.1006/jvci.1996.0030
[3]	Potapov, D., Douze, M., Harchaoui, Z. and Schmid, C. (2014) Category-Specific Video Summarization. In: Fleet, D., Pajdla, T., Schiele, B. and Tuytelaars, T., Eds., Computer Vision—ECCV 2014, Springer, Cham, 540-555. https://doi.org/10.1007/978-3-319-10599-4_35
[4]	Liu, T. and Kender, J.R. (2002) Optimization Algorithms for the Selection of Key Frame Sequences of Variable Length. Computer Vision—ECCV 2002: 7th European Conference on Computer Vision Copenhagen, Denmark, 28-31 May 2002, 403-417. https://doi.org/10.1007/3-540-47979-1_27
[5]	Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735
[6]	Li, Z., Liu, F., Yang, W., Peng, S. and Zhou, J. (2021) A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Transactions on Neural Networks and Learning Systems, 33, 6999-7019. https://doi.org/10.1109/TNNLS.2021.3084827
[7]	Zhao, H., Jia, J. and Koltun, V. (2020) Exploring Self-Attention for Image Recognition. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 10073-10082. https://doi.org/10.1109/CVPR42600.2020.01009
[8]	Zhang, K., Chao, W.L., Sha, F. and Grauman, K. (2016) Video Summarization with Long Short-Term Memory. In: Leibe, B., Matas, J., Sebe, N. and Welling, M., Eds., Computer Vision—ECCV 2016, Springer, Cham, 766-782. https://doi.org/10.1007/3-540-47979-1_27
[9]	Zhao, B., Li, X. and Lu, X. (2017) Hierarchical Recurrent Neural Network for Video Summarization. Proceedings of the 25th ACM International Conference on Multimedia, Los Cabos, 23-27 October 2017, 863-871. https://doi.org/10.1145/3123266.3123328
[10]	Zhao, B., Li, X. and Lu, X. (2018) HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake, 18-23 June 2018, 7405-7414. https://doi.org/10.1109/CVPR.2018.00773
[11]	Ji, Z., Xiong, K., Pang, Y., Member, S. and Li, X. (2020) Video Summarization with Attention-Based Encoder— Decoder Networks. IEEE Transactions on Circuits and Systems for Video Technology, 30, 1709-1717. https://doi.org/10.1109/TCSVT.2019.2904996
[12]	Ji, Z., Jiao, F., Pang, Y. and Shao, L. (2020) Deep Attentive and Semantic Preserving Video Summarization. Neurocomputing, 405, 200-207. https://doi.org/10.1007/3-540-47979-1_27
[13]	Lal, S., Duggal, S. and Sreedevi, I. (2019) Online Video Summarization: Predicting Future to Better Summarize Present. 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 7-11 January 2019, 471-480.
[14]	Rochan, M., Ye, L. and Wang, Y. (2018) Video Summarization Using Fully Convolutional Sequence Networks. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., Computer Vision—ECCV 2018, Springer, Cham, 347-363. https://doi.org/10.1007/978-3-030-01258-8_22
[15]	Fajtl, J., Sokeh, H.S., Argyriou, V., Monekosso, D. and Remagnino, P. (2019) Summarizing Videos with Attention. In: Carneiro, G. and You, S., Eds., Computer Vision—ACCV 2018 Workshops, Springer, Cham, 39-54. https://doi.org/10.1007/978-3-030-21074-8_4
[16]	Gupta, D. and Sharma, A. (2021) Attentive Convolution Network-Based Video Summarization. In: Choudhary, A., Agrawal, A.P., Logeswaran, R. and Unhelkar, B., Eds., Applications of Artificial Intelligence and Machine Learning, Springer, Singapore, 333-346. https://doi.org/10.1007/978-981-16-3067-5_25
[17]	Liang, G., Lv, Y., Li, S., Zhang, S. and Zhang, Y. (2021) Unsupervised Video Summarization with a Convolutional Attentive Adversarial Network. http://arxiv.org/abs/2105.11131
[18]	Mahasseni, B., Lam, M. and Todorovic, S. (2017) Unsupervised Video Summarization with Adversarial LSTM Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 202-211. https://doi.org/10.1109/CVPR.2017.318

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

基于空洞卷积的多维度注意力视频摘要Video Summarization Using a Dilated Convolutional Multi-Dimension Attention Network

基于空洞卷积的多维度注意力视频摘要
Video Summarization Using a Dilated Convolutional Multi-Dimension Attention Network