%0 Journal Article %T 基于空洞卷积的多维度注意力视频摘要
Video Summarization Using a Dilated Convolutional Multi-Dimension Attention Network %A 秦佳林 %J Software Engineering and Applications %P 433-443 %@ 2325-2278 %D 2023 %I Hans Publishing %R 10.12677/SEA.2023.123043 %X 智能手机、摄像机的普及导致视频数据每天呈指数级上升,准确可靠的视频摘要技术对视频概括、视频浏览和视频检索等具有重大意义。目前主流的视频摘要方法主要基于LSTM (Long Short-Term Memory)与卷积神经网络。但这些方法有固有的局限:一是LSTM每个时间步只能处理一帧,训练速度慢,不利于并行化,二是卷积网络中重复的下采样与最大池化操作导致大量细节信息丢失。基于上述问题本文提出一种基于空洞卷积的多维度注意力模型DCMAN (Dilated Convolutional Multi-Dimension Attention Network)。该模型首先利用级联空洞卷积网络提取视频的短期时间信息和视觉特征,不使用最大池化,同时设置合适的膨胀系数保证网络感受野不受影响。其次,空间与通道注意力的结合捕获视频的长期依赖,给每个视频帧赋予对应的重要性分数。最后,更多的跳连接结构融合更多尺度的信息,不一样的初始化注意力查询向量带来更完整的视频信息。实验在四个公共的视频摘要数据集上进行,实验结果表明本文提出的DCMAN模型明显优于与其它最新的视频摘要方法。
The popularity of smartphones and video cameras have led to an exponential increase in video data every day. Accurate and reliable video summarization techniques are of great significance for video summarization, video browsing, and video retrieval. The current mainstream video summarization methods are based on LSTM (Long Short-Term Memory) with convolutional neural networks. However, these methods have inherent limitations: first, LSTM can only process one frame per time step, which is slow in training and not conducive to parallelization, and second, the repetitive down-sampling and maximum pooling operations in convolutional networks lead to the loss of a large amount of detailed information. Based on the above problems, this paper proposes a multidimensional attention model based on cavity convolution DCMAN (Dilated Convolutional Multi-Dimension Attention Network). The model firstly extracts short-term temporal information and visual features of the video using cascaded dilated convolutional network without using maximum pooling, while setting appropriate expansion coefficients to ensure that the network perceptual field is not affected. Second, the combination of spatial and channel attention captures the long-term dependence of the video, assigning a corresponding importance score to each video frame. Finally, more hop-connected structures fuse more scales of information, and different initialized attention query vectors bring more complete video information. Experiments are conducted on four public video summarization datasets, and the experimental results show that the DCMAN model proposed in this paper significantly outperforms with other state-of-the-art video summarization methods. %K 视频摘要,空洞卷积,深度学习,自注意力机制,计算机视觉Video Summarization %K Dilated Convolution %K Deep Learning %K Self-Attention %K Computer Vision %U http://www.hanspub.org/journal/PaperInformation.aspx?PaperID=66949