%0 Journal Article
%T 基于两阶段蒸馏的动作识别
Action Recognition Based on Two-Stage Distillation
%A 陈凯
%A 党存远
%A 蔡子当
%A 夏雨涵
%A 孙永宣
%J Artificial Intelligence and Robotics Research
%P 362-375
%@ 2326-3423
%D 2025
%I Hans Publishing
%R 10.12677/airr.2025.142036
%X 在计算机视觉领域,CNN与Transformer分别在局部信息提取和全局特征建模方面具有优势,如何融合CNN与Transformer成为研究热点之一。一些工作直接在Transformer编码器中引入卷积运算,然而这会改变Transformer的原有结构,限制自注意力的全局建模能力。另一些工作在CNN与Transformer的logit输出层进行知识蒸馏,然而其未能利用CNN的特征层信息。针对上述问题,本文提出特征对齐蒸馏模块,通过将Transformer的特征层与CNN的特征层进行维度对齐,实现了Transformer与CNN的特征层蒸馏,使Transformer学习到了CNN的局部建模能力。针对特征对齐操作会引入卷积操作增加模型计算量的问题,本文又提出了特征映射logit蒸馏模块,通过将Transformer的特征层映射为logit,实现了Transformer与CNN特征层的通用蒸馏方法。为了使学生模型同时学习局部建模能力和长距离依赖建模能力,本文提出了两阶段蒸馏框架,实现了CNN教师和Transformer教师对学生模型的协同指导。实验结果表明,本文方法实现了CNN与Transformer的特征层蒸馏,并使学生模型在CNN教师和Transformer教师的协同指导下,同时学习到了局部建模能力和长距离依赖建模能力,提高了基准模型在动作识别下游任务上的准确率。
In the field of computer vision, CNN and Transformer have advantages in local information extraction and global feature modeling, respectively, and how to fuse CNN and Transformer has become one of the research hotspots. Some works directly introduce convolutional operations in the Transformer encoder, however, this will change the original structure of the Transformer and limit the global modeling ability of self-attention. Some other work performs knowledge distillation in the logit output layer of CNN and Transformer, however, it fails to utilize the feature layer information of CNN. Aiming at the above problems, this paper proposes the feature alignment distillation module, which realizes the feature layer distillation between Transformer and CNN by dimensionally aligning Transformer’s feature layer with CNN’s feature layer, so that Transformer learns the CNN’s local modeling ability. Aiming at the problem that the feature alignment operation will introduce the convolution operation to increase the model computation, this paper also proposes the feature mapping logit distillation module, which realizes a general distillation method for the feature layer of Transformer and CNN by mapping the feature layer of Transformer to logit. In order to enable student models to learn both local modeling ability and long-distance dependent modeling ability, this paper proposes a two-stage distillation framework, which realizes the collaborative guidance of CNN teachers and Transformer teachers to student models. The experimental results show that the method in this paper achieves feature layer distillation of CNN and Transformer, and enables the student model to learn both local modeling capability and long-distance dependency modeling capability under the collaborative guidance of CNN instructor
%K 特征蒸馏,
%K 模型融合,
%K 两阶段蒸馏,
%K 动作识别
Feature Distillation
%K Model Fusion
%K Two-Stage Distillation
%K Action Recognition
%U http://www.hanspub.org/journal/PaperInformation.aspx?PaperID=110073