%0 Journal Article
%T 基于多尺度空洞可分离卷积的视觉Transformer的端到端可训练头部姿态估计<br>End-to-End Trainable Head Pose Estimation with Vision Transformer Based on Multi-Scale Dilated Separable Convolution
%A 尧京京
%J Modeling and Simulation
%P 426-434
%@ 2324-870X
%D 2025
%I Hans Publishing
%R 10.12677/mos.2025.143235
%X 在本文中&#65292;我们基于Hopenet网络和视觉Transformer提出了一种用于RGB图像头部姿势估计的新方法&#65292;并设计了一种新颖的架构&#65292;由以下三个关键组件组成&#65306;(1) 骨干网络&#65292;(2) 视觉Transformer&#65292;(3) 预测头。我们还对骨干网络进行了改进&#65292;采用多尺度空洞可分离卷积以增强特征提取能力。相比于传统卷积神经网络和视觉Transformer提取特征的方式&#65292;我们的骨干网络在降低图像分辨率的同时&#65292;能够更有效地保留关键信息。通过消融实验&#65292;我们验证了基于多尺度空洞可分离卷积的骨干网络在特征保留能力上优于传统的深度卷积网络和视觉Transformer架构。我们在300W-LP和AFLW2000数据集上进行了全面的实验与消融研究。实验结果表明&#65292;所提出的方法在头部姿势估计任务上&#65292;相较于Hopenet及部分基于Transformer编码器的方法(如HeadPosr)&#65292;在准确性和鲁棒性方面均实现了显著提升。<br>In this paper, we propose a novel approach for head pose estimation from RGB images, leveraging the Hopenet network and Vision Transformer. Our method introduces an innovative architecture comprising three key components: (1) a backbone network, (2) a Vision Transformer, and (3) a prediction head. To enhance feature extraction capabilities, we further improve the backbone network by incorporating multi-scale dilated separable convolutions. Compared to traditional convolutional neural networks and Vision Transformers for feature extraction, our backbone network effectively preserves critical information while reducing image resolution. Through ablation studies, we validate that the proposed backbone network, equipped with multi-scale dilated separable convolutions, outperforms conventional deep convolutional networks and Vision Transformer-based architectures in terms of feature retention. We conduct extensive experiments and ablation studies on the 300W-LP and AFLW2000 datasets. Experimental results demonstrate that our approach significantly improves both accuracy and robustness in head pose estimation, outperforming Hopenet and certain Transformer-based encoder methods, such as HeadPose.
%K 姿势估计&#65292
%K 多尺度空洞可分离卷积&#65292
%K 视觉Transformer&#65292
%K Transformer编码器<br>Head Pose Estimation
%K Multi-Scale Dilated Separable Convolutions
%K Vision Transformer
%K Transformer Encoder
%U http://www.hanspub.org/journal/PaperInformation.aspx?PaperID=110198