|
基于多尺度空洞可分离卷积的视觉Transformer的端到端可训练头部姿态估计
|
Abstract:
在本文中,我们基于Hopenet网络和视觉Transformer提出了一种用于RGB图像头部姿势估计的新方法,并设计了一种新颖的架构,由以下三个关键组件组成:(1) 骨干网络,(2) 视觉Transformer,(3) 预测头。我们还对骨干网络进行了改进,采用多尺度空洞可分离卷积以增强特征提取能力。相比于传统卷积神经网络和视觉Transformer提取特征的方式,我们的骨干网络在降低图像分辨率的同时,能够更有效地保留关键信息。通过消融实验,我们验证了基于多尺度空洞可分离卷积的骨干网络在特征保留能力上优于传统的深度卷积网络和视觉Transformer架构。我们在300W-LP和AFLW2000数据集上进行了全面的实验与消融研究。实验结果表明,所提出的方法在头部姿势估计任务上,相较于Hopenet及部分基于Transformer编码器的方法(如HeadPosr),在准确性和鲁棒性方面均实现了显著提升。
In this paper, we propose a novel approach for head pose estimation from RGB images, leveraging the Hopenet network and Vision Transformer. Our method introduces an innovative architecture comprising three key components: (1) a backbone network, (2) a Vision Transformer, and (3) a prediction head. To enhance feature extraction capabilities, we further improve the backbone network by incorporating multi-scale dilated separable convolutions. Compared to traditional convolutional neural networks and Vision Transformers for feature extraction, our backbone network effectively preserves critical information while reducing image resolution. Through ablation studies, we validate that the proposed backbone network, equipped with multi-scale dilated separable convolutions, outperforms conventional deep convolutional networks and Vision Transformer-based architectures in terms of feature retention. We conduct extensive experiments and ablation studies on the 300W-LP and AFLW2000 datasets. Experimental results demonstrate that our approach significantly improves both accuracy and robustness in head pose estimation, outperforming Hopenet and certain Transformer-based encoder methods, such as HeadPose.
[1] | Bulat, A. and Tzimiropoulos, G. (2017) How Far Are We from Solving the 2D & 3D Face Alignment Problem? (And a Dataset of 230,000 3D Facial Landmarks). 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 1021-1030. https://doi.org/10.1109/iccv.2017.116 |
[2] | Bisogni, C., Nappi, M., Pero, C. and Ricciardi, S. (2021) FASHE: A Fractal Based Strategy for Head Pose Estimation. IEEE Transactions on Image Processing, 30, 3192-3203. https://doi.org/10.1109/tip.2021.3059409 |
[3] | Murphy-Chutorian, E., Doshi, A. and Trivedi, M.M. (2007) Head Pose Estimation for Driver Assistance Systems: A Robust Algorithm and Experimental Evaluation. 2007 IEEE Intelligent Transportation Systems Conference, Bellevue, 30 September-3 October 2007, 709-714. https://doi.org/10.1109/itsc.2007.4357803 |
[4] | Chong, E., Ruiz, N., Wang, Y., Zhang, Y., Rozga, A. and Rehg, J.M. (2018) Connecting Gaze, Scene, and Attention: Generalized Attention Estimation via Joint Modeling of Gaze and Scene Saliency. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., Computer Vision—ECCV 2018., Springer, 397-412. https://doi.org/10.1007/978-3-030-01228-1_24 |
[5] | Ruiz, N., Chong, E. and Rehg, J.M. (2018) Fine-grained Head Pose Estimation without Keypoints. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, 18-22 June 2018, 2074-2083. https://doi.org/10.1109/cvprw.2018.00281 |
[6] | Kazemi, V. and Sullivan, J. (2014) One Millisecond Face Alignment with an Ensemble of Regression Trees. 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 23-28 June 2014, 1867-1874. https://doi.org/10.1109/cvpr.2014.241 |
[7] | Kumar, A., Alavi, A. and Chellappa, R. (2017) KEPLER: Keypoint and Pose Estimation of Unconstrained Faces by Learning Efficient H-CNN Regressors. 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, 30 May-3 June 2017, 258-265. https://doi.org/10.1109/fg.2017.149 |
[8] | Fanelli, G., Weise, T., Gall, J. and Van Gool, L. (2011) Real Time Head Pose Estimation from Consumer Depth Cameras. In: Mester, R. and Felsberg, M., Eds., Pattern Recognition, Springer, 101-110. https://doi.org/10.1007/978-3-642-23123-0_11 |
[9] | Meyer, G.P., Gupta, S., Frosio, I., Reddy, D. and Kautz, J. (2015) Robust Model-Based 3D Head Pose Estimation. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 3649-3657. https://doi.org/10.1109/iccv.2015.416 |
[10] | Vaswani, A. (2017) Attention Is All You Need. arXiv: 1706.03762. |
[11] | Dosovitskiy, A. (2020) An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv: 2010.11929. |
[12] | Dhingra, N. (2021) HeadPosr: End-To-End Trainable Head Pose Estimation Using Transformer Encoders. 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), Jodhpur, 15-18 December 2021, 1-8. https://doi.org/10.1109/fg52635.2021.9667080 |
[13] | Liu, F., Xu, H., Qi, M., Liu, D., Wang, J. and Kong, J. (2022) Depth-Wise Separable Convolution Attention Module for Garbage Image Classification. Sustainability, 14, Article 3099. https://doi.org/10.3390/su14053099 |
[14] | Cao, Z., Chu, Z., Liu, D. and Chen, Y. (2021) A Vector-Based Representation to Enhance Head Pose Estimation. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 3-8 January 2021, 1187-1196. https://doi.org/10.1109/wacv48630.2021.00123 |
[15] | 梁令羽, 张天天, 何为. 多尺度卷积神经网络的头部姿态估计[J]. 激光与光电子学进展, 2019, 56(13): 79-86. |
[16] | Zhu, X., Lei, Z., Liu, X., Shi, H. and Li, S.Z. (2016) Face Alignment across Large Poses: A 3D Solution. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 146-155. https://doi.org/10.1109/cvpr.2016.23 |