OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Modeling and Simulation 2024

混合CNN和ViT的自监督知识蒸馏单目深度估计方法
Hybrid CNN and ViT for Self-Supervised Knowledge Distillation Monocular Depth Estimation Method

DOI: 10.12677/mos.2024.133260, PP. 2868-2880

郑千惠, 孔玲君

Keywords: 单目深度估计，自监督学习，知识蒸馏，Vision Transformer
Monocular Depth Estimation, Self-Supervised Learning, Knowledge Distillation, Vision Transformer

Full-Text Cite this paper Add to My Lib

Abstract:

单目深度估计是一项具有挑战性的任务，现有的方法无法高效利用特征的长程相关性和局部信息。针对该问题，本文提出一种混合CNN和ViT (Vision Transformer)的自监督知识蒸馏单目深度估计方法HCVNet。HCVNet对CNN和Vision Transformer的有效组合进行研究，设计了CNN-ViT混合特征编码器，来建模局部和全局上下文信息，提取更具场景表达性的细节特征。采用通道特征聚合模块来捕获长距离依赖，通过在通道维度上聚合区分度高的特征，来增强场景结构的感知能力。引入自监督知识蒸馏，利用结构相同的教师模型为学生模型的训练提供更多监督信号，进一步提高网络性能。在KITTI和Make3D数据集上的实验结果表明，本方法的深度估计性能优于目前的主流方法，且具有较强的泛化能力，能够更好地估计出结构完整细节清晰的深度图。
Monocular depth estimation is a challenging task, and existing methods cannot efficiently utilize feature long-range correlation and local information. To address this problem, this paper proposes HCVNet, a hybrid CNN and ViT (Vision Transformer) method for self-supervised knowledge distillation monocular depth estimation. HCVNet investigates the effective combination of CNN and Vision Transformer, and designs a hybrid CNN-ViT feature encoder to model local and global contextual information and extract more scene-expressive detailed features. Channel feature aggregation module is employed to capture long-range dependencies and enhance the perception of scene structure by aggregating discriminative features in the channel dimension. Self-supervised knowledge distillation is introduced to provide more supervised signals for the training of student models using structurally identical teacher models to further improve network performance. Experimental results on KITTI and Make3D datasets confirm that the depth estimation performance of this method is better than the current mainstream methods and has strong generalization ability, which can better estimate the depth map with complete structure and clear details.

References

[1]	Xu, Q., Tan, C., Xue, T., et al. (2021) Overview of Monocular Depth Estimation Based on Deep Learning. 5th International Conference on Cognitive Systems and Signal Processing (ICCSIP 2020), Zhuhai, 25-27 December 2020, 499-506. https://doi.org/10.1007/978-981-16-2336-3_47
[2]	Sun, J., Xie, Y., Chen, L., et al. (2021) Neuralrecon: Real-Time Coherent 3D Reconstruction from Monocular Video. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, 20-25 June 2021, 15593-15602. https://doi.org/10.1109/CVPR46437.2021.01534
[3]	Luo, Y., Ren, J.S.J., Lin, M., et al. (2018) Single View Stereo Matching. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 155-163. https://doi.org/10.1109/CVPR.2018.00024
[4]	Zhang, Z., Xu, C., Yang, J., et al. (2018) Progressive Hard-Mining Network for Monocular Depth Estimation. IEEE Transactions on Image Processing, 27, 3691-3702. https://doi.org/10.1109/TIP.2018.2821979
[5]	Wang, Z. (2022) Self-Supervised Learning in Computer Vision: A Review. 12th International Conference on Computer Engineering and Networks (CENet 2022), Haikou, 4-7 November 2022, 1112-1121. https://doi.org/10.1007/978-981-19-6901-0_116
[6]	Wang, R., Yu, Z. and Gao, S. (2023) PlaneDepth: Self-Supervised Depth Estimation via Orthogonal Planes. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, 18-22 June 2023, 21425-21434. https://doi.org/10.1109/CVPR52729.2023.02052
[7]	Godard, C., Aodha, O.M., Firman, M., et al. (2019) Digging into Self-Supervised Monocular Depth Estimation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 3827-3837. https://doi.org/10.1109/ICCV.2019.00393
[8]	Lyu, X., Liu, L., Wang, M., et al. (2021) HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 35, 2294-2301. https://doi.org/10.1609/aaai.v35i3.16329
[9]	Wang, F. and Cheng, J. (2023) HQDec: Self-Supervised Monocular Depth Estimation Based on a High-Quality Decoder. arXiv: 2305.18706.
[10]	Ren, W., Wang, L., Piao, Y., et al. (2022) Adaptive Co-Teaching for Unsupervised Monocular Depth Estimation. 17th European Conference on Computer Vision, Tel Aviv, 23-27 October 2022, 89-105. https://doi.org/10.1007/978-3-031-19769-7_6
[11]	Yan, J., Zhao, H., Bu, P., et al. (2021) Channel-Wise Attention-Based Network for Self-Supervised Monocular Depth Estimation. 9th International Conference on 3D Vision, London, 1-3 December 2021, 464-473. https://doi.org/10.1109/3DV53792.2021.00056
[12]	Zhao, C., Zhang, Y., Poggi, M., et al. (2022) MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer. 10th International Conference on 3D Vision, Prague, 12-16 September 2022, 668-678. https://doi.org/10.1109/3DV57658.2022.00077
[13]	Liu, Z., Lin, Y., Cao, Y., et al. (2021) Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. 18th IEEE/CVF International Conference on Computer Vision, Montreal, 10-17 October 2021, 9992-10002. https://doi.org/10.1109/ICCV48922.2021.00986
[14]	Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. 31st Annual Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 5999-6009.
[15]	Kim, K., Ji, B., Yoon, D., et al. (2021) Self-Knowledge Distillation with Progressive Refinement of Targets. 18th IEEE/CVF International Conference on Computer Vision, Montreal, 10-17 October 2021, 6547-6556. https://doi.org/10.1109/ICCV48922.2021.00650
[16]	Geiger, A., Lenz, P., Stiller, C., et al. (2013) Vision Meets Robotics: The KITTI Dataset. International Journal of Robotics Research, 32, 1231-1237. https://doi.org/10.1177/0278364913491297
[17]	Saxena, A., Sun, M. and Ng, A.Y. (2009) Make3D: Learning 3D Scene Structure from a Single Still Image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 824-840. https://doi.org/10.1109/TPAMI.2008.132
[18]	Shim, D. and Kim, H.J. (2023) SwinDepth: Unsupervised Depth Estimation Using Monocular Sequences via Swin Transformer and Densely Cascaded Network. 2023 IEEE International Conference on Robotics and Automation, London, 29 May 2023-2 June 2023, 4983-4990. https://doi.org/10.1109/ICRA48891.2023.10160657
[19]	Eigen, D., Puhrsch, C. and Fergus, R. (2014) Depth Map Prediction from a Single Image Using a Multi-Scale Deep Network. 28th Annual Conference on Neural Information Processing Systems 2014, Montreal, 8-13 December 2014, 2366-2374.
[20]	Sun, Q., Tang, Y., Zhang, C., et al. (2022) Unsupervised Estimation of Monocular Depth and VO in Dynamic Environments via Hybrid Masks. IEEE Transactions on Neural Networks and Learning Systems, 33, 2023-2033. https://doi.org/10.1109/TNNLS.2021.3100895
[21]	Klingner, M., Termohlen, J.A., Mikolajczyk, J., et al. (2020) Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance. 16th European Conference on Computer Vision, Glasgow, 23-28 August 2020, 582-600. https://doi.org/10.1007/978-3-030-58565-5_35
[22]	Choi, J., Jung, D., Lee, D.H., et al. (2020) Self-Supervised Monocular Depth Estimation with Semantic-Aware Depth Features. arXiv: 2010.02893.
[23]	Zhou, H., Greenwood, D., Taylor, S., et al. (2020) Constant Velocity Constraints for Self-Supervised Monocular Depth Estimation. 17th ACM SIGGRAPH European Conference on Visual Media Production, 7-8 December 2020, 1-8. https://doi.org/10.1145/3429341.3429355
[24]	Rares, V.G., Ambrus Pillai, S., et al. (2020) 3D Packing for Self-Supervised Monocular Depth Estimation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14-19 June 2020, 2482-2491.
[25]	Poggi, M., Aleotti, F., Tosi, F., et al. (2020) On the Uncertainty of Self-Supervised Monocular Depth Estimation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 3224-3234. https://doi.org/10.1109/CVPR42600.2020.00329
[26]	Johnston, A. and Carneiro, G. (2020) Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 13-19 June 2020, 4755-4764. https://doi.org/10.1109/CVPR42600.2020.00481
[27]	Zhou, H., Greenwood, D. and Taylor, S. (2021) Self-Supervised Monocular Depth Estimation with Internal Feature Fusion. 32nd British Machine Vision Conference, 22-25 November 2021, 730-734.
[28]	Bae, J.H., Moon, S. and Im, S. (2022) Deep Digging into the Generalization of Self-Supervised Monocular Depth Estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 187-196. https://doi.org/10.1609/aaai.v37i1.25090
[29]	Zhang, N., Nex, F., Vosselman, G., et al. (2023) Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, 17-24 June 2023, 18537-18546. https://doi.org/10.1109/CVPR52729.2023.01778

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

混合CNN和ViT的自监督知识蒸馏单目深度估计方法Hybrid CNN and ViT for Self-Supervised Knowledge Distillation Monocular Depth Estimation Method

混合CNN和ViT的自监督知识蒸馏单目深度估计方法
Hybrid CNN and ViT for Self-Supervised Knowledge Distillation Monocular Depth Estimation Method