|
基于动态门控制和注意力遮掩策略改进ICAFusion的多模态目标检测
|
Abstract:
传统卷积特征融合方法(如CNN)因局部感受野的限制,难以捕获模态间长距离特征关系,同时对图像错位敏感;而Transformer虽具备全局建模能力,但直接堆叠会导致计算复杂度和参数量激增。ICAFusion通过迭代跨模态注意力机制部分解决了这些问题,但仍存在不足:1) 跨模态特征增强模块(CFE)缺乏动态权重调整,对模态间质量差异适应性不足;2) 迭代特征增强模块(ICFE)在局部特征优化和精细化处理方面能力有限。为此,本文提出一种改进的多模态特征融合框架。在CFE模块中加入动态门控机制和注意力遮掩策略,自适应平衡模态特征贡献并过滤无效信息;在ICFE模块中引入精细化特征优化模块(FRFM),结合局部卷积、线性变换和门控机制对特征进行细化优化,提升模态互补性和特征表达能力。实验结果表明,改进后的模型在KAIST和FLIR数据集上的目标检测精度和鲁棒性显著提升,在FLIR上高阈值指标mAP75和mAP50-95分别提升了2.7%和2.4%。
Traditional convolutional feature fusion methods (e.g., CNNs) are limited by local receptive fields, making it difficult to capture long-range relationships between modalities and sensitive to image misalignments. While Transformers possess global modeling capabilities, stacking them directly leads to increased computational complexity and parameter overhead. ICAFusion partially addresses these issues through an iterative cross-modal attention mechanism. However, it still has the following limitations: 1) The Cross-modal Feature Enhancement (CFE) module lacks dynamic weight adjustment, making it less adaptive to quality differences between modalities; 2) The Iterative Cross-modal Feature Enhancement (ICFE) module has limited capabilities in local feature optimization and fine-grained processing. To address these shortcomings, this paper proposes an improved multimodal feature fusion framework. In the CFE module, a dynamic gating mechanism and attention masking strategy are introduced to adaptively balance modal feature contributions and filter out irrelevant information. In the ICFE module, a Fine-grained Feature Refinement Module (FRFM) is incorporated, which combines local convolution, linear transformation, and gating mechanisms to refine features, enhancing modality complementarity and feature representation capabilities. Experimental results demonstrate that the improved model significantly enhances object detection accuracy and robustness on the KAIST and FLIR datasets. Specifically, on the FLIR dataset, the high-threshold metrics mAP75 and mAP50-95 improve by 2.7% and 2.4%, respectively.
[1] | Jaffe, J.S. (2015) Underwater Optical Imaging: The Past, the Present, and the Prospects. IEEE Journal of Oceanic Engineering, 40, 683-700. https://doi.org/10.1109/joe.2014.2350751 |
[2] | Zheng, Y., Blasch, E. and Liu, Z. (2018) Multispectral Image Fusion and Colorization. SPIE Press. |
[3] | Alldieck, T., Bahnsen, C. and Moeslund, T. (2016) Context-Aware Fusion of RGB and Thermal Imagery for Traffic Monitoring. Sensors, 16, Article 1947. https://doi.org/10.3390/s16111947 |
[4] | Fu, C., Mertz, C. and Dolan, J.M. (2019) LIDAR and Monocular Camera Fusion: On-Road Depth Completion for Autonomous Driving. 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, 27-30 October 2019, 273-278. https://doi.org/10.1109/itsc.2019.8917201 |
[5] | Shopovska, I., Jovanov, L. and Philips, W. (2019) Deep Visible and Thermal Image Fusion for Enhanced Pedestrian Visibility. Sensors, 19, Article 3727. https://doi.org/10.3390/s19173727 |
[6] | Xiao, Y., Codevilla, F., Gurram, A., Urfalioglu, O. and Lopez, A.M. (2022) Multimodal End-To-End Autonomous Driving. IEEE Transactions on Intelligent Transportation Systems, 23, 537-547. https://doi.org/10.1109/tits.2020.3013234 |
[7] | Shen, J., Chen, Y., Liu, Y., Zuo, X., Fan, H. and Yang, W. (2024) Icafusion: Iterative Cross-Attention Guided Feature Fusion for Multispectral Object Detection. Pattern Recognition, 145, Article ID: 109913. https://doi.org/10.1016/j.patcog.2023.109913 |
[8] | Fang, Q.Y., Han, D.P. and Wang, Z.K. (2021) Cross-Modality Fusion Transformer for Multispectral Object Detection. arXiv: 2111.00273. |
[9] | Hwang, S., Park, J., Kim, N., Choi, Y. and Kweon, I.S. (2015) Multispectral Pedestrian Detection: Benchmark Dataset and Baseline. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 137-1045. https://doi.org/10.1109/cvpr.2015.7298706 |
[10] | Zhang, H., Fromont, E., Lefevre, S. and Avignon, B. (2020) Multispectral Fusion for Object Detection with Cyclic Fuse-And-Refine Blocks. 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, 25-28 October 2020, 276-280. https://doi.org/10.1109/icip40778.2020.9191080 |
[11] | Ross, T.Y. and Dollár, G. (2017) Focal Loss for Dense Object Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 21-26 July 2017, 2980-2988. |
[12] | Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. (2016) You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 779-788. https://doi.org/10.1109/cvpr.2016.91 |
[13] | Zhang, H., Fromont, E., Lefevre, S. and Avignon, B. (2021) Guided Attentive Feature Fusion for Multispectral Pedestrian Detection. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 3-8 January 2021, 72-80. https://doi.org/10.1109/wacv48630.2021.00012 |
[14] | Qingyun, F. and Zhaokui, W. (2022) Cross-Modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery. Pattern Recognition, 130, Article ID: 108786. https://doi.org/10.1016/j.patcog.2022.108786 |
[15] | Zhou, K., Chen, L. and Cao, X. (2020) Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.M., Eds., Computer Vision—ECCV 2020, Springer, 787-803. https://doi.org/10.1007/978-3-030-58523-5_46 |
[16] | Shen, J., Liu, Y., Chen, Y., Zuo, X., Li, J. and Yang, W. (2022) Mask-Guided Explicit Feature Modulation for Multispectral Pedestrian Detection. Computers and Electrical Engineering, 103, Article ID: 108385. https://doi.org/10.1016/j.compeleceng.2022.108385 |
[17] | Zhang, L., Zhu, X., Chen, X., Yang, X., Lei, Z. and Liu, Z. (2019) Weakly Aligned Cross-Modal Learning for Multispectral Pedestrian Detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 5126-5136. https://doi.org/10.1109/iccv.2019.00523 |
[18] | Liu, J., Zhang, S., Wang, S., et al. (2016) Multispectral Deep Neural Networks for Pedestrian Detection. arXiv: 1611.02644. |
[19] | Sun, Y., Cao, B., Zhu, P. and Hu, Q. (2022) Drone-Based RGB-Infrared Cross-Modality Vehicle Detection via Uncertainty-Aware Learning. IEEE Transactions on Circuits and Systems for Video Technology, 32, 6700-6713. https://doi.org/10.1109/tcsvt.2022.3168279 |