|
ProMahaVQA:基于原型学习与对比损失的零样本视觉问答性能提升研究
|
Abstract:
视觉问答(VQA)是一项复杂的人工智能任务,要求模型理解图像内容与自然语言问题,实现跨模态语义融合。然而,现有方法在处理视觉与语言深度交互方面存在明显不足,尤其在零样本场景中泛化能力有限。为此,本文提出ProMahaVQA模型,引入跨模态原型矩阵、原型查询机制与基于马氏距离的多标签对比损失,有效提升了特征判别能力与模型鲁棒性。模型首次将原型学习机制应用于零样本VQA任务,并通过记忆矩阵支持对未见答案的识别。实验结果表明,ProMahaVQA在F-VQA、TZSL和GZSL等设置下均显著优于现有方法,展现出卓越的泛化性能与跨模态推理能力。
Visual Question Answering (VQA) is a challenging artificial intelligence task that requires models to comprehend image content and natural language questions through cross-modal semantic integration. However, existing methods often struggle with deep visual-language interactions, particularly in zero-shot scenarios where generalization is limited. To address these challenges, we propose ProMahaVQA, a novel model that incorporates a cross-modal prototype matrix, a prototype query mechanism, and a Mahalanobis distance-based multi-label contrastive loss. These innovations significantly enhance feature discrimination and model robustness. Notably, this work is the first to integrate prototype learning into zero-shot VQA, enabling the model to recognize unseen answers via a memory matrix. Experimental results on F-VQA, TZSL, and GZSL benchmarks demonstrate that ProMahaVQA substantially outperforms existing approaches, exhibiting superior generalization and cross-modal reasoning capabilities.
[1] | Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., et al. (2015) VQA: Visual Question Answering. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 2425-2433. https://doi.org/10.1109/iccv.2015.279 |
[2] | Lu, J., Yang, J., Batra, D., et al. (2016) Hierarchical Question-Image Co-Attention for Visual Question Answering. Advances in Neural Information Processing Systems, 29. |
[3] | Chen, Z., Chen, J., Geng, Y., Pan, J.Z., Yuan, Z. and Chen, H. (2021) Zero-Shot Visual Question Answering Using Knowledge Graph. In: Hotho, A., et al., Eds., Lecture Notes in Computer Science, Springer International Publishing, 146-162. https://doi.org/10.1007/978-3-030-88361-4_9 |
[4] | Zhang, X., Wu, C., Zhao, Z., et al. (2023) PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering. arXiv:2305.10415. |
[5] | Abacha, A.B., Shivade, C., Hasan, S.A., et al. (2019) VQA-Med: Overview of the Medical Visual Question Answering Task at Image CLEF 2019. CEUR 2019 Working Notes, Lugano, 9-12 September 2019, 9-12. |
[6] | Jin, D., Pan, E., Oufattole, N., Weng, W., Fang, H. and Szolovits, P. (2021) What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Applied Sciences, 11, Article 6421. https://doi.org/10.3390/app11146421 |
[7] | Dao, S.D., Zhao, E., Phung, D., et al. (2021) Multi-Label Image Classification with Contrastive Learning. arXiv:2107.11626. |
[8] | Sahoo, S. and Maiti, J. (2025) Variance-Adjusted Cosine Distance as Similarity Metric. arXiv:2502.02233. |
[9] | Xian, Y., Lampert, C.H., Schiele, B. and Akata, Z. (2019) Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41, 2251-2265. https://doi.org/10.1109/tpami.2018.2857768 |
[10] | Liu, H. and Singh, P. (2004) ConceptNet—A Practical Commonsense Reasoning Tool-Kit. BT Technology Journal, 22, 211-226. https://doi.org/10.1023/b:bttj.0000047600.45421.6d |
[11] | Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., et al. (2015) Dbpedia—A Large-Scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web, 6, 167-195. https://doi.org/10.3233/sw-140134 |
[12] | Yang, Z., He, X., Gao, J., Deng, L. and Smola, A. (2016) Stacked Attention Networks for Image Question Answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 21-29. https://doi.org/10.1109/cvpr.2016.10 |
[13] | Kim, J.H., Jun, J. and Zhang, B.T. (2018) Bilinear Attention Networks. arXiv:1805.07932. |
[14] | Snell, J., Swersky, K. and Zemel, R.S. (2017) Prototypical Networks for Few-Shot Learning. Advances in Neural Information Processing Systems, 30. |
[15] | Zhu, L., She, Q., Chen, Q., Meng, X., Geng, M., Jin, L., et al. (2023) Background-Aware Classification Activation Map for Weakly Supervised Object Localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 14175-14191. https://doi.org/10.1109/tpami.2023.3309621 |