全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

ProMahaVQA:基于原型学习与对比损失的零样本视觉问答性能提升研究
ProMahaVQA: Enhancing Zero-Shot Visual Question Answering with Prototype Learning and Contrastive Loss

DOI: 10.12677/sa.2025.146145, PP. 29-41

Keywords: 视觉问答,原型学习,马氏距离,零样本学习,跨模态融合
Visual Question Answering
, Prototype Learning, Mahalanobis Distance, Zero-Shot Learning, Cross-Modal Fusion

Full-Text   Cite this paper   Add to My Lib

Abstract:

视觉问答(VQA)是一项复杂的人工智能任务,要求模型理解图像内容与自然语言问题,实现跨模态语义融合。然而,现有方法在处理视觉与语言深度交互方面存在明显不足,尤其在零样本场景中泛化能力有限。为此,本文提出ProMahaVQA模型,引入跨模态原型矩阵、原型查询机制与基于马氏距离的多标签对比损失,有效提升了特征判别能力与模型鲁棒性。模型首次将原型学习机制应用于零样本VQA任务,并通过记忆矩阵支持对未见答案的识别。实验结果表明,ProMahaVQA在F-VQA、TZSL和GZSL等设置下均显著优于现有方法,展现出卓越的泛化性能与跨模态推理能力。
Visual Question Answering (VQA) is a challenging artificial intelligence task that requires models to comprehend image content and natural language questions through cross-modal semantic integration. However, existing methods often struggle with deep visual-language interactions, particularly in zero-shot scenarios where generalization is limited. To address these challenges, we propose ProMahaVQA, a novel model that incorporates a cross-modal prototype matrix, a prototype query mechanism, and a Mahalanobis distance-based multi-label contrastive loss. These innovations significantly enhance feature discrimination and model robustness. Notably, this work is the first to integrate prototype learning into zero-shot VQA, enabling the model to recognize unseen answers via a memory matrix. Experimental results on F-VQA, TZSL, and GZSL benchmarks demonstrate that ProMahaVQA substantially outperforms existing approaches, exhibiting superior generalization and cross-modal reasoning capabilities.

References

[1]  Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., et al. (2015) VQA: Visual Question Answering. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 2425-2433.
https://doi.org/10.1109/iccv.2015.279
[2]  Lu, J., Yang, J., Batra, D., et al. (2016) Hierarchical Question-Image Co-Attention for Visual Question Answering. Advances in Neural Information Processing Systems, 29.
[3]  Chen, Z., Chen, J., Geng, Y., Pan, J.Z., Yuan, Z. and Chen, H. (2021) Zero-Shot Visual Question Answering Using Knowledge Graph. In: Hotho, A., et al., Eds., Lecture Notes in Computer Science, Springer International Publishing, 146-162.
https://doi.org/10.1007/978-3-030-88361-4_9
[4]  Zhang, X., Wu, C., Zhao, Z., et al. (2023) PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering. arXiv:2305.10415.
[5]  Abacha, A.B., Shivade, C., Hasan, S.A., et al. (2019) VQA-Med: Overview of the Medical Visual Question Answering Task at Image CLEF 2019. CEUR 2019 Working Notes, Lugano, 9-12 September 2019, 9-12.
[6]  Jin, D., Pan, E., Oufattole, N., Weng, W., Fang, H. and Szolovits, P. (2021) What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Applied Sciences, 11, Article 6421.
https://doi.org/10.3390/app11146421
[7]  Dao, S.D., Zhao, E., Phung, D., et al. (2021) Multi-Label Image Classification with Contrastive Learning. arXiv:2107.11626.
[8]  Sahoo, S. and Maiti, J. (2025) Variance-Adjusted Cosine Distance as Similarity Metric. arXiv:2502.02233.
[9]  Xian, Y., Lampert, C.H., Schiele, B. and Akata, Z. (2019) Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41, 2251-2265.
https://doi.org/10.1109/tpami.2018.2857768
[10]  Liu, H. and Singh, P. (2004) ConceptNet—A Practical Commonsense Reasoning Tool-Kit. BT Technology Journal, 22, 211-226.
https://doi.org/10.1023/b:bttj.0000047600.45421.6d
[11]  Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., et al. (2015) Dbpedia—A Large-Scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web, 6, 167-195.
https://doi.org/10.3233/sw-140134
[12]  Yang, Z., He, X., Gao, J., Deng, L. and Smola, A. (2016) Stacked Attention Networks for Image Question Answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 21-29.
https://doi.org/10.1109/cvpr.2016.10
[13]  Kim, J.H., Jun, J. and Zhang, B.T. (2018) Bilinear Attention Networks. arXiv:1805.07932.
[14]  Snell, J., Swersky, K. and Zemel, R.S. (2017) Prototypical Networks for Few-Shot Learning. Advances in Neural Information Processing Systems, 30.
[15]  Zhu, L., She, Q., Chen, Q., Meng, X., Geng, M., Jin, L., et al. (2023) Background-Aware Classification Activation Map for Weakly Supervised Object Localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 14175-14191.
https://doi.org/10.1109/tpami.2023.3309621

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133