OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Statistics and Applications 2025

ProMahaVQA：基于原型学习与对比损失的零样本视觉问答性能提升研究
ProMahaVQA: Enhancing Zero-Shot Visual Question Answering with Prototype Learning and Contrastive Loss

DOI: 10.12677/sa.2025.146145, PP. 29-41

闫婧昕

Keywords: 视觉问答，原型学习，马氏距离，零样本学习，跨模态融合
Visual Question Answering, Prototype Learning, Mahalanobis Distance, Zero-Shot Learning, Cross-Modal Fusion

Full-Text Cite this paper Add to My Lib

Abstract:

视觉问答(VQA)是一项复杂的人工智能任务，要求模型理解图像内容与自然语言问题，实现跨模态语义融合。然而，现有方法在处理视觉与语言深度交互方面存在明显不足，尤其在零样本场景中泛化能力有限。为此，本文提出ProMahaVQA模型，引入跨模态原型矩阵、原型查询机制与基于马氏距离的多标签对比损失，有效提升了特征判别能力与模型鲁棒性。模型首次将原型学习机制应用于零样本VQA任务，并通过记忆矩阵支持对未见答案的识别。实验结果表明，ProMahaVQA在F-VQA、TZSL和GZSL等设置下均显著优于现有方法，展现出卓越的泛化性能与跨模态推理能力。
Visual Question Answering (VQA) is a challenging artificial intelligence task that requires models to comprehend image content and natural language questions through cross-modal semantic integration. However, existing methods often struggle with deep visual-language interactions, particularly in zero-shot scenarios where generalization is limited. To address these challenges, we propose ProMahaVQA, a novel model that incorporates a cross-modal prototype matrix, a prototype query mechanism, and a Mahalanobis distance-based multi-label contrastive loss. These innovations significantly enhance feature discrimination and model robustness. Notably, this work is the first to integrate prototype learning into zero-shot VQA, enabling the model to recognize unseen answers via a memory matrix. Experimental results on F-VQA, TZSL, and GZSL benchmarks demonstrate that ProMahaVQA substantially outperforms existing approaches, exhibiting superior generalization and cross-modal reasoning capabilities.

References

[1]	Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., et al. (2015) VQA: Visual Question Answering. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 2425-2433. https://doi.org/10.1109/iccv.2015.279
[2]	Lu, J., Yang, J., Batra, D., et al. (2016) Hierarchical Question-Image Co-Attention for Visual Question Answering. Advances in Neural Information Processing Systems, 29.
[3]	Chen, Z., Chen, J., Geng, Y., Pan, J.Z., Yuan, Z. and Chen, H. (2021) Zero-Shot Visual Question Answering Using Knowledge Graph. In: Hotho, A., et al., Eds., Lecture Notes in Computer Science, Springer International Publishing, 146-162. https://doi.org/10.1007/978-3-030-88361-4_9
[4]	Zhang, X., Wu, C., Zhao, Z., et al. (2023) PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering. arXiv:2305.10415.
[5]	Abacha, A.B., Shivade, C., Hasan, S.A., et al. (2019) VQA-Med: Overview of the Medical Visual Question Answering Task at Image CLEF 2019. CEUR 2019 Working Notes, Lugano, 9-12 September 2019, 9-12.
[6]	Jin, D., Pan, E., Oufattole, N., Weng, W., Fang, H. and Szolovits, P. (2021) What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Applied Sciences, 11, Article 6421. https://doi.org/10.3390/app11146421
[7]	Dao, S.D., Zhao, E., Phung, D., et al. (2021) Multi-Label Image Classification with Contrastive Learning. arXiv:2107.11626.
[8]	Sahoo, S. and Maiti, J. (2025) Variance-Adjusted Cosine Distance as Similarity Metric. arXiv:2502.02233.
[9]	Xian, Y., Lampert, C.H., Schiele, B. and Akata, Z. (2019) Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41, 2251-2265. https://doi.org/10.1109/tpami.2018.2857768
[10]	Liu, H. and Singh, P. (2004) ConceptNet—A Practical Commonsense Reasoning Tool-Kit. BT Technology Journal, 22, 211-226. https://doi.org/10.1023/b:bttj.0000047600.45421.6d
[11]	Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., et al. (2015) Dbpedia—A Large-Scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web, 6, 167-195. https://doi.org/10.3233/sw-140134
[12]	Yang, Z., He, X., Gao, J., Deng, L. and Smola, A. (2016) Stacked Attention Networks for Image Question Answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 21-29. https://doi.org/10.1109/cvpr.2016.10
[13]	Kim, J.H., Jun, J. and Zhang, B.T. (2018) Bilinear Attention Networks. arXiv:1805.07932.
[14]	Snell, J., Swersky, K. and Zemel, R.S. (2017) Prototypical Networks for Few-Shot Learning. Advances in Neural Information Processing Systems, 30.
[15]	Zhu, L., She, Q., Chen, Q., Meng, X., Geng, M., Jin, L., et al. (2023) Background-Aware Classification Activation Map for Weakly Supervised Object Localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 14175-14191. https://doi.org/10.1109/tpami.2023.3309621

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

ProMahaVQA：基于原型学习与对比损失的零样本视觉问答性能提升研究ProMahaVQA: Enhancing Zero-Shot Visual Question Answering with Prototype Learning and Contrastive Loss

ProMahaVQA：基于原型学习与对比损失的零样本视觉问答性能提升研究
ProMahaVQA: Enhancing Zero-Shot Visual Question Answering with Prototype Learning and Contrastive Loss