|
Sub-Med VQA:结合子问题生成与多模态推理的医学视觉问答
|
Abstract:
医学视觉问答(Medical VQA)通过回答基于医学图像的自然语言问题,为临床诊断和决策提供支持。然而,现有方法在多步推理、细粒度理解和可解释性方面存在不足。本文提出一种创新性模型,通过子问题生成机制将复杂医学查询分解为简单问题,并结合多模态对齐和动态知识注入模块逐步推理。模型能够精准聚焦医学图像的关键区域,对查询相关的语义进行动态整合,提升答案生成的准确性和可靠性。在SLAKE和VQA-MED数据集上进行的实验表明,所提方法在答案准确性、推理能力和可解释性方面优于现有方法,为医学VQA任务中的多模态信息整合和复杂推理提供了高效解决方案,并为临床诊断和智能医学研究提供了新思路。
Medical Visual Question Answering (Medical VQA) supports clinical diagnosis and decision-making by answering natural language questions based on medical images. However, existing approaches face challenges in multi-step reasoning, fine-grained understanding, and interpretability. This paper proposes an innovative model that decomposes complex medical queries into simpler sub-questions through a sub-question generation mechanism. Combined with multimodal alignment and dynamic knowledge injection modules, the model performs progressive reasoning. It dynamically focuses on key regions of medical images, integrates query-relevant semantics, and enhances the accuracy and reliability of answer generation. Experiments conducted on the SLAKE and VQA-MED datasets demonstrate that the proposed method outperforms state-of-the-art approaches in terms of answer accuracy, reasoning capability, and interpretability. This work offers an efficient solution for multimodal information integration and complex reasoning in Medical VQA tasks and provides new insights for clinical diagnostics and intelligent medical research.
[1] | Jiang, Y., Natarajan, V., Chen, X.L., et al. (2018) Pythia v0.1: The Winning Entry to the VQA Challenge 2018. arXiv: 1807.09956. |
[2] | Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., et al. (2019) Biobert: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining. Bioinformatics, 36, 1234-1240. https://doi.org/10.1093/bioinformatics/btz682 |
[3] | Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017) Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, 123, 32-73. https://doi.org/10.1007/s11263-016-0981-7 |
[4] | Li, J. and Liu, S. (2021) Image CLEFmed VQA-Med 2021: Attention Model Based on Efficient Interaction between Multimodality. Working Notes of CLEF 201. |
[5] | Agrawal, A., Batra, D., Parikh, D. and Kembhavi, A. (2018). Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 4971-4980. https://doi.org/10.1109/cvpr.2018.00522 |
[6] | Lau, J.J., Gayen, S., Ben Abacha, A. and Demner-Fushman, D. (2018) A Dataset of Clinically Generated Visual Questions and Answers about Radiology Images. Scientific Data, 5, Article No. 180251. https://doi.org/10.1038/sdata.2018.251 |
[7] | Al-Sadi, A., Talafha, B., Al-Ayyoub, M., Jararweh, Y. and Costen, F. (2019) Just at Image CLEF 2019 Visual Question Answering in the Medical Domain. Working Notes of CLEF. |
[8] | Li, M., Cai, W., Liu, R., Weng, Y., Zhao, X., Wang, C., Chen, X., Liu, Z., Pan, C., Li, M., et al. (2021) FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark. 35th Conference on Neural Information Processing, Canada, 6-14 December 2021, 1-9. |
[9] | Lin, T.-Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L. (2014) Microsoft COCO: Common Objects in Context. ECCV. |
[10] | Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2017) Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, 123, 32-73. https://doi.org/10.1007/s11263-016-0981-7 |
[11] | Al-Sadi, A., Al-Theiabat, H. and Al-Ayyoub, M. (2020) The Inception Team at VQA-Med 2020: Pretrained VGG with Data Augmentation for Medical VQA and VQG. Working Notes of CLEF 2020. |
[12] | Kim, J.-H., Jun, J. and Zhang, B.-T. (2018) Bilinear Attention Networks. 2018 Conference on Neural Information Processing Systems, Montréal, 3-8 December 2018, 1-8. |
[13] | Loper, E. and Bird, S. (2002) NLTK. Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, Philadelphia, 7 July 2002, 63-70. https://doi.org/10.3115/1118108.1118117 |