|
基于SAM的零样本多模态舌体分割方法
|
Abstract:
舌诊通过观察舌体特征评估健康状态,而舌体分割作为智能舌诊的关键步骤,需要准确分离舌体与背景,为后续特征提取和健康分析奠定基础。然而,舌体分割目前面临着两大挑战:一是数据的稀缺性,二是现有的分割大模型(如SAM模型)对人工提示的依赖性。为了解决以上问题,本文提出了一种零样本多模态的分割方法。该方法结合SAM模型和多模态提示技术,通过两阶段框架实现:1) 初步分割和相似度聚类,利用SAM模型生成初步分割结果,并通过相似度聚类解码器筛选潜在有效分割;2) 精细化分割,利用多模态大语言模型分析舌体特征,生成精确点提示,再次输入到SAM模型中以实现高精度分割。该方法在无需特定任务训练或标注数据的情况下,实现了SAM模型在舌诊领域的智能分割应用。实验结果显示,相比于原始的SAM模型,该方法在三个舌诊数据集上的mIoU指标分别提升了27.3%,18.2%,29.7%。
Tongue diagnosis assesses health status by observing tongue characteristics, and tongue segmentation, as a key step in intelligent tongue diagnosis, requires accurately separating the tongue body from the background to lay a foundation for subsequent feature extraction and health analysis. However, tongue segmentation currently faces two main challenges: data scarcity and the dependency of existing large segmentation models (such as the segment anything model) on manual prompts. To address these issues, this paper proposes a zero-shot multimodal segmentation method. This method combines the SAM model with multimodal prompt techniques and implemented in a two-stage framework: 1) initial segmentation and similarity clustering, where the SAM model generates initial segmentation results, followed by a similarity clustering decoder to filter out potentially effective segmentations; 2) refined segmentation, where a multimodal large language model analyzes tongue characteristics to generate precise point prompts, which are re-entered into the SAM model to achieve high-precision segmentation. This method enables intelligent segmentation with the SAM model in tongue diagnosis without the need for task-specific training or annotated data. Experimental results show that, compared to the original SAM model, this method improves the mIoU metric on three tongue diagnosis datasets by 27.3%, 18.2%, and 29.7%, respectively.
[1] | 清华, 孙水发, 吴义熔. 基于短距离跳跃连接的U2-Net+医学图像语义分割[J/OL]. 现代电子技术: 1-9. http://kns.cnki.net/kcms/detail/61.1224.TN.20240705.1143.002.html, 2024-10-25. |
[2] | 梁淑芬, 陈琛, 冯跃, 等. 基于一种局部图像增强和改进分水岭的舌体分割算法[J]. 现代电子技术, 2021, 44(16): 138-144. |
[3] | Li, L., Luo, Z., Zhang, M., Cai, Y., Li, C. and Li, S. (2020) An Iterative Transfer Learning Framework for Cross‐Domain Tongue Segmentation. Concurrency and Computation: Practice and Experience, 32, e5714. https://doi.org/10.1002/cpe.5714 |
[4] | Zhang, X., Bian, H., Cai, Y., Zhang, K. and Li, H. (2022) An Improved Tongue Image Segmentation Algorithm Based on Deeplabv3+ Framework. IET Image Processing, 16, 1473-1485. https://doi.org/10.1049/ipr2.12425 |
[5] | Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., et al. (2023) Segment Anything. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 3992-4003. https://doi.org/10.1109/iccv51070.2023.00371 |
[6] | Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al. (2017) Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 6000-6010. |
[7] | Chai, S., Jain, R.K., Teng, S., Liu, J., Li, Y., Tateyama, T., et al. (2023) Ladder Fine-Tuning Approach for SAM Integrating Complementary Network. arXiv: 2306.12737. https://arxiv.org/abs/2306.12737 |
[8] | Shi, X., Chai, S., Li, Y., Cheng, J., Bai, J., Zhao, G., et al. (2023) Cross-Modality Attention Adapter: A Glioma Segmentation Fine-Tuning Method for SAM Using Multimodal Brain MR Images. arXiv: 2307.01124. https://arxiv.org/abs/2307.01124 |
[9] | Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021) Learning Transferable Visual Models from Natural Language Supervision. arXiv: 2103.00020. https://doi.org/10.48550/arXiv.2103.00020 |
[10] | Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., et al. (2024) A Survey on Multimodal Large Language Models. arXiv: 2306.13549. https://arxiv.org/abs/2306.13549v2 |
[11] | Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2021) An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv: 2010.11929. https://arxiv.org/abs/2010.11929v2 |
[12] | TongeImageDataset. https://github.com/BioHit/TongeImageDataset |
[13] | TongueSAM: An Universal Tongue Segmentation Model Based on SAM with Zero-Shot. https://github.com/cshan-github/tonguesam |
[14] | Ronneberger, O., Fischer, P. and Brox, T. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, 5-9 October 2015, 234-241. https://doi.org/10.1007/978-3-319-24574-4_28 |
[15] | Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N. and Liang, J. (2018) UNet++: A Nested U-Net Architecture for Medical Image Segmentation. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Granada, 20 September 2018, 3-11. https://doi.org/10.1007/978-3-030-00889-5_1 |
[16] | Zhao, H., Shi, J., Qi, X., Wang, X. and Jia, J. (2017) Pyramid Scene Parsing Network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 6230-6239. https://doi.org/10.1109/cvpr.2017.660 |
[17] | Chen, L.-C. Papandreou, G., Schroff, F. and Adam, H. (2017) Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv: 1706.05587. https://arxiv.org/abs/1706.05587v3 |
[18] | Chen, L., Zhu, Y., Papandreou, G., Schroff, F. and Adam, H. (2018) Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. Computer Vision—ECCV 2018, Munich, 8-14 September 2018, 833-851. https://doi.org/10.1007/978-3-030-01234-2_49 |
[19] | Li, R., Zheng, S., Zhang, C., Duan, C., Su, J., Wang, L., et al. (2022) Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing, 60, 1-13. https://doi.org/10.1109/tgrs.2021.3093977 |
[20] | Chaurasia, A. and Culurciello, E. (2017) LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation. 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, 10-13 December 2017, 1-4. https://doi.org/10.1109/vcip.2017.8305148 |
[21] | Kirillov, A., Girshick, R., He, K. and Dollar, P. (2019) Panoptic Feature Pyramid Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 6392-6401. https://doi.org/10.1109/cvpr.2019.00656 |
[22] | Li, H., Xiong, P., An, J. and Wang, L. (2018) Pyramid Attention Network for Semantic Segmentation. arXiv: 1805.10180. https://arxiv.org/abs/1805.10180v3 |