全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

基于SAM的零样本多模态舌体分割方法
Zero-Shot Multimodal Tongue Image Segmentation Based on SAM

DOI: 10.12677/csa.2025.153055, PP. 29-38

Keywords: 舌体分割,零样本学习,多模态大语言模型,相似度聚类,医学图像处理
Tongue Image Segmentation
, Zero-Shot Learning, Multimodal Large Language Model, Similarity Clustering, Medical Image Processing

Full-Text   Cite this paper   Add to My Lib

Abstract:

舌诊通过观察舌体特征评估健康状态,而舌体分割作为智能舌诊的关键步骤,需要准确分离舌体与背景,为后续特征提取和健康分析奠定基础。然而,舌体分割目前面临着两大挑战:一是数据的稀缺性,二是现有的分割大模型(如SAM模型)对人工提示的依赖性。为了解决以上问题,本文提出了一种零样本多模态的分割方法。该方法结合SAM模型和多模态提示技术,通过两阶段框架实现:1) 初步分割和相似度聚类,利用SAM模型生成初步分割结果,并通过相似度聚类解码器筛选潜在有效分割;2) 精细化分割,利用多模态大语言模型分析舌体特征,生成精确点提示,再次输入到SAM模型中以实现高精度分割。该方法在无需特定任务训练或标注数据的情况下,实现了SAM模型在舌诊领域的智能分割应用。实验结果显示,相比于原始的SAM模型,该方法在三个舌诊数据集上的mIoU指标分别提升了27.3%,18.2%,29.7%。
Tongue diagnosis assesses health status by observing tongue characteristics, and tongue segmentation, as a key step in intelligent tongue diagnosis, requires accurately separating the tongue body from the background to lay a foundation for subsequent feature extraction and health analysis. However, tongue segmentation currently faces two main challenges: data scarcity and the dependency of existing large segmentation models (such as the segment anything model) on manual prompts. To address these issues, this paper proposes a zero-shot multimodal segmentation method. This method combines the SAM model with multimodal prompt techniques and implemented in a two-stage framework: 1) initial segmentation and similarity clustering, where the SAM model generates initial segmentation results, followed by a similarity clustering decoder to filter out potentially effective segmentations; 2) refined segmentation, where a multimodal large language model analyzes tongue characteristics to generate precise point prompts, which are re-entered into the SAM model to achieve high-precision segmentation. This method enables intelligent segmentation with the SAM model in tongue diagnosis without the need for task-specific training or annotated data. Experimental results show that, compared to the original SAM model, this method improves the mIoU metric on three tongue diagnosis datasets by 27.3%, 18.2%, and 29.7%, respectively.

References

[1]  清华, 孙水发, 吴义熔. 基于短距离跳跃连接的U2-Net+医学图像语义分割[J/OL]. 现代电子技术: 1-9.
http://kns.cnki.net/kcms/detail/61.1224.TN.20240705.1143.002.html
, 2024-10-25.
[2]  梁淑芬, 陈琛, 冯跃, 等. 基于一种局部图像增强和改进分水岭的舌体分割算法[J]. 现代电子技术, 2021, 44(16): 138-144.
[3]  Li, L., Luo, Z., Zhang, M., Cai, Y., Li, C. and Li, S. (2020) An Iterative Transfer Learning Framework for Cross‐Domain Tongue Segmentation. Concurrency and Computation: Practice and Experience, 32, e5714.
https://doi.org/10.1002/cpe.5714

[4]  Zhang, X., Bian, H., Cai, Y., Zhang, K. and Li, H. (2022) An Improved Tongue Image Segmentation Algorithm Based on Deeplabv3+ Framework. IET Image Processing, 16, 1473-1485.
https://doi.org/10.1049/ipr2.12425

[5]  Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., et al. (2023) Segment Anything. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 3992-4003.
https://doi.org/10.1109/iccv51070.2023.00371

[6]  Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al. (2017) Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 6000-6010.
[7]  Chai, S., Jain, R.K., Teng, S., Liu, J., Li, Y., Tateyama, T., et al. (2023) Ladder Fine-Tuning Approach for SAM Integrating Complementary Network. arXiv: 2306.12737.
https://arxiv.org/abs/2306.12737

[8]  Shi, X., Chai, S., Li, Y., Cheng, J., Bai, J., Zhao, G., et al. (2023) Cross-Modality Attention Adapter: A Glioma Segmentation Fine-Tuning Method for SAM Using Multimodal Brain MR Images. arXiv: 2307.01124.
https://arxiv.org/abs/2307.01124

[9]  Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021) Learning Transferable Visual Models from Natural Language Supervision. arXiv: 2103.00020.
https://doi.org/10.48550/arXiv.2103.00020

[10]  Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., et al. (2024) A Survey on Multimodal Large Language Models. arXiv: 2306.13549.
https://arxiv.org/abs/2306.13549v2

[11]  Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2021) An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv: 2010.11929.
https://arxiv.org/abs/2010.11929v2

[12]  TongeImageDataset.
https://github.com/BioHit/TongeImageDataset

[13]  TongueSAM: An Universal Tongue Segmentation Model Based on SAM with Zero-Shot.
https://github.com/cshan-github/tonguesam

[14]  Ronneberger, O., Fischer, P. and Brox, T. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. Medical Image Computing and Computer-Assisted InterventionMICCAI 2015, Munich, 5-9 October 2015, 234-241.
https://doi.org/10.1007/978-3-319-24574-4_28

[15]  Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N. and Liang, J. (2018) UNet++: A Nested U-Net Architecture for Medical Image Segmentation. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Granada, 20 September 2018, 3-11.
https://doi.org/10.1007/978-3-030-00889-5_1

[16]  Zhao, H., Shi, J., Qi, X., Wang, X. and Jia, J. (2017) Pyramid Scene Parsing Network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 6230-6239.
https://doi.org/10.1109/cvpr.2017.660

[17]  Chen, L.-C. Papandreou, G., Schroff, F. and Adam, H. (2017) Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv: 1706.05587.
https://arxiv.org/abs/1706.05587v3

[18]  Chen, L., Zhu, Y., Papandreou, G., Schroff, F. and Adam, H. (2018) Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. Computer VisionECCV 2018, Munich, 8-14 September 2018, 833-851.
https://doi.org/10.1007/978-3-030-01234-2_49

[19]  Li, R., Zheng, S., Zhang, C., Duan, C., Su, J., Wang, L., et al. (2022) Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing, 60, 1-13.
https://doi.org/10.1109/tgrs.2021.3093977

[20]  Chaurasia, A. and Culurciello, E. (2017) LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation. 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, 10-13 December 2017, 1-4.
https://doi.org/10.1109/vcip.2017.8305148

[21]  Kirillov, A., Girshick, R., He, K. and Dollar, P. (2019) Panoptic Feature Pyramid Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 6392-6401.
https://doi.org/10.1109/cvpr.2019.00656

[22]  Li, H., Xiong, P., An, J. and Wang, L. (2018) Pyramid Attention Network for Semantic Segmentation. arXiv: 1805.10180.
https://arxiv.org/abs/1805.10180v3

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133