OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Computer Science and Application 2025

基于SAM的零样本多模态舌体分割方法
Zero-Shot Multimodal Tongue Image Segmentation Based on SAM

DOI: 10.12677/csa.2025.153055, PP. 29-38

钟甫广, 邓森耀, 曾军英, 冯跃, 钟甫东, 贾旭东

Keywords: 舌体分割，零样本学习，多模态大语言模型，相似度聚类，医学图像处理
Tongue Image Segmentation, Zero-Shot Learning, Multimodal Large Language Model, Similarity Clustering, Medical Image Processing

Full-Text Cite this paper Add to My Lib

Abstract:

舌诊通过观察舌体特征评估健康状态，而舌体分割作为智能舌诊的关键步骤，需要准确分离舌体与背景，为后续特征提取和健康分析奠定基础。然而，舌体分割目前面临着两大挑战：一是数据的稀缺性，二是现有的分割大模型(如SAM模型)对人工提示的依赖性。为了解决以上问题，本文提出了一种零样本多模态的分割方法。该方法结合SAM模型和多模态提示技术，通过两阶段框架实现：1) 初步分割和相似度聚类，利用SAM模型生成初步分割结果，并通过相似度聚类解码器筛选潜在有效分割；2) 精细化分割，利用多模态大语言模型分析舌体特征，生成精确点提示，再次输入到SAM模型中以实现高精度分割。该方法在无需特定任务训练或标注数据的情况下，实现了SAM模型在舌诊领域的智能分割应用。实验结果显示，相比于原始的SAM模型，该方法在三个舌诊数据集上的mIoU指标分别提升了27.3%，18.2%，29.7%。
Tongue diagnosis assesses health status by observing tongue characteristics, and tongue segmentation, as a key step in intelligent tongue diagnosis, requires accurately separating the tongue body from the background to lay a foundation for subsequent feature extraction and health analysis. However, tongue segmentation currently faces two main challenges: data scarcity and the dependency of existing large segmentation models (such as the segment anything model) on manual prompts. To address these issues, this paper proposes a zero-shot multimodal segmentation method. This method combines the SAM model with multimodal prompt techniques and implemented in a two-stage framework: 1) initial segmentation and similarity clustering, where the SAM model generates initial segmentation results, followed by a similarity clustering decoder to filter out potentially effective segmentations; 2) refined segmentation, where a multimodal large language model analyzes tongue characteristics to generate precise point prompts, which are re-entered into the SAM model to achieve high-precision segmentation. This method enables intelligent segmentation with the SAM model in tongue diagnosis without the need for task-specific training or annotated data. Experimental results show that, compared to the original SAM model, this method improves the mIoU metric on three tongue diagnosis datasets by 27.3%, 18.2%, and 29.7%, respectively.

References

[1]	清华, 孙水发, 吴义熔. 基于短距离跳跃连接的U2-Net+医学图像语义分割[J/OL]. 现代电子技术: 1-9. http://kns.cnki.net/kcms/detail/61.1224.TN.20240705.1143.002.html, 2024-10-25.
[2]	梁淑芬, 陈琛, 冯跃, 等. 基于一种局部图像增强和改进分水岭的舌体分割算法[J]. 现代电子技术, 2021, 44(16): 138-144.
[3]	Li, L., Luo, Z., Zhang, M., Cai, Y., Li, C. and Li, S. (2020) An Iterative Transfer Learning Framework for Cross‐Domain Tongue Segmentation. Concurrency and Computation: Practice and Experience, 32, e5714. https://doi.org/10.1002/cpe.5714
[4]	Zhang, X., Bian, H., Cai, Y., Zhang, K. and Li, H. (2022) An Improved Tongue Image Segmentation Algorithm Based on Deeplabv3+ Framework. IET Image Processing, 16, 1473-1485. https://doi.org/10.1049/ipr2.12425
[5]	Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., et al. (2023) Segment Anything. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 3992-4003. https://doi.org/10.1109/iccv51070.2023.00371
[6]	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al. (2017) Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 6000-6010.
[7]	Chai, S., Jain, R.K., Teng, S., Liu, J., Li, Y., Tateyama, T., et al. (2023) Ladder Fine-Tuning Approach for SAM Integrating Complementary Network. arXiv: 2306.12737. https://arxiv.org/abs/2306.12737
[8]	Shi, X., Chai, S., Li, Y., Cheng, J., Bai, J., Zhao, G., et al. (2023) Cross-Modality Attention Adapter: A Glioma Segmentation Fine-Tuning Method for SAM Using Multimodal Brain MR Images. arXiv: 2307.01124. https://arxiv.org/abs/2307.01124
[9]	Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021) Learning Transferable Visual Models from Natural Language Supervision. arXiv: 2103.00020. https://doi.org/10.48550/arXiv.2103.00020
[10]	Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., et al. (2024) A Survey on Multimodal Large Language Models. arXiv: 2306.13549. https://arxiv.org/abs/2306.13549v2
[11]	Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2021) An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv: 2010.11929. https://arxiv.org/abs/2010.11929v2
[12]	TongeImageDataset. https://github.com/BioHit/TongeImageDataset
[13]	TongueSAM: An Universal Tongue Segmentation Model Based on SAM with Zero-Shot. https://github.com/cshan-github/tonguesam
[14]	Ronneberger, O., Fischer, P. and Brox, T. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, 5-9 October 2015, 234-241. https://doi.org/10.1007/978-3-319-24574-4_28
[15]	Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N. and Liang, J. (2018) UNet++: A Nested U-Net Architecture for Medical Image Segmentation. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Granada, 20 September 2018, 3-11. https://doi.org/10.1007/978-3-030-00889-5_1
[16]	Zhao, H., Shi, J., Qi, X., Wang, X. and Jia, J. (2017) Pyramid Scene Parsing Network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 6230-6239. https://doi.org/10.1109/cvpr.2017.660
[17]	Chen, L.-C. Papandreou, G., Schroff, F. and Adam, H. (2017) Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv: 1706.05587. https://arxiv.org/abs/1706.05587v3
[18]	Chen, L., Zhu, Y., Papandreou, G., Schroff, F. and Adam, H. (2018) Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. Computer Vision—ECCV 2018, Munich, 8-14 September 2018, 833-851. https://doi.org/10.1007/978-3-030-01234-2_49
[19]	Li, R., Zheng, S., Zhang, C., Duan, C., Su, J., Wang, L., et al. (2022) Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing, 60, 1-13. https://doi.org/10.1109/tgrs.2021.3093977
[20]	Chaurasia, A. and Culurciello, E. (2017) LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation. 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, 10-13 December 2017, 1-4. https://doi.org/10.1109/vcip.2017.8305148
[21]	Kirillov, A., Girshick, R., He, K. and Dollar, P. (2019) Panoptic Feature Pyramid Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 6392-6401. https://doi.org/10.1109/cvpr.2019.00656
[22]	Li, H., Xiong, P., An, J. and Wang, L. (2018) Pyramid Attention Network for Semantic Segmentation. arXiv: 1805.10180. https://arxiv.org/abs/1805.10180v3

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

基于SAM的零样本多模态舌体分割方法Zero-Shot Multimodal Tongue Image Segmentation Based on SAM

基于SAM的零样本多模态舌体分割方法
Zero-Shot Multimodal Tongue Image Segmentation Based on SAM