|
基于CLIP模型的以图搜图方法
|
Abstract:
近年来,人工智能技术在计算机视觉和自然语言处理领域的飞速发展,促进了两者间的深度融合,极大地拓展了智能系统的技术边界和应用前景。这种跨领域整合不仅推动了技术创新,也为诸多新颖研究和应用开辟了新的路径。本文提出了一种针对猫狗数据集和铁路相关数据集的图像检索方法——CLIP-Retrieval,旨在解决公开和专业领域中复杂背景、多角度拍摄等带来的图像检索挑战。CLIP-Retrieval利用CLIP模型的图像编码器作为核心架构,通过提取图像特征并构造相似度矩阵,计算不同图像之间的相似度分数,根据排序结果展示最相关的图像。为验证CLIP-Retrieval的鲁棒性和稳定性,我们进行了对比实验和抗干扰实验。实验结果显示,该算法在性能上有显著提升,具备良好的图像检索效果。具体而言,CLIP-Retrieval能够有效应对不同数据集中的复杂背景、姿态变化等问题,提供准确且高效的检索服务。
In recent years, the rapid advancement of artificial intelligence technology in both computer vision and natural language processing (NLP) has facilitated deep integration between these fields, significantly expanding the technological boundaries and application prospects of intelligent systems. This cross-domain convergence not only drives technological innovation but also paves new paths for a myriad of novel research and applications. This paper introduces CLIP-Retrieval, an image retrieval method designed specifically for the Cats vs. Dogs dataset and railway-related datasets. The goal is to address the challenges posed by complex backgrounds and multi-angle photography in both public and specialized domains. CLIP-Retrieval leverages the image encoder of the CLIP model as its core architecture, extracting image features and constructing a similarity matrix to compute similarity scores between different images. Based on the sorted results, it displays the most relevant images. To verify the robustness and stability of CLIP-Retrieval, we conducted comparison studies and interference resistance experiments. Experimental results show significant performance improvements, demonstrating excellent image retrieval effects. Specifically, CLIP-Retrieval effectively handles complex backgrounds and pose variations across different datasets, providing accurate and efficient retrieval services.
[1] | 燕程, 王志聪, 燕鹏. 基于VGG网络对猫狗数据集分类[J]. 数码设计(下), 2021, 10(5): 18. |
[2] | 王志瑞, 闫彩良. 图像特征提取方法的综述[J]. 吉首大学学报(自然科学版), 2011(5): 41-42. |
[3] | 赵学敏, 田生湖, 张潇璐. 基于深度学习的以图搜图技术在照片档案管理中的应用研究[J]. 档案学研究, 2020(4): 64-68. |
[4] | Choras, R.S. (2007) Image Feature Extraction Techniques and Their Applications for CBIR and Biometrics Systems. International Journal of Biology and Biomedical Engineering, 1, 6-16. |
[5] | Rodriguez-Vazquez, A., Espejo, S., Dominguez-Castron, R., Huertas, J.L. and Sanchez-Sinencio, E. (1993) Current-mode Techniques for the Implementation of Continuous-and Discrete-Time Cellular Neural Networks. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 40, 132-146. https://doi.org/10.1109/82.222812 |
[6] | Sun, Y., Cheng, C., Zhang, Y., Zhang, C., Zheng, L., Wang, Z., et al. (2020) Circle Loss: A Unified Perspective of Pair Similarity Optimization. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 14-19 June 2020, 6397-6406. https://doi.org/10.1109/cvpr42600.2020.00643 |
[7] | Revaud, J., Almazan, J., Rezende, R. and Souza, C.D. (2019) Learning with Average Precision: Training Image Retrieval with a Listwise Loss. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 5106-5115. https://doi.org/10.1109/iccv.2019.00521 |
[8] | Ding, H., Zhou, S.K. and Chellappa, R. (2017) Facenet2expnet: Regularizing a Deep Face Recognition Net for Expression Recognition. 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington DC, 30 May-3 June 2017, 118-126. https://doi.org/10.1109/fg.2017.23 |
[9] | You, H., Zhou, L., Xiao, B., Codella, N., Cheng, Y., Xu, R., et al. (2022) Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-Training. Computer Vision ECCV 2022 17th European Conference, Tel Aviv, 23-27 October 2022, 69-87. https://doi.org/10.1007/978-3-031-19812-0_5 |
[10] | Wang, W., Zheng, V.W., Yu, H. and Miao, C. (2019) A Survey of Zero-Shot Learning: Settings, Methods, and Applications. ACM Transactions on Intelligent Systems and Technology, 10, 1-37. https://doi.org/10.1145/3293318 |
[11] | Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021) An Image Is Worth 16 x 16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations, Vienna, 4 May 2021, 37. |
[12] | Nguyen, H.V. and Bai, L. (2011) Cosine Similarity Metric Learning for Face Verification. Computer Vision ACCV 2010 10th Asian Conference on Computer Vision, Queenstown, 8-12 November 2010, 709-720. https://doi.org/10.1007/978-3-642-19309-5_55 |
[13] | He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 770-778. https://doi.org/10.1109/cvpr.2016.90 |
[14] | Simonyan, K. and Zisserman, A. (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations (ICLR), San Diego, 7-9 May 2015, 1-14. |