|
基于排序损失的跨模态检索优化研究
|
Abstract:
跨模态检索通过一种模态(如文本或图像)来检索另一模态的数据,传统的跨模态检索方法主要依赖模态对齐与相似性度量,以实现多模态间的特征匹配。本文创新性地提出了一种基于排序的跨模态检索方法,通过引入排序损失来优化跨模态检索过程,使得与查询相关性高的项目在结果中排名靠前,从而实现跨模态检索。实验结果表明,引入排序损失可显著提升跨模态检索性能,尤其在文本与图像匹配中表现出色,为后续研究提供了新的方法视角和坚实的技术基础。
Cross-modal retrieval aims to retrieve data in one modality (such as text or images) based on another modality. Traditional cross-modal retrieval methods primarily rely on modality alignment and similarity measures to achieve feature matching across multiple modalities. This paper presents an innovative sorting-based cross-modal retrieval method that optimizes the cross-modal retrieval process by introducing ranking loss, allowing items with higher relevance to the query to be prioritized in the results, thereby enhancing cross-modal retrieval effectiveness. Experimental results demonstrate that the introduction of ranking loss significantly enhances the performance of cross-modal retrieval, particularly excelling in text-image matching tasks. This work provides a new methodological perspective and a solid technical foundation for future research in the field.
[1] | Wang, K., Yin, Q., Wang, W., Wu, S. and Wang, L. (2016) A Comprehensive Survey on Cross-Modal Retrieval. |
[2] | Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., et al. (2021) Learning Transferable Visual Models from Natural Language Supervision. 2021 International Conference on Machine Learning, Online, 13-16 December 2021, 8748-8763. |
[3] | Yan, F. and Mikolajczyk, K. (2015) Deep Correlation for Matching Images and Text. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 3441-3450. https://doi.org/10.1109/cvpr.2015.7298966 |
[4] | Zhang, Y. and Lu, H. (2018) Deep Cross-Modal Projection Learning for Image-Text Matching. In: Lecture Notes in Computer Science, Springer, 707-723. https://doi.org/10.1007/978-3-030-01246-5_42 |
[5] | Zhang, C., Cheng, J. and Tian, Q. (2020) Multi-View Image Classification with Visual, Semantic and View Consistency. IEEE Transactions on Image Processing, 29, 617-627. https://doi.org/10.1109/tip.2019.2934576 |
[6] | Wang, Z., Gao, Z., Yang, Y., Wang, G., Jiao, C. and Shen, H.T. (2024) Geometric Matching for Cross-Modal Retrieval. IEEE Transactions on Neural Networks and Learning Systems, 1-13. https://doi.org/10.1109/tnnls.2024.3381347 |
[7] | Vaswani, A. (2017) Attention Is All You Need. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf |
[8] | Pobrotyn, P., Bartczak, T., Synowiec, M., Białobrzeski, R. and Bojar, J. (2020) Context-Aware Learning to Rank with Self-Attention. |
[9] | Young, P., Lai, A., Hodosh, M. and Hockenmaier, J. (2014) From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions. Transactions of the Association for Computational Linguistics, 2, 67-78. https://doi.org/10.1162/tacl_a_00166 |
[10] | Burges, C.J.C., Ragno, R. and Le, Q.V. (2007) Learning to Rank with Nonsmooth Cost Functions. In: Advances in Neural Information Processing Systems 19, The MIT Press, 193-200. https://doi.org/10.7551/mitpress/7503.003.0029 |
[11] | Pobrotyn, P. and Białobrzeski, R. (2021) Neural NDCG: Direct Optimization of a Ranking Metric via Differentiable Relaxation of Sorting. |
[12] | Cao, Z., Qin, T., Liu, T., Tsai, M. and Li, H. (2007) Learning to Rank. Proceedings of the 24th International Conference on Machine Learning, New York, 20-24 June 2007, 129-136. https://doi.org/10.1145/1273496.1273513 |
[13] | Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., et al. (2005) Learning to Rank Using Gradient Descent. Proceedings of the 22nd International Conference on Machine Learning, New York, 7-11 August 2005, 89-96. https://doi.org/10.1145/1102351.1102363 |