全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

YW-FSVOD:基于YOLO-World的开放词汇小样本目标检测方法
YW-FSVOD: Open-Vocabulary Few-Shot Object Detection Method Based on YOLO-World

DOI: 10.12677/sea.2025.142024, PP. 261-270

Keywords: 小样本目标检测,视觉语言模型,开放词汇,两阶段微调,YOLO-World
Few-Shot Object Detection
, Vision-Language Model, Open-Vocabulary, Two-Stage Fine-Tuning, YOLO-World

Full-Text   Cite this paper   Add to My Lib

Abstract:

小样本目标检测作为计算机视觉的重要分支,致力于模拟人类从极少量样本中学习新目标检测的能力。现有方法依赖大量标注数据,且仍受限于预定义类别表征的刚性约束,因此难以适应开放词汇场景。除此之外,CLIP等视觉–语言预训练模型通过跨模态对齐展现了零样本推理潜力,但其聚焦于图像分类任务,而小样本条件下的目标检测模型性能退化仍面临关键挑战。为此,本文提出了一种基于YOLO-World的开放词汇小样本目标检测方法YW-FSVOD。该方法通过构建语言描述引导的多尺度特征对齐机制,将文本语义嵌入至YOLO-World的视觉编码空间,增强模型对未见类别的泛化能力;并采用预计算文本嵌入替代完整语言模型计算,在保证检测精度的同时实现推理速度的显著提升。实验结果表明,YW-FSVOD在COCO和LVIS数据集上表现优异,精度显著优于传统的小样本目标检测框架。
Few-shot object detection, as an important branch of computer vision, aims to simulate the human ability to learn new object categories from a minimal number of samples. Existing methods typically rely on large annotated datasets and are constrained by rigid predefined category representations, making them difficult to adapt to open-vocabulary scenarios. Moreover, vision-language pretraining models, such as CLIP, have shown promising zero-shot inference capabilities through cross-modal alignment, but these models are primarily focused on image classification tasks. The performance of object detection models under few-shot conditions still faces significant challenges. To address this issue, we propose a novel open-vocabulary few-shot object detection method, YW-FSVOD, based on YOLO-World. This approach constructs a multi-scale feature alignment mechanism guided by textual descriptions, embedding textual semantics into the visual encoding space of YOLO-World to enhance the model’s generalization ability to unseen categories. Additionally, we replace the full language model computation with precomputed text embeddings, significantly improving inference speed while maintaining detection accuracy. The experimental results show that YW-FSVOD performs excellently on both the COCO and LVIS datasets, with accuracy significantly surpassing that of traditional few-shot object detection frameworks.

References

[1]  Wang, Y.Q., Yao, Q.M., Kwok, J.T. and Ni, L.M. (2020) Generalizing from a Few Examples. ACM Computing Surveys, 53, 1-34.
https://doi.org/10.1145/3386252

[2]  Song, Y.S., Wang, T., Cai, P.Y., Mondal, S.K. and Sahoo, J.P. (2023) A Comprehensive Survey of Few-Shot Learning: Evolution, Applications, Challenges, and Opportunities. ACM Computing Surveys, 55, 1-40.
https://doi.org/10.1145/3582688

[3]  Zhang, R.r., Hu, X.f., Li, B.h., Huang, S.y., Deng, H.q., Qiao, Y., et al. (2023) Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 15211-15222.
https://doi.org/10.1109/cvpr52729.2023.01460

[4]  Girshick, R., Donahue, J., Darrell, T. and Malik, J. (2014) Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 23-28 June 2014, 580-587.
https://doi.org/10.1109/cvpr.2014.81

[5]  Yan, X.P., Chen, Z.L., Xu, A.N., Wang, X.X., Liang, X.D. and Lin, L. (2019) Meta R-CNN: Towards General Solver for Instance-Level Low-Shot Learning. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October 2019-2 November 2019, 9576-9585.
https://doi.org/10.1109/iccv.2019.00967

[6]  Kang, B.Y., Liu, Z., Wang, X., Yu, F.H., Feng, J.S. and Darrell, T. (2019) Few-Shot Object Detection via Feature Reweighting. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October 2019-2 November 2019, 8419-8428.
https://doi.org/10.1109/iccv.2019.00851

[7]  Hariharan, B. and Girshick, R. (2017) Low-Shot Visual Recognition by Shrinking and Hallucinating Features. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 3037-3046.
https://doi.org/10.1109/iccv.2017.328

[8]  Gu, X.Y., Lin, T.-Y., Kuo, W.C. and Cui, Y. (2021) Open-Vocabulary Object Detection via Vision and Language Knowledge Distillation. arXiv: 2104.13921.
[9]  Sun, B., Li, B.H., Cai, S.C., Yuan, Y. and Zhang, C. (2021) FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 7348-7358.
https://doi.org/10.1109/cvpr46437.2021.00727

[10]  Wang, X., Huang, T.E., Darrell, T., Gonzalez, J.E. and Yu, F. (2020) Frustratingly Simple Few-Shot Object Detection. arXiv: 2003.06957.
[11]  Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. (2016) You Only Look Once: Unified, Real-Time Object Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 779-788.
https://doi.org/10.1109/cvpr.2016.91

[12]  Redmon, J. and Farhadi, A. (2017) YOLO9000: Better, Faster, Stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 6517-6525.
https://doi.org/10.1109/cvpr.2017.690

[13]  Wang, C., Bochkovskiy, A. and Liao, H.M. (2023) Yolov7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 7464-7475.
https://doi.org/10.1109/cvpr52729.2023.00721

[14]  Ren, S., He, K.M., Girshick, R. and Sun, J. (2015) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv: 1506.01497.
[15]  He, K., Gkioxari, G., Dollar, P. and Girshick, R. (2017) Mask R-CNN. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 2980-2988.
https://doi.org/10.1109/iccv.2017.322

[16]  Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X. and Shan, Y. (2024) Yolo-World: Real-Time Open-Vocabulary Object Detection. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 16-22 June 2024, 16901-16911.
https://doi.org/10.1109/cvpr52733.2024.01599

[17]  Madan, A., Peri, N., Kong, S. and Ramanan, D. (2023) Revisiting Few-Shot Object Detection with Vision-Language Models. arXiv: 2312.14494.
[18]  Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014) Microsoft COCO: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B. and Tuytelaars, T., Eds., Lecture Notes in Computer Science, Springer International Publishing, 740-755.
https://doi.org/10.1007/978-3-319-10602-1_48

[19]  Gupta, A., Dollar, P. and Girshick, R. (2019) LVIS: A Dataset for Large Vocabulary Instance Segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 5351-5359.
https://doi.org/10.1109/cvpr.2019.00550

[20]  Wu, J.X., Liu, S.T., Huang, D. and Wang, Y.H. (2020) Multi-Scale Positive Sample Refinement for Few-Shot Object Detection. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.M., Eds., Lecture Notes in Computer Science, Springer International Publishing, 456-472.
https://doi.org/10.1007/978-3-030-58517-4_27

[21]  Han, G.X., Huang, S.Y., Ma, J.W., He, Y.C. and Chang, S.-F. (2022) Meta Faster R-CNN: Towards Accurate Few-Shot Object Detection with Attentive Feature Alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 36, 780-789.
https://doi.org/10.1609/aaai.v36i1.19959

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133