%0 Journal Article
%T YW-FSVOD&#65306;基于YOLO-World的开放词汇小样本目标检测方法<br>YW-FSVOD: Open-Vocabulary Few-Shot Object Detection Method Based on YOLO-World
%A 张艺博
%A 张索非
%J Software Engineering and Applications
%P 261-270
%@ 2325-2278
%D 2025
%I Hans Publishing
%R 10.12677/sea.2025.142024
%X 小样本目标检测作为计算机视觉的重要分支&#65292;致力于模拟人类从极少量样本中学习新目标检测的能力。现有方法依赖大量标注数据&#65292;且仍受限于预定义类别表征的刚性约束&#65292;因此难以适应开放词汇场景。除此之外&#65292;CLIP等视觉&#8211;语言预训练模型通过跨模态对齐展现了零样本推理潜力&#65292;但其聚焦于图像分类任务&#65292;而小样本条件下的目标检测模型性能退化仍面临关键挑战。为此&#65292;本文提出了一种基于YOLO-World的开放词汇小样本目标检测方法YW-FSVOD。该方法通过构建语言描述引导的多尺度特征对齐机制&#65292;将文本语义嵌入至YOLO-World的视觉编码空间&#65292;增强模型对未见类别的泛化能力&#65307;并采用预计算文本嵌入替代完整语言模型计算&#65292;在保证检测精度的同时实现推理速度的显著提升。实验结果表明&#65292;YW-FSVOD在COCO和LVIS数据集上表现优异&#65292;精度显著优于传统的小样本目标检测框架。<br>Few-shot object detection, as an important branch of computer vision, aims to simulate the human ability to learn new object categories from a minimal number of samples. Existing methods typically rely on large annotated datasets and are constrained by rigid predefined category representations, making them difficult to adapt to open-vocabulary scenarios. Moreover, vision-language pretraining models, such as CLIP, have shown promising zero-shot inference capabilities through cross-modal alignment, but these models are primarily focused on image classification tasks. The performance of object detection models under few-shot conditions still faces significant challenges. To address this issue, we propose a novel open-vocabulary few-shot object detection method, YW-FSVOD, based on YOLO-World. This approach constructs a multi-scale feature alignment mechanism guided by textual descriptions, embedding textual semantics into the visual encoding space of YOLO-World to enhance the model&#8217;s generalization ability to unseen categories. Additionally, we replace the full language model computation with precomputed text embeddings, significantly improving inference speed while maintaining detection accuracy. The experimental results show that YW-FSVOD performs excellently on both the COCO and LVIS datasets, with accuracy significantly surpassing that of traditional few-shot object detection frameworks.
%K 小样本目标检测&#65292
%K 视觉语言模型&#65292
%K 开放词汇&#65292
%K 两阶段微调&#65292
%K YOLO-World<br>Few-Shot Object Detection
%K Vision-Language Model
%K Open-Vocabulary
%K Two-Stage Fine-Tuning
%K YOLO-World
%U http://www.hanspub.org/journal/PaperInformation.aspx?PaperID=112143