OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Journal of Computer and Communications 2025

An Analysis of OpenSeeD for Video Semantic Labeling

DOI: 10.4236/jcc.2025.131005, PP. 59-71

Jenny Zhu

Keywords: Semantic Segmentation, Detection, Labeling, OpenSeeD, Open-Vocabulary, Walking Tours Dataset, Videos

Full-Text Cite this paper Add to My Lib

Abstract:

Semantic segmentation is a core task in computer vision that allows AI models to interact and understand their surrounding environment. Similarly to how humans subconsciously segment scenes, this ability is crucial for scene understanding. However, a challenge many semantic learning models face is the lack of data. Existing video datasets are limited to short, low-resolution videos that are not representative of real-world examples. Thus, one of our key contributions is a customized semantic segmentation version of the Walking Tours Dataset that features hour-long, high-resolution, real-world data from tours of different cities. Additionally, we evaluate the performance of open-vocabulary, semantic model OpenSeeD on our own custom dataset and discuss future implications.

References

[1]	Long, J., Shelhamer, E. and Darrell, T. (2015) Fully Convolutional Networks for Semantic Segmentation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 3431-3440. https://doi.org/10.1109/cvpr.2015.7298965
[2]	Li, S., Ke, L., Danelljan, M., Piccinelli, L., Segu, M., Gool, L.V., et al. (2024) Matching Anything by Segmenting Anything. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 16-22 June 2024, 18963-18973. https://doi.org/10.1109/cvpr52733.2024.01794
[3]	Wang, X.D., Yang, J.F. and Darrell, T. (2024) Segment Anything without Supervision. arXiv: 2406.20081.
[4]	Halevy, A., Norvig, P. and Pereira, F. (2009) The Unreasonable Effectiveness of Data. IEEE Intelligent Systems, 24, 8-12. https://doi.org/10.1109/mis.2009.36
[5]	Zhang, H., Li, F., Zou, X., Liu, S., Li, C., Yang, J., et al. (2023) A Simple Framework for Open-Vocabulary Segmentation and Detection. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 1020-1031. https://doi.org/10.1109/iccv51070.2023.00100
[6]	Ronneberger, O., Fischer, P. and Brox, T. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab, N., Hornegger, J., Wells, W. and Frangi, A., Eds., Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Springer International Publishing, 234-241. https://doi.org/10.1007/978-3-319-24574-4_28
[7]	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017) Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, 4-9 December 2017, 5998-6008.
[8]	Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A. and Zagoruyko, S. (2020) End-to-end Object Detection with Transformers. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.M., Eds., Computer Vision—ECCV 2020, Springer International Publishing, 213-229. https://doi.org/10.1007/978-3-030-58452-8_13
[9]	Cheng, B., Misra, I., Schwing, A.G., Kirillov, A. and Girdhar, R. (2022) Masked-attention Mask Transformer for Universal Image Segmentation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 1280-1289. https://doi.org/10.1109/cvpr52688.2022.00135
[10]	Li, F., Zhang, H., Liu, S.L., Zhang, L., Ni, L.M., Shum, H.Y., et al. (2022) Mask Dino: Towards a Unified Transformer—Based Framework for Object Detection and Segmentation. arXiv: 2206.02777.
[11]	Gu, X.Y., Lin, T.Y., Kuo, W.C. and Cui, Y. (2021) Open-Vocabulary Object Detection via Vision and Language Knowledge Distillation. arXiv: 2104.13921.
[12]	Jia, C., Yang, Y.F., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z. and Duerig, T. (2021) Scaling up Visual and Vision-Language Representation Learning with Noisy Text Supervision. Proceedings of the 38th International Conference on Machine Learning, 18-24 July 2021, 4904-4916.
[13]	Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., et al. (2021) Learning Transferable Visual Models from Natural Language Supervision. 2021 International Conference on Machine Learning, 18-24 July 2021, 8748-8763.
[14]	Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. (2021) Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 9992-10002. https://doi.org/10.1109/iccv48922.2021.00986
[15]	Yang, J., Li, C., Zhang, P., Xiao, B., Liu, C., Yuan, L., et al. (2022. Unified Contrastive Learning in Image-Text-Label Space. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 19141-19151. https://doi.org/10.1109/cvpr52688.2022.01857
[16]	Cordts, M., Omran, M., Ramos, S., Scharwachter, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S. and Schiele, B. (2016) The Cityscapes Dataset. CVPR Workshop on the Future.
[17]	Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A. and Torralba, A. (2017) Scene Parsing through ADE20K Dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 5122-5130. https://doi.org/10.1109/cvpr.2017.544
[18]	Loshchilov, I. and Hutter, F. (2017) Decoupled Weight Decay Regularization. arXiv: 1711.05101.
[19]	Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014) Microsoft COCO: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B. AND Tuytelaars, T., Eds., Computer Vision—ECCV 2014, Springer International Publishing, 740-755. https://doi.org/10.1007/978-3-319-10602-1_48
[20]	Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., et al. (2019) Objects365: A Large-Scale, High-Quality Dataset for Object Detection. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 8429-8438. https://doi.org/10.1109/iccv.2019.00852
[21]	Wiles, O., Carreira, J., Barr, I., Zisserman, A. and Malinowski, M. (2023) Compressed Vision for Efficient Video Understanding. In: Wang, L., Gall, J., Chin, T.J., Sato, I. and Chellappa, R., Eds., Computer Vision—ACCV 2022, Springer, 679-695. https://doi.org/10.1007/978-3-031-26293-7_40
[22]	Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J. and Zisserman, A. (2014) The Pascal Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision, 111, 98-136. https://doi.org/10.1007/s11263-014-0733-5
[23]	Venkataramanan, S., Rizve, M.N., Carreira, J., Asano, Y.M. and Avrithis, Y. (2023) Is ImageNet Worth 1 Video? Learning Strong Image Encoders from 1 Long Unlabelled Video. arXiv: 2310.08584.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133