OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Journal of Software Engineering and Applications 2023

Object Detection Meets LLMs: Model Fusion for Safety and Security

DOI: 10.4236/jsea.2023.1612034, PP. 672-684

Zeba Mohsin Wase, Vijay K. Madisetti, Arshdeep Bahga

Keywords: Computer Vision, Large Language Models, Self Driving Vehicles

Full-Text Cite this paper Add to My Lib

Abstract:

This paper proposes a novel model fusion approach to enhance predictive capabilities of vision and language models by strategically integrating object detection and large language models. We have named this multimodal integration approach as VOLTRON (Vision Object Linguistic Translation for Responsive Observation and Narration). VOLTRON is aimed at improving responses for self-driving vehicles in detecting small objects crossing roads and identifying merged or narrower lanes. The models are fused using a single layer to provide LLaMA2 (Large Language Model Meta AI) with object detection probabilities from YoloV8-n (You Only Look Once) translated into sentences. Experiments using specialized datasets showed accuracy improvements up to 88.16%. We provide a comprehensive exploration of the theoretical aspects that inform our model fusion approach, detailing the fundamental principles upon which it is built. Moreover, we elucidate the intricacies of the methodologies employed for merging these two disparate models, shedding light on the techniques and strategies used.

References

[1]	Xie, X.X., Cheng, G., Wang, J.B., Yao, X.W. and Han, J.W. (2021) Oriented r-cnn for Object Detection.
[2]	Liu, H.T., Li, C.Y., Wu, Q.Y. and Lee, Y.J. (2023) Visual Instruction Tuning.
[3]	Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., et al. (2023) Llama 2: Open Foundation and Fine-Tuned Chat Models.
[4]	Jocher, et al. (2023) YoloV8. https://github.com/ultralytics/ultralytics
[5]	Langchain. https://www.langchain.com/
[6]	Wu, C.F., Yin, S.M., Qi, W.Z., Wang, X.D., Tang, Z.C. and Duan, N. (2023) Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models.
[7]	Xia, Q.L., Huang, H.Y., Duan, N., Zhang, D.D., Ji, L., Sui, Z.F., Cui, E., Bharti, T., Liu, X. and Zhou, M. (2020) Xgpt: Cross-Modal Generative Pre-Training for Image Captioning.
[8]	Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., et al. (2022) Flamingo: A Visual Language Model for Few-Shot Learning.
[9]	Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W.R., Marathe, K., et al. (2023) Openflamingo: An Open-Source Framework for Training Large Autoregressive Vision Language Models.
[10]	Zhang, R.R., Han, J.M., Liu, C., Gao, P., Zhou, A.J., Hu, X.F., Yan, S.L., Lu, P., Li, H.S. and Qiao, Y. (2023) LLaMA-Adapter: Efficient Fine-Tuning of Language Models with Zero-Init Attention.
[11]	Li, J.N., Li, D.X., Savarese, S. and Hoi, S. (2023) Blip-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models.
[12]	Wang, W.H., Chen, Z., Chen, X.K., Wu, J.N., Zhu, X.Z., Zeng, G., et al. (2023) Visionllm: Large Language Model Is Also an Open-Ended Decoder for Vision-Centric Tasks.
[13]	Fu, D.C., Li, X., Wen, L.C., Dou, M., Cai, P.L., Shi, B.T. and Qiao, Y. (2023) Drive like a Human: Rethinking Autonomous Driving with Large Language Models.
[14]	Yao, S.Y., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. and Cao, Y. (2023) React: Synergizing Reasoning and Acting in Language Models.
[15]	Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., et al. (2021) Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
[16]	Liu, H.K., Tam, D., Muqeeth, M., Mohta, J., Huang, T.H., Bansal, M. and Raffel, C. (2022) Few-Shot Parameter-Efficient Fine-Tuning Is Better and Cheaper than In-Context Learning.
[17]	Hu, E.J., Shen, Y.L., Wallis, P., Allen-Zhu, Z., Li, Y.Z., Wang, S., Wang, L. and Chen, W.Z. (2021) Lora: Low-Rank Adaptation of Large Language Models.
[18]	Zhang, J.J., Zhou, Y.X. and Saab, R. (2023) Post-Training Quantization for Neural Networks with Provable Guarantees. SIAM Journal on Mathematics of Data Science, 5, 373-399. https://doi.org/10.1137/22M1511709
[19]	Liu, Z.C., Oguz, B., Zhao, C.S., Chang, E., Stock, P., Mehdad, Y., et al. (2023) Llm-qat: Data-Free Quantization Aware Training for Large Language Models.
[20]	Cheng, Z.J., Kasai, J. and Yu, T. (2023) Batch Prompting: Efficient Inference with Large Language Model APIs. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, Singapore, December 2023, 792-810. https://doi.org/10.18653/v1/2023.emnlp-industry.74

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133