|
基于深度学习的视觉关系检测方法及应用
|
Abstract:
随着深度学习的不断发展和广泛应用,计算机视觉的许多领域也得到了长足的进步,例如在图像分类、对象检测、图像分割等任务中的表现。视觉关系检测(VRD)是计算机视觉的重要任务,旨在识别图像中物体之间的关系或相互作用,这对于理解图像及视觉世界都很重要,VRD也是计算机视觉技术应用研究的关键环节。与一般的物体检测任务相比,VRD不仅需要预测每个物体的类别和轨迹,还需要预测物体之间的关系,研究人员已经针对改任务提出了很多办法,特别在近年来基于深度神经网络的发展的深度学习也有所突破。本文介绍了VRD任务的内容,深度学习基本方法,VRD的传统方法和基于深度学习模型的一些分类和框架及其VRD在计算机视觉领域的应用。
With the continuous development and wide application of deep learning, many fields of computer vision have also made great progress, such as performance in image classification, object detection, image segmentation and other tasks. Visual relationship detection (VRD) is an important task for computer vision, aiming to recognize relations or interactions between objects in an image, which is important for understanding images even the visual world. Compared with the general object detection task, VRD requires not only to predict the categories and trajectories of each object, but also to predict the relationship between objects. Researchers have proposed to tackle this problem especially with the development of deep neural networks in recent years. In this survey, we provide a comprehensive review of VRD in computer vision and some categorization and frameworks of deep learning models for VRD with its applications.
[1] | Wang, Q., Zou, L., Yao, Y., Wang, Y., Li, J. and Yang, W. (2021) An Interconnected Feature Pyramid Networks for Object Detection. Journal of Visual Communication and Image Representation, 3, Article ID: 103260.
https://doi.org/10.1016/j.jvcir.2021.103260 |
[2] | Zhang, L., Hu, X., Zhou, Y., Zhou, G. and Duan, S. (2021) Memristive DeepLab: A Hardware Friendly Deep CNN for Semantic Segmentation. Neurocomputing, 451, 181-191. https://doi.org/10.1016/j.neucom.2021.04.061 |
[3] | Zhu, Y., Li, L. and Wu, X. (2021) Stacked Convolutional Sparse Auto-Encoders for Representation Learning. ACM Transactions on Knowledge Discovery from Data, 15, Article No. 31. https://doi.org/10.1145/3434767 |
[4] | Lu, C., Krishna, R., Bernstein, M. and Li, F.-F.(2016) Visual Relationship Detection with Language Priors. European Conference on Computer Vision (ECCV) 2016, Amsterdam, 11-14 October 2016, 852-869.
https://doi.org/10.1007/978-3-319-46448-0_51 |
[5] | Liu, P., Xiang, C., Jia, D., Zhao, X., Meng, W. and Wang, J. (2020) Stacked Attention Recurrent Relational Networks for Question Answering. Journal of Physics Conference Series, 1570, Article ID: 012072.
https://doi.org/10.1088/1742-6596/1570/1/012072 |
[6] | Zhang, H., Kyaw, Z., Chang, S.F. and Chua, T.-S. (2017) Visual Translation Embedding Network for Visual Relation Detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 3107-3115. https://doi.org/10.1109/CVPR.2017.331 |
[7] | Desai, C., Ramanan, D. and Fowlkes, C. (2009) Discriminative Models for Multi-Class Object Layout. 2009 IEEE 12th International Conference on Computer Vision, Kyoto, 29 September-2 October 2009, 229-236.
https://doi.org/10.1109/ICCV.2009.5459256 |
[8] | Yao, B. and Li, F.F. (2010) Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, 13-18 June 2010, 17-24. https://doi.org/10.1109/CVPR.2010.5540235 |
[9] | Mensink, T., Gavves, E. and Snoek, C.G.M. (2014) COSTA: Co-Occurrence Statistics for Zero-Shot Classification. IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 23-28 June 2014, 2441-2448.
https://doi.org/10.1109/CVPR.2014.313 |
[10] | Plummer, B.A., Mallya, A., Cervantes, C.M., Hockenmaier, J. and Lazebnik, S. (2017) Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 1946-1955. https://doi.org/10.1109/ICCV.2017.213 |
[11] | Liang, X., Lee, L. and Xing, E.P. (2017) Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 4408-4417. https://doi.org/10.1109/CVPR.2017.469 |
[12] | Li, Y., Ouyang, W., Wang, X. and Tang, X. (2017) ViP-CNN: Visual Phrase Guided Convolutional Neural Network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 7244-7253.
https://doi.org/10.1109/CVPR.2017.766 |
[13] | Dai, B., Zhang, Y. and Lin, D. (2017) Detecting Visual Relationships with Deep Relational Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 3298-3308.
https://doi.org/10.1109/CVPR.2017.352 |
[14] | Hu, Z., Yang, Z., Salakhutdinov, R. and Xing, E. (2016) Deep Neural Networks with Massive Learned Knowledge. 2016 Conference on Empirical Methods in Natural Language, Austin, 1-4 November 2016, 1670-1679.
https://doi.org/10.18653/v1/D16-1173 |
[15] | Dechter, R. (1986) Learning While Searching in Constraint-Satisfaction-Problems. Proceedings of the 5th AAAI National Conference on Artificial Intelligence, Philadelphia, 11-15 August 1986, 178-183. |
[16] | Aizenberg, I.N., Aizenberg, N.N. and Vandewalle, J. (2000) Multi-Valued and Universal Binary Neurons: Theory, Learning and Applications. Springer, New York. https://doi.org/10.1007/978-1-4757-3115-6 |
[17] | Huang, M.L. and Wu, Y.Z. (2022) Semantic Segmentation of Pancreatic Medical Images by Using Convolutional Neural Network. Biomedical Signal Processing and Control, 73, Article ID: 103458.
https://doi.org/10.1016/j.bspc.2021.103458 |
[18] | LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., et al. (1989) Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1, 541-551. https://doi.org/10.1162/neco.1989.1.4.541 |
[19] | Cullheim, S., Kellerth, J.O. and Conradi, S. (1977) Evidence for Direct Synaptic Interconnections between Cat Spinal α-Motoneurons via the Recurrent Axon Collaterals: A Morphological Study Using Intracellular Injection of Horseradish Peroxidase. Brain Research, 132, 1-10. https://doi.org/10.1016/0006-8993(77)90702-8 |
[20] | Hopfield, J.J. (1982) Neural Networks and Physical Systems with Emergent Collective Computational Abilities. Proceedings of the National Academy of Sciences of the United States of America, 79, 2554-255.
https://doi.org/10.1073/pnas.79.8.2554 |
[21] | Jordan, M.I. (1997) Serial Order: A Parallel Distributed Processing Approach. In: Donahoe, J.W. and Dorsel, V.P., Eds., Neural-Network Models of Cognition: Biobehavioral Foundations, Vol. 121, North-Holland, Amsterdam, 471-495.
https://doi.org/10.1016/S0166-4115(97)80111-2 |
[22] | Elman, J.L. (1990) Finding Structure in Time. Cognitive Science, 14, 179-211.
https://doi.org/10.1207/s15516709cog1402_1 |
[23] | Schmidhuber, J. (1992) Learning Complex, Extended Sequences Using the Principle of History Compression. Neural Computation, 4, 234-242. https://doi.org/10.1162/neco.1992.4.2.234 |
[24] | Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780.
https://doi.org/10.1162/neco.1997.9.8.1735 |
[25] | Schuster, M. and Paliwa, K.K. (1997) Bidirectional Recurrent Neural Networks. IEEE Transactions on Signal Processing, 45, 2673-2681. https://doi.org/10.1109/78.650093 |
[26] | Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014) Generative Adversarial Nets. arXiv preprint arXiv:1406.2661 |
[27] | Tieleman, T. (2008) Training Restricted Boltzmann Machines Using Approximations to the Likelihood Gradient. Proceedings of the 25th International Conference on Machine Learning, Helsinki, 5-9 July 2008, 1064-1071.
https://doi.org/10.1145/1390156.1390290 |
[28] | Sperduti, A. and Starita, A. (1997) Supervised Neural Networks for the Classification of Structures. IEEE Transactions on Neural Networks, 8, 714-735. https://doi.org/10.1109/72.572108 |
[29] | Ruiz, L., Gama, F. and Ribeiro, A. (2020) Gated Graph Recurrent Neural Networks. IEEE Transactions on Signal Processing, 68, 6303-6318. https://doi.org/10.1109/TSP.2020.3033962 |
[30] | Bruna, J., Zaremba, W., Szlam, A. and LeCun, Y. (2013) Spectral Networks and Locally Connected Networks on Graphs. arXiv preprint arXiv:1312.6203. |
[31] | Micheli, A. (2009) Neural Network for Graphs: A Contextual Constructive Approach. IEEE Transactions on Neural Networks, 20, 498-511. https://doi.org/10.1109/TNN.2008.2010350 |
[32] | Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C. and Yu, P.S. (2021) A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 32, 4-24.
https://doi.org/10.1109/TNNLS.2020.2978386 |
[33] | Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al. (2016) Attention Is All You Need. arXiv preprint arXiv:1706.03762 |
[34] | Hu, J., Shen, L., Albanie, S., Sun, G. and Wu, E. (2020) Squeeze-and-Excitation Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42, 2011-2023. https://doi.org/10.1109/TPAMI.2019.2913372 |
[35] | Sadeghi, M.A. and Farhadi, A. (2011) Recognition Using Visual Phrases. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, 20-25 June 2011, 1745-1752.
https://doi.org/10.1109/CVPR.2011.5995711 |
[36] | Everingham, M., Eslami, S.M.A., Van Gool, L., Williams, C.K.I., Winn, J. and Zisserman, A. (2015) The PASCAL Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision, 111, 98-136.
https://doi.org/10.1007/s11263-014-0733-5 |
[37] | Yu, R., Li, A., Morariu, V.I. and Davis, L.S. (2017) Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation. 2017 IEEE International Conference on Computer Vision (CVPR), Venice, 22-29 October 2017, 1068-1076. https://doi.org/10.1109/ICCV.2017.121 |
[38] | Girshick, R., Donahue, J., Darrell, T. and Malik, J. (2014) Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 23-28 June 2014, 580-587. https://doi.org/10.1109/CVPR.2014.81 |
[39] | Plesse, F., Ginsca, A., Delezoide, B. and Prêteux, F. (2018) Visual Relationship Detection Based on Guided Proposals and Semantic Knowledge Distillation. 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, 23-27 July 2018, 1-6. https://doi.org/10.1109/ICME.2018.8486503 |
[40] | Zhuang, B., Liu, L., Shen, C. and Reid, I. (2017) Towards Context-Aware Interaction Recognition for Visual Relationship Detection. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 589-598. https://doi.org/10.1109/ICCV.2017.71 |
[41] | Krishna, R., Chami, I., Bernstein, M. and Li, F.-F. (2018) Referring Relationship. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 6867-6876.
https://doi.org/10.1109/CVPR.2018.00718 |
[42] | Wang, P., Wu, Q., Cao, J., Shen, C., Gao, L. and van den Hengel, A. (2019) Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 1960-1968.
https://doi.org/10.1109/CVPR.2019.00206 |
[43] | Mi, L. and Chen, Z. (2020) Hierarchical Graph Attention Network for Visual Relationship Detection. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 13883-13892.
https://doi.org/10.1109/CVPR42600.2020.01390 |
[44] | Zhu, Y., Jiang, S. and Li, X. (2017) Visual Relationship Detection with Object Spatial Distribution. 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10-14 July 2017, 379-384.
https://doi.org/10.1109/ICME.2017.8019448 |
[45] | Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., et al. (2013) Playing Atari with Deep Reinforcement Learning. arXiv preprint arXiv.1312.5602. |
[46] | Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, D.A., Bernstein, M.S., et al. (2015) Image Retrieval Using Scene Graphs. Proc. of the IEEE conference on Computer Vision and Pattern Recognition, Boston, 7-12 June 2015, 3668-3678. https://doi.org/10.1109/CVPR.2015.7298990 |
[47] | Fisher, M., Savva, M. and Hanrahan, P. (2011) Characterizing Structural Relationships in Scenes Using Graph Kernels. ACM Transactions on Graphics, 30, Article No. 34. https://doi.org/10.1145/2010324.1964929 |
[48] | Chang, A.X., Savva, M. and Manning, C.D. (2014) Learning Spatial Knowledge for Text to 3D Scene Generation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, 25-29 October 2014, 2028-2038. https://doi.org/10.3115/v1/D14-1217 |
[49] | Kim, U.-H., Park, J.-M., Song, T,-J. and Kim, J.-H. (2020) 3-D Scene Graph: A Sparse and Semantic Representation of Physical Environments for Intelligent Agents. IEEE Transactions on Cybernetics, 50, 4921-4933.
https://doi.org/10.1109/TCYB.2019.2931042 |
[50] | Gkioxari, G., Girshick, R., Dollár, P. and He, K. (2018) Detecting and Recognizing Human-Object Interactions. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, 18-23 June 2018, 8359-8367.
https://doi.org/10.1109/CVPR.2018.00872 |
[51] | Su, Z., Zhu. C., Dong, Y., Cai, D., Chen, Y. and Li, J. (2018) Learning Visual Knowledge Memory Networks for Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, 18-23 June 2018, 7736-7745. https://doi.org/10.1109/CVPR.2018.00807 |
[52] | Cadene, R., Ben-Younnes, H., Cord, M. and Thome, N. (2019) MUREL: Multimodal Relational Reasoning for Visual Question Answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 1989-1998. https://doi.org/10.1109/CVPR.2019.00209 |
[53] | Peng, L., Yang, Y., Wang, Z., Huang Z. and Shen, H.T. (2022) MRA-Net: Improving VQA via Multi-Modal Relation Attention Network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 318-329.
https://doi.org/10.1109/TPAMI.2020.3004830 |
[54] | Hudson, D.A. and Manning, C.D. (2019) GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 6700-6709. https://doi.org/10.1109/CVPR.2019.00686 |
[55] | Gupta, R., Hooda, P., Sanjeev and Kumar Chikkara, N. (2020) Natural Language Processing Based Visual Question Answering Efficient: an Efficient Det Approach. 4th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, 13-15 May 2020, 900-904. https://doi.org/10.1109/ICICCS48265.2020.9121068 |
[56] | Andreas, J., Rohrbach, M., Darrell, T. and Klein, D. (2016) Neural Module Networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 39-48.
https://doi.org/10.1109/CVPR.2016.12 |