全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

多任务对比学习的自监督视频表达
Multitask Contrastive Learning for Self-Supervised Video Representation

DOI: 10.12677/CSA.2023.133041, PP. 433-443

Keywords: 自监督,空间特征,时间特征,多任务对比学习方法,时空自注意力
Self-Supervised
, Spatial Feature, Temporal Feature, Multitask Contrastive Learning Method, Spatiotemporal Self-Attention

Full-Text   Cite this paper   Add to My Lib

Abstract:

现有的自监督学习使用单一的空间或时间代理任务。单一的代理任务,从未标记的数据中提供单一的监督信号,不足以描述视频表示学习的空间特征和时间特征之间的差异。在本文中,我们提出了一个多任务对比学习方法,它通过对多个时空代理任务的对比学习,在时空自注意力的情况下学习有区别的时空特征。不同的空间代理任务学习不同的空间特征,包括空间旋转和空间拼图。不同的时间代理任务学习不同的时间特征,包括时间顺序和时间节奏。我们将视频表示为每个代理任务的多个不同特征,并设计基于代理任务的对比损失来分离一个视频中学习的空间特征和时间特征。基于代理任务的对比损失鼓励不同代理任务学习不同的特征,同一代理任务学习相似的特征,可以学习到同一视频中每个代理任务的判别特征。实验表明,在UCF-101数据集和HMDB-51数据集的行为识别上优于现有的自监督学习方法。
Most existing self-supervised works use a single spatial or temporal pretext task. A single pretext task, providing single supervision from unlabeled data, is insufficient to describe the difference between spatial features and temporal features for video representation learning. In this paper, we propose an attentive spatiotemporal contrastive learning network, which learns discriminative spatial-temporal features with self-attention by contrastive learning between multiple spatial and temporal pretext tasks. Different spatial features are learned by multiple spatial pretext tasks, including spatial rotation, and spatial jigsaw. Different temporal features are learned by multiple temporal pretext tasks, including temporal order, and temporal pace. We represent video as multiple different features for each pretext task, and design pretext task-based contrastive loss to separate the spatial feature and the temporal feature learned in one video. The pretext task-based contrastive loss encourages the different pretext tasks to learn dissimilar features and the same pretext task to learn similar features, which can learn the discriminative features for each pretext task in one video. Experiments show that it outperforms existing self-supervised learning methods for behavior recognition on the UCF-101 dataset and the HMDB-51 dataset.

References

[1]  Jing, L. and Tian, Y. (2018) Self-Supervised Spatiotemporal Feature Learning by Video Geometric Transfor-mations.
[2]  Ahsan, U., Madhok, R. and Essa, I. (2019) Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition. 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 7-11 January 2019, 179-189.
https://doi.org/10.1109/WACV.2019.00025
[3]  Xu, D., Xiao, J., Zhao, Z., et al. (2019) Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 10326-10335.
https://doi.org/10.1109/CVPR.2019.01058
[4]  Yao, Y., Liu, C., Luo, D., et al. (2020) Video Playback Rate Per-ception for Self-Supervised Spatio-Temporal Representation Learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 6547-6556.
https://doi.org/10.1109/CVPR42600.2020.00658
[5]  Benaim, S., Ephrat, A., Lang, O., et al. (2020) SpeedNet: Learning the Speediness in Videos. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 9919-9928.
https://doi.org/10.1109/CVPR42600.2020.00994
[6]  Liang, H., Quader, N., Chi, Z., et al. (2021) Self-Supervised Spatiotemporal Representation Learning by Exploiting Video Continuity. The 36th AAAI Conference on Artificial Intelli-gence (AAAI-22), 22 February-1 March 2022, 1564-1573.
[7]  Kim, D., Cho, D. and Kweon, I.S. (2018) Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8545-8552.
https://doi.org/10.1609/aaai.v33i01.33018545
[8]  Piergiovanni, A.J., Angelova, A. and Ryoo, M.S. (2020) Evolving Losses for Unsupervised Video Representation Learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 130-139.
https://doi.org/10.1109/CVPR42600.2020.00021
[9]  Huang, L., Liu, Y., Wang, B., et al. (2021) Self-Supervised Video Representation Learning by Context and Motion Decoupling. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 13881-13890.
https://doi.org/10.1109/CVPR46437.2021.01367
[10]  Dave, I., Gupta, R., Rizve, M.N., et al. (2021) TCLR: Tem-poral Contrastive Learning for Video Representation. Computer Vision and Image Understanding, 219, Article ID: 103406.
https://doi.org/10.1016/j.cviu.2022.103406
[11]  Wang, J., Jiao, J. and Liu, Y.H. (2020) Self-Supervised Video Representation Learning by Pace Prediction. Computer Vision—ECCV 2020 16th European Conference, Glasgow, 23-28 August 2020, 504-521.
[12]  Bai, Y., Fan, H., Misra, I., et al. (2020) Can Temporal Information Help with Con-trastive Self-Supervised Learning?
[13]  Kay, W., Carreira, J., Simonyan, K., et al. (2017) The Kinetics Human Action Video Dataset.
[14]  Soomro, K., Zamir, A.R. and Shah, M. (2012) UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild.
[15]  Kuehne, H., Jhuang, H., Garrote, E., et al. (2011) HMDB: A Large Video Database for Human Motion Recognition. IEEE International Conference on Computer Vision, Barcelona, 6-13 November 2011, 2556-2563.
https://doi.org/10.1109/ICCV.2011.6126543
[16]  Chen, X., Xie, S. and He, K. (2021) An Empirical Study of Training Self-Supervised Vision Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 9620-9629.
https://doi.org/10.1109/ICCV48922.2021.00950
[17]  Feichtenhofer, C., Fan, H., Malik, J., et al. (2019) SlowFast Networks for Video Recognition. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 Oc-tober-2 November 2019, 6201-6210.
https://doi.org/10.1109/ICCV.2019.00630
[18]  Behrmann, N., Fayyaz, M., Gall, J., et al. (2021) Long Short View Feature Decomposition via Contrastive Video Representation Learning. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 9224-9233.
https://doi.org/10.1109/ICCV48922.2021.00911
[19]  Wang, J., Gao, Y., Li, K., et al. (2021) Removing the Back-ground by Adding the Background: Towards Background Robust Self-Supervised Video Representation Learning. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 11799-11808.
https://doi.org/10.1109/CVPR46437.2021.01163
[20]  Han, T., Xie, W. and Zisserman, A. (2020) Self-Supervised Co-Training for Video Representation Learning.
[21]  Luo, D., Fang, B., Zhou, Y., et al. (2020) Exploring Relations in Untrimmed Videos for Self-Supervised Learning. ACM Transactions on Multimedia Computing, Communications, and Applications, 18, Article No. 35.
[22]  Liu, Y., Wang, K., Lan, H., et al. (2021) Temporal Contrastive Graph Learning for Video Action Recognition and Retrieval.
[23]  Zhang, Y., Po, L.M., Xu, X., et al. (2021) Contrastive Spatio-Temporal Pretext Learning for Self-Supervised Video Representation. The 36th AAAI Conference on Artificial Intelligence (AAAI-22), 22 February-1 March 2022, 3380-3389.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133