OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Software Engineering and Applications 2025

基于Transformer和运动边界掩码关注变化的视频插帧方法
Transformer-Based Video Frame Interpolation with MB Mask Guidance

DOI: 10.12677/sea.2025.142019, PP. 201-216

石明光, 王晓红, 马春运

Keywords: 视频插帧，光流估计，掩码，Transformer
Video Frame Interpolation, Optical Flow Estimation, Mask, Transformer

Full-Text Cite this paper Add to My Lib

Abstract:

为了提升基于光流的视频插帧方法在变化区域的生成质量，我们提出了一种新颖的两阶段视频插帧框架，该框架在光流运动信息的约束下指导中间帧的细化。捕捉长距离相关性信息能够提高光流估计的准确性，因此，我们提出了一种BWT-FlowNet用于光流估计，该网络通过集成双级窗口Transformer和内容感知机制来捕捉视频序列中的长距离时空交互。随后，利用光流中的运动信息预测运动边界掩模(MB mask)，以帮助网络在中间帧细化过程中聚焦于内容变化区域。我们还开发了一种运动边界感知细化网络(MBAR Net)用于中间帧的细化过程。在MBAR Net的子层中使用金字塔MB mask以突出运动区域。此外，引入掩模感知损失函数(Mask Perceptual Loss)以有效约束内容变化区域，从而提高预测帧的质量。实验表明，我们提出的方法在多个公共基准测试中均取得了优异的性能。
To enhance the generation quality of flow-based video frame interpolation methods in changing regions, we propose a novel two-stage video frame interpolation framework that guides the refinement of intermediate frames under the constraint of optical flow motion information. Capturing long-range relevant information can enhance the accuracy of optical flow estimation. Therefore, we propose a BWT-FlowNet for optical flow estimation, which integrates a bi-level window Transformer with content awareness to capture long-range spatial-temporal interactions in video sequences. Then, a Motion Boundary Mask (MB Mask) is predicted by leveraging the motion information from optical flow, which is used to help the network focus on content-changing areas during the refinement of intermediate frames. We also develop a Motion Boundary-Aware Refinement Net (MBAR Net) to refine the process of intermediate frames. Pyramid MB Masks are utilized in sub-layers of the MBAR Net to highlight motion regions. In addition, the Mask Perceptual Loss function is introduced to constrain content-changing areas effectively, improving the quality of predicted frames. Experiments demonstrate that our proposed method achieves excellent performance on several public benchmarks.

References

[1]	Flynn, J., Neulander, I., Philbin, J. and Snavely, N. (2016) Deep Stereo: Learning to Predict New Views from the World’s Imagery. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 5515-5524. https://doi.org/10.1109/cvpr.2016.595
[2]	Wu, C.Y., Singhal, N. and Krahenbuhl, P. (2018) Video Compression through Image Interpolation. Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, 23-27 October 2018, 416-431.
[3]	Liu, Z., Yeh, R.A., Tang, X., Liu, Y. and Agarwala, A. (2017) Video Frame Synthesis Using Deep Voxel Flow. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 4473-4481. https://doi.org/10.1109/iccv.2017.478
[4]	Niklaus, S., Mai, L. and Liu, F. (2017) Video Frame Interpolation via Adaptive Separable Convolution. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 261-270. https://doi.org/10.1109/iccv.2017.37
[5]	Bao, W., Lai, W., Ma, C., Zhang, X., Gao, Z. and Yang, M. (2019) Depth-Aware Video Frame Interpolation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 3698-3707. https://doi.org/10.1109/cvpr.2019.00382
[6]	Siyao, L., Zhao, S., Yu, W., Sun, W., Metaxas, D., Loy, C.C., et al. (2021) Deep Animation Video Interpolation in the Wild. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 6583-6591. https://doi.org/10.1109/cvpr46437.2021.00652
[7]	Niklaus, S., Mai, L. and Liu, F. (2017) Video Frame Interpolation via Adaptive Convolution. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 2270-2279. https://doi.org/10.1109/cvpr.2017.244
[8]	Peleg, T., Szekely, P., Sabo, D. and Sendik, O. (2019) IM-Net for High Resolution Video Frame Interpolation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 2393-2402. https://doi.org/10.1109/cvpr.2019.00250
[9]	Shi, Z., Liu, X., Shi, K., Dai, L. and Chen, J. (2020) Video Interpolation via Generalized Deformable Convolution. arXiv: 2008.10680.
[10]	Meyer, S., Wang, O., Zimmer, H., Grosse, M. and Sorkine-Hornung, A. (2015) Phase-Based Frame Interpolation for Video. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 1410-1418. https://doi.org/10.1109/cvpr.2015.7298747
[11]	Meyer, S., Djelouah, A., McWilliams, B., Sorkine-Hornung, A., Gross, M. and Schroers, C. (2018) PhaseNet for Video Frame Interpolation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 498-507. https://doi.org/10.1109/cvpr.2018.00059
[12]	Xue, T., Chen, B., Wu, J., Wei, D. and Freeman, W.T. (2019) Video Enhancement with Task-Oriented Flow. International Journal of Computer Vision, 127, 1106-1125. https://doi.org/10.1007/s11263-018-01144-2
[13]	Huang, Z., Zhang, T., Heng, W., Shi, B. and Zhou, S. (2022) Real-Time Intermediate Flow Estimation for Video Frame Interpolation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M. and Hassner, T., Eds., Computer Vision—ECCV 2022, Springer, 624-642. https://doi.org/10.1007/978-3-031-19781-9_36
[14]	Chi, Z., Mohammadi Nasiri, R., Liu, Z., Lu, J., Tang, J. and Plataniotis, K.N. (2020) All at Once: Temporally Adaptive Multi-Frame Interpolation with Advanced Motion Modeling. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.M., Eds., Computer Vision—ECCV 2020, Springer, 107-123. https://doi.org/10.1007/978-3-030-58583-9_7
[15]	Jiang, H., Sun, D., Jampani, V., Yang, M., Learned-Miller, E. and Kautz, J. (2018) Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 9000-9008. https://doi.org/10.1109/cvpr.2018.00938
[16]	Liu, Y., Liao, Y., Lin, Y. and Chuang, Y. (2019) Deep Video Frame Interpolation Using Cyclic Frame Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8794-8802. https://doi.org/10.1609/aaai.v33i01.33018794
[17]	Niklaus, S. and Liu, F. (2018) Context-Aware Synthesis for Video Frame Interpolation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 1701-1710. https://doi.org/10.1109/cvpr.2018.00183
[18]	Niklaus, S. and Liu, F. (2020) Softmax Splatting for Video Frame Interpolation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 5436-5445. https://doi.org/10.1109/cvpr42600.2020.00548
[19]	Xu, X., Siyao, L., Sun, W., Yin, Q. and Yang, M.H. (2019) Quadratic Video Interpolation. arXiv: 1911.00627.
[20]	Zhang, H., Zhao, Y. and Wang, R. (2019) A Flexible Recurrent Residual Pyramid Network for Video Frame Interpolation. ICCV.
[21]	Kong, L., Jiang, B., Luo, D., Chu, W., Huang, X., Tai, Y., et al. (2022) IFRNet: Intermediate Feature Refine Network for Efficient Frame Interpolation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 1959-1968. https://doi.org/10.1109/cvpr52688.2022.00201
[22]	Jin, X., Wu, L., Chen, J., Chen, Y., Koo, J. and Hahm, C. (2023) A Unified Pyramid Recurrent Network for Video Frame Interpolation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 1578-1587. https://doi.org/10.1109/cvpr52729.2023.00158
[23]	Zhu, G., Qin, Z., Ding, Y., Liu, Y. and Qin, Z. (2024) MFNet: Real-Time Motion Focus Network for Video Frame Interpolation. IEEE Transactions on Multimedia, 26, 3251-3262. https://doi.org/10.1109/tmm.2023.3308442
[24]	Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., et al. (2015) FlowNet: Learning Optical Flow with Convolutional Networks. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 2758-2766. https://doi.org/10.1109/iccv.2015.316
[25]	Sun, D., Yang, X., Liu, M. and Kautz, J. (2018) PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 8934-8943. https://doi.org/10.1109/cvpr.2018.00931
[26]	Teed, Z. and Deng, J. (2020) RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.M., Eds., Computer Vision—ECCV 2020, Springer, 402-419. https://doi.org/10.1007/978-3-030-58536-5_24
[27]	Liu, P., King, I., Lyu, M.R. and Xu, J. (2019) DDFlow: Learning Optical Flow with Unlabeled Data Distillation. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8770-8777. https://doi.org/10.1609/aaai.v33i01.33018770
[28]	Liu, P., Lyu, M., King, I. and Xu, J. (2019) Selflow: Self-Supervised Learning of Optical Flow. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 4566-4575. https://doi.org/10.1109/cvpr.2019.00470
[29]	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017) Attention Is All You Need. 2017 Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 5998-6008.
[30]	d’Ascoli, S., Touvron, H., Leavitt, M.L., et al. (2021) Convit: Improving Vision Transformers with Soft Convolutional Inductive Biases. Proceedings of the 38th International Conference on Machine Learning, 18-24 July 2021, 2286-2296.
[31]	Li, Y., Zhang, K., Cao, J., et al. (2021) Localvit: Bringing Locality to Vision Transformers. arXiv: 2104.05707.
[32]	Wang, C., Xu, H., Zhang, X., Wang, L., Zheng, Z. and Liu, H. (2022) Convolutional Embedding Makes Hierarchical Vision Transformer Stronger. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M. and Hassner, T., Eds., Computer Vision—ECCV 2022, Springer, 739-756. https://doi.org/10.1007/978-3-031-20044-1_42
[33]	Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., et al. (2021) CvT: Introducing Convolutions to Vision Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 22-31. https://doi.org/10.1109/iccv48922.2021.00009
[34]	Zhu, L., Wang, X., Ke, Z., Zhang, W. and Lau, R. (2023) BiFormer: Vision Transformer with Bi-Level Routing Attention. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 10323-10333. https://doi.org/10.1109/cvpr52729.2023.00995
[35]	Ren, S., Zhou, D., He, S., Feng, J. and Wang, X. (2022) Shunted Self-Attention via Multi-Scale Token Aggregation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 10843-10852. https://doi.org/10.1109/cvpr52688.2022.01058
[36]	Woo, S., Park, J., Lee, J. and Kweon, I.S. (2018) CBAM: Convolutional Block Attention Module. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., Computer Vision—ECCV 2018., Springer, 3-19. https://doi.org/10.1007/978-3-030-01234-2_1
[37]	He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 770-778. https://doi.org/10.1109/cvpr.2016.90
[38]	Meister, S., Hur, J. and Roth, S. (2018) UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss. Proceedings of the AAAI Conference on Artificial Intelligence, 32, 7251-7259. https://doi.org/10.1609/aaai.v32i1.12276
[39]	Zhong, Y., Ji, P., Wang, J., Dai, Y. and Li, H. (2019) Unsupervised Deep Epipolar Flow for Stationary or Dynamic Scenes. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 12087-12096. https://doi.org/10.1109/cvpr.2019.01237
[40]	Soomro, K., Zamir, A.R. and Shah, M. (2012) UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. arXiv: 1212.0402.
[41]	Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J. and Szeliski, R. (2010) A Database and Evaluation Methodology for Optical Flow. International Journal of Computer Vision, 92, 1-31. https://doi.org/10.1007/s11263-010-0390-2
[42]	Choi, M., Kim, H., Han, B., Xu, N. and Lee, K.M. (2020) Channel Attention Is All You Need for Video Frame Interpolation. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 10663-10671. https://doi.org/10.1609/aaai.v34i07.6693
[43]	Lee, H., Kim, T., Chung, T., Pak, D., Ban, Y. and Lee, S. (2020) AdaCoF: Adaptive Collaboration of Flows for Video Frame Interpolation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 5315-5324. https://doi.org/10.1109/cvpr42600.2020.00536
[44]	Park, J., Ko, K., Lee, C. and Kim, C. (2020) BMBC: Bilateral Motion Estimation with Bilateral Cost Volume for Video Interpolation. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.M., Eds., Computer Vision—ECCV 2020, Springer, 109-125. https://doi.org/10.1007/978-3-030-58568-6_7
[45]	Park, J., Lee, C. and Kim, C. (2021) Asymmetric Bilateral Motion Estimation for Video Frame Interpolation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 14519-14528. https://doi.org/10.1109/iccv48922.2021.01427
[46]	Hu, P., Niklaus, S., Sclaroff, S. and Saenko, K. (2022) Many-to-Many Splatting for Efficient Video Frame Interpolation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 3543-3552. https://doi.org/10.1109/cvpr52688.2022.00354
[47]	Lu, L., Wu, R., Lin, H., Lu, J. and Jia, J. (2022) Video Frame Interpolation with Transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 3522-3532. https://doi.org/10.1109/cvpr52688.2022.00352

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

基于Transformer和运动边界掩码关注变化的视频插帧方法Transformer-Based Video Frame Interpolation with MB Mask Guidance

基于Transformer和运动边界掩码关注变化的视频插帧方法
Transformer-Based Video Frame Interpolation with MB Mask Guidance