|
基于Transformer和运动边界掩码关注变化的视频插帧方法
|
Abstract:
为了提升基于光流的视频插帧方法在变化区域的生成质量,我们提出了一种新颖的两阶段视频插帧框架,该框架在光流运动信息的约束下指导中间帧的细化。捕捉长距离相关性信息能够提高光流估计的准确性,因此,我们提出了一种BWT-FlowNet用于光流估计,该网络通过集成双级窗口Transformer和内容感知机制来捕捉视频序列中的长距离时空交互。随后,利用光流中的运动信息预测运动边界掩模(MB mask),以帮助网络在中间帧细化过程中聚焦于内容变化区域。我们还开发了一种运动边界感知细化网络(MBAR Net)用于中间帧的细化过程。在MBAR Net的子层中使用金字塔MB mask以突出运动区域。此外,引入掩模感知损失函数(Mask Perceptual Loss)以有效约束内容变化区域,从而提高预测帧的质量。实验表明,我们提出的方法在多个公共基准测试中均取得了优异的性能。
To enhance the generation quality of flow-based video frame interpolation methods in changing regions, we propose a novel two-stage video frame interpolation framework that guides the refinement of intermediate frames under the constraint of optical flow motion information. Capturing long-range relevant information can enhance the accuracy of optical flow estimation. Therefore, we propose a BWT-FlowNet for optical flow estimation, which integrates a bi-level window Transformer with content awareness to capture long-range spatial-temporal interactions in video sequences. Then, a Motion Boundary Mask (MB Mask) is predicted by leveraging the motion information from optical flow, which is used to help the network focus on content-changing areas during the refinement of intermediate frames. We also develop a Motion Boundary-Aware Refinement Net (MBAR Net) to refine the process of intermediate frames. Pyramid MB Masks are utilized in sub-layers of the MBAR Net to highlight motion regions. In addition, the Mask Perceptual Loss function is introduced to constrain content-changing areas effectively, improving the quality of predicted frames. Experiments demonstrate that our proposed method achieves excellent performance on several public benchmarks.
[1] | Flynn, J., Neulander, I., Philbin, J. and Snavely, N. (2016) Deep Stereo: Learning to Predict New Views from the World’s Imagery. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 5515-5524. https://doi.org/10.1109/cvpr.2016.595 |
[2] | Wu, C.Y., Singhal, N. and Krahenbuhl, P. (2018) Video Compression through Image Interpolation. Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, 23-27 October 2018, 416-431. |
[3] | Liu, Z., Yeh, R.A., Tang, X., Liu, Y. and Agarwala, A. (2017) Video Frame Synthesis Using Deep Voxel Flow. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 4473-4481. https://doi.org/10.1109/iccv.2017.478 |
[4] | Niklaus, S., Mai, L. and Liu, F. (2017) Video Frame Interpolation via Adaptive Separable Convolution. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 261-270. https://doi.org/10.1109/iccv.2017.37 |
[5] | Bao, W., Lai, W., Ma, C., Zhang, X., Gao, Z. and Yang, M. (2019) Depth-Aware Video Frame Interpolation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 3698-3707. https://doi.org/10.1109/cvpr.2019.00382 |
[6] | Siyao, L., Zhao, S., Yu, W., Sun, W., Metaxas, D., Loy, C.C., et al. (2021) Deep Animation Video Interpolation in the Wild. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 6583-6591. https://doi.org/10.1109/cvpr46437.2021.00652 |
[7] | Niklaus, S., Mai, L. and Liu, F. (2017) Video Frame Interpolation via Adaptive Convolution. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 2270-2279. https://doi.org/10.1109/cvpr.2017.244 |
[8] | Peleg, T., Szekely, P., Sabo, D. and Sendik, O. (2019) IM-Net for High Resolution Video Frame Interpolation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 2393-2402. https://doi.org/10.1109/cvpr.2019.00250 |
[9] | Shi, Z., Liu, X., Shi, K., Dai, L. and Chen, J. (2020) Video Interpolation via Generalized Deformable Convolution. arXiv: 2008.10680. |
[10] | Meyer, S., Wang, O., Zimmer, H., Grosse, M. and Sorkine-Hornung, A. (2015) Phase-Based Frame Interpolation for Video. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 1410-1418. https://doi.org/10.1109/cvpr.2015.7298747 |
[11] | Meyer, S., Djelouah, A., McWilliams, B., Sorkine-Hornung, A., Gross, M. and Schroers, C. (2018) PhaseNet for Video Frame Interpolation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 498-507. https://doi.org/10.1109/cvpr.2018.00059 |
[12] | Xue, T., Chen, B., Wu, J., Wei, D. and Freeman, W.T. (2019) Video Enhancement with Task-Oriented Flow. International Journal of Computer Vision, 127, 1106-1125. https://doi.org/10.1007/s11263-018-01144-2 |
[13] | Huang, Z., Zhang, T., Heng, W., Shi, B. and Zhou, S. (2022) Real-Time Intermediate Flow Estimation for Video Frame Interpolation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M. and Hassner, T., Eds., Computer Vision—ECCV 2022, Springer, 624-642. https://doi.org/10.1007/978-3-031-19781-9_36 |
[14] | Chi, Z., Mohammadi Nasiri, R., Liu, Z., Lu, J., Tang, J. and Plataniotis, K.N. (2020) All at Once: Temporally Adaptive Multi-Frame Interpolation with Advanced Motion Modeling. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.M., Eds., Computer Vision—ECCV 2020, Springer, 107-123. https://doi.org/10.1007/978-3-030-58583-9_7 |
[15] | Jiang, H., Sun, D., Jampani, V., Yang, M., Learned-Miller, E. and Kautz, J. (2018) Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 9000-9008. https://doi.org/10.1109/cvpr.2018.00938 |
[16] | Liu, Y., Liao, Y., Lin, Y. and Chuang, Y. (2019) Deep Video Frame Interpolation Using Cyclic Frame Generation. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8794-8802. https://doi.org/10.1609/aaai.v33i01.33018794 |
[17] | Niklaus, S. and Liu, F. (2018) Context-Aware Synthesis for Video Frame Interpolation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 1701-1710. https://doi.org/10.1109/cvpr.2018.00183 |
[18] | Niklaus, S. and Liu, F. (2020) Softmax Splatting for Video Frame Interpolation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 5436-5445. https://doi.org/10.1109/cvpr42600.2020.00548 |
[19] | Xu, X., Siyao, L., Sun, W., Yin, Q. and Yang, M.H. (2019) Quadratic Video Interpolation. arXiv: 1911.00627. |
[20] | Zhang, H., Zhao, Y. and Wang, R. (2019) A Flexible Recurrent Residual Pyramid Network for Video Frame Interpolation. ICCV. |
[21] | Kong, L., Jiang, B., Luo, D., Chu, W., Huang, X., Tai, Y., et al. (2022) IFRNet: Intermediate Feature Refine Network for Efficient Frame Interpolation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 1959-1968. https://doi.org/10.1109/cvpr52688.2022.00201 |
[22] | Jin, X., Wu, L., Chen, J., Chen, Y., Koo, J. and Hahm, C. (2023) A Unified Pyramid Recurrent Network for Video Frame Interpolation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 1578-1587. https://doi.org/10.1109/cvpr52729.2023.00158 |
[23] | Zhu, G., Qin, Z., Ding, Y., Liu, Y. and Qin, Z. (2024) MFNet: Real-Time Motion Focus Network for Video Frame Interpolation. IEEE Transactions on Multimedia, 26, 3251-3262. https://doi.org/10.1109/tmm.2023.3308442 |
[24] | Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., et al. (2015) FlowNet: Learning Optical Flow with Convolutional Networks. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 December 2015, 2758-2766. https://doi.org/10.1109/iccv.2015.316 |
[25] | Sun, D., Yang, X., Liu, M. and Kautz, J. (2018) PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 8934-8943. https://doi.org/10.1109/cvpr.2018.00931 |
[26] | Teed, Z. and Deng, J. (2020) RAFT: Recurrent All-Pairs Field Transforms for Optical Flow. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.M., Eds., Computer Vision—ECCV 2020, Springer, 402-419. https://doi.org/10.1007/978-3-030-58536-5_24 |
[27] | Liu, P., King, I., Lyu, M.R. and Xu, J. (2019) DDFlow: Learning Optical Flow with Unlabeled Data Distillation. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8770-8777. https://doi.org/10.1609/aaai.v33i01.33018770 |
[28] | Liu, P., Lyu, M., King, I. and Xu, J. (2019) Selflow: Self-Supervised Learning of Optical Flow. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 4566-4575. https://doi.org/10.1109/cvpr.2019.00470 |
[29] | Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017) Attention Is All You Need. 2017 Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 5998-6008. |
[30] | d’Ascoli, S., Touvron, H., Leavitt, M.L., et al. (2021) Convit: Improving Vision Transformers with Soft Convolutional Inductive Biases. Proceedings of the 38th International Conference on Machine Learning, 18-24 July 2021, 2286-2296. |
[31] | Li, Y., Zhang, K., Cao, J., et al. (2021) Localvit: Bringing Locality to Vision Transformers. arXiv: 2104.05707. |
[32] | Wang, C., Xu, H., Zhang, X., Wang, L., Zheng, Z. and Liu, H. (2022) Convolutional Embedding Makes Hierarchical Vision Transformer Stronger. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M. and Hassner, T., Eds., Computer Vision—ECCV 2022, Springer, 739-756. https://doi.org/10.1007/978-3-031-20044-1_42 |
[33] | Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., et al. (2021) CvT: Introducing Convolutions to Vision Transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 22-31. https://doi.org/10.1109/iccv48922.2021.00009 |
[34] | Zhu, L., Wang, X., Ke, Z., Zhang, W. and Lau, R. (2023) BiFormer: Vision Transformer with Bi-Level Routing Attention. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 10323-10333. https://doi.org/10.1109/cvpr52729.2023.00995 |
[35] | Ren, S., Zhou, D., He, S., Feng, J. and Wang, X. (2022) Shunted Self-Attention via Multi-Scale Token Aggregation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 10843-10852. https://doi.org/10.1109/cvpr52688.2022.01058 |
[36] | Woo, S., Park, J., Lee, J. and Kweon, I.S. (2018) CBAM: Convolutional Block Attention Module. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., Computer Vision—ECCV 2018., Springer, 3-19. https://doi.org/10.1007/978-3-030-01234-2_1 |
[37] | He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 770-778. https://doi.org/10.1109/cvpr.2016.90 |
[38] | Meister, S., Hur, J. and Roth, S. (2018) UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss. Proceedings of the AAAI Conference on Artificial Intelligence, 32, 7251-7259. https://doi.org/10.1609/aaai.v32i1.12276 |
[39] | Zhong, Y., Ji, P., Wang, J., Dai, Y. and Li, H. (2019) Unsupervised Deep Epipolar Flow for Stationary or Dynamic Scenes. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 12087-12096. https://doi.org/10.1109/cvpr.2019.01237 |
[40] | Soomro, K., Zamir, A.R. and Shah, M. (2012) UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. arXiv: 1212.0402. |
[41] | Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J. and Szeliski, R. (2010) A Database and Evaluation Methodology for Optical Flow. International Journal of Computer Vision, 92, 1-31. https://doi.org/10.1007/s11263-010-0390-2 |
[42] | Choi, M., Kim, H., Han, B., Xu, N. and Lee, K.M. (2020) Channel Attention Is All You Need for Video Frame Interpolation. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 10663-10671. https://doi.org/10.1609/aaai.v34i07.6693 |
[43] | Lee, H., Kim, T., Chung, T., Pak, D., Ban, Y. and Lee, S. (2020) AdaCoF: Adaptive Collaboration of Flows for Video Frame Interpolation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 5315-5324. https://doi.org/10.1109/cvpr42600.2020.00536 |
[44] | Park, J., Ko, K., Lee, C. and Kim, C. (2020) BMBC: Bilateral Motion Estimation with Bilateral Cost Volume for Video Interpolation. In: Vedaldi, A., Bischof, H., Brox, T. and Frahm, J.M., Eds., Computer Vision—ECCV 2020, Springer, 109-125. https://doi.org/10.1007/978-3-030-58568-6_7 |
[45] | Park, J., Lee, C. and Kim, C. (2021) Asymmetric Bilateral Motion Estimation for Video Frame Interpolation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 14519-14528. https://doi.org/10.1109/iccv48922.2021.01427 |
[46] | Hu, P., Niklaus, S., Sclaroff, S. and Saenko, K. (2022) Many-to-Many Splatting for Efficient Video Frame Interpolation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 3543-3552. https://doi.org/10.1109/cvpr52688.2022.00354 |
[47] | Lu, L., Wu, R., Lin, H., Lu, J. and Jia, J. (2022) Video Frame Interpolation with Transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 3522-3532. https://doi.org/10.1109/cvpr52688.2022.00352 |