|
基于跨域引导扩散模型的高保真3D人脸纹理生成
|
Abstract:
从单张野外图片生成高保真3D人脸纹理是一项具有挑战的工作,现有方法在颜色和光照恢复方面已经取得了显著的进展,但是却仍然无法较好地重建中高频纹理细节。其主要原因在于真实面部UV纹理数据集的匮乏,现有模型大多基于合成UV纹理图训练模型,由于缺少真实标签的监督其与真实UV纹理相比必然存在较大的差异,从而导致模型学习到错误的纹理分布。基于以上思考,我们尝试使用原始图片空间中的细节纹理来引导UV空间中UV纹理图的生成,并提出两阶段训练方式以缓解仅使用合成UV纹理图训练模型带来的个性化细节缺失问题。此外,借助于扩散模型在图像生成任务中的卓越性能,我们还设计了一款跨域引导扩散模型,其将空间域和频率域中的细节信息编码为高级语义条件,用来引导扩散模型的生成过程,从而实现近乎精确的重建。最后,我们将跨域引导扩散模型作为UV纹理生成器嵌入到三维重建框架中,用于重建高保真的3D人脸纹理。实验结果表明本文提出的跨域引导扩散模型能较好地生成中高频纹理细节,并在定量和定性分析中明显优于其他3D人脸纹理生成工作。
Generating high-fidelity 3D facial textures from a single outdoor image is a challenging task. While existing methods have made significant progress in color and lighting recovery, they still struggle to accurately reconstruct mid-to-high frequency texture details. This is primarily due to the lack of real facial UV texture datasets. Most models are trained using synthetic UV texture maps, which inherently differ from real UV textures due to the absence of ground truth supervision, leading to inaccurate texture distribution learning. In light of this, we attempt to use detailed textures from the original image space to guide the generation of UV texture maps in the UV space. We propose a two-stage training approach to alleviate the loss of personalized details caused by training models solely on synthetic UV texture maps. Additionally, leveraging the exceptional performance of diffusion models in image generation tasks, we design a cross-domain guided diffusion model. This model encodes detailed information from both spatial and frequency domains into high-level semantic conditions to guide the diffusion process for near-accurate reconstruction. Finally, we integrate the cross-domain guided diffusion model as a UV texture generator into a 3D reconstruction framework to reconstruct high-fidelity 3D facial textures. Experimental results demonstrate that our proposed cross-domain guided diffusion model effectively generates mid-to-high frequency texture details and significantly outperforms other 3D facial texture generation methods in both quantitative and qualitative analyses.
[1] | Liu, F., Zhu, R., Zeng, D., Zhao, Q. and Liu, X. (2018) Disentangling Features in 3D Face Shapes for Joint Face Reconstruction and Recognition. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 5216-5225. https://doi.org/10.1109/cvpr.2018.00547 |
[2] | Liu, F., Zhao, Q., Liu, X. and Zeng, D. (2020) Joint Face Alignment and 3D Face Reconstruction with Application to Face Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42, 664-678. https://doi.org/10.1109/tpami.2018.2885995 |
[3] | Cao, C., Wu, H., Weng, Y., Shao, T. and Zhou, K. (2016) Real-Time Facial Animation with Image-Based Dynamic Avatars. ACM Transactions on Graphics, 35, 1-12. https://doi.org/10.1145/2897824.2925873 |
[4] | Cao, C., Bradley, D., Zhou, K. and Beeler, T. (2015) Real-Time High-Fidelity Facial Performance Capture. ACM Transactions on Graphics, 34, 1-9. https://doi.org/10.1145/2766943 |
[5] | Chaudhuri, B., Vesdapunt, N. and Wang, B. (2019) Joint Face Detection and Facial Motion Retargeting for Multiple Faces. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 9711-9720. https://doi.org/10.1109/cvpr.2019.00995 |
[6] | Chaudhuri, B., Vesdapunt, N., Shapiro, L. and Wang, B. (2020) Personalized Face Modeling for Improved Face Reconstruction and Motion Retargeting. Computer Vision—ECCV 2020, Glasgow, 23-28 August 2020, 142-160. https://doi.org/10.1007/978-3-030-58558-7_9 |
[7] | Tu, L., Porras, A.R., Morales, A., Perez, D.A., Piella, G., Sukno, F., et al. (2019) Three-Dimensional Face Reconstruction from Uncalibrated Photographs: Application to Early Detection of Genetic Syndromes. Uncertainty for Safe Utilization of Machine Learning in Medical Imaging and Clinical Image-Based Procedures, Shenzhen, 17 October 2019, 182-189. https://doi.org/10.1007/978-3-030-32689-0_19 |
[8] | Deng, Y., Yang, J., Xu, S., Chen, D., Jia, Y. and Tong, X. (2019) Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, 16-17 June 2019, 285-295. https://doi.org/10.1109/cvprw.2019.00038 |
[9] | Lee, G. and Lee, S. (2020) Uncertainty-Aware Mesh Decoder for High Fidelity 3D Face Reconstruction. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 6099-6108. https://doi.org/10.1109/cvpr42600.2020.00614 |
[10] | Rai, A., Gupta, H., Pandey, A., Carrasco, F.V., Jason Takagi, S., Aubel, A., et al. (2024) Towards Realistic Generative 3D Face Models. 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 3-8 January 2024, 3726-3736. https://doi.org/10.1109/wacv57701.2024.00370 |
[11] | Bai, H., Kang, D., Zhang, H., Pan, J. and Bao, L. (2023) FFHQ-UV: Normalized Facial UV-Texture Dataset for 3D Face Reconstruction. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 362-371. https://doi.org/10.1109/cvpr52729.2023.00043 |
[12] | Cheng, H., Hui, Y., Jin, H. and Zhang, S. (2024) High-Fidelity Texture Generation for 3D Avatar Based on the Diffusion Model. 2024 16th International Conference on Human System Interaction (HSI), Paris, 8-11 July 2024, 1-6. https://doi.org/10.1109/hsi61632.2024.10613538 |
[13] | Preechakul, K., Chatthee, N., Wizadwongsa, S. and Suwajanakorn, S. (2022) Diffusion Autoencoders: Toward a Meaningful and Decodable Representation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 10609-10619. https://doi.org/10.1109/cvpr52688.2022.01036 |
[14] | Karras, T., Laine, S. and Aila, T. (2019) A Style-Based Generator Architecture for Generative Adversarial Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 4396-4405. https://doi.org/10.1109/cvpr.2019.00453 |
[15] | Song, J., Meng, C. and Ermon, S. (2020) Denoising Diffusion Implicit Models. arXiv: 2010.02502. https://doi.org/10.48550/arXiv.2010.02502 |
[16] | Wang, Z., Bovik, A.C., Sheikh, H.R. and Simoncelli, E.P. (2004) Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13, 600-612. https://doi.org/10.1109/tip.2003.819861 |
[17] | Zhang, R., Isola, P., Efros, A.A., Shechtman, E. and Wang, O. (2018) The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 586-595. https://doi.org/10.1109/cvpr.2018.00068 |