OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Computer Science and Application 2025

基于语音驱动的人脸生成
Speech-Driven Facial Generation

DOI: 10.12677/csa.2025.151020, PP. 199-208

李昊渊

Keywords: 人脸生成，深度学习，Wav2vec，交叉注意力机制，条件卷积
Facial Recognition, Deep Learning, Wav2vec, Cross-Attention Mechanism, Conditional Convolution

Full-Text Cite this paper Add to My Lib

Abstract:

语音驱动人脸生成旨在生成与参考人脸具有相同身份信息，与语音内容相对应的说话人脸视频。针对现有方法中生成人脸身份信息较差、脸部细节较差的问题，提出了一种基于关键点的语音驱动说话人脸视频生成模型LTFG-GAN。该模型首先将基于在语音识别领域微调的无监督预训练模型作为语音编码器，通过融合卷积与注意力机制预测人脸关键点；其次在人脸生成过程中加入交叉注意力机制获取原始参考人脸信息，通过条件卷积与空间自适应归一化将扭曲得到高维形变人脸信息与原始人脸信息融合；最终得到与语音同步的说话人脸视频。实验结果表明，上述方法对于人脸的生成有明显地提升。
Voice driven face generation aims to generate speech facial videos that have the same identity information as the reference face and correspond to the speech content. A speech driven facial video generation model based on landmarks, LTFG-GAN, is proposed to address the issues of poor facial identity information and facial details in existing methods. The model first uses an unsupervised pre trained model fine-tuned in the field of speech recognition as a speech encoder, and predicts facial landmarks by integrating convolution and attention mechanisms; Secondly, a cross-attention mechanism is added to the face generation process to obtain the original reference face information. The distorted high-dimensional deformed face information is fused with the original face information through conditional convolution and spatial adaptive normalization; The final result is a speech synchronized facial video. The experimental results show that the above method has a significant improvement in face generation.

References

[1]	年福东, 王文涛, 王妍, 等. 基于关键点表示的语音驱动说话人脸视频生成[J]. 模式识别与人工智能, 2021, 34(6): 572-580.
[2]	Chung, J.S., Jamaludin, A. and Zisserman, A. (2017) You Said That? arXiv: 1705.02966.
[3]	Mukhopadhyay, R., Philip, J., et al. (2019) Towards Automatic Face-to-Face Translation. Proceedings of the 27th ACM International Conference on Multimedia, Nice, 21-25 October 2019, 1428-1436.
[4]	Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V.P. and Jawahar, C.V. (2020) A Lip Sync Expert Is All You Need for Speech to Lip Generation in the Wild. Proceedings of the 28th ACM International Conference on Multimedia, Seattle, 12-16 October 2020, 484-492. https://doi.org/10.1145/3394171.3413532
[5]	Chung, J.S. and Zisserman, A. (2017) Out of Time: Automated Lip Sync in the Wild. In: Chen, C.S., Lu, J. and Ma, K.K., Eds., Computer Vision—ACCV 2016 Workshops, Springer, 251-263. https://doi.org/10.1007/978-3-319-54427-4_19
[6]	Cheng, K., Cun, X., Zhang, Y., Xia, M., Yin, F., Zhu, M., et al. (2022) Videoretalking: Audio-Based Lip Synchronization for Talking Head Video Editing in the Wild. SIGGRAPH Asia 2022 Conference Papers, Daegu, 6-9 December 2022, 1-9. https://doi.org/10.1145/3550469.3555399
[7]	Suwajanakorn, S., Seitz, S.M. and Kemelmacher-Shlizerman, I. (2017) Synthesizing Obama: Learning Lip Sync from Audio. ACM Transactions on Graphics, 36, 1-13. https://doi.org/10.1145/3072959.3073640
[8]	Zhang, X. and Weng, L. (2020) Realistic Speech-Driven Talking Video Generation with Personalized Pose. Complexity, 2020, Article ID: 6629634. https://doi.org/10.1155/2020/6629634
[9]	Guo, Y., Chen, K., Liang, S., Liu, Y., Bao, H. and Zhang, J. (2021) Ad-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 5764-5774. https://doi.org/10.1109/iccv48922.2021.00573
[10]	Zhang, Z., Hu, Z., Deng, W., Fan, C., Lv, T. and Ding, Y. (2023) Dinet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 3543-3551. https://doi.org/10.1609/aaai.v37i3.25464
[11]	Baevski, A., Zhou, Y., Mohamed, A., et al. (2020) Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Proceedings of the 34th International Conference on Neural Information Processing System, Vancouver, 6-12 December 2020, 12449-12460.
[12]	Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., et al. (2021) Conformer: Local Features Coupling Global Representations for Visual Recognition. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 357-366. https://doi.org/10.1109/iccv48922.2021.00042
[13]	Zhong, W., Fang, C., Cai, Y., Wei, P., Zhao, G., Lin, L., et al. (2023) Identity-Preserving Talking Face Generation with Landmark and Appearance Priors. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 9729-9738. https://doi.org/10.1109/cvpr52729.2023.00938
[14]	Wang, T., Liu, M., Zhu, J., Tao, A., Kautz, J. and Catanzaro, B. (2018) High-Resolution Image Synthesis and Semantic Manipulation with Conditional Gans. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 8798-8807. https://doi.org/10.1109/cvpr.2018.00917
[15]	Li, J., Tu, W. and Xiao, L. (2023) Freevc: Towards High-Quality Text-Free One-Shot Voice Conversion. ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, 4-10 June 2023, 1-5. https://doi.org/10.1109/icassp49357.2023.10095191
[16]	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017) Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 6000-6010.
[17]	Liu, X., Yin, G., Shao, J., et al. (2019) Learning to Predict Layout-to-Image Conditional Convolutions for Semantic Image Synthesis. Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, 8-14 December 2019, 570-580.
[18]	Johnson, J., Alahi, A. and Fei-Fei, L. (2016) Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In: Leibe, B., Matas, J., Sebe, N. and Welling, M., Eds., Computer Vision—ECCV 2016, Springer, 694-711. https://doi.org/10.1007/978-3-319-46475-6_43
[19]	Afouras, T., Chung, J.S., Senior, A., Vinyals, O. and Zisserman, A. (208) Deep Audio-Visual Speech Recognition. arXiv: 1809.02108.
[20]	Wang, J., Qian, X., Zhang, M., Tan, R.T. and Li, H. (2023) Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 14653-14662. https://doi.org/10.1109/cvpr52729.2023.01408

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

基于语音驱动的人脸生成Speech-Driven Facial Generation

基于语音驱动的人脸生成
Speech-Driven Facial Generation