%0 Journal Article
%T 基于语音驱动的人脸生成<br>Speech-Driven Facial Generation
%A 李昊渊
%J Computer Science and Application
%P 199-208
%@ 2161-881X
%D 2025
%I Hans Publishing
%R 10.12677/csa.2025.151020
%X 语音驱动人脸生成旨在生成与参考人脸具有相同身份信息&#65292;与语音内容相对应的说话人脸视频。针对现有方法中生成人脸身份信息较差、脸部细节较差的问题&#65292;提出了一种基于关键点的语音驱动说话人脸视频生成模型LTFG-GAN。该模型首先将基于在语音识别领域微调的无监督预训练模型作为语音编码器&#65292;通过融合卷积与注意力机制预测人脸关键点&#65307;其次在人脸生成过程中加入交叉注意力机制获取原始参考人脸信息&#65292;通过条件卷积与空间自适应归一化将扭曲得到高维形变人脸信息与原始人脸信息融合&#65307;最终得到与语音同步的说话人脸视频。实验结果表明&#65292;上述方法对于人脸的生成有明显地提升。<br>Voice driven face generation aims to generate speech facial videos that have the same identity information as the reference face and correspond to the speech content. A speech driven facial video generation model based on landmarks, LTFG-GAN, is proposed to address the issues of poor facial identity information and facial details in existing methods. The model first uses an unsupervised pre trained model fine-tuned in the field of speech recognition as a speech encoder, and predicts facial landmarks by integrating convolution and attention mechanisms; Secondly, a cross-attention mechanism is added to the face generation process to obtain the original reference face information. The distorted high-dimensional deformed face information is fused with the original face information through conditional convolution and spatial adaptive normalization; The final result is a speech synchronized facial video. The experimental results show that the above method has a significant improvement in face generation.
%K 人脸生成&#65292
%K 深度学习&#65292
%K Wav2vec&#65292
%K 交叉注意力机制&#65292
%K 条件卷积<br>Facial Recognition
%K Deep Learning
%K Wav2vec
%K Cross-Attention Mechanism
%K Conditional Convolution
%U http://www.hanspub.org/journal/PaperInformation.aspx?PaperID=106334