|
基于图傅里叶变换的语音增强算法研究
|
Abstract:
在语音增强过程中,人们往往采用语音信号的频谱信息作为特征输入,再进行进一步的训练增强处理。最为常见的便是对语音信号进行短时傅里叶变换后取其幅度频谱作为特征输入,在语音恢复阶段,则将含有噪声语音的相位信息作为增强语音的相位信息进行语音的重构。但是,这一做法必然导致相位信息的缺失。本文提出将图傅里叶变换(GFT)分别与非负矩阵分解(NMF)算法以及全卷积神经网络(FCNN)模型相结合来实现含有噪声语音的增强,实验表明,图傅里叶变换–非负矩阵分解算法在语音增强上与短时傅里叶变换–非负矩阵分解算法表现相当,基于图傅里叶变换–全卷积神经网络的语音增强相较于基于短时傅里叶变换–全卷积神经网络的语音增强有更为优异的性能。
In the process of speech enhancement, people often use the spectral information of the speech signal as the feature input, and then carry out further training enhancement processing. The most used is to perform a short-term Fourier transform on the speech signal and take its amplitude spectrum as feature input, and in the speech recovery stage, the phase information of the noisy speech is used as the phase information of the enhanced speech for speech reconstruction. However, this practice inevitably leads to the absence of phase information. In this paper, it is proposed to combine the graph Fourier transform with the non-negative matrix factorization algorithm and the fully convolutional neural network model to realize the enhancement of noisy speech, and the experimental results show that the performance of graph Fourier transform-non-negative matrix factorization algorithm is comparable to that of the short-term Fourier transform-non-negative matrix factorization algorithm in speech enhancement, and the speech enhancement based on the graph Fourier transform-fully convolutional neural network has better performance than the speech enhancement based on the short-time Fourier transform-fully convolutional neural network.
[1] | Wang, Y., Narayanan, A. and Wang, D. (2014) On Training Targets for Supervised Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22, 1849-1858. https://doi.org/10.1109/TASLP.2014.2352935 |
[2] | Williamson, D.S., Wang, Y. and Wang, D.L. (2016) Complex Ratio Masking for Monaural Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24, 483-492. https://doi.org/10.1109/TASLP.2015.2512042 |
[3] | Xu, Y., Du, J., Dai, L.-R. and Lee, C.-H. (2014) An Experimental Study on Speech Enhancement Based on Deep Neural Networks. IEEE Signal Processing Letters, 21, 65-68. https://doi.org/10.1109/LSP.2013.2291240 |
[4] | Weninger, F., et al. (2015) Speech Enhancement with LSTM Recurrent Neural Networks and Its Application to Noise-Robust ASR. International Conference on Latent Vari-able Analysis and Signal Separation, Liberec, 25-28 August 2015, 91-99. https://doi.org/10.1007/978-3-319-22482-4_11 |
[5] | Zhang, X.-L. and Wang, D.L. (2016) A Deep Ensemble Learning Method for Monaural Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing, 24, 967-977. https://doi.org/10.1109/TASLP.2016.2536478 |
[6] | Tan, K. and Wang, D. (2018) A Convo-lutional Recurrent Neural Network for Real-Time Speech Enhancement. Proceedings Interspeech, Hyderabad, 2-6 Sep-tember 2018, 3229-3233. https://doi.org/10.21437/Interspeech.2018-1405 |
[7] | Takahashi, N., Goswami, N. and Mitsufuji, Y. (2018) MMDenseLSTM: An Efficient Combination of Convolutional and Recurrent Neural Networks for Audio Source Separation. 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Tokyo, 17-20 September 2018, 106-110. https://doi.org/10.1109/IWAENC.2018.8521383 |
[8] | Koizumi, Y., Yatabe, K., Delcroix, M., Masuyama, Y. and Takeuchi, D. (2020) Speech Enhancement Using Self-Adaptation and Multi-Head Self-Attention. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, 4-8 May 2020, 181-185. https://doi.org/10.1109/ICASSP40776.2020.9053214 |
[9] | Wang, D.L. and Lim, J.S. (1982) The Unim-portance of Phase in Speech Enhancements. IEEE Transactions on Acoustics, Speech, and Signal Processing, 30, 679-681. https://doi.org/10.1109/TASSP.1982.1163920 |
[10] | Paliwal, K., Wjcicki, K. and Shannon, B. (2011) The Importance of Phase in Speech Enhancement. Speech Communication, 53, 465-494. https://doi.org/10.1016/j.specom.2010.12.003 |
[11] | Mowlaee, P., Saeidi, R. and Stylianou, Y. (2016) Advances in Phase-Aware Signal Processing in Speech Communication. Speech Communication, 81, 1-29. https://doi.org/10.1016/j.specom.2016.04.002 |
[12] | Zheng, N.J. and Zhang, X.-L. (2019) Phase-Aware Speech En-hancement Based on Deep Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27, 63-76. https://doi.org/10.1109/TASLP.2018.2870742 |
[13] | Hu, Y., Liu, Y., Lv, S., et al. (2020) DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement. Interspeech 2020, 21st Annual Con-ference of the International Speech Communication Association, Shanghai, 25-29 October 2020, 2472-2476. https://doi.org/10.21437/Interspeech.2020-2537 |
[14] | 季华忠. 基于图傅里叶变换和神经网络的声信标信号识别方法研究[D]: [硕士学位论文]. 杭州: 浙江大学, 2021.
https://doi.org/10.27461/d.cnki.gzjdx.2021.002063 |
[15] | 徐琪. 基于全卷积神经网络和DenseNet的语音增强算法研究[D]: [硕士学位论文]. 南京: 南京邮电大学, 2022.
https://doi.org/10.27251/d.cnki.gnjdc.2022.000133 |
[16] | 柏浩钧, 张天骐, 刘鉴兴, 叶绍鹏. 联合精确比值掩蔽与深度神经网络的单通道语音增强方法[J]. 声学学报, 2022, 47(3): 394-404. https://doi.org/10.15949/j.cnki.0371-0025.2022.03.009 |
[17] | Zen, H.G., et al. (2019) LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. Interspeech 2019, Graz, 15-19 September 2019, 1526-1530. https://doi.org/10.21437/Interspeech.2019-2441 |