|
基于改进Transformer的语音情感识别研究
|
Abstract:
随着人工智能技术的迅速发展,语音情感识别作为人机交互领域的关键技术,对于提升交互体验和理解用户意图具有重要意义。传统的语音情感识别方法在特征提取和模型泛化能力上存在一定的局限性。Transformer模型因其强大的自注意力机制,能够有效捕捉长序列数据中的依赖关系,在自然语言处理等领域取得了显著成果,为语音情感识别提供了新的思路。本文基于改进的Transformer模型展开语音情感识别研究,充分利用LSTM在处理长期依赖方面的优势,以及Transformer在捕捉全局依赖和并行处理方面的能力。实验结果表明,这样的改进可以提高单一模型的准确率。
With the rapid development of artificial intelligence technology, speech emotion recognition, as a key technology in the field of human-computer interaction, is of great significance for enhancing the interaction experience and understanding user intentions. Traditional speech emotion recognition methods have certain limitations in feature extraction and model generalization ability. The Transformer model, due to its powerful self-attention mechanism, can effectively capture the dependencies in long-sequence data and has achieved remarkable results in fields such as natural language processing, providing new ideas for speech emotion recognition. This paper conducts research on speech emotion recognition based on the improved Transformer model, making full use of the advantages of LSTM in handling long-term dependencies and the capabilities of Transformer in capturing global dependencies and parallel processing. The experimental results show that such an improvement can increase the accuracy of a single model.
[1] | 李美娟. 基于深度学习和特征融合的语音情感识别方法研究[D]: [硕士学位论文]. 济南: 齐鲁工业大学, 2024. |
[2] | Jin, Q., Li, C., Chen, S., et al. (2015) Speech Emotion Recognition with Acoustic and Lexical Features. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). South Brisbane, 19-24 April 2015, 4749-4753. https://doi.org/10.1109/ICASSP.2015.7178872 |
[3] | Zhang, S., Zhang, S., Huang, T. and Gao, W. (2018) Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching. IEEE Transactions on Multimedia, 20, 1576-1590. https://doi.org/10.1109/tmm.2017.2766843 |
[4] | 陶建华, 陈俊杰, 李永伟. 语音情感识别综述[J]. 信号处理, 2023, 39(4): 571-587. |
[5] | 程适, 骆晓宁, 李冬城, 等. 一种基于双向LSTM的语音情感识别模型[J]. 长江信息通信, 2022, 35(7): 19-22. |
[6] | Mohan, M., Dhanalakshmi, P. and Kumar, R.S. (2023) Speech Emotion Classification Using Ensemble Models with MFCC. Procedia Computer Science, 218, 1857-1868. https://doi.org/10.1016/j.procs.2023.01.163 |
[7] | Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735 |
[8] | Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al. (2017) Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 6000-6010. |
[9] | Wagner, J., Triantafyllopoulos, A., Wierstorf, H., Schmitt, M., Burkhardt, F., Eyben, F., et al. (2023) Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 10745-10759. https://doi.org/10.1109/tpami.2023.3263585 |
[10] | Tang, X., Lin, Y., Dang, T., Zhang, Y. and Cheng, J. (2024) Speech Emotion Recognition via CNN-Transformer and Multidimensional Attention Mechanism. arXiv: 2403.04743. https://doi.org/10.48550/arXiv.2403.04743 |