全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Speech Emotion Recognition Based on CNN-Transformer with Different Loss Function

DOI: 10.4236/jcc.2025.133008, PP. 103-115

Keywords: Speech Emotion Recognition, CNN-Transformer, Mel-Spectrogram, Multi-Head Self-Attention, Loss Function

Full-Text   Cite this paper   Add to My Lib

Abstract:

Speech Emotion Recognition (SER) is crucial for enhancing human-computer interactions by enabling machines to understand and respond appropriately to human emotions. However, accurately recognizing emotions from speech is challenging due to variations across speakers, languages, and environmental contexts. This study introduces a novel SER framework that integrates Convolutional Neural Networks (CNNs) for effective local feature extraction from Mel-Spectrograms, and transformer networks employing multi-head self-attention mechanisms to capture long-range temporal dependencies in speech signals. Additionally, the paper investigates the impact of various loss functions—L1 Loss, Smooth L1 Loss, Binary Cross-Entropy Loss, and Cross-Entropy Loss—on the accuracy and generalization performance of the model. Experiments conducted on a combined dataset formed from RAVDESS, SAVEE, and TESS demonstrate that the CNN-Transformer model with Cross-Entropy Loss achieves superior accuracy, outperforming other configurations. These findings highlight the importance of appropriately selecting loss functions to enhance robustness and effectiveness in speech emotion recognition systems.

References

[1]  Al-Dujaili, M.J. and Ebrahimi-Moghadam, A. (2023) Speech Emotion Recognition: A Comprehensive Survey. Wireless Personal Communications, 129, 2525-2561.
https://doi.org/10.1007/s11277-023-10244-3
[2]  Singh, Y.B. and Goel, S. (2022) A Systematic Literature Review of Speech Emotion Recognition Approaches. Neurocomputing, 492, 245-263.
https://doi.org/10.1016/j.neucom.2022.04.028
[3]  Ancilin, J. and Milton, A. (2021) Improved Speech Emotion Recognition with Mel Frequency Magnitude Coefficient. Applied Acoustics, 179, Article 108046.
https://doi.org/10.1016/j.apacoust.2021.108046
[4]  Warule, P., Mishra, S.P., Deb, S. and Krajewski, J. (2023) Sinusoidal Model-Based Diagnosis of the Common Cold from the Speech Signal. Biomedical Signal Processing and Control, 83, Article 104653.
https://doi.org/10.1016/j.bspc.2023.104653
[5]  Zhao, X., Zhang, S. and Lei, B. (2013) Robust Emotion Recognition in Noisy Speech via Sparse Representation. Neural Computing and Applications, 24, 1539-1553.
https://doi.org/10.1007/s00521-013-1377-z
[6]  Jothimani, S. and Premalatha, K. (2022) MFF-SAug: Multi Feature Fusion with Spectrogram Augmentation of Speech Emotion Recognition Using Convolution Neural Network. Chaos, Solitons & Fractals, 162, Article 112512.
https://doi.org/10.1016/j.chaos.2022.112512
[7]  Issa, D., Fatih Demirci, M. and Yazici, A. (2020) Speech Emotion Recognition with Deep Convolutional Neural Networks. Biomedical Signal Processing and Control, 59, Article 101894.
https://doi.org/10.1016/j.bspc.2020.101894
[8]  Aftab, A., Morsali, A., Ghaemmaghami, S. and Champagne, B. (2022) LIGHT-SERNET: A Lightweight Fully Convolutional Neural Network for Speech Emotion Recognition. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23-27 May 2022, 6912-6916.
https://doi.org/10.1109/icassp43922.2022.9746679
[9]  Alluhaidan, A.S., Saidani, O., Jahangir, R., Nauman, M.A. and Neffati, O.S. (2023) Speech Emotion Recognition through Hybrid Features and Convolutional Neural Network. Applied Sciences, 13, Article 4750.
https://doi.org/10.3390/app13084750
[10]  Liu, Z., Han, M., Wu, B. and Rehman, A. (2023) Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning. Applied Acoustics, 202, Article 109178.
https://doi.org/10.1016/j.apacoust.2022.109178
[11]  Hazmoune, S. and Bougamouza, F. (2024) Using Transformers for Multimodal Emotion Recognition: Taxonomies and State of the Art Review. Engineering Applications of Artificial Intelligence, 133, Article 108339.
https://doi.org/10.1016/j.engappai.2024.108339
[12]  Zhang, S., Liu, R., Yang, Y., Zhao, X. and Yu, J. (2022) Unsupervised Domain Adaptation Integrating Transformer and Mutual Information for Cross-Corpus Speech Emotion Recognition. Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, 10-14 October 2022, 120-129.
https://doi.org/10.1145/3503161.3548328
[13]  Swain, M., Maji, B., Khan, M., Saddik, A.E. and Gueaieb, W. (2023) Multilevel Feature Representation for Hybrid Transformers-Based Emotion Recognition. 2023 5th International Conference on Bio-Engineering for Smart Technologies (BioSMART), Paris, 7-9 June 2023, 1-5.
https://doi.org/10.1109/biosmart58455.2023.10162089
[14]  Hu, K., Xu, K., Xia, Q., Li, M., Song, Z., Song, L., et al. (2024) An Overview: Attention Mechanisms in Multi-Agent Reinforcement Learning. Neurocomputing, 598, Article 128015.
https://doi.org/10.1016/j.neucom.2024.128015
[15]  Alzubaidi, L., Zhang, J., Humaidi, A.J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., et al. (2021) Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. Journal of Big Data, 8, Article No. 53.
https://doi.org/10.1186/s40537-021-00444-8
[16]  Abdar, M., Pourpanah, F., Hussain, S., Rezazadegan, D., Liu, L., Ghavamzadeh, M., et al. (2021) A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges. Information Fusion, 76, 243-297.
https://doi.org/10.1016/j.inffus.2021.05.008
[17]  Mao, A., Mohri, M. and Zhong, Y. (2023) Cross-Entropy Loss Functions: Theoretical Analysis and Applications. Proceedings of the 40th International Conference on Machine Learning, Honolulu, 23-29 July 2023, 23803-23828.
[18]  Wang, F. and Liu, H. (2021) Understanding the Behaviour of Contrastive Loss. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 20-25 June 2021, 2495-2504.
https://doi.org/10.1109/cvpr46437.2021.00252
[19]  Livingstone, S.R. and Russo, F.A. (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English. PLOS ONE, 13, e0196391.
https://doi.org/10.1371/journal.pone.0196391
[20]  Jackson, P. and Haq, S. (2015) Surrey Audio-Visual Expressed Emotion (SAVEE) Database.
http://kahlan.eps.surrey.ac.uk/savee/
[21]  Dupuis, K. and Pichora-Fuller, M.K. (2010) Toronto Emotional Speech Set (TESS).
https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133