Speech Emotion Recognition (SER) is crucial for enhancing human-computer interactions by enabling machines to understand and respond appropriately to human emotions. However, accurately recognizing emotions from speech is challenging due to variations across speakers, languages, and environmental contexts. This study introduces a novel SER framework that integrates Convolutional Neural Networks (CNNs) for effective local feature extraction from Mel-Spectrograms, and transformer networks employing multi-head self-attention mechanisms to capture long-range temporal dependencies in speech signals. Additionally, the paper investigates the impact of various loss functions—L1 Loss, Smooth L1 Loss, Binary Cross-Entropy Loss, and Cross-Entropy Loss—on the accuracy and generalization performance of the model. Experiments conducted on a combined dataset formed from RAVDESS, SAVEE, and TESS demonstrate that the CNN-Transformer model with Cross-Entropy Loss achieves superior accuracy, outperforming other configurations. These findings highlight the importance of appropriately selecting loss functions to enhance robustness and effectiveness in speech emotion recognition systems.
References
[1]
Al-Dujaili, M.J. and Ebrahimi-Moghadam, A. (2023) Speech Emotion Recognition: A Comprehensive Survey. WirelessPersonalCommunications, 129, 2525-2561. https://doi.org/10.1007/s11277-023-10244-3
[2]
Singh, Y.B. and Goel, S. (2022) A Systematic Literature Review of Speech Emotion Recognition Approaches. Neurocomputing, 492, 245-263. https://doi.org/10.1016/j.neucom.2022.04.028
[3]
Ancilin, J. and Milton, A. (2021) Improved Speech Emotion Recognition with Mel Frequency Magnitude Coefficient. AppliedAcoustics, 179, Article 108046. https://doi.org/10.1016/j.apacoust.2021.108046
[4]
Warule, P., Mishra, S.P., Deb, S. and Krajewski, J. (2023) Sinusoidal Model-Based Diagnosis of the Common Cold from the Speech Signal. BiomedicalSignalProcessingandControl, 83, Article 104653. https://doi.org/10.1016/j.bspc.2023.104653
[5]
Zhao, X., Zhang, S. and Lei, B. (2013) Robust Emotion Recognition in Noisy Speech via Sparse Representation. NeuralComputingandApplications, 24, 1539-1553. https://doi.org/10.1007/s00521-013-1377-z
[6]
Jothimani, S. and Premalatha, K. (2022) MFF-SAug: Multi Feature Fusion with Spectrogram Augmentation of Speech Emotion Recognition Using Convolution Neural Network. Chaos, Solitons&Fractals, 162, Article 112512. https://doi.org/10.1016/j.chaos.2022.112512
[7]
Issa, D., Fatih Demirci, M. and Yazici, A. (2020) Speech Emotion Recognition with Deep Convolutional Neural Networks. BiomedicalSignalProcessingandControl, 59, Article 101894. https://doi.org/10.1016/j.bspc.2020.101894
[8]
Aftab, A., Morsali, A., Ghaemmaghami, S. and Champagne, B. (2022) LIGHT-SERNET: A Lightweight Fully Convolutional Neural Network for Speech Emotion Recognition. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP), Singapore, 23-27 May 2022, 6912-6916. https://doi.org/10.1109/icassp43922.2022.9746679
[9]
Alluhaidan, A.S., Saidani, O., Jahangir, R., Nauman, M.A. and Neffati, O.S. (2023) Speech Emotion Recognition through Hybrid Features and Convolutional Neural Network. AppliedSciences, 13, Article 4750. https://doi.org/10.3390/app13084750
[10]
Liu, Z., Han, M., Wu, B. and Rehman, A. (2023) Speech Emotion Recognition Based on Convolutional Neural Network with Attention-Based Bidirectional Long Short-Term Memory Network and Multi-Task Learning. AppliedAcoustics, 202, Article 109178. https://doi.org/10.1016/j.apacoust.2022.109178
[11]
Hazmoune, S. and Bougamouza, F. (2024) Using Transformers for Multimodal Emotion Recognition: Taxonomies and State of the Art Review. EngineeringApplicationsofArtificialIntelligence, 133, Article 108339. https://doi.org/10.1016/j.engappai.2024.108339
[12]
Zhang, S., Liu, R., Yang, Y., Zhao, X. and Yu, J. (2022) Unsupervised Domain Adaptation Integrating Transformer and Mutual Information for Cross-Corpus Speech Emotion Recognition. Proceedingsofthe 30thACMInternationalConferenceonMultimedia, Lisboa, 10-14 October 2022, 120-129. https://doi.org/10.1145/3503161.3548328
[13]
Swain, M., Maji, B., Khan, M., Saddik, A.E. and Gueaieb, W. (2023) Multilevel Feature Representation for Hybrid Transformers-Based Emotion Recognition. 2023 5thInternationalConferenceonBio-EngineeringforSmartTechnologies (BioSMART), Paris, 7-9 June 2023, 1-5. https://doi.org/10.1109/biosmart58455.2023.10162089
[14]
Hu, K., Xu, K., Xia, Q., Li, M., Song, Z., Song, L., etal. (2024) An Overview: Attention Mechanisms in Multi-Agent Reinforcement Learning. Neurocomputing, 598, Article 128015. https://doi.org/10.1016/j.neucom.2024.128015
Abdar, M., Pourpanah, F., Hussain, S., Rezazadegan, D., Liu, L., Ghavamzadeh, M., etal. (2021) A Review of Uncertainty Quantification in Deep Learning: Techniques, Applications and Challenges. InformationFusion, 76, 243-297. https://doi.org/10.1016/j.inffus.2021.05.008
[17]
Mao, A., Mohri, M. and Zhong, Y. (2023) Cross-Entropy Loss Functions: Theoretical Analysis and Applications. Proceedings of the 40th International Conference on Machine Learning, Honolulu, 23-29 July 2023, 23803-23828.
[18]
Wang, F. and Liu, H. (2021) Understanding the Behaviour of Contrastive Loss. 2021 IEEE/CVFConferenceonComputerVisionandPatternRecognition (CVPR), Nashville, 20-25 June 2021, 2495-2504. https://doi.org/10.1109/cvpr46437.2021.00252
[19]
Livingstone, S.R. and Russo, F.A. (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English. PLOSONE, 13, e0196391. https://doi.org/10.1371/journal.pone.0196391
[20]
Jackson, P. and Haq, S. (2015) Surrey Audio-Visual Expressed Emotion (SAVEE) Database. http://kahlan.eps.surrey.ac.uk/savee/
[21]
Dupuis, K. and Pichora-Fuller, M.K. (2010) Toronto Emotional Speech Set (TESS). https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech-set-tess