%0 Journal Article
%T Speech Emotion Recognition Based on CNN-Transformer with Different Loss Function
%A Bin Li
%J Journal of Computer and Communications
%P 103-115
%@ 2327-5227
%D 2025
%I Scientific Research Publishing
%R 10.4236/jcc.2025.133008
%X Speech Emotion Recognition (SER) is crucial for enhancing human-computer interactions by enabling machines to understand and respond appropriately to human emotions. However, accurately recognizing emotions from speech is challenging due to variations across speakers, languages, and environmental contexts. This study introduces a novel SER framework that integrates Convolutional Neural Networks (CNNs) for effective local feature extraction from Mel-Spectrograms, and transformer networks employing multi-head self-attention mechanisms to capture long-range temporal dependencies in speech signals. Additionally, the paper investigates the impact of various loss functions&#8212;L1 Loss, Smooth L1 Loss, Binary Cross-Entropy Loss, and Cross-Entropy Loss&#8212;on the accuracy and generalization performance of the model. Experiments conducted on a combined dataset formed from RAVDESS, SAVEE, and TESS demonstrate that the CNN-Transformer model with Cross-Entropy Loss achieves superior accuracy, outperforming other configurations. These findings highlight the importance of appropriately selecting loss functions to enhance robustness and effectiveness in speech emotion recognition systems.
%K Speech Emotion Recognition
%K CNN-Transformer
%K Mel-Spectrogram
%K Multi-Head Self-Attention
%K Loss Function
%U http://www.scirp.org/journal/PaperInformation.aspx?PaperID=141585