|
基于轻量化卷积神经网络的音频场景分类研究
|
Abstract:
为了提升用于音频场景识别的低复杂度神经网络的特征提取能力和性能,本文研究了以卷积神经网络(CNN)为主要方法的音频场景分类方法,在传统CNN结构上加入并改进了单独的注意力映射层,改进并对比了两种可用于轻量化卷积网络的注意力机制,在部分卷积层采用深度可分离卷积降低整体网络的参数量。使用较低成本的分组条状卷积替换原始卷积,采用了时频分离方法对整体卷积进行了设计,最终提出了SFAC (Sequence Frequency Attention CNN)网络模型。在语音场景多分类数据集(TAU Urban Acoustic Scenes、UrbanSound8K)上对比了SFAC和多个基于VGG结构的基线卷积网络模型,结果表明,本文提出的神经网络在保持较低的复杂度的前提下,对比基线模型能获得更高的准确度。
In order to improve the feature extraction ability and performance of low complexity neural net-works for audio scene recognition, this paper investigates the audio scene recognition method with Convolutional Neural Network (CNN) as the main method, adds and improves a separate attention mapping layer on the traditional CNN structure, improves and compares two attention mechanisms that can be used for lightweight convolutional networks, and uses deep separable convolution in some convolutional layers to reduce the number of parameters of the overall network. The original convolution is replaced by a low-cost grouping strip convolution, and the time-frequency separation method is used to design the overall convolution. Finally, the SFAC (Sequence Frequency Attention CNN) network model is proposed. The SFAC and multiple baseline convolutional network models based on VGG structure are compared on the speech scene multi-classification datasets (TAU Urban Acoustic Scenes, UrbanSound8K). The results show that the neural network proposed in this paper can obtain higher accuracy than the baseline model while maintaining lower complexity.
[1] | Eronen, A.J., Peltonen, V.T., Tuomi, J.T., et al. (2006) Audio-Based Context Recognition. IEEE Transactions on Audio Speech and Language Processing, 14, 321-329. https://doi.org/10.1109/TSA.2005.854103 |
[2] | Ma, L., Milner, B. and Smith D. (2006) Acoustic Environment Classification. ACM Transactions on Speech and Language Processing, 3, 1-22. https://doi.org/10.1145/1149290.1149292 |
[3] | Jiang, H., Bai, J., Zhang, S. and Xu, B. (2005) SVM-Based Audio Scene Classification. 2005 International Conference on Natural Language Processing and Knowledge Engineer-ing, Wuhan, 30 October-1 November 2005, 131-136. |
[4] | Li, J., Dai, W., Metze, F., Qu, S. and Das, S. (2017) A Comparison of Deep Learning Methods for Environmental Sound Detection. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, 5-9 March 2017, 126-130. https://doi.org/10.1109/ICASSP.2017.7952131 |
[5] | Paseddula, C. and Gangashetty, S.V. (2021) Late Fusion Framework for Acoustic Scene Classification Using LPCC, SCMC, and Log-Mel Band Energies with Deep Neural Networks. Applied Acoustics, 172, Article ID: 107568.
https://doi.org/10.1016/j.apacoust.2020.107568 |
[6] | Zhang, Z., Liu, D., Han, J., Qian, K. and Schuller, B.W. (2021) Learning Audio Sequence Representations for Acoustic Event Classification. Expert Systems with Applications, 178, Article ID: 115007.
https://doi.org/10.1016/j.eswa.2021.115007 |
[7] | Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M. and Luo, P. (2021) SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S. and Wortman Vaughan, J., Eds., Advances in Neural Information Processing Systems, Vol. 34, NeurIPS, New Orleans, 12077-12090. |
[8] | Peng, J., Liu, Y., Tang, S., et al. (2022) PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model. ArXiv: 2204.02681. |
[9] | Hershey, S., Chaudhuri, S., Ellis, D.P.W., et al. (2017) CNN Architectures for Large-Scale Audio Classification. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, 5-9 March 2017, 131-135. https://doi.org/10.1109/ICASSP.2017.7952132 |
[10] | 王猛, 张鹏远. 融合多尺度特征的短时音频场景识别方法[J]. 声学学报, 2022, 47(6): 717-726. |
[11] | 费鸿博, 吴伟官, 李平, 曹毅. 基于梅尔频谱分离和LSCNet的声学场景分类方法[J]. 哈尔滨工业大学学报, 2022, 54(5): 124-130+123. |
[12] | Sifre, L. and Mallat, S. (2014) Rigid-Motion Scattering for Texture Classification. ArXiv: 1403.1687. |
[13] | Chollet, F. (2017) Xception: Deep Learning with Depth-wise Separable Convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 1800-1807.
https://doi.org/10.1109/CVPR.2017.195 |
[14] | Howard, A.G., Zhu, M., Chen, B., et al. (2017) Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. ArXiv: 1704.04861. |
[15] | Ren, Z., Kong, Q., Qian, K., Plumbley, M. D. and Schuller, B.W. (2018) Attention-Based Convolutional Neural Networks for Acoustic Scene Classi-fication. 3rd Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2018 Workshop), Surrey, 19-20 November 2018, 1-5. |
[16] | Hu, J., Shen, L. and Sun, G. (2018) Squeeze-and-Excitation Networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 7132-7141. https://doi.org/10.1109/CVPR.2018.00745 |
[17] | He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Ve-gas, 27-30 June 2016, 770-778.
https://doi.org/10.1109/CVPR.2016.90 |
[18] | Mesaros, A., Heittola, T. and Virtanen, T. (2018) A Multi-Device Da-taset for Urban Acoustic Scene Classification. ArXiv: 1807.09840. |