|
- 2017
采用长短时记忆网络的低资源语音识别方法
|
Abstract:
针对低资源环境下由于标注训练数据不足、造成语音识别系统识别率急剧下降的问题,提出一种采用长短时记忆网络的低资源语音识别(LSTM??LRASR)方法。该方法采用长短时记忆网络构建声学模型,从特征提取、数据扩展及模型优化3个方面提高低资源语音识别性能。在特征提取方面,提取语言无关的高层稳健特征参数,降低声学模型对训练数据的依赖;在数据扩展方面,对已有标注数据进行语速扰动,对无标注数据进行自动识别,从而自动获取更多标注数据;在模型优化方面,通过序贯区分性训练技术提高模型对易混淆音素的区分能力,利用最小风险贝叶斯解码对多个系统进行融合,进一步提高识别性能。对OpenKWS16评测数据的实验结果表明,采用LSTM??LRASR方法搭建的低资源语音识别系统的词错率相对基线系统下降了29.9%,所有查询词的查询项权重代价提升了60.3%。
A speech recognition method using long short??term memory network in low resources (LSTM??LRASR method) is proposed to solve the problem that the recognition rate of an auto speech recognition system is declining due to the lack of transcripted training data in low resource environments. The method uses long short??term memory network to construct an acoustic model, and improves the low resource speech recognition performance from three aspects. These are feature extraction, data augmentation and model optimization. The feature extraction extracts language??independent high??level robustness parameters to reduce the dependence of acoustic model on training data. The data augmentation processes the transcripted data by speed perturbation, while the untranscripted data is recognized automatically, so that more transcripted data are created. The model optimization uses the sequential discriminating training technique to improve the ability of distinguishing phonemes, and the minimum Bayes??risk decoding is used to combine multiple systems and to further improve the recognition performance. The experimental results on the OpenKWS16 evaluation database show that the word error rate of the low resource speech recognition system built by the proposed LSTM??LRASR method is 29.9% lower than that of the baseline system, and the actual value weighted value increases by 60.3%
[1] | LIU Jia, ZHANG Weiqiang. Research progress on key technology of low resource speech recognition [J]. Data Acquisition & Processing, 2017, 32(2): 205??220. |
[2] | [5]CHEN I F, NI C, LIM B P, et al. A keyword??aware language modeling approach to spoken keyword search [J]. Journal of Signal Processing Systems, 2016, 82(2): 197??206. |
[3] | [13]VESELY K, GHOSHAL A, BURGET L, et al. Sequence??discriminative training of deep neural networks [C]∥ Proceedings of the Annual Conference of the International Speech Communication Association. Lous Tourils, Baixas, France: International Speech and Communication Association. 2013: 2345??2349. |
[4] | [14]KINGSBURY B. Lattice??based optimization of sequence classification criteria for neural??network acoustic modeling [C]∥Proceedings of International Conference on Acoustics, Speech, and Signal Processing. Piscataway, NJ, USA: IEEE, 2009: 3761??3764. |
[5] | [1]刘加, 张卫强. 低资源语音识别若干关键技术研究进展 [J]. 数据采集与处理, 2017, 32(2): 205??220. |
[6] | [6]袁胜龙. 资源受限情况下基于ASR的关键词检索研究[D]. 合肥: 中国科学技术大学, 2016: 17??18. |
[7] | [7]刘迪源. 基于BN特征的声学建模研究及其在关键词检索中的应用[D]. 合肥: 中国科学技术大学, 2015: 12??14. |
[8] | [8]NIST. DRAFT KWS16 keyword search evaluation plan [EB/OL]. [2017??01??15]. https: ∥www??nist?? gov/sites/default/files/documents/itl/iad/mig/KWS16 ??evalplan??v04??pdf. |
[9] | [9]俞栋, 邓力. 解析深度学习: 语音识别实践 [M]. 北京: 工业出版社, 2016: 78??84. |
[10] | [16]MAMOU J, CUI J, CUI X, et al. System combination and score normalization for spoken term detection [C]∥Proceedings of International Conference on Acoustics, Speech, and Signal Processing. Piscataway, NJ, USA: IEEE, 2013: 8272??8276. |
[11] | [17]GOEL V, KUMAR S, BYRNE W. Segmental minimum Bayes??risk decoding for automatic speech recognition [J]. IEEE transactions on Speech and Audio Processing, 2004, 12(3): 234??249. |
[12] | [18]POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit [EB/OL]. [2016??12??12]. http:∥homepages??inf??ed??ac??uk/ag hoshal/pubs/asru11??kaldi??pdf. |
[13] | [19]陆梨花, 张连海, 陈琦. 基于加权有限状态转换器的语音查询项检索技术 [J]. 数据采集与处理, 2015, 30(2): 390??398. |
[14] | LU Lihua, ZHANG Lianhai, CHEN Qi. Spoken term detection techniques based on weight finite??state transducer [J]. Data Acquisition & Processing, 2015, 30(2): 390??398. |
[15] | [10]KNILL K M, GALES M J F, RATH S P, et al. Investigation of multilingual deep neural networks for spoken term detection [C]∥Automatic Speech Recognition and Understanding. Piscataway, NJ, USA: IEEE, 2013: 138??143. |
[16] | [11]VESELY K, HANNEMANN M, BURGET L. Semi??supervised training of deep neural networks [C]∥ 2013 Automatic Speech Recognition and Understanding. Washington, DC, USA: IEEE Computer Society, 2013: 267??272. |
[17] | [12]KO T, PEDDINTI V, POVEY D, et al. Audio augmentation for speech recognition [C]∥Proceedings of the Annual Conference of the International Speech Communication Association. Lous Tourils, Baixas, France: International Speech and Communication Association, 2015: 3586??3589. |
[18] | [15]KINGSBURY B, SAINATH T N, SOLTAU H. Scalable minimum Bayes risk training of deep neural network acoustic models using distributed Hessian??free optimization [C]∥ Proceedings of the 13th Annual Conference of the International Speech Communication Association 2012. Lous Tourils, Baixas, France: International Speech and Communication Association, 2012: 10??13. |
[19] | [2]IARPA. The babel program [EB/OL]. [2017??01??15]. https:∥www??iarpa??gov/index??php/research??programs/babel. |
[20] | [3]CAI M, LV Z, SONG B, et al. The THUEE system for the openKWS14 keyword search evaluation [C]∥IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ, USA: IEEE, 2015: 4734??4738. |
[21] | [4]DO V H, XIAO X, XU H, et al. Multilingual exemplar??based acoustic model for the NIST Open KWS 2015 evaluation [C]∥Asia??Pacific Signal and Information Processing Association Summit and Conference. Piscataway, NJ, USA: IEEE, 2015: 594??598. |
[22] | [20]GHAHREMANI P, BABAALI B, POVEY D, et al. A pitch extraction algorithm tuned for automatic speech recognition [C]∥Proceedings of International Conference on Acoustics, Speech, and Signal Processing. Piscataway, NJ, USA: IEEE, 2014: 2494??2498. |
[23] | [21]吴蔚澜, 蔡猛, 田??, 等. 低数据资源条件下基于Bottleneck特征与SGMM模型的语音识别系统 [J]. 中国科学院大学学报, 2015, 32(1): 97??102. |
[24] | WU Weilan, CAI Meng, TIAN Yao, et al. Bottleneck features and subspace Gaussian mixture models for low??resource speech recognition [J]. Journal of University of Chinese Academy of Sciences, 2015, 32(1): 97??102. |