|
- 2018
基于注意力LSTM和多任务学习的远场语音识别
|
Abstract:
由于背景噪声、混响以及人声干扰等因素,远场语音识别任务一直充满挑战性。该文针对远场语音识别任务,提出基于注意力机制和多任务学习框架的长短时记忆递归神经网络(long short-term memory,LSTM)声学模型。模型中嵌入的注意力机制使其自动学习调整对扩展上下文特征输入的关注度,显著提升了模型对远场语音的建模能力。为进一步提高模型的鲁棒性,引入多任务学习框架,使其联合预测声学状态和干净特征。AMI数据集上的实验结果表明:与基线模型相比,引入注意力机制和多任务学习框架的LSTM模型获得了1.5%的绝对词错误率下降。
Abstract:Distant speech recognition remains a challenging task owning to background noise, reverberation, and competing acoustic sources. This work describes a long short-term memory (LSTM) based acoustic model with an attention mechanism and a multitask learning architecture for distant speech recognition. The attention mechanism is embedded in the acoustic model to automatically tune its attention to the spliced context input which significantly improves the ability to model distant speech. A multitask learning architecture, which is trained to predict the acoustic model states and the clean features, is used to further improve the robustness. Evaluations of the model on the AMI meeting corpus show that the model reduces word error rate (WER) by 1.5% over the baseline model.
[1] | YU D, XIONG W, DROPPO J, et al. Deep convolutional neural networks with layer-wise context expansion and attention[C]//17th Annual Conference of the International Speech Communication Association. San Francisco, CA, USA:IEEE, 2016:17-21. |
[2] | CARLETTA J. Unleashing the killer corpus:Experiences in creating the multi-everything AMI meeting corpus[J]. Language Resources and Evaluation, 2007, 41(2):181-190. |
[3] | BENGIO Y, SIMARD P, FRASCONI P. Learning long-term dependencies with gradient descent is diffcult[J]. IEEE Transactions on Neural Networks, 1994, 5(2):157-166. |
[4] | HUANG J T, LI J, YU D, et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada:IEEE, 2013:7304-7308. |
[5] | HINTON G, DENG L, YU D, et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6):82-97. |
[6] | SAK H, SENIOR A, BEAUFAYS F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling[C]//15th Annual Conference of the International Speech Communication Association. Singapore:IEEE, 2014:338-342. |
[7] | SWIETOJANSKI P, GHOSHAL A, RENALS S. Hybrid acoustic models for distant and multichannel large vocabulary speech recognition[C]//IEEE Workshop on Automatic Speech Recognition and Understanding Workshop. Olomouc, Czech Republic:IEEE, 2013:285-290. |
[8] | BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China:IEEE, 2016:4945-4949. |
[9] | LU L, ZHANG X, CHO K, et al. A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition[C]//16th Annual Conference of the International Speech Communication Association. Dresden, Germany:IEEE, 2015:3249-3253. |
[10] | PETER B, RENALS S. Regularization of context-dependent deep neural networks with context-independent multi-task training[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Brisbane, Australia:IEEE, 2015:4290-4294. |
[11] | GAO T, DU J, DAI L, et al. Joint training of front-end and back-end deep neural networks for robust speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane, Australia:IEEE, 2015:4375-4379. |
[12] | POVEY D, ARNAB G, GILLES B, et al. The Kaldi speech recognition toolkit[C]//IEEE Workshop on Automatic Speech Recognition and Understanding Workshop. Hawaii, USA:IEEE, 2011. |