全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

FPGA Implementation of a Pipelined Gaussian Calculation for HMM-Based Large Vocabulary Speech Recognition

DOI: 10.1155/2011/697080

Full-Text   Cite this paper   Add to My Lib

Abstract:

A scalable large vocabulary, speaker independent speech recognition system is being developed using Hidden Markov Models (HMMs) for acoustic modeling and a Weighted Finite State Transducer (WFST) to compile sentence, word, and phoneme models. The system comprises a software backend search and an FPGA-based Gaussian calculation which are covered here. In this paper, we present an efficient pipelined design implemented both as an embedded peripheral and as a scalable, parallel hardware accelerator. Both architectures have been implemented on an Alpha Data XRC-5T1, reconfigurable computer housing a Virtex 5 SX95T FPGA. The core has been tested and is capable of calculating a full set of Gaussian results from 3825 acoustic models in 9.03 ms which coupled with a backend search of 5000 words has provided an accuracy of over 80%. Parallel implementations have been designed with up to 32 cores and have been successfully implemented with a clock frequency of 133?MHz. 1. Introduction Automated Speech Recognition (ASR) systems can revolutionize the way that we interact with technology. Large vocabulary speaker independent systems have potential in all forms of computing, from hand held mobile devices to personal computing and even large scale data centres. A low power, real-time embedded system could dramatically impact our daily interactions with digital mobile technology [1] while a faster than real-time multi-stream batch decoder could be used in server applications for distributed systems [2] or data-mining [3, 4]. There are a range of open source software ASR systems available [5, 6]. These tools employ Hidden Markov Models and Viterbi decoding to provide a speech decoder that can be configured for a variety of implementations. Over the last 5 years, however, the research concerning high performance ASR has been more focused on hardware implementations and as such, many FPGA-based speech recognition systems have been implemented, although systems have generally been limited by small vocabulary [7, 8] or have relied on custom hardware to provide the necessary resources required for a large vocabulary system [9]. The approach of pairing a softcore processor with a custom IP peripheral is popular and has been proposed in a number of papers [8, 10] but a system operating on large vocabularies at real-time is yet to be demonstrated. This is, in part, due to the low operating frequencies of softcore processors but another problem is the interfacing with off-chip, high capacity RAM which can introduce large delays that cripple a high bandwidth system like speech

References

[1]  O. Viikki, I. Kiss, and J. Tian, “Speaker- and language-independent speech recognition in mobile communication systems,” in Proceedings of IEEE Interntional Conference on Acoustics, Speech, and Signal Processing, pp. 5–8, May 2001.
[2]  A. Bernard and A. Alwan, “Low-bitrate distributed speech recognition for packet-based and wireless communication,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 8, pp. 570–579, 2002.
[3]  N. Leavitt, “Let’s hear it for audio mining,” Computer, vol. 35, no. 10, pp. 23–25, 2002.
[4]  S. Douglas, D. Agarwal, T. Alonso, et al., “Mining customer care dialogs for ‘daily news’,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp. 652–660, 2005.
[5]  W. Walker, P. Lamere, P. Kwok, et al., “Sphinx-4: a flexible open source framework for speech recognition,” Sun Microsystems Whitepaper, 2004.
[6]  S. J. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book Version 3.4., Cambridge University Press, Cambridge, Mass, USA, 2006.
[7]  E. C. Lin, K. Yu, R. A. Rutenbar, and T. Chen, “A 1000-word vocabulary, speaker-independent, continuous live-mode speech recognizer implemented in a single FPGA,” in Proceedings of the 15th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '07), pp. 60–68, February 2007.
[8]  O. Cheng, W. Abdulla, and Z. Salcic, “Hardware-software co-design of automatic speech recognition system for embedded real-time applications,” to appear in IEEE Transactions on Industrial Electronics.
[9]  E. C. Lin and R. A. Rutenbar, “A multi-FPGA 10x-real-time high-speed search engine for a 5000-word vocabulary speech recognizer,” in Proceedings of the 7th ACM SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '09), pp. 83–92, February 2009.
[10]  K. You, H. Lim, and W. Sung, “Architectural design and implementation of an FPGA softcore based speech recognition system,” in Proceedings of the 6th IEEE International Workshop on System on Chip for Real Time Applications (IWSOC '06), pp. 50–55, December 2006.
[11]  M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in speech recognition,” Computer Speech and Language, vol. 16, no. 1, pp. 69–88, 2002.
[12]  R. Veitch, L.-M. Aubert, R. Woods, and S. Fischaber, “Acceleration of hmm-based speech recognition system by parallel fpga gaussian calculation,” in Proceedings of the 6th Southern Conference on Programmable Logic, 2010.
[13]  L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[14]  J. Chong, Y. Yi, A. Faria, N. Satish, and K. Keutzer, “Dataparallel large vocabulary continuous speech recognition on graphics processors,” in Proceedings of the 1st Annual Workshop on Emerging Applications and Many Core Architectures, June 2008.
[15]  S. Molau, M. Pitz, R. Schlüter, and H. Ney, “Computing mel-frequency cepstral coefficients on the power spectrum,” in Proceedings of IEEE Interntional Conference on Acoustics, Speech, and Signal Processing, pp. 73–76, May 2001.
[16]  B. Milner and X. Shao, “Clean speech reconstruction from MFCC vectors and fundamental frequency using an integrated front-end,” Speech Communication, vol. 48, no. 6, pp. 697–715, 2006.
[17]  E. Bocchieri and D. Blewett, “A decoder for LVCSR based on fixed-point arithmetic,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), pp. 1113–1116, May 2006.
[18]  Xilinx, “Virtex-5 Family Overview,” http://www.xilinx.com/support/documentation/data_sheets/ds100.pdf.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133