|
- 2017
基于层次化结构的语言模型单元集优化
|
Abstract:
对于大词汇量语音识别系统,适当选择基本单元至关重要。虽然以词为基本单元时避免了词边界的确定等复杂过程,但很多派生类结构中(如黏性语言),词比较长,而且很多文字(如中文、日文等)不需要词边界,因而在自然语言处理应用中没有选取基本单元集的固定模式。该文以维吾尔语大词汇量语音识别系统为例,研究基于各个层次化粒度单元的语音识别系统。通过比较各种层次化单元集为基础的语音识别结果,分析错误识别模式,收集被误判的单元序列作为在2层单元序列结构中择优的训练样本库。比较各种单元集的优缺点,提出一种能平衡长单元集和短单元集优点的方法。实验结果表明:该方法不仅可以有效提高语音识别准确率,也大大缩减了词典容量。
Abstract:An appropriate lexicon set must be selected as an important first step in developing large vocabulary continuous speech recognition (LVCSR) systems. The word unit is chosen as the lexicon basis to avoid word boundary detection problems. However, the lexicon basis selection is not as simple for the derivative morphological structure (e.g., agglutinative languages). Furthermore, there are no word boundaries in many languages such as Chinese and Japanese. This paper uses the Uyghur LVCSR system to analyze various particle based automatic speech recognition (ASR) systems with comparisons of the ASR results for various linguistic layers to develop a method to balance the advantages of two layer lexicons. The ASR results for the two layers are aligned and compared to analyze error patterns and extract samples as training data for the alternative selection method. Tests show that this method effectively improves the ASR accuracy with a small lexicon size.