全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...
-  2015 

基于条件随机场的中文短文本分词方法
Chinese word segmentation method for short Chinese text based on conditional random fields

Keywords: 中文分词,条件随机场,机器学习,
Chinese word segmentation
,conditional random field (CRF),machine learning

Full-Text   Cite this paper   Add to My Lib

Abstract:

中文分词是信息检索工作的一项先决任务。随着大数据时代的到来, 信息检索工作对于中文分词准确率和召回率的要求也不断提高。该文提出了一种针对中文短文本的分词方法。该方法首先利用机器学习中的条件随机场模型对待处理的中文短文本进行初步分词, 然后再利用传统词典分词方法对初步分词结果进行修正, 从而完成分词工作。针对中文短文本的特点, 该方法在条件随机场的标记选择和特征模板编写上做了相应优化。测试结果表明, 该方法改善了传统的基于词典的分词法因为未登录词和交叠歧义而产生的准确率和召回率下降的问题, 并在Sighan bakeoff 2005的四个语料测试集中均取得了0.95以上的F-Score。实验证明: 该方法适合应用于信息检索领域的中文短文本分词工作。
Abstract:Chinese word segmentation is a prerequisite for information retrieval. With the arrival of big data, information retrieval needs more precise word segmentation and recall. This paper presents a Chinese word segmentation method for short Chinese texts. The method first uses a conditional random field model to label the words with special tags to obtain preliminary results. Then, it uses the traditional dictionary-based method to improve the initial result to complete the word segmentation. This method improves recognition of “out of vocabulary” words and overlap ambiguities over the traditional method, with F-Scores over 0.95 with the 4 corpora of the Sighan 2005 bakeoff. Tests show that this method is better for short text Chinese word segmentation for information retrieval.

References

[1]  Lafferty J, Mccallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data [C]// ACM, Proceedings of the 18th International Conference on Machine Learning. Williamstown, MA, USA: Scholarly Commons, 2001: 282-289.
[2]  ZHANG Meishan, DENG Zhilong, CHE Wanxiang, et al. Combining statistical model and dictionary for domain adaption of Chinese word segmentation [J]. Journal of Chinese Information Processing, 2012, 26(2): 8-12.
[3]  Nocedal J, Updating quasi-Newton matrices with limited storage [J]. Mathematics of computation, 1980, 35(151): 773-782.
[4]  SUN Xu, ZHANG Yaozhong, Matsuzaki T, et al. Probabilistic Chinese word segmentation with non-local information and stochastic training [J]. Information Processing & Management, 2013, 49(3): 626-636.
[5]  CHEN Lei, LI Miao, ZHANG Jian, et al. A double-layer word segmentation combined with local ambiguity word grid and CRF [J]. Transactions on Computer Science & Technology, 2013 (1): 1-8.
[6]  Ray A, Chandawala A, Chaudhury S. Character Recognition Using Conditional Random Field Based Recognition Engine [C]// IEEE, Proceedings of 12th International Conference on Document Analysis and Recognition. Washington DC, USA: IEEE Computer Society, 2013: 18-22.
[7]  Sha F, Pereira F, Shallow parsing with conditional random fields [C]// ACL, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Boston, MA, USA: Association for Computational Linguistics, 2003, 1: 134-141.
[8]  ZHAO Hai, HUANG Changning, LI Mu. An improved Chinese word segmentation system with conditional random field [C]// ACL, 2006a. Proceedings of the Fifth Sighan Workshop on Chinese Language Processing. Sydney, Australia: Association for Computational Linguistics, 2006: 162-165.
[9]  Tseng H, Chang P, Andrew G, et al. A conditional random field word segmenter for Sighan bakeoff 2005 [C]// ACL, Proceedings of the Fourth Sighan Workshop on Chinese Language Processing. Jeju Island, Korea: Association for Computational Linguistics, 2005: 168-171.
[10]  Fosler-Lussier E, HE Yanzhang, Jyothi P, et al. Conditional random fields in speech, audio, and language processing [J]. Proceedings of the IEEE, 2013, 101(5): 1054-1075.
[11]  YANG Yanfeng, YANG Yanqin, GUAN Hu, et al. Out-of-vocabulary words recognition based on conditional random field in electronic commerce [J]. Lecture Notes in Computer Science, 2014, 8835: 532-539.
[12]  Chellappa R, Fain A, Chellappa R, et al. Markov Random Fields: Theory and Applications [M]. San Diego, CA, USA: Academic Press Inc., 1993.
[13]  PENG Fuchuan, FENG Fangfang, Mccallum A. Chinese segmentation and new word detection using conditional random fields [C]// Proceedings of Coling 2004. Genera, Switzerland, 2004: 562-568.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133