%0 Journal Article %T Information Retrieval Oriented Adaptive Chinese Word Segmentation System
面向信息检索的自适应中文分词系统 %A CAO Yong-Gang %A CAO Yu-Zhong %A JIN Mao-Zhong %A LIU Chao %A
曹勇刚 %A 曹羽中 %A 金茂忠 %A 刘超 %J 软件学报 %D 2006 %I %X New words recognition and ambiguity resolving have vital effect on information retrieval precision. This paper presents a statistical model based algorithm for adaptive Chinese word segmentation. Then, a new word segmentation system called BUAASEISEG is designed and implemented using this algorithm. BUAASEISEG can recognize new words in various domains and do disambiguation and segment words with arbitrary length. It uses an iterative bigram method to do word segmentation. Through online statistical analysis on target article and using the offline words frequencies dictionary or the inverted index of the search engine, the candidate words selection and disambiguation are done. On the basis of the statistical methods, post-process using stopwords list, quantity suffix words list and surname list are used for further precision improvement. The comparative evaluation with the famous Chinese word segmentation system ICTCLAS, using news and papers as testing text, shows that BUAASEISEG outperforms ICTCLAS in new words recognition and disambiguation. %K word segmentation system %K word segmentation algorithm %K information retrieval %K new word recognition %K disambiguation
分词系统 %K 分词算法 %K 信息检索 %K 新词识别 %K 歧义消解 %U http://www.alljournals.cn/get_abstract_url.aspx?pcid=5B3AB970F71A803DEACDC0559115BFCF0A068CD97DD29835&cid=8240383F08CE46C8B05036380D75B607&jid=7735F413D429542E610B3D6AC0D5EC59&aid=886AE514DC464103&yid=37904DC365DD7266&vid=BCA2697F357F2001&iid=38B194292C032A66&sid=35E8A259891FB32F&eid=8C27CCA578E52082&journal_id=1000-9825&journal_name=软件学报&referenced_num=12&reference_num=14