%0 Journal Article %T 基于条件随机场的中文短文本分词方法<br>Chinese word segmentation method for short Chinese text based on conditional random fields %A 刘泽文 %A 丁冬 %A 李春文 %J 清华大学学报(自然科学版) %D 2015 %X 中文分词是信息检索工作的一项先决任务。随着大数据时代的到来, 信息检索工作对于中文分词准确率和召回率的要求也不断提高。该文提出了一种针对中文短文本的分词方法。该方法首先利用机器学习中的条件随机场模型对待处理的中文短文本进行初步分词, 然后再利用传统词典分词方法对初步分词结果进行修正, 从而完成分词工作。针对中文短文本的特点, 该方法在条件随机场的标记选择和特征模板编写上做了相应优化。测试结果表明, 该方法改善了传统的基于词典的分词法因为未登录词和交叠歧义而产生的准确率和召回率下降的问题, 并在Sighan bakeoff 2005的四个语料测试集中均取得了0.95以上的F-Score。实验证明: 该方法适合应用于信息检索领域的中文短文本分词工作。<br>Abstract:Chinese word segmentation is a prerequisite for information retrieval. With the arrival of big data, information retrieval needs more precise word segmentation and recall. This paper presents a Chinese word segmentation method for short Chinese texts. The method first uses a conditional random field model to label the words with special tags to obtain preliminary results. Then, it uses the traditional dictionary-based method to improve the initial result to complete the word segmentation. This method improves recognition of “out of vocabulary” words and overlap ambiguities over the traditional method, with F-Scores over 0.95 with the 4 corpora of the Sighan 2005 bakeoff. Tests show that this method is better for short text Chinese word segmentation for information retrieval. %K 中文分词 %K 条件随机场 %K 机器学习 %K < %K br> %K Chinese word segmentation %K conditional random field (CRF) %K machine learning %U http://jst.tsinghuajournals.com/CN/Y2015/V55/I8/906#FigureTableTab