%0 Journal Article %T 基于λ-主动学习方法的中文微博分词<br>λ-active learning based microblog-oriented Chinese word segmentation %A 张婧 %A 黄德根 %A 黄锴宇 %A 刘壮 %A 孟祥主 %J 清华大学学报(自然科学版) %D 2018 %R 10.16511/j.cnki.qhdxxb.2018.26.011 %X 由于面向中文微博的分词标注语料相对较少,导致基于传统方法和深度学习方法的中文分词系统在微博语料上的表现效果很差。针对此问题,该文提出一种主动学习方法,从大规模未标注语料中挑选更具标注价值的微博分词语料。根据微博语料的特点,在主动学习迭代过程中引入参数λ来控制所选的重复样例的个数,以确保所选样例的多样性;同时,根据样例中字标注结果的不确定性和上下文的多样性,采用Max、Avg和AvgMax这3种策略衡量样例整体的标注价值;此外,用于主动学习的初始分词器除使用当前字的上下文作为特征外,还利用字向量自动计算当前字成为停用字的可能性作为模型的特征。实验结果表明:该方法的F值比基线系统提高了0.84%~1.49%,比目前最优的基于词边界标注(word boundary annotation,WBA)的主动学习方法提升效果更好。<br>Abstract:Current manual segmented microblog-oriented corpora are inadequate, so both conventional Chinese word segmentation (CWS) systems and deep learning based CWS systems are still not very effective. This paper presents an active learning method that selects samples with high annotation values from unlabelled tweets for microblog-oriented CWS. A parameter is introduced to control the number of repeatedly selected samples that offen occur in microblog data. Three strategies (Max, Avg and AvgMax) are used to evaluate the overall values of each sample. The initial segment character is a stop character which is calculated by taking character embeddings into consideration. Tests demonstrate that this method outperforms the baseline system with F Gains of 0.84%~1.49% and state-of-the-art active learning method word boundary annotation (WBA). %K 文字信息处理 %K 中文分词 %K 主动学习 %K 样例多样性 %K 微博语料 %K < %K br> %K word information processing %K Chinese word segmentation %K active learning %K diversity of samples %K microblog-oriented data %U http://jst.tsinghuajournals.com/CN/Y2018/V58/I3/260