|
- 2018
基于λ-主动学习方法的中文微博分词
|
Abstract:
由于面向中文微博的分词标注语料相对较少,导致基于传统方法和深度学习方法的中文分词系统在微博语料上的表现效果很差。针对此问题,该文提出一种主动学习方法,从大规模未标注语料中挑选更具标注价值的微博分词语料。根据微博语料的特点,在主动学习迭代过程中引入参数λ来控制所选的重复样例的个数,以确保所选样例的多样性;同时,根据样例中字标注结果的不确定性和上下文的多样性,采用Max、Avg和AvgMax这3种策略衡量样例整体的标注价值;此外,用于主动学习的初始分词器除使用当前字的上下文作为特征外,还利用字向量自动计算当前字成为停用字的可能性作为模型的特征。实验结果表明:该方法的F值比基线系统提高了0.84%~1.49%,比目前最优的基于词边界标注(word boundary annotation,WBA)的主动学习方法提升效果更好。
Abstract:Current manual segmented microblog-oriented corpora are inadequate, so both conventional Chinese word segmentation (CWS) systems and deep learning based CWS systems are still not very effective. This paper presents an active learning method that selects samples with high annotation values from unlabelled tweets for microblog-oriented CWS. A parameter is introduced to control the number of repeatedly selected samples that offen occur in microblog data. Three strategies (Max, Avg and AvgMax) are used to evaluate the overall values of each sample. The initial segment character is a stop character which is calculated by taking character embeddings into consideration. Tests demonstrate that this method outperforms the baseline system with F Gains of 0.84%~1.49% and state-of-the-art active learning method word boundary annotation (WBA).
[1] | QIU X P, QIAN P, YIN L S, et al. Overview of the NLPCC 2015 shared task:Chinese word segmentation and POS tagging for micro-blog texts[J]. Natural Language Processing and Chinese Computing. Berlin, Germany:Springer, 2015:541-549. |
[2] | CHEN X X, XU L, LIU Z Y, et al. Joint learning of character and word embeddings[C]//Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Argentina:AAAI, 2015:1236-1242. |
[3] | LI C, LIU Y. Improving named entity recognition in tweets via detecting non-standard words[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China:ACL, 2015:929-938. |
[4] | DONG G Z, LI R G, YANG W, et al. Microblog burst keywords detection based on social trust and dynamics model[J]. Chinese Journal of Electronics, 2014, 23(4):695-700. |
[5] | QIU X P, QIAN P, SHI Z. Overview of the NLPCC-ICCPOL 2016 shared task:Chinese word segmentation for micro-blog texts[J]. Natural Language Understanding and Intelligent Applications. Berlin, Germany:Springer, 2016:901-906. |
[6] | TSENG H, CHANG P C, ANDREW G, et al. A conditional random field word Segmenter for SIGHAN bakeoff 2005[C]//Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing. Jeju Island, Korea:ACL, 2005:168-171. |
[7] | ZHANG H P, YU H K, XIONG D Y, et al. HHMM-based Chinese lexical analyzer ICTCLAS[C]//Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Sapporo, Japan:ACL, 2003:184-187. |
[8] | 黄德根, 焦世斗, 周惠巍. 基于子词的双层CRFs中文分词[J]. 计算机研究与发展, 2010, 47(5):962-968. HUANG D G, JIAO S D, ZHOU H W. Dual-layer CRFs based on subword for Chinese word segmentation[J]. Journal of Computer Research and Development, 2010, 47(5):962-968. (in Chinese) |
[9] | TANG M, LUO X Q, ROUKOS S. Active learning for statistical natural language parsing[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, USA:ACL, 2002:120-127. |
[10] | CHEN Y K, LASKO T A, MEI Q Z, et al. A study of active learning methods for named entity recognition in clinical text[J]. Journal of Biomedical Informatics, 2015, 58:11-18. |
[11] | LI S S, ZHOU G D, HUANG C R. Active learning for Chinese word segmentation[C]//Proceedings of COLING 2012:Posters. New York, USA:ACM, 2012:683-692. |
[12] | 梁喜涛, 顾磊. 基于最近邻的主动学习分词方法[J]. 计算机科学, 2015, 42(6):228-232, 261. LIANG X T, GU L. Active learning in Chinese word segmentation based on nearest neighbor[J]. Computer Science, 2015, 42(6):228-232, 261. (in Chinese) |
[13] | 冯冲, 陈肇雄, 黄河燕, 等. 基于Multigram语言模型的主动学习中文分词[J]. 中文信息学报, 2006, 20(1):50-58. FENG C, CHEN Z X, HUANG H Y, et al. Active learning in Chinese word segmentation based on Multigram language model[J]. Journal of Chinese Information Processing, 2006, 20(1):50-58. (in Chinese) |
[14] | SUN W W, XU J. Enhancing Chinese word segmentation using unlabeled data[C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Edinburgh, UK:ACL, 2011:970-979. |
[15] | ZHAO H, KIT C Y. Exploiting unlabeled text with different unsupervised segmentation criteria for Chinese word segmentation[J]. Research on Computing Science, 2008, 33:93-104. |
[16] | MIKOLOV T, YIH W T, ZWEIG G. Linguistic regularities in continuous space word representations[C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Atlanta, USA:ACL, 2013:746-751. |
[17] | NGUYEN T H, SHIRAI K. Topic modeling based sentiment analysis on social media for stock market prediction[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Beijing, China:ACL, 2015:1354-1364. |
[18] | LIU X H, ZHOU M, WEI F R, et al. Joint inference of named entity recognition and normalization for tweets[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Jeju Island, Korea:ACL, 2012:526-535. |
[19] | LI S S, XUE Y X, WANG Z Q, et al. Active learning for cross-domain sentiment classification[C]//Proceedings of the 23rd International Joint Conference on Artificial Intelligence. Beijing, China:AAAI, 2013:2127-2133. |