|
- 2015
基于LDA及标签传播的实体集合扩展
|
Abstract:
摘要: 实体集合扩展是指给定某类别下若干示例作为种子,扩展得到属于该类别下的更多实体。传统的实体集合扩展方法主要考虑实体之间的共现关系,根据它们之间的相似程度进行迭代式的扩展,但这会导致语义偏转问题的出现,准确率较差。对此,提出了先根据LDA(latent dirichlet allocation)主题模型获得种子词集合语义信息,再通过标签传播来进行实体集合扩展的方法。该方法通过考虑实体列表整体蕴含的语义信息,避免了单个词可能带来的歧义问题;利用LDA模型,挖掘实体列表的上下文主题,丰富实体扩展过程中的语义信息,解决语义偏转问题。在实际数据集上取得了良好的检测效果,证明了本文方法的有效性。
Abstract: Set expansion refers to expanding a partial set of "seed" objects into a more complete set. A widely employed approach to set expansion is based on iterative bootstrapping, which can be applied with only small amounts of supervision and which scales bad to very large corpus. A well-known problem with iterative bootstrapping is a phenomenon known as semantic drift: as bootstrapping proceeds it is likely that unreliable patterns will lead to false extractions. To address this issue, a hybrid method for entity set expansion was proposed based on LDA and label propagation. The whole entities in an entity list were considered to prevent words ambiguity; and the LDA used model to mine semantic information in contexts between entity lists to resolve the semantic drift phenomenon. Experiments were conducted with some datasets, and the evaluation demonstrates the effectiveness, efficiency, and scalability of the proposed solution
[1] | SADAMITSU K, SAITO K, IMAMURA K, et al. Entity set expansion using topic information[C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2011:726-731. |
[2] | SADAMITSU K, SAITO K, IMAMURA K, et al. Entity set expansion using interactive topic information[C]// Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation. Somerset:ACL,2012:108-116. |
[3] | JINDAL P, ROTH D. Learning from negative examples in set-expansion[C]// Proceedings of IEEE 11th International Conference on Data Mining. Washington:IEEE Computer Society, 2011:1110-1115. |
[4] | WANG R C, COHEN W W. Language-independent set expansion of named entities using the web[C]// Proceedings of the 7th IEEE International Conference on Data Mining (ICDM'07). Piscataway:IEEE, 2007:342-350. |
[5] | WANG R C, COHEN W W. Iterative set expansion of named entities using the web[C]// Proceedings of the 8th IEEE International Conference on Data Mining (ICDM'08). Piscataway:IEEE, 2008:1091-1096. |
[6] | WANG R C, COHEN W W. Character-level analysis of semi-structured documents for set expansion[C]// Proceedings of 2009 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:Association for Computational Linguistics, 2009:1503-1512. |
[7] | HE Yeye, DONG Xin. Seisa:set expansion by iterative similarity aggregation[C]// Proceedings of the 20th International Conference on World Wide Web. New York:ACM, 2011:427-436. |
[8] | LI Xiaoli, ZHANG Lei, LIU Bing, et al. Distributional similarity vs. PU learning for entity set expansion[C]// Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2010:359-364. |
[9] | QI Zhenyu, LIU Kang, ZHAO Jun. A novel entity set expansion method leveraging entity semantic knowledge[J]. Journal of Chinese Information Processing, 2013, 27(2):1-9. |
[10] | BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. The Journal of Machine Learning Research, 2003, 3:993-1022. |
[11] | ZHU Xiaojin, GHAHRAMANI Zoubin. Learning from labeled and unlabeled data with label propagation[R]. Pittsburgh:Carnegie Mellon University, 2002. |
[12] | ZHANG Huaping, LIU Qun, CHENG Xueqi, et al. Chinese lexical analysis using hierarchical hidden Markov model[C]// Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg:Association for Computational Linguistics, 2003:63-70. |
[13] | WENG Jianshu, LIM E P, JIANG Jing, et al. Twitter rank:finding topic sensitive influential twetterers[C]// Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. New York:ACM, 2010:261-270. |