|
面向知识图谱的信息抽取
|
Abstract:
随着大数据时代的到来,海量数据不断涌现,从中寻找有用信息,抽取对应知识的需求变得越来越强烈。针对该需求,知识图谱技术应运而生,并在实现知识互联的过程中日益发挥重要作用。信息抽取作为构建知识图谱的基础技术,实现了从大规模数据中获取结构化的命名实体及其属性或关联信息。同时,由于具有多样化的实现方法,扩充了信息抽取技术的应用领域和场景,也提升了对信息抽取技术研究的价值和必要性的认可度。本文首先以知识图谱的构建框架为背景。探讨信息抽取研究的意义;然后从MUC、ACE和ICDM三个国际测评会议的角度回顾信息抽取的发展历史;接着,基于面向限定域和开放域两个方面,介绍信息抽取的关键技术,包括实体抽取技术、关系抽取技术和属性抽取技术。
With the advent of the new era of big data, massive data constantly emerge. Therefore, the demand to find useful information and extract corresponding knowledge becomes intense. In response to this demand, knowledge graph technology came into being and has increasingly played an im-portant role in achieving knowledge integration. Information extraction, as a basis for constructing knowledge graphs, obtains structured named entities with their attributes and relationships from large-scale data. This paper starts with the significance of information extraction in the context of knowledge graph construction. Then, from the viewpoints of the MUC, ACE, and ICDM conferences, this paper reviews the evolving history of information extraction. Next, this paper introduces closed domains and open domains oriented key technologies of information extraction, respectively, in-cluding entity extraction, relationship extraction and attribute extraction.
[1] | Wu, X.D., Wu, J., Fu, X.Y., Li, J.C., Zhou, P. and Jiang, X. (2019) Automatic Knowledge Graph Construction: A Re-port on the 2019 ICDM/ICBK Contest. 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8-11 November 2019, 1540-1545. https://doi.org/10.1109/ICDM.2019.00204 |
[2] | Wu, X.D., He, J, Lu, R.Q., et al. (2016) From Big Data to Big Knowledge: HACE + BigKE. Computer Science, 42, 3-6. |
[3] | Lin, Y.K., Shen, S.Q., Liu, Z.Y., et al. (2016) Neural Relation Extraction with Selective Attention over Instances. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, 7-12 August 2016, 2124-2133. https://doi.org/10.18653/v1/P16-1200 |
[4] | China Chinese Information Society (2018) Language and Knowledge Computing Committee. Knowledge Graph Development Report. Higher Education Press, Beijing. |
[5] | Surdeanu, M., Tibshirani, J., Nallapati, R., et al. (2012) Multi-Instance Multi-Label Learning for Relation Extraction. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Lan-guage, Jeju Island, 12-14 July 2012, 455-465. |
[6] | Wu, X.D., et al. (2015) Knowledge Engineering with Big Data. IEEE Intelligent Systems, 30, 46-55.
https://doi.org/10.1109/MIS.2015.56 |
[7] | Liu, Q., Li, Y., Duan, H., Liu, Y. and Qin, Z.G. (2016) Knowledge Graph Construction Techniques. Journal of Computer Research and Development, 53, 582-600. |
[8] | Wu, X.D., Zhu, X.Q., Wu, G.Q., et al. (2014) Data Mining with Big Data. IEEE Transactions on Knowledge and Data Engineering, 26, 97-107. https://doi.org/10.1109/TKDE.2013.109 |
[9] | Socher, R., Huval, B., Manning, C., et al. (2012) Semantic Compositionality through Recursive Matrix-Vector Spaces. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, 12-14 July 2012, 1201-1211. |
[10] | Augenstein, I., Das, M., Riedel, S., et al. (2017) SemEval 2017 Task 10: ScienceIE—Extracting Keyphrases and Relations from Scienti?c Publications. CoRR abs/1704.02853. |
[11] | Etzioni, O., Fader, A., Christensen, J., et al. (2011) Open Information Extraction: The Second Generation. Proceedings of the 22nd International Joint Con-ference on Artificial Intelligence, Barcelona, July 2011, 3-10. |
[12] | Li, B.L., Chen, Y.Z. and Yu, S.W. (2003) Research on Information Extraction: A Survey. Computer Engineering and Applications, 39, 1-5. |
[13] | Guo, X.Y. and He, T.T. (2014) Survey about Research on Information Extraction. Computer Science, 42, 14-17. |
[14] | Zhang, C., Hoffmann, R. and Weld, D.S. (2012) Ontological Smoothing for Relation Extraction with Minimal Supervision. Proceedings of the 26th AAAI Conference on Artificial Intelligence, Toronto, Toronto, Ontario, Canada, 22-26 July 2012. |
[15] | Wikipedia (2019) Message Understanding Conference.
https://en.wikipedia.org/wiki/Message_Understanding_Conference |
[16] | Banko, M., Cafarella, M.J., Soderland, S., et al. (2007) Open Information Extraction from the Web. Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, 6-12 January 2007, 2670-2676. |
[17] | Brin, S. (1998) Extracting Patterns and Relations from the World Wide Web. Proceedings of Lecture Notes in Computer Science, 1590, 172-183. https://doi.org/10.1007/10704656_11 |
[18] | Liu, L. and Wang, D.B. (2018) A Review on Named Entity Recognition. Journal of the China Society for Scienti?c and Technical Information, 37, 329. |
[19] | Nadeau, D. and Sekine, S. (2007) A Survey of Named Entity Recognition and Classi?cation. Linguistics Investigations, 30, 3-26. https://doi.org/10.1075/li.30.1.03nad |
[20] | Fader, A., Soderland, S. and Etzioni, O. (2011) Identifying Relations for Open Information Extraction. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Pro-cessing, John McIntyre Conference Centre, Edinburgh, 27-31 July 2011, 1535-1545. |
[21] | Sun, Z. and Wang, H.L. (2010) Overview on the Advance of the Research on Named Entity Recognition. New Technology of Library and Infor-mation Service, 26, 42-47. |
[22] | Humphreys, K., Gaizauskas, R., Azzam, S., et al. (1998) University of Sheffield: De-scription of the LaSIE-II System as Used for MUC-7. Proceedings of the 7th Message Understanding Conference, Fair-fax, 29 April-1 May 1998.
https://www.aclweb.org/anthology/M98-1007/ |
[23] | Rau, L.F. (1991) Extracting Company Names from Text. Pro-ceedings of the 7th IEEE Conference on Artificial Intelligence Applications Piscataway, Miami Beach, 24-28 February 1991, 29-32. |
[24] | RatnaParkhi, A. (1997) A Simple Introduction to Maximum Entropy Models for Natural Language Processing. Institute for Research in Cognitive Science, Technical Reports, University of Pennsylvania, Pennsylvania, 97-108. |
[25] | McCallum, A. (2009) Joint Inference for Netural Language Processing. Proceedings of the 13th Confer-ence on Computational Natural Language Learning (CoNLL-2009), Boulder, June 2019, 1.
https://doi.org/10.3115/1596374.1596376 |
[26] | Zhang, X.Y., Wang, T. and Chen, H.W. (2005) Research on Named Entity Recognition. Computer Science, 32, 44-48. |
[27] | Zhang, H.L. (2008) Visual C++ Digital Image Pattern Recognition Technology and Engineering Practice. People’s Posts and Telecommunications Press, Beijing, 58-93. |
[28] | Lafferty, J., McCallum, A. and Pereira, F. (2001) Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the 18th International Conference on Machine Learning, Wil-liamstown, 28 June-1 July 2001, 282-289. |
[29] | Liu, X.H., Zhang, S.D., Wei, F.R., et al. (2011) Recognizing Named Entities in Tweets. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, 19-24 June 2011, 359-367. |
[30] | Lin, Y.F., Tsai, T., Chou, W.C., et al. (2004) A Maximum Entropy Approach to Biomedical Named Entity Recognition. Proceedings of the 4th ACM SIGKDD Workshop on Data Mining in Bioinformatics, Seattle, August 2004, 56-61. |
[31] | Xu, M.B., Jiang, H. and Sedtawut, W. (2016) A Fofe-Based Local Detection Approach for Named Entity Recognition and Mention Detection. Computer Science, Computation and Language, November 2016.
arXiv:1611.00801v1 [cs.CL]. |
[32] | Cherry, C. and Guo, H.Y. (2015) The Unreasonable Effectiveness of Word Repre-sentations for Twitter Named Entity Recognition. Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, 31 May-5 June 2015, 735-745. |
[33] | Godin, F., Vandersmissen, B., Neve, W.D., et al. (2015) Multimedia Lab @ ACL W-NUT NER Shared Task: Named Entity Recognition for Twitter Microposts Using Distributed Word Representations. Proceedings of the Workshop on Noisy User-Generated Text, Beijing, July 2015, 146-153. https://doi.org/10.18653/v1/W15-4322 |
[34] | Arora, R., Tsai, C.T., Tsereteli, K., Kambadur, P. and Yang, Y. (2019) A Semi-Markov Structured Support Vector Machine Model for High-Precision Named Entity Recognition. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 2019, 5862-5866.
https://doi.org/10.18653/v1/P19-1587 |
[35] | Yi, L., Mari, O. and Hannaneh, H. (2017) Scientific Information Ex-traction with Semi-Supervised Neural Tagging. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, September 2017 2641-2651. |
[36] | Zhao, J., Liu, K., Zhou, G.Y., et al. (2011) Open Information Extraction. Journal of Chinese Information Processing, 25, 98-110. |
[37] | Etzioni, O., Cafarella, M., Downey, D., et al. (2005) Unsupervised Named-Entity Extraction from the Web: An Experimental Study. Artificial Intel-ligence, 165, 91-134. https://doi.org/10.1016/j.artint.2005.03.001 |
[38] | Sekine, S., Sudo, K. and Nobata, C. (2002) Extended Named Entity Hierarchy. Proceedings of the 3rd Language Resources and Evaluation Conference, New York, May 2002, 1818-1824. |
[39] | Xiao, L. and Weld, D.S. (2012) Fine-Grained Entity Recognition. Proceedings of the 26th Conference on Association for the Advancement of Artificial Intelligence, Menlo Park, 2012, Vol. 12, 94-100. |
[40] | Jain, A. and Pennacchiotti, M. (2010) Open Entity Extraction from Web Search Query Logs. Proceedings of the 23th Interna-tional Conference on computational Linguistics, Stroudsburg, Beijing, August 2010, 210-518. |
[41] | Shi, B., Zhang, Z., Sun, L., et al. (2014) A Probabilistic Co-Bootstrapping Method for Entity Set Expansion. Proceedings of the 25th Inter-national Conference on Computational Linguistics: Technical Papers, Dublin, august 2014, 2280-2290. |
[42] | Agichtein, E. and Gravano, L. (2000) Snowball: Extracting Relations from Large Plain-Text Collections. Proceedings of the 5th ACM Conference on Digital Libraries, San Antonio, June 2010, 85-94.
https://doi.org/10.1145/336597.336644 |
[43] | Xie, D.P. and Chang, Q. (2020) View of Relation Extraction. Applica-tion Research of Computers, 7, 1-5. |
[44] | Zhou, G.D., Su, J., Zhang, J., et al. (2005) Exploring Various Knowledge in Relation Extraction. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Michigan, June 2015, 419-444. |
[45] | Qi, G.L., Gao, H. and Wu, T.X. (2017) The Research Advances of Knowledge Graph. Tech-nology Intelligence Engineering, 3, 4-25. |
[46] | Zhuang, C.L., Qian, L.H. and Zhou, G.D. (2009) Research on Tree Ker-nel-Based Entity Semantic Relation Extraction. Journal of Chinese Information Processing, 23, 3. http://jcip.cipsc.org.cn/CN/abstract/abstract1128.shtml |
[47] | Zelenko, D., Aone, C. and Richardella, A. (2003) Kernel Methods for Relation Extraction. The Journal of Machine Learning Research, 3, 1083-1106. |
[48] | Li, Q. and Ji, H. (2014) Incremental Joint Extraction of Entity Mentions and Relations. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, 23-25 June 2014, 402-412.
https://doi.org/10.3115/v1/P14-1038 |
[49] | Whitelaw, C., Kehlenbeck, A., Petrovic, N., et al. (2008) Web-Scale Named Entity Recognition. Proceedings of the 17th ACM Conference on Information and Knowledge Management, Na-pa Valley, 26-30 October 2008, 123-132.
https://doi.org/10.1145/1458082.1458102 |
[50] | Carlson, A., Betteridge, J., Kisiel, B., et al. (2010) Towards an Ar-chitecture for Never-Ending Language Learning. Proceedings in 24th AAAI Conference on Artificial Intelligence, Atlanta Georgia, July 2010, 529-573. |
[51] | Mitchell, T. and Fredkin, E. (2014) Never-Ending Language Learning. 2014 IEEE International Conference on Big Data (Big Data), Washington DC, 27-30 October 2014, 1. https://doi.org/10.1109/BigData.2014.7004203 |
[52] | Chang, X.L., Mi, X.M. and Muppala, J.K. (2013) Perfor-mance Evaluation of Artificial Intelligence Algorithms for Virtual Network Embedding. Proceedings in Engineering Ap-plications of Artificial Intelligence, 26, 2540-2550.
https://doi.org/10.1016/j.engappai.2013.07.007 |
[53] | He, T.T., Xu, C., Li, J., et al. (2006) Named Entity Relation Extraction Method Based on Seed Self-expansion. Proceedings in Computer Engineering, 32, 183-184. |
[54] | Eichler, K., Hemsen, H. and Neumann, G. (2008) Unsupervised Relation Extraction from Web Documents. Proceedings of the In-ternational Conference on Language Resources and Evaluation, Marrakech, 26 May-1 June 2008. |
[55] | Hashimoto, K., Stenetorp, P., Miwa, M., et al, (2015) Task-Oriented Learning of Word Embeddings for Semantic Relation Classification. Proceedings of the 19th Conference on Computational Natural Language Learning, Beijing, 30-31 July 2015, 268-278. https://doi.org/10.18653/v1/K15-1027 |
[56] | Bollegala, D.T., Matsuo, Y. and Ishizuka, M. (2010) Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web. Proceedings of the 19th International Con-ference on World Wide Web, WWW 2010, Raleigh, 26-30 April 2010, 151-160. https://doi.org/10.1145/1772690.1772707 |
[57] | Quirk, C. and Poon, H. (2016) Distant Supervision for Relation Extraction beyond the Sentence Boundary. Proceedings of the 15th Conference of the European Chapter of the Associa-tion for Computational Linguistics, Valencia, 3-7 April 2017, 1171-1182. |
[58] | Mintz, M., Bills, S., Snow, R. and Jurafsky, D. (2009) Distant Supervision for Relation Extraction without Labeled Data. Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, Singapore, 2-7 August 2009, 1003-1011. https://doi.org/10.3115/1690219.1690287 |
[59] | Ji, G.L., Liu, K., He, S.Z., et al. (2017) Distant Supervision for Re-lation Extraction with Sentence-Level Attention and Entity Descriptions. Proceedings of the 31st AAAI Conference on Ar-tificial Intelligence, San Francisco, 4-9 February 2017, 3060-3066. |
[60] | Guo, X.Y., Zhang, H., Yang, H.J., et al. (2019) A Single Attention-Based Combination of CNN and RNN for Relation Classification. IEEE Access, 7, 12467-12475. https://doi.org/10.1109/ACCESS.2019.2891770 |
[61] | Tran, V.H., Phi, V.T., Shindo, H., et al. (2019) Relation Classification Using Segment-Level Attention-Based CNN and Dependency-Based RNN. Proceedings of the 2019 Con-ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo-gies, 2-7 June 2019, 2793-2798.
https://doi.org/10.18653/v1/N19-1286 |
[62] | Zhou, P., Shi, W., Tian, J., et al. (2016) Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification. Proceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics Berlin, 7-12 August 2016, 207-212. https://doi.org/10.18653/v1/P16-2034 |
[63] | JainPoon, H. and Domingos, P. (2007) Joint Inference in Information Extraction. Proceedings of the 22nd AAAI Conference on Artificial Intelligence, Vancouver, 22-26 July 2007, 913-918. |
[64] | Zeng, D.J., Liu, K., Chen, Y.B., et al. (2015) Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Pro-cessing, Lisbon, 17-21 September 2015, 1753-1762. https://doi.org/10.18653/v1/D15-1203 |
[65] | Zhang, Y.H., Zhong, V. and Chen, D.Q. (2017) Position-Aware Attention and Supervised Data Improve Slot Filling. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, 7-11 September, 2017, 35-45. https://doi.org/10.18653/v1/D17-1004 |
[66] | Studer, R. (2008) Knowledge Engineering: Principles and Methods. Data & Knowledge Engineering, 25, 161-197.
https://doi.org/10.1016/S0169-023X(97)00056-6 |
[67] | Swartout, B., Patil, R., Knight, K., et al. (1997) Toward Distributed Use of Large-Scale Ontologies. Proceedings of AAAI-97 Spring Symposium on Ontological Engineering, Stanford University, California, 1997, 138-148 |
[68] | Noy, N.F. and McGuinness, D.L. (2001) Ontology Development 101: A Guide to Creating Your First Ontology. Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880, Stanford. |
[69] | Suryanto, H. and Compton, P. (2001) Discovery of Ontologies from Knowledge Bases. Proceedings of ACM International Conference on Knowledge Capture, Victoria, October 2001, 171-178.
https://doi.org/10.1145/500737.500764 |
[70] | Dahab, M.Y., Hassan, H.A. and Rafea, A. (2008) TextOntoEx: Au-tomatic Ontology Construction from Natural English Text. Expert Systems with Applications, 34, 1474-1480. https://doi.org/10.1016/j.eswa.2007.01.043 |
[71] | Yu, Y.T. and Hsu, C.C. (2011) A Structured Ontology Construc-tion by Using Data Clustering and Pattern Tree Mining. Proceedings of International Conference on Machine Learning and Cybernetics, Guilin, 10-13 July 2011, 45-50. |
[72] | Moreno, A., Isern, D. and López Fuentes, A.C. (2013) Ontolo-gy-Based Information Extraction of Regulatory Networks from Scientific Articles with Case Studies for Escherichia Coli. Expert Systems with Applications, 40, 3266-3281.
https://doi.org/10.1016/j.eswa.2012.12.090 |
[73] | Li, C.X., Su, Y.R., Wang R.J., et al. (2012) Structured AJAX Data Extraction Based on Agricultural Ontology. Journal of Integrative Agriculture, 11, 784-791. https://doi.org/10.1016/S2095-3119(12)60068-9 |
[74] | Wimalasuriya, D.C. and Dou, D. (2009) Using Multiple Ontologies in Information Extraction. Proceedings of the 18th ACM Conference on Information and Knowledge Man-agement, Hong Kong, 2-6 November 2009, 235-244.
https://doi.org/10.1145/1645953.1645985 |
[75] | Wu, F. and Weld, D.S. (2010) Open Information Extraction Using Wikipedia. Proceedings of Annual Meeting of the Association for Computational Linguistics, Uppsala, 11-16 July 2010, 118-127. |
[76] | Banko, M. and Etzioni, O. (2008) The Tradeoffs between Open and Traditional Relation Extraction. Pro-ceedings of the 46th Annual Meeting of the Association for Computational Linguistics, Columbus, June 2008, 28-36. |
[77] | Domingos, P. and Lowd, D. (2009) Markov Logic: An Interface Layer for Artificial Intelligence. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3, 155. https://doi.org/10.2200/S00206ED1V01Y200907AIM007 |
[78] | Schmitzm, M., Bai, R., Soderiand, S., et al. (2014) Open Language Learning for Information Extraction. Proceedings of Conference on Empirical Methods in Natural Lan-guage Processing and Computational Natural Language Learning, Jeju Island, 12-14 July 2012, 523-534. |
[79] | Zhu, J., Nei, Z.Q., Liu, X.J., et al. (2009) StatSnowball: A Statistical Approach to Extracting Entity Relationships. Proceedings of the 18th International Conference on World Wide Web, Madrid, 20-24 April 2009, 101-110.
https://doi.org/10.1145/1526709.1526724 |
[80] | Del, C.L. and Gemulla, R. (2013) ClausIE: Clause-based Open In-formation Extraction. Proceedings of the 22nd Internationa1 Conference on World Wide Web, Rio de Janeiro, Brazil: WWW, 355-366. |
[81] | Miwa, M. and Bansal, M. (2016) End-to-End Relation Extraction Using LSTMs on Sequences and Tree Structures. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, 7-12 August 2016, 1105-1116. https://doi.org/10.18653/v1/P16-1105 |
[82] | Liu, X.J. and Nen, H. (2010) People Summarization by Combining Named Entity Recognition and Relation Extraction. Journal of Convergence Information Technology, 5, 233-241. https://doi.org/10.4156/jcit.vol5.issue10.30 |
[83] | Suchanek, F.M., Kasneci, G. and Weikum, G. (2007) Yago: A Core of Semantic Knowledge. Proceedings of the 16th International Conference on World Wide Web, New York, May 2007, 697-706.
https://doi.org/10.1145/1242572.1242667 |
[84] | Moro, A. and Navigli, R. (2013) Integrating Syntactic and Semantic Analysis into the Open Information Extraction Paradigm. Proceedings of the 23rd International Joint Conference on Ar-tificial Intelligence, Beijing, 3-9 August 2013 2148-2154. |
[85] | Domingos, P. and Webb, A. (2012) A Tractable First-Order Probabilistic Logic. Proceedings of the 26th AAAI Conference on Artificial Intelligence, Toronto, July 2012, 1902-1909. |
[86] | Xu, Z.L., Sheng, Y.P., He, L.R., et al. (2016) Review on Knowledge Graph Techniques. Journal of University of Electronic Science and Technology of China, 45, 589-606. |
[87] | Wu, F. and Weld, D.S. (2007) Autono-mously Semantifying Wikipedia. Proceedings of the 16th ACM Conf on Information and Knowledge Management, Lis-bon, 6-8 November 2007, 41-50.
https://doi.org/10.1145/1321440.1321449 |
[88] | Huang, L., Sil, A., Ji, H., et al. (2017) Improving Slot Filling Per-formance with Attention Neural Networks on Dependency Structures. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, 7-11 September 2017, 2588-2597. https://doi.org/10.18653/v1/D17-1274 |