|
基于映射关系的领域词典抽取算法
|
Abstract:
领域词典是一种领域知识的表现形式,是数据规范化和数据清洗的重要参考信息。映射关系指表格中某两列间的对应关系。领域词典构建与扩充以Web表格为主要数据来源,需要对众多Web表格中的局部映射关系进行联结和扩展,但Web表格中存在异构和数据质量问题,不能单纯地依靠模式匹配等数据集成技术。本文提出了一种基于映射关系的领域词典抽取算法。首先利用带IDF权重的Jaccard最大包含度和编辑距离进行近似字符串匹配,并利用高斯混合模型实现数值离散化,从而解决了数据层面的异构性问题。然后由点互信息和函数依赖确定包含映射关系的候选表;接下来定义了候选表间的相容性和相斥性,构造出映射关系图模型,以进行候选表联结,实现了以映射关系为形式的领域词典抽取。最后,为保证领域词典的质量,加入了冲突消解过程。在实验验证阶段,本文利用房地产领域数据集,与其他从Web获取领域知识的算法进行比较,验证了本文所提出算法的有效性和可靠性。
The domain dictionary is a form of expression of domain knowledge and the important reference information for data normalization and data cleaning. The mapping relationships refer to the cor-responding relationship between two columns in a table. The construction and expansion of domain dictionary takes Web tables as the main data source, and it is necessary to connect and expand the local mapping relationships in many Web tables. However, there are heterogeneous and data qual-ity problems in Web tables, data integration technologies, for example, pattern matching cannot be relied on. This paper proposes a domain dictionary extraction algorithm based on mapping rela-tions. Firstly, we use the IDF-Jaccard maximum containment and edit distance for approximate string matching, and use Gaussian mixture model to achieve numerical discretization, thereby solving the heterogeneity problem at the data level. Next, the candidate table containing mapping relationships is determined by the pointwise mutual information and functional dependence; then the compatibility and repulsion between the candidate tables are defined, and the mapping rela-tionship graph model is constructed to connect the candidate tables, and the domain dictionary with the form of mapping relationships is extracted. Finally, to ensure the quality of the domain dic-tionary, a conflict resolution process was added. In the experimental verification, this paper used real estate data sets, compared with other algorithms that obtain domain knowledge from the Web, so the effectiveness and reliability of the algorithm proposed was verified.
[1] | 王鑫, 邹磊, 王朝坤, 彭鹏, 冯志勇. 知识图谱数据管理研究综述[J]. 软件学报, 2019, 30(7): 2139-2174.
http://www.jos.org.cn/1000-9825/5841.htm |
[2] | Abadi, D.J., Marcus, A. and Madden, S.R. (2009) SW-Store: A Ver-tically Partitioned DBMS for Semantic Web Data Management. VLDB Journal, 18, 385-406. https://doi.org/10.1007/s00778-008-0125-y |
[3] | Abadi, D.J., Marcus, A. and Madden, S.R. (2007) Scalable Se-mantic Web Data Management Using Vertical Partitioning. In: Klas, W., Ed., Proceedings of the 33rd International Con-ference on Very Large Data Bases, VLDB Endowment, Vienna, 411-422. |
[4] | 陈文亮, 朱靖波, 朱慕华, 姚天顺. 基于领域词典的文本特征表示[J]. 计算机研究与发展, 2005, 42(12): 2155-2160. |
[5] | 宋施恩, 樊兴华. 基于词共现和词上下文的领域观点词抽取方法[J]. 计算机工程与设计, 2013, 34(11): 4012-4015. |
[6] | Venetis, P., Halevy, A.Y., Madhavan, J., Pasca, M., Shen, W., Wu, F., Miao, G.X. and Wu, C. (2011) Recovering Semantics of Tables on the Web. PVLDB, 4, 528-538. https://doi.org/10.14778/2002938.2002939 |
[7] | Hearst, M.A. (1992) Automatic Acquisition of Hyponyms from Large Text Corpora. Proceedings of the 14th Conference on Computational Linguistics, Volume 2, 539-545. https://doi.org/10.3115/992133.992154 |
[8] | Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M. and Etzioni, O. (2007) Open Information Extraction from the Web. Proceedings of IJCAI, Volume 7, 2670-2676. |
[9] | Lehmberg, O. and Bizer, C. (2017) Stitching Web Tables for Improving Matching Quality. PVLDB, 10, 1502-1513.
https://doi.org/10.14778/3137628.3137657 |
[10] | Ling, X., Halevy, A.Y., Wu, F. and Yu, C. (2013) Synthesizing Union Tables from the Web. Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, 3-9 August 2013, 2677. |
[11] | Kang, J. and Naughton, J.F. (2003) On Schema Matching with Opaque Column Names and Data Values. SIGMOD 2003, San Diego, 9-12 June 2003, 205-216. https://doi.org/10.1145/872757.872783 |
[12] | Nargesian, F., Zhu, E.K., Pu, K.Q. and Miller, R.J. (2018) Table Un-ion Search on Open Data. Proceedings of the VLDB Endowment, 11, 813-825. https://doi.org/10.14778/3192965.3192973 |
[13] | Chaudhuri, S., Ganti, V. and Kaushik, R. (2006) A Primitive Op-erator for Similarity Joins in Data Cleaning. 22nd International Conference on Data Engineering (ICDE’06), Atlanta, 3-7 April 2006, 5.
https://doi.org/10.1109/ICDE.2006.9 |
[14] | Agrawal, P., Arasu, A. and Kanshik, R. (2010) On Indexing Er-ror-Tolerant Set Containment. In: Proceedings of the 2010 International Conference on Management of Data, ACM Press, Indianapolis, 927-938.
https://doi.org/10.1145/1807167.1807267 |
[15] | Ukkonen, E. (1985) Algorithms for Approximate String Matching. Information and Control, 64, 100-118.
https://doi.org/10.1016/S0019-9958(85)80046-2 |
[16] | Khanmohammadi, S. and Chou, C.-A. (2016) A Gaussian Mixture Model Based Discretization Algorithm for Associative Classification of Medical Data. Expert Systems with Ap-plications, 58, 119-129.
https://doi.org/10.1016/j.eswa.2016.03.046 |
[17] | Bouma, G. (2009) Normalized (Pointwise) Mutual Information in Collocation Extraction. Proceedings of GSCL, Potsdam, September 2009, 31-40. |
[18] | Miller, R.J., Nargesian, F., Christodoulakis, C., Pu, K.Q. and Andritsos, P. (2018) Making Open Data Transparent: Data Discovery on Open Data. IEEE Data Engineering Bulletin, 41, 59-70. |
[19] | Wang, Y. and He, Y. (2017) Synthesizing Mapping Relationships Us-ing Table Corpus. Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Confer-ence 2017, Chicago, 14-19 May 2017.
https://doi.org/10.1145/3035918.3064010 |
[20] | Nakashole, N., Theobald, M. and Weikum, G. (2011) Scalable Knowledge Harvesting with High Precision and High Recall. Conference on Web Search and Data Mining (WSDM 2011), Hong Kong, 9-12 February 2011, 227-236.
https://doi.org/10.1145/1935826.1935869 |
[21] | He, H., Meng, W., Yu, C.T. and Wu, Z. (2003) WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce. Proceedings 2003 VLDB Conference, Berlin, 9-12 September 2003, 357-368.
https://doi.org/10.1016/B978-012722442-8/50039-2 |