全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

基于LEBERT-CRF和知识图谱的中文地址修正补全方法
Chinese Address Correction Completion Method Based on LEBERT-CRF and Knowledge Graph

DOI: 10.12677/CSA.2023.134080, PP. 808-818

Keywords: 中文地址分词,中文地址匹配,LEBERT,CRF,知识图谱
Chinese Address Segmentation
, Chinese Address Matching, LEBERT, CRF, Knowledge Graph

Full-Text   Cite this paper   Add to My Lib

Abstract:

为解决人工中文地址因输入不准确造成的地址解析错误问题,本文首先结合词汇增强的基于Transformer的双向编码表征模型(LEBERT)与条件随机场(CRF),提出了LEBERT-CRF模型,相较BERT-长短期记忆-CRF模型(BERT-BiLSTM-CRF)在分词准确率、召回率以及F值上分别提升了1.45%、1.89%和1.67%。然后,通过标准层级地址数据,并引入别名、旧名等地址信息构建了地址知识图谱库。最终,利用经过分词处理的地址数据,并根据地址数据存在的几种可能错误类型,设计出一种基于地址知识图谱库的匹配算法,对分词完的地址数据进行匹配修正并得到准确地址信息,相较于中文省份城市地区匹配器(CPCA),地址解析在一级地址、二级地址、三级地址上解析准确率分别提升了2.12%、2.36%和1.12%。
In order to solve the problem of address resolution errors caused by inaccurate input of manual Chinese addresses, in this paper, we first propose a LEBERT-CRF model which is based on the combination of the word-enhanced deep learning model Lexicon Enhanced Bidirectional Encoder Rep-resentations from Transformers (LEBERT) and Conditional Random Fields (CRF). Compared with BERT-Bidirectional Long Short Term Memory-CRF (BERT-BiLSTM-CRF) model, the segmentation accuracy, recall rate and F-score were increased by 1.45%, 1.89% and 1.67%, respectively. Then, based on the standard multi-level address data, an address knowledge graph database is con-structed with address information such as aliases and old names. Finally, a matching algorithm based on the address knowledge graph database is designed based on the address data processed by word segmentation and several possible error types exist in the address data. The address data after word segmentation is matched and corrected and accurate address information is obtained. Compared to the Chinese Province City Area mapper (CPCA), the resolution accuracy of 1st-level address, 2nd-level address and 3rd-level address is improved by 2.12%, 2.36% and 1.12%, respectively.

References

[1]  国务院办公厅关于印发“十四五”城乡社区服务体系建设规划的通知(国办发[2021] 56号) [J]. 中华人民共和国国务院公报, 2022(5): 69-77.
[2]  王思力. 面向大规模信息检索的中文分词技术研究[D]: [硕士学位论文]. 北京: 中国科学院计算技术研究所, 2006: 9-27.
[3]  张科. 多次Hash快速分词算法[J]. 计算机工程与设计, 2007, 28(7): 1716-1718.
[4]  李家福, 张亚非. 一种基于概率模型的分词系统[J]. 系统仿真学报, 2002, 14(5): 544-546.
[5]  Mccallum, A., Freitag, D. and Pereira, F.C.N. (2000) Maximum Entropy Markov Models for Infor-mation Extraction and Segmentation. Proceedings of the 17th International Conference on Machine Learning, Stanford, 29 June-2 July 2000, 591-598.
[6]  Low, J.K., Ng, H.T. and Guo, W. (2005) A Maximum Entropy Approach to Chi-nese Word Segmentation. Proceedings of the 4th SIGHAN Workshop on Chinese Language Processing, Jeju Island, 14-15 October 2005, 161-164.
[7]  Lafferty, J., Mccallum, A. and Pereira, F. (2001) Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the 18th International Conference on Machine Learning, Williamstown, 28 June-1 July 2001, 282-289.
[8]  陈晴. 基于条件随机场的自动分词技术的研究[D]: [硕士学位论文]. 沈阳: 东北大学, 2005: 7-14.
[9]  张晓淼. 基于神经网络的中文分词算法的研究[D]: [硕士学位论文]. 大连: 大连理工大学, 2006: 25-38.
[10]  Chen, X., Qiu, X., Zhu, C., et al. (2015) Long Short-Term Memory Neural Networks for Chinese Word Segmentation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, 17-21 September 2015, 1197-1206.
https://doi.org/10.18653/v1/D15-1141
[11]  Graves, A., Fernández, S. and Schmidhuber, J. (2005) Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. In: International Conference on Artificial Neu-ral Networks, Springer, Berlin, 799-804.
https://doi.org/10.1007/11550907_126
[12]  黄积杨. 基于双向LSTMN神经网络的中文分词研究分析[D]: [硕士学位论文]. 南京: 南京大学, 2016: 38-48.
[13]  张子睿, 刘云清. 基于BI-LSTM-CRF模型的中文分词法[J]. 长春理工大学学报(自然科学版), 2017, 40(4): 87-92.
[14]  张文静, 张惠蒙, 杨麟儿, 等. 基于Lattice-LSTM的多粒度中文分词[J]. 中文信息学报, 2019, 33(1): 18-24.
[15]  王玮. 基于Bi-LSTM-6tags的智能中文分词方法[J]. 计算机应用, 2018, 38(z2): 107-110.
[16]  Jacob, D., Ming, W.C., Kenton, L., et al. (2019) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT, Minneapolis, 2-7 June 2019, 4171-4186.
[17]  Zhao, S., Zhang, T., Hu, M., et al. (2022) AP-BERT: Enhanced Pre-Trained Model through Average Pooling. Applied Intelligence, 52, 15929-15937.
https://doi.org/10.1007/s10489-022-03190-3
[18]  Lan, Z.Z., Chen, M.D., Sebastian, G., et al. (2019) ALBERT: A Lite BERT for Self-Supervised Learning of Language Rep-resentations. ArXiv: 1909.11942.
[19]  黄丹丹, 郭玉翠. 融合Attention机制的BI-LSTM-CRF中文分词模型[J]. 软件, 2018, 39(10): 260-266.
[20]  张琛, 陈张建, 刘江涛, 等. Lucene自适应分词的地址匹配方法改进与实现[J]. 测绘科学, 2021, 46(10): 185-193.
[21]  姚心宇. 中文地址识别系统中的地址表达与匹配[D]: [硕士学位论文]. 上海: 华东师范大学, 2012.
[22]  Bizer, C., Heath, T. and Berners-Lee, T. (2011) Linked Data: The Story So Far. In: Semantic Services, Interoperability and Web Applications: Emerging Concepts, IGI Global, Hershey, 205-227.
https://doi.org/10.4018/978-1-60960-593-3.ch008
[23]  Akerkar, R. and Sajja, P. (2009) Knowledge-Based Sys-tems. Jones & Bartlett Publishers, Sudbury.
[24]  Liu, W., Fu, X., Zhang, Y. and Xiao, W. (2021) Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter. Proceedings of the 59th Annual Meeting of the Association for Com-putational Linguistics and the 11th International Joint Conference on Natural Language Processing, Volume 1, 5847-5858.
https://doi.org/10.18653/v1/2021.acl-long.454
[25]  Song, Y., Shi, S., Li, J., et al. (2018) Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2, 175-180.
https://doi.org/10.18653/v1/N18-2028

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133