|
引入反馈机制的中文文本校对技术研究
|
Abstract:
中文文本校对技术已取得了很大进展,然而目前很多技术研究依赖于深度学习,随着语言模型越来越复杂,训练成本迅速增加,导致落地应用较为困难。针对上述问题,本文提出了一种迭代式无监督文本自动校对技术,可同时纠正多字、少字、字序颠倒以及错别字等文本错误,并设计了反馈机制,可对校对错误结果进行反馈与实时修正。模型使用交叉位置融合算法定位错词索引,针对检测到的错词位置,采用并行多通道候选词构建策略得到候选词序列,并基于得分修正算法计算最优候选词。该方法在公开数据集SIGAHN和自构建数据集上进行了测试实验,纠正准确率和精度分别提升了6.39%和5.17%,高于Transformer等深度学习模型,且训练成本低,可作为文本自动校对技术普及应用的参考方案。
Chinese proofreading technology has made great progress. However, at present, many technical studies rely on deep learning. As the language model becomes more and more complex, the training cost increases rapidly, resulting in difficulties in landing applications. In view of the above problems, this paper proposes an iterative unsupervised text automatic proofreading technology, which can correct text errors such as multi word, few word, reversed word order and wrong type words at the same time, and designs a feedback mechanism to feed back and correct the proofreading error results in real time. The model uses the cross position fusion algorithm to locate the wrong word index. For the detected wrong word position, it uses the parallel multi-channel candidate word construction strategy to get the candidate word sequence, and calculates the optimal candidate word based on the score correction algorithm. The method has been tested on the public data set sigahn and the self built data set, and the correction accuracy and precision have been improved by 6.39% and 5.17% respectively, which are higher than the transformer deep learning model, and the training cost is low. It can be used as a reference scheme for the popularization and application of automatic text proofreading technology.
[1] | 刘明洁, 梁毅, 艾中良, 贾高峰. 面向法律文书的中文文本校对方法研究[J]. 计算机工程与应用, 2020, 56(24): 274-278. |
[2] | 陈楠, 曹雪虹, 焦良葆, 等. 面向电力巡检语音指令识别后的文本纠错算法[J]. 计算机与数字工程, 2022, 50(1): 116-123, 134. |
[3] | 张鑫. 面向社会媒体的中文文本校对方法研究与实现[D]: [硕士学位论文]. 哈尔滨: 黑龙江大学, 2016. |
[4] | Chang, C.H. (1994) A Pilot Study on Automatic Chinese Spelling Error Correction. Communication of COLIPS, 4, 143-149. |
[5] | 王福钊, 周雁. 基于匹配算法的藏文文本词语校对研究[J]. 计算机与数字工程, 2021, 49(7): 1433-1436. |
[6] | 郝亚男, 乔钢柱, 谭瑛. 基于神经网络与注意力机制的中文文本校对方法[J]. 计算机系统应用, 2019, 28(10): 190-195. |
[7] | Zhang, J. and Zhang, X. (2020) Comparison of Chinese Character Correct and Error Classifier for Overseas Students Based on Handwriting Motion Characteristics. Journal of Physics: Conference Series, 1646, Article ID: 012064.
https://doi.org/10.1088/1742-6596/1646/1/012064 |
[8] | 石敏, 高尚. 基于决策列表的中文同音词自动识别与校对[J]. 电子设计工程, 2015(9): 39-41. |
[9] | Yu, J. and Li, Z. (2014) Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape. Proceedings of the 3rd CIPS-SIGHAN Joint Conference on Chi-nese Language Processing, Wuhan, 20-21 October 2014, 220-223. https://doi.org/10.3115/v1/W14-6835 |
[10] | Zhao, H., Cai, D., Xin, Y., Wang, Y. and Jia, Z. (2017) A Hybrid Mod-el for Chinese Spelling Check. ACM Transactions on Asian and Low-Resource Language Information Processing, 16, 1-22. https://doi.org/10.1145/3047405 |
[11] | 王琼, 旷文珍, 许丽. 基于改进的N-gram模型和知识库的文本查错算法[J]. 计算机应用与软件, 2021, 38(10): 310-315, 320. |
[12] | 王浩畅, 周锦程. 中文语法自动纠错系统的研究与实现[J]. 企业科技与发展, 2020(2): 81-84, 87. |
[13] | 龚永罡, 吴萌, 廉小亲, 裴晨晨. 基于Seq2Seq与Bi-LSTM的中文文本自动校对模型[J]. 电子技术应用, 2020, 46(3): 42-46. |
[14] | 龚永罡, 裴晨晨, 廉小亲, 王嘉欣. 基于Transformer模型的中文文本自动校对研究[J]. 电子技术应用, 2020, 46(1): 30-33, 38. |
[15] | Zhang, R., Pang, C., Zhang, C., et al. (2021) Correcting Chinese Spelling Errors with Phonetic Pre-Training. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1-6 August 2021, 2250-2261.
https://doi.org/10.18653/v1/2021.findings-acl.198 |
[16] | Wu, S.-H., Liu, C.-L. and Lee, L.-H. (2013) Chinese Spelling Check Evaluation at SIGHAN Bake-off 2013. Proceedings of the 7th SIGHAN Workshop on Chinese Language Processing, Nagoya, 14-18 October 2013, 35-42. |