|
基于DFA的中拼混合敏感词过滤算法
|
Abstract:
本文针对当前网络上通过各种干扰形式“伪装”的敏感词,提出了一种基于DFA的中拼混合敏感词过滤算法,解决了一般的系统过滤方法难以成功检测过滤该类敏感词的问题,提高了包含该类敏感词文本过滤的查全率和查准率。本文提出的算法包括中文拼音敏感词库的扩充算法、敏感词树的构建算法、待检测文本的预处理算法以及敏感词过滤算法,通过实验得到该算法查准率为100%,查全率约为95%~100%,算法复杂度较低,满足实际应用需要。
Aiming at the sensitive words that are “camouflaged” through various interference forms on the current network, this paper proposes a Chinese character and Pinyin mixed sensitive word filtering algorithm based on DFA, which solves the problem that the general system filtering methods are difficult to successfully detect and filter such sensitive words, and improves the recall and precision of text filtering containing such sensitive words. The algorithm proposed in this paper includes the expansion algorithm of the Chinese characters and Pinyin mixed sensitive word library, the construction algorithm of the sensitive word tree, the pretreatment algorithm of the text to be detected, and the sensitive word filtering algorithm. Through the experiment, the precision of the algorithm is 100%, and the recall is about 95%~100%. The algorithm complexity is low so this algorithm meets the practical application needs.
[1] | Liu, C., Wang, W.Y., Wang, M., et al. (2017) An Efficient Instance Selection Algorithm to Reconstruct Training Set for Support Vector Machine. Knowledge-Based Systems, 116, 58-73. https://doi.org/10.1016/j.knosys.2016.10.031 |
[2] | Guan, D.H., Yuan, W.W., Lee, Y.K., et al. (2008) Improving Supervised Learning Performance by Using Fuzzy Clustering Method to Select Training Data. Journal of Intelligent & Fuzzy Systems, 19, 321-334. |
[3] | Xue, P.Q., Nurbol, and Wushour, I. (2016) Sensitive Information Filtering Algorithm Based on Text Information Network. Computer Engineering & Design, 37, 2447-2452. |
[4] | Liu, B.G., Chen, Q.C. and Lei, X.F. (2020) Efficient Multi-Pattern Matching Algorithm for PDF Content Search. Application Research of Computers, 37, 1755-1759. |
[5] | Chen, Y.J., Wushour, S. and Yu, Q. (2019) An Improved Multi-Pattern Matching Algorithm based on Aho-Corasick Algorithm. Modern Electronics Technique, 42, 89-93. |
[6] | 李丹阳, 赵亚慧. 基于字典树语言模型的专业课查询文本校对方法[J]. 延边大学学报, 2020, 46(3): 260-264. |
[7] | 蒋琳, 徐颖. 基于DFA访问结构的多授权机构ABE方案设计[J]. 无线电工程, 2022, 52(8): 1302-1309. |
[8] | 周永福, 曾志. 基于DFA算法的政务云敏感词汇监测系统实现[J]. 科技与创新, 2022(20): 152-155. |
[9] | 孙芳媛. 基于倒排索引和字典树的站内搜索引擎的设计与实现[D]: [硕士学位论文]. 哈尔滨: 哈尔滨工业大学, 2016. |
[10] | Wang, M.H. and Hung, C.P. (2003) Extension Neural Network and Its Applications. Neural Networks, 16, 779-784.
https://doi.org/10.1016/S0893-6080(03)00104-7 |