全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

基于BiLSTM-CRF的中文藏头诗敏感词检测算法
Chinese Hidden-Head Poem Sensitive Word Detection Algorithm Based on BiLSTM-CRF

DOI: 10.12677/SEA.2023.126089, PP. 915-921

Keywords: 藏头诗,敏感词检测,BiLSTM-CRF
Acrostic Poetry
, Sensitive Word Detection, BiLSTM-CRF

Full-Text   Cite this paper   Add to My Lib

Abstract:

在数字化和社交媒体时代,藏头诗作为一种文化传承与现代表达相结合的文学形式,其内容监控成为了互联网平台管理的一个挑战。由于其特殊的构造方式,即每行的开头字连起来可以表达特定意义,这一特性使得其成为了隐藏敏感信息的一种手段。尤其是在社交媒体和即时通讯平台上,用户可能会利用藏头诗来规避敏感词过滤机制。本研究提出了一种基于双向长短期记忆网络(BiLSTM-CRF)的藏头诗敏感词检测算法。该算法首先采用词嵌入方法将文字表示成高维向量,再利用BiLSTM模型对藏头诗正反双向的上下文语义进行理解,并捕获文本序列中跨句藏头词的依赖关系,最后通过CRF模型根据标签相关性输出标记序列。我们对算法在不同类型的藏头诗数据集上进行了测试,结果显示该算法能够有效地识别出敏感词汇,具有较高的准确率和召回率。本算法对于监管自动生成的文本内容,尤其是在保护文化传承和遵守网络法规方面显示出其重要价值。
In the era of digitization and social media, acrostic poetry, as a literary form that combines cultural heritage with modern expression, has posed a challenge to internet platform management due to content monitoring. Because of its unique construction, where the initial letters of each line can convey a specific meaning when connected, this feature makes it a means of hiding sensitive information. Particularly on social media and instant messaging platforms, users may use acrostic poems to circumvent sensitive word filtering mechanisms. This study proposes a sensitive word detection algorithm for acrostic poetry based on Bidirectional Long Short-Term Memory Networks (BiLSTM-CRF). The algorithm first uses word embedding to represent the text as high-dimensional vectors, then utilizes the BiLSTM model to understand the semantic context of acrostic poems in both forward and backward directions and capture dependencies of acrostic words across sentences in the text sequence. Finally, the CRF model outputs label sequences based on label relevance. We tested the algorithm on various types of acrostic poetry datasets, and the results demonstrate that the algorithm can effectively identify sensitive words with high accuracy and recall. This algorithm has significant value for monitoring automatically generated text content, particularly in preserving cultural heritage and complying with internet regulations.

References

[1]  Sara Sood, Judd Antin, Elizabeth Churchill. (2012) Profanity Use in Online Communities. CHI '12: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 05-10 May 2012, New York, 1481-1490.
https://doi.org/10.1145/2207676.2208610
[2]  Liu, C., Wang, W.Y., Wang, M., et al. (2017) An Efficient Instance Selection Algorithm to Reconstruct Training Set for Support Vector Machine. Knowledge-Based Systems, 116, 58-73.
https://doi.org/10.1016/j.knosys.2016.10.031
[3]  Guan, D.H., Yuan, W.W., Lee, Y.K., et al. (2008) Improving Supervised Learning Performance by Using Fuzzy Clustering Method to Select Training Data. Journal of Intelligent & Fuzzy Systems, 19, 321-334.
[4]  Xue, P.Q., Nurbol, and Wushour, I. (2016) Sensitive Information Filtering Algorithm Based on Text Information Network. Computer Engineering & Design, 37, 2447-2452.
[5]  张若彬, 刘嘉勇, 何祥. 基于BLSTM-CRF模型的安全漏洞领域命名实体识别[J]. 四川大学学报(自然科学版), 2019, 56(3): 469-475.
[6]  黄炜, 黄建桥, 李岳峰. 基于BiLSTM-CRF的涉恐信息实体识别模型研究[J]. 情报杂志, 2019, 38(12): 149-156.
[7]  尤丽珏, 尹远芳. 基于BiLSTM-CRF模型的医学影像检查报告信息实体识别[J]. 微型电脑应用, 2023, 39(10): 134-137.
[8]  郑贤茹, 李柏岩, 冯珍妮, 等. 基于BERT-BiLSTM-CRF的网络敏感词及变体实体识别[J]. 计算机与数字工程, 2023, 51(7): 1585-1589.
[9]  Dou, G., Zhao, K., Guo, M., et al. (2023) Memristor-Based LSTM Network for Text Classification. Fractals, 31, Article ID: 2340040.
https://doi.org/10.1142/S0218348X23400406
[10]  刘雪梅, 程彭圣男, 李海瑞, 等. 基于字词向量的BiLSTM-CRF水利工程巡检文本实体识别模型[J/OL]. 华北水利水电大学学报(自然科学版), 1-9.
http://kns.cnki.net/kcms/detail/41.1432.tv.20231102.1649.002.html, 2023-11-09.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133