|
融合共现网络特征与知识增强语义梯度提升电子邮件分类
|
Abstract:
本文针对现有电子邮件分类算法缺乏知识网络特征,并且训练复杂度较高的问题,应用复杂网络理论和知识增强语义模型,设计了一种基于电子邮件知识共现网络特征和知识增强语义的梯度提升算法,研究如何利用电子邮件知识网络和增强深度学习模型的知识表征来提升分类算法性能。首先,利用词汇共现度构建基于电子邮件知识的共现网络;其次,采用维瓦尔第算法将共现网络的节点映射到张量空间,生成对应知识节点空间嵌入;然后,计算共现网络模型的中心性特征并与维瓦尔第语义空间嵌入相结合,再融合知识增强语义模型生成的文本语义特征;最后,使用梯度增强算法实现电子邮件分类学习。在实验中,相较于现在的领先模型,本文算法在准确率、精确率和召回率等指标上均有明显提升,验证了其有效性,揭示了电子邮件知识网络特征能够有效增强现有模型的性能,提供了对其表征能力的有效补充。
In this paper, for the problem that existing email classification algorithms lack knowledge network features and have high training complexity, a gradient boosting algorithm based on email knowledge co-occurrence network features and knowledge enhancement semantics is designed by applying the complex network theory and knowledge enhancement semantics model to study how to improve the performance of classification algorithms by using the email knowledge network and knowledge representation of the augmented deep learning model. Firstly, the lexical co-occurrence is used to construct a co-occurrence network based on email knowledge; secondly, the Vivaldi algorithm is used to map the nodes of the co-occurrence network to the tensor space to generate the corresponding knowledge node space embedding; then, the centrality feature of the co-occurrence network model is calculated and combined with the Vivaldi semantic space embedding, and then the text semantic features generated by the knowledge-enhanced semantic model are fused; finally, the gradient boosting algorithm is used to achieve email classification learning. In the experiments, compared with the current leading model, the algorithm in this paper has obvious improvement in the indexes of accuracy, precision and recall, which verifies its effectiveness and reveals that the email knowledge network features can effectively enhance the performance of the existing model and provide an effective complement to its representational capability.
[1] | Russell, E., Jackson, T.W., Fullman, M. and Chamakiotis, P. (2023) Getting on Top of Work‐Email: A Systematic Review of 25 Years of Research to Understand Effective Work‐Email Activity. Journal of Occupational and Organizational Psychology, 97, 74-103. https://doi.org/10.1111/joop.12462 |
[2] | Altulaihan, E., Alismail, A., Hafizur Rahman, M.M. and Ibrahim, A.A. (2023) Email Security Issues, Tools, and Techniques Used in Investigation. Sustainability, 15, Article 10612. https://doi.org/10.3390/su151310612 |
[3] | Ageng, R., Faisal, R. and Ihsan, S. (2024) Random Forest Machine Learning for Spam Email Classification. Journal of Dinda: Data Science, Information Technology, and Data Analytics, 4, 8-13. https://doi.org/10.20895/dinda.v4i1.1363 |
[4] | Zavrak, S. and Yilmaz, S. (2023) Email Spam Detection Using Hierarchical Attention Hybrid Deep Learning Method. Expert Systems with Applications, 233, Article 120977. https://doi.org/10.1016/j.eswa.2023.120977 |
[5] | Roumeliotis, K.I., Tselikas, N.D. and Nasiopoulos, D.K. (2024) Next-Generation Spam Filtering: Comparative Fine-Tuning of LLMs, NLPs, and CNN Models for Email Spam Classification. Electronics, 13, Article 2034. https://doi.org/10.3390/electronics13112034 |
[6] | Daud, S., Ullah, M., Rehman, A., Saba, T., Damaševičius, R. and Sattar, A. (2023) Topic Classification of Online News Articles Using Optimized Machine Learning Models. Computers, 12, Article 16. https://doi.org/10.3390/computers12010016 |
[7] | Hasib, K.M., Azam, S., Karim, A., Marouf, A.A., Shamrat, F.M.J.M., Montaha, S., et al. (2023) MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data. IEEE Access, 11, 93048-93063. https://doi.org/10.1109/access.2023.3309697 |
[8] | Jianan, G., Kehao, R. and Binwei, G. (2024) Deep Learning-Based Text Knowledge Classification for Whole-Process Engineering Consulting Standards. Journal of Engineering Research, 12, 61-71. https://doi.org/10.1016/j.jer.2023.07.011 |
[9] | Shi, Y., Ma, H., Zhong, W., Tan, Q., Mai, G., Li, X., et al. (2023) ChatGraph: Interpretable Text Classification by Converting ChatGPT Knowledge to Graphs. 2023 IEEE International Conference on Data Mining Workshops (ICDMW), Shanghai, 1-4 December 2023, 515-520. https://doi.org/10.1109/icdmw60847.2023.00073 |
[10] | Palanivinayagam, A., El-Bayeh, C.Z. and Damaševičius, R. (2023) Twenty Years of Machine-Learning-Based Text Classification: A Systematic Review. Algorithms, 16, Article 236. https://doi.org/10.3390/a16050236 |
[11] | Qu, P., Zhang, B., Wu, J., et al. (2024) Comparison of Text Classification Algorithms based on Deep Learning. Journal of Computer Technology and Applied Mathematics, 1, 35-42. |
[12] | Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., et al. (2022) A Survey on Text Classification: From Traditional to Deep Learning. ACM Transactions on Intelligent Systems and Technology, 13, 1-41. https://doi.org/10.1145/3495162 |
[13] | Sun, J., Xu, C., Tang, L., Wang, S., Lin, C., Gong, Y., et al. (2023) Think-on-Graph: Deep and Responsible Reasoning of Large Language Model with Knowledge Graph. arXiv: 2307.07697. https://doi.org/10.48550/arXiv.2307.07697 |
[14] | Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M. and Liu, Q. (2019) ERNIE: Enhanced Language Representation with Informative Entities. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, 28 July-2 August 2019, 1441-1451. https://doi.org/10.18653/v1/p19-1139 |
[15] | Sun, Y., Wang, S., Li, Y., Feng, S., Tian, H., Wu, H., et al. (2020) ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 8968-8975. https://doi.org/10.1609/aaai.v34i05.6428 |
[16] | Page, L., Brin, S., Motwani, R. and Winograd, T. (1999) The PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford InfoLab. |
[17] | Hong, L., Qian, Y., Gong, C., Zhang, Y. and Zhou, X. (2023) Improved Key Node Recognition Method of Social Network Based on Pagerank Algorithm. Computers, Materials & Continua, 74, 1887-1903. https://doi.org/10.32604/cmc.2023.029180 |
[18] | Yang, M., Wang, H., Wei, Z., Wang, S. and Wen, J. (2024) Efficient Algorithms for Personalized Pagerank Computation: A Survey. IEEE Transactions on Knowledge and Data Engineering, 36, 4582-4602. https://doi.org/10.1109/tkde.2024.3376000 |
[19] | Dabek, F., Cox, R., Kaashoek, F. and Morris, R. (2004) Vivaldi: A Decentralized Network Coordinate System. ACM SIGCOMM Computer Communication Review, 34, 15-26. https://doi.org/10.1145/1030194.1015471 |
[20] | Papadakis, H., Panagiotakis, C. and Fragopoulou, P. (2017) Scor: A Synthetic Coordinate Based Recommender System. Expert Systems with Applications, 79, 8-19. https://doi.org/10.1016/j.eswa.2017.02.025 |
[21] | Panagiotakis, C., Papadakis, H., Papagrigoriou, A. and Fragopoulou, P. (2021) Improving Recommender Systems via a Dual Training Error Based Correction Approach. Expert Systems with Applications, 183, Article 115386. https://doi.org/10.1016/j.eswa.2021.115386 |
[22] | Chen, T. and Guestrin, C. (2016) XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, 13-17 August 2016, 785-794. https://doi.org/10.1145/2939672.2939785 |
[23] | Liu, X., Wang, S., Lu, S., Yin, Z., Li, X., Yin, L., et al. (2023) Adapting Feature Selection Algorithms for the Classification of Chinese Texts. Systems, 11, Article 483. https://doi.org/10.3390/systems11090483 |
[24] | Samih, A., Ghadi, A. and Fennan, A. (2023) Enhanced Sentiment Analysis Based on Improved Word Embeddings and XGBoost. International Journal of Electrical and Computer Engineering (IJECE), 13, 1827-1836. https://doi.org/10.11591/ijece.v13i2.pp1827-1836 |
[25] | Elsayed, S., Thyssens, D., Rashed, A., Jomaa, H.S. and Schmidt-Thieme, L. (2021) Do We Really Need Deep Learning Models for Time Series Forecasting? arXiv: 2101.02118. https://doi.org/10.48550/arXiv.2101.02118 |
[26] | Shetty, J. and Adibi, J. (2004) The Enron Email Dataset Database Schema and Brief Statistical Report. Information Sciences Institute Technical Report, University of Southern California, 120-128. |
[27] | Bera, D., Ogbanufe, O. and Kim, D.J. (2023) Towards a Thematic Dimensional Framework of Online Fraud: An Exploration of Fraudulent Email Attack Tactics and Intentions. Decision Support Systems, 171, Article 113977. https://doi.org/10.1016/j.dss.2023.113977 |
[28] | Voorhees, E.M. and Tice, D.M. (1999) The TREC-8 Question Answering Track Report. Proceedings of the Second International Conference on Language Resources and Evaluation, Athens, 16 November 1999, 77-82. |
[29] | Woźniak, M., Wieczorek, M. and Siłka, J. (2023) BiLSTM Deep Neural Network Model for Imbalanced Medical Data of IoT Systems. Future Generation Computer Systems, 141, 489-499. https://doi.org/10.1016/j.future.2022.12.004 |
[30] | Han, C., Wu, C., Guo, H., Hu, M. and Chen, H. (2023) HaNet: A Hierarchical Attention Network for Change Detection with Bitemporal Very-High-Resolution Remote Sensing Images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 16, 3867-3878. https://doi.org/10.1109/jstars.2023.3264802 |
[31] | Kim, Y., Kim, J., Kim, Y., Song, S. and Joo, H.J. (2023) Predicting Medical Specialty from Text Based on a Domain-Specific Pre-Trained Bert. International Journal of Medical Informatics, 170, Article 104956. https://doi.org/10.1016/j.ijmedinf.2022.104956 |
[32] | Cai, Q., Zheng, S. and Liu, J. (2024) Hierarchical Text Classification of Chinese Public Security Cases Based on ERNIE 3.0 Model. 2024 5th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, 19-21 April 2024, 746-751. https://doi.org/10.1109/cvidl62147.2024.10603827 |