|
基于SMOTE算法的垃圾邮件检测研究
|
Abstract:
垃圾邮件检测一直是大数据和人工智能领域的研究热点。本文对Kaggle平台上的垃圾邮件数据集,进行了从数据预处理、文本特征构建,到垃圾邮件检测模型构建的完整数据处理过程。由于在垃圾邮件数据集中正常邮件和垃圾邮件占比极度不均衡,故采用SMOTE算法对垃圾邮件进行数据扩充,之后采用逻辑回归、支持向量机、决策树和随机森林四种学习算法构建垃圾邮件检测模型。本文对比了SMOTE前后四种检测模型的性能,尤其比较了准确率、精确度、召回率和F1-Score几个指标,以及混淆矩阵。实验结果可见,SMOTE算法有效提高了垃圾邮件检出的准确度,基于SMOTE算法的垃圾邮件检测模型具有较好性能。
The detection of spam has always been a research hotspot in big data and artificial intelligence. This paper presents a complete data analysis process for the spam data set on the Kaggle, including data preprocessing, the construction of text feature, building the detection model of a spam. Due to the imbalance between ham and spam, the SMOTE algorithm is used to expand the spam data, then four learning algorithms such as logistic regression, SVM, decision tree and random forest are used to build the detection model of spam. The performance of four detection models is compared before and after SMOTE, especially the classification accuracy, precision, recall, F1-Score and confusion matrix. The experimental results show that SMOTE algorithm can effectively improve the accuracy of spam detection, and the spam detection model based on SMOTE algorithm has good performance.
[1] | 韩雪. 贝叶斯优化在垃圾邮件过滤中的应用研究[J]. 徐州工程学院学报(自然科学版), 2023, 38(2): 77-83. |
[2] | 王斯琴. 改进朴素贝叶斯算法在垃圾邮件过滤中的应用[D]: [硕士学位论文]. 重庆: 重庆师范大学, 2020. |
[3] | 冯军军, 李力. 机器学习在垃圾邮件过滤中的实现[J]. 电脑知识与技术, 2021, 17(8): 154-155. |
[4] | 林荫. 基于KNN⁃SVM的垃圾邮件过滤模型[J]. 现代电子技术, 2016, 39(23): 90-92, 97. |
[5] | 宋丹. 基于改进的卷积神经网络的垃圾邮件过滤方法[D]: [硕士学位论文]. 淮南: 安徽理工大学, 2021. |
[6] | 俞荧妹. 基于深度学习的垃圾邮件检测方法[D]: [硕士学位论文]. 上海: 东华大学, 2023. |
[7] | 丁伟民, 徐文钊. 一种基于 SMOTE 和随机森林的垃圾邮件检测算法[J]. 潍坊学院学报, 2020, 20(2): 14-15. |
[8] | 赵喆梅. 基于过采样的不平衡数据分类方法研究[D]: [硕士学位论文]. 兰州: 兰州交通大学, 2023. |
[9] | Chawla, N.V., Bowyer, K.W., Hall, L.O., et al. (2002) SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357. https://doi.org/10.1613/jair.953 |