全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...
电子学报  2014 

基于单边选择链和样本分布密度融合机制的非平衡数据挖掘方法

DOI: 10.3969/j.issn.0372-2112.2014.07.011, PP. 1311-1319

Keywords: 非平衡数据分类,单边选择链,分布密度,重采样

Full-Text   Cite this paper   Add to My Lib

Abstract:

非平衡数据集分类问题是机器学习领域的重大挑战性难题.针对该难题,传统的少数类样本合成技术(SyntheticMinorityOver-SamplingTechnique,SMOTE)已成为一种有力手段并得到广泛采用.但在新样本生成过程中,SMOTE利用所有少数类样本合成新样本,由此产生过拟合瓶颈.为更好地解决该问题,提出了一种基于单边选择链和样本分布密度的非平衡数据挖掘新方法(One-SidedLink&DistributionDensity-SMOTE,OSLDD-SMOTE).OSLDD-SMOTE通过单边选择链遴选出处于分类边界的少数类样本,根据这些样本的动态分布密度生成新样本.进而分析了样本合成度对节点数目和对少数类精度的影响;基于G-mean、F-measure和AUC三个指标综合比较了OSLDD-SMOTE与其他同类方法的分类性能.实验结果表明,OSLDD-SMOTE有效提高了少数类样本的分类准确率.

References

[1]  Chan P K,Stolfo S J.Toward scalable learning with nonuniform class and cost distributions:A case study in credit card fraud detection[A].Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining[C].New York:AAAI,1998.164-168.
[2]  Phua C,Alahakoon D,Lee V.Minority report in fraud detection:Classification of skewed data[J].SIGKDD Explore,2004,6(1):50-59.
[3]  Lewis D,Gale W.A sequential algorithm for training text classifiers[A].Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval[C].Dublin:ACM,1994.3-12.
[4]  Turney P D.Learning algorithms for keyphrase extraction[J].Information Retrieval,2000,2(4):303-336.
[5]  Ling C X,Li C.Data mining for direct marketing:Problems and solutions[A].Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining[C].New York:AAAI,1998.73-79.
[6]  Japkowicz N.The class imbalance problem:Significance and strategies[A].Proceedings of the 2000 International Conference on Artificial Intelligence:Special Track on Inductive Learning[C].Las Vegas:AAAI,2000.111-117.
[7]  Liu Xu-Ying,Wu Jian-xin,Zhou Zhi-Hua.Exploratory undersampling for class-imbalance learning[J].IEEE Transactions on Systems,Man and Cybernetics,2009,39(2):539-550.
[8]  Zhou Z H,Liu X Y.Training cost-sensitive neural networks with methods addressing the class imbalance problem[J].IEEE Transactions on Knowledge and Data Engineering,2006,18(1):63-77.
[9]  Chawla N V,Bowyer K W,Hall L O,Kegelmeyer W P.SMOTE-synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16(1):321-357.
[10]  杨智明,乔立岩,彭喜元.基于改进SMOTE的不平衡数据挖掘方法研究[J].电子学报,2007,35(12A):22-26. Yang Zhi-Ming,Qiao Li-Yan,Peng Xi-Yuan.Research ondatamining method for imbalanced dataset based on improved SMOTE[J].Acta Electronica Sinica,2007,35(12A):22-26.(in Chinese)
[11]  Han H,Wang W Y,Mao B H.Borderline-SMOTE:A new over-sampling method in imbalanced data sets learning[A].Proceedings of the 2005 international conference on Advances in Intelligent Computing[C].Berlin,Heidelberg:Springer-Verlag,2005,3644.878-887.
[12]  曾志强,吴群,廖备水,高济.一种基于核SMOTE的非平衡数据集分类方法[J].电子学报,2009,37(11):2489-2495. Zeng Zhi-Qiang,Wu Qun,Liao Bei-Shui,Gao Ji.A classfication method for imbalance data set based on kernel SMOTE[J].Acta Electronica Sinica,2009,37(11):2489-2495.(in Chinese)
[13]  李正欣,赵林度.基于SMOTEBoost的非均衡数据集SVM分类器[J].系统工程,2008,26(5):116-119. Li Zheng-Xin,Zhao Lin-Du.A SVM classifier for imbalanced datasets based on SMOTEBoost[J].Systems Engineering,2008,26(5):116-119.(in Chinese)
[14]  毕华,梁洪力,王珏.重采样方法与机器学习[J].计算机学报,2009,32(5):862-877. Bi Hua,Liang Hong-Li,Wang Yu.Resamplingmethod and machine learning[J].Chinese Journal of Computers,2009,32(5):862-877.(in Chinese)
[15]  Fan X N,Tang K,Weise T.Margin-based over-sampling method for learning from imbalanced datasets[A].Proceedings of the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining[C].Berlin:Springer,2011.24-27.
[16]  欧阳震诤,罗建书,胡东敏,吴泉源.一种不平衡数据流集成分类模型[J].电子学报,2010,38(1):184-189. OUYANG Zhen-zheng,LUO Jian-shu,HU Dong-min,WU Quan-yuan.an ensemble classifier framework for mining imbalanced data streams[J].Acta Electronica Sinica,2010,38(1):184-189.(in Chinese)
[17]  周志华,陈世福.神经网络集成[J].计算机学报,2002,25(1):1-8. Zhou Zhi-Hua,Chen Shi-Fu.Neural network ensemble[J].Chinese Journal of Computers,2002,25(1):1-8.(in Chinese)
[18]  Zhou Z H,Jiang Y.MeV4dical diagnosis with C4.5 rule preceded by artificial neural network ensemble[J].IEEE Transactions on Information Technology in Biomedicine,2003,7(1):37-42.
[19]  Zhou Zhi-Hua,Jiang Yuan,Chen Shi-fu.Extracting symbolic rules from trained neural network ensembles[J].AI Communications,2003,16(1):3-15.
[20]  Brodley C E,Friedl M A.Identifying mislabeled training data[J].Journal of Artificial Intelligence Research,1999,11(1):131-167.
[21]  Muhlenbach F,Lallich S,Zighed D.Identifying and handling mislabelled instances[J].Journal of Intelligent Information Systems,2004,22(1):89-109.
[22]  Gamberger D,Lavrac N,Dzeroski S.Noise elimination in inductive concept learning:A case study in medical diagnosis[A].Proceedings of the 7th International Workshop on Algorithmic Learning Theory[C].Berlin,Heidelberg:Springer-Verlag,1996,1160.199-212.
[23]  Fawcett T.ROC graphs:Notes and practical considerations for data mining researchers[R].USA:Technical Report HP Labs,2003.
[24]  Garcha V,Sanchez J S,Mollineda R A.On the use of surrounding neighbors for synthetic over-sampling of the minority class[A].Proceedings of 8th WSEAS International Conference on Simulation,Modeling and Optimization[C].Santander:WSEAS Press,2008.23-25.
[25]  He H,Bai Y,Garcia E A,Li S.ADASYN:Adaptive synthetic sampling approach for imbalanced learning[A].Proceedings of 2008 IEEE International Joint Conference on Neural Networks[C].Hong Kong:IEEE Press,2008.1322-1328.
[26]  Calleja J D L,Fuentes O.A distance-based over-sampling method for learning from imbalanced data sets[A].Proceedings of the 20th International Florida Artificial Intelligence Research Society Conference[C].Florida:AAAI Press,2007.634-635.
[27]  杨炳儒,谢永红,侯伟,周谆.基于复合金字塔模型的蛋白质二级结构预测系统的研究[J].科学通报,2009,54(21):3311-3319. Yang Bing-Ru,Xie Yong-Hong,Hou Wei,Zhou Zhun.A novel protein secondary structure prediction system based on compound pyramid model[J].Chinese Science Bulletin,2009,54(21):3311-3319.(in Chinese)
[28]  Yang B R,Hou W,Zhou Z,Quan HB.KAAPRO:An approach of protein secondary structure prediction based on KDD* in the compound pyramid prediction model[J].Expert Systems With Applications,2009,36(1):9000-9006.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133