The Wisconsin Breast Cancer Dataset has been heavily cited as a benchmark dataset for classification. Neural Network techniques such as Neural Networks, Probabilistic Neural Networks, and Regression Neural Networks have been shown to perform very well on this dataset. However, despite its obvious practical importance and implications for cancer research, a thorough investigation of all modern classification techniques on this dataset remains to be done. In this paper we examine the efficacy of classifiers such as Random Forests with varying number of trees, Support Vector Machines with different kernels, Naive Bayes model and neural networks on the accuracy of classifying the masses in the dataset as benign/malignant. Results indicate that Support Vector machines with a Radial Basis function kernel give the best accuracy of all the models attempted. This indicates that there are non-linearities present in the dataset and that the Support vector machine does a good job of mapping the data into a higher dimensional space in which the non-linearities fade away and the data becomes linearly separable by large margin classifier like the support vector machine. These methods show that modern machine learning methods could provide for improved accuracy for early prediction of cancerous tumors.
Huo, Z., Giger, M., Vyborny, C., Wolverton, D., Schmidt, R. and Doi, K. (1998) Automated Computerized Classification of Malignant and Benign Mass Lesions on Digital Mammograms. Academic Radiology, 5, 155-168.
Cheng, H.-D., Lui Y.M. and Freimanis, R.I. (1998) A Novel Approach to Microcalcification Detection Using Fuzzy Logic Technique. IEEE Transactions on Medical Imaging, 17, 442-450.
Pendharkar, P.C., Rodger, J.A., Yaverbaum, G.J., Herman, N. and Benner, M. (1999) Association, Statistical, Mathematical and Neural Approaches for Mining Breast Cancer Patterns, Expert Systems with Applications, 17, 223-232. DRAFT VERSION of paper to Appear at the Oncology Reports, Special Issue Computational Analysis and Decision Support Systems in Oncology, Last Quarter 2005.
Chen, D., Chang, R.F. and Huang, Y.L. (2000) Breast Cancer Diagnosis Using Self-Organizing Map for Sonography. Ultrasound in Medical Biology, 26, 405-411.
Giger, M., Huo, Z., Kupinski, M. and Vyborny, C. (2000) Computer-Aided Diagnosis in Mammography. In: Sonka, M. adn Fitzpatrick, J., Eds., Handbook of Medical Imaging, Medical Image Processing and Analysis, Vol. 2, SPIE Press, 386-408.
Tourassi, G.D., Markey, M.K., Lo, J.Y. and Floyd Jr., C.E. (2001) A Neural Network Approach to Breast Cancer Diagnosis as a Constraint Satisfaction Problem. Medical Physics, 28, 804-811.
Wolberg, W.H., Street, W.N., Heisey, D.M. and Mangasarian, O.L. (1995) Computer-Derived Nuclear Features Distinguish Malignant from Benign Breast Cytology. Human Pathology, 26, 792-796.
Wolberg, W.H., Street, W.N. and Mangasarian, O.L. (1994) Machine Learning Techniques to Diagnose Breast Cancer from Image-Processed Nuclear Features of Fine-Needle Aspirates. Cancer Letters, 77, 163-171.
Wolberg, W.H., Street, W.N. and Mangasarian, O.L. (1995) Image Analysis and Machine Learning Applied to Breast Cancer Diagnosis and Prognosis. Analytical and Quantitative Cytology and Histology, 17, 77-87.
Jiang, Y., Nishikawa, R., Wolverton, D., Metz, C., Giger, M.L., Schmidt, R. and Doi, K. (1996) Malignant and Benign Clustered Microcalcifications: Automated Feature Analysis and Classification. Radiology, 198, 671-678.
Hoya, T. and Chambers, J.A. (2001) Heuristic Pattern Correction Scheme Using Adaptively Trained Generalized Regression Neural Networks. IEEE Transactions on Neural Networks, 12, 91-100.
Kolmogorov, A.N. (1957) On the Representation of Continuous Functions of Many Variables by Superposition of Continuous Functions of One Variable and Addition. Doklady Akademii Nauk SSSR, 144, 679-681. American Mathematical Society Translation, 28, 55-59 .