全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Study on the Development and Implementation of Different Big Data Clustering Methods

DOI: 10.4236/ojapps.2023.137092, PP. 1163-1177

Keywords: Clustering, K-Means, Fuzzy c-Means, Expectation Maximization, BIRCH

Full-Text   Cite this paper   Add to My Lib

Abstract:

Clustering is an unsupervised learning method used to organize raw data in such a way that those with the same (similar) characteristics are found in the same class and those that are dissimilar are found in different classes. In this day and age, the very rapid increase in the amount of data being produced brings new challenges in the analysis and storage of this data. Recently, there is a growing interest in key areas such as real-time data mining, which reveal an urgent need to process very large data under strict performance constraints. The objective of this paper is to survey four algorithms including K-Means algorithm, FCM algorithm, EM algorithm and BIRCH, used for data clustering and then show their strengths and weaknesses. Another task is to compare the results obtained by applying each of these algorithms to the same data and to give a conclusion based on these results.

References

[1]  John, G. and David, R. (2012) The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East. IDC IVIEW, Sponsored by EMC Corporation.
https://www.cs.princeton.edu/courses/archive/spring13/cos598C/idc-the-digital-universe-in-2020.pdf
[2]  David, R., John, G. and John, R. (2018) The Digitization of the World, from Edge to Core. An IDC White Paper-#US44413318, Sponsored by Seagate.
https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf
[3]  Williams, P., Soares, C. and Gilbert, J.E. (2012) A Clustering Rule Based Approach for Classification Problems. International Journal of Data Warehousing and Mining, 8, 1-23.
https://doi.org/10.4018/jdwm.2012010101
[4]  Priya, R.V. and Vadivel, A. (2012) User Behaviour Pattern Mining from Weblog. International Journal of Data Warehousing and Mining, 8, 1-22.
https://doi.org/10.4018/jdwm.2012040101
[5]  Kwok, T., Smith, K.A., Lozano, S. and Taniar, D. (2002) Parallel Fuzzy c-Means Clustering for Large Data Sets. In: Monien, B. and Feldmann, R. Eds., Euro-Par 2002: Euro-Par 2002 Parallel Processing, Springer, Berlin, 365-374.
https://doi.org/10.1007/3-540-45706-2_48
[6]  Kalia, H., Dehuri, S. and Ghosh, A. (2013) A Survey on Fuzzy Association Rule Mining. International Journal of Data Warehousing and Mining, 9, 1-27.
https://doi.org/10.4018/jdwm.2013010101
[7]  Daly, O. and Taniar, D. (2004) Exception Rules Mining Based on Negative Association Rules. In: Laganá, A., Gavrilova, M.L., Kumar, V., Mun, Y., Tan, C.J.K. and Gervasi, O. Eds., Computational Science and Its Applications—ICCSA 2004, Springer, Berlin, 543-552.
https://doi.org/10.1007/978-3-540-24768-5_58
[8]  Ashrafi, M.Z., Taniar, D. and Smith, K.A. (2007) Redundant Association Rules Reduction Techniques. International Journal of Business Intelligence and Data Mining, 2, 29-63.
https://doi.org/10.1504/IJBIDM.2007.012945
[9]  Taniar, D., Rahayu, W., Lee, V.C.S. and Daly, O. (2008) Exception Rules in Association Rule Mining. Applied Mathematics and Computation, 205, 735-750.
https://doi.org/10.1016/j.amc.2008.05.020
[10]  Havens, T.C., Bezdek, J.C. and Palaniswami, M. (2013) Scalable Single Linkage Hierarchical Clustering for Big Data. 2013 IEEE Eighth International Conference on Intelligent Sensors, Sensor Networks and Information Processing, Melbourne, 2-5 April 2013, 396-401.
https://doi.org/10.1109/ISSNIP.2013.6529823
[11]  Abhishek, S. (2018) Most Popular Clustering Algorithms Used in Machine Learning.
https://analyticsindiamag.com/most-popular-clustering-algorithms-used-in-machine-learning/
[12]  Lam, D. and Wunsch, D.C. (2014) Clustering. Academic Press Library in Signal Processing, 1, 1115-1149.
https://doi.org/10.1016/B978-0-12-396502-8.00020-6
[13]  MacQueen, J. (1967) Some Methods for Classification and Analysis of Multivariate Observations. In: Le Cam, L.M., Neyman, J., and Scott, E.L., Eds., Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Oakland, 281-297.
[14]  Artúr, I.K., Róbert, F. and Galambos, P. (2018) Unsupervised Clustering for Deep Learning: A Tutorial Survey. Acta Polytechnica Hungarica, 15, 29-53.
https://doi.org/10.12700/APH.15.8.2018.8.2
[15]  Wu, X.D., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., et al. (2008) Top 10 Algorithms in Data Mining. Knowledge and Information Systems, 14, 1-37.
https://doi.org/10.1007/s10115-007-0114-2
[16]  Tapas, K., David, M.M., Nathan, S.N., Christine, D.P., Ruth, S. and Angela, Y.W. (2002) An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 881-892.
https://doi.org/10.1109/TPAMI.2002.1017616
[17]  Amit, S., et al. (2017) A Review of Clustering Techniques and Developments. Neurocomputing, 267, 664-681.
https://doi.org/10.1016/j.neucom.2017.06.053
[18]  Dunn, J.C. (1973) A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal of Cybernetics, 3, 32-57.
https://doi.org/10.1080/01969727308546046
[19]  Bezdek, J.C. (1981) Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York.
https://doi.org/10.1007/978-1-4757-0450-1
[20]  Xu, R. and Wunsch, D. (2005) Survey of Clustering Algorithms. IEEE Transaction on Neural Networks, 16, 645-678.
https://doi.org/10.1109/TNN.2005.845141
[21]  Chen, W. and Giger, M. (2006) A Fuzzy C-Means (FCM)-Based Approach for Computerized Segmentation of Breast Lesions in Dynamic Contrast-Enhanced MR Images. Academic Radiology, 13, 63-72.
https://doi.org/10.1016/j.acra.2005.08.035
[22]  Jiang, L. and Yang, W.H. (2003) A Modified Fuzzy c-Means Algorithm for Segmentation of Magnetic Resonance Images. Proceedings of the 7th International Conference on Digital Image Computing: Techniques and Applications, DICTA 2003, Sydney, 10-12 December 2003, 225–232.
[23]  Bezdek, J., Keller, J., Pal, N. and Krisnapuram, R. (1995) Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. Kluwer Academic Publishers, New York.
[24]  Dave, R.N. (1992) Boundary Detection through Fuzzy Clustering. IEEE International Conference on Fuzzy Systems, San Diego, 8-12 March 1992, 127-134.
[25]  Kang, J.Y., Min, L,Q., Luan, Q.X., Li, X. and Liu, J.Z. (2009) Novel Modified Fuzzy c-Means Algorithm with Applications. Digital Signal Processing, 19, 309-319.
[26]  Berget, I., Mevik, B.H. and Næs, T. (2008) New Modifications and Applications of Fuzzy c-Means Methodology. Computational Statistics & Data Analysis, 52, 2403-2418.
https://doi.org/10.1016/j.csda.2007.10.020
[27]  Barra. V. (1999) Segmentation floue des tissus cérébraux en IRM 3D: Une approche possibiliste versus autres méthodes. Master’s Thèse, Universite Blaise Pascal, Clermont-Ferrand.
[28]  Moussa, S., Lyazid, T. and Abdelouaheb, M. (2008) Nouvelle variante de l’algorithme FCM Appliquée à la Segmentation D’images IRM Cérébrales. MCSEAI, Oran, 28-30 April 2008, 4 p.
[29]  Pabitra, M., Sankar, K.P. and Siddiqi, M.A. (2003) Non-Convex Clustering Using Expectation Maximization Algorithm with Rough Set Initialization. Pattern Recognition Letters, 24, 863-873.
https://doi.org/10.1016/S0167-8655(02)00198-8
[30]  Cherkassky, V. and Mulier, F. (1998) Learning from Data: Concepts, Theories and Methods. John Wiley, New York.
[31]  Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society: Series B, 39, 1-22.
https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
[32]  Bradley, P., Fayyad, U. and Reina, C. (1999) Scaling EM (Expectation Maximization) Algorithm to Large Databases. Microsoft Research Technical Report, MSR-TR-98-35.
https://www.researchgate.net/publication/2240573_Scaling_EM_Expectation-Maximization_Clustering_to_Large_Databases
[33]  Study Materials: APJ Abdul Kalam Technological University.
https://www.marian.ac.in/public/images/uploads/DMWH%20M6.pdf
[34]  Zhang, T., Ramakrishnan, R. and Livny, M. (1996) BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM SIGMOD Record, 25, 103-114.
https://doi.org/10.1145/235968.233324

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133