Semantic duplicates in databases represent today an important data quality challenge which leads to bad decisions. In large databases, we sometimes find ourselves with tens of thousands of duplicates, which necessitates an automatic deduplication. For this, it is necessary to detect duplicates, with a fairly reliable method to find as many duplicates as possible and powerful enough to run in a reasonable time. This paper proposes and compares on real data effective duplicates detection methods for automatic deduplication of files based on names, working with French texts or English texts, and the names of people or places, in Africa or in the West. After conducting a more complete classification of semantic duplicates than the usual classifications, we introduce several methods for detecting duplicates whose average complexity observed is less than O(2n). Through a simple model, we highlight a global efficacy rate, combining precision and recall. We propose a new metric distance between records, as well as rules for automatic duplicate detection. Analyses made on a database containing real data for an administration in Central Africa, and on a known standard database containing names of restaurants in the USA, have shown better results than those of known methods, with a lesser complexity.
Bhagyashri, Kelkar, A. and Manwade, K.B. (2012) Identifying Nearly Duplicate Records in Relational Database, International Journal of Computer Science and Information Technology & Security, 2, 514-517.
Baxter, R., Christen, P. and Epidemiology, C.F. (2003) A Comparison of Fast Blocking Methods for Record Linkage. Proceedings of Workshop Data Cleaning, Record Linkage, and Object Consolidation, Washington DC, August 24-27 2003, 25-27.
Aizawa and Oyama, K. (2005) A Fast Linkage Detection Scheme for Multi-Source Information Integration. Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration, Tokyo, 8-9 April 2005, 30-39.
Yan, S., Lee, D.W., Kan, M.-Y. and Lee, C.G. (2007) Adaptive Sorted Neighborhood Methods for Efficient Record Linkage. Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, Vancouver, 18-23 June 2007, 185-194.
McCallum, Nigam, K. and Ungar, L.H. (2000) Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data mining, Boston, 20-23 August 2000, 169-178.
Metha, Kadhum, Alnoory, Musbah and Aqel, M. (2011) Performance Evaluation of Similarity Functions for Duplicate Record Detection. Master’s Thesis, Middle East University, Beirut. http://elibrary.mediu.edu.my/books/2014/MEDIU11238.pdf
Bianco, G.D., Galante, R. and Heuser, C.A. (2011) A Fast Approach for Parallel Deduplication on Multicore Processors. 26th Symposium on Applied Computing SAC’11, TaiChung, 21-25 March 2011, 1027-1032.
Raghavan, H. and Allan, J. (2004) Using Soundex Codes for Indexing Names in ASR Documents. Proceeding SpeechIR ‘04 Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004, Boston, 6 May 2004, 22-27. https://doi.org/10.3115/1626307.1626312
Christen, P. (2006) A Comparison of Personal Name Matching: Techniques and Practical Issues. 6th IEEE International Conference on Data Mining-Workshops, Hong Kong, 18-22 December 2006, 290-294.