|
计算机应用研究 2010
Approach for detecting approximately duplicate records based on cluster of inner code''s sequence value
|
Abstract:
Detecting and eliminating approximately duplicated records is one of main problems needed to solve for data cleaning and improving data quality.As to such problem,this paper presented an approach for detecting approximately duplicate records based on cluster of inner code's sequence value.The proposed method firstly chose the key field or some bits of it,and according to the inner code's sequence value of character,clustered large datasets into many small datasets by cluster thought.Then in term of rank-bas...