|
计算机应用 2006
Approximately duplicated records examining method and its application in ETL of data warehouse
|
Abstract:
Examining and eliminating approximately duplicated records is one of main problems needed to solve for data cleaning and improving data quality. The position-coding technology to ETL of data warehouse was introduced,a novel examining algorithm named Position-Coding Method(PCM) of approximately duplicated records was presented.The algorithm was applied to Chinese character set, as well as Western character set. Experiment comparison with the previous work indicates that the method is effective.