|
Polibits 2012
String Distances for Near-duplicate DetectionKeywords: near-duplicate detection, string similarity measures, database, data mining. Abstract: near-duplicate detection is important when dealing with large, noisy databases in data mining tasks. in this paper, we present the results of applying the rank distance and the smith-waterman distance, along with more popular string similarity measures such as the levenshtein distance, together with a disjoint set data structure, for the problem of near-duplicate detection.
|