|
A Novel Approach to Detect the Near Duplicate by Refining Provenance MatrixKeywords: near-duplicates , Provenance , distrusted , provenance matrix , trustworthiness Abstract: In this paper, the provenance matrix is refined to get more accuracy and efficiency in detecting near-duplicates by adding two more factors ‘How’ and ‘Why’ , as the performance of the web search depends on the search results having information without duplicates or redundancy . More redundancy leads to more time consume and more storage, that’s why search engines try to avoid indexing of duplicates documents. Provenance model combines both the content-based and trust-based factors for classifying near-duplicates or original documents, as now a days, many of near-duplicates are from the distrusted websites
|