|
计算机应用 2006
Concept based algorithm of dealing near-replicas of documents on the Web
|
Abstract:
To solve near-replicas of documents on the Web obtained by search engine, a similarity dealing algorithm was proposed. Based on concepts extracted from the Web pages and inverted file, the algorithm built a model which shrank the scale of the Web pages processed. The algorithm saved a great deal of temporal and spatial resources and provides a good foundation for near-replicas detection.