|
计算机应用 2006
Focused crawling method based on improved C4.5 exploiting anchor text
|
Abstract:
A new focused crawling method based on anchor text and improved C4.5 decision tree algorithm was proposed. It exploited the anchor text of URL to train the decision tree, and then applied the decision tree model to decide whether a downloaded page was on topic and how to choose the next URL to visit. Finally, a prototype system named DTFC based on this method was implemented, and experiments in four university websites were carried out in allusion to "academic report". The experimental results show that DTFC outperforms two standard crawlers for focused crawling.