全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

基于排名机制的领域Web网页发现
Domain Web Pages Discovery Based on Ranking Mechanism

DOI: 10.12677/HJDM.2022.124031, PP. 320-333

Keywords: 主题爬取,网页排名,领域Web网页发现,Focused Crawling, Page Rank, Domain Web Pages Discovery

Full-Text   Cite this paper   Add to My Lib

Abstract:

对很多Web数据集成应用来说,领域Web发现能力至关重要。从目前来看,现有的主题爬取策略依然有效,并随之产生了不少依据这些策略的主题爬虫,然而配置主题爬虫困难且费时,因此提出基于排名机制的领域Web网页发现算法,该算法在现有的主题爬取策略之上,利用给定的样本网页集,使用基于排名的方法,系统地结合多种Web网页发现策略,迭代发现并提取领域Web新网页。实验表明,该方法具有较高的网页准确率,验证了方法的有效性。
Domain Web discovery capabilities are critical to many Web data integration applications. From the current point of view, the existing focused crawling strategies are still effective, and many focused crawlers based on these strategies have been created. However, configuring focused crawlers is difficult and time-consuming. Therefore, a domain Web page discovery algorithm based on ranking mechanism is proposed. Based on the existing focused crawling strategies, the algorithm uses a given set of sample web pages, uses a ranking-based method, and systematically combines various web page discovery strategies to iteratively discover and extract new web pages in the domain. Experiments show that the method has high web page accuracy, which verifies the effectiveness of the method.

References

[1]  汤羽, 林迪, 范爱华, 吴薇薇. 大数据分析与计算[M]. 北京: 清华大学出版社, 2018.
[2]  Krishnamurthy, Y., Pham, K., Santos, A., and Freire, J. (2016) Interactive Exploration for Domain Discovery on the Web. ACM KDD Workshop on Interactive Data Exploration and Analytics (IDEA), 64-71. https://nyuscholars.nyu.edu/en/publications/interactive-exploration-for-domain-discovery-on-the-web
[3]  Barbosa, L., Bangalore, S., and Sridhar, V.K.R. (2011) Crawling Back and Forth: Using Back and Out Links to Locate Bilingual Sites. Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai, 8-13 November 2011, 429-437. https://aclanthology.org/I11-1048
[4]  Qiu, D.S., Barbosa, L., Dong, X.L., Shen, Y.Y., and Srivastava, D, (2015) Dexter: Large-Scale Discovery and Extraction of Product Specifications on the Web. Proc. Proceedings of the VLDB Endowment, 8, 2194-2205.
[5]  Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002) Finite-Time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47, 235-256.
https://doi.org/10.14778/2831360.2831372
[6]  Dean, J. and Henzinger, M.R. (1999) Finding Related Pages in the World Wide Web. Computer Networks 31, 11, 1467-1479.
[7]  Murata, T. (2001) Finding Related Web Pages Based on Connectivity Information from a Search Engine. Poster Proceedings of 10th International Conference on World Wide Web (WWW), Hong Kong, 1-5 May 2001, 18-19. http://www10.org/cdrom/posters/frame.html
[8]  Vieira, K., Barbosa, L., Silva, A.S., Freire, J., and Moura, E. (2016) Finding Seeds to Bootstrap Focused Crawlers. World Wide Web, 19, 449-474.
https://doi.org/10.1007/s11280-015-0331-7
[9]  Barbosa, L. and Freire, J. (2007) An Adaptive Crawler for Locat-ing Hidden-Web Entry Points. In Proceedings of the 16th International Conference on World Wide Web (WWW), New York, 8 May 2007, 441-450.
https://doi.org/10.1145/1242572.1242632
[10]  Chakrabarti, S., Punera, K., and Subramanyam, M. (2002) Acceler-ated Focused Crawling through Online Relevance Feedback. In Proceedings of the 11th International Conference on World Wide Web (WWW), New York, 7 May 2002, 148-159.
https://doi.org/10.1145/511446.511466
[11]  Chakrabarti, S., van den Berg, M., and Dom, B. (1999) Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Computer Networks, 31, 1623-1640.
https://doi.org/10.1016/S1389-1286(99)00052-3
[12]  Ester, M., Kriegel, H.-P., and Schubert, M. (2004) Accurate and Efficient Crawling for Relevant Websites. In Proceedings of the Thirtieth International Conference on very Large Data Bases (VLDB), Toronto, 31 August-3 September 2004, 396-407.
https://doi.org/10.1016/B978-012088469-8.50037-1
[13]  Meusel, R., Mika, P., and Blanco, R. (2014) Focused Crawling for Structured Data. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM), New York, 3 November 2014, 1039-1048.
https://doi.org/10.1145/2661829.2661902

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133