OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

电子学报 2014

一种基于搜索策略的多主题信息采集方法

DOI: 10.3969/j.issn.0372-2112.2014.12.003, PP. 2352-2358

仲兆满,李存华,刘宗田,管燕

Keywords: 多主题信息采集,原子规则,内置搜索,通用搜索,相关性计算

Full-Text Cite this paper Add to My Lib

Abstract:

本文针对多主题信息采集效率低下的问题,调研了主题规则在内置搜索引擎和通用搜索引擎上搜索结果的差异,提出将主题规则拆分成原子规则的思想,分析了原子规则间的相同、互换、包含三种关系.在原子规则之间关系的基础上,设计了针对内置搜索和通用搜索不同的原子规则分配策略,这样做一方面提高主题信息采集的准确率,另一方面减少搜索采集的次数.针对原子规则直接搜索结果的准确率不高的问题,提出了基于句群的主题与信息相关性的过滤方法.设置138条主题规则(拆分后的原子规则为8223条),14个内置搜索引擎和4个通用搜索引擎,在单位时间内采集到的信息总条数与采集到的相关信息的条数两个方面进行了实验比较.结果表明,所提方法在信息采集数目及相关信息采集数目方面均具有较好的性能.

References

[1]	Hsu C C,Wu F.Topic-specific crawling on the Web with the measurements of the relevancy context graph.Information System,2006,31(4-5):232-246.
[2]	Chakrabarti S,Berg M,et al.Focused crawling:a new approach to topic-specific Web resource discovery[J].Computer Networks,1999,31(11-16):1623-1640.
[3]	Hersovici M,Jacovi M,et al.The shark-search algorithm and application:tailored web site mapping[J] Computer Networks and ISDN Systems,1998,30(1-7):317-326.
[4]	Bergmark D,Lagoze C,et al.Focused crawls,tunneling,and digital libraries[A].6th European Conference on Research and Advanced Technology for Digital Libraries[C].London,UK:Springer-Verlag,2002.91-106.
[5]	De Bra PME,Post RDJ.Information retrieval in the world wide web:making client-based searching feasible[A].1st International World Wide Web Conference[C].Geneva,Switzerland:Elsevier Science BV,1994.183-192.
[6]	仲兆满,朱平,等.一种基于局部分析面向事件的查询扩展方法[J].情报学报,2012,31(2):151-159. Zhong Zhong-man,Zhu Ping,et al.Research on event-oriented query expansion based on local analysis[J].Journal of the China Society for Scientific and Technical Information,2012,31(2):151-159.(in Chinese)
[7]	Rodrigo Campos,Oscar Rojas,et al.Distributed ontology-driven focused crawling[A].21st Euromicro International Conference on Parallel,Distributed,and Network-Based Processing[C].Belfast,United Kingdom:IEEE Computer Society,2013.108-115.
[8]	Punam Bedi,Anjali Thukral,et al.Focused crawling of tagged web resources using ontology[J].Computers and Electrical Engineering,2013,39(2):613-628.
[9]	Yang S Y.Ontocrawler:a focused crawler with ontology-supported website models for information agents[J].Expert System Application,2010,37(7):5381-5389.
[10]	Michael Hersovici,Michal Jacovi,et al.The shark-search algorithm:an application:tailored web site mapping[J].Computer Networks and ISDN System,1998,30(3):256-264.
[11]	Martinet J,Chiaramella Y,et al.A relational vector space model using an advanced weighting scheme for image retrieval[J].Information Processing and Management,2011,47 (3):391-414.
[12]	仲兆满,李存华,等.面向Web 新闻的事件多要素检索方法.软件学报,2013,24(10):2366-2378. Zhong Zhong-man,Li Cun-hua,et al.Web news oriented event multi-elements retrieval[J].Journal of Software,2013,24(10):2366-2378.(in Chinese)
[13]	Martinez-Romo J,Araujo L.Updating broken web links:an automatic recommendation system[J].Information Processing and Management,2012,48 (2):183-203.
[14]	Liu H Y,Milios E.Probabilistic models for focused Web crawling[J].Computational Intelligence,2012,28(3):289-328.
[15]	Du Y J,Pen Q Q,et al.A topic-specific crawling strategy based on semantics similarity[J].Data & Knowledge Engineering,2013,88(11):75-93.
[16]	Torkestani J A.An adaptive focused Web crawling algorithm based on learning automata[J].Appliance Intelligence,2012,37(4):586-601.
[17]	高凯.搜索引擎中信息动态采集策略的研究[J].电子学报,2007,35(10):1984-1988. Gao Kai.Dynamic refresh strategy for crawler in search engine[J].Acta Electronica Sinica,2007,35(10):1984-1988.(in Chinese)
[18]	Mohsen J,Hassan S,et al.A method for focused crawling using combination of link structure and content similarity[A].2006 IEEE/WIC/ACM International Conference on Web Intelligence[C].Hong Kong:IEEE Computer Society,2006.753-756.
[19]	Yuvarani M,Iyengar N,et al.Lscrawler:a framework for an enhanced focused Web crawler based on link semantics[A].2006 IEEE/WIC/ACM International Conference on Web Intelligence[C].Hong Kong:IEEE Computer Society,2006.794-800.
[20]	Melanie N,Markus N,et al.Focused crawling for building Web comment corpora[A].10th IEEE Consumer Communications and Networking Conference[C].Las Vegas,NV:IEEE Computer Society,2013.685-688.
[21]	Almpanidis G,Kotropoulos C,Pitas I.Combining text and link analysis for focused crawling—An application for vertical search engines[J].Information Systems,2007,32(6):886-908.
[22]	Wang Meng,Li Guang-da,et al.When Amazon meets Google:product visualization by exploring multiple information sources[J].ACM Transactions on Internet Technology,2013,12(4):Article 12.
[23]	Nie Li-qiang,Wang Meng,et al.Beyond text QA:multimedia answer generation by harvesting Web information[J].IEEE Transactions on Multimedia,2013,15(2):426-441.
[24]	Steve Lawrence,C Lee Giles.Accessibility of information on the Web[J].Nature,1999,400(6740):107-109.
[25]	Steve Lawrence,C Lee Giles.Searching the world wide web[J].Science,1998,280(5360):98-100.
[26]	Selberg E,Etzioni O.The Metacrawler architecture for resource aggregation on the Web[J].IEEE Expert,1997,12(1):11-14.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133