OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

模式识别与人工智能 2013

基于文本密度模型的Web正文抽取

, PP. 667-672

朱泽德,李淼,张健,陈雷,曾新华

Keywords: Web挖掘,正文抽取,文本密度,高斯平滑,最大子序列

Full-Text Cite this paper Add to My Lib

Abstract:

为从大量无关信息中获取有用内容，正文抽取成为Web数据应用不可或缺的组成部分。文中提出一种基于文本密度模型的新闻网页正文抽取方法。主要通过融合网页结构和语言特征的统计模型，将网页文档按文本行转化成正、负密度序列，再根据邻近行的内容连续性，利用高斯平滑技术修正文本密度序列，最后采用改进的最大子序列分割序列抽取正文内容。该方法保持正文完整性并排除噪声干扰，且无需人工干预或反复训练。实验结果表明基于文本密度抽取正文对不同数据源具有广泛的适应性，且准确率和召回率优于现有统计模型。

References

[1]	Chen Yu,Ma Weiying,Zhang Hongjiang. Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices // Proc of the 12th International Conference on World Wide Web. Budapest,Hungary,2003: 225-233
[2]	Yu Shipeng,Cai Deng,Wen Jirong,et al. Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation // Proc of the 12th International Conference on World Wide Web. Budapest,Hungary,2003: 11-18
[3]	Uszkoreit J,Ponte J M,Popat A C,et al. Large Scale Parallel Document Mining for Machine Translation // Proc of the 23rd International Conference on Computational Linguistics. Beijing,China,2010: 1101-1109
[4]	Adelberg B. NoDoSEA Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents // Proc of the ACM SIGMOD International Conference on Management of Data. Washington,USA,1998: 283-294
[5]	Kang D K,Choi J. MetaNews: An Information Agent for Gathering News Articles on the Web // Proc of the 14th International Symposium Methodologies for Intelligent Systems. Maebashi,Japan,2003: 179-186
[6]	Yang Shaohua,Lin Hailüe,Han Yanbo. Automatic Data Extraction from Template-Generated Web Pages. Journal of Software,2008,19(2): 209-223
[7]	Kohlschütter C,Fankhauser P,Nejdl W. Boilerplate Detection Using Shallow Text Features // Proc of the 3th ACM International Conference on Web Search and Data Mining. New York,USA,2010: 441-450
[8]	Song Ruihua,Liu Haifeng,Wen Jirong,et al. Learning Important Models for Webpage Blocks Based on Layout and Content Analysis. ACM SIGKDD Explorations Newsletter,2004,6(2): 14-23
[9]	Gibson J,Wellner B,Lubar S. Adaptive Web-page Content Identification // Proc of the 9th ACM International Workshop on Web Information and Data Management. Lisbon,Portugal,2007: 105-112
[10]	Ziegler C N,Skubacz M. Content Extraction from News Pages Using Particle Swarm Optimization on Linguistic and Structural Features // Proc of the IEEE/WIC/ACM International Conference on Web Intelligence. Fremont,USA,2007: 242-249
[11]	Pasternack J,Roth D. Extracting Article Text from the Web with Maximum Subsequence Segmentation // Proc of the 18th International Conference on World Wide Web. Madrid,Spain,2009: 971-980
[12]	Finn A,Kushmerick N,Smyth B. Fact or Fiction: Content Classification for Digital Libraries // Proc of the 2nd DELOS Network of Excellence Workshop on Personalization and Recommender Systems in Digital Libraries. Dublin,Ireland,2001: 1-6
[13]	Pinto D,Branstein M,Coleman R,et al. QuASM: A System for Question Answering Using Semi-Structured Data // Proc of the 2nd ACM/IEEE-CS Joint Conference on Digital Libraries. Portland,USA,2002: 46-55
[14]	Mantratzis C,Orgun M,Cassidy S. Separating XHTML Content from Navigation Clutter Using DOM-Structure Block Analysis // Proc of the 16th ACM Conference on Hypertext and Hypermedia. Salzburg,Austria,2005: 145-147
[15]	Debnath S,Mitra P,Giles C L. Automatic Extraction of Informative Blocks from Webpages // Proc of the ACM Symposium on Applied Computing. Santa Fe,USA,2005: 1722-1726
[16]	Gottron T. Content Code Blurring: A New Approach to Content Extraction // Proc of the 19th International Conference on Database and Expert Systems Applications. Turin,Italy,2008: 29-33
[17]	Gibson D,Punera K,Tomkins A. The Volume and Evolution of Web Page Templates // Proc of the 14th International Conference on World Wide Web. Chiba,Japan,2005: 830-839
[18]	Weninger T,Hsu W H,Han Jiawei. CETR-Content Extraction via Tag Ratios // Proc of the 19th International Conference on World Wide Web. Raleigh,USA,2010: 971-980

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133