%0 Journal Article %T An Algorithm for Noise Reduction in Web Pages Based on a Group of Content-related Rules
一种基于内容规则的网页去噪算法* %A Wang Jiandong Wang Jimin Tian Feijia %A
王建冬王继民田飞佳 %J 现代图书情报技术 %D 2008 %I %X This paper presents a new algorithm for the elimination of noise in Web pages based on a group of content-related rules.First,the authors present an algorithm which can peel off noises by iteratively comparing the tables on the same level of the page's table tree.Next,an algorithm is presented in order to evaluate the similarity of anchor text's topic and the content of the page.To some extent,as the new algorithm takes semantic facts of the Web pages into consideration,it acquires higher accuracy than pure rule-based algorithm,while requires lower time complexity.The result of experiment indicates that this algorithm performs very effectively when purifying great mass of Web pages. %K Noise reduction in Web pages Levenshtein distance
网页净化 %K 编辑距离 %U http://www.alljournals.cn/get_abstract_url.aspx?pcid=B5EDD921F3D863E289B22F36E70174A7007B5F5E43D63598017D41BB67247657&cid=E46382710BF131B2&jid=24AADBCD0D5373C73F37F78D10E2F717&aid=130DB5CC3E4A5389864FC63A8CDE0CBA&yid=67289AFF6305E306&vid=B91E8C6D6FE990DB&iid=38B194292C032A66&sid=987EDA49D8A7A635&eid=318E4CC20AED4940&journal_id=1003-3513&journal_name=现代图书情报技术&referenced_num=0&reference_num=14