|
计算机科学技术学报 2002
Innovating Web page classification through reducing noise
|
Abstract:
This paper presents a new method that eliminates noise in Web page classification. It first describes the presentation of a Web page based on HTML tags. Then through a novel distance formula, it eliminates the noise in similarity measure. After carefully analyzing Web pages, we design an algorithm that can distinguish related hyperlinks from noisy ones. We can utilize non-noisy hyperlinks to improve the performance of Web page classification (the CAWN algorithm). For any page, we can classify it through the text and category of neighbor pages related to the page. The experimental results show that our approach improved classification accuracy. This work is supported by the National Natural Science Foundation of China (No.60075019, and No.9010402) and the National Science Foundation of Beijing (No.4011003). LI Xiaoli received his Ph.D. degree from the Institute of ComputingTechnology, The Chinese Academy of Sciences in 2001. He taught artificial intelligence in the Graduate School of the University of Science and Technology of China in 1999. His research interests include Web mining, information retrieval and natural language processing. He has published more than 20 papers in international conferences and journals Since 2000, he has been working as a research staff in the National University of Singapore. SHI Zhongzhi received his B.E. and M.E. degrees from the University of Science and Technology of China in 1964 and 1968, respectively. He is currently the Executive Director of the Department of Intelligent Computer Science, Institute of Computing Technology. His research interests include artificial intelligence, neural computing, cognitive science, advanced database technology, new generation computer. He has published 10 books and more than 300 technical papers. He is a member of the Standing Steering Committee of PRICAI, Vice President of Chinese Artificial Intelligence Society, and Secretary-General of China Computer Federation. He is also the Vice President of the Chinese Society of Machine Learning and Vice President of the Chinese Society of Knowledge Engineering.