%0 Journal Article %T Parallel Text Categorization of Massive Text Based on Hadoop
基于Hadoop平台的海量文本分类的并行化 %A XIANG Xiao-jun %A GAO Yang %A SHANG Lin %A YANG Yu-bin %A
向小军 %A 高阳 %A 商琳 %A 杨育彬 %J 计算机科学 %D 2011 %I %X In recent years, there have been extensive studies and rapid progresses in automatic text categorization, which is one of the hotspots and key techniques in the information retrieval and data mining field. In recent years,as the text data grows exponentially, to effectively manage the large storage of data, we must use efficient algorithm to process it in the distributed environment. In this paper, we implemented a simple and effective text categorization algorithm on ha- doop--TFIDF classifier, an algorithm based on vector space model, cosine similarity was applied as the metrics. The ex- periments on two datasets show that the parallel algorithm is effective on large storage of data and can be applied in practical application field. %K Text categorization %K Parallelization %K Massive data %K Hadoop
文本分类,并行化,海量数据,Hadoop %U http://www.alljournals.cn/get_abstract_url.aspx?pcid=5B3AB970F71A803DEACDC0559115BFCF0A068CD97DD29835&cid=8240383F08CE46C8B05036380D75B607&jid=64A12D73428C8B8DBFB978D04DFEB3C1&aid=17DDCED190714E79A98FEEFBE3D758E7&yid=9377ED8094509821&vid=16D8618C6164A3ED&iid=F3090AE9B60B7ED1&sid=798FBE8DE1A255B1&eid=847B14427F4BF76A&journal_id=1002-137X&journal_name=计算机科学&referenced_num=0&reference_num=0