全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

一种嵌入分布信息的Web文档相似性度量

, PP. 66-70

Keywords: Web网页的相似性度量,VSM,分布信息,Web网页分类

Full-Text   Cite this paper   Add to My Lib

Abstract:

Web文档间的相似性度量是Web文本分类的关键,有效的相似性度量策略可改进Web文本分类的精度.经典的向量空间模型(VSM)仅考虑网页中单词的出现频率,未有效利用单词的分布信息,因而影响了网页的分类精度.论文计算了网页中单词分布位置的均值和方差,并将之引入到网页的相似性计算中,提出了一种直接嵌入分布信息的新的网页相似性度量方法.该方法因合理利用单词的出现频率及其分布信息,可有效改进和拓展经典的网页相似性度量策略.实验结果表明,该网页相似性度量方法是有效可行的.

References

[1]  [ Sebastiani F. M ach ine learn ing in au tom ated tex t ca tego rization[ J]. ACM Computing Survey, 2002, 34( 1): 1-47.
[2]  [ Joach im s T. Tex t categor ization w ith support vec to rm ach ines: Lea rning w ith m any relevan t fea tures[ C ] / / Proceed ing s o f ECML-98. Chemn itz, 1998: 137-142.
[3]  [ Schapire R E, S inger Y. Boo stexter: a boosting-based sy stem for tex t ca tego rization[ J] . M achine Lea rning, 2000, 39( 2 /3):135-168.
[4]  [ Lu Yuchang, LuM ingyu, L i Fan. Analysis and construc tion of w ord w e ighing function in VSM [ J] . Journa l o f Computer Research& Deve lopm en t, 2002, 39( 10): 1 205-1 210.
[5]  [ Xue X iaob ing, Zhou Zh ihua. Distributional fea tures for tex t categor ization[ C ] / / Pro ceedings o f the 17 th European ConferenceonM ach ine Learn ing ( ECML-06). Berlin: LNAI 4212, 2006: 497-508.
[6]  [ Lew is D D. N aive( B ayes) at forty: The independence assum ption in inform ation retriev al[ C ] / / Proceed ings of 10th European Con f onM achine Learn ing. Berlin: Spr inger, 1998: 4-15.
[7]  [ SaubanM, Pfahr ing er B. Tex t categor ization using docum ent pro filing [ C ] / / Pro ceedings o f PKDD-2003. B erlin: Springer-Ve rlag, 2003: 411-412.
[8]  [ C ravenM, D iPasquo D, Fre itag D, et a.l Lea rning to ex trac t sym bo lic know ledg e from theW or ldW ideW eb[ C] / / Proceeding s o fAAA I-98. M ad ison: W I, 1998: 509-516.
[9]  [ Cui Z ifeng, Xu Baowen, ZhangW e ifeng, et a.l W eb do cum en ts cluster ing w ith interest links[ C] / / Serv ice-Or iented System Eng ineer ing. IEEE Internationa lW orkshop, 2005: 111-116.
[10]  [ Zeng H uajun, H eQ ica,i Chen Zhen, et a.l Learn ing to c lusterw eb sea rh resu lts[ C] / / Proceed ings o f SIGIR-04. Sheffield,2004: 210-217.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133