|
- 2017
基于多标签分类的学术文献潜在时间意图识别研究Abstract: 为了提高检索结果的时间相关性,将文本特征抽取和多标签分类算法应用于文献检索的潜在时间意图分类研究之中.从检索潜在时间意图分类的角度出发,提出一种基于文本时间信息抽取和Labeled LDA(标签主题模型)的文献潜在时间意图自动分类算法.首先,在获取的文献时间信息基础上,将文献检索潜在时间意图映射至具体时间类别.其次,为了减少时间信息的稀疏性对分类特征学习过程的影响,利用交叉学科中时间短语分布特征优化Labeled LDA分类模型的标签选择过程.最后,将所提算法与其他多标签分类算法进行对比实验,分析和评估文献检索潜在时间意图自动分类的准确率.结果表明,所提算法的AUC的值达到79.6%,较同类基准算法ECC(整体分类链)提高约10.9%,且针对不同学科均取得了较好的分类效果,是一种有效的文献检索潜在时间意图学习方法.In order to enhance the temporal relevance of retrieval result,the text feature extraction and algorithm of multi-label classification were applied to potential temporal intention classification of literature retrieval. From the perspective of retrieving the classification of potential temporal intentions,an algorithm was proposed to automatically classifiy potential temporal intentions of literature,based on text temporal information extraction and labeled LDA. Firstly,by use of such temporal information,the potential temporal intention of literature retrieval was mapped onto specific temporal categories based on temporal information gained from literature. Secondly,the distribution features of temporal phrases across disciplines were used to optimize the process of label selection of the classification model of labeled LDA in order to reduce the impact of sparsity of temporal information on the learning process of classification features. Finally,the proposed algorithm was compared with other multi-label classification algorithms in specific experiments,and the accuracy of automated classification of potential temporal intentions of literature retrieval was analyzed and evaluated. The result shows that the AUC value of the proposed algorithm reaches 94.3%,which increases approximately 4.3%,compared with the algorithm of ECC (Ensembles of Classifler Chains). In addition,the present algorithm has produced favorable classifying effects in different disciplines. Thus,it is an effective learning method for potential temporal intention of literature retrieval.
|