%0 Journal Article %T 基于一级序列预测蛋白质亚细胞定位的超级学习机方法
Extreme Learning Machine for Protein Subcellular Localization from Primary Sequence %A 石峰 %A 陈洪 %A 熊慧娟 %J Hans Journal of Data Mining %P 6-11 %@ 2163-1468 %D 2013 %I Hans Publishing %R 10.12677/HJDM.2013.31002 %X 蛋白质一级序列的亚细胞定位在基因组注释、蛋白质功能预测、药物发现等领域起着重要作用。超级学习机是近年来新兴的机器学习方法。本文探讨了超级学习机在蛋白质亚细胞定位预测中的潜力。为此,我们首先给出了一种新的特征提取策略,将每个蛋白质一级序列表示成25维的数值向量。在此基础上,我们将852组分枝杆菌蛋白质数据分别用基于新特征的支持向量机方法、基于新特征的超级学习机方法和已有的基于伪氨基酸组成特征的支持向量机方法做数值试验。这852组数据从Swiss-Prot 48数据库中选取,分属于四个不同种类。通过在这些数据上做五折交叉数值比较发现,基于新特征提取策略的超级学习机方法的准确率最高,达到了97.2%,超过基于新特征的支持向量机方法的96.4%的准确率以及基于伪氨基酸组成特征的支持向量机方法的95.2%的准确率。
Predicting protein subcellular localization from primary sequence is crucial to genome annotation, protein function prediction, drug discovery and etc. Extreme learning machine is an attractive learning method in recent years. This paper explores the potential of extreme learning machine for protein subcellular localization prediction. For this, a new feature selection strategy is established first. By utilizing the feature selection strategy, each primary sequence can be expressed as a 25-dimensional numerical vector. Furthermore, some numerical comparisons of Support Vector Ma-chine with new features, Extreme Learning Machine with new features and another existing Support Vector Machine method with Pseudo amino acid composition features are given on 852 mycobcterial proteins data. The data arises from Swiss-Prot 48 database and belongs to four different classes. Results of five cross-validation for 852 protein sequences show that ELM with new features achieves the best accuracy. It achieves 97.2% accuracy, SVM with new features ob-tains 96.4% accuracy and SVM with Pseudo amino acid composition features displays 95.2% accuracy. %K 蛋白质亚细胞定位;超级学习机;同源蛋白质
Protein Subcellular Localization %K Extreme Learning Machine %K Homologous Protein %U http://www.hanspub.org/journal/PaperInformation.aspx?PaperID=9373