OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

PLOS ONE 2014

Ensemble Positive Unlabeled Learning for Disease Gene Identification

DOI: 10.1371/journal.pone.0097079

Peng Yang, Xiaoli Li, Hon-Nian Chua, Chee-Keong Kwoh, See-Kiong Ng

Full-Text Cite this paper Add to My Lib

Abstract:

An increasing number of genes have been experimentally confirmed in recent years as causative genes to various human diseases. The newly available knowledge can be exploited by machine learning methods to discover additional unknown genes that are likely to be associated with diseases. In particular, positive unlabeled learning (PU learning) methods, which require only a positive training set P (confirmed disease genes) and an unlabeled set U (the unknown candidate genes) instead of a negative training set N, have been shown to be effective in uncovering new disease genes in the current scenario. Using only a single source of data for prediction can be susceptible to bias due to incompleteness and noise in the genomic data and a single machine learning predictor prone to bias caused by inherent limitations of individual methods. In this paper, we propose an effective PU learning framework that integrates multiple biological data sources and an ensemble of powerful machine learning classifiers for disease gene identification. Our proposed method integrates data from multiple biological sources for training PU learning classifiers. A novel ensemble-based PU learning method EPU is then used to integrate multiple PU learning classifiers to achieve accurate and robust disease gene predictions. Our evaluation experiments across six disease groups showed that EPU achieved significantly better results compared with various state-of-the-art prediction methods as well as ensemble learning classifiers. Through integrating multiple biological data sources for training and the outputs of an ensemble of PU learning classifiers for prediction, we are able to minimize the potential bias and errors in individual data sources and machine learning algorithms to achieve more accurate and robust disease gene predictions. In the future, our EPU method provides an effective framework to integrate the additional biological and computational resources for better disease gene predictions.

References

[1]	López-Bigas N, Ouzounis CA (2004) Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic Acids Research 32(10): 3108–3114. doi: 10.1093/nar/gkh605
[2]	Yang P, Li X, Wu M, Kwoh CK, Ng SK (2011) Inferring Gene-Phenotype Associations via Global Protein Complex Network Propagation. PLoS ONE 6(7): e21502. doi: 10.1371/journal.pone.0021502
[3]	Ala U, Piro RM, Grassi E, Damasco C, Silengo L, et al. (2008) Prediction of Human Disease Genes by Human-Mouse Conserved Coexpression Analysis. PLoS Comput Biol 4(3): e1000043. doi: 10.1371/journal.pcbi.1000043
[4]	Ideker T, Sharan R (2008) Protein networks in disease. Genome Research 18: 644–652. doi: 10.1101/gr.071852.107
[5]	Perez-Iratxeta C, Bork P, Andrade MA (2002) Association of genes to genetically inherited diseases using data mining. Nature Genetics 313: 316–319. doi: 10.1038/ng895
[6]	Goh KI, Cusick ME, Valle D, Childs B, Vidal M, et al. (2007) The human disease network. Proc Natl Acad Sci USA 104(21): 8685–8690. doi: 10.1073/pnas.0701361104
[7]	Brunner HG, Van Driel MA (2004) From syndrome families to functional genomics. Nat Rev Genet 5(7): 545–551. doi: 10.1038/nrg1383
[8]	Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R (2010) Associating Genes and Protein Complexes with Disease via Network Propagation. Plos computational biology 6(1): e1000641. doi: 10.1371/journal.pcbi.1000641
[9]	Adie EA, Adams RR, Evans KL, Porteous DJ, Pickard B (2005) Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics 6(1): 55.
[10]	Smalter A, Lei SF, Chen X (2007) Human Disease-gene Classification with Integrative Sequence-based and Topological Features of Protein-protein Interaction Networks. BIBM.
[11]	Radivojac P, Peng K, Clark WT, Peters BJ, Mohan A, et al. (2008) An integrated approach to inferring gene-disease associations in humans. Proteins 72(3): 1030–1037. doi: 10.1002/prot.21989
[12]	Mordelet F, Vert JP (2011) ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples. BMC Bioinformatics 12(1): 389. doi: 10.1186/1471-2105-12-389
[13]	Yang P, Li XL, Mei JP, Kwoh CK, Ng SK (2012) Positive-unlabeled learning for disease gene identification. Bioinformatics 28(20): 2640–2647. doi: 10.1093/bioinformatics/bts504
[14]	Liu T, Du X, Xu YD, Li M, Wang X (2011) Partially Supervised Text Classification with Multi-Level Examples. In AAAI.
[15]	Xu JZ, Li YJ (2006) Discovering disease-genes by topological features in human protein–protein interaction network. Bioinformatics 22(22): 2800–2805. doi: 10.1093/bioinformatics/btl467
[16]	K？hler S, Bauer S, Horn D, Robinson PN (2008) Walking the interactome for prioritization of candidate disease genes. The American Journal of Human Genetics 82(4): 949–958. doi: 10.1016/j.ajhg.2008.02.013
[17]	Linghu B, Snitkin ES, Hu Z, Xia Y, DeLisi C (2009) Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network. Genome Biology 10(9): R91+.
[18]	Schlicker A, Lengauer T, Albrecht M (2010) Improving disease gene prioritization using the semantic similarity of Gene Ontology terms. Bioinformatics 26(18): i561–i567. doi: 10.1093/bioinformatics/btq384
[19]	Prasad TK, Goel R, Kandasamy K, Keerthikumar S, Kumar S, et al. (2009) Human Protein Reference Database. Nucleic Acids Research 37: 767–772. doi: 10.1093/nar/gkn892
[20]	Brown KR, Jurisica I (2005) Online predicted human interaction database. Bioinformatics 21(9): 2076–2082. doi: 10.1093/bioinformatics/bti273
[21]	Gene Ontology Consortium (2004) The gene ontology database and informatics resource. Nucleic Acid Res 32(1): 258–261. doi: 10.1093/nar/gkh036
[22]	Wang JZ, Du Z, Payattakool R, Philip SY, Chen CF (2007) A new method to measure the semantic similarity of GO terms. Bioinformatics 23(10): 1274–1281. doi: 10.1093/bioinformatics/btm087
[23]	McKusick VA (2007) Mendelian Inheritance in Man and Its Online Version, OMIM. Am. J. Hum. Genet. 80: 588–604. doi: 10.1086/514346
[24]	Van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA (2006) A text-mining analysis of the human phenome. European Journal of Human Genetics 14(5): 535–542. doi: 10.1038/sj.ejhg.5201585
[25]	Flicek P, Amode MR, Barrel D (2011) Ensemble 2011. Nucleic Acids Res 39(S1): 800–806.
[26]	Deza E, Deza MM (2009) Encyclopedia of Distances. Springer Berline Heidelberg.
[27]	LOVáSZ L (1993) Random walks on graphs: a survey. Combinatorics: Paul Erd？s is Eighty 2: 1–46.
[28]	Vapnik V (1998) Statistical Learning Theory. Wiley, New York.
[29]	Chang C, Lin C (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3): 27. doi: 10.1145/1961189.1961199
[30]	Mitchell TM (1997) Artificial neural network. Machine Learning 81–127.
[31]	Kuncheva LI (2004). Combining pattern classifiers: methods and algorithms. John Wiley & Sons.
[32]	Sciarretta S, Zhai P, Shao D, Maejima Y, Robbins J, et al. (2012) Rheb is a critical regulator of autophagy during myocardial ischemia: pathophysiological implications in obesity and metabolic syndrome. Circulation 125(9): 1134–1146. doi: 10.1161/circulationaha.111.078212
[33]	Tabassum R, Mahajan A, Chauhan G, Dwivedi OP, Ghosh S, et al. (2010) Evaluation of DOK5 as a susceptibility gene for type 2 diabetes and obesity in North Indian population. BMC Medical Genetics 11: 35. doi: 10.1186/1471-2350-11-35
[34]	Palaniappan S, Awang R (2008) Intelligent heart disease prediction system using data mining techniques. In Computer Systems and Applications, 2008. AICCSA 2008. IEEE/ACS International Conference pp.108–115.
[35]	Mei JP, Kwoh CK, Yang P, Li XL, Zheng J (2013) Drug–target interaction prediction by learning from local information and neighbors. Bioinformatics 29(2): 238–245. doi: 10.1093/bioinformatics/bts670
[36]	Mei JP, Kwoh CK, Yang P, Li XL, Zheng J (2012). Globalized bipartite local model for drug-target interaction prediction. In Proceedings of the 11th International Workshop on Data Mining in Bioinformatics. pp. 8–14.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133