The wide coverage and biological relevance of the Gene Ontology (GO), confirmed through its successful use in protein function prediction, have led to the growth in its popularity. In order to exploit the extent of biological knowledge that GO offers in describing genes or groups of genes, there is a need for an efficient, scalable similarity measure for GO terms and GO-annotated proteins. While several GO similarity measures exist, none adequately addresses all issues surrounding the design and usage of the ontology. We introduce a new metric for measuring the distance between two GO terms using the intrinsic topology of the GO-DAG, thus enabling the measurement of functional similarities between proteins based on their GO annotations. We assess the performance of this metric using a ROC analysis on human protein-protein interaction datasets and correlation coefficient analysis on the selected set of protein pairs from the CESSM online tool. This metric achieves good performance compared to the existing annotation-based GO measures. We used this new metric to assess functional similarity between orthologues, and show that it is effective at determining whether orthologues are annotated with similar functions and identifying cases where annotation is inconsistent between orthologues. 1. Introduction Worldwide DNA sequencing efforts have led to a rapid increase in sequence data in the public domain. Unfortunately, this has also yielded a lack of functional annotations for many newly sequenced genes and proteins. From 20% to 50% of genes within a genome [1] are still labeled unknown, uncharacterized, or hypothetical, and this limits our ability to exploit these data. Therefore, automatic genome annotation, which consists of assigning functions to genes and their products, has to be performed to ensure that maximal benefit is derived from these sequencing efforts. This requires a systematic description of the attributes of genes and proteins using a standardized syntax and semantics in a format that is human readable and understandable, as well as being interpretable computationally. The terms used for describing functional annotations should have definitions and be placed within a structure of relationships. Therefore, an ontology is required in order to represent annotations of known genes and proteins and to use these to predict functional annotations of those which are identified but as yet uncharacterized. By capturing knowledge about a domain in a shareable and computationally accessible form, ontologies can provide defined and computable semantics
References
[1]
F. Enault, K. Suhre, and J. M. Claverie, “Phydbac “gene function predictor”: a gene annotation tool based on genomic context analysis,” BMC Bioinformatics, vol. 6, p. 247, 2005.
[2]
P. W. Lord, R. D. Stevens, A. Brass, and C. A. Goble, “Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation,” Bioinformatics, vol. 19, no. 10, pp. 1275–1283, 2003.
[3]
M. Ashburner, C. A. Ball, J. A. Blake et al., “Gene ontology: tool for the unification of biology,” Nature Genetics, vol. 25, no. 1, pp. 25–29, 2000.
[4]
X. Mao, T. Cai, J. G. Olyarchuk, and L. Wei, “Automated genome annotation and pathway identification using the KEGG Orthology (KO) as a controlled vocabulary,” Bioinformatics, vol. 21, no. 19, pp. 3787–3793, 2005.
[5]
Q. Zheng and X. J. Wang, “GOEAST: a web-based software toolkit for gene ontology enrichment analysis,” Nucleic acids research, vol. 36, pp. W358–363, 2008.
[6]
GO-Consortium, “The gene ontology in 2010: extensions and refinements,” Nucleic Acids Research, vol. 38, no. 1, Article ID gkp1018, pp. D331–D335, 2009.
[7]
GO-Consortium, “The gene ontology (GO) project in 2006,” Nucleic Acids Research, vol. 34, pp. D322–D326, 2006.
[8]
S. Carbon, A. Ireland, C. J. Mungall et al., “AmiGO: online access to ontology and annotation data,” Bioinformatics, vol. 25, no. 2, pp. 288–289, 2009.
[9]
C. Pesquita, D. Faria, A. O. Falc?o, P. Lord, and F. M. Couto, “Semantic similarity in biomedical ontologies,” PLoS Computational Biology, vol. 5, no. 7, Article ID e1000443, 2009.
[10]
E. Camon, M. Magrane, D. Barrell et al., “The gene ontology annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and interpro,” Genome Research, vol. 13, no. 4, pp. 662–672, 2003.
[11]
E. Camon, D. Barrell, V. Lee, E. Dimmer, and R. Apweiler, “The gene ontology annotation (GOA) database—an integrated resource of GO annotations to the UniProt knowledgebase,” In Silico Biology, vol. 4, no. 1, pp. 5–6, 2004.
[12]
E. Camon, M. Magrane, D. Barrell et al., “The gene ontology annotation (GOA) Database: sharing knowledge in Uniprot with gene oncology,” Nucleic Acids Research, vol. 32, pp. D262–D266, 2004.
[13]
D. Barrell, E. Dimmer, R. P. Huntley, D. Binns, C. O'Donovan, and R. Apweiler, “The GOA database in 2009—an integrated gene ontology annotation resource,” Nucleic Acids Research, vol. 37, no. 1, pp. D396–D403, 2009.
[14]
E. C. Dimmer, R. P. Huntley, D. G. Barrell, et al., “The gene ontology—providing a functional role in proteomic studies,” Proteomics, vol. 8, supplement 23-24, pp. 2–11, 2008.
[15]
L. N. Soldatova and R. D. King, “Are the current ontologies in biology good ontologies?” Nature Biotechnology, vol. 23, no. 9, pp. 1095–1098, 2005.
[16]
J. Shon, J. Y. Park, and L. Wei, “Beyond similarity-based methods to associate genes for the inference of function,” Drug Discovery Today, vol. 1, no. 3, pp. 89–96, 2003.
[17]
F. Shi, Q. Chen, and X. Niu, “Functional similarity analyzing of protein sequences with empirical mode decomposition,” in Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD '07), vol. 2, pp. 766–770, 2007.
[18]
T. Kambe, T. Suzuki, M. Nagao, and Y. Yamaguchi-Iwai, “Sequence similarity and functional relationship among eukaryotic ZIP and CDF transporters,” Genomics, Proteomics and Bioinformatics, vol. 4, no. 1, pp. 1–9, 2006.
[19]
J. L. Sevilla, V. Segura, A. Podhorski, et al., “Correlation between gene expression and GO semantic similarity,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 4, pp. 330–338, 2005.
[20]
T. J. Hestilow and Y. Huang, “Clustering of gene expression data based on shape similarity,” Eurasip Journal on Bioinformatics and Systems Biology, vol. 2009, Article ID 195712, 2009.
[21]
W. Wang, J. M. Cherry, Y. Nochomovitz, E. Jolly, D. Botstein, and H. Li, “Inference of combinatorial regulation in yeast transcriptional networks: a case study of sporulation,” Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 6, pp. 1998–2003, 2005.
[22]
Z. Wu and M. S. Palmer, “Verb semantics and lexical selection,” in Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL '94), pp. 133–138, 1994.
[23]
V. Pekar and S. Staab, “Taxonomy learning: factoring the structure of a taxonomy into a semantic classification decision,” in Proceedings of the 19th International Conference on Computational Linguistics, pp. 1–7, Association for Computational Linguistics, Morristown, NJ, USA, 2002.
[24]
R. Gentleman, Visualizing and Distances Using GO, http://bioconductor.org/packages/2.6/bioc/vignettes/GOstats/inst/doc/GOvis.pdf, 2005.
[25]
S. Benabderrahmane, M. Smail-Tabbone, O. Poch, A. Napoli, and M. D. Devignes, “IntelliGO: a new vector-based semantic similarity measure including annotation origin,” BMC Bioinformatics, vol. 11, p. 588, 2010.
[26]
M. H. Seddiqui and M. Aono, “Metric of intrinsic information content for measuring semantic similarity in an ontology,” in Proceedings of the 7th Asia-Pacific Conference on Conceptual Modelling (APCCM '10), vol. 110, pp. 89–96, Brisbane, Australia, 2010.
[27]
G. Yu, F. Li, Y. Qin, X. Bo, Y. Wu, and S. Wang, “GOSemSim: an R package for measuring semantic similarity among GO terms and gene products,” Bioinformatics, vol. 26, no. 7, Article ID btq064, pp. 976–978, 2010.
[28]
J. Z. Wang, Z. Du, R. Payattakool, P. S. Yu, and C. F. Chen, “A new method to measure the semantic similarity of GO terms,” Bioinformatics, vol. 23, no. 10, pp. 1274–1281, 2007.
[29]
P. Resnik, “Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language,” Journal of Artificial Intelligence Research, vol. 11, pp. 95–130, 1999.
[30]
D. Lin, “An information-theoretic definition of similarity,” in Proceedings of the 15th International Conference on Machine Learning, pp. 296–304, 1998.
[31]
A. Schlicker, F. S. Domingues, J. Rahnenfuhrer, and T. Lengauer, “A new measure for functional similarity of gene products based on gene ontology,” BMC Bioinformatics, vol. 7, p. 302, 2006.
[32]
P. Zhang, J. Zhang, H. Sheng, J. J. Russo, B. Osborne, and K. Buetow, “Gene functional similarity search tool (GFSST),” BMC Bioinformatics, vol. 7, p. 135, 2006.
[33]
G. K. Mazandu and N. J. Mulder, “Using the underlying biological organization of the MTB functional network for protein function prediction,” Infection, Genetics and Evolution, vol. 12, no. 5, pp. 922–932, 2011.
[34]
M. Li, X. Chen, X. Li, B. Ma, and P. M. B. Vitányi, “The similarity metric,” IEEE Transactions on Information Theory, vol. 50, no. 12, pp. 3250–3264, 2004.
[35]
D. Martin, C. Brun, E. Remy, P. Mouren, D. Thieffry, and B. Jacq, “GOToolBox: functional analysis of gene datasets based on gene ontology,” Genome Biology, vol. 5, no. 12, p. R101, 2004.
[36]
C. Pesquita, D. Faria, H. Bastos, A. E. N. Ferreira, A. O. Falc?o, and F. M. Couto, “Metrics for GO based protein semantic similarity: a systematic evaluation,” BMC Bioinformatics, vol. 9, supplement 5, p. S4, 2008.
[37]
M. Alvarez, X. Qi, and C. Yan, “A shortest-path graph kernel for estimating gene product semantic similarity,” Journal of Biomedical Semantics, vol. 2, no. 3, pp. 1–9, 2011.
[38]
C. Pesquita, D. Faria, H. Bastos, A. O. Falc?o, and F. M. Couto, Evaluating GO-based Semantic Similarity Measures, http://xldb.fc.ul.pt/xldb/publications/Pesquita.etal:EvaluatingGO-basedSemantic:2007_document.pdf, 2007.
[39]
A. Tversky, “Features of similarity,” Psychological Review, vol. 84, no. 4, pp. 327–352, 1977.
[40]
S. Jain and G. D. Bader, “An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology,” BMC Bioinformatics, vol. 11, p. 562, 2010.
[41]
B. Aranda, P. Achuthan, Y. Alam-Faruque et al., “The IntAct molecular interaction database in 2010,” Nucleic Acids Research, vol. 38, no. 1, Article ID gkp878, pp. D525–D531, 2009.
[42]
I. Xenarios, L. Salwnski, X. J. Duan, et al., “DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions,” Nucleic Acids Research, vol. 30, no. 1, pp. 303–305, 2002.
[43]
G. D. Bader, I. Donaldson, C. Wolting, B. F. F. Ouellette, T. Pawson, and C. W. V. Hogue, “BIND—the biomolecular interaction network database,” Nucleic Acids Research, vol. 29, no. 1, pp. 242–245, 2001.
[44]
P. Pagel, S. Kovac, M. Oesterheld et al., “The MIPS mammalian protein-protein interaction database,” Bioinformatics, vol. 21, no. 6, pp. 832–834, 2005.
[45]
A. Ceol, C. A. Aryamontri, L. Licata, et al., “Mint, the molecular interaction database: 2009 update,” Nucleic Acids Research, vol. 38, supplement 1, pp. D532–D539, 2010.
[46]
C. Stark, B. J. Breitkreutz, A. Chatr-Aryamontri et al., “The BioGRID interaction database: 2011 update,” Nucleic Acids Research, vol. 39, no. 1, pp. D698–D704, 2011.
[47]
G. K. Mazandu and N. J. Mulder, “Generation and analysis of large-scale data-driven mycobacterium tuberculosis functional networks for drug target identification,” Advances in Bioinformatics, vol. 2011, Article ID 801478, 14 pages, 2011.
[48]
P. Hu, G. Bader, D. A. Wigle, and A. Emili, “Computational prediction of cancer-gene function,” Nature Reviews Cancer, vol. 7, no. 1, pp. 23–34, 2007.
[49]
L. V. Zhang, S. L. Wong, O. D. King, and F. P. Roth, “Predicting co-complexed protein pairs using genomic and proteomic data integration,” BMC Bioinformatics, vol. 5, p. 38, 2004.
[50]
A. Ben-Hur and W. S. Noble, “Choosing negative examples for the prediction of protein-protein interactions,” BMC Bioinformatics, vol. 7, supplement 1, p. S2, 2006.
[51]
T. Sing, O. Sander, N. Beerenwinkel, and T. Lengauer, “ROCR: visualizing classifier performance in R,” Bioinformatics, vol. 21, no. 20, pp. 3940–3941, 2005.
[52]
P. H. Guzzi, M. Mina, C. Guerra, and M. Cannataro, “Semantic similarity analysis of protein data: assessment with biological features and issues,” Briefings in Bioinformatics, Advance Access, 17 pages, 2012.
[53]
R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2010.
[54]
R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2011.
[55]
M. Mistry and P. Pavlidis, “Gene ontology term overlap as a measure of gene functional similarity,” BMC Bioinformatics, vol. 9, p. 327, 2008.
[56]
E. B. Camon, D. G. Barrell, E. C. Dimmer et al., “An evaluation of GO annotation retrieval for BioCreAtIvE and GOA,” BMC Bioinformatics, vol. 6, supplement 1, p. S17, 2005.
[57]
C. Pesquita, D. Pessoa, D. Faria, and F. Couto, CESSM: Collaborative Evaluation of Semantic Similarity Measures. JB2009: Challenges in Bioinformatics: 1–5, 2009.
[58]
E. Jain, A. Bairoch, S. Duvaud et al., “Infrastructure for the life sciences: design and implementation of the UniProt website,” BMC Bioinformatics, vol. 10, p. 136, 2009.
[59]
UniProt-Consortium, “The universal protein resource (UniProt) in 2010,” Nucleic Acids Research, vol. 38, no. 1, Article ID gkp846, pp. D142–D148, 2009.
[60]
P. Flicek, M. R. Amode, D. Barrell et al., “Ensembl 2011,” Nucleic Acids Research, vol. 39, no. 1, pp. D800–D806, 2011.
[61]
R. J. Kinsella, A. K?h?ri, S. Haider, et al., “Ensembl biomarts: a hub for data retrieval across taxonomic space,” Database (Oxford), bar030, 2011.
[62]
R. Apweiler, A. Bairoch, C. H. Wu et al., “UniProt: the universal protein knowledgebase,” Nucleic Acids Research, vol. 32, pp. D115–D119, 2004.
[63]
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” Journal of Molecular Biology, vol. 215, no. 3, pp. 403–410, 1990.
[64]
S. F. Altschul, T. L. Madden, A. A. Sch?ffer et al., “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389–3402, 1997.
[65]
S. F. Altschul, “Amino acid substitution matrices from an information theoretic perspective,” Journal of Molecular Biology, vol. 219, no. 3, pp. 555–565, 1991.
[66]
G. K. Mazandu and N. J. Mulder, “Scoring protein relationships in functional interaction networks predicted from sequence data,” PLoS ONE, vol. 6, no. 4, Article ID e18607, 2011.
[67]
S. P. Calderon-Copete, G. Wigger, C. Wunderlin et al., “The Mycoplasma conjunctivae genome sequencing, annotation and analysis,” BMC Bioinformatics, vol. 10, supplement 6, p. S7, 2009.
[68]
W. C. Wong, S. Maurer-Stroh, and F. Eisenhaber, “More than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology,” PLoS Computational Biology, vol. 6, no. 7, p. e1000867, 2010.