|
Exploring protein structural dissimilarity to facilitate structure classificationAbstract: We compute a coefficient of dissimilarity (Ω) between proteins, based on structural and sequence-based descriptors characterising the respective constituent SSEs. For a set of 1,661 pairs of proteins with sequence identity up to 35%, the performance of Ω in predicting shared Class, Fold and Super-family levels is comparable to that of DaliLite Z score and shows a greater than four-fold increase in the true positive rate (TPR) for proteins sharing the Family level. On a larger set of 600 domains representing 200 families, the performance of Z score improves in predicting a shared Family, but still only achieves about half of the TPR of Ω. The TPR for structures sharing a Super-family is lower than in the first dataset, but Ω performs slightly better than Z score. Overall, the sensitivity of Ω in predicting common Fold level is higher than that of the DaliLite Z score.Classification to a deeper level in the hierarchy is specific and difficult. So the efficiency of Ω may be attractive to the curators and the end-users of SCOP. We suggest Ω may be a better measure for structure classification than the DaliLite Z score, with the caveat that currently we are restricted to comparing structures with equal number of SSEs.The increased pace of protein structure determination, due to high-throughput, synchrotron-based X-ray crystallography and multi-dimensional NMR, promises rapid growth in the number of known protein structures [1-3]. Comparison and classification of newly resolved structures contributes to our understanding of the structural architecture, evolution and function of proteins, especially those with low sequence identity to well characterised proteins [4,5]. This information is important for the identification of new protein folds, drug discovery, and phylogenetic analysis of the proteome.Classification schemes, such as SCOP (Structural Classification Of Proteins) [6] and CATH [7], are well established. SCOP is a curated database and probably the leading classif
|