|
BMC Bioinformatics 2007
Background frequencies for residue variability estimates: BLOSUM revisitedAbstract: In this work we suggest generalizing Shannon's expression to a function with similar mathematical properties, that, at the same time, includes observed propensities of residue types to mutate to each other. To do that, we revisit the original construction of BLOSUM matrices, and re-interpret them as mutation probability matrices. These probabilities are then used as background frequencies in the revised residue conservation measure.We show that joint entropy with BLOSUM-proportional probabilities as a reference distribution enables detection of protein functional sites comparable in quality to a time-costly maximum-likelihood evolution simulation method (rate4site), and offers greater resolution than the Shannon entropy alone, in particular in the cases when the available sequences are of narrow evolutionary scope.As a groundwork for the mutational study of a protein, many researchers will choose the comparative analysis of the protein homologues. Column entropy in the multiple sequence alignment [1,2] has proven over time as a workhorse of such endeavors, giving an excellent estimate of residue variability, and proving difficult to beat in terms of its prediction power. One of its limitations, which we address in this paper, is its inability to differentiate between amino acid residue types. For example, its straightforward application proves blind to the fact that an isoleucine, a residue of a type that mutates easily, when found conserved over a large evolutionary distance, should appear more conspicuous than a conserved proline. Shannon's entropy is unable to distinguish between the two cases, and thus its resolution stops at the level of residues which are completely conserved across the aligned homologue set.This is illustrated in Figure 1 where entropy (green dashed line) is compared with a prediction from a detailed simulation of evolutionary events, provided by rate4site program [3] (red thick full line; the thin line gives a preview of the method described
|