|
BMC Bioinformatics 2008
Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sitesAbstract: A-GLAM, a user-friendly computer program for identifying sequence motifs, now incorporates a Bayesian model systematically combining sequence and positional information. A-GLAM's predictions with and without positional information were compared on two human TFBS datasets, each containing sequences corresponding to the interval [-2000, 0] bases upstream of a known TSS. A rigorous statistical analysis showed that positional information significantly improved the prediction of sequence motifs, and an extensive cross-validation study showed that A-GLAM's model was robust against mild misspecification of its parameters. As expected, when sequences in the datasets were successively truncated to the intervals [-1000, 0], [-500, 0] and [-250, 0], positional information aided motif prediction less and less, but never hurt it significantly.Although sequence truncation is a viable strategy when searching for biologically active motifs with a positional preference, a probabilistic model (used reasonably) generally provides a superior and more robust strategy, particularly when the sequence motifs' positional preferences are not well characterized.Transcription factor binding sites (TFBSs) provide a specific example of biologically functional sequence motifs that sometimes have positional preferences. TFBSs contribute substantially to the control of gene expression, and because of their biological importance, much experimental effort has been expended in identifying them. Because experimental identification is expensive, there are now many computational tools that identify TFBSs as the subsequences, or "motifs", common to a set of sequences. Most TFBSs correspond to short and imprecise motifs [1], however, so all computational tools in a recent contest performed rather poorly in identifying known TFBSs [2].Although some tools have an ad hoc basis [3-5], other tools have a basis in the calculus of probability, and can therefore immediately and systematically combine sequence with
|