|
BMC Bioinformatics 2008
NestedMICA as an ab initio protein motif discovery toolAbstract: Generally NestedMICA recovered most of the short (3–9 amino acid long) test protein motifs spiked into a test set of sequences at different frequencies. We showed that it can be used to find multiple motifs at the same time, too. In all the assessment experiments we carried out, its overall motif discovery performance was better than that of MEME.NestedMICA proved itself to be a robust and sensitive ab initio protein motif finder, even for relatively short motifs that exist in only a small fraction of sequences.NestedMICA is available under the Lesser GPL open-source license from: http://www.sanger.ac.uk/Software/analysis/nmica/ webciteDiscovering linear sequence motifs common to a set of protein sequences has long been an important problem in biology. It is possible to check if a set of proteins contain a known sequence motif by searching protein motif or domain databases. Databases including Pfam [1], eukaryotic linear motif database (ELM) [2], Prosite [3] and ScanSite [4] contain sequence motifs and domains in the form of regular expressions or profile HMMs. Obviously, one cannot use these resources to discover a novel or unannotated sequence motif that is suspected to be a common feature in a given protein set. While new protein domains such as those contained in Pfam can be defined from alignments of evolutionarily related sequences, the identification of short sequence motifs, potentially shared between proteins that appear evolutionarily unrelated, is much harder.To tackle this problem, several multiple alignment approaches [5,6] have been proposed. One such tool, Dilimot [7], is a recent protein motif search tool aiming at finding relatively short overrepresented motifs by aligning only sequence regions that are likely to contain a linear motif. It filters out regions including globular domains and coiled-coil regions which are reported or predicted by some other algorithm, before searching for known motifs in several protein databases such as PFAM, and fina
|