%0 Journal Article
%T MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences
%A Eduardo Corel
%A Florian Pitschi
%A Ivan Laprevotte
%A Gilles Grasseau
%A Gilles Didier
%A Claudine Devauchelle
%J BMC Bioinformatics
%D 2010
%I BioMed Central
%R 10.1186/1471-2105-11-406
%X Among the equivalence classes produced by the N-local decodings for all N, we select a (relatively) small number of "relevant" classes corresponding to variable length subwords that carry enough information to perform the classification. The parameter N, for which correct values are data-dependent and thus hard to guess, is here replaced by the average repetitivity ¦Ê of the sequences. We show that our approach yields classifications of several sets of HIV/SIV sequences that agree with the accepted taxonomy, even on usually discarded repetitive regions (like the non-coding part of LTR).The method MS4 satisfactorily classifies a set of sequences that are notoriously hard to align. This suggests that our approach forms the basis of a reliable alignment-free classification tool. The only parameter ¦Ê of MS4 seems to give reasonable results even for its default value, which can be a great advantage for sequence sets for which little information is available.The classification of biological sequences is one of the fundamental tasks of bioinformatics, and faces special challenges in the genomic and post-genomic era. While it is a classical paradigm to base it on an initial multiple alignment of the sequences, a current trend is to provide alignment-free classification methods (subword-based [1], kernel-based [2], composition vector-based [3,4]...), in order to tackle datasets that cannot be amenable to multiple sequence alignment (MSA) methods. Approaches based on k-mers have also been used for more than a decade to detect anchoring zones for whole genome alignments [5-8].In this paper, we describe a method for the alignment-free classification of families of nucleic or protein sequences (composed of a few hundreds of members). Our aim is to rapidly detect similarity segments shared by these sequences without having to consider the order in which they occur inside the sequences. Our approach allows us to take into account shuffled domains as well as repeated segments.The lo
%U http://www.biomedcentral.com/1471-2105/11/406