|
BMC Bioinformatics 2008
The comparative analysis of statistics, based on the likelihood ratio criterion, in the automated annotation problemAbstract: A number of new statistics for the automated annotation of biological sequences is introduced. All these statistics are based on the likelihood ratio criterion.Some of the statistics yield a prediction quality that is significantly higher (up to 1.5 times higher) in comparison with the results obtained with the A4-procedure.Many biological databanks, both dealing with protein sequences (e.g., SWISS-PROT) and nucleotide sequences (e.g., GeneBank), contain not only primary structures of sequences (i.e., sequences of letters – amino acids or nucleotides), but also information about functions and properties of these sequences. This information is stored in so called description fields of the sequences. There exist different types of description fields – KW (KeyWords), DE (Descriptions), ..., FT (Feature Table), ...; elements of description fields are referred to as words. Words from KW, DE, ... fields describe a sequence as a whole, while words from FT fields correspond to certain positions (letters) of a sequence.The automated annotation problem can be described as follows. Consider a biological sequence (referred to as a query sequence) with known primary structure (i.e. letter sequence) but unknown properties and functions (i.e., description fields). The task is to determine functions and properties of this sequence (in other words, to restore its description fields) on the basis of the primary structure. The annotation should be fully automated. This is the subject of the current paper.There are two main approaches to the solution of this problem. In the first approach (it can be called a static one) a certain fixed protein classification (grouping proteins according to similarity in structure and/or functions), specified beforehand, is used: for a query protein the search of a relative group (super family) is performed on the basis of primary structures, and properties/functions of this group are extended to the query protein. An example of this approach is describ
|