|
A Biological Sequence Compression Based on Cross Chromosomal Similarities Using Variable length LUTKeywords: Biological Sequences , Chromosome , Cross Chromosomal Similarity , Compression Gain , Prediction. Abstract: While modern hardware can provide vast amounts of inexpensive storage forbiological databases, the compression of Biological sequences is still ofparamount importance in order to facilitate fast search and retrieval operationsthrough a reduction in disk traffic. This issue becomes even more important inlight of the recent increase of very large data sets, such as meta genomes.The present Biological sequence compression algorithms work by finding similarrepeated regions within the Biological sequence and then encode these repeatedregions together for compression. The previous research on chromosomesequence similarity reveals that the length of similar repeated regions within onechromosome is about 4.5% of the total sequence length. The compression gainis often not high because of these short lengths of repeated regions. It is wellrecognized that similarities exist among different regions of chromosomesequences. This implies that similar repeated sequences are found amongdifferent regions of chromosome sequences. Here, we apply cross-chromosomalsimilarity for a Biological sequence compression. The length and location ofsimilar repeated regions among the different Biological sequences are studied. Itis found that the average percentage of similar subsequences found between twochromosome sequences is about 10% in which 8% comes from crosschromosomalprediction and 2% from self-chromosomal prediction. Thepercentage of similar subsequences is about 18% in which only 1.2% comesfrom self-chromosomal prediction while the rest is from cross-chromosomalprediction among the different Biological sequences studied. This suggests thesignificance of cross-chromosomal similarities in addition to self-chromosomalsimilarities in the Biological sequence compression. An additional 23% of storagespace could be reduced on average using self-chromosomal and crosschromosomalpredictions in compressing the different Biological sequences.
|