OALib Journal期刊
ISSN: 2333-9721
费用：99美元

投递稿件

查看量	下载量

相关文章
更多...

Algorithms for Molecular Biology 2011

ReCoil - an algorithm for compression of extremely large datasets of dna data

DOI: 10.1186/1748-7188-6-23

Vladimir Yanovsky

Full-Text Cite this paper Add to My Lib

Abstract:

While compression based on encoding mismatches between the dataset and a similar reference can yield high compression rate, good quality reference sequence may be unavailable. Instead, ReCoil's compression is based on encoding the differences between similar or overlapping reads. As such reads may appear at large distances from each other in the dataset and since random access memory is a limited resource, ReCoil is designed to work efficiently in external memory, leveraging high bandwidth of modern hard disk drives.High speeds and relatively low prices of High Throughput Sequencing (HTS) technologies led to their widespread use for various kinds of applications, making it important to store high volumes of generated sequencing data efficiently.Given a genetic sequence, an HTS sequencer outputs a set of subsequences, called reads, of the sequence. Unlike the more expensive sequencing technologies that were used, for example, for the Human Genome Project, the HTS reads usually have higher error rate and shorter lengths. On the other hand, the datasets produced by an HTS sequencer are usually of high coverage, i.e. they can have many different reads overlapping at each position, making them highly compressible. In this work we address the problem of compression of datasets of HTS reads.Previous research [1] show that for the sequence of a human genome it is hard to achieve compression rate significantly better than a trivial two bits per nucleotide. Hence the algorithms for compression of HTS datasets must take advantage of the self-similarity due to read overlaps. One difficulty that must be overcome is that for huge HTS datasets similar or overlapping reads can be at great distance from each other in the input and splitting it into smaller chunks will miss these similarities. Hence our goal was a compression algorithm that works on the whole dataset at once, using external memory without a significant hit in performance.DNA sequence contains a large number of approx

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133