全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

CallSim: Evaluation of Base Calls Using Sequencing Simulation

DOI: 10.5402/2012/371718

Full-Text   Cite this paper   Add to My Lib

Abstract:

Accurate base calls generated from sequencing data are required for downstream biological interpretation, particularly in the case of rare variants. CallSim is a software application that provides evidence for the validity of base calls believed to be sequencing errors and it is applicable to Ion Torrent and 454 data. The algorithm processes a single read using a Monte Carlo approach to sequencing simulation, not dependent upon information from any other read in the data set. Three examples from general read correction, as well as from error-or-variant classification, demonstrate its effectiveness for a robust low-volume read processing base corrector. Specifically, correction of errors in Ion Torrent reads from a study involving mutations in multidrug resistant Staphylococcus aureus illustrates an ability to classify an erroneous homopolymer call. In addition, support for a rare variant in 454 data for a mixed viral population demonstrates “base rescue” capabilities. CallSim provides evidence regarding the validity of base calls in sequences produced by 454 or Ion Torrent systems and is intended for hands-on downstream processing analysis. These downstream efforts, although time consuming, are necessary steps for accurate identification of rare variants. 1. Introduction Accurate base calling in high throughput DNA sequencing can be a very challenging task [1, 2], where errors of either biological or technical origin can be introduced. Methods for post-processing of the read data can help mitigate some of this, though various error types can remain in the output result. The sequencing technologies that involve sequential flows of each nucleotide are of interest here and in particular the Roche 454 and Ion Torrent systems. For 454, a pyrosequencing [3] approach is used, while Ion Torrent technology detects changes in pH during base incorporation [4]. One well known source of error in these systems is incomplete extension [5]. That is, a single base or a base within a homopolymeric region might not be incorporated during a flow, and instead, is added during the next flow of the like nucleotide. This dephasing phenomenon, illustrated in Figure 1, accumulates as the number of flows increases, and perturbs the experimental/measured signal in the flowgram. An incorrect base or insertion/deletion (indel) call occurs when this signal perturbation is sufficient to cause an incorrect determination of the number of bases incorporated in the DNA molecule during a flow. Figure 1: Illustration of the simulated DNA molecules and the polymerase position. Only a single

References

[1]  M. L. Metzker, “Sequencing technologies the next generation,” Nature Reviews Genetics, vol. 11, no. 1, pp. 31–46, 2010.
[2]  J. M. Perkel, “Sanger Who? sequencing the next generation,” Science, vol. 324, no. 5924, pp. 275–279, 2009.
[3]  M. Ronaghi, “Pyrosequencing sheds light on DNA sequencing,” Genome Research, vol. 11, no. 1, pp. 3–11, 2001.
[4]  J. M. Rothberg, W. Hinz, T. M. Rearick et al., “An integrated semiconductor device enabling non-optical genome sequencing,” Nature, vol. 475, no. 7356, pp. 348–352, 2011.
[5]  W. Brockman, P. Alvarez, S. Young et al., “Quality scores and SNP detection in sequencing-by-synthesis systems,” Genome Research, vol. 18, no. 5, pp. 763–770, 2008.
[6]  R. Nielsen, “Genomics: in search of rare human variants,” Nature, vol. 467, no. 7319, pp. 1050–1051, 2010.
[7]  M. Gerlinger, A. J. Rowan, S. Horswell et al., “Intratumor heterogeneity and branched evolution revealed by multiregion sequencing,” The New England Journal of Medicine, vol. 366, pp. 883–892, 2012.
[8]  S. P. Shah, A. Roth, R. Goya, et al., “The clonal and mutational evolution spectrum of primary triple-negative breast cancers,” Nature, vol. 486, pp. 395–399, 2012.
[9]  M. R. Henn, C. L. Boutwell, P. Charlebois, et al., “Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection,” PLoS Pathogens, vol. 8, article e1002529, 2012.
[10]  O. Zagordi, A. Bhattacharya, N. Eriksson, and N. Beerenwinkel, “ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data,” BMC Bioinformatics, vol. 12, article no. 119, 2011.
[11]  A. R. Quinlan, D. A. Stewart, M. P. Str?mberg, and G. T. Marth, “Pyrobayes: an improved base caller for SNP discovery in pyrosequences,” Nature Methods, vol. 5, no. 2, pp. 179–181, 2008.
[12]  F. Meacham, D. Boffelli, J. Dhahbi, D. I. Martin, M. Singer, and L. Pachter, “Identification and correction of systematic error in high-throughput sequence data,” BMC Bioinformatics, vol. 12, 451, 2011.
[13]  L. Ilie, F. Fazayeli, and S. Ilie, “HiTEC: accurate error correction in high-throughput sequencing data,” Bioinformatics, vol. 27, no. 3, pp. 295–302, 2011.
[14]  P. Skums, Z. Dimitrova, D. Campo et al., “Efficient error correction for next-generation sequencing of viral amplicons,” BMC Bioinformatics, vol. 13, (Supplement 10):S6, 2012.
[15]  C. Quince, A. Lanzén, T. P. Curtis et al., “Accurate determination of microbial diversity from 454 pyrosequencing data,” Nature Methods, vol. 6, no. 9, pp. 639–641, 2009.
[16]  B. P. Howden, C. R. E. McEvoy, D. L. Allen et al., “Evolution of multidrug resistance during staphylococcus aureus infection involves mutation of the essential two component regulator WalKR,” PLoS Pathogens, vol. 7, e1002359, 2011.
[17]  A. R. Macalalad, ZodyMC, P. Charlebois et al., “Highly sensitive and specific detection of rare variants in mixed viral populations from massively parallel sequence data,” PLoS Computational Biology, vol. 8, article e1002417, 2012.
[18]  N. Metropolis and S. Ulam, “The Monte Carlo method,” Journal of the American Statistical Association, vol. 44, no. 247, pp. 335–341, 1949.
[19]  SRA Toolkit, http://www.ncbi.nlm.nih.gov/.
[20]  JfreeChart library, http://www.jfree.org/jfreechart/.
[21]  A. Mellmann, D. Harmsen, C. A. Cummings et al., “Prospective genomic characterization of the german enterohemorrhagic Escherichia coli O104:H4 outbreak by rapid next generation sequencing technology,” PLoS ONE, vol. 6, no. 7, Article ID e22751, 2011.
[22]  Sequence Read Archive, http://sra.dnanexus.com/.
[23]  MUMmer 3, “Ultra-fast alignment of large-scale DNA and protein sequences,” http://mummer.sourceforge.net/.
[24]  H. Rohde, et al., “Open-source genomic analysis of shiga-toxin-producing E. coli O104:H4,” The New England Journal of Medicine, vol. 365, pp. 718–724, 2011.
[25]  D. Li, F. Xi, M. Zhao, et al., “Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011): genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen,” GigaScience. In press.
[26]  Bowtie, http://bowtie-bio.sourceforge.net/index.shtml/.
[27]  “Integrative genomics viewer,” http://www.broadinstitute.org/igv/.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133