%0 Journal Article %T CallSim: Evaluation of Base Calls Using Sequencing Simulation %A Jarrett D. Morrow %A Brandon W. Higgs %J ISRN Bioinformatics %D 2012 %R 10.5402/2012/371718 %X Accurate base calls generated from sequencing data are required for downstream biological interpretation, particularly in the case of rare variants. CallSim is a software application that provides evidence for the validity of base calls believed to be sequencing errors and it is applicable to Ion Torrent and 454 data. The algorithm processes a single read using a Monte Carlo approach to sequencing simulation, not dependent upon information from any other read in the data set. Three examples from general read correction, as well as from error-or-variant classification, demonstrate its effectiveness for a robust low-volume read processing base corrector. Specifically, correction of errors in Ion Torrent reads from a study involving mutations in multidrug resistant Staphylococcus aureus illustrates an ability to classify an erroneous homopolymer call. In addition, support for a rare variant in 454 data for a mixed viral population demonstrates ¡°base rescue¡± capabilities. CallSim provides evidence regarding the validity of base calls in sequences produced by 454 or Ion Torrent systems and is intended for hands-on downstream processing analysis. These downstream efforts, although time consuming, are necessary steps for accurate identification of rare variants. 1. Introduction Accurate base calling in high throughput DNA sequencing can be a very challenging task [1, 2], where errors of either biological or technical origin can be introduced. Methods for post-processing of the read data can help mitigate some of this, though various error types can remain in the output result. The sequencing technologies that involve sequential flows of each nucleotide are of interest here and in particular the Roche 454 and Ion Torrent systems. For 454, a pyrosequencing [3] approach is used, while Ion Torrent technology detects changes in pH during base incorporation [4]. One well known source of error in these systems is incomplete extension [5]. That is, a single base or a base within a homopolymeric region might not be incorporated during a flow, and instead, is added during the next flow of the like nucleotide. This dephasing phenomenon, illustrated in Figure 1, accumulates as the number of flows increases, and perturbs the experimental/measured signal in the flowgram. An incorrect base or insertion/deletion (indel) call occurs when this signal perturbation is sufficient to cause an incorrect determination of the number of bases incorporated in the DNA molecule during a flow. Figure 1: Illustration of the simulated DNA molecules and the polymerase position. Only a single %U http://www.hindawi.com/journals/isrn.bioinformatics/2012/371718/