MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery. All variant discovery benefits from an accurate description of the read placement confidence. To this end, MOSAIK uses a neural-network based training scheme to provide well-calibrated mapping quality scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. In order to ensure that studies of any genome are supported, a training pipeline is provided to ensure optimal mapping quality scores for the genome under investigation. MOSAIK is multi-threaded, open source, and incorporated into our command and pipeline launcher system GKNO (http://gkno.me).
References
[1]
Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, et al. (2010) Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327: 78–81 doi:10.1126/science.1181498.
[2]
Eid J, Fehr A, Gray J, Luong K, Lyle J, et al. (2009) Real-time DNA sequencing from single polymerase molecules. Science 323: 133–138 doi:10.1126/science.1162986.
Schneider GF, Dekker C (2012) DNA sequencing with nanopores. Nat Biotechnol 30: 326–328 doi:10.1038/nbt.2181.
[5]
Burrows M, Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm.
[6]
Cox AJ, Bauer MJ, Jakobi T, Rosone G (2012) Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics. doi:10.1093/bioinformatics/bts173.
[7]
Boytsov L (2011) Indexing methods for approximate dictionary searching. J Exp Algorithmics 16: 1.1 doi:10.1145/1963190.1963191.
[8]
Li H, Ruan J, Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18: 1851–1858 doi:10.1101/gr.078212.108.
[9]
Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, et al. (2009) Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet 41: 1061–1067 doi:10.1038/ng.437.
[10]
Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, et al. (2010) mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat Methods 7: 576–577 doi:10.1038/nmeth0810-576.
[11]
Rumble SM, Lacroute P, Dalca A V, Fiume M, Sidow A, et al. (2009) SHRiMP: accurate mapping of short color-space reads. PLoS Comput Biol 5: e1000386 doi:10.1371/journal.pcbi.1000386.
[12]
David M, Dzamba M, Lister D, Ilie L, Brudno M (2011) SHRiMP2: sensitive yet practical SHort Read Mapping. Bioinformatics 27: 1011–1012 doi:10.1093/bioinformatics/btr046.
[13]
Lin H, Zhang Z, Zhang MQ, Ma B, Li M (2008) ZOOM! Zillions of oligos mapped. Bioinformatics 24: 2431–2437 doi:10.1093/bioinformatics/btn416.
[14]
Zhang Z, Lin H, Ma B (2010) ZOOM Lite: next-generation sequencing data mapping and visualization software. Nucleic Acids Res 38: W743–8 doi:10.1093/nar/gkq538.
[15]
Eaves HL, Gao Y (2009) MOM: maximum oligonucleotide mapping. Bioinformatics 25: 969–970 doi:10.1093/bioinformatics/btp092.
[16]
Campagna D, Albiero A, Bilardi A, Caniato E, Forcato C, et al. (2009) PASS: a program to align short sequences. Bioinformatics 25: 967–968 doi:10.1093/bioinformatics/btp087.
[17]
Kim YJ, Teletia N, Ruotti V, Maher CA, Chinnaiyan AM, et al. (2009) ProbeMatch: rapid alignment of oligonucleotides to genome allowing both gaps and mismatches. Bioinformatics 25: 1424–1425 doi:10.1093/bioinformatics/btp178.
[18]
Li R, Li Y, Kristiansen K, Wang J (2008) SOAP: short oligonucleotide alignment program. Bioinformatics 24: 713–714 doi:10.1093/bioinformatics/btn025.
[19]
Gontarz PM, Berger J, Wong CF (2013) SRmapper: a fast and sensitive genome-hashing alignment tool. Bioinformatics 29: 316–321 doi:10.1093/bioinformatics/bts712.
[20]
Lunter G, Goodson M (2011) Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res 21: 936–939 doi:10.1101/gr.111120.110.
[21]
Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25: 1754–1760 doi:10.1093/bioinformatics/btp324.
[22]
Langmead B (2010) Aligning short sequencing reads with Bowtie. Curr Protoc Bioinforma Ed board Andreas D Baxevanis al Chapter 11: Unit 11.7.
[23]
Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9: 357–360 doi:10.1038/nmeth.1923.
[24]
Hoffmann S, Otto C, Kurtz S, Sharma CM, Khaitovich P, et al. (2009) Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLoS Comput Biol 5: e1000502 doi:10.1371/journal.pcbi.1000502.
[25]
Li R, Yu C, Li Y, Lam T-W, Yiu S-M, et al. (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25: 1966–1967 doi:10.1093/bioinformatics/btp336.
Ferragina P, Manzini G (2001) An experimental study of an opportunistic index: 269–278.
[28]
Mahmud MP, Wiedenhoeft J, Schliep A (2012) Indel-tolerant read mapping with trinucleotide frequencies using cache-oblivious kd-trees. Bioinformatics 28: i325–i332 doi:10.1093/bioinformatics/bts380.
[29]
Tipton KF (1994) Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). Enzyme nomenclature. Recommendations 1992. Supplement: corrections and additions. Eur J Biochem 223: 1–5. doi: 10.1111/j.1432-1033.1994.tb18960.x
[30]
The 1000 Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467: 1061–1073 doi:10.1038/nature09534.
[31]
The 1000 Genomes Project Consortium (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65 doi:10.1038/nature11632.
[32]
Harris TD, Buzby PR, Babcock H, Beer E, Bowers J, et al. (2008) Single-molecule DNA sequencing of a viral genome. Science 320: 106–109 doi:10.1126/science.1150427.
[33]
Garrison E, Marth G (2012) Haplotype-based variant detection from short-read sequencing: 9.
[34]
Li H (2011) A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27: 2987–2993 doi:10.1093/bioinformatics/btr509.
[35]
Prak ET, Kazazian HH (2000) Mobile elements and the human genome. Nat Rev Genet 1: 134–144 doi:10.1038/35038572.
[36]
Griffiths D (2001) Endogenous retroviruses in the human genome sequence. Genome Biol 2: reviews1017.1–reviews1017.5 doi:10.1186/gb-2001-2-6-reviews1017.
[37]
Costantini M, Bernardi G (2009) Mapping insertions, deletions and SNPs on Venter's chromosomes. PLoS One 4: e5972 doi:10.1371/journal.pone.0005972.
[38]
Levy S, Sutton G, Ng PC, Feuk L, Halpern AL, et al. (2007) The diploid genome sequence of an individual human. PLoS Biol 5: e254 doi:10.1371/journal.pbio.0050254.
[39]
Osoegawa K, Mammoser AG, Wu C, Frengen E, Zeng C, et al. (2001) A bacterial artificial chromosome library for sequencing the complete human genome. Genome Res 11: 483–496 doi:10.1101/gr.169601.
[40]
Marth GT, Yu F, Indap AR, Garimella K, Gravel S, et al. (2011) The functional spectrum of low-frequency coding variation. Genome Biol 12: R84 doi:10.1186/gb-2011-12-9-r84.
[41]
Su X, Zhang L, Zhang J, Meric-Bernstam F, Weinstein JN (2012) PurityEst: estimating purity of human tumor samples using next-generation sequencing data. Bioinformatics 28: 2265–2266 doi:10.1093/bioinformatics/bts365.
[42]
Roberts KG, Morin RD, Zhang J, Hirst M, Zhao Y, et al. (2012) Genetic alterations activating kinase and cytokine receptor signaling in high-risk acute lymphoblastic leukemia. Cancer Cell 22: 153–166 doi:10.1016/j.ccr.2012.06.005.
[43]
Lin Y, Li Z, Ozsolak F, Kim SW, Arango-Argoty G, et al. (2012) An in-depth map of polyadenylation sites in cancer. Nucleic Acids Res 40: 8460–8471 doi:10.1093/nar/gks637.
[44]
Wang J, Mullighan CG, Easton J, Roberts S, Heatley SL, et al. (2011) CREST maps somatic structural variation in cancer genomes with base-pair resolution. Nat Methods 8: 652–654 doi:10.1038/nmeth.1628.
[45]
Chung CC, Ciampa J, Yeager M, Jacobs KB, Berndt SI, et al. (2011) Fine mapping of a region of chromosome 11q13 reveals multiple independent loci associated with risk of prostate cancer. Hum Mol Genet 20: 2869–2878 doi:10.1093/hmg/ddr189.
[46]
Goya R, Sun MGF, Morin RD, Leung G, Ha G, et al. (2010) SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors. Bioinformatics 26: 730–736 doi:10.1093/bioinformatics/btq040.
[47]
Cridland JM, Thornton KR (2010) Validation of rearrangement break points identified by paired-end sequencing in natural populations of Drosophila melanogaster. Genome Biol Evol 2: 83–101 doi:10.1093/gbe/evq001.
[48]
Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, et al. (2008) Whole-genome sequencing and variant discovery in C. elegans. Nat Methods 5: 183–188 doi:10.1038/nmeth.1179.
[49]
Henn MR, Boutwell CL, Charlebois P, Lennon NJ, Power KA, et al. (2012) Whole genome deep sequencing of HIV-1 reveals the impact of early minor variants upon immune recognition during acute infection. PLoS Pathog 8: e1002529 doi:10.1371/journal.ppat.1002529.
[50]
Malboeuf CM, Yang X, Charlebois P, Qu J, Berlin AM, et al. (2012) Complete viral RNA genome sequencing of ultra-low copy samples by sequence-independent amplification. Nucleic Acids Res 41: e13 doi:10.1093/nar/gks794.
[51]
Campbell MS, Mullins JI, Hughes JP, Celum C, Wong KG, et al. (2011) Viral linkage in HIV-1 seroconverters and their partners in an HIV-1 prevention clinical trial. PLoS One 6: e16986 doi:10.1371/journal.pone.0016986.
[52]
Wilen CB, Wang J, Tilton JC, Miller JC, Kim KA, et al. (2011) Engineering HIV-resistant human CD4+ T cells with CXCR4-specific zinc-finger nucleases. PLoS Pathog 7: e1002020 doi:10.1371/journal.ppat.1002020.
[53]
Farrell A, Thirugnanam S, Lorestani A, Dvorin JD, Eidell KP, et al. (2012) A DOC2 protein identified by mutational profiling is essential for apicomplexan parasite exocytosis. Science 335: 218–221 doi:10.1126/science.1210829.
[54]
Dark MJ, Al-Khedery B, Barbet AF (2011) Multistrain genome analysis identifies candidate vaccine antigens of Anaplasma marginale. Vaccine 29: 4923–4932 doi:10.1016/j.vaccine.2011.04.131.
[55]
Dark MJ, Lundgren AM, Barbet AF (2012) Determining the repertoire of immunodominant proteins via whole-genome amplification of intracellular pathogens. PLoS One 7: e36456 doi:10.1371/journal.pone.0036456.
[56]
Iorizzo M, Senalik DA, Grzebelus D, Bowman M, Cavagnaro PF, et al. (2011) De novo assembly and characterization of the carrot transcriptome reveals novel genes, new markers, and genetic diversity. BMC Genomics 12: 389 doi:10.1186/1471-2164-12-389.
[57]
Neves L, Davis J, Barbazuk B, Kirst M (2011) Targeted sequencing in the loblolly pine (Pinus taeda) megagenome by exome capture. BMC Proc 5: O48 doi:10.1186/1753-6561-5-S7-O48.
[58]
Cannon CH, Kua C-S, Zhang D, Harting JR (2010) Assembly free comparative genomics of short-read sequence data discovers the needles in the haystack. Mol Ecol 19 Suppl 1147–161 doi:10.1111/j.1365-294X.2009.04484.x.
[59]
Aslam ML, Bastiaansen JW, Elferink MG, Megens H-J, Crooijmans RP, et al. (2012) Whole genome SNP discovery and analysis of genetic diversity in Turkey (Meleagris gallopavo). BMC Genomics 13: 391 doi:10.1186/1471-2164-13-391.
[60]
Fraser BA, Weadick CJ, Janowitz I, Rodd FH, Hughes KA (2011) Sequencing and characterization of the guppy (Poecilia reticulata) transcriptome. BMC Genomics 12: 202 doi:10.1186/1471-2164-12-202.
[61]
Stewart C, Kural D, Str?mberg MP, Walker JA, Konkel MK, et al. (2011) A Comprehensive Map of Mobile Element Insertion Polymorphisms in Humans. PLoS Genet 7: 1. doi: 10.1371/journal.pgen.1002236
[62]
Tae H, McMahon KW, Settlage RE, Bavarva JH, Garner HR (2013) ReviSTER: an automated pipeline to revise misaligned reads to simple tandem repeats. Bioinformatics 29: 1734–1741 doi:10.1093/bioinformatics/btt277.
[63]
David M, Mustafa H, Brudno M (2013) Detecting Alu insertions from high-throughput sequencing data. Nucleic Acids Res: gkt612–. doi:10.1093/nar/gkt612.
[64]
Xing J, Witherspoon DJ, Jorde LB (2013) Mobile element biology: new possibilities with high-throughput sequencing. Trends Genet 29: 280–289 doi:10.1016/j.tig.2012.12.002.
[65]
Zhao M, Lee W-P, Garrison EP, Marth GT (2013) SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications. PLoS One 8: e82138 doi:10.1371/journal.pone.0082138.
[66]
Farrar M (2007) Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics 23: 156–161 doi:10.1093/bioinformatics/btl582.
[67]
Adel'son-Vel'skii GM, Landis EM (1962) An algorithm for the organization of information. Sov Math Dokl 3: 263–266.
[68]
Smith TF, Waterman MS (1981) Indentification of common molecular subsequences. J Mol Biol 147: 195–197. doi: 10.1016/0022-2836(81)90087-5
[69]
Gotoh O (1982) An improved algorithm for matching biological sequences. J Mol Biol 162: 705–708. doi: 10.1016/0022-2836(82)90398-9
[70]
Chao KM, Pearson WR, Miller W (1992) Aligning two sequences within a specified diagonal band. Comput Appl Biosci 8: 481–487. doi: 10.1093/bioinformatics/8.5.481