As next generation sequencing technologies are getting more efficient and less expensive, RNA-Seq is becoming a widely used technique for transcriptome studies. Computational analysis of RNA-Seq data often starts with the mapping of millions of short reads back to the genome or transcriptome, a process in which some reads are found to map equally well to multiple genomic locations (multimapping reads). We have developed the Minimum Unique Length Tool (MULTo), a framework for efficient and comprehensive representation of mappability information, through identification of the shortest possible length required for each genomic coordinate to become unique in the genome and transcriptome. Using the minimum unique length information, we have compared different uniqueness compensation approaches for transcript expression level quantification and demonstrate that the best compensation is achieved by discarding multimapping reads and correctly adjusting gene model lengths. We have also explored uniqueness within specific regions of the mouse genome and enhancer mapping experiments. Finally, by making MULTo available to the community we hope to facilitate the use of uniqueness compensation in RNA-Seq analysis and to eliminate the need to make additional mappability files.
References
[1]
Metzker ML (2010) Sequencing technologies - the next generation. Nature Reviews Genetics 11: 31–46 Available: http://www.ncbi.nlm.nih.gov/pubmed/19997?069.
[2]
Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods Available: http://www.nature.com/nmeth/journal/vaop?/ncurrent/abs/nmeth.1226.html.
[3]
Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, et al. (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456: 470–476 Available: http://www.nature.com/nature/journal/v45?6/n7221/abs/nature07509.html.
[4]
Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genetics 40: 1413–1415 Available: http://www.ncbi.nlm.nih.gov/pubmed/18978?789.
[5]
Lee S, Seo CH, Lim B, Yang JO, Oh J, et al. (2011) Accurate quantification of transcriptome from RNA-Seq data by effective length normalization. Nucleic Acids Research 39: e9 Available: http://www.ncbi.nlm.nih.gov/pubmed/21059?678.
[6]
Koehler R, Issac H, Cloonan N, Grimmond SM (2011) The uniqueome: a mappability resource for short-tag sequencing. Bioinformatics 27: 272–274 Available: http://www.pubmedcentral.nih.gov/article?render.fcgi?artid=3018812&tool=pmcentrez?&rendertype=abstract.
[7]
Encode Genome Annotations. Available: http://hgdownload.cse.ucsc.edu/goldenPat?h/hg18/encodeDCC/wgEncodeMapability. Accessed 2012 Dec 12.
[8]
Uniqueness files at Grimmond Lab. Available: http://grimmond.imb.uq.edu.au/uniqueome/?downloads/. Accessed 2012 Dec 12.
[9]
Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, et al. (2009) PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nature Biotechnology 27: 66–75 Available: http://www.pubmedcentral.nih.gov/article?render.fcgi?artid=2924752&tool=pmcentrez?&rendertype=abstract.
[10]
Derrien T, Estellé J, Marco Sola S, Knowles DG, Raineri E, et al. (2012) Fast Computation and Applications of Genome Mappability. PLoS ONE 7: e30377 Available: http://dx.plos.org/10.1371/journal.pone.?0030377.
[11]
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, et al. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28: 511–515 Available: http://www.nature.com/nbt/journal/v28/n5?/full/nbt.1621.html.
[12]
Katz Y, Wang ET, Airoldi EM, Burge CB (2010) Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nature Methods 7: 1009–1015 Available: http://www.nature.com/doifinder/10.1038/?nmeth.1528.
[13]
Ramsk?ld D, Wang ET, Burge CB, Sandberg R (2009) An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS computational biology 5: e1000598 Available: http://www.ncbi.nlm.nih.gov/pubmed/20011?106.
[14]
Krueger F, Kreck B, Franke A, Andrews SR (2012) DNA methylome analysis using short bisulfite sequencing data. Nature Methods 9: 145–151 Available: http://www.ncbi.nlm.nih.gov/pubmed/22290?186.
[15]
Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, et al. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57–74 Available: http://www.nature.com/doifinder/10.1038/?nature11247.
[16]
Weber M, Hellmann I, Stadler MB, Ramos L, P??bo S, et al. (2007) Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome. Nature Genetics 39: 457–466 Available: http://www.ncbi.nlm.nih.gov/pubmed/17334?365.
[17]
Griffith M, Griffith OL, Mwenifumbo J, Goya R, Morrissy AS, et al. (2010) Alternative expression analysis by RNA sequencing. Nature Methods 7: 843–847 Available: http://www.ncbi.nlm.nih.gov/pubmed/20835?245.
[18]
Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25 Available: http://genomebiology.com/content/10/3/R2?5.