RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets. 1. Introduction RNA-Seq is the direct sequencing of transcripts by high-throughput sequencing technology and can profile an entire transcriptome at single-base resolution whilst concurrently quantifying gene expression levels on a genome-wide scale [1–3]. RNA-Seq not only has considerable advantages for examining transcriptome fine structure—for example, in the detection of novel transcripts, allele-specific expression, and alternative splicing—but also provides a far more precise measurement of levels of transcripts than that of other methods [4, 5]. With no probes or primers to design, RNA-Seq delivers unbiased and unparalleled information about the transcriptome and gene expression. Early studies have demonstrated that RNA-Seq is very reliable in terms of technical reproducibility [6, 7]. Compared to microarray-based profiling, RNA-Seq can detect the expression of low abundance transcripts and the subtle change under different conditions; has a wider dynamic range; and avoids technical issues in microarray related to probe performance such as cross-hybridization, limited detection range of individual probes, and nonspecific hybridization [8, 9]. Currently, RNA-Seq is becoming an attractive approach in the profiling of gene expression and in evaluating differential expression [10–13]. Until recently, sequencing has primarily been carried out in large genome centers which have invested heavily in computational infrastructure
References
[1]
Z. Wang, M. Gerstein, and M. Snyder, “RNA-Seq: a revolutionary tool for transcriptomics,” Nature Reviews Genetics, vol. 10, no. 1, pp. 57–63, 2009.
[2]
S. Marguerat and J. B?hler, “RNA-seq: from technology to biology,” Cellular and Molecular Life Sciences, vol. 67, no. 4, pp. 569–579, 2010.
[3]
K. O. Mutz, A. Heilkenbrinker, M. L?nne, J. G. Walter, and F. Stahl, “Transcriptome analysis using next-generation sequencing,” Current Opinion in Biotechnology, vol. 24, no. 1, pp. 22–30, 2013.
[4]
P. J. Hurd and C. J. Nelson, “Advantages of next-generation sequencing versus the microarray in epigenetic research,” Briefings in Functional Genomics and Proteomics, vol. 8, no. 3, pp. 174–183, 2009.
[5]
J. H. Malone and B. Oliver, “Microarrays, deep sequencing and the true measure of the transcriptome,” BMC Biology, vol. 9, article 34, 2011.
[6]
L. M. McIntyre, K. K. Lopiano, A. M. Morse et al., “RNA-seq: technical variability and sampling,” BMC Genomics, vol. 12, article 293, 2011.
[7]
J. C. Marioni, C. E. Mason, S. M. Mane, M. Stephens, and Y. Gilad, “RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays,” Genome Research, vol. 18, no. 9, pp. 1509–1517, 2008.
[8]
I. Nookaew, M. Papini, N. Pornputtapong et al., “A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae,” Nucleic Acids Research, vol. 40, no. 20, pp. 10084–10097, 2012.
[9]
M. A. Stalteri and A. P. Harrison, “Interpretation of multiple probe sets mapping to the same gene in Affymetrix GeneChips,” BMC Bioinformatics, vol. 8, article 13, 2007.
[10]
A. Oshlack, M. D. Robinson, and M. D. Young, “From RNA-seq reads to differential expression results,” Genome Biology, vol. 11, no. 12, article 220, 2010.
[11]
J. H. Bullard, E. Purdom, K. D. Hansen, and S. Dudoit, “Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments,” BMC Bioinformatics, vol. 11, article 94, 2010.
[12]
S. Tarazona, F. García-Alcalde, J. Dopazo, A. Ferrer, and A. Conesa, “Differential expression in RNA-seq: a matter of depth,” Genome Research, vol. 21, no. 12, pp. 2213–2223, 2011.
[13]
J. Lee, Y. Ji, S. Liang, G. Cai, and P. Müller, “On differential gene expression using RNA-Seq data,” Cancer Informatics, vol. 10, pp. 205–215, 2011.
[14]
M. Baker, “Next-generation sequencing: adjusting to data overload,” Nature Methods, vol. 7, no. 7, pp. 495–499, 2010.
[15]
M. C. Schatz, B. Langmead, and S. L. Salzberg, “Cloud computing and the DNA data race,” Nature Biotechnology, vol. 28, no. 7, pp. 691–693, 2010.
[16]
U. S. Evani, D. Challis, J. Yu et al., “Atlas2 cloud: a framework for personal genome analysis in the cloud,” BMC Genomics, vol. 13, supplement 6, article S19, 2012.
[17]
M. Garber, M. G. Grabherr, M. Guttman, and C. Trapnell, “Computational methods for transcriptome annotation and quantification using RNA-seq,” Nature Methods, vol. 8, no. 6, pp. 469–477, 2011.
[18]
J. Chen, F. Qian, W. Yan, and B. Shen, “Translational biomedical informatics in the cloud: present and future,” BioMed Research International, vol. 2013, Article ID 658925, 8 pages, 2013.
[19]
L. D. Stein, “The case for cloud computing in genome informatics,” Genome Biology, vol. 11, no. 5, article 207, 2010.
[20]
A. Rosenthal, P. Mork, M. H. Li, J. Stanford, D. Koester, and P. Reynolds, “Cloud computing: a new business paradigm for biomedical information sharing,” Journal of Biomedical Informatics, vol. 43, no. 2, pp. 342–353, 2010.
[21]
D. P. Wall, P. Kudtarkar, V. A. Fusaro, R. Pivovarov, P. Patil, and P. J. Tonellato, “Cloud computing for comparative genomics,” BMC Bioinformatics, vol. 11, article 259, 2010.
[22]
R. S. Thakur, R. Bandopadhyay, B. Chaudhary, and S. Chatterjee, “Now and next-generation sequencing techniques: future of sequence analysis using cloud computing,” Front Genetics, vol. 3, article 280, 2012.
[23]
M. C. Schatz, “CloudBurst: highly sensitive read mapping with MapReduce,” Bioinformatics, vol. 25, no. 11, pp. 1363–1369, 2009.
[24]
B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L. Salzberg, “Searching for SNPs with cloud computing,” Genome Biology, vol. 10, no. 11, article R134, 2009.
[25]
B. Langmead, K. D. Hansen, and J. T. Leek, “Cloud-scale RNA-sequencing differential expression analysis with Myrna,” Genome Biology, vol. 11, no. 8, article R83, 2010.
[26]
S. V. Angiuoli, J. R. White, M. Matalka, O. White, and W. F. Fricke, “Resources and costs for microbial sequence analysis evaluated using virtual machines and cloud computing,” PLoS ONE, vol. 6, no. 10, Article ID e26624, 2011.
[27]
T. Nguyen, W. Shi, and D. Ruden, “CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping,” BMC Research Notes, vol. 4, article 171, 2011.
[28]
X. Feng, R. Grossman, and L. Stein, “PeakRanger: a cloud-enabled peak caller for ChIP-seq data,” BMC Bioinformatics, vol. 12, article 139, 2011.
[29]
S. Anders and W. Huber, “Differential expression analysis for sequence count data,” Genome Biology, vol. 11, no. 10, article R106, 2010.
[30]
M. D. Robinson, D. J. McCarthy, and G. K. Smyth, “edgeR: a bioconductor package for differential expression analysis of digital gene expression data,” Bioinformatics, vol. 26, no. 1, pp. 139–140, 2010.
[31]
T. J. Hardcastle and K. A. Kelly, “BaySeq: empirical bayesian methods for identifying differential expression in sequence count data,” BMC Bioinformatics, vol. 11, article 422, 2010.
[32]
S. Zhao, K. Prenger, L. Smith et al., “Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing,” BMC Genomics, vol. 14, article 425, 2013.
[33]
Amazon Simple Storage Service (Amazon S3), http://aws.amazon.com/s3/.
J. Hu, H. Ge, M. Newman, and K. Liu, “OSA: a fast and accurate alignment tool for RNA-Seq,” Bioinformatics, vol. 28, no. 14, pp. 1933–1934, 2012.
[36]
C. Trapnell, B. A. Williams, G. Pertea et al., “Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation,” Nature Biotechnology, vol. 28, no. 5, pp. 511–515, 2010.
B. Li and C. N. Dewey, “RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome,” BMC Bioinformatics, vol. 12, article 323, 2011.
B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast and memory-efficient alignment of short DNA sequences to the human genome,” Genome Biology, vol. 10, no. 3, article R25, 2009.