Reliable identification of copy number aberrations (CNA) from comparative genomic hybridization data would be improved by the availability of a generalised method for processing large datasets. To this end, we developed swatCGH, a data analysis framework and region detection heuristic for computational grids. swatCGH analyses sequentially displaced (sliding) windows of neighbouring probes and applies adaptive thresholds of varying stringency to identify the 10% of each chromosome that contains the most frequently occurring CNAs. We used the method to analyse a published dataset, comparing data preprocessed using four different DNA segmentation algorithms, and two methods for prioritising the detected CNAs. The consolidated list of the most commonly detected aberrations confirmed the value of swatCGH as a simplified high-throughput method for identifying biologically significant CNA regions of interest. 1. Introduction Correlating specific genomic copy number aberrations (CNA) with disease is an important and challenging first step in biomarker discovery [1]. Detecting CNAs that define genomic regions of interest using array comparative genomic hybridisation (aCGH) requires precise integration of probe signal amplitude, size (i.e., width) of copy number imbalanced region, and frequency of imbalance across a sample set, all referenced to relevant clinico-pathologic features. There are two broad methods of aCGH data interpretation for biomarker discovery. The first, exemplified by the R Bioconductor package cghMCR [2], identifies regions showing the most frequent CNAs within a sample set, ranked by average signal amplitude. This approach to prioritization may under-call low prevalence high-level CNAs, such as homozygous deletions or gene amplifications that occur in small subsets of the samples analysed. The second method, targeted gene identification, exemplified by the genome topography scanning (GTS) algorithm [3] and Genomic Identification of Significant Targets in Cancer (GISTIC) module [4], is designed to localize regions of copy number imbalance most likely to be of functional significance. The GTS method models CNAs using parameters of signal intensity, region width and recurrence across a sample set, moderated by gene content. While this approach is able to identify significant regions of imbalance in heterogeneous samples, it relies on prior knowledge. GISTIC calculates the background rate of random chromosomal aberrations and identifies regions that are aberrant more often than would be expected by chance, with greater weight given to high
References
[1]
A. Kallioniemi, “CGH microarrays and cancer,” Current Opinion in Biotechnology, vol. 19, no. 1, pp. 36–40, 2008.
[2]
A. J. Aguirre, C. Brennan, G. Bailey et al., “High-resolution characterization of the pancreatic adenocarcinoma genome,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 24, pp. 9067–9072, 2004.
[3]
R. Wiedemeyer, C. Brennan, T. P. Heffernan et al., “Feedback circuit among INK4 tumor suppressors constrains human glioblastoma development,” Cancer Cell, vol. 13, no. 4, pp. 355–364, 2008.
[4]
R. Beroukhim, G. Getz, L. Nghiemphu et al., “Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma,” Proceedings of the National Academy of Sciences of the United States of America, vol. 104, no. 50, pp. 20007–20012, 2007.
[5]
F. Sanchez-Garcia, U. D. Akavia, E. Mozes, and D. Pe'er, “JISTIC: identification of significant targets in cancer,” BMC Bioinformatics, vol. 11, p. 189, 2010.
[6]
R. Chari, W. W. Lockwood, and W. L. Lam, “Computational methods for the analysis of array comparative genomic hybridization,” Cancer Informatics, vol. 2, pp. 48–58, 2006.
[7]
D. Pinkel and D. G. Albertson, “Comparative genomic hybridization,” Annual Review of Genomics and Human Genetics, vol. 6, pp. 331–354, 2005.
[8]
R. C. Gentleman, V. J. Carey, D. M. Bates et al., “Bioconductor: open software development for computational biology and bioinformatics,” Genome Biology, vol. 5, no. 10, p. R80, 2004.
[9]
A. B. Olshen, E. S. Venkatraman, R. Lucito, and M. Wigler, “Circular binary segmentation for the analysis of array-based DNA copy number data,” Biostatistics, vol. 5, no. 4, pp. 557–572, 2004.
[10]
E. S. Venkatraman and A. B. Olshen, “A faster circular binary segmentation algorithm for the analysis of array CGH data,” Bioinformatics, vol. 23, no. 6, pp. 657–663, 2007.
[11]
P. Hupé, N. Stransky, J. P. Thiery, F. Radvanyi, and E. Barillot, “Analysis of array CGH data: from signal ratio to gain and loss of DNA regions,” Bioinformatics, vol. 20, no. 18, pp. 3413–3422, 2004.
[12]
J. Fridlyand, A. M. Snijders, D. Pinkel, D. G. Albertson, and A. N. Jain, “Hidden Markov models approach to the analysis of array CGH data,” Journal of Multivariate Analysis, vol. 90, no. 1, pp. 132–153, 2004.
[13]
J. C. Marioni, N. P. Thorne, and S. Tavaré, “BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data,” Bioinformatics, vol. 22, no. 9, pp. 1144–1146, 2006.
[14]
D. Thain, T. Tannenbaum, and M. Livny, “Distributed computing in practice: the Condor experience,” Concurrency Computation Practice and Experience, vol. 17, no. 2–4, pp. 323–356, 2005.
[15]
B. P. P. van Houte, T. W. Binsl, H. Hettling, W. Pirovano, and J. Heringa, “CGHnormaliter: an iterative strategy to enhance normalization of array CGH data with imbalanced aberrations,” BMC Genomics, vol. 10, p. 401, 2009.