%0 Journal Article
%T A High-Throughput Computational Framework for Identifying Significant Copy Number Aberrations from Array Comparative Genomic Hybridisation Data
%A Ian Roberts
%A Stephanie A. Carter
%A Cinzia G. Scarpini
%A Konstantina Karagavriilidou
%A Jenny C. J. Barna
%A Mark Calleja
%A Nicholas Coleman
%J Advances in Bioinformatics
%D 2012
%I Hindawi Publishing Corporation
%R 10.1155/2012/876976
%X Reliable identification of copy number aberrations (CNA) from comparative genomic hybridization data would be improved by the availability of a generalised method for processing large datasets. To this end, we developed swatCGH, a data analysis framework and region detection heuristic for computational grids. swatCGH analyses sequentially displaced (sliding) windows of neighbouring probes and applies adaptive thresholds of varying stringency to identify the 10% of each chromosome that contains the most frequently occurring CNAs. We used the method to analyse a published dataset, comparing data preprocessed using four different DNA segmentation algorithms, and two methods for prioritising the detected CNAs. The consolidated list of the most commonly detected aberrations confirmed the value of swatCGH as a simplified high-throughput method for identifying biologically significant CNA regions of interest. 1. Introduction Correlating specific genomic copy number aberrations (CNA) with disease is an important and challenging first step in biomarker discovery [1]. Detecting CNAs that define genomic regions of interest using array comparative genomic hybridisation (aCGH) requires precise integration of probe signal amplitude, size (i.e., width) of copy number imbalanced region, and frequency of imbalance across a sample set, all referenced to relevant clinico-pathologic features. There are two broad methods of aCGH data interpretation for biomarker discovery. The first, exemplified by the R Bioconductor package cghMCR [2], identifies regions showing the most frequent CNAs within a sample set, ranked by average signal amplitude. This approach to prioritization may under-call low prevalence high-level CNAs, such as homozygous deletions or gene amplifications that occur in small subsets of the samples analysed. The second method, targeted gene identification, exemplified by the genome topography scanning (GTS) algorithm [3] and Genomic Identification of Significant Targets in Cancer (GISTIC) module [4], is designed to localize regions of copy number imbalance most likely to be of functional significance. The GTS method models CNAs using parameters of signal intensity, region width and recurrence across a sample set, moderated by gene content. While this approach is able to identify significant regions of imbalance in heterogeneous samples, it relies on prior knowledge. GISTIC calculates the background rate of random chromosomal aberrations and identifies regions that are aberrant more often than would be expected by chance, with greater weight given to high
%U http://www.hindawi.com/journals/abi/2012/876976/