Standard statistical approaches for prioritization of variants for functional testing in fine-mapping studies either use marginal association statistics or estimate posterior probabilities for variants to be causal under simplifying assumptions. Here, we present a probabilistic framework that integrates association strength with functional genomic annotation data to improve accuracy in selecting plausible causal variants for functional validation. A key feature of our approach is that it empirically estimates the contribution of each functional annotation to the trait of interest directly from summary association statistics while allowing for multiple causal variants at any risk locus. We devise efficient algorithms that estimate the parameters of our model across all risk loci to further increase performance. Using simulations starting from the 1000 Genomes data, we find that our framework consistently outperforms the current state-of-the-art fine-mapping methods, reducing the number of variants that need to be selected to capture 90% of the causal variants from an average of 13.3 to 10.4 SNPs per locus (as compared to the next-best performing strategy). Furthermore, we introduce a cost-to-benefit optimization framework for determining the number of variants to be followed up in functional assays and assess its performance using real and simulation data. We validate our findings using a large scale meta-analysis of four blood lipids traits and find that the relative probability for causality is increased for variants in exons and transcription start sites and decreased in repressed genomic regions at the risk loci of these traits. Using these highly predictive, trait-specific functional annotations, we estimate causality probabilities across all traits and variants, reducing the size of the 90% confidence set from an average of 17.5 to 13.5 variants per locus in this data.
References
[1]
Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, et al. (2009) Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences 106: 9362–9367. doi: 10.1073/pnas.0903103106
[2]
Meyer KB, OReilly M, Michailidou K, Carlebur S, Edwards SL, et al. (2013) Fine-scale mapping of the fgfr2 breast cancer risk locus: putative functional variants differentially bind foxa1 and e2f1. The American Journal of Human Genetics 93: 1046–1060.
[3]
Kote-Jarai Z, Saunders EJ, Leongamornlert DA, Tymrakiewicz M, Dadaev T, et al. (2013) Fine-mapping identifies multiple prostate cancer risk loci at 5p15, one of which associates with tert expression. Human molecular genetics 22: 2520–2528. doi: 10.1158/1538-7445.am2013-2546
[4]
Wu Y, Waite LL, Jackson AU, Sheu WH, Buyske S, et al. (2013) Trans-ethnic fine-mapping of lipid loci identifies population-specific signals and allelic heterogeneity that increases the trait variance explained. PLoS genetics 9: e1003379. doi: 10.1371/journal.pgen.1003379
[5]
Maller JB, McVean G, Byrnes J, Vukcevic D, Palin K, et al. (2012) Bayesian refinement of association signals for 14 loci in 3 common diseases. Nature genetics 44: 1294–1301. doi: 10.1038/ng.2435
[6]
Faye LL, Machiela MJ, Kraft P, Bull SB, Sun L (2013) Re-ranking sequencing variants in the post-gwas era for accurate causal variant identification. PLoS genetics 9: e1003609. doi: 10.1371/journal.pgen.1003609
[7]
Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E (2014) Identifying causal variants at loci with multiple signals of association. Genetics: genetics–114.
[8]
Type AGEN, Type SA, Consortium DS, Type MA, Consortium DM, et al. (2014) Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nature genetics 46: 234–244.
[9]
Consortium IMSG, et al.. (2013) Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis. Nature genetics.
[10]
Pickrell JK (2014) Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. The American Journal of Human Genetics 94: 559–573. doi: 10.1016/j.ajhg.2014.03.004
[11]
Gaffney DJ, Veyrieras JB, Degner JF, Pique-Regi R, Pai AA, et al. (2012) Dissecting the regulatory architecture of gene expression qtls. Genome Biol 13: R7. doi: 10.1186/gb-2012-13-1-r7
[12]
Zuber V, Silva APD, Strimmer K (2012) A novel algorithm for simultaneous snp selection in high-dimensional genome-wide association studies. BMC bioinformatics 13: 284. doi: 10.1186/1471-2105-13-284
[13]
Valdar W, Sabourin J, Nobel A, Holmes CC (2012) Reprioritizing genetic associations in hit regions using lasso-based resample model averaging. Genetic epidemiology 36: 451–462. doi: 10.1002/gepi.21639
[14]
Guan Y, Stephens M, et al. (2011) Bayesian variable selection regression for genome-wide association studies and other large-scale problems. The Annals of Applied Statistics 5: 1780–1815. doi: 10.1214/11-aoas455
[15]
Servin B, Stephens M (2007) Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS genetics 3: e114. doi: 10.1371/journal.pgen.0030114.eor
[16]
Lee SI, Dudley AM, Drubin D, Silver PA, Krogan NJ, et al. (2009) Learning a prior on regulatory potential from eqtl data. PLoS genetics 5: e1000358. doi: 10.1371/journal.pgen.1000358
[17]
Carbonetto P, Stephens M (2013) Integrated enrichment analysis of variants and pathways in genome-wide association studies indicates central role for il-2 signaling genes in type 1 diabetes, and cytokine signaling genes in crohn's disease. PLoS genetics 9: e1003770. doi: 10.1371/journal.pgen.1003770
[18]
Consortium EP, et al. (2012) An integrated encyclopedia of dna elements in the human genome. Nature 489: 57–74.
[19]
Maurano MT, Humbert R, Rynes E, Thurman RE, Haugen E, et al. (2012) Systematic localization of common disease-associated variation in regulatory dna. Science 337: 1190–1195. doi: 10.1126/science.1222794
[20]
Trynka G, Raychaudhuri S (2013) Using chromatin marks to interpret and localize genetic associations to complex human traits and diseases. Current opinion in genetics & development 23: 635–641. doi: 10.1016/j.gde.2013.10.009
[21]
Karczewski KJ, Dudley JT, Kukurba KR, Chen R, Butte AJ, et al. (2013) Systematic functional regulatory assessment of disease-associated variants. Proceedings of the National Academy of Sciences 110: 9607–9612. doi: 10.1073/pnas.1219099110
[22]
Trynka G, Sandor C, Han B, Xu H, Stranger BE, et al. (2013) Chromatin marks identify critical cell types for fine mapping complex trait variants. Nature genetics 45: 124–130. doi: 10.1038/ng.2504
[23]
Gusev A, Lee SH, Neale BM, Trynka G, Vilhjalmsson BJ, et al.. (2014) Regulatory variants explain much more heritability than coding variants across 11 common diseases. bioRxiv.
[24]
Udler MS, Meyer KB, Pooley KA, Karlins E, Struewing JP, et al. (2009) Fgfr2 variants and breast cancer risk: fine-scale mapping using african american studies and analysis of chromatin conformation. Human molecular genetics 18: 1692–1703. doi: 10.1093/hmg/ddp078
[25]
Trynka G, Hunt KA, Bockett NA, Romanos J, Mistry V, et al. (2011) Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease. Nature genetics 43: 1193–1201. doi: 10.1038/ng.998
[26]
Patsopoulos NA, Barcellos LF, Hintzen RQ, Schaefer C, van Duijn CM, et al. (2013) Fine-mapping the genetic association of the major histocompatibility complex in multiple sclerosis: Hla and non-hla effects. PLoS genetics 9: e1003926. doi: 10.1371/journal.pgen.1003926
[27]
Liu JZ, Almarri MA, Gaffney DJ, Mells GF, Jostins L, et al. (2012) Dense fine-mapping study identifies new susceptibility loci for primary biliary cirrhosis. Nature genetics 44: 1137–1141. doi: 10.1038/ng.2395
[28]
Fellay J, Thompson AJ, Ge D, Gumbs CE, Urban TJ, et al. (2010) Itpa gene variants protect against anaemia in patients treated for chronic hepatitis c. Nature 464: 405–408. doi: 10.1038/nature08825
[29]
Lewinger JP, Conti DV, Baurley JW, Triche TJ, Thomas DC (2007) Hierarchical bayes prioritization of marker associations from a genome-wide association scan for further investigation. Genetic epidemiology 31: 871–882. doi: 10.1002/gepi.20248
[30]
Quintana M, Conti D (2013) Integrative variable selection via bayesian model uncertainty. Statistics in medicine 32: 4938–4953. doi: 10.1002/sim.5888
[31]
Udler MS, Tyrer J, Easton DF (2010) Evaluating the power to discriminate between highly correlated snps in genetic association studies. Genetic epidemiology 34: 463–468. doi: 10.1002/gepi.20504
[32]
Carlin BP, Louis TA (2000) Bayes and empirical Bayes methods for data analysis. CRC Press.
[33]
Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, et al. (2010) Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466: 707–713.
[34]
Pasaniuc B, Zaitlen N, Shi H, Bhatia G, Gusev A, et al.. (2014) Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics: btu416.
[35]
Han B, Kang HM, Eskin E (2009) Rapid and accurate multiple testing correction and power estimation for millions of correlated markers. PLoS genetics 5: e1000456. doi: 10.1371/journal.pgen.1000456
[36]
Conneely KN, Boehnke M (2007) So many correlated tests, so little time! Rapid adjustment of P values for multiple correlated tests. American journal of human genetics 81: 1158–1168. doi: 10.1086/522036
[37]
Zaitlen N, Pasaniuc B, Gur T, Ziv E, Halperin E (2010) Leveraging genetic variability across populations for the identification of causal variants. American journal of human genetics 86: 23–33. doi: 10.1016/j.ajhg.2009.11.016
[38]
Liu DC, Nocedal J (1989) On the limited memory bfgs method for large scale optimization. Mathematical programming 45: 503–528. doi: 10.1007/bf01589116
[39]
Johnson SG (2010) The nlopt nonlinear-optimization package.
[40]
Su Z, Marchini J, Donnelly P (2011) Hapgen2: simulation of multiple disease snps. Bioinformatics.
[41]
Yang J, Ferreira T, Morris AP, Medland SE, Madden PA, et al. (2012) Conditional and joint multiple-snp analysis of gwas summary statistics identifies additional variants influencing complex traits. Nature genetics 44: 369–375. doi: 10.1038/ng.2213
[42]
Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, et al. (2012) The accessible chromatin landscape of the human genome. Nature 489: 75–82. doi: 10.1038/nature11232
[43]
Teslovich et al TM (2010) Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466: 707–713.
[44]
Shifman S, Kuypers J, Kokoris M, Yakir B, Darvasi A (2003) Linkage disequilibrium patterns of the human genome across populations. Human molecular genetics 12: 771–776. doi: 10.1093/hmg/ddg088