All Title Author
Keywords Abstract

GAGA: A New Algorithm for Genomic Inference of Geographic Ancestry Reveals Fine Level Population Substructure in Europeans

DOI: doi/10.1371/journal.pcbi.1003480

Full-Text   Cite this paper   Add to My Lib


Attempts to detect genetic population substructure in humans are troubled by the fact that the vast majority of the total amount of observed genetic variation is present within populations rather than between populations. Here we introduce a new algorithm for transforming a genetic distance matrix that reduces the within-population variation considerably. Extensive computer simulations revealed that the transformed matrix captured the genetic population differentiation better than the original one which was based on the T1 statistic. In an empirical genomic data set comprising 2,457 individuals from 23 different European subpopulations, the proportion of individuals that were determined as a genetic neighbour to another individual from the same sampling location increased from 25% with the original matrix to 52% with the transformed matrix. Similarly, the percentage of genetic variation explained between populations by means of Analysis of Molecular Variance (AMOVA) increased from 1.62% to 7.98%. Furthermore, the first two dimensions of a classical multidimensional scaling (MDS) using the transformed matrix explained 15% of the variance, compared to 0.7% obtained with the original matrix. Application of MDS with Mclust, SPA with Mclust, and GemTools algorithms to the same dataset also showed that the transformed matrix gave a better association of the genetic clusters with the sampling locations, and particularly so when it was used in the AMOVA framework with a genetic algorithm. Overall, the new matrix transformation introduced here substantially reduces the within population genetic differentiation, and can be broadly applied to methods such as AMOVA to enhance their sensitivity to reveal population substructure. We herewith provide a publically available ( model-free method for improved genetic population substructure detection that can be applied to human as well as any other species data in future studies relevant to evolutionary biology, behavioural ecology, medicine, and forensics.


[1]  Barbujani G, Colonna V (2010) Human genome diversity: frequently asked questions. Trends Genet 26: 285–295. doi: 10.1016/j.tig.2010.04.002
[2]  Freedman ML, Reich D, Penney KL, McDonald GJ, Mignault AA, et al. (2004) Assessing the impact of population stratification on genetic association studies. Nat Genet 36: 388–393. doi: 10.1038/ng1333
[3]  Marigorta UM, Lao O, Casals F, Calafell F, Morcillo-Suarez C, et al. (2011) Recent human evolution has shaped geographical differences in susceptibility to disease. BMC Genomics 12: 55. doi: 10.1186/1471-2164-12-55
[4]  Kayser M, de Knijff P (2011) Improving human forensics through advances in genetics, genomics and molecular biology. Nat Rev Genet 12: 179–192. doi: 10.1038/nrg2952
[5]  Alexander DH, Novembre J, Lange K (2009) Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19: 1655–1664. doi: 10.1101/gr.094052.109
[6]  Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155: 945–959.
[7]  Tang H, Peng J, Wang P, Risch NJ (2005) Estimation of individual admixture: analytical and study design considerations. Genet Epidemiol 28: 289–301. doi: 10.1002/gepi.20064
[8]  Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al. (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38: 904–909. doi: 10.1038/ng1847
[9]  Cox TF, Cox MAA (2001) Multidimensional Scaling. Florida: CHAPMAN & HALL/CRC.
[10]  Jombart T, Pontier D, Dufour AB (2009) Genetic markers in the playground of multivariate analysis. Heredity 102: 330–341. doi: 10.1038/hdy.2008.130
[11]  Wang C, Zollner S, Rosenberg NA (2012) A quantitative comparison of the similarity between genes and geography in worldwide human populations. PLoS Genet 8: e1002886. doi: 10.1371/journal.pgen.1002886
[12]  Yang WY, Novembre J, Eskin E, Halperin E (2012) A model-based approach for analysis of spatial structure in genetic data. Nat Genet 44: 725–731. doi: 10.1038/ng.2285
[13]  Ramachandran S, Deshpande O, Roseman CC, Rosenberg NA, Feldman MW, et al. (2005) Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc Natl Acad Sci U S A 102: 15942–15947. doi: 10.1073/pnas.0507611102
[14]  Sabeti PC, Schaffner SF, Fry B, Lohmueller J, Varilly P, et al. (2006) Positive natural selection in the human lineage. Science 312: 1614–1620. doi: 10.1126/science.1124309
[15]  Oota H, Settheetham-Ishida W, Tiwawech D, Ishida T, Stoneking M (2001) Human mtDNA and Y-chromosome variation is correlated with matrilocal versus patrilocal residence. Nat Genet 29: 20–21. doi: 10.1038/ng711
[16]  Goldstein DB, Chikhi LV (2002) Human migrations and population structure: what we know and why it matters. Annu Rev Genomics Hum Genet 3: 129–152. doi: 10.1146/annurev.genom.3.022502.103200
[17]  Cavalli-Sforza LL, Menozzi P, Piazza A (1994) The history and geography of human genes. Princeton (NJ): Princeton University Press.
[18]  Handley LJ, Manica A, Goudet J, Balloux F (2007) Going the distance: human population genetics in a clinal world. Trends Genet 23: 432–439. doi: 10.1016/j.tig.2007.07.002
[19]  Liu H, Prugnolle F, Manica A, Balloux F (2006) A geographically explicit genetic model of worldwide human-settlement history. Am J Hum Genet 79: 230–237. doi: 10.1086/505436
[20]  Mendizabal I, Lao O, Marigorta UM, Wollstein A, Gusmao L, et al. (2012) Reconstructing the population history of European Romani from genome-wide data. Curr Biol 22: 2342–2349. doi: 10.1016/j.cub.2012.10.039
[21]  Lao O, Lu TT, Nothnagel M, Junge O, Freitag-Wolf S, et al. (2008) Correlation between genetic and geographic structure in Europe. Curr Biol 18: 1241–1248. doi: 10.1016/j.cub.2008.07.049
[22]  Lao O, Altena E, Becker C, Brauer S, Kraaijenbrink T, et al. (2013) Clinal distribution of human genomic diversity across the Netherlands despite archaeological evidence for genetic discontinuities in Dutch population history. Investig Genet 4: 9. doi: 10.1186/2041-2223-4-9
[23]  Ralph P, Coop G (2013) The Geography of Recent Genetic Ancestry across Europe. PLoS Biol 11: e1001555. doi: 10.1371/journal.pbio.1001555
[24]  Lu TT, Lao O, Nothnagel M, Junge O, Freitag-Wolf S, et al. (2009) An evaluation of the genetic-matched pair study design using genome-wide SNP data from the European population. Eur J Hum Genet 17: 967–975. doi: 10.1038/ejhg.2008.266
[25]  Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, et al. (2008) Genes mirror geography within Europe. Nature 456: 98–101. doi: 10.1038/nature07331
[26]  Lawson DJ, Hellenthal G, Myers S, Falush D (2012) Inference of population structure using dense haplotype data. PLoS Genet 8: e1002453. doi: 10.1371/journal.pgen.1002453
[27]  Browning SR, Browning BL (2011) Haplotype phasing: existing methods and new developments. Nat Rev Genet 12: 703–714. doi: 10.1038/nrg3054
[28]  Andres AM, Clark AG, Shimmin L, Boerwinkle E, Sing CF, et al. (2007) Understanding the accuracy of statistical haplotype inference with sequence data of known phase. Genet Epidemiol 31: 659–671. doi: 10.1002/gepi.20185
[29]  Novembre J, Ramachandran S (2011) Perspectives on human population structure at the cusp of the sequencing era. Annu Rev Genomics Hum Genet 12: 245–274. doi: 10.1146/annurev-genom-090810-183123
[30]  Gibbs RA, Taylor JF, Van Tassell CP, Barendse W, Eversole KA, et al. (2009) Genome-wide survey of SNP variation uncovers the genetic structure of cattle breeds. Science 324: 528–532. doi: 10.1126/science.1167936
[31]  Lee WC (2003) Testing the genetic relation between two individuals using a panel of frequency-unknown single nucleotide polymorphisms. Ann Hum Genet 67: 618–619. doi: 10.1046/j.1529-8817.2003.00063.x
[32]  Stevens EL, Heckenberg G, Roberson ED, Baugher JD, Downey TJ, et al. (2011) Inference of relationships in population data using identity-by-descent and identity-by-state. PLoS Genet 7: e1002287. doi: 10.1371/journal.pgen.1002287
[33]  Excoffier L, Smouse PE, Quattro JMV (1992) Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics 131: 479–491.
[34]  Excoffier L (2003) Analysis of population subdivision. In: Balding DJ, Bishop M, Cannings C, editors. Handoobk of statistical genetics- 2nd edition. 2 ed. The Atrium, Sothern Gate, Chichester, West Sussex: Wiley.
[35]  Meirmans PG (2006) Using the AMOVA framework to estimate a standardized genetic differentiation measure. Evolution 60: 2399–2402. doi: 10.1111/j.0014-3820.2006.tb01874.x
[36]  Goudet J, Raymond M, de Meeus T, Rousset F (1996) Testing differentiation in diploid populations. Genetics 144: 1933–1940.
[37]  Rosenberg NA, Li LM, Ward R, Pritchard JK (2003) Informativeness of genetic markers for inference of ancestry. Am J Hum Genet 73: 1402–1422. doi: 10.1086/380416
[38]  Bondy JA, Murty USR (2008) Graph Theory; Axler S, Ribert KA, editors: Springer. 657 p.
[39]  Dupanloup I, Schneider S, Excoffier LV (2002) A simulated annealing approach to define the genetic structure of populations. Mol Ecol 11: 2571–2581. doi: 10.1046/j.1365-294x.2002.01650.x
[40]  Haupt RL, Haupt SE (2004) Practical genetic algorithms: Wiley-Interscience. 272 p.
[41]  Goswami G, Liu SJ, Wong HW (2007) Evolutionary Monte Carlo Methods for Clustering. Journal of Computational & Graphical Statistics 16: 21. doi: 10.1198/106186007x255072
[42]  Hudson RR (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18: 337–338. doi: 10.1093/bioinformatics/18.2.337
[43]  Nachman MW, Crowell SL (2000) Estimate of the mutation rate per nucleotide in humans. Genetics 156: 297–304.
[44]  DeGiorgio M, Jakobsson M, Rosenberg NA (2009) Out of Africa: modern human origins special feature: explaining worldwide patterns of human genetic variation using a coalescent-based serial founder model of migration outward from Africa. Proc Natl Acad Sci U S A 106: 16057–16062. doi: 10.1073/pnas.0903341106
[45]  Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81: 559–575. doi: 10.1086/519795
[46]  McVean G (2009) A genealogical interpretation of principal components analysis. PLoS Genet 5: e1000686. doi: 10.1371/journal.pgen.1000686
[47]  Liu Y, Nyunoya T, Leng S, Belinsky SA, Tesfaigzi Y, et al. (2013) Softwares and methods for estimating genetic ancestry in human populations. Hum Genomics 7: 1. doi: 10.1186/1479-7364-7-1
[48]  Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation.
[49]  Lawson DJ, Falush D (2012) Population identification using genetic data. Annu Rev Genomics Hum Genet 13: 337–361. doi: 10.1146/annurev-genom-082410-101510
[50]  R Development Core Team (2006) R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
[51]  Cailliez F (1983) The analytical solution of the additive constant problem. Psychometrika 48: 343–349. doi: 10.1007/bf02294026
[52]  Lee AB, Luca D, Klei L, Devlin B, Roeder K (2010) Discovering genetic ancestry using spectral graph theory. Genet Epidemiol 34: 51–59. doi: 10.1002/gepi.20434
[53]  Baran Y, Quintela I, Carracedo A, Pasaniuc B, Halperin E (2013) Enhanced Localization of Genetic Samples through Linkage-Disequilibrium Correction. Am J Hum Genet 92: 882–894. doi: 10.1016/j.ajhg.2013.04.023
[54]  Browning SR, Browning BL (2007) Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am J Hum Genet 81: 1084–1097. doi: 10.1086/521987
[55]  Cramér H ( 1946) Mathematical Methods of Statistics: Princeton: Princeton University Press.
[56]  Fraley C, Raftery AE (2007) Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering. Journal of Classification 24: 155–181. doi: 10.1007/s00357-007-0004-5
[57]  Barton NH, Wilson I (1995) Genealogies and geography. Philos Trans R Soc Lond B Biol Sci 349: 49–59.


comments powered by Disqus