The availability of high-throughput parallel methods for sequencing microbial communities is increasing our knowledge of the microbial world at an unprecedented rate. Though most attention has focused on determining lower-bounds on the -diversity i.e. the total number of different species present in the environment, tight bounds on this quantity may be highly uncertain because a small fraction of the environment could be composed of a vast number of different species. To better assess what remains unknown, we propose instead to predict the fraction of the environment that belongs to unsampled classes. Modeling samples as draws with replacement of colored balls from an urn with an unknown composition, and under the sole assumption that there are still undiscovered species, we show that conditionally unbiased predictors and exact prediction intervals (of constant length in logarithmic scale) are possible for the fraction of the environment that belongs to unsampled classes. Our predictions are based on a Poissonization argument, which we have implemented in what we call the Embedding algorithm. In fixed i.e. non-randomized sample sizes, the algorithm leads to very accurate predictions on a sub-sample of the original sample. We quantify the effect of fixed sample sizes on our prediction intervals and test our methods and others found in the literature against simulated environments, which we devise taking into account datasets from a human-gut and -hand microbiota. Our methodology applies to any dataset that can be conceptualized as a sample with replacement from an urn. In particular, it could be applied, for example, to quantify the proportion of all the unseen solutions to a binding site problem in a random RNA pool, or to reassess the surveillance of a certain terrorist group, predicting the conditional probability that it deploys a new tactic in a next attack.
References
[1]
Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, et al. (2006) Microbial diversity in the deep sea and the underexplored \rare biosphere”. Proc Natl Acad Sci USA 103: 12115–12120.
[2]
Hughes JB, Hellmann JJ, Ricketts TH, Bohannan BJ (2001) Counting the uncountable: statistical approaches to estimating microbial diversity. Appl Environ Microbiol 67: 4399–4406.
[3]
Schloss PD, Handelsman J (2004) Status of the microbial census. Microbiol Mol Biol Rev 68: 686–691.
[4]
Curtis TP, Head IM, Lunn M, Woodcock S, Schloss PD, et al. (2006) What is the extent of prokaryotic diversity? Phil Trans R Soc Lond 361: 2023–2037.
[5]
Roesch LF, Fulthorpe RR, Riva A, Casella G, Hadwin AK, et al. (2007) Pyrosequencing enumerates and contrasts soil microbial diversity. Isme J 1: 283–290.
[6]
Hong SH, Bunge J, Jeon SO, Epstein SS (2006) Predicting microbial species richness. Proc Natl Acad Sci USA 103: 117–122.
[7]
Quince C, Curtis TP, Sloan WT (2008) The rational exploration of microbial diversity. Isme J 2: 997–1006.
[8]
Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, et al. (2007) A core gut microbiome in obese and lean twins. Nature 457: 480–484.
[9]
Fierer N, Lauber CL, Zhou N, McDonald D, Costello EK, et al. (2010) Forensic identification using skin bacterial communities and/or references within. Proc Natl Acad Sci USA 107: 6477–6481.
Burnham KP, Overton WS (1978) Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika 65: 625–633.
[12]
Chao A (1984) Nonparametric estimation of the number of classes in a population. Scand J Stat 11: 265–270.
[13]
Chao A (1897) Estimating the population size for capture-recapture data with unequal catchability. Biometrics 43: 783–791.
[14]
Mao CX, Lindsay BG (2007) Estimating the number of classes. Ann Stat 35: 917–930.
[15]
Bunge J, Fitzpatrick M (1993) Estimating the number of species: A review. J Am Stat Assoc 88: 364–373.
[16]
Hinsley F, Stripp A (1993) Codebreakers: The Inside Story of Bletchley Park. Oxford Univ. Press.
[17]
Finch SJ, Mendell NR, Thode HC Jr (1989) Probabilistic measures of adequacy of a numerical search for a global maximum. J Am Stat Assoc 84: 1020–1023.
[18]
Mao CX (2004) Predicting the conditional probability of discovering a new class. J Am Stat Assoc 99: 1108–1118.
[19]
Good IJ (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40: 237–264.
[20]
Robbins HE (1968) On estimating the total probability of the unobserved outcomes of an experiment. Ann Math Stat 39: 256–257.
[21]
Starr N (1979) Linear estimation of the probability of discovering a new species. Ann Stat 7: 644–652.
[22]
Clayton MK, Frees EW (1987) Nonparametric estimation of the probability of discovering a new species. J Am Stat Assoc 82: 305–311.
[23]
Esty WW (1983) A Normal limit law for a nonparametric estimator of the coverage of a random sample. Ann Statist 11: 905–912.
[24]
Aldous D (1988) Probability Approximations via the Poisson Clumping Heuristic. Springer-Verlag.
[25]
Mahmoud HM (2000) Sorting: A Distribution Theory. Wiley-Interscience.
[26]
Hwang HK, Janson S (2008) Local limit theorems for finite and infinite urn models. Ann Probab 36: 992–1022.
[27]
Mao CX, Lindsay BG (2002) A poisson model for the coverage problem with a genomic application. Biometrika 89: 669–681.
[28]
Durrett R (1999) Essentials of stochastic processes. Springer Texts in Statistics.
[29]
Turnbaugh PJ, Ridaura VK, Faith JJ, Rey FE, Knight R, et al. (2009) The e_ect of diet on the human gut microbiome: A metagenomic analysis in humanized gnotobiotic mice. Sci Transl Med 1: 6ra14.
[30]
Fierer N, Hamady M, Lauber CL, Knight R (2008) The influence of sex, handedness, and washing on the diversity of hand surface bacteria. Proc Natl Acad Sci USA 105: 17994–17999.
[31]
Ross SM (2002) Simulation. Academic Press, third edition.
[32]
Abramowitz M, Stegun IA (1964) Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. New York: Dover, ninth Dover printing, tenth GPO printing edition.