Abstract:
This paper studies the problem of testing whether a function is monotone from a nonparametric Bayesian perspective. Two new families of tests are constructed. The first uses constrained smoothing splines, together with a hierarchical stochastic-process prior that explicitly controls the prior probability of monotonicity. The second uses regression splines, together with two proposals for the prior over the regression coefficients. The finite-sample performance of the tests is shown via simulation to improve upon existing frequentist and Bayesian methods. The asymptotic properties of the Bayes factor for comparing monotone versus non-monotone regression functions in a Gaussian model are also studied. Our results significantly extend those currently available, which chiefly focus on determining the dimension of a parametric linear model.

Abstract:
In this article we describe Bayesian nonparametric procedures for two-sample hypothesis testing. Namely, given two sets of samples $\mathbf{y}^{\scriptscriptstyle(1)}\;$\stackrel{\scriptscriptstyle{iid}}{\s im}$\;F^{\scriptscriptstyle(1)}$ and $\mathbf{y}^{\scriptscriptstyle(2 )}\;$\stackrel{\scriptscriptstyle{iid}}{\sim}$\;F^{\scriptscriptstyle( 2)}$, with $F^{\scriptscriptstyle(1)},F^{\scriptscriptstyle(2)}$ unknown, we wish to evaluate the evidence for the null hypothesis $H_0:F^{\scriptscriptstyle(1)}\equiv F^{\scriptscriptstyle(2)}$ versus the alternative $H_1:F^{\scriptscriptstyle(1)}\neq F^{\scriptscriptstyle(2)}$. Our method is based upon a nonparametric P\'{o}lya tree prior centered either subjectively or using an empirical procedure. We show that the P\'{o}lya tree prior leads to an analytic expression for the marginal likelihood under the two hypotheses and hence an explicit measure of the probability of the null $\mathrm{Pr}(H_0|\{\mathbf {y}^{\scriptscriptstyle(1)},\mathbf{y}^{\scriptscriptstyle(2)}\}\mathbf{)}$.

Abstract:
The regression discontinuity (RD) design is a popular approach to causal inference in non-randomized studies. This is because it can be used to identify and estimate causal effects under mild conditions. Specifically, for each subject, the RD design assigns a treatment or non-treatment, depending on whether or not an observed value of an assignment variable exceeds a fixed and known cutoff value. In this paper, we propose a Bayesian nonparametric regression modeling approach to RD designs, which exploits a local randomization feature. In this approach, the assignment variable is treated as a covariate, and a scalar-valued confounding variable is treated as a dependent variable (which may be a multivariate confounder score). Then, over the model's posterior distribution of locally-randomized subjects that cluster around the cutoff of the assignment variable, inference for causal effects are made within this random cluster, via two-group statistical comparisons of treatment outcomes and non-treatment outcomes. We illustrate the Bayesian nonparametric approach through the analysis of a real educational data set, to investigate the causal link between basic skills and teaching ability.

Abstract:
In this article, we propose a new method for the fundamental task of testing for dependence between two groups of variables. The response densities under the null hypothesis of independence and the alternative hypothesis of dependence are specified by nonparametric Bayesian models. Under the null hypothesis, the joint distribution is modeled by the product of two independent Dirichlet Process Mixture (DPM) priors; under the alternative, the full joint density is modeled by a multivariate DPM prior. The test is then based on the posterior probability of favoring the alternative hypothesis. The proposed test not only has good performance for testing linear dependence among other popular nonparametric tests, but is also preferred to other methods in testing many of the nonlinear dependencies we explored. In the analysis of gene expression data, we compare different methods for testing pairwise dependence between genes. The results show that the proposed test identifies some dependence structures that are not detected by other tests.

Abstract:
In this paper we study Bayesian answers to testing problems when the hypotheses are not well separated and propose a general approach with a special focus on shape constrains testing. We then apply our method to several testing problems including testing for positivity and monotonicity in a nonparametric regression setting. For each of this problems, we show that our approach leads to the optimal separation rate of testing, which indicates that our tests have the best power. To our knowledge, separation rates have not been studied in the Bayesian literature so far.

Abstract:
A key problem in statistical modeling is model selection, how to choose a model at an appropriate level of complexity. This problem appears in many settings, most prominently in choosing the number ofclusters in mixture models or the number of factors in factor analysis. In this tutorial we describe Bayesian nonparametric methods, a class of methods that side-steps this issue by allowing the data to determine the complexity of the model. This tutorial is a high-level introduction to Bayesian nonparametric methods and contains several examples of their application.

Abstract:
A Bayesian approach to the classification problem is proposed in which random partitions play a central role. It is argued that the partitioning approach has the capacity to take advantage of a variety of large-scale spatial structures, if they are present in the unknown regression function $f_0$. An idealized one-dimensional problem is considered in detail. The proposed nonparametric prior uses random split points to partition the unit interval into a random number of pieces. This prior is found to provide a consistent estimate of the regression function in the $\L^p$ topology, for any $1 \leq p < \infty$, and for arbitrary measurable $f_0:[0,1] \to [0,1]$. A Markov chain Monte Carlo (MCMC) implementation is outlined and analyzed. Simulation experiments are conducted to show that the proposed estimate compares favorably with a variety of conventional estimators. A striking resemblance between the posterior mean estimate and the bagged CART estimate is noted and discussed. For higher dimensions, a generalized prior is introduced which employs a random Voronoi partition of the covariate-space. The resulting estimate displays promise on a two-dimensional problem, and extends with a minimum of additional computational effort to arbitrary metric spaces.

Abstract:
We present clustering methods for multivariate data exploiting the underlying geometry of the graphical structure between variables. As opposed to standard approaches that assume known graph structures, we first estimate the edge structure of the unknown graph using Bayesian neighborhood selection approaches, wherein we account for the uncertainty of graphical structure learning through model-averaged estimates of the suitable parameters. Subsequently, we develop a nonparametric graph clustering model on the lower dimensional projections of the graph based on Laplacian embeddings using Dirichlet process mixture models. In contrast to standard algorithmic approaches, this fully probabilistic approach allows incorporation of uncertainty in estimation and inference for both graph structure learning and clustering. More importantly, we formalize the arguments for Laplacian embeddings as suitable projections for graph clustering by providing theoretical support for the consistency of the eigenspace of the estimated graph Laplacians. We develop fast computational algorithms that allow our methods to scale to large number of nodes. Through extensive simulations we compare our clustering performance with standard clustering methods. We apply our methods to a novel pan-cancer proteomic data set, and evaluate protein networks and clusters across multiple different cancer types.

Abstract:
Data analysis sometimes requires the relaxation of parametric assumptions in order to gain modeling flexibility and robustness against mis-specification of the probability model. In the Bayesian context, this is accomplished by placing a prior distribution on a function space, such as the space of all probability distributions or the space of all regression functions. Unfortunately, posterior distributions ranging over function spaces are highly complex and hence sampling methods play a key role. This paper provides an introduction to a simple, yet comprehensive, set of programs for the implementation of some Bayesian nonparametric and semiparametric models in R, DPpackage. Currently, DPpackage includes models for marginal and conditional density estimation, receiver operating characteristic curve analysis, interval-censored data, binary regression data, item response data, longitudinal and clustered data using generalized linear mixed models, and regression data using generalized additive models. The package also contains functions to compute pseudo-Bayes factors for model comparison and for eliciting the precision parameter of the Dirichlet process prior, and a general purpose Metropolis sampling algorithm. To maximize computational efficiency, the actual sampling for each model is carried out using compiled C, C++ or Fortran code.

Abstract:
In recent years, Bayesian nonparametric statistics has gathered extraordinary attention. Nonetheless, a relatively little amount of work has been expended on Bayesian nonparametric hypothesis testing. In this paper, a novel Bayesian nonparametric approach to the two-sample problem is established. Precisely, given two samples $\mathbf{X}=X_1,\ldots,X_{m_1}$ $\overset {i.i.d.} \sim F$ and $\mathbf{Y}=Y_1,\ldots,Y_{m_2} \overset {i.i.d.} \sim G$, with $F$ and $G$ being unknown continuous cumulative distribution functions, we wish to test the null hypothesis $\mathcal{H}_0:~F=G$. The method is based on the Kolmogorov distance and approximate samples from the Dirichlet process centered at the standard normal distribution and a concentration parameter 1. It is demonstrated that the proposed test is robust with respect to any prior specification of the Dirichlet process. A power comparison with several well-known tests is incorporated. In particular, the proposed test dominates the standard Kolmogorov-Smirnov test in all the cases examined in the paper.