Abstract:
It has long been recognized that statistical dependence in data poses a significant challenge to large-scale multiple testing. Failure to take the dependence into account can result in severe drop in performance of multiple testing. In particular, the detection power of large-scale multiple tests is known to suffer when the False Discovery Proportion must be controlled. However, it often happens that the dependence structure is unknown and only a single, albeit very high dimensional, observation of test statistic is available. This makes large scale multiple testing under dependence considerably harder. Our work addresses this problem for the case of a stationary, ergodic signal vector with low signal-strength and known noise distribution. Our main contribution in this setting is a new approach for improved recovery of a long sequence of dependent binary signals embedded in noisy observations. The novel aspect of our approach is the approximation and numerical computation of the posterior probabilities of binary signals at individual sites of the process, by drawing strength from observations at nearby sites without assuming the availability of their joint prior distribution. Although we only consider signal vectors registered as a time series, the approach in principle may apply to random fields as well.

Abstract:
An important aspect of multiple hypothesis testing is controlling the significance level, or the level of Type I error. When the test statistics are not independent it can be particularly challenging to deal with this problem, without resorting to very conservative procedures. In this paper we show that, in the context of contemporary multiple testing problems, where the number of tests is often very large, the difficulties caused by dependence are less serious than in classical cases. This is particularly true when the null distributions of test statistics are relatively light-tailed, for example, when they can be based on Normal or Student's $t$ approximations. There, if the test statistics can fairly be viewed as being generated by a linear process, an analysis founded on the incorrect assumption of independence is asymptotically correct as the number of hypotheses diverges. In particular, the point process representing the null distribution of the indices at which statistically significant test results occur is approximately Poisson, just as in the case of independence. The Poisson process also has the same mean as in the independence case, and of course exhibits no clustering of false discoveries. However, this result can fail if the null distributions are particularly heavy-tailed. There clusters of statistically significant results can occur, even when the null hypothesis is correct. We give an intuitive explanation for these disparate properties in light- and heavy-tailed cases, and provide rigorous theory underpinning the intuition.

Abstract:
This paper studies the problem of high-dimensional multiple testing and sparse recovery from the perspective of sequential analysis. In this setting, the probability of error is a function of the dimension of the problem. A simple sequential testing procedure is proposed. We derive necessary conditions for reliable recovery in the non-sequential setting and contrast them with sufficient conditions for reliable recovery using the proposed sequential testing procedure. Applications of the main results to several commonly encountered models show that sequential testing can be exponentially more sensitive to the difference between the null and alternative distributions (in terms of the dependence on dimension), implying that subtle cases can be much more reliably determined using sequential methods.

Abstract:
Test statistics are often strongly dependent in large-scale multiple testing applications. Most corrections for multiplicity are unduly conservative for correlated test statistics, resulting in a loss of power to detect true positives. We show that the Westfall--Young permutation method has asymptotically optimal power for a broad class of testing problems with a block-dependence and sparsity structure among the tests, when the number of tests tends to infinity.

Abstract:
In this article, we consider the problem of simultaneous testing of hypotheses when the individual test statistics are not necessarily independent. Specifically, we consider the problem of simultaneous testing of point null hypotheses against two-sided alternatives about the mean parameters of normally distributed random variables. We assume that conditionally given the vector means, these random variables jointly follow a multivariate normal distribution with a known but arbitrary covariance matrix. We consider a Bayesian framework where each unknown mean is modeled via a two component point mass mixture prior, whereby unconditionally the test statistics jointly have a mixture of multivariate normal distributions. A new testing procedure is developed that uses the dependence among the test statistics and works in a step down like manner. This procedure is general enough to be applied to even for non-normal data. A decision theoretic justification in favor of the proposed testing procedure has been provided by showing that unlike the traditional $p$-value based stepwise procedures, this new method possesses a certain convexity property which is essential for the admissibility of a multiple testing procedure with respect to the vector risk function. Consistent estimation of the unknown proportion of alternative hypotheses and variance of the distribution of the non-zero means is theoretically investigated. An alternative representation of the proposed test statistics has also been established resulting in a great reduction in computational complexity. It is demonstrated through extensive simulations that for various forms of dependence and a wide range of sparsity levels, the proposed testing procedure compares quite favourably with several existing multiple testing procedures available in the literature in terms of overall misclassification probability.

Abstract:
The performance of multiple hypothesis testing is known to be affected by the statistical dependence among random variables involved. The mechanisms responsible for this, however, are not well understood. We study the effects of the dependence structure of a finite state hidden Markov model (HMM) on the likelihood ratios critical for optimal multiple testing on the hidden states. Various convergence results are obtained for the likelihood ratios as the observations of the HMM form an increasing long chain. Analytic expansions of the first and second order derivatives are obtained for the case of binary states, explicitly showing the effects of the parameters of the HMM on the likelihood ratios.

Abstract:
Recently, an exact binomial test called SGoF (Sequential Goodness-of-Fit) has been introduced as a new method for handling high dimensional testing problems. SGoF looks for statistical significance when comparing the amount of null hypotheses individually rejected at level γ = 0.05 with the expected amount under the intersection null, and then proceeds to declare a number of effects accordingly. SGoF detects an increasing proportion of true effects with the number of tests, unlike other methods for which the opposite is true. It is worth mentioning that the choice γ = 0.05 is not essential to the SGoF procedure, and more power may be reached at other values of γ depending on the situation. In this paper we enhance the possibilities of SGoF by letting the γ vary on the whole interval (0,1). In this way, we introduce the ‘SGoFicance Trace’ (from SGoF's significance trace), a graphical complement to SGoF which can help to make decisions in multiple-testing problems. A script has been written for the computation in R of the SGoFicance Trace. This script is available from the web site http://webs.uvigo.es/acraaj/SGoFicance.h？tm.

Abstract:
The severity of type II errors is frequently ignored when deriving a multiple testing procedure, even though utilizing it properly can greatly help in making correct decisions. This paper puts forward a theory behind developing a multiple testing procedure that can incorporate the type II error severity and is optimal in the sense of minimizing a measure of false non-discoveries among all procedures controlling a measure of false discoveries. The theory is developed under a general model allowing arbitrary dependence by taking a compound decision theoretic approach to multiple testing with a loss function incorporating the type II error severity. We present this optimal procedure in its oracle form and offer numerical evidence of its superior performance over relevant competitors.

Abstract:
Large-scale multiple testing tasks often exhibit dependence, and leveraging the dependence between individual tests is still one challenging and important problem in statistics. With recent advances in graphical models, it is feasible to use them to perform multiple testing under dependence. We propose a multiple testing procedure which is based on a Markov-random-field-coupled mixture model. The ground truth of hypotheses is represented by a latent binary Markov random field, and the observed test statistics appear as the coupled mixture variables. The parameters in our model can be automatically learned by a novel EM algorithm. We use an MCMC algorithm to infer the posterior probability that each hypothesis is null (termed local index of significance), and the false discovery rate can be controlled accordingly. Simulations show that the numerical performance of multiple testing can be improved substantially by using our procedure. We apply the procedure to a real-world genome-wide association study on breast cancer, and we identify several SNPs with strong association evidence.

Abstract:
Multiple scale homogenization problems are reduced to single scale problems in higher dimension. It is shown that sparse tensor product Finite Element Methods (FEM) allow the numerical solution in complexity independent of the dimension and of the length scale. Problems with stochastic input data are reformulated as high dimensional deterministic problems for the statistical moments of the random solution. Sparse tensor product FEM give a deterministic solution algorithm of log-linear complexity for statistical moments.