Abstract:
We investigate the maximal size of distinguished submatrices of a Gaussian random matrix. Of interest are submatrices whose entries have average greater than or equal to a positive constant, and submatrices whose entries are well-fit by a two-way ANOVA model. We identify size thresholds and associated (asymptotic) probability bounds for both large-average and ANOVA-fit submatrices. Results are obtained when the matrix and submatrices of interest are square, and in rectangular cases when the matrix submatrices of interest have fixed aspect ratios. In addition, we obtain a strong, interval concentration result for the size of large average submatrices in the square case. A simulation study shows good agreement between the observed and predicted sizes of large average submatrices in matrices of moderate size.

Abstract:
This paper is concerned with the problem of recovering a finite, deterministic time series from observations that are corrupted by additive, independent noise. A distinctive feature of this problem is that the available data exhibit long-range dependence and, as a consequence, existing statistical theory and methods are not readily applicable. This paper gives an analysis of the denoising problem that extends recent work of Lalley, but begins from first principles. Both positive and negative results are established. The positive results show that denoising is possible under somewhat restrictive conditions on the additive noise. The negative results show that, under more general conditions on the noise, no procedure can recover the underlying deterministic series.

Abstract:
We show that if $\mathcal{X}$ is a complete separable metric space and $\mathcal{C}$ is a countable family of Borel subsets of $\mathcal{X}$ with finite VC dimension, then, for every stationary ergodic process with values in $\mathcal{X}$, the relative frequencies of sets $C\in\mathcal{C}$ converge uniformly to their limiting probabilities. Beyond ergodicity, no assumptions are imposed on the sampling process, and no regularity conditions are imposed on the elements of $\mathcal{C}$. The result extends existing work of Vapnik and Chervonenkis, among others, who have studied uniform convergence for i.i.d. and strongly mixing processes. Our method of proof is new and direct: it does not rely on symmetrization techniques, probability inequalities or mixing conditions. The uniform convergence of relative frequencies for VC-major and VC-graph classes of functions under ergodic sampling is established as a corollary of the basic result for sets.

Abstract:
We define a notion of entropy for an infinite family $\mathcal{C}$ of measurable sets in a probability space. We show that the mean ergodic theorem holds uniformly for $\mathcal{C}$ under every ergodic transformation if and only if $\mathcal{C}$ has zero entropy. When the entropy of $\mathcal{C}$ is positive, we establish a strong converse showing that the uniform mean ergodic theorem fails generically in every isomorphism class, including the isomorphism classes of Bernoulli transformations. As a corollary of these results, we establish that every strong mixing transformation is uniformly strong mixing on $\mathcal{C}$ if and only if the entropy of $\mathcal{C}$ is zero, and obtain a corresponding result for weak mixing transformations.

Abstract:
Let F be a family of Borel measurable functions on a complete separable metric space. The gap (or fat-shattering) dimension of F is a combinatorial quantity that measures the extent to which functions f in F can separate fi?nite sets of points at a prede?ned resolution gamma > 0. We establish a connection between the gap dimension of F and the uniform convergence of its sample averages under ergodic sampling. In particular, we show that if the gap dimension of F at resolution gamma > 0 is fi?nite, then for every ergodic process the sample averages of functions in F are eventually within 10 gamma of their limiting expectations uniformly over the class F. If the gap dimension of F is finite for every resolution gamma > 0 then the sample averages of functions in F converge uniformly to their limiting expectations. We assume only that F is uniformly bounded and countable (or countably approximable). No smoothness conditions are placed on F, and no assumptions beyond ergodicity are placed on the sampling processes. Our results extend existing work for i.i.d. processes.

Abstract:
We show that the sets in a family with finite VC dimension can be uniformly approximated within a given error by a finite partition. Immediate corollaries include the fact that VC classes have finite bracketing numbers, satisfy uniform laws of averages under strong dependence, and exhibit uniform mixing. Our results are based on recent work concerning uniform laws of averages for VC classes under ergodic sampling.

Abstract:
For any family of measurable sets in a probability space, we show that either (i) the family has infinite Vapnik-Chervonenkis (VC) dimension or (ii) for every epsilon > 0 there is a finite partition pi such the pi-boundary of each set has measure at most epsilon. Immediate corollaries include the fact that a family with finite VC dimension has finite bracketing numbers, and satisfies uniform laws of large numbers for every ergodic process. From these corollaries, we derive analogous results for VC major and VC graph families of functions.

Abstract:
Interactions between proteins and DNA facilitate and regulate many basic cellular functions, including transcription, DNA replication, recombination, and DNA repair. For example, the process of transcription is regulated by a class of proteins referred to as transcription factors, which often bind to specific DNA sequences upstream of gene coding regions. This control mechanism allows cells to respond to developmental or environmental signals by using the same transcription factor to coordinate expression of many genes. Therefore, it is of interest to determine where regulatory proteins of this and other types are bound to the genome.The genomic-binding location of transcription factors can be determined using chromatin immunoprecipitation (ChIP) followed by detection of the enriched fragments by DNA microarray hybridization. This procedure, also known as ChIP-chip, has been reviewed extensively [1-5]. To appreciate the unique properties of the data generated by the ChIP-chip procedure, it is useful to review briefly the main points of the experimental procedure (Figure 1).After growing the cells of interest under the desired conditions, chromatin is usually cross-linked with formaldehyde to preserve sites of interaction between proteins and DNA. The cross-linked chromatin is then sheared by sonication or enzymatic digestion. Shearing creates a population of chromatin fragments of varying size, generally ranging from 200 to 1,000 base-pairs. The protein of interest, along with the DNA associated with it, is then isolated by using an antibody specific to that protein or by affinity purification utilizing an epitope or affinity tag fused to the protein. The ChIPed DNA is then purified. Because yields from most samples are low, amplification is often required. DNA fragments enriched in the procedure are then detected by comparative hybridization to a DNA microarray. Standard technical recommendations common to all microarray experiments (for example, the need for dye s

Abstract:
This paper considers estimation of a univariate density from an individual numerical sequence. It is assumed that (i) the limiting relative frequencies of the numerical sequence are governed by an unknown density, and (ii) there is a known upper bound for the variation of the density on an increasing sequence of intervals. A simple estimation scheme is proposed, and is shown to be $L_1$ consistent when (i) and (ii) apply. In addition it is shown that there is no consistent estimation scheme for the set of individual sequences satisfying only condition (i).

Abstract:
We consider univariate regression estimation from an individual (non-random) sequence $(x_1,y_1),(x_2,y_2), ... \in \real \times \real$, which is stable in the sense that for each interval $A \subseteq \real$, (i) the limiting relative frequency of $A$ under $x_1, x_2, ...$ is governed by an unknown probability distribution $\mu$, and (ii) the limiting average of those $y_i$ with $x_i \in A$ is governed by an unknown regression function $m(\cdot)$. A computationally simple scheme for estimating $m(\cdot)$ is exhibited, and is shown to be $L_2$ consistent for stable sequences $\{(x_i,y_i)\}$ such that $\{y_i\}$ is bounded and there is a known upper bound for the variation of $m(\cdot)$ on intervals of the form $(-i,i]$, $i \geq 1$. Complementing this positive result, it is shown that there is no consistent estimation scheme for the family of stable sequences whose regression functions have finite variation, even under the restriction that $x_i \in [0,1]$ and $y_i$ is binary-valued.