Abstract:
This paper provides an overview of a recently developed class of strategies for model selection, known as the fence methods. It also offers directions of future research as well as challenging problems. 1. Introduction On the morning of March 16, 1971, Hirotugu Akaike, as he was taking a seat on a commuter train, came out with the idea of a connection between the relative Kullback-Liebler discrepancy and the empirical log-likelihood function, a procedure that was later named Akaike’s information criterion, or AIC (Akaike [1, 2]; see Bozdogan [3] for the historical note). The idea has allowed major advances in model selection and related fields. See, for example, de Leeuw [4]. A number of similar criteria have since been proposed, including the Bayesian information criterion (BIC; Schwarz [5]), a criterion due to Hannan and Quinn (HQ; [6]), and the generalized information criterion (GIC; Nishii [7], Shibata [8]). All of the information criteria can be expressed as where is a measure of lack-of-fit by the model, ; is the dimension of , defined as the number of free parameters under ; and is a penalty for complexity of the model, which may depend on the effective sample size, . Although the information criteria are broadly used, difficulties are often encountered, especially in some nonconventional situations. We discuss a number of such cases below. (1) The Effective Sample Size. In many cases, the effective sample size, , is not the same as the number of data points. This often happens when the data are correlated. Take a look at two extreme cases. In the first case, the observations are independent; therefore, the effective sample size should be the same as the number of observations. In the second case, the data are so much correlated that all of the data points are identical. In this case, the effective sample size is 1, regardless of the number of data points. A practical situation may be somewhere between these two extreme cases, such as cases of mixed effects models (e.g., Jiang [9]), which makes the effective sample size difficult to determine. (2) The Dimension of a Model. The dimension of a model, , can also cause difficulties. In some cases, such as the ordinary linear regression, this is simply the number of parameters under , but in other situations, where nonlinear, adaptive models are fitted, this can be substantially different. Ye [10] developed the concept of generalized degrees of freedom (gdf) to track model complexity. For example, in the case of multivariate adaptive regression splines (Friedman [11]), nonlinear terms can have an

Abstract:
In mixed linear models with nonnormal data, the Gaussian Fisher information matrix is called a quasi-information matrix (QUIM). The QUIM plays an important role in evaluating the asymptotic covariance matrix of the estimators of the model parameters, including the variance components. Traditionally, there are two ways to estimate the information matrix: the estimated information matrix and the observed one. Because the analytic form of the QUIM involves parameters other than the variance components, for example, the third and fourth moments of the random effects, the estimated QUIM is not available. On the other hand, because of the dependence and nonnormality of the data, the observed QUIM is inconsistent. We propose an estimator of the QUIM that consists partially of an observed form and partially of an estimated one. We show that this estimator is consistent and computationally very easy to operate. The method is used to derive large sample tests of statistical hypotheses that involve the variance components in a non-Gaussian mixed linear model. Finite sample performance of the test is studied by simulations and compared with the delete-group jackknife method that applies to a special case of non-Gaussian mixed linear models.

Abstract:
We give answer to an open problem regarding consistency of the maximum likelihood estimators (MLEs) in generalized linear mixed models (GLMMs) involving crossed random effects. The solution to the open problem introduces an interesting, nonstandard approach to proving consistency of the MLEs in cases of dependent observations. Using the new technique, we extend the results to MLEs under a general GLMM. An example is used to further illustrate the technique.

Abstract:
Nucleosome positioning dictates the DNA accessibility for regulatory proteins, and thus is critical for gene expression and regulation. It has been well documented that only a subset of nucleosomes are reproducibly positioned in eukaryotic genomes. The most prominent example of phased nucleosomes is the context of genes, where phased nucleosomes flank the transcriptional starts sites (TSSs). It is unclear, however, what factors determine nucleosome positioning in regions that are not close to genes. We mapped both nucleosome positioning and DNase I hypersensitive site (DHS) datasets across the rice genome. We discovered that DHSs located in a variety of contexts, both genic and intergenic, were flanked by strongly phased nucleosome arrays. Phased nucleosomes were also found to flank DHSs in the human genome. Our results suggest the barrier model may represent a general feature of nucleosome organization in eukaryote genomes. Specifically, regions bound with regulatory proteins, including intergenic regions, can serve as barriers that organize phased nucleosome arrays on both sides. Our results also suggest that rice DHSs often span a single, phased nucleosome, similar to the H2A.Z-containing nucleosomes observed in DHSs in the human genome.

Abstract:
The term ``empirical predictor'' refers to a two-stage predictor of a linear combination of fixed and random effects. In the first stage, a predictor is obtained but it involves unknown parameters; thus, in the second stage, the unknown parameters are replaced by their estimators. In this paper, we consider mean squared errors (MSE) of empirical predictors under a general setup, where ML or REML estimators are used for the second stage. We obtain second-order approximation to the MSE as well as an estimator of the MSE correct to the same order. The general results are applied to mixed linear models to obtain a second-order approximation to the MSE of the empirical best linear unbiased predictor (EBLUP) of a linear mixed effect and an estimator of the MSE of EBLUP whose bias is correct to second order. The general mixed linear model includes the mixed ANOVA model and the longitudinal model as special cases.

Abstract:
We propose an iterative estimating equations procedure for analysis of longitudinal data. We show that, under very mild conditions, the probability that the procedure converges at an exponential rate tends to one as the sample size increases to infinity. Furthermore, we show that the limiting estimator is consistent and asymptotically efficient, as expected. The method applies to semiparametric regression models with unspecified covariances among the observations. In the special case of linear models, the procedure reduces to iterative reweighted least squares. Finite sample performance of the procedure is studied by simulations, and compared with other methods. A numerical example from a medical study is considered to illustrate the application of the method.

Abstract:
Many model search strategies involve trading off model fit with model complexity in a penalized goodness of fit measure. Asymptotic properties for these types of procedures in settings like linear regression and ARMA time series have been studied, but these do not naturally extend to nonstandard situations such as mixed effects models, where simple definition of the sample size is not meaningful. This paper introduces a new class of strategies, known as fence methods, for mixed model selection, which includes linear and generalized linear mixed models. The idea involves a procedure to isolate a subgroup of what are known as correct models (of which the optimal model is a member). This is accomplished by constructing a statistical fence, or barrier, to carefully eliminate incorrect models. Once the fence is constructed, the optimal model is selected from among those within the fence according to a criterion which can be made flexible. In addition, we propose two variations of the fence. The first is a stepwise procedure to handle situations of many predictors; the second is an adaptive approach for choosing a tuning constant. We give sufficient conditions for consistency of fence and its variations, a desirable property for a good model selection procedure. The methods are illustrated through simulation studies and real data analysis.

Abstract:
We study behavior of the restricted maximum likelihood (REML) estimator under a misspecified linear mixed model (LMM) that has received much attention in recent gnome-wide association studies. The asymptotic analysis establishes consistency of the REML estimator of the variance of the errors in the LMM, and convergence in probability of the REML estimator of the variance of the random effects in the LMM to a certain limit, which is equal to the true variance of the random effects multiplied by the limiting proportion of the nonzero random effects present in the LMM. The aymptotic results also establish convergence rate (in probability) of the REML estimators as well as a result regarding convergence of the asymptotic conditional variance of the REML estimator. The asymptotic results are fully supported by the results of empirical studies, which include extensive simulation studies that compare the performance of the REML estimator (under the misspecified LMM) with other existing methods.

Abstract:
The current method to classify graphite morphology types of grey cast iron is based on traditional subjective observation, and it cannot be used for quantitative analysis. Since microstructures have a great effect on the mechanical properties of grey cast iron and different types have totally different characters, six types of grey cast iron are discussed and an image-processing software subsystem that performs the classification and quantitative analysis automatically based on a kind of composed feature vector and artificial neural network (ANN) is described. There are three kinds of texture features: fractal dimension, roughness and two-dimension autoregression, which are used as an extracted feature input vector of ANN classifier. Compared with using only one, the checkout correct precision increased greatly. On the other hand, to achieve the quantitative analysis and show the different types clearly, the region segmentation idea was applied to the system. The percentages of the regions with different type are reported correctly. Furthermore, this paper tentatively introduces a new empirical method to decide the number of ANN hidden nodes, which are usually considered as a difficulty in ANN structure decision. It was found that the optimum hidden node number of the experimental data was the same as that obtained using the new method.

Abstract:
Potato is the third most important food crop worldwide. However, genetic and genomic research of potato has lagged behind other major crops due to the autopolyploidy and highly heterozygous nature associated with the potato genome. Reliable and technically undemanding techniques are not available for functional gene assays in potato. Here we report the development of a transient gene expression and silencing system in potato. Gene expression or RNAi-based gene silencing constructs were delivered into potato leaf cells using Agrobacterium-mediated infiltration. Agroinfiltration of various gene constructs consistently resulted in potato cell transformation and spread of the transgenic cells around infiltration zones. The efficiency of agroinfiltration was affected by potato genotypes, concentration of Agrobacterium, and plant growth conditions. We demonstrated that the agroinfiltration-based transient gene expression can be used to detect potato proteins in sub-cellular compartments in living cells. We established a double agroinfiltration procedure that allows to test whether a specific gene is associated with potato late blight resistance pathway mediated by the resistance gene RB. This procedure provides a powerful approach for high throughput functional assay for a large number of candidate genes in potato late blight resistance.