oalib

Publish in OALib Journal

ISSN: 2333-9721

APC: Only $99

Submit

Any time

2019 ( 18 )

2018 ( 10 )

2017 ( 9 )

2016 ( 9 )

Custom range...

Search Results: 1 - 10 of 879 matches for " Ernest Fokoue "
All listed articles are free for downloading (OA Articles)
Page 1 /879
Display every page Item
A Taxonomy of Big Data for Optimal Predictive Machine Learning and Data Mining
Ernest Fokoue
Statistics , 2015,
Abstract: Big data comes in various ways, types, shapes, forms and sizes. Indeed, almost all areas of science, technology, medicine, public health, economics, business, linguistics and social science are bombarded by ever increasing flows of data begging to analyzed efficiently and effectively. In this paper, we propose a rough idea of a possible taxonomy of big data, along with some of the most commonly used tools for handling each particular category of bigness. The dimensionality p of the input space and the sample size n are usually the main ingredients in the characterization of data bigness. The specific statistical machine learning technique used to handle a particular big data set will depend on which category it falls in within the bigness taxonomy. Large p small n data sets for instance require a different set of tools from the large n small p variety. Among other tools, we discuss Preprocessing, Standardization, Imputation, Projection, Regularization, Penalization, Compression, Reduction, Selection, Kernelization, Hybridization, Parallelization, Aggregation, Randomization, Replication, Sequentialization. Indeed, it is important to emphasize right away that the so-called no free lunch theorem applies here, in the sense that there is no universally superior method that outperforms all other methods on all categories of bigness. It is also important to stress the fact that simplicity in the sense of Ockham's razor non plurality principle of parsimony tends to reign supreme when it comes to massive data. We conclude with a comparison of the predictive performance of some of the most commonly used methods on a few data sets.
Probit Normal Correlated Topic Models
Xingchen Yu,Ernest Fokoue
Computer Science , 2014,
Abstract: The logistic normal distribution has recently been adapted via the transformation of multivariate Gaus- sian variables to model the topical distribution of documents in the presence of correlations among topics. In this paper, we propose a probit normal alternative approach to modelling correlated topical structures. Our use of the probit model in the context of topic discovery is novel, as many authors have so far con- centrated solely of the logistic model partly due to the formidable inefficiency of the multinomial probit model even in the case of very small topical spaces. We herein circumvent the inefficiency of multinomial probit estimation by using an adaptation of the diagonal orthant multinomial probit in the topic models context, resulting in the ability of our topic modelling scheme to handle corpuses with a large number of latent topics. An additional and very important benefit of our method lies in the fact that unlike with the logistic normal model whose non-conjugacy leads to the need for sophisticated sampling schemes, our ap- proach exploits the natural conjugacy inherent in the auxiliary formulation of the probit model to achieve greater simplicity. The application of our proposed scheme to a well known Associated Press corpus not only helps discover a large number of meaningful topics but also reveals the capturing of compellingly intuitive correlations among certain topics. Besides, our proposed approach lends itself to even further scalability thanks to various existing high performance algorithms and architectures capable of handling millions of documents.
Adaptive Random SubSpace Learning (RSSL) Algorithm for Prediction
Mohamed Elshrif,Ernest Fokoue
Computer Science , 2015,
Abstract: We present a novel adaptive random subspace learning algorithm (RSSL) for prediction purpose. This new framework is flexible where it can be adapted with any learning technique. In this paper, we tested the algorithm for regression and classification problems. In addition, we provide a variety of weighting schemes to increase the robustness of the developed algorithm. These different wighting flavors were evaluated on simulated as well as on real-world data sets considering the cases where the ratio between features (attributes) and instances (samples) is large and vice versa. The framework of the new algorithm consists of many stages: first, calculate the weights of all features on the data set using the correlation coefficient and F-statistic statistical measurements. Second, randomly draw n samples with replacement from the data set. Third, perform regular bootstrap sampling (bagging). Fourth, draw without replacement the indices of the chosen variables. The decision was taken based on the heuristic subspacing scheme. Fifth, call base learners and build the model. Sixth, use the model for prediction purpose on test set of the data. The results show the advancement of the adaptive RSSL algorithm in most of the cases compared with the synonym (conventional) machine learning algorithms.
A Comparison of Classifiers in Performing Speaker Accent Recognition Using MFCCs
Zichen Ma,Ernest Fokoue
Computer Science , 2015,
Abstract: An algorithm involving Mel-Frequency Cepstral Coefficients (MFCCs) is provided to perform signal feature extraction for the task of speaker accent recognition. Then different classifiers are compared based on the MFCC feature. For each signal, the mean vector of MFCC matrix is used as an input vector for pattern recognition. A sample of 330 signals, containing 165 US voice and 165 non-US voice, is analyzed. By comparison, k-nearest neighbors yield the highest average test accuracy, after using a cross-validation of size 500, and least time being used in the computation
Robust Classification of High Dimension Low Sample Size Data
Necla Gunduz,Ernest Fokoue
Statistics , 2015,
Abstract: The robustification of pattern recognition techniques has been the subject of intense research in recent years. Despite the multiplicity of papers on the subject, very few articles have deeply explored the topic of robust classification in the high dimension low sample size context. In this work, we explore and compare the predictive performances of robust classification techniques with a special concentration on robust discriminant analysis and robust PCA applied to a wide variety of large $p$ small $n$ data sets. We also explore the performance of random forest by way of comparing and contrasting the differences single model methods and ensemble methods in this context. Our work reveals that Random Forest, although not inherently designed to be robust to outliers, substantially outperforms the existing techniques specifically designed to achieve robustness. Indeed, random forest emerges as the best predictively on both real life and simulated data.
An Information-Theoretic Alternative to the Cronbach's Alpha Coefficient of Item Reliability
Ernest Fokoue,Necla Gunduz
Statistics , 2015,
Abstract: We propose an information-theoretic alternative to the popular Cronbach alpha coefficient of reliability. Particularly suitable for contexts in which instruments are scored on a strictly nonnumeric scale, our proposed index is based on functions of the entropy of the distributions of defined on the sample space of responses. Our reliability index tracks the Cronbach alpha coefficient uniformly while offering several other advantages discussed in great details in this paper.
Pattern Discovery in Students' Evaluations of Professors: A Statistical Data Mining Approach
Necla Gunduz,Ernest Fokoue
Statistics , 2015,
Abstract: The evaluation of instructors by their students has been practiced at most universities for many decades, and there has always been a great interest in a variety of aspects of the evaluations. Are students matured and knowledgeable enough to provide useful and dependable feedback for the improvement of their instructors' teaching skills/abilities? Does the level of difficulty of the course have a strong relationship with the rating the student give an instructor? In this paper, we attempt to answer questions such as these using some state of the art statistical data mining techniques such support vector machines, classification and regression trees, boosting, random forest, factor analysis, kMeans clustering. hierarchical clustering. We explore various aspects of the data from both the supervised and unsupervised learning perspective. The data set analyzed in this paper was collected from a university in Turkey. The application of our techniques to this data reveals some very interesting patterns in the evaluations, like the strong association between the student's seriousness and dedication (measured by attendance) and the kind of scores they tend to assign to their instructors.
On the Predictive Properties of Binary Link Functions
Necla Gunduz,Ernest Fokoue
Statistics , 2015,
Abstract: This paper provides a theoretical and computational justification of the long held claim that of the similarity of the probit and logit link functions often used in binary classification. Despite this widespread recognition of the strong similarities between these two link functions, very few (if any) researchers have dedicated time to carry out a formal study aimed at establishing and characterizing firmly all the aspects of the similarities and differences. This paper proposes a definition of both structural and predictive equivalence of link functions-based binary regression models, and explores the various ways in which they are either similar or dissimilar. From a predictive analytics perspective, it turns out that not only are probit and logit perfectly predictively concordant, but the other link functions like cauchit and complementary log log enjoy very high percentage of predictive equivalence. Throughout this paper, simulated and real life examples demonstrate all the equivalence results that we prove theoretically.
Random Subspace Learning Approach to High-Dimensional Outliers Detection
Bohan Liu,Ernest Fokoue
Statistics , 2015,
Abstract: We introduce and develop a novel approach to outlier detection based on adaptation of random subspace learning. Our proposed method handles both high-dimension low-sample size and traditional low-dimensional high-sample size datasets. Essentially, we avoid the computational bottleneck of techniques like minimum covariance determinant (MCD) by computing the needed determinants and associated measures in much lower dimensional subspaces. Both theoretical and computational development of our approach reveal that it is computationally more efficient than the regularized methods in high-dimensional low-sample size, and often competes favorably with existing methods as far as the percentage of correct outlier detection is concerned.
Dimensionality Reduction of High-Dimensional Highly Correlated Multivariate Grapevine Dataset  [PDF]
Uday Kant Jha, Peter Bajorski, Ernest Fokoue, Justine Vanden Heuvel, Jan van Aardt, Grant Anderson
Open Journal of Statistics (OJS) , 2017, DOI: 10.4236/ojs.2017.74049
Abstract: Viticulturists traditionally have a keen interest in studying the relationship between the biochemistry of grapevines’ leaves/petioles and their associated spectral reflectance in order to understand the fruit ripening rate, water status, nutrient levels, and disease risk. In this paper, we implement imaging spectroscopy (hyperspectral) reflectance data, for the reflective 330 - 2510 nm wavelength region (986 total spectral bands), to assess vineyard nutrient status; this constitutes a high dimensional dataset with a covariance matrix that is ill-conditioned. The identification of the variables (wavelength bands) that contribute useful information for nutrient assessment and prediction, plays a pivotal role in multivariate statistical modeling. In recent years, researchers have successfully developed many continuous, nearly unbiased, sparse and accurate variable selection methods to overcome this problem. This paper compares four regularized and one functional regression methods: Elastic Net, Multi-Step Adaptive Elastic Net, Minimax Concave Penalty, iterative Sure Independence Screening, and Functional Data Analysis for wavelength variable selection. Thereafter, the predictive performance of these regularized sparse models is enhanced using the stepwise regression. This comparative study of regression methods using a high-dimensional and highly correlated grapevine hyperspectral dataset revealed that the performance of Elastic Net for variable selection yields the best predictive ability.
Page 1 /879
Display every page Item


Home
Copyright © 2008-2017 Open Access Library. All rights reserved.