全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

Comparison between Common Statistical Modeling Techniques Used in Research, Including: Discriminant Analysis vs Logistic Regression, Ridge Regression vs LASSO, and Decision Tree vs Random Forest

DOI: 10.4236/oalib.1108414, PP. 1-19

Subject Areas: Mathematical Analysis, Applied Statistical Mathematics

Keywords: Supervised Learning, Logistic Regression, Discriminant Analysis, KNN, Ridge Regression, LASSO, Decision Tree, Random Forests, PCA, Clustering

Full-Text   Cite this paper   Add to My Lib

Abstract

Statistical techniques are important tools in modeling research work. However, there could be misleading outcomes if sufficient care is undermined in choosing the right approach. Employing the correct analysis in any research work needs deep knowledge on the differences between these tools. Incorrect selection of the modeling technique would create serious problems during the interpretation of the findings and could affect the conclusion of the study. Each technique has its own assumptions and procedures about the data. This paper compares common statistical approaches, including regression vs classification, discriminant analysis vs logistic regression, ridge regression vs LASSO, and decision tree vs random forest. Results show that each approach has its unique statistical characteristics that should be well understood before deciding upon its utilization in the research.

Cite this paper

Abdulhafedh, A. (2022). Comparison between Common Statistical Modeling Techniques Used in Research, Including: Discriminant Analysis vs Logistic Regression, Ridge Regression vs LASSO, and Decision Tree vs Random Forest. Open Access Library Journal, 9, e8414. doi: http://dx.doi.org/10.4236/oalib.1108414.

References

[1]  Gareth, J., Witten, D., Hastie, T. and Tibshirani, R. (2013) An Introduction to Statistical Learning: With Applications in R. Springer, Berlin, Heidelberg.
[2]  Hastie, T., Tibshirani, R. and Friedman, J. (2008) The Elements of Statistical Learning. Springer, New York. https://doi.org/10.1007/978-0-387-84858-7
[3]  Brett, L. (2015) Machine Learning with R. Packt Publishing Ltd., Birmingham.
[4]  Freedman, D.A. (2009) Statistical Models: Theory and Practice. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511815867
[5]  Venables, W.N. and Ripley, B.D. (2002) Modern Applied Statistics in S. 4th Edition, Springer, New York. https://doi.org/10.1007/978-0-387-21706-2
[6]  Breiman, L. (2001) Random Forests. Machine Learning, 45, 5-32. https://doi.org/10.1023/A:1010933404324
[7]  Washington, S.P., Karlaftis, M.G. and Mannering, F. (2010) Statistical and Econometric Methods for Transportation Data Analysis. 2nd Edition, Chapman Hall/CRC, Boca Raton.
[8]  Abdulhafedh, A. (2017) Road Crash Prediction Models: Different Statistical Modeling Approaches. Journal of Transportation Technologies, 7, 190-205. https://doi.org/10.4236/jtts.2017.72014
[9]  Passos, I.C., Mwangi, B. and Kapczinski, F. (2016) Big Data Analytics and Machine Learning and beyond. Lancet Psychiatry, 3, 13-15. https://doi.org/10.1016/S2215-0366(15)00549-0
[10]  Abdulhafedh, A. (2017) Incorporating the Multinomial Logistic Regression in Vehicle Crash Severity Modeling: A Detailed Overview. Journal of Transportation Technologies, 7, 279-303. https://doi.org/10.4236/jtts.2017.73019
[11]  Heinze, G. and Schemper, M. (2002) A Solution to the Problem of Separation in Logistic Regression. Statistics in Medicine, 21, 2409-2419. https://doi.org/10.1002/sim.1047
[12]  Abdulhafedh, A. (2022) Incorporating Multiple Linear Regression in Predicting the House Prices Using a Big Real Estate Dataset with 80 Independent Variables. Open Access Library Journal, 9, Article No. e8346. https://doi.org/10.4236/oalib.1108346
[13]  Abdulhafedh, A. (2016) Crash Severity Modeling in Transportation Systems. PhD Dissertation, University of Missouri, Columbia, MO, USA. https://mospace.umsystem.edu/xmlui/browse?authority=b5818edd-97e5-439f-a994-206bab12f712&type=author
[14]  Yeh, I.-C. and Lien, C.-H. (2009) The Comparisons of Data Mining Techniques for the Predictive Accuracy of Probability of Default of Credit Card Clients. Expert Systems with Applications, 36, 2473-2480. https://doi.org/10.1016/j.eswa.2007.12.020
[15]  Abdulhafedh, A. (2017) Road Traffic Crash Data: An Over-view on Sources, Problems, and Collection Methods. Journal of Transportation Technologies, 7, 206-219. https://doi.org/10.4236/jtts.2017.72015
[16]  Abdulhafedh, A. (2021) Vehicle Crash Frequency Analysis Using Ridge Regression. International Journal for Science and Advance Research in Technology, 7, 254-261.
[17]  Pasha, G.R. and Shah, M.A. (2004) Application of Ridge Regression to Multicollinear Data. Journal of Research (Science), 15, 97-106.
[18]  Lin, T.-H., Li, H.-T. and Tsai, K.-C. (2004) Implementing the Fisher’s Discriminant ratio in a k-Means Clustering Algorithm for Feature Selection and Data Set Trimming. Journal of Chemical Information and Modeling, 44, 76-87. https://doi.org/10.1021/ci030295a
[19]  Wiener, A.L.A.M. (2002) Classification and Regression by Random Forest. R News, 2, 18-22.
[20]  Agresti, A. (2013) Categorical Data Analysis. John Wiley & Sons, New Jersey.
[21]  Dobson, A.J. and Barnett, A. (2008) An Introduction to Generalized Linear Models. CRC Press.
[22]  Abdulhafedh, A. (2016) Crash Frequency Analysis. Journal of Transportation Technologies, 6, 169-180. https://doi.org/10.4236/jtts.2016.64017
[23]  Lorenzen, T.J. and Anderson, V.L. (1993) Design of Experiments: A No-Name Approach. CRC Press. https://doi.org/10.1201/9781482277524
[24]  Box, G.E.P., Hunter, W.G. and Hunter, J.S. (1978) Statistics for Experimenters. John Wiley & Sons, New York.
[25]  Cochran, W.G. and Cox, G.M. (1992) Experimental Designs. John Wiley & Sons, New York.
[26]  Lawless, J.F. (2002) Statistical Models and Methods for Lifetime Data. John Wiley & Sons, New York. https://doi.org/10.1002/9781118033005
[27]  Miller, R. (1998) Survival Analysis. John Wiley & Sons, New York.
[28]  Bajwa, S.J. (2015) Basics, Common Errors and Essentials of Statistical Tools and Techniques in Anesthesiology Research. Journal of Anaesthesiology Clinical Pharmacology, 31, 547-553. https://doi.org/10.4103/0970-9185.169087
[29]  Kim, K.S., Choi, H.H., Moon, C.S. and Mun, C.W. (2011) Comparison of k-Nearest Neighbor, Quadratic Discriminant, and Linear Discriminant Analysis in Classification of Electromyogram Signals Based on the Wrist-Motion Directions. Current Applied Physics, 11, 740-745. https://doi.org/10.1016/j.cap.2010.11.051
[30]  Abdulhafedh, A. (2021) Incorporating K-Means, Hierarchical Clustering and PCA in Customer Segmentation. Journal of City and Development, 3, 12-30.
[31]  Wang, S., Li, D., Song, X., Wei, Y. and Li, H. (2011) A Feature Selection Method Based on Improved Fisher’s Discriminant Ratio for Text Sentiment Classification. Expert Systems with Applications, 38, 8696-8702. https://doi.org/10.1016/j.eswa.2011.01.077
[32]  Abdulhafedh, A. (2017) How to Detect and Remove Temporal Autocorrelation in Vehicular Crash Data. Journal of Transportation Technologies, 7, 133-147. https://doi.org/10.4236/jtts.2017.72010
[33]  Sexton, J. and Laake, P. (2009) Standard Errors for Bagged and Random Forest Estimators. Computational Statistics & Data Analysis, 53, 801-811. https://doi.org/10.1016/j.csda.2008.08.007
[34]  Tanha, J., van Someren, M. and Afsarmanesh, H. (2015) Semi-Supervised Self-Training for Decision Tree Classifiers. International Journal of Machine Learning and Cybernetics, 8, 355-370. https://doi.org/10.1007/s13042-015-0328-7
[35]  Chapelle, O., Sindhwani, V. and Keerthi, S. (2008) Optimization Techniques for Semi-Supervised Support Vector Machines. Journal of Machine Learning Research, 9, 203-233.
[36]  Joachims, T. (1999) Making Large Scale SVM Learning Practical. In: Support Vector Learning, MIT Press. Cambridge, 169-184.
[37]  Maaten, L.V.D. (2014) Accelerating t-SNE Using Tree-Based Algorithms. Journal of Machine Learning Research, 15, 3221-3245.
[38]  Athey, S., Tibshirani, J. and Wager, S. (2019) Generalized Random Forests. The Annals of Statistics, 47, 1148-1178. https://doi.org/10.1214/18-AOS1709
[39]  Scornet, E., Biau, G. and Vert, J.-P. (2015) Consistency of Random Forests. The Annals of Statistics, 43, 1716-1741. https://doi.org/10.1214/15-AOS1321
[40]  Gan, H., Sang, N., Huang, R., Tong, X. and Dan, Z. (2013) Using Clustering Analysis to Improve Semi-Supervised Classification. Neurocomputing, 101, 290-298. https://doi.org/10.1016/j.neucom.2012.08.020
[41]  Altman, D.G. and Bland, J.M. (2009) Parametric vs. Non-Parametric Methods for Data Analysis. BMJ, 338, Article No. a3167. https://doi.org/10.1136/bmj.a3167
[42]  Afifi, A., Clark, V.A. and May, S. (2004) Computer-Aided Multivariate Analysis. 4th Edition, Chapman & Hall/CRC, Boca Raton.
[43]  Abdulhafedh, A. (2017) A Novel Hybrid Method for Measuring the Spatial Autocorrelation of Vehicular Crashes: Combining Moran’s Index and Getis-Ord *Gi Statistic. Open Journal of Civil Engineering, 7, 208-221. https://doi.org/10.4236/ojce.2017.72013
[44]  Johnson, R.A. and Wichern, D.W. (2007) Applied Mutivariate Statistical Analysis. 6th Edition). Pearson, Prentice Hall, New Jersey.
[45]  Williams, G.J. (2011) Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery. Springer, New York.
[46]  Friedman, J., Hastie, T. and Tibshirani, R. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33, 1-22. https://doi.org/10.18637/jss.v033.i01
[47]  Abdulhafedh, A. (2017) Identifying Vehicular Crash High Risk Locations along Highways via Spatial Autocorrelation Indices and Kernel Density Estimation. World Journal of Engineering and Technology, 5, 198-215. https://doi.org/10.4236/wjet.2017.52016
[48]  Imbens, G. and Rubin, D.B. (2015) Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9781139025751

Full-Text


comments powered by Disqus

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133

WeChat 1538708413