Comparison between Common Statistical Modeling Techniques Used in Research, Including: Discriminant Analysis vs Logistic Regression, Ridge Regression vs LASSO, and Decision Tree vs Random Forest
Statistical techniques are important tools in modeling research work. However, there could be misleading outcomes if sufficient care is undermined in choosing the right approach. Employing the correct analysis in any research work needs deep knowledge on the differences between these tools. Incorrect selection of the modeling technique would create serious problems during the interpretation of the findings and could affect the conclusion of the study. Each technique has its own assumptions and procedures about the data. This paper compares common statistical approaches, including regression vs classification, discriminant analysis vs logistic regression, ridge regression vs LASSO, and decision tree vs random forest. Results show that each approach has its unique statistical characteristics that should be well understood before deciding upon its utilization in the research.
Cite this paper
Abdulhafedh, A. (2022). Comparison between Common Statistical Modeling Techniques Used in Research, Including: Discriminant Analysis vs Logistic Regression, Ridge Regression vs LASSO, and Decision Tree vs Random Forest. Open Access Library Journal, 9, e8414. doi: http://dx.doi.org/10.4236/oalib.1108414.
Gareth, J., Witten, D., Hastie, T. and Tibshirani, R. (2013) An Introduction to Statistical Learning: With Applications in R. Springer, Berlin, Heidelberg.
Washington, S.P., Karlaftis, M.G. and Mannering, F. (2010) Statistical and Econometric Methods for Transportation Data Analysis. 2nd Edition, Chapman Hall/CRC, Boca Raton.
Passos, I.C., Mwangi, B. and Kapczinski, F. (2016) Big Data Analytics and Machine Learning and beyond. Lancet Psychiatry, 3, 13-15.
https://doi.org/10.1016/S2215-0366(15)00549-0
Abdulhafedh, A. (2017) Incorporating the Multinomial Logistic Regression in Vehicle Crash Severity Modeling: A Detailed Overview. Journal of Transportation Technologies, 7, 279-303. https://doi.org/10.4236/jtts.2017.73019
Heinze, G. and Schemper, M. (2002) A Solution to the Problem of Separation in
Logistic Regression. Statistics in Medicine, 21, 2409-2419.
https://doi.org/10.1002/sim.1047
Abdulhafedh, A. (2022) Incorporating Multiple Linear Regression in Predicting the House Prices Using a Big Real Estate Dataset with 80 Independent Variables. Open Access Library Journal, 9, Article No. e8346. https://doi.org/10.4236/oalib.1108346
Abdulhafedh, A. (2016) Crash Severity Modeling in Transportation Systems. PhD Dissertation, University of Missouri, Columbia, MO, USA.
https://mospace.umsystem.edu/xmlui/browse?authority=b5818edd-97e5-439f-a994-206bab12f712&type=author
Yeh, I.-C. and Lien, C.-H. (2009) The Comparisons of Data Mining Techniques for the Predictive Accuracy of Probability of Default of Credit Card Clients. Expert Systems with Applications, 36, 2473-2480.
https://doi.org/10.1016/j.eswa.2007.12.020
Abdulhafedh, A. (2017) Road Traffic Crash Data: An Over-view on Sources, Problems, and Collection Methods. Journal of Transportation Technologies, 7, 206-219.
https://doi.org/10.4236/jtts.2017.72015
Abdulhafedh, A. (2021) Vehicle Crash Frequency Analysis Using Ridge Regression. International Journal for Science and Advance Research in Technology, 7, 254-261.
Lin, T.-H., Li, H.-T. and Tsai, K.-C. (2004) Implementing the Fisher’s Discriminant ratio in a k-Means Clustering Algorithm for Feature Selection and Data Set Trimming. Journal of Chemical Information and Modeling, 44, 76-87.
https://doi.org/10.1021/ci030295a
Bajwa, S.J. (2015) Basics, Common Errors and Essentials of Statistical Tools and Techniques in Anesthesiology Research. Journal of Anaesthesiology Clinical Pharmacology, 31, 547-553. https://doi.org/10.4103/0970-9185.169087
Kim, K.S., Choi, H.H., Moon, C.S. and Mun, C.W. (2011) Comparison of k-Nearest Neighbor, Quadratic Discriminant, and Linear Discriminant Analysis in Classification of Electromyogram Signals Based on the Wrist-Motion Directions. Current Applied Physics, 11, 740-745. https://doi.org/10.1016/j.cap.2010.11.051
Wang, S., Li, D., Song, X., Wei, Y. and Li, H. (2011) A Feature Selection Method Based on Improved Fisher’s Discriminant Ratio for Text Sentiment Classification. Expert Systems with Applications, 38, 8696-8702.
https://doi.org/10.1016/j.eswa.2011.01.077
Abdulhafedh, A. (2017) How to Detect and Remove Temporal Autocorrelation in Vehicular Crash Data. Journal of Transportation Technologies, 7, 133-147.
https://doi.org/10.4236/jtts.2017.72010
Sexton, J. and Laake, P. (2009) Standard Errors for Bagged and Random Forest Estimators. Computational Statistics & Data Analysis, 53, 801-811.
https://doi.org/10.1016/j.csda.2008.08.007
Tanha, J., van Someren, M. and Afsarmanesh, H. (2015) Semi-Supervised Self-Training for Decision Tree Classifiers. International Journal of Machine Learning and Cybernetics, 8, 355-370. https://doi.org/10.1007/s13042-015-0328-7
Chapelle, O., Sindhwani, V. and Keerthi, S. (2008) Optimization Techniques for Semi-Supervised Support Vector Machines. Journal of Machine Learning Research, 9, 203-233.
Scornet, E., Biau, G. and Vert, J.-P. (2015) Consistency of Random Forests. The Annals of Statistics, 43, 1716-1741. https://doi.org/10.1214/15-AOS1321
Gan, H., Sang, N., Huang, R., Tong, X. and Dan, Z. (2013) Using Clustering Analysis to Improve Semi-Supervised Classification. Neurocomputing, 101, 290-298.
https://doi.org/10.1016/j.neucom.2012.08.020
Altman, D.G. and Bland, J.M. (2009) Parametric vs. Non-Parametric Methods for Data Analysis. BMJ, 338, Article No. a3167. https://doi.org/10.1136/bmj.a3167
Abdulhafedh, A. (2017) A Novel Hybrid Method for Measuring the Spatial Autocorrelation of Vehicular Crashes: Combining Moran’s Index and Getis-Ord *Gi Statistic. Open Journal of Civil Engineering, 7, 208-221.
https://doi.org/10.4236/ojce.2017.72013
Friedman, J., Hastie, T. and Tibshirani, R. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33, 1-22. https://doi.org/10.18637/jss.v033.i01
Abdulhafedh, A. (2017) Identifying Vehicular Crash High Risk Locations along Highways via Spatial Autocorrelation Indices and Kernel Density Estimation. World Journal of Engineering and Technology, 5, 198-215.
https://doi.org/10.4236/wjet.2017.52016
Imbens, G. and Rubin, D.B. (2015) Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press, Cambridge.
https://doi.org/10.1017/CBO9781139025751