Incorporating Multiple Linear Regression in Predicting the House Prices Using a Big Real Estate Dataset with 80 Independent Variables

doi:10.4236/oalib.1108346

OALib Journal期刊
ISSN: 2333-9721
费用：99美元

查看量	下载量

Open Access Library Journal 9 2022

查看所有领域

Incorporating Multiple Linear Regression in Predicting the House Prices Using a Big Real Estate Dataset with 80 Independent Variables

DOI: 10.4236/oalib.1108346, PP. 1-21

Azad Abdulhafedh

Subject Areas: Applied Statistical Mathematics, Civil Engineering

Keywords: Multiple Linear Regression, Ames House Price Prediction, RSE, RSME, MSE, K-Fold, LOOCV

Full-Text Cite this paper Add to My Lib

Abstract

This paper uses a multiple linear regression analysis to predict the final price of a house in a big real estate dataset. The data describes the sale of individual properties, various features, and details of each home in Ames, Iowa, USA from 2006 to 2010. The dataset comprises of 80 explanatory variables which include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables. The goal was to use the training data to predict the sale prices of the houses in the testing data. The most important predictors were determined by random forest and kept in the analysis. The highly correlated predictors were dropped from the dataset. All assumptions of the linear regression were checked, and an optimal final predictive model was achieved by keeping the most influential predictors only. The model accuracy assessments produced very good results with an adjusted R-squared value of 0.9283, a residual standard error (RSE) of 0.094, and a root squared mean error (RSME) of 0.12792. In addition, the prediction error (Mean Squared Error, MSE) of the final model was found to be very small (12%) by applying different cross validation techniques, including the validation set approach, the K-fold approach and the Leave-One-Out-Cross Validation (LOOCV) approach. Results show that multiple linear regression can precisely predict the house prices with big dataset and large number of both categorical and numerical predictors.

Cite this paper

Abdulhafedh, A. (2022). Incorporating Multiple Linear Regression in Predicting the House Prices Using a Big Real Estate Dataset with 80 Independent Variables. Open Access Library Journal, 9, e8346. doi: http://dx.doi.org/10.4236/oalib.1108346.

References

[1]	Gareth, J., Witten, D. Hastie, T. and Tibshirani, R. (2013) An Introduction to Statistical Learning: With Applications in R. Springer, New York.
[2]	Hastie, T., Tibshirani, R. and Friedman, J. (2008) The Elements of Statistical Learning. Springer, New York. https://doi.org/10.1007/978-0-387-84858-7
[3]	Bruce, P. and Andrew, B. (2017) Practical Statistics for Data Scientists. O’Reilly Media, Sebastopol.
[4]	Berry, W.D. and Feldman, S. (1985) Multiple Regression in Practice. Sage University Paper Series on Quantitative Applications in the Social Sciences, Series No. 07-050, Sage, Newbury Park.
[5]	Abdulhafedh, A. (2017) A Novel Hybrid Method for Measuring the Spatial Autocorrelation of Vehicular Crashes: Combining Moran’s Index and Getis-Ord G*i Statistic. Open Journal of Civil Engineering, 7, 208-221. https://doi.org/10.4236/ojce.2017.72013
[6]	Cohen, J. and Cohen, P. (1983) Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, Inc., Hillsdale.
[7]	Abdulhafedh, A. (2017) Road Traffic Crash Data: An Over-View on Sources, Problems, and Collection Methods. Journal of Transportation Technologies, 7, 206-219. https://doi.org/10.4236/jtts.2017.72015
[8]	Abdulhafedh, A. (2017) Road Crash Prediction Models: Different Statistical Modeling Approaches. Journal of Transportation Technologies, 7, 190-205. https://doi.org/10.4236/jtts.2017.72014
[9]	Pedhazur, E.J. (1997) Multiple Regression in Behavioral Research. 3rd Edition, Harcourt Brace Orlando.
[10]	Abdulhafedh, A. (2017) Incorporating the Multinomial Logistic Regression in Vehicle Crash Severity Modeling: A Detailed Overview. Journal of Transportation Technologies, 7, 279-303. https://doi.org/10.4236/jtts.2017.73019
[11]	Tabachnick, B.G. and Fidell, L.S. (2001). Using Multivariate Statistics. 4th Edition, Allyn and Bacon, Needham Heights.
[12]	Montgomery, D.C. and Peck, E.A. (1982) Introduction to Linear Regression Analysis. John Wiley and Sons, Inc., New York.
[13]	Abdulhafedh, A. (2016) Crash Frequency Analysis. Journal of Transportation Technologies, 6, 169-180. https://doi.org/10.4236/jtts.2016.64017
[14]	Rawlings, J.O. (1988) Applied Regression Analysis: A Research Tool. Wadsworth & Brooks/Cole, Pacific Grove.
[15]	Jobson, J.D. (1991) Multiple Linear Regression. In: Applied Multivariate Data Analysis, Springer, New York, 219-398. https://doi.org/10.1007/978-1-4612-0955-3
[16]	Weisberg, S. (1980) Applied Linear Regression. 2nd Edition, John Wiley and Sons, Inc., New York.
[17]	Neter, J., Wasserman, W. and Kutner, M.H. (1983) Applied Linear Regression Models. Richard D. Irwin, Inc., Homewood.
[18]	Abdulhafedh, A. (2021) Incorporating K-Means, Hierarchical Clustering and PCA in Customer Segmentation. Journal of City and Development, 3, 12-30.
[19]	De Cock, D. (2011) Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project. Journal of Statistics Education, 19, Published Online. https://doi.org/10.1080/10691898.2011.11889627
[20]	Van Buuren, S. (2018) Flexible Imputation of Missing Data. Chapman & Hall/CRC, Boca Raton. https://doi.org/10.1201/9780429492259
[21]	Schafer, J.L. and Graham, J.W. (2002) Missing Data: Our View of the State of the Art. Psychological Methods, 7, 147-77. https://doi.org/10.1037/1082-989X.7.2.147
[22]	Abayomi, K., Gelman, A. and Levy, M. (2008) Diagnostics for Multivariate Imputations. Journal of the Royal Statistical Society C, 57, 273-291. https://doi.org/10.1111/j.1467-9876.2007.00613.x
[23]	Kuhn, M. and Johnson, K. (2013) Applied Predictive Modeling. Springer, New York. https://doi.org/10.1007/978-1-4614-6849-3
[24]	Zorn, C. (2005) A Solution to Separation in Binary Response Models. Political Analysis, 13, 157-170. https://doi.org/10.1093/pan/mpi009
[25]	Abdulhafedh, A. (2021) Vehicle Crash Frequency Analysis Using Ridge Regression. International Journal for Science and Advance Research in Technology, 7, 254-261.
[26]	Gelman, A., Jakulin, A., Pittau, M.G. and Su, Y.S. (2008) A Weakly Informative Default Prior Distribution for Logistic and Other Regression Models. The Annals of Applied Statistics, 2, 1360-1383. https://doi.org/10.2139/ssrn.1010421

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133