This paper uses a multiple linear regression analysis to predict the final price of a house in a big real estate dataset. The data describes the sale of individual properties, various features, and details of each home in Ames, Iowa, USA from 2006 to 2010. The dataset comprises of 80 explanatory variables which include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables. The goal was to use the training data to predict the sale prices of the houses in the testing data. The most important predictors were determined by random forest and kept in the analysis. The highly correlated predictors were dropped from the dataset. All assumptions of the linear regression were checked, and an optimal final predictive model was achieved by keeping the most influential predictors only. The model accuracy assessments produced very good results with an adjusted R-squared value of 0.9283, a residual standard error (RSE) of 0.094, and a root squared mean error (RSME) of 0.12792. In addition, the prediction error (Mean Squared Error, MSE) of the final model was found to be very small (12%) by applying different cross validation techniques, including the validation set approach, the K-fold approach and the Leave-One-Out-Cross Validation (LOOCV) approach. Results show that multiple linear regression can precisely predict the house prices with big dataset and large number of both categorical and numerical predictors.
Cite this paper
Abdulhafedh, A. (2022). Incorporating Multiple Linear Regression in Predicting the House Prices Using a Big Real Estate Dataset with 80 Independent Variables. Open Access Library Journal, 9, e8346. doi: http://dx.doi.org/10.4236/oalib.1108346.
Berry, W.D. and Feldman, S. (1985) Multiple Regression in Practice. Sage University Paper Series on Quantitative Applications in the Social Sciences, Series No. 07-050, Sage, Newbury Park.
Abdulhafedh, A. (2017) A Novel Hybrid Method for Measuring the Spatial Autocorrelation of Vehicular Crashes: Combining Moran’s Index and Getis-Ord G*i Statistic. Open Journal of Civil Engineering, 7, 208-221.
https://doi.org/10.4236/ojce.2017.72013
Cohen, J. and Cohen, P. (1983) Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, Inc., Hillsdale.
Abdulhafedh, A. (2017) Road Traffic Crash Data: An Over-View on Sources, Problems, and Collection Methods. Journal of Transportation Technologies, 7, 206-219.
https://doi.org/10.4236/jtts.2017.72015
Abdulhafedh, A. (2017) Incorporating the Multinomial Logistic Regression in Vehicle Crash Severity Modeling: A Detailed Overview. Journal of Transportation Technologies, 7, 279-303. https://doi.org/10.4236/jtts.2017.73019
Jobson, J.D. (1991) Multiple Linear Regression. In: Applied Multivariate Data Analysis, Springer, New York, 219-398. https://doi.org/10.1007/978-1-4612-0955-3
De Cock, D. (2011) Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project. Journal of Statistics Education, 19, Published Online. https://doi.org/10.1080/10691898.2011.11889627
Schafer, J.L. and Graham, J.W. (2002) Missing Data: Our View of the State of the Art. Psychological Methods, 7, 147-77. https://doi.org/10.1037/1082-989X.7.2.147
Abayomi, K., Gelman, A. and Levy, M. (2008) Diagnostics for Multivariate Imputations. Journal of the Royal Statistical Society C, 57, 273-291.
https://doi.org/10.1111/j.1467-9876.2007.00613.x
Abdulhafedh, A. (2021) Vehicle Crash Frequency Analysis Using Ridge Regression. International Journal for Science and Advance Research in Technology, 7, 254-261.
Gelman, A., Jakulin, A., Pittau, M.G. and Su, Y.S. (2008) A Weakly Informative Default Prior Distribution for Logistic and Other Regression Models. The Annals of Applied Statistics, 2, 1360-1383. https://doi.org/10.2139/ssrn.1010421