With the
increasing interest in e-commerce shopping, customer reviews havebecome
one of the most important elements that determine customer satisfaction
regarding products. This demonstrates the importance of workingwith Text Mining. This study is based on The Women’s Clothing E-Commerce Reviews database, which consists of reviews written by real
customers. The aim of this paper is to conduct a Text Mining approach on a set
of customer reviews. Each review was classified as either a positive or
negative review by employing a
classification method. Four tree-based methods were applied to solve the
classification problem, namely Classification Tree, Random Forest, Gradient
Boosting and XGBoost. The dataset was categorized into training and test sets.
The results indicate that the Random Forest method displays an overfitting,
XGBoost displays an overfitting if the number of trees is too high,
Classification Tree is good at detecting negative reviews and bad at detecting
positive reviews and the Gradient Boosting shows stable values and quality
measures above 77% for the test dataset. A consensus between the applied
methods is noted for important classification terms.
References
[1]
Medhat, W., Hassan, A. and Korashy, H. (2014) Sentiment Analysis Algorithms and Applications: A Survey. Ain Shams Engineering Journal, 5, 1093-1113.
https://doi.org/10.1016/j.asej.2014.04.011
[2]
Aung, K.Z. and Myo, N.N. (2017) Sentiment Analysis of Students’ Comment Using Lexicon Based Approach. 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS), Wuhan, 24-26 May 2017, 149-154.
[3]
Palanisamy, P., Yadav, V. and Elchuri, H. (2013) Serendio: Simple and Practical Lexicon Based Approach to Sentiment Analysis. Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2, 543-548.
[4]
Onan, A. (2021) Sentiment Analysis on Massive Open Online Course Evaluations: A Text Mining and Deep Learning Approach. Computer Applications in Engineering Education, 29, 572-589. https://doi.org/10.1002/cae.22253
[5]
Ko, Y. and Seo, J. (2000) Automatic Text Categorization by Unsupervised Learning. COLING 2000: The 18th International Conference on Computational Linguistics, Volume 1, 453-459.
[6]
Alrehili, A. and Albalawi, K. (2019) Sentiment Analysis of Customer Reviews Using Ensemble Method. 2019 International Conference on Computer and Information Sciences (ICCIS), Sakaka, 3-4 April 2019, 1-6.
https://doi.org/10.1109/ICCISci.2019.8716454
[7]
Shah, K., Patel, H., Sanghvi, D. and Shah, M. (2020) A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification. Augmented Human Research, 5, 1-16. https://doi.org/10.1007/s41133-020-00032-0
[8]
Lin, X.X. (2020) Sentiment Analysis of e-Commerce Customer Reviews Based on Natural Language Processing. Proceedings of the 2020 2nd International Conference on Big Data and Artificial Intelligence, Johannesburg, 28-30 April 2020, 32-36.
[9]
Tan, A.-H., et al. (1999) Text Mining: The State of the Art and the Challenges. Proceedings of the PAKDD 1999 Workshop on Knowledge Discovery from Advanced Databases, Volume 8, 65-70.
[10]
Toman, M., Tesar, R. and Jezek, K. (2006) Influence of Word Normalization on Text Classification. Proceedings of InSciT, Vol. 4, 354-358.
[11]
Luhn, H.P. (1958) The Automatic Creation of Literature Abstracts. IMB Journal, 2, 159-165. https://doi.org/10.1147/rd.22.0159
[12]
Loh, W.-Y. (2011) Classification and Regression Trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1, 14-23.
https://doi.org/10.1002/widm.8
[13]
Fielding, A. and O’Muircheartaigh, C.A. (1977) Binary Segmentation in Survey Analysis with Particular Reference to Aid. Journal of the Royal Statistical Society: Series D (The Statistician), 26, 17-28. https://doi.org/10.2307/2988216
[14]
Messenger, R. and Mandell, L. (1972) A Modal Search Technique for Predictive Nominal Scale Multivariate Analysis. Journal of the American Statistical Association, 67, 768-772. https://doi.org/10.1080/01621459.1972.10481290
[15]
Ross Quinlan, J. (2014) C4.5: Programs for Machine Learning. Elsevier, Amsterdam.
[16]
Breiman, L., Friedman, J., Stone, C.J. and Olshen, R.A. (1984) Classification and Regression Trees. CRC Press, Boca Raton.
[17]
Breiman, L. (2001) Random Forests. Machine Learning, 45, 5-32.
https://doi.org/10.1023/A:1010933404324
[18]
Zhan, C.J., Zheng, Y.F., Zhang, H.J. and Wen, Q.S. (2021) Random-Forest-Bagging Broad Learning System with Applications for Covid-19 Pandemic. IEEE Internet of Things Journal, 8, 15906-15918. https://doi.org/10.1109/JIOT.2021.3066575
[19]
Liaw, A. and Wiener, M. (2002) Classification and Regression by Random Forest. R News, 2, 18-22.
[20]
R Core Team (2020) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna.
[21]
Kishore Ayyadevara, V. (2018) Pro Machine Learning Algorithms. Apress, Berkeley.
https://doi.org/10.1007/978-1-4842-3564-5
[22]
Chen, T.Q. and Guestrin, C. (2016) Xgboost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, 13-17 August 2016, 785-794.
https://doi.org/10.1145/2939672.2939785
[23]
Silge, J. and Robinson, D. (2016) Tidytext: Text Mining and Analysis Using Tidy Data Principles in R. The Journal of Open Source Software, 1, 37.
https://doi.org/10.21105/joss.00037
[24]
Feinerer, I., Hornik, K. and Meyer, D. (2008) Text Mining Infrastructure in R. Journal of Statistical Software, 25, 1-54. https://doi.org/10.18637/jss.v025.i05
[25]
Rinker, T.W. (2018) Textstem: Tools for Stemming and Lemmatizing Text. Version 0.1.4. Buffalo, New York.
[26]
Therneau, T. and Atkinson, B. (2018) Rpart: Recursive Partitioning and Regression Trees. R Package Version 4.1-13.
[27]
Greenwell, B., Boehmke, B., Cunningham, J. and GBM Developers (2020) Gbm: Generalized Boosted Regression Models. R Package Version 2.1.8.
[28]
Hvitfeldt, E. (2020) Textdata: Download and Load Various Text Datasets. R Package Version 0.4.1.