全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

A Statistical Analysis of Textual E-Commerce Reviews Using Tree-Based Methods

DOI: 10.4236/ojs.2022.123023, PP. 357-372

Keywords: Text Mining, Supervised Classification, Tree-Based Methods, Classification Trees, Random Forest, Gradient Boosting, XGBoost

Full-Text   Cite this paper   Add to My Lib

Abstract:

With the increasing interest in e-commerce shopping, customer reviews have become one of the most important elements that determine customer satisfaction regarding products. This demonstrates the importance of working with Text Mining. This study is based on The Womens Clothing E-Commerce Reviews database, which consists of reviews written by real customers. The aim of this paper is to conduct a Text Mining approach on a set of customer reviews. Each review was classified as either a positive or negative review by employing a classification method. Four tree-based methods were applied to solve the classification problem, namely Classification Tree, Random Forest, Gradient Boosting and XGBoost. The dataset was categorized into training and test sets. The results indicate that the Random Forest method displays an overfitting, XGBoost displays an overfitting if the number of trees is too high, Classification Tree is good at detecting negative reviews and bad at detecting positive reviews and the Gradient Boosting shows stable values and quality measures above 77% for the test dataset. A consensus between the applied methods is noted for important classification terms.

References

[1]  Medhat, W., Hassan, A. and Korashy, H. (2014) Sentiment Analysis Algorithms and Applications: A Survey. Ain Shams Engineering Journal, 5, 1093-1113.
https://doi.org/10.1016/j.asej.2014.04.011
[2]  Aung, K.Z. and Myo, N.N. (2017) Sentiment Analysis of Students’ Comment Using Lexicon Based Approach. 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS), Wuhan, 24-26 May 2017, 149-154.
[3]  Palanisamy, P., Yadav, V. and Elchuri, H. (2013) Serendio: Simple and Practical Lexicon Based Approach to Sentiment Analysis. Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2, 543-548.
[4]  Onan, A. (2021) Sentiment Analysis on Massive Open Online Course Evaluations: A Text Mining and Deep Learning Approach. Computer Applications in Engineering Education, 29, 572-589.
https://doi.org/10.1002/cae.22253
[5]  Ko, Y. and Seo, J. (2000) Automatic Text Categorization by Unsupervised Learning. COLING 2000: The 18th International Conference on Computational Linguistics, Volume 1, 453-459.
[6]  Alrehili, A. and Albalawi, K. (2019) Sentiment Analysis of Customer Reviews Using Ensemble Method. 2019 International Conference on Computer and Information Sciences (ICCIS), Sakaka, 3-4 April 2019, 1-6.
https://doi.org/10.1109/ICCISci.2019.8716454
[7]  Shah, K., Patel, H., Sanghvi, D. and Shah, M. (2020) A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification. Augmented Human Research, 5, 1-16.
https://doi.org/10.1007/s41133-020-00032-0
[8]  Lin, X.X. (2020) Sentiment Analysis of e-Commerce Customer Reviews Based on Natural Language Processing. Proceedings of the 2020 2nd International Conference on Big Data and Artificial Intelligence, Johannesburg, 28-30 April 2020, 32-36.
[9]  Tan, A.-H., et al. (1999) Text Mining: The State of the Art and the Challenges. Proceedings of the PAKDD 1999 Workshop on Knowledge Discovery from Advanced Databases, Volume 8, 65-70.
[10]  Toman, M., Tesar, R. and Jezek, K. (2006) Influence of Word Normalization on Text Classification. Proceedings of InSciT, Vol. 4, 354-358.
[11]  Luhn, H.P. (1958) The Automatic Creation of Literature Abstracts. IMB Journal, 2, 159-165.
https://doi.org/10.1147/rd.22.0159
[12]  Loh, W.-Y. (2011) Classification and Regression Trees. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1, 14-23.
https://doi.org/10.1002/widm.8
[13]  Fielding, A. and O’Muircheartaigh, C.A. (1977) Binary Segmentation in Survey Analysis with Particular Reference to Aid. Journal of the Royal Statistical Society: Series D (The Statistician), 26, 17-28.
https://doi.org/10.2307/2988216
[14]  Messenger, R. and Mandell, L. (1972) A Modal Search Technique for Predictive Nominal Scale Multivariate Analysis. Journal of the American Statistical Association, 67, 768-772.
https://doi.org/10.1080/01621459.1972.10481290
[15]  Ross Quinlan, J. (2014) C4.5: Programs for Machine Learning. Elsevier, Amsterdam.
[16]  Breiman, L., Friedman, J., Stone, C.J. and Olshen, R.A. (1984) Classification and Regression Trees. CRC Press, Boca Raton.
[17]  Breiman, L. (2001) Random Forests. Machine Learning, 45, 5-32.
https://doi.org/10.1023/A:1010933404324
[18]  Zhan, C.J., Zheng, Y.F., Zhang, H.J. and Wen, Q.S. (2021) Random-Forest-Bagging Broad Learning System with Applications for Covid-19 Pandemic. IEEE Internet of Things Journal, 8, 15906-15918.
https://doi.org/10.1109/JIOT.2021.3066575
[19]  Liaw, A. and Wiener, M. (2002) Classification and Regression by Random Forest. R News, 2, 18-22.
[20]  R Core Team (2020) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna.
[21]  Kishore Ayyadevara, V. (2018) Pro Machine Learning Algorithms. Apress, Berkeley.
https://doi.org/10.1007/978-1-4842-3564-5
[22]  Chen, T.Q. and Guestrin, C. (2016) Xgboost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, 13-17 August 2016, 785-794.
https://doi.org/10.1145/2939672.2939785
[23]  Silge, J. and Robinson, D. (2016) Tidytext: Text Mining and Analysis Using Tidy Data Principles in R. The Journal of Open Source Software, 1, 37.
https://doi.org/10.21105/joss.00037
[24]  Feinerer, I., Hornik, K. and Meyer, D. (2008) Text Mining Infrastructure in R. Journal of Statistical Software, 25, 1-54.
https://doi.org/10.18637/jss.v025.i05
[25]  Rinker, T.W. (2018) Textstem: Tools for Stemming and Lemmatizing Text. Version 0.1.4. Buffalo, New York.
[26]  Therneau, T. and Atkinson, B. (2018) Rpart: Recursive Partitioning and Regression Trees. R Package Version 4.1-13.
[27]  Greenwell, B., Boehmke, B., Cunningham, J. and GBM Developers (2020) Gbm: Generalized Boosted Regression Models. R Package Version 2.1.8.
[28]  Hvitfeldt, E. (2020) Textdata: Download and Load Various Text Datasets. R Package Version 0.4.1.

Full-Text

Contact Us

service@oalib.com

QQ:3279437679

WhatsApp +8615387084133