Stocks in the Chinese stock market can be divided into ST stocks and normal stocks, so to prevent investors from buying potential ST stocks, this paper first performs SMOTEENN oversampling data preprocessing for the ST stock category, and selects 139 financial indicators and technical factor as predictive features. Then, it combines the Boruta algorithm and Copula entropy method for feature selection, effectively improving the machine learning model’s performance in ST stock classification, with the AUC values of the two models reaching 98% on the test set. In the model selection and optimization, this paper uses six major models, including logistic regression, XGBoost, AdaBoost, LightGBM, Catboost, and MLP, for modeling and optimizes them using the Optuna framework. Ultimately, XGBoost model is selected as the best model because its AUC value exceeds 95% and its running time is less. Finally, the XGBoost model is explained using the SHAP theory and the interaction between features is discovered, further improving the model’s accuracy and AUC value by about 0.6%, verifying the effectiveness of the model.
References
[1]
Liu, X.J. and Liao, A.H. (2021). Application of SVM, Decision Tree and Logistic Regression Algorithm in Stock Classification and Prediction. Proceedings of the 2021 International Conference on Financial Management and Economic Transition (FMET 2021), 27-29 August 2021, 64-68. https://doi.org/10.2991/aebmr.k.210917.011
[2]
Wu, Z.Y., Chen, G.D. and Yao, J.Y. (2019) The Stock Classification Based on Entropy Weight Method and Improved Fuzzy C-Means Algorithm. Proceedings of the 4th International Conference on Big Data and Computing (ICBDC’ 19), New York, 10 May 2019, 130-134.
[3]
Borovkova, S. and Tsiamas, I. (2019) An Ensemble of LSTM Neural Networks for High-Frequency Stock Market Classification. Journal of Forecasting, 38, 600-619. https://doi.org/10.1002/for.2585
[4]
Anbalagan, T. and Maheswari, S.U. (2015) Classification and Prediction of Stock Market Index Based on Fuzzy Metagraph. Procedia Computer Science, 47, 214-221. https://doi.org/10.1016/j.procs.2015.03.200
[5]
Jones, S. and Hensher, D.A. (2004) Predicting Firm Financial Distress: A Mixed Logit Model. The Accounting Review, 79, 1011-1038. https://doi.org/10.2308/accr.2004.79.4.1011
[6]
Kannangara, K.K.P.M., Zhou, W., Ding, Z. and Hong, Z. (2022) Investigation of Feature Contribution to Shield Tunneling-Induced Settlement Using Shapley Additive Explanations Method. Journal of Rock Mechanics and Geotechnical Engineering, 14, 1052-1063. https://doi.org/10.1016/j.jrmge.2022.01.002
[7]
Li, X., Yu, Q., Tang, C., Lu, Z. and Yang, Y. (2022) Application of Feature Selection Based on Multilayer GA in Stock Prediction. Symmetry, 14, Article 1415. https://doi.org/10.3390/sym14071415
[8]
Amini, N., Mahdavi, M., Choubdar, H., Abedini, A., Shalbaf, A. and Lashgari, R. (2022) Automated Prediction of COVID-19 Mortality Outcome Using Clinical and Laboratory Data Based on Hierarchical Feature Selection and Random Forest Classifier. Computer Methods in Biomechanics and Biomedical Engineering, 26, 160-173. https://doi.org/10.1080/10255842.2022.2050906
[9]
Aram, K.Y., Lam, S.S. and Khasawneh, M.T. (2022) Linear Cost-Sensitive Max-Margin Embedded Feature Selection for Svm. Expert Systems with Applications, 197, Article 116683. https://doi.org/10.1016/j.eswa.2022.116683
[10]
Krivorotko, O., Sosnovskaia, M., Vashchenko, I., Kerr, C. and Lesnic, D. (2022) Agent-Based Modeling of COVID-19 Outbreaks for New York State and UK: Parameter Identification Algorithm. Infectious Disease Modelling, 7, 30-44. https://doi.org/10.1016/j.idm.2021.11.004
[11]
Jian, M. (2019) Variable Selection with Copula Entropy. arXiv: 1910.12389. https://doi.org/10.48550/arXiv.1910.12389