|
基于采样增强与动态直方图的改进LightGBM算法
|
Abstract:
梯度提升类算法面临的主要问题是大规模数据下的运算速度问题。本文针对LightGBM中采样仅依赖一阶导数影响精度,以及直方图分箱忽视数据分布特征导致计算冗余,提出了基于牛顿法的梯度单边采样,引入二阶导数提高采样精度,同时设计动态直方图算法,实现分布和标签感知的自适应分箱。在Epsilon和MNIST8M数据集上的实验表明,新方法在提升模型性能的同时,训练时间分别减少了20.7%和9.8%。
Gradient boosting algorithms face computational efficiency challenges when processing large-scale data. In order to improve the limitations in LightGBM: the gradient-based one-side sampling relying solely on first-order derivatives which compromises accuracy, and histogram binning ignoring data distribution characteristics leading to computational redundancy, we propose a Newton-based gradient one-side sampling method incorporating second-order derivatives to enhance precision, along with a dynamic histogram algorithm enabling distribution-aware and label-aware adaptive binning. Experimental results on the Epsilon and MNIST8M datasets demonstrate that our approach improves model performance while reducing training time by 20.7% and 9.8% respectively.
[1] | Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W. and Liu, T. Y. (2017) LightGBM: A Highly Efficient Gradient Boosting Decision Tree. Advances in Neural Information Processing Systems, 30, 3147-3155. |
[2] | Friedman, J.H. (2001) Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 29, 1189-1232. https://doi.org/10.1214/aos/1013203451 |
[3] | Ponsam, J.G., Bella Gracia, S.V.J., Geetha, G., Karpaselvi, S. and Nimala, K. (2021) Credit Risk Analysis Using LightGBM and a Comparative Study of Popular Algorithms. 2021 4th International Conference on Computing and Communications Technologies (ICCCT), Chennai, 16-17 December 2021, 634-641. https://doi.org/10.1109/iccct53315.2021.9711896 |
[4] | Ge, D., Gu, J., Chang, S. and Cai, J. (2020) Credit Card Fraud Detection Using LightGBM Model. 2020 International Conference on E-Commerce and Internet Technology (ECIT), Zhangjiajie, 22-24 April 2020, 232-236. https://doi.org/10.1109/ecit50008.2020.00060 |
[5] | Han, L., Yang, T., Pu, X., Sun, L., Yu, B. and Xi, J. (2021) Alzheimer’s Disease Classification Using LightGBM and Euclidean Distance Map. 2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, 12-14 March 2021, 1540-1544. https://doi.org/10.1109/iaeac50856.2021.9391046 |
[6] | Alzamzami, F., Hoda, M. and El Saddik, A. (2020) Light Gradient Boosting Machine for General Sentiment Classification on Short Texts: A Comparative Evaluation. IEEE Access, 8, 101840-101858. https://doi.org/10.1109/access.2020.2997330 |
[7] | Ong, Y.J., Zhou, Y., Baracaldo, N. and Ludwig, H. (2020) Adaptive Histogram-Based Gradient Boosted Trees for Federated Learning. |
[8] | Zhang, H., Si, S. and Hsieh, C.J. (2017) GPU-Acceleration for Large-Scale Tree Boosting. |
[9] | Meng, Q., Ke, G., Wang, T., Chen, W., Ye, Q., Ma, Z.M. and Liu, T.Y. (2016) A Communication-Efficient Parallel Algorithm for Decision Tree. Advances in Neural Information Processing Systems, 29, 1279-1287. |
[10] | Shi, Y., Ke, G., Chen, Z., Zheng, S. and Liu, T. Y. (2022) Quantized Training of Gradient Boosting Decision Trees. Advances in Neural Information Processing Systems, 35, 18822-18833. |
[11] | Chen, T. and Guestrin, C. (2016) XGBoost. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, 13-17 August 2017, 785-794. https://doi.org/10.1145/2939672.2939785 |