|
基于体检大数据的健康指数建模
|
Abstract:
近年来,随着健康医疗大数据平台的快速发展,越来越多的体检数据整合到大数据平台上。如何挖掘并利用健康医疗海量数据提高医疗服务质量,提升医患沟通水平是一个全新的挑战。文中应用机器学习算法对45,374个体检用户,共3,529,829条体检数据进行分析数据的探索性分析和特征工程。在个人信用风险评分模型的基础上,将预测模型由梯度集成决策树改进为LASSO回归模型,增加评分卡的可解释性,同时结合体检的应用场景和输入数据,建立体检评分模型。实验结果表明在体检大数据集上,健康指数分数基本上服从正态分布,符合线性回归模型的先验假设。该评分模型同时具有稳健性和区分度的特点,可综合各项体检指标,较为客观地描述用户身体健康状况水平,降低体检用户同医生的沟通成本,督促用户更加关注身体整体健康状况水平。
In recent years, with the rapid development of health care big data platform, more and more phys-ical examination data are integrated into the big data platform. A new challenge is how to improve the quality of medical services by using massive medical data. In this paper, we use machine learn-ing algorithm to visually analyze 3,529,829 physical examination data of 45,374 physical examina-tion users. On the basis of personal credit risk scoring model, the prediction model is improved from gradient integrated decision tree to lasso regression model, which increases the interpretabil-ity of scorecard. At the same time, combined with the application scenarios and input data of physi-cal examination, we established the health score model. The health index score basically obeys normal distribution, which is consistent with the prior hypothesis of the linear regression model It can integrate various physical examination indicators, objectively describe the health status of us-ers, reduce the communication cost between users and doctors, and urge users to pay more atten-tion to the overall health status.
[1] | 中共中央 国务院印发《“健康中国2030”规划纲要》[J]. 中华人民共和国国务院公报, 2016(32): 5-20. |
[2] | 国务院办公厅. 关于促进和规范健康医疗大数据应用发展的指导意见(国办发[2016]47号) [Z]. 2016. |
[3] | 叶荔姗, 赵飞, 陈坚, 徐秋实, 许志坚. 基于智能电子健康档案平台的大数据应用研究与实践[J]. 中国卫生信息管理杂志, 2019, 16(6): 672-676. |
[4] | 熊辉, 何振峰. 基于R平台的体检数据分析研究[J]. 福建电脑, 2017, 33(11): 73-75. |
[5] | Wang, L., Wang, Y., Chen, Y., Liu, C. and Fan, X. (2017) Prediction of Lymphocytosis Using Machine Learning Algorithm Based on Checkup Data. 2017 4th International Conference on Systems and Informatics (ICSAI), Hangzhou, 11-13 November 2017, 649-654. https://doi.org/10.1109/ICSAI.2017.8248369 |
[6] | 余秋燕, 赵莹, 孙继佳, 邵建华. 典型机器学习算法在脂肪肝分类预测研究中的实现与比较[J]. 数理医药学杂志, 2019, 32(1): 1-3. |
[7] | 方匡南, 章贵军, 张惠颖. 基于Lasso-Logistic模型的个人信用风险预警方法[J]. 数量经济技术经济研究, 2014, 31(2): 125-136. |
[8] | 李阳, 陈晓泓, 王一梅, 胡家昌, 沈子妍, 沈波, 林静, 丁小强. 基于LASSO变量选择联合贝叶斯网络构建恶性肿瘤相关急性肾损伤(AKI)风险预测模型[J]. 复旦学报(医学版), 2020, 47(4): 521-530. |
[9] | Huang, Y.Q., Liang, C.H., He, L., et al. (2016) Development and Validation of a Radiomics Nomo-gram for Preoperative Prediction of Lymph Node Metastasis in Colorectal Cancer. Journal of Clinical Oncology, 34, 2157-2164.
https://doi.org/10.1200/JCO.2015.65.9128 |
[10] | 韩修龙. 基于XGBOOST的用户信用评分建模[J]. 电脑知识与技术, 2018, 14(5): 7-8. |
[11] | 贾瑞珍, 杜兵. 健康体检的深层价值探讨(附1300例体检结果分析) [J]. 中国全科医学, 2007(1): 58-59. |
[12] | 孟祥飞, 王瑛, 李超, 亓尧, 孙贇. 独立不同分布不确定变量中心极限定理证明及其应用[J]. 上海交通大学学报, 2019, 53(10): 1230-1237. |
[13] | Dolgopyat, D. and Goldsheid, I. (2018) Central Limit Theorem for Recurrent Random Walks on a Strip with Bounded Potential. Nonlinearity, 31, 3381. https://doi.org/10.1088/1361-6544/aab89b |
[14] | Benoist, Y. and Quint, J.-F. (2016) Central Limit Theorem for Linear Groups. The Annals of Probability, 44, No. 2.
https://doi.org/10.1214/15-AOP1002 |
[15] | 缪柏其, 宁静, 肖婕. 主成分分析和因子分析在体检数据分析中的应用——中国科技大学高级知识分子健康状况及影响因素分析[J]. 数理统计与管理, 2000(6): 16-19. |
[16] | 王小强. 基于随机森林的亚健康状态预测与特征选择方法研究[J]. 计算机应用与软件, 2014, 31(1): 296-298, 307. |
[17] | 张占林, 孙勇, 妥小青, 叶勒丹?马汉, 龚政, 田恬, 陈珍, 古丽斯亚?海力力, 戴江红, 姚华. 随机森林算法对体检人群糖尿病患病风险的预测价值研究[J]. 中国全科医学, 2019, 22(9): 1021-1026. |