基于缺失森林的医疗大数据缺失值插补

吉林大学学报(信息科学版) ›› 2022, Vol. 40 ›› Issue (4): 616-620.

基于缺失森林的医疗大数据缺失值插补

白洪涛^a,b, 栾雪^a, 何丽莉^a,b , 毕亚茹^c , 张婷婷^b, 孙成林^c

吉林大学 a. 软件学院; b. 计算机科学与技术学院, 长春 130022; c. 第一医院, 长春 130012

收稿日期:2022-03-29 出版日期:2022-08-16 发布日期:2022-08-17
通讯作者: 孙成林(1975— ), 男(朝鲜族), 吉林延吉人, 吉林大学教授, 博士, 博士生导师, 主要从事内分泌临床研究, (Tel)86-13944855718(E-mail)clsun213@163.com。
作者简介:白洪涛( 1975— ), 男, 吉林榆树人, 吉林大学教授, 博士, 主要从事机器学习和并行计算方法研究, ( Tel) 86- 13604367893 (E-mail)baiht@ jlu.edu.cn;
基金资助:
国家重点研发计划基金资助项目(2017YFC1309805); 吉林省科技厅自然科学基金资助项目(20210101181JC)

Missing Value Interpolation for Medical Big Data Based on Missing Forest

BAI Hongtao^a,b, LUAN Xue^a, HE Lili ^a,b, BI Yaru^c , ZHANG Tingting ^b , SUN Chenglin^c

a. College of Software; b. College of Computer Science and Technology, Jilin University, Changchun 130022, China; c. First Hospital, Jilin University, Changchun 130012, China

Received:2022-03-29 Online:2022-08-16 Published:2022-08-17
Supported by:

摘要/Abstract

摘要： 为解决医疗数据集中数据缺失对分类器的性能以及下游任务产生的不利影响, 提出使用缺失森林插补法对医疗数据集中缺失值进行插补。该方法首先采用数据集中完整数据的观测值训练一个随机森林模型; 利用训练好的随机森林模型预测缺失数据; 不断重复迭代上述过程, 从而完成数据缺失值补全。在两个医学数据集上进行测试, 结果表明, 根据 NRMSE( Normalized Root Mean Squared Error) 和 PFC( the Proportion of Falsely Classified)评估指标, 缺失森林插补法误差较低, 插补效果优于 K 最近邻插补法、多重插补法和 GAIN(Generative Adversarial Imputation Nets)插补法。同时, 使用糖尿病数据集通过分析谷丙转氨酶(ALT: A Lanineamino Transferase)与糖尿病剂量反应关系证明了缺失森林插补法的稳定性。

关键词: 缺失数据插补;； , 缺失森林插补法； , 大数据； , ALT 与糖尿病剂量-反应

Abstract: To address the adverse effects of missing data in the medical dataset on the performance of the classifier and on downstream tasks. We use the missing forest interpolation method to interpolate missing values in medical datasets. The method first trains a random forest model with observations of complete data in the dataset. Then the trained random forest model is used to predict the missing data. Finally, the above process is repeated iteratively to complete the missing data interpolation. On two medical datasets, according to NRMSE(Normalized Root Mean Squared Error) and PFC( the Proportion of Falsely Classified) evaluation metrics, the missing forest interpolation method has lower error and better interpolation than K-nearest neighbor interpolation,multiple interpolation and GAIN( Generative Adversarial Imputation Nets) interpolation. The stability of the missing forest interpolation method is demonstrated by analyzing the relationship between glutamate aminotransferase (ALT: ALanine aminoTransferase) and diabetes dose-response using the diabetes dataset.

Key words: missing data interpolation； , missing forest interpolation； , big data； , alanine amino transferase(ALT)and diabetes dose-response

中图分类号:

TP391

白洪涛, 栾雪, 何丽莉, 毕亚茹, 张婷婷, 孙成林. 基于缺失森林的医疗大数据缺失值插补[J]. 吉林大学学报(信息科学版), 2022, 40(4): 616-620.

BAI Hongtao, LUAN Xue, HE Lili , BI Yaru, ZHANG Tingting, SUN Chenglin. Missing Value Interpolation for Medical Big Data Based on Missing Forest[J]. Journal of Jilin University (Information Science Edition), 2022, 40(4): 616-620.