%0 Journal Article %T 结构化数据清洗技术综述<br>Survey of structured data cleaning methods %A 郝爽 %A 李国良 %A 冯建华 %A 王宁 %J 清华大学学报(自然科学版) %D 2018 %R 10.16511/j.cnki.qhdxxb.2018.22.053 %X 数据清洗是对脏数据进行检测和纠正的过程,是进行数据分析和管理的基础。该文对经典和新兴的数据清洗技术进行分类和总结,为进一步的研究工作提供方向。形式化定义了数据清洗问题,对数据缺失、数据冗余、数据冲突和数据错误这4种数据噪声的检测技术进行详细阐述。按照数据清洗方式对数据噪声的消除技术进行分类概述,包括基于完整性约束的数据清洗算法、基于规则的数据清洗算法、基于统计的数据清洗算法和人机结合的数据清洗算法。介绍了常用的测评数据集和噪声注入工具,并对未来重点的研究方向进行了探讨和展望。<br>Abstract:Data cleaning is the process of detecting and repairing dirty data which is often needed in data analysis and management. This paper classifies and summarizes the traditional and advanced data cleaning techniques and identifies potential directions for further work. This study first formally defines the cleaning problem for structured data and then describes error detection methods for missing data, redundant data, conflicting data and erroneous data. The data cleaning methods are then summarized based on their error elimination method, including constraint-based data cleaning, rule-based data cleaning, statistical data cleaning and human-in-the-loop data cleaning. Some important datasets and noise injection tools are introduced as well. Open research problems and future research directions are also discussed. %K 数据清洗 %K 数据噪声 %K 噪声检测 %K 噪声消除 %K < %K br> %K data cleaning %K dirty data %K error detection %K error elimination %U http://jst.tsinghuajournals.com/CN/Y2018/V58/I12/1037