%0 Journal Article %T 基于反常项集的异常值处理算法
An Anomalies Processing Algorithm Based on Abnormal Itemsets %A 崔晨 %A 李贵 %A 李征宇 %A 韩子扬 %A 曹科研 %J Hans Journal of Data Mining %P 150-166 %@ 2163-1468 %D 2021 %I Hans Publishing %R 10.12677/HJDM.2021.113014 %X 异常值指的是数据中的噪声和不一致值。异常值检测与处理往往依赖于约束规则,通常的约束规则包括条件函数依赖、否定约束、编辑规则等。但对于特定领域,这些领域约束规则需要由领域专家制定,基于数据挖掘和机器学习算法,难以高效地发现这些领域约束规则。本文提出了一种用于数据清洗的反常项集的概念,与基于数据分布密度的异常值检测算法类似,反常项集是数据中不太可能出现的非常态取值组合。在此基础上,本文引入了加权调和提升度的概念及特性,利用改进的等价类变换算法挖掘低提升度的反常项集。并采用准反常项集对数据更正进行预计算,给出了一种类似于近邻插补算法的异常值更正算法,以保证异常值处理质量。在房地产信息数据集下的实验表明,基于反常项集的异常值检测与处理算法具有较高的精度,同时能够避免在数据修复中引入新的异常。
Anomalies refer to the noise and inconsistent values in the data. The detection and processing of anomalies often depend on domain constraints, which usually include conditional functional de-pendencies, negative constraints and editing rules, etc. However, for specific domains, these domain constraint rules need to be made by domain experts, and it is difficult to find these domain con-straint rules efficiently based on data mining and machine learning algorithms. In this paper, a concept of abnormal itemset for data cleaning is proposed. Similar to the outlier detection algo-rithm based on data distribution density, abnormal itemset is an unlikely combination of abnormal values in data. Then, some characteristics of lifting degree are introduced to mine abnormal itemset with low lifting degree by using the improved equivalence class transformation algorithm. Fur-thermore, this paper proposes an anomalies repair algorithm similar to the nearest neighbor in-terpolation algorithm to ensure the repair quality. Experiments under the real estate information data set show that the anomalies detection and processing algorithm based on abnormal itemset have high accuracy and will not introduce new anomalies by data repairing. %K 异常值处理,数据清洗,模式挖掘,反常项集
Anomalies Data Processing %K Data Cleaning %K Pattern Mining %K Abnormal Itemset %U http://www.hanspub.org/journal/PaperInformation.aspx?PaperID=42926