|
基于反常项集的异常值处理算法
|
Abstract:
异常值指的是数据中的噪声和不一致值。异常值检测与处理往往依赖于约束规则,通常的约束规则包括条件函数依赖、否定约束、编辑规则等。但对于特定领域,这些领域约束规则需要由领域专家制定,基于数据挖掘和机器学习算法,难以高效地发现这些领域约束规则。本文提出了一种用于数据清洗的反常项集的概念,与基于数据分布密度的异常值检测算法类似,反常项集是数据中不太可能出现的非常态取值组合。在此基础上,本文引入了加权调和提升度的概念及特性,利用改进的等价类变换算法挖掘低提升度的反常项集。并采用准反常项集对数据更正进行预计算,给出了一种类似于近邻插补算法的异常值更正算法,以保证异常值处理质量。在房地产信息数据集下的实验表明,基于反常项集的异常值检测与处理算法具有较高的精度,同时能够避免在数据修复中引入新的异常。
Anomalies refer to the noise and inconsistent values in the data. The detection and processing of anomalies often depend on domain constraints, which usually include conditional functional de-pendencies, negative constraints and editing rules, etc. However, for specific domains, these domain constraint rules need to be made by domain experts, and it is difficult to find these domain con-straint rules efficiently based on data mining and machine learning algorithms. In this paper, a concept of abnormal itemset for data cleaning is proposed. Similar to the outlier detection algo-rithm based on data distribution density, abnormal itemset is an unlikely combination of abnormal values in data. Then, some characteristics of lifting degree are introduced to mine abnormal itemset with low lifting degree by using the improved equivalence class transformation algorithm. Fur-thermore, this paper proposes an anomalies repair algorithm similar to the nearest neighbor in-terpolation algorithm to ensure the repair quality. Experiments under the real estate information data set show that the anomalies detection and processing algorithm based on abnormal itemset have high accuracy and will not introduce new anomalies by data repairing.
[1] | Hemalatha, C.S., Vaidehi, V. and Lakshmi, R. (2015) Minimal Infrequent Pattern Based Approach for Mining Outliers in Data Streams. Expert Systems with Applications, 42, 1998-2012. https://doi.org/10.1016/j.eswa.2014.09.053 |
[2] | Duncan, K. and Wells, D. (1999) A Rule Based Data Cleansing. Journal of Data Warehousing, 4, 146-159. |
[3] | Szathmary, L., Napoli, A. and Valtchev, P. (2007) Towards Rare Item-set Mining. 19th IEEE International Conference on Tools with Artificial Intelligence, Patras, 29-31 October 2007, 305-312. https://doi.org/10.1109/ICTAI.2007.30 |
[4] | Kolahi, S. and Lakshmanan, L. (2009) On Approximating Optimum Repairs for Functional Dependency Violations. Database Theory-ICDT, International Conference, St Peters-burg, March 2009, 53.
https://doi.org/10.1145/1514894.1514901 |
[5] | Chu, X., Ilyas, I.F. and Papotti, P. (2014) Discovering Denial Con-straints. Proceedings of the Vldb Endowment, 6, 1498-1509. https://doi.org/10.14778/2536258.2536262 |
[6] | Beskales, G., Ilyas, I.F., Golab, L. and Galiullin, A. (2012) On the Relative Trust between Inconsistent Data and Inaccurate Constraints. 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, 8-12 April 2013, 541-552. https://doi.org/10.1109/ICDE.2013.6544854 |
[7] | 刘波, 耿寅融. 数据质量检测规则挖掘方法[J]. 模式识别与人工智能, 2012(5): 835-844. |
[8] | 林印华, 张春海, 刘洁. 基于清洗规则和主数据的数据复算法实现[J]. 计算机科学, 2012, 39(11): 174-176. |
[9] | 王凤梅, 胡丽霞. 一种基于近邻规则的缺失数据填补方法[J]. 计算机工程, 2012, 38(21): 53-55. |
[10] | Szathmary, L., Valtchev, P. and Napoli, A. (2010) Generating Rare Association Rules Using the Minimal Rare Item Sets Family. International Journal of Soft-ware and Informatics, 4, 219-238. |
[11] | Adda, M., Lei, W. and Yi, F. (2007) Rare Itemset Mining. International Con-ference on Machine Learning & Applications, Cincinnati, 13-15 December 2007, 73-80. https://doi.org/10.1109/ICMLA.2007.106 |
[12] | Szathmary, L., Valtchev, P. and Napoli, A. (2010) Finding Minimal Rare Itemsets and Rare Association Rules. International Conference on Knowledge Science, Engineering and Manage-ment, Belfast, 1-3 September 2010, 16-27.
https://doi.org/10.1007/978-3-642-15280-1_5 |
[13] | Ouyang, W.M. and Huang, Q.H. (2013) Mining Indirect Tem-poral Sequential Patterns in Large Transaction Databases. Applied Mechanics and Materials, 385-386, 1362-1365.
https://doi.org/10.4028/www.scientific.net/AMM.385-386.1362 |
[14] | Szathmary, L., Valtchev, P., Napoli, A. and Godin, R. (2009) Efficient Vertical Mining of Frequent Closures and Generators. Advances in Intelligent Data Analysis VIII, 8th International Symposium on Intelligent Data Analysis, IDA 2009, Lyon, 31 August-2 September 2009, 393-404. https://doi.org/10.1007/978-3-642-03915-7_34 |
[15] | Zaki, M.J. and Hsiao, C.J. (2002) CHARM: An Efficient Al-gorithm for Closed Itemset Mining. Proceedings of the Second SIAM International Conference on Data Mining, Arling-ton, 11-13 April 2002, 457-473.
https://doi.org/10.1137/1.9781611972726.27 |
[16] | Rammelaere, J., Geerts, F. and Goethals, B. (2017) Cleaning Data with Forbidden Itemsets. IEEE International Conference on Data Engineering, San Diego, 19-22 April 2017, 897-908. https://doi.org/10.1109/ICDE.2017.138 |
[17] | Boriah, S., Chandola, V. and Kumar, V. (2008) Similarity Measures for Categorical Data: A Comparative Evaluation. Proceedings of the SIAM International Conference on Data Mining, SDM 2008, Atlanta, 24-26 April 2008, 243-254.
https://doi.org/10.1137/1.9781611972788.22 |
[18] | Chiang, F. and Miller, R.J. (2008) Discovering Data Quality Rules. Proceedings of the Vldb Endowment, 1, 1166-1177.
https://doi.org/10.14778/1453856.1453980 |