The paper deals with classification in privacy-preserving data mining. An algorithm, the Random Response Forest, is introduced constructing many binary decision trees, as an extension of Random Forest for privacy-preserving problems. Random Response Forest uses the Random Response idea among the anonymization methods, which instead of generalization keeps the original data, but mixes them. An anonymity metric is defined for undistinguishability of two mixed sets of data. This metric, the binary anonymity, is investigated and taken into consideration for optimal coding of the binary variables. The accuracy of Random Response Forest is presented at the end of the paper. 1. Introduction In the data mining area, the privacy is emphatic issue to preserve anonymity of persons in the models. The goal of privacy-preserving data mining (PPDM) [1] is to develop data mining models without increasing the risk of misuse of the data used to generate those models. There are two broad approaches in the literature based on the different points of view of the privacy [2]. The randomization approach focuses on individual privacy, and fortunately data mining models do not necessarily require individual records, but only distributions. So this approach preserves the privacy by perturbing the data, and since the perturbing distribution is known, it can be used to reconstruct aggregate distributions, that is, the probability distribution of the data set. In another—so-called Secure Multi-party Computation (SMC)—approach, the aim is to build a data mining model across multiple databases without revealing the individual records in each database to the other databases [3], but this paper does not deal with this approach. PPDM methods for data modification can include perturbation, blocking, merging, swapping, or sampling. Perturbation is accomplished by the alteration of an attribute value by a new value (i.e., changing a 1-value to a 0-value, or adding noise). Blocking means the replacement of an existing attribute value with a fix sign or character representing the missing value. Merging is the combination of several values into a coarser category [4]. Data swapping refers to interchanging values of individual records [5]. Sampling refers to releasing data for only a sample of a population. A wide approach in PPDM literature is data perturbation—this paper focuses on only this method—where original data are perturbed and the data mining model is built on the randomized data. The data perturbation should take two opposite requirements into consideration: the privacy of the
References
[1]
R. Agrawal and R. Srikant, “Privacy-preserving data mining,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '00), vol. 29, pp. 439–450, ACM, New York, NY, USA, May 2000.
[2]
B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, “Privacy-preserving data publishing: a survey of recent developments,” ACM Computing Surveys, vol. 42, no. 4, article 14, pp. 1–53, 2010.
[3]
X. Qi and M. Zong, “An overview of privacy preserving data mining,” Procedia Environmental Sciences, vol. 12, pp. 1341–1347, 2012.
[4]
V. S. Verykios, E. Bertino, I. N. Fovino, L. P. Provenza, Y. Saygin, and Y. Theodoridis, “State-of-the-art in privacy preserving data mining,” SIGMOD Record, vol. 33, no. 1, pp. 50–57, 2004.
[5]
C. C. Aggarwal and P. S. Yu, Privacy-Preserving Data Mining: Models and Algorithms, vol. 34 of The Kluwer International Series on Advances in Database Systems, Springer, New York, NY, USA, 2008.
[6]
N. Zhang, W. Zhao, and J. Chen, “Performance measurements for privacy preserving data mining,” in Proceedings of the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD '05), T. B. Ho, D. Cheung, and H. Liu, Eds., vol. 3518 of Lecture Notes in Computer Science, pp. 43–49, Hanoi, Vietnam, May 2005.
[7]
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
[8]
S. L. Warner, “Randomized response: a survey technique for eliminating evasive answer bias,” Journal of the American Statistical Association, vol. 60, no. 309, pp. 63–66, 1965.
[9]
W. Du and Z. Zhan, “Using randomized response techniques for privacy-preserving data mining,” in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '03), L. Getoor, T. Senator, P. Domingos, and C. Faloutsos, Eds., pp. 505–510, ACM, New York, NY, USA, August 2003.
[10]
X. Xiao and Y. Tao, “Personalized privacy preservation,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '06), pp. 229–240, ACM, New York, NY, USA, June 2006.
[11]
A. Skowron and C. Rauszer, Intelligent Decision Support: Handbook of Applications and Advances of the Rough Set Theory, vol. 11, Springer, 1992.
[12]
R. J. Bayardo and R. Agrawal, “Data privacy through optimal k-anonymization,” in Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE '05), pp. 217–228, Tokoyo, Japan, April 2005.
[13]
K. LeFevre, D. J. Dewitt, and R. Ramakrishnan, “Mondrian multidimensional k-anonymity,” in Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE '06), p. 25, Atlanta, Ga, USA, April 2006.
[14]
B. C. M. Fung, K. Wang, and P. S. Yu, “Anonymizing classification data for privacy preservation,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 5, pp. 711–725, 2007.
[15]
T. Bylander, “Estimating generalization error on two-class datasets using out-of-bag estimates,” Machine Learning, vol. 48, no. 1–3, pp. 287–297, 2002.
[16]
Census Income Data Set, UCI Machine Learning Repository, http://archive.ics.uci.edu/ml/datasets.