%0 Journal Article
%T Random Response Forest for Privacy-Preserving Classification
%A Gábor Sz？cs
%J Journal of Computational Engineering
%D 2013
%R 10.1155/2013/397096
%X The paper deals with classification in privacy-preserving data mining. An algorithm, the Random Response Forest, is introduced constructing many binary decision trees, as an extension of Random Forest for privacy-preserving problems. Random Response Forest uses the Random Response idea among the anonymization methods, which instead of generalization keeps the original data, but mixes them. An anonymity metric is defined for undistinguishability of two mixed sets of data. This metric, the binary anonymity, is investigated and taken into consideration for optimal coding of the binary variables. The accuracy of Random Response Forest is presented at the end of the paper. 1. Introduction In the data mining area, the privacy is emphatic issue to preserve anonymity of persons in the models. The goal of privacy-preserving data mining (PPDM) [1] is to develop data mining models without increasing the risk of misuse of the data used to generate those models. There are two broad approaches in the literature based on the different points of view of the privacy [2]. The randomization approach focuses on individual privacy, and fortunately data mining models do not necessarily require individual records, but only distributions. So this approach preserves the privacy by perturbing the data, and since the perturbing distribution is known, it can be used to reconstruct aggregate distributions, that is, the probability distribution of the data set. In another—so-called Secure Multi-party Computation (SMC)—approach, the aim is to build a data mining model across multiple databases without revealing the individual records in each database to the other databases [3], but this paper does not deal with this approach. PPDM methods for data modification can include perturbation, blocking, merging, swapping, or sampling. Perturbation is accomplished by the alteration of an attribute value by a new value (i.e., changing a 1-value to a 0-value, or adding noise). Blocking means the replacement of an existing attribute value with a fix sign or character representing the missing value. Merging is the combination of several values into a coarser category [4]. Data swapping refers to interchanging values of individual records [5]. Sampling refers to releasing data for only a sample of a population. A wide approach in PPDM literature is data perturbation—this paper focuses on only this method—where original data are perturbed and the data mining model is built on the randomized data. The data perturbation should take two opposite requirements into consideration: the privacy of the
%U http://www.hindawi.com/journals/jcengi/2013/397096/