Enhancing Protection in High-Dimensional Data - Distributed Differential Privacy With Feature Selection
Enhancing Protection in High-Dimensional Data - Distributed Differential Privacy With Feature Selection
Keywords: The computational cost for implementing data privacy protection tends to rise as the dimensions
Protection increase, especially on correlated datasets. For this reason, a faster data protection mechanism
High-dimensional data is needed to handle high-dimensional data while balancing utility and privacy. This study
Distributed
introduces an innovative framework to improve the performance by leveraging distributed com-
Differential privacy
puting strategies. The framework integrates specific feature selection algorithms and distributed
Feature selection
mutual information computation, which is crucial for sensitivity assessment. Additionally, it is
optimized using a hyperparameter tuning technique based on Bayesian optimization, which
focuses on minimizing either a combined score of the Bayesian information criterion (BIC) and
Akaike’s Information Criterion (AIC) or by minimizing the Maximal Information Coefficient
(MIC) score individually. Extensive testing on 12 datasets with tens to thousands of features
was conducted for classification and regression tasks. With our method, the sensitivity of
the resulting data is lower than alternative approaches, requiring less perturbation for an
equivalent level of privacy. Using a novel Privacy Deviation Coefficient (PDC) metric, we assess
the performance disparity between original and perturbed data. Overall, there is a significant
execution time improvement of 64.30% on the computation, providing valuable insights for
practical applications.
1. Introduction
In the data-driven era, enterprises strive to innovate by harnessing their datasets for analyses that augment transparency and
facilitate informed decision-making, necessitating collecting or sharing data from diverse sources. Data sharing encompasses raw
data and collaborative utilization of models, particularly in machine learning (Fang et al., 2024; Ge et al., 2023; Li et al., 2022;
Sinaci et al., 2024). Nevertheless, deploying machine learning in corporate settings encounters challenges in balancing data privacy
with optimizing data utility (H.R. et al., 2023; Kumar et al., 2023; Li et al., 2022; Xu et al., 2023). Despite incorporating protective
elements like regularization in machine learning systems, which help safeguard training data, determined adversaries may still
extract sensitive information from the encoded representation of the machine learning model. This vulnerability is demonstrated by
model-inversion attacks, such as in image recovery from facial recognition systems (Abadi et al., 2016).
In response to the privacy concerns, researchers have developed a series of protection methods, including encryption, secure
multi-party computing, k-anonymity, t-closeness, l-diversity, and the widely used perturbation technique, with Differential Privacy
∗ Corresponding author at: Department of Electronics Technology, Faculty of Electrical Engineering and Informatics, Budapest University of Technology and
Economics, Budapest, Hungary.
E-mail addresses: [email protected] (I.M. Putrama), [email protected] (P. Martinek).
https://doi.org/10.1016/j.ipm.2024.103870
Received 11 June 2024; Received in revised form 6 August 2024; Accepted 19 August 2024
Available online 27 August 2024
0306-4573/© 2024 Published by Elsevier Ltd.
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
(DP) being a prominent example (El Mestari et al., 2024; Himeur et al., 2022; Sathish Kumar et al., 2023; Zhang, Zhu et al., 2022).
Encryption, as well as secure multiparty computation, are stronger alternatives; however, the technology has a notable drawback in
terms of operational and computational costs, particularly when dealing with large datasets (Cheu et al., 2019; Kiehn & Car, 2018;
Yao et al., 2023). On the other hand, DP offers a robust, standardized mathematical framework ensuring strong privacy protection
and finds extensive application in big data privacy. It protects the confidential data of individuals within the dataset while still
guaranteeing the extraction of relevant information. However, recent research shows that the common DP actions on a correlated
dataset may elevate the risk of privacy breaches, thereby complicating the task of balancing privacy and utility aspects of DP (Lv
& Zhu, 2019; Wang & Wang, 2021; Wang et al., 2023). Furthermore, applying DP strategies to correlated datasets incurs higher
computational costs due to intricate variable relationships, necessitating additional calculations to ensure privacy while maintaining
utility in the data analysis.
We observed that one way to determine the accuracy of the DP model is to perform calculations of an important measurement,
namely Mutual Information(MI) – crucial for assessing the relationship between variables, which is necessary to calculate the
sensitivity of the dataset – however, despite its usefulness, this procedure is highly time-intensive (Shen et al., 2023; Zhao, Fan
et al., 2021). This issue becomes particularly pronounced when dealing with high-dimensional datasets containing many features
and substantial sample sizes, posing a significant challenge in the computing process.
A prevalent strategy for handling high-dimensional datasets in preprocessing involves integrating feature selection within the
dataset framework. However, employing a reliable feature selection method to minimize information loss is essential, particularly
when adding noise during differential privacy, which can further impact data utility. On the other hand, leveraging distributed
computing offers a potential solution to address computational challenges associated with large-scale data processing. Employing
Hadoop or Spark has proven highly effective in mitigating computational complexities in privacy-preserving techniques such as data
anonymization and l-diversity (Ashkouti et al., 2021; Nayahi & Kavitha, 2017). Additionally, parallel attribute reduction algorithms
applied to extensive datasets have shown promising results in enhancing efficiency and scalability (Yin et al., 2021).
Building upon previous research findings, we introduce a customized DP framework adapted to handle large-scale correlated
datasets in distributed environments. Our approach begins by identifying relevant features using a selection technique within a
distributed computing environment, leveraging cluster resources to optimize computational efficiency. Simultaneously, in determin-
ing the optimal noise levels to ensure differential privacy, we defined parameters for distributed computing. This method leads to
faster processing while maintaining comparable results when evaluated with classification and regression models on datasets with
a sufficiently large number of features.
The remainder of this manuscript is structured as follows: Section 2 explores existing research related to the field. Section 3
outlines the objectives of our study. Section 4 describes the method used for our proposed framework. Experimental setup, including
system specification, data collection, and testing procedure, are described in Section 5. Section 6 provides the results with a thorough
analysis that demonstrates the efficacy of our approach and suggests possible directions for future research. To conclude our research,
Section 7 summarizes the main insights.
2. Related works
A DP method, introduced by Dwork (2006), is a technique used to address the problem of privacy leaks in datasets. Essentially,
DP methods entail the introduction of controlled random noise into aggregate data. The ultimate goal of DP is to strike a balance
between preserving privacy and data usability, ensuring that adversaries face uncertainty in identifying specific information about
individuals in a dataset. Since its inception, various enhancements have been suggested to improve privacy preservation under DP
in general scenarios and specific case studies.
In traditional DP, researchers often rely on the assumption that the records in a dataset are independent. For instance, a study
by Kairouz et al. (2015) explores multi-party computing in the context of DP, which allows each party to broadcast messages
interactively while preserving privacy. However, this approach does not account for scenarios involving correlated data; instead, it
assumes data independence among the parties involved. Meanwhile, real-world datasets often exhibit dependencies among records
and between their attributes, implying that removing one record can influence others. For example, salary information might be
closely linked to education level and occupation in a dataset. Such correlations can inadvertently reveal more information than
expected to the adversaries (Kifer & Machanavajjhala, 2011). This raises the challenge of adding appropriate noise to maintain data
privacy in correlated datasets. Adding too much noise to correlated datasets will degrade data utility, while insufficient noise will
reveal data privacy.
Other challenges include applying DP to complex and high-volume network structures, high-velocity, and high-dimensional data.
The complexity of these datasets necessitates adding more noise to achieve DP, which leads to the ‘‘curse of dimensionality’’. This
phenomenon occurs when perturbing a high-dimensional dataset results in a low signal-to-noise ratio and a nearly useless dataset.
In addition, the computational complexity of these data sets can be too high to be practically implemented (Jiang, Gao et al., 2022).
2
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
Table 1
Computation expenses associated with MI and sensitivity.
Breast Adult Slice
Instances 569 30,162 (3000)a 53,500 (3000)a
Features 30 12 384
Est. compute costs 264,585 234,000 224,073,000
MIC and sensitivity (mins) 97.34b 259.20b –
Other (mins) 0.84 2.20 –
Total Time (mins) 98.18 261.40 –
b
Observed using CDP-DR method on Windows OS Platform with 16 GB RAM and 8-cores processor.
Advancements in differential privacy protection have addressed privacy concerns associated with releasing correlated data (Ou
et al., 2018; Wang & Wang, 2021; Wang et al., 2021). A significant challenge with existing approaches is the low utility, which results
from the sensitivity of these methods to noise escalation due to minor changes in data correlation, thus significantly reducing data
utility. Motivated by these findings, a study by Shen et al. (2023) proposes an approach called Correlated Differential Privacy for
Data Release (CDP-DR). It performs DP with correlated data analysis for data releases specific to machine learning. The study assesses
feature correlation strength by computing their MI and constructing a feature matrix. Subsequently, it calculates the eigenvalues
and eigenvectors of the matrix, which serve as the basis for determining the dataset’s maximum global sensitivity—an essential
factor for quantifying the necessary perturbation level. Although it demonstrates impressive performance in handling correlated
data, the method’s applicability to larger high-dimensional datasets remains uncertain. Another study by Liu et al. (2024) proposes
correlated DP by introducing a concept of correlated degree mechanism to compute correlated sensitivity, measuring the correlation
between every two records in the dataset. The study employs Pearson correlation to gauge the linear relationship between two
variables and Mahalanobis distance to determine the distance between a point and a distribution. This method aims to provide
a more accurate representation of the real situation by considering the degree of correlation between records, thereby enhancing
the privacy protection mechanism. However, the study focuses on datasets with limited feature dimensionality, which may limit
its applicability to datasets with higher feature dimensions or different correlation patterns. A study by Cai et al. (2024) proposes
a method to address issues in balancing privacy protection and utility. It computes MI to estimate the high-dimensional mutual
information of graph embedding vectors, which is then used to measure privacy protection and data quality. Similarly, calculating
MI has also been found to be significantly time-consuming. To address this, the study introduces MINE-GE, a technique designed to
expedite the MI computation for graph embedding vectors.
Research focusing on privacy concerns in high-dimensional data has also been studied extensively. High-dimensional data is
common in real-world situations and allows for the extraction of more detailed information. However, these datasets often contain
a significant amount of sensitive information, and directly releasing them can lead to privacy breaches. Some studies tackle privacy
issues in high-dimensional datasets using decision trees and data encryption for privacy preservation (Zhang, Chen et al., 2022; Zhang
et al., 2023; Zhang, Yang et al., 2022). Nonetheless, these solutions are prone to overfitting and face exponential computational
growth due to the curse of dimensionality. In machine learning, reducing data dimensionality is crucial as it effectively addresses
the problem of high computational demands associated with high-dimensional data. Studies by Zhang et al. (2020) and Zhao, Ren
et al. (2021) introduced privacy protection mechanisms through dimension reduction to extract key features from large datasets,
simplifying tasks such as classification and similarity judgment. However, their approach necessitates repeated feature importance
evaluation algorithms, significantly increasing computational costs and resulting in performance loss.
The feasibility of privacy preservation in large datasets has been explored through decentralized or distributed computation
strategies. For example, a study by Lv and Zhu (2019) proposes a parallel computing approach to enhance privacy protection by
dividing the data into blocks, thus addressing performance issues related to big data. However, this study’s algorithm, which utilizes
a traditional Maximal Information Coefficient (MIC) measurement, is designed for modest datasets with few features. The study
suggests future research on improving the algorithm for handling high-dimensional datasets. Other approaches include utilizing
encryption and Blockchain-based privacy-preserving methods (Mohammadi et al., 2024; Moulahi et al., 2023), or a decentralized
method such as Federated Learning, a promising novel solution for privacy protection that avoids data transfer and preserves
individual data confidentiality (Basudan, 2024; Liu, Yan et al., 2024; Wang et al., 2024). However, these approaches involve
intensive computations that extend training times and introduce latency.
Based on our observations, a significant computational burden when applying DP arises during calculations to assess the
sensitivity of the dataset and determine the noise level required to ensure differential privacy. One of the DP methods that utilizes the
MIC calculation approach, especially for correlated datasets, typically involves many nested iterations, resulting in computational
complexity of 𝑂(𝑅 ⋅ 𝐶 2 ), where 𝑅 corresponds to the number of rows and 𝐶 represents the number of columns. This highlights the
potentially high computational cost for large datasets, especially those with a significant number of columns. Table 1 provides a
3
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
synopsis of these computations. As depicted, the computation times are notably extensive, even for processing datasets with few
samples and features. The duration increases exponentially as the size and dimensions of the dataset grow. This trend becomes
even more pronounced when dealing with datasets containing high-dimensional features. Therefore, addressing these challenges
and practically applying privacy-preserving techniques across various applications where high-dimensional correlated datasets are
common becomes critical.
2.4. Opportunities
Researchers propose a distributed computing approach to address the performance issue, improve data security, and manage
large data sets in other privacy-preserving scenarios. For instance, a study by Nayahi and Kavitha (2017) introduces a clustering
algorithm within a Hadoop environment aimed at achieving k-anonymization and l-diversity. This method demonstrates how
distributed computing can address privacy concerns by leveraging parallel processing to handle data more efficiently. Likewise, Wang
et al. (2022) leveraged MapReduce on the Hadoop platform for implementing DP protection. They observed that these approaches
significantly enhance model accuracy and operational efficiency, substantially reducing computational costs. Apart from that,
using other distributed computing, such as Apache Spark, has also demonstrated high-performance caching, excellent scalability,
and efficiency in data management. For example, a study by Ashkouti et al. (2021) presents a distributed approach to achieve
multidimensional anonymization for the l-diversity privacy model, employing Apache Spark. Although both Hadoop and Spark are
frequently used distributed computing platforms, the study emphasizes the superior performance and scalability of the Spark-based
approach in terms of computational complexity and execution time. Related research shows Spark’s processing speed is much faster
than the Map Reduce computing model under the Hadoop framework. Furthermore, Spark has proven effective in parallel attribute
reduction tasks on extensive datasets (Jiang, Du et al., 2022; Yin et al., 2021).
Previous studies show various advantages of using distributed frameworks such as Hadoop and Spark in other contexts (Brito
et al., 2023; Nayahi & Kavitha, 2017; Palma-Mendoza et al., 2019; Putrama & Martinek, 2023). This inspired us to explore their
potential for enhancing computational differential privacy on correlated datasets, providing an opportunity for our investigation.
3. Objectives
The primary aim of this study is to investigate the effectiveness of leveraging distributed computing techniques to enhance the
execution efficiency of processing differential privacy. This involves reducing the number of features in the dataset while maintaining
a balance between privacy preservation and utility optimization. Specifically, we seek to achieve the following objectives:
1. Develop a distributed privacy preservation framework integrating feature selection techniques to balance privacy and utility
in high-dimensional datasets.
2. Propose an innovative approach to computing global sensitivity in a distributed environment, essential for determining the
required noise for data perturbation.
3. Customize an existing feature selection method for compatibility with distributed environments, enhancing efficiency and
scalability.
4. Evaluate the proposed method on diverse datasets with feature dimensions ranging from tens to thousands, assessing its
performance against alternative methods.
5. Introduce a novel metric to gauge the performance of the tested models on both the original and perturbed datasets.
4. Method
This section covers the fundamental principles of high-dimensional feature selection, sensitivity analysis, and perturbation-based
privacy preservation. These concepts form the foundation of our proposed framework, as detailed in the following subsection.
Datasets characterized by high-dimensional features, where the number of features 𝑁 exceeds the number of rows, present
challenges. This becomes particularly problematic for extensive datasets with significant correlation or multi-collinearity, potentially
leading to model overfitting. Moreover, privacy preservation techniques include introducing noise to the data, potentially reducing
their utility. Hence, incorporating feature selection with correlation analysis methods before data perturbation becomes essential.
The goal is to reduce the dimensionality of the dataset to improve computing while balancing utility and trade-offs with privacy.
This study adopts the Repeated Elastic Net Technique (RENT) framework to perform feature selection, which offers a versatile
and efficient solution, especially beneficial for high-dimensional datasets (Jenul et al., 2021). In addition, its architectural design
facilitates the partitioning of extensive datasets into discrete subsets, which allows parallel processing capabilities. Our approach
modifies this architecture to support distributed computing, which we call d-RENT, to differentiate it from the original, expressed
as follows.
4
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
Definition 1. Consider a dataset 𝑋𝑡𝑟𝑎𝑖𝑛 = {𝒙𝑖 ∶ 𝑖 = 1, … , 𝐼𝑡𝑟𝑎𝑖𝑛 }, where 𝒙𝑖 represents an instance in the dataset. The dataset 𝑋𝑡𝑟𝑎𝑖𝑛
(𝑛) (𝑛)
will be sampled and partitioned uniquely into 𝑁 number of subsets as 𝑋𝑡𝑟𝑎𝑖𝑛 ⊂ 𝑋𝑡𝑟𝑎𝑖𝑛 , of size 𝐼𝑡𝑟𝑎𝑖𝑛 for each 𝑛 = 1, … , 𝑁. Each subset
will be processed in parallel by a set of computing nodes 𝑅𝑝 , where 𝑝 = 1, … , ∞, in the Spark cluster. It is possible that the same
node 𝑟𝑝 processes two or more subsets of the dataset.
(𝑛)
For each subset 𝑋𝑡𝑟𝑎𝑖𝑛 , a dedicated model 𝑀𝑛 is trained independently using the data stored in the corresponding Resilient
Distributed Dataset(RDD) on node 𝑟𝑝 . During the training process, weights 𝜷 𝑛,𝑓 are computed for each feature 𝑓 in 𝑋𝑡𝑟𝑎𝑖𝑛 , where
𝑓 = 1, … , 𝐹 forming a vector 𝜷 𝑓 = (𝛽1,𝑓 , … , 𝛽𝑁,𝑓 ).
Subsequently, the master node in the Spark cluster performs feature selection by aggregating vector 𝜷 𝑓 from each node into a
matrix 𝑩 of dimension (N × F).
A candidate feature is included for selection if:
1. The relative frequency of 𝜏1 (𝜷 𝑓 ) is large, i.e., many 𝑁 elastic net models select the feature:
1 ∑
𝑁
𝜏1 (𝜷 𝑓 ) = 𝛽 (1)
𝑁 𝑛=1 𝑓
2. The variation of 𝜷 𝑓 must be small i.e. 𝜏2 (𝜷 𝑓 ) ideally has features with weights of the same sign:
1 || ∑
𝑁
|
𝜏2 (𝜷 𝑓 ) = | 𝑠𝑖𝑔𝑛(𝛽𝑓 )|| (2)
𝑁 | 𝑛=1 |
3. The distribution mean arising from 𝑁 parameter estimations in 𝜷 𝑓 i.e. 𝜏3 (𝜷 𝑓 ) is significantly non-zero:
( )
|𝜇(𝜷 𝑓 )|
𝜏3 (𝜷 𝑓 ) = 𝑡𝑁−1 √ (3)
𝜎 2 (𝜷 𝑓 )
1
where 𝑡𝑁−1 (.) represents the cumulative distribution function of Student’s t-distribution with N-1 degrees of freedom.
To perform feature selection using the above metrics, threshold values 𝑡1 , 𝑡2 , 𝑡3 ∈ [0, 1] are defined. Specifically, a feature f is
added to the selected feature set 𝐹 ∗ if it satisfies all the criteria: 𝜏𝑖 ≥ 𝑡𝑖 , ∀𝑖 ∈ {1, 2, 3}.
While the vanilla RENT framework proved effective for datasets with moderate size and features, it lacks runtime optimizations
highlighted by the authors, leading to a linear increase in runtime with the number of ensemble models used. The framework
identifies the best model using Elastic Net (Cui et al., 2019). It employs Bayesian Information Criterion (BIC) scores to select the
most relevant features, which considers the information content of the model against its complexity in terms of optimized parameters.
However, BIC tends to penalize model complexity in proportion to the number of data points and favors simpler models throughout
the selection process. Consequently, this method may not consistently yield relevant features in some instances, particularly when the
penalty imposed by BIC outweighs the model improvement gained from adding new features. In such cases, threshold adjustments
are often necessary at the expense of model utility. To overcome this problem, we propose a strategy that combines BIC with
Akaike’s Information Criterion (AIC) calculations, as demonstrated by previous studies (Lumley & Scott, 2015; Sclove, 2021), into
our approach. Unlike BIC, AIC imposes less severe penalties on model complexity by a fixed penalty term, leading to more flexibility
in model selection.
Meanwhile, using grid search during the hyperparameter exploration to find the optimal combination hinders the overall
performance, especially when evaluating datasets with numerous features. To overcome these problems, our study suggests
using Bayesian Optimization (BO) to perform the search configured for distributed computation. BO aims to find the optimal
hyperparameters 𝑥∗ of a machine learning model by modeling the unknown objective function 𝑓 (𝑥) as a probabilistic surrogate
𝑓̂(𝑥), typically a Gaussian Process. At each iteration, it selects the next evaluation point 𝑥𝑡 that maximizes an acquisition function
𝛼(𝑥), balancing exploration and exploitation. Compared to Grid Search, Bayesian Optimization efficiently explores promising regions
of the search space. It converges to the global optimum with fewer evaluations, particularly in high-dimensional and noisy search
spaces. For further insights into using BO for accelerating hyperparameter tuning, readers are encouraged to refer to Nguyen and
Kingdom (2019). In addition, for the case of some specific datasets, we propose an alternative optimization using MIC score (Kraskov
et al., 2004) calculation as a surrogate. This method guides hyperparameter optimization towards a solution that aligns with
the data sensitivity calculation criteria. However, this approach extends the duration of the optimization process. Therefore, it is
recommended to choose it based on the desired level of accuracy, especially for datasets with smaller dimensions or fewer instances,
unless the lead time for execution is negligible. To minimize the combined (BIC + AIC) or MIC, our approach exploits the advantages
of each criterion, resulting in more reliable feature selection results. Please see Algorithm 1 for pseudo code illustration.
5
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
Definition 2 (Differential Privacy - (Dwork, 2006)). Consider a dataset 𝐷 with 𝑛 records, and let 𝑓 be a query function that operates
on 𝐷 to yield a result. DP is attained through a randomized mechanism , a ‘‘noisy’’ version of 𝑓 , where (𝐷) = 𝑓 (𝐷) + 𝑁. Here,
𝑁 is a random variable drawn from a predefined noise distribution. DP asserts that 𝐷 and 𝐷′ are neighbors if they differ in a single
record. Using 𝜖 as a positive scalar, the mechanism ensures 𝜖-DP for all 𝑆 ⊆ Range() which satisfies:
where 𝜖 is a privacy budget. The lower its value, the higher the privacy level of the mechanism .
The mechanism typically takes the form of noise following a Laplace distribution. The noise sequence denoted as vector
𝒛 ∼ 𝐿𝑎𝑝(𝜆) guarantee the mechanism satisfies 𝜖-differential privacy. The scale parameter 𝜆 has a Probability Density Function
1
(PDF): 𝑝(𝜆, 𝑧 ∈ 𝒛) = 2𝜆 𝑒𝑥𝑝(− |𝑧|
𝜆
), where 𝜆 = 𝛥𝑓
𝜖
. Therefore:
𝛥𝑓
(𝐷) = 𝑓 (𝐷) + 𝐿𝑎𝑝( ) (5)
𝜖
Definition 3 (Global Sensitivity - (Dwork, 2006)). The global sensitivity measures how much function output 𝑓 can change when a
single data point in the dataset is modified. The global sensitivity (𝛥𝑓 ) of the function 𝑓 is defined as:
In addition to distributing the feature selection processing, we conduct distributed computation to determine the maximum
Global Sensitivity value. This process is depicted in Fig. 1 and formalized as follows.
Let 𝑋𝑡𝑟𝑎𝑖𝑛 be the original high-dimensional dataset with full features, and let denote the derived dataset (with reduced features)
of 𝑋𝑡𝑟𝑎𝑖𝑛 with 𝑚 rows represent the number of samples and 𝑛 columns represent the number of features. The goal is to compute the
sensitivity of the dataset concerning alterations by removing individual samples, as quantified by MIC scores. We propose a
parallelized computation method leveraging a Spark environment to achieve this. First, we generate a set , consisting of pairs of
feature indices (𝑖, 𝑗), where 𝑖 and 𝑗 denote the indices of two distinct features. Concurrently, we generate a corresponding subset
of dataset ′𝑘 by removing one sample from . Formally, ′𝑘 = {𝐷⧵𝑘 ∣ 𝑘 = 0, 1, … , 𝑚 − 1}, where ⧵𝑘 represents with the 𝑘th
sample removed. Each subset ′𝑘 is paired with the set of feature indices to form (𝑚 − 1) data tuples 𝑘 = (′𝑘 , ). Each tuple is
distributed across the Spark cluster, assigning each worker node a distinct computation task. In each worker node, the MIC score is
computed for the designated feature pairs , along with the associated dataset ′𝑘 specified by the tuple. A square matrix is created,
and its eigenvalues are determined. The final score for the tuple is the maximum absolute difference between the eigenvalue of this
6
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
matrix and that of the original data. The maximum value acquired across all tuples provides insight into the dataset’s maximum
global sensitivity. This parallelized approach significantly enhances computational efficiency, enabling rapid evaluation of dataset
sensitivity. The sensitivity is then used to compute the noise required to perturb the model’s output.
Output perturbation introduces randomness or noise in the final results, thus obscuring individual data contributions and
providing privacy guarantees. This ensures an adversary cannot reliably infer sensitive information about any individual in the
dataset. To perturb the output, a noise is added into the coefficients and intercept of the regression model as:
𝑦̂ = 𝑤̂ ⋅ 𝑥′ + 𝑏̂ (7)
̂ 𝑏̂ represents the perturbed coefficient and the intercept, respectively, and 𝑥′ is the perturbed dataset. For the classification
where 𝑤,
model, the perturbed output is given by:
𝑦̂ = sign(𝑤̂ ⋅ 𝑥′ + 𝑏)
̂ (8)
where sign is a masking function for the class labels (see Algorithm 2).
To measure the performance of our approach, we calculate the accuracy/loss scores of classification/regression tasks based on
both the original and the perturbed datasets. These scores are then plotted against the noise level, with epsilon representing the
degree of privacy protection, ranging from 0 to 1. We introduce a Privacy Deviation Coefficient (PDC) metric to compare the model
performance against the two datasets.
Definition 4 (Privacy Deviation Coefficient (PDC)). For a given method, let 𝑆𝑜 be the evaluation score on the original dataset and 𝑆𝑝
be the evaluation score of the perturbed dataset, with 𝜖 as the perturbation level. The PDC is defined as follows:
Theorem 1. For Classification (Maximization) tasks, where evaluation scores range from 0 to 1 (0 being the worst and one being the
best):
[ ]
( 𝑆𝑜 − 𝑆𝑝 )
𝑃 𝐷𝐶 = 𝑚𝑒𝑎𝑛 (1 − 𝑆𝑜 ) + (9)
𝜖
Theorem 2. For Regression (Minimization) tasks, where evaluation scores range from 1 to 0 (1 being the worst and 0 being the best):
[ ]
( 𝑆𝑜 − 𝑆𝑝 )
𝑃 𝐷𝐶 = 𝑚𝑒𝑎𝑛 𝑆𝑜 − (10)
𝜖
The PDC score ranges from −∞ to +∞, and has the following properties:
1. 𝑃 𝐷𝐶 = 0 indicates that the performance on the original dataset and the perturbed dataset matches exactly at their best.
2. 𝑃 𝐷𝐶 > 0 indicates that the performance on the perturbed dataset is worse than on the original dataset.
3. 𝑃 𝐷𝐶 < 0 indicates that the performance on the perturbed dataset is better than on the original dataset.
Given that the performance on the perturbed dataset often falls below the original dataset’s, this outcome is not necessarily
unfavorable. Nonetheless, when comparing the performance of PDC among methods, aiming for a lower value indicates superior
performance.
The architecture of our proposed framework is depicted in Fig. 2. The objective of the workflow is to perform dimensionality
reduction on correlated datasets by selecting the most prominent features while ensuring privacy preservation for data releases. The
entire workflow takes place within the distributed environment of Apache Spark, where large datasets are automatically partitioned
into smaller subsets. Each partition is processed independently on distinct nodes within the cluster.
Initially, the dataset is read and processed for distributed feature selection as elaborated in Section 4.1. This reduces the
dimension while keeping only the most important features, allowing for faster processing while maintaining most information.
During this stage, several sub-processes done during the feature selection run independently within the cluster resources. These
sub-processes perform weight computation for features to determine the selected feature that satisfies the criteria as stated
in Definition 1. The modification implemented to facilitate distributed computation involved substituting the RENT framework’s
parallel computation framework to operate within the Spark environment. Based on this feature selection process’s output, a derived
dataset containing only the important set of features is obtained. Subsequently, the MIC computation is performed against the derived
dataset outlined by Section 4.4. During this process, we perform standardization of the derived dataset by default unless the dataset
indicates a relatively similar range of values.
7
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
Finally, we carry out a comparative analysis based on the results obtained compared with the results of the baseline methods by
specifically evaluating the execution time performance and model evaluation scores using the given metric as defined by Definition 4.
5. Experimental setup
In this experiment, we used a virtual machine with a Linux Operating System (OS) for testing. The baseline methods were
executed on a single OS machine (the standalone task). In contrast, our method was executed on a Spark cluster (the distributed
8
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
Table 2
Specifications of hardware and software environment.
Environment Hardware Software
Standalone Type: Virtual (VMWare) Jupyter Notebook
OS: Ubuntu Server 22.04 LTS
Processor: Intel Core i9, 32 cores
RAM: 32 GB
Distributed Type: Virtual (VMWare) Jupyter Notebook
OS: Ubuntu Server 22.04 LTS PySpark
1 Driver: 8-core processor, 8 GB RAM
3 Workers: 8-core processor, 8 GB RAM
Table 3
Dataset details.
Dataset Task Original instances Used instances Original features Used features
Titanic Classification 891 891 12 7
Adult Classification 48,842 3000 15 13
Breast Classification 569 569 32 30
Sonar Classification 208 208 61 61
Malware Classification 3000 3000 241 241
Gina Classification 3153 1000 971 971
Gisette Classification 6000 1000 4956 4956
Friedman Regression 1000 1000 101 101
Residential Regression 372 372 105 105
MIP-2016 Regression 1090 1090 148 148
Slice Regression 53,500 750 386 386
Santander Regression 4459 1000 4993 4993
Items in bold indicate the actual value used is less than the original.
task). The standalone machine was set up using Python in a Jupyter Notebook environment, while the distributed cluster utilized
Jupyter and PySpark for execution. The cluster consisted of one driver node and three worker nodes, with each node allocated 8 GB
of RAM and an 8-core processor as defined in the Spark configuration file. On the other hand, the standalone setup was equipped
with 32 GB of RAM and a 32-core processor to ensure a fair comparison. For more details, please refer to Table 2.
To ensure comprehensive testing, we gathered open-source datasets containing thousands of features. We sampled several
datasets to align with our available computing capacity, as seen in Table 3.
In the first batch, we gathered datasets for classification tasks. We use the same datasets, such as Titanic, Breast Cancer, and
Adult datasets, as utilized by Shen et al. (2023). Using these datasets, we conducted experiments to demonstrate the computational
efficiency of our proposed method compared to the baseline while maintaining a balance between privacy and utility.
• Titanic. This dataset is commonly used in data science and machine learning for binary classification tasks. It contains 891
examples with 12 numerical and categorical features. We removed unnecessary features (Passenger ID, Name, Ticket, and Cabin),
leaving seven features for further analysis. After encoding categorical features like Embarked (the target feature) and filling
rows with missing values, the final dataset consists of 714 instances and seven features for the experiment.
• Breast Cancer. This dataset comprises 569 examples with 32 features, where Diagnosis is the target feature. Following encoding
and cleaning, the final dataset dimensions are 569 𝑥 30.
• Adult. This dataset originally consisted of 48,842 examples with 15 features and solely numerical values. We resampled the
original data to meet the 3000 instances used in the CDP-DR codebase. After excluding one feature (Fnlwgt ) and filling in the
missing values, we end up with a dataset of 3000 rows and 13 columns.
Additionally, we incorporate datasets with more features, such as Sonar and Gina from OpenML.1
• Sonar. This dataset consists of 208 examples with 61 features. It was well-formatted and required no additional preprocessing,
so the dataset was used as is.
• Malware. The original dataset comprises 3000 examples and 241 features. It was also well-formatted and did not require
additional preprocessing, allowing us to use it directly.
• Gina. The original dataset consists of 3153 examples and 971 features. However, we reduced the sample size to 1000 instances
to expedite processing. Given the dataset’s clean condition, no further preprocessing was required.
1 https://www.openml.org/search?type=data.
9
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
• Gisette. The initial dataset contained 6000 examples and 4956 features. To expedite processing, we reduced the sample size to
1000 instances. Since the dataset was already clean, no additional preprocessing was necessary.
In the second batch, we collected datasets for regression tasks from OpenML repositories such as Friedman, MIP-2106, and
Santander. The Residential and Slice datasets were taken from the UCI2 repository.
• Friedman. This artificially generated dataset contains 1000 examples and 101 features. It was already in clean condition, so
no further preprocessing was needed.
• MIP-2016. This dataset, sourced from the same repository, consists of 1090 examples and 148 features. We removed
non-numerical features, filled rows with missing values, and rescaled each row.
• Residential. This dataset contains 372 instances with 105 features. The dataset was in clean condition, so no further
preprocessing was done.
• Slice. Originally containing 53,500 examples, this dataset was resampled to 750 records due to its higher number of rows and
features. Like the previous dataset, it was clean and required no further preprocessing.
• Santander. The original dataset contains 4459 examples and 4993 features, making it one of the largest features. To expedite
processing, we reduced the sample size to 1000 instances. Since the dataset was already clean, no additional preprocessing
was needed.
To assess the effectiveness of our proposed framework, we compared its performance against several methods, namely CDP-DR,
k-Best, and RENT, which were executed in a standalone environment. In contrast, we evaluated our approach using a modified
version of the RENT framework in a distributed environment. The model’s accuracy was measured for classification, while the
model’s mean squared error (MSE) was recorded for regression. Depending on the task, these measurements were performed for
original and perturbed datasets.
2 https://archive.ics.uci.edu/.
10
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
Table 4
Datasets with computation results.
Dataset CDP-DR k-Best RENT Our approach
Fn Gs PDC Fn Gs PDC Fn Gs PDC Fn Gs PDC
Titanic 6 0.138 0.761 5 0.012 0.688 4 0.004 0.331 5 0.004 0.326
Adult 9 0.301 0.837 10 0.394 0.877 10 0.043 0.323 10 0.043 0.320
Breast 30 0.036 0.145a 5 0.018 1.302 4 0.728 0.862 5 0.013 0.324
Sonar 33 1.112 1.141 6 0.404 0.797 6 0.023 0.248 6 0.023 0.242
Malware – – – 35 0.264 0.797 21 0.439 0.981 38 0.119 0.462
Gina – – – 50 0.006 0.917 37 0.680 1.117 57 0.516 1.066
Gisette – – – 60 0.471 1.167 54 0.689 1.041 87 0.391 1.095
Friedman – – – 5 0.007 0.758 5 0.006 0.766 4 0.005 0.756
Residential – – – 15 0.025 1.872 13 0.021 1.544 16 0.023 2.226
MIP-2016 – – – 7 0.499 544.876 7 0.003 1.194 7 0.003 1.199
Slice – – – 25 0.071 30.668 25 0.050 2.876 24 0.048 2.774
Santander – – – 10 0.199 98.947 10 0.699 1262.127 9 0.057 590.338
AVG-1 0.397 0.721 0.207 0.916 0.200 0.441 0.021 0.303
AVG-2 0.198 56.972 0.282 106.117 0.104 50.094
– Not applicable.
Fn The number of selected features.
Gs The global sensitivity score.
PDC The PDC score.
AVG-1 The average score for the first four datasets (including CDP-DR).
AVG-2 The average score for all datasets (excluding CDP-DR).
a
The best scores are highlighted in bold.
We conducted 12 tasks (7 classification and five regression) on the prepared datasets (as shown in Table 3) using various methods,
including our proposed methodology. The experiment results are summarized in Table 4, covering important metrics for classification
and regression tasks across datasets. For each method, column Fn indicates the number of features selected after reduction, column
Gs shows the computed global sensitivity scores, and column PDC presents the computed PDC scores. In the following subsections,
we present the results of comparisons between each method and our approach. A discussion follows after the comparisons are
presented.
6.1.1. CDP-DR
CDP-DR works only for classification tasks, as demonstrated by the authors; therefore, we excluded regression tasks for this
method. Additionally, without loss of generality, we did not test datasets with more than 100 features for this method because the
cost estimation for datasets of this size is significant, and it already shows that the process could not be completed within 8 h, as
shown in Table 1.
Initially, we executed the code shared by the authors on datasets used in their studies, namely Titanic, Adult, and Breast
Cancer datasets. Additionally, we executed the code on the Sonar dataset, which has 61 features. Internally, its algorithm performs
dimensionality reduction using Principal Component Analysis (PCA). By executing the respective methods on the same datasets, we
found that our approach achieved the lowest global sensitivity (below 0.05 on average) and PDC scores (below 0.4 on average),
except for the Breast Cancer dataset, where CDP-DR had one of the lowest scores of 0.145. Our method tends to select only a few
of the most important features out of many initial ones, allowing for better score computation. Depending on the characteristics of
a dataset, the feature selection process based on the RENT method can be slow due to the optimal parameter search involved in its
hyperparameter tuning processes. However, with the help of the distributed computation process used for sensitivity computation
(which tends to be time-consuming) as shown in Fig. 1, and referring to the execution time for this particular Breast Cancer dataset
shown in Fig. 3, our method performed the most efficiently which took only 1.87 min to execute, while CDP-DR took the longest
time of 119.17 min. Consequently, our method operates approximately 98.43% faster than CDP-DR in this instance, indicating a
significantly better performance.
Finally, referring to the average scores where this CDP-DR was tested on the first four datasets, our method still gets the lowest
global sensitivity and PDC scores of 0.021 and 0.303, respectively.
6.1.2. k-BEST
The k-BEST method tends to be much faster during the feature selection process. However, depending on the characteristics of the
selected features, it may result in slower computation during global sensitivity computation. In terms of processing time, as shown in
Figs. 3 & 4, DP involving the k-BEST method is faster in computation compared to CDP-DR on the Titanic (45.04%), Breast (79.73%),
and Sonar (98.66%) datasets, which yields 74.48% faster on average. It is also faster compared to the approach using the RENT
11
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
method on the Titanic (42.92%), Adult (4.78%), Sonar (43.14%), Friedman (10.3%), Residential (66.38%), Slice (35.99%), and
Santander (35.33%) datasets, which yields 34.12% faster on average. However, compared to our method, it is, on average, 68.70%
slower. Regarding computing global sensitivity and PDC scores, this method excels with scores of 0.006 and 0.917, respectively, for
the Gina dataset. Although it achieves the lowest PDC score for the Santander dataset (98.947), our method still attains the lowest
score for the corresponding global sensitivity (0.057). Comparing this method with our approach across the remaining datasets, our
approach consistently achieves better sensitivity scores, requiring lower perturbation levels while maintaining model utility. This
can be seen in the average global sensitivity and PDC scores across all datasets, where our method achieved 47.47% and 12.07%
lower scores, respectively, compared to this method. However, the high PDC scores for the Santander dataset across methods require
further exploration of hyperparameter tunings to optimize feature selection and reduce these scores.
6.1.3. RENT
When comparing our method with the approach using plain RENT feature selection, we observe closely matched results, which
is expected given that both methods rely on the same feature selection technique. However, our method incorporates optimizations
during feature selection, such as combining BIC and MIC minimization and computing sensitivity score minimization for smaller
datasets, resulting in 63.12% and 52.79% better for global sensitivity and PDC scores, respectively. Additionally, referring to Figs. 3 &
4 reveals that our method is more than half as fast as RENT feature selection without the distributed approach across all tests. This
suggests that utilizing our approach offers significant advantages.
12
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
Fig. 5. The correlation matrices comparison of our proposed approach with the baseline method on the Breast Cancer dataset: (a) CDP-DR; (b) k-Best; (c) RENT;
(d) Ours.
Table 5
Percentage gain in execution time (minutes) for each method across different datasets.
Titanic Adult Breast Sonar Malware Gina Gisette Friedman Residential MIP-2016 Slice Santander
CDP-DR 2.42 39.96 119.17 21.66 – – – – – – – –
k-Best 1.33a 42.19 24.15 0.29 184.96 361.38 528.66 8.36 1.17 9.12 12.39 13.89
RENT 2.33 44.31 4.4 0.51 68.58 129.86 201.13 9.32 3.48 7.92 27.17 21.48
Ours 0.34 2.99 1.87 0.22 41.02 25.56 36.57 0.88 0.59 3.04 12.67 7.78
Gain (%) 74.44 92.52 57.50 24.14 77.82 80.32 93.08 89.47 49.57 61.62 27.14 43.99
Avg. Gain (%) 64.30
a
The 2nd fastest execution times are highlighted in bold.
6.2. Analysis
13
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
Fig. 6. Boxplot comparison: (a) & (b) Sensitivity and PDC scores for the first four datasets (including CDP-DR); (c) & (d) Sensitivity and PDC scores for all
datasets (excluding CDP-DR).
6.3. Discussion
Upon dissecting the results obtained from our experiments, it can be seen that our approach is efficient in implementing
differential privacy while offering substantial utility improvements compared to existing baseline methods. The important aspect
is incorporating the feature selection method and pruning redundant features within high-dimensional datasets without sacrificing
model utility. The distributed computing approach has gained tremendous time efficiency. Additionally, our architecture provides
the flexibility to plug in a wide selection of various feature selection models according to the level of utility and privacy requirements.
14
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
Fig. 7. Evaluation performance of our proposed method compared with baseline methods for several datasets: (a) Titanic (7 features); (b) Adult (13 features);
(c) Residential (102 features); (d) Slice (386 features).
Our testing, however, uncovered several limitations related to using the distributed Spark framework on virtual machines.
Repeated Jupyter Notebook executions occasionally indicate interconnection issues between worker nodes, disrupting an efficient
and smooth testing workflow. As a result, we had to rerun the same code on the same dataset several times and select the
best outcomes. Furthermore, depending on the number of records processed and the capabilities of each compute node, complex
orchestration of distributed processes – like hyperparameter tuning via Bayesian optimization – requires continuous data transfer
between driver and worker nodes, which makes larger datasets require longer execution times. Spark essentially provides a
multitude of adjustable options to fine-tune performance. For the most part, we used default configurations during our investigation.
Consequently, determining the best configurations for a given dataset and exploring other distributed framework technologies
for performance optimization remain challenging research topics. In addition, exploring additional distributed baseline methods
for comparison is another topic for investigation. Our work currently focuses on enhancing computational differential privacy
by combining distributed computing with feature selection methods to efficiently process datasets with numerous features. This
study demonstrates the approach’s feasibility. However, optimizing feature selection by comparing various methods within the
same distributed environment or against other distributed differential privacy techniques will be challenging to explore. This will
yield valuable insights for practical use.
It is important to note that the data used in our experiment was sampled and taken from a larger dataset. Given our available
computing resources, this strategy was used to expedite processing while preserving computational equivalency between clusters
and standalone configurations. More research is required to determine how well our model generates accurate data privacy while
retaining its usefulness for larger datasets, possibly using a comparable practical computing environment such as a cloud computing
alternative.
Finally, methods like RENT for feature selection have yielded a sparse set of features from many initial features. This tendency
may lead to underfitting, where the model becomes too simplistic to adequately capture the inherent structure and patterns within
15
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
Table 6
Complexity details of the Algorithm 1 & 2 combined.
[feature-selection]
∙ select feature: the complexity is 𝑂(𝑓 (𝑚, 𝑛)), where 𝑚 is the number of instances and 𝑛 is the number of features
∙ AIC and BIC computation: 𝑂(𝑚 ⋅ 𝑑), where 𝑚 is the number of instances and 𝑑 is the number of selected features
∙ sensitivity: for non-distributed version is 𝑂(𝑚 ⋅ 𝑑 2 ). Therefore for distributed version has complexity of 𝑂( 𝑚𝑘 ⋅ 𝑑 2 ) where 𝑘 is the number of nodes in the
cluster
∙ bayesian optimization: Bayesian optimization evaluates the objective function over several iterations. For 𝐼 iterations, with the returned feature selection
processes of 𝑂(𝑓 (𝑚, 𝑛) + 𝑚 ⋅ 𝑑 + 𝑚 ⋅ 𝑑 2 ), the complexity becomes 𝑂( 𝐼𝑘 ⋅ (𝑓 (𝑚, 𝑛) + 𝑚 ⋅ 𝑑 + 𝑚 ⋅ 𝑑 2 ))
𝐼
∙ Time complexity: 𝑂(𝑓 (𝑚, 𝑛) + 𝑘
⋅ (𝑓 (𝑚, 𝑛) + 𝑚 ⋅ 𝑑 2 ))
𝐼
∙ Space complexity: 𝑂(𝑛 + ⋅ (𝑓 (𝑚, 𝑛) + 𝑚 ⋅ 𝑑 2 ))
𝑘
[maximum sensitivity calculation]
∙ generate feature pairs: 𝑂(𝑑 2 )
∙ generate tuples 𝑇 = (𝐷𝑘′ , 𝐹 ): Each subset 𝐷𝑘′ is of size (𝑚 − 1) ⋅ 𝑑. For all subsets: 𝑂(𝑚 ⋅ 𝑑)
∙ Distributed MIC score computation: 𝑂( 𝑘1 ⋅ 𝑑 2 + 𝑚 ⋅ 𝑑) + 𝑂(𝑙𝑜𝑔 𝑘), where 𝑂(𝑙𝑜𝑔 𝑘) is the typical communication overhead to perform aggregation
2
∙ Time complexity: 𝑂(𝑚 ⋅ 𝑑 + 𝑑𝑘 + 𝑙𝑜𝑔 𝑘)
∙ Space complexity: 𝑂(𝑑 2 + 𝑚 ⋅ 𝑑)
[adding noise]
∙ generate Laplace noise: 𝑂(𝑚 ⋅ 𝑑), where 𝑚 is the number of instances and 𝑑 is the number of selected features
∙ perturbation: assuming the perturbation involves a constant-time operation on each element, it gives 𝑂(𝑑 2 )
∙ matrix multiplication: typically has 𝑂(𝑚 ⋅ 𝑑 2 )
∙ Time complexity: 𝑂(𝑚 ⋅ 𝑑 2 )
∙ Space complexity: 𝑂(𝑚 ⋅ 𝑑 2 )
[perform perturbation]
2
∙ generate model: distributed training algorithm for SVR is approximately 𝑂( 𝑚𝑘 ⋅ 𝑑)
∙ compute prediction for 𝑦′ = 𝑤′ .𝑋 ′ + 𝑏′ : 𝑂(𝑑) for adding noise to the coefficient, 𝑂(1) for the intercept. Each node computes prediction for each partition,
gives 𝑂( 𝑚𝑘 ⋅ 𝑑)
2
∙ Time complexity: 𝑂( 𝑚𝑘 ⋅ 𝑑)
∙ Space complexity: 𝑂( 𝑚⋅𝑑
𝑘
)
2
∙ Total time complexity: focusing only on the highest degree term, it has 𝑂( 𝑚𝑘 ⋅ 𝑑)
∙ Total space complexity: 𝑂( 𝐼𝑘 ⋅ 𝑚 ⋅ 𝑑 2 )
the original data, potentially resulting in suboptimal performance when applied to unseen data. Consequently, it poses an interesting
research question to compare the model performance with an alternative approach utilizing feature reduction instead of feature
selection. Feature reduction techniques aim to transform the original feature space while retaining as much of the original variance
as possible. While this may enhance the model’s utility, careful consideration must be given to computing the perturbation level
appropriately to achieve lower sensitivity and, consequently, better privacy outcomes.
7. Conclusion
This study introduces an innovative approach to data privacy by utilizing distributed computing techniques. The proposed method
leverages a proven feature selection algorithm to enhance execution efficiency while preserving the model’s utility performance.
Through comprehensive experimentation across various open datasets, we evaluate the approach’s effectiveness in classification
and regression tasks, comparing its performance against several alternative approaches employing different feature selection
methodologies. Our findings reveal a notably promising performance exhibited by our method, surpassing other baseline techniques
across most examined scenarios and datasets.
Despite these promising results, our approach entails certain limitations. Primarily, our experimentation is constrained by using
sampled data containing a limited number of instances, and the testing environment is conducted on a virtual setup. Moreover, we
primarily rely on default configurations for existing feature selection methods with an introduced alternative for hyperparameter
tuning. Consequently, exploring options for computing environments, such as physical or cloud-based setups, to accommodate larger
datasets with higher dimensional features presents an intriguing topic for future research.
In summary, our proposed approach shows the benefits of combining feature selection over distributed processing by outperform-
ing other baseline methods when tested against various open datasets, achieving an average improvement efficiency gain of 64.30%
overall. These findings provide valuable insights for practitioners seeking scalable solutions to implement differential privacy in
real-world scenarios.
I Made Putrama: Writing – original draft, Visualization, Validation, Software, Methodology, Investigation, Formal analysis, Data
curation, Conceptualization. Péter Martinek: Writing – review & editing, Validation, Supervision, Resources.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared
to influence the work reported in this paper.
16
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
Data availability
Acknowledgments
We sincerely thank Hua Shen (Hubei University of Technology) for providing access to codebase and dataset, which are critical
to the success of this research.
• 𝑆𝑜 and 𝑆𝑝 are the evaluation scores on the original and perturbed datasets, respectively, with 𝜖 > 0.
• Evaluation scores 𝑆𝑜 and 𝑆𝑝 range from 0 to 1, where 0 is the worse score and 1 is the best score.
1. Case 𝑆𝑜 = 𝑆𝑝 :
• Statement: When 𝑆𝑜 = 𝑆𝑝 , the PDC should indicate that the performance on both datasets are the same.
• Proof:
[ ] [ ]
( 𝑆𝑜 − 𝑆𝑝 ) (𝑆 − 𝑆 )
𝑜 𝑜
𝑃 𝐷𝐶 = 𝑚𝑒𝑎𝑛 (1 − 𝑆𝑜 ) + = 𝑚𝑒𝑎𝑛 (1 − 𝑆𝑜 ) + = 𝑚𝑒𝑎𝑛(1 − 𝑆𝑜 )
𝜖 𝜖
Therefore:
⎧If 𝑆𝑜 = 1, then PDC = 0
⎪
⎨If 𝑆𝑜 = 0, then PDC = 1
⎪
⎩If 𝑆𝑜 = c, then PDC = 1 - c for any c ∈ (0, 1)
2. Case 𝑆𝑜 > 𝑆𝑝 :
• Statement: When 𝑆𝑜 > 𝑆𝑝 , the PDC should indicate a positive value, showing that the performance on the perturbed
dataset is worse than the original dataset.
• Proof:
[ ] [ ]
( 𝑆𝑜 − 𝑆𝑝 ) ( )
𝛥
𝑃 𝐷𝐶 = 𝑚𝑒𝑎𝑛 (1 − 𝑆𝑜 ) + = 𝑚𝑒𝑎𝑛 (1 − 𝑆𝑜 ) +
𝜖 𝜖
𝛥 𝛥
where 𝛥 = 𝑆𝑜 − 𝑆𝑝 > 0. Since 𝛥 > 0, for small 𝜖, 𝜖
is large, making PDC positive. For large 𝜖, 𝜖
is small, making PDC
≈ 1 − 𝑆𝑜 . Thus:
{
If 𝜖 → 0, then PDC → +∞
If 𝜖 → ∞, then PDC → 1 − 𝑆𝑜
3. Case 𝑆𝑜 < 𝑆𝑝 :
• Statement: When 𝑆𝑜 < 𝑆𝑝 , the PDC should indicate a negative value, showing that the performance on the perturbed
dataset is better than the original dataset.
• Proof:
[ ] [ ]
( 𝑆𝑜 − 𝑆𝑝 ) ( )
−𝛥
𝑃 𝐷𝐶 = 𝑚𝑒𝑎𝑛 (1 − 𝑆𝑜 ) + = 𝑚𝑒𝑎𝑛 (1 − 𝑆𝑜 ) +
𝜖 𝜖
𝛥 𝛥
where 𝛥 = 𝑆𝑝 − 𝑆𝑜 > 0. Since 𝛥 > 0, for small 𝜖, 𝜖
is large, making PDC negative. For large 𝜖, 𝜖
is small, making PDC
≈ 1 − 𝑆𝑜 . Thus:
{
If 𝜖 → 0, then PDC → −∞
If 𝜖 → ∞, then PDC → 1 − 𝑆𝑜
• 𝑆𝑜 and 𝑆𝑝 are the evaluation scores on the original and perturbed datasets, respectively, with 𝜖 > 0.
• Evaluation scores 𝑆𝑜 and 𝑆𝑝 range from 1 to 0, where 1 is the worse score and 0 is the best score.
17
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
1. Case 𝑆𝑜 = 𝑆𝑝 :
• Statement: When 𝑆𝑜 = 𝑆𝑝 , the PDC should indicate that the performance on both datasets are the same.
• Proof:
[ ] [ ]
( 𝑆𝑜 − 𝑆𝑝 ) (𝑆 − 𝑆 )
𝑜 𝑜
𝑃 𝐷𝐶 = 𝑚𝑒𝑎𝑛 (𝑆𝑜 ) − = 𝑚𝑒𝑎𝑛 (𝑆𝑜 ) − = 𝑚𝑒𝑎𝑛(𝑆𝑜 )
𝜖 𝜖
Therefore:
⎧If 𝑆𝑜 = 0, then PDC = 0
⎪
⎨If 𝑆𝑜 = 1, then PDC = 1
⎪
⎩If 𝑆𝑜 = c, then PDC = c for any c ∈ (0, 1)
2. Case 𝑆𝑜 > 𝑆𝑝 :
• Statement: When 𝑆𝑜 > 𝑆𝑝 , the PDC should indicate a negative value, showing that the performance on the perturbed
dataset is better than the original dataset.
• Proof:
[ ] [ ]
( 𝑆𝑜 − 𝑆𝑝 ) ( )
𝛥
𝑃 𝐷𝐶 = 𝑚𝑒𝑎𝑛 (𝑆𝑜 ) − = 𝑚𝑒𝑎𝑛 (𝑆𝑜 ) −
𝜖 𝜖
𝛥 𝛥
where 𝛥 = 𝑆𝑜 − 𝑆𝑝 > 0. Since 𝛥 > 0, for small 𝜖, 𝜖
is large, making PDC negative. For large 𝜖, 𝜖
is small, making PDC
≈ 𝑆𝑜 . Thus:
{
If 𝜖 → 0, then PDC → −∞
If 𝜖 → ∞, then PDC → 𝑆𝑜
3. Case 𝑆𝑜 < 𝑆𝑝 :
• Statement: When 𝑆𝑜 < 𝑆𝑝 , the PDC should indicate a positive value, showing that the performance on the perturbed
dataset is worse than the original dataset.
• Proof:
[ ] [ ]
( 𝑆𝑜 − 𝑆𝑝 ) ( )
−𝛥
𝑃 𝐷𝐶 = 𝑚𝑒𝑎𝑛 (𝑆𝑜 ) − = 𝑚𝑒𝑎𝑛 (𝑆𝑜 ) −
𝜖 𝜖
−𝛥 𝛥
where 𝛥 = 𝑆𝑝 − 𝑆𝑜 > 0. Since 𝛥 > 0, for small 𝜖, 𝜖
is large and negative, making PDC positive. For large 𝜖, 𝜖
is small
and negative, making PDC ≈ 𝑆𝑜 . Thus:
{
If 𝜖 → 0, then PDC → +∞
If 𝜖 → ∞, then PDC → 𝑆𝑜
For both classification and regression tasks, the Privacy Deviation Coefficient (PDC) effectively measures the deviation between the
evaluation performance of the model on the original versus the perturbed dataset:
• Classification:
• Regression:
References
Abadi, M., McMahan, H. B., Chu, A., Mironov, I., Zhang, L., Goodfellow, I., & Talwar, K. (2016). Deep learning with differential privacy. In 23rd ACM conf.
comput. commun. secur.. http://dx.doi.org/10.48550/arXiv.1607.00133.
18
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
Ashkouti, F., Khamforoosh, K., & Sheikhahmadi, A. (2021). DI-Mondrian: Distributed improved Mondrian for satisfaction of the L-diversity privacy model using
Apache Spark. Information Sciences (Ny), 546, 1–24. http://dx.doi.org/10.1016/j.ins.2020.07.066.
Basudan, S. (2024). A privacy-preserving federated learning protocol with a secure data aggregation for the Internet of Everything. Computer Communications,
223, 1–14. http://dx.doi.org/10.1016/j.comcom.2024.05.005.
Brito, C. V., Ferreira, P. G., Portela, B. L., Oliveira, R. C., & Paulo, J. T. (2023). Privacy-Preserving Machine Learning on Apache Spark. IEEE Access, 11,
127907–127930. http://dx.doi.org/10.1109/ACCESS.2023.3332222.
Cai, L., Tang, J., Dang, S., & Chen, G. (2024). Privacy protection and utility trade-off for social graph embedding. Information Sciences (Ny), 676, Article 120866.
http://dx.doi.org/10.1016/j.ins.2024.120866.
Cheu, A., Smith, A., Ullman, J., Zeber, D., & Zhilyaev, M. (2019). Lect. notes comput. sci. (including subser. lect. notes artif. intell. lect. notes bioinformatics) 11476
LNCS: vol. 37, distributed differential privacy via shuffling (pp. 5–403). http://dx.doi.org/10.1007/978-3-030-17653-2_13.
Cui, L., Bai, L., Zhang, Z., Wang, Y., & Hancock, E. R. (2019). Identifying the most informative features using a structurally interacting elastic net. Neurocomputing,
336, 13–26. http://dx.doi.org/10.1016/j.neucom.2018.06.081.
Dwork, C. (2006). Differential Privacy. In ICALP 2006 autom. lang. program. (pp. 1–12). http://dx.doi.org/10.1007/978-3-030-96398-9_2.
El Mestari, S. Z., Lenzini, G., & Demirci, H. (2024). Preserving data privacy in machine learning systems. Computers & Security, 137, Article 103605.
http://dx.doi.org/10.1016/j.cose.2023.103605.
Fang, C., Dziedzic, A., Zhang, L., Oliva, L., Verma, A., Razak, F., Papernot, N., & Wang, B. (2024). Decentralised, collaborative, and privacy-preserving machine
learning for multi-hospital data. eBioMedicine, 101, Article 105006. http://dx.doi.org/10.1016/j.ebiom.2024.105006.
Ge, L., Li, H., Wang, X., & Wang, Z. (2023). A review of secure federated learning: Privacy leakage threats, protection technologies, challenges and future
directions. Neurocomputing, 561, Article 126897. http://dx.doi.org/10.1016/j.neucom.2023.126897.
Himeur, Y., Sohail, S. S., Bensaali, F., Amira, A., & Alazab, M. (2022). Latest trends of security and privacy in recommender systems: A comprehensive review
and future perspectives. Computers & Security, 118, Article 102746. http://dx.doi.org/10.1016/j.cose.2022.102746.
H.R., R., S., K., S., E., & M.S., S. (2023). A hybrid deep learning framework for privacy preservation in edge computing. Computers & Security, 129, Article
103209. http://dx.doi.org/10.1016/j.cose.2023.103209.
Jenul, A., Schrunner, S., Liland, K. H., Indahl, U. G., Futsaether, C. M., & Tomic, O. (2021). Rent - Repeated elastic net technique for feature selection. IEEE
Access, 9, 152333–152346. http://dx.doi.org/10.1109/ACCESS.2021.3126429.
Jiang, K., Du, S., Zhao, F., Huang, Y., Li, C., & Luo, Y. (2022). Effective data management strategy and RDD weight cache replacement strategy in Spark.
Computer Communications, 194, 66–85. http://dx.doi.org/10.1016/j.comcom.2022.07.008.
Jiang, H., Gao, Y., Sarwar, S. M., GarzaPerez, L., & Robin, M. (2022). CCIS: vol. 1536, Differential privacy in privacy-preserving big data and learning: challenge and
opportunity. Springer International Publishing., http://dx.doi.org/10.1007/978-3-030-96057-5_3, arXiv:2112.01704.
Kairouz, P., Oh, S., & Viswanath, P. (2015). Secure multi-party differential privacy. Advances in Neural Information Processing Systems, 2015-Janua, 2008–2016.
Kiehn, O., & Car (2018). Differentially Private Distributed Online Learning. IEEE Transactions on Knowledge and Data Engineering, 176, 139–148. http:
//dx.doi.org/10.1109/TKDE.2018.2794384.Differentially.
Kifer, D., & Machanavajjhala, A. (2011). No free lunch in data privacy. Vol. 19, In Proc. ACM SIGMOD Int. Conf. Manag. Data (pp. 3–204). http://dx.doi.org/10.
1145/1989323.1989345.
Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E : Statistical Physics, Plasmas, Fluids, and Related
Interdisciplinary Topics, 69(16), http://dx.doi.org/10.1103/PhysRevE.69.066138, arXiv:0305641.
Kumar, A., Saxena, R., Awasthi, A., & Sunil, M. P. (2023). Measurement : Sensors Privacy preserved data sharing using blockchain and support vector machine
for industrial IOT applications. Measurement: Sensors, 29, Article 100891. http://dx.doi.org/10.1016/j.measen.2023.100891.
Li, Z., Mao, F., & Wu, C. (2022). Can we share models if sharing data is not an option? Patterns, 3, Article 100603. http://dx.doi.org/10.1016/j.patter.2022.100603.
Liu, M., Song, X., Li, Y., & Li, W. (2024). Correlated differential privacy based logistic regression for supplier data protection. Computers & Security, 136, Article
103542. http://dx.doi.org/10.1016/j.cose.2023.103542.
Liu, L., Yan, Z., Zhang, T., Gao, Z., Cai, H., & Wang, J. (2024). Data privacy protection: A novel federated transfer learning scheme for bearing fault diagnosis.
Knowledge-Based Systems, 291, Article 111587. http://dx.doi.org/10.1016/j.knosys.2024.111587.
Lumley, T., & Scott, A. (2015). AIC and BIC for modeling with complex survey data. Journal of Survey Statistics and Methodology, 3, 1–18. http://dx.doi.org/10.
1093/jssam/smu021.
Lv, D., & Zhu, S. (2019). Achieving correlated differential privacy of big data publication. Computers & Security, 82, 184–195. http://dx.doi.org/10.1016/j.cose.
2018.12.017.
Mohammadi, S., Balador, A., Sinaei, S., & Flammini, F. (2024). Balancing Privacy and Performance in Federated Learning: a Systematic Literature Review on
Methods and Metrics. Journal of Parallel and Distributed Computing, 192, Article 104918. http://dx.doi.org/10.1016/j.jpdc.2024.104918.
Moulahi, W., Jdey, I., Moulahi, T., Alawida, M., & Alabdulatif, A. (2023). A blockchain-based federated learning mechanism for privacy preservation of healthcare
IoT data. Computers in Biology and Medicine, 167, Article 107630. http://dx.doi.org/10.1016/j.compbiomed.2023.107630.
Nayahi, J. J. V., & Kavitha, V. (2017). Privacy and utility preserving data clustering for data anonymization and distribution on Hadoop. Future Generation
Computer Systems, 74, 393–408. http://dx.doi.org/10.1016/j.future.2016.10.022.
Nguyen, V., & Kingdom, U. (2019). Bayesian optimization for accelerating hyper-parameter tuning. In 2019 IEEE second int. conf. artif. intell. knowl. eng..
http://dx.doi.org/10.1109/AIKE.2019.00060.
Ou, L., Qin, Z., Liao, S., Hong, Y., & Jia, X. (2018). Releasing Correlated Trajectories: Towards High Utility and Optimal Differential Privacy. IEEE Transactions
on Dependable and Secure Computing, 17, 1109–1123. http://dx.doi.org/10.1109/TDSC.2018.2853105.
Palma-Mendoza, R. J., De-Marcos, L., Rodriguez, D., & Alonso-Betanzos, A. (2019). Distributed correlation-based feature selection in spark. Information Sciences
(Ny), 496, 287–299. http://dx.doi.org/10.1016/j.ins.2018.10.052.
Putrama, I. M., & Martinek, P. (2023). A hybrid architecture for secure big-data integration and sharing in smart manufacturing. In Proc. IEEE-international spring
semin. electron. technol. 2023-May. http://dx.doi.org/10.1109/ISSE57496.2023.10168508.
Sathish Kumar, G., Premalatha, K., Uma Maheshwari, G., & Rajesh Kanna, P. (2023). No more privacy Concern: A privacy-chain based homomorphic
encryption scheme and statistical method for privacy preservation of user’s private and sensitive data. Expert Systems with Applications, 234, Article 121071.
http://dx.doi.org/10.1016/j.eswa.2023.121071.
Sclove, S. L. (2021). Using Model Selection Criteria to Choose the Number of Principal Components. Journal of Statistical Theory and Applications, 20, 450–461.
http://dx.doi.org/10.1007/s44199-021-00002-4.
Shen, H., Li, J., G, W., & Zhang, M. (2023). Data release for machine learning via correlated differential privacy. Future Generation Computer Systems, 60,
http://dx.doi.org/10.1016/j.ipm.2023.103349.
Sinaci, A. A., Gencturk, M., Alvarez-Romero, C., Laleci Erturkmen, G. B., Martinez-Garcia, A., Escalona-Cuaresma, M. J., & Parra-Calderon, C. L. (2024). Privacy-
preserving federated machine learning on FAIR health data: A real-world application. Computational and Structural Biotechnology Journal, 24, 136–145.
http://dx.doi.org/10.1016/j.csbj.2024.02.014.
Wang, Z., Duan, S., Wu, C., Lin, W., Zha, X., Han, P., & Liu, C. (2024). Generative Data Augmentation for Non-IID Problem in Decentralized Clinical Machine
Learning. Vol. 160, In Proc. - 2022 4th int. conf. data intell. secur. ICDIS 2022 (pp. 336–343). http://dx.doi.org/10.1109/ICDIS55630.2022.00058.
19
I.M. Putrama and P. Martinek Information Processing and Management 61 (2024) 103870
Wang, X., Fan, W., He, J., & Chi, C. H. (2022). A Novel Distributed Differential Privacy Preserving Based on Random Forest in Data Centers. Procedia Computer
Science, 214, 1531–1540. http://dx.doi.org/10.1016/j.procs.2022.11.340.
Wang, H., & Wang, H. (2021). Correlated tuple data release via differential privacy. Information Sciences (Ny), 560, 347–369. http://dx.doi.org/10.1016/j.ins.
2021.01.058.
Wang, Y., Wang, Q., Zhao, L., & Wang, C. (2023). Differential privacy in deep learning: Privacy and beyond. Future Generation Computer Systems, 148, 408–424.
http://dx.doi.org/10.1016/j.future.2023.06.010.
Wang, H., Xu, Z., Jia, S., Xia, Y., & Zhang, X. (2021). Why current differential privacy schemes are inapplicable for correlated data publishing? World Wide Web,
24, 1–23. http://dx.doi.org/10.1007/s11280-020-00825-8.
Xu, J., Hong, N., Xu, Z., Zhao, Z., Wu, C., Kuang, K., Wang, J., Zhu, M., Zhou, J., Ren, K., Yang, X., Lu, C., Pei, J., & Shum, H. (2023). Data-driven learning
for data rights, data pricing, and privacy computing. Engineering, http://dx.doi.org/10.1016/j.eng.2022.12.008.
Yao, A., Li, G., Li, X., Jiang, F., Xu, J., & Liu, X. (2023). Differential privacy in edge computing-based smart city Applications:Security issues, solutions and
future directions. Array, 19, Article 100293. http://dx.doi.org/10.1016/j.array.2023.100293.
Yin, L., Qin, L., Jiang, Z., & Xu, X. (2021). A fast parallel attribute reduction algorithm using Apache Spark. Knowledge-Based Systems, 212, Article 106582.
http://dx.doi.org/10.1016/j.knosys.2020.106582.
Zhang, M., Chen, Y., & Susilo, W. (2022). Decision Tree Evaluation on Sensitive Datasets for Secure e-Healthcare Systems. IEEE Transactions on Dependable and
Secure Computing, 20, 3988–4001. http://dx.doi.org/10.1109/TDSC.2022.3219849.
Zhang, M., Huang, S., Shen, G., & Wang, Y. (2023). PPNNP: A privacy-preserving neural network prediction with separated data providers using multi-client
inner-product encryption. Computer Standards & Interfaces, 84, http://dx.doi.org/10.1016/j.csi.2022.103678.
Zhang, M., Yang, M., & Shen, G. (2022). SSBAS-FA: A secure sealed-bid e-auction scheme with fair arbitration based on time-released blockchain. Journal of
Systems Architecture, 129, Article 102619. http://dx.doi.org/10.1016/j.sysarc.2022.102619.
Zhang, T., Zhu, T., Xiong, P., Huo, H., Tari, Z., & Zhou, W. (2020). Correlated Differential Privacy: Feature Selection in Machine Learning. IEEE Transactions on
Industrial Informatics, 16, 2115–2124. http://dx.doi.org/10.1109/TII.2019.2936825, arXiv:2010.03094.
Zhang, G., Zhu, X., Yin, L., Pedrycz, W., & Li, Z. (2022). Granular data representation under privacy protection: Tradeoff between data utility and privacy via
information granularity. Applied Soft Computing, 131, Article 109808. http://dx.doi.org/10.1016/j.asoc.2022.109808.
Zhao, B., Fan, K., Yang, K., & Wang, Z. (2021). Anonymous and Privacy-Preserving Federated Learning With Industrial Big Data. IEEE Transactions on Industrial
Informatics, 17, 6314–6323. http://dx.doi.org/10.1109/TII.2021.3052183.
Zhao, F., Ren, X., Yang, S., Han, Q., Zhao, P., & Yang, X. (2021). Latent Dirichlet Allocation Model Training with Differential Privacy. IEEE Transactions on
Information Forensics and Security, 16, 1290–1305. http://dx.doi.org/10.1109/TIFS.2020.3032021, arXiv:2010.04391.
20