Data Leakage Detection
Data Leakage Detection
PROJECT REPORT
Submitted by
PRITHIGA A 810421243037
RESHMA R 810421243042
PROJECT REPORT
Submitted by
PRITHIGA A 810421243037
RESHMA R 810421243042
SIGNATURE SIGNATURE
Dr. Shree K.V.M, M.E., Ph.D., Ms.K.Murugeswari M.E.,
HEAD OF THE DEPARTMENT, SUPERVISOR,
I
ACKNOWLEDGEMENT
We express our gratitude and thanks to Our Parents first for giving health and a sound
mind for completing this project. We give all the glory and thanks to our almighty GOD for
showering upon, the necessary wisdom and grace for accomplishing this project.
It is our pleasant duty to express deep sense of gratitude to our Honourable Chancellor
Shri. A. Srinivasan, for his kind encouragement. We have unique pleasure in thanking
our Principal Prof. Dr. D. Shanmugasundaram, M.E., PhD., F.I.E., C.Eng., our Dean
Dr. K .Anbarasan, M.E., Ph.D., and our COE Dr. K. Velmurugan, M.E., Ph.D., for their
unflinching devotion, which leads us to complete this project.
We express our faithful and sincere gratitude to our Head of the Department
Dr. Shree K.V.M, M.E., Ph.D., Artificial Intelligence and Data Science, for her valuable
guidance and support the gave us during the project time.
We express our faithful and sincere gratitude to our Project Coordinator A. Siva Sankari
M.E.,MCA.,M. Phil., Artificial Intelligence and Data Science, for her valuable guidance andsupport the
gave us during the project time.
II
3
ABSTRACT
1
TABLE OF CONTENT
TITLE PAGE NO
1.0 INTRODUCTION 03
1.1 Overview of project
3.0 ARCHITECTURE 08
4.0 IMPLEMENTATION 10
4.2Code Structure
5. RESULT 13
6. DISCUSSION 16
6.1Evaluation Metrics
6.2Performance Analysis
7.0 CONCLUSION 19
8.0 REFERENCE 20
2
CHAPTER 1
INTRODUCTION
Data leakage is a critical concern in data science, especially when building predictive
models.
It refers to the unintentional inclusion of information in the training data that would not be
available at the time of prediction, leading to overly optimistic performance estimates and
ultimately resulting in models that fail in real-world applications.
Understanding and preventing data leakage is essential for developing robust and reliable
machine learning models.
At its core, data leakage occurs when the model has access to data during training that it
wouldn't have during actual prediction.
This can take many forms, such as including future data, target leakage, or using
variables that contain information about the target variable.
For example, if a model predicting credit default uses data that includes payment history
up to the point of default, this information would not be available when making a prediction
before the default occurs. This scenario gives the model an unfair advantage, leading to inflated
performance metrics during validation.
One common type of data leakage is temporal leakage, where information from the
future is used to predict past events. In time-series data, this is particularly problematic as it
directly violates the temporal order of events.
For instance, using stock prices from the future to predict past prices would be an
obvious case of temporal leakage. Ensuring the chronological integrity of data is crucial in these
contexts to avoid misleading results.
Another form is target leakage, where the predictors include information that would not
be known at prediction time but is directly related to the outcome. This often happens
inadvertently when the target variable or its proxies slip into the training features
3
accuracy would appear high during validation but would fail when applied to new, unseen data
where such information is unavailable.
Feature leakage, a more subtle form, occurs when features indirectly capture
information about the target variable. This can happen through complex relationships or when
data preprocessing is not done correctly.
For example, if a dataset includes a feature that is a summary statistic of the target
variable, even if not obvious at first glance, it can introduce leakage. Thoroughly understanding
the data and the relationships between features and the target variable is key to preventing
feature leakage.
Preventing data leakage involves careful data preprocessing, feature selection, and
validation practices. Data should be split into training, validation, and test sets in a manner that
preserves the temporal or logical sequence of events.
Collaboration with subject matter experts can help in understanding the nuances of the
data and ensuring that only appropriate features are included.
Regularly reviewing and validating the data pipeline is also vital. Automated checks
can help detect anomalies or changes in data that might introduce leakage. Version control and
documentation of data processing steps ensure transparency and reproducibility, which are
essential for maintaining the integrity of the modeling process.
Data leakage is a significant challenge in data science that can severely undermine the
reliability of predictive models. By understanding its various forms and implementing robust
data handling practices, data scientists can mitigate the risk of leakage and build models that
generalize well to new data. This not only enhances the credibility of the models but also ensures
their practical applicability in real-world scenarios.
4
CHAPTER-2
LITRATURE REVEIW
Inthis paper [1],we makes use of sequence alignment method for searching
complex data-leakage patterns. This algorithm is engaged for recognizing long as well as
important data patterns. This identification is paired with a sampling algorithm.
Paper [2] we, implemented two algorithms for searching and transformed leakage
information. This framework fulfills high recognition exactness and finds transformed
leakage appeared differently in relation to the cutting edge inspection systems.
They parallelize their design on graphics preparing unit as well as exhibit the solid
scalability of their detection solution needed by a sizable association.
In paper [4]we developed the Aquifer security system that assigns host export
limitations on all data taken as part of a user interface (UI) workflow. Key understanding
was that when applications in modern working frameworks offer data, it is a piece of an
enormous work process to play out a user task.
In doing all things considered they engage applications to sensibly hold control of their
data after it has been shared as a major aspect of the client's tasks.
In paper [5] wepresent Attire: an app for computers as well as smart phones which
shows the user with an avatar. Attire conveys real-time data exposure in a light weight and
unobtrusive manner via updating to the avatar’s clothing.
5
In paper [6] we given the Data-Driven Semi-Global Alignment (DDSGA),
DDSGA method. From the point of security effectiveness, DDSGA increase the scoring systems
by adopting distinct alignment parameters for every single user.
adjusts to modification in the user conduct by upgrading the signature of a user as per its present
behavior. To optimize the run time overhead, DDSGA reduce the alignment overhead as well as
parallelizes the detection and also modify.
In paper [7] we, proposed novel method for getting richer semantics of the user’s
determined. The technique is depending on the observation which for most text-based
applications, the user’s determined are shown fully on screen, as text, as well as the user will do
some modifications if what is on screen is not what he needed.
The evaluation outcome shows that Gyrus successfully prohibits modern malware, as
well as study demonstrated that it would be very tough for future attacks to defeat it. At last, the
performance analysis demonstrated that Gyrus is a countable option for positioning on
standalone pc with continues user interaction.
Gyrus fills an important gap, enabling security actions that taking user concentration
in finding the legitimacy of network traffic.
while making sure each subset has each event relevant to a detection case. Proposed
partitioning method is based on the concept of detection scope, i.e., the less “slice” of traffic
that a detector must study for performing its function. As this concept has
Int J Elec & Comp Eng ISSN: 2088-8708 A Survey: Data Leakage Detection
Techniques (K. S. Wagh) 2249 some common applicability, designed model will support
simple, per-flow detection technique and more complex, high-level detectors.
6
Pindings of we [9], the introduction of essential data is not basic because of information
change in the content. Transformations (for example, insertion, and deletion) results in
significantly unpredictable leakage patterns.
Paper [10] we given that number of the apparent distance metrics utilized for
computing behavioral similarity between network hosts fail to capture the semantic
significance imbued by network protocols. Moreover, they also tend to neglect long-term
temporal structure of the objects being counted.
To consider the role of these semantic as well as temporal attributes, they create
another behavioral distance metric for network hosts as well as compare its execution with a
metric which disregards information like this.
Specifically, they propose semantically important metrics for common data types found
in network data, indicate how these metrics can be consolidated to treat network information as a
unified metric space, as well as depict a temporal sequencing algorithm which captures long-
term causal relationships. Shoulin Yin et al.
[11] introduced novel concept searchable asymmetric encryption, which is useful for
security and search operations on encrypted data. It greatly enhances the information
protection, and prevent the leakage of the user's search criteria-Search Pattern.
In paper [12] we describes the Kaman- Kerberos assistant mobile ad-hoc network
(KAMAN) protocol to avoid users information leak in cloud environment for virtual side
channel attack. Moez Altayeb et al.
[13] described the concepts of radiation leaks and data in wireless sensor network.
To locate leakage station and control the stations power consumption by sending a special
command to it from server node.
7
CHAPTER 3
ARCHITECTURE
1. Phase:
Training
2. Preprocessing Phase:
Scaling or Normalization:
Scaling or normalizing data using information from the entire dataset before
splitting into training and test sets. This can lead to the test set being influenced by the
training set, biasing performance metrics.
Imputation:
Imputing missing values using information from the entire dataset, including the
test set.
Incorrect Validation:
8
Evaluating the model on the same data that was used for training (not splitting into
independent training and test sets).
4. Deployment Phase:
Real-time Data:
If the deployed model is trained on outdated data and the real-time data differs
significantly, it can lead to inaccurate predictions.
The diagram illustrates a *Data Leakage Detection System* designed to monitor and prevent
unauthorized sharing or exposure of confidential documents. Here's a step-by-step explanation of
the components and flow:
9
CHAPTER 4
METHADOLOGY
This happens when data that would not be available at the time of prediction is included
in the training set. For example, using target variables that are influenced by future events that
would not be known at the time of prediction.
Recall that validation is meant to be a measure of how the model does on data that it
hasn't considered before. You can corrupt this process in subtle ways if the validation data
affects the preprocessing behavior. This is sometimes called train-test contamination.
For example, imagine you run preprocessing (like fitting an imputer for missing values)
before calling train_test_split(). The end result? Your model may get good validation scores,
giving you great confidence in it, but perform poorly when you deploy it to make decisions.
After all, you incorporated data from the validation or test data into how you make
predictions, so the may do well on that particular data even if it can't generalize to new data. This
problem becomes even more subtle (and more dangerous) when you do more complex feature
engineering.
If your validation is based on a simple train-test split, exclude the validation data from
any type of fitting, including the fitting of preprocessing steps. This is easier if you use scikit-
learn pipelines. When using cross-validation, it's even more critical that you do your
preprocessing inside the pipeline!
Non-Random Sampling:
If data is not properly sampled, it can lead to leakage. For example, in time-
series data, if training and test sets are not split chronologically, information from the
future can leak into the past.
Target Leakage: This occurs when features that are highly correlated with the target
variable but available in practice are included in the model. For example, using customer churn
prediction models that include information about whether a customer has already churned.
10
This occurs when features that are highly correlated with the target variable but available in
practice are included in the model. For example, using customer churn prediction models that
include information about whether a customer has already churned.
Train-Test Contamination:
Issue:
When data from the test set inadvertently influences the training process. For
example, if preprocessing steps like scaling or imputation are done without separating
train and test data, the model may indirectly learn information from the test set.
Solution:
Always split your data into train and test sets before any preprocessing steps.
Ensure that any transformations or adjustments are learned from the training data and
applied consistently to the test data.
Using Future Information:
Issue:
Including data in the training process that would not be available at the time of
prediction. For instance, using features that are calculated using information that would
not be known at the time of prediction can lead to unrealistic performance estimates.
Solution:
Be mindful of the temporal order of your data. Ensure that the training data only
includes information that would realistically be available at the time of making
predictions.
Solution:
Remove any columns that directly leak information about the target variable
from your predictors before training your model.
11
Data Preprocessing Steps:
Issue:
Performing data preprocessing steps such as scaling or feature selection
across the entire dataset (train + test) without separating them.
Solution:
Always fit preprocessing steps (e.g., scaling, imputation) on the training data
and then transform both the training and test sets using the fitted preprocessing
parameters.
Cross-Validation Leakage:
Issue:
Incorrectly applying cross-validation in a way that leaks information across
folds. For example, performing feature selection or parameter tuning separately on
each fold without considering the entire training set.
Solution:
Ensure that all cross-validation steps, including preprocessing, feature
selection, and hyperparameter tuning, are applied within each fold of the cross-
validation loop. Use techniques like nested cross-validation for robust evaluation.
12
CHAPTER 5
IMPLEMENTATION
import pandas as pd
Load dataset
data = pd.read_csv('AER_credit_card_data.csv')
features = data.columns.drop(target)
data_encoded = pd.get_dummies(data,
columns=data[features].select_dtypes(include=['object']).columns, drop_first=True)
13
Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data_encoded[features_encoded],
data_encoded[target], test_size=0.3, random_state=42)
rf.fit(X_train, y_train)
cm = confusion_matrix(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print('Confusion Matrix:')
print(cm)
plt.title('Feature Importances')
plt.xlabel('Importance Score')
plt.ylabel('Features')
plt.show()
14
Detect potential data leakage
15
CHAPTER 6
Model Performance
The model's performance is evaluated using accuracy and a confusion matrix. These metrics
provide insight into the model's ability to correctly classify instances of credit card default.
Accuracy :
The accuracy of the Random Forest model on the test set is 0.9773. This means that
approximately 97.73% of the predictions made by the model are correct.
Accuracy: 0.9773
Confusion Matrix :
The confusion matrix provides a detailed breakdown of the model's performance in terms
of true positives, true negatives, false positives, and false negatives.
16
Confusion Matrix:
[[ 90 0]
[ 9 297]]
True Positive :
297 instances where the model correctly predicted a default (second row, second column).
The confusion matrix indicates that the model is highly accurate in predicting both defaults
and non-defaults, with only a small number of false negatives and no false positives.
- share
- expenditure
- reports
These features have high importance scores in the Random Forest model and might include
information that could lead to data leakage. For instance:
Share : This feature might include information about the proportion of a customer's
spending that was shared or utilized in a certain way, potentially revealing future spending
behavior.
17
Expenditure : This feature could directly reflect the customer's future spending patterns,
which would not be available at the time of making a prediction.
Reports : This feature might include information from financial or credit reports
generated after the prediction period, thus leaking future information into the model.
If confirmed, these features should be excluded from the model training process
to ensure that the model's performance metrics are valid and that the model can generalize well
to new, unseen data.
18
CHAPTER 7
CONCLUSION
Data leakage is a critical issue in data science that can severely compromise the
integrity and reliability of machine learning models. It occurs when information from outside the
training dataset is used improperly to create or evaluate the model.
By prioritizing data integrity and rigorous validation practices, data scientists can build
more trustworthy models that generalize well to unseen data and yield reliable insights for
decision-making purposes.
The Random Forest model achieved high accuracy in predicting credit card defaults,
with an accuracy score of 0.9773 and a well-performing confusion matrix.
However, potential data leakage features were identified, which need to be addressed to
ensure the model's robustness and reliability in real-world applications. Future work should focus
on validating and possibly removing these features to prevent data leakage and maintain the
integrity of the model's predictions.
19
CHAPTER 8
REFRENCE
[1] “Data Leakage Detection”Panagiotis Papadimitriou, Student Member, IEEE, and Hector
Garcia-Molina, Member.
[2] R. Agrawal and J. Kiernan, “Watermarking Relational Databases,” Proc. 28th Int’l Conf.
Very Large Data Bases (VLDB ’02), VLDB Endowment, pp. 155-166, 2002.
[3] P. Bonatti, S.D.C. di Vimercati, and P. Samarati, “An Algebra for Composing Access
Control Policies,” ACM Trans. Information and System Security, vol. 5, no. 1, pp. 1-35, 2002.
[4] P. Buneman, S. Khanna, and W.C. Tan, “Why and Where: A Characterization of Data
Provenance,” Proc. Eighth Int’l Conf. Database Theory (ICDT ’01), J.V. den Bussche and V.
Vianu, eds., pp. 316-330, Jan. 2001
[5] P. Buneman and W.-C. Tan, “Provenance in Databases,” Proc. ACM SIGMOD, pp. 1171-
1173, 2007.
[6] Y. Cui and J. Widom, “Lineage Tracing for General Data Warehouse Transformations,” The
VLDB J., vol. 12, pp. 41-58, 2003.
[7] F. Hartung and B. Girod, “Watermarking of Uncompressed and Compressed Video,” Signal
Processing, vol. 66, no. 3, pp. 283-301, 1998.
[8] S. Jajodia, P. Samarati, M.L. Sapino, and V.S. Subrahmanian,“Flexible Support for Multiple
Access Control Policies,” ACM Trans. Database Systems, vol. 26, no. 2, pp. 214-260, 2001.
[10] B. Mungamuru and H. Garcia-Molina, “Privacy, Preservation and Performance: The 3 P’s
of Distributed Data Management.
20