0% found this document useful (0 votes)

18 views25 pages

Data Leakage Detection

The project report outlines the development of a Data Leakage Detection System using machine learning, specifically the Random Forest algorithm, to identify unauthorized data disclosures. It emphasizes the importance of preventing data leakage in predictive models and details the methodologies for data preprocessing, feature engineering, and model evaluation. The system aims to enhance data security by integrating with existing security infrastructure to monitor and analyze data transactions in real-time.

Uploaded by

Rohi Rocks

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views25 pages

Data Leakage Detection

Uploaded by

Rohi Rocks

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

DATA LEAKAGE DETECTION SYSTEM

PROJECT REPORT

Submitted by

HALETHA BEGAM A 810421243018

PRITHIGA A 810421243037

RESHMA R 810421243042

in partial fulfilment for the award of the degree

of
BACHELOR OF TECHNOLOGY
in
ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DHANALAKSHMI SRINIVASAN ENGINEERING COLLEGE

(AUTONOMOUS)
PERAMBALUR – 621 212

ANNA UNIVERSITY: CHENNAI 600 025

JULY 2024
DATA LEAKAGE DETECTION SYSTEM

PROJECT REPORT

Submitted by

HALETHA BEGAM A 810421243018

PRITHIGA A 810421243037

RESHMA R 810421243042

in partial fulfilment for the award of the degree

of
BACHELOR OF TECHNOLOGY
in
ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DHANALAKSHMI SRINIVASAN ENGINEERING COLLEGE

(AUTONOMOUS)
PERAMBALUR – 621 212

ANNA UNIVERSITY: CHENNAI 600 025

JULY 2024
DHANALAKSHMI SRINIVASAN ENGINEERING COLLEGE
(AUTONOMOUS)
PERAMBALUR – 621 212
BONAFIDE CERTIFICATE

Certified that this Project report on “DATA LEAKAGE DETECTION

SYSTEM” is the bonafide work of HALETHA BEGAM A (810421243018),
PRITHIGA A (810421243037) , RESHMA R (810421243045) , who
carried out the project work under my supervision.

SIGNATURE SIGNATURE
Dr. Shree K.V.M, M.E., Ph.D., Ms.K.Murugeswari M.E.,
HEAD OF THE DEPARTMENT, SUPERVISOR,

Department of Artificial Intelligence Department of Artificial Intelligence

and Data Science, and Data Science,

Dhanalakshmi Srinivasan Engineering Dhanalakshmi Srinivasan Engineering

College (Autonomous), College (Autonomous),
Perambalur – 621212 Perambalur - 621212

Submitted for the Project Viva-Voce Examination held on

INTERNAL EXAMINER EXTERNAL EXAMINER

I
ACKNOWLEDGEMENT

We express our gratitude and thanks to Our Parents first for giving health and a sound
mind for completing this project. We give all the glory and thanks to our almighty GOD for
showering upon, the necessary wisdom and grace for accomplishing this project.

It is our pleasant duty to express deep sense of gratitude to our Honourable Chancellor
Shri. A. Srinivasan, for his kind encouragement. We have unique pleasure in thanking
our Principal Prof. Dr. D. Shanmugasundaram, M.E., PhD., F.I.E., C.Eng., our Dean
Dr. K .Anbarasan, M.E., Ph.D., and our COE Dr. K. Velmurugan, M.E., Ph.D., for their
unflinching devotion, which leads us to complete this project.

We express our faithful and sincere gratitude to our Head of the Department
Dr. Shree K.V.M, M.E., Ph.D., Artificial Intelligence and Data Science, for her valuable
guidance and support the gave us during the project time.

We express our faithful and sincere gratitude to our Project Coordinator A. Siva Sankari
M.E.,MCA.,M. Phil., Artificial Intelligence and Data Science, for her valuable guidance andsupport the
gave us during the project time.

We are also thankful to our internal project guide Ms.K.Murugeswari M.E., of

Department of Artificial Intelligence and Data Science for her valuable guidance and precious
suggestion to complete this project work successfully.

We render our thanks to all Faculty members and Programmers of Department of

Computer Science and Engineering for their timely assistance

II
3
ABSTRACT

Data leakage, the unauthorized exposure of sensitive information, is a critical threat to

organizations, potentially leading to significant financial losses and reputational damage. This
project aims to develop an advanced data leakage detection system using machine learning,
specifically the Random Forest algorithm, to effectively identify and mitigate unauthorized data
disclosures. By collecting a comprehensive dataset that includes instances of both legitimate data
usage and known leakage cases, the model can be trained to discern patterns indicative of data
leakage. The process begins with thorough data preprocessing to clean, normalize, and standardize
the dataset, ensuring consistency and improving the algorithm's performance. Feature engineering
is employed to identify and extract the most relevant features, optimizing the dataset for better
accuracy. The Random Forest algorithm, known for its robustness and ability to handle large
datasets, constructs multiple decision trees and aggregates their results to improve detection
accuracy. The model is rigorously evaluated using metrics such as accuracy, precision, recall, and
F1-score to ensure its effectiveness in detecting data leakage. Cross-validation techniques are
applied to confirm the model's generalizability and robustness. Once trained, the model is deployed
into a real-time data leakage detection system, integrated with existing security infrastructure to
continuously monitor and analyze data transactions. This approach aims to provide a high-
performing data leakage detection model that enhances data security by timely identifying and
preventing unauthorized data exposure. The resulting system is scalable and adaptable, suitable
for integration into various organizational environments to maintain data integrity and protect
sensitive information.

KEYWORDS: DATA LEAKAGE, MACHINE LEARNING, RANDOM FOREST,

ENSEMBLE LEARNING, FEATURE ENGINEERING, MODEL VALIDATION,
CONFUSION MATRIX.

1
TABLE OF CONTENT
TITLE PAGE NO
1.0 INTRODUCTION 03
1.1 Overview of project

2.0 LITERATURE REVIEW 05

3.0 ARCHITECTURE 08

4.0 IMPLEMENTATION 10

4.1Tools and Libraries

4.2Code Structure

5. RESULT 13

6. DISCUSSION 16

6.1Evaluation Metrics

6.2Performance Analysis

7.0 CONCLUSION 19

8.0 REFERENCE 20

2
CHAPTER 1
INTRODUCTION

Data leakage is a critical concern in data science, especially when building predictive
models.

It refers to the unintentional inclusion of information in the training data that would not be
available at the time of prediction, leading to overly optimistic performance estimates and
ultimately resulting in models that fail in real-world applications.

Understanding and preventing data leakage is essential for developing robust and reliable
machine learning models.

At its core, data leakage occurs when the model has access to data during training that it
wouldn't have during actual prediction.

This can take many forms, such as including future data, target leakage, or using
variables that contain information about the target variable.

For example, if a model predicting credit default uses data that includes payment history
up to the point of default, this information would not be available when making a prediction
before the default occurs. This scenario gives the model an unfair advantage, leading to inflated
performance metrics during validation.

One common type of data leakage is temporal leakage, where information from the
future is used to predict past events. In time-series data, this is particularly problematic as it
directly violates the temporal order of events.

For instance, using stock prices from the future to predict past prices would be an
obvious case of temporal leakage. Ensuring the chronological integrity of data is crucial in these
contexts to avoid misleading results.

Another form is target leakage, where the predictors include information that would not
be known at prediction time but is directly related to the outcome. This often happens
inadvertently when the target variable or its proxies slip into the training features

For example, in a healthcare setting, if a model predicting patient mortality includes

data recorded post-mortem, such as autopsy results, it would lead to target leakage. The model's

3
accuracy would appear high during validation but would fail when applied to new, unseen data
where such information is unavailable.

Feature leakage, a more subtle form, occurs when features indirectly capture
information about the target variable. This can happen through complex relationships or when
data preprocessing is not done correctly.

For example, if a dataset includes a feature that is a summary statistic of the target
variable, even if not obvious at first glance, it can introduce leakage. Thoroughly understanding
the data and the relationships between features and the target variable is key to preventing
feature leakage.

Preventing data leakage involves careful data preprocessing, feature selection, and
validation practices. Data should be split into training, validation, and test sets in a manner that
preserves the temporal or logical sequence of events.

Cross-validation techniques should be employed thoughtfully, ensuring that data from

the future does not inform the past. Additionally, domain knowledge plays a crucial role in
identifying potential sources of leakage.

Collaboration with subject matter experts can help in understanding the nuances of the
data and ensuring that only appropriate features are included.

Regularly reviewing and validating the data pipeline is also vital. Automated checks
can help detect anomalies or changes in data that might introduce leakage. Version control and
documentation of data processing steps ensure transparency and reproducibility, which are
essential for maintaining the integrity of the modeling process.

Data leakage is a significant challenge in data science that can severely undermine the
reliability of predictive models. By understanding its various forms and implementing robust
data handling practices, data scientists can mitigate the risk of leakage and build models that
generalize well to new data. This not only enhances the credibility of the models but also ensures
their practical applicability in real-world scenarios.

4
CHAPTER-2
LITRATURE REVEIW
Inthis paper [1],we makes use of sequence alignment method for searching
complex data-leakage patterns. This algorithm is engaged for recognizing long as well as
important data patterns. This identification is paired with a sampling algorithm.

That allows one to look at the similitude of two independently tested

successions. This structure accomplishes great discovery exactness in perceiving
transformed leakage.

Paper [2] we, implemented two algorithms for searching and transformed leakage
information. This framework fulfills high recognition exactness and finds transformed
leakage appeared differently in relation to the cutting edge inspection systems.

They parallelize their design on graphics preparing unit as well as exhibit the solid
scalability of their detection solution needed by a sizable association.

In paper [3] we have designs fuzzy fingerprint, which is a privacy-preserving

data-leak detection system also provides its realization. By making use of special digests, the
exposure of the vital data is kept to very less while detection. we have conducted few tests to
conform the accuracy, privacy, as well as efficiency of our solutions.

In paper [4]we developed the Aquifer security system that assigns host export
limitations on all data taken as part of a user interface (UI) workflow. Key understanding
was that when applications in modern working frameworks offer data, it is a piece of an
enormous work process to play out a user task.

Each application on the UI work process is a potential information owner, and in

addition thus can add to the security limitations. The restrictions are held with data as it is
composed to storage and propagated to future UI work forms that read it.

In doing all things considered they engage applications to sensibly hold control of their
data after it has been shared as a major aspect of the client's tasks.

In paper [5] wepresent Attire: an app for computers as well as smart phones which
shows the user with an avatar. Attire conveys real-time data exposure in a light weight and
unobtrusive manner via updating to the avatar’s clothing.

5
In paper [6] we given the Data-Driven Semi-Global Alignment (DDSGA),
DDSGA method. From the point of security effectiveness, DDSGA increase the scoring systems
by adopting distinct alignment parameters for every single user.

Also, it endures few transformations in user command sequences by permitting few

changes in the low-level representation of the commands functionality. It additionally

adjusts to modification in the user conduct by upgrading the signature of a user as per its present
behavior. To optimize the run time overhead, DDSGA reduce the alignment overhead as well as
parallelizes the detection and also modify.

In paper [7] we, proposed novel method for getting richer semantics of the user’s
determined. The technique is depending on the observation which for most text-based
applications, the user’s determined are shown fully on screen, as text, as well as the user will do
some modifications if what is on screen is not what he needed.

Depending on this concept, development of prototype known as Gyrus that enforces

right working of applications by taking user determinant is done. By making use of Gyrus,
representation of stopping destructive activities which can modify the host system to forward
destructive traffic, like social network impersonation attacks, as well as online financial
services fraud is done.

The evaluation outcome shows that Gyrus successfully prohibits modern malware, as
well as study demonstrated that it would be very tough for future attacks to defeat it. At last, the
performance analysis demonstrated that Gyrus is a countable option for positioning on
standalone pc with continues user interaction.

Gyrus fills an important gap, enabling security actions that taking user concentration
in finding the legitimacy of network traffic.

In paper [8] we implemented a domain-specific concurrency model that backs a large

class of IDS analysis not depends on a particular detection technique. Implemented technique
divides the stream of network events in subsets that the IDS will process not related.

while making sure each subset has each event relevant to a detection case. Proposed
partitioning method is based on the concept of detection scope, i.e., the less “slice” of traffic
that a detector must study for performing its function. As this concept has

Int J Elec & Comp Eng ISSN: 2088-8708 A Survey: Data Leakage Detection
Techniques (K. S. Wagh) 2249 some common applicability, designed model will support
simple, per-flow detection technique and more complex, high-level detectors.

6
Pindings of we [9], the introduction of essential data is not basic because of information
change in the content. Transformations (for example, insertion, and deletion) results in
significantly unpredictable leakage patterns.

Present automata-based string coordinating algorithms are illogical for finding

transformed data leakage as a result of its formidable complexity nature while exhibiting
the required consistent expressions. They create two novel algorithms for recognizing long
and also wrong data leakage. Their framework achieves high detection precision in perceiving
changed breaks contrasted and the best in class inspection techniques. They parallelize our
design on graphics processing unit and in addition demonstrated the solid scalability of data
leakage detection arrangement examining enormous information.

Paper [10] we given that number of the apparent distance metrics utilized for
computing behavioral similarity between network hosts fail to capture the semantic
significance imbued by network protocols. Moreover, they also tend to neglect long-term
temporal structure of the objects being counted.

To consider the role of these semantic as well as temporal attributes, they create
another behavioral distance metric for network hosts as well as compare its execution with a
metric which disregards information like this.

Specifically, they propose semantically important metrics for common data types found
in network data, indicate how these metrics can be consolidated to treat network information as a
unified metric space, as well as depict a temporal sequencing algorithm which captures long-
term causal relationships. Shoulin Yin et al.

[11] introduced novel concept searchable asymmetric encryption, which is useful for
security and search operations on encrypted data. It greatly enhances the information
protection, and prevent the leakage of the user's search criteria-Search Pattern.

In paper [12] we describes the Kaman- Kerberos assistant mobile ad-hoc network
(KAMAN) protocol to avoid users information leak in cloud environment for virtual side
channel attack. Moez Altayeb et al.

[13] described the concepts of radiation leaks and data in wireless sensor network.
To locate leakage station and control the stations power consumption by sending a special
command to it from server node.

7
CHAPTER 3
ARCHITECTURE

1. Phase:
Training

Inclusion of Future Information: Using data that would not be available at

the time of prediction. For example, using target variables that are influenced by
future events not known at prediction time.

Inclusion of Metadata: Including metadata or derived features

that contain information about the target variable, thus indirectly
leaking information.

2. Preprocessing Phase:

Scaling or Normalization:

Scaling or normalizing data using information from the entire dataset before
splitting into training and test sets. This can lead to the test set being influenced by the
training set, biasing performance metrics.
Imputation:

Imputing missing values using information from the entire dataset, including the
test set.

3. Feature Engineering Phase:

Direct Use of Test Data:

Creating features based directly on test data or using information that would not be
available at prediction time.

Incorrect Validation:
8
Evaluating the model on the same data that was used for training (not splitting into
independent training and test sets).

4. Deployment Phase:

Real-time Data:
If the deployed model is trained on outdated data and the real-time data differs
significantly, it can lead to inaccurate predictions.

fig .1 . DATA LEKAGE DEDUCTION SYSTEM

The diagram illustrates a *Data Leakage Detection System* designed to monitor and prevent
unauthorized sharing or exposure of confidential documents. Here's a step-by-step explanation of
the components and flow:

Components and Flow:

1. Web Data Sources

2. Data Leakage Detuction Syste

9
CHAPTER 4
METHADOLOGY
This happens when data that would not be available at the time of prediction is included
in the training set. For example, using target variables that are influenced by future events that
would not be known at the time of prediction.

Train Test Contamination:

A different type of leak occurs when you aren't careful to distinguish training data from
validation data.

Recall that validation is meant to be a measure of how the model does on data that it
hasn't considered before. You can corrupt this process in subtle ways if the validation data
affects the preprocessing behavior. This is sometimes called train-test contamination.

For example, imagine you run preprocessing (like fitting an imputer for missing values)
before calling train_test_split(). The end result? Your model may get good validation scores,
giving you great confidence in it, but perform poorly when you deploy it to make decisions.

After all, you incorporated data from the validation or test data into how you make
predictions, so the may do well on that particular data even if it can't generalize to new data. This
problem becomes even more subtle (and more dangerous) when you do more complex feature
engineering.

If your validation is based on a simple train-test split, exclude the validation data from
any type of fitting, including the fitting of preprocessing steps. This is easier if you use scikit-
learn pipelines. When using cross-validation, it's even more critical that you do your
preprocessing inside the pipeline!

Non-Random Sampling:

If data is not properly sampled, it can lead to leakage. For example, in time-
series data, if training and test sets are not split chronologically, information from the
future can leak into the past.

Target Leakage: This occurs when features that are highly correlated with the target
variable but available in practice are included in the model. For example, using customer churn
prediction models that include information about whether a customer has already churned.

10
This occurs when features that are highly correlated with the target variable but available in
practice are included in the model. For example, using customer churn prediction models that
include information about whether a customer has already churned.

Information Leaks from External Sources:

Sometimes, external data sources or unintentional access to private information

provide insights that the model could inadvertently use.

Train-Test Contamination:

Issue:
When data from the test set inadvertently influences the training process. For
example, if preprocessing steps like scaling or imputation are done without separating
train and test data, the model may indirectly learn information from the test set.
Solution:
Always split your data into train and test sets before any preprocessing steps.
Ensure that any transformations or adjustments are learned from the training data and
applied consistently to the test data.
Using Future Information:
Issue:
Including data in the training process that would not be available at the time of
prediction. For instance, using features that are calculated using information that would
not be known at the time of prediction can lead to unrealistic performance estimates.
Solution:
Be mindful of the temporal order of your data. Ensure that the training data only
includes information that would realistically be available at the time of making
predictions.

Data Containing Target Information:

Issue:
When predictors include information about the target variable that would
not be available during prediction time. This could artificially inflate model
performance during training.

Solution:
Remove any columns that directly leak information about the target variable
from your predictors before training your model.

11
Data Preprocessing Steps:

Issue:
Performing data preprocessing steps such as scaling or feature selection
across the entire dataset (train + test) without separating them.

Solution:
Always fit preprocessing steps (e.g., scaling, imputation) on the training data
and then transform both the training and test sets using the fitted preprocessing
parameters.
Cross-Validation Leakage:
Issue:
Incorrectly applying cross-validation in a way that leaks information across
folds. For example, performing feature selection or parameter tuning separately on
each fold without considering the entire training set.
Solution:
Ensure that all cross-validation steps, including preprocessing, feature
selection, and hyperparameter tuning, are applied within each fold of the cross-
validation loop. Use techniques like nested cross-validation for robust evaluation.

12
CHAPTER 5
IMPLEMENTATION
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix

import matplotlib.pyplot as plt

import seaborn as sns

Load dataset
data = pd.read_csv('AER_credit_card_data.csv')

Inspect the first few rows of the dataset

print(data.head())

target variable and fea Define tures

target = 'card' # Assuming 'default' is the column indicating if a customer defaulted

features = data.columns.drop(target)

One-hot encode categorical features

data_encoded = pd.get_dummies(data,
columns=data[features].select_dtypes(include=['object']).columns, drop_first=True)

Update features after encoding

features_encoded = data_encoded.columns.drop(target)

13
Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data_encoded[features_encoded],
data_encoded[target], test_size=0.3, random_state=42)

Initialize and train Random Forest model

rf = RandomForestClassifier(n_estimators=100, random_state=42)

rf.fit(X_train, y_train)

Make predictions on the test set

y_pred = rf.predict(X_test)

Evaluate model performance

accuracy = accuracy_score(y_test, y_pred)

cm = confusion_matrix(y_test, y_pred)

print(f'Accuracy: {accuracy}')

print('Confusion Matrix:')

print(cm)

Plot feature importances

feature_importances = pd.Ser t.figure(figsize=(10, 6))
sns.barplot(x=feature_importances, y=feature_importances.index)

plt.title('Feature Importances')

plt.xlabel('Importance Score')

plt.ylabel('Features')
plt.show()

14
Detect potential data leakage

potential_leakage_features = feature_importances[feature_importances > 0.05].index.tolist()

print('Potential Data Leakage Features:', potential_leakage_features)

15
CHAPTER 6

RESULT AND DISCUSSION

Model Performance

The model's performance is evaluated using accuracy and a confusion matrix. These metrics
provide insight into the model's ability to correctly classify instances of credit card default.

Accuracy :

The accuracy of the Random Forest model on the test set is 0.9773. This means that
approximately 97.73% of the predictions made by the model are correct.

Accuracy: 0.9773
Confusion Matrix :
The confusion matrix provides a detailed breakdown of the model's performance in terms
of true positives, true negatives, false positives, and false negatives.

16
Confusion Matrix:
[[ 90 0]

[ 9 297]]

The confusion matrix can be interpreted as follows:

True Negatives (TN) :

90 instances where the model correctly predicted no default (first row, first column).

False Positives (FP) :

0 instances where the model incorrectly predicted a default (first row, second column).

False Negatives (FN) :

9 instances where the model incorrectly predicted no default (second row, first column).

True Positive :
297 instances where the model correctly predicted a default (second row, second column).

The confusion matrix indicates that the model is highly accurate in predicting both defaults
and non-defaults, with only a small number of false negatives and no false positives.

Potential Data Leakage Features

Data leakage occurs when the model has access to information during training that
would not be available at the time of prediction. This can lead to artificially high performance
metrics and poor generalization to new data.
In this project, the following features were identified as potential sources of data leakage:

- share

- expenditure

- reports

These features have high importance scores in the Random Forest model and might include
information that could lead to data leakage. For instance:

Share : This feature might include information about the proportion of a customer's
spending that was shared or utilized in a certain way, potentially revealing future spending
behavior.

17
Expenditure : This feature could directly reflect the customer's future spending patterns,
which would not be available at the time of making a prediction.

Reports : This feature might include information from financial or credit reports
generated after the prediction period, thus leaking future information into the model.

To address potential data leakage, it is crucial to carefully examine these features

and determine if they contain future information or information that would not be available at the
time of prediction.

If confirmed, these features should be excluded from the model training process
to ensure that the model's performance metrics are valid and that the model can generalize well
to new, unseen data.

18
CHAPTER 7
CONCLUSION
Data leakage is a critical issue in data science that can severely compromise the
integrity and reliability of machine learning models. It occurs when information from outside the
training dataset is used improperly to create or evaluate the model.

Leading to unrealistically optimistic performance metrics during development but

poor performance in real-world applications.

Detecting and preventing data leakage requires meticulous preprocessing and

validation steps to ensure that the model only learns from the data it would realistically
encounter during deployment.
Implementing robust validation strategies, such as cross-validation and proper
separation of training and validation sets, is essential to mitigate this risk.

Furthermore, maintaining vigilance throughout the entire data science lifecycle—from

data collection to model deployment—is crucial in identifying and addressing potential sources
of leakage

By prioritizing data integrity and rigorous validation practices, data scientists can build
more trustworthy models that generalize well to unseen data and yield reliable insights for
decision-making purposes.

The Random Forest model achieved high accuracy in predicting credit card defaults,
with an accuracy score of 0.9773 and a well-performing confusion matrix.

However, potential data leakage features were identified, which need to be addressed to
ensure the model's robustness and reliability in real-world applications. Future work should focus
on validating and possibly removing these features to prevent data leakage and maintain the
integrity of the model's predictions.

19
CHAPTER 8
REFRENCE

[1] “Data Leakage Detection”Panagiotis Papadimitriou, Student Member, IEEE, and Hector
Garcia-Molina, Member.

[2] R. Agrawal and J. Kiernan, “Watermarking Relational Databases,” Proc. 28th Int’l Conf.
Very Large Data Bases (VLDB ’02), VLDB Endowment, pp. 155-166, 2002.

[3] P. Bonatti, S.D.C. di Vimercati, and P. Samarati, “An Algebra for Composing Access
Control Policies,” ACM Trans. Information and System Security, vol. 5, no. 1, pp. 1-35, 2002.

[4] P. Buneman, S. Khanna, and W.C. Tan, “Why and Where: A Characterization of Data
Provenance,” Proc. Eighth Int’l Conf. Database Theory (ICDT ’01), J.V. den Bussche and V.
Vianu, eds., pp. 316-330, Jan. 2001

[5] P. Buneman and W.-C. Tan, “Provenance in Databases,” Proc. ACM SIGMOD, pp. 1171-
1173, 2007.

[6] Y. Cui and J. Widom, “Lineage Tracing for General Data Warehouse Transformations,” The
VLDB J., vol. 12, pp. 41-58, 2003.

[7] F. Hartung and B. Girod, “Watermarking of Uncompressed and Compressed Video,” Signal
Processing, vol. 66, no. 3, pp. 283-301, 1998.

[8] S. Jajodia, P. Samarati, M.L. Sapino, and V.S. Subrahmanian,“Flexible Support for Multiple
Access Control Policies,” ACM Trans. Database Systems, vol. 26, no. 2, pp. 214-260, 2001.

[9] Y. Li, V. Swarup, and S. Jajodia, “Fingerprinting RelationalDatabases: Schemes and

Specialties,” IEEE Trans. Dependable and Secure Computing, vol. 2, no. 1, pp. 34-45, Jan.-Mar.
2005.

[10] B. Mungamuru and H. Garcia-Molina, “Privacy, Preservation and Performance: The 3 P’s
of Distributed Data Management.

Final PPT
No ratings yet
Final PPT
18 pages
MCQ For Data Science Users DR Dhananjay Bisen DR Neeraj Sahu DR Brijesh
No ratings yet
MCQ For Data Science Users DR Dhananjay Bisen DR Neeraj Sahu DR Brijesh
17 pages
Bank Fraud Documentation
No ratings yet
Bank Fraud Documentation
109 pages
Crime Rate Analysis Using Machine Learning Final
100% (1)
Crime Rate Analysis Using Machine Learning Final
37 pages
UPI (Report)
100% (1)
UPI (Report)
30 pages
Effectual Contract Management and Analysis With AI-Powered Technology Reducing Errors and S
No ratings yet
Effectual Contract Management and Analysis With AI-Powered Technology Reducing Errors and S
6 pages
Mini Report Python
No ratings yet
Mini Report Python
24 pages
Analysis On The Growth of Indian Economy: A Project Plan On
No ratings yet
Analysis On The Growth of Indian Economy: A Project Plan On
16 pages
Malware Detection Using Machine Learning
100% (1)
Malware Detection Using Machine Learning
112 pages
Introduction To Machine Leraning
No ratings yet
Introduction To Machine Leraning
27 pages
Dav - Lab Manual
No ratings yet
Dav - Lab Manual
34 pages
1 s2.0 S0957417422020073 Main
No ratings yet
1 s2.0 S0957417422020073 Main
11 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
20 pages
Tech (1) 1
No ratings yet
Tech (1) 1
41 pages
Report 1 Crim
No ratings yet
Report 1 Crim
73 pages
Deep Learning
No ratings yet
Deep Learning
33 pages
Internship Report
No ratings yet
Internship Report
35 pages
Final Main Report 1
No ratings yet
Final Main Report 1
68 pages
Report 5th
No ratings yet
Report 5th
14 pages
Data Leakage Final 2
No ratings yet
Data Leakage Final 2
20 pages
Done Dma
No ratings yet
Done Dma
22 pages
Network Intrusion Detection System Report
No ratings yet
Network Intrusion Detection System Report
59 pages
Deepinder Singh - 8TH SEM CSE4
No ratings yet
Deepinder Singh - 8TH SEM CSE4
6 pages
SR Internship
No ratings yet
SR Internship
25 pages
Data Mining Unit-5
No ratings yet
Data Mining Unit-5
6 pages
Car Recommendation System
No ratings yet
Car Recommendation System
5 pages
ML Lab Report
No ratings yet
ML Lab Report
23 pages
Data Wrangling Tools
No ratings yet
Data Wrangling Tools
3 pages
Minor Project (7-37)
No ratings yet
Minor Project (7-37)
31 pages
Predictive Analytics-Mid Sem Exam Question Bank
No ratings yet
Predictive Analytics-Mid Sem Exam Question Bank
28 pages
MLP Quiz-2
No ratings yet
MLP Quiz-2
4 pages
Real-Time Motion Insight Using Mediapipe: A. Lakshmiprabha, Dr. G. Arockia Sahaya Sheela
No ratings yet
Real-Time Motion Insight Using Mediapipe: A. Lakshmiprabha, Dr. G. Arockia Sahaya Sheela
26 pages
Credit Card Fraud Detection: Allam Prathyusha Reddy (Urk18Cs114)
No ratings yet
Credit Card Fraud Detection: Allam Prathyusha Reddy (Urk18Cs114)
26 pages
b22cs054 (Final Report)
No ratings yet
b22cs054 (Final Report)
17 pages
Hitesh Internship Report
No ratings yet
Hitesh Internship Report
14 pages
Bhanu Final
No ratings yet
Bhanu Final
56 pages
Updatedupasana
No ratings yet
Updatedupasana
16 pages
Isha Bhama Project Report Final
No ratings yet
Isha Bhama Project Report Final
46 pages
Data Leakage Detection System
No ratings yet
Data Leakage Detection System
28 pages
Aam Report
No ratings yet
Aam Report
8 pages
Final Cyber Attack
No ratings yet
Final Cyber Attack
145 pages
Plagiarism-Report 3
No ratings yet
Plagiarism-Report 3
3 pages
Report Jeevan
No ratings yet
Report Jeevan
30 pages
FaceMap2 Merge
No ratings yet
FaceMap2 Merge
50 pages
17BIT008
No ratings yet
17BIT008
19 pages
Designing An AI
No ratings yet
Designing An AI
12 pages
Take Home Assignment - CCS3342-Business Intelligence
No ratings yet
Take Home Assignment - CCS3342-Business Intelligence
2 pages
SIP Project Report-1
No ratings yet
SIP Project Report-1
28 pages
Credit Risk Analysis Capstone Project
No ratings yet
Credit Risk Analysis Capstone Project
6 pages
Eti 2 - Compressed
No ratings yet
Eti 2 - Compressed
11 pages
Chapter 1
No ratings yet
Chapter 1
5 pages
MINI DOCC LAST (1) - Removed
No ratings yet
MINI DOCC LAST (1) - Removed
52 pages
MMMMMM
No ratings yet
MMMMMM
23 pages
Documentation Project
No ratings yet
Documentation Project
48 pages
Sign Language Recognition Using Python and Opencv: Sandip Appasaheb Dange
No ratings yet
Sign Language Recognition Using Python and Opencv: Sandip Appasaheb Dange
51 pages
Split 20250408 1447
No ratings yet
Split 20250408 1447
10 pages
House Price Predictor PPT Project
No ratings yet
House Price Predictor PPT Project
13 pages
Cryptocurrency Price Prediction Using Deep Learning
No ratings yet
Cryptocurrency Price Prediction Using Deep Learning
52 pages
Data Science Report
No ratings yet
Data Science Report
46 pages
Student Performance Analysis Using Machine Learning
No ratings yet
Student Performance Analysis Using Machine Learning
40 pages
Mini Project (Fdudl) Upd
No ratings yet
Mini Project (Fdudl) Upd
25 pages
Lec 1 Data Acquisition and Preprocessing
No ratings yet
Lec 1 Data Acquisition and Preprocessing
8 pages
505 Mini
No ratings yet
505 Mini
59 pages
Dcumentation 1
No ratings yet
Dcumentation 1
80 pages
Survey Paper
No ratings yet
Survey Paper
4 pages
Duplichecker Plagiarism Report
No ratings yet
Duplichecker Plagiarism Report
2 pages
Final Thesis
No ratings yet
Final Thesis
34 pages
Credit Card Fraud Detection PR222
No ratings yet
Credit Card Fraud Detection PR222
12 pages
Final Doc Fin PDF
No ratings yet
Final Doc Fin PDF
87 pages
Datascience
No ratings yet
Datascience
13 pages
Final Doc Fin
No ratings yet
Final Doc Fin
87 pages
IP Project Deepika
No ratings yet
IP Project Deepika
26 pages
Osish Bantha Internship
No ratings yet
Osish Bantha Internship
29 pages
Leaf Disease
No ratings yet
Leaf Disease
20 pages
B.E Cse Batchno 204
No ratings yet
B.E Cse Batchno 204
52 pages
C-42 Exp 3 Sma
No ratings yet
C-42 Exp 3 Sma
8 pages
1822 B.tech It Batchno 358
No ratings yet
1822 B.tech It Batchno 358
119 pages
Vijay DMPM
No ratings yet
Vijay DMPM
23 pages
Major Project Documentation Final
No ratings yet
Major Project Documentation Final
40 pages
COM745 - Coursework Description and Assessment CriteriaQAHE
No ratings yet
COM745 - Coursework Description and Assessment CriteriaQAHE
7 pages
Dsvmannual
No ratings yet
Dsvmannual
13 pages
Students Project Report Coverage (V1.1) : The Following Sequence Should Be Followed and Maintained
No ratings yet
Students Project Report Coverage (V1.1) : The Following Sequence Should Be Followed and Maintained
67 pages
Analysis On Credit Card Fraud Detection Using Machine Learning Approaches
No ratings yet
Analysis On Credit Card Fraud Detection Using Machine Learning Approaches
10 pages
Ip Project
No ratings yet
Ip Project
16 pages
Phase 0 PPT
No ratings yet
Phase 0 PPT
13 pages
B2 Salma Fayaz
No ratings yet
B2 Salma Fayaz
56 pages
Amity School of Engineering and Technology: Submitted To
No ratings yet
Amity School of Engineering and Technology: Submitted To
28 pages
Data Mining Notes Jntuh Compress
No ratings yet
Data Mining Notes Jntuh Compress
62 pages
Final ML Report
No ratings yet
Final ML Report
34 pages

Data Leakage Detection

Uploaded by

Data Leakage Detection

Uploaded by

DATA LEAKAGE DETECTION SYSTEM

HALETHA BEGAM A 810421243018

in partial fulfilment for the award of the degree

DHANALAKSHMI SRINIVASAN ENGINEERING COLLEGE

ANNA UNIVERSITY: CHENNAI 600 025

HALETHA BEGAM A 810421243018

in partial fulfilment for the award of the degree

DHANALAKSHMI SRINIVASAN ENGINEERING COLLEGE

ANNA UNIVERSITY: CHENNAI 600 025

Certified that this Project report on “DATA LEAKAGE DETECTION

Department of Artificial Intelligence Department of Artificial Intelligence

Dhanalakshmi Srinivasan Engineering Dhanalakshmi Srinivasan Engineering

Submitted for the Project Viva-Voce Examination held on

INTERNAL EXAMINER EXTERNAL EXAMINER

We are also thankful to our internal project guide Ms.K.Murugeswari M.E., of

We render our thanks to all Faculty members and Programmers of Department of

Data leakage, the unauthorized exposure of sensitive information, is a critical threat to

KEYWORDS: DATA LEAKAGE, MACHINE LEARNING, RANDOM FOREST,

2.0 LITERATURE REVIEW 05

4.1Tools and Libraries

For example, in a healthcare setting, if a model predicting patient mortality includes

Cross-validation techniques should be employed thoughtfully, ensuring that data from

That allows one to look at the similitude of two independently tested

In paper [3] we have designs fuzzy fingerprint, which is a privacy-preserving

Each application on the UI work process is a potential information owner, and in

Also, it endures few transformations in user command sequences by permitting few

Depending on this concept, development of prototype known as Gyrus that enforces

In paper [8] we implemented a domain-specific concurrency model that backs a large

Present automata-based string coordinating algorithms are illogical for finding

Inclusion of Future Information: Using data that would not be available at

Inclusion of Metadata: Including metadata or derived features

3. Feature Engineering Phase:

Direct Use of Test Data:

fig .1 . DATA LEKAGE DEDUCTION SYSTEM

Components and Flow:

1. Web Data Sources

Train Test Contamination:

Information Leaks from External Sources:

Sometimes, external data sources or unintentional access to private information

Data Containing Target Information:

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix

import matplotlib.pyplot as plt

import seaborn as sns

Inspect the first few rows of the dataset

target variable and fea Define tures

target = 'card' # Assuming 'default' is the column indicating if a customer defaulted

One-hot encode categorical features

Update features after encoding

Initialize and train Random Forest model

Make predictions on the test set

Evaluate model performance

Plot feature importances

potential_leakage_features = feature_importances[feature_importances > 0.05].index.tolist()

print('Potential Data Leakage Features:', potential_leakage_features)

RESULT AND DISCUSSION

The confusion matrix can be interpreted as follows:

True Negatives (TN) :

False Positives (FP) :

False Negatives (FN) :

Potential Data Leakage Features

To address potential data leakage, it is crucial to carefully examine these features

Leading to unrealistically optimistic performance metrics during development but

Detecting and preventing data leakage requires meticulous preprocessing and

Furthermore, maintaining vigilance throughout the entire data science lifecycle—from

[9] Y. Li, V. Swarup, and S. Jajodia, “Fingerprinting RelationalDatabases: Schemes and

You might also like