0% found this document useful (0 votes)
18 views25 pages

Data Leakage Detection

The project report outlines the development of a Data Leakage Detection System using machine learning, specifically the Random Forest algorithm, to identify unauthorized data disclosures. It emphasizes the importance of preventing data leakage in predictive models and details the methodologies for data preprocessing, feature engineering, and model evaluation. The system aims to enhance data security by integrating with existing security infrastructure to monitor and analyze data transactions in real-time.

Uploaded by

Rohi Rocks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views25 pages

Data Leakage Detection

The project report outlines the development of a Data Leakage Detection System using machine learning, specifically the Random Forest algorithm, to identify unauthorized data disclosures. It emphasizes the importance of preventing data leakage in predictive models and details the methodologies for data preprocessing, feature engineering, and model evaluation. The system aims to enhance data security by integrating with existing security infrastructure to monitor and analyze data transactions in real-time.

Uploaded by

Rohi Rocks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

DATA LEAKAGE DETECTION SYSTEM

PROJECT REPORT

Submitted by

HALETHA BEGAM A 810421243018

PRITHIGA A 810421243037

RESHMA R 810421243042

in partial fulfilment for the award of the degree


of
BACHELOR OF TECHNOLOGY
in
ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DHANALAKSHMI SRINIVASAN ENGINEERING COLLEGE


(AUTONOMOUS)
PERAMBALUR – 621 212

ANNA UNIVERSITY: CHENNAI 600 025


JULY 2024
DATA LEAKAGE DETECTION SYSTEM

PROJECT REPORT

Submitted by

HALETHA BEGAM A 810421243018

PRITHIGA A 810421243037

RESHMA R 810421243042

in partial fulfilment for the award of the degree


of
BACHELOR OF TECHNOLOGY
in
ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

DHANALAKSHMI SRINIVASAN ENGINEERING COLLEGE


(AUTONOMOUS)
PERAMBALUR – 621 212

ANNA UNIVERSITY: CHENNAI 600 025


JULY 2024
DHANALAKSHMI SRINIVASAN ENGINEERING COLLEGE
(AUTONOMOUS)
PERAMBALUR – 621 212
BONAFIDE CERTIFICATE

Certified that this Project report on “DATA LEAKAGE DETECTION


SYSTEM” is the bonafide work of HALETHA BEGAM A (810421243018),
PRITHIGA A (810421243037) , RESHMA R (810421243045) , who
carried out the project work under my supervision.

SIGNATURE SIGNATURE
Dr. Shree K.V.M, M.E., Ph.D., Ms.K.Murugeswari M.E.,
HEAD OF THE DEPARTMENT, SUPERVISOR,

Department of Artificial Intelligence Department of Artificial Intelligence


and Data Science, and Data Science,

Dhanalakshmi Srinivasan Engineering Dhanalakshmi Srinivasan Engineering


College (Autonomous), College (Autonomous),
Perambalur – 621212 Perambalur - 621212

Submitted for the Project Viva-Voce Examination held on

INTERNAL EXAMINER EXTERNAL EXAMINER

I
ACKNOWLEDGEMENT

We express our gratitude and thanks to Our Parents first for giving health and a sound
mind for completing this project. We give all the glory and thanks to our almighty GOD for
showering upon, the necessary wisdom and grace for accomplishing this project.

It is our pleasant duty to express deep sense of gratitude to our Honourable Chancellor
Shri. A. Srinivasan, for his kind encouragement. We have unique pleasure in thanking
our Principal Prof. Dr. D. Shanmugasundaram, M.E., PhD., F.I.E., C.Eng., our Dean
Dr. K .Anbarasan, M.E., Ph.D., and our COE Dr. K. Velmurugan, M.E., Ph.D., for their
unflinching devotion, which leads us to complete this project.

We express our faithful and sincere gratitude to our Head of the Department
Dr. Shree K.V.M, M.E., Ph.D., Artificial Intelligence and Data Science, for her valuable
guidance and support the gave us during the project time.

We express our faithful and sincere gratitude to our Project Coordinator A. Siva Sankari
M.E.,MCA.,M. Phil., Artificial Intelligence and Data Science, for her valuable guidance andsupport the
gave us during the project time.

We are also thankful to our internal project guide Ms.K.Murugeswari M.E., of


Department of Artificial Intelligence and Data Science for her valuable guidance and precious
suggestion to complete this project work successfully.

We render our thanks to all Faculty members and Programmers of Department of


Computer Science and Engineering for their timely assistance

II
3
ABSTRACT

Data leakage, the unauthorized exposure of sensitive information, is a critical threat to


organizations, potentially leading to significant financial losses and reputational damage. This
project aims to develop an advanced data leakage detection system using machine learning,
specifically the Random Forest algorithm, to effectively identify and mitigate unauthorized data
disclosures. By collecting a comprehensive dataset that includes instances of both legitimate data
usage and known leakage cases, the model can be trained to discern patterns indicative of data
leakage. The process begins with thorough data preprocessing to clean, normalize, and standardize
the dataset, ensuring consistency and improving the algorithm's performance. Feature engineering
is employed to identify and extract the most relevant features, optimizing the dataset for better
accuracy. The Random Forest algorithm, known for its robustness and ability to handle large
datasets, constructs multiple decision trees and aggregates their results to improve detection
accuracy. The model is rigorously evaluated using metrics such as accuracy, precision, recall, and
F1-score to ensure its effectiveness in detecting data leakage. Cross-validation techniques are
applied to confirm the model's generalizability and robustness. Once trained, the model is deployed
into a real-time data leakage detection system, integrated with existing security infrastructure to
continuously monitor and analyze data transactions. This approach aims to provide a high-
performing data leakage detection model that enhances data security by timely identifying and
preventing unauthorized data exposure. The resulting system is scalable and adaptable, suitable
for integration into various organizational environments to maintain data integrity and protect
sensitive information.

KEYWORDS: DATA LEAKAGE, MACHINE LEARNING, RANDOM FOREST,


ENSEMBLE LEARNING, FEATURE ENGINEERING, MODEL VALIDATION,
CONFUSION MATRIX.

1
TABLE OF CONTENT
TITLE PAGE NO
1.0 INTRODUCTION 03
1.1 Overview of project

2.0 LITERATURE REVIEW 05

3.0 ARCHITECTURE 08

4.0 IMPLEMENTATION 10

4.1Tools and Libraries

4.2Code Structure

5. RESULT 13

6. DISCUSSION 16

6.1Evaluation Metrics

6.2Performance Analysis

7.0 CONCLUSION 19

8.0 REFERENCE 20

2
CHAPTER 1
INTRODUCTION

Data leakage is a critical concern in data science, especially when building predictive
models.

It refers to the unintentional inclusion of information in the training data that would not be
available at the time of prediction, leading to overly optimistic performance estimates and
ultimately resulting in models that fail in real-world applications.

Understanding and preventing data leakage is essential for developing robust and reliable
machine learning models.

At its core, data leakage occurs when the model has access to data during training that it
wouldn't have during actual prediction.

This can take many forms, such as including future data, target leakage, or using
variables that contain information about the target variable.

For example, if a model predicting credit default uses data that includes payment history
up to the point of default, this information would not be available when making a prediction
before the default occurs. This scenario gives the model an unfair advantage, leading to inflated
performance metrics during validation.

One common type of data leakage is temporal leakage, where information from the
future is used to predict past events. In time-series data, this is particularly problematic as it
directly violates the temporal order of events.

For instance, using stock prices from the future to predict past prices would be an
obvious case of temporal leakage. Ensuring the chronological integrity of data is crucial in these
contexts to avoid misleading results.

Another form is target leakage, where the predictors include information that would not
be known at prediction time but is directly related to the outcome. This often happens
inadvertently when the target variable or its proxies slip into the training features

For example, in a healthcare setting, if a model predicting patient mortality includes


data recorded post-mortem, such as autopsy results, it would lead to target leakage. The model's

3
accuracy would appear high during validation but would fail when applied to new, unseen data
where such information is unavailable.

Feature leakage, a more subtle form, occurs when features indirectly capture
information about the target variable. This can happen through complex relationships or when
data preprocessing is not done correctly.

For example, if a dataset includes a feature that is a summary statistic of the target
variable, even if not obvious at first glance, it can introduce leakage. Thoroughly understanding
the data and the relationships between features and the target variable is key to preventing
feature leakage.

Preventing data leakage involves careful data preprocessing, feature selection, and
validation practices. Data should be split into training, validation, and test sets in a manner that
preserves the temporal or logical sequence of events.

Cross-validation techniques should be employed thoughtfully, ensuring that data from


the future does not inform the past. Additionally, domain knowledge plays a crucial role in
identifying potential sources of leakage.

Collaboration with subject matter experts can help in understanding the nuances of the
data and ensuring that only appropriate features are included.

Regularly reviewing and validating the data pipeline is also vital. Automated checks
can help detect anomalies or changes in data that might introduce leakage. Version control and
documentation of data processing steps ensure transparency and reproducibility, which are
essential for maintaining the integrity of the modeling process.

Data leakage is a significant challenge in data science that can severely undermine the
reliability of predictive models. By understanding its various forms and implementing robust
data handling practices, data scientists can mitigate the risk of leakage and build models that
generalize well to new data. This not only enhances the credibility of the models but also ensures
their practical applicability in real-world scenarios.

4
CHAPTER-2
LITRATURE REVEIW
Inthis paper [1],we makes use of sequence alignment method for searching
complex data-leakage patterns. This algorithm is engaged for recognizing long as well as
important data patterns. This identification is paired with a sampling algorithm.

That allows one to look at the similitude of two independently tested


successions. This structure accomplishes great discovery exactness in perceiving
transformed leakage.

Paper [2] we, implemented two algorithms for searching and transformed leakage
information. This framework fulfills high recognition exactness and finds transformed
leakage appeared differently in relation to the cutting edge inspection systems.

They parallelize their design on graphics preparing unit as well as exhibit the solid
scalability of their detection solution needed by a sizable association.

In paper [3] we have designs fuzzy fingerprint, which is a privacy-preserving


data-leak detection system also provides its realization. By making use of special digests, the
exposure of the vital data is kept to very less while detection. we have conducted few tests to
conform the accuracy, privacy, as well as efficiency of our solutions.

In paper [4]we developed the Aquifer security system that assigns host export
limitations on all data taken as part of a user interface (UI) workflow. Key understanding
was that when applications in modern working frameworks offer data, it is a piece of an
enormous work process to play out a user task.

Each application on the UI work process is a potential information owner, and in


addition thus can add to the security limitations. The restrictions are held with data as it is
composed to storage and propagated to future UI work forms that read it.

In doing all things considered they engage applications to sensibly hold control of their
data after it has been shared as a major aspect of the client's tasks.

In paper [5] wepresent Attire: an app for computers as well as smart phones which
shows the user with an avatar. Attire conveys real-time data exposure in a light weight and
unobtrusive manner via updating to the avatar’s clothing.

5
In paper [6] we given the Data-Driven Semi-Global Alignment (DDSGA),
DDSGA method. From the point of security effectiveness, DDSGA increase the scoring systems
by adopting distinct alignment parameters for every single user.

Also, it endures few transformations in user command sequences by permitting few


changes in the low-level representation of the commands functionality. It additionally

adjusts to modification in the user conduct by upgrading the signature of a user as per its present
behavior. To optimize the run time overhead, DDSGA reduce the alignment overhead as well as
parallelizes the detection and also modify.

In paper [7] we, proposed novel method for getting richer semantics of the user’s
determined. The technique is depending on the observation which for most text-based
applications, the user’s determined are shown fully on screen, as text, as well as the user will do
some modifications if what is on screen is not what he needed.

Depending on this concept, development of prototype known as Gyrus that enforces


right working of applications by taking user determinant is done. By making use of Gyrus,
representation of stopping destructive activities which can modify the host system to forward
destructive traffic, like social network impersonation attacks, as well as online financial
services fraud is done.

The evaluation outcome shows that Gyrus successfully prohibits modern malware, as
well as study demonstrated that it would be very tough for future attacks to defeat it. At last, the
performance analysis demonstrated that Gyrus is a countable option for positioning on
standalone pc with continues user interaction.

Gyrus fills an important gap, enabling security actions that taking user concentration
in finding the legitimacy of network traffic.

In paper [8] we implemented a domain-specific concurrency model that backs a large


class of IDS analysis not depends on a particular detection technique. Implemented technique
divides the stream of network events in subsets that the IDS will process not related.

while making sure each subset has each event relevant to a detection case. Proposed
partitioning method is based on the concept of detection scope, i.e., the less “slice” of traffic
that a detector must study for performing its function. As this concept has

Int J Elec & Comp Eng ISSN: 2088-8708 A Survey: Data Leakage Detection
Techniques (K. S. Wagh) 2249 some common applicability, designed model will support
simple, per-flow detection technique and more complex, high-level detectors.

6
Pindings of we [9], the introduction of essential data is not basic because of information
change in the content. Transformations (for example, insertion, and deletion) results in
significantly unpredictable leakage patterns.

Present automata-based string coordinating algorithms are illogical for finding


transformed data leakage as a result of its formidable complexity nature while exhibiting
the required consistent expressions. They create two novel algorithms for recognizing long
and also wrong data leakage. Their framework achieves high detection precision in perceiving
changed breaks contrasted and the best in class inspection techniques. They parallelize our
design on graphics processing unit and in addition demonstrated the solid scalability of data
leakage detection arrangement examining enormous information.

Paper [10] we given that number of the apparent distance metrics utilized for
computing behavioral similarity between network hosts fail to capture the semantic
significance imbued by network protocols. Moreover, they also tend to neglect long-term
temporal structure of the objects being counted.

To consider the role of these semantic as well as temporal attributes, they create
another behavioral distance metric for network hosts as well as compare its execution with a
metric which disregards information like this.

Specifically, they propose semantically important metrics for common data types found
in network data, indicate how these metrics can be consolidated to treat network information as a
unified metric space, as well as depict a temporal sequencing algorithm which captures long-
term causal relationships. Shoulin Yin et al.

[11] introduced novel concept searchable asymmetric encryption, which is useful for
security and search operations on encrypted data. It greatly enhances the information
protection, and prevent the leakage of the user's search criteria-Search Pattern.

In paper [12] we describes the Kaman- Kerberos assistant mobile ad-hoc network
(KAMAN) protocol to avoid users information leak in cloud environment for virtual side
channel attack. Moez Altayeb et al.

[13] described the concepts of radiation leaks and data in wireless sensor network.
To locate leakage station and control the stations power consumption by sending a special
command to it from server node.

7
CHAPTER 3
ARCHITECTURE

1. Phase:
Training

Inclusion of Future Information: Using data that would not be available at


the time of prediction. For example, using target variables that are influenced by
future events not known at prediction time.

Inclusion of Metadata: Including metadata or derived features


that contain information about the target variable, thus indirectly
leaking information.

2. Preprocessing Phase:

Scaling or Normalization:

Scaling or normalizing data using information from the entire dataset before
splitting into training and test sets. This can lead to the test set being influenced by the
training set, biasing performance metrics.
Imputation:

Imputing missing values using information from the entire dataset, including the
test set.

3. Feature Engineering Phase:

Direct Use of Test Data:


Creating features based directly on test data or using information that would not be
available at prediction time.

Incorrect Validation:
8
Evaluating the model on the same data that was used for training (not splitting into
independent training and test sets).

4. Deployment Phase:

Real-time Data:
If the deployed model is trained on outdated data and the real-time data differs
significantly, it can lead to inaccurate predictions.

fig .1 . DATA LEKAGE DEDUCTION SYSTEM

The diagram illustrates a *Data Leakage Detection System* designed to monitor and prevent
unauthorized sharing or exposure of confidential documents. Here's a step-by-step explanation of
the components and flow:

Components and Flow:

1. Web Data Sources


2. Data Leakage Detuction Syste

9
CHAPTER 4
METHADOLOGY
This happens when data that would not be available at the time of prediction is included
in the training set. For example, using target variables that are influenced by future events that
would not be known at the time of prediction.

Train Test Contamination:


A different type of leak occurs when you aren't careful to distinguish training data from
validation data.

Recall that validation is meant to be a measure of how the model does on data that it
hasn't considered before. You can corrupt this process in subtle ways if the validation data
affects the preprocessing behavior. This is sometimes called train-test contamination.

For example, imagine you run preprocessing (like fitting an imputer for missing values)
before calling train_test_split(). The end result? Your model may get good validation scores,
giving you great confidence in it, but perform poorly when you deploy it to make decisions.

After all, you incorporated data from the validation or test data into how you make
predictions, so the may do well on that particular data even if it can't generalize to new data. This
problem becomes even more subtle (and more dangerous) when you do more complex feature
engineering.

If your validation is based on a simple train-test split, exclude the validation data from
any type of fitting, including the fitting of preprocessing steps. This is easier if you use scikit-
learn pipelines. When using cross-validation, it's even more critical that you do your
preprocessing inside the pipeline!

Non-Random Sampling:

If data is not properly sampled, it can lead to leakage. For example, in time-
series data, if training and test sets are not split chronologically, information from the
future can leak into the past.

Target Leakage: This occurs when features that are highly correlated with the target
variable but available in practice are included in the model. For example, using customer churn
prediction models that include information about whether a customer has already churned.

10
This occurs when features that are highly correlated with the target variable but available in
practice are included in the model. For example, using customer churn prediction models that
include information about whether a customer has already churned.

Information Leaks from External Sources:

Sometimes, external data sources or unintentional access to private information


provide insights that the model could inadvertently use.

Train-Test Contamination:

Issue:
When data from the test set inadvertently influences the training process. For
example, if preprocessing steps like scaling or imputation are done without separating
train and test data, the model may indirectly learn information from the test set.
Solution:
Always split your data into train and test sets before any preprocessing steps.
Ensure that any transformations or adjustments are learned from the training data and
applied consistently to the test data.
Using Future Information:
Issue:
Including data in the training process that would not be available at the time of
prediction. For instance, using features that are calculated using information that would
not be known at the time of prediction can lead to unrealistic performance estimates.
Solution:
Be mindful of the temporal order of your data. Ensure that the training data only
includes information that would realistically be available at the time of making
predictions.

Data Containing Target Information:


Issue:
When predictors include information about the target variable that would
not be available during prediction time. This could artificially inflate model
performance during training.

Solution:
Remove any columns that directly leak information about the target variable
from your predictors before training your model.

11
Data Preprocessing Steps:

Issue:
Performing data preprocessing steps such as scaling or feature selection
across the entire dataset (train + test) without separating them.

Solution:
Always fit preprocessing steps (e.g., scaling, imputation) on the training data
and then transform both the training and test sets using the fitted preprocessing
parameters.
Cross-Validation Leakage:
Issue:
Incorrectly applying cross-validation in a way that leaks information across
folds. For example, performing feature selection or parameter tuning separately on
each fold without considering the entire training set.
Solution:
Ensure that all cross-validation steps, including preprocessing, feature
selection, and hyperparameter tuning, are applied within each fold of the cross-
validation loop. Use techniques like nested cross-validation for robust evaluation.

12
CHAPTER 5
IMPLEMENTATION
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix

import matplotlib.pyplot as plt

import seaborn as sns

Load dataset
data = pd.read_csv('AER_credit_card_data.csv')

Inspect the first few rows of the dataset


print(data.head())

target variable and fea Define tures

target = 'card' # Assuming 'default' is the column indicating if a customer defaulted

features = data.columns.drop(target)

One-hot encode categorical features

data_encoded = pd.get_dummies(data,
columns=data[features].select_dtypes(include=['object']).columns, drop_first=True)

Update features after encoding


features_encoded = data_encoded.columns.drop(target)

13
Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data_encoded[features_encoded],
data_encoded[target], test_size=0.3, random_state=42)

Initialize and train Random Forest model


rf = RandomForestClassifier(n_estimators=100, random_state=42)

rf.fit(X_train, y_train)

Make predictions on the test set


y_pred = rf.predict(X_test)

Evaluate model performance


accuracy = accuracy_score(y_test, y_pred)

cm = confusion_matrix(y_test, y_pred)

print(f'Accuracy: {accuracy}')

print('Confusion Matrix:')

print(cm)

Plot feature importances


feature_importances = pd.Ser t.figure(figsize=(10, 6))
sns.barplot(x=feature_importances, y=feature_importances.index)

plt.title('Feature Importances')

plt.xlabel('Importance Score')

plt.ylabel('Features')
plt.show()

14
Detect potential data leakage

potential_leakage_features = feature_importances[feature_importances > 0.05].index.tolist()

print('Potential Data Leakage Features:', potential_leakage_features)

15
CHAPTER 6

RESULT AND DISCUSSION

Model Performance

The model's performance is evaluated using accuracy and a confusion matrix. These metrics
provide insight into the model's ability to correctly classify instances of credit card default.

Accuracy :

The accuracy of the Random Forest model on the test set is 0.9773. This means that
approximately 97.73% of the predictions made by the model are correct.

Accuracy: 0.9773
Confusion Matrix :
The confusion matrix provides a detailed breakdown of the model's performance in terms
of true positives, true negatives, false positives, and false negatives.

16
Confusion Matrix:
[[ 90 0]

[ 9 297]]

The confusion matrix can be interpreted as follows:

True Negatives (TN) :


90 instances where the model correctly predicted no default (first row, first column).

False Positives (FP) :


0 instances where the model incorrectly predicted a default (first row, second column).

False Negatives (FN) :


9 instances where the model incorrectly predicted no default (second row, first column).

True Positive :
297 instances where the model correctly predicted a default (second row, second column).

The confusion matrix indicates that the model is highly accurate in predicting both defaults
and non-defaults, with only a small number of false negatives and no false positives.

Potential Data Leakage Features


Data leakage occurs when the model has access to information during training that
would not be available at the time of prediction. This can lead to artificially high performance
metrics and poor generalization to new data.
In this project, the following features were identified as potential sources of data leakage:

- share

- expenditure

- reports

These features have high importance scores in the Random Forest model and might include
information that could lead to data leakage. For instance:

Share : This feature might include information about the proportion of a customer's
spending that was shared or utilized in a certain way, potentially revealing future spending
behavior.

17
Expenditure : This feature could directly reflect the customer's future spending patterns,
which would not be available at the time of making a prediction.

Reports : This feature might include information from financial or credit reports
generated after the prediction period, thus leaking future information into the model.

To address potential data leakage, it is crucial to carefully examine these features


and determine if they contain future information or information that would not be available at the
time of prediction.

If confirmed, these features should be excluded from the model training process
to ensure that the model's performance metrics are valid and that the model can generalize well
to new, unseen data.

18
CHAPTER 7
CONCLUSION
Data leakage is a critical issue in data science that can severely compromise the
integrity and reliability of machine learning models. It occurs when information from outside the
training dataset is used improperly to create or evaluate the model.

Leading to unrealistically optimistic performance metrics during development but


poor performance in real-world applications.

Detecting and preventing data leakage requires meticulous preprocessing and


validation steps to ensure that the model only learns from the data it would realistically
encounter during deployment.
Implementing robust validation strategies, such as cross-validation and proper
separation of training and validation sets, is essential to mitigate this risk.

Furthermore, maintaining vigilance throughout the entire data science lifecycle—from


data collection to model deployment—is crucial in identifying and addressing potential sources
of leakage

By prioritizing data integrity and rigorous validation practices, data scientists can build
more trustworthy models that generalize well to unseen data and yield reliable insights for
decision-making purposes.

The Random Forest model achieved high accuracy in predicting credit card defaults,
with an accuracy score of 0.9773 and a well-performing confusion matrix.

However, potential data leakage features were identified, which need to be addressed to
ensure the model's robustness and reliability in real-world applications. Future work should focus
on validating and possibly removing these features to prevent data leakage and maintain the
integrity of the model's predictions.

19
CHAPTER 8
REFRENCE

[1] “Data Leakage Detection”Panagiotis Papadimitriou, Student Member, IEEE, and Hector
Garcia-Molina, Member.

[2] R. Agrawal and J. Kiernan, “Watermarking Relational Databases,” Proc. 28th Int’l Conf.
Very Large Data Bases (VLDB ’02), VLDB Endowment, pp. 155-166, 2002.

[3] P. Bonatti, S.D.C. di Vimercati, and P. Samarati, “An Algebra for Composing Access
Control Policies,” ACM Trans. Information and System Security, vol. 5, no. 1, pp. 1-35, 2002.

[4] P. Buneman, S. Khanna, and W.C. Tan, “Why and Where: A Characterization of Data
Provenance,” Proc. Eighth Int’l Conf. Database Theory (ICDT ’01), J.V. den Bussche and V.
Vianu, eds., pp. 316-330, Jan. 2001

[5] P. Buneman and W.-C. Tan, “Provenance in Databases,” Proc. ACM SIGMOD, pp. 1171-
1173, 2007.

[6] Y. Cui and J. Widom, “Lineage Tracing for General Data Warehouse Transformations,” The
VLDB J., vol. 12, pp. 41-58, 2003.

[7] F. Hartung and B. Girod, “Watermarking of Uncompressed and Compressed Video,” Signal
Processing, vol. 66, no. 3, pp. 283-301, 1998.

[8] S. Jajodia, P. Samarati, M.L. Sapino, and V.S. Subrahmanian,“Flexible Support for Multiple
Access Control Policies,” ACM Trans. Database Systems, vol. 26, no. 2, pp. 214-260, 2001.

[9] Y. Li, V. Swarup, and S. Jajodia, “Fingerprinting RelationalDatabases: Schemes and


Specialties,” IEEE Trans. Dependable and Secure Computing, vol. 2, no. 1, pp. 34-45, Jan.-Mar.
2005.

[10] B. Mungamuru and H. Garcia-Molina, “Privacy, Preservation and Performance: The 3 P’s
of Distributed Data Management.

20

You might also like