0% found this document useful (0 votes)
64 views20 pages

Detecting Data Leakage: Panagiotis Papadimitriou Hector Garcia-Molina

This document discusses detecting data leakage by modeling it as a maximum likelihood problem and distributing data to minimize overlaps between what is provided to different agents. It describes the problem entities, such as a dataset, agents requesting data, and leaked profiles. It then covers guilt models for determining the probability an agent is guilty of leaking data. Finally, it proposes distribution strategies for sample and explicit data requests that aim to provide disjoint data sets to different agents to better identify guilty parties if a leak occurs.

Uploaded by

Vishal Patil
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views20 pages

Detecting Data Leakage: Panagiotis Papadimitriou Hector Garcia-Molina

This document discusses detecting data leakage by modeling it as a maximum likelihood problem and distributing data to minimize overlaps between what is provided to different agents. It describes the problem entities, such as a dataset, agents requesting data, and leaked profiles. It then covers guilt models for determining the probability an agent is guilty of leaking data. Finally, it proposes distribution strategies for sample and explicit data requests that aim to provide disjoint data sets to different agents to better identify guilty parties if a leak occurs.

Uploaded by

Vishal Patil
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 20

Detecting Data Leakage

Panagiotis Papadimitriou
[email protected]

Hector Garcia-Molina
[email protected]

Leakage Problem
Name: Sarah
Sex: Female . Name: Mark

Sex: Male
. Jeremy Sarah App. U1 App. U2 Mark

Other Sources e.g. Sarahs Network

Kathryn

Stanford Infolab

Outline
Problem Description Guilt Models
Pr{U1 leaked data} = 0.7 Pr{U2 leaked data} = 0.2

Distribution Strategies

Stanford Infolab

Problem Description Guilt Models Distribution Strategies

Stanford Infolab

Problem Entities
Entity Distributor Facebook Dataset T Set of all Facebook profiles

Agents Facebook Apps U1, , Un

R1, , Rn Ri: Set of peoples profiles who have added the application Ui

Leaker

S Set of leaked profiles

Stanford Infolab

Agents Data Requests


Sample
100 profiles of Stanford people

Explicit
All people who added application
(example we used so far)

All Stanford profiles

Stanford Infolab

Problem Description Guilt Models Distribution Strategies

Stanford Infolab

Guilt Models (1/3)


p: posterior probability that a leaked profile comes from other sources

p p
Guilty Agent: Agent who leaks at least one profile Pr{Gi|S}: probability that agent Ui is guilty, given the leaked set of profiles S
Stanford Infolab

Other Sources e.g. Sarahs Network


8

Guilt Models (2/3)


Agents leak each of their data items independently
p2

Agents leak all their data items OR nothing

p(1-p)

(1-p)p

or

or

(1-p)2

or
Stanford Infolab 9

Guilt Models (3/3)


Independently NOT Independently

Pr{G2}

Pr{G2}

Pr{G1} Pr{G1}

Stanford Infolab

10

Problem Description Guilt Models Distribution Strategies

Stanford Infolab

11

The Distributors Objective (1/2)


R1 R2 U1 U2
R3

S (leaked)
R1

R3 R4

U3 U4
Stanford Infolab

Pr{G1|S}>>Pr{G2|S}
Pr{G1|S}>> Pr{G4|S}
12

The Distributors Objective (2/2)


To achieve his objective the distributor has to distribute sets Ri, , Rn that minimize
i

1 Ri

R R
j i i

, i, j 1,..., n

Intuition: Minimized data sharing among agents makes leaked data reveal the guilty agents
Stanford Infolab 13

Distribution Strategies Sample (1/4)


Set T has four profiles:
Kathryn, Jeremy, Sarah and Mark

There are 4 agents:


U1, U2, U3 and U4

Each agent requests a sample of any 2 profiles of T for a market survey


Stanford Infolab 14

Distribution Strategies Sample (2/4)


Poor Minimize Ri R j
i j

U1 U2 U3 U4

U1

U2
U3 U4

Stanford Infolab

15

Distribution Strategies Sample (3/4)


Optimal Distribution

U1 U2 U3 U4

Avoid full overlaps and minimize R R


j i i

1 Ri

Stanford Infolab

16

Distribution Strategies Sample (4/4)

Stanford Infolab

17

Distribution Strategies
Sample Data Requests
The distributor has the freedom to select the data items to provide the agents with General Idea:
Provide agents with as much disjoint sets of data as possible

Explicit Data Requests The distributor must provide agents with the data they request General Idea:
Add fake data to the distributed ones to minimize overlap of distributed data

Problem: There are cases where the distributed data must overlap E.g., |Ri|++|Rn|>|T|

Problem: Agents can collude and identify fake data NOT COVERED in this talk
Stanford Infolab 18

Conclusions
Data Leakage Modeled as maximum likelihood problem Data distribution strategies that help identify the guilty agents

Stanford Infolab

19

Thank You!

You might also like