Chapter 1-Introduction to Data Privacy
Chapter 1-Introduction to Data Privacy
Data Privacy
N. Venkataramenan and A. Shriram, Data Privacy -
Principles and Practice, CRC Press, 2017.
1
Contents
• 1. Introduction
• 2. What is Data Privacy and Why Is It Important?
• 3. Use cases: Need for Sharing Data
• 4. Methods for Protecting Data
• 5. Importance of Balancing Data Privacy and Utility
• 6. Introduction to Anonymization Design Principles
• 7. Nature of Data in Enterprise
2
1. Introduction
• Organizations dealing with banking, insurance, retail, healthcare, and
manufacturing across the globe collect large amounts of data about their
customers. This is a valuable asset to the organizations as these data can be
mined to extract a lot of insights about their customers.
– For example, mining these data can throw light on customers’ spending/buying,
credit card usage, and investment patterns and health issues, to name a few.
3
1. Introduction
• Business applications contain sensitive information, such as personal
or financial and health-related data.
• Sharing such data can potentially violate individual privacy and lead
to financial loss to the company.
• Serious concerns have been expressed by general public about
exposing person-specific information. The issue of data leakage,
either intentional or accidental exposure of sensitive information, is
becoming a major security issue.
• An IDC survey claims that data leakage is the number one threat,
ranked higher than viruses, Trojan horses, and worms.
4
1. Introduction
• To address the privacy of an individual’s data, governments across the globe have mandated regulations that
companies have to adhere to:
– HIPAA (Health Insurance Portability and Accountability Act) in the United States
– FIPPA (Freedom of Information and Protection of Privacy Act) in Canada
– Sarbanes–Oxley Act
– Video Privacy Protection
– EU’s Data Protection Directive
– Swiss Data Protection Act
– ...
• Companies need to look into methods and tools to anonymize sensitive data.
• Data anonymization techniques have been the subject of intense investigation in recent years for many kinds of
structured data, including tabular, transactional data, and graph data.
5
2. What is Data Privacy and Why Is It Important?
• There are numerous incidents where customers’ confidential personal
information has been attacked by or lost to a data snooper.
• When such untoward incidents occur, organizations face legal suits,
financial loss, loss of image, and, importantly, the loss of their
customers.
• There are many stakeholders of data privacy in an organization.
6
2. What is Data Privacy and Why Is It Important?
•
7
2. What is Data Privacy and Why Is It Important?
• Companies spend millions of dollars to protect the privacy of
customer data.
– Why is it so important?
– What constitutes personal information?
• Personal information consists of name, identifiers like social security number,
geographic and demographic information, and general sensitive information, for example,
financial status, health issues, shopping patterns, and location data.
• Loss of this information means loss of privacy—one’s right to freedom from intrusion by
others.
8
2. What is Data Privacy and Why Is It Important?
Protecting Sensitive Data
• “I know where you were yesterday!” Google knows your location when you
use Google Maps.
• Google maps can track you wherever you go when you use it on a smart
phone.
• Mobile companies know your exact location when you use a mobile phone.
• You have no place to hide. You have lost your privacy. This is the flip side of
using devices like smart phones, global positioning systems (GPS), and radio
frequency identification (RFID).
• Sensitive data: Location, health issues, financial status, ...
9
2. What is Data Privacy and Why Is It Important?
Protecting Sensitive Data
10
2. What is Data Privacy and Why Is It Important?
Protecting Sensitive Data
11
2. What is Data Privacy and Why Is It Important?
Protecting Sensitive Data
12
2. What is Data Privacy and Why Is It Important?
Protecting Sensitive Data
• Data D in the tables contains four disjointed data sets:
– 1. Explicit identifiers (EI): Attributes that identify a customer (also called record
owner) directly. These include attributes like social security number (SSN),
insurance ID, and name.
– 2. Quasi-identifiers (QI): Attributes that include geographic and demographic
information, phone numbers, and e-mail IDs. Quasi-identifiers are also defined as
those attributes that are publicly available, for example, a voters database.
– 3. Sensitive data (SD): Attributes that contain confidential information about the
record owner, such as health issues, financial status, and salary, which cannot be
compromised at any cost.
– 4. Nonsensitive data (NSD): Data that are not sensitive for the given context.
13
2. What is Data Privacy and Why Is It Important?
Protecting Sensitive Data
• The first two data sets, the EI and QI, uniquely identify a record owner and
when combined with sensitive data become sensitive or confidential.
• The data set D is considered as a matrix of m rows and n columns. Matrix D is
a vector space where each row and column is a vector
D = [DEI][DQI][DSD]
• Each of the data sets, EI, QI, and SD, are matrices with m rows and i, j, and k
columns, respectively.
– We need to keep an eye on the index j (representing QI), which plays a major role in
keeping the data confidential.
14
2. What is Data Privacy and Why Is It Important?
Protecting Sensitive Data
• Data protection techniques (such as cryptography, anonymization)
are used prior to sharing data.
• Anonymization is a process of logically separating the identifying
information (PII) from sensitive data.
• The anonymization approach ensures that EI and QI are logically
separated from SD.
– As a result, an adversary will not be able to easily identify the record owner
from his sensitive data.
15
2. What is Data Privacy and Why Is It Important?
Privacy and Anonymization
• Under the condition of privacy, we have knowledge of a person’s
identity, but not of an associated personal fact
• Under the condition of anonymity, we have knowledge of a personal
fact, but not of the associated person’s identity.
16
2. What is Data Privacy and Why Is It Important?
Privacy and Anonymization
•
Table 1.4 illustrates an anonymized
table where PII is protected and
sensitive data are left in their original
form. Sensitive data should be in
original form so that the data can be
used to mine useful knowledge.
17
2. What is Data Privacy and Why Is It Important?
Privacy and Anonymization
• Anonymization is a two-step process:
– Data masking and de-identification.
• Data masking is a technique applied to systematically substitute, suppress, or
scramble data that call out an individual, such as names, IDs, account numbers,
SSNs, etc.
– Masking techniques are simple techniques that perturb original data.
• De-identification is applied on QI fields.
– QI fields such as date of birth, gender, and zip code have the capacity to uniquely
identify individuals.
– By de-identifying, the values of QI are modified carefully so that the relationship is
till maintained by identities cannot be inferred.
18
2. What is Data Privacy and Why Is It Important?
Privacy and Anonymization
• The original data set is D which is anonymized, resulting in data set D’.
D’ = T(D) or
D’ = T([DEI][DQI][DSD]), where T is the transformation function.
• Anonymization process
– Data masking: EI is completely masked and no longer relevant in D’.
– No transformation is applied on SD and it is left in its original form.
– This results that transformation is applied only on QI
D’ = T([DQI])
• D′ can be shared as QI is transformed and SD is in its original form but it is very difficult to
identify the record owner.
• De-identification becomes highly challenging.
19
3. Use Cases: Need for Sharing Data
• Organizations tend to share customer data as there is much insight to
be gained from customer-sensitive data.
– For example, a healthcare provider’s database could contain how patients
have reacted to a particular drug or treatment. This information would be
useful to a pharmaceutical company.
– However, these sensitive data cannot be shared or released due to legal,
financial, compliance, and moral issues.
– But for the benefit of the organization and the customer, there is a need to
share these data responsibly, which means the data are shared without
revealing the PII of the customer.
20
3. Use Cases: Need for Sharing Data
• These use cases can be
classified under two
categories:
– 1. Privacy protection of
sensitive data at rest
• Data mining and analysis
• Application testing
– 2. Privacy protection of
sensitive data in motion (at
run-time)
• Business operation
• Application support
• Auditing and reporting for
regulatory compliance
21
3. Use Cases: Need for Sharing Data
• PII is found in all layers of architecture.
• Whether data are shared with internal departments or external
vendors, it is critical to protect the privacy of customer data.
• It is important to have an enterprise-wide privacy preservation design
strategy to address the heterogeneity in data sources, data structures,
and usage scenarios.
22
3. Use Cases: Need for Sharing Data
Data Mining and Analysis
• Banks would want to understand their customers’ credit card usage
patterns.
• A retail company would like to study customers’ buying habits.
• A healthcare company would like to understand the effectiveness of a
drug or treatment provided to their patients.
• All these patterns are hidden in the massive amounts of data they hold.
The intention of the company is to gather comprehensive knowledge
from these hidden patterns.
23
3. Use Cases: Need for Sharing Data
Data Mining and Analysis
• Data mining is the technique that is used to gather knowledge and
predict outcomes from large quantities of data.
• The goal of data mining is to extract knowledge, discover patterns,
predict, learn, and so on.
• The key functional blocks in most data mining applications are
classification, clustering, and association pattern mining.
24
3. Use Cases: Need for Sharing Data
Software Application Testing
• A number of companies across the globe outsource their application -
testing.
– Outsourcing of testing is growing at about 20% every year.
• Application testing comprises functional requirements and
nonfunctional requirements testing.
• Successful application testing requires high-quality test data.
• High-quality test data are present in production systems and this is
sensitive customer data, which must be anonymized before sharing
with testers.
25
3. Use Cases: Need for Sharing Data
Software Application Testing
• A high-level process of test data manufacturing
26
3. Use Cases: Need for Sharing Data
Business Operations
• Many large companies across the globe outsource their business
operations to business process outsourcing (BPO) companies in
countries like India, China, and Brazil.
– For example, a financial services company outsources its business operations
to a BPO company in India. Then that BPO company will assist customers of
the financial services company in their business operations such as securities
trading, managing their financial transactions, and so on.
– But access to a customer’s trade account during these processes would expose
a lot of sensitive information and this is not acceptable to the customer, and
the regulation in the country will not permit such access.
27
3. Use Cases: Need for Sharing Data
Business Operations
• Therefore, these data have to be protected.
• But the question is how and what data are to be protected?
– A technique known as tokenization is used here wherein the sensitive data
that should not be seen by the BPO employee are replaced with a token.
– This token has no relationship with the original data, and outside the context
of the application the token has no meaning at all.
– All these are executed during run-time.
28
4. Methods of Protecting Data
• One of the most daunting tasks in information security is protecting
sensitive data in enterprise applications, which are often complex
and distributed.
• Some of the methods available are
– Cryptography
– Anonymization
– Tokenization
29
4. Methods of Protecting Data
Cryptography
• Cryptographic techniques are probably one of the oldest known
techniques for data protection. When done right, they are probably
one of the safest techniques to protect data in motion and at rest.
• Encrypted data have high protection, but are not readable, so how can
we use such data?
• For the use cases discussed, cryptographic techniques are not used
widely.
30
4. Methods of Protecting Data
Anonymization
• Anonymization is a set of techniques used to modify the original data in such a manner that it
does not resemble the original value but maintains the semantics and syntax.
• Regulatory compliance and ethical issues drive the need for anonymization.
• The intent is that anonymized data can be shared freely with other parties, who can perform their
own analysis on the data.
• Anonymization is an optimization problem, in that when the original data are modified they
lose some of its utility.
– But modification of the data is required to protect it.
31
4. Methods of Protecting Data
Tokenization
• Tokenization is a technique that replaces the original sensitive data with non-sensitive
placeholders referred to as tokens.
• The fundamental difference between tokenization and the other techniques is that in
tokenization, the original data are completely replaced by a surrogate that has no
connection to the original data.
– Tokens have the same format as the original data.
– As tokens are not derived from the original data, they exhibit very powerful data protection
features.
• Another interesting point of tokens is, although the token is usable within its native
application environment, it is completely useless elsewhere.
– Therefore, tokenization is ideal to protect sensitive identifying information.
32
5. Importance of Balancing Data Privacy and Utility
• Privacy preservation should also ensure utility of data.
– In other words, the provisioned data should protect the individual’s privacy and at the same time
ensure that the anonymized data are useful for knowledge discovery.
33
5. Importance of Balancing Data Privacy and Utility
• As a transformation function is applied on QI, it is obvious that the correlation
between QI fields and SD fields is affected or weakened, and this indicates
how useful the transformed data are for the given purpose.
• An example from the healthcare domain to illustrate this important
relationship between privacy and utility
– HIPAA states that if any of the data elements are associated with health information,
it makes that information personally identifiable. HIPAA defines 18 attributes as PII
that include name, SSN, geographic information, demographic information,
telephone number, admission date, etc.
– Therefore, in any privacy preserving data analysis of health data, it should be
ensured that any of these 18 attributes, if present, should be completely anonymized.
34
5. Importance of Balancing Data Privacy and Utility
35
5. Importance of Balancing Data Privacy and Utility
• If so much information is stripped off, then how can the remaining
data be useful for the analysis?
– Let us take an example of a patient getting admitted to a hospital. According
to the HIPAA privacy rules, the admission date is part of the patient’s PII and
therefore should be anonymized.
– The healthcare provider can share the patient’s medical data to external
partners for the analysis, but it will be impossible to analyze the efficacy of
the treatment as the date of admission is anonymized as per HIPAA privacy
laws.
36
5. Importance of Balancing Data Privacy and Utility
• HIPAA’s intention is to protect patient privacy, but it impacts medical research in the
process.
• Therefore, it is extremely important to ensure the utility of the data while preserving
privacy. In other words, there needs to be a balance between privacy and utility of
anonymized data.
- Cryptographic mechanism provides low
utility (0) and high privacy (1).
- The privacy or utility in a cryptographic
mechanism is either black (0) or white (1),
whereas in anonymization methods, it is
“shades of gray,” meaning that is possible
to control the levels of privacy or utility.
37
5. Importance of Balancing Data Privacy and Utility
• Anonymization can be viewed as constrained optimization, produce a data set
with smallest distortion that also satisfies the given set of privacy
requirements.
• But how do you balance the two contrasting features, privacy and utility?
• Anonymized data are utilized in many areas of an organization like data
mining, analysis, or creating test data.
• An important point to remember here is each type of requirement or analysis
warrants a different anonymization design.
– This means that there is no single privacy versus utility measure.
38
5. Importance of Balancing Data Privacy and Utility
• Let us consider the original data given in Table 1.6
– 1. Original data table with no privacy but high utility
– 2. High correlation between QI and SD (attributes fields)
• Although many rows have not been shown here, let us assume that the ZIP CODE and
INCOME are correlated, in that the ZIP CODE 56001 primarily consists of high-income
individuals.
39
5. Importance of Balancing Data Privacy and Utility
• Table 1.7 is a modified version of Table 1.6. We can see that the
names have been changed, the original ZIP CODES have been
replaced with different values and INCOME values are unchanged.
40
5. Importance of Balancing Data Privacy and Utility
41
5. Importance of Balancing Data Privacy and Utility
• Another design can have just “XXXX” for all names, 56001 for all zip codes, and
“Male” for all gender values. We can agree that this anonymization design scores
well in terms of privacy, but utility is pathetic.
– Privacy gain: Names are completely suppressed, financial standing cannot be inferred, and
geographical location is not compromised.
– Utility loss: Presence of females in the population, meaningless names lose demographic
clues, flat value of zip code annuls the correlation.
42
5. Importance of Balancing Data Privacy and Utility
• Anonymization design drives the extent of privacy and utility, which
are always opposed to each other.
• The two above designs also show that privacy or utility need not be 0
and 1 as in encryption; rather, both are shades of gray as stated earlier.
• A good design can achieve a balance between them and achieve both
goals to a reasonable extent.
43
5. Importance of Balancing Data Privacy and Utility
• One way to quantify privacy is on the basis of how much
information an adversary can obtain about the SD of an individual
from different dimensions in the data set.
• It means that SD fields can be identified (or estimated/deduced) using
QI fields.
– This is a very simple way to quantify privacy.
– In fact, this model does not capture many important dimensions, such as
background knowledge of the adversary, adversary’s knowledge of some of
the sensitive data, the complexity of the data structure, etc.
44
5. Importance of Balancing Data Privacy and Utility
• The utility loss of a particular anonymization technique is measured
against the utility provided by the original data set. A measure of
utility is also the correlation between QI and SD preserved in the
anonymized data.
• There are many anonymization techniques in use today, which can be
broadly classified into perturbative and nonperturbative techniques.
– Each of these techniques provides its own privacy versus utility model. The
core goals of these anonymization techniques are
• (1) to prevent an adversary from identifying SD fields
• (2) to ensure minimal utility loss in the anonymized data set by ensuring high correlation
between the QI and SD fields.
45
5. Importance of Balancing Data Privacy and Utility
Measuring Privacy of Anonymized Data
• Given a data set D, a data anonymizer can create different anonymized data
sets D1′, D2′,..., Dn′ based on different anonymization algorithm combinations
for each attribute.
– Each of these anonymized data sets will have different privacy versus utility trade-
offs.
• Privacy is a relative measure.
– This means that the privacy of D1′ is measured against another anonymized data set
D2′.
– There are multiple ways to measure the difference in privacy. These approaches are
broadly classified into statistical and probabilistic methods.
– Some statistical approaches measure privacy in terms of the difference or variation in
perturbed variables. The larger the variance, the better the privacy of the perturbed
data. This technique is generally used for statistical databases.
46
5. Importance of Balancing Data Privacy and Utility
Measuring Privacy of Anonymized Data
• Probabilistic methods measure privacy loss when an adversary
has knowledge of the distribution of the data in the original data
set and background information about some tuples in the data set.
• Bob is the adversary and has some background
information about Alice as she is his neighbor. Bob
knows that Alice smokes heavily but does not really
know what disease she is suffering from.
• However, he has knowledge about the distribution
of the sensitive fields in a table containing medical
records of a hospital that he has noticed Alice
visiting. Bob then uses the knowledge of the
distribution of SD fields and background
information about Alice to identify her illness,
which is cancer.
47
5. Importance of Balancing Data Privacy and Utility
Measuring Utility of Anonymized Data
• Assume that in the original data D, QI, and SD are highly correlated.
– An example could be the correlation between demographic and geographic
information, such as year of birth, country of birth, locality code, and income.
• Data set D contains the truth about the relationship between
demographic and geographic information and income. While
anonymizing D, the truth should be preserved for the data to be useful.
• When D is anonymized to D′ using a transformation function T, D′ =
T(D), the QI fields are distorted to some extent in D′.
– Now, how true is D′?
– Does the correlation between QI and SD fields in D′ still exist?
48
5. Importance of Balancing Data Privacy and Utility
Measuring Utility of Anonymized Data
• Each anonymization function will provide different levels of
distortion.
• If Q is the distribution of QI fields in D and Q′ is the distribution of
QI fields in D′, then the statistical distance measure of Q and Q′
provides an indication of the utility of D′. This provides a number of
approaches to measure utility.
49
6. Introduction to Anonymization Design Principles
• Anonymization design is not
straightforward.
– Achieving a balance between privacy and
utility has many dependencies.
50
6. Introduction to Anonymization Design Principles
• Appendix A: Anonymization Design Principles for Multidimensional
Data
• Each principle is structured into two parts—rationale and
implications.
– Rationale: details out what that principle is and how to use it for
anonymization design
– Implications: tells what happens if you do not follow the principle
51
6. Introduction to Anonymization Design Principles
• Anonymization Design Principles
– 1. Principle of classification—Classify the data set D into EI, QI, SD, and NSD with
clear boundaries between them.
– 2. Principle of concealment—Completely mask EI.
– 3. Principle of specialization—Understand the application domain to decide on the
anonymization design.
– 4. Principle of consistency—Ensure consistency in masking data across applications
in a domain.
– 5. Principle of utilization—Understand the application scenario to decide on the
anonymization design. For example, analytical utility of QI in data mining may not
be required in TDM.
– 6. Principle of threat modeling—Identify possible threats for a given environment,
setting, or data type.
52
6. Introduction to Anonymization Design Principles
– 7. Principle of correlation—Maintain correlation between attributes. For example, locality and zip code
or DOB and age.
– 8. Principle of contextual anonymization—Understand the con- text. (From whom are you trying to
protect the data? What is the environment?)
– 9. Principle of value-based anonymization—Understand the semantics of the data in the context of the
application so as to apply the correct or appropriate anonymization technique on the data.
– 10. Principle of data structure complexity—Anonymization design is dependent on the data structure.
– 11. Principle of correlated shuffling—Maintain correlation between related attributes while shuffling
data. For example, correlation between locality, city, and zip code.
– 12. Principle of randomization—Maintain statistical properties (like distribution) when randomly
perturbing the data.
– 13. Principle for protection against identity disclosure—Define a privacy model to prevent identity
disclosure via record linkage.
53
6. Introduction to Anonymization Design Principles
Principle of Classification (1)
• Principle of Classification: Classify the Data Set D into EI, QI, SD,
and NSD with Clear Boundaries between Them
• Rationale
– Given a table T with data set D, the first step in anonymization design is to
classify the data into EI (EIi, EIi+1,...This classification is an essential first step
as it will help in determining
– which attributes must be masked and which attributes must be identifiable.,
EIn), QI (QIj, QIj+1,..., QIm), SD (SDk, SDk+1 ,..., SDp), and NSD.
– This classification is an essential first step as it will help in determining which
attributes must be masked and which attributes must be identifiable.
54
6. Introduction to Anonymization Design Principles
Principle of Classification (1)
55
6. Introduction to Anonymization Design Principles
Principle of Classification (1)
• Classification is extremely challenging when dealing with a data set
of high dimensionality. For example, a personal loan application of a
bank has over 200 fields; a mortgage loan application has many more
fields.
• In such a situation, the following questions arise:
– What constitutes EI?
– What constitutes QI and SD?
– How do you determine the boundary between QI and SD?
56
6. Introduction to Anonymization Design Principles
Principle of Classification (1)
• What constitutes EI?
– All identifiers that directly identify the record owner.
– Examples of EI are name of the record owner, social security number, driving
license number, passport number, insurance ID, and any other attribute that
can directly identify the record owner. It is a relatively easy task to pick out
the EI.
57
6. Introduction to Anonymization Design Principles
Principle of Classification (1)
• What constitutes QI?
– Attributes in the data set that can be traced to or linked to an external
publicly available data source are termed as quasi-identifiers.
– They are generally composed of demographic and geographic information of
the record owner.
– It is difficult to clearly quantify the amount of publicly available information,
especially in the current era of social media.
58
6. Introduction to Anonymization Design Principles
Principle of Classification (1)
• How do you determine the boundary between QI and SD?
– It is very difficult to define a clear boundary between QI and SD.
– The reasons could be the dimensions of QI and also the complexity of the
business domain.
– Let us assume that an HR personnel is the adversary, then the background
knowledge of the adversary is more than what is present in the external
source.
• It is important to understand from whom we are trying to protect the data. This
helps sometimes to draw a boundary between QI and SD.
– Another example from the mortgage domain illustrates the difficulty
– in identifying QI and the boundary between QI and SD.
59
6. Introduction to Anonymization Design Principles
Principle of Classification (1)
– Another example from the mortgage domain illustrates the difficulty in
identifying QI and the boundary between QI and SD.
• The table contains the geographical information of the record owner’s current residence
and that of the property he has acquired.
• Both these addresses are available in a public data source. According to our earlier
definitions of QI, both these addresses should be included in the QI and be de-identified.
• By anonymizing the address of the acquired property, the analytic utility of the data set is
reduced, but not doing so will lead to identity disclosure.
60
6. Introduction to Anonymization Design Principles
Principle of Classification (1)
• Implications
– Incorrect identification of EI, QI, and SD attributes could lead to privacy loss
or utility loss.
– Are phone numbers and e-mail addresses EI or QI?
61
7. Nature of Data in the Enterprise
• Multidimensional Data
• Transaction Data
• Longitudinal Data
• Graph Data
• Time Series Data
62
7. Nature of Data in the Enterprise
Multidimensional Data
• Multidimensional data also referred to as relational data are the most
common format of data available today in many enterprises.
• In a relational table, each row is a vector that represents an entity. The
columns represent the attributes of the entity.
• A row of data in a relational table is classified into explicit identifiers,
quasi-identifiers, sensitive data, and nonsensitive data.
– As a rule, EI are completely masked out, QI are anonymized, and SD are left
in their original form.
63
7. Nature of Data in the Enterprise
Multidimensional Data
• The fundamental differences between anonymizing multidimensional data and
other data structures are as follows:
– In a multidimensional data table, each record or row is independent of others;
therefore, anonymizing a few of the records will not affect other records.
– Anonymizing a tuple in a record will not affect other tuples in the record.
• Other complex data structures, such as graph, longitudinal, or time series data,
cannot be viewed in this way.
• Privacy preservation for multidimensional data can be classified into
– (1) random perturbation methods and
– (2) group anonymization techniques, such as k-anonymity or l-diversity.
These techniques are used to prevent identity disclosure and attribute disclosure.
64
7. Nature of Data in the Enterprise
Multidimensional Data
Challenges in Privacy Preservation of Multidimensional Data
• The challenges in this kind of data preservation are as follows:
– 1. Difficulty in identifying the boundary between QI and SD in the presence
of background knowledge of the adversary
– 2. High dimensionality of data poses a big challenge to privacy preservation
– 3. Clusters in sensitive data set
– 4. Difficulty in achieving realistic balance between privacy and utility
65
7. Nature of Data in the Enterprise
Transaction Data
• Transaction data are a classic example of sparse high-dimensional data.
• A transaction database holds transactions of a customer at a supermarket.
• Privacy of transaction data is very critical as an adversary who has access to
this database can obtain the shopping preferences of customers and exploit
that information.
• But the problem with transaction database is that it is of very high
dimensionality and sparsely filled.
– A supermarket will have thousands of products contributing to the high
dimensionality of the transaction database.
– Moreover, the transactional data contained in the database are binary—either 0 or 1.
An event of a transaction is represented by 1; otherwise, it would be a 0.
66
7. Nature of Data in the Enterprise
Transaction Data
68
7. Nature of Data in the Enterprise
Longitudinal Data
• Longitudinal studies are carried out extensively in the healthcare
domain.
• An example would be the study of the effects of a treatment or
medicine on an individual over a period of time.
• The measurement of the effects is repeatedly taken over that period of
time on the same individual.
• The goal of longitudinal study is to characterize the response of the
individual to the treatment.
• Longitudinal studies also help in understanding the factors that
influence the changes in response.
69
7. Nature of Data in the Enterprise
Longitudinal Data
• The table contains a longitudinal set D, which has three disjoint sets of data - EI, QI, and SD.
A few important characteristics of the data set D that must be considered while designing an
anonymization approach are as follows:
– Data are clustered—composed of repeated measurements obtained from a single individual at different
points in time.
– The data within the cluster are correlated.
– The data within the cluster have a temporal order, which means the first measurement will be followed
by the second and so on. 70
7. Nature of Data in the Enterprise
Longitudinal Data
Challenges in Anonymizing Longitudinal Data
• Anonymization design for longitudinal data should consider two
aspects:
– 1. The characteristics of longitudinal data in the anonymized data set D′
should be maintained.
– 2. Anonymization designs aim to prevent identity and attribute disclosure.
71
7. Nature of Data in the Enterprise
Graph Data
• Graph data are interesting and found in many domains like social
networks, electronics, transportation, software, and telecom.
• A graph G = (V,E) consists of a set of vertices together with a set of
vertex pairs or edges.
• Graphs are interesting as they model almost any relationship. This is
especially relevant in modeling networks like financial networks and
also social networks like Facebook, LinkedIn, and Twitter.
72
7. Nature of Data in the Enterprise
Graph Data
• Social networks have many users and contain a lot of personal information, such as
network of friends and personal preferences.
• Social network data analytics is a rich source of information for many companies that
want to understand how their products are received by the customers.
– For example, a bank would like to get feedback from their customers about the various financial
products and services they offer.
– The bank can have its own page on, say, Facebook where its customers provide their views and
feedback.
– Publishing these data for mining and analysis will compromise the privacy of the customers.
– Therefore, it is required to anonymize the data before provisioning it for analytics.
– However, graph data are complex in nature.
73
7. Nature of Data in the Enterprise
Graph Data
• Figure 1.7 depicts a network with original data of users. The same has been
anonymized in Figure 1.8.
• Will this anonymization be enough to thwart an adversary’s attempt to re-
identify the users?
– The simple answer is no.
74
7. Nature of Data in the Enterprise
Graph Data
Challenges in Anonymizing of Graph Data
• Privacy of graph data can be classified into three categories:
– 1. Identity disclosure
• Identity disclosure occurs when it is possible to identify the users in the network.
– 2. Link disclosure
• Just as in relational table, sensitive content is associated with each node (entity).
– 3. Content/attribute disclosure
• Links between users are highly sensitive and can be used to identify relationships
between users.
75
7. Nature of Data in the Enterprise
Time Series Data
• Time series data result from taking measurements at regular intervals of time from a process.
• An example of this could be temperature measurement from a sensor or daily values of a stock,
the net asset value of a fund taken on a daily basis, or blood pressure measurements of a patient
taken on a weekly basis.
• We looked at longitudinal data where we considered the response of a patient to blood pressure
medication. The measurements have a temporal order. So, what is the difference between
longitudinal data and time series data?
– Longitudinal data are extensively used in the healthcare domain, especially in clinical trials. Longitudinal data
represent repeated measurements taken on a person. These measurements are responses to a treatment or drug,
and there is a strong correlation among these measurements.
– A univariate time series is a set of long measurements of a single variable taken at regular intervals, say, blood
pressure measurements of a patient taken over a period of time. These measurements need not necessarily be a
response to a drug or treatment.
76
7. Nature of Data in the Enterprise
Time Series Data
77
7. Nature of Data in the Enterprise
Time Series Data
Challenges in Privacy Preservation of Time Series Data
• Some of the challenges in privacy preservation of the time series data
are as follows:
– High dimensionality
– Retaining the statistical properties of the original time series data like mean,
variance, and so on
– Supporting various types of queries like range query or pattern matching
query
– Preventing identity disclosure and linkage attacks
78