CSC 303 Data Protection Techniques Notes
CSC 303 Data Protection Techniques Notes
There are many stakeholders of data privacy in an organization; these are shown in
Figure 1.1. below. Let us define these stakeholders
Adversary/data snooper: Data are precious and their theft is very common. An
adversary can be internal or external to the organization. The anonymization design
should be such that it can thwart an adversary’s effort to identify a record owner in
the database.
What constitutes personal information?
Personal information consists of name, identifiers like social security number,
geographic and demographic information, and general sensitive information, for
example, financial status, health issues, shopping patterns, and location data. Loss of
this information means loss of privacy—one’s right to freedom from intrusion by
others.
Let us look at a sample bank customer and an account table. The customer table
taken as such has nothing confidential as most of the information contained in it is
also available in the public voters database and on social networking sites like
Facebook. Sensitiveness comes in when the customer table is combined with an
accounts table. A logical representation of Tables 1.1 and 1.2 is shown in Table 1.3.
Data D in the tables contains four disjointed data sets:
1. Explicit identifiers (EI): Attributes that identify a customer (also called record
owner) directly. These include attributes like social security number (SSN), insurance
ID, and name.
2. Quasi-identifiers (QI): Attributes that include geographic and demographic
information, phone numbers, and e-mail IDs. Quasi identifiers are also defined as
those attributes that are publicly available, for example, a voters database.
3. Sensitive data (SD): Attributes that contain confidential information about the
record owner, such as health issues, financial status, and salary, which cannot be
compromised at any cost.
4. Nonsensitive data (NSD): Data that are not sensitive for the given context.
The first two data sets, the EI and QI, uniquely identify a record owner and when
combined with sensitive data become sensitive or confidential. The data set D is
considered as a matrix of m rows and n columns. Matrix D is a vector space where
each row and column is a vector
D = [DEI] [DQI] [DSD]
Each of the data sets, EI, QI, and SD, are matrices with m rows and i, j, and k
columns, respectively. We need to keep an eye on the index j (representing QI), which
plays a major role in keeping the data confidential. Apart from assuring their
customers’ privacy, organizations also have to comply with various regulations in that
region/country, as mentioned earlier. Most countries have strong privacy laws to
protect citizens’ personal data. Organizations that fail to protect the privacy of their
customers or do not comply with the regulations face stiff financial penalties, loss of
reputation, loss of customers, and legal issues. This is the primary reason
organizations pay so much attention to data privacy. They find themselves in a Catch-
22 as they have huge amounts of customer data, and there is a compelling need to
share these data with specialized data analysis companies. Most often, data
protection techniques, such as cryptography and anonymization, are used prior to
sharing data.
Anonymization is a process of logically separating the identifying information (PII)
from sensitive data. Referring to Table 1.3, the anonymization approach ensures that
EI and QI are logically separated from SD. As a result, an adversary will not be able to
easily identify the record owner from his sensitive data. This is easier said than done.
How to effectively anonymize the data?