0% found this document useful (0 votes)
6 views6 pages

CSC 303 Data Protection Techniques Notes

The document discusses various stakeholders involved in data protection, including companies, customers, and government regulations, emphasizing the importance of safeguarding sensitive data to prevent legal and reputational damage. It outlines techniques for protecting personal information, such as anonymization, cryptography, and tokenization, while highlighting the challenges of balancing data privacy with utility. Additionally, it differentiates between privacy and anonymity, and presents methods like data masking and pseudonymization to enhance data security.

Uploaded by

Covenant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views6 pages

CSC 303 Data Protection Techniques Notes

The document discusses various stakeholders involved in data protection, including companies, customers, and government regulations, emphasizing the importance of safeguarding sensitive data to prevent legal and reputational damage. It outlines techniques for protecting personal information, such as anonymization, cryptography, and tokenization, while highlighting the challenges of balancing data privacy with utility. Additionally, it differentiates between privacy and anonymity, and presents methods like data masking and pseudonymization to enhance data security.

Uploaded by

Covenant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

CSC 303 DATA PROTECTION TECHNIQUES NOTES

There are many stakeholders of data privacy in an organization; these are shown in
Figure 1.1. below. Let us define these stakeholders

Company: Any organization like a bank, an insurance company, or an e-commerce,


retail, healthcare, or social networking company that holds large amounts of
customer-specific data. They are the custodians of customer data, which are
considered very sensitive, and have the responsibility of protecting the data at all
costs. Any loss of these sensitive data will result in the company facing legal suits,
financial penalties, and loss of reputation.
Customer/record owner: An organization’s customer could be an individual or
another organization who share their data with the company. For example, an
individual shares his personal information, also known as PII, such as his name,
address, gender, date of birth, phone numbers, e-mail address, and income with a
bank. PII is considered sensitive as any disclosure or loss could lead to undesired
identification of the customer or record owner.
Government: Government defines what data protection regulations that
the company should comply with. Examples of such regulations are the
HIPPA Act, the EU Data Protection Act, and the Swiss Data Protection Act.
It is mandatory for companies to follow government regulations on data protection.
Data anonymizer: A person who anonymizes and provides data for analysis or as test
data.
Data analyst: This person uses the anonymized data to carry out data mining
activities like prediction, knowledge discovery, and so on. Following government
regulations, such as the Data Moratorium Act, only anonymized data can be used for
data mining. Therefore, it is important that the provisioned data support data mining
functionalities.
Tester: Outsourcing of software testing is common among many companies. High-
quality testing requires high-quality test data, which is present in production systems
and contains customer-sensitive information. In order to test the software system, the
tester needs data to be extracted from production systems, anonymized, and
provisioned for testing. Since test data contain customer-sensitive data, it is
mandatory to adhere to regulatory compliance in that region/country.
Business operations employee: Data analysts and software testers use anonymized
data that are at rest or static, whereas business operations employees access
production data because they need to support customer’s business requirements.

Adversary/data snooper: Data are precious and their theft is very common. An
adversary can be internal or external to the organization. The anonymization design
should be such that it can thwart an adversary’s effort to identify a record owner in
the database.
What constitutes personal information?
Personal information consists of name, identifiers like social security number,
geographic and demographic information, and general sensitive information, for
example, financial status, health issues, shopping patterns, and location data. Loss of
this information means loss of privacy—one’s right to freedom from intrusion by
others.

Protecting Sensitive Data


“I know where you were yesterday!” Google knows your location when you use
Google Maps. Google maps can track you wherever you go when you use it on a smart
phone. Mobile companies know your exact location when you use a mobile phone. You
have no place to hide. You have lost your privacy. This is the flip side of using devices
like smart phones, Global positioning systems (GPS), and radio frequency
identification (RFID). Why should others know where you were yesterday? Similarly,
why should others know your health issues or financial status? All these are sensitive
data and should be well protected as they could fall into the wrong hands and be
exploited

Let us look at a sample bank customer and an account table. The customer table
taken as such has nothing confidential as most of the information contained in it is
also available in the public voters database and on social networking sites like
Facebook. Sensitiveness comes in when the customer table is combined with an
accounts table. A logical representation of Tables 1.1 and 1.2 is shown in Table 1.3.
Data D in the tables contains four disjointed data sets:
1. Explicit identifiers (EI): Attributes that identify a customer (also called record
owner) directly. These include attributes like social security number (SSN), insurance
ID, and name.
2. Quasi-identifiers (QI): Attributes that include geographic and demographic
information, phone numbers, and e-mail IDs. Quasi identifiers are also defined as
those attributes that are publicly available, for example, a voters database.
3. Sensitive data (SD): Attributes that contain confidential information about the
record owner, such as health issues, financial status, and salary, which cannot be
compromised at any cost.
4. Nonsensitive data (NSD): Data that are not sensitive for the given context.

The first two data sets, the EI and QI, uniquely identify a record owner and when
combined with sensitive data become sensitive or confidential. The data set D is
considered as a matrix of m rows and n columns. Matrix D is a vector space where
each row and column is a vector
D = [DEI] [DQI] [DSD]
Each of the data sets, EI, QI, and SD, are matrices with m rows and i, j, and k
columns, respectively. We need to keep an eye on the index j (representing QI), which
plays a major role in keeping the data confidential. Apart from assuring their
customers’ privacy, organizations also have to comply with various regulations in that
region/country, as mentioned earlier. Most countries have strong privacy laws to
protect citizens’ personal data. Organizations that fail to protect the privacy of their
customers or do not comply with the regulations face stiff financial penalties, loss of
reputation, loss of customers, and legal issues. This is the primary reason
organizations pay so much attention to data privacy. They find themselves in a Catch-
22 as they have huge amounts of customer data, and there is a compelling need to
share these data with specialized data analysis companies. Most often, data
protection techniques, such as cryptography and anonymization, are used prior to
sharing data.
Anonymization is a process of logically separating the identifying information (PII)
from sensitive data. Referring to Table 1.3, the anonymization approach ensures that
EI and QI are logically separated from SD. As a result, an adversary will not be able to
easily identify the record owner from his sensitive data. This is easier said than done.
How to effectively anonymize the data?

Privacy and Anonymity: Two Sides of the Same Coin


This brings up the interesting definition of privacy and anonymity. According to
Skopek [1], under the condition of privacy, we have knowledge of a person’s identity,
but not of an associated personal fact, whereas under the condition of anonymity, we
have knowledge of a personal fact, but not of the associated person’s identity. In this
sense, privacy and anonymity are flip sides of the same coin. Tables 1.4 and 1.5
illustrate the fundamental differences between privacy and anonymity.
There is a subtle difference between privacy and anonymity. The word privacy is also
used in a generic way to mean anonymity, and there are specific use cases for both of
them. Table 1.4 illustrates an anonymized table where PII is protected and sensitive
data are left in their original form. Sensitive data should be in original form so that the
data can be used to mine useful
knowledge.
Anonymization is a two-step process: data masking and de-identification. Data
masking is a technique applied to systematically substitute, suppress, or scramble
data that call out an individual, such as names, IDs, account numbers, SSNs, etc.
Masking techniques are simple techniques that perturb original data. De-identification
is applied on QI fields. QI fields such as date of birth, gender, and zip code have the
capacity to uniquely identify individuals. Combine that with SD, such as income, and a
Warren Buffet or Bill Gates is easily identified in the data set. By de-identifying, the
values of QI are modified carefully so that the relationship is till maintained by
identities cannot be inferred.
In Equation 1.1, the original data set is D which is anonymized, resulting in data set
D′ = T(D) or T([DEI][DQI][DSD]), where T is the transformation function. As a first step
in the anonymization process, EI is completely masked and no longer relevant in D′.
As mentioned earlier, no transformation is applied on SD and it is left in its original
form.
This results in D′ = T([DQI]), which means that transformation is applied only on QI as
EI is masked
and not considered as part of D′ and SD is left in its original form. D′ can be shared as
QI is transformed and SD is in its original form but it is very difficult to identify the
record owner. Coming up with the transformation function is key to the success of
anonymization design and this is nontrivial.
The other scenario is protecting SD, as shown in Table 1.5, which is applied on data in
motion. The implementation of this is also very challenging. It is dichotomous as
organizations take utmost care in protecting the privacy of their customers’ data, but
the same customers provide a whole
lot of personal information when they register on social network sites like Facebook
(of course, many of the fields are not mandatory but most people do provide sufficient
personal information), including address, phone numbers, date of birth (DOB), details
of education and qualification, work
experience, etc. Sweeney [2] reports that zip code, DOB, and gender are sufficient to
uniquely identify 83% of population in the United States. With the amount of PII
available on social networking sites, a data snooper with some background knowledge
could use the publicly available information to re-identify customers in corporate
databases. In the era of social networks, de-identification becomes highly challenging.

Methods of Protecting Data


One of the most daunting tasks in information security is protecting sensitive data in
enterprise applications, which are often complex and distributed. What methods are
available to protect sensitive data? Some of the methods available are cryptography,
anonymization, and tokenization, which are briefly discussed in this section, and a
detailed coverage is provided in the other chapters
in the book. Of course, there are other one-way functions like hashing.
Cryptographic techniques are probably one of the oldest known techniques for data
protection. When done right, they are probably one of the safest techniques to protect
data in motion and at rest. Encrypted data have high protection, but are not readable,
so how can we use such data? Another
issue associated with cryptography is key management. Any compromise of key
means complete loss of privacy.
Anonymization is a set of techniques used to modify the original data in such a
manner that it does not resemble the original value but maintains the semantics and
syntax. Regulatory compliance and ethical issues drive the need for anonymization.
The intent is that anonymized data can be shared
freely with other parties, who can perform their own analysis on the data.
Anonymization is an optimization problem, in that when the original data are modified
they lose some of its utility. But modification of the data is required to protect it. An
anonymization design is a balancing act between data privacy and utility. Privacy
goals are set by the data owners, and utility goals are set by data users. Now, is it
really possible to optimally achieve this balance between privacy and utility?
Tokenization is a data protection technique that has been extensively used in the
credit card industry but is currently being adopted in other domains as well.
Tokenization is a technique that replaces the original sensitive data with nonsensitive
placeholders referred to as tokens. The fundamental difference
between tokenization and the other techniques is that in tokenization, the original
data are completely replaced by a surrogate that has no connection to the original
data. Tokens have the same format as the original data. As tokens are not derived
from the original data, they exhibit very powerful data protection features. Another
interesting point of tokens is, although the token is usable
within its native application environment, it is completely useless elsewhere.
Therefore, tokenization is ideal to protect sensitive identifying information.
For some time, the middle ground has been to use lighter privacy protection
mechanisms, mechanisms such as data masking or pseudonymization. These
processes aim at protecting data by removing or altering its direct, sometimes
indirect, identifiers. It's quite frequent to see the term "anonymization" in references
to these methods. However, the two have clear legal and technical implications.
Pseudonymization, or data masking, is commonly used to protect data privacy. It
consists of altering data, most of the time, direct identifiers, to protect individuals'
privacy in the datasets. There are several techniques to produce pseudonymized data:
 Encryption: hiding sensitive data using a cipher protected by an encryption key.
 Shuffling: scrambling data within a column to disassociate its original other
attributes.
 Suppression: nulling or removing from the dataset the sensitive columns.
 Redaction: masking out parts of the entirety of a column’s values

You might also like