Reference 4 Cropped
Reference 4 Cropped
Katharine Jarmul
Chapter 1. Data Governance
and Simple Privacy Approaches
Data privacy is a large and long-lived field. I want you to picture it like an
old road, packed with interesting side streets and diversions but hard to
navigate if you don’t know the way. This chapter is your initial orientation
to this road. In this chapter and throughout this book, I’ll help you map
important parts of the privacy landscape, and you’ll find areas where you
want to learn more and deviate from the original path. Applying this map
within your organization means uncovering who is doing what, what their
responsibilities are, and what data privacy needs exist in your organization.1
You might have heard the phrase data governance only once or hundreds of
times, but it is often left unexplained or open for interpretation. In this
chapter, you’ll learn where data governance overlaps with data privacy for
practical data science purposes and learn simpler approaches for solving
privacy problems with data, such as pseudonymization. You’ll also learn
how governance techniques like documentation and lineage tracking can
help identify privacy problems or ways to implement privacy techniques at
the appropriate step.
TIP
If you already know or work in data governance, I recommend skimming or skipping this chapter.
If governance and data management are new to you, this chapter will show you the foundations
needed to apply the advanced techniques you’ll learn in later chapters.
This chapter will help give you tools and systems to identify, track, and
manage sensitive data. Without this foundation, it will be difficult to assess
privacy risk and mitigate those concerns. Starting with governance makes
sense, because privacy fits well into the governance frameworks and
paradigms, and these areas of work support one another in data systems.
Data Governance: What Is It?
Data governance is often used as an “all-encompassing” way to think about
our data decisions, like whether to opt in to allowing a service to contact
you or determining who has access rights to a given database. But what
does the phrase really refer to, and how can you make it actionable?
Data governance is literally governing data. One way to govern happens via
a transfer of rights people individually and communally possess. Those
rights are passed onto elected officials who manage tasks and
responsibilities for individuals who have no time, expertise, or interest. In
data governance, individuals transfer rights when data is given to an
organization. When you use a website, service, or application, you agree to
whatever privacy policy, terms, and conditions or contract is presented by
those data processors or collectors at that time. This is similar to living in a
particular state and implicitly agreeing to follow the laws of that land.
Data governance helps manage whose data you collect, how you collect and
enhance it, and what you do with it after collection. Figure 1-1 illustrates
how privacy and security relate to data governance, via an imaginary island
where users and their data are properly protected by both privacy and
security initiatives. In this diagram, you can see the sensitive data inside a
tower. Security initiatives are supported by Privacy by Design.2 Regulations
and compliance provide a moat that keeps sensitive data separate. Privacy
technologies you will learn in this book are bridges for users and data
stakeholders, allowing them to gather insights and make decisions with
sensitive data without violating individual privacy.
Figure 1-1. Mapping data governance
Where did the data come from? What laws or internal policies
apply to this data?
How did the processing change the What was the privacy policy and
data? terms at collection time?
Is the metadata for lineage Did the data come from a third
information easily accessible and party? If so, what are the
queryable? restrictions and obligations,
contractual or otherwise, for this
data?
Data reliability/Knowledge Data privacy and security
You are likely already focused on many of these questions since data is a
major part of your job. You might have personally suffered from a lack of
data documentation, incomplete understanding of how a certain database
came to be, and issues with data labeling and quality. Now you have a new
word to use to describe these qualities: governance!
Working on the governance side of data administration or management is
really about focusing on how to collect and update information about the
data throughout its lifecycle. The regulatory, privacy, and security concerns
shape that information and ensure governance decisions and frameworks
expedite measures like individual data rights and appropriate usage of data.
If your data does not come from individuals, there may be other concerns
with regard to proprietary data or related security issues that guide
governance initiatives.
When you think about governing data in a concrete way, you begin to look
at tasks such as documenting the ever-changing data flows at your
organization. It seems obvious and easy, but on closer look it is anything
but.
Let’s say you have a huge data lake that gets fed from 10 different sources,
some external, some internal. How can you actually begin to govern that
data? What would a scalable and easy-to-use solution look like? What
happens when those data flows change? It may be enough just to document
the code or the workflows that are actively running and in use and to leave
the rest for future work. But what do you do with data from partners or
other external data collection systems? You’ll need to coordinate this
documentation so the legal, privacy, and risk departments can use it for
auditing and assessment. This process should not be solved with piecemeal
and temporary solutions but instead addressed as holistically as possible.
To begin, let’s identify which data is the most important to protect for the
purpose of practical data privacy. How can you identify sensitive data?
What exactly is sensitive data?