Guidelines For GDPR Compliance in Big Data Systems
Guidelines For GDPR Compliance in Big Data Systems
Keywords: The implementation of the GDPR that aims at protecting European citizens’ privacy is still a real challenge. In
The General Data Protection Regulation particular, in Big Data systems where data are voluminous and heterogeneous, it is hard to track data evolution
(GDPR) through its complex life cycle ranging from collection, ingestion, storage and analytics. In this context, from
Big Data analytics
2016 to 2021 research has been conducted and several security tools designed. However, they are either
Privacy
specific to particular applications or address partially the regulation articles. To identify the covered parts,
Security
the missed ones and the necessary metrics for comparing different works, we propose a framework for GDPR
compliance. The framework identifies the main components for the regulation implementation by mapping
requirements aligned with GDPR’s provisions to IT design requirements. Based on this framework, we compare
the main GDPR solutions in the Big Data domain and we propose a guideline for GDPR verification and
implementation in Big Data systems.
1. Introduction in the area of Big Data analytics. The term ‘‘Big Data analytics’’ refers
to the entire data management life cycle from ingestion and storage
The General Data Protection Regulation GDPR [1] sets new require- to analysis of high volumes of data with heterogeneous format from
ments on security and data protection through 99 articles and 173 different sources. As presented in Fig. 1, the reference architecture of
recitals and aims to protect the rights and freedom of natural persons. Big Data systems covers 5 main layers [5]: data sources, ingestion,
Every organization that deals with personal data has to comply with processing, storage, distribution and services. At the processing layer,
GDPR to protect these rights and to be accountable while improving sophisticated algorithms are being developed to analyze a large amount
business models [2]. Accountability aims at demonstrating how con- of data to gain valuable insights for accurate decision-making, detect-
trollers comply with data protection principles. Each organization must ing unprecedented opportunities such as finding meaningful patterns,
answer the following questions: what information is processed? why? presuming situations, predicting and inferring behaviors. Due to the
how and where is data stored? who can access it and why? is it up- large data volume and the complexity of processing, tracking data
to-date and accurate? how long will you keep it for? how will it be dependencies and privacy verification are challenging. For this purpose,
safeguarded and how accountability should be reached? data security and governance layer is a cross-layer generally used for
Some previous works present how to extract technical requirements data security and management. Consequently, it represents a key part
from law requirements [3] before the birth of GDPR. Nevertheless, no of the system in implementing GDPR requirements.
design patterns or best practices can be directly applied in the Big Data Recent academic and industrial tools [6,7] implement some GDPR
context for GDPR-compliance implementation. requirements translating automatically the privacy policies to software
In the last few years, topics about GDPR have been discussed across in order to provide accountability. However, these works address, only
a range of academic publications and industry papers from different partially, GDPR principles and their related articles such as purpose
theoretical and practical perspectives, including numerous implemen- limitation, data minimization, storage limitation, transparency or se-
tations and design concepts for GDPR compliance [4]. These works curity [8,9]. Other works concentrate on particular articles of the
are still in their infancy with a limited scope. Fully developed and regulation [10] like the right to data portability, the right to be forgot-
approved tools that implement GDPR articles are still missing especially ten, the access right or the right to be informed [11]. Also, these works
✩ This project is carried out under the MOBIDOC scheme, funded by the EU through the EMORI program and managed by the ANPR. This work was, also,
partially supported by the VRR Tunisien project, funded by Tunisien Ministry of Higher Education and Scientific Research.
∗ Corresponding author at: University of Carthage, Polytechnic School of Tunisia, SERCOM Lab, Tunisia.
E-mail addresses: [email protected] (M. Rhahla), [email protected] (S. Allegue), [email protected] (T. Abdellatif).
URLs: https://www.proxym-group.com (M. Rhahla), https://www.proxym-group.com (S. Allegue).
https://doi.org/10.1016/j.jisa.2021.102896
generally address one particular type of data source (logs, IoT sensors design requirements starting from GDPR principles and we describe our
or classical SQL databases). It is not clear how to apply proposed framework for GDPR compliance in Big data systems. In Section 6, we
solutions that consider uniform data, to Big Data architectures with use the presented framework to classify GDPR tools for reuse purposes.
multi-channel data sources, different purposes and intensive processing. We describe the framework used for GDPR-compliance implementation
Consequently, we still lack guidelines to verify GDPR compliance and and evaluation in Section 7. Finally, Section 8 provides a summary of
to implement the regulation in a Big Data context. As a starting point, the main findings of this paper and highlights new opportunities for
in order to address this issue, a comprehensive overview of the regu- future work.
lation and a common understanding of its key concepts are necessary.
Afterward, the analysis of GDPR documentation and the study of recent 2. GDPR entities and principles
works on privacy and GDPR allow the identification of the main privacy
The GDPR aims at delivering harmonized, consistent and high-
requirements and building blocks for GDPR compliance verification. As
level data protection across Europe. It has 99 articles and 173 recitals
an outcome of this study, and based on the different experimentation
grouped into 11 chapters. In those chapters, it addresses a set of prin-
carried out in the state of the art, we propose a framework with well-
ciples, entities, obligations and legal requirements. GDPR is a complex
defined components to implement the regulation. According to these
law and hard to understand and analyze by Big Data system designers
components, we situate the different works carried out on GDPR in
and IT developers. In this section, we will illustrate a big picture of
the domain of Big Data. Furthermore, we provide an overview of how
GDPR requirements and entities through top down approach.
to use the framework to assist IT developers and Big Data system
designers to build GDPR-compliant systems and applications. As an 2.1. GDPR entities
illustration of the framework usage, we consider the example of an e-
health application, and we illustrate how we used the framework to There are six main entities in the regulation [1]:
help privacy by design implementation in the considered application.
This paper’s contribution can be summarized as follows: • Data Subject (DS): an identified or identifiable natural person,
directly or indirectly, by data to be used by the controller or by
• An analysis of GDPR principles and entities for a better under- any other natural or legal person. A data subject is any person
standing of the regulation by IT developers and Big Data system whose personal data are being collected, held or processed.
designers. • Controller: a natural or legal person, public authority, agency or
• A translation from GDPR principles requirements to IT design any other body which alone or jointly with others determines the
requirements. purposes, conditions and means of personal data processing. It en-
• A framework for GDPR compliance verification and implementa- sures compliance with GDPR principles related to the processing
tion in Big Data systems. of the personal data (Accountability), implements data protection
• A classification of the state of the art conducted on GDPR solu- policies and data security measures, carries out data protection
tions implemented between 2016 and 2021 in both academic and impact assessment (DPIA) for high-risk processing, informs data
industrial areas. subjects on their rights, in case of a personal data breach, notifies
• A use case demonstrating the framework usage. the supervisory authority within 72 h and transfers personal data
to the third country or international organization, per specific
This work is an extended version of our previous work [12] which safeguarding provisions.
was restricted to a survey and a first version of the proposed frame- • Processor: (a person or a legal entity) processes personal data on
work. In this paper, we propose a translation from the regulation’s behalf of the controller. Specifically, it collects personal data on-
requirements to IT design requirements which allows us to have a line through registration, contact forms, email or digital payments
more precise and fine-grained framework. Furthermore, an IoT use-case and invoicing. It also stores, uses, records, organizes, retrieves,
is proposed to illustrate the framework usage that helps us identify discloses and deletes the collected personal data on behalf of, and
missing parts in the use-case management system. Furthermore, we under the instructions of the controller and creates inventories for
extended the related works’ section and the GDPR tools section with all above-mentioned data processing categories.
recent solutions mainly from the industry. The up-to-date version of the • Data Protection Officer (DPO): (a person or a legal entity)
studied solutions allows us to provide some key guidelines for GDPR manages and supervises all data protection activities. Specifically,
implementation. Finally, the evaluation of the ameliorated solution he monitors compliance to GDPR personal data protection and
shows an acceptable overhead when implementing GDPR-compliance. security provisions and cooperates with the supervisory authority.
This paper is structured as follows. Section 2 is an overview of • Supervisory Authority (SA): Article 46 states that supervisory
GDPR principles and main entities. In Section 3, we presented the authorities ‘‘are responsible for monitoring the application of this
related works and we highlighted the contribution of this paper. Sec- Regulation and for contributing to its consistent application’’.
tion 4 presents the problem statement and illustrates the main GDPR The independent public authority is responsible for monitoring
challenges in Big Data systems. In Section 5, we extract the main IT regulated entity compliance with GDPR.
2
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
• Third party: refers a natural or legal person, public authority, 2.3. Guidelines scope
agency or body other than the data subject, controller, processor
and persons who, under the direct authority of the controller or In this paper, our goal is to help Big data developers and systems ar-
processor, are authorized to process personal data. chitects to better address GDPR requirements for Big Data applications.
Dependencies between these entities are defined by GDPR articles: The GDPR opts to having the ‘‘appropriate technical and organizational
For example, the data subject can declare his consent to the con- measures’’ established when processing personal data. These appropri-
troller (Art.4). He can also request from the controller access to data ate technical and organizational measures vary and depend on the use
(Art.13 and Art.14), data rectification, make processing restriction and case and scenarios. In this work, we focus on the technical measures
information about the life-cycle of his data. On the other hand, the for systems adopting a classical Big Data architecture as presented in
controller provides information to data subjects (Art.13 and Art.14) and Fig. 1. We followed a top-down analysis of the GDPR. We start from the
communicates data breaches to them (Art.34). The previous work [12] general description of article 5 that refers to many articles and recitals
gives a big picture of the main articles’ based relationships between as summarized in Table 1. For example:
the different entities. From this network, it becomes evident that the ‘‘Lawfulness, fairness and transparency’’ are some of the main goals
center of activity in this regulation revolves around the data subject of the GDPR. Important Articles include Art. 5(1)(a) as well as the
and the main entity in the GDPR is the controller. A large set of actions obligation to provide information stipulated in Articles 12 through 14.
is necessary. From a technical perspective, systems should be able to A separate Recital about transparency (Recital 58 and 39(1-4)) provides
provide options for storing and revoking consent, as well as to restrict
some technical recommendations and other articles and recitals de-
processing on a fine-grained level (Art.4, Art.21, and Art.18). The
scribe the transparency principle requirements as presented in Table 1.
ability to deliver complete and coherent data to data subjects or transfer
‘‘Purpose limitation’’ is requested particularly through Art.5(1)(b) in
it to competitors has to be implemented (Art.20). The right to data
accordance with Recital 39(6). Many articles and recitals are linked to
rectification or deletion (Art.16 and Art.17) poses further challenges. In
the next section, we will present the big picture of GDPR requirements the purpose limitation such as Art.6 and recital 50.
summarized in its principles. The main contributors to ‘‘data minimization’’ is Art. 5(1)(c) in
accordance with Recital 39(7-8).
2.2. GDPR principles ‘‘Accuracy’’ is illustrated in article 5(1)(d) but it is linked to data
subject rights since personal data should be ‘‘accurate and, where
GDPR sets out seven key principles for the processing of personal necessary, kept up to date. Every reasonable step must be taken to
data stipulated in Article 5 [1]. They can be concluded in the following ensure that inaccurate personal data, having regard to the purposes
points: for which they are processed, are erased or rectified without delay ’’.
So requirements in Chapter 3 ‘‘Rights of the data subject’’, should be
• Lawfulness, fairness and transparency: Personal data shall be
linked to this principle (articles and the corresponding recitals) such as
processed lawfully, fairly and in a transparent manner with the
data subject. Art. 16, Art. 17, Art. 18 as mentioned in Table 1.
• Purpose limitation: This principle aims to make clear that per- GDPR is a complex law and targets software compliance and also
sonal data should be collected for specified, explicit and legiti- internal enterprise processes. In this work, we focus on the software
mate purposes and not further processed in incompatible manner compliance part whose articles and recitals are summarized in Table 1.
with those purposes. Controllers should define and document the We also limit the scope of our work on the consent as legal basis
purpose for data usage and provide the possibility to update of processing. GDPR requirements of a contractual nature between its
purposes and to check the coherence between them. entities are not studied in this work because our target is IT developers.
• Data minimization: This principle makes clear that personal data In addition, this work addresses a guideline for controllers. So, we are
should be adequate, relevant and limited to what is necessary in not addressing guidelines for processors obligations or for supervisory
relation to the purposes for which they are processed. In other authorities obligation. Also, joint controller of Art. 29 and data transfer
words, the minimum amount of personal data is revealed to will be addressed in future work.
satisfy the application’s purpose. GDPR is a hard to understand law and very linked and dependent
• Accuracy: Controllers should ensure the accuracy of any personal on its principles, articles, and recitals. We provided a summarized
data created or updated with the right to rectification, which overview of the relationship on the dependency between principles,
gives data subjects the right to have incorrect personal data articles, and recitals to highlight that all GDPR articles and recitals
rectified. are linked. That is why many organizations such as Cloudera [13]
• Storage limitation: Controllers should consider which data to addressed GDPR compliance by analyzing and studying its principles
store, why and for how long. So, even if they collect and use
(adopting the same top-down approach). Also, academic works [14]
personal data fairly and lawfully, they cannot keep it for longer
followed some methodology to group GDPR articles by principles in
than needed. GDPR does not set specific time limits for different
their understanding of law.
types of data. Once information is no longer needed, personal data
must be securely deleted.
• Integrity and confidentiality: Controllers must have appropriate 3. Related works
security measures put in place to protect the personal data they
hold and to have records to demonstrate compliance. Works on GDPR compliance can be divided into 3 main categories:
• Accountability: The accountability principle requires that con- (1) GDPR analysis, (2) Frameworks for GDPR compliance and (3) IT
trollers take the responsibility for what they do with personal data tools for GDPR implementation.
and how they comply with the other principles. They must have The first category of works presents theoretical interpretations of
appropriate measures and records settled to be able to prove their GDPR. The second category is more technical and presents some guide-
compliance.
lines to implement GDPR compliant systems. The last category is the
As we see, GDPR principles gives a big picture about the law scope of Section 6 about recent implementations and tools of GDPR
requirements. It is really hard and complex to analyze the regulation in Big Data. In this section, we focus on the two first categories. The
and apply it to an IT product life cycle. In the next section, we will third category that focuses on IT tools and implementations in Big Data
detail the proposed guideline scope. systems is the scope of Section 6 of this paper.
3
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
Table 1
GDPR requirements analysis.
GDPR principles GDPR articles GDPR recitals
Lawfulness, fairness and transparency 5(1)(a),6, 7, 12, 13, 14, 15, 19, 24(1), 25(1,2), 30, 39(1–4), 42, 43, 58, 60, 61, 62, 63, 74(1–3), 78,
32(1), 33, 34. 86, 87.
Purpose limitation 5(1)(b), 6(4) , 22, 24 (2). 29, 39(6), 39(9), 50(1), 71(1–2).
Data minimization 5(1)(c), 25(2) 39(7–8) , 39(10), 78(2–3), 156.
Accuracy 5(1)(d), 16, 17, 18, 21, 13(2)(c), 14(2)(d), 15(1)(e) 39(11), 63(4), 65(1-4), 66, 67(1–3), 69(1–2), 71.
Storage limitation 32(1)(b,c,d), 5(1)(e), 13(2)(a), 20(1,2), 13(2)(a), 49, 39(10), 63, 68(3–6), 83(3)
15(1)
Integrity and confidentiality 5(1)(f), 28(3)(b), 32(1)(b), 25(2), 29, 32(2), 25(1), 28, 29, 39(12), 49(1–2), 75, 78, 81, 83(2–3), 84,
32(1), 32(1)(b), 89(1), 25(1), 32(1)(a), 40(2)(d), 88, 89, 90, 91, 95
6(4)(f)
Accountability 5(2), 24, 30, 32(1)(a,b,c), 34, 35(11), 37(7) 82
3.1. GDPR analysis just help data controllers to improve the integration of consent and data
management in their systems.
Several authors and organizations have analyzed GDPR and privacy Another project, called PrivacyGuide [28] is based on machine
by design. However, they generally provide documentation and support learning and natural language processing techniques to classify privacy
for law understanding without providing a practical framework or policy content and to calculate the risk behind each policy. A visualized
guidelines to apply this law in the company’s projects or products as summary is provided to illustrate the relevant privacy aspects associ-
in [15]. ated with the identified level of risk. This work helps identifying risk
Other works generally focus on specific applications such as health- assessment and accountability principles. In our work, we provide an
care [16–18], private bank [19] or specific tasks like data storage [20]. assessment and a risk manager as an IT component of our framework.
They analyze and discuss the impact of GDPR compliance on the Its role is to compare Business policies against consent policies for non-
specific fields of interest. For example, in [17] the fundamental legal compliance detection in a static way. Inspired by PrivacyGuide project’s
issues are identified, as well as challenges and opportunities for an e- ideas, this module can be enhanced for a dynamic privacy risk analysis.
health scenario. Architectural guidelines are also proposed and tackled A framework for GDPR compliance for SME (Small and Medium-
without a proof of concept. In [21], authors analyze GDPR articles sized Enterprises) is proposed in [29]. The framework distinguishes
and propose an ontological data protection model for an organization. 3 phases to follow: analysis, design and implementation. The design
In [22], authors highlighted the effects of GDPR on designing assistive phase of the project focuses on three principles: routines, policies and
environments without proposing implementations. In [23] and in [24], templates. Work routines concern the handling of personal data during
authors analyze the principle of data protection by design without a regular working day for all employees. Starting from work routines
proposing implementation guidelines. In [25], authors provide guide- and the SME policies, a template is generated and is displayed in a GUI.
lines for implementing GDPR from a user experience perspective for In [29], authors actually provide a very general framework that may
subsidiary companies, but no implementation or use case validation is cover different SME. Giving the presented framework, organizations
proposed. In [26], authors present an overview of the GDPR in terms need to work with experts in order to adapt it to its routines. This work
of entities involved and provide a systematic representation of their can be applied to Big Data systems for data flow control. However,
interactions. Consequently, the paper presents an analysis of the entities for GDPR implementation, the proposed solution is very general, and
categorized according to their role as defined by GDPR, the nature of additional work is required with experts and IT design systems.
information flows between these entities, and the requirements for their In [30], authors propose a framework for the GDPR-compliance in
interoperability. These studies helped us to understand more clearly smart city IoT platforms. The adopted approach is different from ours.
GDPR challenges and its requirements. Also, it helped in our translation Indeed, instead of starting from GDPR principles’ study to deduce IT
from legal requirements to technical requirements [14]. requirements, authors identify a set of security and privacy require-
ments related to IoT systems matching then the list of requirements
3.2. Frameworks for GDPR compliance against GDPR features extracted from GDPR principles. In our work,
our starting point is the GDPR articles and recitals and then we ex-
A framework is proposed by the Special privacy project in [6] and tract IT requirements. The proposed framework is guided by the IoT
provides a set of functionalities for consent, transparency and com- architecture and the focus is put on securing M2M communications.
pliance checking. This work illustrates the approach to consent man- Compared to their framework, ours is guided by the particular ref-
agement and consent implementation in a big data context. Therefore, erence architecture of Big Data systems which has many similarities
authors studied the transparency requirements to GDPR and provided with IoT. Indeed, we have in common the management components
a compliance and transparency dashboard. In our work, we addressed related to data collection, storage and processing. In our framework,
not only consent and transparency but also other GDPR principles like we additionally add a data flow manager that represents a global view
accountability, data minimization and purpose limitation. Regarding of data flow stages (sources, edges, processors, etc.) which is needed for
consent, we provide a more fine-grained consent policies based on accountability. We also adopt a more fine-grained architecture where
attribute-based access control. risk management, distribution management, user consent, storage and
In [27], authors provide guidelines for managing consent and per- processing are handled in separate components. This provides more
sonal data in ICT businesses, taking into account the provisions of flexibility in managing GDPR principles. Furthermore, we show, in this
the General Data Protection Regulation (GDPR). They started from paper, that our framework can be directly applied to IoT as a particular
the analysis of previous studies on consent management models and use case of Big Data.
GDPR requirements. In our work, our starting point is the GDPR law GDPR-compliance is currently handled in a lot of EU projects,
with its principles, articles and recitals and then, we extracted the IT namely BPR4GDPR [31], DEFeND [32], SMOOTH [33], PDP4E [34],
requirements as guidelines for IT developers. Furthermore, in [27], no PAPAYA [35] and PoSeID-on [36]. These projects address the lack
implementation or evaluation is provided, the result of the study can of specific, operational solutions that respond to challenges and legal
4
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
innovations posed by GDPR, by providing systematic methods, detailed quality. For example, in the field of e-health, incorrect data
techniques and software tools [37]. about the patient health or environment can lead to an erro-
The goal of BPR4GDPR [31] (Business Process Re-engineering and neous diagnosis which puts the life of the data subject in danger.
functional toolkit for GDPR compliance) is to provide a holistic frame- Anomaly detection systems are required for accuracy. Indeed, it
work able to support end-to-end GDPR-compliant intra- and inter- is important to make sure that data are not modified during the
organizational ICT-enabled processes at various scales. The proposed transmission phase, and malicious entities that try to inject data
solutions are very general and aim at covering the full process lifecycle, in order to congest the network or influence the analysis results
from its initial identification or specification to its enactment and are detected.
execution. They target diverse application domains while our work • Challenges in Data Processing and Storage layers: Data are
focuses on the specific domain of Big Data systems. stored and processed to provide advanced and calculated infor-
DEFeND2 [32] is the data governance framework designed to assist mation for the services layer. However, personal data should
organizations to implement GDPR. DEFeND2 adopts the same approach be stored for a well-defined time duration (Storage limitation).
as ours where project members start from the study of GDPR articles Thus, data retention and disclosure limitation are required at this
and principles. Then, they identify the framework components imple- phase. Consequently, necessary mechanisms must be deployed for
menting these principles. However, their scope is Larger than ours destroying data when expired. Furthermore, a lot of data are col-
since they target coordinators from different sectors (e-health, banking, lected for non-defined purposes mainly for Big Data analytics that
energy etc.) and they propose solutions for both operations and for requires the maximum of input to improve algorithms’ accuracy.
technical implementations [38,39]. If we compare the implementation However, the blunt statement that data are collected for any pos-
module of DEFeND project to our framework, we see that we are more sible analytics is not a sufficiently accepted purpose. The principle
guided by the Big Data architecture and then we provide a more fine- of storage limitation may undermine the ability to be predictive,
grained architecture. For instance, since we have a data collection layer which is one of the opportunities rendered possible by Big Data
different from the processing and storage layers in Big Data, we propose analytics. Indeed, if Big Data analytics allows predictability, it is
three managers for each layer, the collection manager, the processing precisely because algorithms can compare current data to stored
manager and the storage manager to deal with each layer separately. past data in order to determine what is going to happen in the
As we can see, many projects are seeing the light in the area of future.
GDPR. These projects are with different objectives and different scopes. Another challenging issue for securing a Big Data system is data
As it is not yet feasible to address all the issues, each project has its sharing. For example, road traffic data can be collected by de-
own target public and/or focuses on some specific aspects of GDPR. ployed cameras or travelers’ smartphones and GPS in a crowd-
Although the work is carried out, several other aspects of GDPR remain sourcing way. During global road planning, it is challenging to
still open such as purpose limitation, data minimization and storage define the access policy and enable privacy-preserving data shar-
limitation [37]. ing among the involved applications and services. Therefore, Big
To summarize, compared to existent projects and proposed frame- data storage and sharing require the deployment of appropriate
works for GDPR-compliance and to the best of our knowledge, our techniques in order to respect the user consent and privacy while
work presents the first technical GDPR framework and IT guidelines for providing innovative analytic processing for different purposes.
GDPR compliance in Big Data systems. Although some ideas and design Once a Big Data application resolved all the previous challenges
components are similar to some proposed solutions, our framework for the processing and storage layer, a controller needs to demon-
is more adapted to the data touch points management and to the strate this. Here comes the role of transparency and account-
layers of the Big Data architecture. We also provide a set of technical ability. The controller needs to provide all information for the
implementations full-filling Big Data challenges and we illustrate the processed data by providing where data are stored and how
framework usage through an IoT use-case. they are manipulated or processed. This task can be easy for
classical applications. But in Big Data context, it is a challenging
4. Problem statement task. Big Data processing is complex and with different purposes
and intensive processing and some time with opaque processing
In this section, we discuss the major GDPR challenges in Big Data operations. Building, transparency and tracking data usage and
systems [40]. From our perspective, the current GDPR and privacy chal- storage is very difficult.
lenges can be grouped into four categories following the architecture • Challenges in Distribution and Services layers: Third-party
layers presented in Fig. 1 as follows: applications have access to the analytic results calculated from
citizen data. The communicated information, even anonymous,
• Challenges in Data Sources layer: Regarding the privacy prin-
may reveal personal data. It is important that data sharing be
ciples, both consent and purpose limitation principle must be
controlled with regard to citizens’ consent. The challenge exists
considered before beginning the collection phase. Each data sub-
when there is a big number and a diversity of applications using
ject has the right to know the reasons behind collecting each data
personal data and communicating from data sources to processing
from the sources. Hence, the data subject is asked to set his prefer-
layers. In that case, tracking data access and data breach notifica-
ences about the collection frequency, the data granularity and the
tion becomes difficult. Also, in the context of Big Data analytics,
set of information he allows to disclose to third party applications.
the processing can be opaque whereas individuals (data subject)
Preserving privacy at the source layer is essential and can affect
must be given clear information on what data are processed. They
the whole data life cycle. One may see that the principle of ‘‘data
have to be better informed on how and for what purposes their
minimization’’ and Big Data are at first sight contradictory and
information is used and in some cases, they require the logic used
very challenging because the perceived opportunities in Big Data
in algorithms to determine assumptions and predictions about
provide incentives to collect as much data as possible and to
them.
retain this data as long as possible for yet unidentified future
purposes. From the above description, the core GDPR principles seem, for the
• Challenges in Data Ingestion layer: Big Data applications typ- most part, to contradict some of the key features of Big data applica-
ically tend to collect data from diverse sources and without tion and Big Data analytics. Nevertheless, rethinking some processing
careful verification of the relevance or accuracy of the data thus activities and IT developments may help to respect privacy, notably by
collected. This can provide false analysis results and affect data having well-managed, up-to-date and relevant data while preserving
5
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
Big Data spirit. Ultimately, this may also improve data quality and • IT Req 2: Data usage verification: data are used as expected.
thus contribute to the analytics. Addressing GDPR principles requires A controller has to map all relevant personal data and data flows
a coordinated strategy involving different organizational entities in- to understand what to do with that data. In this phase, metadata
cluding legal, human resources, IT security and more. GDPR includes about identified data elements relevant for GDPR can be loaded
key requirements that directly impact the way organizations implement into a data governance platform and classified then displayed,
IT security. Unfortunately, it is not possible to buy a GDPR-compliant for instance, as a graph or dashboard. According to GDPR, users
product and think a system is compliant. Because GDPR is more about can delegate access rights to other parties. The delegation can be
security processes and managing risk, no product will solve all of the fine-grained by specifying data attributes. Attributes are about:
privacy problems. What is needed is to ensure that solutions work data sets, devices, third-party applications, storage spaces, users,
together to be truly GDPR compliant [41]. purposes.etc. For example, in e-health, a patient may be interested
The next section will detail our steps and building blocks of the in granting access to a partner only about the glucose level
proposed framework. We translated GDPR requirement into techni- and keeping the other information private. Therefore, only the
cal requirements trying to face the different challenges in Big Data owners or delegated users can access the data. On the other
architecture and provide a GDPR compliant solution. hand, a controller has to manage data ownership (permitting
the change of ownership) and the access delegations. In the
5. From GDPR principles to IT GDPR framework
delegation management, it must be possible to list them (check
the grants provided) and to revoke delegation or the consent.
To design GDPR-compliant systems, GDPR obligations have to be
• IT Req 3: DS notifications in case of a data breach. A con-
interpreted as technical requirements that are not straightforward. We
troller has to ensure the communication of policy interference and
need to represent a valid means to write simple and understandable
breach to the DS and DPO in order to negotiate a modification
requirements. Some academics started to address this step as soon
of denying access. Real-time monitoring and notification man-
as GDPR appeared such as in [42] and in [43], the study of these
agement must be implemented. Also, a controller has to inform
similar efforts helped us in identifying and confirming the right IT
users about the security level at which the solution may work
requirements. It is the scope of this section where we follow the
according to the level of security taken (it may depend on the
steps presented in Fig. 2. Indeed, GDPR principles are interpreted and
detailed as GDPR requirements that we translate to IT design require- kind of sensitive data managed). Furthermore, the GDPR requires
ments. Table 2 shows 12 requirements (Req 1 - Req 12) detailing the the support of data breach detection in a short time whenever
GDPR principles. These requirements are built based on the articles and some data and Data Type have been tampered or leaked.
recitals obligation related to each principles in Table 1. For example, • IT Req 4: Data usage communication to the DS and continu-
the lawfulness, fairness and transparency principle leads to at least ous checking of lawful basis of data processing. A controller
four requirements. First data subject (DS) privacy preferences have needs to implement an automated discovery of relevant or partial
to be collected and applied when processing and communicating his personal data. It also has to harvest metadata from heteroge-
data. This is the scope of the first requirement about ‘‘DS consent neous solutions, data management, data warehouse, data integra-
management’’. In particular, when a data breach occurs, DS must be tion, extract-transform-load, business intelligence, Big Data and
notified (Req 2) and each time his data are used, data access has to be Hadoop technologies [44]. This allows data identification through
checked (Req 3) and communicated the DS (Req 4). In addition, in our the exploration of connected metadata. In addition to using a
translation to IT requirements, we not only consider GDPR requirement privacy dashboard or navigable visualization so that the DS can
but also Big Data privacy challenges layer by layer against GDPR. For explore and track his data flows between systems services. For
example, in the ingestion layer, we are providing Req 1, Req 2, Req example, the developers of Big Data applications need to create
3, Req 4, Req 7 and Req 8. Following the Big Data architecture life- connections with: Dashboards (for presenting data and collecting
cycle, we are providing requirements for every layer. We detailed in actions from users or services), storage (for getting access to
the next section, how we analyzed Big Data architecture to detect data historical data, or for saving additional data, results of some data
touch-points, and how our requirements control the pipeline and the analytics) and with ingestion layers (brokers in case of IoT ap-
data flow from collection to distribution. plication) (for subscribing on the data drive or sending/receiving
The second step consists of extracting the main IT design require- messages), etc. Big Data applications also invoke and implement
ments from the obtained GDPR requirements. We present hereafter the Data Analytics processes exploiting a large amount of data stor-
list of identified IT requirements. age, for example, by using machine learning approaches. Thus,
the authentications to establish these connections have to be
• IT Req 1: DS consent management. A controller has to provide
automated. This means that the developers are not forced to use
the API to the data subject DS in order to express his privacy
the credentials in the source code to establish authenticated con-
preferences (definition of consent, withdrawal of consent, storage
of consent, and compliance to consent). The DS has to be able to nections (for example, with the IoT Brokers, Dashboards, Storage,
describe what data are being collected, the source, the reason of etc.). This can be implemented as an orchestration component
collection, the access rights to processors, where is data allowed facilitating this connection between the different components.
to be stored, how long is the data retained, who has access to the • IT Req 5: Purpose definition and documentation for data
data, to where and to whom is the data being transferred. Then, usage. A controller has to enrich and extend its access control
the controller translates each preference into a machine-readable policy with purposes for data usage to limit access to the specified
policy language. Furthermore, the controller system must support policy. These policies need to be stored and evaluated with a
signed consent to authorize the usage, access and management of policy engine for each data access.
Data Types. The concept of Data Type is derived from GDPR and • IT Req 6: Purpose update. A controller has to provide an update
can be regarded as Data Category. According to GDPR, the autho- User Interface and flexibility to update defined policies for the de-
rization/delegation to manage personal data (Types) provided by fined purposes. If the controller performs some form of processing
a user to the Big Data platform management must be performed different from the one initially defined, the controller shall ensure
by using a signed consent and the grant can be revoked any time that it is analyzed, justified and documented why the new purpose
by the user. So, when talking technically a consent is translated to is considered consistent with the old one. Updating privacy policy
a privacy policy with a set of conditions and constraints. Consent and/or revoking it means revoking the consent which is translated
management refers to privacy policies management. as privacy.
6
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
Table 2
From GDPR principles to GDPR requirements.
GDPR principles GDPR requirements
Lawfulness, fairness and transparency - Req 1: DS consent management.
- Req 2: DS notifications in case of data breach.
- Req 3: Data usage verification: data are used as
expected.
- Req 4: Data usage communication to the DS and
continuous checking of lawful basis of data
processing.
• IT Req 7: Data collection limitation to the purpose. A con- has to give the DS the ability to express for a given data why it
troller has to filter data at the ingestion layer (collection phase) should be collected and how long it will be stored or processed.
based on the linked purposes for every data usage by using These policies need to be stored and evaluated with a policy
annotation techniques (data minimization). When data are anno- engine for each data access. This shall provide the controller with
tated with the appropriate purpose, they can be easily filtered. the ability to identify any personal data no longer required. This
Anonymization or pseudonymisation might be helpful in order to helps the controller to identify and transform stored personal data
minimize the collection of excess data. Also, storage limitation no longer required into anonymous data. The anonymous data
requirements play an important role in minimizing data. (The shall be in a format that prevents de-anonymization with realistic
controller may only collect and process data that are necessary effort. When data are anonymized, the system shall replace all
for the defined and documented purpose. This includes each copies of the original data by the anonymized data, unless they
individual data attribute as well as the overall data set.) are deleted. In addition, the controller should have the ability to
• IT Req 8: Accuracy Verification of any personal data created delete personal data identified as no longer required, including
or updated and accuracy challenges and mistakes considera- all instances of data such as backups.
tion. A controller has to check data sources against the expected • IT Req 11: Security policies definition as integrity and con-
data sources defined in the consent policies. To provide data fidentiality constraints. A controller has to provide authenti-
quality, we can use techniques such as trusted execution environ- cation and authorization techniques for processors, third-party
ment TEE [45] and remote attestation [46] to build trust in data users and IT personal. The controller can propose different pro-
sources. Personal data shall be accurate and, when necessary, kept tocols and modalities (push/pull) to authenticate and establish
up to date. Every reasonable step must be taken to ensure that secure connections with third-party platforms or devices or users.
inaccurate data, taking into consideration the purposes of the pro- For example, authentication can be based on certificate and/or
cessing, is erased or rectified without delay. Here, the controller access tokens are used, and then activate SSL/TLS connections
has to implement CRUD functionality for the DS dashboard (Data supported by mutual authentications in the best cases Secure
management: Create, Read, Update and Delete).This requirement communications are required for all kinds of connections involv-
refers to implementing the Rights of the data subject. ing data source devices, ingestion Brokers, Big Data applications,
• IT Req 9: Dynamic Verification of the data policy implemen- dashboards and storage containers. A controller can also protect
tation. A controller has to give the DS the ability to create or data by using adequate encryption algorithms such as attribute-
update his policies dynamically by accessing the policy storage based encryption ABE [47]. Besides, following the DS consent,
and check interference problems. In this requirement, we need the controller has to hide the identity by using a pseudonym and
to implement appropriate algorithms for policy checking. The ensure a pseudonymous identity that cannot be linked to a real
controller also has to respect the DS rights by providing the identity during online interactions with third-party users and IT
sharing, deleting, rectifying and calculating DS data as features. personal.
• IT Req 10: Consideration of which data to store, why and for • IT Req 12: Compliance demonstration for DPO and DS:
how long. A controller has to extend the policy implementation demonstrating principles’ implementation. A controller has to
in order to define conditions for storage limitation. The controller monitor, block and audit by collecting, consolidating, securing,
7
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
and analyzing audit logs. Tracking the risk assessment history • C3: Collection manager. It implements the following IT design
highlights how much progress is being made over time. Also, requirements: IT Req 2, IT Req 4, IT Req 7 and IT Req 8. It is
audit logs are used to demonstrate accountability by comparing in charge of annotating raw data with consent policies defined in
the processing scope against the consent policy defined by the the access and consent manager. Data minimization is enforced
DS. The controller has to support auditing for each data subject in this component by filtering the collected data at a very early
DS, to monitor who has accessed their personal data. The DS stage.
has to access the auditing data, obtaining details about the • C4: Security manager. It implements IT Req 11. It defines
accesses, such as: when, where, how, and who accessed data. security policies as integrity and confidentiality constraints. Also,
This feature is requested explicitly by the GDPR. The processing it provides as explained IT Req 11 pseudonymisation which is
of personal data should be documented and this documentation required by the GDPR. For this goal, the Big Data technology can
be versioned and kept up to date. Also, continuous testing for be used like Apache Ranger [48] for pseudonymisation. Another
software vulnerabilities should be performed and re-testing all technique is tokenization where data are permanently replaced
security requirements for new releases as well as re-evaluating by a substitute value, making the original data completely unre-
whether new security requirements may have evolved through coverable. Another solution is encrypting the identity of the DS
progress in (state-of-the-art). using a robust algorithm such as ABE [47].
• C5: Processing manager. It implements IT Req 2 and IT Req
Overall, GDPR addresses the key security tenets of confidentiality,
3. It helps controllers check the process scopes against consent
integrity, and availability of systems and data. Starting from this IT
policies defined in the consent manager component by the DS.
design requirement, we propose the framework for GDPR compliance
Also, every processing activity is stored in the Logs and Process-
implementation as technical components in the next section. ing Records component C9. The Logs and Processing Records
component is a database.
5.1. Framework architecture for GDPR compliance implementation in big
• C6: Storage manager. It implements IT Req 6, IT Req 8 and
data systems
IT Req 10 requirements. It helps controllers manage the stored
data (raw data and calculated data). This component allows the
To implement the main IT design requirements in a Big Data system
management of data location and Data Time To Live DTTL as a
for GDPR compliance, we analyzed the data life-cycle within a Big
policy by checking regularly storage purpose, DTTL, and records
Data system and we highlighted the different data touchpoints by users
updates. It controls data transfer by comparing data location
or applications following the different layers of Fig. 1. Our analysis
defined in the consent policy and the current physical location
resulted in five data touchpoints going from data sources layer to the
of the data.
distribution layer presented in Fig. 3. Data touchpoints are a set of
• C7: Distribution manager. It implements authentication and
contact points with data by a user or by an application. For example,
authorization techniques to control the user or service access to
in the distribution layer, a service or an application from the services personal data (IT Req 11). It can provide also the finding data IT
layer may ask access for the data. The touchpoint is the communication feature (IT Req 4) if the user is a data subject.
channel that has to be controlled so that only authorized data are
• C10: Assessment and Risk manager. It implements IT Req 12.
disseminated to the application.
It automates different functions of a privacy program, such as
Fig. 3 highlights the main functional components of our frame-
deploying Privacy Impact Assessment PIAs, locating risk gaps,
work. The framework components are typically located in the data
demonstrating compliance and helping privacy officers scale com-
security and governance layer of the Big Data architecture and can be
plex tasks requiring spreadsheets, data entry and reporting. This
distributed following the workload application type. Each component
component can be used by the data protection officer DPO as
communicates with the appropriate layer of the architecture and fol- an audit service. For Example, it provides an output to the
lowing data touchpoints distribution. These components are flexible, DPO GDPR-Compliance status and accountability information
adaptable, and interact in a complementary way to provide compliance and demonstration. This is achieved by comparing the scope
with the target system and workload. We identified a central compo- of consent policy collected from the DS against business policy
nent, the Data Flow Manager that plays the role of orchestrator to the concluded from the services process. Then, it reports a com-
other components. Furthermore, this component centralizes the global pliance status for each service and demonstrates compliance or
view of data as it flows in the Big Data system. Indeed, we noticed non-compliance for users.
that data tracking and information control requires a global view of the
data dissemination. Thus, graph-based representation of the system and In addition to these components, two meta-data databases are gen-
data is very useful to control data access rights and to detect security erally required in this GDPR architecture design: Logs and Processing
breaches. For instance, we provide in Fig. 3, a graphical representation Records C9 and Consent policies storage C8. These two storage
of data dissemination in a big data system. points help to storing all processing records and logs in order to show
We present in this section the main components of the framework compliance to the DPO and also to store the consent collected from the
and then, we describe their operation and inter-communication. DS as security policies. These proposed components interact with each
other, their relationship is detailed in the next section.
• C1: Data Flow manager. With a centralized representation of
data flow throughout the Big Data system, this component can 5.2. Framework operation and component interactions
manager data life-cycle (from collection to deletion) and detect
security breaches. It relays on the other components (C2-C10) for In this section, we describe the different steps of the framework
specific tasks like notifying the DS in case of a data breach and operation following the Big Data pipeline with its different data touch-
keeping the DS informed about his data usage (IT Req 3 and IT points. The framework consists of two important dashboards: (1) Con-
Req 4). sent management and transparency dashboard, which is responsible for
• C2: Access and Consent manager. It implements four IT design obtaining consent from the data subject and (2) DPO transparency and
requirements: IT Req 1, IT Req 3, IT Req 5 and IT Req 10. It compliance dashboard, which is responsible for presenting data pro-
provides the API to get the consent policies from the user and cessing and sharing events in an easily digestible manner and demon-
stores them in the Consent Policies Storage C8 component. It strating that existing data processing and sharing complies with user
also implements and enforces the GDPR DS rights: the right to be policies. We present the steps performed by our solution starting from
informed, to access, to erase, to update, the right of portability, data sources layer to data services layers controlling the different data
processing restriction, and right to object. touchpoints as follows:
8
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
• Action 1: The DS uses a mobile application or a web application • Actions 7: In order to demonstrate compliance to the DPO, an
to access his Big Data application. During this setup, he passes by Assessment and Risk manager dashboard is provided. A compli-
a human language consent request associated with a data usage ance checking is performed Here. C10 retrieves consent policies
policy. Before starting the collection of DS’s data in the ingestion stored in C8 the Consent Policies Storage database and C9 the log
layer, a controller should obtain the DS consent. and processing record for each data category.
• Action 2: C1 collects the signed consent by the DS and sends • Action 8: Finally, a Transparency and Compliance dashboard is
it to the Collection manager C3. C3 annotates collected data also provided to the DS. First, the transparency dashboard gives
with adequate policies. Metadata management techniques can be all details on the processing of the collected data in order to check
typically used to extract the data flow as a graph representing that the DS consent is respected. If the DS decides to revoke the
data lineage. Afterward, collected data (node in the data flow given consent and asks the controller to delete all of his data. The
graph) is labeled with the DS collected consent information as
information stored in the storage points to the data he is referring
annotations or tags.
to, hence all traces are automatically deleted, or he can just rectify
• Action 3: Inside C1 and based on the data flow graph, a policy
his consent and here the policy configuration checker is activated
checker verifies that each data annotation takes into account the
by C1 with every change to keep policy implementation valid. In
preferences of the DS and that each processing or storage layer
addition, the controller needs to provide a simple representation
has access to authorized data only. This task is performed each
time new data are received or whenever newly calculated data of the DS data flow graph and the propagation of his consent with
are obtained. If everything goes right in the configuration process, fine granularity. First, C1 extracts the graph from C5 and C6 by
policies are enforced and stored in C8 (C1 sends policies to C8) the provided TrackToken and sends it to C2. Then, C2 redefines
else a notification is sent to the DS for consent rectification in C2. the data flow graph using a graph calculation framework. The DS
• Action 4-5: After security checking, C1 sends the collected con- has, clearly, a complete overview and full control of his collected
sent from C2 to C4 for integrity and confidentiality implemen- data and consent.
tation. Following the annotation, data are either encrypted or
In this section, we start from GDPR principles and obtained as
private attributes annotated by ‘‘PII’’ are hidden (anonymized).
output a well defined functional component for use. As we see, the
• Action 6: In the processing and storage layer, data lineage or data
flow graph of all stored or processed data in this layer is captured. framework components operations are valid for a wide range of Big
As presented in Fig. 3, the DS’s data are stored in a data lake and Data applications and domains. Requirements are defined, we need now
is processed in batch or real-time mode for the purpose defined to provide an implementation technology for each component. In the
in his signed consent. A unique token is generated for the DS’s next section, we will study the GDPR tools and detect which is the best
data and returned to him as a confirmation ID for his account implementation for every component.
creation in the application. Using the unique ID (or Token), he
can access and control all his collected and processed data and
6. IT tools for GDPR implementation
his consent. He can also manage his rights as defined in the
GDPR articles. C5 and C6 monitor and control the data storage
and usage then return the result (as a data flow graph) to C1 to GDPR-oriented tools are divided into 3 main categories: (1) Aca-
execute the configuration checking and return a notification if a demic GDPR Tools, (2) Industrial GDPR tools and (3) Apache tools that
problem exists to the DS or just display accountability information are built-in Big Data solutions. The next sections summarize these three
in the DS dashboard and DPO dashboard. categories.
9
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
6.1. Academic GDPR tools slicing. Although important, the solution addresses a limited part of
data control requirements in GDPR.
In the past three years, many authors have worked to provide To the best of our knowledge and from our study of the related
privacy tools for GDPR. These tools partially cover GDPR principles and works presented in Section 2 and the academic tools, no work imple-
articles: ments all the framework components to face the challenges in a Big
In [20], the authors analyzed the impact of GDPR on storage systems Data architecture. Industrials attempting to be GDPR compliant and
and extracted security requirements and the adequate storage feature to avoid GDPR penalties reinforced their products by adding GDPR
to provide GDPR compliance. Then, they took the case of Redis [49] features or providing GDPR tools for that purpose.
to extend its feature in order to be GDPR compliant and measured the
performance overhead of each modification. They found that achiev- 6.2. Industrial GDPR tools
ing strict compliance efficiently is difficult. The authors highlighted
three key challenges: efficient logging, efficient deletion and efficient In addition to academic tools, industrial security tools have been
metadata indexing for GDPR compliance. This work addresses three proposed and classified in our comparative table. Generally, companies
components of our framework: Storage manager, Security manager, and providing these tools do not provide details on the implementation. We
Collection manager. identify some of them in this section:
Authors in [10] propose privacyTracker, a GDPR-compliant tool • The Absolute Platform: This tool provides visibility and control.
that covers data tractability and transparency. They implement some It addresses GDPR prerequisites by observing and verifying PII
GDPR rights such as data portability and the right to erasure. A priva- (Personally identifiable information), avoiding data breaches and
cyTracker framework is an approach that empowers consumers with automating remediation. The main features of this tool are not
appropriate controls to trace the disclosure of data as collected by very well detailed but, we can consider that it partially provides
companies, and assess the integrity of this multi-handled data. This is the Collection manager component functionalities [56].
accomplished by constructing a tree-like data structure of all entities • AlgoSec: AlgoSec is an automation solution for network security
that received the digital record, while maintaining references that allow policy management. With AlgoSec you can accurately process
traversal of the tree from any node, both in a top-down manner and security policy changes in minutes or hours, not days or weeks.
bottom-up manner. A prototype was developed based on the privacy- Using intelligent, highly customizable workflows AlgoSec stream-
Tracker principles as a proof-of-concept of the viability of the proposed lines and automates the entire security policy change process from
principles. This work addresses Collection manager and Distribution planning and design to proactive risk analysis, implementation on
manager. the device, validation and auditing. This tool partially addresses
For GDPR accountability in IoT systems, an IoT Databox model is Access and Consent manager [57].
proposed providing the mechanisms to build trust relations IoT [11]. • Collibra: The Collibra Platform is built on a foundation of data
The IoT Databox is an edge solution that implements the local control control and governance to ensure the security of user and enter-
recommendation and collates personal data on a networked device prise data. It creates and maintains a rigorous control security
situated at home. It meets the external accountability requirement framework built around regulatory, legal and statutory require-
by surfacing the interactions between connected devices and data ments as well as industry best practices. This tool addresses
processors and articulating the social actors and activities in which data privacy rights and finding data by cataloging and lineage
machine-to-machine interactions are embedded through a distinctive techniques to get the full story behind data [58].
range of computational mechanisms. This model touches more than one • Compliance forge: It offers project management tools for privacy
component of our framework: Access and Consent manager, Collection by design. It uses automation to integrate security and privacy
manager and Distribution manager. controls into standard project management processes [59].
In a previous work [50], we propose a GDPR controller for IoT • MY DATA manager: MY DATA manager was developed to
systems where security, transparency and purpose limitation are im- achieve GDPR challenges by simplifying and automating the
plemented. In this work, we start by providing three components: management of GDPR compliance specifications and processes.
the Access and Consent manager, the Security manager and the Data It provides a set of features such as data mapping, compliance
Flow manager using Kafka topics. Also, in [51], we propose a security assessment, data inventory activities and data explorer [60].
model for data privacy and an original solution where a GDPR consent • Alien Vault USM: This tool helps to detect data breaches and
manager is integrated using the Complex Event Processing (CEP) sys- monitor data security. The unified platform centralizes essential
tem [52] and following the edge computing. We show, through a smart capabilities like asset discovery, vulnerability scanning, intrusion
home IoT system, the efficiency of our approach in terms of flexibility detection, behavioral monitoring, log management and threat
and scalability. We express policy in 5W policy model: It is crucial for intelligence updates [61].
an individual to be sure that what he has shared is exactly what he • BigId: This tool assures data minimization through duplication
wants to be shared, to whom, for what purpose and when. Individuals discovery and correlation. It satisfies customer data portability,
must have control over their data and can give or revoke permission to supports and enables right-to-be-forgotten. In addition, it reveals
access their data for a given service whenever they want. enforcement of customer consent for personal data collection,
In [53], authors present TagUBig - Taming Your Big Data, a tool to data residency flows and risk profiling with breach notification
control and improve transparency, privacy, availability and usability windows [62].
when users interact with applications. For IoT system, ADvoCATE [9] • BWise GDPR Compliance solution: This tool helps to build data
allows data subjects to easily control consents regarding access to views, data control and compliance. It helps to efficiently collect,
their personal data. The proposed solution is based on Blockchain access, transfer or share data assets and safeguard data privacy
technology. Juan Camilo proposed another Blockchain-based solution and data protection [63].
to implement consent in GDPR [54]. This work provides the data • Consentua: This is a consent choice and control tool that enables
subjects a tool to assert their rights and get control over their consents users to choose and control their personal data. It empowers an
and personal data. These works are different implementations of the increasingly trusted and straightforward relationship between the
framework component using Blockchain technology. client and the service provider. It captures consent throughout the
In [55], the authors discussed how static program analysis can be customer journey as needed. Then it gives the user the ability to
applied to detect privacy violations in programs. The solution is based control data processing in real-time. Finally it gives a picture to
on classical information flow control techniques, tainting and backward know how, why and where consent was collected [64].
10
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
• PrivacyPerfect: This work is composed of a set of tools such • ∼: indicates that the component is partially addressed by the
as assessment, processing and dashboard tools specially designed referenced work and some of its IT design requirements and
for chief privacy officers, reports, legal processing grounds and features are not implemented.
graphical overviews [65]. • ×: indicates that the component is not implemented by the refer-
• Hashicorp Vault: Hashicorp Vault manages secrets and pro- enced work.
tects sensitive data securely, stores and tightly controls access
to tokens, passwords, certificates, encryption keys for protecting In Table 3, the framework components defined in Fig. 3 are partially
secrets and other sensitive data using a UI, CLI, or HTTP API. implemented in each solution. There are no tools that implement all
This product addresses a set of IT design requirements such as components. More precisely, we can see that the components that are
establishing control with policies and rules, accessing controls most covered are C1, C2 and C8. They mainly focus on security policy
and protecting the Data. Also, it is considered by the Hashicorp definition and management. This can be explained by the fact that
company as a product for GDPR compliance [7]. involving people in the definition of their security constraints and the
• One Trust: It automates the intake and fulfillment of consumer tracking of their data is the focus of many research works even before
and subject rights requests. Also, it leverages intelligent risk miti- GDPR was adopted. Indeed, providing practical and intuitive API for
gation to discover and address risks faster. It provides assessment users, not necessarily experts in security, was considered, for many
automation and targeted data discovery. It addresses a set of years, a priority in privacy-sensitive systems like e-health and other
GDPR articles in order to provide compliance. The main features IoT systems. On the other hand, the C3 component addressing data
are: assessment automation, data inventory and mapping and annotation is less implemented compared to the other components.
targeted data discovery [66]. With data heterogeneity and the multi-sources of Big data, data an-
• Skyhigh Networks: This helps gain complete visibility into data, notation becomes necessary for tracking and controlling data flows.
context, and user behavior across all cloud services and devices. This technique is rarely required in small systems with a uniform data
[67]. format and source. Also, C9 and C10 are considered only in few recent
works. It is actually newly created requirement with GDPR which
6.3. Apache tools consists of interfacing with the data protection officer PDO and the
supervisory authority.
Apache has developed a set of tools to provide security features
to Big data systems architectures. These technologies can be used to
7. The framework implementation and application
address parts of GDPR requirements when used in a good way. Here
are some popular solutions:
In this section, we propose an implementation of the framework
• Apache Eagle: Apache Eagle is an open-source solution for iden- based on the tools studied in Section 6. We selected from Table 3 the
tifying security and performance issues instantly on Big Data best candidate technology chosen for each component in the context of
platforms like Apache Hadoop and Apache Spark [68]. It analyzes our use case. Then, we present the framework application in order to
data activities and daemon logs. It provides a state of the art ameliorate a previous work on GDPR-compliance in e-health systems.
alert engine to identify a security breach, performance issues and We describe the framework usage and evaluate its overhead on the
shows insights [69]. application performance.
• Apache Atlas: Apache Atlas is an open-source solution used
Our implementation is based on Apache Ranger [48] and Atlas [70].
for data tagging. It provides open metadata management and
Indeed, these technologies represent the defacto standard as a gov-
governance capacities for organizations to make a catalog of their
ernance layer in Big Data Systems. Furthermore, following Table 3,
data assets, classify and govern these assets and to provide col-
Apache Ranger is already adopted for masking features in C4, Access
laboration capabilities around these data assets for data scientists,
control in C6, C5 and C7 and auditing information in C9 and C10.
analysts and the data governance team. It helps implementing the
Also, Apache Atlas provides a global view for the collected data as
Finding Data and classification IT design requirements in order to
data lineage in C3, C5 and C6 (data discovery) and allows data anno-
build a data flow graph to track data [70].
tation. Much more, it is configurable with the chosen Apache Ranger
• Apache Ranger: Apache Ranger is an open solution that helps
to translate the defined tags or annotation automatically to policies
developers to enable, monitor and manage the entire data security
(Tag-based-policy). The main functionalities of these standards are as
across the Hadoop platform. The vision with Ranger is to provide
a framework for the central administration of security policies and follows:
user access monitoring [48]. • The Atlas-Ranger integration unites the data classification and
metadata store capabilities of Atlas with security enforcement
6.4. A comparative study in Ranger. Ranger implements dynamic classification-based se-
curity policies (Tag-based-policy). Ranger’s centralized platform
The comparison of the studied tools is tedious work since they empowers data administrators to define security policy based on
come from different communities and target different objectives and Atlas metadata tags or attributes and apply this policy in real-
variant application contexts. We propose situating these tools against time to the entire hierarchy of entities including databases, tables,
the framework components to measure their compliance with GDPR and columns, thereby preventing security violations. Ranger tags
and also for reuse purposes. Indeed, instead of reinventing the wheel,
are attribute-based, every tag can have attributes. Tag attribute
some ideas and implementations can be reused for designed Big Data
values are used in the tag-based policies to control the authoriza-
systems, even though the work context can be different. Consequently,
tion decision. When configuring Atlas-Ranger to work together,
a global picture of these solutions and their implemented parts of the
TagSync is activated. Ranger TagSync is used to synchronize the
regulation can be very helpful to Big Data system designers. Using the
tag store with the external metadata service Apache Atlas. It
defined framework, we can set up the comparative table below. We
receives tag details from Apache Atlas via change notifications.
adopt the following notations:
As tags are added to updated or deleted from resources in Apache
• ✓: indicates that the component is implemented by the referenced Atlas, Ranger TagSync receives notifications and updates the tag
work. store in order to keep the policy implementation valid.
11
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
Table 3
Comparative Table of the IT GDPR tools.
GDPR Tools by Category Our framework
Components
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
GDPR for healthcare [18] × × × × × × × × × ×
GDPR analysis GDPR in Health Clinics [16] × × × × × × × × × ×
GDPR investigation [21] × × ∼ ∼ × ∼ × × × ×
PrivacyTracker [10] ∼ × ✓ ∼ × × ✓ × × ×
Storage system for GDPR [20] × × ✓ ✓ × ✓ × ∼ × ×
IoT Databox [11] ∼ ✓ ∼ × × ∼ ✓ ∼ × ×
GDPR Controller [50] ✓ ✓ × ✓ × × × ∼ ∼ ×
Academic GDPR tools TagUBig [53] ✓ ∼ × ✓ ∼ × × × ∼ ×
Consent management [54] ✓ ✓ ∼ × × × × ✓ × ×
Special privacy [6] ✓ ✓ ∼ ∼ × ∼ × ✓ ✓ ∼
GDPR in smart home systems [51] ∼ ✓ ∼ × ✓ ∼ ∼ ✓ × ×
SAC [71] × × ∼ × ∼ ∼ × × ∼ ∼
The Absolute Platform [56] ∼ × ✓ ∼ × ∼ × ∼ ∼ ∼
Alien Vault USM [61] ✓ ✓ ∼ × ∼ × ∼ ∼ ✓ ∼
BigId [62] ✓ ✓ ∼ × × × × ✓ × ×
BWise GDPR solution [63] ✓ × ✓ × × × ✓ ∼ ∼ ∼
Consentua [64] ∼ ✓ ∼ ∼ × × × ✓ × ×
PrivacyPerfect [65] ✓ × ∼ × ∼ × × × ∼ ×
Industrial GDPR tools Algosec [57] ∼ ✓ ∼ × ∼ ✓ ✓ ∼ ∼ ∼
Hashicorp Vault [7] × × ∼ ✓ ∼ ✓ × ∼ ∼ ∼
Collibra [58] ✓ ∼ ✓ × ∼ × ∼ × ∼ ×
One Trust [66] ✓ ✓ ✓ ∼ ∼ ∼ ∼ ∼ ✓ ∼
Skyhigh Networks [67] × ✓ ∼ × ∼ × × ∼ × ∼
Compliance forge [59] ∼ ✓ ∼ × ∼ × × ✓ ∼ ×
MY DATA manager [60] ∼ ✓ ∼ × ∼ × × ∼ ✓ ∼
Apache Eagle [69] × × ∼ ∼ ∼ ∼ ∼ ✓ × ×
Apache tools Apache Atlas [70] ✓ ∼ ✓ × ✓ ✓ × ∼ ✓ ×
Apache Ranger [48] ∼ × ∼ ✓ × ✓ ✓ ✓ ∼ ×
• For certain business use cases, data should have an expiration performed. For this security policy checking, we proposed in a previous
date for business usage (Data Time To Live). This case can be work the following formal checking [51]: Let D be the data sets D =
achieved with Atlas and Ranger. Apache Atlas can assign expi- {d1, d2, .., dn } and P the set of processors P = {p1, p2, .., pn }, E the
ration dates to a data tag. Ranger inherits the expiration date and set of storage spaces E = {e1, e2, .., in } and L the set of labels (DS
automatically denies access to the tagged data after the expiration preferences) in the form of 5W GDPR policy. For an incoming data d
date. This feature helps a lot in controlling data storage and data which is annotated with the 5W policy (with the following tags what,
access in a GDPR context. who, when, where, why), the security configuration is correct if it meets
• Controlling the data access with Location-specific access policies user preference L regarding the same data owners. More formally, the
similar to time-based access policies. In help taking consideration configuration is accepted if we have for a function S such that S: P ∪ E
of the geographical location of the user when accessing the data. ∪ D → L:
• Visualizing data lineage in the Big Data application delivering a
complete view of data movement across several analytic engines • If we have p in P, d in D. A process p is authorized to pro-
such as Apache Storm, Kafka, Falcon, Hive and recently Spark. As cess a data d if: L (d) ⊆ L (p): 𝑑.𝑤ℎ𝑎𝑡 == 𝑝.𝑤ℎ𝑎𝑡, 𝑑.𝑤ℎ𝑦 ==
this tracking is carried out at the platform level, any application 𝑝.𝑤ℎ𝑦, 𝑑.𝑤ℎ𝑒𝑟𝑒 == 𝑝.𝑤ℎ𝑒𝑟𝑒, 𝑑.𝑤ℎ𝑜 ⊆ 𝑝.𝑤ℎ𝑜, 𝑑.𝑤ℎ𝑒𝑛 > 𝑝.𝑤ℎ𝑒𝑛.
that uses these engines will be natively tracked with Atlas and
secured with Ranger. • If we have e in E, d in D. A storage space e is authorized to
store a data d if: L (d) ⊆ L (e): 𝑑.𝑤ℎ𝑎𝑡 == 𝑒.𝑤ℎ𝑎𝑡, 𝑑.𝑤ℎ𝑦 ==
Apache Ranger and Apache Atlas represent the core of our frame- 𝑒.𝑤ℎ𝑦, 𝑑.𝑤ℎ𝑒𝑟𝑒 == 𝑒.𝑤ℎ𝑒𝑟𝑒, 𝑑.𝑤ℎ𝑜 ⊆ 𝑒.𝑤ℎ𝑜, 𝑑.𝑤ℎ𝑒𝑛 > 𝑒.𝑤ℎ𝑒𝑛.
work implementation. We take the advantage of data interception
techniques, the connectors to the manager layer and all the described After installing and configuring Atlas-Ranger, we implemented in
features of data lineage construction, policy checking and enforcement. Atlas a GDPR classification following the GDPR 5W policy [51]. We
In our implementation, we added the following functionalities: provide a user interface so that a DS can define his constraints following
the 5W policy. Then, it is sent to Atlas via a REST API to tag the
• A policy model to express the DS consent following an extended
collected DS data. Atlas annotates the adequate selected data captured
GDPR 5W model.
by its hooks with 5W tags and propagates it to the rest of the data
• A security policy configuration checker compliant with the 5W
lineage to control the calculated data.
model.
The policy configuration checker is executed before enforcing this
• Attribute-based Encryption using CP-ABE.
data. We capture the new annotated lineage from Atlas via its REST
• A set of user-friendly interfaces for the DS and DPO.
API, we execute the checker to verify if a problem is detected (we
• A notification system about security breaches using Kafka.
used a java implementation for the policy configuration checker). The
Regarding the security policy model, we proposed a taxonomy for policy checker verifies that each data annotation takes into account the
privacy policies called 5W [51] but other taxonomies can be used. In preferences of the DS and that each processing or storage layer has
the 5W model, the DS is asked to respond to 5W questions: what data to access to authorized data only. This task is performed each time new
be processed? why? how and where his data are stored? who can access data are received or whenever newly calculated data are obtained as
it and why? is it up-to-date and accurate? and how long will he keep explained in the proposed security model [51]. Once the configuration
it for? Following the defined 5W policy, a configuration checking is is performed, if a problem exists a notification is sent to the DS for
12
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
consent update. If no problem is detected the 5 W policy is enforced • C4 implementation: We adopt the same implementation of
and data are secured. One of the 5W policy attributes is if data Crypto-Engine, using a particular algorithm of ABE called CP-
should be encrypted or not, if yes a CP-ABE encryption algorithm is ABE [50]. Attribute-Based Encryption (ABE) is a form of public-
executed [50]. key encryption. The ABE algorithms represent a good candidate
We have Atlas-Ranger are configured, then the TagSync module is to achieve privacy and fine-grained access control for Big Data
activated. More precisely, policies are stored and enforced in Apache applications running on Cloud servers. Furthermore, in [75]
Ranger as tag-based policies using the Tag sync module automatically. shows that the proposed scheme can not only achieve fine-grained
TagSync is used to populate the tag store from the tag details available access control but also support resisting the collusive attack. Our
in an external system in our case Apache Atlas [70]. TagSync is a choice is additionally motivated by CP-ABE evaluation in [50]
daemon process. In the currently used release, ranger-TagSync supports where authors show an acceptable overhead. We confirm this
receiving tag details from Apache Atlas via change notifications. As result in the evaluation part of our work.
tags are added/updated/deleted to resources in Apache Atlas, ranger- • C5 and C6 implementation: These two components are respon-
TagSync would receive notifications and update the tag store. The sible for controlling the storage and processing of data. Thanks
GDPR 5W policies are enforced as presented in Fig. 4(b) as ‘‘Tag-Based to Apache Atlas and Apache Ranger, we collect in real-time the
Policy’’. Each W is automatically mapped to one of the Apache Ranger process scope and manage the stored data, the storage location
policies, as illustrated in Fig. 4(b). The topic of the What tag can be a and the ‘‘data time to live’’ against the consent policy defined by
database, a table or a column and the masking option will be a simple the DS. Every processing and storage process is stored in Logs
Tag as PII (for the value ‘‘true’’ in the policy) if the data are Tagged and Processing Records C9 database via a specified Kafka topic
like PII, they will be hidden by Apache Ranger (masking data). The of C1 in addition to logs provided by Apache Ranger. In case
When tag is a period for the DTTL, we can set the start time and the of a breach, notification is sent to the DS and to the DPO to
end time. The collection time will be compared to the current time notify them. We adopted Apache Atlas hooks. Apache Atlas Hooks
(system date) for each request. The Who tag contains two options: the (Atlas connectors) are provided in order to collect all meta-data
owner is the connected user and the owner of the Token, the processors for transparency and accountability purposes and Ranger plugin
are mentioned and grouped by user group. The Why tag also contains to verify if storage space or a process is allowed to have Data
two options: the purpose is defined in the description of the policy (in activating also the policy configuration checker. Because in Big
future work, we will work on the semantics of the purpose as defined Data application, we need complex processing such as ML process
by the DS) and the how option is defined as permissions (alter, create, (Machine Learning), we used Apache Spark in batch mode to
read, update, write). Finally, the Where tag is detailed in three tags: process the stored collected data.
the source is verified in the security checking process as the first data • C7 implementation: The distribution layer helps to control users’
collection point, the destination is all the storage spaces in the graph. or applications’ touchpoints with the newly calculated or stored
The transfer right is represented as a boolean type and controlled by data by evaluating access control using a Ranger plugin to eval-
the Location-specific access feature of Ranger. uate the policy retrieved from C8. The data query is captured
In addition to the provided consent form, we displayed a user- by the distribution manager and then sent to the Ranger plugin
friendly data flow graph for the DS from the JSON format captured for evaluation and access control. While authorizing an access re-
from Apache Atlas lineage using his REST API. Then, we redefined the quest, Apache Ranger plugin evaluates applicable Ranger policies
data flow graph using a graph calculation framework: Apache Tinker- for the resource being accessed.
Pop [72] with a light representation compared to Apache Atlas lineage • C10 implementation: If we see Table 3, Assessment and Risk
in Fig. 4. Also, Apache Atlas does not provide a Hook (connector) to management are not really implemented in existent systems ex-
collect metadata from the Apache Spark process. This is why we used cept in [6]. In our work, we use logs and processing records to
a recent implementation of Spark-Atlas-Connector SAC [73] from our feed the Assessment and Risk manager component, C10, in order
Table 3 to obtain data lineage of the data processed by Apache Spark. to demonstrate compliance for the Data Protection Officer. C10
After performing the right configuration, Apache Atlas can detect all provides a security dashboard and real-time logs. This compo-
spark processes as illustrated in Fig. 4. Here, our framework is ready nent uses the consent policies together with processing scopes
for any processing type even for complex Spark process. as business policies (the processing logs provided by Kafka) to
Fig. 5 displays the main technical components selected for the check that data processing and sharing comply with the relevant
framework implementation. usage control policies. The Logs and Processing Records database
is alimented with ranger logs and collected from Ranger plugin
• C1 implementation: For task orchestration and inter-component in the distribution manager where the access of all services is
communication, Apache Kafka [74] is classically used [50]. In- controlled.
deed, the publish/subscribe broker of Kafka allows for loosely
coupled and scalable communication between the different frame- The output of the framework implementation is detailed and illustrated
work components. Components are notified about new data, new in the next section as a real e-health application with real data set.
security policies and security breaches when they occur. All com-
ponents are information producers and also information con- 7.1. Application to e-health GDPR-compliance
sumers which allow C1 to play the role of a central hub in
the management layer. A particular consumer component is the In this section, we used our implemented framework components
policy checker that relies on the annotated graph for checking to improve our previous work [50].This work was developed as part of
policy compliance between annotated data from one side and a client project for a digital healthcare publisher that offers software
storage spaces and processes from another side [50]. for remote monitoring, diagnostic assistance, and digital therapeutic
• C2 implementation: We provide a user interface so that a DS can education authorizing remote patient care. Their products assist med-
define his constraints following the 5W policy and track his data ical staff in taking care of their patients remotely. They are designed
in all the Big Data life-cycle with fine granularity. around four pillars: Remote monitoring (make it possible to monitor
• C3 implementation: In the Gateway we need to filter the col- patients remotely and better adapt their care.), diagnostic and pre-
lected data in order to provide data minimization. The DS consent diction (assist caregivers in the detection of relapses and therapeutic
is used as annotation for data and access control policy at the toxicities.), Support, administrative, and health statistic (allow the
same time in Atlas-Ranger. Therapeutic Education of Patients from a distance to guide and reassure
13
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
them and prepare recommendation based on historical data). Their subjects can track their data flows and can be notified about any illicit
applications and services are IoT based. So, we consider a classical access. Data are annotated following the security constraints and access
e-health IoT system with a set of sensors collecting patients’ state control is executed at run time. However, it was not clear to us what is
(Let us call Alice the DS in our use case) and sending this data to a exactly covered by GDPR and what is missing. The framework allowed
Gateway. A set of services collaborate to analyze received data and to us to accomplish this task.
calculate diagnosis (Processing and Storage Layer). Typically, Alice’s As shown in the comparative Table 3, many functionalities are
data are collected to develop diagnosis (diagnostic service) and predict missing in our previous work [50]. More precisely, three components
changes in Alice’s state (state prediction service). Data are selectively are implemented (C1, C2 and C4), two components are partially im-
accessible by different services depending on the type of disease and by plemented (C8 and C9) and five components (C3, C5, C6, C7 and
research laboratories according to Alice’s consent (Distribution Layer), C10) are missing. In the new version of this work, we reuse the
for example, if Alice data is communicated to a research lab or other implemented framework for GDPR-compliance with a python/spark
services without Alice consent, this event has to be detected and data process-based pipeline. In the use case implementation, we used a batch
cannot be communicated without Alice consent. Shared and critical mode processing: data coming from sensors or GW are captured by the
data must be protected against unauthorized access while providing adequate Kafka topic in the ingestion layer, then stored in the hospital
accurate and fine-grain access control for the authorized actors. This storage layer (Hbase). The stored data will be processed by Apache
use case highlights the need for the setting of multilevel and dynamic Spark for diagnostic purpose and hive for statistical and administrative
security policies. Alice can delegate her data control to third-parties purpose. Finally, the result is displayed to Alice’s doctor in the services
like the medical staff or her doctor Bob or a parent. Alice as a Data layer (It is a typical Big Data pipeline). In the use case architecture
subject (DS) has the right to define and modify her security policies (her of Fig. 5, we highlight the different data touchpoints that need to
consent) and is notified if they are violated (used by auxiliary services be controlled by the proposed framework: the GW touchpoint, the
or administrative services). In our previous work, A multi-level security ingestion layer touchpoint, the storage layer touchpoint, the processing
model is proposed to describe fine-grained access control policies. Data layer touchpoint and finally the distribution layer touchpoint.
14
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
Let us consider the following scenario for the implemented use compare our use-case execution with the introduced security layer (im-
case and check that the DS consent is respected. We suppose that we plementing our framework) and without. For the engineering overhead,
have 4 services in our use case: diagnostic service, state prediction we describe the engineering effort that is required to embed the security
service, statistical and administrative service and auxiliary service. The management layer to an existent Big Data solution. We explain that,
framework is used to check GDPR compliance of these services and if since our implementation is based on Big Data technology (Apache
Alice’s consent is respected. A user interface asks Alice to respond to a Kafka, Apache Atlas, Apache Ranger), the performance and engineering
5W form: she must provide what data to be processed? why? how and overheads are minimal.
where her data are stored? who can access it and why? is it up-to-date
and accurate? and how long will she keep it for?. Alice’s policy says 7.2.1. Performance overhead
that the following IoT data can be collected: the blood pressure, heart Our evaluation settings are as follows:
rate, temperature and location. These data are stored in the hospital’s
• Environment: The evaluation is carried out on a PC as a gateway,
servers in the EU. The hospital (which plays the role of a controller)
with Intel Core i5 up to 2.4 GHz and 8 GB of memory. The
additionally asks if these data can be shared and used by the medical
use case application is implemented in a Hadoop ecosystem on
lab. Alice accepts this option of data forwarding. Fig. 6 shows the
a cluster with 300GB of memory and 24 cores. The cluster has
consent request presented as a 5W GDPR form to be felt by Alice.
three workers, each one with 8 cores and 100GB.
After collecting and securing her data, the controller as we men-
• Data: In this implementation, we used a real data set extracted
tioned need to demonstrate that Alice’s consent is implemented as
from a Proxym-IT client database: an IoT e-health platform (e-
desired and designed to both Alice as a DS and to the DPO for trans-
Health, Big Data, Quantified Self, and Digital Health). Our in-
parency and compliance purposes.
dustrial client maintains the patients’ data which doubles in size
To demonstrate compliance to the DPO an Assessment and Risk
every 11 months. The health services and products have sup-
manager dashboard is provided: Here a compliance checking is per-
ported more than 500,000 patients since 2013. Patients’ data are
formed as explained in the components implementation. Fig. 7(b)
collected from different IoT devices such as smart band, smart
shows some result of the compliance dashboard that can be used by a
bathroom scales, smart blood pressure monitors, etc. The platform
controller to demonstrate accountability to the DPO: First an investiga-
enables the creation of intelligent, rigorous, and engaging digital
tion is presented to the DPO, here for each data type, we can see all the
therapeutic supports and analyzes the collected data. In this
processing, purposes, storage spaces related to. If we take the example
evaluation, we focused on blood pressure measures. We extracted
of blood pressure data, Fig. 7 shows the different services related to the
these measures for 100,000 patients collected by smart blood
processing of this data: diagnostic, state prediction, administrative and
pressure monitors. We set the smart blood pressure monitors to
auxiliary services. Here, the investigation process shows that Alice’s
send one sensor-data/10 min so, 144 measures per day for each
medical data are used by the 4 services. Following Alice’s consent,
patient. The obtained data set contains almost 2 592 million
blood pressure data is not collected for administrative purposes or
measures of 100,000 patients for about 5 months. We used the
auxiliary purposes. The compliance checking is automatically starting
provided data set, it contains 76 attributes, but our experiments
the comparison after building the 5W business policy as a result of
refer to using a subset of them. In particular, the following
the investigation process. In Fig. 7(b), compliant and non-compliant
attributes are used: Patient ID(An id number representing the
services are detected in the hospital as a GDPR status. In case of
patient), Patient Name (A text string representing the patient
compliance failure, it is important to explain which parts of the business
name), Origin (A text string representing the sources of data
policies are not adequate and cause compliance checking to fail. It may
collected: device ID) Age (A text string representing the age of
also be useful to suggest possible corrections and generate a report to
the patient), Gender (A text string representing the gender of the
the SA.
patient), height (cm), weight (kg), apHi (Systolic blood pressure),
Finally, a Transparency and Compliance dashboard is also provided
ap_lo (Diastolic blood pressure), cholesterol (1: normal, 2: above
to Alice, as a DS, to track and control her data. First, a timeline is
normal, 3: well above normal), gluc (1: normal, 2: above normal,
illustrated using the 5W consent attributes as a filter for all details on
3: well above normal), smoke (A text string indicates whether the
the processing of her data (Why, What, How, Who, Where and When)
patient smokes or not), temp(a text string indicates the patient’s
in order to check that her consent is respected. For example in Fig. 8,
temperature degree).
Alice wants to know all information about the processing of her blood
• Type of processing: In our evaluation, we evaluated one type
pressure data. She performs a filter by ‘‘What’’ annotation. She gets in
of service performing simple operations. In the state prediction
her timeline, when the process of her blood pressure takes place (date
and monitoring service, medical staff is performing patients state
and hour), who collected her blood pressure data, for what purpose and
monitoring based on their historical data in batch mode to better
how it is used.
adapt their care and diagnostics. On the other hand, Several
Alice can now decide to revoke the given consent and ask the
patients have made many queries to send additional data or
hospital to delete all of her data as highlighted in Fig. 8. In addition
get data about them. In this service process, we are performing
to her timeline, Alice needs to see her data flow graph and how
many python/spark operations such as: ‘‘grouping’’, ‘‘selecting’’,
her consent is propagated. Therefore, the controller provides a simple
‘‘joining’’, ‘‘sorting’’, ‘‘fitting’’, ‘‘predicting’’.
representation: a high-level data flow to display to Alice in her user
interface with fine granularity by extracting exactly the data used in Considering the provided settings, we compare the execution time
the query or the process. As a result, we have obtained a user-friendly of two cases with and without the data security and governance layer
data flow graph for our use case, as shown in Fig. 9. We have 3 types for GDPR-compliance. We measure the response time starting from the
of nodes: data D, process P or storage space S as shown in Fig. 9 and as ingestion layer to the services layer (the diagnostic result is displayed
used in the policy configuration checker model definition. As we see, to the doctor). We vary the number of users using the application
Alice has a complete overview and full control of her collected data. (patients/second) to reach up to 600 users/second.
The measured response time does not include the time for sending
7.2. Evaluation of GDPR-compliance overhead data from the GW to the ingestion point and the time for accessing data
in the storage layer. We have 5 touch-points: in the GW data need to be
We distinguish two evaluations: performance overhead and engi- encrypted, in the ingestion and storage layer data need to be decrypted
neering overhead. In the first evaluation, we measure the execution for use, in the processing layer Spark process need to be controlled
time for security checking and encryption/decryption tasks. We also and finally, in the services layer access to data by the doctor need
15
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
16
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
to be controlled. Controlling the spark process and the access to data encryption and decryption time that increase linearly with the number
refer to applying the attributes of the enforced 5W policy by Ranger of considered data attributes. We vary the number of attributes up to
plugins and executing the policy configuration checker to keep policies 500 and we consider 1GB of sensor data coming from IoT devices at
implementation valid. We obtained the result shown in Fig. 10. The the same time. Encryption or decryption operations are executed at the
graph shows that compared to the classical use case with no GDPR DS GW and at the ingestion layer. Even with this exaggerated config-
compliance, we still have an acceptable latency with low variations. uration that is very unlikely to occur in IoT systems, the evaluation
This variation is due to the different data interceptions and processing still provides acceptable execution times. Fig. 12 shows that the time
at touchpoints: Process access control to data, policy configuration taken by the decryption and encryption algorithm varies linearly with
checker and encryption/decryption operations. To have more details the number of attributes and it is still acceptable compared to other
about the impact of each interception, we evaluate each one separately. evaluations [50].
First, we evaluate the access control to data. Queries to data are
captured by the Distribution manager to decide whether data can be 7.2.2. Engineering overhead
communicated to services. The previous operation may introduce a In Big Data applications, the management layer with Apache Ranger
huge processing delay if the access control verification is performed for and Atlas are classically installed and configured with their hooks
many data items. However, thanks to Apache Ranger scalability [48, for the processing and storage data touchpoints thanks to Apache
76], this processing overhead is reduced to few milliseconds. Ambari [77] sandbox. We used Ambari for our framework installation
Second, we evaluate the performance of our Policy Configuration and configuration comprising Big Data components (Apache Kafka,
Checker that is, actually, an extension to the couple Atlas and Ranger Apache Ranger, Apache Atlas, Apache Spark, etc.). The hooks config-
and it may slow down their performance. This checker is executed uration depends on the technology used for each data touchpoint to
at deployment time and run-time when the security configuration push metadata changes to Apache Atlas for data control. For example,
changes. The key measures to consider are memory and execution for the ingestion layer touchpoint, we need a Kafka hook to capture
time. The memory depends on the number of edges E and the vertices which data collected by Atlas, same for storage data touchpoint we
V of the data flow graphs. The graph is generally composed of few need a hdfs/Hive hooks and for the processing a Spark hook. The
edges and vertices but, in our evaluation, we are placed in complex developer has just to write his spark code on an installed notebook
Big Data applications composed of several data sources and several like Apache Zeppelin [78] on the provided sandbox with a connector
service destinations going through several processors. The worst case as a hook configuration, or just run with a command line on the
is reached when each destination is connected to all sources in the sandbox passing the connector as a parameter (bin/spark-shell –jars
graph. In this assessment, we consider many nodes ranging from 5 to spark-atlas-connector_2.11-0.1.0-SNAPSHOT.jar). To summarize, since
1500 nodes and we consider fully connected graphs. Fig. 11 shows the our implementation is based on Apache Ranger-Atlas solution, we use
variation of the execution time for verifying policies in the considered hooks and connector techniques provided by these technologies to
data flow graphs. The execution time increases with the number of embed the management layer. Similar hook approaches can be applied
nodes in the data flow graph of Apache Atlas lineage. The execution if other technologies are used for the framework implementation.
time remains very acceptable, it does not exceed 4s even for scalable
and fully interconnected graphs (1500 nodes). 8. Conclusion and future work
For the memory consumption, we consider the same experimenta-
tion with nodes ranging from 5 to 1500 nodes and we consider fully This work aims at helping IT designers and developers understand
connected graphs. We compared the memory consumption with and GDPR and implement GDPR-compliant Big Data systems. For this, we
without the framework integration. Fig. 11(b) shows that the execution analyze GDPR requirements and translate them to IT design require-
of policy configuration checker consumes slightly more memory com- ments. Then, a framework is proposed that details the main components
pared to the initial evaluation. This confirms that memory depends on for GDPR compliance verification and implementation. To implement
the number of edges E and the vertices V of the data flow graphs. In this framework, we classified and compared different tools related to
a Big Data context and from the shown low variation, we can mention GDPR implementation in Big Data systems. This comparison is guided
that memory consumption is not affected by the execution of the policy by the identified IT requirements and the framework components de-
configuration checker algorithms. signed in this paper. An implementation is proposed extending the Big
Finally, we evaluate the overhead of using CP-ABE for encryption. Data governance layer, Apache Ranger associated with Atlas. To vali-
In addition to the evaluation presented in [50], we evaluate both the date this implementation part, we consider an e-health application, and
17
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
Fig. 10. Global evaluation: the impact of the framework implementation on a classical e-health use case.
we show how GDPR is respected using our solution without significant components’ implementation can be enhanced by introducing other se-
overhead on system performance or engineering effort. curity implementations such as Lattice-Based encryption [79] and more
The framework is the first step towards GDPR-compliance in Big user-friendly API for DS and DPO. As future work, we are interested
Data systems. It allows simplifying GDPR understanding to the IT in applying the framework on streaming applications and evaluating
community and providing guidelines for Big Data GDPR compliance. GDPR-compliance on this kind of application.
Nevertheless, it is evaluated on a single kind of application using Kafka
and spark technologies. We need to evaluate the framework on other
Declaration of competing interest
kinds of Big Data pipelines. Furthermore, our use-case application is in
batch mode. Real-time and streaming applications (using CEP [52] or
Spark Stream) have more challenging requirements in terms of response The authors declare that they have no known competing finan-
time. It is important to check the GDPR-compliance overhead induced cial interests or personal relationships that could have appeared to
by our framework on real-time processing. Furthermore, the framework influence the work reported in this paper.
18
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
19
M. Rhahla et al. Journal of Information Security and Applications 61 (2021) 102896
[60] My data manager. my data manager. 2020, Online; https://www. [70] Apache. Apache atlas. 2020, Online; https://atlas.apache.org/. [Accessed 15
mydatamanager.eu/conhecer-my-data-manager?lang=en. [Accessed 02 March January 2020].
2020]. [71] Tang M, Shao S, Yang W, Liang Y, Yu Y, Saha B, et al. SAC: A system for
[61] Vault A. Alien vault USM. 2020, Online; www.alienvault.com. [Accessed 15 big data lineage tracking. In: 2019 IEEE 35th international conference on data
January 2020]. engineering. IEEE; 2019, p. 1964–7.
[62] BigId. Bigid. 2020, Online; https://bigid.com/eu-gdpr/. [Accessed 15 January [72] Apache. Apache tinkerpop. 2020, Online; http://tinkerpop.apache.org/. [Ac-
2020]. cessed 03 March 2020].
[63] BWise. BWise GDPR compliance solution. 2020, Online; www.bwise.com/ [73] Tang M, Shao S, Yang W, Liang Y, Yu Y, Saha B, et al. SAC: A system for
solutions/regulatory-compliance-management/global-data-protection-regulation- big data lineage tracking. In: 2019 IEEE 35th international conference on data
gdpr. [Accessed 15 January 2020]. engineering. IEEE; 2019, p. 1964–7.
[64] Consentua. Consentua. 2020, Online; https://consentua.com. [Accessed 15 [74] Apache. Apache kafka. 2020, Online; https://kafka.apache.org/. [Accessed 15
January 2020]. January 2020].
[65] Perfect P. PrivacyPerfect. 2020, Online; https://www.privacyperfect.com/fr. [75] Li Z, Huan S. Multi-level attribute-based encryption access control scheme for
[Accessed 15 January 2020]. big data. In: MATEC web of conferences, vol. 173. EDP Sciences; 2018, p. 03047.
[66] One trust. One trust. 2020, Online; https://www.onetrust.com. [Accessed 02 [76] Cloudera. Providing authorization with apache ranger. 2020, Online;
March 2020]. https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/authorization-
[67] Skyhigh networks. Skyhigh networks. 2020, Online; https://www. ranger/sec_authorization_ranger.pdf. [Accessed 06 March 2020].
skyhighnetworks.com. [Accessed 02 March 2020]. [77] Apache. Apache ambari. 2020, Online; https://ambari.apache.org/. [Accessed 03
[68] Apache. Apache spark. 2020, Online; http://spark.apache.org/. [Accessed 03 March 2020].
March 2020]. [78] Apache. Apache zeppelin. 2020, Online; https://zeppelin.apache.org/. [Accessed
[69] Apache. Apache eagle. 2020, Online; https://eagle.apache.org/. [Accessed 15 03 March 2020].
January 2020]. [79] Dai W, Doröz Y, Polyakov Y, Rohloff K, Sajjadpour H, Savaş E, et al. Implemen-
tation and evaluation of a lattice-based key-policy ABE scheme. IEEE Trans Inf
Forensics Secur 2017;13(5):1169–84.
20