Nelson 2016
Nelson 2016
Abstract—Big data is currently a hot research topic, with four veracity and value. Volume refers to the amount of data, which
million hits on Google scholar in October 2016. One reason for Kaisler et al. [5] define to be in the range of 1018 bytes to
the popularity of big data research is the knowledge that can be considered big data. Variety denotes the problem of big
be extracted from analyzing these large data sets. However, data
can contain sensitive information, and data must therefore be data being able to consist of different formats of data, such
sufficiently protected as it is stored and processed. Furthermore, as text, numbers, videos and images. Velocity represents the
it might also be required to provide meaningful, proven, privacy speed at which the data grows, that is, at what speed new
guarantees if the data can be linked to individuals. data is generated. Furthermore, veracity concerns the accuracy
To the best of our knowledge, there exists no systematic and trustworthiness of data. Lastly, value corresponds to the
overview of the overlap between big data and the area of
security and privacy. Consequently, this review aims to explore usefulness of data, indicating that some data points, or a
security and privacy research within big data, by outlining and combination of points, may be more valuable than others. Due
providing structure to what research currently exists. Moreover, to the potential large scale data processing of big data, there
we investigate which papers connect security and privacy with exists a need for efficient, scalable solutions, that also take
big data, and which categories these papers cover. Ultimately, is security and privacy into consideration.
security and privacy research for big data different from the rest
of the research within the security and privacy domain? To the best of our knowledge, there exists no peer-reviewed
To answer these questions, we perform a systematic literature articles that systematically review big data papers with a
review (SLR), where we collect recent papers from top confer- security and privacy perspective. Hence, we aim to fill that gap
ences, and categorize them in order to provide an overview of by conducting a systematic literature review (SLR) of recent
the security and privacy topics present within the context of big big data papers with a security and privacy focus. While this
data. Within each category we also present a qualitative analysis
of papers representative for that specific area. Furthermore, we
review does not cover the entire, vast, landscape of security
explore and visualize the relationship between the categories. and privacy for big data, it provides an insight into the field,
Thus, the objective of this review is to provide a snapshot of the by presenting a snapshot of what problems and solutions exists
current state of security and privacy research for big data, and within the area.
to discover where further research is required. In this paper, we select papers from top security and privacy
conferences, as well as top conferences on data format and
I. I NTRODUCTION
machine learning for further analysis. The papers are recent
Big data processing presents new opportunities due to its publications, published between 2012 and 2015, which we
analytic powers. Business areas that can benefit from analyzing manually categorize to provide an overview of security and
big data include the automotive industry, the energy distribu- privacy papers in a big data context. The categories are chosen
tion industry, health care and retail. Examples from these areas to be relevant for big data, security or privacy respectively.
include analyzing driving patterns to discover anomalies in Furthermore, we investigate and visualize what categories
driving behaviour [1], making use of smart grid data to create relate to each other in each reviewed paper, to show what
energy load forecasts [2], analyzing search engine queries connections exists and which ones are still unexplored. We
to detect influenza epidemics [3] and utilizing customers’ also visualize the proportion of papers belonging to each
purchase history to generate recommendations [4]. However, category, and the proportion of papers published in each
all of these examples include data linked to individuals, which conference. Lastly we analyze and present a representative
makes the underlying data potentially sensitive. subset of papers from each of the categories.
Furthermore, while big data provides analytic support, big The paper is organized as follows. First, the method for
data in itself is difficult to store, manage and process efficiently gathering and reviewing papers is explained in Section II.
due to the inherent characteristics of big data [5]. These Then, the quantitative and qualitative results are presented
characteristics were originally divided into three dimensions in Section III, where each of the categories and their cor-
referred to as the three Vs [6], but are today often divided responding papers are further analyzed in the subsection with
into four or even five Vs [2, 5, 7]. The original three Vs their corresponding name. A discussion of the findings and
are volume, variety and velocity, and the newer V’s are directions for future work is presented in Section IV. Lastly,
3694
would require a labor intensive analysis just to eliminate those performing the quality assessment, 82 papers remain. Query A
irrelevant papers. Furthermore, we believe that the papers results in 78 papers, and query B contributes with four unique
related to security or privacy would mention this in their title. papers that were not already found by query A. In Table IV
Thus, we have focused on a smaller, relevant, subset. the number of papers from each conference is shown for query
Query A focuses on finding papers related to security or A and query B respectively.
privacy in one of the big data conferences. This query is
intentionally constructed to catch a wide range of security Conference Acronym Query A Query B
Number Percentage Number Percentage
and privacy papers, including relevant papers that have omitted of Papers of Papers of Papers of Papers
’big data’ from the title. Furthermore, query B is designed to DCC 0 0% 0 0%
find big data papers in any of the conferences, unlike query A. ICDE 22 28% 0 0%
The reason to also include query B is foremost to capture big ICDM 4 5% 0 0%
SIGKDD 0 0% 0 0%
data papers in security and privacy conferences. Query B will SIGMOD 21 26% 1 25%
also be able to find big data papers in the other conferences, VLDB 25 31% 1 25%
which provides the opportunity to catch security or privacy WSDM 0 0% 0 0%
papers that were not already captured by query A. ICML 5 6.3% 0 0%
Step 2: After the papers have been collected, we man- NIPS 1 1.3% 0 0%
ually filter them to perform both a selection and a quality S&P - - 1 25%
USENIX Security - - 0 0%
assessment, in accordance with the guidelines for a SLR. First, CCS - - 1 25%
we filter away talks, tutorials, panel discussions and papers Total: 78 100% 4 100%
only containing abstracts from the collected papers. We also
verify that no papers are duplicates to ensure that the data TABLE IV: The number, and percentage, of papers picked
is not skewed. Then, as a quality assessment we analyze the from each conference, for query A and query B
papers’ full corpora to determine if they belong to security
or privacy. Papers that do not discuss security or privacy Step 4: Then, as part of the data synthesis which is the
are excluded. Thus, the irrelevant papers, mainly captured by last step in the review protocol in Table I, the quantitative
query B, and other potential false positives, are eliminated. results from the queries are visualized. Both as circle packing
To further assess the quality of the papers, we investigate diagrams, where the proportion of papers and conferences
each papers’ relevance for big data. To determine if it is a is visualized, and as a circular network diagram where re-
big data paper we include the entire corpus of the paper, and lationships between categories are visualized. Thereafter a
look for evidence of scalability in the proposed solution by qualitative analysis is performed on the papers, where the
examining if the paper relates to the five V’s. The full list of novel idea and the specific topics covered are extracted from
included and excluded papers is omitted in this paper due to the papers’ corpora. A representative set of the papers are then
space restrictions, but it is available from the authors upon presented.
request.
Step 3: Then, each paper is categorized into one or more III. R ESULTS
of the categories shown in Table III. These categories were In this section, we quantitatively and qualitatively analyze
chosen based on the five V’s, with additional security and the 82 papers. Figure 1 (a) visualizes where each paper
privacy categories added to the set. Thus the categories capture originates from, using circle packing diagrams. The size of
both the inherent characteristics of big data, as well as security each circle corresponds to the proportion of papers picked
and privacy. from a conference. As can be seen, most papers have been
published in ICDE, SIGMOD or VLDB. Furthermore, the
Category V Security or Privacy
distribution of the different categories is illustrated in Figure 1
Confidentialityiv
Data Analysis Value (b), where the size of a circle represents the amount of papers
Data Format Variety, Volume covering that category. Prominent categories are privacy, data
Data Integrity Veracity analysis and confidentiality.
Privacyv
Furthermore, some papers discuss more than one category
Stream Processing Velocity, Volume
Visualization Value, Volume and therefore belong to more than one category. Therefore,
the total number of papers when all categories are summed
TABLE III: Categories used in the review, chosen based on will exceed 82. To illustrate this overlap of categories, the
the five V’s. A checkmark in the third column means that the relationship between the categories is visualized as a circular
category is a security or privacy category. network diagram in Figure 2. Each line between two categories
means that there exists at least one paper that discusses both
In total, 208 papers match the search criteria when we run
categories. The thickness of the line reflects the amount of
both queries in Google Scholar. After filtering away papers and
papers that contain the two categories connected by the line.
iv As defined by ISO 27000:2016 [9] Privacy and data analytics as well as confidentiality and
v Anonymization as defined by ISO 29100:2011 [10] data format are popular combinations. Stream processing and
3695
(a) Conferences, grouped by research field (b) Categories, grouped by similarity
Fig. 1: Circle packing diagrams, showing the proportion of papers belonging to conferences (a) and categories (b)
visualization are only connected by one paper, respectively, to whereas the rest use partial homomorphic encryption which
privacy. supports given arithmetic operations. Liu et al. [11] propose
a secure method for comparing trajectories, for example to
compare different routes using GPS data, by using partial ho-
momorphic encryption. Furthermore, Chu et al. [12] use fully
homomorphic encryption to provide a protocol for similarity
ranking.
Another topic covered by several papers is access control.
In total, four papers discuss access control. For example,
Bender et al. [13] proposed a security model where policies
must be explainable. By explainable in this setting Ben-
der et al. refers to the fact that every time a query is denied
due to missing privileges, an explanation as to what additional
privileges are needed is returned. This security model is an
attempt to make it easier to implement the principle of least
Fig. 2: Connections between categories, where the thickness privilege, rather than giving users too generous privileges.
of the link represents the amount of papers that connect the Additionally, Meacham and Shasha [14] propose an appli-
two categories cation that provides access control in a database, where all
records are encrypted if the user does not have the appropriate
Since there is not enough room to describe each paper in privileges. Even though the solutions by Bender et al. and
the qualitative analysis, we have chosen a representative set Meacham and Shasha use SQL, traditionally not associated
for each category. This representative set is chosen to give with big data, their main ideas are still applicable since it
an overview of the papers for each category. Each selected only requires changing the database to a RDBMS for big
paper is then presented in a table to show which categories it data that have been proposed earlier, such as Vertica [15] or
belongs to. An overview of the rest of the papers are shown Zhu et al.’s [16] distributed query engine.
in Table V. Other topics covered were secure multiparty computation, a
concept where multiple entities perform a computation while
A. Confidentiality keeping each entity’s input confidential, oblivious transfer,
Confidentiality is a key attribute to guarantee when sensitive where a sender may or may not transfer a piece of information
data is handled, especially since being able to store and to the receiver without knowing which piece is sent, as well
process data while guaranteeing confidentiality could be an as different encrypted indexes used for improving search
incentive to get permission to gather data. In total, 23 papers time efficiency. In total, three papers use secure multiparty
were categorized as confidentiality papers. Most papers used computation, two use oblivious transfer and two use encrypted
different types of encryption, but there was no specific topic indexes.
that had a majority of papers. Instead, the papers were spread
across a few different topics. In Table VI, an overview of all B. Data Integrity
papers presented in this section is given. Data integrity is the validity and quality of data. It is
Five papers use homomorphic encryption, which is a tech- therefore strongly connected to veracity, one of the five V’s. In
nique that allows certain arithmetic operations to be performed total, five papers covered data integrity. Since there is only a
on encrypted data. Of those five papers, one uses fully homo- small set of data integrity papers, no apparent topic trend was
morphic encryption which supports any arithmetic operation, spotted. Nonetheless, one paper shows an attack on integrity,
3696
Author Short Title C DA DF DI P SP V
Akcora et al. Privacy in Social Networks
Allard et al. Chiaroscuro
Bonomi and Xiong Mining Frequent Patterns with Differential Privacy
Bonomi et al. LinkIT
Cao et al. A hybrid private record linkage scheme
Chen and Zhou Recursive Mechanism
Dev Privacy Preserving Social Graphs for High Precision Community Detection
Dong et al. When Private Set Intersection Meets Big Data
Fan et al. FAST
Gaboardi et al. Dual Query
Guarnieri and Basin Optimal Security-aware Query Processing
Guerraoui et al. D2P
Haney et al. Design of Policy-aware Differentially Private Algorithms
He et al. Blowfish Privacy
He et al. DPT
He et al. SDB
Hu et al. Authenticating Location-based Services Without Compromising Location Privacy
Hu et al. Private search on key-value stores with hierarchical indexes
Hu et al. VERDICT
Jain and Thakurta (Near) Dimension Independent Risk Bounds for Differentially Private Learning
Jorgensen and Cormode Conservative or liberal?
Kellaris and Practical differential privacy via grouping and smoothing
Papadopoulos
Khayyat et al. BigDansing
Kozak and Zezula Efficiency and Security in Similarity Cloud Services
Li and Miklau An Adaptive Mechanism for Accurate Query Answering Under Differential Privacy
Li et al. A Data- and Workload-aware Algorithm for Range Queries Under Differential
Privacy
Li et al. DPSynthesizer
Li et al. Fast Range Query Processing with Strong Privacy Protection for Cloud Computing
Li et al. PrivBasis
Lin and Kifer Information Preservation in Statistical Privacy and Bayesian Estimation of
Unattributed Histograms
Lu et al. Generating private synthetic databases for untrusted system evaluation
Mohan et al. GUPT
Nock et al. Rademacher observations, private data, and boosting
Oktay et al. SEMROD
Pattuk et al. Privacy-aware dynamic feature selection
Potluru et al. CometCloudCare (C3)
Qardaji et al. Differentially private grids for geospatial data
Qardaji et al. PriView
Qardaji et al. Understanding Hierarchical Methods for Differentially Private Histograms
Rahman et al. Privacy Implications of Database Ranking
Rana et al. Differentially Private Random Forest with High Utility
Ryu et al. Curso
Sen et al. Bootstrapping Privacy Compliance in Big Data Systems
Shen and Jin Privacy-Preserving Personalized Recommendation
Terrovitis et al. Privacy Preservation by Disassociation
To et al. A Framework for Protecting Worker Location Privacy in Spatial Crowdsourcing
Wong et al. Secure Query Processing with Data Interoperability in a Cloud Database Environ-
ment
Xiao et al. DPCube
Xu et al. Differentially private frequent sequence mining via sampling-based candidate
pruning
Xue et al. Destination prediction by sub-trajectory synthesis and privacy protection against
such prediction
Yang et al. Bayesian Differential Privacy on Correlated Data
Yaroslavtsev et al. Accurate and efficient private release of datacubes and contingency tables
Yi et al. Practical k nearest neighbor queries with location privacy
Yuan et al. Low-rank Mechanism
Zeng et al. On Differentially Private Frequent Itemset Mining
Zhang et al. Functional Mechanism
Zhang et al. Lightweight privacy-preserving peer-to-peer data integration
Zhang et al. Private Release of Graph Statistics Using Ladder Functions
Zhang et al. PrivBayes
Zhang et al. PrivGene
TABLE V: The reviewed papers omitted from the reference list, showing categories covered by each paper. C = Confidentiality,
DA = Data Analysis, DF = Data Format, DI= Data Integrity, P = Privacy, SP = Stream Processing, V = Visualization.
3697
Author C DA DF DI P SP V
can be used to anonymize data. The first three are techniques
Bender et al. [13]
Chu et al. [12]
for releasing entire sets of data through privacy-preserving
Liu et al. [11] data publishing (PPDP), whereas differential privacy is used
Meacham and Shasha [14] for privacy-preserving data mining (PPDM). Thus, differential
TABLE VI: A set of confidentiality papers, showing categories privacy is obtained without processing the entire data set,
covered by each paper. A checkmark indicates the paper on unlike the others. Therefore, anonymizing larger data sets can
that row contains the category. be difficult from an efficiency perspective. However, larger sets
have greater potential to hide individual data points within the
set [27].
two papers are on error correction and data cleansing and Out of a total of 61 privacy papers, one paper [28] uses
two papers use tamper-proof hardware to guarantee integrity k-anonymity, and another paper [29] uses l-diversity and t-
of the data. An overview of all papers covered in this section closeness but also differential privacy to anonymize data.
are shown in Table VII. Furthermore, Cao and Karras [30] introduce a successor to t-
Xiao et al. [17] shows that it is enough to poison 5% closeness, called β-likeness which they claim is more informa-
of the training values, a data set used solely to train a tive and comprehensible. In comparison, a large portion, 46 pa-
machine learning algorithm, in order for feature selection to pers, of the privacy oriented papers focuses only on differential
fail. Feature selection is the step where relevant attributes are privacy as their privacy model. Most of them propose methods
being decided, and it is therefore an important step since the for releasing differentially private data structures. Among these
rest of the algorithm will depend on these features. Thus, are differentially private histograms [31] and different data
Xiao et al. show that feature selection is not secure unless structures for differentially private multidimensional data [32].
the integrity of the data can be verified. An interesting observation by Hu et al. [33] is that dif-
Furthermore, Arasu et al. [18] implemented a SQL database ferential privacy can have a large impact on accuracy of the
called Cipherbase that focuses on confidentiality of data as result. When Hu et al. enforced differential privacy on their
well as integrity in the cloud. To maintain the integrity of the telecommunications platform, they got between 15% to 30%
cryptographic keys, they use FPGA based custom hardware accuracy loss. In fact, guaranteeing differential privacy while
to provide tamper-proof storage. Lallali et al. [19] also used maintaining high utility of the data is not trivial. From the
tamper-resistant hardware where they enforce confidentiality reviewed papers, 15 of them investigated utility in combination
for queries performed in personal clouds. The tamper-resistant with differential privacy.
hardware is in the form of a secure token which prevents One example of a paper that investigates the utility of
any data disclosure during the execution of a query. While differentially private results, and how to improve it is Proser-
the secure tokens ensures a closed execution environment, pio et al. [34]. The work of Proserpio et al. is a continuation of
they posses limited processing power due to the hardware the differentially private querying language PINQ [35], which
constraints which adds to the technical challenge. they enhance by decreasing the importance of challenging
Author C DA DF DI P SP V
entries, which induce high noise, in order to improve accuracy
Arasu et al. [18]
of the results.
Lallali et al. [19] The papers reviewed in this section can be seen in Ta-
Xiao et al. [17] ble VIII.
TABLE VII: A set of data integrity papers, showing categories Author C DA DF DI P SP V
covered by each paper Acs et al.[31]
Cao and Karras [30]
Cormode et al.[32]
C. Privacy Hu et al. [33]
An important notion is privacy for big data, since it can Jurczyk et al. [29]
Proserpio et al. [34]
potentially contain sensitive data about individuals. To mitigate Wang and Zheng [28]
the privacy problem, data can be de-identified by removing
attributes that would identify an individual. This is an approach TABLE VIII: A set of privacy papers, showing categories
that works, if done correctly, both when data is managed and covered by each paper
when released. However, under certain conditions it is still
possible to re-identify individuals even when some attributes
have been removed [20, 21, 22]. Lu et al. [7] also point out D. Data Analysis
that the risk of re-identification can increase with big data, Data analysis is the act of extracting knowledge from data.
as more external data from other sources than the set at hand It includes both general algorithms for knowledge discovery,
can be used to cross-reference and infer additional information and machine learning. Out of 26 papers categorized as data
about individuals. analysis papers, 15 use machine learning. Apart from machine
Several privacy models, such as k-anonymity [23], l- learning, other topics included frequent sequence mining,
diversity [24], t-closeness [25] and differential privacy [26], where reoccurring patterns are detected, and different versions
3698
of the k-nearest neighbor (kNN) algorithm, that finds the k storage capacity in comparison with saving the entire data set.
closest points given a point of reference. All papers from this Furthermore, stream processing can also completely remove
section are shown in Table IX. the bottleneck of first writing data to disk and then reading it
Jain and Thakurta [36] implemented differentially pri- back in order to process it if it is carried out in real-time.
vate learning using kernels. The problem investigated by One paper, by Kellaris et al. [41] shown in Table XI,
Jain and Thakurta is keeping the features, which are different combines stream processing with a privacy, and provides a
attributes of an entity, of a learning set private while still differentially private way of querying streamed data. Their
providing useful information. approach enforces w event-level based privacy rather than user-
Furthermore, Elmehdwi et al. [37] implemented a secure level privacy, which makes each event in the stream private,
kNN algorithm, based on partial homomorphic encryption. rather than the user that continuously produces events. Event-
Here, Elmehdwi et al. propose a method for performing kNN level based privacy, originally introduced by Dwork et al. [42],
in the cloud, where both the query and the database are is more suitable in this case due to the fact that differential
encrypted. Similarly, Yao et al. [38] investigated the secure privacy requires the number of queries connected to the same
nearest neighbour (SNN) problem which asks a third party individual to be known in order to provide user-level based
to find the point closest to a given point, without revealing privacy. In the case of streaming however, data is gathered
any of the points to the third party. They show attacks for continuously, making it impossible to estimate how many
existing methods for SNN, and design a new SNN method times a certain individual will produce events in the future.
that withstand the attacks.
Author C DA DF DI P SP V
Author C DA DF DI P SP V Kellaris et al. [41]
Elmehdwi et al. [37]
Jain and Thakurta [36] TABLE XI: All stream processing papers, showing categories
Yao et al. [38] covered by each paper
TABLE IX: A set of data analysis papers, showing categories
covered by each paper G. Data Format
In order to store and access big data, it can be structured
E. Visualization in different ways. Out of the 19 papers labeled as data format
Visualization of big data provides a quick overview of the papers, most used a distributed file system, database or cloud
data points. It is an important technique, especially while that made them qualify in this category. An overview of all
exploring a new data set. However, it is not trivial to implement papers from this section can be found in Table XII.
for big data. Gordov and Gubarev [39] point out visual noise, One example of combining data format and privacy is the
large image perception, information loss, high performance work by Peng et al. [43] that focuses on query optimization
requirements and high rate of image change as the main under differential privacy. The main challenge faced when
challenges when visualizing big data. enforcing differential privacy on databases is the interactive
One paper, by To et al. [40], shown in Table X, was catego- nature of the database where new queries are issued in real-
rized as a visualization paper. To et al. implemented a toolbox time. An unspecified number of queries makes it difficult
for visualizing and assigning tasks based on an individuals’ to wisely spend the privacy budget, which essentially keeps
location. In this toolbox, location privacy is provided while track of how many queries can be asked, used to guarantee
at the same time allowing for allocation strategies of tasks differential privacy, to still provide high utility of query an-
to be analyzed. Thus, it presents a privacy-preserving way of swers. Therefore, Peng et al. implemented the query optimizer
analyzing how parameters in a system should be tuned to result Pioneer, that makes use of old query replies when possible in
in a satisfactory trade-off between privacy and accuracy. order to consume as little as possible of the remaining privacy
budget.
Author C DA DF DI P SP V Furthermore, Sathiamoorthy et al. [44] focus on data in-
To et al. [40] tegrity, and present an alternative to standard Reed-Solomon
TABLE X: All visualization papers, showing categories cov- codes, which are erasure codes used for error-correction, that
ered by each paper are more efficient and offer higher reliability. They imple-
mented their erasure codes in the Hadoop’s distributed file
system, HDFS, and were able to show that the network traffic
F. Stream Processing could be reduced, but instead their erasure codes required more
Stream processing is an alternative to the traditional store- storage space than traditional Reed-Solomon codes.
then-process approach, which can allow processing of data Lastly, Wang and Ravishankar [45] point out that pro-
in real-time. The main idea is to perform analysis on data viding both efficient and confidential queries in databases
as it is being gathered, to directly address the issue of data is challenging. Inherently, the problem stems from the fact
velocity. Processing streamed data also allows an analyst to that indexes invented to increase performance of queries also
only save the results from the analysis, thus requiring less leak information that can allow adversaries to reconstruct
3699
the plaintext, as Wang and Ravishankar show. Consequently, While privacy was covered by a large portion of papers,
Wang and Ravishankar present an encrypted index that pro- only two papers use an existing privacy-preserving data pub-
vides both confidentiality and efficiency for range queries, lishing (PPDP) technique. Moreover, one paper introduces a
tackling the usual trade-off between security and performance. new PPDP technique called β-likeness. A reason for why this
topic might not be getting a lot of attention is the fact that
Author C DA DF DI P SP V PPDP is dependent on the size of the data set. Thus PPDP
Peng et al. [43] is harder to apply to big data, since the entire data set must
Sathiamoorthy et al. [44]
Wang and Ravishankar [45]
be processed in order to anonymize it. Consequently, further
work may be required in this area to see how PPDP can be
TABLE XII: A set of data format papers, showing categories applied to big data.
covered by each paper We have also detected a gap in the knowledge considering
stream processing and visualization in combination with either
data integrity or confidentiality, as no papers covered two of
IV. D ISCUSSION AND F UTURE W ORK these topics. Data integrity is also one of the topics that were
underrepresented, with five papers out of 82 papers in total,
While this review investigates security and privacy for big which is significantly lower than the number of confidentiality
data, it does not cover all papers available within the topic, and privacy papers. However, it might be explained by the fact
since it would be infeasible to manually review them all. that the word ’integrity’ was not part of any of the queries.
Instead, the focus of this review is to explore recent papers This is a possible expansion of the review.
and to provide both a qualitative and a quantitative analysis,
in order to create a snapshot of the current state-of-the-art. V. C ONCLUSION
By selecting papers from top conferences and assessing their There are several interesting ideas for addressing security
quality manually before selecting them, we include only papers and privacy issues within the context of big data. In this paper,
relevant for big data, security and privacy. 208 recent papers have been collected from A∗ conferences,
A potential problem with only picking papers from top to provide an overview of the current state-of-the-art. In the
conferences is that, while the quality of the papers is good, the end, 82 were categorized after passing the filtering and quality
conferences might only accept papers with ground breaking assessment stage. All reviewed papers can be found in tables
ideas. After conducting this review, however, we believe most in Section III.
big data solutions with respect to security and privacy are Conclusively, since papers can belong to more than one
not necessarily ground breaking ideas, but rather new twists category, 61 papers investigate privacy, 25 data analysis, 23
on existing ideas. From the papers collected for this review, confidentiality, 19 data format, 5 data integrity, one stream
none of the topics covered are specific for big data, rather the processing and one visualization. Prominent topics were differ-
papers present new combinations of existing topics. Thus, it ential privacy, machine learning and homomorphic encryption.
seems that security and privacy for big data is not different None of the identified topics are unique for big data.
from other security and privacy research, as the ideas seem to Categories such as privacy and data analysis are covered
scale well. in a large portion of the reviewed papers, and 20 of them
Another part of the methodology that can be discussed is the investigate the combination of privacy and data analysis.
two queries used to collect papers. Query A was constructed However, there are certain categories where interesting con-
to cover a wide range of papers, and query B was set to only nections could be made that do not yet exist. For example, one
include big data papers. Unfortunately, query A contributed combination that is not yet represented is stream processing
with far more hits than query B after the filtering step from with either confidentiality or data integrity. Visualization is
Table I. This means that most papers might not have been another category that was only covered by one paper.
initially intended for big data, but they were included after the In the end, we find that the security and privacy for big data,
quality assessment step, since the methods used were deemed based on the reviewed papers, is not different from security
scalable. Consequently, widening the scope of query B might and privacy research in general.
include papers that present security or privacy solutions solely
ACKNOWLEDGEMENTS
intended for big data.
Regarding the categories, confidentiality was covered by This research was sponsored by the BAuD II project (2014-
almost a third of the papers, but had no dominating topic. 03935) funded by VINNOVA, the Swedish Governmental
Rather, it contained a wide spread of different cryptographic Agency for Innovation Systems.
techniques and access control. Furthermore, privacy was well R EFERENCES
represented, with 61 papers in the review. A large portion of
[1] G. Fuchs et al. “Constructing semantic interpretation
these papers used differential privacy, the main reason prob-
of routine and anomalous mobility behaviors from big
ably being the fact that most differentially private algorithms
data”. In: SIGSPATIAL Special 7.1 (May 2015), pp. 27–
are independent of the data set’s size, which makes it beneficial
34.
for large data sets.
3700
[2] M. Chen et al. “Big Data: A Survey”. en. In: Mobile [15] C. Bear et al. “The vertica database: SQL RDBMS
Networks and Applications 19.2 (Jan. 2014), pp. 171– for managing big data”. In: Proceedings of the 2012
209. workshop on Management of big data systems. ACM,
[3] J. Ginsberg et al. “Detecting influenza epidemics using 2012, pp. 37–38.
search engine query data”. English. In: Nature 457.7232 [16] F. Zhu et al. “A Fast and High Throughput SQL Query
(Feb. 2009), pp. 1012–4. System for Big Data”. In: Web Information Systems En-
[4] O. Tene and J. Polonetsky. “Privacy in the Age of Big gineering - WISE 2012. Ed. by X. S. Wang et al. Lecture
Data: A Time for Big Decisions”. In: Stanford Law Notes in Computer Science 7651. DOI: 10.1007/978-
Review Online 64 (Feb. 2012), p. 63. 3-642-35063-4 66. Springer Berlin Heidelberg, 2012,
[5] S. Kaisler et al. “Big Data: Issues and Challenges Mov- pp. 783–788.
ing Forward”. English. In: System Sciences (HICSS), [17] H. Xiao et al. “Is Feature Selection Secure against
2013 46th Hawaii International Conference on. IEEE, Training Data Poisoning?” In: Proceedings of the 32nd
Jan. 2013, pp. 995–1004. International Conference on Machine Learning (ICML-
[6] D. Laney. 3D Data Management: Controlling Data 15). 2015, pp. 1689–1698.
Volume, Velocity, and Variety. Tech. rep. META Group, [18] A. Arasu et al. “Secure Database-as-a-service with Ci-
Feb. 2001. pherbase”. In: Proceedings of the 2013 ACM SIGMOD
[7] R. Lu et al. “Toward efficient and privacy-preserving International Conference on Management of Data. SIG-
computing in big data era”. English. In: Network, IEEE MOD ’13. New York, NY, USA: ACM, 2013, pp. 1033–
28.4 (Aug. 2014), pp. 46–50. 1036.
[8] B. Kitchenham. Procedures for performing systematic [19] S. Lallali et al. “A Secure Search Engine for the
reviews. Joint Technical Report. Keele, UK: Software Personal Cloud”. In: Proceedings of the 2015 ACM
Engineering Group Department of Computer Science SIGMOD International Conference on Management of
Keele University, UK, and Empirical Software Engi- Data. SIGMOD ’15. New York, NY, USA: ACM, 2015,
neering, National ICT Australia Ltd, 2004, p. 26. pp. 1445–1450.
[9] International Organization for Standardization. Informa- [20] M. Barbaro and T. Zeller. “A Face Is Exposed for AOL
tion technology – Security techniques – Information se- Searcher No. 4417749”. In: The New York Times (Aug.
curity management systems – Overview and vocabulary. 2006).
Standard. Geneva, CH: International Organization for [21] A. Narayanan and V. Shmatikov. “Robust De-
Standardization, Feb. 2016. anonymization of Large Sparse Datasets”. In: IEEE
[10] International Organization for Standardization. Informa- Symposium on Security and Privacy, 2008. SP 2008.
tion technology – Security techniques – Privacy frame- May 2008, pp. 111–125.
work. Standard. Geneva, CH: International Organization [22] P. Samarati and L. Sweeney. Protecting privacy when
for Standardization, Dec. 2011. disclosing information: k-anonymity and its enforce-
[11] A. Liu et al. “Efficient secure similarity computation ment through generalization and suppression. Tech. rep.
on encrypted trajectory data”. In: 2015 IEEE 31st SRI International, 1998.
International Conference on Data Engineering (ICDE). [23] L. Sweeney. “k-anonymity: A model for protecting
2015 IEEE 31st International Conference on Data En- privacy”. In: International Journal of Uncertainty,
gineering (ICDE). 2015, pp. 66–77. Fuzziness and Knowledge-Based Systems 10.05 (2002),
[12] Y.-W. Chu et al. “Privacy-Preserving SimRank over Dis- pp. 557–570.
tributed Information Network”. In: 2012 IEEE 12th In- [24] A. Machanavajjhala et al. “L-diversity: Privacy beyond
ternational Conference on Data Mining (ICDM). 2012 k -anonymity”. In: ACM Transactions on Knowledge
IEEE 12th International Conference on Data Mining Discovery from Data 1.1 (2007), 3–es.
(ICDM). 2012, pp. 840–845. [25] N. Li et al. “t-Closeness: Privacy Beyond k-Anonymity
[13] G. Bender et al. “Explainable Security for Relational and l-Diversity.” In: ICDE. Vol. 7. 2007, pp. 106–115.
Databases”. In: Proceedings of the 2014 ACM SIGMOD [26] C. Dwork. “Differential privacy”. In: Automata, lan-
International Conference on Management of Data. SIG- guages and programming. Springer, 2006, pp. 1–12.
MOD ’14. New York, NY, USA: ACM, 2014, pp. 1411– [27] H. Zakerzadeh et al. “Privacy-preserving big data pub-
1422. lishing”. In: Proceedings of the 27th International Con-
[14] A. Meacham and D. Shasha. “JustMyFriends: Full SQL, ference on Scientific and Statistical Database Manage-
Full Transactional Amenities, and Access Privacy”. In: ment. ACM, June 2015, p. 26.
Proceedings of the 2012 ACM SIGMOD International [28] Y. Wang and B. Zheng. “Preserving privacy in social
Conference on Management of Data. SIGMOD ’12. networks against connection fingerprint attacks”. In:
New York, NY, USA: ACM, 2012, pp. 633–636. 2015 IEEE 31st International Conference on Data
Engineering (ICDE). 2015 IEEE 31st International Con-
ference on Data Engineering (ICDE). 2015, pp. 54–65.
3701
[29] P. Jurczyk et al. “DObjects+: Enabling Privacy- ference on Data Engineering (ICDE). 2014, pp. 664–
Preserving Data Federation Services”. In: 2012 IEEE 675.
28th International Conference on Data Engineering [38] B. Yao et al. “Secure nearest neighbor revisited”. In:
(ICDE). 2012 IEEE 28th International Conference on 2013 IEEE 29th International Conference on Data En-
Data Engineering (ICDE). 2012, pp. 1325–1328. gineering (ICDE). 2013 IEEE 29th International Con-
[30] J. Cao and P. Karras. “Publishing Microdata with a ference on Data Engineering (ICDE). 2013, pp. 733–
Robust Privacy Guarantee”. In: Proc. VLDB Endow. 744.
5.11 (2012), pp. 1388–1399. [39] E. Y. Gorodov and V. V. Gubarev. “Analytical review of
[31] G. Acs et al. “Differentially Private Histogram Pub- data visualization methods in application to big data”.
lishing through Lossy Compression”. In: 2012 IEEE In: Journal of Electrical and Computer Engineering
12th International Conference on Data Mining (ICDM). 2013 (Jan. 2013), p. 22.
2012 IEEE 12th International Conference on Data Min- [40] H. To et al. “PrivGeoCrowd: A toolbox for studying
ing (ICDM). 2012, pp. 1–10. private spatial Crowdsourcing”. In: 2015 IEEE 31st
[32] G. Cormode et al. “Differentially Private Spatial De- International Conference on Data Engineering (ICDE).
compositions”. In: 2012 IEEE 28th International Con- 2015 IEEE 31st International Conference on Data En-
ference on Data Engineering (ICDE). 2012 IEEE 28th gineering (ICDE). 2015, pp. 1404–1407.
International Conference on Data Engineering (ICDE). [41] G. Kellaris et al. “Differentially Private Event Se-
2012, pp. 20–31. quences over Infinite Streams”. In: Proc. VLDB Endow.
[33] X. Hu et al. “Differential Privacy in Telco Big Data Plat- 7.12 (2014), pp. 1155–1166.
form”. In: Proc. VLDB Endow. 8.12 (2015), pp. 1692– [42] C. Dwork et al. “Differential privacy under contin-
1703. ual observation”. In: Proceedings of the forty-second
[34] D. Proserpio et al. “Calibrating Data to Sensitivity in ACM symposium on Theory of computing. ACM, 2010,
Private Data Analysis: A Platform for Differentially- pp. 715–724.
private Analysis of Weighted Datasets”. In: Proc. VLDB [43] S. Peng et al. “Query optimization for differentially
Endow. 7.8 (2014), pp. 637–648. private data management systems”. In: 2013 IEEE 29th
[35] F. D. McSherry. “Privacy integrated queries: an ex- International Conference on Data Engineering (ICDE).
tensible platform for privacy-preserving data analysis”. 2013 IEEE 29th International Conference on Data En-
In: Proceedings of the 2009 ACM SIGMOD Interna- gineering (ICDE). 2013, pp. 1093–1104.
tional Conference on Management of data. ACM, 2009, [44] M. Sathiamoorthy et al. “XORing elephants: novel
pp. 19–30. erasure codes for big data”. In: Proceedings of the 39th
[36] P. Jain and A. Thakurta. “Differentially private learning international conference on Very Large Data Bases.
with kernels”. In: Proceedings of the 30th International VLDB’13. Trento, Italy: VLDB Endowment, 2013,
Conference on Machine Learning (ICML-13). 2013, pp. 325–336.
pp. 118–126. [45] P. Wang and C. V. Ravishankar. “Secure and effi-
[37] Y. Elmehdwi et al. “Secure k-nearest neighbor query cient range queries on outsourced databases using Rp-
over encrypted data in outsourced environments”. In: trees”. In: 2013 IEEE 29th International Conference
2014 IEEE 30th International Conference on Data En- on Data Engineering (ICDE). 2013 IEEE 29th Interna-
gineering (ICDE). 2014 IEEE 30th International Con- tional Conference on Data Engineering (ICDE). 2013,
pp. 314–325.
3702