Sok: Cryptographically Protected Database Search
Sok: Cryptographically Protected Database Search
Benjamin Fuller∗ , Mayank Varia† , Arkady Yerukhimovich‡, Emily Shen‡ , Ariel Hamlin‡ ,
Vijay Gadepally‡ , Richard Shay‡ , John Darby Mitchell‡ , and Robert K. Cunningham‡
∗ University
of Connecticut
Email: [email protected]
† BostonUniversity
Email: [email protected]
‡ MIT
Lincoln Laboratory
Email: {arkady, emily.shen, ariel.hamlin, vijayg, richard.shay, mitchelljd, rkc}@ll.mit.edu
arXiv:1703.02014v2 [cs.CR] 2 Jun 2017
Abstract—Protected database search systems cryptographically and individuals [4]. When these are done properly, tremendous
isolate the roles of reading from, writing to, and administering the value can be extracted from data, enabling better decisions,
database. This separation limits unnecessary administrator access improved health, economic growth, and the creation of entire
and protects data in the case of system breaches. Since protected
search was introduced in 2000, the area has grown rapidly; industries and capabilities.
systems are offered by academia, start-ups, and established Important and sensitive data are stored in database manage-
companies. ment systems (DBMSs), which support ingest, storage, search,
However, there is no best protected search system or set of
techniques. Design of such systems is a balancing act between and retrieval, among other functionality. DBMSs are vital to
security, functionality, performance, and usability. This challenge most businesses and are used for many different purposes. We
is made more difficult by ongoing database specialization, as distinguish between the core database, which provides mecha-
some users will want the functionality of SQL, NoSQL, or nisms for efficiently indexing and searching over dynamic data,
NewSQL databases. This database evolution will continue, and and the DBMS, which is software that accesses data stored in
the protected search community should be able to quickly provide
functionality consistent with newly invented databases. a database. A database’s primary purpose is efficient storage
At the same time, the community must accurately and clearly and retrieval of data. DBMSs perform many other functions
characterize the tradeoffs between different approaches. To ad- as well: enforcing data access policies, defining data struc-
dress these challenges, we provide the following contributions: tures, providing external applications with strong transaction
1) An identification of the important primitive operations guarantees, serving as building blocks in complex applications
across database paradigms. We find there are a small
number of base operations that can be used and combined (such as visualization and data presentation), replicating data,
to support a large number of database paradigms. integrating disparate data sources, and backing up important
2) An evaluation of the current state of protected search sources. Recently introduced DBMSs also perform analytics
systems in implementing these base operations. This evalu- on stored data. We concentrate on the database’s core functions
ation describes the main approaches and tradeoffs for each of data insertion, indexing, and search.
base operation. Furthermore, it puts protected search in
the context of unprotected search, identifying key gaps in As the scale, value, and centralization of data increase, so
functionality. too do security and privacy concerns. There is demonstrated
3) An analysis of attacks against protected search for different risk that the data stored in databases will be compromised.
base queries.
4) A roadmap and tools for transforming a protected search Nation-state actors target other governments’ systems, cor-
system into a protected database, including an open-source porate repositories, and individual data for espionage and
performance evaluation platform and initial user opinions competitive advantages [5]. Criminal groups create and use
of protected search. underground markets to buy and sell stolen personal informa-
Index Terms—searchable symmetric encryption, property pre- tion [6]. Devastating attacks occur against government [7] and
serving encryption, database search, oblivious random access commercial [8] targets.
memory, private information retrieval
Protected database search technology cryptographically
I. INTRODUCTION separates the roles of providing, administering, and accessing
The importance of collecting, storing, and sharing data is data. It reduces the risks of a data breach, since the server(s)
widely recognized by governments [1], companies [2], [3], hosting the database can no longer access data contents.
Whereas most traditional databases require the server to be
This material is based upon work supported under Air Force Contract
No. FA8721-05-C-0002 and/or FA8702-15-D-0001. Any opinions, findings, able to read all data contents in order to perform fast search
conclusions or recommendations expressed in this material are those of the and retrieval, protected search technology uses cryptographic
author(s) and do not necessarily reflect the views of the U.S. Air Force. The techniques on data that is encrypted or otherwise encoded, so
work of B. Fuller was performed in part while at MIT Lincoln Laboratory.
The work of M. Varia was performed under NSF Grant No. 1414119 and that the server can quickly answer queries without being able
additionally while a consultant at MIT Lincoln Laboratory. to read the plaintext data.
A. Protected Search Systems Today The goal of this work is twofold: first, to inform protected
Protected database search has reached an inflection point in search designers on the current and future state of database
maturity. In 2000, Song, Wagner, and Perrig provided the first technology, enabling focus on techniques that will be useful
scheme with communication proportional to the description in future DBMSs, and second, to help security and database
of the query and the server performing (roughly) a linear experts understand the tradeoffs between protected search
scan of the encrypted database [9]. Building on this, the field systems so they can make an informed decision about which
quickly moved from theoretical interest to the design and technology, if any, is most appropriate for their setting.
implementation of working systems. We accomplish these goals with the following contributions:
Protected database search solutions encompass a wide vari- 1) A characterization of database search functionality
ety of cryptographic techniques, including property-preserving in terms of base and combined queries. Traditional
encryption [10], searchable symmetric encryption [11], private databases efficiently answer a small number of queries,
information retrieval by keyword [12], and techniques from called a basis. Other queries are answered by combining
oblivious random access memory (ORAM) [13]. Like the these basis operations [36]. Protected search systems
cryptographic elements used in their construction, protected have implicitly followed this basis and combination
search systems provide provable security based on the hardness approach.
of certain computational problems. Provable security comes Although there are many database paradigms, the num-
with several other benefits: a rigorous definition of security, ber of distinct bases of operations is small. We advocate
a thorough description of protocols, and a clear statement of for explicitly adopting this basis and combination ap-
assumptions. proach.
Many of these systems have been implemented. Protected 2) An identification of the bases of current protected search
search implementations have been tested and found to scale systems and black-box ways to combine basis queries
moderately well, reporting performance results on datasets of to achieve richer functionality. We then put protected
billions of records [14]–[22]. search in the context of unprotected search by identify-
In the commercial space, a number of established and ing basis functions currently unaddressed by protected
startup companies offer products with protected search func- search systems.
tionality, including Bitglass [23], Ciphercloud [24], Cipher- 3) An evaluation of current attacks that exploit leakage of
Query [25], Crypteron [26], IQrypt [27], Kryptnostic [28], various protected search approaches to learn sensitive
Google’s Encrypted BigQuery [29], Microsoft’s SQL Server information. This gives a snapshot of the current security
2016 [30], Azure SQL Database [31], PreVeil [32], Sky- of available base queries.
high [33], StealthMine [34], and ZeroDB [35]. While not 4) A roadmap and tools for transforming a protected search
all commercial systems have undergone thorough security system into a protected DBMS capable of deployment.
analysis, their core ideas come from published work. For We present an open-source software package developed
this reason, this paper focuses on systems with a published by our team that aids with performance evaluation; our
description. tool has evaluated protected search at the scale of 10TB
Designing a protected search system is a balance between of data. We also present preliminary user opinions of
security, functionality, performance, and usability. Security protected search. Lastly, we summarize systems that
descriptions focus on the information that is revealed, or have made the transition to full systems, and we chal-
leaked, to an attacker with access to the database server. lenge other designers to think in terms of full DBMS
Functionality is primarily characterized by the query types functionality.
that a protected database can answer. Queries are usually
C. Organization
expressed in a standard language such as the structured query
language (SQL). Performance and usability are affected by the The remainder of this work is organized as follows:
database’s data structures and indexing mechanisms, as well Section II introduces background on databases and protected
as required computational and network cost. search systems, Section III describes protected search base
There are a wide range of protected database systems that queries and leakage attacks against these queries, Section IV
are appropriate for different applications. With such a range describes techniques for combining base queries and discusses
of choices, it is natural to ask: Are there solutions for every remaining functionality gaps, Section V shows how to trans-
database setting? If so, which solution is best? form from queries to a full system, and Section VI concludes.
B. Our Contribution II. OVERVIEW OF DATABASE SYSTEMS
The answers to these questions are complex. Protected This section provides background on the databases and
search replicates the functionality of some database paradigms, protected search systems that we study in this paper. We
but the unprotected database landscape is diverse and rapidly first describe unprotected database paradigms and their query
changing. Even for database paradigms with mature protected bases. Next we define the types of users and operations of
search solutions, making an informed choice requires under- a database. We then describe the protected search problem,
standing the tradeoffs. including its security goals and the security imperfections
known as leakage that schemes may exhibit. Finally, we give then express other operations using these kernels. Similarly,
examples of common leakage functions found in the literature. many database technologies have a query basis: a small set
of base operations that can be combined to provide complex
A. Database Definition and Evolution search functionality. Furthermore, multiple technologies share
A database is a partially-searchable, dynamic data store that the same query basis. In some cases the basis was not explicit
is optimized to respond to targeted queries (e.g., those that in the original design but was formalized in later work. Apache
return less than 5% of the total data). Database servers respond Accumulo’s native API does not have a rigorous mathematical
to queries through a well established API. Databases typically design, but frameworks such as D4M [47], [48] and Pig [49]
perform search operations in time sublinear in the database used to manipulate data in Accumulo do.
size due to the use of parallel architectures or data structures Leveraging an underlying query basis will allow the pro-
such as binary trees and B-trees. tected search community to keep pace with new database
Several styles of database engines have evolved over the past systems. We discuss three bases found in database systems.
few decades. Relational or SQL-style databases dominated the First, relational algebra forms the backbone of many SQL
database market from the 1970s to the 1990s. Over the past and NewSQL systems [42]. Second, associative arrays provide
decade, there has been a focus on databases systems that sup- a mathematical basis for SQL, NoSQL, and NewSQL sys-
port many sizes of data management workloads [37]. NoSQL tems [50]. Third, linear algebraic operations form a basis for
and NewSQL have emerged as new database paradigms, some NewSQL databases. These bases and database paradigms
gaining traction in the database market [38], [39]. are summarized in Table I.
1) SQL: Relational databases (often called SQL databases) 1) Relational Algebra: Relational algebra, proposed by
typically provide strong transactional guarantees and have Codd in 1970 as a model for SQL [36], consists of the fol-
a well known interface. Relational databases are vertically lowing primitives: set union, set difference, Cartesian product
scalable: they achieve better performance through greater hard- (joins), projection, and selection. Complex queries can typi-
ware resources. SQL databases comply with ACID (Atomicity, cally be generated by composing these operations. Relational
Consistency, Isolation, and Durability) requirements [40]. algebra and the composability of operations allow a server-
2) NoSQL: NoSQL (short for “not only SQL”) databases side query planner to optimize query execution by rearranging
emerged in the mid 2000s. NoSQL optimizes the architec- operations to still produce the same result [68].
ture for fast data ingest, flexible data structures, and relaxed 2) Associative Arrays: Associative arrays are a mathe-
transactional guarantees [41]. These changes were made to ac- matical basis for several styles of database engines [50].
commodate increasing amounts of unstructured data. NoSQL They provide a mathematical foundation for key-value store
databases, for the most part, excel at horizontal scaling and NoSQL databases. Associative array algebra consists of the
when data models closely align with future computation. following base operations: construction, find, associative array
3) NewSQL: NewSQL systems bring together the scalabil- addition, associative array element-wise multiplication, and
ity of NoSQL databases and the transactional guarantees of array multiplication [47]. Associative arrays are built on top of
relational databases [42]. Several NewSQL variants are being the algebraic concept of a semiring (a ring without an additive
developed, such as in-memory databases that closely resemble inverse). Addition or multiplication in an associative array can
the data models and programming interface of SQL databases, denote any two binary operations from an algebraic semiring.
and array data stores that are optimized for numerical data Usually, these two operations are the traditional × and +, but
analysis. in the min-plus algebra the two operations are min and + (in
4) Future Systems: We expect the proliferation of cus- the max-plus algebra the two operations are max and +).
tomized engines that are tuned to perform a relatively small 3) Linear Algebra: A number of newer NewSQL databases
set of operations efficiently. While these systems will have support linear algebraic operations. GraphBLAS is a current
different characteristics, we believe that each system will standardization effort underway for graph algorithms [69].
efficiently implement a small set of basis operations. There are In GraphBLAS, graph data is stored using sparse matrices,
several federated or polystore systems being developed [43]– and the linear algebraic base operations of construction, find,
[45]. matrix addition, matrix multiplication, and element-wise mul-
The heterogeneous nature of current and future databases tiplication are composed to create graph algorithms. Examples
demands a variety of protected search systems. While provid- of how the GraphBLAS can be applied to popular graph
ing such variety is challenging, there are a small number of algorithms are given in [70], [71].
base operations that can be combined to provide much of the
C. Database Roles and Operations
functionality of the aforementioned systems.
We consider five important database roles, analogous to
B. Query Bases roles in database systems like Microsoft SQL Server 2016
To reduce the space of possible queries that must be [72]:
secured, we borrow an approach from developers of software ∙ A provider, who provides and modifies the data.
specifications and mathematical libraries [46]. In these fields, ∙ A querier, who wishes to learn things about the data.
it is common to determine a core set of low-level kernels and ∙ A server, who handles storage and processing.
Query Basis Technology Fundamental characteristics Strengths Weaknesses Examples
Rel. Algebra SQL [36]: Transaction support, Popular interface, Upfront schema design, MySQL [52]
Set Union Relational ACID guarantees, Common data model [51] Low insert & query rate Oracle DB [53]
Set Difference Table representation of data Postgres [54]
Products/Joins NewSQL [42]: Use of in-memory, Popular interface, Req. expensive hardware, Spanner [55]
Projection Relational new system arch. Transactional support, Often only relational data MemSQL [56]
Selection or simplified data model ACID guarantees model Spark SQL [57]
Federated [58] Relational model, Transactional support, Upfront schema design, Garlic [58]
Partitioned/replicated tables High performance, Often only relational data DB2 [59]
ACID guarantees model
Assoc. Array Alg. NoSQL [41]: Horizontal scalability, High insert rates, Sacrifice one of the fol- BigTable [41]
Construct Key-value Data rep. as key-value pairs, Cell-level security lowing: consistency, avail- Accumulo [61]
Find BASE guarantees [60] Flexible schema ability, or partition toler- HBase [62]
Array (+, ×) ance mongoDB [63]
Element-wise ×
Linear Algebra NoSQL [64]: Data represented as adjacency Natural data representation, Performance, Neo4j [64]
Construct Graph or incidence matrix, Amenable to graph algs. Diverse data models, System G [65]
Find Databases Horizontal scalability, Difficult to optimize
Matrix (+, ×) Graph operation API
Element-wise × NewSQL [66]: ACID guarantees, High performance, Data model restrictions, SciDB [66]
Array Data represented as arrays Transactional support, Lack of iterator support TileDB [67]
Databases (dense or sparse) Good for scientific comp.
Multiple bases Polystore [43] Disparate DBMSs High performance, Requires middleware BigDAWG [43]
Flexible data stores, Myria [45]
Diverse data/programming
models
TABLE I
S UMMARY OF A (NOT EXHAUSTIVE) SET OF POPULAR CURRENT AND EMERGING QUERY BASES TOGETHER WITH THEIR CORRESPONDING DATABASE
TECHNOLOGIES. C HARACTERISTICS, STRENGTHS, WEAKNESSES, AND EXAMPLES REFER TO THE TECHNOLOGIES, NOT THE QUERY BASES.
∙ An authorizer, who specifies data- and query-based rules. Tables II and V for details.
∙ An enforcer, who ensures that rules are applied.
D. Protected Database Search Systems
Databases provide an expressive language for representing
permissions, or rules. Rules are enforced by authenticating the Informally, a protected search system is a database system
roles possessed by a valid user and granting her the appropriate that supports the roles and operations defined above, in which
powers. In general, each user may perform multiple roles, and each party learns only its intended outputs and nothing else. In
each role may be performed by multiple users. particular, a protected search system aims to ensure that the
While databases offer a wide range of features, we focus on server learns nothing about the data stored in the protected
four operations: 𝐈𝐧𝐢𝐭, 𝐐𝐮𝐞𝐫𝐲, 𝐔𝐩𝐝𝐚𝐭𝐞, and 𝐑𝐞𝐟 𝐫𝐞𝐬𝐡. These op- database or about the queries, and the querier learns nothing
erations are common across the database paradigms described beyond the query results. These security goals can be formal-
above; we describe them below in the context of protected ized using the real-ideal style of cryptographic definition. In
search. this paradigm, one imagines an ideal version of a protected
search system, in which a trusted external party performs
∙ 𝐈𝐧𝐢𝐭: The initialization protocol occurs between the storage, queries, and modifications correctly and reveals only
provider (who has data to load) and the server. The server the intended outputs to each party. The real system is said
obtains a protected database representing the loaded data. to be secure if no party can learn more from its real world
∙ 𝐐𝐮𝐞𝐫𝐲: The query protocol occurs between the querier interactions than it can learn in the ideal system.
(with a query), the server (with the protected database), We restrict our attention in this work to protected database
the enforcer (with the rules), and possibly the provider. search systems that provide formally defined security guaran-
The querier obtains the query results if the rules are tees based upon the strength of their underlying cryptographic
satisfied. primitives. Some of the commercial systems mentioned in the
∙ 𝐔𝐩𝐝𝐚𝐭𝐞: The update protocol occurs between the provider introduction lack formal security reductions; although they
(with a set of updates) and the server. The server obtains are based on techniques with proofs of security, analysis is
an updated protected database. Updates include insertions, required to determine whether differences from those proven
deletions, and record modifications. techniques affect security.
∙ 𝐑𝐞𝐟 𝐫𝐞𝐬𝐡: The refresh protocol occurs between the Scenarios: Only a few existing protected search systems
provider and the server. The server obtains a new pro- consider the enforcement of rules (i.e., include an authorizer
tected database that represents the same data but is and enforcer). Therefore, in this paper we focus primarily on
designed to achieve better performance and/or security. two scenarios: a three-party scenario comprising a provider,
All systems considered in this work support 𝐈𝐧𝐢𝐭 and 𝐐𝐮𝐞𝐫𝐲, a querier, and a server, and a two-party scenario in which a
but only some systems support 𝐔𝐩𝐝𝐚𝐭𝐞 and 𝐑𝐞𝐟 𝐫𝐞𝐬𝐡; see single user acts as both the provider and the querier (we denote
this combined entity as the client). The latter scenario models a this task, the next section identifies common types of leakage.
cloud storage outsourcing application in which a client uploads
files to the cloud that she can later request. In the two-party E. Common Leakage Profiles
setting, the client has the right to know all information in the This section provides a vocabulary (partially based on
database so it is only necessary to consider security against an Kamara [75]) to describe common features of leakage system-
adversarial server. In this work, we focus on protected search atically. While the exact descriptions of leakage profiles are
in the case of a single provider and a single querier; for the often complex, their essence can mostly be derived from four
more general setting in which multiple users can perform a characteristics: the objects that leak, the type of information
single role, see Section V and [73]. leaked about them, which operation leaks, and the party that
We stress that a secure search system for one scenario does learns the leakage.
not automatically extend to another scenario. Additionally, The following types of objects within a protected search
despite the limited attention in the literature thus far, we system are vulnerable to leakage.
believe that the authorizer and enforcer roles are an important 1) Data items, and any indexing data structures.
aspect of the continued maturation of protected search systems; 2) Queries.
see Section V-A for additional discussion. 3) Records returned in response to queries, or other rela-
Threats: There are two types of entities that may pose tionships between the data items and the queries (e.g.,
security threats to a database: a valid user known as an insider records that partially match a conjunction query).
who performs one or more of the roles, and an outsider who 4) Access control rules and the results of their application.
can monitor and potentially alter network communications
Next, we categorize the information leaked from each
between valid users. We distinguish between adversaries that
object. The leakage may occur independently for each query or
are semi-honest (or honest-but-curious), meaning they follow
response, or it may depend upon the prior history of queries
the prescribed protocols but may passively attempt to learn
and responses. For complex queries like Booleans, leakage
additional information from the messages they observe, and
may also depend on the connections between the clauses of a
those that are malicious, meaning they are actively willing to
query. While the details of leakage may depend on the specific
perform any action necessary to learn additional information
data structures used for representing and querying the data, we
or influence the execution of the system. An outsider adversary
list five general categories of information that may be leaked
(even a malicious one) can be thwarted using secure point-to-
from objects, ranked from the least to most damaging. We use
point channels. Furthermore, we distinguish between adver-
this ranking throughout our discussion of base queries.
saries that persist for the lifetime of the database and those
that obtain a snapshot at a single point in time [74]. The bulk # Structure: properties of an object only concealable via
of active research in protected search technology considers padding, such as the length of a string, the cardinality of
semi-honest security against a persistent insider adversary. a set, or the circuit or tree representation of an object.
Performance and Leakage: While unprotected databases ◔ Identifiers: pointers to objects so that their past/future
are often I/O bound, protected systems may be compute accesses are identifiable.
or network bound. We can measure the performance of a # Predicates: identifiers plus additional information on ob-
G
protected operation by calculating the computational overhead jects. Examples include “matches the intersection of 2
and the additional network use (in both the number of mes- clauses within a query” and “within a common (known)
sages and the total amount of data transmitted). The type of range.”
cryptographic operations matters as well: whenever possible, ◕ Equalities: which objects have the same value.
slower public-key operations should be avoided or minimized Order (or more): numerical or lexicographic ordering of
in favor of faster symmetric-key primitives. objects, or perhaps even partial plaintext data.
In order to improve performance, many protected search Each of the four database operations may leak information.
systems reveal or leak information during some or all oper- During 𝐈𝐧𝐢𝐭, the server may receive leakage about the initial
ations. Leakage should be thought of as an imperfection of data items. Every party may receive leakage during a 𝐐𝐮𝐞𝐫𝐲:
the scheme. The real-ideal security definition is parameterized the querier may learn about the rules and the current data
by the system’s specific leakage profile, which comprises a items; the server may learn about the query, the rules, and the
sequence of functions that formally describe all information current data items; the provider may learn about the query and
that is revealed to each party beyond the intended output. A rules; and the enforcer may learn about the query and current
security proof demonstrates that the claimed leakage profile is data items. During 𝐔𝐩𝐝𝐚𝐭𝐞, the server may receive leakage
an upper bound on what is actually revealed to an adversary. about the prior and new data records. During a 𝐑𝐞𝐟 𝐫𝐞𝐬𝐡, the
Protected search systems’ security is primarily distinguished server may receive leakage about the current data items.
by their leakage profile; our security discussion focuses on In a two-party protected search system without 𝐔𝐩𝐝𝐚𝐭𝐞 or
leakage. rules it suffices to describe the leakage to the server during 𝐈𝐧𝐢𝐭
While leakage profiles are comprehensive, it is often diffi- and 𝐐𝐮𝐞𝐫𝐲. In this setting, common components of leakage
cult to interpret them and to assess their impact on the security profiles include: equalities of queries (often called search
of a particular application (see Section III-B). To help with patterns); identifiers of data items returned across multiple
queries (often called access patterns); the circuit topology of backwards compatibility comes at a cost to security. The
a boolean query; and cardinalities and lengths of data items, Custom Index (or 𝙲𝚞𝚜𝚝𝚘𝚖) approach achieves lower leakage
queries, and query responses. Dynamic databases must also at the expense of designing special-purpose protected indices
consider leakage during 𝐔𝐩𝐝𝐚𝐭𝐞 and 𝐑𝐞𝐟 𝐫𝐞𝐬𝐡. Three-party together with customized protocols that enable the querier and
databases with access restrictions must also consider leakage server to traverse the indices together. We highlight a third
to the provider and querier about any objects they didn’t approach Oblivious Index (or 𝙾𝚋𝚕𝚒𝚟), which is a subset of
produce themselves. 𝙲𝚞𝚜𝚝𝚘𝚖 that provides stronger security by obscuring object
identifiers (i.e., hiding repeated retrieval of the same record).
F. Comparison with Other Approaches
We intentionally define protected database search by its A. Base Query Implementations
objective rather than the techniques used. As we will see
in Section III, many software-based techniques suffice to Cryptographic protocols have been developed for several
construct protected database search. Many hardware-based classes of base queries. The most common constructions
solutions like [76] are viable and valuable as well; however, are for equality, range, and boolean queries (which evalu-
they use orthogonal assumptions and techniques to software- ate boolean expressions over equality and/or range clauses),
only approaches. To maintain a single focus in this SoK, we though additional query types have been developed as well.
restrict our attention to software-only approaches. Here, we summarize some of the techniques for providing
Within software-only approaches, the cryptographic com- these functionalities, splitting them based on the approach
munity has developed several general primitives that address used.
all or part of the protected database search problem. The text below focuses on the distinct benefits of each
∙ Secure multi-party computation [77]–[79], fully homo- base query mechanism; Table II systematizes the common
morphic encryption [80]–[82], and functional encryp- security, performance, and usability dimensions along which
tion [83] hide data while computing queries on it. each scheme can be measured. From a security point of view,
∙ Private information retrieval [12], [84], [85] and oblivious we list the index approach, threat model (cf. Section II-D), and
random-access memory (ORAM) [13] hide access pat- the amount of leakage that the server learns about the data
terns over the data retrieved. On their own, they typically items during 𝐈𝐧𝐢𝐭 and 𝐐𝐮𝐞𝐫𝐲 (cf. Section II-E). Performance
do not support searches beyond a simple index retrieval; and usability are described along three dimensions: the scale
however, several schemes we discuss in the next section of updates and queries that each scheme has been shown
use ORAM in their protocols to hide access patterns while to support, the type and amount of cryptography required
performing expressive database queries. to support updates and queries, and the network latency and
bandwidth characteristics.
Protected search techniques in the literature often draw heavily
from these primitives, but rarely rely exclusively on one of 1) 𝙻𝚎𝚐𝚊𝚌𝚢: Property-preserving encryption [10] produces
them in its full generality. Instead, they tend to use specialized ciphertexts that preserve some property (e.g., equality or order)
protocols, often with some leakage, with the goal of improving of the underlying plaintexts. Thus, protected searches (e.g.,
performance. equality or range queries) can be supported by inserting cipher-
Another related area of research known as authenticated data texts into a traditional database, without changing the indexing
structures ensures correctness in the presence of a malicious and querying mechanisms. As a result, 𝙻𝚎𝚐𝚊𝚌𝚢 schemes
server but does not provide confidentiality (e.g., [86]–[90]). In immediately inherit decades of advances and optimizations in
general, authenticated data structures do not easily compose database management systems.
with protected database search systems. Equality: Deterministic encryption (DET) [15], [92] ap-
plies a randomized-but-fixed permutation to all messages so
III. BASE Q UERIES equality of ciphertexts implies equality of plaintexts, enabling
In this section, we identify basis functions that currently lookups over encrypted data. All other properties are obscured.
exist in protected search. The section provides systematic However, deterministic encryption typically reveals equalities
reviews of the different cryptographic approaches used across between data items to the server even without the querier
query types and an evaluation of known attacks against them. making any queries.
Due to length limitations, we focus on the Pareto frontier of Range: Order-preserving encryption (OPE) [93]–[95]
schemes providing the currently best available combinations of preserves the relative order of the plaintexts, enabling range
functionality, performance, and security. This means that we queries to be performed over ciphertexts. This approach re-
omit any older schemes that have been superseded by later quires no changes to a traditional database, but comes at
work. For a historical perspective including such schemes, we the cost of quite significant leakage: roughly, in addition to
refer readers to relevant surveys [73], [91]. revealing the order of data items, it also leaks the upper
We categorize the schemes into three high-level approaches. half of the bits of each message [94]. Improving on this,
The Legacy Index (or 𝙻𝚎𝚐𝚊𝚌𝚢) approach can be used with an Boldyreva et al. [95] show how to hide message contents until
unprotected database server; it merely modifies the provider’s queries are made against the database. Mavroforakis et al. [96]
data insertions and the querier’s requests. However, this further strengthen security using fake queries. Finally, mutable
OPE [97] reveals only the order of ciphertexts at the expense The BLIND SEER system [16], [17] supports boolean
of added interactivity during insertion and query execution. queries by using an index containing a search tree whose
Many 𝙻𝚎𝚐𝚊𝚌𝚢 approaches can easily be extended to perform leaves correspond to records in the database, and whose
boolean queries and joins by simply combining the results nodes contain (encrypted) Bloom filters storing the set of
of the equality or range queries over the encrypted data. all keywords contained in their descendants. A Bloom filter
CryptDB [15] handles these query types using a layered or is a data structure that allows for efficient set membership
onion approach that only reveals properties of ciphertexts as queries. To execute a conjunctive query, the querier and
necessary to process the queries being made. They demonstrate server jointly traverse the tree securely using Yao’s garbled
at most 30% performance overhead over MySQL, though this circuits [108], a technique from secure two-party computation,
value can be much smaller depending on the networking and following branches whose Bloom filters match all terms in the
computing characteristics of the environment. conjunction. Chase and Shen [109] design a protection method
𝙻𝚎𝚐𝚊𝚌𝚢 approaches have been adopted industrially [98] and based on suffix trees to enable substring search.
deployed in commercial systems [23]–[35]. However, as we Tree-based indices are also amenable to range searches. The
will explain in Section III-B and Table III, even the strongest Arx-RANGE protocol [110] builds an index for answering
𝙻𝚎𝚐𝚊𝚌𝚢 schemes reveal substantial information about queries range queries without revealing all order relationships to the
and data to a dedicated attacker. server. The index stores all encrypted values in a binary
2) 𝙲𝚞𝚜𝚝𝚘𝚖 Inverted Index: Several works over the past tree so range queries can be answered by traversing this
decade support equality searches on single-table databases tree for the end points. Using Yao’s garbled circuits, the
via a reverse lookup that maps each keyword to a list of server traverses the index without learning the values it is
identifiers for the database records containing the keyword comparing or the result of the comparison at each stage. Roche
(e.g., [11], [99]). Newer works provide additional features and et al.’s partial OPE protocol [111] provides a different tradeoff
optimizations for such equality searches. Blind Storage [100] between performance and security with a scheme optimized
shows how to do this with low communication and a very for fast insertion that achieves essentially free insertion and
simple server, while Sophos [101] shows how to achieve a (amortized) constant time search at the expense of leaking a
notion of forward security hiding whether new records match partial order of the plaintexts.
older queries (this essentially runs 𝐑𝐞𝐟 𝐫𝐞𝐬𝐡 on every 𝐈𝐧𝐬𝐞𝐫𝐭). 4) Other 𝙲𝚞𝚜𝚝𝚘𝚖 Indices: We briefly mention protected
OSPIR-OXT [18]–[21] additionally supports boolean search mechanisms supporting other query types: ranking
queries: the inverted index finds the set of records matching results of boolean queries [112], [113], calculating the inner
the first term in a query, and a second index containing a list of product with a fixed vector [114], [115], and computing the
(record identifier, keyword) pairs is used to check whether the shortest distance on a graph [116]. These schemes mostly work
remaining terms of the query are also satisfied. Cryptographi- by building encrypted indices out of specialized data structures
cally, the main challenge is to link the two indices obliviously, for performing the specific query computation. For example,
so that the server only learns the connections between terms in Meng et al.’s GRECS system [116] provides several different
the same query. Going beyond boolean queries, Kamara and protocols with different leakage/performance tradeoffs that
Moataz [102] intelligently combine several inverted indices encrypt a sketch-based (graph) distance oracle to enable secure
in order to support the selection, projection, and Cartesian shortest distance queries.
product operations of relational algebra with little overhead 5) 𝙾𝚋𝚕𝚒𝚟: This class of protected search schemes aims
on top of the underlying inverted index (specifically, only to hide common results between queries. Oblivious RAM
using symmetric cryptography). They do so at the expense (ORAM) has been a topic of research for twenty years [117]
of introducing additional leakage. Moataz’s Clusion library and the performance of ORAM schemes has progressed
implements many inverted index-based schemes [103], [104]. steadily. Many of the latest implementations are based on
Cash and Tessaro demonstrate that secure inverted indices the Path ORAM scheme [118]. However, applying ORAM
must necessarily be slower than their insecure counterparts, techniques to protected search is still challenging [119].
requiring extra storage space, several non-local read requests, 𝙾𝚋𝚕𝚒𝚟 schemes typically hide data identifiers across queries
or large overall information transfer [105]. by re-encrypting and moving data around in a data structure
3) 𝙲𝚞𝚜𝚝𝚘𝚖 Tree Traversal: Another category of 𝙲𝚞𝚜𝚝𝚘𝚖 (e.g., a tree) stored on the server. Several equality schemes
schemes uses indices with a tree-based structure. Here a query use the 𝙾𝚋𝚕𝚒𝚟 approach. Roche et al.’s vORAM+HIRB
is executed (roughly) by traversing the tree and returning scheme [120] observes that search requires an ORAM capable
the leaf nodes at which the query terminates. The main of storing varying size blocks since different queries may
cryptographic challenge here is to hide the traversal pattern result in different numbers of results. They design an efficient
through the tree, which can depend upon the data and query. variable-size ORAM (vORAM) and combine it with a history
For equality queries, Kamara and Papamanthou [106] show independent data structure to build a keyword search scheme.
how to do this in a parallelizable manner; with enough parallel Garg et al.’s TWORAM scheme [121] focuses on reducing the
processing they can achieve an amortized constant query cost. number of rounds required by an ORAM-type secure search.
Stefanov et al. [107] show how to achieve forward privacy They use a garbled RAM-like [122] construction to build a
using a similar approach. two-round ORAM resulting in a four-round search scheme
for equality queries. Moataz and Blass [123] design oblivious B. Leakage Inference Attacks
versions of suffix arrays and suffix trees to provide an 𝙾𝚋𝚕𝚒𝚟 In this subsection and Table III, we summarize leakage
scheme for substring queries. While offering greater security, inference attacks that can exploit the leakage revealed by a
these schemes still tend to be slower than the constructions in protected search system in order to recover some information
the other classes. about sensitive data or queries. Hence, this section details the
real-world impact of the leakage bounds and threat models
An alternative approach is to increase the number of par- depicted in Table II. The two tables are connected via a JOIN
ties. This approach is taken by Faber et al.’s 3PC-ORAM on the “𝑆 leakage” columns: a protected search scheme is
scheme [124] and Ishai et al.’s shared-input shared-output sym- affected by an attack if the scheme’s leakage to the server is
metric private information retrieval (SisoSPIR) protocol [22] at least as large as the attack’s required minimum leakage.
to support range queries. 3PC-ORAM shows how by adding a We stress that leakage inference is a new and rapidly
second non-colluding server, one can build an ORAM scheme evolving field. As a consequence, the attacks in Table III
that is much simpler than previous constructions. SisoSPIR only cover a subset of leakage profiles included in Table II.
uses a distributed protocol between a client and two non- Additionally, this section merely provides lower bounds on the
colluding servers to traverse a (per-field) B-tree in a way impact of leakage because attacks only improve over time.
that neither server learns anything about which records are We start by introducing the different dimensions that char-
accessed. By deviating from the standard ORAM paradigm, acterize attack requirements and efficacy. Then, we sketch a
these schemes are able to approach the efficiency typically couple representative attacks from the literature. Finally, we
achieved by Custom Index schemes that do not hide access describe how the provider and querier should use these attacks
patterns. to inform their choice of a search system that adequately
protects their interests.
6) Supporting Updates: Another important aspect of secure 1) Attack Requirements: We classify attacks along four
search schemes is whether they support 𝐔𝐩𝐝𝐚𝐭𝐞. While update dimensions: attacker goal, required leakage, attacker model,
functionality is critical for many database applications, it is not and prior knowledge. The attacker is the server in all of the
supported by many protected search schemes in the 𝙲𝚞𝚜𝚝𝚘𝚖 attacks we consider, except for the Communication Volume
and 𝙾𝚋𝚕𝚒𝚟 categories. Those that support updates do so in Attack of [125], which can be executed by a network observer
one of two ways. For ease of presentation, consider a newly who knows the size of the dataset. We expect future research
inserted record. In most 𝙻𝚎𝚐𝚊𝚌𝚢 schemes the new value is im- on attacks using leakage available to other insiders.
mediately inserted into the database index, allowing for queries a) Attacker Goal: Current attacks try to recover either
to efficiently return this value immediately after insertion. In a set of queries asked by the querier (query recovery) or the
many 𝙲𝚞𝚜𝚝𝚘𝚖 schemes, e.g., [16], new values are inserted data being stored at the server (data recovery).
into a side index on which a less efficient (typically, linear b) Required Leakage: This is the leakage function that
time) search can be used. Periodically performing 𝐑𝐞𝐟 𝐫𝐞𝐬𝐡 must be available to the attacker. We focus on the common
incorporates this side index into the main index; however, leakage functions on the dataset and responses identified in
due to the cost of 𝐑𝐞𝐟 𝐫𝐞𝐬𝐡 it is not possible to do this Section II-E. Examples include the cardinality of a response
very frequently. Thus, depending on the frequency and size set, the ordering of records in the database, and identifiers
of updates, update capability may be a limiting functionality of the returned records. Some attacks require leakage on the
of protected search. In particular, a major open question is entire dataset while others only require leakage on query
to build protected search capable of supporting the very high responses.
ingest rates typical of NoSQL databases. We return to this c) Attacker Model: Current inference attacks assume one
open problem in Section V. Roche et al. [111] take a step in of two attacker models. The first is a semi-honest attacker as
this direction with a 𝙲𝚞𝚜𝚝𝚘𝚖 scheme for range queries capable discussed in Section II-D. The second is an attacker capable
of supporting very high insert rates. of data injection: it can create specially crafted records and
have the provider insert them into the database. Note that
Table II systematizes the protected search techniques dis- this capability falls outside the usual malicious model for the
cussed in this section along with some basic information server. The attacker’s ability to perform data injection depends
about the (admittedly nuanced) leakage profiles that they have on the use case. For example, if a server can send an email to
been proven to meet. There are several correlations between a user that automatically updates the protected database, this
columns of the table; some of these connections reveal funda- model is reasonable. On the other hand, it might be harder to
mental privacy-performance tradeoffs whereas others simply insert an arbitrary record into a database of hospital medical
reflect the current state of the art. To provide one example in records.
the latter category: most 𝙻𝚎𝚐𝚊𝚌𝚢 systems leak information at d) Attacker Prior Knowledge: All current attacks assume
ingestion, whereas most 𝙲𝚞𝚜𝚝𝚘𝚖 only leak information after some prior knowledge, which is usually information about the
queries have been made against the system. The recent Arx- stored data but may include information about the queries
EQ [14] bucks this trend by requiring the client to remember made. Attack success is judged by the ability to learn informa-
the frequency of each keyword. tion beyond the prior knowledge. The following types of prior
knowledge (ordered from most to least information) help to an unknown first name. The query does not return a unique
execute attacks. number of records, so the method above cannot be used.
Contents of full dataset: the data items contained in the Suppose that FirstName=‘John’ and FirstName=‘Matthew’
database. The only possible attacker goal in this case is both return 1000 records. The attacker can also check how
query recovery. many records are in common with the previous query. This
◕ Contents of a subset of dataset: a set of items contained creates an additional constraint, for example there may be
in the database. Both attacker goals are interesting in this 100 records with name ‘John Smith’ but only 10 records
case. with name ‘Matthew Smith’. By checking record overlap with
# Distributional knowledge of dataset: information about the
G the previously identified query, the attacker can identify the
probability distribution from which database entries are queried first name. This attack iteratively identifies queries
drawn. For example, this could include knowledge of the and uses them as additional constraints to identify unknown
frequency of first names in English-speaking countries. queries.
This type of knowledge can be gained by examining Cash et al.’s attack is fairly simple and performs well if the
correlated datasets. keyword universe sizes is at most 5000. However, it requires a
◔ Distributional knowledge of queries: information about large portion of the dataset to be known to the attacker. With
the probability distribution from which queries are drawn. 80% of the dataset known to the attacker, Cash et al. [128]
As above, this might be knowledge that names will yield a 40% keyword recovery rate.
be queried according to their frequency in the overall Zhang et al. [127] extend the Count Attack to a malicious
population. adversary setting, allowing a server to inject a set of con-
# Keyword universe: knowledge of the possible values for structed records. This capability greatly improves keyword
each field. recovery. By carefully constructing a small number of these
Naturally, attacks that require full knowledge of the data are records (e.g., nine records for a universe of 5000 keywords),
more effective; the reasonableness of this assumption should it is possible to search the keyword universe and identify
be evaluated for each use case. the keyword. Although the records are fairly large, the attack
2) Attack Efficacy: We evaluate attack efficacy qualitatively extends if the database only allows a limited number of
in terms of three metrics: 1) the runtime of the attack, keywords per data record. This attack recovers more keywords
including time required to create any inserted records; 2) than the attack of Cash et al.: 40% of the data must be leaked
the sensitivity of the recovery rate to the amount of prior to obtain a 40% keyword recovery rate.
knowledge; and 3) the keyword universe size attacked. Note 4) Discussion: The provider and querier rely upon pro-
that the strength of an attack is strongly application-dependent; tected search to protect themselves against the server, or
an attack that is devastating on one dataset may be completely anyone who compromises the server. Our systemization of
ineffective on another dataset. attacks shows that they should consider the following four
Table III characterizes currently known attacks based upon questions before choosing a protected search technique to use.
their requirements and efficacy. All of the attacks described in ∙ How large is the keyword universe?
the table only require modest computing resources. ∙ How much of the dataset or query keyword universe (and
3) Attack Techniques: Leakage inference attacks against corresponding frequency) can the attacker predict?
protected search systems have evolved rapidly over the last ∙ Can an attacker reasonably insert crafted records?
few years, with Islam et al. [132] in 2012 inspiring many ∙ Does the adversary have persistent access to the server,
other papers. Most of the attacks in Table III rely on the or merely a snapshot of it at a single point in time?
following two facts: 1) different keywords are associated with Answers to the first three questions depend upon the intended
different numbers of records, and 2) most systems reveal use case. For example, a system with a smaller leakage profile
keyword identifiers for a record either at rest (e.g., DET [15] may be necessary in a setting where the keyword universe is
reveals during 𝐈𝐧𝐢𝐭 if records share keywords) or when it is small and the attacker has the ability to add records. A system
returned across multiple queries (e.g., Blind Seer [16] reveals with a larger leakage profile may suffice in a setting where the
during 𝐐𝐮𝐞𝐫𝐲 which returned records share keywords). To keyword universe is very large.
give intuition for how these attacks work we briefly summarize The fourth question pertains to adversaries who compromise
two entries of Table III. the server. 𝙻𝚎𝚐𝚊𝚌𝚢 schemes tend to leak information about the
Cash et al.’s [128] Count Attack is a conceptually simple entire database to the server. Thus, using the terminology of
way to exploit this information. Assume the attacker has full Grubbs et al. [74], they are susceptible to an adversary who
knowledge of the database and is trying to learn the query. only gets a snapshot of the database at some point in time.
The attacker sees how many records are returned in response In contrast, 𝙲𝚞𝚜𝚝𝚘𝚖 schemes tend to reveal information about
to a query. If that number is unique it can identify the query. records only during record retrieval or index modification as
Furthermore, by identifying the query, the attacker learns that part of the querying process, so they require a persistent
every returned record is associated with that keyword. adversary who can observe the evolution of the database state
For example, suppose the attacker learns the first query was over time. (We note however that many Boolean schemes have
for LastName = ‘Smith’. Now consider a second query for additional leakage about data statistics for the entire database.)
Threats 𝑆 leakage Scale Crypto Network
Adversarial 𝑄
Implemented?
Adversarial 𝑆
Query: # ops
# round trips
Insert: # ops
Crypto type
Scale tested
# of parties
Query type
Updatable?
Data sent
Query
Init
Scheme (References) Approach Unique feature
Arx-EQ [14] 𝙻𝚎𝚐𝚊𝚌𝚢 2 — G
# # ◔ ✔ G
# legacy compliant
Kamara-Papamanthou [106] 𝙲𝚞𝚜𝚝𝚘𝚖 2 — G
# # ◔ — — # # parallelizable
Blind Storage [100] 𝙲𝚞𝚜𝚝𝚘𝚖 2 — G
# # ◔ ✔ G
# G
# G
# low 𝑆 work
Equality
# ◔ — ◕ G
# Partially Known Documents [128]
y
er
# ◔ ✔ ◕ G
# # Hierarchical-Search Attack [127]
Qu
# ◔ — G
# Count Attack [128]
# ◔ — G
# G
# Graph Matching Attack [129]
ry
◕ — — G
# # ? # Frequency Analysis [130]
ve
co
◕ — ✔ G
# # ? Active Attacks [128]
Re
— — G
# # # Non-Crossing Attack [131]
TABLE III
SUMMARY OF CURRENT LEAKAGE INFERENCE ATTACKS AGAINST PROTECTED SEARCH BASE QUERIES. 𝑆 IS THE SERVER AND THE ASSUMED ATTACKER FOR ALL ATTACKS LISTED.
𝑆 LEAKAGE SYMBOLS HAVE THE SAME MEANING AS IN TABLE II. EACH ATTACK IS RELEVANT TO SCHEMES IN TABLE II WITH AT LEAST THE 𝑆 LEAKAGE SPECIFIED IN THIS
TABLE . S OME ATTACKS REQUIRE THE ATTACKER TO BE ABLE TO INJECT DATA BY HAVING THE PROVIDER INSERT IT INTO THE DATABASE. LEGENDS FOR THE REST OF THE
COLUMNS FOLLOW. IN ALL COLUMNS EXCEPT “K EYWORD UNIVERSE TESTED,” BUBBLES THAT ARE MORE FILLED IN REPRESENT PROPERTIES THAT ARE BETTER FOR THE SCHEME
AND WORSE FOR THE ATTACKER .
PRIOR KNOWLEDGE RUNTIME (IN # OF KEYWORDS) SENSITIVITY TO PRIOR KNOWLEDGE K EYWORD UNIVERSE TESTED
– C ONTENTS OF FULL DATASET
◕– C ONTENTS OF A SUBSET OF DATASET – M ORE THAN QUADRATIC – H IGH – > 1000
#– D ISTRIBUTIONAL KNOWLEDGE OF DATASET
G #– Q UADRATIC
G #– LOW #– 500 TO 1000
G
◔– D ISTRIBUTIONAL KNOWLEDGE OF QUERIES #– LINEAR ? – U NTESTED #– < 500
#– K EYWORD UNIVERSE
In summary, each protected search approach has a distinct 2) Disjunction of equalities/ranges using equality/range:
leakage profile that results in qualitatively different attacks. If Disjunctions of equalities or ranges can be supported using
queries only touch a small portion of the dataset or the adver- an equality or a range scheme, respectively. To obtain the
sary only has a snapshot, the impact of leakage from 𝙲𝚞𝚜𝚝𝚘𝚖 records that equal any of a set of 𝑘 keywords 𝑤1 , … , 𝑤𝑘 , the
systems is less than from 𝙻𝚎𝚐𝚊𝚌𝚢 schemes. If queries regularly querier can perform an equality query for each keyword 𝑤𝑖
return a large fraction of the dataset, this distinction disappears and combine the results. Similarly, to retrieve all records that
and an 𝙾𝚋𝚕𝚒𝚟 scheme may be appropriate. Recently, Kellaris are in any of 𝑘 ranges, the querier can perform a range query
et al. [125] showed an attack on 𝙾𝚋𝚕𝚒𝚟 schemes, but it requires for each range and combine the results. This approach reveals
significantly smaller database and keyword universe sizes than to the server the leakage associated with each equality or
attacks against non-𝙾𝚋𝚕𝚒𝚟 schemes. range query, e.g., the exact or approximate number of records
Open Problems: The area of leakage attacks against pro- matching each clause (not just the number of records matching
tected search is expanding. Published attacks consider attack- the disjunction overall).
ers who insert specially crafted data records but have not 3) Conjunction of equalities using equality: Conjunctions
considered an attacker who may issue crafted queries. Fur- of equalities can be supported using an equality scheme. To
thermore, all prior attacks have considered the leakage profile supporting querying for records that match all of the keywords
of the server. Future attacks should consider the implications 𝑤1 , … , 𝑤𝑘 , one builds an equality scheme containing 𝑘-tuples
of leakage to the querier and provider. Current attacks have of keywords. The querier then performs an equality search on
targeted Equality and Range queries; we encourage the study the 𝑘-tuple representing her query to retrieve the records that
of leakage attacks on other query types such as Boolean contain all of those keywords. The storage for this approach
queries. grows exponentially with 𝑘 but is viable for targeted keyword
On the reverse side, it is important to understand what combinations or a small number of fields.
these leakage attacks mean in real-world application scenarios. 4) Stemming using equality: Stemming reduces words to
Specifically, is it possible to identify appropriate real-world their root form; stemming queries allow matching on word
use-cases where the known leakage attacks do not disclose too variations. For example, a stemming query for ‘run’ will also
much information? Understanding this will enable solutions return results for ‘ran’ and ‘running’. The Porter stemmer
that better span the security, performance, and functionality is a widely used algorithm [135], [136]. Stemming can be
tradeoff space. supported easily by using the stemmed version of keywords
Lastly, on the defensive side we encourage designers to at both initialization and query time, and thus performing the
implement 𝐑𝐞𝐟 𝐫𝐞𝐬𝐡 mechanisms. 𝐑𝐞𝐟 𝐫𝐞𝐬𝐡 mechanisms have match using a single equality query.
only been implemented for Equality systems. 5) Proximity using equality: Proximity queries find values
that are ‘close’ to the search term. Li et al. [137] support
IV. EXTENDING FUNCTIONALITY
proximity queries by building an equality scheme associating
A. Query Composition each neighbor of any record with its set of neighbors in the
We now describe techniques to combine the base queries dataset at initialization; a proximity query is then an equality
described in Section III (equality, Boolean, and range queries) query, which will return a record if it matches the queried value
to obtain richer queries. We restrict our attention to techniques or is a neighbor of it. Boldyreva and Chenette [133] improve
that are black box (i.e., they do not depend on the implemen- on the security of this scheme by revealing only pairwise
tation of the base queries). neighbor relationships instead of neighbor sets. They also pad
As a general principle, schemes that support a given query the number of inserted keywords to the maximum number of
type by composing base queries tend to have more leakage neighbors. This solution multiplies storage by the maximum
than schemes that natively support the same query type as number of neighbors of a record. If disjunctive searches are
a base query. However, using query composition, a scheme permitted, one can trade off storage space with the number of
that supports the necessary base queries can be extended terms in the search.
straightforwardly to support multiple query types, whereas Another approach uses locality-sensitive hashing [138],
supporting those all as base queries requires significant effort. [139], which preserves closeness by mapping ‘close’ inputs
Thus, we see value in advancing both base and composed to identical values and ‘far’ inputs to uncorrelated values.
queries. Proximity queries can be supported by inserting the output of
Table IV summarizes the techniques we describe below. a locality-sensitive hash as a keyword in an equality scheme.
In the table and the text, we cite the first work proposing Returning only ‘close’ records requires matching the output
each approach, though we note that several ideas appear to of multiple hashes. Parameters vary widely depending on the
have been developed independently and concurrently. We defer notion of closeness. This approach has been demonstrated for
the description of string queries (substrings and wildcards) to Jaccard distance [140] and Hamming distance [137], [141]–
Appendix A. [144].
1) Equality using range: Equality queries can be supported 6) Small-domain range query using equality [134]: To
using a range query scheme. To obtain the records equal to 𝑎, support range queries on a searchable attribute 𝐴 with domain
the querier performs a range query for the range [𝑎, 𝑎]. 𝐷, we build two equality-searchable indices. The first index
Composed Query Base Query Calls Additional Storage Leakage Work
1. Equality (EQ) 1 range none Same as range —
2. Disjunction (OR) of 𝑘 EQs (or 𝑘 EQs (or ranges) none Identifiers of records matching each clause, if EQ leaks —
ranges) ≥◔
(𝛽 )
3. Conjunction (AND) of 𝑘 EQs 1 EQ 𝑘
Same as EQ —
4. Stemming 1 EQ 1 Identifiers of records sharing stem, if EQ leaks ≥ ◔ —
5. Proximity 1 EQ 𝓁 Identifiers of neighbor pairs, if EQ leaks ≥ ◔ [133]
6. Range w/ small domain (2 + 𝑟) EQs 1 No leakage if refresh between queries [134]
7. Range OR of (2 log 𝑚) EQs log 𝑚 Distributional info, if EQ leaks ≥ ◔ [16]
8. Negation AND of 2 ranges 1 Same as OR of ranges [16]
9. Substring (𝜌 = 𝜅) 1 EQ 𝛼−𝜅+1 Identifiers of records sharing 𝜅-grams, if EQ leaks ≥ ◔ [22]
10. Substring (𝜌 ≤ 𝜅) 1 range 𝛼−𝜅+1 Same as range, on 𝜅-grams [22]
11. Anchored Substring (𝜌 ≥ 𝜅) AND of (𝜌 − 𝜅 + 1) EQs 𝛼−𝜅+1 If EQ leaks ≥ ◔, rec. ids. w/ 𝜅-grams in same positions; [18]
if AND leaks # clauses, 𝜌
12. Substring OR of (𝛼 − 𝜅 + 1) ANDs 𝛼−𝜅+1 If EQ leaks ≥ ◔, rec. ids. w/ 𝜅-grams in same positions; [18]
of (𝜌 − 𝜅 + 1) EQs if AND leaks # clauses, 𝜌
13. Anchored Wildcard AND of (𝜌 − 𝜅 + 1) EQs 𝛼−𝜅+1 If EQ leaks ≥ ◔, rec. ids. w/ 𝜅-grams in same positions; [18]
if AND leaks # clauses, 𝜌
14. Wildcard OR of (𝛼 − 𝜅 + 1) ANDs 𝛼−𝜅+1 If EQ leaks ≥ ◔, rec. ids. w/ 𝜅-grams in same positions; [18]
of (𝜌 − 𝜅 + 1) EQs if AND leaks # clauses, 𝜌
TABLE IV
S UMMARY OF QUERY COMBINERS USING EQUALITY (EQ), CONJUNCTION (AND), DISJUNCTION (OR), AND RANGE BASE QUERY TYPES. STORAGE IS GIVEN
AS ADDITIONAL STORAGE BEYOND THAT REQUIRED FOR THE BASE EQUALITY OR RANGE QUERIES, AS A MULTIPLICATIVE FACTOR OVER THE BASE STORAGE.
COMPOSED QUERY LEAKAGE DEPENDS ON THE LEAKAGE OF THE BASE QUERIES USED; THE TABLE GIVES THE COMPOSED QUERY LEAKAGE IF THE BASE
EQUALITY SCHEME LEAKS IDENTITIES. “A NCHORED ” REFERS TO A SEARCH THAT OCCURS AT EITHER THE BEGINNING OR THE END OF A STRING .
BOOLEAN NOTATION P ROXIMITY, RANGE NOTATION STRING NOTATION
𝑘 = # OF CLAUSES IN BOOLEAN 𝓁 = M AX # OF NEIGHBORS OF A RECORD 𝜅 = L ENGTH OF GRAMS
𝛽 = M AX # OF KEYWORDS PER RECORD 𝑚 = S IZE OF DOMAIN 𝜌 = L ENGTH OF QUERY STRING
𝑟 = # QUERY RESULTS 𝛼 = M AX LENGTH OF DATA STRING
(PADDED IF NECESSARY)
maps each value 𝑎 ∈ 𝐷 to the number of records in the although their recommended scheme has false positives.
database smaller than 𝑎 and the number of records larger than 8) Negations using range and disjunction [16]: As above
𝑎. With two equality queries into this index, the querier can consider an ordered domain 𝐷 with minimum and maximum
learn the location of the lower and upper bounds of a range values 𝑎𝑚𝑖𝑛 and 𝑎𝑚𝑎𝑥 , respectively. To search for all records
query. The second index is an ordered list of records sorted not matching 𝐴 = 𝑎, compute a disjunction of the queries
by 𝐴, from which the client reads the relevant subset. [𝑎𝑚𝑖𝑛 , 𝑎) and (𝑎, 𝑎𝑚𝑎𝑥 ].
This approach requires blinding factors to prevent the client
from learning the positions of the results while still being able B. The Functionality Gap
to search the second index [134]. Also, this approach only
works for attributes with small domain, since the first index We now review gaps in query functionality based on cur-
has size proportional to the domain size. rent protected base and combined queries. Our discussion is
7) Large-domain range using equality and disjunction [16], divided among the three query bases from Section II-A.
[134]: Range queries can be performed over exponential size a) Relational Algebra: Cartesian product, which corre-
domains via range covers, which are a specialization of set sponds to the JOIN keyword in SQL, has been demonstrated
covers that effectively pre-compute the results of canonical in 𝙻𝚎𝚐𝚊𝚌𝚢 schemes. The one 𝙲𝚞𝚜𝚝𝚘𝚖 scheme that supports
range queries that would be asked during a binary search of Cartesian product is the work of Kamara and Moataz [102],
each record. For instance, consider the domain 𝐷 = [0, 8) with but their scheme does not support updates.
size 𝑚 = 8. To insert a record with attribute 𝐴 = 3, we insert The JOIN keyword makes a system relational. Secure JOIN
keywords corresponding to each of the canonical ranges [0, 8), is a crucial capability for protected search systems. The key
[0, 4), [2, 4), and [3, 4). Range queries are split into canonical challenge is to create a data structure capable of linking
ranges; for instance, the range [2, 5) would be split into [2, 4) different values that reveals no information to any party. This
and [4, 5). Combining this technique with disjunctions yields challenge also arises in Boolean 𝙲𝚞𝚜𝚝𝚘𝚖 systems. Systems
range queries [16]. overcome this challenge by placing values that could be linked
Demertzis et al. [145] provide a variety of range cover in a single joint data structure. It is difficult to scale this
schemes with different tradeoffs between leakage, storage, approach to the JOIN operation as the columns involved
and computation. At the extremes, they can support constant are not known ahead of time (and there are many more
storage with query cost linear in the range size, or 𝑚2 possibilities).
multiplicative storage with constant-sized keyword queries. Open Problem: Support secure Cartesian product using
They recommend a balanced approach similar to [16], [134], 𝙲𝚞𝚜𝚝𝚘𝚖 and 𝙾𝚋𝚕𝚒𝚟 approaches.
b) Associative Arrays: The main workhorse of associa- with statistical databases and restricts the ability of a principal
tive arrays is the ability to quickly add and multiply arrays. to infer a fact about a stored datum from the result returned by
𝙻𝚎𝚐𝚊𝚌𝚢 schemes have shown how to support limited addition an aggregate function such as average or count. Flow control
through the use of somewhat homomorphic encryption. There ensures that information in an object does not flow to another
is extensive work on private addition and multiplication using object of lesser privilege. Data encryption in classical systems
secure computation. However, this problem has not received is used for transmitting data from the database back to the
substantial attention in the protected search literature. We client and user. Some systems also encrypt the data at rest
see adaptation of (parallelizable) arithmetic techniques into and use fine-grained encryption for access control [152]. These
protected search as a key to supporting associative arrays. techniques are covered in most database textbooks.
Open Problem: Incorporate secure computation into pro- A new complementary approach is called query con-
tected search systems to support array (+, ×). trol [153]. Query control limits which queries are acceptable,
In addition, associative arrays are often constructed for not which objects are visible by a user. As an example, a
string objects. In this setting, multiplication and addition user may be required to specify at least five columns in a
are usually replaced with the concatenate function and an query, ensuring the query is sufficiently “targeted.” It enables
application-defined ‘minimum’ function that selects one of the database designers to match legal requirements written in this
two values. Finding the minimum is connected to the compar- style. Query control can be expressed using a query policy,
ison operation. The comparison operation has been identified which regulates the set of query controls.
as a core gadget in the secure computation literature [146], Most current protected search designs do not consider either
[147]. We encourage adaptation of this gadget to protected an authorizer or enforcer. Integrating this functionality is an
search. important part of maturing protected search and complements
Open Problem: Support protected queries to output the the cryptographic protections provided by the basic protocols.
minimum of two strings.
B. Performance Characterization
c) Linear Algebra: The main gap in supporting linear al-
gebra is how to privately multiply two matrices. This problem Database system adoption depends on response time on
is made especially challenging as for different data types the the expected set of queries. Databases are highly tuned,
addition (+) and multiplication (×) operations may be defined often creating indices on the fly in response to queries.
arbitrarily. Furthermore, linear algebra databases store data as This makes fair and fast evaluation difficult. To address this
sparse matrices. Access patterns to a sparse matrix may leak challenge, we developed a performance evaluation platform.
about the contents. This problem has begun to receive attention Our platform has been open-sourced with a BSD license
in the learning literature [148] as matrix multiplication enables (https://github.com/mit-ll/SPARTA). Design details can be
many linear classification approaches. However, current work found in [154]–[156]. It has been used to test protected search
requires specializing storage to a particular algorithm, such as systems at scales of 10TB. Prior works [16], [17], [19], [22]
shortest path [116], [149]. report performance numbers generated by our platform. While
Open Problem: Support efficient secure matrix multiplica- the platform has been used to evaluate SQL-style databases
tion and storage. it was designed with reusability and extensibility in mind to
allow generalization to other types of databases.
V. FROM Q UERIES TO DATABASE SYSTEMS Our platform evaluates: 1) integrity of responses and
In addition to search, a DBMS enforces rules, defines modifications (when occurring individually and while other
data structures, and provides transactional guarantees to an operations are being performed) and 2) query latency and
application. In this section, we highlight important components throughput under a wide variety of conditions. The system can
that are affected by security and need to be addressed to enable vary environmental characteristics, the size of the database,
a protected search system to become a full DBMS. We then query types, how many records will be returned by each
discuss current protected search systems and their applicability query, and query policy. Each of these factors can be measured
for different DB settings. independently to create performance cross-sections.
In our experiments, we found protected search response time
A. Controls, Rules and Enforcement depends heavily on:
Classical database security includes a broad set of control 1) Network capacity, load, and number of records returned
measures, including access control, inference control, flow by a query. Protected search systems often have more
control, and data encryption [150]. rounds of communication and network traffic than un-
Access control assigns a principal such as a user, role, protected databases.
account, or program privileges to interact with objects like 2) The ordering of terms and subclauses within a query.
tables, records, columns, views, or operations in a given Query planning is difficult for protected search systems
context [151]. Discretionary access control balances usability as they do not know statistics of the data. Protected
with security and is used in most applications. Mandatory search generates a plan based on only query type.
access control is used where a strict hierarchy is important 3) The existence and complexity of rules (query policy and
and available for individuals and data. Inference control is used access control). Protected search systems use advanced
cryptography like multi-party computation to evaluate different performance/leakage tradeoffs. CryptDB is the
these rules, resulting in substantial overhead. fastest and easiest to deploy. However, once a column is used
in a query, CryptDB reveals statistics about the entire dataset’s
C. User Perceptions of Performance value on this column. The security impact of this leakage
We conducted a human-subjects usability evaluation to should be evaluated for each application (see Section III-B).
further the understanding of current protected search usabil- Blind Seer and OSPIR-OXT also leak information to the
ity. This evaluation considered the performance of multiple server but primarily on data returned by the query. Thus,
protected search technologies and the perception of perfor- they are appropriate in settings where a small fraction of
mance by human subjects (our procedure was approved by the database is queried. Finally, SisoSPIR is appropriate if
our Institutional Review Board). In this evaluation, subjects a large fraction of the data is regularly queried. However,
interacted with different protected search systems through an SisoSPIR does not support Boolean queries, which is limiting
identical web interface. Here, we focus on thoughts shared by in practice.
participants during discussion. (An informal overview of our 2) Full Relational Algebra: CryptDB is the only system for
procedure is in Appendix B.) relational algebra that supports Cartesian product. (As stated,
Our participants discussed several themes that are salient while Kamara and Moataz [102] support Cartesian product,
for furthering the usability of protected search: but do not support dynamic data.)
∙ Participants cared more about predictability of response 3) Associative Array - NoSQL Key-Value: The Arx sys-
times than minimizing the mean response time. When re- tem built on mongoDB provides functionality necessary to
sponse times were unpredictable, participants were unsure support associative arrays. In addition, other commercial sys-
whether they should wait for a query to complete or do tems (e.g., Google’s Encrypted BigQuery [29]) and academic
something else. works [157], [158] apply 𝙻𝚎𝚐𝚊𝚌𝚢 techniques to build a NoSQL
∙ Participants felt the protected technologies were slower protected system.
than an unprotected system. Participants felt this perfor- Blind Seer, OSPIR-OXT, and SisoSPIR have sufficient
mance was acceptable if it gave them access to additional query functionality to support associative arrays. However,
data, but did not want to migrate current databases to their techniques concentrate on query performance. Associa-
a protected system. Note that this feedback is from end tive array databases often have insert rates of over a million
users, not administrators. records per second. The insert rates of Blind Seer, OSPIR-
∙ Participants expected performance to be correlated with OXT, and SisoSPIR are multiple orders of magnitude smaller.
the number of records returned and the length of the Suppose a record is being updated. In an unprotected system
query. Participants were surprised that different types of this causes a small change to the primary index structure.
queries might have different performance characteristics. However in the protected setting, if only a few locations
are modified the server may learn about the statistics of
D. Current Protected Search Databases the updated record. This creates a tension between efficiency
Some protected search systems have made the transition and security. Efficient updates are even more difficult if the
to full database solutions. These systems report performance provider does not have the full unprotected dataset.
analysis, perform rule enforcement, and support dynamic data. Open Problem: Construct 𝙲𝚞𝚜𝚝𝚘𝚖 and 𝙾𝚋𝚕𝚒𝚟 techniques
These systems are summarized in Table V. CryptDB repli- that can handle millions of inserts per second.
cates most DBMS functionality with a performance overhead To support very large insert rates, NoSQL key-value stores
of under 30% [15]. This approach has been extended to NoSQL commonly distribute the data across many machines. This
key-value stores [157], [158]. Arx is built on a NoSQL key- introduces the challenge of synchronizing queries, updates, and
value store called mongoDB [63]. Arx reports a performance data across these machines. This synchronization is difficult
overhead of approximately 10% when used to replace the as none of the servers are supposed to know what queries,
database of a web application (ShareLatex). Blind Seer [16] updates, or data they are processing!
reports slowdown of between 20% and 300% for most queries, Open Problem: Construct protected search systems that
while OSPIR-OXT [18] report they occasionally outperform a leverage distributed server architectures.
baseline MySQL 5.5 system with a cold cache and are an order 4) Linear Algebra and Others: No current protected search
of magnitude slower than MySQL with a warm cache. The system supports the linear algebra basis used to implement
SisoSPIR system [22] reports performance slowdown of 500% complex graph databases. In addition, as federated and poly-
compared to a baseline MySQL system on keyword equality store databases emerge it will be important to interface be-
and range queries. tween different protected search systems that are designed for
Given these performance numbers, we now ask which different query bases.
solution, if any, is appropriate for different database settings. Inherent Limitations: Protected search systems are still in
1) Relational Algebra without Cartesian product: development, so it is important to distinguish between tran-
CryptDB, Blind Seer, OSPIR-OXT, and SisoSPIR all provide sient limitations and inherent limitations of protected search.
functionality that supports most of relational algebra except Protected search inherently reduces data visibility in order
for the Cartesian product operation. These systems offer to prevent abuse. To achieve high performance under these
Access control
Code available
Query policy
Performance
Multi-client
# of parties
User auth.
Approach
Substring
Wildcard
Keyword
Leakage
Equality
Boolean
Update
Range
Sum
Join
System Supported Operations Properties Features
CryptDB [15] # # # 𝙻𝚎𝚐𝚊𝚌𝚢 2 # ◕
Arx [14] # # # # 𝙲𝚞𝚜𝚝𝚘𝚖 2 # # # # # G
# G
#
BLIND SEER [16], [17] # # # # 𝙲𝚞𝚜𝚝𝚘𝚖 3 # # # G
# G
#
OSPIR-OXT [18]–[21], [103], [104] # # 𝙲𝚞𝚜𝚝𝚘𝚖 3 # # # G
# ◕
SisoSPIR [22] # # # # # 𝙾𝚋𝚕𝚒𝚟 3 # # # # # G
#
TABLE V
T HIS TABLE SUMMARIZES PROTECTED SEARCH DATABASES THAT HAVE BEEN DEVELOPED AND EVALUATED AT SCALE. T HE Supported Operations COLUMNS
DESCRIBE THE QUERIES NATURALLY SUPPORTED BY EACH SCHEME. Properties AND Features COLUMNS DESCRIBE THE SYSTEM AND AVAILABLE
FUNCTIONALITY. F INALLY Leakage AND Performance DESCRIBE THE WHOLE, COMPLEX SYSTEM , AND ARE THEREFORE RELATIVE (VS. THE MORE PRECISELY
DEFINED VALUES FOR INDIVIDUAL OPERATIONS USED EARLIER).
conditions, many design decisions such as the schema and the of this information presents a risk to reputation and intellectual
choice of which indices to build must be made before data is property.
ingested and stored on the server. In particular, if an index has This paper provides a snapshot of current protected search
not been built for a particular field, then it simply cannot be solutions. There is currently no dominant solution for all use
searched without returning the entire database to the querier. cases. Adopters need to understand system characteristics and
In general, it is not possible to dynamically permit a type of tradeoffs for their use case.
search without retrieving the entire dataset. Protected databases will see widespread adoption. Protected
Additionally, if the database malfunctions, debugging efforts search has developed rapidly since 2000, advancing from linear
are complicated by the reduced visibility into server processes time equality queries on static data to complex searches on
and logs. More generally, protected search systems are more dynamic data, now within overhead between 30%-500% over
complicated to manage and don’t yet have an existing com- standard SQL.
munity of qualified, certified administrators. At the same time, the database landscape is rapidly chang-
Throughout this work we’ve identified a few transient limita- ing, specializing, adding new functionality, and federating
tions that can (and should!) be mitigated with future advances. approaches. Integrating protected search in a unified design
Each potential user must make her own judgment as to whether requires close interaction between cryptographers, protected
the value of improved security outweighs the performance search designers, and database experts. To spur that integra-
limitations. tion, we describe a three pronged approach to this collabora-
tion: 1) developing base queries that are useful in many appli-
VI. CONCLUSION AND O UTLOOK cations, 2) understanding how to combine queries to support
multiple applications, and 3) rapidly applying techniques to
Several established and startup companies have commercial- emerging database technologies.
ized protected search. Most of these products today use the DBMSs are more than just efficient search systems; they
𝙻𝚎𝚐𝚊𝚌𝚢 technique, but we believe both 𝙲𝚞𝚜𝚝𝚘𝚖 and 𝙾𝚋𝚕𝚒𝚟 are highly optimized and complex systems. Protected search
approaches will find their way into products with broad user has shown that database and cryptography communities can
bases. work together. The next step is to transform protected search
Governments and companies are finding value in lacking systems into protected DBMSs.
access to individuals’ data [159]. Proactively protecting data
mitigates the (ever-increasing) risk of server compromise, ACKNOWLEDGMENTS
reduces the insider threat, can be marketed as a feature, and The authors thank David Cash, Carl Landwehr, Konrad
frees developers’ time to work on other aspects of products Vesey, Charles Wright, and the anonymous reviewers for
and services. The recent HITECH US Health Care Law [160] helpful feedback in improving this work.
establishes a requirement to disclose breaches involving more
than 500 patients but exempts companies if the data is en- R EFERENCES
crypted: “if your practice has a breach of encrypted data [...] [1] R. Powers and D. Beede, “Fostering innovation, creating jobs, driving
it would not be considered a breach of unsecured data, and better decisions: The value of government data,” Office of the Chief
you would not have to report it” [161]. Economist, Economics and Statistics Administration, US Department
of Commerce, July 2014.
Protected database technology can also open up new mar- [2] G. S. Linoff and M. J. Berry, Mining the Web: Transforming Customer
kets, such as those cases where there is great value in recording Data into Customer Value. New York, NY, USA: John Wiley & Sons,
and sharing information but the risk of data spills is too high Inc., 2002.
[3] “Big & fast data: The rise of insight-
For example, companies recognize the value of sharing cyber driven business,” 2015. [Online]. Available:
threat and attack information [162], but uncontrolled sharing https://www.capgemini.com/resource-file-access/resource/pdf/big_fast_data_the_rise_o
[4] B. Mons, H. van Haagen, C. Chichester, P.-B. t. Hoen, J. T. den [24] “Ciphercloud.” [Online]. Available: http://www.ciphercloud.com
Dunnen, G. van Ommen, E. van Mulligen, B. Singh, R. Hooft, [25] “Cipherquery.” [Online]. Available: https://privatemachines.com
M. Roos, J. Hammond, B. Kiesel, B. Giardine, J. Velterop, [26] “Crypteron.” [Online]. Available: https://www.crypteron.com/
P. Groth, and E. Schultes, “The value of data,” Nat Genet, [27] “IQrypt.” [Online]. Available: http://iqrypt.com/
vol. 43, no. 4, pp. 281–283, Apr 2011. [Online]. Available: [28] “Kryptnostic.” [Online]. Available: https://www.kryptnostic.com/
http://dx.doi.org/10.1038/ng0411-281 [29] Google, “Encrypted BigQuery client.” [Online]. Available:
[5] Mandiant, http://intelreport.mandiant.com/Mandiant_APT1_Report.pdf, https://github.com/google/encrypted-bigquery-client
Feb 2013.
[30] Microsoft Corporation, “Always Encrypted (Database En-
[6] M. Motoyama, D. McCoy, K. Levchenko, S. Savage, and G. M. gine) - SQL Server 2016.” [Online]. Available:
Voelker, “An analysis of underground forums,” in Proceedings of the https://msdn.microsoft.com/en-us/library/mt163865.aspx
2011 ACM SIGCOMM Conference on Internet Measurement Confer-
[31] ——, “Always Encrypted Cryptography - SQL Server 2016.” [Online].
ence, ser. IMC ’11. New York, NY, USA: ACM, 2011, pp. 71–80.
Available: https://msdn.microsoft.com/en-us/library/mt653971.aspx
[7] N. Y. Times, “Hacking linked to China exposes millions of U.S.
workers,” http://www.nytimes.com/2015/06/05/us/breach-in-a-federal- [32] “PreVeil.” [Online]. Available: https://www.preveil.com/
computer-system-exposes-personnel-data.html, June 4, 2015, accessed: [33] “Skyhigh networks: Cloud security software.” [Online]. Available:
2015-07-09. https://www.skyhighnetworks.com
[8] ——, “9 recent cyberattacks against big businesses,” [34] “StealthMine.” [Online]. Available: http://stealthmine.com/
http://www.nytimes.com/interactive/2015/02/05/technology/recent- [35] “ZeroDB.” [Online]. Available: https://zerodb.com/
cyberattacks.html, February 5, 2015, accessed: 2015-07-09. [36] E. F. Codd, “A relational model of data for large shared data banks,”
[9] D. X. Song, D. Wagner, and A. Perrig, “Practical techniques for Communications of the ACM, vol. 13, no. 6, pp. 377–387, 1970.
searches on encrypted data,” in 2000 IEEE Symposium on Security [37] M. Stonebraker and U. Cetintemel, “One size fits all: an idea whose
and Privacy. IEEE Computer Society Press, May 2000, pp. 44–55. time has come and gone,” in 21st International Conference on Data
[10] O. Pandey and Y. Rouselakis, “Property preserving symmetric en- Engineering (ICDE’05). IEEE, 2005, pp. 2–11.
cryption,” in EUROCRYPT 2012, ser. LNCS, D. Pointcheval and [38] J. D. Ullman, A first course in database systems. Pearson Education
T. Johansson, Eds., vol. 7237. Springer, Heidelberg, Apr. 2012, pp. India, 1982.
375–391. [39] M. Stonebraker and J. M. Hellerstein, Readings in database systems.
[11] R. Curtmola, J. A. Garay, S. Kamara, and R. Ostrovsky, “Searchable Morgan Kaufmann Publishers, 1988.
symmetric encryption: improved definitions and efficient construc- [40] T. Haerder and A. Reuter, “Principles of transaction-oriented database
tions,” in ACM CCS 06, A. Juels, R. N. Wright, and S. Vimercati, recovery,” ACM Computing Surveys (CSUR), vol. 15, no. 4, pp. 287–
Eds. ACM Press, Oct. / Nov. 2006, pp. 79–88. 317, 1983.
[12] B. Chor, N. Gilboa, and M. Naor, “Private information retrieval [41] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach,
by keywords,” Cryptology ePrint Archive, Report 1998/003, 1998, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A
http://eprint.iacr.org/1998/003 . distributed storage system for structured data,” ACM Transactions on
[13] O. Goldreich, “Towards a theory of software protection and simulation Computer Systems (TOCS), vol. 26, no. 2, p. 4, 2008.
by oblivious RAMs,” in 19th ACM STOC, A. Aho, Ed. ACM Press, [42] A. Pavlo and M. Aslett, “What’s really new with NewSQL?” SIGMOD
May 1987, pp. 182–194. Record, 2016.
[14] R. Poddar, T. Boelter, and R. A. Popa, “Arx: A strongly encrypted [43] A. Elmore, J. Duggan, M. Stonebraker, M. Balazinska, U. Cetintemel,
database system,” Cryptology ePrint Archive, Report 2016/591, 2016, V. Gadepally, J. Heer, B. Howe, J. Kepner, T. Kraska et al., “A
http://eprint.iacr.org/2016/591 . demonstration of the BigDAWG polystore system,” Proceedings of the
[15] R. A. Popa, C. M. S. Redfield, N. Zeldovich, and H. Balakrishnan, VLDB Endowment, vol. 8, no. 12, pp. 1908–1911, 2015.
“CryptDB: processing queries on an encrypted database,” Commun. [44] V. Gadepally, P. Chen, J. Duggan, A. Elmore, B. Haynes, J. Kepner,
ACM, vol. 55, no. 9, pp. 103–111, 2012. [Online]. Available: S. Madden, T. Mattson, and M. Stonebraker, “The BigDAWG polystore
http://doi.acm.org/10.1145/2330667.2330691 system and architecture,” in 2016 IEEE High Performance Extreme
[16] V. Pappas, F. Krell, B. Vo, V. Kolesnikov, T. Malkin, S. G. Choi, Computing Conference (HPEC). IEEE, 2016, pp. 1–6.
W. George, A. D. Keromytis, and S. Bellovin, “Blind seer: A scalable [45] D. Halperin, V. Teixeira de Almeida, L. L. Choo, S. Chu, P. Koutris,
private DBMS,” in 2014 IEEE Symposium on Security and Privacy. D. Moritz, J. Ortiz, V. Ruamviboonsuk, J. Wang, A. Whitaker et al.,
IEEE Computer Society Press, May 2014, pp. 359–374. “Demonstration of the Myria big data management service,” in Pro-
[17] B. A. Fisch, B. Vo, F. Krell, A. Kumarasubramanian, V. Kolesnikov, ceedings of the 2014 ACM SIGMOD international conference on
T. Malkin, and S. M. Bellovin, “Malicious-client security in Blind Seer: Management of data. ACM, 2014, pp. 881–884.
A scalable private DBMS,” in 2015 IEEE Symposium on Security and
[46] R. A. Van De Geijn and E. S. Quintana-Ortí, The science of program-
Privacy. IEEE Computer Society Press, May 2015, pp. 395–410.
ming matrix computations, 2008.
[18] D. Cash, S. Jarecki, C. S. Jutla, H. Krawczyk, M.-C. Rosu, and
[47] J. Kepner and V. Gadepally, “Adjacency matrices, incidence matrices,
M. Steiner, “Highly-scalable searchable symmetric encryption with
database schemas, and associative arrays,” in International Parallel &
support for Boolean queries,” in CRYPTO 2013, Part I, ser. LNCS,
Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2014.
R. Canetti and J. A. Garay, Eds., vol. 8042. Springer, Heidelberg,
Aug. 2013, pp. 353–373. [48] V. Gadepally, J. Kepner, W. Arcand, D. Bestor, B. Bergeron, C. Byun,
[19] S. Jarecki, C. S. Jutla, H. Krawczyk, M.-C. Rosu, and M. Steiner, L. Edwards, M. Hubbell, P. Michaleas, J. Mullen et al., “D4M: Bringing
“Outsourced symmetric private information retrieval,” in ACM CCS associative arrays to database engines,” in High Performance Extreme
13, A.-R. Sadeghi, V. D. Gligor, and M. Yung, Eds. ACM Press, Computing Conference (HPEC), 2015 IEEE. IEEE, 2015, pp. 1–6.
Nov. 2013, pp. 875–888. [49] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig
[20] D. Cash, J. Jaeger, S. Jarecki, C. S. Jutla, H. Krawczyk, M.-C. Rosu, and latin: a not-so-foreign language for data processing,” in Proceedings of
M. Steiner, “Dynamic searchable encryption in very-large databases: the 2008 ACM SIGMOD international conference on Management of
Data structures and implementation,” in NDSS 2014. The Internet data. ACM, 2008, pp. 1099–1110.
Society, Feb. 2014. [50] J. Kepner, V. Gadepally, D. Hutchison, H. Jananthan, T. Mattson,
[21] S. Faber, S. Jarecki, H. Krawczyk, Q. Nguyen, M.-C. Rosu, and S. Samsi, and A. Reuther, “Associative array model of SQL, NoSQL,
M. Steiner, “Rich queries on encrypted data: Beyond exact matches,” and NewSQL databases,” in 2016 IEEE High Performance Extreme
in ESORICS 2015, Part II, ser. LNCS, G. Pernul, P. Y. A. Ryan, and Computing Conference, 2016.
E. R. Weippl, Eds., vol. 9327. Springer, Heidelberg, Sep. 2015, pp. [51] D. J. Abadi, “Data management in the cloud: limitations and opportu-
123–145. nities.” IEEE Data Eng. Bull., vol. 32, no. 1, pp. 3–12, 2009.
[22] Y. Ishai, E. Kushilevitz, S. Lu, and R. Ostrovsky, “Private large-scale [52] “MySQL.” [Online]. Available: https://www.mysql.com/
databases with distributed searchable symmetric encryption,” in CT- [53] K. Loney, Oracle database 10g: the complete reference. McGraw-
RSA 2016, ser. LNCS, K. Sako, Ed., vol. 9610. Springer, Heidelberg, Hill/Osborne, 2004.
Feb. / Mar. 2016, pp. 90–107. [54] M. Stonebraker and L. A. Rowe, The design of Postgres. ACM, 1986,
[23] “Bitglass.” [Online]. Available: http://www.bitglass.com/ vol. 15, no. 2.
[55] J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, [79] O. Goldreich, S. Micali, and A. Wigderson, “How to play any mental
S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild et al., “Spanner: game or A completeness theorem for protocols with honest majority,”
Google’s globally distributed database,” ACM Transactions on Com- in 19th ACM STOC, A. Aho, Ed. ACM Press, May 1987, pp. 218–229.
puter Systems (TOCS), vol. 31, no. 3, p. 8, 2013. [80] C. Gentry, “Fully homomorphic encryption using ideal lattices,” in 41st
[56] N. Shamgunov, “The MemSQL in-memory database system.” in ACM STOC, M. Mitzenmacher, Ed. ACM Press, May / Jun. 2009,
IMDM@ VLDB, 2014. pp. 169–178.
[57] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, [81] Z. Brakerski, C. Gentry, and V. Vaikuntanathan, “(Leveled) fully ho-
X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi et al., “Spark SQL: momorphic encryption without bootstrapping,” in ITCS 2012, S. Gold-
Relational data processing in spark,” in Proceedings of the 2015 ACM wasser, Ed. ACM, Jan. 2012, pp. 309–325.
SIGMOD International Conference on Management of Data. ACM, [82] C. Gentry, S. Halevi, and N. P. Smart, “Better bootstrapping in fully
2015, pp. 1383–1394. homomorphic encryption,” in PKC 2012, ser. LNCS, M. Fischlin,
[58] M. J. Carey, L. M. Haas, P. M. Schwarz, M. Arya, W. F. Cody, J. Buchmann, and M. Manulis, Eds., vol. 7293. Springer, Heidelberg,
R. Fagin, M. Flickner, A. W. Luniewski, W. Niblack, D. Petkovic et al., May 2012, pp. 1–16.
“Towards heterogeneous multimedia information systems: The garlic [83] S. Garg, C. Gentry, S. Halevi, M. Raykova, A. Sahai, and B. Waters,
approach,” in Research Issues in Data Engineering, 1995: Distributed “Candidate indistinguishability obfuscation and functional encryption
Object Management, Proceedings. RIDE-DOM’95. Fifth International for all circuits,” in 54th FOCS. IEEE Computer Society Press, Oct.
Workshop on. IEEE, 1995, pp. 124–131. 2013, pp. 40–49.
[59] “IBM DB2.” [Online]. Available: [84] B. Chor, O. Goldreich, E. Kushilevitz, and M. Sudan, “Private infor-
http://www.ibm.com/analytics/us/en/technology/db2/ mation retrieval,” in 36th FOCS. IEEE Computer Society Press, Oct.
[60] D. Pritchett, “BASE: An ACID alternative,” Queue, vol. 6, no. 3, pp. 1995, pp. 41–50.
48–55, 2008. [85] Y. Gertner, Y. Ishai, E. Kushilevitz, and T. Malkin, “Protecting data
privacy in private information retrieval schemes,” in 30th ACM STOC.
[61] J. Kepner, W. Arcand, D. Bestor, B. Bergeron, C. Byun, V. Gadepally,
ACM Press, May 1998, pp. 151–160.
M. Hubbell, P. Michaleas, J. Mullen, A. Prout et al., “Achieving
[86] M. T. Goodrich, R. Tamassia, N. Triandopoulos, and R. Cohen,
100,000,000 database inserts per second using accumulo and D4M,”
“Authenticated data structures for graph and geometric searching,”
in 2014 IEEE High Performance Extreme Computing Conference
in CT-RSA 2003, ser. LNCS, M. Joye, Ed., vol. 2612. Springer,
(HPEC). IEEE, 2014, pp. 1–6.
Heidelberg, Apr. 2003, pp. 295–313.
[62] L. George, HBase: the definitive guide. " O’Reilly Media, Inc.", 2011. [87] C. Papamanthou and R. Tamassia, “Time and space efficient algorithms
[63] “mongoDB.” [Online]. Available: https://www.mongodb.com/ for two-party authenticated data structures,” in ICICS 07, ser. LNCS,
[64] J. Webber, “A programmatic introduction to Neo4j,” in Proceedings of S. Qing, H. Imai, and G. Wang, Eds., vol. 4861. Springer, Heidelberg,
the 3rd annual conference on Systems, programming, and applications: Dec. 2008, pp. 1–15.
software for humanity. ACM, 2012, pp. 217–218. [88] M. Etemad and A. Küpçü, “Database outsourcing with hierarchical
[65] “IBM system G.” [Online]. Available: http://systemg.research.ibm.com/ authenticated data structures,” in ICISC 13, ser. LNCS, H.-S. Lee and
[66] P. G. Brown, “Overview of sciDB: large scale array storage, processing D.-G. Han, Eds., vol. 8565. Springer, Heidelberg, Nov. 2014, pp.
and analysis,” in Proceedings of the 2010 ACM SIGMOD International 381–399.
Conference on Management of data. ACM, 2010, pp. 963–968. [89] M. Backes, M. Barbosa, D. Fiore, and R. M. Reischuk, “ADSNARK:
[67] N. Li, Scalable database query processing. Johns Hopkins University, Nearly practical and privacy-preserving proofs on authenticated data,”
2012. in 2015 IEEE Symposium on Security and Privacy. IEEE Computer
[68] J. M. Smith and P. Y.-T. Chang, “Optimizing the performance of a Society Press, May 2015, pp. 271–286.
relational algebra database interface,” Communications of the ACM, [90] J. H. Ahn, D. Boneh, J. Camenisch, S. Hohenberger, a. shelat, and
vol. 18, no. 10, pp. 568–579, 1975. B. Waters, “Computing on authenticated data,” in TCC 2012, ser.
[69] J. Kepner, D. Bader, A. Buluç, J. Gilbert, T. Mattson, and H. Meyer- LNCS, R. Cramer, Ed., vol. 7194. Springer, Heidelberg, Mar. 2012,
henke, “Graphs, matrices, and the GraphBLAS: Seven good reasons,” pp. 1–20.
Procedia Computer Science, vol. 51, pp. 2453–2462, 2015. [91] A. Hamlin, N. Schear, E. Shen, M. Varia, S. Yakoubov, and A. Yerukhi-
[70] V. Gadepally, J. Bolewski, D. Hook, D. Hutchison, B. Miller, and movich, “Cryptography for big data security,” in Big Data: Storage,
J. Kepner, “Graphulo: Linear algebra graph kernels for NoSQL Sharing, and Security, F. Hu, Ed. Taylor & Francis LLC, CRC Press,
databases,” in International Parallel & Distributed Processing Sym- 2016.
posium Workshops (IPDPSW). IEEE, 2015. [92] M. Bellare, A. Boldyreva, and A. O’Neill, “Deterministic and efficiently
[71] D. Hutchison, J. Kepner, V. Gadepally, and A. Fuchs, “Graphulo searchable encryption,” in CRYPTO 2007, ser. LNCS, A. Menezes, Ed.,
implementation of server-side sparse matrix multiply in the accu- vol. 4622. Springer, Heidelberg, Aug. 2007, pp. 535–552.
mulo database,” in High Performance Extreme Computing Conference [93] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, “Order-preserving
(HPEC), 2015 IEEE. IEEE, 2015, pp. 1–7. encryption for numeric data,” in Proceedings of the ACM SIGMOD
International Conference on Management of Data, 2004, pp. 563–574.
[72] Microsoft Corporation, “Database-level roles.” [Online]. Available:
[Online]. Available: http://doi.acm.org/10.1145/1007568.1007632
https://msdn.microsoft.com/en-us/library/ms189121.aspx
[94] A. Boldyreva, N. Chenette, Y. Lee, and A. O’Neill, “Order-preserving
[73] C. Bösch, P. Hartel, W. Jonker, and A. Peter, “A survey symmetric encryption,” in EUROCRYPT 2009, ser. LNCS, A. Joux,
of provably secure searchable encryption,” ACM Comput. Surv., Ed., vol. 5479. Springer, Heidelberg, Apr. 2009, pp. 224–241.
vol. 47, no. 2, pp. 18:1–18:51, August 2014. [Online]. Available:
[95] A. Boldyreva, N. Chenette, and A. O’Neill, “Order-preserving encryp-
http://doi.acm.org/10.1145/2636328
tion revisited: Improved security analysis and alternative solutions,” in
[74] P. Grubbs, R. McPherson, M. Naveed, T. Ristenpart, and V. Shmatikov, CRYPTO 2011, ser. LNCS, P. Rogaway, Ed., vol. 6841. Springer,
“Breaking web applications built on top of encrypted data,” in ACM Heidelberg, Aug. 2011, pp. 578–595.
CCS 16. ACM Press, 2016, pp. 1353–1364. [96] C. Mavroforakis, N. Chenette, A. O’Neill, G. Kollios, and
[75] S. Kamara, “Structured encryption and leakage suppression,” presented R. Canetti, “Modular order-preserving encryption, revisited,” in
at Encryption for Secure Search and Other Algorithms, Bertinoro, Italy, Proceedings of the 2015 ACM SIGMOD International Conference
June 2015. on Management of Data, 2015, pp. 763–777. [Online]. Available:
[76] S. Bajaj and R. Sion, “TrustedDB: A trusted hardware-based database http://doi.acm.org/10.1145/2723372.2749455
with privacy and data confidentiality,” IEEE Transactions on Knowl- [97] R. A. Popa, F. H. Li, and N. Zeldovich, “An ideal-security protocol for
edge and Data Engineering, vol. 26, no. 3, pp. 752–765, 2014. order-preserving encoding,” in 2013 IEEE Symposium on Security and
[77] A. C.-C. Yao, “Protocols for secure computations (extended abstract),” Privacy. IEEE Computer Society Press, May 2013, pp. 463–477.
in 23rd FOCS. IEEE Computer Society Press, Nov. 1982, pp. 160– [98] P. Grofig, M. Härterich, I. Hang, F. Kerschbaum, M. Kohler, A. Schaad,
164. A. Schröpfer, and W. Tighzert, “Experiences and observations on
[78] M. Ben-Or, S. Goldwasser, and A. Wigderson, “Completeness theorems the industrial implementation of a system to search over outsourced
for non-cryptographic fault-tolerant distributed computation (extended encrypted data,” in Sicherheit, 2014, pp. 115–125. [Online]. Available:
abstract),” in 20th ACM STOC. ACM Press, May 1988, pp. 1–10. http://subs.emis.de/LNI/Proceedings/Proceedings228/article7.html
[99] M. Chase and S. Kamara, “Structured encryption and controlled [123] T. Moataz and E.-O. Blass, “Oblivious substring search with
disclosure,” in ASIACRYPT 2010, ser. LNCS, M. Abe, Ed., vol. 6477. updates,” Cryptology ePrint Archive, Report 2015/722, 2015,
Springer, Heidelberg, Dec. 2010, pp. 577–594. http://eprint.iacr.org/2015/722.
[100] M. Naveed, M. Prabhakaran, and C. A. Gunter, “Dynamic searchable [124] S. Faber, S. Jarecki, S. Kentros, and B. Wei, “Three-party ORAM for
encryption via blind storage,” in 2014 IEEE Symposium on Security secure computation,” in ASIACRYPT 2015, Part I, ser. LNCS, T. Iwata
and Privacy. IEEE Computer Society Press, May 2014, pp. 639–654. and J. H. Cheon, Eds., vol. 9452. Springer, Heidelberg, Nov. / Dec.
[101] R. Bost, “Σ𝑜𝜙𝑜𝜍: Forward secure searchable encryption,” in ACM CCS 2015, pp. 360–385.
16. ACM Press, 2016, pp. 1143–1154. [125] G. Kellaris, G. Kollios, K. Nissim, and A. O’Neill, “Generic attacks
[102] S. Kamara and T. Moataz, “SQL on structurally-encrypted on secure outsourced databases,” in Proceedings of the 2016 ACM
databases,” Cryptology ePrint Archive, Report 2016/453, 2016, SIGSAC Conference on Computer and Communications Security, ser.
http://eprint.iacr.org/2016/453 . CCS ’16. New York, NY, USA: ACM, 2016, pp. 1329–1340.
[103] ——, “Boolean searchable symmetric encryption with worst-case sub- [Online]. Available: http://doi.acm.org/10.1145/2976749.2978386
linear complexity,” in EUROCRYPT 2017, 2017. [126] E. Chen, I. Gomez, B. Saavedra, and J. Yucra, “Cocoon: Encrypted
[104] T. Moataz, “Searchable symmetric encryption: Implementation of 2Lev, substring search,” https://courses.csail.mit.edu/6.857/2016/files/29.pdf,
ZMF, IEX-2Lev, IEX-ZMF,” https://github.com/orochi89/Clusion . May 2015, accessed: 2016-07-15.
[105] D. Cash and S. Tessaro, “The locality of searchable symmetric encryp- [127] Y. Zhang, J. Katz, and C. Papamanthou, “All your queries are belong
tion,” in EUROCRYPT 2014, ser. LNCS, P. Q. Nguyen and E. Oswald, to us: The power of file-injection attacks on searchable encryption,”
Eds., vol. 8441. Springer, Heidelberg, May 2014, pp. 351–368. in 25th USENIX Security Symposium, USENIX Security 16, Austin,
[106] S. Kamara and C. Papamanthou, “Parallel and dynamic searchable TX, USA, August 10-12, 2016., 2016, pp. 707–720. [Online]. Available:
symmetric encryption,” in FC 2013, ser. LNCS, A.-R. Sadeghi, Ed., https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/zh
vol. 7859. Springer, Heidelberg, Apr. 2013, pp. 258–274. [128] D. Cash, P. Grubbs, J. Perry, and T. Ristenpart, “Leakage-abuse
[107] E. Stefanov, C. Papamanthou, and E. Shi, “Practical dynamic searchable attacks against searchable encryption,” in Proceedings of the 22nd
encryption with small leakage,” in NDSS 2014. The Internet Society, ACM SIGSAC Conference on Computer and Communications Security,
Feb. 2014. Denver, CO, USA, October 12-6, 2015, 2015, pp. 668–679. [Online].
[108] A. C.-C. Yao, “How to generate and exchange secrets (extended Available: http://doi.acm.org/10.1145/2810103.2813700
abstract),” in 27th FOCS. IEEE Computer Society Press, Oct. 1986, [129] D. Pouliot and C. V. Wright, “The shadow nemesis: Inference
pp. 162–167. attacks on efficiently deployable, efficiently searchable encryption,”
[109] M. Chase and E. Shen, “Substring-searchable symmetric encryption,” in Proceedings of the 22nd ACM SIGSAC Conference on
PoPETs, vol. 2015, no. 2, pp. 263–281, 2015. [Online]. Available: Computer and Communications Security, Denver, CO, USA,
October 12-6, 2015, 2015, pp. 644–655. [Online]. Available:
http://www.degruyter.com/view/j/popets.2015.2015.issue-2/popets-2015-0014/popets-2015-0014.xml
[110] T. Boelter, R. Poddar, and R. A. Popa, “A secure one-roundtrip index http://doi.acm.org/10.1145/2810103.2813651
for range queries,” Cryptology ePrint Archive, Report 2016/568, 2016, [130] M. Naveed, S. Kamara, and C. V. Wright, “Inference attacks on
http://eprint.iacr.org/2016/568 . property-preserving encrypted databases,” in 23rd ACM Conference on
[111] D. S. Roche, D. Apon, S. G. Choi, and A. Yerukhimovich, “POPE: Computer and Communications Security, Vienna, Austria, October 24-
Partial order preserving encoding,” in ACM CCS 16. ACM Press, 28, 2016, 2016.
2016, pp. 1131–1142. [131] P. Grubbs, K. Sekniqi, V. Bindschaedler, M. Naveed, and T. Ristenpart,
[112] F. Baldimtsi and O. Ohrimenko, “Sorting and searching behind the “Leakage-abuse attacks against order-revealing encryption,” Cryptol-
curtain,” in FC 2015, ser. LNCS, R. Böhme and T. Okamoto, Eds., ogy ePrint Archive, Report 2016/895, http://eprint.iacr.org/2016/895.
vol. 8975. Springer, Heidelberg, Jan. 2015, pp. 127–146. [132] M. S. Islam, M. Kuzu, and M. Kantarcioglu, “Access pattern disclosure
[113] M. Strizhov and I. Ray, “Multi-keyword similarity search over en- on searchable encryption: Ramification, attack and mitigation,” in 19th
crypted cloud data,” Cryptology ePrint Archive, Report 2015/137, Annual Network and Distributed System Security Symposium, NDSS
2015, http://eprint.iacr.org/2015/137. 2012, San Diego, California, USA, February 5-8, 2012, 2012.
[114] E. Shen, E. Shi, and B. Waters, “Predicate privacy in encryption [133] A. Boldyreva and N. Chenette, “Efficient fuzzy search on encrypted
systems,” in TCC 2009, ser. LNCS, O. Reingold, Ed., vol. 5444. data,” in FSE 2014, ser. LNCS, C. Cid and C. Rechberger, Eds., vol.
Springer, Heidelberg, Mar. 2009, pp. 457–473. 8540. Springer, Heidelberg, Mar. 2015, pp. 613–633.
[115] C. Bösch, Q. Tang, P. H. Hartel, and W. Jonker, “Selective document [134] G. D. Crescenzo and A. Ghosh, “Privacy-preserving range queries from
retrieval from encrypted database,” in ISC 2012, ser. LNCS, D. Goll- keyword queries,” in Data and Applications Security and Privacy XXIX,
mann and F. C. Freiling, Eds., vol. 7483. Springer, Heidelberg, Sep. ser. LNCS, vol. 9149. Springer, 2015, pp. 35–50.
2012, pp. 224–241. [135] M. F. Porter, “An algorithm for suffix stripping,” Program, vol. 14,
[116] X. Meng, S. Kamara, K. Nissim, and G. Kollios, “GRECS: Graph no. 3, pp. 130–137, 1980.
encryption for approximate shortest distance queries,” in ACM CCS [136] P. Willett, “The porter stemming algorithm: then and now,” Program,
15, I. Ray, N. Li, and C. Kruegel:, Eds. ACM Press, Oct. 2015, pp. vol. 40, no. 3, pp. 219–223, 2006.
504–517. [137] J. Li, Q. Wang, C. Wang, N. Cao, K. Ren, and W. Lou, “Fuzzy keyword
[117] O. Goldreich and R. Ostrovsky, “Software protection and simulation search over encrypted data in cloud computing,” in INFOCOM 2010.
on oblivious rams,” Journal of the ACM (JACM), vol. 43, no. 3, pp. 29th IEEE International Conference on Computer Communications,
431–473, 1996. Joint Conference of the IEEE Computer and Communications Societies,
[118] E. Stefanov, M. van Dijk, E. Shi, C. W. Fletcher, L. Ren, X. Yu, 15-19 March 2010, San Diego, CA, USA, 2010, pp. 441–445. [Online].
and S. Devadas, “Path ORAM: an extremely simple oblivious RAM Available: http://dx.doi.org/10.1109/INFCOM.2010.5462196
protocol,” in ACM CCS 13, A.-R. Sadeghi, V. D. Gligor, and M. Yung, [138] P. Indyk and R. Motwani, “Approximate nearest neighbors: towards
Eds. ACM Press, Nov. 2013, pp. 299–310. removing the curse of dimensionality,” in Proceedings of the thirtieth
[119] M. Naveed, “The fallacy of composition of oblivious RAM and annual ACM symposium on Theory of computing. ACM, 1998, pp.
searchable encryption,” Cryptology ePrint Archive, Report 2015/668, 604–613.
2015, http://eprint.iacr.org/2015/668. [139] A. Gionis, P. Indyk, R. Motwani et al., “Similarity search in high
[120] D. S. Roche, A. J. Aviv, and S. G. Choi, “A practical oblivious map dimensions via hashing,” in VLDB, vol. 99, no. 6, 1999, pp. 518–529.
data structure with secure deletion and history independence,” in 2016 [140] M. Kuzu, M. S. Islam, and M. Kantarcioglu, “Efficient similarity
IEEE Symposium on Security and Privacy. IEEE Computer Society search over encrypted data,” in IEEE 28th International Conference on
Press, 2016, pp. 178–197. Data Engineering (ICDE), 2012, pp. 1156–1167. [Online]. Available:
[121] S. Garg, P. Mohassel, and C. Papamanthou, “TWORAM: Efficient http://dx.doi.org/10.1109/ICDE.2012.23
oblivious RAM in two rounds with applications to searchable encryp- [141] H. Park, B. H. Kim, D. H. Lee, Y. D. Chung, and J. Zhan,
tion,” ser. LNCS. Springer, Heidelberg, Aug. 2016, pp. 563–592. “Secure similarity search,” in 2007 IEEE International Conference
[122] S. Lu and R. Ostrovsky, “How to garble RAM programs,” in EURO- on Granular Computing, GrC 2007, San Jose, California,
CRYPT 2013, ser. LNCS, T. Johansson and P. Q. Nguyen, Eds., vol. USA, 2-4 November 2007, 2007, p. 598. [Online]. Available:
7881. Springer, Heidelberg, May 2013, pp. 719–734. http://dx.doi.org/10.1109/GRC.2007.140
[142] M. Adjedj, J. Bringer, H. Chabanne, and B. Kindarji, “Biometric [163] “Amazon Web Services (AWS) - cloud computing services.” [Online].
identification over encrypted data made feasible,” in Information Available: https://aws.amazon.com/
Systems Security, 5th International Conference, ICISS 2009, Kolkata,
India, December 14-18, 2009, Proceedings, 2009, pp. 86–100. A PPENDIX A
[Online]. Available: http://dx.doi.org/10.1007/978-3-642-10772-6_8
[143] J. Bringer, H. Chabanne, and B. Kindarji, “Error-tolerant
S UBSTRING AND W ILDCARD Q UERY COMBINERS
searchable encryption,” in Proceedings of IEEE International 9) Bounded-length substring using keyword equality [22]:
Conference on Communications, ICC 2009, Dresden, Germany,
14-18 June 2009, 2009, pp. 1–6. [Online]. Available: Searches for substrings of a fixed length 𝜅 can be supported
http://dx.doi.org/10.1109/ICC.2009.5199004 simply by inserting all length-𝜅 substrings (𝜅-grams) into an
[144] C. Wang, K. Ren, S. Yu, and K. M. R. Urs, “Achieving equality-searchable index during initialization. Given a field
usable and privacy-assured similarity search over outsourced cloud
data,” in Proceedings of the IEEE INFOCOM 2012, Orlando, FL, with maximum length 𝛼, this techniques requires adding 𝛼 − 𝜅
USA, March 25-30, 2012, 2012, pp. 451–459. [Online]. Available: keywords during insertion and making one keyword search
http://dx.doi.org/10.1109/INFCOM.2012.6195784 during query execution.
[145] I. Demertzis, S. Papadopoulos, O. Papapetrou, A. Deligiannakis, and
M. Garofalakis, “Practical private range search revisited,” in ACM 10) Short substring using range [22]: By inserting the 𝜅-
SIGMOD/PODS Conference, 2016. grams into a range index, queries for substrings of length up
[146] I. Damgard, M. Geisler, and M. Kroigard, “Homomorphic encryption to 𝜅 can also be supported. We explain by example: one can
and secure comparison,” International Journal of Applied Cryptogra-
phy, vol. 1, no. 1, pp. 22–31, 2008.
query for the two-character string “hi” by searching for the
[147] F. Kerschbaum, D. Biswas, and S. de Hoogh, “Performance comparison range [ℎ𝑖𝑎, ℎ𝑖𝑧] in an index of length-3 substrings.
of secure comparison protocols,” in Database and Expert Systems 11) Anchored substring using conjunction [18]: We now
Application, 2009. DEXA’09. 20th International Workshop on. IEEE,
2009, pp. 133–136.
consider the converse of the above situation: supporting
[148] S. Han and W. K. Ng, “Privacy-preserving linear fisher discriminant searches of long substrings of length at least 𝜅, with storage
analysis,” in Pacific-Asia Conference on Knowledge Discovery and overhead decreasing in 𝜅. We begin with an “anchored” search,
Data Mining. Springer, 2008, pp. 136–147. where the substring occurs either at the beginning or end of
[149] X. S. Wang, K. Nayak, C. Liu, T.-H. H. Chan, E. Shi, E. Stefanov, and
Y. Huang, “Oblivious data structures,” in ACM CCS 14, G.-J. Ahn, the string.
M. Yung, and N. Li, Eds. ACM Press, Nov. 2014, pp. 215–226. By way of example, suppose we wish to support substring
[150] R. Elmasri and S. Navathe, Fundamentals of Database Systems. searches on the record 𝑎 =“teststring”. In a conjunction-
Boston, MA, USA: Addison-Wesley, 2011.
[151] E. Bertino and R. Sandhu, “Database security-Concepts, Approaches, searchable index, we insert 𝜅-grams of the string along with
and Challenges,” IEEE Transactions on Dependable and Secure Com- their location (1, “tes”), (2, “est”), . . . , (8, “ing”). Now to
puting, vol. 2, no. 1, 2005. search for all records containing “test” the client asks for all
[152] A. Fuchs, “Accumulo–extensions to Google’s Bigtable design,” Na-
tional Security Agency, Tech. Rep, 2012.
records matching both (1, “tes”) and (2, “est”). Searching from
[153] IARPA, “Broad agency announcement IARPA-BAA- the end of the string can be accomplished using negative
11-01: Security and privacy assurance research indexing; using (-1, “ing”), (-2, “rin”), (-3, “tri”), . . . , (-8,
(SPAR) program.” February 2011. [Online]. Available: “tes”) in the above example.
https://www.fbo.gov/notices/c55e38dbde30cb668f687897d8f01e69
[154] A. Hamlin and J. Herzog, “A test-suite generator for database systems,” 12) Substring using disjunction of conjunctions [18]:
in 2014 IEEE High Performance Extreme Computing Conference, Removing the anchoring restriction from the above technique
2014, pp. 1–6. requires the use of disjunctions, since the starting location
[155] M. Varia, B. Price, N. Hwang, A. Hamlin, J. Herzog, J. Poland,
M. Reschly, S. Yakoubov, and R. K. Cunningham, “Automated assess- of the substring is unknown. To find the substring “test” the
ment of secure search systems,” Operating Systems Review, vol. 49, querier must search for a conjunction of (𝑖, “tes”) and (𝑖+1,
no. 1, pp. 22–30, 2015. “est”) for any starting position 𝑖. The number of terms in this
[156] M. Varia, S. Yakoubov, and Y. Yang, “HEtest: A homomorphic
encryption testing framework,” in FC 2015 Workshops, ser. LNCS, formula depends on the maximum string length.
M. Brenner, N. Christin, B. Johnson, and K. Rohloff, Eds., vol. 8976. 13) and 14) Wildcard using conjunctions [18]: The above
Springer, Heidelberg, Jan. 2015, pp. 213–230. technique also supports single-character wildcard queries. For
[157] J. Kepner, V. Gadepally, P. Michaleas, N. Schear, M. Varia, A. Yerukhi-
movich, and R. K. Cunningham, “Computing on masked data: a high instance, to search for “tes_str”, the client asks for a con-
performance method for improving big data veracity,” in 2014 IEEE junction of (1, “tes”) and (5, “str”). Note that the 𝜅-gram
High Performance Extreme Computing Conference (HPEC). IEEE, length of letters is required on either side of the wildcard. This
2014, pp. 1–6.
[158] V. Gadepally, B. Hancock, B. Kaiser, J. Kepner, P. Michaleas, M. Varia,
technique can be extended for unanchored queries as above,
and A. Yerukhimovich, “Computing on masked data to improve the and it supports multiple character wildcards by incrementing
security of big data,” in IEEE International Symposium on Technologies the expected positions of the 𝜅-grams.
for Homeland Security (HST). IEEE, 2015, pp. 1–6.
[159] B. Schneier, “Data is a toxic asset,” 2016. [Online]. Available: A PPENDIX B
https://www.schneier.com/essays/archives/2016/03/data_is_a_toxic_asse.html
[160] D. Blumenthal, “Launching HITECH,” New England Journal of P ROCEDURE FOR P ILOT STUDY
Medicine, vol. 362, no. 5, pp. 382–385, 2010.
We installed and configured multiple protected search sys-
[161] "The Office of the National Coordinator for Health
Information Technology", “Guide to privacy and security tems. For each, we ingested ten million records of real
of electronic health information,” 2015. [Online]. Available: application data, and conducted sessions with 10 users over a
https://www.healthit.gov/sites/default/files/pdf/privacy/privacy-and-security-guide.pdf
ten-day period. Our Institutional Review Board reviewed our
[162] S. Barnum, “Standardizing cyber threat intelligence information with
the Structured Threat Information eXpression (STIX),” MITRE Corpo- protocols and questionnaires, determined that they represented
ration, vol. 11, 2012. a minimal risk, and approved the procedure. Software for the
procedure resided in an Amazon Web Services (AWS) [163]
network. Data was drawn from a genuine application source
and was converted to a single, static table with over one
hundred columns and ten million records.
Participants had a mix of technical and non-technical back-
grounds, with six men and four women. All participants
had prior experience interacting with web interfaces that
use database backends to present results. Participants were
aware that they were using different systems but systems were
identified only by a single letter. Participants were not given
any information about the capabilities of the technologies.
Each participant took part in three types of sessions, each
of which lasted 30 minutes: 1) training on the web interface;
2) scripted interaction with each of the technologies; and
3) exploratory sessions with each of the technologies. Users
interacted with the secure database technology through a
web application which included a visual query builder which
queried the underlying secure database. Participants interacted
with the visual query builder to create queries. Then, the web
server submitted the query to the protected search system.