1.authenticated Multistep Nearest
1.authenticated Multistep Nearest
VOL. 23,
NO. 5,
MAY 2011
641
1 INTRODUCTION
ET DB
be a D-dimensional data set. Each record P 2 DB can be thought of as a point in the space defined by the D attribute domains, and in the sequel we use the term record and point interchangeably. Given a point Q, a nearest neighbor (NN) query retrieves the record fP 2 DB: DST Q; P DST Q; P 0 8 P 0 2 DBg, where DST(Q, P ) denotes the distance between Q and P . Likewise, a kNN query returns the k closest points to Q. NN and kNN queries are common in similarity retrieval. Specifically, since similarity between records is inversely proportional to their distance, a kNN query returns the k most similar records to Q. The multistep framework [11], [19] has been proposed for NN and similarity retrieval in domains that entail highdimensional data (e.g., in time series, medical, image, biological and document databases), expensive distance functions (e.g., road network distance, dynamic time warping), or a combination of both factors. In this paper, we focus on authenticated multistep NN search for applications that require a proof of result correctness. For instance, [3] argues that the most cost-effective way for medical facilities to maintain radiology images is to outsource all image management tasks to specialized commercial providers. Clients issue similarity queries to a provider. The latter returns the result set and additional verification information, based on which the client establishes that the result is indeed correct, i.e., it contains exactly the records of DB that satisfy the query conditions, and that these records
. S. Papadopoulos, L. Wang, Y. Yang, and D. Papadias are with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong. E-mail: {stavros, lxwang, yini, dimitris}@cse.ust.hk. . P. Karras is with the School of Computing, National University of Singapore, Computing 1, #02-18, Computing Drive, Singapore 117417. E-mail: [email protected]. Manuscript received 18 Sept. 2009; revised 12 Dec. 2009; accepted 9 Feb. 2010; published online 30 Aug. 2010. Recommended for acceptance by D. Srivastava. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-2009-09-0665. Digital Object Identifier no. 10.1109/TKDE.2010.157.
1041-4347/11/$26.00 2011 IEEE
indeed originate from their legitimate data source (i.e., the corresponding medical facility). A similar situation occurs for data replication, i.e., when a data owner stores DB at several servers. Clients issue their queries to the closest (in terms of network latency) server, but they wish to be assured that the result is the same as if the queries were sent to the original source of DB. In other cases, correctness is guaranteed by a trusted third party. For instance, notarization services [20] have been proposed to safeguard against tampering in document databases (the motivating example being Enron). Authenticated query processing ensures the client that the received result complies with the validated DB. Initially, we study the problem assuming that the entire DB resides at a single server. Our first contribution is AMN, an adaptation of a multistep algorithm that is optimal in terms of DST computations. AMN requires transmissions of false hits, i.e., records that are not in the result, but are nevertheless necessary for its verification. In addition to the network overhead, false hits impose a significant burden to the client, which has to verify them. The second contribution, C-AMN, alleviates this problem through an elaborate scheme that reduces the size of false hits. Finally, we consider a distributed setting, where the database is horizontally partitioned over several servers. Our third contribution, ID-AMN, incrementally retrieves data, gradually eliminating servers that cannot contribute results. The rest of the paper is organized as follows: Section 2 surveys related work. Section 3 presents the indexing scheme and the query algorithms of AMN. Section 4 focuses on C-AMN and minimization of the false hits. Section 5 deals with distributed servers and ID-AMN. Section 6 contains an extensive experimental evaluation with real data sets and Section 7 concludes the paper.
BACKGROUND
Section 2.1 describes multistep query processing. Section 2.2 overviews similarity search for high-dimensional data. Section 2.3 surveys background on authenticated queries.
Published by the IEEE Computer Society
642
MAY 2011
Algorithm MultistepNN(Q, k) 1. Retrieve the k NNs {P1, ..., Pk} of Q according to dst 2. RS = {P1, ..., Pk}, sorted according to DST 3. DSTmax = DST(Q, Pk) // the current kth NN DST 4. P = next NN of Q according to dst 5. While dst(Q, P) < DSTmax 6. If DST(Q, P) < DSTmax 7. Insert P into RS and remove previous kth NN 8. Update DSTmax over RS 9. P = next NN of Q according to dst
Section 2.4 focuses on the MR-Tree, which is used by the proposed techniques.
2.1 Multistep NN Framework The multistep NN framework is motivated by applications that entail expensive distance computations. Specifically, let DST(Q, P ) be the actual distance between a query Q and a data point P 2 DB. The applicability of the multistep framework rests on the existence of a filter distance metric dst, which is cheap to evaluate and satisfies the lower bounding property, i.e., for every possible Q and P : dstQ; P DST Q; P . Multistep NN search was introduced in [11]. Subsequently, Seidl and Kriegel [19] proposed the algorithm of Fig. 1, which is optimal in terms of DST computations. In order to provide a concrete context, our explanation focuses on road networks [18], where DST and dst refer to the network and euclidean distance, respectively. Compared to euclidean distance (dst), network distance (DST) computations are significantly more expensive because they entail shortest path algorithms in large graphs. Moreover, the euclidean kNNs can be efficiently retrieved using conventional NN search on a spatial index. Assuming that DB is indexed by an R*-Tree [1], the multistep kNN algorithm first retrieves the k euclidean NNs of Q using an incremental algorithm (e.g., [7]). These points are inserted into a result set RS, and their network (DST) distances are computed. Let DSTmax be the network distance1 between Q and its current kth NN Pk . The next euclidean NN P is then retrieved. As long as dstQ; P < DSTmax , the algorithm computes DST(Q, P ) and compares it against DSTmax . If DST Q; P < DSTmax , P is inserted into RS, the previous Pk is expunged, and DSTmax is updated. The loop of Lines 5-9 terminates when dstQ; P ! DSTmax ; because of the lower bounding property of the euclidean distance, any point lying further in the euclidean space cannot be closer than DSTmax in the network. Independently of the application domain, the algorithm of Fig. 1 performs the minimum number of DST computations. Specifically, in addition to RS, the DST distances are computed only for false hits, i.e., the set of points F H fP 2 DB-RS : dstQ; P DST Q; Pk g, where Pk is the final kth NN. The rest of the records are not accessed at all (if they reside in pruned nodes of the R*-Tree), or they are eliminated using their dst to Q.
1. To avoid tedious details in our discussion, we assume that all data distances to the query point are different.
High-Dimensional Similarity Search Using Multistep NN Several applications including image, medical, time series, and document databases involve high-dimensional data. Similarity retrieval in these applications based on lowdimensional indexes, such as the R*-Tree [1], is very expensive due to the dimensionality curse [2]. Specifically, even for moderate dimensionality (i.e., D 20) a sequential scan that computes DST(Q, P ) for every P 2 DB is usually cheaper than conventional NN algorithms using the index. Consequently, numerous specialized structures have been proposed for exact [8] and approximate [22] kNN search in high dimensions. The GEMINI framework [6], [11] follows a different approach, combining multistep search with a dimensionality reduction technique that exhibits the lower bounding property. Specifically, each record P 2 DB is mapped to a low-dimensional representation p in d dimensions (d ( D). The resulting d-dimensional data set db is indexed by an R*-Tree, or any low-dimensional index. The query Q is also transformed to a d-dimensional point q and processed using a multistep method. For instance, in the algorithm of Fig. 1, DST (resp. dst) computations involve high (low) dimensional points. The index prunes most nodes and records using the cheap, filter (dst) distances,2 whereas the expensive DST computations are necessary only for the points in result RS and false hit set FH. GEMINI is the most common approach for performing similarity search over high-dimensional data, and especially time series. Numerous dimensionality reduction methods have been used extensively including Discrete Fourier Transform (DFT), Singular Value Decomposition (SVD), Discrete Wavelet Transform (DWT), Piecewise Linear Approximation (PLA), Piecewise Aggregate Approximation (PAA), Adaptive Piecewise Constant Approximation (APCA), and Chebyshev Polynomials (CP). Their effectiveness is measured by the number of records that they can prune using only the low-dimensional representations (i.e., it is inversely proportional to the cardinality of FH). Ding et al. [5] experimentally compare various techniques, concluding that their effectiveness depends on the data characteristics. 2.3 Authenticated Query Processing In authenticated query processing, a server maintains a data set DB signed by a trusted authority (e.g., the data owner, a notarization service). The signature sig is usually based on a public-key cryptosystem (e.g., RSA [16]). The server receives and processes queries from clients. Each query returns a result set RS DB that satisfies certain predicates. Moreover, the client must be able to establish that RS is correct, i.e., that it contains all records of DB that satisfy the query conditions, and that these records have not been modified by the server or another entity. Since sig captures the entire DB (including records not in the query result), in addition to RS, the server returns a verification object (VO). Given the VO, the client can verify RS based on sig and the signers public key. VO generation at the server is usually performed using an authenticated data structure (ADS). The most influential ADS is
2. Note that in GEMINI DST and dst may be based on the same definition (e.g., they may both be euclidean distances). In this case, dst is cheaper because of the lower dimensionality.
2.2
643
included in the VO
N2 Q P1
R
N5 P5 P4 Q
DST max
N1 e1 e2
N4
P 2 P3
N2 e3
e4
e5
N3 e6
Level 1
h1 h2 h3
h4 h5 h6
N1
requested record P6
the Merkle Hash Tree (MH-Tree) [15], a main-memory binary tree, originally proposed for single record authentication. Each leaf in the MH-Tree stores the digest of a record, calculated using a one-way, collision-resistant hash function h, such as SHA-1 [16]. An inner node stores a digest computed on the concatenation of the digests in its children. The trusted authority signs the root digest. Fig. 2 illustrates an MH-Tree over 16 records. Assume that a client requests record P6 . When traversing the tree to locate P6 , the server produces a VO that contains the digests (shown in gray) of the siblings of the visited nodes: V O h25 h5 P6 h20 h30 . Tokens [ and ] signify the scope of a node. VO and sig are transmitted to the client, which subsequently simulates a reverse tree traversal. Specifically, from h5 and P6 it computes h19 , then h26 (using h19 and h20 ), h29 (using h25 and h26 ), and finally h31 (using h29 and h30 ). Due to the collision resistance of the hash function, if P6 is modified, then the digest h31 reconstructed by the client will not match sig; hence, the verification will fail. The MH-Tree constitutes the basis of a large number of subsequent ADSs that support: 1. range queries on a single attribute [13], 2. multidimensional ranges [23], 3. continuous queries on data streams [24], 4. search in generalized directed acyclic graphs [14], 5. search over peer-to-peer networks [21], 6. XML queries [12], and 7. similarity-based document retrieval [17]. In the following, we focus on the MR-Tree [23], the state-ofthe-art multidimensional ADS, which is utilized by our methods.
2.4 The MR-Tree The MR-Tree [23] combines the concepts of the MH-Tree and the R*-Tree [1]. A leaf node contains entries elf of the form (pgP ; P ), where P is an indexed point, and pgP is a pointer to the page accommodating the record of P . An internal node N stores entries ein of the form (pgNc ; MBRNc ; hNc ), where pgNc points to ein s child node Nc . If N is at level 1 (the leafs being at level 0), MBRNc is the minimum bounding rectangle (MBR) of the points in Nc , and hNc is a digest computed on the concatenation of the binary representation of the points in Nc . If N lies at higher levels, MBRNc is the MBR of all the MBR values in Nc , and hNc is the digest of the concatenation of all pairs (MBRi ; hi ) in Nc . Fig. 3 illustrates a twodimensional MR-Tree assuming a maximum node capacity of three entries per node. A signature sig is produced on the root digest hroot hMBRN2 jhN2 jMBRN3 jhN3 .
Upon receiving a range query QR , the server performs a depth-first traversal of the MR-Tree, using the algorithm of Fig. 4, to retrieve the set RS of points in QR . Furthermore, it generates a V OR that contains: 1) all the points outside QR that reside in a leaf MBR overlapping QR , and 2) a pair (MBRN ; hN ), for every node N pruned during query processing. In the example of Fig. 3, given the shaded range QR , we have RS fP2 ; P3 ; P7 g, and V OR P1 ; result; result MBRN5 ; hN5 result; P8 ; P9 MBRN7 ; hN7 . The token result signifies an object in RS according to the order of appearance in RS. For instance, [P1 , result, result] corresponds to node N4 ; the first occurrence of result refers to P2 , and the second one to P3 . In order to distinguish the type of each element in the VO, MR_Range includes a header prior to each token, digest, and point in the VO. This header consumes 3 bits, which suffice to represent eight different element types. For simplicity, we omit the headers in our presentation since the element type is implied by its name. The server sends RS, VO, and sig to the client, which first verifies that all points of RS indeed fall in QR . Then, it scans V OR and reconstructs hroot bottom up (using a process similar to the MH-Tree verification). In our example, the client initially substitutes the first two result tokens with P2 and P3 from RS, and uses P1 , P2 , and P3 to compute digest hN4 hP1 jP2 jP3 and MBRN4 . Then, it utilizes MBRN4 and MBRN5 to evaluate hN2 hMBRN4 jhN4 jMBRN5 jhN5 and MBRN2 . Likewise, it subsequently calculates hN3 , and MBRN3 from the partial V O result; P8 ; P9 MBRN7 ; hN7 . Eventually, it computes hroot hMBRN2 jhN2 jMBRN3 j hN3 . If hroot matches sig, the client is assured that the points in RS are sound. Furthermore, if all MBR values and points in VO lie outside QR , then RS is complete. The MR-Tree can also process NN queries [23]. Assume, for instance, a query asking for the three NNs of point Q in
Algorithm MR_Range (QR, N) 1. Append [ to VO 2. For each entry e in N // entries must be enumerated in original order 3. If N is leaf 4. If e falls in QR 5. Insert e into RS and append a result token to VO 6. Else // e falls out of QR 7. Append e.P to VO 8. Else // N is internal node 9. If e.MBRNc overlaps Q, MR_Range(Q, e.pgNc) 10. Else append e.MBRNc, e.hNc to VO // a pruned child node 11. Append ] to VO
644
MAY 2011
e3= (pgN4, MBRN4, hN4) hN4=h(hp1 | hP1 | hp2 | hP2 | hp3 | hP3) hp1 = h(p1)
MR-Tree N1 e1 e4 e2
e2= (pg N3, MBR N3, hN3) hN3=h(MBRN6 | hN6 | MBR N7 | hN7)
Level 2 (root)
N2 e3
e5
Level 1
Level 0 (leaves )
Fig. 3. The server: 1) retrieves the kNNs of Q (i.e., P2 , P3 , and P7 ) using any conventional algorithm, 2) computes the distance DSTmax of the kth NN (i.e., DSTmax DST Q; P7 ), and 3) finally executes MR_Range treating (Q, DSTmax ) as the range. For this example, the VO is identical to that of QR in Fig. 3 since both the resulting points and accessed nodes are the same. The verification process of the client is also identical to the one performed for range queries. However, as an adaptation of the R*-Tree, the MR-Tree also suffers from the dimensionality curse [2]. Therefore, the application of the afore-mentioned method on high-dimensional data has very limited pruning power. Specifically, for numerous dimensions, nearly all leaf nodes must be visited (leading to highserver cost); consequently, the majority of points are inserted in the VO (leading to high-communication overhead); finally, the client has to verify almost the entire data set.
independent of these metrics, as well as of the underlying dimensionality reduction technique. The proposed Authenticated Multistep NN (AMN) extends the multistep NN algorithm of [19] to our setting. As opposed to optimizing the processing cost at the server, the major objective of AMN (and any query authentication technique, in general) is to minimize 1) the network overhead due to the transmission of the VO, and 2) the verification cost at the client (which is assumed to have limited resources compared to the server). Section 3.1 describes the indexing scheme of AMN, while Section 3.2 presents the query processing and verification algorithms. Table 1 summarizes the most important symbols used throughout the paper.
AUTHENTICATED MULTISTEP NN
Our work adopts the GEMINI framework because 1) it has been proven effective in nonauthenticated similarity retrieval, especially for numerous (i.e., D > 100) dimensions, where even high-dimensional indexes fail;3 2) it can be extended to authenticated query processing based on a lowdimensional ADS, i.e., the MR-Tree, whereas, currently there are no authenticated high-dimensional structures; and 3) it is general, i.e., it can also be applied when the expensive distance computations are due to the nature of the distance definition (e.g., network distance), rather than the data dimensionality (in which case D d). We assume a client-server architecture, where the server maintains data signed by a trusted authority. There are two versions of the signed data set: a D-dimensional DB and a ddimensional db (d ( D), produced from DB using any dimensionality reduction technique that satisfies the lower bounding property. For instance, DB may be a set of highdimensional time series and db their low-dimensional representations obtained by DFT. There is a single signature sig, generated by a public key cryptosystem (e.g., RSA), that captures both DB and db. DST (dst) refers to the distance metric used in the D(d)-dimensional space. For ease of illustration, we use euclidean distance for both the DST and dst metrics. Nevertheless, the proposed techniques are
3. High-dimensional indexes for exact NN retrieval, such as the state-ofthe-art i-distance [8], are designed for up to about 50 dimensions, whereas we perform experiments with D up to 1,024.
3.1 Indexing Scheme The server indexes db using an MR-Tree. Since authentication information should capture both low- and highdimensional representations, AMN necessitates the following modifications on the structure of the MR-Tree. Each leaf (level 0) entry elf has the form (pgP ; p; hP ), where p 2 db is the reduced representation of P 2 DB, and hP is the digest of the binary representation of P . Pointer pgP points to the disk page(s) storing P . An intermediate MR-Tree node entry ein has the form (pgNc ; hNc ; MBRNc ), where pgNc is a pointer to a child node (let Nc ), and MBRNc is the MBR of the points in Nc . The value hNc depends on the level. For level 1, hNc hhp1 jhP 1 jhp2 jhP 2 j . . . jhpf jhP f , where hpi hP i denotes the digest of p (resp. P ) in the ith entry in Nc . At higher levels, hNc is computed as in the traditional MR-Tree. Observe that the digests of both reduced and original points in the tree are incorporated into the root digest hroot . The trusted authority generates a signature sig by signing hroot . The server maintains DB, the MR-Tree and sig. Fig. 5 outlines the indexing scheme of AMN, assuming the data points and node structure of Fig. 3. Note that the proposed techniques are independent of the underlying index. For instance, an authenticated high-dimensional index (if such an ADS existed), would permit higher values of d (compared to the MR-Tree). 3.2 Query Processing and Verification Let Q be a D-dimensional query point received at the server. The goal of AMN is to return to the corresponding client the kNNs of Q, in a verifiable manner. The server starts processing Q by reducing it to a d-dimensional point q, using the same dimensionality reduction technique as for
645
Algorithm AMN_server(Q, k) // Q is the query and q its reduced representation 1. Retrieve the k NNs {p1, ..., pk} of q in d-dimensional space 2. Retrieve the respective {P1, ..., Pk} in D-dimensional space 3. Set FH = , RS = {P1, ..., Pk}, and compute DSTmax over RS 4. p = next NN of q in d-dimensional space 5. While (dstmin = dst(q, p)) < DSTmax 6. Retrieve P // D-dimensional representation of p 7. If DST(Q, P) > DSTmax, Insert P into FH 8. Else 9. Insert P into RS 10. Remove Pk from RS and insert it into FH 11. Update DSTmax over RS 12. p = next NN of q in d-dimensional space 13. VOR = MR_Range((q, DSTmax), root) 14. Send RS, FH, VOR and sig to the client
N1
p2
N2 DSTmax = DST(Q,P2 )
dst(q,p2) q p1
p2
p3
dstmin = dst(q,p3)
Fig. 7. Finding the 1NN. (a) 1st NN candidate. (b) 2nd NN candidate.
DB. Fig. 6 provides the pseudocode for AMN at the server for arbitrary values of k. Initially (Lines 1-4), the algorithm extracts the k closest points fp1 ; . . . ; pk g to q using an incremental NN algorithm (e.g., [7]) on the MR-Tree; 2. it retrieves their high-dimensional representations P1 ; . . . ; Pk from the disk, following the pointers from the leaf entries in the MR-Tree; 3. it initializes a false hit set F H ; and a result set RS fP1 ; . . . ; Pk g, and computes DSTmax , i.e., the DST of the current kth NN Pk ; and 4. it obtains the next NN (i.e., the k 1th NN) point p of q from the tree. The procedure then enters the loop of Lines 5-12, where it computes distance dstmin dstq; p. Observe that dstmin is the minimum distance between q and any point p, whose high-dimensional representation has not yet been retrieved. If dstmin > DSTmax , the algorithm terminates. Otherwise, it retrieves the true representation P of p from the disk. If DST Q; P > DSTmax , then P is a false hit and appended to FH; else, P is a candidate result and is inserted into RS. This causes the deletion of the current kth NN Pk from RS, and its insertion into FH. The algorithm updates DSTmax over the new RS, and proceeds to retrieve the next NN of q in the tree. After the NN loop terminates, Line 13 performs the authenticated range query qR q; DSTmax using the most updated value of DSTmax . This produces a V OR that contains 1) a pair (hN ; MBRN ) for every pruned node N, 2) a pair (hP , p) for every point p in a leaf node that intersects qR , but whose P representation is not in RS [ F H, and 3) a result (false_hit) token for every index point whose true point is in RS (FH). We illustrate AMN using Fig. 7 and assuming that k 1 and d 2. Lines 1-3 (see Fig. 6) set F H ;, RS fP1 g, and DSTmax DST Q; P1 . Let SR (search region) be the area within distance (dst(q, p1 ), DSTmax ) from q, i.e., the shaded area in Fig. 7a. Only points in SR are candidate results. The server proceeds by retrieving the next NN p2 of q in db, and its high-dimensional representation P2 . Assuming that DST Q; P2 < DST Q; P1 ( DSTmax ), P2 becomes the new NN of Q and it is inserted into RS. Moreover, P1 becomes a false hit, and is moved from RS to FH. The server then updates DSTmax to DST(Q, P2 ), which leads to the shrinking of the SR 1.
as shown in Fig. 7b. The next NN p3 of q falls out of SR, and NN retrieval terminates with RS fP2 g and F H fP1 g. Next, the server performs an authenticated range query qR q; DSTmax , where DSTmax DST Q; P2 . The result of qR contains the low-dimensional representations of p1 and p2 . Considering that the MR-Tree in Fig. 7 consists of root node N1 , and its two children N2 and N3 , the VO of qR is V OR result; false hit; p3 ; hP 3 hN3 ; MBRN3 , i.e., it is computed as in the traditional MR-Tree, with two differences: 1) p3 is inserted along with its associated digest hP 3 (since this is necessary for computing hroot ), and 2) two tokens are used as placeholders, one corresponding to the reduced representation of an actual result (result), and one of a false hit (false_hit). Note that it is redundant to provide p1 and p2 because the client can derive them from P1 and P2 included in RS [ F H. Signature sig, RS, FH, and V OR are transmitted to the client. Having RS, FH, and V OR , the client can reconstruct the signed digest hroot using the verification algorithm of the MRTree with two alterations: 1) it derives the reduced point p and digest hP of every point P in RS (FH), and substitutes its corresponding result (false_hit) token in V OR with (p, hP ); 2) it computes the level-1 digests as in Fig. 5. Fig. 8 provides the pseudocode for the verification algorithm. After reconstructing and matching hroot against sig, the procedure computes DSTmax over RS. It then establishes that the points in FH are indeed false hits. Finally, it confirms that all points and MBRs in V OR (other than the result/false_hit tokens) lie further than DSTmax from q. In the running example, the client first computes p1 and p2 from P1 and P2 , and MBRN2 using p1 , p2 , and p3 . Then, it generates hN2 hhp1 jhP 1 jhp2 jhP 2 jhp3 jhP 3 , and subsequently hroot hN2 jMBRN2 jhN3 jMBRN3 . If hroot matches sig, the client evaluates DSTmax DST Q; P2 and ascertains that DST Q; P1 > DSTmax . Eventually, it establishes that the distance of p3 from q, as well as the minimum distance between MBRN3 and q, is indeed larger than DSTmax .
Algorithm AMN_client (R, VOR) 1. MR_verify (RS, FH, VOR) 2. Compute DSTmax over RS 3. For each point Pi in FH, Verify DST(Q, Pi) > DSTmax 4. For each point pi (MBRi) in VOR, 5. Verify dst(q, pi) > DSTmax (mindist(q, MBRi) > DSTmax)
646
MAY 2011
Full Representations i Q: P1 : P2 : 1 2 3 4 5 6 7 8 2 5 0 9 10 11 12 13 14 15 16 5 20 25 10 5 5 20 25 10 5 4 18 20 20 6 -2.1908 -0.1698 -0.0279 dst(q, p1) = 2.2023 dst(q, p2) = 3.6479 3 3 7 3 13 7 5 4 7 5 6 10 3 7 40 2 25 6 10 4 7 20 2 24 15 11 3 22 40 1 q: p1 : p2 : 9.5625 10.4375 12.5 Distances DST(Q, P1) = 30. 1662 DST(Q, P2) = 28.443 (DSTmax)
h31 h29 h26 h27 h20 h7 2 2 h8 5 2 h21 h22 h23 h30 h28 h24
Reduced Representations
h6 20 40
Proof of Correctness. We distinguish two cases: 1) the server returns to the client a set RS 0 , which is derived from the correct RS after substituting a point P with another P 0 that does not exist in DB (i.e., P 0 is bogus). Since the signature does not incorporate authentication information about P 0 , the reconstructed hroot does not match sig due to the collision-resistance of the hash function, and thus, the client is alarmed. 2) The server returns to the client a set RS 0 , which is derived from RS after substituting a point P with another P 0 that belongs 0 to DB (i.e., P 0 is legitimate). Let DSTmax (DSTmax ) denote the distance between Q and its kth NN in RS (RS 0 ). Since 0 P 0 is not a true result, DSTmax > DSTmax . The server also 0 generates a V OR , which verifies RS 0 , i.e., it authenticates 0 0 0 range qR q; DSTmax . Point p lies in range qR because 0 dstq; p DST Q; P DSTmax < DSTmax . Given that P is not in RS 0 , it must be included in FH (otherwise, the verification of V O0R will fail). Thus, Line 3 of algorithm AMN_client detects the violation and the client is alarmed. u t Since AMN follows the optimal framework of Fig. 1, it is also optimal in terms of DST distance computations. A more important question in our setting regards its performance in terms of the communication overhead and the verification cost at the client. Recall that, along with the result, the client needs a representation for each P 2 F H in order to verify that indeed P is a false hit. For typical data series applications, D can be up to 1-3 orders of magnitude larger than d. Therefore, FH emerges as the most important performance factor in authenticated NN search, especially for very high-dimensional data. However, as we show next, FH does not have to include the complete representations of false hits.
COMMUNICATION-EFFICIENT AMN
Depending on the dimensionality reduction technique, the values D, d, and k, and the data set characteristics, there may be numerous false hits in FH, each containing hundreds or thousands (i.e., D) values. Next, we propose communication-efficient AMN (C-AMN), which decreases the size of the false hits, significantly reducing the transmission and verification cost without compromising the security of AMN. Section 4.1, explains the main concepts of C-AMN, whereas Section 4.2 presents the concrete algorithm for false hit reduction.
4.1 General Framework We illustrate the main idea of C-AMN through Fig. 9, where 1) Q; P1 ; P2 are 16-dimensional time series (i.e., D 16), 2) q, p1 ; p2 are their two-dimensional representations obtained by taking the first two coefficients of the DFT decomposition (i.e., d 2), and 3) DST and dst correspond to euclidean distances. The coefficients and distances are computed using the real values of Q, P1 , and P2 . Since DST Q; P2 < DST Q; P1 , P2 is the 1NN of Q. Furthermore, dstq; p2 > dstq; p1 , which signifies that P1 is a false hit, thus all its 16 values are included in FH by AMN. Let P i1 i D be the ith value of P (e.g., P1 2 6 see Fig. 9), and SP be an ordered subset of P values, e.g., SP1 P1 1; P1 6; P1 15 25; 20; 13. SQ1 contains the values of Q in the same positions as SP1 , e.g., SQ1 Q1; Q6; Q15 5; 40; 3. Then, DST SQ1 ; SP1 28:4605 > DST Q; P2 28:443. Thus, instead of the complete P1 , it suffices for the server to include into FH only SP1 . The client computes SQ1 from Q, establishes that DST Q; P2 < DST SQ1 ; SP1 DST Q; P1 , and confirms P1 as a false hit. Assuming that each value has size Sv 8 bytes, SP1 consumes 24 bytes, as opposed to 128 bytes for P1 . By sending SP1 to the client, the server reduces the communication cost significantly (this is the largest possible reduction for the current example). However, it must also prove that SP1 is not falsified. We next demonstrate a solution to this problem, assuming, for simplicity, that D is a power of 2. The server maintains an MH-Tree over P1 as shown in Fig. 10. Recall that in AMN, the digest hP 1 included in the MR-Tree leaf accommodating p1 is computed on the concatenation of the binary representation of the values of P1 , i.e., hP 1 hP1 1jP1 2j . . . jP1 16. Instead, in C-AMN, hP 1 is the digest h31 in the root of the MH-Tree, which summarizes authentication information about the entire P1 . The rest of the MR-Tree structure remains the same. Returning to our running example, after producing SP1 , the server constructs a VO (denoted as V OSP 1 ) that can verify SP1 . V OSP 1 contains the necessary components for reconstructing hP 1 . These components are highlighted in dark gray in Fig. 10: V OSP 1 25 h2 h18 h5 20 h20 h27 h23 13 h16 . Note that the positions4 of SP1 values can be computed based on the structure of V OSP 1 . For example, prior to 20 in V OSP 1 , there is one value (25), two level-0 digests (h2 ; h5 ) and a level-1 digest (h18 ). This implies that 20 is the 6th value in P1 .
4. The positions of the P1 values used in SP1 are necessary so that the client can compute SQ1 from Q.
647
Given V OSP 1 , the client produces hP 1 by performing successive hash operations on the contents of corresponding [and] tokens as explained in Section 2.3. Finally, recall that the digest of p1 , hp1 , is needed for computing the level-1 digest of N2 , hN2 hhp1 jhP 1 jhp2 jhP 2 jhp3 jhP 3 . In AMN, the client could compute hp1 by first reducing P1 (fully included in FH) to p1 , and then, calculating hp1 hp1 . In C-AMN, however, the client does not have the entire P1 , and it cannot generate p1 from SP1 . Therefore, the server must send hp1 to the client as well. The additional authentication information increases the communication overhead. Continuing the example, and assuming that a digest has size Sh 20 bytes (a typical value), V OSP 1 and hp1 consume a total of 184 bytes, which is larger than P1 (128 bytes). This is due to the fact that the values of SP1 are dispersed in the MH-Tree, causing the insertion of numerous digests in V OSP 1 . Consider now that we modify SP1 to include values P1 1-P1 8. In this case, DST SQ1 ; SP1 28:46 > DSTmax , and V OSP 1 25 610 47 202 5 h30 contains a single digest, h30 (light gray see Fig. 10). The total cost is now 106 bytes, which is actually the lowest possible in this example. Note that for simplicity, we omit the headers of the VO elements (see Section 2.4) in the size computations because their space consumption is negligible. Summarizing, C-AMN aims at replacing each false hit P with a verifiable representation SP that consumes less space. C-AMN necessitates some modifications over the AMN indexing scheme: 1) for every P 2 DB, the server maintains a MH-Tree, and 2) each digest hP at the leaf level of the MR-Tree is set to the root digest of the MH-Tree of P . Query processing at the server proceeds as in Fig. 6, but after computing RS, FH, and V OR , the server calls a function ReduceFH to replace every false hit P with a pair (V OSP , hp ), where V OSP contains the subset SP of P that proves that P is a false hit, along with verification information, and hp is the digest of P s indexed point p. Verification at the client is similar to AMN_client (see Fig. 8), with the following alterations: 1) MR_verify computes, for every pair (V OSP , hp 2 F H, the digest hP of the corresponding false hit P , simply simulating the initial calculation of hroot in the MH-Tree of P . 2) For every V OSP in FH, Line 3 extracts SP from V OSP , computes the respective SQ of Q, and verifies that DST SQ; SP > DSTmax . The proof of correctness of CAMN is identical to that in AMN, given that there is no way for the server to falsify SP (otherwise, the reconstructed hroot would not match sig due to the collisionresistance property of the hash function).
Theorem 1. Determining the optimal subset SP of P, which leads to the minimal V OSP under the constraint that DST SQ; SP > DSTmax , is NP-hard. Proof. This can be proven by a straightforward polynomial reduction from the NP-Hard Precedence Constraint Knapsack Problem [9]. u t We next propose a greedy version of ReduceFH. Let disti P i Qi2 be the contribution of P [i] to DST(SQ,SP), and commi the additional cost if P i is inserted in the current V OSP . To assist the iterative decisions of our greedy algorithm, we assign to each unselected value P i a benefit Bi disti =commi that captures the potential gain if P i is included in SP. For example, in Fig. 10, suppose that SP1 P1 6; hence V OSP 1 h25 h5 20h20 h30 . If we insert P1 1 in SP1 , the n e w V OSP 1 v e r i f y i n g SP1 P1 1; P1 6 b e c o m e s 20 h2 h18 h5 20h20 h30 . The increase of V OSP 1 due to the insertion of P1 1 is comm1 Sh Sv 28 20 h2 h18 substitutes h25 in V OSP 1 ). At each step, the unselected P i with the highest benefit Bi is greedily inserted in SP. Intuitively, we prefer values with large dist (to satisfy DST SQ; SP > DSTmax with as few values as possible), and small comm (to keep V OSP as small as possible). After inserting P i in SP, we set Bi 0 (so that it cannot be selected again). We also update (if necessary) the benefits of the yet unselected P values, since comm depends on the V OSP verifying the current SP. The algorithm terminates when the condition DST SQ; SP > DSTmax is satisfied. The most expensive operation is the update of the benefits at each step. A naive algorithm would take OD2 time in overall; upon each selection, it would scan all unselected P values and update their benefits as necessary. We next present an implementation with complexity OD logD). We demonstrate the process on P1 of our running example using Fig. 11. The algorithm initially sets SP1 ;, and V OSP 1 ;. Moreover, it assigns commi 4 Sh Sv 84 for all P1 i, since inserting any P1 i in V OSP 1 would cause its increase by four digests and one value. Subsequently, it calculates the benefit vector B of P1 , and constructs an auxiliary binary tree on B, where 1) each leaf node ni stores B[i], and 2) inner nodes are created bottom-up, storing the maximum value in their subtree. In the first step, ReduceFH performs a root to leaf traversal of the tree, visiting always the node with the maximum value. In Fig. 11a, the leaves with the maximum benefits are n1 and n6 because they have the largest distances to Q1 1 and Q1 6, respectively. Suppose that the algorithm (randomly) visits n6 ; it inserts P1 6 20 into SP1 , and updates the value of n6 to 0 (so that n6 will not be selected in the future). V OSP 1 becomes h25 h5 20 h20 h30 , and DST SQ; SP1 20 < DSTmax . Therefore, the algorithm continues. The next action regards the updating of the benefits in the rest of the tree. Observe that, if after the inclusion of P1 6 in SP1 , we also insert P1 5, digest h5 in V OSP 1 will be replaced by a value. Thus, the size of V OSP 1 will decrease because Sh (20 bytes) is larger than Sv (8 bytes). In general, depending on the relative size of Sh and Sv , there is a cutoff level (cl), such that, if a value P [i] is inserted in SP, all the benefits in the same subtree as ni rooted at cl will become negative because of the reduction of the communication cost. To include the corresponding values in SP, we set their respective benefits to 1. In our example cl 2, and the
4.2 False Hit Reduction Algorithm Ideally, for each false hit P , ReduceFH should derive the subset SP with the minimum length. Intuitively, this task is at least as difficult as the Knapsack Problem; we need to select a subset of items (SP of P values), each assigned a cost (communication overhead) and a weight (distance DST(SQ, SP)), such that the sum of costs is minimized and the sum of weights exceeds DSTmax . An additional complication is that when we select one item, the cost of the rest changes (i.e., unlike knapsack, where the cost is fixed).
648
MAY 2011
n31 n26
4.76
4.76
n30 n27
1.19
Client
n28 n23
0 1.19
NN DST
Client
kDST
n18
.012
n19
4.76
n20
.107
n21
0
n22
0
n24
1.19
P1 1 1 P1 2 P2 1 2 3
1 2 5
DST 6
S1
2 3,6,12,15 S
5,7,10,11 S 3
n2
n3 n4
n5 n6
n7 n8
.107
4.76 0
.012 0
4.76 0
n25
16.7
n18
.042
n31
n27
0
n30
2.27
Client
n28
2.27
kDST
S1
Client
kDST
S1 S2 S3 S4
n17 n1
16.7
1
n24
2.27
n20
n21
0
n22
0
n23
0
2 3,6,12,15 S
1 2 3
n2
n3 n4
.042
16.7 0
n5 n6
0
2 5
n31
n7 n8
0
7,10,11
S3
n25
n18
n27
0
n30
2.27
n17 n1
0
n2
n20 n7 n8
0 0 0
n21
0
n22
0
n23
0
n24
2.27
n3 n4
n5 n6
0 0
Fig. 11. Using the auxiliary tree of ReduceFH. (a) Choosing to include P1 6 in SP1 . (b) After inserting P1 6. (c) After inserting P1 5, P1 7, P1 8, and P1 1 in SP1 .
selection of P1 6 will assign 1 to the benefits of nodes n5 ; n7 ; n8 , n19 ; n20 ; n26 ; n29 ; n31 . The benefits of the remaining nodes are updated as shown in Fig. 11b. In the three next steps, ReduceFH follows the nodes with 1 benefit and inserts P1 5, P1 7, and P1 8 into SP1 , updating their corresponding benefits (as well as the values of their ancestors) to 0. Observe that these insertions do not influence the benefits of the nodes in the subtrees rooted at n25 and n30 . The general rule is that commi changes, only if the insertion of a P [j] into SP causes the substitution (in V OSP ) of a digest that was taken into account in commi . Note that each such digest corresponds to a subtree of the auxiliary tree in ReduceFH, since the latter has a one-to-one correspondence with the MH-Tree of P . For example, in Fig. 11b (i.e., after P1 6 has been inserted in SP1 ), comm9 considers digests h10 , h22 , and h28 , corresponding to nodes n10 , n22 , and n28 , respectively. After the insertion of P1 5, comm9 still involves the above digests, and thus, does not change. ReduceFH marks every visited node of the tree during the selection of the next P1 value. The marked nodes in Fig. 11 are highlighted in gray color. A marked node signifies that there is at least one leaf ni in its subtree such that P1 i is in SP1 . In each new traversal, if a visited node n is not marked, then the benefits in the subtree of ns sibling ns must be updated. For instance, in Fig. 11b, comm9 takes into account h28 corresponding to n28 , which is not marked. If n28 is visited, the benefit of n9 (i.e., of P1 9), and of all leaves in the subtree rooted at n27 , will change. Returning to our example, at this point, SP1 P1 6; P1 5; P1 7; P1 8 20; 7; 2; 5, a n d DST SQ; SP 1 20:224 < DSTmax . In the next step, the procedure inserts P1 1 25 into SP1 (it has the largest benefit), and updates the benefits of n2 ; n3 ; n4 ; n18 ; n25 ; n29 ; n31 to 1, as shown in Fig. 11c. Subsequently, P1 2; P1 3; P1 4 are inserted because their
benefit is 1. The resulting SP1 P1 6; P1 5; P1 7; P1 8; P1 1; P1 2; P1 3; P1 4 20; 7; 2; 5; 25; 6; 10; 4 satisfies the constraint DST SQ; SP1 28:461 > DSTmax . Its corresponding VO is equal to V OSP 1 25 610 47 20 2 5 h30 . Observe that SP1 and V OSP 1 are the same as the ones we constructed for Fig. 9, and are optimal for this example. Fig. 12 provides the generalized pseudocode of ReduceFH. Function ConstructTree in Line 2 computes the initial benefit vector B of P and builds a binary tree over it. The loop in Lines 3-5 first calls function FindNextValue, which retrieves the next value Pv with the maximum benefit. It also updates the benefits of the affected nodes. As soon as the condition in Line 3 is satisfied, Lines 6-8 include the values with benefit 1 into SP as well. ReduceFH eventually returns SP. Complexity analysis. The construction of the tree takes O(D) time. Furthermore, the loops in Lines 3-5 and 6-8 involve O(D) steps. FindNextValue involves two operations: 1) finding the value with the maximum benefit, which takes O(logD) time, and 2) updating the benefits of the affected nodes. Instead of calculating the latter cost in each step of FindNextValue, we will compute it collectively in the entire execution of ReduceFH. We mentioned that the benefit of a node ni is updated, only when the subtree corresponding to a digest involved in commi is marked for the first time. Since the maximum number of digests entailed in commi is log D, the total time required for updating all values is O(D log D), which is also the complexity of ReduceFH for each false hit.
AMN
IN
DISTRIBUTED SERVERS
In this setting, we assume that the database is horizontally partitioned and distributed over m >1 servers. Specifically, each server S i stores a subset DBi such that: DB1 [ :: [ DBm DB and DBi \ DBj ;, 81 i; j m. In addition, S i maintains an MR-Tree on the corresponding reduced data set dbi , which is signed by a signature sigi . A query result comprises the kNNs over all servers. Minimization of transmissions (of the high-dimensional data) is particularly important for this setting, especially for large values of m. Section 5.1 presents SD-AMN (short for simple distributed AMN), used as a benchmark in our experimental evaluation. Section 5.2 proposes ID-AMN
649
Client
NN DST
Client
kDST
P1 1 1 P1 2 P2 1 2 3
1 2 5
DST 6
S1
2 3,6,12,15 S
5,7,10,11 S 3
Client
kDST
S1
Client
kDST
S1 S2 S3 S4
1 2 5 5
2 3,6,12,15 S
1 2 3
7,10,11
S3
DB1 [ :: [ DBm . At the last step of SD-AMN all servers perform a verifiable range qR q; DSTmax , during which they cannot cheat. Thus, they can only misreport the distance vectors kDST i , leading the client to compute 0 a DSTmax that is different from the real DSTmax . We distinguish two cases with respect to the relationship 0 0 between DSTmax and the DSTmax . If DSTmax > DSTmax , the client will obtain all results (possibly, in addition to 0 more points) through the verifiable range (q, DSTmax ), 0 and will detect the discrepancy between DSTmax and 0 DSTmax . If DSTmax < DSTmax , the client receives fewer than k objects in RS 1 [ :: [ RS m and is alarmed. Therefore, it can always establish result correctness. u t As an example, suppose that in Fig. 13a S 2 misreports 2 DST Q; P1 as 4 (instead of 3). Consequently, the client 0 computes the distance to its 3rd NN as DSTmax 4 > DSTmax , and receives a VO for range (q, 4) from every 2 server. S 2 will then have to return P1 as a result, and the 2 difference between the reported and actual DST Q; P1 will be revealed. For the second case, consider that that server S 2 2 misreports DST Q; P1 as 2, leading the client to compute 0 DSTmax 2 < DSTmax . Upon receiving the VOs for range (q, 2) from all servers, it discovers that there are only two 1 1 points (P1 ; P2 ) in RS 1 [ :: [ RS 4 . Note that servers can misreport distances of points in DB-RS-FH without being caught, provided that the false DST reported are larger than DSTmax . In the example of Fig. 13a, there are no lowdimensional objects of S 4 within (q, 3). Therefore, S 4 can send to the client any vector that contains distances larger than 3 because the verification of (q, 3) does not involve the transmission of any point (RS 4 F H 4 ;). Clearly, this type of false values does not affect the result.
Fig. 13. Distributed authenticated kNN processing. (a) Distances to query. (b) ID-AMN after 1st round. (c) ID-AMN after 2nd round. (d) IDAMN after 3rd round.
(short for incremental distributed AMN), a more elaborate method, which quickly eliminates servers that cannot contribute results.
5.1 Simple Distributed AMN In SD-AMN, a client sends its kNN query Q to all servers. Each server S i retrieves the partial result RS i on the local DBi using the conventional multistep algorithm of Fig. 1, and generates a vector kDST i with the distance values of the kNN set RS i in S i . The client collects the vectors from the servers and determines the global kth nearest distance DSTmax over all m k collected distances. Then, it transmits a range qR q; DSTmax . Each server S i executes qR using its MRTree and returns V Oi , RS i , and F H i . V Oi has the same R R meaning as in centralized processing, i.e., it is the VO of qR . RS i (resp. F H i ) is a set of results (resp. false hits), i.e., points of dbi that fall in qR and whose high-dimensional representations have distance from Q smaller (resp. larger) than DSTmax . The size of FH can be reduced through C-AMN. We demonstrate SD-AMN using the example of Fig. 13a, assuming a 3NN query Q. There are four servers, each containing four points. For ease of illustration, we sort these 2 points on their distance to Q and display their DST, e.g., P1 2 is the first point in S 2 and DST Q; P1 3. The diagram also includes the distances in the reduced space, e.g., dstq; p2 2. Given kDST 1 1; 2; 5, kDST 2 3; 6; 12, 1 kDST 3 5; 7; 10, and kDST 4 7; 8; 9, the client computes DSTmax 3 (the first two NNs are in S 1 , and the third one in S 3 ), and transmits the range qR q; 3. S 1 returns 1 1 1 1 1 1 V O1 , RS 1 fP1 ; P2 g, and F H 1 fP3 ; P4 g:P3 and P4 are R necessary, in order for the client to establish that they are indeed false hits. At the other extreme, S 4 returns V O4 ; RS 4 F H 4 ;, as it does not contain results or false R hits (for all points in S 4 , their dst from q exceeds 3). If each V Oi is verified successfully, and for every point P in each R F H i it holds DST Q; P DSTmax , then the client is assured that RS is correct.
Proof of Correctness. In SD-AMN, the client computes a DSTmax (e.g., 3 see Fig. 13a) such that the range (Q, DSTmax ) contains exactly k high-dimensional points in
5.2 Incremental Distributed AMN SD-AMN is optimal in terms of high-dimensional point transmissions because the client receives D-dimensional representations only for points in qR . All these points (results and false hits) are necessary to establish correctness anyway. However, it must transmit Q to all servers. Moreover, each server S i has to compute RS i although none of the points of RS i may participate in the global result (e.g., S 4 see Fig. 13). ID-AMN avoids these problems by gradually eliminating servers that cannot contribute results. Specifically, ID-AMN incrementally retrieves distance values from servers to compute the final DSTmax , postponing local NN computations at the servers until they are required. We present the pseudocode of ID-AMN (at the client) in Fig. 14, and explain its functionality by continuing the example of Fig. 13 (k 3). Initially, the client marks each server S i as a candidate, and transmits q to S i , which responds with two values: the dsti (kdsti ) of its 1st (kth) NN in the low-dimensional space. For instance, in Fig. 13, the client receives (1, 2), (2, 11), (2, 9), (5, 7) from S 1 ; S 2 ; S 3 ; S 4 , respectively. Intuitively, a low kdsti implies a promising server that may contain k good results. DST i and dsti are used for server pruning and selection, to be discussed shortly. Let S j be the server (e.g., S 1 ) with the minimum kdstj . The client directs Q to S j and obtains a vector kDST with the distance values of the kNN set RS j in S j . DSTmax is set to the kth distance in kDST and S j ceases to be a candidate. Continuing the example,
650
MAY 2011
Algorithm ID-AMN_client (Q, k) 1. For each server Si 2. Set Candidate[i]=1; 3. (dsti, kdsti,)= get_ smallest_dist(q, Si) 4. DSTi = dsti 5. Let Sj be the server with the minimum kdstj 6. Set vector kDST = get_k_smallest_DSTs(Q, Sj) 7. Set DSTmax = maximum value in kDST; Set Candidate[j]=0 8. While there are candidate servers 9. For each server Si 10. If DSTi DSTmax , Set Candidate[i]=0 11. Select candidate server Si with minimum DSTi 12. Set DSTnew = get_next_ smallest_DST(Q, Si) from server Si 13. If DSTnew DSTmax, Set Candidate [i]=0 14. Else // DSTnew < DSTmax 15. Insert DSTnew into kDST 16. Set DSTmax = maximum value in kDST ; 17. DSTi = DSTnew 18. For each server Si 19. 20.
i (VOR, RSi, FHi) = MR_Range((q, DSTmax), rooti) i Verify(VOR) and incorporate RSi into RS
kDST 1; 2; 5, DSTmax 5. Fig. 13b illustrates the serverto-client transmissions during these steps. The while loop (Lines 8-17) starts by eliminating each server such that DST i ! DSTmax (initially DST i dsti ). For instance, DST 4 7 ! DSTmax 5, and the client discards S 4 without sending Q. Since the subsequent verification of S 4 does not require Q either, there is no transmission of high-dimensional data (query, or points) between the client and S 4 . Line 11 selects the candidate server S i with the minimum DST i , and asks for the distance DSTnew of the next NN in S i . If DSTnew ! DSTmax , S i is purged. Assuming that the selected server is S 3 (DST 3 DST 2 2), then 3 DST Q; P1 5 ! DSTmax 5, causing the elimination of S 3 without changing kDST. Fig. 13c shows the pruned servers S 1 ; S 3 ; S 4 in gray. The next iteration of the loop selects the last candidate S 2 , and retrieves DSTnew 3. Since 3 < DSTmax ; DSTnew is inserted into kDST, and DSTmax changes to 3. The loop terminates because all servers have been eliminated (see Fig. 13d). Lines 18-20 simply verify the range qR q; DSTmax in each server. All the result points (RS i ), as well as false hits (F H i ) are transmitted during this step. The client generates the final result RS locally from the union of all RS i . C-AMN can be applied to reduce the size of false hits. Note that Line 12 may call get_next_ smallest_DST multiple times on the same server S i . In this case, the client needs to transmit the full query Q only the first time; for subsequent requests, it suffices to send the query ID. Proof of Correctness. The client obtains all results and false hits at the end through the verifiable range (Lines 18-20). As shown in the proof SD-AMN, any DST misreporting 0 that leads to the computation of a DSTmax 6 DSTmax can be detected by the client. Let us now consider that some server S i sends false dsti and kdsti . The value of kdsti is only used as an estimator for the selection of the initial server (Line 5), and it only affects the efficiency (but not the correctness) of the algorithm. For instance, if S 3 reports kdst3 1 (instead of 9), it will become the initial
server, increasing the communication overhead (S 4 cannot be immediately eliminated), without however altering the result. Moreover, as discussed in Section 5.1, any false distance smaller than DSTmax will be caught by the verification. Similarly, dsti is used as a lower bound for DST i . If S i sends a value of dsti lower than the actual one, it can only be selected earlier than necessary during the while loop without affecting correctness. On the other hand, if the transmitted dsti exceeds the actual one, 1) S i is selected later during the loop, or 2) eliminated altogether if the reported dsti exceeds DSTmax . Case 1) only affects the efficiency, whereas case 2) is detected during the verification because S i has to send objects u t within the range qR q; DSTmax . Similar to SD-AMN, ID-AMN is optimal in terms of data point transmissions. Now let us consider query transmissions. Ideally, an optimal algorithm would send the complete query Q only to servers that contribute nearest neighbors (S 1 ; S 2 see Fig 13) because the distances of these NNs are necessary for obtaining the final value of DSTmax . Additionally, ID-AMN sends Q to some false candidates (e.g., i S 3 ) that cannot be eliminated. Specifically, let DST1 be the i i distance of the 1st NN in S . S is a false candidate, if i dsti < DSTmax < DST1 . Regarding the CPU cost, in IDAMN some eliminated servers (e.g., S 4 ) do not perform any high-dimensional distance computations. False candidates i return a single DST1 , thus they need to retrieve the first local NN (which may involve multiple DST computations, if there are false hits). The rest of the servers must retrieve either ki or (ki 1) NNs, where ki k is the contribution of S i to the final kNN set. The distance of the additional (+1) NN is sometimes required to eliminate S i . In comparison, SD-AMN transmits Q to all servers, and each server performs the necessary computations to obtain the k local NNs.
EXPERIMENTAL EVALUATION
We use four real data sets that capture different combinations of dimensionality D, cardinality N, and application domain: 1. Corel (D 64; N 68;040), 2. Chlorine (D 128; N 4;310), 3. Stock (D 512; N 10;000), and 4. Mallat (D 1;024; N 2;400). Corel5 can be downloaded from archive.ics.uci.edu/ml/, while the rest are available at: www.cs.ucr.edu/~eamonn/time_ series_data/. We decrease the dimensionality of each data set using Chebyshev polynomials [4]. The value of d is a parameter with range [2-16] and default value 8. Each reduced data set is indexed by an MR-Tree using a page size of 4 KB. Every digest is created by SHA-1 [16]. We assume that both DST and dst are based on the euclidean distance. Section 6.1, compares AMN and C-AMN considering a single server. Section 6.2, evaluates SD-AMN and ID-AMN assuming multiple servers.
5. Corel has four attribute sets. We use the first two sets (Color Histogram and Color Histogram Layout, of 32 attributes each) to derive the 64 D data set used in the experiments.
651
AMN
3.5 3.0 2.5 2.0 1.5 1.0 0.5 0 communication overhead (Mbytes)
C-AMN
percentage of FH
100 98 2 3014
99
96
97
90
91
72 d |FH|
4 2154
8 1543
16 1152
Fig. 15. Communication overhead versus d (k 8). (a) Corel (D 64; N 68;040). (b) Chlorine (D 128; N 4;310). (c) Stock (D 512; N 10;000). (d) Mallat (D 1;024; N 2;400).
10 communication overhead (Mbytes) percentage of FH 1.0
0.1
97 90 2 1452
97 90 4 1483
97 90 8 1543
97 90 16 1611
97 90 32 k 1675 |FH|
Fig. 16. Communication overhead versus k (d 8). (a) Corel (D 64; N 68;040). (b) Chlorine (D 128; N 4;310). (c) Stock (D 512; N 10;000). (d) Mallat (D 1;024; N 2;400).
6.1 Single Server The measures of interest are the communication overhead, and the CPU cost at the server and the client. We assess the communication overhead based on the verification information sent to the client. The transmission of the query and the result is omitted because it is necessary in any method. The CPU cost is measured in terms of the elementary distance computations. Specifically, D elementary computations are required to derive the euclidean distance of two Ddimensional points. We exclude the I/O cost at the server because it is identical for both AMN and C-AMN (and similar to that of the conventional multistep algorithm) since in any case, we have to retrieve the low-dimensional NNs using the MR-Tree. For each experiment, we select a random data point as the query, and report the average results over 10 queries. Fig. 15 fixes the number k of NNs to 8, and investigates the effect of d on the communication overhead of the verification information. Specifically, the overhead is measured in Mbytes, assuming that each value consumes Sv 8 bytes (a double precision number) and each digest is Sh 20 bytes (typical size for SHA-1). We indicate the number jF Hj of false hits below the x-axis. As d increases, jF Hj drops because the reduced representation captures more information about the corresponding point. In all cases, C-AMN leads to a significant decline of the overhead. The savings grow with D, and exceed an order of magnitude for Mallat, because long series provide more optimizations opportunities. On the other hand, the gains decrease as d grows due to the smaller FH. In order to demonstrate the effect of the false hits, we include inside each column of the diagrams, the contribution of FH as a percentage of the total overhead. For high D and low d, FH constitutes the dominant factor, especially for AMN (e.g., at least 98 percent in Mallat), corroborating the importance of C-AMN. The absolute overhead is lower (in both AMN and CAMN) for high values of d due to the decrease of jF Hj. The exception is Corel, where the communication cost actually grows when d exceeds a threshold (8 for AMN, 4 for CAMN). This is explained as follows: A typical record (i.e., image) in Corel has very low values (<0:005) on most (>60)
dimensions, and relatively large values (>0:1) on the rest. Furthermore, the large values of different records usually concentrate on different dimensions. Dimensionality reduction using Chebyshev polynomials [4] captures effectively those important dimensions even for low d. Consequently, there is a small number of false hits (for d 2; jF Hj % 0:28% of N, whereas in the other data sets jF Hj is 50-75 percent of N). As d grows, jF Hj does not drop significantly. On the other hand, the verification information transmitted to the client contains more boundary records and node MBRs, increasing the VO size. Fig. 16 illustrates the communication overhead as a function of the number of neighbors k to be retrieved (d 8). Note that the minimum value of k is 2 because the query is also a data point, i.e., its first NN is itself. The number of false hits increases with k, leading to higher transmission cost. However, compared to Fig. 15, the difference is rather small because k has a lower impact on jF Hj than d. For the same reason, the absolute performance of both methods is rather insensitive to k. The benefits of CAMN are again evident, especially if D ! 128. Fig. 17 investigates the elementary distance computations at the server as a function of d. C-AMN is always more expensive than AMN due to the additional cost of ReduceFH. This cost drops with increasing d because there are fewer false hits to be reduced. The numbers inside each column denote the CPU time (in milliseconds) using a Pentium Intel Core 2 Duo 2.33 GHz processor. For Corel, the CPU time is too low to be accurately measured. However, for Chlorine, Stock, and Mallat it may reach several seconds due to the large values of D and jF Hj. Fig. 18 shows the server computations versus the number of required neighbors. The cost increases slightly with k, but similar to Fig. 16, the effect is not as pronounced as that of d. Note that the diagrams do not include the I/O cost, which is identical to both methods. I/O operations normally dominate the processing overhead (since large records must be retrieved from the disk) and the performance difference of the two methods in terms of the overall cost diminishes. Moreover, the difference of C-AMN and AMN would decrease further, if DST were based on a more
652
MAY 2011
Fig. 17. Server computations versus d (k 8). (a) Corel (D 64; N 68;040). (b) Chlorine (D 128; N 4;310). (c) Stock (D 512; N 10;000). (d) Mallat (D 1;024; N 2;400).
Fig. 18. Server computations versus k (d 8). (a) Corel (D 64; N 68;040). (b) Chlorine (D 128; N 4;310). (c) Stock (D 512; N 10;000). (d) Mallat (D 1;024; N 2;400).
Fig. 19. Client computations versus d (k 8). (a) Corel (D 64; N 68;040). (b) Chlorine (D 128; N 4;310). (c) Stock (D 512; N 10;000). (d) Mallat (D 1;024; N 2;400).
Fig. 20. Client computations versus k (d 8). (a) Corel (D 64; N 68;040). (b) Chlorine (D 128; N 4;310). (c) Stock (D 512; N 10;000). (d) Mallat (D 1;024; N 2;400).
expensive distance function than dst (e.g., DTW versus euclidean distance as in [10]) and applied the optimization of Section 4.3. This is because ReduceFH entails only cheap dst computations, which would be dominated by the more expensive DST calculations, common in both methods. Fig. 19 illustrates the number of elementary distance computations at the client as a function of d. C-AMN leads to significant gains, sometimes dropping the processing cost by more than an order of magnitude. Since this cost is proportional to the amount of data received by the client, the diagrams are correlated to those in Fig. 15; accordingly, the benefits of C-AMN are more significant for settings that involve large values of D and jF Hj. Fig. 20 investigates the effect of k on the client. Similar to Figs. 16 and 18, the CPU cost increases with k, but the impact of k is rather small. Summarizing, compared to AMN, C-AMN imposes additional CPU cost for the server, in order to reduce the communication overhead and the verification effort at the client. This is a desirable tradeoff in client-server architectures because 1) data transmissions constitute the main bottleneck in most applications, especially those involving wireless networks, and 2) clients are assumed to have limited resources, whereas servers are powerful. Finally,
note that the transmission overhead can also be reduced by conventional compression techniques. We do not include experiments with this approach since it benefits both AMN and C-AMN. Moreover, it increases the computational burden of the client, which has to decompress the data before their verification.
6.2 Distributed Servers Next, we compare SD-AMN and ID-AMN considering that the database is horizontally partitioned over m servers. Recall that the methods first collect distance information, based on which they determine the range that contains the result. The NNs and the false hits are obtained during the verification of this range, which is identical in SD-AMN and ID-AMN. Thus, when measuring the communication cost, we focus on their differences, which regard the transmission of query points and the distance information. The CPU overhead is based again on elementary distance computations. Finally, due to the identical verification process, the client cost is similar, and the corresponding experiments are omitted. Fig. 21 shows the communication cost as a function of the number m of servers. Since we do not count the
653
SD-AMN
communication overhead (Kbytes) 10 10
ID-AMN
communication overhead (Kbytes)
1.0
1.0
Fig. 21. Communication overhead versus m (d 8; k 8). (a) Corel (D 64; N 68;040). (b) Chlorine (D 128; N 4;310). (c) Stock (D 512; N 10;000). (d) Mallat (D 1;024; N 2;400).
elementary distance computations (x10 ) 10 1.0 1.0
5
Fig. 22. Server computations versus m (d 8; k 8). (a) Corel (D 64; N 68;040). (b) Chlorine (D 128; N 4;310). (c) Stock (D 512; N 10;000). (d) Mallat (D 1;024; N 2;400).
common data transmissions, the dominant factor is the number of high-dimensional query (Q) transmissions. SDAMN sends Q to all servers yielding an overhead of D m values. On the other hand, ID-AMN transmits Q only to candidate servers. In the best case, all results may be found in a single server, and the rest are eliminated using the dst bound; in the worst case, Q must be sent to all servers, if they all constitute candidates. In general, the number of eliminated servers increases with their total number, leading to the savings of ID-AMN. Fig. 22 compares the two methods on elementary distance computations at the server versus m. The retrieval of a kNN set involves a number of computations linear to k jF Hj d D because the distances of all results and false hits must be evaluated in both low and highdimensional spaces. In SD-AMN, each of the m servers must retrieve the kNNs; thus, the total cost increases linearly with both m and k. In ID-AMN a server has to perform a number of computations that is proportional to its contribution ki k in the result set. The value of m affects the number of computations only indirectly, by increasing the false candidates. In general, ID-AMN clearly outperforms SD-AMN in all settings.
ACKNOWLEDGMENTS
This work was supported by grant HKUST 618108 from Hong Kong RGC.
REFERENCES
[1] [2] [3] [4] [5] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, The R Tree: An Efficient and Robust Access Method for Points and Rectangles, Proc. ACM SIGMOD, 1990. K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, When is Nearest Neighbor Meaningful?, Proc. Intl Conf. Database Theory (ICDT 99), 1999. R. Bryan, The Digital Revolution: The Millennial Change in Medical Imaging, Radiology, vol. 229, pp. 299-304, Nov. 2003. Y. Cai and R. Ng, Indexing Spatio-Temporal Trajectories with Chebyshev Polynomials, Proc. ACM SIGMOD, 2004. H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh, Querying and Mining of Time Series Data: Experimental Comparison of Representations and Distance Measures, Proc. Intl Conf. Very Large Data Base Endowment (VLDB 08), vol. 1, pp. 1542-1552, 2008. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, Fast Subsequence Matching in Time-Series Databases, Proc. ACM SIGMOD, 1994. G. Hjaltason and H. Samet, Distance Browsing in Spatial Databases, Trans. Database Systems (TODS 99), vol. 24, no. 2, pp. 265-318, 1999. H. Jagadish, B. Ooi, K. Tan, C. Yu, and R. Zhang, I-Distance: An Adaptive B+-Tree Based Indexing Method Nearest Neighbor Search, Trans. Database Systems (TODS 05), vol. 30, no. 2, pp. 364397, 2005. H. Kellerer, U. Pferschy, and D. Pisinger, Knapsack Problems. Springer, 2004. E.J. Keogh, C.A. Ratanamahatana, Exact Indexing of Dynamic Time Warping, Knowledge and Information Systems, vol. 7, no. 3, pp. 358-386, 2005. F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, and Z. Protopapas, Fast Nearest Neighbor Search in Medical Image Databases, Proc. Intl Conf. Very Large Data Base Endowment (VLDB 96), 1996. A. Kundu and E. Bertino, Structural Signatures for Tree Data Structures, Proc. Intl Conf. Very Large Data Base Endowment (VLDB 08), 2008. F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin, Dynamic Authenticated Index Structures for Outsourced Databases, Proc. ACM SIGMOD, 2006. C. Martel, G. Nuckolls, P. Devanbu, M. Gertz, A. Kwong, and S. Stubblebine, A General Model for Authenticated Data Structures, Algorithmica, vol. 39, no. 1, pp. 21-41, 2004. R. Merkle, A Certified Digital Signature, Proc. CRYPTO, 1989.
CONCLUSIONS
[9] [10] [11]
The importance of authenticated query processing increases with the amount of information available at sources that are untrustworthy, unreliable, or simply unfamiliar. This is the first work addressing authenticated similarity retrieval from such sources using the multistep kNN framework. We show that a direct integration of optimal NN search with an authenticated data structure incurs excessive communication overhead. Thus, we develop C-AMN, a technique that addresses the communication specific aspects of NN, and minimizes the transmission overhead and verification effort of the clients. Furthermore, we propose ID-AMN, which retrieves distance information from distributed servers, eliminating those that cannot contribute results.
654
MAY 2011
[16] A. Menezes, P. van Oorschot, and S. Vanstone, Handbook of Applied Cryptography. CRC Press, 1996. [17] H. Pang and K. Mouratidis, Authenticating the Query Results of Text Search Engines, Proc. Intl Conf. Very Large Data Base Endowment (VLDB 08), 2008. [18] D. Papadias, J. Zhang, N. Mamoulis, and Y. Tao, Query Processing in Spatial Network Databases, Proc. Intl Conf. Very Large Data Base Endowment (VLDB 03), 2003. [19] T. Seidl and H.-P. Kriegel, Optimal Multi-Step k-Nearest Neighbor Search, Proc. ACM SIGMOD, 1998. [20] R.T. Snodgrass, S.S. Yao, and C. Collberg, Tamper Detection in Audit Logs, Proc. Intl Conf. Very Large Data Base Endowment (VLDB 04), 2004. [21] R. Tamassia and N. Triandopoulos, Efficient Content Authentication over Distributed Hash Tables, Proc. Intl Conf. Applied Cryptography and Network Security (ACNS 07), 2007. [22] Y. Tao, K. Yi, C. Sheng, and P. Kalnis, Quality and Efficiency in High Dimensional Nearest Neighbor Search, Proc. ACM SIGMOD, 2009. [23] Y. Yang, S. Papadopoulos, D. Papadias, and G. Kollios, Spatial Outsourcing for Location-Based Services, Proc. Intl Conf. Data Eng. (ICDE 08), 2008. [24] K. Yi, F. Li, M. Hadjieleftheriou, G. Kollios, and D. Srivastava, Randomized Synopses for Query Assurance on Data Streams, Proc. Intl Conf. Data Eng. (ICDE 08), 2008. Stavros Papadopoulos received the BSc degree in computer science from the Aristotle University of Thessaloniki, Greece. He is currently working toward the PhD degree at the Department of Computer Science and Engineering, Hong Kong University of Science and Technology. His research interests include authenticated query processing, private information retrieval, spatio-temporal databases, and cloud computing. Lixing Wang received the BE degree in software engineering from Fudan University, China and the BS degree from the University College Dublin, Ireland, both in computer science. He is currently working toward the PhD degree at the Department of Computer Science and Engineering, Hong Kong University of Science and Technology. His research interests include spatio-temporal databases and query processing in wireless sensor networks.
Yin Yang received the BE degree in computer science and engineering from Shanghai Jiao Tong University and the PhD degree in computer science from the Hong Kong University of Science and Technology. He is currently a visiting scholar at the Department of Computer Science and Engineering, Hong Kong University of Science and Technology. His research interests include database outsourcing, query processing in data stream management systems, keyword search, and spatio-temporal databases. Dimitris Papadias is currently a professor of Computer Science and Engineering, Hong Kong University of Science and Technology. In 1997, he worked and studied at the German National Research Center for Information Technology (GMD), the National Center for Geographic Information and Analysis (NCGIA, Maine), the University of California at San Diego, the Technical University of Vienna, the National Technical University of Athens, Queens University (Canada), and University of Patras (Greece). He serves or has served in the editorial boards of the VLDB Journal, IEEE Transactions on Knowledge and Data Engineering, and Information Systems. Panagiotis Karras received the PhD degree in computer science from the University of Hong Kong and the MEng degree in electrical and computer engineering from the National Technical University of Athens. He is currently a Lee Kuan Yew postdoctoral fellow at the National University of Singapore. In 2008, he received the Hong Kong Young Scientist Award. He has also held positions at the University of Zurich and the Technical University of Denmark. His research interests include data mining, algorithms, spatial data management, anonymization, indexing, and similarity search. His work has been published in major database and data mining conferences and journals.
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.