0% found this document useful (0 votes)
51 views11 pages

AF-DedupSecure Data Deduplication Based On Adaptive Dynamic Merkle Hash Forest PoW For Cloud Storage

AF-DedupSecure Data Deduplication Based on Adaptive Dynamic Merkle Hash Forest PoW for Cloud Storage

Uploaded by

unwell.whinny.0z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views11 pages

AF-DedupSecure Data Deduplication Based On Adaptive Dynamic Merkle Hash Forest PoW For Cloud Storage

AF-DedupSecure Data Deduplication Based on Adaptive Dynamic Merkle Hash Forest PoW for Cloud Storage

Uploaded by

unwell.whinny.0z
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. XX, NO.

XX, XXXX 1

AF-Dedup: Secure Data Deduplication Based on


Adaptive Dynamic Merkle Hash Forest PoW for
Cloud Storage

Abstract— For encrypted data deduplication, Proof of In order to address the aforementioned challenge, Conver-
ownership (PoW) verifies a client’s ownership of an entire gent Encryption (CE) [3] is proposed. However, CE cannot
file, preventing malicious users from exploiting a single resist dictionary attacks. There are also variants [4] [5] of
segment of information to gain access to the file. By es-
tablishing the identity of two users who possess the same CE, but they still cannot address the core issue. To enhance
file, CSP can maintain a single copy for the file, enabling the security of CE, several schemes introduce the use of a
deduplication. However, existing PoW schemes based on Trusted Third Party (TTP) [6]. However, TTPs are considered
Merkle hash tree (MHT) cannot guarantee the security of to be fully trusted in these schemes, posing implementation
small files. Therefore, we propose a novel data structure challenges in real-world applications.
called adaptive dynamic Merkle hash forest (ADMHF) for
PoW, and present an encrypted data deduplication scheme Proof of ownership (PoW) plays a vital role in encrypted
called AF-Dedup. It reduces the risks of data content ex- data deduplication, as it can efficiently prove that a user
posure resulting from multiple ownership verification at- indeed possesses a certain file [7]. The current method for
tempts in traditional schemes. Specifically, we first con- PoW involves using file hash to differentiate between identical
struct the file tag as a unique identifier of the file. Second, files, commonly referred to as hash-as-proof. However, if ans
different encryption schemes are employed depending on
the popularity of the data. Then, the corresponding ADMHF attacker can enumerate the hash value offline, he may gain
is generated for subsequent ownership verifications. the ownership of the file even he does not own the file. In
After security analysis and simulation experiments, our response to this, Halevi et al. suggest the Merkle hash tree
scheme is proven to significantly enhance the security of (MHT) [8] as a means of validating data [9]. Compared to the
small files. In a given situation for files with only 2 blocks, hash-as-proof scheme, this scheme enables users to efficiently
our scheme achieves the same level of security as the
existing scheme for a file with 91 blocks. prove ownership of an entire file to the CSP, rather than just
a fragment of the file. For prover that passes MHT-PoW, it
Index Terms— ADMHF, bilinear mapping, encrypted data can be inferred with high probability that the prover possesses
deduplication, proof of ownership
most of the leaves associated with the file.
However, the security of MHT-PoW scheme is compro-
I. I NTRODUCTION mised, especially for small files. As demonstrated by our
URRENTLY, an increasing number of users prefer to analysis in Section V-C, considering side-channel attacks, each
C upload their local files to cloud storage platforms in order
to free up storage space and access their data anytime and
verification attempt poses a potential risk of exposing relevant
nodes. With an increasing number of verifications Ncr , the
anywhere. However, this trend results in a significant amount percentage of exposed nodes γ also increases. Once Ncr
of duplicated data stored in the cloud, leading to waste of the exceeds a certain value, the MHT becomes fully exposed,
storage space. To address this issue, the mainstream methods rendering the PoW mechanism ineffective. We then perform
include data compression and data deduplication, with this experiments in Section VI-A to assess the correlation between
paper focused on the latter. Encrypted data deduplication is Ncr and γ.
a technique where the cloud server stores a single copy of a Building on the aforementioned concerns, we propose a
file and generates an access link for other authorized users. secure encrypted data deduplication scheme, namely AF-
Studies indicate that encrypted data deduplication ratios can Dedup. Our contributions are as follows:
range from 1:10 to 1:500, allowing for over 90% savings in • We propose an encrypted data deduplication scheme that
storage space for backup file systems [1]. does not rely on any TTPs and guarantees semantic
The practice of outsourcing private data to the cloud service security of ciphertexts.
providers (CSPs) necessitates that users place unconditional • We propose a PoW method based on ADMHF, thereby
trust in the security of their data. In reality, users are often preventing the exposure of MHT node information via
concerned about data confidentiality. One viable solution in- multiple verifications. For example, when the file is
volves encrypting the data before uploading it to the CSP. divided into 32 blocks, the percentage of exposed nodes
However, traditional encryption schemes use different keys for in a single PoW round decreased by a factor of 7.199
different users, resulting in the same plaintext being encrypted compared to the traditional MHT-PoW scheme.
into different ciphertexts. This makes deduplication a complex • We utilize blind signature to generate file tag and enhance
and challenging problem [2]. system security by employing bilinear mapping for sig-
2 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. XX, NO. XX, XXXX

TABLE I: Description of symbols


nature verification. Semantic Without Secure
Method PoW
security TTP communication
II. R ELATED W ORK CE [3] × ✓ × ×
MLE [5] × ✓ × ×
The majority of existing deduplication schemes are based Xu-CDE [14] × ✓ × ×
on Convergent Encryption (CE), which address the issue Key-sharing [17] ✓ × × ✓
of incongruity between conventional encryption algorithms TEE [19] ✓ ✓ × ✓
and deduplication systems. Douceur et al. [3] is the first to MHT [9] ✓ ✓ × ✓
Our scheme ✓ ✓ ✓ ✓
introduce the concept of CE. This method entails encrypting a
file by utilizing its hash value as the key, thereby allowing for
the same plaintext to be encrypted into identical ciphertext.
Despite its efficiency and straightforwardness, CE remains
susceptible to dictionary attacks as its data entropy is low [4].
Moreover, it lacks the ability to achieve semantic security. To
formalize the concept of CE, Bellare et al. [5] propose a new
cryptographic primitive–Message Lock Encryption (MLE),
which complicates the key computation and encryption meth-
ods but has no change in its core idea compared to CE, thus
also fails to achieve semantic security [10] [11]. In the follow-
up works, Chen et al. [12] develop a novel scheme named
Block-Level Message-Locked Encryption (BL-MLE), which
enables both file-level and block-level deduplication to be
Fig. 1: System Model
performed simultaneously with a smaller set of metadata [13].
Subsequently, Xu et al. [14] propose a multi-client crossover
deduplication scheme named Xu-CDE as a natural extension
of CE. However, the PoW credentials used in this scheme lack whether the user possesses the file. However, the scheme does
real-time protection, causing it vulnerable to replay attacks not take into consideration the insecurity of the communication
[15]. channel [21] [22]. If an adversary can eavesdrop on the content
There are some works based on Trusted Third Party (TTP) between the server and the client, the adversary can recover
in encrypted data deduplication. Stanek et al. [18] put forward the whole MHT after a few PoW verifications especially for
a concept of popularity to classify data into popular data small files.
and unpopular data. In this scheme, the TTP consists of In summary, existing schemes have multiple shortcomings
an identity provider and an indexing service. The former is in terms of their resistance to brute force attacks, and some of
employed to mitigate sybil attacks, while the latter serves to the prerequisites are unfeasible for real-world appliaction. So
prevent the leakage of unpopular data information to the CSP. this study aims to address the prevailing weaknesses in current
Other schemes also employ TTP [16] [17]. Nevertheless, the encrypted data deduplication scheme. In order to illustrate the
practicality of finding a suitable TTP remains a significant distinctions between our scheme and the existing works, we
challenge. To get rid of TTP, Fan et al. [19] present a secure conduct a comparative analysis, and the results are presented
deduplication scheme that leverages the Trusted Execution in Table I.
Environment (TEE) [20]. Due to the distinctiveness of TEE,
sensitive operations such as key management are restricted to III. S YSTEM AND A DVERSARY M ODEL
TEE, thereby eliminating the need for a TTP.
A. System Model
There is another line of works based on the proof of own-
ership (PoW) technique. Early deduplication schemes based In the proposed scheme, the system model primarily in-
on the hash-as-proof mechanism are vulnerable against brute- cludes two entities: User and CSP. These entities are illustrated
force attacks as the hash is easy to get. Then, Halevi et al. in Fig. 1.
[9] propose a scheme to thwart counterfeit attacks utilizing User: User is an entity that outsources the data storage to
Merkle hash tree (MHT). In this approach, the client and the CSP and can subsequently access the downloaded data
server encode the file into a buffer, which is segmented into from the cloud. In a deduplicated cloud storage system, users
fixed-size blocks, with the hash of each data block serving can minimize network transfers by uploading data files only
as a leaf node. The parent node is obtained by hashing two once, instead of uploading files that already exist in the cloud
child nodes, and ultimately the root node of the entire tree storage. Users can be further classified into two categories:
is computed. During the MHT-PoW verification process, the initial uploaders and subsequent uploaders.
server randomly selects k leaf nodes as challenges, and the CSP: CSP is an entity that consists of a main server and
client returns the path information from leaf nodes to the a storage server. The main server of CSP is tasked with
root node for response. Subsequently, the server can calculate conducting cross-user file duplicity detection, verifying data
the value of the root node based on the client response. By ownership, and generating tags. The storage server, on the
comparing the two root node values, the server can determine other hand, is responsible for securely storing the encrypted
AUTHOR et al.: AF-DEDUP: SECURE DATA DEDUPLICATION BASED ON ADAPTIVE DYNAMIC MERKLE HASH FOREST POW FOR CLOUD STORAGE 3

file data blocks and facilitating user access requests by com- • Step 4. Input security parameters λ to update the sym-
municating with the main server. metric encryption algorithm.
2) CheckTagGen: When a user U intends to upload a file F
B. Adversary Model to CSP, s/he first executes the remote attestation to establish a
In the proposed scheme, two possible adversaries are con- secure communication channel. Then U tries to acquire a blind
sidered: internal adversary and external adversary. signature of F from CSP. The process is shown in Alg.1.
Internal adversary: The internal adversary refers to some-
one with authorized access to the CSP and may attempt Algorithm 1 Check Tag Generation
to retrieve users’ encrypted data without permission. This 1: User:
adversary, may be a cloud service provider or one of its R
2: Randomly chooses q ←− Zp∗ .
employees, and can access ciphertext without the knowledge 3: Computes the short hash value of the file hF = SH(F ).
or consent of the respective user. 4: Calculates F ′ = hqF .
External adversary: In the context of cloud computing, 5: User → CSP: Sends F ′ for blind signature.
external adversaries are typically defined as malicious users 6: CSP: Computes α′ = F r .

who engage in unauthorized access of private data belonging 7: CSP → User: Send α′ to U .
to other users. These adversaries can employ various tactics 8: User: Calculates α = α q .
′ −1

to achieve their objectives, such as attacking legitimate clients 9: if e(α, g) = e(hF , g1 ) holds then
to obtain partial information about specific copies of data. 10: // Signature α′ from CSP is right.
11: User:
C. Design Goals 12: Computes K1 = H(α).
The design goal of this scheme is that CSP can securely 13: Computes the ciphertext of F , CF = E(K1 , F ).
accomplish the deduplication of encrypted data. Therefore, our 14: Computes the file tag T agF = H(CF ).
scheme should satisfy the following properties: 15: Stores T agF to verify data integrity.
• Security: 1) The uploaded ciphertext is semantically 16: User → CSP: Send T agF as the file tag.
secure. 2) The security of file tags, including its un- 17: if T agF already existed in CSP then
forgeability and distinguishability. 3) The security of the 18: CSP → User: U is a subsequent uploader.
ADMHF-PoW process. 19: Execute Algorithm 2.2.
• Efficiency: The proposed solution should be efficient in 20: else
terms of time and storage. 21: CSP → User: U is an initial uploader.
22: Execute Algorithm 2.1.
IV. P ROPOSED D EDUPLICATION S CHEME 23: end if
A. Preliminary 24: else
25: Upload termination.
1) Bilinear mapping: Let (G, +) and (GT , ×) be addictive
26: end if
and multiplicative groups with the same prime order p respec-
tively, g is the generator of G. Let e : G × G → GT be a
bilinear map which satisfies the following properties. 3) FileUpload: Based on the result of Alg. 1, U is divided
• Bilinear: ∀a, b ∈ Zp∗ , ∀P, Q ∈ G, e(aP, bQ)= e(P, Q)ab . into initial uploader or subsequent uploader.
• Non-degenerate: ∃P, Q ∈ G, such that e(P, Q) ̸= 1. Case 1: For initial uploaders, to achieve a balance between
• Computable: There is an efficient algorithm to compute the security and the efficiency of data encryption, we adopt
e(P, Q), for P, Q ∈ G. a strategy that classify the data into two categories: popular
data and unpopular data. For popular data with low level
B. Implementation of privacy, it is sufficient to upload CF as ciphertext. For
unpopular data, a two-layered encryption is required. When the
Our scheme consists of four processes, including SystemSet,
number of authorized users reaches the popularity threshold t,
CheckTagGen, FileUpload, and FileDownload. The overview
that is CoutF = t, the outer layer encryption of unpopular
is shown in Fig.2.
data is decrypted. Notably, t is a constant determined by the
1) SystemSet:
application scenario and security requirements. As t increases,
• Step 1. Two cyclic groups G1 and GT of prime order p the security of the system improves, but it increases the
are chosen and g is the generating element of G1 . Define overhead for users to decrypt during the file download phase.
bilinear mapping e : G1 × G1 → GT . Regarding the specific value of t in real-world setting, readers
R R
• Step 2. CSP randomly choose g ′ , h, h′ ←− G, r ←− Zp∗ , can refer to [18].
and calculates g1 = g r , Z = e(g r , g ′ ). Set the system After obtaining the ciphertext, the corresponding ADHMF is

main public key as mpk = g r , the public parameter is generated for subsequent verifications. To mitigate the poten-
(G1 , G2 , p, e, g, g1 , g ′ , h, h′ , Z, H, SH). tial vulnerabilities associated with the MHT, we propose using
• Step 3. Each user in the system picks a random number a new verification data structure called the adaptive dynamic
R
s ←− Zp∗ , calculates the user’s public key as pk = g s , Merkle hash forest (ADMHF). It is a collection of m mutually

and private key as sk = (d1 , d2 , d3 ) = (g r , (g1pk h)s , g s ). independent MHTs. Moreover, each ADMHF corresponding
4 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. XX, NO. XX, XXXX

Fig. 2: The Overview of AF-Dedup

to a particular file. The number of trees contained within the 12: Generates random salt values Salt = {s1 , s2 , · · · , sm }.
ADMHF is determined by the file size, according to a specified 13: for i = 1 to m do
formula. 14: Generates M HTi and add si to M HTi when calcu-
a late the hash value of each node.
y = F (x) = c (1)
1 + eb− x 15: Signs the root node nRooti , and obtains SigFi =
The equation (1) illustrates the dynamic relationship be- nrRooti to verify the integrity of M HTi .
tween y, which denotes the number of trees in ADMHF, 16: Stores si and the root node nRooti .
and x, representing the number of data blocks. The graph 17: end for
of the function appears as an inverted S-curve. Parameter a
Among them, T agpop is a variable that indicates the current
determines the upper bound of the function value. Parameter
data popularity. When CountF < t, F is unpopular data now,
b and c affect the critical point and change tendency of the
that is T agpop = 1. When CountF > t, F is popular data,
curve.
then T agpop = 0. When CountF = t, then T agpop = 2,
To achieve the diversity of MHTs in ADMHF, we introduce which indicates that the file is undergoing a transition from
salt to the hash function in the MHT generation process. unpopular data to popular data, and it is considered popularity
Salt = {s1 , s2 , · · · , sm } is a set of randomly generated transitioning data.
strings, in which each element si , i ∈ {1, · · · , m} corresponds
Case 2: For subsequent uploaders, CSP tends to perform
to a unique MHT in the forest. The value of the parent node
PoW on U . The ADMHF-PoW process consists of Challenge,
is obtained by concatenating the values of its two child nodes
Response and Verification. The detailed process is shown in
with a salt value in Salt, and then performing a hash operation
Alg. 2.2.
on the resulting string. The detailed process is shown in Alg.
2.1.
Algorithm 2.2 File Upload: Subsequent Uploader
Algorithm 2.1 File Upload: Initial Uploader 1: Challenge:
1: // CountF = 0 < t, which implies the data is unpopular 2: // CSP generates a challenge set for PoW.
data now. 3: Randomly select a M HTi , i ∈ {1, 2, · · · , m} from
2: User: ADMHF.
R
3: L ←− Zp∗ for two-layered encryption. 4: Computes the hash value of current root node.
4: C1 = g L , C2 = CF ⊕ Z L , 5: if e(SigFi , g) = e(nRooti , g1 ) then
5: C3 = (g1pk h)L , C4 = (g1hF h′ )L . 6: Generates a challenge set ch = {ch1 , ch2 , · · · , chk },
6: Computes the ciphertext CF′ = C1 ||C2 ||C3 ||C4 . chj ∈ {1, 2, · · · , N umt }.
7: Computes K2 = H(F ), τF = K2 ⊕ L. 7: CSP → User: Sends < si , ch, τF > to U .
8: User → CSP: Send < CF′ , τF , T agpop = 1 > to CSP. 8: else
9: CSP: 9: The data integrity of M HTi has been corrupted.
10: Divides CF into blocks {Bi }, i ∈ {1, · · · , N umt }. 10: end if
11: Calculates m according to (1). 11:
AUTHOR et al.: AF-DEDUP: SECURE DATA DEDUPLICATION BASED ON ADAPTIVE DYNAMIC MERKLE HASH FOREST POW FOR CLOUD STORAGE 5

12: Response:
13: // U generates a response to prove his ownership.
14: Computes K2 = H(F ), L = τF ⊕ K2 .
15: Encrypts CF with L to obtain CF′ .
16: Divides CF′ into blocks {Bi }, i ∈ {1, · · · , N umt }.
17: Adds salt si to each of the nodes to obtain M HTi′ .
18: Computes the corresponding response res = {leafi ,
Bro(leafi )}, i ∈ ch for each node indexed by ch.
19: User → CSP: Send res to CSP.
20:
21: Verification:
22: // CSP verifies the correctness of the response.
23: CSP uses the values in res to calculate the root node of
M HTi .
24: if calculated value is identical to the actual value then
25: Adds U to the owner list, CountF + = 1.
26: if CountF < t then
Fig. 3: ADMHF-PoW Process
27: CSP → User: Returns T agpop = 1.
28: else if CountF = t then
29: CSP → User: Returns T agpop = 2. Here is the correctness proof for the outer layer decryption:
30: else
31: CSP → User: Returns T agpop = 0. (C2 · e(C3 , d3 )) ⊕ (e(d1 , C1 ) · e(d2 , C1 ))
32: end if = ((CF ⊕ e(g r , g ′ )L ) · e(C3 , d3 )) ⊕ (e(d1 , C1 ) · e(d2 , C1 ))
33: Execute Algorithm 3.
= CF ⊕ (e(g r , g ′ )L · e((g1pk h)L , g s )) ⊕ (e(g ′r , g L ) · e((g1pk h)s ), g L )
34: else
35: The verification fails. = CF ⊕ (e(g r , g ′ )L · e((g1pk h)L , g s )) ⊕ ((e(g r , g ′ )L · e((g1pk h)L , g s )))
36: end if = CF ⊕ 0 = CF .

Among them, in the process of Response, Route(leafi ) 4) FileDownload: When U intends to download file F , he
denotes the set of nodes located on the path from the leaf sends a request to CSP. CSP authenticates U and searches
node leafi to the root node. And Bro(leafi ) denotes the set the list of file owners to determine whether U has access. If
of sibling nodes for each node within Route(leafi ). found, CSP returns current data popularity to U . Otherwise,
the request will be rejected. In addition, the user is informed if
The ADMHF-PoW process is illustrated with the exam- there is a change in data popularity. The encrypt and decrypt
ple shown in Fig. 3. In this case, the file is split into 8 process is shown in Alg. 4.
blocks. During the challenge process, CSP randomly selects
a M HTi and generates a challenge set ch = {3, 7}, then Algorithm 4 File Download
sends < si , ch, τF > to U . U calculates the response res = 1: CSP → User: Sends < CF′ , τF , T agpop = 1 > or <
{n31 , n43 , n44 , n33 , n47 , n48 } with si . Then CSP uses the node CF , T agpop = 0 >.
values in res to calculate the root node to verify the response. 2: if T agpop = 1 then
If U passes the PoW, it can be assumed that U owns the file. 3: User: Decrypts CF′ using L.
Then data deduplication needs to be performed, the detailed 4: User: Obtains CF = (C2 · e(C3 , d3 )) ⊕ (e(d1 , C1 ) ·
process is shown in Alg. 3. e(d2 , C1 )).
5: end if
6: Computes T agF′ = H(CF ).
Algorithm 3 Data Deduplication
7: if T agF′ = T agF then
1: if T agpop = 2 then 8: User: Computes F = D(K1 , CF ) to get the plaintext.
2: // It is considered as popularity transitioning data. 9: else
3: User: 10: Data integrity is compromised.
4: Decrypts CF′ using L. 11: end if
5: Then obtains CF = (C2 · e(C3 , d3 )) ⊕ (e(d1 , C1 ) ·
e(d2 , C1 )).
6: User → CSP: Uploads < CF , T agpop = 0 > to replace C. Computation Complexity Analysis
the original tuple < CF′ , τF , T agpop = 1 >. We analyze the computation complexity of users and CSP
7: else in different phases in Table II. Among them, F Size represents
8: // Data Deduplication. the size of the file, N umt is the number of leaf nodes of MHT,
9: U does not need to do anything. m represents the number of MHTs in ADMHF, and k is the
10: end if number of nodes CSP challenges at each round.
6 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. XX, NO. XX, XXXX

TABLE II: Computation complexity analysis of different enti-


ties in different phases
User CSP T agF = T agF ′
CheckTagGen O(F Size) + O(log p) O(log p) ⇔ H(CF ) = H(CF ′ )
OuterLayerEnc O(log p) −
ADMHFGen − O(mN umt log N umt ) ⇔ H(E(H(αF ), F )) = H(E(H(αF ′ ), F ′ ))
Challenge − O(k)
Response O(N umt ) + O(k log N umt ) −
⇔ H(E(H(hrF ), F )) = H(E(H(hrF ′ ), F ′ ))
Verify − O(k log N umt ) ⇔ H(E(H(SH(F )r ), F )) = H(E(H(SH(F ′ )r ), F ′ )).

If the equation above holds, according to Lemma 1 and


the determinism of the encrypted algorithm, there must be
V. S ECURITY A NALYSIS F = F ′ . However, it contradicts the assumption F ̸= F ′ ,
therefore the assumption does not hold.
As described in Section III-C, this paper focuses on achiev-
ing three key security goals: ensuring the security of file tags, Theorem 3. Let the file tag uploaded by user Ui be T agF
maintaining semantic security for ciphertexts, and providing and the file tag uploaded by Uj be T agF ′ . If T agF = T agF ′ ,
the security of the PoW process. In the following security then the probability that F ̸= F ′ is negligible, that is:
proof, we will explain how AF-Dedup achieves these security
goals. P r[F ̸= F ′ |T agF = T agF ′ ] ≤ ε. (4)
Proof. Without loss of generality, from the steps in the
scheme, we can conclude:
A. Security of File Tag T agF = H(E(H(SH(F )r ), F )). (5)

First, we analyze the correctness of the file tag, which is If F ̸= F , we discuss attacks on the file tag T agF by the
ensured by the fact that the CSP returns the correct signature adversary A from the following two perspectives:
during the file tag generation phase. (1) Assuming A is a malicious user, the parameters F and
r are unknown.
Theorem 1. For the file F , if e(α, g) = e(hF , g1 ) holds, the (2) Assuming A is a CSP, the parameter F is unknown.
CSP must return the correct signature α′ during the generating In both of these cases, A cannot construct a file tag that
process of file tag. satisfy T agF = T agF ′ .
Proof. According to the feature of bilinear mapping:
e(α, g) = e(α′q
−1
, g) B. Data Privacy
qrq −1 We illustrate the semantic security of the ciphertext by
= e(hF , g) = e(hrF , g)
analyzing its resistance to offline brute-force attacks. Based on
= e(hF , g r ) = e(hF , g1 ). our proofs, for an attacker with limited computational power,
it is impossible to obtain any important information about the
If the equation above holds, U can confirm that the signature
plaintext from the ciphertext.
is valid, thus ensuring the correctness of the file tag.
1) Offline brute-force attack: If an internal adversary A,
The security of file tags includes their unforgeability and such as CSP, acquires the file tag, ciphertext and data popular-
distinguishability. In the following text, Theorem 2 analyzes ity, he performs an offline brute-force attack on F . In addition,
the distinguishability of file tags, while Theorem 3 demon- for unpopular data, A also gets τF .
strates the unforgeability of file tags. As for popular data:
• Step 1. A lists a possible data set {Fi }, |F | = F Size,
Lemma 1. For a secure hash function H : {0, 1}∗ → G1 , where i ∈ [1, 2F Size ]. Then he computes {hFi =
∀F1 , F2 ∈ [0, 1]∗ , F1 ̸= F2 , the probability that H(F1 ) = SH(Fi )}.
H(F2 ) is negligible, that is: • Step 2. A lists all possible private key values of CSP
r
P r[H(F1 ) = H(F2 )|F1 ̸= F2 ] ≤ ε. (2) {rj }, j ∈ [1, p − 1], then calculates {αij = hFji }.
• Step 3. Then A calculates the ciphertext set {CFij =
Theorem 2. Let the file tag uploaded by user Ui be T agF E(H(αij ), Fi )}.
and the file tag uploaded by Uj be T agF ′ . If F ̸= F ′ , the As for unpopular data:
probability that T agF = T agF ′ is negligible. That is:
• Step 1. A lists a possible data set {Fi }, |F | = F Size,
P r[T agF = T agF ′ |F ̸= F ] ≤ ε.′
(3) where i ∈ [1, 2F Size ]. Then A computes {K2i =
H(Fi )}, {Li = K2i ⊕ τF }.
Proof. The proof of this theorem can be done by proof by • Step 2. A uses the above method of popular data to get
contradiction. Suppose there exists F ̸= F ′ such that T agF = the ciphertext {CFi }.
T agF ′ . According to the scheme of this paper that: • Step 3. A Encrypts {CFi } with {Li } to obtain {CF′ i }.
AUTHOR et al.: AF-DEDUP: SECURE DATA DEDUPLICATION BASED ON ADAPTIVE DYNAMIC MERKLE HASH FOREST POW FOR CLOUD STORAGE 7

If there exists CFi = CF or CF′ i = CF′ , where i ∈ files, the adversary needs to collect fewer different response
[1, 2F Size ], A can obtain the plaintext data. The time complex- sets, making the MHT constructed on it more vulnerable.
ity of this process is O(p · 2F Size ), which is computationally
Theorem 6. If the adversary collects N umt /2 distinct re-
infeasible.
sponse sets of the same file, where N umt represents the
Lemma 2. The Discrete Logarithm problem (DL problem): number of leaf nodes, then the adversary can reconstruct this
Given g a ∈ G where a ∈ Zp∗ , computing a is hard. MHT.
Theorem 4. The scheme is resistant to offline brute-force Proof. Suppose for file F , the adversary collected N umt
attack. distinct response sets. We represent this with a matrix, where
each row of the matrix represents the response set collected
Proof. From Lemma 2 in α = hrF , guessing r is difficult.
by the adversary each round:
Therefore, A cannot compute the plaintext from the ciphertext
even if it perform an offline brute-force attack.
 
a1,1 a1,2 · · · a1,n
By analyzing the scheme’s resistance to online attacks, it a2,1 a2,2 · · · a2,n 
..  . (7)
 
can be proved that an adversary cannot gain ownership of a  .. .. ..
 . . . . 
certain file by interacting with the CSP in the case of unknown
ac,1 ac,2 · · · ac,n
the plaintext.
2) Online brute-force attack: After eavesdropping the com- For the sake of convenience, we use n to represent the
munication channel, the external adversary A obtains the file length of res and c to denote N umt /2. Here, we discuss the
tag T agF . Using this information, A performs an online brute- scenario where only one challenge node is initiated by CSP.
force attack on F . In practical applications, there are typically multiple challenge
• Step 1. A uploads T agF to CSP. As T agF is previously
nodes, resulting in the exposure of a greater number of nodes.
stored, A is recognized as a subsequent uploader. During the Verification process, to enable the CSP to
• Step 2. CSP generates a challenge ch to A, assuming that
correctly generate the root node based on res, it’s necessary
|ch| = 1. to label whether the node is a left child or a right child. How-
• Step 3. A generates response by enumerating the hash
ever, adversaries can exploit this information along with the
values in res. For one hash value, A lists all its pos- collected response sets to launch attacks. The last column in
sible values {hi }, |h| = HashLen, i ∈ [1, 2HashLen ]. the matrix corresponds to the second level of the MHT. Since
In one verification round, A needs to give a total of the second level has only two nodes, this column has only two
log(N umt ) + 1 such hash values. distinct values. Based on this, attackers can reconstruct the
• Step 4. A sends res to CSP for verification.
second level. By repeating this process, attackers can use the
• Step 5. CPS returns the verification result.
same method to reconstruct all subsequent nodes and thereby
reconstruct the entire MHT.
Theorem 5. If A knows T agF and unknows F , the proba-
bility that A wins the Game to obtain the ownership of F is
negligible. That is: 2) Security of ADMHF-PoW: The previous section demon-
strated the insecurity of the MHT scheme. In the following
P r[Awins ] ≤ ε. (6) section, we will demonstrate how the ADMHF-PoW scheme
Proof. The probability of A guessing one hash value is enhances the security significantly of the PoW process.
( 21 )HashLen . A must guess all log(N umt ) + 1 hash values Theorem 7. Compared with MHT-PoW scheme, ADMHF-
correctly at the same time to win the game. Hence, the PoW scheme significantly reduces the node exposure rate in
probability that A wins is ( 12 )HashLen(log N umt +1) . For SHA- each round of verification.
256, HashLen is 256 bits. Furthermore, in practice, the value
of |ch| is usually greater than 1, and this probability decreases Proof. Assuming MHT is a complete binary tree, its leaf nodes
sharply as |ch| increases. Therefore, it is computationally are 2i . Therefore, the total number of nodes in the MHT is
infeasible for A to win the Game. N umsum = 2i+1 − 1. Assuming |ch| = 1 in each verification
attempt, as the user needs to provide the values of all the
sibling nodes on the path from the challenge node in ch to
C. Security of PoW Process the root node, then |res| = i + 1. Since some of the other
1) Exposure of Merkle Hash Tree: Typically, PoW is de- nodes can be computed by nodes in res as child node, the
signed to operate reliably, regardless of the number of times it number of exposed nodes is |Expose| = 2i + 1.
occurs. However, in the MHT-PoW scheme [9], a portion of We define γ as the percentage of the exposed content of
the MHT nodes becomes exposed with each ownership veri- MHT, which is shown in (8).
fication attempt. If the adversary can collect enough response
|Expose| 2i + 1
sets res, all nodes of the entire MHT can be reconstructed. γ= = i+1 × 100% (8)
At this point, no matter what challenge CSP initiates, the N umsum 2 −1
adversary can return the correct response, thereby gaining For ADMHF scheme, CSP generates an ADMHF with m
ownership of the entire file. It should be noted that for smaller MHTs. Among them, m can be calculated by (1). So the
8 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. XX, NO. XX, XXXX

TABLE III: Description of symbols


0 .8
Symbol Description
0 .7 M H T s c h e m e
P e rc e n ta g e o f e x p o s u re (% )
0 .6 A D M H F s c h e m e SC Ciphertext size
0 .5 ST File tag size
0 .4 Stree MHT size, including the root and the salt
0 .3 Ssk Private key size
0 .2 Sk Encryption key size
0 .1
Smap Bilinear map size
0 .0
Sh Hash size
SξF Authentication information size
0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0
Ss Authentication parameter size
F ile B lo c k s
SAm Verification initial value size
SAα Privilege set size
Fig. 4: Percentage of exposure rate in one challenge-response Std Trapdoor size
round
TABLE IV: Communication overhead comparison
Scheme Communication overhead
exposure rate of nodes in one verification attempt of the Our scheme 2Smap + ST + SC + Sk
ADMHF scheme γ ′ is shown in (9). Key-sharing Stree + ST + SC + Smap
TEE SξF + Ss + SC + SAm
|Expose| γ
γ′ = = × 100% (9)
m × N umsum m
Given a = 8, b = 2.5, c = 488 in (1), the relationship be observed that the value of γ decreases as the number of
between exposure rate of nodes in one verification attempt file blocks increases. For example, in Fig. 5(a), γ reaches
and N umt is shown in the Fig. 4. It can be observed that 99.327% when file block num=500. In Fig. 5(b), the value of
the ADMHF scheme for a file with 2 blocks achives the same γ is 92.156% with file block num=800 and other parameters
level of security as MHT scheme for a file with 91 blocks. As kept the same.
a result, ADMHF-PoW significantly improves the security of Then, we conduct experiments to obtain the value of Ncr
small files. required to lead to full exposure. Let N0 be the minimum
number of Ncr that actually cause full exposure of the MHT
VI. E XPERIMENT AND E VALUATION content in our test. We evaluate the relation between N0
All the experiments are performed on multiple devices with and the number of file blocks by simulating multiple PoW
Intel (R) Core (TM) i5-8500T CPU processor running at 2.1 processes. The results are shown in Fig. 6.
GHz and 4GB RAM on Linux operation system. In which, 2 The results demonstrate that, in the case of small files with
devices act as the main server and the storage server, and the the same number of blocks, the value of N0 derived from
rest of the devices act as users. the ADMHF scheme is significantly larger than that derived
The proposed scheme is implemented with C++ language, from the MHT scheme. Furthermore, the value of N0 based
using PBC [24], OpenSSL [25] function library to realize the on ADMHF exhibits a increase as the file size increases,
cryptography algorithm. We use SHA-1 to compute the short indicating a improvement in security. Upon reaching a certain
hash value of the file and use SHA-256 as the hash function file size, such as 600 blocks, it can be observed that ADMHF
to calculate the hash value of the file, the encryption key, the degrades into a single MHT, resulting in a negligible difference
file tag and generate ADMHF. Encryption and decryption are between the values of N0 derived from ADMHF and MHT-
based on AES-256 symmetric encryption algorithm. based methods.
To simulate real-world scenarios, we store 3000 different
B. Performance Analysis
files on a cloud server, with sizes ranging from 1KB to 256MB.
The number of users owning each file is randomly assigned. We conduct a theoretical analysis of the scheme, examining
We set the popularity threshold to t = 7, resulting in a rough its performance in terms of communication overhead and
ratio of 3:2 between unpopular data and popular data. storage overhead. The measurement symbols used in our
analysis are defined in Table III.
We first analyze the communication overhead required by
A. ADMHF Evaluation
the proposed scheme and compare it with key-sharing scheme
Let Ncr denote the number of verification attempts, and [17] and TEE scheme [19]. The results are shown in Table
γ represent the percentage of the exposed MHT content. We IV. As the proposed scheme does not require any TTP, some
test the correlation between Ncr and γ experimentally with communication costs are eliminated.
different numbers of challenge blocks. The results are shown The storage overhead analysis includes overhead on the
in Fig. 5. cloud server side, the client side and the additional TTP side.
The data indicates that there is a positive correlation be- As the third party server is not used in the proposed scheme, no
tween Ncr and γ. Furthermore, as Ncr increases, the rate of additional storage overhead is required. The results are shown
increase in γ slows down. Simultaneously, as the number of in Table V.
challenge blocks increases, γ increases. Furthermore, it can
AUTHOR et al.: AF-DEDUP: SECURE DATA DEDUPLICATION BASED ON ADAPTIVE DYNAMIC MERKLE HASH FOREST POW FOR CLOUD STORAGE 9

1 0 0 1 0 0

P e rc e n ta g e o f e x p o s u re (% )

P e rc e n ta g e o f e x p o s u re (% )
8 0 8 0

6 0 6 0
c h a lle n g e b lo c k n u m b e r= 1 0 c h a lle n g e b lo c k n u m b e r = 1 0
4 0 4 0
c h a lle n g e b lo c k n u m b e r= 1 5 c h a lle n g e b lo c k n u m b e r = 1 5
2 0 c h a lle n g e b lo c k n u m b e r= 2 0 2 0 c h a lle n g e b lo c k n u m b e r = 2 0
c h a lle n g e b lo c k n u m b e r= 2 5 c h a lle n g e b lo c k n u m b e r = 2 5
0 0
0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0
N u m b e r o f c h a lle n g e -re s p o n s e N u m b e r o f c h a lle n g e -re s p o n s e
(a ) (b )
Fig. 5: Percentage of content exposure. (a)File block number=500, (b) File block number=800.

C o m p u ta tio n O v e rh e a d (s )
2 5
N u m b e r o f c h -re s re q u ire d

8 0 0
2 0 K e y G e n e ra tio n
6 0 0 D a ta E n c ry p tio n
1 5 T a g G e n e ra tio n
4 0 0 F ile U p lo a d
1 0
M H T sc h e m e A D M H F G e n e ra tio n
2 0 0
A D M H F sc h e m e 5
0
0
0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0
1 6 3 2 6 4 1 2 8 2 5 6
B lo c k N u m b e r F ile S iz e (M B )
(a ) (a )
N u m b e r o f c h -re s re q u ire d

C o m p u ta tio n O v e rh e a d (s )
2 5
4 0 0
2 0 K e y G e n e ra tio n
3 0 0
D a ta E n c ry p tio n
1 5 T a g G e n e ra tio n
2 0 0
M H T sc h e m e 1 0 F ile U p lo a d
1 0 0 A D M H F G e n e ra tio n
A D M H F sc h e m e 5
0
0 2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 0
B lo c k N u m b e r 1 6 3 2 6 4 1 2 8 2 5 6
(b ) F ile S iz e (M B )
(b )
Fig. 6: Actual number of challenge-response resulting in full
Fig. 7: Computation overhead for different file size. (a)
exposure. (a) Challenge block number=10, (b) Challenge block
Popular data Computation overhead, (b) Unpopular data com-
number=20.
putation overhead.
TABLE V: Storage overhead comparison
Scheme Server side Client side TTP side
Our scheme SC + N Stree + Ssk + Sk ST + Sk - overhead for ADMHF generation demonstrates little difference
Key-sharing Sk + ST + SC Ssk SC + Ssk between files of different sizes, which exhibits good scalability.
TEE SC + Ss + SξF + SAm Sh + bSAα + Std -
For small files, more MHTs need to be generated, while for
large files, although the number of MHTs is reduced, the
computation overhead of generating a single MHT increases.
C. Computation Overhead Besides, the computation overhead required for both file
The computation overhead of the proposed scheme is eval- encryption and upload process increases with file size.
uated by conducting experiments on five distinct file sets, We compare the latency of the PoW process with the
ranging in size from 16MB to 256MB. Measurements are MHT-PoW scheme [9] and the key-sharing scheme [17]. The
taken at various stages of the file upload process, including results are illustrated in Fig. 8(a). In our experiment, We
key generation, data encryption, file tag generation, file up- set the number of challenge nodes to be 10%N umt , where
loading, and ADMHF generation. The experimental results are N umt represents the number of leaf nodes in the MHT. As
presented in Fig. 7. shown in Fig. 8(a), our scheme has lower latency compared
Fig. 7 highlights that the computation overhead associated to [17], but slightly higher than [9]. Although the efficiency
with key and file tag generation is practically negligible. of [9] is higher, it comes at the cost of sacrificing security.
This is attributed to the fact that during the tag generation Specifically, it can only prove security for a more restrictive
phase, we compute the short hash value for files of any set of input distributions and under an assumption about the
size. The subsequent blind signature is applied to this hash linear code. On the other hand, [17] has a fixed number
value, resulting in higher efficiency. Besieds, the computation of leaf nodes in the construction of the MHT, resulting in
10 IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, VOL. XX, NO. XX, XXXX

T o ta l C o m p u ta tio n O v e rh e a d (s )
1 6
C o m p u ta tio n O v e rh e a d (s )
5 0
M H T -P o W O u r s c h e m e -P o p D a ta
1 4
K e y s h a rin g O u r s c h e m e -U n p o p u la r d a ta
1 2 4 0
O u r s c h e m e K e y -s h a rin g
1 0 C lo u d D e d u p
3 0 T E E
8
6 2 0
4
1 0
2
0 0
1 6 3 2 6 4 1 2 8 2 5 6 5 1 2 1 6 3 2 6 4 1 2 8 2 5 6
F ile S iz e (M B ) F ile S iz e (M B )
(a ) (b )
Fig. 8: (a) Comparison of computation overhead in the PoW process. (b) Total computation overhead comparison.

higher computation overhead for small files. Considering both R EFERENCES


security and efficiency, our scheme is superior.
In order to assess the performance of our scheme, we con- [1] Jianwei Yin, Yan Tang, Shuiguang Deng, Bangpeng Zheng, and Al-
duct experiments to evaluate the total computation overhead bert Y. Zomaya. Muse: A multi-tierd and sla-driven deduplication
framework for cloud storage systems. IEEE Transactions on Computers,
and compared it with other schemes, including Key-sharing 2021.
[17], ClouDedup [26], and TEE [19]. The results, as shown [2] W. You and B. Chen. Proofs of ownership on encrypted cloud data
in Fig. 8(b), indicate that our scheme exhibits superior perfor- via intel sgx. In The First ACNS Workshop on Secure Cryptographic
Implementation (SCI 2́0)(in conjunction with ACNS ’20), 2020.
mance in terms of total computation overhead for popular data.
[3] J. R. Douceur, A. Adya, W. J. Bolosky, P. Simon, and M. Theimer.
However, for unpopular data, our total computation overhead Reclaiming space from duplicate files in a serverless distributed file
is slightly higher due to the implementation of two-layered system. In Distributed Computing Systems, 2002. Proceedings. 22nd
encryption, which enhances data security. International Conference on, 2002.
[4] Chen Zhang, Yinbin Miao, Qingyuan Xie, Yu Guo, Hongwei Du, and
D. Extending to Large Files Xiaohua Jia. Privacy-preserving deduplication of sensor compressed
data in distributed fog computing. IEEE Transactions on Parallel and
To test the performance of our scheme with large files and Distributed Systems, 33(12), 2022.
its availability in real-world scenarios, we utilize a real-world [5] Mihir Bellare, Sriram Keelveedhi, and Thomas Ristenpart. Message-
dataset collected from Storage Lab at Stony Brook University locked encryption and secure deduplication. 2013.
[6] C. Guo, X. Jiang, Kkr Choo, and Y. Jie. R-dedup: Secure client-side
[27]. For a file of size 1GB, we select a file chunk size of deduplication for encrypted data without involving a third-party entity.
1KB. At this configuration, the computation time for file tag Journal of Network and Computer Applications, 162:102664, 2020.
generation is 0.473s, file encryption takes 69.754s, and the [7] M. Miao, G. Tian, and W. Susilo. New proofs of ownership for efficient
data deduplication in the adversarial conspiracy model. International
PoW process required 7.242s. Then we conduct tests with Journal of Intelligent Systems, (3), 2021.
a file chunk size of 32KB. In this case, the times for the [8] Ralph C. Merkle. A certified digital signature. 1989.
three steps are 0.475s, 69.241s, and 1.391s, respectively. In [9] S. Halevi, D. Harnik, B. Pinkas, and A. Shulman-Peleg. Proofs of
comparison with the PoW process and the scheme presented in ownership in remote storage systems. ACM, page 491, 2011.
[10] Xue Yang, Rongxing Lu, Jun Shao, Xiaohu Tang, and Ali A. Ghorbani.
[9], our scheme demonstrate a remarkable 95.089% reduction Achieving efficient secure deduplication with user-defined access control
in computation overhead. in cloud. IEEE Transactions on Dependable and Secure Computing,
19(1), 2022.
VII. C ONCLUSION [11] Yuan Zhang, Chunxiang Xu, Nan Cheng, and Xuemin Shen. Secure
This paper introduces the design of ADMHF, a novel password-protected encryption key for deduplicated cloud storage sys-
tems. IEEE Transactions on Dependable and Secure Computing, 19(4),
data structure for PoW, which is applied to cloud storage 2022.
deduplication. A file tag is constructed to check if the file [12] R. Chen, Y. Mu, G. Yang, and F. Guo. Bl-mle: Block-level message-
already exists, and if not, the ciphertext is uploaded. The data locked encryption for secure large file deduplication. IEEE Transactions
popularity is categorized to optimize security and efficiency. on Information Forensics & Security, 10(12):2643–2652, 2015.
[13] Mohammed Gharib and MohammadAmin Fazli. Secure cloud storage
In addtion, subsequent uploaders must execute ADMHF-PoW. with anonymous deduplication using id-based key management. J.
After security analysis and simulation experiments, the per- Supercomput., 79(2):2356–2382, 2023.
centage of exposure significantly decreases compared to the [14] X. Jia, E. C. Chang, and J. Zhou. Weak leakage-resilient client-
side deduplication of encrypted data in cloud storage. In Proceedings
traditional MHT scheme. In addition, for small files, it leads of the 8th ACM SIGSAC symposium on Information, computer and
to an increase in the value of N0 , which is the number of communications security, 2013.
verification attempts required for full exposure of MHT. Both [15] Spark: Secure pseudorandom key-based encryption for deduplicated
of these indicators prove that our scheme improves the security storage. Computer Communications, 154:148–159, 2020.
[16] S. Li, C. Xu, and Y. Zhang. Csed: Client-side encrypted deduplication
of files. At the same time, for large files, the computation scheme based on proofs of ownership for cloud storage. Journal of
overhead is relatively low. Information Security and Applications, 46:250–258, 2019.
AUTHOR et al.: AF-DEDUP: SECURE DATA DEDUPLICATION BASED ON ADAPTIVE DYNAMIC MERKLE HASH FOREST POW FOR CLOUD STORAGE 11

[17] W. A. Liang, A Bw, S. A. Wei, and B Zz. A key-sharing based secure


deduplication scheme in cloud storage - sciencedirect. Information
Sciences, 504:48–60, 2019.
[18] J. Stanek, A. Sorniotti, E. Androulaki, and L. Kencl. A secure data
deduplication scheme for cloud storage. In Financial Cryptography,
2014.
[19] Y. Fan, X. Lin, W. Liang, G. Tan, and P. Nanda. A secure privacy pre-
serving deduplication scheme for cloud computing. Future Generation
Computer Systems, 101, 2019.
[20] Garima Verma. Secure client-side deduplication scheme for cloud with
dual trusted execution environment. IETE Journal of Research, 0(0):1–
11, 2022.
[21] Yang Ming, Chenhao Wang, Hang Liu, Yi Zhao, Jie Feng, Ning Zhang,
and Weisong Shi. Blockchain-enabled efficient dynamic cross-domain
deduplication in edge computing. IEEE Internet Things J., 9(17):15639–
15656, 2022.
[22] Xiaoyu Zheng, Yuyang Zhou, Yalan Ye, and Fagen Li. A cloud data
deduplication scheme based on certificateless proxy re-encryption. J.
Syst. Archit., 102, 2020.
[23] Giuseppe Ateniese, Randal Burns, Reza Curtmola, Joseph Herring, Lea
Kissner, Zachary Peterson, and Dawn Song. Provable data possession
at untrusted stores. Association for Computing Machinery, 2007.
[24] Guangquan Xu, Longlong Rao, and Jiangang Shi. Novel android
malware detection method based on multi-dimensional hybrid features
extraction and analysis. Intelligent Automation and Soft Computing,
25:–1, 09 2019.
[25] Shuai Wang, Yuyan Bao, Xiao Liu, Pei Wang, Danfeng Zhang, and
Dinghao Wu. Identifying cache-based side channels through secret-
augmented abstract interpretation, 05 2019.
[26] P. Puzio, R. Molva, Melek Nen, and S. Loureiro. Cloudedup: Secure
deduplication with encrypted data for cloud storage. In IEEE Interna-
tional Conference on Cloud Computing Technology & Science, 2013.
[27] Vasily Tarasov, Amar Mudrankit, Will Buik, Philip Shilane, Geoff Kuen-
ning, and Erez Zadok. Generating realistic datasets for deduplication
analysis. In Proceedings of the 2012 USENIX Conference on Annual
Technical Conference, USA, 2012. USENIX Association.

You might also like