0% found this document useful (0 votes)
55 views14 pages

Detecting Suspicious File Migration or Replication in The Cloud

Uploaded by

ankitmalik844903
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views14 pages

Detecting Suspicious File Migration or Replication in The Cloud

Uploaded by

ankitmalik844903
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Detecting Suspicious File Migration or

Replication in the Cloud


Adam Bowers† , Cong Liao‡ , Douglas Steiert† , Dan Lin† , Anna Squicciarini‡ , Ali Hurson§
† Department of Electrical Engineering and Computer Science

University of Missouri
Email: {acbqbd,djsg38,lindan}@missouri.edu
‡ Information Science and Technology

Pennsylvania State University


Email:{cxl491, asquicciarini}@ist.psu.edu
§ Department of Computer Science

Missouri University of Science and Technology


Email:[email protected]

Abstract—There has been a prolific rise in the popularity of cloud hospitals) who store sensitive data (e.g., medical records) that
storage in recent years. While cloud storage offers many advantages are governed by laws to remain within certain geographic
such as flexibility and convenience, users are typically unable to tell boundaries and borders. Another situation were this problem
or control the actual locations of their data. This limitation may affect
arises is with governmental entities that require all data to be
users’ confidence and trust in the storage provider, or even render
cloud unsuitable for storing data with strict location requirements. To stored in the same country that the government operates in;
address this issue, we propose a system called LAST-HDFS which this challenge has seen difficulties with cloud service providers
integrates Location-Aware Storage Technique (LAST) into the open (CSPs) quietly moving data out-of-country or being bought out
source Hadoop Distributed File System (HDFS). The LAST-HDFS sys- by foreign companies. For example, Canadian laws demand
tem enforces location-aware file allocations and continuously monitors that personal identifiable data must be stored in Canada.
file transfers to detect potentially illegal transfers in the cloud. Illegal
However, large cloud infrastructure like the Amazon Cloud has
transfers here refer to attempts to move sensitive data outside the
(”legal”) boundaries specified by the file owner and its policies. Our
more than 40 zones distributed all over the world [1], which
underlying algorithms model file transfers among nodes as a weighted makes it very challenging to provide guaranteed adherence to
graph, and maximize the probability of storing data items of similar pri- regulatory compliance. Even Hadoop, which historically has
vacy preferences in the same region. We equip each cloud node with a been managed as a geographically confined distributed file
socket monitor that is capable of monitoring the real-time communication system, is now deployed in large scale across different regions
among cloud nodes. Based on the real-time data transfer information
(see Facebook Prism [2] or recent patent [3]).
captured by the socket monitors, our system calculates the probability
of a given transfer to be illegal. We have implemented our proposed
To date, various tools have been proposed to help users
framework and carried out an extensive experimental evaluation in a verify the exact location of data stored in the cloud [4]–[6],
large-scale real cloud environment to demonstrate the effectiveness and with emphasis on post-allocation compliance. However, recent
efficiency of our proposed system. work has acknowledged the importance of a proactive location
control for data placement consistent with adopters’ location
requirements [4], [7], [8], to allow users to have stronger
1 I NTRODUCTION control over their data and to guarantee the location where
With the ever-increasing popularity of cloud computing, the the data is stored.
demand for cloud storage has also increased exponentially. In this work, we infiltrate into one of the most widely
Computing firms are no longer the only consumers of cloud adopted cloud data storage systems–Hadoop Distributed File
storage and cloud computing, but rather average businesses, System (HDFS), and design an enhanced HDFS system, called
and even end-users, are taking advantage of the immense LAST-HDFS. The LAST-HDFS extends HDFS’ capabilities
capabilities that cloud services can provide. While enjoying to achieve location-aware file allocations and file transfer
the flexibility and convenience brought by cloud storage, monitoring. Specifically, LAST-HDFS provides the following
cloud users release control over their data, and particularly new functions: (i) consistently enforces a location-aware data
are often unable to locate the actual their data; this could be loading and storage by assigning datanodes according to user
in-state, in-country, or even out-of-country. Lack of location specified privacy policies; (ii) actively tracks and dynamically
control may cause privacy breaches for cloud users (e.g., corrects possible data migration (due to balancing or data
'LJLWDO2EMHFW,GHQWL¿HU7'6&
‹3HUVRQDOXVHLVSHUPLWWHGEXWUHSXEOLFDWLRQUHGLVWULEXWLRQUHTXLUHV,(((SHUPLVVLRQ
6HH KWWSZZZLHHHRUJSXEOLFDWLRQV VWDQGDUGVSXEOLFDWLRQVULJKWVLQGH[KWPO IRU PRUH LQIRUPDWLRQ
2

Name Node Finally, Section 8 concludes the paper and outlines future
Location-aware research directions.
File Allocator

Data Node Data Node


2 U SE C ASE
Socket Monitor Socket Monitor
In order to better motivate our work and clarify our goal,
we discuss the impact our proposed approach in realistic
Illegal File Illegal File
distributed cloud settings. Take for instance, the sample use
Transfer Detector Transfer Detector of data legally required to stay in country of origin. In this
case false positives in particular are unacceptable due to
Fig. 1: Architecture of the Name Node and Data Nodes contractual and service level agreement (SLA) bindings. A
compelling case for this above case would be an example of
government systems. With the recent Executive Orders [9] to
replication needs) within the cluster that might violate data further increase the security aspects of our cyberspace, there
placement policies; (iii) detects potentially illegal data migra- is a push to move government data to the cloud infrastructure.
tion, by monitoring socket communication between individual The reasoning behind this is due to limited physical infras-
datanodes and correlating it with the constraints imposed by tructure and the common other advantages such as flexibility,
the policy. convenience, and scalability. However, since the data being
The idea of our approach is that, once data is allocated stored on the cloud is governmental data, there are instances in
per users’ location preferences, our framework monitors real- which not everyone should be able to access the data, such as
time file transfers in the cloud and is capable of detecting that from Department of Defense (DoD), whereas Department
potential illegal transfers. An illegal transfer in our context of Energy (DoE) data is usually public. These two different
denotes moving sensitive data outside the legal boundaries data types, public and private, need to be stored differently.
specified by the file owner (e.g., storing a file in a physical The private data can be set up on private cloud servers that can
location other than what the file owner desires). Our approach be assured to reside within the United States, while the public
builds on the observation that users’ location preferences data may be stored in the public commercial cloud such as
are often consistent with privacy laws and regulations. As a Amazon S3 and be analyzed using big-data analytic tools such
result, files can be gathered into groups in which multiple as Hadoop in Amazon EC2. Yet, for such pubic governmental
users share the similar, if not the same, location preferences. data, there still needs to be assurance that this data does not get
Accordingly, our system allocates cloud nodes based on the processed or placed on nodes that reside outside the United
similarity of users’ location preferences. More specifically, we States. To overcome this challenge, our proposed approach
model the file transfers among nodes as a weighted graph can be directly adopted by the public cloud services such as
and then maximize the probability that files with similar Amazon EC2 to enhance the location protection when Hadoop
privacy preferences will be stored in the same region. We then is used for data analysis. As for the storage services like
devise socket monitoring functions to monitor the real-time Amazon S3, our proposed approach can be adopted with some
communication among cloud nodes. Based on our legal file minor tweaks. We can introduce a name node manager as used
transfer graph and the communication that is detected between in [4] to perform the proposed location-aware file allocation
the nodes, we are able to calculate the probability of a transfer tasks and validate transfers that we are probabilistically unsure
being illegal. Figure 1 shows an overview of the proposed of being legal or not. The data storage nodes will still be
system, whereby the name node in HDFS is equipped with equipped with socket monitors to collect suspicious transfer
our proposed location-aware file allocator, and the data nodes information. Given that we know the probability of an illegal
are equipped with the proposed illegal file transfer detector data transfer from one region to another, the transfer with high
that analyzes information collected by socket monitors. probability to be illegal will be sent to the name node for
further inspection.
We carry out extensive experimental studies in both a real
cloud testbed and a large-scale simulated cloud environment to
demonstrate the efficiency and effectiveness of our proposed 3 BACKGROUND ON H ADOOP D ISTRIBUTED
system. Experimental results confirm correctness of location F ILE S YSTEM (HDFS)
enforcement in file uploading and load balancing process with In this section, we provide some background information on
little computational overhead, as well as the capability to the Hadoop Distributed File System (HDFS), which is the
verify data placement under potential attack and multi-user base of our system. HDFS [10] is an essential component of
scenarios through socket analysis. the open source Hadoop software. HDFS is a distributed file
The rest of the paper is organized as follows. Section 2 system designed to run on commodity hardware and support
presents the use case of our proposed system. Section 3 distributed data storage and access by applications running on
briefly review the background about the Hadoop system. Then, top of it with high throughput and fault-tolerance. It adopts a
Section 4 gives an overview of the proposed LAST-HDFS master-slave architecture, which consists of a single namenode
system followed by detailed implementation algorithms in as the master node and multiple datanodes as the slave nodes.
Section 5. Section 6 reports the experimental results. After The namenode manages file system meta-data and orchestrates
that, Section 7 reviews related work in secure cloud storage. file accesses. The datanodes serve read/write requests issued
3

by clients and perform the actual read/write operations on disk equal size which are then replicated three times when stored
blocks as instructed by the namenode. In what follows, we in the cloud. In our work, we will simply refer to the “file
briefly review the data storage and load balancing mechanisms chunks” as “file”.
adopted by the current HDFS since our proposed system will Our overarching goal is to enable HDFS to support location-
revise these two functions to achieve location-aware storage. aware data storage so that data owners’ location privacy
policies are strongly enforced when storing their data in the
3.1 Write Mechanism in HDFS cloud. Recall that in the existing HDFS, the locations of a user
uploaded file are determined by two factors: i) data replication
For a data owner (client) to upload a file to HDFS, he needs to for the purpose of fault-tolerance, and ii) load balancing to
first initiate a write request to the namenode asking to create optimize cluster space utilization. In other words, users’ file
a new file in HDFS. Once the namenode approves the request, chunks will be replicated to multiple datanodes when the files
the client will begin writing data to the stream where data is are uploaded for the first time, and it is very likely that the
split into packets. Each packet represents a data block of the file blocks on saturated nodes may be transferred to under-
file that will be written to the datanodes. A separate thread in utilized nodes at a later time. Therefore, in order to enforce
the client will pick up a packet and contact the namenode, users’ location settings during the lifespan of their data in the
from which a list of candidate datanodes will be returned cloud, we need to achieve the following design goals:
to the client. Then, the client will send write packet to the
1) When uploading the files to the cloud, users should be
first datanode in the list where the data block will be stored.
allowed to specify the location constraints (e.g., regions,
Subsequently, the data block will be replicated to the following
countries) within which their data is allowed to be placed
datanodes in the list in a pipeline manner.
in the cloud.
2) The location constraints (i.e., location privacy policies)
3.2 Load Balancing in HDFS specified by the users should be consistently enforced
Load balancing is of great importance to the overall perfor- during the data replication process.
mance of HDFS clusters, especially when a new datanode is 3) The location constraints (i.e., location privacy policies)
added to the cluster or the disk space of certain datanodes should also be consistently enforced during the load
is saturated. Hadoop provides a balancer tool that allows a balancing process.
cloud administrator to balance the disk space usage in a HDFS 4) Any data movement (caused by malicious attacks) that
cluster. An outline of the load balancing process is described violates the location constraints should be detected.
below:
1) The balancer partitions all the datanodes into two 4.2 Threat Model
groups: i) under-utilized node group and (ii) over- In our system, we consider the following three types of
utilized node group, based on their data block usage entities:
reports. • File loader: It uploads files to the cloud on behalf of users.
2) The balancer randomly select one datanode from each • Namenode: It is the master node in HDFS which manages
group to form a pair of nodes whose load will be the entire file system and also interacts with users.
balanced by transferring certain amount of data from • Datanodes: They are the nodes that actually store the user
one to the other. data.
3) The balancer randomly selects a list of data blocks in
Accordingly, we make the following assumptions and threat
the over-utilized datanode and transfer the data to the
model. The namenode is the core node in the system and
under-utilized datanode in the same pair.
is assumed fully trusted due to the following reasons. There
4) The balancer iterates the above three steps until all
is typically one name node per cluster with a couple of its
the datanodes in the cluster reach certain utilization
backups. That means the number of name nodes is far less
threshold, i.e., the system achieves balanced load.
than the number of data nodes. The name node controls all
the file directories which are extremely important to the service
4 A N OVERVIEW OF THE P ROPOSED provider to ensure the availability of the whole cloud service,
LAST-HDFS S YSTEM and hence the name node is typically much better protected
In this section, we first layout the system design goals and the and already closely monitored by the service provider. With
threat model. Then, we present an overview of our proposed that said, the namenode will faithfully handle requests from
LAST-HDFS system. users. On the other hand, since the number of data nodes is
huge, it is much more challenging for the service providers to
keep track of the behavior of all the data nodes. The attacks on
4.1 Design Goals data nodes are more silent, frequent, and hard to be noticed.
We consider a cloud architecture similar to the Amazon Cloud, Thus, we do not assume all the datanodes are fully trusted.
which is partitioned in multiple zones and each zone contains a The compromised datanodes could intentionally transfer or
number of cloud nodes (e.g., 50,000 [11]). Each node supports copy users’ data to any other nodes that may reside outside the
a distributed file system such as HDFS (Hadoop Distributed legal regions specified by the users. Attackers may do this for
File System). A file is typically partitioned into chunks of various purposes such as analyzing users’ data for advertising,
4

selling users’ data to obtain financial gains, stealing one’s


private information, or even using this to hide malicious data.
Our proposed system aims to protect users’ data from this kind
of attacks.

4.3 LAST-HDFS System Components


We now provide an overview of our proposed LAST-HDFS
system. With respect to the goals and threat model that we
described above, the LAST-HDFS adds two new features to
the existing HDFS, which are (i) location-aware file allocation
and (ii) real-time file transfer analysis. Fig. 2: An Example of File Storage and Transfer in the Cloud
The location-aware file allocation feature is realized through
three new components: communication, monitoring socket connections between indi-
• Location-aware File Loading: This is a file loader that vidual datanodes within a HDFS cluster provides useful insight
runs on the namenode (master node) to perform the file into how data is moved. The socket monitor is in charge of
loading operations upon users’ requests. Along with the monitoring all of the communication that occurs between the
file uploading request, the file loader will also accept multiple nodes within the cloud infrastructure. Note that the
the users’ location privacy policies if any. In the loca- socket monitor only detects packets transferred between nodes
tion privacy policy, the user can clearly specify which but not the content of the packets. The challenge here is to
regions/countries are allowed to store their data and the address the following question, “How can the cloud service
maximum number of copies of data can be stored. provider optimally allocate storage nodes for users so that the
• Location-aware File Replication: This function is per- socket monitoring component can easily conclude that there
formed by the namenode in order to allocate datanodes is a violation during the file transfer?”
that satisfy the user’s location privacy policy. It replaces Consider the example in Figure 2. Without loss of general-
the original file replication function in HDFS. Specifi- ity, we adopt the following format to represent users’ location
cally, once the user’s request has been submitted to the preferences regarding their data in the cloud.
namenode by the file loader program, the namenode will Definition 1: Given a user u, his/her location privacy pol-
return a list of candidate nodes according to the required icy Pu is in the form of Pu ((f1 , Υ1 ), (f2 , Υ2 ), ..., (fk , Υk )),
regions and the replication factor specified in the policy where fi represents user u’s files and Υi is a set of regions
file. The policy will be enforced by the namenode such in the cloud that are allowed to store fi .
that only the qualified datanodes will be selected and User u1 has a policy Pu1 ((f 11, {R1 , R2 }), (f 12, {R3 })), user
the number of chosen datanodes equals the replication u2 has a policy Pu2 ((f 21, {R1 , R2 })) and user u3 has a policy
factor. In case there is no enough space or sufficient Pu3 ((f 31, {R2 , R3 }), (f 32, {R1 , R3 })). Assume that chunks
number of qualified datanodes available, the namenode of these three users’ files are allocated to the available cloud
will only select those that meet the criteria and reduce nodes A, B and C based on individual privacy preferences as
the replication factor accordingly, and inform the user illustrated in the figure.
about this. Suppose that the socket monitor at node A detects a file
• Location-aware Load Balancing: This function is also transfer from node A in region R1 to node C in region R3 .
performed by the namenode to balance the loads on in- Since both files f 11 and f 21 stored in node A are not allowed
dividual datanodes while enforcing location privacy poli- to be stored in region R3 , we can easily conclude that this
cies. This function replaces the original load balancer in file transfer must be illegal even though the socket monitor
HDFS. As we know, load balancing is essential to ensure does not know which file is transferred. The conclusion in
the optimal performance of the cloud storage services. this instance is based on the knowledge that every file in node
Our proposed location-aware load balancer inherits this A cannot be transferred to region R3 , so there is no need
important property while taking into account the location to determine which files are being transferred. However, in
privacy concerns. During the load balancing process, another case when the socket monitor at node B detects a file
whenever a particular data block is selected to be moved transfer from itself to node D in region R3 , we would not
or copied from one datanode to another, the location- be able to know for sure whether this file transfer is legal or
aware load balancer will check the policy associated with illegal. This is because one of the file items (f 31) at node B
the data and verify whether the destination datanodes are is allowed to be stored in region R3 whereas the other file
permitted by the policy. If not, another datanode will be f 11 is not allowed to be stored at R3 . If this file transfer is
selected and verified similarly till the qualifying nodes about f 31, then it is legal. Otherwise, it is an illegal transfer.
are found. From the above example, we can observe that if our proposed
The real-time file transfer analysis is realized with one new file allocation method enables the cloud server (i.e., the name
component called Host-based Socket Monitoring. The basic node in HDFS) to allocate files associated with similar location
idea of the host-based socket monitoring is the following. preferences, our system would be more efficient at detecting
Since the data transfer between the datanodes relies on socket illegal file transfers.
5

Fig. 3: An Example of the LP-tree

5 D ETAILED A LGORITHMS FOR means 4 files stored in the cloud nodes indexed by N1 have
I MPLEMENTING THE LAST-HDFS S YSTEM location preference of region R1 and 5 files prefer region R2 .
The example LP-tree corresponds to the example in Figure 2.
In this section, we present the detailed algorithms that support
Specifically, cloud node A stores the data items f 11 and f 21
the two major functionalities in the proposed LAST-HDFS
which have their location preferences constrained to R1 and
system, including (i) location-aware file allocation and (ii)
R2 . Assume that some other cloud nodes F and G store data
real-time file transfer analysis.
items with the same location preferences as node A. Then,
cloud nodes A, F and G are recorded in the same leaf node
5.1 Location-aware File Allocation N2 in the tree as illustrated in Figure 3, where # = 4 indicates
We will present the algorithms first and then discuss the system the files belong to 4 different users. Similarly, node N3 in the
implementation details. tree shows that both of the cloud nodes B and E store data
items with the same location preferences R2 and R3 and these
5.1.1 Algorithms files belong to 2 different users.
Due to the increasing number of the users who are adopting The construction of the LP-tree is as follows. Starting from
the cloud services, large amounts of cloud storage requests the first new data item uploaded to the cloud, the name node
are being received continuously over time by cloud service will look for an empty cloud node within the satisfying regions
providers. When a user has an incoming storage request, our (e.g., R1 and R2 for f 22). If an empty cloud node is found,
proposed location-aware file allocation aims at finding the the new data item will be stored in that cloud node. Since the
cloud nodes that store the files with location preferences most LP-tree is empty right now, a root node will be created. One
similar to that of the newly uploaded file, so as to help identify of the root node’s entries will be used to record this new data
additional illegal file transfers in the future. A straightforward item’s indexing information (e.g., the ID of the cloud node
way to perform this step is to simply compare the location that stores this item). For the subsequent insertions of data
preference of the new data item with the location preferences items to the LP-tree, the first step is to search in the LP-tree
of all the existing data items already stored in the cloud. to identify potential cloud nodes which stores the data items
However, considering the scale of the cloud, this naive solution with the same location preferences as the new data item. If
is obviously very time consuming to carry out. Therefore, we such cloud node is found and has capacity to store the new
propose an efficient approach, i.e., the Location Preference item, the aggregation information in its parent node in the LP-
(LP) tree, to help speed up this process. tree will be updated to include the new item. For example, a
Our proposed LP-tree will index location preferences of new file with location preferences R1 and R2 may be stored
the files stored in each cloud node. The tree is maintained in cloud node F and we only need to update N1 ’s aggregation
and updated by the name node whenever there is an update information from (4R1, 5R2) to (5R1, 6R2). The update will
to the file storage, such as cloud nodes or files being added also propagate to all the ancestor nodes.
or deleted. An example of the LP-tree is shown in Figure 3, If none of the cloud nodes indexed by the LP-tree has
whereby N0 , ..., N5 denote the names of the nodes in the sufficient storage space, the name node will identify a new
index, and the # symbol indicates the number of users who empty cloud node and create a new index entry. The new
have their information indexed in the same index node. The index entry will be inserted into the leaf node whose location
leaf nodes of the tree contain the IDs of the cloud nodes which preferences are the same as the new data item. If a leaf node
store files with the similar geographical location preferences. in the LP-tree is full, it will be split into two nodes and the
The internal nodes of the LP-tree record the aggregated aggregation information at the parent level will be adjusted.
location preferences of their corresponding children nodes, so Such adjustment may propagate all the way up to the root.
as to facilitate the search for the suitable cloud nodes that have In this way, the LP-tree’s height will increase gradually. For
available space for incoming storage requests. The aggregated example, if a new file with location preferences R1 and R2
location preferences include two kinds of information: (i) the arrives in the cloud, the name node will start checking the root
IDs of the allowed regions; and (ii) the number of data items node of the LP-tree. It will find that the first entry in the root
associated with each allowed region. For example, as shown node contains (R1 , R2 , R3 , R4 ), which includes the new data
in Figure 3, the first entry in Node N1 is (4R1,5R2), which item’s location preferences of R1 and R2 . Then, it retrieves
6

Algorithm 1 Data Allocation  Si→j


1: procedure A SSIGN DATA(data, node, minMatch)  find Si→j +Sij , Si→j ≤ Sij
Ci→j = Sij (1)
Si→j +Sij ,
node to add data to otherwise
2: if node = null then return false
else if node.children=null and N r
3:
j=1 Ci→j
4: node.estimateConfidenceLevel(data)>minMatch then DCi = (2)
Nr
5: node.addData(data) return true
6: end if Equation 1 computes the confidence level of detecting an
7: bestMatch = null illegal transfer from region i to j. Without loss of generality,
8: bestScore = 0 we assume each file chunk is of the same size. Then, Si→j
9: for child in node.children do denotes the number of file chunks (including the new data
10: score = child.estimateConfidenceLevel(data) item) in the cloud node to be considered, that are allowed to
11: if score>bestScore then be transferred from region i to region j based on the data
12: bestMatch = child owners’ privacy preferences. On contrast, Sij denotes the
13: bestScore = score number of file chunks that are not allowed to transfer to region
14: end if j. The intuition behind the calculation is that the more files
15: end for that are allowed (or not allowed) to be transferred to the same
16: return AssignData(data,bestMatch,minMatch) place, the easier to determine whether the file transfer is illegal.
17: end procedure Specifically, if all files at region i are allowed to be transferred
to region j, it would be easier to conclude that any file transfer
detected between these two regions later on should be legal,
and hence the confidence level Ci→j of making the conclusion
is as high as 1. On the other hand, if none of the files at
the child node N1 of that first entry. By inspecting all the region i are allowed to be stored at region j, it would also
entries in N1 , it finds that the first entry of N1 contains the be easy to conclude that any file transfer between these two
new data item’s preferred locations, and hence check all the nodes are illegal, and hence the confidence level Ci→j is also
cloud nodes indexed by N2 which are cloud nodes A, F and 1. The case where uncertainty is the highest is when half of
G. If A, F and G do not have sufficient space for the new data the files at region i are allowed to be transferred to region
item, the name node will find an empty cloud node (say H). j whereas the other half are not. In this case, the confidence
An index entry for H will be created and inserted to Node N2 . level is lowered to 50%, which is similar to a random guess.
Suppose that N2 is full, N2 will be split into two index nodes: After computing the confidence level for the candidate region i
one contains entries for A and F, and the other contains entries regarding all other regions, we aggregate the confidence level
for G and H. A parent index entry for the newly created node to obtain the final detection confidence (DCi ) in Equation
will be inserted into Node N1 . Similar tree adjustment will 2, whereby Nr denotes the total number of regions specified
occur in all the ancestor nodes of this new node. in the files of the node being considered. Choosing the node
with the highest DCi value maximizes the chance to make the
In the case when none of the cloud nodes indexed by the correct judgement on a file transfer.
LP-tree has the same location preferences as the new data item, Table 1 shows a simple example of the confidence level
we propose a metric called confidence level to help identify calculation assuming there are total 10 data items in the cloud
the most suitable node for the new data item so that such node i. As reported, the confidence level is 1 in the case when
storage could maximize the chance of detecting non-compliant 0 or all files are allowed to be transferred to region j. Similarly,
file transfers. As shown in Algorithm 1, the algorithm still when there is only one file which is allowed or disallowed to
starts the search from the root node of the LP-tree. Instead be transferred to region j, the confidence level is the same at
of only looking for exact matches of the location preferences, 0.9. When half of the files are allowed (or disallowed) to be
the algorithm also checks the nodes which partially match the transferred to region j, the confidence level of detecting the
new data item’s location preferences. These tree nodes will be illegal transfer decrease to the lowest 0.5 which is basically a
examined in a descending order of the number of matching random guess.
location preferences. Once the best answer is found (e.g., the To gain a better understanding of the algorithm, let us look
exact match), we do not need to check the remaining tree at the following example. Assume that a new data item is
nodes. When the algorithm reaches the leaf node of the LP- requested to be contained to only regions R1 or R6 , and
tree, we will compute the confidence level of the candidate should ideally not be transferred to any other regions. Based
cloud node by assuming that the new data item would be stored on the LP-tree in Figure 3, there is not any node with exact
there (but not actually store it yet). The higher the confidence match of the location preferences of the new data item. Rather,
level is, the higher the probability that a communication there are several candidate nodes that fulfill part of this new
incurred between two nodes could be an allowed file transfer. file’s location preferences, which are cloud nodes A, F, G and
We will choose the cloud node with the highest confidence K. Nodes A, F, G contain 5 files with location preference of
level to actually store the new file. The confidence level is R1 partially matching the new file’s location preferences, and
computed as follows: their detection confidence after adding the new file would be
7

TABLE 1: Examples of Confidence Score Calculation list of IP addresses of datanodes respectively. The mapping
Si→j Ci→j between regions and datanodes are hard-coded in the file
0 1 loader as described in Section 6.1.
1 0.9 Users have the choice of submitting their data either with or
2 0.8 without a data location policy file. If a policy file is provided,
3 0.7
the locations in the file will be extracted and serve as the input
4 0.6
5 0.5 value of the parameter FavoredNodes in the corresponding
6 0.6 API call. Otherwise, the location is considered as null when
7 0.7 the create method is invoked.
8 0.8
9 0.9 Location-Aware Replicator
10 1 This step aims to store the user data at the specified
locations. When the create method is invoked by file loader,
a request is sent to the namenode asking for a list of datanodes
73%. Node K contains only 2 files with location preferences to store the data as we describe with the write mechanism in
partially matching the new file, and its detection confidence HDFS in Section 3.1. The datanodes are selected according to
after considering the new file would be 67%. By comparing the class BlockPlacementPolicy. In Hadoop’s default im-
the detection confidence of them, we can find that nodes A, plementation of BlockPlacementPolicy, candidate datan-
F, G are better candidates. odes are firstly drawn from the list of FavoredNodes specified
With the aid of the LP-tree, we only need to check log(n) by the user. However, there is no guarantee that a candidate
nodes in the LP-tree to locate candidate cloud nodes to store datanode will be actually selected unless it meets a series of
a newly uploaded file, where n is the total number of cloud criteria, e.g., enough space and low network latency. In case
storage nodes. The of disqualified candidate datanodes in the FavoredNodes list,
 space complexity in the worse case for the
LP-tree is log( xr ), where r is the number of regions and x is additional datanodes will be selected from those nodes who
the do not belong to the preferred list, in order to make sure that
 r  number of regions allowed. The reason for this is there are
the number of returned datanodes equals the replication factor.
x possible policies that need to be represented by the tree.
It is worth noting that the actual policies can be stored in the As a result, it is possible that some copies of user data will
hard drive. Only the top few levels (typically 2 or 3 levels) of be stored in the locations against the date location policy.
the LP-tree (a few MB) need to be stored in the main memory To enforce the location policy in the process of
for quick retrieval. data replication, we extend the default implementation of
BlockPlacementPolicy and override the original procedure
5.1.2 System Implementation of selecting candidate datanodes. In our design, if there exists
the case where at least one candidate datanode from the
To realize the proposed location-aware file allocation, we need
FavoredNodes list is disqualified, we reduce the replication
to extend three components in the existing HDFS as elaborated
factor to the number of datanodes that are eventually selected
in the following.
by the namenode, instead of selecting other possible datanodes
Location-Aware File Loader outside the scope of the FavoredNodes list. As for changing
The file loader is a Java application program that takes data the replication factor, we leverage the Hadoop shell command
location policy files as input and prepare for the data repli- hdfs setrep. Specifically, we add a command option -w so
cation on the specified nodes. Instead of using FileSystem that the change will only be made after the replication process
APIs that are normally designated for user programs, we has ended. In this way, we can ensure that the data location
leverage public APIs provided by DFSClient class to improve policy can be consistently enforced in the replication process.
efficiency. Specifically, one of the public methods, named For such files whose replication factor cannot be met at this
create, has a particular input parameter FavoredNodes, initial uploading, our system will invoke the location-aware
which virtually allow users to specify their preferred nodes for replication process again whenever there is resource released
storing the data. Hence, this particular method is by default in order to produce desired number of copies eventually.
used by our file loader when handling user requests.
Location-Aware Load Balancer
The data location policy file is designed as a simple text file
During the data processing in Hadoop, load balancing may
containing multiple file entries. Each entry has the following
occur once in a while to maximize the system performance.
format:
If we rely on the default Hadoop load balancer, user data
src path, dest path, replica, region ID
may be moved to the nodes that do not satisfy location
where src path is the file location in local host, and dest data policies since the default Hadoop load balancer does
path is the file location in HDFS. replica denotes the not consider location privacy issues. In order to consistently
replication factor, which allows users to store multiple copies enforce location policy during the load balancing, we enhanced
of data for the purpose of fault-tolerance. Lastly, region ID the Hadoop load balancer by adding an additional procedure to
denotes the locations where data should be stored, represented check whether the outgoing location of the selected data block
by user friendly texts such as “EAST US” or “WEST US”. on the over-utilized node conforms to the policy specified
These representations are predefined and associated with a in the data location configuration file. In particular, we add
8

1
Select a (f11, f21) A
1
Block ID
1 R1
(f11, f31) B 1
Find the
corresponding (f12, f32) C 0.5
 
1 R2
D 1

1
Find the
E
    R3
   1
F 0.9

Fig. 5: An Example of the Legal File Transfer Graph


Check if the outgoing
  
  

of the detected file transfer being illegal or not. The formal


No
True? definition of the LFT graph is given below:
Yes Definition 2: The Legal File Transfer  graph is a weighted
directed bipartite graph LF T = (V R, E), where V de-
Prepare for
transfer notes the set of cloud nodes, R denotes the cloud regions,
and E denotes the edges from the cloud nodes to the cloud
Fig. 4: Work Flow of isDataBlockLocationValid regions. Each edge eij is associated with a weight value
S
wij = Si→ji→j +Sij which indicates the percentage of files
located at the cloud node i allowed to be transferred to cloud
the isDataBlockLocationValid function to the current region j.
implementation of Hadoop balancer as illustrated in Figure 4. In the LFT graph, each edge indicates a possible file transfer
The challenge of implementing the above process is to find route from a cloud node to a cloud region. The weight value
the mapping from the data blocks to the filename because on the edge further helps to determine the probability that a
the mapping is not supported by the programmable API in file transfer could be illegal.
HDFS. To address this issue, we studied the Hadoop tool fsck Recall the example in Figure 2. User u1 has a pol-
which can provide the current storage status of HDFS, and icy Pu1 ((f 11, {R1 , R2 }), (f 12, {R3 })), user u2 has a
identified a useful command that can serve our purpose. The policy Pu2 ((f 21, {R1 , R2 })) and user u3 has a policy
command is -files, -blocks, -locations such that the Pu3 ((f 31, {R2 , R3 }), (f 32, {R1 , R3 })). Figure 5 shows an
data blocks and corresponding locations of a given file are example of a partial LFT graph built based on this running
known. Specifically, for those files protected by data location example. Each region in the cloud is represented as a node
policies, we execute the following command tool: in the graph such as nodes R1 , R2 and R3 in the figure,
hdfs fsck <path> -files -blocks -locations and each cloud node that has been used for storage is also
where <path> denotes the absolute path of the file in HDFS. represented as a node in the graph such as nodes A, B, C,
It will yield a list of data blocks and their whereabouts for the D, E, and F . If files stored in a cloud node A are allowed
given file as output. We further process the output to create the to be transferred from region R1 to R2 , then there is an edge
mapping from the data blocks to the filename and store the in the LFT graph from A to R1 and R2 . The weight value
mapping in a file. The file will be loaded when we launch on the edge A → Ri indicates the percentage of the files that
our balancer tool. The above process is encapsulated in a are allowed to be transferred from A to Ri . Since all the files
script shown below and the code for parsing fsck output is in node A are allowed to be transferred to R1 and R2 , the
implemented in ParseFsckOutput Java program. weight values are 1. Similarly, the edge from the node to its
own region always has a weight value of 1 because the files are
definitely allowed to be transferred within its current region.
In the example, node C stores files f 12 and f 32. Since only
5.2 Real-time File Transfer Analysis f 32 (not f 12) is allowed to be stored in region R1 , the edge
from cloud node C to R1 is weighted 0.5.
We now proceed to discuss how to perform real-time file The LFT graph can be incrementally updated during the
transfer analysis to capture potentially illegal transfers. file allocation process. When a new file has been assigned a
cloud node (say A) for storage, the name node will check if
5.2.1 Illegal File Transfer Detection Algorithm its location privacy policy introduces a new allowable cloud
The main idea of the illegal file transfer detection is to build region (say R3 ) that has not been recorded in the graph yet. If
a dynamic Legal File Transfer (LFT) graph along the file so, a new edge from the cloud node A to the cloud region R3
allocation process to capture the possible file transfer routes in will be added. Otherwise, the name node just needs to update
the cloud. Then, by comparing a detected file transfer with the the weight value on the edge that is related to the new file. In
expected routes in the LFT graph, we estimate the probability the case when a file has been removed, a similar process can
9

be conducted to reflect the change on the LFT graph. Thread Starts


Based on the graph, we can assess the compliance (or lack Yes

of thereof) of a file transfer as follows. When the socket at


Socket Capturer > time_thresh ?
a cloud node (say A) detects a communication from A to B,
the system will search the legal file transfer (LFT) graph. If No
there is no edge between node A and the region that node Socket Parser Wait
B belongs to, we can conclude that this transfer is illegal
No No
with 100% confidence. If there is an edge between these two
new socket ? interrupted?
nodes, we compute the probability of this transfer being illegal
as 1 − wA→B , where wA→B is the weight value on the edge Yes Yes
from node A to B. If the probability is higher than a threshold Push to DB Thread Ends
ξ, our system will report it as a highly suspected transfer. In
the experiments, we set the threshold to be 90% which yields Fig. 6: Work Flow of Socket Monitor
very high detection rate.
5.2.2 Host-based Socket Implementation Socket
Monitor
The proposed real-time file transfer analysis relies on the
Central Datanode
collected communication information among cloud nodes. To DB
collect such information, we equip each cloud node with a Slave
Socket
socket monitor. The socket monitor aims to intercept infor- Monitor

……
mation sent out of the datanode so that by analyzing such Namenode
information, we will know if data has been transferred to
disqualified locations. The socket monitor is implemented as File Loader Datanode

a threaded Java program running on data hosts (i.e., individual Socket


Master
datanodes). It has two major functionalities: capturing socket Monitor
connections on the host and storing the captured information Slave
datanode message
in a centralized database on the master node.
   
To capture socket connections, we employ netstat
socket message
which is a tool in Linux [12]. All datanodes in our system
use Netstat in the operating system. When executing the Fig. 7: Hadoop Cluster and Socket Monitoring Flow
command "netstat" on a terminal, a list of existing socket
connections on the host will be generated as output. We parse
the consecutive output by netstat between a very short commands like lsof are able to detect the names of files
period of time to extract detailed information (e.g., source being transferred, the detected files at the OS level are not
IP, destination IP, protocol, process name or ID, timestamp) solely mapped to single user files in HDFS. Instead, these
about a open or closed socket connection. In particular, the detected files could include the contents of multiple users’ files
actual command we use is: in HDFS. This is because users’ original files are partitioned
netstat -t, -n, -a, --inet, --program and physically stored in files with entirely different names
and formats managed by the HDFS. Therefore, even though
which gives us tcp sockets connections from both listening
commands lsof can get information about opened files on a
and non-listening port with IP & port in numeric format and
machine, it does not directly contribute to the differentiation of
process name or ID displayed.
which user’s file in HDFS is being transferred via sockets. This
Figure 6 outlines the workflow of our proposed socket mon-
is why we choose netstat which is sufficient to complete
itor. As shown in Figure 6, our socket capturer executes the
the task and retrieves minimal information for efficiency.
predefined netstat command shown above, and the socket
parser processes the corresponding output to extract useful
socket information. The whole work flow is executed every 6 E XPERIMENTAL S TUDY
100 ms in a while loop, and thus we achieve continuous In order to evaluate the efficiency and effectiveness of our
monitoring of the file transfer. proposed system, we carry out a series of experiments on a
The socket monitor keeps track of all the socket connections real cloud testbed as well as a large-scale simulated cloud
established on the host, especially those related to Hadoop environment.
daemons (namenode, datanode, etc.). The collected socket The real cloud testbed consists of 16 virtual machines on
information will be sent and stored in a centralized MySQL VMWare virtual platform with Intel(R) Xeon(R) E5440 2.83
database on the master node. Moreover, location policy files GHz CPU and 8GB memory running Ubuntu 12.04 Linux
submitted by the users will also be kept in the database for OS. One VM acts as the master node and the rest are slave
post-mortem analysis. The overall system setup is shown in nodes, i.e., datanodes in HDFS. All the datanodes are assigned
Figure 7. to different regions identified by region ID, e.g., region1,
Other network analysis commands such as lsof may be region2. Each region consists of three datanodes. Region
used in a similar way. It is worth noting that although some ID will be used in the location policy file to specify desired
10

location. The mapping between region ID and datanode IP /user/ <dir>


address is summarized in Table 2. Each virtual machine is /user/file1.txt 1572864 bytes, 1 block(s): OK
installed with our proposed Last-HDFS system which is built 0.BP-1231142416-127.0.1.1-1445285059107:blk_1073741843_1019
upon Hadoop 2.6.0. On each individual slave node, we also len=1572864 rep1=1 [218.193.126.202:50010]
deploy a socket monitor running as an independent daemon
/user/file2.txt 1423803 bytes, 1 block(s): OK
service. In addition, we set up a MySQL database on the
0.BP-1231142416-127.0.1.1-1445285059107:blk_1073741844_1020
master node to store socket information and location policy
len=1423803 rep1=1 [218.193.126.207:50010]
files submitted by the users.
The results explicitly shows that file1 and file2 are
TABLE 2: Mapping Between Region ID, Node ID and IP stored in different datanodes indicated by the IP address, and
Address each datanode belongs to different region in our setting. Hence,
Region ID Node ID IP Address policy is indeed enforced during the process of file uploading.
master master-node 218.193.126.201 We then repeated this experiment using various policies and
region1 slave-node-{1∼3} 218.193.126.{202∼204} file types , and all the uploading operations are correct.
region2 slave-node-{4∼6} 218.193.126.{205∼207}
region3 slave-node-{7∼9} 218.193.126.{208∼210}
Next, we test the performance of Last-HDFS, by comparing
region4 slave-node-{10∼12} 218.193.126.{211∼213} it with the default file uploading method supported by HDFS.
region5 slave-node-{13∼15} 218.193.126.{214∼216} Specifically, files are uploaded in HDFS by the HDFS shell
command:
The real cloud testbed is focused on evaluating the efficiency hdfs dfs -copyFromLocal <localsrc> URI
of our proposed system. In order to test the accuracy of where <localsrc> is the source path of file in the local
detecting illegal file transfers in various scenarios that may system and URI is the destination path in HDFS.
exist in the real world, we also build a simulated large-scale In this test, we use five different files with size of 2.2 GB,
cloud network that resembles Amazon Global Infrastructure 4.4 GB, 6.6 GB, 8.8 GB, 11GB. In addition, we also vary
[1]. Specifically, we simulate about 40 regions (called zones the replication factor from 1 to 15 for each file. For each
by Amazon). In each region, we randomly generate a set of combination of file size and replication factor, we run the
cloud nodes and vary the total number of nodes in the whole default command and our file uploader with a policy file on
cloud from 1,000 to 20,000. Each user file is associated with the master node respectively, and measure the elapsed time of
a location privacy policy that specifies up to five randomly file uploading process. The result is shown in Figure 8;
generated regions. Then, we simulate legal (or illegal) file
35:00
transfers by randomly selecting the file transfer destination Default (2.2GB)
/$6Tí+DF6 (2.2GB)
node that satisfies (or does not satisfy) the location policy of 30:00 Default (4.4GB)
the file. /$6Tí+DF6 (4.4GB)
Default (6.6GB)
Elapsed Time (mm:ss)

25:00 /$6Tí+DF6 (6.6GB)


Default (8.8GB)
6.1 Performance of Location-aware File Loader and /$6Tí+DF6 (8.8GB)
20:00 Default (11GB)
Replicator /$6Tí+DF6 (11GB)

The first round of experiments is conducted in the real cloud 15:00


set-up. We aim to evaluate two aspects of our proposed
10:00
location-aware file loader and replicator: (i) whether or not
our system correctly enforces the location privacy policies for 05:00
each file; (ii) what is performance overhead introduced by our
location-aware file loading and replicating compared to the 00:00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
original HDFS. Replication Factor
First, we carry a proof-of-concept experiment to check Fig. 8: Performance Comparison Between Two Methods
whether the policy is enforced at the time a file is uploaded. We
choose two files with size of 1.4 MB and 1.5 MB. Each file has As we can see from the result, for a given replication
a different location policy. For simplicity, we set replication factor, it takes more time to replicate larger files than smaller
factor to be 1 since it is not a relevant variable for this ones to multiple datanodes, which is reasonable under the
experiment. The location policy being tested is shown below. assumption of similar network conditions between datanodes.
∼/file1.txt /user/file1.txt 1 region1
For individual files, the replication period is prolonged as
∼/file2.txt /user/file2.txt 1 region2
the replication factor increases. This is due to the replication
According to this policy, file1 and file2 will be uploaded mechanism that data will be replicated to multiple datanodes
to different locations indicated by region1 and region2. To in a pipeline manner.
proof-check correctness, we firstly run our file loader with the
above policy, and then check the final file locations with the
following HDFS shell command afterwards: 6.2 Performance of Location-aware Load Balancing
hdfs fsck <file> -files -locations -blocks We also evaluate the location-aware load balancer in the real
where <file> denote the file path in HDFS. The output of cloud test. Recall that we extend the default load balancer to
the above command is the following:: comply with location constraints during data movement. We
11

aim to check if the location policy is also enforced by our As we can see from Table 4, there is no significant dif-
proposed balancer, as described in Section 5.1.2. ference among the time taken by our proposed load balancer
In order to control closely the saturation level of individual under various policy ratio, compared to the default one. Hence,
datanodes, we conduct this test in a three-node cluster locally our proposed load balancer does not introduce extra overhead
with one being the master node and two slave nodes. The to its overall performance.
cluster consists of three machines with two Intel(R) Xeon(R)
X5550 2.67 GHz CPUs, 48GB memory and Ubuntu 12.04 LTS 6.3 Accuracy of Illegal File Detection
Linux OS. Each machine is installed with the same softwares
as that in the large testbed described earlier. The two slave In this subsection, we aim to demonstrate how the proposed
nodes represent two regions respectively. location-aware file allocation strategy helps with the illegal
file transfer detection. Since the real cloud testbeds only has
We employ 10 files with identical size of 2.2 GB. We set
limited number of nodes which would not be sufficient to
the ratio of files with locations policy to be 50%, meaning that
cover various scenarios of file transfers that may happen in
half of the data will be subjected to location constraints. We
large-scale real-world cloud like Amazon cloud , we adopt
start the Hadoop cluster with one datanode only denoted by
a simulated cloud environment that resembles the structure
region1, and upload the files without location policy to the
of the Amazon cloud with 40+ regions and thousands of
cluster using the default command. For the remaining files,
nodes. The specific parameters of the simulated cloud will
we run our file loader with the policy stating that they should
be introduced in each corresponding experiments.
be uploaded to region1. Then, we add another datanode to
We compare our system with a baseline approach that
the cluster as region2. As a result, one region is considered
simply assigns each individual file to a random node in one
saturated as compared to the other one. Lastly, we launch our
of the regions specified in the location privacy policy. The
extended balancer and check the location changes of every
detection accuracy is defined as the accuracy rate of detecting
files afterwards using the fsck command, whose output is
the correct types of file transfers including legal and illegal
parsed and summarized in Table 3.
ones. Specifically, after files are allocated to the cloud nodes
by both approaches, we simulated both illegal and legal file
TABLE 3: Block Location Comparison Before and After Load
transfers. Illegal transfers move files to regions that are not
Balancing
allowed in the corresponding location policies, whereas legal
File Total Block Locations (Before) Block Locations (After) file transfers move files to regions that are allowed. Let Nillegal
ID Block [# in region1,# in region 2] [# in region1,# in region 2] and Nlegal denote the number of illegal and legal transfers,
1 35 35 in region1, 0 in region2 same
2 35 35 in region1, 0 in region2 same respectively. Let Ni→i denote the number of illegal transfers
3 35 35 in region1, 0 in region2 same being detected, and Nl→l denote the number of legal transfers
4 35 35 in region1, 0 in region2 same that have been correctly marked as legal transfer. The detection
5 35 35 in region1, 0 in region2 same
6 35 35 in region1, 0 in region2 29 in region1, 6 in region2
accuracy is defined as follows:
7 35 35 in region1, 0 in region2 21 in region1, 14 in region2 Ni→i + Nl→l
8 35 35 in region1, 0 in region2 30 in region1, 5 in region2 Correctness = (3)
9 35 35 in region1, 0 in region2 30 in region1, 5 in region2 Nillegal+Nlegal
10 35 35 in region1, 0 in region2 20 in region1, 15 in region2
It is important to note that in any cases, neither approach
will have a perfect (100%) detection correctness rate. This is
As we can see from Table 3, the first five files stay in because our correctness function considers both false positives
region 1, while the remaining five files were moved to and false negatives, and it is common that some suspicious file
region 2, after the load balancing operation. Therefore, transfers could just be false alarms.
location policies are indeed enforced by the load balancer
during data movement. 6.3.1 Effect of the Number of Files to Be Stored in the
We also assess how enforcing location constraints will affect Cloud
the performance of common load balancing tasks. We conduct In this round of experiments, we vary the total number of
the same test mentioned above but vary the ratio of files with data files from 100K to 1M. Without loss of generality, we
location constraints from 20% to 80%. At each round, we assume each file has the same size. The total number of cloud
measure the elapsed time taken by our extended load balancer regions is set to 40 and the total number of cloud nodes is
to move data across nodes according to the location constraints 1,000. Each cloud node can store up to 1000 files. Each data
applied to the data. We compare it against the overhead of the file specifies up to 5 allowable regions in its location privacy
default balancer by Hadoop in absence of such constraints. policy. We simulated 1000 legal file transfers and 1000 illegal
For each experiment, we run our balancer and default one file transfers. The performance of our approach is compared
five times separately. The overall performance is reported in against the baseline as described previously
Table 4. Figure 9 reports the detection correctness. As we can see,
our LAST-HDFS has much higher detection accuracy than the
TABLE 4: Balancer Performance Under Different Policy Ratio baseline approach. This is because that the baseline approach
Policy Ratio 0% 20% 40% 60% 80% considers each file independently when allocating them and
Avg Time (unit: hour) 1.6514 1.6572 1.6856 1.6856 1.6532 hence files with different privacy preference are very likely to
12

be placed in the same cloud node. For example, a cloud node approach will detect 100% of illegal transfers. However, for
A may store a file f 1 that is allowed to be transferred to node a legal transfer, the baseline approach is like taking a random
B, but also another file f 2 that is not allowed to transfer to B. guess and the detection probability would be around 50%. In
As a result, when a communication with node B is detected, the real cloud when there are very few illegal transfers, the
it is very hard for the baseline approach to ascertain whether baseline approach will generate too many false positives and
this communication is transferring file f 1 which is legal, or hence will not be suitable for illegal file transfer detection.
file f 2 which is illegal. Our proposed LAST-HDFS system Compared to the baseline, LAST-HDFS reports 99.9% illegal
considers multiple files’ privacy preferences simultaneously files and has a lower false positive detection rate around 30%
and allocates them in a way (i.e., the use of LFT graph) when the illegal files are few. This is attributed to the better
that helps effectively detect the illegal transfers. Therefore, file allocation strategy adopted by LAST-HDFS that enhances
we achieve much higher accuracy. the chance of making the correct judgment. Moreover, by
reporting all suspected file transfers, the baseline also increases
90

Baseline the workload at the name node. The name node has to examine
LAST-HDFS every suspected transfer one by one. In contrast, our proposed
85
LAST-HDFS has significantly helped reduce the fine-grained
Correct Prediction %

examination needed at the name node.


80

6.3.3 Effect of the Total Number of Cloud Nodes


75
Next, we study the scalability of our approach by increasing
the total number of cloud nodes up to 20,000. We fix the total
70
1 3 5 7 10
number of files to 250,000 and the total number of regions to
Total Number of Files in the Cloud  105
40 as that in the previous experiments. The number of legal
Fig. 9: Detection Correctness When Varying the Number of and illegal file transfers is still 1000 each. Figure 11 reports the
Files detection correctness rate. We can see that our proposed LAST-
HDFS performs consistently better than the baseline approach
by detecting more illegal file transfers due to the same reason
6.3.2 Effect of the Percentage of Illegal Transfers as previously discussed. In addition, observe that the correct
detection rate of both approaches increases with the number
In this set of experiments, we aim to study the effect of the
of cloud nodes. This is because the more cloud nodes, the
amount of illegal file transfers among the total file transfers
lower chance files with different policies being placed in the
in the cloud. The total number of files is set to 100,000. The
same node. Thus, it becomes easier to identify illegal transfers
total number of cloud regions is 40 and the total number of
through the socket monitors.
cloud nodes is 1,000. We simulated total 2000 file transfers
and vary the percentage of illegal file transfers over the total 95

Baseline
file transfers from 1% to 50%. Figure 10 reports the results. LAST-HDFS
90
Correct Prediction %

90

Baseline 85

LAST-HDFS
80
Correct Prediction %

80

70

75

60
70
0.2 0.6 1 1.4 1.8 2

50 Cloud Nodes  104

Fig. 11: Detection Correctness When Varying the Number of


40
0.01 0.05 0.1 0.2 0.3 0.4 0.5 Nodes
Percentage of Illegal File Transfers

Fig. 10: Impact of Percentage of Illegal File Transfers


6.3.4 Effect of the Number of Cloud Regions
As shown, when the percentage of illegal transfers is small In this round of experiments, we alternate the number of cloud
ranging from 1% to 10%, our LAST-HDFS approach achieves regions from 10 to 50. The total number of cloud nodes is
much higher detection rate than the baseline. This is because 1000. The total number of files to be stored in the cloud is
the baseline mixed files with different location preferences in a set to 250,000, and each file still specifies up to 5 desired
node, which makes it very challenging to distinguish legal and locations as that in the previous round of experiments. Again,
illegal transfers without knowing the communication content 1000 legal file transfers and 1000 illegal file transfers were
between two nodes. The baseline approach reports any transfer simulated. Figure 12 shows the detection correctness. The
from one region to another region as illegal so long as one LAST-HDFS again achieves much higher detection rate than
file in the same cloud node does not have the destination the baseline approach. In addition, we also observe that the
node in its location preference list. Therefore, the baseline detection accuracy of our approach decreases with the increase
13

of the cloud regions. The possible reason is that the fewer the data security and privacy [13]. There have been some efforts
cloud regions, the more likely it is that files will have similar on the research problem of data placement control in cloud
location preferences and hence lead to better grouping results storage systems. Peterson et al. [14] defined the notion of
in our approach. “data sovereignty” and proposed a MAC-based proof of data
100
possession (PDP) technique to authenticate the geographic
Baseline locations of data stored in the cloud. Benson et al. [15]
LAST-HDFS
95
addressed the problem of determining the physical locations of
Correct Prediction %

90 data stored in geographically distributed data centers, by using


passive distance measurement and linear regression predictive
85
model to estimate in which data center the data is stored. Later,
80 Gondree and Peterson [16] proposed a general framework,
75
named constraint-based data geo-location (CBDG), that binds
latency-based geo-location techniques with a probabilistic
70
10 20 30 40 50 PDP, based on the previous solutions in [14], [15]. In addition,
Total Number of Cloud Regions
Watson et al. [17] considered the case of collusion between
Fig. 12: Detection Correctness When Varying the Total Num- malicious service providers and suggested a proof of location
ber of Cloud Regions (PoL) scheme that deployed trusted landmarks to verify the the
existence of a file on a host using proof of retrievability (PoR)
protocol. In [18], [19], PoR was also adopted with a time-
6.3.5 Effect of the Number of Regions Allowed in Each based distance-bounding protocol to provide strong geographic
Policy location assurance.
In the last set of tests, we further examine the effect of the Instead of verifying file locations afterwards, another com-
number of regions specified in each location privacy policy. mon approach is to require users to encrypt their data before
We increase the number of allowable regions in each policy uploading it to the cloud [20]. The rationale behind is that
from 5 to 25, and keep other parameters the same as the pre- if the cloud does not have the original plain-text data, users
vious experiment. Figure 13 shows the detection correctness would have fewer concerns on data location. This approach,
rate. As in the previous experiments, our LAST-HDFS yields however, imposes a large computational burden on the users
much higher detection accuracy than the baseline approach. and it renders the data hard to index and analyze on cloud
The baseline approach deteriorates with the increase of the premises.
allowable regions in each policy. Specifically, the detection The above two types of approaches do not provide any
correctness of the baseline approach decreases to a random mechanisms for the cloud providers to directly enforce
guess (50%) when each file specifies 25 desired regions (out of location-aware storage, and hence are very difficult for the
40). The possible reason is that the more regions that each file cloud providers to implement. Unlike any existing work, our
is allowed to be stored, the more likely that files with different proposed system addresses location-aware data storage from a
location preferences will be stored together by the baseline system oriented perspective by embedding location checking
approach. As a result, more file transfers become suspicious directly in the datanode assignment as well as load balancing
since there may always be a file which is allowed to go to the process. It releases the burden at the client side and also
destination region, but the baseline approach will still report provides the cloud service provider an efficient and effective
such a file transfer as a potential illegal one if other files did way to honor clients’ SLA regarding data location constraints.
not include that destination region in their policies.
80
8 C ONCLUSION
75
Baseline In this paper, we build, on top of the existing HDFS, a novel
LAST-HDFS
LAST-HDFS system to address the data placement control
Correct Prediction %

70
problem in the cloud. LAST-HDFS supports policy-driven file
65
loading that enables location-aware storage in cloud sites.
60 More importantly, it also ensures that the location policy is
55 enforced regardless of data replication and load balancing
50
processes that may affect policy compliance. Specifically,
an efficient LP-tree and Legal File Transfer graph were de-
45
5 10 15 20 25
signed to help optimally allocate files with similar location
Allowable Regions in Each Policy
preferences to the most suitable cloud nodes which in turn
Fig. 13: Detection Correctness When Varying the Number of enhance the chance of detecting illegal file transfers. We
Regions in Each Policy have conducted extensive experimental studies in both a real
cloud testbed and a large-scale simulated cloud environment.
Our experimental results have shown the effectiveness and
7 R ELATED W ORK efficiency of the proposed LAST-HDFS system.
Data location in the cloud environment has been recognized In the future, we plan to take into account more complicated
as an important factor in providing users with assurance of policies to capture other privacy requirements other than the
14

location. We will adopt more sophisticated policy analysis [20] A. Michalas and K. Y. Yigzaw, “Locless: Do you really care where your
algorithm [21] and compute the integrated policy as the repre- cloud files are?” ACM/IEEE, 2016.
[21] D. Lin, P. Rao, R. Ferrini, E. Bertino, and J. Lobo, “A similarity
sentative policy [22] at each node to help speed up the policy measure for comparing XACML policies,” IEEE Trans. Knowl. Data
comparison and selection of nodes for the newly uploaded Eng., vol. 25, no. 9, pp. 1946–1959, 2013.
files. Moreover, we also plan to leverage Intel SGX technology [22] P. Rao, D. Lin, E. Bertino, N. Li, and J. Lobo, “Fine-grained integration
of access control policies,” Computers & Security, vol. 30, no. 2-3, pp.
to secure socket monitors from being compromised. 91–107, 2011.

Adam Bowers received a B.S. in Computer Sci-


ACKNOWLEDGEMENT ence from the Missiour University of Science and
Technology in 2016. He is currently a PhD Com-
This work is partially supported by National Science Founda- puter Science student at Missouri University of
tion under the project DGE-1433659. Science and Technology. His current research
focus is cloud security and privacy.

R EFERENCES
[1] Amazon, “Aws global infrastructure,” in https://aws.amazon.com/about-
aws/global-infrastructure/, 2017. Cong Liao received a BE degree in Mechani-
[2] C. Metz, “Facebook tackles (really) big data with project prism,” in cal Design, Manufacturing and Automation from
https://www.wired.com/2012/08/facebook-prism/, 2012. University of Electronic Science and Technology
[3] K. V. SHVACHKO, Y. Aahlad, J. Sundar, and of China, and a MS degree in Robotics from Uni-
P. Jeliazkov, “Geographically-distributed file sys- versity of Pennsylvania. He is currently working
tem using coordinated namespace replication,” in towards a PhD degree in the College of Infor-
https://www.google.com/patents/WO2015153045A1?cl=zh, 2014. mation Sciences and Technology at the Penn-
[4] C. Liao, A. Squicciarini, and L. Dan, “Last-hdfs: Location-aware storage sylvania State University. His current research
technique for hadoop distributed file system,” in IEEE International interests include cloud security and adversarial
Conference on Cloud Computing (CLOUD), 2016. machine learning.
[5] N. Paladi and A. Michalas, ““one of our hosts in another country”: Chal-
lenges of data geolocation in cloud storage,” in International Conference Douglas Steiert is a graduate student enrolled
on Wireless Communications, Vehicular Technology, Information Theory in a Computer Science PhD program at Mis-
and Aerospace & Electronic Systems (VITAE), 2014, pp. 1–6. souri University of Science and Technology. He
[6] Z. N. Peterson, M. Gondree, and R. Beverly, “A position paper on received his B.S. in Computer Science from
data sovereignty: The importance of geolocating data in the cloud.” in Missouri University of Science and Technology
HotCloud, 2011. in 2015. His main research focus has been on
[7] A. Squicciarini, D. Lin, S. Sundareswaran, and J. Li, “Policy driven node privacy within smartphone and social media ar-
selection in mapreduce,” in 10th International Conference on Security eas.
and Privacy in Communication Networks (SecureComm), 2014.
[8] J. Li, A. Squicciarini, D. Lin, S. Liang, and C. Jia, “Secloc: Securing
location-sensitive storage in the cloud,” in ACM symposium on access Dan Lin is an associate professor and Director
control models and technologies (SACMAT), 2015. of Cybersecurity Lab at Missouri University of
[9] E. Order, “Presidential executive order on strengthening the Science and Technology. She received the PhD
cybersecurity of federal networks and critical infrastructure,” in degree in Computer Science from the National
https://www.whitehouse.gov/the-press-office/2017/05/11/presidential- University of Singapore in 2007, and was a post
executive-order-strengthening-cybersecurity-federal, 2017. doctoral research associate at Purdue University
[10] “Hdfs architecture,” http://hadoop.apache.org/docs/stable/ for two years. Her main research interests cover
hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. many areas in the fields of database systems
[11] R. Miller, “Inside amazon cloud computing infrastructure,” and information security.
in http://datacenterfrontier.com/inside-amazon-cloud-computing-
infrastructure/, 2015.
[12] T. Bujlow, K. Balachandran, S. L. Hald, M. T. Riaz, and J. M. Pedersen, Anna Squicciarini is an associate professor at
“Volunteer-based system for research on the internet traffic,” Telfor College of Information Sciences and Technology
Journal, vol. 4, no. 1, pp. 2–7, 2012. at Pennsylvania State University. She received
[13] M. Geist, “Location matters up in the cloud,” http://www.thestar.com/ the PhD degree in Computer Science from Uni-
business/2010/12/04/geist location matters up in the cloud.html. versity of Milan, Italy in 2006. Her research
[14] Z. N. Peterson, M. Gondree, and R. Beverly, “A position paper on is currently funded by the National Science
data sovereignty: the importance of geolocating data in the cloud,” in Foundation, Hewlett-Packard, and a Google Re-
Proceedings of the 8th USENIX conference on Networked systems design search Award.
and implementation, 2011.
[15] K. Benson, R. Dowsley, and H. Shacham, “Do you know where your
cloud files are?” in Proceedings of the 3rd ACM workshop on Cloud
computing security workshop. ACM, 2011, pp. 73–82. Ali Hurson received B.S. degree in Physics from
[16] M. Gondree and Z. N. Peterson, “Geolocation of data in the cloud,” the University of Tehran in 1970, M.S. degree in
in Proceedings of the third ACM conference on Data and application Computer Science from the University of Iowa
security and privacy. ACM, 2013, pp. 25–36. in 1978, and Ph.D. from the University of Central
[17] G. J. Watson, R. Safavi-Naini, M. Alimomeni, M. E. Locasto, and Florida in 1980. He was a Professor of Computer
S. Narayan, “Lost: location based storage,” in Proceedings of the 2012 Science at the Pennsylvania State University un-
ACM Workshop on Cloud computing security workshop. ACM, 2012, til 2008, when he joined the Missouri University
pp. 59–70. of Science and Technology. He has published
[18] A. Albeshri, C. Boyd, and J. G. Nieto, “Geoproof: proofs of geographic over 300 technical papers in areas including
location for cloud computing environment,” in Distributed Computing multi-databases, global information sharing and
Systems Workshops (ICDCSW), 2012 32nd International Conference on. processing, computer architecture and cache
IEEE, 2012, pp. 506–514. memory, and mobile and pervasive computing. He serves as an ACM
[19] A. Albeshri, C. Boyd, and J. G. Nieto, “Enhanced geoproof: improved distinguished speaker, area editor of the CSI Journal of Computer
geographic assurance for data in the cloud,” International Journal of Science and Engineering, and Co-Editor-in-Chief of Advances in Com-
Information Security, vol. 13, no. 2, pp. 191–198, 2014. puters.

You might also like