Detecting Suspicious File Migration or Replication in The Cloud
Detecting Suspicious File Migration or Replication in The Cloud
University of Missouri
Email: {acbqbd,djsg38,lindan}@missouri.edu
‡ Information Science and Technology
Abstract—There has been a prolific rise in the popularity of cloud hospitals) who store sensitive data (e.g., medical records) that
storage in recent years. While cloud storage offers many advantages are governed by laws to remain within certain geographic
such as flexibility and convenience, users are typically unable to tell boundaries and borders. Another situation were this problem
or control the actual locations of their data. This limitation may affect
arises is with governmental entities that require all data to be
users’ confidence and trust in the storage provider, or even render
cloud unsuitable for storing data with strict location requirements. To stored in the same country that the government operates in;
address this issue, we propose a system called LAST-HDFS which this challenge has seen difficulties with cloud service providers
integrates Location-Aware Storage Technique (LAST) into the open (CSPs) quietly moving data out-of-country or being bought out
source Hadoop Distributed File System (HDFS). The LAST-HDFS sys- by foreign companies. For example, Canadian laws demand
tem enforces location-aware file allocations and continuously monitors that personal identifiable data must be stored in Canada.
file transfers to detect potentially illegal transfers in the cloud. Illegal
However, large cloud infrastructure like the Amazon Cloud has
transfers here refer to attempts to move sensitive data outside the
(”legal”) boundaries specified by the file owner and its policies. Our
more than 40 zones distributed all over the world [1], which
underlying algorithms model file transfers among nodes as a weighted makes it very challenging to provide guaranteed adherence to
graph, and maximize the probability of storing data items of similar pri- regulatory compliance. Even Hadoop, which historically has
vacy preferences in the same region. We equip each cloud node with a been managed as a geographically confined distributed file
socket monitor that is capable of monitoring the real-time communication system, is now deployed in large scale across different regions
among cloud nodes. Based on the real-time data transfer information
(see Facebook Prism [2] or recent patent [3]).
captured by the socket monitors, our system calculates the probability
of a given transfer to be illegal. We have implemented our proposed
To date, various tools have been proposed to help users
framework and carried out an extensive experimental evaluation in a verify the exact location of data stored in the cloud [4]–[6],
large-scale real cloud environment to demonstrate the effectiveness and with emphasis on post-allocation compliance. However, recent
efficiency of our proposed system. work has acknowledged the importance of a proactive location
control for data placement consistent with adopters’ location
requirements [4], [7], [8], to allow users to have stronger
1 I NTRODUCTION control over their data and to guarantee the location where
With the ever-increasing popularity of cloud computing, the the data is stored.
demand for cloud storage has also increased exponentially. In this work, we infiltrate into one of the most widely
Computing firms are no longer the only consumers of cloud adopted cloud data storage systems–Hadoop Distributed File
storage and cloud computing, but rather average businesses, System (HDFS), and design an enhanced HDFS system, called
and even end-users, are taking advantage of the immense LAST-HDFS. The LAST-HDFS extends HDFS’ capabilities
capabilities that cloud services can provide. While enjoying to achieve location-aware file allocations and file transfer
the flexibility and convenience brought by cloud storage, monitoring. Specifically, LAST-HDFS provides the following
cloud users release control over their data, and particularly new functions: (i) consistently enforces a location-aware data
are often unable to locate the actual their data; this could be loading and storage by assigning datanodes according to user
in-state, in-country, or even out-of-country. Lack of location specified privacy policies; (ii) actively tracks and dynamically
control may cause privacy breaches for cloud users (e.g., corrects possible data migration (due to balancing or data
'LJLWDO2EMHFW,GHQWL¿HU7'6&
3HUVRQDOXVHLVSHUPLWWHGEXWUHSXEOLFDWLRQUHGLVWULEXWLRQUHTXLUHV,(((SHUPLVVLRQ
6HH KWWSZZZLHHHRUJSXEOLFDWLRQV VWDQGDUGVSXEOLFDWLRQVULJKWVLQGH[KWPO IRU PRUH LQIRUPDWLRQ
2
Name Node Finally, Section 8 concludes the paper and outlines future
Location-aware research directions.
File Allocator
by clients and perform the actual read/write operations on disk equal size which are then replicated three times when stored
blocks as instructed by the namenode. In what follows, we in the cloud. In our work, we will simply refer to the “file
briefly review the data storage and load balancing mechanisms chunks” as “file”.
adopted by the current HDFS since our proposed system will Our overarching goal is to enable HDFS to support location-
revise these two functions to achieve location-aware storage. aware data storage so that data owners’ location privacy
policies are strongly enforced when storing their data in the
3.1 Write Mechanism in HDFS cloud. Recall that in the existing HDFS, the locations of a user
uploaded file are determined by two factors: i) data replication
For a data owner (client) to upload a file to HDFS, he needs to for the purpose of fault-tolerance, and ii) load balancing to
first initiate a write request to the namenode asking to create optimize cluster space utilization. In other words, users’ file
a new file in HDFS. Once the namenode approves the request, chunks will be replicated to multiple datanodes when the files
the client will begin writing data to the stream where data is are uploaded for the first time, and it is very likely that the
split into packets. Each packet represents a data block of the file blocks on saturated nodes may be transferred to under-
file that will be written to the datanodes. A separate thread in utilized nodes at a later time. Therefore, in order to enforce
the client will pick up a packet and contact the namenode, users’ location settings during the lifespan of their data in the
from which a list of candidate datanodes will be returned cloud, we need to achieve the following design goals:
to the client. Then, the client will send write packet to the
1) When uploading the files to the cloud, users should be
first datanode in the list where the data block will be stored.
allowed to specify the location constraints (e.g., regions,
Subsequently, the data block will be replicated to the following
countries) within which their data is allowed to be placed
datanodes in the list in a pipeline manner.
in the cloud.
2) The location constraints (i.e., location privacy policies)
3.2 Load Balancing in HDFS specified by the users should be consistently enforced
Load balancing is of great importance to the overall perfor- during the data replication process.
mance of HDFS clusters, especially when a new datanode is 3) The location constraints (i.e., location privacy policies)
added to the cluster or the disk space of certain datanodes should also be consistently enforced during the load
is saturated. Hadoop provides a balancer tool that allows a balancing process.
cloud administrator to balance the disk space usage in a HDFS 4) Any data movement (caused by malicious attacks) that
cluster. An outline of the load balancing process is described violates the location constraints should be detected.
below:
1) The balancer partitions all the datanodes into two 4.2 Threat Model
groups: i) under-utilized node group and (ii) over- In our system, we consider the following three types of
utilized node group, based on their data block usage entities:
reports. • File loader: It uploads files to the cloud on behalf of users.
2) The balancer randomly select one datanode from each • Namenode: It is the master node in HDFS which manages
group to form a pair of nodes whose load will be the entire file system and also interacts with users.
balanced by transferring certain amount of data from • Datanodes: They are the nodes that actually store the user
one to the other. data.
3) The balancer randomly selects a list of data blocks in
Accordingly, we make the following assumptions and threat
the over-utilized datanode and transfer the data to the
model. The namenode is the core node in the system and
under-utilized datanode in the same pair.
is assumed fully trusted due to the following reasons. There
4) The balancer iterates the above three steps until all
is typically one name node per cluster with a couple of its
the datanodes in the cluster reach certain utilization
backups. That means the number of name nodes is far less
threshold, i.e., the system achieves balanced load.
than the number of data nodes. The name node controls all
the file directories which are extremely important to the service
4 A N OVERVIEW OF THE P ROPOSED provider to ensure the availability of the whole cloud service,
LAST-HDFS S YSTEM and hence the name node is typically much better protected
In this section, we first layout the system design goals and the and already closely monitored by the service provider. With
threat model. Then, we present an overview of our proposed that said, the namenode will faithfully handle requests from
LAST-HDFS system. users. On the other hand, since the number of data nodes is
huge, it is much more challenging for the service providers to
keep track of the behavior of all the data nodes. The attacks on
4.1 Design Goals data nodes are more silent, frequent, and hard to be noticed.
We consider a cloud architecture similar to the Amazon Cloud, Thus, we do not assume all the datanodes are fully trusted.
which is partitioned in multiple zones and each zone contains a The compromised datanodes could intentionally transfer or
number of cloud nodes (e.g., 50,000 [11]). Each node supports copy users’ data to any other nodes that may reside outside the
a distributed file system such as HDFS (Hadoop Distributed legal regions specified by the users. Attackers may do this for
File System). A file is typically partitioned into chunks of various purposes such as analyzing users’ data for advertising,
4
5 D ETAILED A LGORITHMS FOR means 4 files stored in the cloud nodes indexed by N1 have
I MPLEMENTING THE LAST-HDFS S YSTEM location preference of region R1 and 5 files prefer region R2 .
The example LP-tree corresponds to the example in Figure 2.
In this section, we present the detailed algorithms that support
Specifically, cloud node A stores the data items f 11 and f 21
the two major functionalities in the proposed LAST-HDFS
which have their location preferences constrained to R1 and
system, including (i) location-aware file allocation and (ii)
R2 . Assume that some other cloud nodes F and G store data
real-time file transfer analysis.
items with the same location preferences as node A. Then,
cloud nodes A, F and G are recorded in the same leaf node
5.1 Location-aware File Allocation N2 in the tree as illustrated in Figure 3, where # = 4 indicates
We will present the algorithms first and then discuss the system the files belong to 4 different users. Similarly, node N3 in the
implementation details. tree shows that both of the cloud nodes B and E store data
items with the same location preferences R2 and R3 and these
5.1.1 Algorithms files belong to 2 different users.
Due to the increasing number of the users who are adopting The construction of the LP-tree is as follows. Starting from
the cloud services, large amounts of cloud storage requests the first new data item uploaded to the cloud, the name node
are being received continuously over time by cloud service will look for an empty cloud node within the satisfying regions
providers. When a user has an incoming storage request, our (e.g., R1 and R2 for f 22). If an empty cloud node is found,
proposed location-aware file allocation aims at finding the the new data item will be stored in that cloud node. Since the
cloud nodes that store the files with location preferences most LP-tree is empty right now, a root node will be created. One
similar to that of the newly uploaded file, so as to help identify of the root node’s entries will be used to record this new data
additional illegal file transfers in the future. A straightforward item’s indexing information (e.g., the ID of the cloud node
way to perform this step is to simply compare the location that stores this item). For the subsequent insertions of data
preference of the new data item with the location preferences items to the LP-tree, the first step is to search in the LP-tree
of all the existing data items already stored in the cloud. to identify potential cloud nodes which stores the data items
However, considering the scale of the cloud, this naive solution with the same location preferences as the new data item. If
is obviously very time consuming to carry out. Therefore, we such cloud node is found and has capacity to store the new
propose an efficient approach, i.e., the Location Preference item, the aggregation information in its parent node in the LP-
(LP) tree, to help speed up this process. tree will be updated to include the new item. For example, a
Our proposed LP-tree will index location preferences of new file with location preferences R1 and R2 may be stored
the files stored in each cloud node. The tree is maintained in cloud node F and we only need to update N1 ’s aggregation
and updated by the name node whenever there is an update information from (4R1, 5R2) to (5R1, 6R2). The update will
to the file storage, such as cloud nodes or files being added also propagate to all the ancestor nodes.
or deleted. An example of the LP-tree is shown in Figure 3, If none of the cloud nodes indexed by the LP-tree has
whereby N0 , ..., N5 denote the names of the nodes in the sufficient storage space, the name node will identify a new
index, and the # symbol indicates the number of users who empty cloud node and create a new index entry. The new
have their information indexed in the same index node. The index entry will be inserted into the leaf node whose location
leaf nodes of the tree contain the IDs of the cloud nodes which preferences are the same as the new data item. If a leaf node
store files with the similar geographical location preferences. in the LP-tree is full, it will be split into two nodes and the
The internal nodes of the LP-tree record the aggregated aggregation information at the parent level will be adjusted.
location preferences of their corresponding children nodes, so Such adjustment may propagate all the way up to the root.
as to facilitate the search for the suitable cloud nodes that have In this way, the LP-tree’s height will increase gradually. For
available space for incoming storage requests. The aggregated example, if a new file with location preferences R1 and R2
location preferences include two kinds of information: (i) the arrives in the cloud, the name node will start checking the root
IDs of the allowed regions; and (ii) the number of data items node of the LP-tree. It will find that the first entry in the root
associated with each allowed region. For example, as shown node contains (R1 , R2 , R3 , R4 ), which includes the new data
in Figure 3, the first entry in Node N1 is (4R1,5R2), which item’s location preferences of R1 and R2 . Then, it retrieves
6
TABLE 1: Examples of Confidence Score Calculation list of IP addresses of datanodes respectively. The mapping
Si→j Ci→j between regions and datanodes are hard-coded in the file
0 1 loader as described in Section 6.1.
1 0.9 Users have the choice of submitting their data either with or
2 0.8 without a data location policy file. If a policy file is provided,
3 0.7
the locations in the file will be extracted and serve as the input
4 0.6
5 0.5 value of the parameter FavoredNodes in the corresponding
6 0.6 API call. Otherwise, the location is considered as null when
7 0.7 the create method is invoked.
8 0.8
9 0.9 Location-Aware Replicator
10 1 This step aims to store the user data at the specified
locations. When the create method is invoked by file loader,
a request is sent to the namenode asking for a list of datanodes
73%. Node K contains only 2 files with location preferences to store the data as we describe with the write mechanism in
partially matching the new file, and its detection confidence HDFS in Section 3.1. The datanodes are selected according to
after considering the new file would be 67%. By comparing the class BlockPlacementPolicy. In Hadoop’s default im-
the detection confidence of them, we can find that nodes A, plementation of BlockPlacementPolicy, candidate datan-
F, G are better candidates. odes are firstly drawn from the list of FavoredNodes specified
With the aid of the LP-tree, we only need to check log(n) by the user. However, there is no guarantee that a candidate
nodes in the LP-tree to locate candidate cloud nodes to store datanode will be actually selected unless it meets a series of
a newly uploaded file, where n is the total number of cloud criteria, e.g., enough space and low network latency. In case
storage nodes. The of disqualified candidate datanodes in the FavoredNodes list,
space complexity in the worse case for the
LP-tree is log( xr ), where r is the number of regions and x is additional datanodes will be selected from those nodes who
the do not belong to the preferred list, in order to make sure that
r number of regions allowed. The reason for this is there are
the number of returned datanodes equals the replication factor.
x possible policies that need to be represented by the tree.
It is worth noting that the actual policies can be stored in the As a result, it is possible that some copies of user data will
hard drive. Only the top few levels (typically 2 or 3 levels) of be stored in the locations against the date location policy.
the LP-tree (a few MB) need to be stored in the main memory To enforce the location policy in the process of
for quick retrieval. data replication, we extend the default implementation of
BlockPlacementPolicy and override the original procedure
5.1.2 System Implementation of selecting candidate datanodes. In our design, if there exists
the case where at least one candidate datanode from the
To realize the proposed location-aware file allocation, we need
FavoredNodes list is disqualified, we reduce the replication
to extend three components in the existing HDFS as elaborated
factor to the number of datanodes that are eventually selected
in the following.
by the namenode, instead of selecting other possible datanodes
Location-Aware File Loader outside the scope of the FavoredNodes list. As for changing
The file loader is a Java application program that takes data the replication factor, we leverage the Hadoop shell command
location policy files as input and prepare for the data repli- hdfs setrep. Specifically, we add a command option -w so
cation on the specified nodes. Instead of using FileSystem that the change will only be made after the replication process
APIs that are normally designated for user programs, we has ended. In this way, we can ensure that the data location
leverage public APIs provided by DFSClient class to improve policy can be consistently enforced in the replication process.
efficiency. Specifically, one of the public methods, named For such files whose replication factor cannot be met at this
create, has a particular input parameter FavoredNodes, initial uploading, our system will invoke the location-aware
which virtually allow users to specify their preferred nodes for replication process again whenever there is resource released
storing the data. Hence, this particular method is by default in order to produce desired number of copies eventually.
used by our file loader when handling user requests.
Location-Aware Load Balancer
The data location policy file is designed as a simple text file
During the data processing in Hadoop, load balancing may
containing multiple file entries. Each entry has the following
occur once in a while to maximize the system performance.
format:
If we rely on the default Hadoop load balancer, user data
src path, dest path, replica, region ID
may be moved to the nodes that do not satisfy location
where src path is the file location in local host, and dest data policies since the default Hadoop load balancer does
path is the file location in HDFS. replica denotes the not consider location privacy issues. In order to consistently
replication factor, which allows users to store multiple copies enforce location policy during the load balancing, we enhanced
of data for the purpose of fault-tolerance. Lastly, region ID the Hadoop load balancer by adding an additional procedure to
denotes the locations where data should be stored, represented check whether the outgoing location of the selected data block
by user friendly texts such as “EAST US” or “WEST US”. on the over-utilized node conforms to the policy specified
These representations are predefined and associated with a in the data location configuration file. In particular, we add
8
1
Select a (f11, f21) A
1
Block ID
1 R1
(f11, f31) B 1
Find the
corresponding (f12, f32) C 0.5
1 R2
D 1
1
Find the
E
R3
1
F 0.9
……
mation sent out of the datanode so that by analyzing such Namenode
information, we will know if data has been transferred to
disqualified locations. The socket monitor is implemented as File Loader Datanode
aim to check if the location policy is also enforced by our As we can see from Table 4, there is no significant dif-
proposed balancer, as described in Section 5.1.2. ference among the time taken by our proposed load balancer
In order to control closely the saturation level of individual under various policy ratio, compared to the default one. Hence,
datanodes, we conduct this test in a three-node cluster locally our proposed load balancer does not introduce extra overhead
with one being the master node and two slave nodes. The to its overall performance.
cluster consists of three machines with two Intel(R) Xeon(R)
X5550 2.67 GHz CPUs, 48GB memory and Ubuntu 12.04 LTS 6.3 Accuracy of Illegal File Detection
Linux OS. Each machine is installed with the same softwares
as that in the large testbed described earlier. The two slave In this subsection, we aim to demonstrate how the proposed
nodes represent two regions respectively. location-aware file allocation strategy helps with the illegal
file transfer detection. Since the real cloud testbeds only has
We employ 10 files with identical size of 2.2 GB. We set
limited number of nodes which would not be sufficient to
the ratio of files with locations policy to be 50%, meaning that
cover various scenarios of file transfers that may happen in
half of the data will be subjected to location constraints. We
large-scale real-world cloud like Amazon cloud , we adopt
start the Hadoop cluster with one datanode only denoted by
a simulated cloud environment that resembles the structure
region1, and upload the files without location policy to the
of the Amazon cloud with 40+ regions and thousands of
cluster using the default command. For the remaining files,
nodes. The specific parameters of the simulated cloud will
we run our file loader with the policy stating that they should
be introduced in each corresponding experiments.
be uploaded to region1. Then, we add another datanode to
We compare our system with a baseline approach that
the cluster as region2. As a result, one region is considered
simply assigns each individual file to a random node in one
saturated as compared to the other one. Lastly, we launch our
of the regions specified in the location privacy policy. The
extended balancer and check the location changes of every
detection accuracy is defined as the accuracy rate of detecting
files afterwards using the fsck command, whose output is
the correct types of file transfers including legal and illegal
parsed and summarized in Table 3.
ones. Specifically, after files are allocated to the cloud nodes
by both approaches, we simulated both illegal and legal file
TABLE 3: Block Location Comparison Before and After Load
transfers. Illegal transfers move files to regions that are not
Balancing
allowed in the corresponding location policies, whereas legal
File Total Block Locations (Before) Block Locations (After) file transfers move files to regions that are allowed. Let Nillegal
ID Block [# in region1,# in region 2] [# in region1,# in region 2] and Nlegal denote the number of illegal and legal transfers,
1 35 35 in region1, 0 in region2 same
2 35 35 in region1, 0 in region2 same respectively. Let Ni→i denote the number of illegal transfers
3 35 35 in region1, 0 in region2 same being detected, and Nl→l denote the number of legal transfers
4 35 35 in region1, 0 in region2 same that have been correctly marked as legal transfer. The detection
5 35 35 in region1, 0 in region2 same
6 35 35 in region1, 0 in region2 29 in region1, 6 in region2
accuracy is defined as follows:
7 35 35 in region1, 0 in region2 21 in region1, 14 in region2 Ni→i + Nl→l
8 35 35 in region1, 0 in region2 30 in region1, 5 in region2 Correctness = (3)
9 35 35 in region1, 0 in region2 30 in region1, 5 in region2 Nillegal+Nlegal
10 35 35 in region1, 0 in region2 20 in region1, 15 in region2
It is important to note that in any cases, neither approach
will have a perfect (100%) detection correctness rate. This is
As we can see from Table 3, the first five files stay in because our correctness function considers both false positives
region 1, while the remaining five files were moved to and false negatives, and it is common that some suspicious file
region 2, after the load balancing operation. Therefore, transfers could just be false alarms.
location policies are indeed enforced by the load balancer
during data movement. 6.3.1 Effect of the Number of Files to Be Stored in the
We also assess how enforcing location constraints will affect Cloud
the performance of common load balancing tasks. We conduct In this round of experiments, we vary the total number of
the same test mentioned above but vary the ratio of files with data files from 100K to 1M. Without loss of generality, we
location constraints from 20% to 80%. At each round, we assume each file has the same size. The total number of cloud
measure the elapsed time taken by our extended load balancer regions is set to 40 and the total number of cloud nodes is
to move data across nodes according to the location constraints 1,000. Each cloud node can store up to 1000 files. Each data
applied to the data. We compare it against the overhead of the file specifies up to 5 allowable regions in its location privacy
default balancer by Hadoop in absence of such constraints. policy. We simulated 1000 legal file transfers and 1000 illegal
For each experiment, we run our balancer and default one file transfers. The performance of our approach is compared
five times separately. The overall performance is reported in against the baseline as described previously
Table 4. Figure 9 reports the detection correctness. As we can see,
our LAST-HDFS has much higher detection accuracy than the
TABLE 4: Balancer Performance Under Different Policy Ratio baseline approach. This is because that the baseline approach
Policy Ratio 0% 20% 40% 60% 80% considers each file independently when allocating them and
Avg Time (unit: hour) 1.6514 1.6572 1.6856 1.6856 1.6532 hence files with different privacy preference are very likely to
12
be placed in the same cloud node. For example, a cloud node approach will detect 100% of illegal transfers. However, for
A may store a file f 1 that is allowed to be transferred to node a legal transfer, the baseline approach is like taking a random
B, but also another file f 2 that is not allowed to transfer to B. guess and the detection probability would be around 50%. In
As a result, when a communication with node B is detected, the real cloud when there are very few illegal transfers, the
it is very hard for the baseline approach to ascertain whether baseline approach will generate too many false positives and
this communication is transferring file f 1 which is legal, or hence will not be suitable for illegal file transfer detection.
file f 2 which is illegal. Our proposed LAST-HDFS system Compared to the baseline, LAST-HDFS reports 99.9% illegal
considers multiple files’ privacy preferences simultaneously files and has a lower false positive detection rate around 30%
and allocates them in a way (i.e., the use of LFT graph) when the illegal files are few. This is attributed to the better
that helps effectively detect the illegal transfers. Therefore, file allocation strategy adopted by LAST-HDFS that enhances
we achieve much higher accuracy. the chance of making the correct judgment. Moreover, by
reporting all suspected file transfers, the baseline also increases
90
Baseline the workload at the name node. The name node has to examine
LAST-HDFS every suspected transfer one by one. In contrast, our proposed
85
LAST-HDFS has significantly helped reduce the fine-grained
Correct Prediction %
Baseline
file transfers from 1% to 50%. Figure 10 reports the results. LAST-HDFS
90
Correct Prediction %
90
Baseline 85
LAST-HDFS
80
Correct Prediction %
80
70
75
60
70
0.2 0.6 1 1.4 1.8 2
of the cloud regions. The possible reason is that the fewer the data security and privacy [13]. There have been some efforts
cloud regions, the more likely it is that files will have similar on the research problem of data placement control in cloud
location preferences and hence lead to better grouping results storage systems. Peterson et al. [14] defined the notion of
in our approach. “data sovereignty” and proposed a MAC-based proof of data
100
possession (PDP) technique to authenticate the geographic
Baseline locations of data stored in the cloud. Benson et al. [15]
LAST-HDFS
95
addressed the problem of determining the physical locations of
Correct Prediction %
70
problem in the cloud. LAST-HDFS supports policy-driven file
65
loading that enables location-aware storage in cloud sites.
60 More importantly, it also ensures that the location policy is
55 enforced regardless of data replication and load balancing
50
processes that may affect policy compliance. Specifically,
an efficient LP-tree and Legal File Transfer graph were de-
45
5 10 15 20 25
signed to help optimally allocate files with similar location
Allowable Regions in Each Policy
preferences to the most suitable cloud nodes which in turn
Fig. 13: Detection Correctness When Varying the Number of enhance the chance of detecting illegal file transfers. We
Regions in Each Policy have conducted extensive experimental studies in both a real
cloud testbed and a large-scale simulated cloud environment.
Our experimental results have shown the effectiveness and
7 R ELATED W ORK efficiency of the proposed LAST-HDFS system.
Data location in the cloud environment has been recognized In the future, we plan to take into account more complicated
as an important factor in providing users with assurance of policies to capture other privacy requirements other than the
14
location. We will adopt more sophisticated policy analysis [20] A. Michalas and K. Y. Yigzaw, “Locless: Do you really care where your
algorithm [21] and compute the integrated policy as the repre- cloud files are?” ACM/IEEE, 2016.
[21] D. Lin, P. Rao, R. Ferrini, E. Bertino, and J. Lobo, “A similarity
sentative policy [22] at each node to help speed up the policy measure for comparing XACML policies,” IEEE Trans. Knowl. Data
comparison and selection of nodes for the newly uploaded Eng., vol. 25, no. 9, pp. 1946–1959, 2013.
files. Moreover, we also plan to leverage Intel SGX technology [22] P. Rao, D. Lin, E. Bertino, N. Li, and J. Lobo, “Fine-grained integration
of access control policies,” Computers & Security, vol. 30, no. 2-3, pp.
to secure socket monitors from being compromised. 91–107, 2011.
R EFERENCES
[1] Amazon, “Aws global infrastructure,” in https://aws.amazon.com/about-
aws/global-infrastructure/, 2017. Cong Liao received a BE degree in Mechani-
[2] C. Metz, “Facebook tackles (really) big data with project prism,” in cal Design, Manufacturing and Automation from
https://www.wired.com/2012/08/facebook-prism/, 2012. University of Electronic Science and Technology
[3] K. V. SHVACHKO, Y. Aahlad, J. Sundar, and of China, and a MS degree in Robotics from Uni-
P. Jeliazkov, “Geographically-distributed file sys- versity of Pennsylvania. He is currently working
tem using coordinated namespace replication,” in towards a PhD degree in the College of Infor-
https://www.google.com/patents/WO2015153045A1?cl=zh, 2014. mation Sciences and Technology at the Penn-
[4] C. Liao, A. Squicciarini, and L. Dan, “Last-hdfs: Location-aware storage sylvania State University. His current research
technique for hadoop distributed file system,” in IEEE International interests include cloud security and adversarial
Conference on Cloud Computing (CLOUD), 2016. machine learning.
[5] N. Paladi and A. Michalas, ““one of our hosts in another country”: Chal-
lenges of data geolocation in cloud storage,” in International Conference Douglas Steiert is a graduate student enrolled
on Wireless Communications, Vehicular Technology, Information Theory in a Computer Science PhD program at Mis-
and Aerospace & Electronic Systems (VITAE), 2014, pp. 1–6. souri University of Science and Technology. He
[6] Z. N. Peterson, M. Gondree, and R. Beverly, “A position paper on received his B.S. in Computer Science from
data sovereignty: The importance of geolocating data in the cloud.” in Missouri University of Science and Technology
HotCloud, 2011. in 2015. His main research focus has been on
[7] A. Squicciarini, D. Lin, S. Sundareswaran, and J. Li, “Policy driven node privacy within smartphone and social media ar-
selection in mapreduce,” in 10th International Conference on Security eas.
and Privacy in Communication Networks (SecureComm), 2014.
[8] J. Li, A. Squicciarini, D. Lin, S. Liang, and C. Jia, “Secloc: Securing
location-sensitive storage in the cloud,” in ACM symposium on access Dan Lin is an associate professor and Director
control models and technologies (SACMAT), 2015. of Cybersecurity Lab at Missouri University of
[9] E. Order, “Presidential executive order on strengthening the Science and Technology. She received the PhD
cybersecurity of federal networks and critical infrastructure,” in degree in Computer Science from the National
https://www.whitehouse.gov/the-press-office/2017/05/11/presidential- University of Singapore in 2007, and was a post
executive-order-strengthening-cybersecurity-federal, 2017. doctoral research associate at Purdue University
[10] “Hdfs architecture,” http://hadoop.apache.org/docs/stable/ for two years. Her main research interests cover
hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. many areas in the fields of database systems
[11] R. Miller, “Inside amazon cloud computing infrastructure,” and information security.
in http://datacenterfrontier.com/inside-amazon-cloud-computing-
infrastructure/, 2015.
[12] T. Bujlow, K. Balachandran, S. L. Hald, M. T. Riaz, and J. M. Pedersen, Anna Squicciarini is an associate professor at
“Volunteer-based system for research on the internet traffic,” Telfor College of Information Sciences and Technology
Journal, vol. 4, no. 1, pp. 2–7, 2012. at Pennsylvania State University. She received
[13] M. Geist, “Location matters up in the cloud,” http://www.thestar.com/ the PhD degree in Computer Science from Uni-
business/2010/12/04/geist location matters up in the cloud.html. versity of Milan, Italy in 2006. Her research
[14] Z. N. Peterson, M. Gondree, and R. Beverly, “A position paper on is currently funded by the National Science
data sovereignty: the importance of geolocating data in the cloud,” in Foundation, Hewlett-Packard, and a Google Re-
Proceedings of the 8th USENIX conference on Networked systems design search Award.
and implementation, 2011.
[15] K. Benson, R. Dowsley, and H. Shacham, “Do you know where your
cloud files are?” in Proceedings of the 3rd ACM workshop on Cloud
computing security workshop. ACM, 2011, pp. 73–82. Ali Hurson received B.S. degree in Physics from
[16] M. Gondree and Z. N. Peterson, “Geolocation of data in the cloud,” the University of Tehran in 1970, M.S. degree in
in Proceedings of the third ACM conference on Data and application Computer Science from the University of Iowa
security and privacy. ACM, 2013, pp. 25–36. in 1978, and Ph.D. from the University of Central
[17] G. J. Watson, R. Safavi-Naini, M. Alimomeni, M. E. Locasto, and Florida in 1980. He was a Professor of Computer
S. Narayan, “Lost: location based storage,” in Proceedings of the 2012 Science at the Pennsylvania State University un-
ACM Workshop on Cloud computing security workshop. ACM, 2012, til 2008, when he joined the Missouri University
pp. 59–70. of Science and Technology. He has published
[18] A. Albeshri, C. Boyd, and J. G. Nieto, “Geoproof: proofs of geographic over 300 technical papers in areas including
location for cloud computing environment,” in Distributed Computing multi-databases, global information sharing and
Systems Workshops (ICDCSW), 2012 32nd International Conference on. processing, computer architecture and cache
IEEE, 2012, pp. 506–514. memory, and mobile and pervasive computing. He serves as an ACM
[19] A. Albeshri, C. Boyd, and J. G. Nieto, “Enhanced geoproof: improved distinguished speaker, area editor of the CSI Journal of Computer
geographic assurance for data in the cloud,” International Journal of Science and Engineering, and Co-Editor-in-Chief of Advances in Com-
Information Security, vol. 13, no. 2, pp. 191–198, 2014. puters.