0% found this document useful (0 votes)
64 views7 pages

An Access Control Scheme For Big Data Processing: Vincent C. Hu, Tim Grance, David F. Ferraiolo, D. Rick Kuhn

This document proposes an access control scheme for distributed big data processing clusters. It first describes how big data is difficult to process due to its large size and complex formats. It then discusses how access control is critical for security but becomes more challenging for complex systems like big data processing. The document proposes collaborating access control across domains and distributed management to protect big data processing components and users.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views7 pages

An Access Control Scheme For Big Data Processing: Vincent C. Hu, Tim Grance, David F. Ferraiolo, D. Rick Kuhn

This document proposes an access control scheme for distributed big data processing clusters. It first describes how big data is difficult to process due to its large size and complex formats. It then discusses how access control is critical for security but becomes more challenging for complex systems like big data processing. The document proposes collaborating access control across domains and distributed management to protect big data processing components and users.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

An Access Control Scheme for Big Data Processing

Vincent C. Hu, Tim Grance, David F. Ferraiolo, D. Rick Kuhn


National Institute of Standards and Technology
Gaithersburg, MD, USA
vhu, grance, dferraiolo, [email protected]

Abstract— Access Control (AC) systems are among the most BD has denser and higher resolutions such as media,
critical of network security components. A system’s privacy and photos, and videos from sources such as social media, mobile
security controls are more likely to be compromised due to the applications, public records, and databases; the data is either in
misconfiguration of access control policies rather than the failure static batches or dynamically generated by machine and users
of cryptographic primitives or protocols. This problem becomes by the advanced capacities of hardware, software, and network
increasingly severe as software systems become more and more technologies. Examples include data from sensor networks or
complex, such as Big Data (BD) processing systems, which are tracking user behavior. Rapidly increasing volumes of data and
deployed to manage a large amount of sensitive information and data objects add enormous pressure on existing IT
resources organized into a sophisticated BD processing cluster.
infrastructures with scaling difficulties such as capabilities for
Basically, BD access control requires the collaboration among
cooperating processing domains to be protected as computing
data storage, advance analysis, and security. These difficulties
environments that consist of computing units under distributed result from BD’s large and growing files, at high speed, and in
AC managements. Many BD architecture designs were proposed various formats, as is measured by: Velocity (the data comes at
to address BD challenges; however, most of them were focused high speed, e.g., scientific data such as data from weather
on the processing capabilities of the “three Vs” (Velocity, patterns.); Volume (the data results from large files, e.g.,
Volume, and Variety). Considerations for security in protecting Facebook generates 25TB of data daily.); and Variety (the files
BD are mostly ad hoc and patch efforts. Even with some come in various formats: audio, video, text messages, etc. [2]).
inclusion of security in recent BD systems, a critical security Therefore, BD data processing systems must be able to deal
component, AC (Authorization), for protecting BD processing with collecting, analyzing, and securing BD data that requires
components and their users from the insider attacks, remains processing very large data sets that defy conventional data
elusive. This paper proposes a general purpose AC scheme for management, analysis, and security technologies. In simple
distributed BD processing clusters. ways, some solutions use a dedicated system for their BD
processing. However, to maximize scalability and
Keywords—Access Control, Authorization, Big Data, performance, most BD processing systems apply massively
Distributed System parallel software running on many commodity computers in
distributed computing frameworks that may include columnar
I. INTRODUCTION databases and other BD management solutions [5].
Data IQ News [1] estimates that the global data population Access Control (AC) systems are among the most critical
will reach 44 zettabytes (1 billion terabytes) by 2020. This of network security components. It is more likely that privacy
growth trend is influencing the way data is being mass or security will be compromised due to the misconfiguration of
collected and produced for high-performance computing or access control policies than from a failure of a cryptographic
operations and planning analysis. Big Data (BD) refers to large primitive or protocol. This problem becomes increasingly
data that is difficult to process by using a traditional data severe as software systems become more and more complex
processing system, for example, to analyze Internet data traffic, such as BD processing systems, which are deployed to manage
or edit video data of hundreds of gigabytes. (Note that each a large amount of sensitive information and resources
case depends on the capabilities of a system; it has been argued organized into a sophisticated BD processing cluster. Basically,
that for some organizations, terabytes of text, audio, and video BD AC systems require collaboration among corporate
data per day can be processed, thus, it is not BD, but for those processing domains as protected computing environments,
organizations that cannot process efficiently, it is BD [2]). BD which consist of computing units under distributed AC
technology is gradually reshaping current data systems and management [6].
practices. Government Computer News [3] estimates that the
volume of data stored by federal agencies alone will increase Many architecture designs have been proposed to address
from 1.6 to 2.6 petabytes within two years, and U.S. state and BD challenges; however, most of them have been focused on
local governments are just as keen on harnessing the power of the processing capabilities of the “three Vs” (Velocity,
BD to boost security, prevent fraud, enhance service delivery, Volume, and Variety). Considerations for security in protecting
and improve emergency response. It is estimated that BD AC are mostly ad hoc and patch efforts. Even with the
successfully leveraging technologies for BD can reduce the IT inclusion of some security capability in recent BD systems,
cost by an average of 48% [4].
practical AC (authorization) for BD processing components is throughout industry and government. Hadoop keeps data and
not readily available. processing resources in close proximity within the cluster. It
runs on a distributed model composed of numerous low-cost
This paper proposes a general AC scheme for distributed computers (e.g., Linux-based machines with simple
BD processing clusters. Section II describes current BD tools architecture): two main MS components: TD – MapReduce
and implementations. Section III discusses BD AC and DD - File System (HDFS-Hadoop file system), and a set
requirements. Section IV introduces related work. Section V of tools. The TD provides distributed data processing across
illustrates our BD AC scheme. Section VI discusses the cluster, and the DD distributes large data sets across the
implementation considerations for the general BD model. servers in the cluster [3]. Hadoop’s CSs are called Slaves, and
Section VII concludes the paper. each has two components: Task Tracker and Data Node. The
MS contains two additional components: Job Tracker and
II. GENERAL BIG DATA MODEL AND EXAMPLE Name Node. Job Tracker and Task Tracker are grouped as
The fundamental model for most of the current BD MapReduce, and Name Node and Data Node fall under HDFS.
architecture designs is based on the concept of distributed
processing [2, 4], which contains a set of generic processing
systems as shown in Figure 1:
1. Master System (MS) receives data from BD data source
providers, and determines processing steps in response to a
user’s request. MS has the following three major functions:
• Task Distribution (TD) function is responsible for
distributing processes to the cooperated (slave) systems of the
BD cluster.
• Data distribution (DD) function is responsible for
distributing data to the cooperated systems of the BD cluster.
• Result Collection (RC) function processes collects
and analyzes information provided by cooperating systems of
the BD cluster, and generates aggregated result to users.
Unless restricted by specific applications, the three
functions are usually installed and managed in the same host
machine for easy and secure management and maintenance.
Fig. 2. Hadoop BD cluster example
2. Cooperated System (CS) (or slave system) is assigned and
trusted by MS for BD processing. CS reports progress or Hadoop combines storage, servers, and networking to break
problems to the TD and DD, otherwise, returns the computed data as well as computation down to small pieces. Each
result to the RC of MS. computation is assigned a small piece of data, so that instead
one big computation, numerous small computations are
performed much faster, with the result aggregated and sent
back to the application [2]. Thus, Hadoop provides linear
scalability, using as many computers as required. Cluster
communication between MS and CSs manages what traditional
data service systems would not be able to handle.

III. BIG DATA ACCESS CONTROL CHALLENGES


Enterprises want the same security capabilities for BD as
are in place for “non-BD” information systems, including user
authentication and authorization (AC). According to [4], the
biggest challenge in deploying BD technologies is security
(50% of those surveyed), and the biggest challenge working
with and leveraging technologies for BD data is to maintain
data security (47% of those surveyed). One of the fundamental
security techniques is AC policy enforcement and management
Fig. 1. General BD model [8], which allows organizations to safeguard their BD in order
to meet security and privacy mandates. However, the three Vs
We define a BD Cluster as a BD distributed system that of data are overwhelming for existing system models, which
networks MS and CSs to serve BD users’ requests to process were not designed and built with AC capability in mind [9].
BD source data. The model in Figure 1 represents a generic Thus, most of them fail to adequately manage the creation, use,
distributed BD architecture such as the Apache Foundation’s and dissemination of BD data and process. As a result, they
open source software Hadoop [7] (Figure 2), which is used either introduce friction into collaboration through excessively
strict rules, or risk serious data loss by sharing data too can only restrict access on an IP/port basis, and knows nothing
permissively [5]. of the architecture of the BD cluster. As a result, security
administrators have to segregate sensitive data on separate
Authentication is different from authorization, as servers in order to control access. It would require the creation
distinguished in [10]; the authentication management function of a second BD cluster to contain sensitive data, and even then
is not directly related to the data content. For BD, as for non- would only provide two levels of security for the data. Even
BD data systems, authentication is generally handled by MS without those drawbacks, perimeter security solutions represent
and CSs independently. The focus of our BD scheme is on a single layer of defense around a soft interior; for example,
authorization (AC), which is more complex than non-BD once a firewall is breached, the system is wide open for attack
systems, because of the need to synchronize access privileges [9].
between the MS and CSs.
Hadoop, like many open source technologies, was not
BD AC must not only enforce access control policies on created with security in mind. It uses the MapReduce facility
data leaving the MS, it must also control access to the CSs’ and a distributed file system with no built-in security. The
resources. Depending on the sensitivity of the data, it needs to Hadoop community realized that more robust security controls
make certain that BD applications, the MS, and CSs have were needed, and decided to focus on security by applying
permissions to access the data that they are analyzing, and deal
technologies including Kerberos, firewalls, and basic HDFS
with the access to the distributed BD process and data from permissions [9, 11]. The Kerberos implementation utilized the
their local users [11]. The characteristics of BD distributed token-based framework to support a flexible authorization
computing model, as illustrated below, pertain to a unique set enforcement engine that aims to replace (but be backwards
of challenges for BD AC, which requires a different set of compatible with) the current AC Lists (ACLs) approaches for
concepts and considerations. AC, thus to support an advanced authorization model, focusing
Like Hadoop, support for the BD’s three V features on Attribute Based AC (ABAC) and the XACML standard.
complicates a system’s AC implementation, because the However, Kerberos is difficult to install and configure on the
difficulties are in general handled by the following techniques, MS and CSs, and to integrate with Active Directory (AD) and
each with its security challenge. Lightweight Directory Access Protocol, (LDAP) services. A
malicious developer could easily write code to impersonate
• Distributed computing – BD data is processed anywhere users’ Hadoop services (e.g., writing a new TaskTracker and
resources are available, enabling massively parallel registering itself as a Hadoop service, or impersonating the
computation between MS and CSs. This creates complicated HDFS or mapped users, deleting everything in HDFS, etc.). In
environments that are highly vulnerable to attack, as opposed addition, DataNodes enforced no AC; a malicious user could
to the centralized repositories that are monolithic and easier to read arbitrary data blocks from DataNodes, bypassing AC
secure. restrictions, or writing garbage data to DataNodes,
• Fragmented/redundant data - Data within BD clusters is undermining the integrity of the data to be analyzed. Further,
fluid, with multiple copies moving to and from MS and CSs to anyone could submit a job to a JobTracker and it could be
ensure redundancy and resiliency. Data can become sliced into arbitrarily executed.
fragments that are shared across them. This fragmentation adds Some components of the Hadoop ecosystem have applied
complexity to the data integrity and confidentiality. their own security as a layer over Hadoop; for example,
• Node-to-node communication - MS and CSs usually Apache Accumulo [12] provides cell-level authorization, and
communicate through unsecure protocols such as RPC over HBase [13] provides AC at the column and family level [11].
TCP/IP [9]. Some of them configured Hadoop to perform AC based on user
and group permissions by ACLs, but this may not be enough
for every organization, because many organizations use
IV. RELATED WORK
flexible and dynamic AC policies based on security attributes
Tools and techniques for BD AC should protect BD of users and resources and business processes, so the ACL
processes and data, ensuring that security policies are enforced approach is certainly limited [11].
in a cost-effective and timely manner. Currently, only a few
approaches that address the unique architecture of distributed For decades, relational, or SQL-based databases, have been
computing can meet the security requirements of BD AC. the database schema of choice to store and manage data; such
Some provide an enterprise-class security solution by generally databases allow data to be stored by predefined schema such as
applying traditional perimeter security solutions for a control RDBMS’s row and column in table format. SQL-based
point (gateway/perimeter such as firewalls and intrusion databases support AC on data queries by assigning column or
detection/prevention technologies) where data and commands row with security attributes so that they conform to the
enter the MS or CS. But traditional approaches that rely on Attribute-Based AC (ABAC) [10] model that is central to
perimeter security are unable to adequately secure a BD many database security frameworks. But in the era of BD, the
cluster. For example, firewalls attempt to map IP to actual AD traditional database model has difficulty dealing with the
(Active Directory) credentials, but this is problematic in the multitude of unstructured data types, as well as the massive
BD cluster, because it requires specific network design (i.e., no amounts of data that must be stored, managed, and
Network Address Translation (NAT) from internal corporate manipulated. Many applications employ NoSQL [14] for
sub-nets). Even with special network configuration, a firewall handling unstructured, messy, and unpredictable data. NoSQL
encompasses a wide variety of different database technologies
that were developed in response to a rise in the volume of data data/process by considering the processing capabilities (e.g.,
stored. They are built to allow the insertion of data without a system load) and security requirements of the CS. For example,
predefined schema; NoSQL taxonomy supports key-value the distributed BD data cannot be written to disk space that is
stores, document store, BigTable, and graph databases. shared by other local CS users that have nothing to do with the
However, it is useful when constraints and validation logic are BD, or BD data can be printed only from local printers.
not required to be implemented in a database. Thus, most BD Additionally, the CSP needs to handle a situation when AC
environments only offer AC at the schema level, with no finer rules from other CS local policies conflict with CSP rules.
granularity to address attributes of users and resources.
Therefore, if the AC for BD data query is based on the
Federated Attribute Definitions (FAD) list the common
structure attributes of BD data, then NoSQL needs to support
attributes used by MS and CSs, so that the MSP and CSPs can
the capability to create and manage AC information; such a
capability may require an application layer on top of the be composed using the common attributes in the FAD
existing NoSQL mechanisms [15]. dictionary. For example, attribute local user is defined as all
users who can log into the CS system, company employee is
defined as all the CS users who have company’s employee
V. GENERAL BIG DATA ACCESS CONTROL SCHEME
identifications, and system administrators is defined as the CS
Fundamentally, the AC requirements for a BD cluster are user who has system administrator privilege on the CS system.
no different than non-BD systems. However, due to the facts So, the FAD serves as the federated dictionary of AC attributes
that (1) BD is processed by distributing its processes and data that should be syntactically and semantically agreed by the MS
from MS to CSs, and (2) BD data has no formal scheme for and CSs.
database management, BD AC needs additional AC
capabilities than non-BD systems. A BD cluster is a construct To apply the scheme, the following tasks need to be
of an enterprise system that requires MS’s AC mechanism to performed before the application:
be incorporated with CSs’. In terms of AC privilege as defined
in [10], when MS passes the process/data to a CS, the MS is • Coordinate BD source providers and MS for SA
the Subject, the BD process/data and required CS local agreements;
resources are the Objects, and the required actions are the • MS collects information about CSs based on the knowledge
Actions in CS’s AC policy. Figure 3 shows our proposed AC of CSs’ security capabilities, levels of assurances, or trust. The
scheme based on the general BD model described in Section 3. information is required for MS to define security classes in
The scheme includes AC components to meet the BD AC sync with BD source providers for SA;
requirements as described below.
• Coordinate MS and CSs to define attributes in FAD based
Security Agreement (SA) is a mutual agreement between BD on the common syntactic and semantic values of attributes for
source provider and the MS for defining security classes of BD the BD processing needs;
source. The purpose of SA is for the BD source provider and
the MS to define and agree upon security classes (ranks), so • CS prepares information about how it trusts the MS (i.e.,
that it can be referred by the MS and CSs to decide the levels what local resources are available for the BD processing) and
of security (or trust) that a CS is qualified for processing the considerations for performance and security capabilities of the
BD. For instance, a BD source may be an email log file, CS’s local system, as well as responsibility for disseminating
considering the confidential level; the log can be accessed only the distributed BD process and data. All the information will be
by a CS with security class say from 1 to 3. translated into CSP rules;
• CS composes meta-rules to handle conflicts between CS’s
Trust CS List (TCSL) lists the trusted CSs recognized by the local AC policies and CSP, as well as between MSP and CSP.
MS. The TCSL categorizes CSs by the security classes Note that unless specifically agreed upon, CSP rules should
according to the SAs worked with BD source providers. TCSL have higher priority than MSP rules when determining access
is managed by MS security officer based on their knowledge of privileges. If CS denies a BD process’s access, it should notify
the associated CSs. For example, CS-i is assigned to class 1, MS for transferring the task/data to other CSs; and
CS-j and CS-k are assigned to class 2, and CS-l is assigned to • In addition to authorization enforcement, MS and CSs may
class 3. In other words, TCSL allows the MS to determine how consider including AC activity audit capability for BD access
CSs are trusted for the distributions of BD process and data. logs.

MS AC Policy (MSP) is managed by the MS security officer.


MSP specifies a set of AC rules that are imposed by MS to
enforce AC on CSs. For example, the distributed BD data can
be read only by subjects with attribute company employee, or
the BD process can be executed only by processes with subject
attribute system administrator in a CS.

CS AC Policy (CSP) is managed by the CS security officer.


CSP allows the CS to control the access to the distributed BD
BDAC {
Cu = {c1…...ck} such that (bdu, ci)  SA;
if Cu =  {
request = deny /* for this csl
else
CSu = {cs1…..csk} such that (csi, ci)  TCSL and ci
Cu;
if CSu =  or csl  CSu {
request = deny /* for this csl
else
if (there exist (mpx = (atx, ax, bdx) MSP such
that au = = ax and bdu = = bdx)) and (there exist
(cspx = (bdx, ax, rsx)  CSPl such that au = =ax
and bdu = = bdx and rsx = = resource required)) {
request = grant /* for this csl ;
if perform au on rsx = = success {

Fig. 3. A generic BD architecture


return result to RC
else
Formally, the proposed scheme can be represented by:
return “RC resource from csl
• A set of security classes from c1 to cn in the set C = unavailable”
{c1…..cn}.
}
• A set of BD source providers from bd1 to bdn in the set BD
= {bd1…bdn}. else
• A set of CSs from cs1 to csn in the set CS = {cs1….csn}. request = deny /* for this csl
• A set of federated attributes from at1 to atn in the set FAD = }
{at1….atn}. }
• A set of (bdx, cx) pairs in the set SA  BD  C. }
• A set of (csx, cx) pairs in the set TCSL  CS  C. }
• A set of MS policy rules from mp1 to mpn in the set MSP = The following demonstrates an example of the algorithm.
{mp1 ….mpn}, each mpi = (atx, ax, bdx) is a tuple where, atx  The BD source provider x enforces security class 2 as defined
FAD is an attributes, ax is an action, and bdx  BD is a source in SA. Class 2 is a level of moderate trust by some mutual
provider that means subject with attribute atx is permitted to criteria between x and MS. MS sets up MSP and regulates that
perform action ax on object from bdx. only system administrator can process x’s BD after the subject
attribute: system administrator is syntactically and
• A set of AC policy rules for CS CSi in the set CSPi =
semantically agreed between MS and participating CSs. MS
{cspi1….cspin}, each cspii = (bdx, ax, rsx) is a tuple where bdx 
also assigns CS-i, CS-j, and CS-k (assuming the higher the
BD, ax is an action, and rsx is a local resource from CSi that
number, the higher the security classes) to security classes
means subject from bdx is permitted to perform action ax on
greater than or equal to class 2 in the TCSL. Assume CS-i‘s
object rsx.
local AC policy does not allow to process nonlocal data, as
Let the BDU = (u, au, bdu) represent a BD user request; well as CS-i’s CSP gives local policy higher priority than MSP.
where u is an authenticated BD user by the MS, au is a And CS-j does not give any system administrator privilege to
requested action, and bdu is a BD source provider of the nonlocal process; thus, only CS-k can process request for x’s
data/process that u is request to perform au from. The AC BD.
algorithm to accept (u, au, bdu) on the CS csl is:
Figure 4 depicts the BD AC control/manage domains
where MS AC domain is enforced collectively by information
in the SA, TCSL, MSP, and FAD entries, which are all
configured and managed by MS. CS AC domain is enforced
collectively by information in MSP, FAD, and CSP entries;
however, only CSP entries are configured and managed by the harder to retrieve required attributes, for example, by
CS. Note that atx in the FAD and csx in TCSL are used to traditional database queries or analytic tools. This is because
determine if a CS is permitted to process the BD process established procedures for manipulating data typically rely
request. The csx is independently decided by the MS, but atx upon imposition of a rigid structure or schema (defining
needs to share the responsibilities with trusted CSs for access elements of the data to be a location, a device type, a zip code,
privilege decisions. Thus, the chain of trust is from BD source a user ID, etc.) and the pre-calculation of indexes to recall
provider to MA through SA, then from MS to CS through specific data fields (attributes), such as used in RDBA or
TCSL, MSP and FAD, also from CS to MS through FAD and NoSQL. These approaches continue to excel in managing
CSP which shows that both MS’s and CS’s AC domains do not
highly structured data but are less well suited for BD [5].
allow unauthorized access without the coordination between
MS and CSs.
AC auditing -- The nature of distributed processing in BD
clusters poses a challenge for AC activity auditing, which
Authenticated SA TCSL CSP
tends to be incapable of analyzing the interconnected data
BD user request ….. …… …… between MS and CSs, which are responsible for tracking the
u, au, bdu (bdu, cx) (cx, csx) bdu, au, rsu actual interactions with files and resources spread across their
…… …… …… operating systems and applications. To minimize the
FAD MSP
problems, organizations may want to implement auditing
….. ……
functions in the infrastructure layer, simplify complexity in
atx atx, au, bdu the application space, and adopt authentication and MAC.
…… …… However, scaling these technologies to the levels necessary to
accommodate BD can present its own set of unique challenges
[5, 16].
Unauthorized
Combining with Cloud computing environment -- With the
MS access
Unauthorized emergence of Cloud platforms, many might consider
CS access processing BD in a Cloud environment, to handle the growing
volume and complexity of BD. However, Cloud computing
has both positive and negative effects on the BD. As data
Fig. 4. BD AC control/manage domain. becomes more accessible through the Cloud, there are three
major threats to securing BD processing: malfunctioning
Note that to be general, we provide a core concept of the infrastructure components, infrastructure outside attacks, and
scheme. For practical implementations, additional sets and infrastructure inside attacks. To address these threats, the MS
rules for the algorithm that cover other system components should improve trustworthiness and the usability of CSs by
such as backup, logs, etc., might need to be included. strengthening the MSP and fine-grained FDA.

VI. IMPLEMENTATION CONSIDERATIONS Many AC models and mechanisms can be applied to


In addition to the scheme, operational considerations that support the proposed scheme. One of the most versatile is the
are common for other enterprise BD AC mechanisms also Attribute-Based Access Control (ABAC) model, because of its
need to be addressed as below. capabilities in configuring and managing attributes and AC
policies. It also supports the AC requirements of enterprise
Trust establishment – For better performance, MS might systems, where BD is usually serviced as described in [10],
need to separate parallel TDs and DDs functions (e.g., which provides guidance for ABAC enterprise implementation
Hadoop) that process immense volumes of BD. In this case, by the initiation, acquisition and development,
TDs and DDs should also protect the BD data from an implementation and assessment, and operations and
untrusted CS, especially when their processes are delegated in maintenance phases of an enterprise. The guidance may also
a virtual environment such as that of a Cloud. For example, be applied to BD AC implementation.
some suggest ensuring the trustworthiness of CSs by
Mandatory AC (MAC) mechanisms so that CS must be VII. SUMMARY AND CONCLUSIONS
authenticated and given properties by MS, and only when
they're competent can they be assigned CS tasks. After this To harvest BD benefits, security challenges must be
qualification, periodic updates must be made to ensure CSs overcome. Security professionals apply most controls at the
consistently meet established policies [16]. very edges of the network. However, if attackers penetrate the
BD cluster, they will have full and unrestricted access to the
Content Attributes -- If instead of SAs, the security classes BD [9]. Thus, many organizations are being required to
for MSP are dynamically determined by referring attributes enforce AC and privacy restrictions on BD to meet regulatory
from the BD sources contents [15], then the issue would be requirements. In this paper, we presented an AC scheme for
that as data grows in volume and complexity, it becomes BD data processing in a distributed processing environment.
We introduced the definition of BD, illustrated a general BD REFERENCES
process model abstracted based on general distributed
processing environment, and used the popular BD process [1] “Big data to turn ‘mega’ as capacity will hot 44 zettabytes by 2020,”,
application Hadoop as an example. We discussed BD AC DataIQ News, http://www.dataiq.co.uk/news/20140410/big-data-turn-
issues with related work. Finally, we presented the BD AC mega-capacity-will-hit-44-zettabytes-2020, Oct. 2014.
scheme and considerations based on trust between BD source [2] H. Mir, “Hadoop Tutorial 1What is Hadoop,” ZeroToProTraining,
http://ZeroTOProTraining.com http://nusmv.irst.itc.it/.
providers and BD Master System (MS), and consequently,
[3] J. Moore, “How big data is remaking the government data center,”
between MS and Cooperating Systems (CSs). The scheme is GCN, http://gcn.com/articles/2014/02/14/big-data-data-centers.aspx,
focused on authorization (access control) to protect BD Feb. 2014.
processing and data from insider attacks in the BD cluster, [4] W. Bell, “The Big Data Cure,” MeriTalk,
under the assumption that authentication is already http://www.meritalk.com/bigdatacure, 2014.
established. In addition to AC components, we demonstrated [5] P. Miller, “Applying big data analytics to human-generated data,”
the formal sets and algorithm, and the domain of protection GIGAOM RESEARCH, http://research.gigaom.com/report/applying-
big-data-analytics-to-human-generated-data/, Jan. 2014.
for the scheme that showed that no unauthorized privileges
[6] V. Hu, R. Kuhn, T. Xie, and J. Hwang, “Model Checking for
either from MS or CSs are possible. Verification of Mandatory Access Control Models and Properties,”
International Journal of Software Engineering and Knowledge
Depending on the security requirements of the BD Engineering (IJSEKE) regular issue IJSEKE Vol. 21, No. 1., 2011.
application, many AC models and mechanisms can be applied [7] Hadoop.apache.org
to the scheme. The proposed scheme is devised to be generally [8] V. Hu and K. Scarfone, “Guidelines for Access Control System
Evaluation Metrics,” NIST Interagency Report 7874, Gaithersburg, MD,
applicable for distributed system-based BD AC, and stems USA, 2012.
from the fundamental AC mechanism that uses attributes of [9] “The Big Data Security Gap: Protecting the Hadoop Cluster,” White
subjects, objects, actions, and sometimes environment Paper, Zittaset, http://www.zettaset.com/wp-
conditions which are the building blocks of Attribute-Based content/uploads/2014/04/zettaset_wp_security_0413.pdf, 2014.
AC (ABAC) mechanisms, to determine access permission. We [10] V. Hu, D. Ferraiolo, R. Kuhn, A. Schnitzer, K. Sandlin, R. Miller, and
further discussed some operational and implementation issues K. Scarfone…., “Attribute Based Access Control Definition and
Consideration,” NIST Special Publication 800-162, Gaithersburg, MD,
for practical application. These issues are tied to the BD USA, 2013.
application, and should be handled according to the BD [11] K.. T. Smith, “Big Data Security: The Evolution of Hadoop’s Security
security requirements. Model,” InfoQ, http://www.infoq.com/articles/HadoopSecurityModel,
Aug. 2014.
[12] “Apache Accumulo,” https://accumulo.apache.org
[13] Hbase.apache.org
[14] “NoSQL Databases Explained,” mongoDB Inc.,
http://www.mongodb.com/nosql-explained, 2014.
[15] W. Zeng, Y, Yang, B. Luo, “Access Control for Big Data using Data
Content,” in Proc. 2013 IEEE International Conference on Big Data,
2013.
[16] S. Shea, “CSA top 10 big data security, privacy challenges and how to
solve them,” TechTarget, SearchSecurity, Nov. 2013.

You might also like