An Access Control Scheme For Big Data Processing: Vincent C. Hu, Tim Grance, David F. Ferraiolo, D. Rick Kuhn
An Access Control Scheme For Big Data Processing: Vincent C. Hu, Tim Grance, David F. Ferraiolo, D. Rick Kuhn
Abstract— Access Control (AC) systems are among the most BD has denser and higher resolutions such as media,
critical of network security components. A system’s privacy and photos, and videos from sources such as social media, mobile
security controls are more likely to be compromised due to the applications, public records, and databases; the data is either in
misconfiguration of access control policies rather than the failure static batches or dynamically generated by machine and users
of cryptographic primitives or protocols. This problem becomes by the advanced capacities of hardware, software, and network
increasingly severe as software systems become more and more technologies. Examples include data from sensor networks or
complex, such as Big Data (BD) processing systems, which are tracking user behavior. Rapidly increasing volumes of data and
deployed to manage a large amount of sensitive information and data objects add enormous pressure on existing IT
resources organized into a sophisticated BD processing cluster.
infrastructures with scaling difficulties such as capabilities for
Basically, BD access control requires the collaboration among
cooperating processing domains to be protected as computing
data storage, advance analysis, and security. These difficulties
environments that consist of computing units under distributed result from BD’s large and growing files, at high speed, and in
AC managements. Many BD architecture designs were proposed various formats, as is measured by: Velocity (the data comes at
to address BD challenges; however, most of them were focused high speed, e.g., scientific data such as data from weather
on the processing capabilities of the “three Vs” (Velocity, patterns.); Volume (the data results from large files, e.g.,
Volume, and Variety). Considerations for security in protecting Facebook generates 25TB of data daily.); and Variety (the files
BD are mostly ad hoc and patch efforts. Even with some come in various formats: audio, video, text messages, etc. [2]).
inclusion of security in recent BD systems, a critical security Therefore, BD data processing systems must be able to deal
component, AC (Authorization), for protecting BD processing with collecting, analyzing, and securing BD data that requires
components and their users from the insider attacks, remains processing very large data sets that defy conventional data
elusive. This paper proposes a general purpose AC scheme for management, analysis, and security technologies. In simple
distributed BD processing clusters. ways, some solutions use a dedicated system for their BD
processing. However, to maximize scalability and
Keywords—Access Control, Authorization, Big Data, performance, most BD processing systems apply massively
Distributed System parallel software running on many commodity computers in
distributed computing frameworks that may include columnar
I. INTRODUCTION databases and other BD management solutions [5].
Data IQ News [1] estimates that the global data population Access Control (AC) systems are among the most critical
will reach 44 zettabytes (1 billion terabytes) by 2020. This of network security components. It is more likely that privacy
growth trend is influencing the way data is being mass or security will be compromised due to the misconfiguration of
collected and produced for high-performance computing or access control policies than from a failure of a cryptographic
operations and planning analysis. Big Data (BD) refers to large primitive or protocol. This problem becomes increasingly
data that is difficult to process by using a traditional data severe as software systems become more and more complex
processing system, for example, to analyze Internet data traffic, such as BD processing systems, which are deployed to manage
or edit video data of hundreds of gigabytes. (Note that each a large amount of sensitive information and resources
case depends on the capabilities of a system; it has been argued organized into a sophisticated BD processing cluster. Basically,
that for some organizations, terabytes of text, audio, and video BD AC systems require collaboration among corporate
data per day can be processed, thus, it is not BD, but for those processing domains as protected computing environments,
organizations that cannot process efficiently, it is BD [2]). BD which consist of computing units under distributed AC
technology is gradually reshaping current data systems and management [6].
practices. Government Computer News [3] estimates that the
volume of data stored by federal agencies alone will increase Many architecture designs have been proposed to address
from 1.6 to 2.6 petabytes within two years, and U.S. state and BD challenges; however, most of them have been focused on
local governments are just as keen on harnessing the power of the processing capabilities of the “three Vs” (Velocity,
BD to boost security, prevent fraud, enhance service delivery, Volume, and Variety). Considerations for security in protecting
and improve emergency response. It is estimated that BD AC are mostly ad hoc and patch efforts. Even with the
successfully leveraging technologies for BD can reduce the IT inclusion of some security capability in recent BD systems,
cost by an average of 48% [4].
practical AC (authorization) for BD processing components is throughout industry and government. Hadoop keeps data and
not readily available. processing resources in close proximity within the cluster. It
runs on a distributed model composed of numerous low-cost
This paper proposes a general AC scheme for distributed computers (e.g., Linux-based machines with simple
BD processing clusters. Section II describes current BD tools architecture): two main MS components: TD – MapReduce
and implementations. Section III discusses BD AC and DD - File System (HDFS-Hadoop file system), and a set
requirements. Section IV introduces related work. Section V of tools. The TD provides distributed data processing across
illustrates our BD AC scheme. Section VI discusses the cluster, and the DD distributes large data sets across the
implementation considerations for the general BD model. servers in the cluster [3]. Hadoop’s CSs are called Slaves, and
Section VII concludes the paper. each has two components: Task Tracker and Data Node. The
MS contains two additional components: Job Tracker and
II. GENERAL BIG DATA MODEL AND EXAMPLE Name Node. Job Tracker and Task Tracker are grouped as
The fundamental model for most of the current BD MapReduce, and Name Node and Data Node fall under HDFS.
architecture designs is based on the concept of distributed
processing [2, 4], which contains a set of generic processing
systems as shown in Figure 1:
1. Master System (MS) receives data from BD data source
providers, and determines processing steps in response to a
user’s request. MS has the following three major functions:
• Task Distribution (TD) function is responsible for
distributing processes to the cooperated (slave) systems of the
BD cluster.
• Data distribution (DD) function is responsible for
distributing data to the cooperated systems of the BD cluster.
• Result Collection (RC) function processes collects
and analyzes information provided by cooperating systems of
the BD cluster, and generates aggregated result to users.
Unless restricted by specific applications, the three
functions are usually installed and managed in the same host
machine for easy and secure management and maintenance.
Fig. 2. Hadoop BD cluster example
2. Cooperated System (CS) (or slave system) is assigned and
trusted by MS for BD processing. CS reports progress or Hadoop combines storage, servers, and networking to break
problems to the TD and DD, otherwise, returns the computed data as well as computation down to small pieces. Each
result to the RC of MS. computation is assigned a small piece of data, so that instead
one big computation, numerous small computations are
performed much faster, with the result aggregated and sent
back to the application [2]. Thus, Hadoop provides linear
scalability, using as many computers as required. Cluster
communication between MS and CSs manages what traditional
data service systems would not be able to handle.