Big Data Algorithms
Big Data Algorithms
BIG DATA
Algorithms, Analytics,
and Applications
Edited by
Kuan-Ching Li
Hai JianG
Laurence T. Yang
Alfredo Cuzzocrea
BIG DATA
Algorithms, Analytics,
and Applications
Chapman & Hall/CRC
Big Data Series
SERIES EDITOR
Sanjay Ranka
PUBLISHED TITLES
BIG DATA : ALGORITHMS, ANALYTICS, AND APPLICATIONS
Kuan-Ching Li, Hai Jiang, Laurence T. Yang, and Alfredo Cuzzocrea
Chapman & Hall/CRC
Big Data Series
BIG DATA
Algorithms, Analytics,
and Applications
Edited by
Kuan-Ching Li
Providence University
Taiwan
Hai Jiang
Arkansas State University
USA
Laurence T. Yang
St. Francis Xavier University
Canada
Alfredo Cuzzocrea
ICAR -CNR & University of Calabria
Italy
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2015 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
v
vi ◾ Contents
It is apparent that in the era of Big Data, when every major ield of science, engineering,
business, and inance is producing and needs to (repeatedly) process truly extraordinary
amounts of data, the many unsolved problems of when, where, and how all those data are
to be produced, transformed, and analyzed have taken center stage. his book, Big Data:
Algorithms, Analytics, and Applications, addresses and examines important areas such as
management, processing, data stream techniques, privacy, and applications.
he collection presented in the book covers fundamental and realistic issues about Big
Data, including eicient algorithmic methods to process data, better analytical strategies
to digest data, and representative applications in diverse ields such as medicine, science,
and engineering, seeking to bridge the gap between huge amounts of data and appropriate
computational methods for scientiic and social discovery, and to bring together technolo-
gies for media/data communication, elastic media/data storage, cross-network media/data
fusion, SaaS, and others. It also aims at interesting applications related to Big Data.
his timely book edited by Kuan-Ching Li, Hai Jiang, Laurence T. Yang, and Alfredo
Cuzzocrea gives a quick introduction to the understanding and use of Big Data and pro-
vides examples and insights into the process of handling and analyzing problems. It pres-
ents introductory concepts as well as complex issues. he book has ive major sections:
management, processing, streaming techniques and algorithms, privacy, and applications.
hroughout the book, examples and practical approaches for dealing with Big Data are
provided to help reinforce the concepts that have been presented in the chapters.
his book is a required understanding for anyone working in a major ield of science,
engineering, business, and inancing. It explores Big Data in depth and provides diferent
angles on just how to approach Big Data and how they are currently being handled today.
I have enjoyed and learned from this book, and I feel conident that you will as well.
Jack Dongarra
University of Tennessee, Knoxville
ix
Foreword by Dr. Yi Pan
In 1945, mathematician and physicist John von Neumann and his colleagues wrote an
article, “First Drat of a Report on the EDVAC,” to describe their new machine EDVAC
based on some ideas from J. Presper Eckert and John Mauchly. he proposed concept of
stored-program computer, known as von Neumann machine, has been widely adopted in
modern computer architectures. Both the program and data for an application are saved
in computer memory for execution lexibility. Computers were not dedicated to single jobs
anymore.
In 1965, Intel’s cofounder Gordon Moore proposed the so-called Moore’s law to pre-
dict that the number of transistors doubles every 18 months. Computer hardware roughly
sticks to Moore’s law over these years. However, to ensure that the computability doubles
every 18 months, computer hardware has to work with system sotware and application
sotware closely. For each computer job, both computations and data aspects should be
handled properly for better performance.
To scale up computing systems, multiple levels of parallelism, including instruction-
level, thread-level, and task-level parallelism, have been exploited. Multicore and many-
core processors as well as symmetric multiprocessing (SMP) have been used in hardware.
Data parallelism is explored through byte-level parallelization, vectorization, and SIMD
architectures. To tolerate the latency of data access in memory, memory hierarchy with
multiple levels of cache is utilized to help overcome the memory wall issue.
To scale out, multicomputer systems such as computer clusters, peer-to-peer systems,
grids, and clouds are adopted. Although the aggregated processing and data storage capa-
bilities have been increased, the communication overhead and data locality remain as the
major issues. Still, scaled-up and scaled-out systems have been deployed to support com-
puter applications over the years.
Recently, Big Data has become a buzzword in the computer world, as the size of data
increases exponentially and has reached the petabyte and terabyte ranges. he represen-
tative Big Data sets come from some ields in science, engineering, government, private
sector, and daily life. Data digestion has taken over the previous data generation as the
leading challenge for the current computer world. Several leading countries and their gov-
ernments have noticed this new trend and released new policies for further investigation
and research activities. As distributed systems and clouds become popular, compared to
the traditional data in common applications, Big Data exhibit some distinguished charac-
teristics, such as volume, variety, velocity, variability, veracity, value, and complexity. he
manners for data collection, data storage, data representation, data fusion, data processing,
and visualization have to be changed. Since modern computers still follow von Neumann
xi
xii ◾ Foreword by Jack Dongarra
architecture and other potential ones such as biocomputers and quantum computers are
still in infancy, the algorithmic and sotware adjustments are still the main consideration
for Big Data.
Li, Jiang, Yang, and Cuzocrea’s timely book addresses the Big Data issue. hese editors
are active researchers and have done a lot of work in the area of Big Data. hey assembled
a group of outstanding authors. he book’s content mainly focuses on algorithm, analytics,
and application aspects in ive separate sections: Big Data management, Big Data process-
ing, Big Data stream techniques and algorithms, Big Data privacy, and Big Data applica-
tions. Each section contains several case studies to demonstrate how the related issues are
addressed.
Several Big Data management strategies such as indexing and clustering are introduced
to illustrate how to organize data sets. Some processing and scheduling schemes as well
as representative frameworks such as MapReduce and Hadoop are also included to show
how to speed up Big Data applications. Particularly, stream technique is explained in detail
due to its importance in Big Data processing. Privacy is still a concern in Big Data and
proper attention is necessary. Finally, the text includes several actual Big Data applications
in inance, media, biometrics, geoscience, and the social sector. hey help demonstrate
how the Big Data issue has been addressed in various ields.
I hope that the publication of this book will help university students, researchers, and
professionals understand the major issues in Big Data applications as well as the corre-
sponding strategies to tackle them. his book might also stimulate research activities in all
aspects of Big Data. As the size of data still increases dramatically every day, the Big Data
issue might become obviously more challenging than people expect. However, this may
also be a good opportunity for new discoveries in this exciting ield. I highly recommend
this timely and valuable book. I believe that it will beneit many readers and contribute to
the further development of Big Data research.
Dr. Yi Pan
Distinguished University Professor of Computer Science
Interim Associate Dean of Arts and Science
Georgia State University, Atlanta
Foreword by D. Frank Hsu
Due to instrumentation and interconnection in the past decades, the Internet and the
World Wide Web have transformed our society into a complex cyber–physical–social eco-
system where everyone is connected to everything (the Internet of Everything [IOE]) and
everybody is an information user as well as an information provider. More recently, Big
Data with high volume and wide variety have been generated at a rapid velocity. Everyone
agrees that Big Data have great potential to transform the way human beings work, live,
and behave. However, their true value to the individual, organization, society, planet Earth,
and intergalactic universe hinges on understanding and solving fundamental and realistic
issues about Big Data. hese include eicient algorithms to process data, intelligent infor-
matics to digest and analyze data, wide applications to academic disciplines in the arts and
sciences, and operational practices in the public and private sectors. he current book, Big
Data: Algorithms, Analytics, and Applications (BDA3), comes at the right time with the
right purpose.
In his presentation to the Information and Telecommunication Board of the National
Research Council (http://www.heFourthParadigm.com), Jim Gray describes the long his-
tory of the scientiic inquiry process and knowledge discovery method as having gone
through three paradigms: from empirical (thousands of years ago), to theoretical (in the
last few hundred years, since the 1600s), and then to modeling (in the last few decades).
More recently, the complexity of Big Data (structure vs. unstructured, spatial vs. temporal,
logical vs. cognitive, data-driven vs. hypothesis-driven, etc.) posts great challenges not only
in application domains but also in analytic method. hese domain applications include
climate change, environmental issues, health, legal afairs, and other critical infrastruc-
tures such as banking, education, inance, energy, healthcare, information and communi-
cation technology, manufacturing, and transportation. In the scientiic discovery process,
eicient and efective methods for analyzing data are needed in a variety of disciplines,
including STEM (astronomy, biomedicine, chemistry, ecology, engineering, geology, com-
puter science, information technology, and mathematics) and other professional studies
(architecture, business, design, education, journalism, law, medicine, etc.). he current
book, BDA3, covers many of the application areas and academic disciplines.
Big Data phenomenon has altered the way we analyze data using statistics and comput-
ing. Traditional statistical problems tend to have many observations but few parameters
with a small number of hypotheses. More recently, problems such as analyzing fMRI data
set and social networks have diverted our attention to the situation with a small number of
observations, a large number of variables (such as cues or features), and a relatively bigger
number of hypotheses (also see D.J. Spiegelhalter, Science, 345: 264–265, 2014). In the Big
xiii
xiv ◾ Foreword by Jack Dongarra
Data environment, the use of statistical signiicance (the P value) may not always be appro-
priate. In analytics terms, correlation is not equivalent to causality and normal distribution
may not be that normal. An ensemble of multiple models is oten used to improve forecast-
ing, prediction, or decision making. Traditional computing problems use static data base,
take input from logical and structured data, and run deterministic algorithms. In the Big
Data era, a relational data base has to be supplemented with other structures due to scal-
ability issue. Moreover, input data are oten unstructured and illogical (due to acquisition
through cognition, speech, or perception). Due to rapid streaming of incoming data, it is
necessary to bring computing to the data acquisition point. Intelligent informatics will use
data mining and semisupervised machine learning techniques to deal with the uncertainty
factor of the complex Big Data environment. he current book, BDA3, has included many
of these data-centric methods and analyzing techniques.
he editors have assembled an impressive book consisting of 22 chapters, written by 57
authors, from 12 countries across America, Europe, and Asia. he chapters are properly
divided into ive sections on Big Data: management, processing, stream technologies and
algorithms, privacy, and applications. Although the authors come from diferent disci-
plines and subields, their journey is the same: to discover and analyze Big Data and to
create value for them, for their organizations and society, and for the whole world. he
chapters are well written by various authors who are active researchers or practical experts
in the area related to or in Big Data. BDA3 will contribute tremendously to the emerg-
ing new paradigm (the fourth paradigm) of the scientiic discovery process and will help
generate many new research ields and disciplines such as those in computational x and
x-informatics (x can be biology, neuroscience, social science, or history), as Jim Gray envi-
sioned. On the other hand, it will stimulate technology innovation and possibly inspire
entrepreneurship. In addition, it will have a great impact on cyber security, cloud comput-
ing, and mobility management for public and private sectors.
I would like to thank and congratulate the four editors of BDA3—Kuan-Ching Li, Hai
Jiang, Laurence T. Yang, and Alfredo Cuzzocrea—for their energy and dedication in put-
ting together this signiicant volume. In the Big Data era, many institutions and enterprises
in the public and private sectors have launched their Big Data strategy and platform. he
current book, BDA3, is diferent from those strategies and platforms and focuses on essen-
tial Big Data issues, such as management, processing, streaming technologies, privacy, and
applications. his book has great potential to provide fundamental insight and privacy
to individuals, long-lasting value to organizations, and security and sustainability to the
cyber–physical–social ecosystem on the planet.
D. Frank Hsu
Fordham University, New York
Preface
A s data sets are being generated at an exponential rate all over the world, Big Data
has become an indispensable issue. While organizations are capturing exponentially
larger amounts of data than ever these days, they have to rethink and igure out how to
digest it. he implicit meaning of data can be interpreted in reality through novel and
evolving algorithms, analytics techniques, and innovative and efective use of hardware
and sotware platforms so that organizations can harness the data, discover hidden pat-
terns, and use newly acquired knowledge to act meaningfully for competitive advantages.
his challenging vision has attracted a great deal of attention from the research commu-
nity, which has reacted with a number of proposals focusing on fundamental issues, such
as managing Big Data, querying and mining Big Data, making Big Data privacy-preserving,
designing and running sophisticated analytics over Big Data, and critical applications,
which span over a large family of cases, from biomedical (Big) Data to graph (Big) Data,
from social networks to sensor and spatiotemporal stream networks, and so forth.
A conceptually relevant point of result that inspired our research is recognizing that
classical managing, query, and mining algorithms, even developed with very large data
sets, are not suitable to cope with Big Data due to both methodological and performance
issues. As a consequence, there is an emerging need for devising innovative models, algo-
rithms, and techniques capable of managing and mining Big Data while dealing with their
inherent properties, such as volume, variety, and velocity.
Inspired by this challenging paradigm, this book covers fundamental and realistic issues
about Big Data, including eicient algorithmic methods to process data, better analytical
strategies to digest data, and representative applications in diverse ields such as medicine,
science, and engineering, seeking to bridge the gap between huge amounts of data and
appropriate computational methods for scientiic and social discovery and to bring tech-
nologies for media/data communication, elastic media/data storage, cross-network media/
data fusion, Sotware as a Service (SaaS), and others together. It also aims at interesting
applications involving Big Data.
According to this methodological vision, this book is organized into ive main sections:
• “Big Data Management,” which focuses on research issues related to the efective and
eicient management of Big Data, including indexing and scalability aspects.
• “Big Data Processing,” which moves the attention to the problem of processing Big
Data in a widespread collection of resource-intensive computational settings, for
xv
xvi ◾ Preface
• “Big Data Stream Techniques and Algorithms,” which explores research issues con-
cerning the management and mining of Big Data in streaming environments, a typical
scenario where Big Data show their most problematic drawbacks to deal with—here,
the focus is on how to manage Big Data on the ly, with limited resources and approxi-
mate computations.
• “Big Data Privacy,” which focuses on models, techniques, and algorithms that aim at
making Big Data privacy-preserving, that is, protecting them against privacy breaches
that may prevent the anonymity of Big Data in conventional settings (e.g., cloud
environments).
• “Big Data Applications,” which, inally, addresses a rich collection of practical applica-
tions of Big Data in several domains, ranging from inance applications to multimedia
tools, from biometrics applications to satellite (Big) Data processing, and so forth.
In the following, we will provide a description of the chapters contained in the book,
according to the previous ive sections.
he irst section (i.e., “Big Data Management”) is organized into the following chapters.
Chapter 1, “Scalable Indexing for Big Data Processing,” by Hisham Mohamed and
Stéphane Marchand-Maillet, focuses on the K-nearest neighbor (K-NN) search problem,
which is the way to ind and predict the most closest and similar objects to a given query.
It inds many applications for information retrieval and visualization, machine learning,
and data mining. he context of Big Data imposes the inding of approximate solutions.
Permutation-based indexing is one of the most recent techniques for approximate simi-
larity search in large-scale domains. Data objects are represented by a list of references
(pivots), which are ordered with respect to their distances from the object. In this context,
the authors show diferent distributed algorithms for eicient indexing and searching based
on permutation-based indexing and evaluate them on big high-dimensional data sets.
Chapter 2, “Scalability and Cost Evaluation of Incremental Data Processing Using
Amazon’s Hadoop Service,” by Xing Wu, Yan Liu, and Ian Gorton, considers the case of
Hadoop that, based on the MapReduce model and Hadoop Distributed File System (HDFS),
enables the distributed processing of large data sets across clusters with scalability and fault
tolerance. Many data-intensive applications involve continuous and incremental updates
of data. Understanding the scalability and cost of a Hadoop platform to handle small and
independent updates of data sets sheds light on the design of scalable and cost-efective
data-intensive applications. With these ideas in mind, the authors introduce a motivating
movie recommendation application implemented in the MapReduce model and deployed
on Amazon Elastic MapReduce (EMR), a Hadoop service provided by Amazon. In par-
ticular, the authors present the deployment architecture with implementation details of
the Hadoop application. With metrics collected by Amazon CloudWatch, they present an
empirical scalability and cost evaluation of the Amazon Hadoop service on processing
Preface ◾ xvii
continuous and incremental data streams. he evaluation result highlights the potential of
autoscaling for cost reduction on Hadoop services.
Chapter 3, “Singular Value Decomposition, Clustering, and Indexing for Similarity
Search for Large Data Sets in High-Dimensional Spaces,” by Alexander homasian,
addresses a popular paradigm, that is, representing objects such as images by their feature
vectors and searching for similarity according to the distances of the points represent-
ing them in high-dimensional space via K-nearest neighbors (K-NNs) to a target image.
he authors discuss a combination of singular value decomposition (SVD), clustering,
and indexing to reduce the cost of processing K-NN queries for large data sets with high-
dimensional data. hey irst review dimensionality reduction methods with emphasis on
SVD and related methods, followed by a survey of clustering and indexing methods for
high-dimensional numerical data. he authors describe combining SVD and clustering as
a framework and the main memory-resident ordered partition (OP)-tree index to speed up
K-NN queries. Finally, they discuss techniques to save the OP-tree on disk and specify the
stepwise dimensionality increasing (SDI) index suited for K-NN queries on dimensionally
reduced data.
Chapter 4, “Multiple Sequence Alignment and Clustering with Dot Matrices, Entropy,
and Genetic Algorithms,” by John Tsiligaridis, presents a set of algorithms and their ei-
ciency for Multiple Sequence Alignment (MSA) and clustering problems, including also
solutions in distributive environments with Hadoop. he strength, the adaptability, and
the efectiveness of the genetic algorithms (GAs) for both problems are pointed out. MSA is
among the most important tasks in computational biology. In biological sequence compar-
ison, emphasis is given to the simultaneous alignment of several sequences. GAs are sto-
chastic approaches for eicient and robust search that can play a signiicant role for MSA
and clustering. he divide-and-conquer principle ensures undisturbed consistency during
vertical sequences’ segmentations. Indeed, the divide-and-conquer method (DCGA) can
provide a solution for MSA utilizing appropriate cut points. As far as clustering is con-
cerned, the aim is to divide the objects into clusters so that the validity inside clusters is
minimized. As an internal measure for cluster validity, the sum of squared error (SSE) is
used. A clustering genetic algorithm with the SSE criterion (CGA_SSE), a hybrid approach,
using the most popular algorithm, the K-means, is presented. he CGA_SSE combines
local and global search procedures. Comparison of the K-means and CGA_SSE is pro-
vided in terms of the accuracy and quality of the solution for clusters of diferent sizes
and densities. he complexity of all proposed algorithms is examined. he Hadoop for the
distributed environment provides an alternate solution to the CGA_SSE, following the
MapReduce paradigm. Simulation results are provided.
he second section (i.e., “Big Data Processing”) is organized into the following chapters.
Chapter 5, “Approaches for High-Performance Big Data Processing: Applications and
Challenges,” by Ouidad Achahbar, Mohamed Riduan Abid, Mohamed Bakhouya, Chaker
El Amrani, Jaafar Gaber, Mohammed Essaaidi, and Tarek A. El Ghazawi, puts emphasis
on social media websites, such as Facebook, Twitter, and YouTube, and job posting websites
like LinkedIn and CareerBuilder, which involve a huge amount of data that are very use-
ful for economy assessment and society development. hese sites provide sentiments and
xviii ◾ Preface
interests of people connected to web communities and a lot of other information. he Big
Data collected from the web is considered an unprecedented source to fuel data processing
and business intelligence. However, collecting, storing, analyzing, and processing these
Big Data as quickly as possible creates new challenges for both scientists and analytics. For
example, analyzing Big Data from social media is now widely accepted by many compa-
nies as a way of testing the acceptance of their products and services based on customers’
opinions. Opinion mining or sentiment analysis methods have been recently proposed for
extracting positive/negative words from Big Data. However, highly accurate and timely
processing and analysis of the huge amount of data to extract their meaning requires new
processing techniques. More precisely, a technology is needed to deal with the massive
amounts of unstructured and semistructured information in order to understand hidden
user behavior. Existing solutions are time consuming given the increase in data volume
and complexity. It is possible to use high-performance computing technology to accelerate
data processing through MapReduce ported to cloud computing. his will allow compa-
nies to deliver more business value to their end customers in the dynamic and changing
business environment. his chapter discusses approaches proposed in literature and their
use in the cloud for Big Data analysis and processing.
Chapter 6, “he Art of Scheduling for Big Data Science,” by Florin Pop and Valentin
Cristea, moves the attention to applications that generate Big Data, like social networking
and social inluence programs, cloud applications, public websites, scientiic experiments
and simulations, data warehouses, monitoring platforms, and e-government services. Data
grow rapidly, since applications produce continuously increasing volumes of both unstruc-
tured and structured data. he impact on data processing, transfer, and storage is the need
to reevaluate the approaches and solutions to better answer user needs. In this context,
scheduling models and algorithms have an important role. A large variety of solutions for
speciic applications and platforms exist, so a thorough and systematic analysis of exist-
ing solutions for scheduling models, methods, and algorithms used in Big Data processing
and storage environments has high importance. his chapter presents the best of existing
solutions and creates an overview of current and near-future trends. It highlights, from
a research perspective, the performance and limitations of existing solutions and ofers
an overview of the current situation in the area of scheduling and resource management
related to Big Data processing.
Chapter 7, “Time–Space Scheduling in the MapReduce Framework,” by Zhuo Tang,
Ling Qi, Lingang Jiang, Kenli Li, and Keqin Li, focuses on the signiicance of Big Data, that
is, analyzing people’s behavior, intentions, and preferences in the growing and popular
social networks and, in addition to this, processing data with nontraditional structures
and exploring their meanings. Big Data is oten used to describe a company’s large amount
of unstructured and semistructured data. Using analysis to create these data in a relational
database for downloading will require too much time and money. Big Data analysis and
cloud computing are oten linked together because real-time analysis of large data requires
a framework similar to MapReduce to assign work to hundreds or even thousands of com-
puters. Ater several years of criticism, questioning, discussion, and speculation, Big Data
inally ushered in the era belonging to it. Hadoop presents MapReduce as an analytics
Preface ◾ xix
engine, and under the hood, it uses a distributed storage layer referred to as the Hadoop
Distributed File System (HDFS). As an open-source implementation of MapReduce,
Hadoop is, so far, one of the most successful realizations of large-scale data-intensive cloud
computing platforms. It has been realized that when and where to start the reduce tasks are
the key problems in enhancing MapReduce performance. In this so-delineated context, the
chapter proposes a framework for supporting time–space scheduling in MapReduce. For
time scheduling, a self-adaptive reduce task scheduling policy for reduce tasks’ start times
in the Hadoop platform is proposed. It can decide the start time point of each reduce task
dynamically according to each job context, including the task completion time and the size
of the map output. For space scheduling, suitable algorithms are released, which synthesize
the network locations and sizes of reducers’ partitions in their scheduling decisions in
order to mitigate network traic and improve MapReduce performance, thus achieving
several ways to avoid scheduling delay, scheduling skew, poor system utilization, and low
degree of parallelism.
Chapter 8, “GEMS: Graph Database Engine for Multithreaded Systems,” by Alessandro
Morari, Vito Giovanni Castellana, Oreste Villa, Jesse Weaver, Greg Williams, David
Haglin, Antonino Tumeo, and John Feo, considers the speciic case of organizing, manag-
ing, and analyzing massive amounts of data in several contexts, like social network analy-
sis, inancial risk management, threat detection in complex network systems, and medical
and biomedical databases. For these areas, there is a problem not only in terms of size but
also in terms of performance, because the processing should happen suiciently fast to be
useful. Graph databases appear to be a good candidate to manage these data: hey provide
an eicient data structure for heterogeneous data or data that are not themselves rigidly
structured. However, exploring large-scale graphs on modern high-performance machines
is challenging. hese systems include processors and networks optimized for regular,
loating-point intensive computations and large, batched data transfers. At the opposite,
exploring graphs generates ine-grained, unpredictable memory and network accesses, is
mostly memory bound, and is synchronization intensive. Furthermore, graphs oten are
diicult to partition, making their processing prone to load unbalance. Following this evi-
dence, the chapter describes Graph Engine for Multithreaded Systems (GEMS), a full sot-
ware stack that implements a graph database on a commodity cluster and enables scaling
in data set size while maintaining a constant query throughput when adding more cluster
nodes. he GEMS sotware stack comprises a SPARQL-to-data parallel C++ compiler, a
library of distributed data structures, and a custom, multithreaded, runtime system. Also,
an evaluation of GEMS on a typical SPARQL benchmark and on a Resource Description
Format (RDF) data set is proposed.
Chapter 9, “KSC-net: Community Detection for Big Data Networks,” by Raghvendra
Mall and Johan A.K. Suykens, demonstrates the applicability of the kernel spectral cluster-
ing (KSC) method for community detection in Big Data networks, also providing a practi-
cal exposition of the KSC method on large-scale synthetic and real-world networks with
up to 106 nodes and 107 edges. he KSC method uses a primal–dual framework to con-
struct a model on a smaller subset of the Big Data network. he original large-scale kernel
matrix cannot it in memory. So smaller subgraphs using a fast and unique representative
xx ◾ Preface
subset (FURS) selection technique is selected. hese subsets are used for training and vali-
dation, respectively, to build the model and obtain the model parameters. It results in a
powerful out-of-sample extensions property, which allows inferring of the community
ailiation for unseen nodes. he KSC model requires a kernel function, which can have
kernel parameters and what is needed to identify the number of clusters k in the network.
A memory-eicient and computationally eicient model selection technique named bal-
anced angular itting (BAF) based on angular similarity in the eigenspace was proposed in
the literature. Another parameter-free KSC model was proposed as well. Here, the model
selection technique exploits the structure of projections in eigenspace to automatically
identify the number of clusters and suggests that a normalized linear kernel is suicient
for networks with millions of nodes. his model selection technique uses the concept of
entropy and balanced clusters for identifying the number of clusters k. In the scope of
this context literature, the chapter describes the sotware KSC-net, which obtains the rep-
resentative subset by FURS, builds the KSC model, performs one of the two (BAF and
parameter-free) model selection techniques, and uses out-of-sample extensions for com-
munity ailiation for the Big Data network.
Chapter 10, “Making Big Data Transparent to the Sotware Developers’ Community,”
by Yu Wu, Jessica Kropczynski, and John M. Carroll, investigates the open-source sotware
(OSS) development community, which has allowed technology to progress at a rapid pace
around the globe through shared knowledge, expertise, and collaborations. he broad-
reaching open-source movement bases itself on a share-alike principle that allows anybody
to use or modify sotware, and upon completion of a project, its source code is made publi-
cally available. Programmers who are a part of this community contribute by voluntarily
writing and exchanging code through a collaborative development process in order to pro-
duce high-quality sotware. his method has led to the creation of popular sotware prod-
ucts including Mozilla Firefox, Linux, and Android. Most OSS development activities are
carried out online through formalized platforms (such as GitHub), incidentally creating a
vast amount of interaction data across an ecosystem of platforms that can be used not only
to characterize open-source development work activity more broadly but also to create Big
Data awareness resources for OSS developers. he intention of these awareness resources
is to enhance the ability to seek out much-needed information necessary to produce high-
quality sotware in this unique environment that is not conducive to ownership or proits.
Currently, it is problematic that interconnected resources are archived across stand-alone
websites. Along these research lines, this chapter describes the process through which
these resources can be more conspicuous through Big Data in three interrelated sections
about the context and issues of the collaborating process in online space and a fourth sec-
tion on how Big Data can be obtained and utilized.
he third section (i.e., “Big Data Stream Techniques and Algorithms”) is organized into
the following chapters.
Chapter 11, “Key Technologies for Big Data Stream Computing,” by Dawei Sun,
Guangyan Zhang, Weimin Zheng, and Keqin Li, focuses on the two main mechanisms for
Big Data computing, that is, Big Data stream computing (BDSC) and Big Data batch comput-
ing. BDSC is a model of straight-through computing, such as Twitter Storm and Yahoo! S4,
Preface ◾ xxi
which does for stream computing what Hadoop does for batch computing, while Big Data
batch computing is a model of storing and then computing, such as the MapReduce frame-
work, open-sourced by the Hadoop implementation. Essentially, Big Data batch comput-
ing is not suicient for many real-time application scenarios, where a data stream changes
frequently over time and the latest data are the most important and most valuable. For
example, when analyzing data from real-time transactions (e.g., inancial trades, e-mail
messages, user search requests, sensor data tracking), a data stream grows monotonically
over time as more transactions take place. Ideally, a real-time application environment can
be supported by BDSC. In this speciic applicative setting, this chapter introduces data
stream graphs and the system architecture for BDSC and key technologies for BDSC sys-
tems. Among other contributions, the authors present the system architecture and key
technologies of four popular example BDSC systems, that is, Twitter Storm, Yahoo! S4,
Microsot TimeStream, and Microsot Naiad.
Chapter 12, “Streaming Algorithms for Big Data Processing on Multicore Architecture,”
by Marat Zhanikeev, studies Hadoop and MapReduce, which are well known as de facto
standards in Big Data processing today. Although they are two separate technologies, they
form a single package as far as Big Data processing—not just storage—is concerned. his
chapter treats them as one package, according to the depicted vision. Today, Hadoop and/
or MapReduce lacks popular alternatives. Hadoop solves the practical problem of not being
able to store Big Data on a single machine by distributing the storage over multiple nodes.
MapReduce is a framework on which one can run jobs that process the contents of the stor-
age—also in a distributed manner—and generate statistical summaries. he chapter shows
that performance improvements mostly target MapReduce. here are several fundamen-
tal problems with MapReduce. First, the map and reduce operators are restricted to key–
value hashes (data type, not hash function), which places a cap on usability. For example,
while the data streaming is a good alternative for Big Data processing, MapReduce fails
to accommodate the necessary data types. Second, MapReduce jobs create heterogeneous
environments where jobs compete for the same resource with no guarantee of fairness.
Finally, MapReduce jobs lack time awareness, while some algorithms might need to process
data in their time sequence or using a time window. he core premise of this chapter is to
replace MapReduce with a time-aware storage and processing logic. Big Data is replayed
along the timeline, and all the jobs get the time-ordered sequence of data items. he major
diference here is that the new method collects all the jobs in one place—the node that
replays data—while MapReduce sends jobs to remote nodes so that data can be processed
locally. his architecture is chosen for the sole purpose of accommodating a wide range of
data streaming algorithms and the data types they create.
Chapter 13, “Organic Streams: A Uniied Framework for Personal Big Data Integration
and Organization Towards Social Sharing and Individualized Sustainable Use,” by
Xiaokang Zhou and Qun Jin, moves the attention to the rapid development of emerging
computing paradigms, which are oten applied to our dynamically changed work, life, play-
ing and learning in the highly developed information society, a kind of seamless inte-
gration of the real physical world and cyber digital space. More and more people have
been accustomed to sharing their personal contents across the social networks due to the
xxii ◾ Preface
high accessibility of social media along with the increasingly widespread adoption of wire-
less mobile computing devices. User-generated information has spread more widely and
quickly and provided people with opportunities to obtain more knowledge and information
than ever before, which leads to an explosive increase of data scale, containing big potential
value for individual, business, domestic, and national economy development. hus, it has
become an increasingly important issue to sustainably manage and utilize personal Big Data
in order to mine useful insight and real value to better support information seeking and
knowledge discovery. To deal with this situation in the Big Data era, a uniied approach to
aggregation and integration of personal Big Data from life logs in accordance with individual
needs is considered essential and efective, which can beneit the sustainable information
sharing and utilization process in the social networking environment. Based on this main
consideration, this chapter introduces and deines a new concept of organic stream, which
is designed as a lexibly extensible data carrier, to provide a simple but eicient means to
formulate, organize, and represent personal Big Data. As an abstract data type, organic
streams can be regarded as a logic metaphor, which aims to meaningfully process the raw
stream data into an associatively and methodically organized form, but no concrete imple-
mentation for physical data structure and storage is deined. Under the conceptual model
of organic streams, a heuristic method is proposed and applied to extract diversiied indi-
vidual needs from the tremendous amount of social stream data through social media. And
an integrated mechanism is developed to aggregate and integrate the relevant data together
based on individual needs in a meaningful way, and thus personal data can be physically
stored and distributed in private personal clouds and logically represented and processed by
a set of newly introduced, metaphors named heuristic stone, associative drop, and associa-
tive ripple. he architecture of the system with the foundational modules is described, and
the prototype implementation with the experiment’s result is presented to demonstrate the
usability and efectiveness of the framework and system.
Chapter 14, “Managing Big Trajectory Data: Online Processing of Positional Streams,”
by Kostas Patroumpas and Timos Sellis, considers location-based services, which have
become all the more important in social networking, mobile applications, advertising,
traic monitoring, and many other domains, following the proliferation of smartphones
and GPS-enabled devices. Managing the locations and trajectories of numerous people,
vehicles, vessels, commodities, and so forth must be eicient and robust since this infor-
mation must be processed online and should provide answers to users’ requests in real
time. In this geo-streaming context, such long-running continuous queries must be repeat-
edly evaluated against the most recent positions relayed by moving objects, for instance,
reporting which people are now moving in a speciic area or inding friends closest to the
current location of a mobile user. In essence, modern processing engines must cope with
huge amounts of streaming, transient, uncertain, and heterogeneous spatiotemporal data,
which can be characterized as big trajectory data. Inspired by this methodological trend,
this chapter examines Big Data processing techniques over frequently updated locations
and trajectories of moving objects. Indeed, the Big Data issues regarding volume, velocity,
variety, and veracity also arise in this case. hus, authors foster a close synergy between the
established stream processing paradigm and spatiotemporal properties inherent in motion
Preface ◾ xxiii
features. Taking advantage of the spatial locality and temporal timeliness that characterize
each trajectory, the authors present methods and heuristics that address such problems.
he fourth section (i.e., “Big Data Privacy”) is organized into the following chapters.
Chapter 15, “Personal Data Protection Aspects of Big Data,” by Paolo Balboni, focuses
on speciic legal aspects of managing and processing Big Data by also providing a relevant
vision on European privacy and data protection laws. In particular, the analysis con-
siders applicable EU data protection provisions and their impact on both businesses and
consumers/data subjects, and it introduces and conceptually assesses a methodology to
determine whether (1) data protection law applies and (2) personal Big Data can be (fur-
ther) processed (e.g., by way of analytic sotware programs). Looking into more detail, this
chapter deals with diverse aspects of data protection, providing an understanding of Big
Data from the perspective of personal data protection using the Organization for Economic
Co-operation and Development’s four-step life cycle of personal data along the value chain,
paying special attention to the concept of compatible use. Also, the author sheds light on
the development of the concept of personal data and its relevance in terms of data process-
ing. Further focus is placed on aspects such as pseudoanonymization, anonymous data,
and reidentiication. Finally, conclusions and recommendations that focus on the privacy
implications of Big Data processing and the importance of strategic data protection com-
pliance management are illustrated.
Chapter 16, “Privacy-Preserving Big Data Management: he Case of OLAP,” by Alfredo
Cuzzocrea, highlights the security and privacy of Big Data repositories as among the most
challenging topics in Big Data research. As a relevant instance, the author considers the
case of cloud systems, which are very popular now. Here, cloud nodes are likely to exchange
data very oten. herefore, the privacy breach risk arises, as distributed data repositories can
be accessed from a node to another one, and hence, sensitive information can be inferred.
Another relevant data management context for Big Data research is represented by the issue
of efectively and eiciently supporting data warehousing and online analytical processing
(OLAP) over Big Data, as multidimensional data analysis paradigms are likely to become
an “enabling technology” for analytics over Big Data, a collection of models, algorithms,
and techniques oriented to extract useful knowledge from cloud-based Big Data reposito-
ries for decision-making and analysis purposes. At the convergence of the three axioms
introduced (i.e., security and privacy of Big Data, data warehousing and OLAP over Big
Data, analytics over Big Data), a critical research challenge is represented by the issue of
efectively and eiciently computing privacy-preserving OLAP data cubes over Big Data. It
is easy to foresee that this problem will become more and more important in future years
as it not only involves relevant theoretical and methodological aspects, not all explored
by actual literature, but also regards signiicant modern scientiic applications. Inspired
by these clear and evident trends, this chapter moves the attention to privacy-preserving
OLAP data cubes over Big Data and provides two kinds of contributions: (1) a complete
survey of privacy-preserving OLAP approaches available in literature, with respect to both
centralized and distributed environments, and (2) an innovative framework that relies on
lexible sampling-based data cube compression techniques for computing privacy-preserving
OLAP aggregations on data cubes.
xxiv ◾ Preface
he ith section (i.e., “Big Data Applications”) is organized into the following chapters.
Chapter 17, “Big Data in Finance,” by Taruna Seth and Vipin Chaudhary, addresses Big
Data in the context of the inancial industry, which has always been driven by data. Today,
Big Data is prevalent at various levels in this ield, ranging from the inancial services sec-
tor to capital markets. he availability in Big Data in this domain has opened up new ave-
nues for innovation and has ofered immense opportunities for growth and sustainability.
At the same time, it has presented several new challenges that must be overcome to gain the
maximum value from it. Indeed, in recent years, the inancial industry has seen an upsurge
of interest in Big Data. his comes as no surprise to inance experts, who understand the
potential value of data in this ield and are aware that no industry can beneit more from
Big Data than the inancial services industry. Ater all, the industry not only is driven by
data but also thrives on data. Today, the data, characterized by the four Vs, which refer to
volume, variety, velocity, and veracity, are prevalent at various levels of this ield, ranging
from capital markets to the inancial services industry. Also, capital markets have gone
through an unprecedented change, resulting in the generation of massive amounts of high-
velocity and heterogeneous data. For instance, about 70% of US equity trades today are
generated by high-frequency trades (HFTs) and are machine driven. In the so-delineated
context, this chapter considers the impact and applications of Big Data in the inancial
domain. It examines some of the key advancements and transformations driven by Big
Data in this ield. he chapter also highlights important Big Data challenges that remain to
be addressed in the inancial domain.
Chapter 18, “Semantic-Based Heterogeneous Multimedia Big Data Retrieval,” by Kehua
Guo and Jianhua Ma, considers multimedia retrieval, an important technology in many
applications such as web-scale multimedia search engines, mobile multimedia search,
remote video surveillance, automation creation, and e-government. With the widespread
use of multimedia documents, our world will be swamped with multimedia content such
as massive images, videos, audio, and other content. herefore, traditional multimedia
retrieval has been switching into a Big Data environment, and the research into solving some
problems according to the features of multimedia Big Data retrieval attracts considerable
attention. Having as reference the so-delineated application setting, this chapter proposes
a heterogeneous multimedia Big Data retrieval framework that can achieve good retrieval
accuracy and performance. he authors begin by addressing the particularity of heteroge-
neous multimedia retrieval in a Big Data environment and introducing the background
of the topic. hen, literature related to current multimedia retrieval approaches is briely
reviewed, and the general concept of the proposed framework is introduced briely. he
authors provide in detail a description of this framework, including semantic information
extraction, representation, storage, and multimedia Big Data retrieval. Finally, the pro-
posed framework’s performance is experimentally evaluated against several multimedia
data sets.
Chapter 19, “Topic Modeling for Large-Scale Multimedia Analysis and Retrieval,” by
Juan Hu, Yi Fang, Nam Ling, and Li Song, similarly puts emphasis on the exponential
growth of multimedia data that occurred in recent years, with the arrival of the Big Data
era and thanks to the rapid increase in processor speed, cheaper data storage, prevalence
Preface ◾ xxv
of digital content capture devices, as well as the looding of social media like Facebook
and YouTube. New data generated each day have reached 2.5 quintillion bytes as of 2012.
Particularly, more than 10 h of videos are uploaded onto YouTube every minute, and mil-
lions of photos are available online every week. he explosion of multimedia data in social
media raises a great demand for developing efective and eicient computational tools to
facilitate producing, analyzing, and retrieving large-scale multimedia content. Probabilistic
topic models prove to be an efective way to organize large volumes of text documents,
while much fewer related models are proposed for other types of unstructured data such
as multimedia content, partly due to the high computational cost. With the emergence of
cloud computing, topic models are expected to become increasingly applicable to multime-
dia data. Furthermore, the growing demand for a deep understanding of multimedia data
on the web drives the development of sophisticated machine learning methods. hus, it is
greatly desirable to develop topic modeling approaches to multimedia applications that are
consistently efective, highly eicient, and easily scalable. Following this methodological
scheme, this chapter presents a review of topic models for large-scale multimedia analysis
and shows the current challenges from various perspectives by presenting a comprehensive
overview of related work that addresses these challenges. Finally, the chapter discusses
several research directions in the ield.
Chapter 20, “Big Data Biometrics Processing: A Case Study of an Iris Matching
Algorithm on Intel Xeon Phi,” by Xueyan Li and Chen Liu, investigates the applicative set-
ting of Big Data biometrics repositories. Indeed, the authors recognize that, with the drive
towards achieving higher computation capability, the most advanced computing systems
have been adopting alternatives from the traditional general purpose processors (GPPs) as
their main components to better prepare for Big Data processing. NVIDIA’s graphic pro-
cessing units (GPUs) have powered many of the top-ranked supercomputer systems since
2008. In the latest list published by Top500.org, two systems with Intel Xeon Phi coproces-
sors have claimed positions 1 and 7. While it is clear that the need to improve eiciency for
Big Data processing will continuously drive changes in hardware, it is important to under-
stand that these new systems have their own advantages as well as limitations. he required
efort from the researchers to port their codes onto the new platforms is also of great sig-
niicance. Unlike other coprocessors and accelerators, the Intel Xeon Phi coprocessor does
not require learning a new programming language or new parallelization techniques. It
presents an opportunity for the researchers to share parallel programming with the GPP.
his platform follows the standard parallel programming model, which is familiar to
developers who already work with x86-based parallel systems. From another perspective,
with the rapidly expanded biometric data collected by various sources for identiication
and veriication purposes, how to manage and process such Big Data draws great concern.
On the one hand, biometric applications normally involve comparing a huge amount of
samples and templates, which has strict requirements on the computational capability of
the underlying hardware platform. On the other hand, the number of cores and associated
threads that hardware can support has increased greatly; an example is the newly released
Intel Xeon Phi coprocessor. Hence, Big Data biometrics processing demands the execution
of the applications at a higher parallelism level. Taking an iris matching algorithm as a case
xxvi ◾ Preface
study, the authors propose an OpenMP version of the algorithm to examine its perfor-
mance on the Intel Xeon Phi coprocessor. heir target is to evaluate their parallelization
approach and the inluence from the optimal number of threads, the impact of thread-
to-core ainity, and the built-in vector engine. his does not mean that achieving good
performance on this platform is simple. he hardware, while presenting many similarities
with other existing multicore systems, has its own characteristics and unique features. In
order to port the code in an eicient way, those aspects are fully discussed in the chapter.
Chapter 21, “Storing, Managing, and Analyzing Big Satellite Data: Experiences and
Lessons Learned from a Real-World Application,” by Ziliang Zong, realizes how Big Data
has shown great capability in yielding extremely useful information and extraordinary
potential in revolutionizing scientiic discoveries and traditional commercial models.
Indeed, numerous corporations have started to utilize Big Data to understand their cus-
tomers’ behavior at a ine-grained level, rethink their business process work low, and
increase their productivity and competitiveness. Scientists are using Big Data to make new
discoveries that were not possible before. As the volume, velocity, variety, and veracity of
Big Data keep increasing, signiicant challenges with respect to innovative Big Data man-
agement, eicient Big Data analytics, and low-cost Big Data storage solutions arise. his
chapter provides a case study on how the big satellite data (at the petabyte level) of the
world’s largest satellite imagery distribution system is captured, stored, and managed by
the National Aeronautics and Space Administration (NASA) and the US Geological Survey
(USGS) and gives a unique example of how a changed policy could signiicantly afect the
traditional ways of Big Data storage and distribution, which will be quite diferent from
typical commercial cases driven by sales. Also, the chapter discusses how the USGS Earth
Resources Observation and Science (EROS) center switly overcomes the challenges from
serving few government users to hundreds of thousands of global users and how data visu-
alization and data mining techniques are used to analyze the characteristics of millions of
requests and how they can be used to improve the performance, cost, and energy eiciency
of the EROS system. Finally, the chapter summarizes the experiences and lessons learned
from conducting the target Big Data project in the past 4 years.
Chapter 22, “Barriers to the Adoption of Big Data Applications in the Social Sector,” by
Elena Strange, focuses on the social aspects of dealing with Big Data. he author recognizes
that efectively working with and leveraging Big Data has the potential to change the world.
Indeed, if there is a ceiling on realizing the beneits of Big Data algorithms, applications,
and techniques, we have not yet reached it. he research ield is maturing rapidly. No longer
are we seeking to understand what “Big Data” is and whether it is useful. No longer is Big
Data processing the province of niche computer science research. Rather, the concept of
Big Data has been widely accepted as important and inexorable, and the buzzwords “Big
Data” have found their way beyond computer science into the essential tools of business,
government, and media. Tools and algorithms to leverage Big Data have been increasingly
democratized over the last 10 years. By 2010, over 100 organizations reported using the dis-
tributed ile system and framework Hadoop. Early adopters leveraged Hadoop on in-house
Beowulf clusters to process tremendous amounts of data. Today, well over 1000 organiza-
tions use Hadoop. hat number is climbing and now includes companies with a range of
Preface ◾ xxvii
technical competencies and those with and without access to internal clusters and other
tools. Yet the beneits of Big Data have not been fully realized by businesses, governments,
and particularly the social sector. In this so-delineated background, this chapter describes
the impact of this gap on the social sector and the broader implications engendered by the
sector in a broader context. Also, the chapter highlights the opportunity gap—the unreal-
ized potential of Big Data in the social sector—and explores the historical limitations and
context that have led up to the current state of Big Data. Finally, it describes the current
perceptions of and reactions to Big Data algorithms and applications in the social sector
and ofers some recommendations to accelerate the adoption of Big Data.
Overall, this book represents a solid research contribution to state-of-the-art studies
and practical achievements in algorithms, analytics, and applications on Big Data, and
sets the basis for further eforts in this challenging scientiic ield that will, more and more,
play a leading role in next-generation database, data warehousing, data mining, and cloud
computing research. he editors are conident that this book will represent an authorita-
tive milestone in this very challenging scientiic road.
Editors
xxix
xxx ◾ Editors
the International Journal of Embedded Systems (IJES); an editorial board member for
the International Journal of Big Data Intelligence (IJBDI), the Scientiic World Journal
(TSWJ), the Open Journal of Internet of hings (OJIOT), and the GSTF Journal on
Social Computing (JSC); and a guest editor for the IEEE Systems Journal, International
Journal of Ad Hoc and Ubiquitous Computing, Cluster Computing, and he Scientiic
World Journal for multiple special issues. He has also served as a general chair or pro-
gram chair for some major conferences/workshops and involved in 90 conferences and
workshops as a session chair or as a program committee member. He has reviewed six
cloud computing–related books (Distributed and Cloud Computing, Virtual Machines,
Cloud Computing: heory and Practice, Virtualized Infrastructure and Cloud Services
Management, Cloud Computing: Technologies and Applications Programming, he Basics
of Cloud Computing) for publishers such as Morgan Kaufmann, Elsevier, and Wiley.
He serves as a review board member for a large number of international journals. He
is a professional member of ACM and the IEEE Computer Society. Locally, he serves as
US NSF XSEDE (Extreme Science and Engineering Discovery Environment) Campus
Champion for Arkansas State University.
Professor Laurence T. Yang is with Department of Computer Science at St. Francis Xavier
University, Canada. His research includes parallel and distributed computing, embedded
and ubiquitous/pervasive computing, cyber–physical–social systems, and Big Data.
He has published 200+ refereed international journal papers in the above areas; about
40% are in IEEE/ACM transactions/journals and the rest mostly are in Elsevier, Springer,
and Wiley journals. He has been involved in conferences and workshops as a program/gen-
eral/steering conference chair and as a program committee member. He served as the vice
chair of the IEEE Technical Committee of Supercomputing Applications (2001–2004), the
chair of the IEEE Technical Committee of Scalable Computing (2008–2011), and the chair of
the IEEE Task Force on Ubiquitous Computing and Intelligence (2009–2013). He was on the
steering committee of the IEEE/ACM Supercomputing Conference series (2008–2011) and
on the National Resource Allocation Committee (NRAC) of Compute Canada (2009–2013).
In addition, he is the editor in chief and editor of several international journals. He is the
author/coauthor or editor/coeditor of more than 25 books. Mobile Intelligence (Wiley 2010)
received an Honorable Mention by the American Publishers Awards for Professional and
Scholarly Excellence (the PROSE Awards). He has won several best paper awards (includ-
ing IEEE best and outstanding conference awards, such as the IEEE 20th International
Conference on Advanced Information Networking and Applications [IEEE AINA-06]),
one best paper nomination, the Distinguished Achievement Award (2005, 2011), and the
Canada Foundation for Innovation Award (2003). He has given 30 keynote talks at various
international conferences and symposia.
xxxiii
xxxiv ◾ Contributors
Yu Wu
College of Information Sciences and
Technology
Pennsylvania State University
State College, Pennsylvania
I
Big Data Management
1
CHAPTER 1
CONTENTS
1.1 Introduction 4
1.2 Permutation-Based Indexing 5
1.2.1 Indexing Model 5
1.2.2 Technical Implementation 6
1.3 Related Data Structures 6
1.3.1 Metric Inverted Files 6
1.3.2 Brief Permutation Index 7
1.3.3 Preix Permutation Index 7
1.3.4 Neighborhood Approximation 7
1.3.5 Metric Suix Array 8
1.3.6 Metric Permutation Table 8
1.4 Distributed Indexing 9
1.4.1 Data Based 9
1.4.1.1 Indexing 9
1.4.1.2 Searching 9
1.4.2 Reference Based 10
1.4.2.1 Indexing 10
1.4.2.2 Searching 11
1.4.3 Index Based 12
1.4.3.1 Indexing 12
1.4.3.2 Searching 12
1.5 Evaluation 13
1.5.1 Recall and Position Error 14
1.5.2 Indexing and Searching Performance 16
1.5.3 Big Data Indexing and Searching 17
1.6 Conclusion 18
Acknowledgment 18
References 18
3
4 ◾ Hisham Mohamed and Stéphane Marchand-Maillet
ABSTRACT
he K-nearest neighbor (K-NN) search problem is the way to ind and predict the
closest and most similar objects to a given query. It inds many applications for infor-
mation retrieval and visualization, machine learning, and data mining. he con-
text of Big Data imposes the inding of approximate solutions. Permutation-based
indexing is one of the most recent techniques for approximate similarity search in
large-scale domains. Data objects are represented by a list of references (pivots),
which are ordered with respect to their distances from the object. In this chapter, we
show diferent distributed algorithms for eicient indexing and searching based on
permutation-based indexing and evaluate them on big high-dimensional data sets.
1.1 INTRODUCTION
Similarity search [1] aims to extract the most similar objects to a given query. It is a fun-
damental operation for many applications, such as genome comparison to ind all the
similarities between one or more genes and multimedia retrieval to ind the most similar
picture or video to a given example.
Similarity search is known to be achievable in at least three ways. he irst technique is
called exhaustive search. For a given query, the distance between the query and the data-
base object is calculated while keeping track of which objects are similar (near) to the query.
he main problem with this technique is that it does not scale with large collections. he
second technique is exact search. Using space decomposition techniques, the number of
objects that are compared to the query is reduced. his technique is not eicient for high-
dimensional data sets, due to the curse of dimensionality [2]. As the dimensions increase, a
large part of the database needs to be scanned. Hence, the performance becomes similar to
that of exhaustive search. he third technique is approximate search. It reduces the amount
of data that needs to be accessed by using some space partitioning (generally done using piv-
ots) [3–5] or space transformation techniques [6–10]. It provides fast and scalable response
time while accepting some imprecision in the results. he most common techniques for
that are the locality-sensitive hashing (LSH) [7], FastMap [6], and M-tree [8]. A survey can
be found in Reference 11. Recently, a new technique based on permutations is proposed,
which is called permutation-based indexing (PBI) [12].
he idea of PBI was irst proposed in Reference 12. It relies on the selection and order-
ing of a set of reference objects (pivots) by their distance to every database object. Order
permutations are used as encoding to estimate the real ordering of distance values between
the submitted query and the objects in the database (more details in Section 1.2). he tech-
nique is eicient; however, with the exponential increase of data, parallel and distributed
computations are needed.
In this chapter, we review the basics of PBI and show its eiciency on a distributed envi-
ronment. Section 1.2 introduces the PBI model and its basic implementation. Section 1.3
shows a review of the related data structures that are based on PBI. Section 1.4 introduces
indexing and searching algorithms for PBI on distributed architectures. Finally, we present
our evaluation on large-scale data sets in Section 1.5 and conclude in Section 1.6.
Scalable Indexing for Big Data Processing ◾ 5
Deinition 1
Given a set of N objects oi, D = {o1,…,oN} in m-dimensional space, a set of reference objects R =
{r1,…,rn} ⊂ D, and a distance function that follows the metric space postulates, we deine the
ordered list of R relative to o ∈ D, L(o,R), as the ordering of elements in R with respect to their
increasing distance from o:
L(o, R ) = {ri1 ,, rin } such that d(o, ri j ) ≤ d(o, ri j +1 )∀j = 1,, n − 1
hen, for any r ∈ R, L(o,R)|r indicates the position of r in L(o,R). In other words, L(o,R)|r = j
such that ri j = r . Further, given n > 0, L(o, R ) is the pruned ordered list of the n irst elements
of L(o,R).
Figure 1.1b and c gives the ordered lists L(o,R) (n = 4) and the pruned ordered lists
L(o, R ) (n = 2) for D and R illustrated in Figure 1.1a.
In K-nearest neighbor (K-NN) similarity queries, we are interested in ranking objects (to
extract the K irst elements) and not in the actual interobject distance values. PBI relaxes
distance calculations by assuming that they will be approximated in terms of their order-
ing when comparing the ordered lists of objects. Spearman footrule distance (dSFD) is con-
sidered to deine the similarity between ordered lists. Formally,
rank n
o3
L(o1,R) = (r3,r4,r1,r2) L(o2,R) = (r3,r2,r4,r1) L(o1,R) = (r3,r4) L(o2,R) = (r3,r2)
r3
L(o3,R) = (r3,r4,r1,r2) L(o4,R) = (r2,r3,r1,r4) L(o3,R) = (r3,r4) L(o4,R) = (r2,r3)
q o1 o4 L(o5,R) = (r2,r1,r3,r4) L(o6,R) = (r2,r3,r1,r4) L(o5,R) = (r2,r1) L(o6,R) = (r2,r3)
o2
r4 o6 L(o7,R) = (r1,r2,r4,r3) L(o8,R) = (r4,r1,r3,r2) L(o7,R) = (r1,r2) L(o8,R) = (r4,r1)
o8 r2
L(o9,R) = (r1,r4,r2,r3) L(o10,R) = (r4,r1,r3,r2) L(o9,R) = (r1,r4) L(o10,R) = (r4,r1)
o10 o7 o5 L(q,R) = (r4,r3,r1,r2)
r1 L(q,R) = (r4,r3)
o9
FIGURE 1.1 (a) White circles are data objects oi; black circles are reference objects rj; the gray circle
is the query object q. (b) Ordered lists L(oi,R), n = 4. (c) Pruned ordered lists L(oi , R ), n = 2.
6 ◾ Hisham Mohamed and Stéphane Marchand-Maillet
To eiciently answer users’ queries, the authors of References 13 through 17 proposed sev-
eral strategies to decrease the complexity of the dSFD, which we discuss in detail in Section 1.3.
FIGURE 1.2 (a) Metric inverted ile. (b) Preix permutation index. (c) Neighborhood approximation.
list of the objects (saved in the inverted list). Ater checking the inverted lists of all the
reference points, the objects are sorted based on their accumulator value. Objects with a
small accumulator value are more similar to the query. Later, the authors of Reference 19
proposed a distributed implementation of the MIF.
the similar objects to the query are the objects where L(oi , R ) ∩ L(q, R ) ≠ φ. To reduce the
complexity, they proposed another term, the threshold value t. It deines the number of
reference objects that could be shared between two ordered lists in order to deine their
similarity. he data structure related to this algorithm is similar to the MIF except for two
things. he irst diference is that the t value is deined in the indexing, and it afects the
performance. he second diference is the data compression on the inverted lists for saving
storage. he indexing time of this technique is high especially due to the compressing time.
he searching time and eiciency are based on the indexing technique and the threshold
value. Figure 1.2c shows the NAPP for the partial ordered lists in Figure 1.1c.
B B
bij = ⋅ L(oi , R )|rj and bqj = ⋅ L(q, R )|rj (1.2)
n n
o1 o2 o3 o4 o5 o6 o7 o8 o9 o10
L̃✍oi ,R✌|rj ✤ ✤ ✤ ✤ ✤ ✤ ✤ ✤ ✤ ✤ r1 r2 r3 r4 R
✙✄✚✛ry
S: r✖ r✕ r✖ r✢ r✖ r✕ r✢ r✖ r✢ r☛ r✢ r✖ r☛ r✢ r✕ r☛ r☛ r✕ r✕ r1 1 2 1 2 1 2 1 2 Buckets
✜✁✂✄x: ✤ ✣ ☎ ✆ ✝ ✞ ✟ ✠ ✤✡ ✤✤ ✤ ✤✣ ✤☎ ✤✆ ✤✝ ✤✞ ✤✟ ✤✠ 0 Hard disk
o5 o4 o7 o2 o1 o4 o8 o1
☞✎✏r1 ☞✎✏r2 ☞✎✏r3 ☞✎✏r4 o9 o8 o5 o7 o2 o6 o1 o3
o10 o6 o3 0 o9
✑✒✓ ✔: ✤✡ ✤✣ ✤✝ ✤✞ ✡ ☎ ✞ ✠ ✤✤ ✤☎ ✤ ✣ ✆ ✟ ✤ ✝ ✤✆ ✤✟ ✤9
✍✗✌ ✍✘✌
FIGURE 1.3 (a) Metric suix array. (b) Metric permutation table.
Scalable Indexing for Big Data Processing ◾ 9
rank
d(q, oi ) ≃ ∑
rj ∈Lɶ( q ,R )
| bqj − bij |. (1.3)
rj ∈Lɶ( oi ,R )
For each reference, B buckets are allocated. Each bucket contains a list of objects z,
which contains the IDs of the objects that are located in the same bucket. Figure 1.3b shows
the MIF for the partial ordered lists in Figure 1.1c. hen, these lists are compressed using
delta encoding [29].
For searching, Equation 1.3 counts the discrepancy in quantized ranking from common
reference objects between each object and the query. hat is, each object oi scores
Objects oi are then sorted according to their decreasing si scores. his approximate
ranking can be improved by DDC. For a K-NN query, we use a DDC on the Kc = Δ.K irst
objects in our sorted candidate list and call Δ > 1 the DDC factor.
searching process. here is a broker process (line 1). his broker process is responsible
for handling users’ requests. Once a query is received, the query is broadcasted to all
the processes, which we call the workers (line 2). Each worker creates the L(q, R ) and
searches through its local partial data structure DSp (lines 3–6). Once the search is in-
ished, the list of the objects from each process OLp is sent back to the broker (line 7).
he broker organizes the received results lists OL1,…,OLP and sends them to the user OL
(lines 8–11). he organization of the results is simple. he similarity between the query
and the objects in any data domain is identiied using the full same reference list. Hence,
the organization of the lists consists of merging the lists and sorting them based on the
similarity score. his similarity score is deined based on the searching algorithm that
is used.
hat means that the data are divided between the processes based on the references, not
N (n + nlogn)
randomly as in the previous technique. he complexity of indexing is O + t1,
P
where t1 is the time needed to transfer the object information from the indexing process to
the coordinating process.
N n
IN: Dp of size , R of n, Rp of , n
P P
OUT: Data structure DSp
1. if(indexer)
2. For oi ∈ Dp
3. For rj ∈ R
4. b[j].dis = d(oi,rj)
5. b[j].indx = j
6. L(oi,R) = quicksort(b,n)
7. L(oi , R ) = partiallist ( L(oi , R ), n)
8. For rc ∈L(oi, R )
9. Send the global object ID, the L(oi , R )|rc, and the reference ID rc to coordi-
nating process.
10. if(coordinator)
11. Recv. data from any indexer.
12. Store the received data in DSp.
1.4.2.2 Searching
Unlike in the previous search scenario, the processes that participate to answer the query
are the processes that have the references, which are located in L(q, R ). Algorithm 4 shows
the searching process. Once a broker process receives a query, the partial ordered list is cre-
ated L(q, R ) (line 2). An active process is a process whose references are located in L(q, R ).
More formally, a process p is deined as active if there exists rj ∈ Rp, such that rj ∈L(q, R ). he
broker notiies the active processes and sends the position of the related references L(q, R )r j
to them (lines 3–4). Each active process searches in its local partial data structure and send
the results back to the broker (lines 5–9). he broker organizes the received results and
sends them to the user (lines 10–13). he organization of the results is not simple. In this
approach, each process deines the similarity between the query and the related objects
based on a partial reference list Rp. hat means that the same object can be seen by difer-
ent processes, so the object can be found in each output list by each process with a partial
12 ◾ Hisham Mohamed and Stéphane Marchand-Maillet
similarity score. Hence, the broker has to add these partial similarity scores that are related
to each object and then sort the objects based on the summed similarity score.
1.4.3.2 Searching
Algorithm 5 shows the searching process. he broker process has information about all the
references R = {R1 ∪ R2 ∪…∪ Rp}. Hence, once a query q is submitted, the partial ordered
list L(q, R ) is created using the full reference list. If the query is located within a cer-
tain data domain Dp, the reference points that were selected from that data domain Rp are
located as the closest references to the query. Hence, the searching is only done through the
process that is responsible for the corresponding data domain (lines 1–4). If the query is
located between two or more data domains, the closest references are shared among these
Scalable Indexing for Big Data Processing ◾ 13
diferent domains. Formally, a process p is an active process if there exists rj ∈ Rp, such
that rj ∈L(q, R ). Ater searching, the results from each process cannot be sent to the broker
directly. he similarity between the query and each object for each process is deined by
diferent sets of reference points Rp that are not related to each other. To combine and rank
them, each process computes the distances between the submitted query and the Kcp = Δ ×
k irst objects in the candidate list of each process. We call Δ ≥ 1 the DDC factor. hen,
this distance information with the related object IDs is sent to the broker process (lines
5–10). Finally, the broker ranks these objects based on their distances and sends the results
back to the user. For example, if Δ = 2, for K-NNquery(q,30), each process returns the most
similar k = 30 objects to the query. Each process searches within its local data structure
and gets the top set candidate list K c p = 2 × 30 = 60 . hen, a DDC is performed between the
query and this top K c p = 60 by each process. If the references of the query ordered list are
located within the sets of 3 processes, the broker receives 3 × 60 = 180 distances. hese 180
distances are sorted, and the top 30 objects are sent back to the user.
1.5 EVALUATION
Large-scale experiments were conducted to evaluate the eiciency and the scalability of the
three approaches in terms of recall (RE), position error (PE), indexing time, and searching
time. he three approaches were implemented using message passing interface (MPI) [30].
he results section is divided into three subsections.
In Section 1.5.1, we measure the average RE and PE for the three distributed approaches.
he experiments were performed using a data set consisting of 5 million objects. he data
set is composed of visual shape features (21 dimensional) and was extracted from the
12-million ImageNet corpus [31]. hree sets of references of sizes 1000, 2000, and 3000
are selected randomly from the data set. he average RE and PE are based on 250 diferent
queries, which are selected randomly from the data set.
14 ◾ Hisham Mohamed and Stéphane Marchand-Maillet
In Section 1.5.2, we measure the scalability of the three approaches using the same data
set in terms of searching and indexing time. We compare the implementation of the three
approaches using MPI in terms of indexing and searching time. he average searching
time is based on 250 diferent queries, which are selected randomly from the data set. For
the previous experiments, the MSA data structure was used. All the evaluation includ-
ing indexing and searching time is performed using the full permutation list, as we con-
sider using the full ordered list as the worst-case scenario. hat means that whatever the
approach used, the searching is done by all the processes.
In Section 1.5.3, we perform large-scale experiments using the full content-based photo
image retrieval (CoPhIR) data set [32] and the best distributed approach. he CoPhIR
[32] data set consists of 106 million MPEG-7 features. Each MPEG-7 feature contains ive
descriptors (scalable color, color structure, color layout, edge histogram, and homogeneous
texture). hese ive features are combined to build a 280 dimensional vector. he database
size is about 75 GB. he evaluation was performed using a set of references of size 2000.
he number of nearest points ñ is 10.
MPICH2 1.4.1p1 is installed on a Linux cluster of 20 dual-core computers (40 cores in
total) each holding 8 GB of memory and 512 GB of local disk storage, led by a master eight-
core computer holding 32 GB of memory and a terabyte storage capacity.
S ∩ SA
RE = , (1.5)
S
∑ P( X , o) − P(S , o)
o∈SA
A
PE = , (1.6)
SA ⋅ N
where S and SA are the ordering of the top-K ranked objects to q for exact similarity search
and approximate similarity search, respectively. X is the ordering of items in the data set
D with respect to their distance from q, P(X,o) indicates the position of o in X, and P(SA,o)
indicates the position of o in SA. An RE = 0.5 indicates that the output results of the search-
ing algorithm contain 50% of the k best objects retrieved by the exact search. A PE = 0.001
indicates that the average shiting of the best K objects ranked by the searching algorithm,
with respect to the rank obtained by the exact search, is 0.1% of the data set.
Figure 1.4 shows the average RE and PE for 10 K-NNs using 10 and 20 processes. he
igure represents four diferent strategies, which are as follows. he irst strategy is sequen-
tial (Seq). he second strategy is the data-based (DB) approach. he third strategy is the
reference-based (RB) approach. he last strategy is the index-based (IB) approach. he DDC
factor for the IB approach is equal to 1 for a fair comparison.
Figure 1.4a and b show that the Seq, DB, and RB approaches give the same RE and PE.
Also, the number of processes does not afect the performance. his is expected as all the
Scalable Indexing for Big Data Processing ◾ 15
RE (10-NN)
✪ 0.5
PE (10-NN)
PE (10-NN)
0.0008 0.5 0.0008
✪✩ 0.4 0.4
★
✧✦ 0.3 0.0006 0.0006
0.3
✥ 0.2 0.0004 0.0004
❘ 0.0002
0.2
0.0002
0.1 0.1
0 0 0 0
✶✴✴✴ ✵✴✴✴ ✷✴✴0 1✴✴✴ ✵✴✴✴ ✷✴✴✴
# references # references
(a) ( b)
FIGURE 1.4 Average RE and PE (21 dimensional data set, MSA, 250 queries) for 10 K-NNs using
diferent sets of reference points (1000, 2000, and 3000) using 10 and 20 processes. (a) RE and PE
(1 DDC, 10 nodes). (b) RE and PE (1 DDC, 20 nodes).
objects, whatever their domain, are indexed using the same set of references. Hence, the
DB and RB approaches give the same RE and PE as the sequential algorithm, whatever the
number of processes.
he IB approach gives a better RE and PE compared to the other three approaches, Seq,
DB, and RB. In addition, its eiciency increases when the number of processes increases.
here are two reasons for that. First, the data domain with respect to each process is inde-
pendent, which helps to decrease the search space for each process. hat identiies the
objects in a better way using the local references. he second reason is the DDC, between
the query and the candidate list from each process. Hence, the IB approach gives a better
recall and position error than the sequential and the other distributed implementations.
In addition, a better performance is achieved when the number of processes increases. As
the number of processes increases, the number of objects that are assigned to each process
decreases, which improves the locality and provides a better representation of the objects,
which leads to a better performance.
Figure 1.5 shows the efect of the DDC factor used in the IB approach for 10 K-NNs,
using 20 processes and the same sets of references. It is clear from the igure that the RE
Average RE and PE Average RE and PE
(20 nodes) RE: 1000 R RE: 2000 R RE: 3000 R (20 nodes) PE: 1000 R PE: 2000 R PE: 3000 R
1 0.0006
Average RE (10-NN)
Average PE (10-NN)
0.0005
0.8
0.0004
0.6
0.0003
0.4
0.0002
0.2 0.0001
0 0
1 2 4 6 8 10 1 2 4 6 8 10
DDC DDC
(a) (b)
FIGURE 1.5 Efect of changing DDC for 10 K-NNs using 20 processes (21 dimensional data set,
MSA, 250 queries). (a) Average recall. (b) Average position error.
16 ◾ Hisham Mohamed and Stéphane Marchand-Maillet
increases and PE decreases when the DDC increases. Also, due to the data distribution, a
better recall can be achieved with a low DDC value. For example, a recall of 0.9 is achieved
with DDC equal to 4. On the other hand, when DDC increases, more objects are compared
with the query per process, which afects the searching time.
Time (s)
1500 100
80 60
1000 1081 46 52
580 60
517 401 390 40 30 40
500 21
185
40 62 20 10
2.8 4
0 25 1.2
0
1000 2000 3000 1000 2000 3000
# references # references
(a) (b)
FIGURE 1.6 Indexing and searching time (21 dimensional data set, MSA) in seconds using ive
processes. (a) Indexing time: 5 processes. (b) Searching time: 5 processes.
Seq DB RB IB 35 DB RB IB 33
2000
1646 30
25
1500 1590 25
1081
Time (s)
Time (s)
20
1000 1050 13 16
15
517 10 8.8
500 4.6
400 210 5
75 156
1.2 2
0 11 18 27 0 0.8
1000 2000 3000 1000 2000 3000
# references # references
(a) (b)
FIGURE 1.7 Indexing and searching time (21 dimensional data set, MSA) in seconds using 10
processes. (a) Indexing time: 10 processes. (b) Searching time: 10 processes.
Scalable Indexing for Big Data Processing ◾ 17
2000 Seq DB RB IB 16 DB RB IB 15
1646
14
1500 12
Time (s)
9
Time (s)
1205 10
1000 1081 8
880 6
6 5 4.9
500 517 4
360 198 2
48 100 2 0.5 0.9
12 0.2
0 5 0
8
1000 2000 3000 1000 2000 3000
# references # references
(a) (b)
FIGURE 1.8 Indexing and searching time (21 dimensional data set, MSA) in seconds using 20
processes. (a) Indexing time: 20 processes. (b) Searching time: 20 processes.
that there is no communication time between the processes. he second reason is the divi-
sion of the data and the references. hat makes the indexing process within each process
fast compared to the sequential one.
Searching. he x-axis shows the number of reference points and the y-axis shows the
searching time in seconds using the full ordered list L(oi,R). For the three algorithms, when
the number of processes increases, the searching time decreases, with the same RE and PE
for DB and RB approaches and with a better RE and PE for the IB approach. Also, when the
number of reference points increases, the searching time increases. For the RB approach,
searching is very slow compared to the other two approaches. he main reason for that is the
gathering process. In RB, all the nodes participate to answer a user’s query, and each node
n
has references, where each reference is responsible for N objects as we use the full ordered
P
list. If we have 20 processes and 20 reference points, the broker process receives 20 lists of size
N. Ater that, these lists are combined and sorted. he combination process is diicult as the
similarity with respect to the objects is done using a portion of the references, not the full ref-
erences. Hence, this reduces the running time, which means that most of the time for the RB
is consumed in combining the results from the processes to give the inal output to the user.
he IB searching algorithm gave the best running time, although there is some overhead
time due to the hard disk access and the DDC between the query and the top Kcp objects.
TABLE 1.1 Indexing and Searching the CoPhIR Data Set Using the MPT
Data Structure
2000 Sequential Distributed
K-NN 10 100 10 100
RE 0.64 0.7 0.76 0.86
Indexing 25 h 10 min
Searching 6s 6.5 s 1.8 s 2s
18 ◾ Hisham Mohamed and Stéphane Marchand-Maillet
he experiments were performed using the MPT data structure (Section 1.3.6). he table
shows the results for the sequential and the distributed implementation using 34 processes
based on the IB approach. he number of nearest references is n = 50 out of 2000 reference
points that were selected from each database portion randomly. he DDC factor is Δ = 40
for the sequential and the distributed implementations. For the distributed implementa-
tion, the number of reference points for each processor is 60.
he table shows that the distributed implementation gives better RE and PE compared to
sequential implementation for the top 10 and 100 K-NNs due to the locality and better repre-
sentation of each object in the distributed environment. In terms of indexing, the distributed
implementation is much faster than the sequential implementation due to the division of data
and references. For searching, the distributed implementation did not give up high speed like
the indexing, due to the number of DDCs that need to be done by each process.
1.6 CONCLUSION
PBI is one of the most recent techniques for approximate K-NN similarity queries in large-
scale domains. It aims to predict the similarity between the data objects based on how they
see the surroundings using some reference objects. With the exponential increase of data,
parallel and distributed computations are needed. In this chapter, we showed three difer-
ent distributed algorithms for eicient indexing and searching using PBI. he three algo-
rithms are divided based on the distribution of the data or the references or both of them.
We evaluated them on big high-dimensional data sets and showed that the technique,
which is based on distributing the data and the references, gives the best performance due
to the better representations of the objects by the local references and the low number of
objects for each process.
ACKNOWLEDGMENT
his work is jointly supported by the Swiss National Science Foundation (SNSF) via the
Swiss National Center of Competence in Research (NCCR) on Interactive Multimodal
Information Management (IM2) and the European Cooperation in Science and Technology
(COST) Action on Multilingual and Multifaceted Interactive Information Access (MUMIA)
via the Swiss State Secretariat for Education and Research (SER).
REFERENCES
1. H. V. Jagadish, A. O. Mendelzon and T. Milo. Similarity-based queries. In Proceedings of the
Fourteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems,
PODS ’95, pages 36–45, New York, 1995. ACM.
2. H. Samet. Foundations of Multidimensional and Metric Data Structures. he Morgan Kaufmann
Series in Computer Graphics and Geometric Modeling. Morgan Kaufmann, San Francisco, CA, 2006.
3. B. Bustos, O. Pedreira and N. Brisaboa. A dynamic pivot selection technique for similarity
search. In Proceedings of the First International Workshop on Similarity Search and Applications,
pages 105–112, Washington, DC, 2008. IEEE Computer Society.
4. L. G. Ares, N. R. Brisaboa, M. F. Esteller, Ó. Pedreira and Á. S. Places. Optimal pivots to minimize
the index size for metric access methods. In Proceedings of the 2009 Second International Workshop
on Similarity Search and Applications, pages 74–80, Washington, DC, 2009. IEEE Computer Society.
Scalable Indexing for Big Data Processing ◾ 19
5. B. Bustos, G. Navarro and E. Chávez. Pivot selection techniques for proximity searching in
metric spaces. Pattern Recognition Letters, 24(14):2357–2366, 2003.
6. C. Faloutsos and K.-I. Lin. Fastmap: A fast algorithm for indexing, data-mining and visualiza-
tion of traditional and multimedia datasets. SIGMOD Record, 24(2):163–174, 1995.
7. A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor
in high dimensions. Communications of the ACM, 51(1):117–122, 2008.
8. P. Ciaccia, M. Patella and P. Zezula. M-tree: An eicient access method for similarity search
in metric spaces. In M. Jarke, M. J. Carey, K. R. Dittrich, F. H. Lochovsky, P. Loucopoulos
and M. A. Jeusfeld, editors, VLDB ’97, Proceedings of 23rd International Conference on Very
Large Data Bases, August 25–29, 1997, Athens, Greece, pages 426–435. Morgan Kaufmann, San
Francisco, CA, 1997.
9. O. Egecioglu, H. Ferhatosmanoglu and U. Ogras. Dimensionality reduction and similar-
ity computation by inner-product approximations. Knowledge and Data Engineering, IEEE
Transactions on, 16(6):714–726, 2004.
10. Ü. Y. Ogras and H. Ferhatosmanoglu. Dimensionality reduction using magnitude and shape
approximations. In Proceedings of the Twelth International Conference on Information and
Knowledge Management, CIKM ’03, pages 99–107, New York, 2003. ACM.
11. P. Zezula, G. Amato, V. Dohnal and M. Batko. Similarity Search: he Metric Space Approach,
volume 32 of Advances in Database Systems. Springer, Secaucus, NJ, 2006.
12. E. C. Gonzalez, K. Figueroa and G. Navarro. Efective proximity retrieval by ordering permuta-
tions. IEEE Transactions, on Pattern Analysis and Machine Intelligence, 30(9):1647–1658, 2008.
13. H. Mohamed and S. Marchand-Maillet. Quantized ranking for permutation-based indexing. In
N. Brisaboa, O. Pedreira and P. Zezula, editors, Similarity Search and Applications, volume 8199
of Lecture Notes in Computer Science, pages 103–114. Springer, Berlin, 2013.
14. G. Amato and P. Savino. Approximate similarity search in metric spaces using inverted iles. In
Proceedings of the 3rd International Conference on Scalable Information Systems, InfoScale ’08,
pages 28:1–28:10, ICST, Brussels, Belgium, 2008. ICST.
15. A. Esuli. Mipai: Using the PP-index to build an eicient and scalable similarity search system.
In Proceedings of the 2009 Second International Workshop on Similarity Search and Applications,
pages 146–148, Washington, DC, 2009. IEEE Computer Society.
16. E. S. Tellez, E. Chavez and G. Navarro. Succinct nearest neighbor search. Information Systems,
38(7):1019–1030, 2013.
17. G. Amato, C. Gennaro and P. Savino. Mi-ile: Using inverted iles for scalable approximate
similarity search. Multimedia Tools and Applications, 71(3):1333–1362, 2014.
18. J. Zobel and A. Mofat. Inverted iles for text search engines. ACM Computing Surveys, 38(2):6,
2006.
19. H. Mohamed and S. Marchand-Maillet. Parallel approaches to permutation-based indexing
using inverted iles. In 5th International Conference on Similarity Search and Applications
(SISAP), Toronto, CA, August 2012.
20. E. S. Tellez, E. Chavez and A. Camarena-Ibarrola. A brief index for proximity searching. In E.
Bayro-Corrochano and J.-O. Eklundh, editors, Progress in Pattern Recognition, Image Analysis,
Computer Vision, and Applications, volume 5856 of Lecture Notes in Computer Science, pages
529–536. Springer, Berlin, 2009.
21. R. W. Hamming. Error detecting and error correcting codes. Bell System Technical Journal,
29:147–160, 1950.
22. E. S. Tellez and E. Chavez. On locality sensitive hashing in metric spaces. In Proceedings of the
hird International Conference on SImilarity Search and APplications, SISAP ’10, pages 67–74,
New York, 2010. ACM.
23. A. Esuli. PP-index: Using permutation preixes for eicient and scalable approximate similarity
search. Proceedings of LSDSIR 2009, i(July):1–48, 2009.
20 ◾ Hisham Mohamed and Stéphane Marchand-Maillet
24. H. Mohamed and S. Marchand-Maillet. Metric suix array for large-scale similarity search. In
ACM WSDM 2013 Workshop on Large Scale and Distributed Systems for Information Retrieval,
Rome, IT, February 2013.
25. H. Mohamed and S. Marchand-Maillet. Permutation-based pruning for approximate K-NN
search. In DEXA Proceedings, Part I, Prague, Czech Republic, August 26–29, 2013, pages 40–47,
2013.
26. U. Manber and E. W. Myers. Suix arrays: A new method for on-line string searches. SIAM
Journal on Computing, 22(5):935–948, 1993.
27. K.-B. Schumann and J. Stoye. An incomplex algorithm for fast suix array construction.
Sotware: Practice and Experience, 37(3):309–329, 2007.
28. H. Mohamed and M. Abouelhoda. Parallel suix sorting based on bucket pointer reinement.
In Biomedical Engineering Conference (CIBEC), 2010 5th Cairo International, Cairo, Egypt,
pages 98–102, December 2010.
29. S. W. Smith. he Scientist and Engineer’s Guide to Digital Signal Processing. California Technical
Publishing, San Diego, CA, 1997.
30. he MPI Forum. MPI: A message passing interface, 1993. Available at http://www.mpi-forum.org.
31. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei. ImageNet: A large-scale hierarchical
image database. In IEEE Computer Vision and Pattern Recognition (CVPR), 2009.
32. P. Bolettieri, A. Esuli, F. Falchi, C. Lucchese, R. Perego, T. Piccioli and F. Rabitti. CoPhIR: A test
collection for content-based image retrieval. CoRR, abs/0905.4627v2, 2009.
CHAPTER 2
CONTENTS
2.1 Introduction 22
2.2 Introduction of MapReduce and Apache Hadoop 22
2.3 A Motivating Application: Movie Ratings from Netlix Prize 24
2.4 Implementation in Hadoop 25
2.5 Deployment Architecture 27
2.6 Scalability and Cost Evaluation 30
2.7 Discussions 33
2.8 Related Work 34
2.9 Conclusion 35
Acknowledgment 35
References 35
Appendix 2.A: Source Code of Mappers and Reducers 36
ABSTRACT
Based on the MapReduce model and Hadoop Distributed File System (HDFS),
Hadoop enables the distributed processing of large data sets across clusters with scal-
ability and fault tolerance. Many data-intensive applications involve continuous and
incremental updates of data. Understanding the scalability and cost of a Hadoop
platform to handle small and independent updates of data sets sheds light on the
design of scalable and cost-efective data-intensive applications. In this chapter, we
introduce a motivating movie recommendation application implemented in the
MapReduce model and deployed on Amazon Elastic MapReduce (EMR), a Hadoop
21
22 ◾ Xing Wu, Yan Liu, and Ian Gorton
2.1 INTRODUCTION
Processing large-scale data is an increasing common and important problem for many
domains [1]. he de facto standard programming model MapReduce [2] and the associ-
ated run-time systems were originally adopted by Google. Subsequently, an open-source
platform named Hadoop [3] that supports the same programming model has gained tre-
mendous popularity. However, MapReduce was not designed to eiciently process small
and independent updates to existing data sets. his means the MapReduce jobs must be
run again over both the newly updated data and the old data. Given enough computing
resources, MapReduce’s scalability makes this approach feasible. However, reprocessing
the entire data discards the work done in earlier runs and makes latency proportional to
the size of the entire data, rather than the size of an update [4].
In this chapter, we present an empirical scalability and cost evaluation of Hadoop on
processing continuously and incrementally updated data streams. We irst introduce the
programming model of MapReduce for designing sotware applications. We select an appli-
cation for inding the most mutual fans of a movie for recommendations using Netlix data
[5]. We then discuss the implementation and deployment options available in Amazon Web
Services (AWS), a public cloud environment. We design experiments to evaluate the scal-
ability and resource usage cost of running this Hadoop application and explore the insights
of the empirical results using monitoring data at both the system and the platform level.
✾✿❀❁❂❃❂ re❄❅❆❇❁❂❃❂
Data tuple:
Ma✸ Mapper Key: step Mapper
Value: 1
✹✺✻✼✽
access to application data, Hadoop is reliable and eicient for Big Data analysis on large
clusters.
he interface for application developers using Apache Hadoop is simple and convenient.
Developers are only required to implement two interfaces: mapper and reducer. As Source
Code 1 shows, the two void functions map and reduce are corresponding to the map and
reduce phases of the MapReduce model. Hadoop takes care of the other operations on the
data such as shuling, sorting, dividing input data, and so on.
Source Code 1: Mapper and Reducer in Java
@Deprecated
publicinterfaceMapper<K1, V1, K2, V2>extendsJobConfigurable, Closeable {
24 ◾ Xing Wu, Yan Liu, and Ian Gorton
Ratings.txt
Key
Value
Jack 1_5
Alex 1_4
Fred 1_2
Tabs Jack 2_4
2_5
Alex
@Deprecated
publicinterfaceReducer<K2, V2, K3, V3>extendsJobConfigurable, Closeable {
By default, Hadoop is programmed with Java. With the Hadoop Streaming interface,
developers can choose any programming languages to implement the MapReduce func-
tions. Using Hadoop Streaming, any executable or script can be speciied as the mapper
or the reducer. When map/reduce tasks are running, the input ile is converted into lines.
hen these lines are fed to the standard input of the speciied mapper/reducer. Figure 2.3 is
an example to show how Hadoop Streaming extracts key/value pairs from lines. By default,
the characters before the irst tab character are considered as the key, while the rest are
values. If there is no tab character in a line, then the entire line is the key, and the value is
null. he standard output of the mapper/reducer is collected, and the key/value pairs are
extracted from each line in the same way. In this chapter, we choose Python to implement
mappers and reducers using the Hadoop Streaming interface.
movies the user may like. Collaborative iltering (CF) is a widely used algorithm for recom-
mendation systems [7]. It contains two major forms, namely, user-based and item-based
CF. he user-based CF aims to recommend a user movies that other users like [8,9], while
the item-based CF recommends a user movies similar to the user’s watched list or high-
rating movies [10–12].
As users keep rating and watching movies, both algorithms cannot avoid the prob-
lem of incremental data processing, which means analyzing the new ratings and most
recent watched histories and updating the recommendation lists. As the numbers of
users and movies grow, incremental data processing can impose immense demands
on computation and memory usage. Take the item-based CF algorithm in Reference
12 as an example. Assuming there are M users and N movies, the time complexity to
compute the similarity between two movies is O(M). For the inal recommendation
result, the similarities for all the possible movie pairs must be computed, which has
a complexity of O(M*N 2). With millions of users and tens of thousands of movies,
the complexity will be extremely large, and scalability becomes a serious problem.
Implementing the CF algorithms with the MapReduce model using scalable platforms
is a reasonable solution. Hence, a movie recommendation application on real-world
data sets provides an excellent motivating scenario to evaluate the scalability and cost
of the Hadoop platform in a cloud environment.
For this evaluation, we use the sample data sets from Netlix Prize [5]. he size is
approximately 2 GB in total. he data set contains 17,770 iles. he MovieIDs range from
1 to 17,770 sequentially with one ile per movie. Each ile contains three ields, namely,
UserID, Rating, and Date. he UserID ranges from 1 to 2,649,429 with gaps. here are,
in total, 480,189 users. Ratings are on a ive-star scale from 1 to 5. Dates have the format
YYYY-MM-DD. We merged all the 17,770 iles into one ile, with each line in the format
of “UserID MovieID Rating Date,” ordered by ascending date, to create the input data for
our application.
In the item-based CF algorithm, we deine a movie fan as a user who rates a particular
movie higher than three stars. Mutual fans of two movies are the users who have rated
both of the movies higher than three stars. hen we measure the similarity between two
movies by the number of their mutual fans. As a result, we output a recommendation list
that contains top-N most similar movies to each movie. Following this application logic,
we present the implementation in Hadoop and highlight key artifacts in the MapReduce
programming model.
Round 1. Map and sort the user–movie pairs. Each pair implies that the user is a fan of
the movie. A reducer is not needed in this round. Figure 2.4 shows an example of the
input and output.
26 ◾ Xing Wu, Yan Liu, and Ian Gorton
Orig✐❈❉❊ ✐❈❋●❍
❚ ✐❈❋●❍ ❯❱❲❏❑ ❳■❱❲ ❑❍❉❈❨❉■❨ ✐❈❋●❍
User ID Movie ID Rating
Sorted ■❏❑●❊❍
User Movie
Jack 1 5 ID ID ❳❱■ ❊✐❈❏ ✐❈ ❑s❑❩❑❍❨✐❈❬
▲▼◆❖ ◗ 4 Map and Jack 1 ❚ ■❏❲❱❭❏ ❊❏❉❨✐❈❪ ❉❈❨ ❍■❉✐❊✐❈❪ ❫❴✐❍❏❑❋❉❯❏
sort
❊✐❈❏ ❧ ❊✐❈❏❩❑❍■✐❋❵❛
Fred 1 2 Jack 2
Fred 2 4 Alex 2
❚ ❫■✐❍❏ ❍❴❏ ■❏❑●❊❍❑ ❍❱ ❑❍❉❈❨❉■❨ ❱●❍❋●❍t
Jack 3 5 Fred 2 iflen(words) < 3:
Alex 3 1 continue
elifint (words[2]) > 3:
# output key/value pairs, sep by a tab
print '%s\t%s' % (words [θ], words [1])
Round 2. Calculate the number of mutual fans of each movie. Figure 2.5 demonstrates
the processing with an example. Assume Jack is a fan of movies 1, 2, and 3; then mov-
ies 1 and 2 have one mutual fan as Jack. Likewise, movies 1 and 3 and movies 2 and
3 also have one mutual fan as Jack. A mapper inds all the mutual fans of each movie
and outputs a line for every mutual fan. hen the reducer aggregates the result based
on the movie pair that has one mutual fan and counts the number of mutual fans. he
sample codes of the mapper and reducer are shown in Appendix A.
Round 3. Extract the movie pairs in the result of round 2 and ind the top-N movies that
have the most mutual fans of each movie. he mapper extracts the movie IDs from
the movie pairs, and the reducer aggregates the result from the mapper and orders
the recommended movies based on the numbers of their mutual fans. As Figure 2.6
demonstrates, movie pair 1–2 has a mutual fan count of two, and movie pair 1–3
has one mutual fan. herefore, movie ID 1 has mutual fans with movie IDs 2 and 3.
Round 2 input
Round 2 output
User ID Movie ID
MovieID1- Mutual
MovieID2 count MovieID1- Mutual
Jack 1 MovieID2 count
1–2 1
Jack 2 1–2 2
Map 1–3 1 Reduce
Jack 3 1–3 1
2–3 1
Alex 1 2–3 1
Alex 2 1–2 1
Fred 2
3 2 1
Movie
recommendation
job
Amazon EMR
Handoop cluster
Core
Master EC2
EC2
instance
EC2 EC2
Instance
instance EC2
Instance 10
EC2
Instance
Instance
Amazon S3
Netflix
Result
data set
cluster coniguration, hardware coniguration, and steps. We use the default settings for
the other sections.
In cluster coniguration, we set up our cluster name and the log folder on S3 as Figure 2.8
shows.
In hardware coniguration, as Figure 2.9 shows, we leave the network and EC2 subnet as
default and change the EC2 instance type and instance count to what we need.
As Figure 2.10 shows, the steps section is where we conigure the Hadoop jobs. Since
the recommendation application requires three rounds of MapReduce operations, we add
three steps here. Each step is a MapReduce job. he speciication of S3 location of the map-
per, the reducer, the input iles, and the output folder for each step is shown in Figure 2.11.
We use Amazon CloudWatch [13] to monitor the status of Hadoop jobs and tasks
and EC2 instances. For EC2 instances, CloudWatch provides a free service named Basic
Monitoring, which includes metrics for CPU utilization, data transfer, and disk usage at a
5 min frequency. For the EMR service, CloudWatch has metrics about HDFS, S3, nodes,
and map/reduce tasks. All these built-in EC2 and EMR metrics are automatically collected
and reported. AWS users can access these metrics in real time on the CloudWatch website
or through the CloudWatch application programming interfaces (APIs).
30 ◾ Xing Wu, Yan Liu, and Ian Gorton
experiments that process diferent-sized rating iles. First, we order the Netlix data set by
time. Assuming 200 ratings made by users every second, the data set is of 200 ratings per
second * 24 (hours) * 3600 (seconds) * N records on the Nth day. We run our movie recom-
mendation jobs over the period of days as N = 1, 3, 6 (Figure 2.12) and compare the results.
Understanding how the minimal time interval or frequency of data processing varies
according to the data update sizes helps to make scaling decisions on the optimal resource
provision settings. Table 2.3 shows the elapsed time of Hadoop jobs with diferent sizes of
input data. he elapsed time implies, given a certain data size, the minimal time required
to run the Hadoop application that updates the movie recommendation list. For example,
the irst entry in Table 2.3 means that given the current capacity, it is only possible to rerun
the Hadoop movie recommendation application with a frequency of once per hour for pro-
cessing rating data updated in 1 day. For any shorter frequencies or larger data sets, more
EMR instances need to be provisioned. Further analysis in Figure 2.13 shows the linear
trend of the elapsed time of this Hadoop application according to the data sizes.
1 day
3 days
6 days
500
R2 = 0.9968
300
200
100
0
0 500 1000 1500 2000 2500 3000
Data size (MB)
Measurement
FIGURE 2.13 Linear trend between elapsed time and data sizes.
32 ◾ Xing Wu, Yan Liu, and Ian Gorton
From the platform status, Figure 2.14 shows the number of running tasks comparing
the results of the 1-day input and the 3-day input. he evaluation environment has 10 core
instances. Each instance has two mappers and one reducer. he running map and reduce
tasks are at their maximum most of the time. In addition, all the mappers and reducers
inish their job approximately at the same time, which implies that there is no data skew
problem here. In Section 2.4, “Implementation in Hadoop,” we present three rounds of the
MapReduce jobs of this movie recommendation application. he time running each round
is listed in Figure 2.14.
Figure 2.15 shows the system status comparing 1-day data input and 3-day data input.
As Figure 2.15c shows, in both experiments, there are no data written to HDFS, and the
data read from HDFS total less than 5 KB/s all the time. his is mainly due to the fact that
EMR uses S3 for input and output instead of HDFS. In the EMR services, the mappers read
from S3 and hold intermediate results in a local disk of EC2 instances. hen in the shuling
stage, these intermediate results are sent through the network to corresponding reducers
that may sit on other instances. In the end, the reducers write the inal result to S3.
In Figure 2.15a, the experiment with 3-day input has the CPU utilization at 100% most
of the time, except at three low points at 100, 125, and 155 min, which are the times when
the shuling or writing results of reduce tasks occur. he average network I/O measure-
ment in Figure 2.15b shows spikes at around the same time points, 100, 125, and 155 min,
respectively, while other times, the average network I/O stays quite low. he highest net-
work I/O rate is still below the bandwidth of Amazon’s network in the same region, which
is 1 Gbps to our best knowledge. he experiment with 1-day input has the same pattern
with diferent time frames. From these observations, we can infer that CPU utilization
would be the bottleneck for processing a higher frequency of data.
Table 2.4 shows the cost of the EMR service under diferent terms of data sizes and updat-
ing frequencies. For example, with a once-per-3 h updating frequency and 1.2G input iles,
we need to run the Hadoop recommendation jobs 8 times a day. Each time, it takes 2 h and 57
min. he EMR price on AWS is 0.08 USD instance per hour. For a sub-hour’s usage, the bill-
ing system rounds it to an hour. As we have used 11 instances in total, it costs $0.08*11*3*8 =
$21.12 per day. he NA (not applicable) in the table means that the processing time for data at
this size is longer than the updating interval so that more instances need to be provisioned.
30
Round # of MapReduce 1 day 3 days
3 days Map 1 day Map
3 days Reduce 1 day Reduce
Number of tasks
20
Round 1 4 min 5 min
10
80
Bytes
4
60
2
40 0
0 50 100 150 0 50 100 150
(a) Time (min) (b) Time (min)
2000 0.5
0 0
0 50 100 150 0 50 100 150
(c) Time (min) (d) Time (min)
FIGURE 2.15 System status of Hadoop cluster. (a) Core instance average CPU utilization. (b) Core
instance average network in/out. (c) HDFS bytes read/written. (d) S3 bytes read/written.
2.7 DISCUSSIONS
Using AWS EMR, S3, CloudWatch, and the billing service, we are able to observe for both
performance and cost metrics. Based on our experience, we discuss other features of the
Hadoop platform related to the processing of incremental data sets.
Programmability. Hadoop is an open-source project, and it provides a simple program-
ming interface for developers. Hadoop supports multiple programming languages through the
Hadoop Streaming interface. Figure 2.4 is an example of a mapper programmed with Python.
Deployment complexity. Hadoop deployment is complicated and includes MapReduce
modules, HDFS, and other complex coniguration iles. However, Hadoop is a widely used
platform and has a mature community so that it is easy to ind documents or choose an
integrated Hadoop service like Amazon EMR. With Amazon EMR, developers no longer
34 ◾ Xing Wu, Yan Liu, and Ian Gorton
need to set up the Hadoop environment, and a simple coniguration can get Hadoop jobs
run as we present in “Deployment Architecture,” Section 2.5.
Integration with other tools. Hadoop can output metrics to data iles or real-time
monitoring tools. Integration with other monitoring tools can be done by editing the
coniguration ile. For example, Ganglia is a scalable distributed monitoring system for
high-performance computing systems such as clusters and grids [14]. Hadoop can output
metrics to Ganglia by editing hadoop-metrics.properties as follows:
dfs.class = org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period = 10
dfs.servers = @GANGLIA@:8649
mapred.class = org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period = 10
mapred.servers = @GANGLIA@:8649
Autoscaling options. Amazon EMR allows users to increase or decrease the number of task
instances. For core instances that contain HDFS, users can increase but not decrease them. All
these scaling operations can be made when the Hadoop cluster is running, which enables the
autoscaling ability for EMR if the autoscaling rules are set up based on the Hadoop metrics.
2.9 CONCLUSION
In this chapter, we evaluate the scalability and cost of the Amazon Hadoop service through a
movie recommendation scenario, which requires frequent updates on continuous input data
streams. We implement this sample application with the MapReduce programming model and
deploy it on the Amazon EMR service. We present the monitoring techniques to collect both
system- and platform-level metrics to observe the performance and scalability status. Our eval-
uation experiments help to identify the minimal time interval or the highest frequency of data
processing, which is an indicator on the autoscaling settings of optimal resource provisioning.
ACKNOWLEDGMENT
Copyright 2014 Carnegie Mellon University
his material is based upon work funded and supported by the Department of Defense
under contract no. FA8721-05-C-0003 with Carnegie Mellon University for the operation
of the Sotware Engineering Institute, a federally funded research and development center.
No warranty. his Carnegie Mellon University and Sotware Engineering Institute Material
is furnished on an “as-is” basis. Carnegie Mellon University makes no warranties of any
kind, either expressed or implied, as to any matter including, but not limited to, warranty
of itness for purpose or merchantability, exclusivity, or results obtained from use of the
material. Carnegie Mellon University does not make any warranty of any kind with respect
to freedom from patent, trademark, or copyright infringement.
his material has been approved for public release and unlimited distribution.
DM-0001345
REFERENCES
1. Paul Zikopoulos, and Chris Eaton. Understanding Big Data: Analytics for Enterprise Class
Hadoop and Streaming Data (1st ed.). McGraw-Hill Osborne Media, Columbus, OH (2011).
2. Jefrey Dean, and Sanjay Ghemawat. MapReduce: Simpliied data processing on large clusters.
Communications of the ACM 51, 1 (2008): 107–113.
3. What is apache hadoop? (2014). Available at http://hadoop.apache.org/.
4. Daniel Peng, and Frank Dabek. Large-scale incremental processing using distributed transac-
tions and notiications. In Proceedings of the 9th USENIX Symposium on Operating Systems
Design and Implementation. USENIX (2010).
5. Netlix Prize (2009). Available at http://www.netlixprize.com/.
6. Gediminas Adomavicius et al. Incorporating contextual information in recommender systems
using a multidimensional approach. ACM Transactions on Information Systems (TOIS) 23, 1
(2005): 103–145.
7. John S. Breese, David Heckerman, and Carl Kadie. Empirical analysis of predictive algo-
rithms for collaborative iltering. In Proceedings of the Fourteenth Conference on Uncertainty in
Artiicial Intelligence. Morgan Kaufmann Publishers Inc. (1998).
8. Jonathan L. Herlocker et al. An algorithmic framework for performing collaborative ilter-
ing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval. ACM (1999).
36 ◾ Xing Wu, Yan Liu, and Ian Gorton
9. Paul Resnick et al. GroupLens: An open architecture for collaborative iltering of netnews. In
Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work. ACM (1994).
10. Mukund Deshpande, and George Karypis. Item-based top-n recommendation algorithms.
ACM Transactions on Information Systems (TOIS) 22, 1 (2004): 143–177.
11. Greg Linden, Brent Smith, and Jeremy York. Amazon.com recommendations: Item-to-item
collaborative iltering. Internet Computing, IEEE 7, 1 (2003): 76–80.
12. Badrul Sarwar et al. Item-based collaborative iltering recommendation algorithms. In Pro-
ceedings of the 10th International Conference on World Wide Web. ACM (2001).
13. Amazon CloudWatch (2014). Available at http://aws.amazon.com/cloudwatch/.
14. Ganglia (2014). Available at http://ganglia.sourceforge.net/.
15. Pramod Bhatotia et al. Incoop: MapReduce for incremental computations. In Proceedings of the 2nd
ACM Symposium on Cloud Computing (SOCC ’11). ACM, New York, Article 7, 14 pages (2011).
16. Dionysios Logothetis et al. Stateful bulk processing for incremental analytics. In Proceedings of
the 1st ACM symposium on Cloud computing (SoCC ’10). ACM, New York, 51–62 (2010).
17. Shadi Ibrahim et al. Evaluating MapReduce on virtual machines: he hadoop case. Cloud
Computing, Lecture Notes in Computer Science 5931 (2009): 519–528.
18. Zacharia Fadika et al. Evaluating hadoop for data-intensive scientiic operations. In Cloud
Computing (CLOUD), 2012 IEEE 5th International Conference on, 67, 74 (June 24–29, 2012).
19. Elif Dede et al. Performance evaluation of a MongoDB and hadoop platform for scientiic data
analysis. In Proceedings of the 4th ACM Workshop on Scientiic Cloud Computing (Science Cloud
’13). ACM, New York, 13–20 (2013).
20. Matei Zaharia et al. Improving MapReduce performance in heterogeneous environments. In
Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation
(OSDI ’08). USENIX Association, Berkeley, CA, 29–42 (2008).
#!/usr/bin/env python
import sys
movie_ids = list()
user_id = 0
print_id_pairs(movie_ids)
#!/usr/bin/env python
import sys
movie_pair_cnt = 0
movie_pair = 0
# accumulate the number of mutual fans and write the results to standard output;
iflen(words) < 2:
continue
ifmovie_pair ! = words[0] and movie_pair ! = 0:
print '%s\t%d'% (movie_pair, movie_pair_cnt)
movie_pair_cnt = 0
movie_pair = words[0]
movie_pair_cnt + = int(words[1])
#!/usr/bin/env python
import sys
#!/usr/bin/env python
import sys
top_lists = []
movie_id = 0
print_top_lists(movie_id, top_lists)
CHAPTER 3
Singular Value
Decomposition, Clustering,
and Indexing for Similarity
Search for Large Data Sets
in High-Dimensional Spaces
Alexander Thomasian
CONTENTS
3.1 Introduction 40
3.2 Data Reduction Methods and SVD 42
3.3 Clustering Methods 44
3.3.1 Partitioning Methods 45
3.3.2 Hierarchical Clustering 46
3.3.3 Density-Based Methods 48
3.3.4 Grid-Based Methods 49
3.3.5 Subspace Clustering Methods 49
3.4 Steps in Building an Index for k-NN Queries 52
3.5 Nearest Neighbors Queries in High-Dimensional Space 54
3.6 Alternate Method Combining SVD and Clustering 56
3.7 Survey of High-Dimensional Indices 57
3.8 Conclusions 62
Acknowledgments 62
References 62
Appendix 3.A: Computing Approximate Distances with Dimensionality-Reduced Data 68
39
40 ◾ Alexander Thomasian
ABSTRACT
Representing objects such as images by their feature vectors and searching for simi-
larity according to the distances of the points representing them in high-dimensional
space via k-nearest neighbors (k-NNs) to a target image is a popular paradigm.
We discuss a combination of singular value decomposition (SVD), clustering, and
indexing to reduce the cost of processing k-NN queries for large data sets with high-
dimensional data. We irst review dimensionality reduction methods with emphasis
on SVD and related methods, followed by a survey of clustering and indexing meth-
ods for high-dimensional numerical data. We describe combining SVD and cluster-
ing as a framework and the main memory-resident ordered partition (OP)-tree index
to speed up k-NN queries. We discuss techniques to save the OP-tree on disk and
specify the stepwise dimensionality increasing (SDI) index suited for k-NN queries
on dimensionally reduced data.
3.1 INTRODUCTION
IBM’s Query by Image Content (QBIC) project in the 1990s [1] utilized content-based image
retrieval (CBIR) or similarity search based on features extracted from images. We are con-
cerned with the storage space and processing requirements for CBIR, which are important
because of the high data volumes and the associated processing requirements, especially if
the similarity search is conducted by a naive sequential scan of the whole data set. he clus-
tering with singular value decomposition (CSVD) method and the ordered partition (OP)-
tree index [2] are used in Reference 3 in applying CBIR to indexing satellite images [4].
CBIR has two steps [5]: (1) Characterize images by their features and deine similarity
measures. (2) Extract features such as color, texture, and shape [6]. Texture features include
the fractal dimension, coarseness, entropy, circular Moran autocorrelation functions, and
spatial gray-level diference (SGLD) statistics [7]. In addition to photographic, medical, and
satellite images, content-based retrieval is used in conjunction with audio and video clips
and so forth.
Given a target image represented by a point in high-dimensional space, images similar
to it can be determined via a range query, which determines all points within a certain
radius. Since a range query with a small radius may not yield any query points, k-nearest
neighbor (k-NN) queries are used instead [8]. he distance may be based on the Euclidean
or more complex distance functions, such as the Mahalanobis distance [8]. k-NN queries
can be implemented by a sequential scan of the data set holding the feature vectors of all
images, while keeping track of the nearest k. he naive algorithm has a running time of
O(MN), where M is the cardinality of data set (number of images) and N is the dimen-
sionality (number of features). Amazingly, naive search may outperform multidimensional
indexing methods in higher-dimensional data spaces [9].
Clustering is one method to reduce the cost of k-NN search from the viewpoint of CPU
and disk processing. Given H clusters with an equal number of points per cluster, the cost of
k-NN queries is reduced H-fold, provided that only the data points in one cluster are to be
searched. A survey of clustering methods with emphasis on disk-resident high-dimensional
SVD, Clustering, and Indexing for Similarity Search ◾ 41
data is provided in this chapter. With the advent of storage-class memory (SCM) [10], more
and more data are expected to migrate from magnetic disks [11] to such memory.
Clusters are usually envisioned as hyperspheres represented by their centroid and radius
(the distance of the farthest point from the centroid). A query point may reside in the
volume shared by two or more hyperspheres or hypercubes depending on the indexing
structure, so that multiple clusters need to be searched to locate the k-NNs. k-NN search
has to be trimmed as much as possible to reduce query processing cost [3]. Indexing is a
form of clustering; however, most indexing structures, such as those in the R-tree family,
perform poorly in higher-dimensional data spaces [8]. More details on this topic appear in
Section 3.7.
Principal component analysis (PCA) [12], singular value decomposition (SVD), and
Karhunen–Loève transform (KLT) are related methods to reduce the number of dimensions
of high-dimensional data ater rotating coordinates to attain minimal loss in information
for the desired level of data compression [8,13]. Dimensionality reduction can be option-
ally applied to the original data set before applying clustering. Dimensionality reduction
applied to clusters takes advantage of local correlations, resulting in a smaller normalized
mean square error (NMSE) for the same reduction in data volume as when SVD is applied
to the whole data set. he method is therefore called recursive clustered SVD (RCSVD),
although SVD was applied to clusters in References 3 and 14.
As an example that combining clustering with SVD results in higher dimensionality
reduction than applying SVD without clustering, consider points surrounding multiple
nonintersecting lines in three-dimensional space, which can be speciied by their single
dominant dimension, that is, a threefold reduction in space requirements, but also compu-
tational cost for k-NN queries. Experimentation with several real-life data sets has shown
lower information loss with the same compression ratio or vice versa when clustering is
combined with SVD.
he local dimensionality reduction (LDR) method [15] inds local correlations and per-
forms dimensionality reduction on these data. However, it is shown in Reference 3 that
CSVD with an of-the-shelf clustering method, such as k-means, outperforms LDR.
A further reduction in cost can be attained by reducing the number of points to be con-
sidered by building an index for the data in each cluster. It turns out that popular indexes
such as R-trees are ineicient in processing k-NN queries in high dimensions. Indexing
methods are surveyed in References 16 and 17, but a more focused and up-to-date survey
for high-dimensional data is provided in this chapter.
he chapter is organized as follows. Section 3.2 discusses the issue of dimensional-
ity reduction. Section 3.3 presents clustering methods with emphasis on large, high-
dimensional, disk-resident data sets. he method in Reference 3 to build the data structures
for k-NN queries is described in Section 3.4. Section 3.5 speciies the method for determin-
ing similarity. Section 3.6 discusses the LDR method in Reference 15. Section 3.7 surveys
indexing structures. Section 3.8 summarizes the discussion in the chapter. A more eicient
method [18] to compute the Euclidean distance with dimensionally reduced data with SVD
is given in the appendix.
42 ◾ Alexander Thomasian
1. SVD, PCA, and KLT are related dimensionality reduction methods, which are dis-
cussed in this section.
2. Discrete wavelet transform (DWT) has many variations [13], such as Daubechies transform
[22]. It has been utilized in Reference 23 for data compression in a relational database.
3. Regression, in its simplest form, is the linear regression of a single variable model,
that is, variable Y as a linear function of variable X, Y = α + βX, but regression can be
extended to more than one variable [13].
4. Log-linear models are a methodology for approximating discrete multidimensional
probability distributions; for example, the multiway table of joint probabilities is
approximated by a product of lower-order tables [24].
5. Histograms approximate the data in one or more attributes of a relation by their
frequencies [25].
6. Clustering techniques are discussed in Section 3.3.
7. Indexing structures are discussed in Section 3.7.
8. Sampling is an attempt to represent a large set of data by a small random sample of its
data elements. Sampling in databases is reviewed in Reference 26, and Reference 27 is
a representative publication.
We only consider SVD and related methods in our discussion in the context of the
discussion in References 3 and 14. Consider matrix X, whose M rows are the feature vec-
tors for images. Each image has N features, which is the number of columns in the matrix.
Subtracting the mean from the elements in each column yields columns with a zero mean.
he columns may be furthermore studentized by dividing the elements by the standard
deviation. he SVD of X is given as follows [8,28]:
X = UΣV T, (3.1)
1 0 0 0 2 0 0 1 0
0 1 0 0
X= 0 0 3 0 0 U=
0 0 0 0 0 0 0 0 −1
0 4 0 0 0 1 0 0 0
0 1 0 0 0
4 0 0 0 0
0 0 1 0 0
0 3 0 0 0
Σ= VT = 0.2 0 0 0 0.8
0 0 5 0 0
0 0 0 1 0
0 0 0 0 0
− 0.8 0 0 0 0.2
1 T
C= X X = V ΛV T , (3.2)
M
where V is the matrix of eigenvectors (as in Equation 3.1) and Λ is a diagonal matrix hold-
ing the eigenvalues of C. C is positive-semideinite, so that its N eigenvectors are orthonor-
mal and its eigenvalues are nonnegative. he trace (sum of eigenvalues) of C is invariant
under rotation. he computational cost in this case is higher than SVD and requires a
matrix multiplication O(MN 2) followed by eigenvalue decomposition O(N 3).
We assume that the eigenvalues, similar to the singular values, are in nonincreasing
order, so that λ1 ≥ λ2 ≥ … ≥ λN. To show the relation between singular values and eigenval-
ues, we substitute X in Equation 3.2 with its decomposition in Equation 3.1.
1 T 1 1
C= X X = (V ΣU T )(U ΣV T ) = V Σ 2V T .
M M M
It follows that
1 2
Λ= Σ . (3.3)
M
Y = XV. (3.4)
44 ◾ Alexander Thomasian
Retaining the irst n dimensions of Y maximizes the fraction of preserved variance and
minimizes the NMSE:
M N N
NMSE =
∑ ∑ i =1 j =n+1
yi2, j
=
∑
j =n+1
λj
. (3.5)
M N N
∑ ∑ i =1 j =1
yi2, j ∑ j =1
λj
1. Distance deined as (Dmax - Dmin)/Dmin, where Dmax and Dmin are the points with the
farthest and least distance from a target point, decreases with increasing dimension-
ality. Setting 0 ≤ p ≤ 1 has been proposed in References 37 and 38 to provide more
meaningful distance measures.
1/ p
n
p
D( u , v ) = ∑u −v
i =1
i i . (3.6)
Given n points and k clusters, the number of possible clusters is a Stirling number of the
second kind deined as follows:
k
1 k
S(n, k ) = n = ∑(−1) t
(k − t )n.
k k! t =0 t
S(n, 1) = S(n, n) = 1, S(4, 2) = 7 because there are seven ways to partition four objects, a,
b, c, d, into two groups:
Recursion can be used to compute S(n, k) = kS(n − 1, k) + S(n − 1, k − 1), 1 ≤ k < n, and
to show that S(n.k) increases rapidly, for example, S(10, 5) = 42525.
Cluster analysis can be classiied into the following methods.
not all or nothing [39]. In fact, the expectation maximization (EM) method provides sot
assignment of points to clusters, such that each point has a probability of belonging to
each cluster.
k-means is a good example of partitioning-based clustering methods, which can be
speciied succinctly as follows [40]:
1. Randomly select H points from the data set to serve as preliminary centroids for clus-
ters. Centroid spacing is improved by using the method in Reference 41.
2. Assign points to the closest centroid based on the Euclidean distance to form H
clusters.
3. Recompute the centroid for each cluster.
4. Go back to step 2, unless there is no new reassignment.
he quality of the clusters varies signiicantly from run to run, so that the clustering
experiment is repeated and the one yielding the smallest sum of squares (SSQ) distances to
the centroids of all clusters is selected:
H N
2
SSQ = ∑∑ ∑( x
h=1 i∈ ( h ) j =1
i, j − (h )
j ). (3.7)
(h ) denotes the hth cluster, with centroid Ch with coordinates given by (h ) and radius
Rh, which is the distance of the farthest point in the cluster from Ch. SSQ decreases with
increasing H, but a point of diminishing returns is soon reached. he k-means clustering
method does not scale well to large high-dimensional data sets [42].
Clustering Large Applications Based on Randomized Search (CLARANS) [43] extends
Clustering Large Applications (CLARA), described in Reference 33. Clustering is carried
out with respect to representative points referred to as medoids. his method, also known
as k-medoids, is computationally expensive because (1) each iteration requires trying out
new partitioning representatives through an exchange process and (2) a large number of
passes over the data set may be required. CLARANS improves eiciency by performing
iterative hill climbing over a smaller sample.
1. Agglomerative or bottom up, where at irst, each point is its own cluster, and pairs of
clusters are merged as one moves up the hierarchy
2. Divisive or top down, where all observations start in one cluster, and splits are per-
formed recursively as one moves down the hierarchy
SVD, Clustering, and Indexing for Similarity Search ◾ 47
m m
LS = ∑x
i =1
i,j SS = ∑x
i =1
2
i,j .
In the irst phase, BIRCH scans the database to build an initial in-memory CF-tree,
which can be viewed as a multilevel compression of the data that tries to preserve the
inherent clustering structure of data. In the second phase, BIRCH applies a clustering algo-
rithm to cluster the leaf nodes of the CF-tree, which removes sparse clusters as outliers and
groups dense clusters into larger ones. Using CF, we can easily derive many useful statistics
of a cluster: the cluster’s centroid, radius, and diameter. To merge two clusters C1 and C2,
we simply have the following:
CF = (CF1 + CF2 ) = (m1 + m2, LS1 + LS 2 , SS1 + SS 2 ).
m m m m
LS = ( LSX , LSY ) = ∑ x ,∑
i =1
i
i =1
yi , SS = (SSX , SSY ) = ∑ x ,∑ y
i =1
2
i
i =1
2
i .
1
R=
m
SSX + SSY + (1/m − 2) LSX2 + LSY2 .( )
Generalizing to N dimensions with subscript 1 ≤ n ≤ N,
N N
1
R=
m ∑ SS + (1/m − 2)∑ LS .
n=1
n
n=1
2
n
Ordering Points to Identify the Clustering Structure (OPTICS) [52] can be seen as a gener-
alization of DBSCAN to multiple ranges, efectively replacing the ε parameter with a maxi-
mum search radius. To detect meaningful data clusters of varying densities, the points of
the database are (linearly) ordered, and points that are spatially closest become neighbors
in the ordering.
which means that if there are dense units in k dimensions, there are dense units in all
(k − 1) dimensional projections. Algorithms irst create a histogram for each dimension
and select those bins with densities above a given threshold. Candidate subspaces in two
dimensions can then be formed using only those dimensions that contain dense units, dra-
matically reducing the search space. he algorithm proceeds until there are no more dense
units found. Adjacent dense units are then combined to form clusters.
CLIQUE is the irst subspace clustering algorithm combining density and grid-based
clustering [54], which was developed in conjunction with the Quest data mining project
at IBM [57]. An a priori-style search method was developed for association rule mining
(ARM) to ind dense subspaces [54]. Once the dense subspaces are found, they are sorted by
coverage; only subspaces with the greatest coverage are kept, and the rest are pruned. he
algorithm then looks for adjacent dense grid units in each of the selected subspaces using
a depth-irst search. Clusters are formed by combining these units using a greedy growth
scheme. he algorithm starts with an arbitrary dense unit and greedily grows a maximal
region in each dimension, until the union of all the regions covers the entire cluster. he
weakness of this method is that the subspaces are aligned with the original dimensions.
he Entropy-Based Clustering Method (ENCLUS) is based heavily on the CLIQUE algo-
rithm, but instead of measuring density or coverage, it measures entropy based on the
observation that a subspace with clusters typically has lower entropy than a subspace with-
out clusters, that is, entropy decreases as cell density increases [58]. ENCLUS uses the same
a priori-style, bottom-up approach as CLIQUE to mine signiicant subspaces. Entropy can
be used to measure all three clusterability criteria: coverage, density, and correlation. he
search is accomplished using the downward closure property of entropy and the upward
closure property of correlation to ind minimally correlated subspaces. If a subspace is
highly correlated, all of its superspaces must not be minimally correlated. Since non–
minimally correlated subspaces might be of interest, ENCLUS searches for interesting sub-
spaces by calculating interest gain and inding subspaces whose entropy exceeds a certain
threshold. Once interesting subspaces are found, clusters can be identiied using the same
methodology as CLIQUE or any other existing clustering algorithms.
Merging of Adaptive Finite Intervals (and more than a CLIQUE) (MAFIA) extends
CLIQUE by using an adaptive grid based on the distribution of data to improve eiciency
and cluster quality [59]. MAFIA initially creates a histogram to determine the minimum
number of bins for a dimension. he algorithm then combines adjacent cells of similar
density to form larger cells. In this manner, the dimension is partitioned based on the data
distribution, and the resulting boundaries of the cells capture the cluster perimeter more
accurately than ixed-sized grid cells. Once the bins have been deined, MAFIA proceeds
much like CLIQUE, using an a priori-style algorithm to generate the list of clusterable
subspaces by building up from one dimension.
Cell-based clustering (CBF) addresses scalability issues associated with many bottom-up
algorithms [60]. One problem for other bottom-up algorithms is that the number of bins
created increases dramatically as the number of dimensions increases. CBF uses a cell cre-
ation algorithm that creates optimal partitions based on minimum and maximum values
on a given dimension, which results in the generation of fewer bins. CBF also addresses
SVD, Clustering, and Indexing for Similarity Search ◾ 51
scalability with respect to the number of instances in the data set, since other approaches
oten perform poorly when the data set is too large to it in the main memory. CBF stores
the bins in an eicient iltering-based index structure, which results in improved retrieval
performance.
A cluster tree (CLTree) is built using a bottom-up strategy, evaluating each dimension
separately and using dimensions with high density in further steps [61]. It uses a modi-
ied decision tree algorithm to adaptively partition each dimension into bins, separating
areas of high density from areas of low density. he decision tree splits correspond to the
boundaries of bins.
Density-Based Optimal Projective Clustering (DOC) is a Monte Carlo algorithm combin-
ing the grid-based approach used by the bottom-up approaches and the iterative improve-
ment method from the top-down approaches [62].
Projected clustering (PROCLUS) is a top-down subspace clustering algorithm [63].
Similar to CLARANS, PROCLUS samples the data and then selects a set of k medoids
and iteratively improves the clustering. he three phases of PROCLUS are as follows.
(1) Initialization phase: Select a set of potential medoids that are far apart using a greedy
algorithm. (2) Iteration phase: Select a random set of k medoids from this reduced data set
to determine if clustering quality improves by replacing current medoids with randomly
chosen new medoids. Cluster quality is based on the average distance between instances
and the nearest medoid. For each medoid, a set of dimensions is chosen whose average
distances are small compared to statistical expectation. Once the subspaces have been
selected for each medoid, average Manhattan segmental distance is used to assign points
to medoids, forming clusters. (3) Reinement phase: Compute a new list of relevant dimen-
sions for each medoid based on the clusters formed and reassign points to medoids, remov-
ing outliers. he distance-based approach of PROCLUS is biased toward clusters that are
hype-spherical in shape. PROCLUS is actually somewhat faster than CLIQUE due to the
sampling of large data sets; however, using a small number of representative points can
cause PROCLUS to miss some clusters entirely.
An oriented projected cluster (ORCLUS) looks for non–axis-parallel subspaces [64].
Similarly to References 3 and 14, this method is based on the observation that many data sets
contain interattribute correlations. he algorithm can be divided into three steps. (1) Assign
phase: Assign data points to the nearest cluster centers. (2) Subspace determination: Redeine
the subspace associated with each cluster by calculating the covariance matrix for a cluster
and selecting the orthonormal eigenvectors with the smallest eigenvalues. (3) Merge phase:
Merge cluster that are near each other and have similar directions of least spread.
he Fast and Intelligent Subspace Clustering Algorithm Using Dimension Voting
(FINDIT) uses dimension-oriented distance (DOD) to count the number of dimensions
in which two points are within a threshold distance of each other, based on the assump-
tion that in higher dimensions, it is more meaningful for two points to be close in several
dimensions rather than in a few [65]. he algorithm has three phases. (1) Sampling phase:
Select two small sets generated through random sampling to determine initial representa-
tive medoids of the clusters. (2) Cluster forming phase: Find correlated dimensions using
the DOD measure for each medoid. (3) Assignment phase: Assign instances to medoids
52 ◾ Alexander Thomasian
based on the subspaces found. FINDIT employs sampling techniques like the other top-
down algorithms to improve performance with very large data sets.
he δ-clusters algorithm uses a distance measure to capture the coherence exhibited by
a subset of instances on a subset of attributes [66]. Coherent instances may not be close, but
instead, both follow a similar trend, ofset from each other. One coherent instance can be
derived from another by shiting by an ofset. he algorithm starts with initial seeds and
iteratively improves the overall quality of the clustering by randomly swapping attributes
and data points to improve individual clusters. Residue measures the decrease in coherence
that a particular entry (attribute or instance) brings to the cluster. he iterative process
terminates when individual improvement levels of in each cluster.
Clustering on Subsets of Attributes (COSA) assigns weights to each dimension for each
instance, not each cluster [67]. Starting with equally weighted dimensions, the algorithm
examines the k-NNs of each instance. hese neighborhoods are used to calculate the respec-
tive dimension weights for each instance, with higher weights assigned to those dimen-
sions that have a smaller dispersion within the k-NN group. hese weights are used to
calculate dimension weights for pairs of instances, which are used to update the distances
used in the k-NN calculation. he process is repeated using the new distances until the
weights stabilize. he neighborhoods for each instance become increasingly enriched with
instances belonging to its own cluster. he dimension weights are reined as the dimen-
sions relevant to a cluster receive larger weights. he output is a distance matrix based
on weighted inverse exponential distance and is suitable as input to any distance-based
clustering method. Ater clustering, the weights for each dimension of cluster members
are compared, and an overall importance value for each dimension is calculated for each
cluster.
here are many others, which are reviewed in Reference 61. he multilevel Mahalanobis-
based dimensionality reduction (MMDR) [68] clusters high-dimensional data sets using the
low-dimensional subspace based on the Mahalanobis rather than the Euclidean distance,
since it is argued that locally correlated clusters are elliptical shaped. he Mahalanobis
distance uses diferent coeicients according to the dimension [8].
A detailed description of the steps of building an index for k-NN queries is as follows.
1. CSVD supports three alternative objective functions: (1) the index space compres-
sion, (2) the NMSE (deined in Equation 3.5), and (3) desired recall.
he compression of the index space is deined as the ratio of the original size of
the database (equal to N · M) to the size of its CSVD description, which is speciied as
follows:
V = N ⋅H + ∑( N ⋅ p + m ⋅ p ),
h=1
h h h (3.8)
NMSE =
∑ ∑ ∑ (y ) = ∑ m ∑
h=1 i =1 j =nh +1
(h )
i, j
h=1
h
j =nh +1
λ (jh )
. (3.9)
H mh N 2 H N
∑ ∑ ∑ (y ) ∑ m ∑
h=1 i =1 j =1
(h )
i, j
h=1
h
j =1
λ (jh )
1/ 2 n
1/ 2 2 2
T
D( u , v ) = ( u − v ) ( u − v ) = u + v − 2u v T
= ∑ (u − v ) .
i =1
i i
2
(3.10)
Given precomputed vector norms, we need to compute only the inner product of the two
vectors.
Similarity search based on k-NN queries, which determine the k images that are clos-
est to a target image, is expensive when the number of dimensions per image (N) is in the
hundreds and the number of images (M) is in the millions. he k-NNs can be determined
by a sequential scan of an M × N matrix X. Retrieved points are inserted into a heap, as long
as they are no further than the k nearest points extracted so far. Determining k-NN queries
in high dimensions is not a trivial task, due to “the curse of dimensionality” [71].
As dimensionality increases, all points begin to appear almost equidistant from one
another. hey are efectively arranged in a d-dimensional sphere around the query,
no matter where the query (point) is located. he radius of the sphere increases,
SVD, Clustering, and Indexing for Similarity Search ◾ 55
while the width of the sphere remains unchanged, and the space close to the query
point remains unchanged [71].
1. A 2d partition of a unit hypercube has a limited number regions for small d, but with
d = 100 there are 2100 ≈ 1030 regions, so that even with billions of points almost all of
the regions are empty.
2. A 0.95 × 0.95 square enclosed in a 1 × 1 square occupies 90.25% of its area, while for
100 dimensions, 0.95100 ≈ 0.00592.
3. A circle with radius 0.5 enclosed in a unit square occupies 3.14159/4 = 78.54% of its area.
For d = 40, the volume of the hypersphere is 3.278 × 1021. All the space is in the corners,
and approximately 3 × 1020 points are required to expect one point inside the sphere.
he clusters are sorted in increasing order of distance, with the primary cluster in
irst position.
3. Searching the primary cluster: his step produces a list of k results, in increasing order
of distance; let dmax be the distance of the farthest point in the list.
4. Searching the other clusters: If the distance to the next cluster does not exceed dmax,
then the cluster is searched, otherwise, the search is terminated. While searching a
cluster, if points closer to the query than dmax are found, they are added to the list of
k current best results, and dmax is updated.
False alarms are discarded by referring to the original data set. Noting the relationship
between range and k-NN queries, the latter can be processed as follows [72]: (1) Find the
k-NNs of the query point Q in the subspace. (2) Determine the farthest actual distance to
Q among these k points (dmax). (3) Issue a range query centered on Q on the subspace with
radius ε = dmax. (4) For all points obtained in this manner, ind their original distances to Q,
by referring to the original data set and ranking the points, that is, select the k-NNs.
he exact k-NN processing method was extended to multiple clusters in Reference 30,
where we compare the CPU cost of the two methods as the NMSE is varied using two data
sets. An oline experiment was used to determine k*, which yields a recall R ≈ 1. he CPU
time required by the exact method for a sequential scan is lower than the approximate
method, even when = 0.8. his is attributable to the fact that the exact method issues
a k-NN query only once, and this is followed by less costly range queries. Optimal data
reduction is studied in the context of the exact query processing [73].
RCSVD was explored in Reference 14 in the context of sequential data set scans. he
OP-tree index [2], which is a main memory index, which is suited for k-NN queries,
is utilized in Reference 3. Two methods to make this index persistent are described in
References 74 and 75. We also proposed a new index, the stepwise dimensionally increasing
(SDI) index, which is shown to outperform other indices [74].
1. Initial selection of centroids. H points with pairwise distances of at least threshold are
selected.
2. Clustering. Each point is associated to the closest centroid, whose distance is less than
ε or ignored otherwise. he coordinates of the centroid are the average of the associ-
ated points.
3. Apply SVD to each cluster. Each cluster now has a reference frame.
4. Assign points to clusters. Each point in the data set is analyzed in the reference frame
of each cluster. For each cluster, determine the minimum number of retained dimen-
sions so that the squared error in representing points does not exceed MaxReconDist2.
he cluster requiring the minimum number of dimensions Nmin is determined. If
Nmin ≤ MaxDim, the point is associated with that cluster, and the required number of
dimensions is recorded; otherwise, the point becomes an outlier.
5. Compute the subspace dimensionality of each cluster. Find the minimum number of
dimensions ensuring that at most, FracOutliers of the points assigned to the clus-
ter in step 4 do not satisfy the MaxReconDist 2 constraints. Each cluster now has a
subspace.
SVD, Clustering, and Indexing for Similarity Search ◾ 57
6. Recluster points. Each point is associated with the irst cluster, whose subspace is at a
distance less than MaxReconDist from the point.
7. Remove small clusters. Clusters with size smaller than MinSize are removed and the
corresponding points reclustered in step 6.
8. Recursively apply steps 1–7 to the outlier set. Make sure that the centroids selected in
step 1 have a distance at least threshold from the current valid clusters. he procedure
terminates when no new cluster is identiied.
9. Create an outlier list. As in Reference 18, this is searched using a linear scan. Varying
the NMSE may yield diferent compression ratios. he smaller the NMSE, the higher
the number of points that cannot be approximated accurately.
he values for H, threshold, ε, MaxReconDist, MaxDim, FracOutliers, and MinSize
need to be provided by the user. his constitutes the main diiculty with applying the
LDR method. he invariant parameters are MaxReconDist, MaxDim, and MinSize. he
method can produce values of H, threshold, and overall FracOutliers, which are signii-
cantly diferent from the inputs. he fact that the LDR method determines the number of
clusters is one of its advantages with respect to the k-means method.
he LDR method also takes advantage of LBP [8,72] to attain exact query processing.
It produces fewer false alarms by incorporating the reconstruction distance (the squared
distance of all dimensions that have been eliminated) as an additional dimension with
each point. It is shown experimentally in Reference 3 that RCSVD outperforms LDR in
over 90% of cases.
It is known that the nearest neighbor search algorithms based on the Hjaltason and
Samet (HS) method [70,83] outperform the Roussopoulos, Kelley, and Vincent (RKV)
method [84]. his was also ascertained experimentally as part of carrying out experimental
studies on indexing structures in Reference 85. he RKV algorithm provided with SR-tree
code was replaced with the HS algorithm in this study. In computing distances, we used
the method speciied in Reference 73, which is more eicient than the method in Reference
18. his method is speciied in Section 3.8.
With the increasing dimensionality of feature vectors, most multidimensional indices
lose their efectiveness. he so-called dimensionality curse [8] is due to an increased over-
lap among the nodes of the index and a low fan-out, which results in increased index
height. For a typical feature vector-based hierarchical index structure, each node corre-
sponds to an 8-kilobyte (KB) page. Given a ixed page size S (minus space dedicated to
bookkeeping), number of dimensions N, and s = sizeof(data type), the fan-out for diferent
index structures is as follows:
Hybrid trees, proposed in Reference 86, in conjunction with the LDR method discussed
in Section 3.6, combine the positive aspects of DP and SP indexing structures into one.
Experiments on “real” high-dimensional large-size feature databases demonstrate that the
hybrid tree scales well to high dimensionality and large database sizes. It signiicantly out-
performs both purely DP-based and SP-based index mechanisms as well as linear scan at
all dimensionalities for large-sized databases.
X-trees combine linear and hierarchical structures using supernodes to avoid overlaps to
attain improved performance for k-NN search for higher dimensions [87].
SVD, Clustering, and Indexing for Similarity Search ◾ 59
Pyramid trees [88] are not afected by the curse of dimensionality. he iMinMax(θ) [89]
method, which maps points from high- to single-dimensional space, outperforms pyramid
tress in experiments for range queries.
iDistance is an eicient method for k-NN search in a high-dimensional space [90,91].
Data are partitioned into several clusters, and each partition has a reference point, for
example, the centroids of each cluster obtained using the k-means clustering method [40].
he data in each cluster are transformed into a single-dimensional space according to the
similarity with respect to a reference point. he one-dimensional values of diferent clus-
ters are disjoint. he one-dimensional space is indexed by a B+ index, and k-NN search is
implemented using range searches. he search starts with a small radius, but it is increased
step by step to form a bigger query sphere. iDistance is lossy since multiple data points in
the high-dimensional space may be mapped to the same value in the single-dimensional
space. he SDI method is compared with the iDistance method in Reference 74.
A vector approximation ile (VA-ile) represents each data object using the cell into
which it falls [9]. Cells are deined by a multidimensional grid, where dimension i is par-
titioned 2bi ways. Due to the sparsity of high-dimensional space, it is very unlikely that
several points can share a cell. Nearest neighbor search sequentially scans the VA-ile to
determine the upper-bound and lower-bound distance from the query to each cell. his
is followed by a reinement or iltering step; if the approximate lower bound is greater
than the current upper bound, it is not considered further. Otherwise, it is a candi-
date. During the reine step, all the candidates are sorted according to their lower-bound
distances.
A metric tree transforms the feature vector space into metric space and then indexes
the metric space [92]. A metric space is a pair, M = (F , d ), where is a domain of feature
value, and d is a distance function with the following properties:
he search space is organized based on relative distances of objects, rather than their abso-
lute positions in a multidimensional space.
A vantage point (VP)-tree [93] partitions a data set according to distances between the
objects and a reference or vantage point. he corner point is chosen as the vantage point,
and the median value of the distances is chosen as separating radius to partition the data
set into two balanced subsets. he same procedure is applied recursively on each subset.
he multiple vantage point (mvp)-tree [94] uses precomputed distances in the leaf nodes to
provide further iltering during search operations. Both trees are built in a top-down man-
ner; balance cannot be guaranteed during insertion and deletion. Costly reorganization is
required to prevent performance degradation.
An M-tree is a paged metric-tree index [95]. It is balanced and able to deal with dynamic
data. Leaf nodes of an M-tree store the feature vectors of the indexed objects Oj and distances
60 ◾ Alexander Thomasian
to their parents, whereas internal nodes store routing objects Or, distances to their parents
Op, covering radii r(Or), and corresponding covering tree pointers. he M-tree reduces the
number of distance computations by storing distances.
he OMNI-family is a set of indexing methods based on the same underlying theory that
all the points Si located between ℓ and upper u radii are candidates for a spherical query
with radius r and given point Q for a speciic focus Fi, where ℓ = d(Q, Fi) − r, u = d(Q, Fi) + r
[96]. For multiple foci, the candidates are the intersections of Si. Given a data set, a set of
foci is to be found. For each point in the data set, the distance to all of the foci is calculated
and stored. he search process can be applied to sequential scan, B+-trees, and R-trees. For
B+-trees, the distances for each focus Fi are indexed, a range query is performed on each
index, and inally, the intersection is obtained. For R-trees, the distances for all the foci,
which form lower-dimensional data, are indexed, and a single range query is performed.
A Δ-tree is a main memory-resident index structure, which represents each level with
a diferent number of dimensions [97]. Each level of the index represents the data space
starting with a few dimensions and expanding to full dimensions, while keeping the fan-
out ixed. he number of dimensions increases toward the leaf level, which contains full
dimensions of the data. his is intended to minimize the cache miss ratio as the dimen-
sionality of feature vectors increases. At level ℓ ≥ 1, the number of dimensions is selected
as the smallest nℓ satisfying
n N
∑λ ∑λ
k =1
k
k =1
k ≥ min(p,1).
he nodes of the index increase in size from the highest level to the lowest level, and the
tree may not be height balanced, but this is not a problem since the index is main memory
resident. he index with shorter feature vector lengths attains a lower cache miss ratio,
reduced cycles per instruction (CPI), and reduced CPU time [98].
he telescopic vector (TV)-tree is a disk-resident index with nodes corresponding to disk
pages [99]. TV-trees partition the data space using telescopic minimum bounding regions
(TMBRs), which have telescopic vectors as their centers. hese vectors can be contracted
and extended dynamically by telescopic functions deined in Reference 99, only if they
have the same number of active dimensions (α). Features are ordered using the KLT applied
to the whole data set, so that the irst few dimensions provide the most discrimination.
he discriminatory power of the index is heavily afected by the value of the parameter α,
which is diicult to determine. In case the number of levels is large, the tree will still sufer
from the curse of dimensionality. he top levels of TV-trees have higher fan-outs, which
results in reducing the input/output (I/O) cost for disk accesses. Experimental results on a
data set consisting of dictionary words are reported in Reference 99.
he OP-tree index described in Reference 2 for eicient processing of k-NN queries
recursively equipartitions data points one dimension at a time in a round-robin manner
until the data points can it into a leaf node. he two properties of OP-trees are ordering
and partitioning. Ordering partitions the search space, and partitioning rejects unwanted
SVD, Clustering, and Indexing for Similarity Search ◾ 61
space without actual distance computation. A fast k-NN search algorithm by reducing the
distance based on the structure is described. Consider the following sample data set:
1: (1, 2, 5), 2: (3, 8, 7), 3: (9, 10, 8), 4: (12, 9.2), 5: (8, 7, 20), 6: (6.6.23),
7: (0, 3, 27), 8: (2, 13.9), 9: (11, 11, 15), 10: (14, 17, 13), 11: (7, 14, 12), 12: (10, 12, 3).
As far as the irst partition is concerned, we partition the points into three regions,
R1(−∞, 3), R2(3, 6), R3(6, +∞).
manner. he data set is recursively split top down using the dimension with the maximum
variance and choosing a pivot, which is approximately the median.
3.8 CONCLUSIONS
Similarity search via k-NN queries applied to feature vectors associated to images is a well-
known paradigm in CBIR. Applying k-NN queries to a large data set, where the number
of images is in the millions and the number of features is in the hundreds, is quite costly,
so we present a combination of dimensionality reduction via SVD and related methods,
clustering, and indexing to achieve a higher eiciency in k-NN queries.
SVD eliminates dimensions with little variance, which has little efect on the outcome
of k-NN queries, although this results in an approximate search, so that precision and
recall are required to quantify this efect. Dimensionality reduction allows more eiciency
in applying clustering and building index structures. Both clustering and indexing reduce
the cost of k-NN processing by reducing the number of points to be considered. Applying
clustering before indexing allows more eicient dimensionality reduction, but in fact, we
have added an extra indexing level.
he memory-resident OP-tree index, which was used in our earlier studies, was made
disk resident. Unlike the pages of R-tree-like indices, which can be loaded one page at a
time, the OP-tree index can be accessed from disk with one sequential access. he SDI
index was also shown to outperform well-known indexing methods.
here are several interesting problems that are not covered in this discussion. he issue
of adding points to the data set without the need for repeating the application of SVD is
addressed in References 75 and 102. PCA can be applied to very large M × N data sets by
computing the N × N covariance matrix ater a single data pass. he eigenvectors com-
puted by applying PCA to an N × N matrix can be used to compute the rotated matrix with
a reduced number of dimensions. here is also the issue of applying SVD to very large data
sets, which may be sparse [103,104].
Figures and experimental results are not reported for the sake of brevity. he reader is
referred to the referenced papers for igures, graphs, and tables.
ACKNOWLEDGMENTS
he work reported here is based mainly on the author’s collaboration with colleagues
Dr. Vittorio Castelli and Dr. Chung-Sheng Li at the IBM T. J. Watson Research Center in
Yorktown Heights, New York, and PhD students Dr. Lijuan (Catherine) Zhang and Dr. Yue
Li at the New Jersey Institute of Technology (NJIT), Newark, New Jersey.
REFERENCES
1. W. Niblack, R. Barber, W. Equitz, M. Flickner, E. H. Glasman, D. Petkovic, P. Yanker, C.
Faloutsos, and G. Taubin. “he QBIC project: Querying images by content, using color, texture,
and shape.” In Proc. SPIE Vol. 1908: Storage and Retrieval for Image and Video Databases, San
Jose, CA, January 1993, pp. 173–187.
2. B. Kim, and S. Park. “A fast k-nearest-neighbor inding algorithm based on the ordered partition.”
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 8(6): 761–766 (1986).
SVD, Clustering, and Indexing for Similarity Search ◾ 63
3. V. Castelli, A. homasian, and C. S. Li. “CSVD: Clustering and singular value decomposition
for approximate similarity search in high dimensional spaces.” IEEE Transactions on Knowledge
and Data Engineering (TKDE) 14(3): 671–685 (2003).
4. C.-S. Li, and V. Castelli. “Deriving texture feature set for content-based retrieval of satellite
image database.” In Proc. Int’l. Conf. on Image Processing (ICIP ’97), Santa Barbara, CA, October
1997, pp. 576–579.
5. D. A. White, and R. Jain. “Similarity indexing with the SS-tree.” In Proc. 12th IEEE Int’l. Conf.
on Data Engineering (ICDE), New Orleans, LA, March 1996, pp. 516–523.
6. V. Castelli, and L. D. Bergman (editors). Image Databases: Search and Retrieval of Digital
Imagery. John Wiley and Sons, New York, 2002.
7. B. S. Manjunath, and W.-Y. Ma. “Texture features for image retrieval.” In Image Databases:
Search and Retrieval of Digital Imagery, V. Castelli, and L. D. Bergman (editors). Wiley-
Interscience, 2002, pp. 313–344.
8. C. Faloutsos. Searching Multimedia Databases by Content (Advances in Database Systems).
Kluwer Academic Publishers (KAP)/Elsevier, Burlingame, MA, 1996.
9. R. Weber, H.-J. Schek, and S. Blott. “A quantitative analysis and performance study for
similarity-search methods in high-dimensional spaces.” In Proc. 24th Int’l. Conf. on Very Large
Data Bases (PVLDB), New York, August 1998, pp. 194–205.
10. R. F. Freitas, and W. W. Wilcke. “Storage-class memory: he next storage system technology.”
IBM Journal of Research and Development 52(4–5): 439–448 (2008).
11. B. Jacob, S. W. Ng, and D. T. Wang. Memory Systems: Cache, DRAM, Disk. Morgan Kaufman
Publishers (MKP)/Elsevier, Burlingame, MA. 2008.
12. I. T. Jollife. Principal Component Analysis. Springer, New York, 2002.
13. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes, 3rd Edition:
he Art of Scientiic Computing. Cambridge University Press (CUP), Cambridge, UK, 2007.
14. A. homasian, V. Castelli, and C. S. Li. “RCSVD: Recursive clustering and singular value decompo-
sition for approximate high-dimensionality indexing.” In Proc. 7th ACM Int’l. Conf. on Information
and Knowledge Management (CIKM), Baltimore, MD, November 1998, pp. 201–207.
15. K. Chakrabarti, and S. Mehrotra. “Local dimensionality reduction: A new approach to index-
ing high dimensional space.” In Proc. Int’l. Conf. on Very Large Data Bases (PVLDB), Cairo,
Egypt, August 2000, pp. 89–100.
16. V. Gaede, and O. Günther. “Multidimensional access methods.” ACM Computing Surveys 30(2):
170–231 (1998).
17. V. Castelli. “Multidimensional indexing structures for content-based retrieval.” In Image
Databases: Search and Retrieval of Digital Imagery, V. Castelli, and L. D. Bergman (editors).
Wiley-Interscience, Hoboken, NJ, pp. 373–434.
18. F. Korn, H. V. Jagadish, and C. Faloutsos. “Eiciently supporting ad hoc queries in large datasets
of time sequences.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Tucson, AZ,
May 1997, pp. 289–300.
19. R. Ramakrishnan, and J. Gehrke. Database Management Systems, 3rd edition. McGraw-Hill,
New York, 2003.
20. J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques, 3rd edition. Morgan
Kaufmann Publishers (MKP)/Elsevier, Burlingame, MA, 2011.
21. W. Barbara, W. DuMouchel, C. Faloutsos, P. J. Haas, J. M. Hellerstein, Y. Ioannidis, H. V.
Jagadish, T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik. “he New Jersey data
reduction report.” Data Engineering Bulletin 20(4): 3–42 (1997).
22. S. Mallat. A Wavelet Tour of Signal Processing: he Sparse Way. MKP, 2008.
23. K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. “Approximate query processing
using wavelets.” In Proc. Int’l. Conf. on Very Large Data Bases (PVLDB), Cairo, Egypt, August
2000, pp. 111–122.
64 ◾ Alexander Thomasian
24. Y. Bishop, S. Fienberg, and P. Holland. Discrete Multivariate Analysis: heory and Practice. MIT
Press, Cambridge, MA, 1975.
25. V. Poosala. “Histogram-based estimation techniques in databases.” PhD hesis, Univ. of
Wisconsin-Madison, Madison, WI, 1997.
26. F. Olken. “Random sampling from databases.” PhD Dissertation, University of California,
Berkeley, CA, 1993.
27. J. M. Hellerstein, P. J. Haas, and H. J. Wang. “Online aggregation.” In Proc. ACM SIGMOD Int’l.
Conf. on Management of Data, Tucson, AZ, May 1997, pp. 171–182.
28. A. Rajaraman, J. Leskovec, and J. Ullman. Mining of Massive Datasets, 1.3 edition. Cambridge
University Press (CUP), Cambridge, UK, 2013. Available at http://infolab.stanford.edu/~ullman
/mmds.html.
29. IBM Corp. Engineering and Scientiic Subroutine Library (ESSL) for AIX V5.1, Guide and
Reference. IBM Redbooks, Armonk, NY. SA23-2268-02, 07/2012.
30. A. homasian, Y. Li, and L. Zhang. “Exact k-NN queries on clustered SVD datasets.” Information
Processing Letters (IPL) 94(6): 247–252 (2005).
31. J. A. Hartigan. Clustering Algorithms. John Wiley and Sons, New York, 1975.
32. A. K. Jain, and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, Upper Saddle River,
NJ, 1988.
33. L. Kaufman, and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis.
John Wiley and Sons, New York, 1990.
34. B. S. Everitt, S. Landau, M. Leese, and D. Stahl. Cluster Analysis, 5th edition. John Wiley and
Sons, New York, 2011.
35. M. J. Zaki, and W. Meira Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms.
Cambridge University Press (CUP), Cambridge, UK, 2013.
36. H. P. Kriegel, P. Krger, and A. Zimek. “Clustering high-dimensional data: A survey on sub-
space clustering, pattern-based clustering, and correlation clustering.” ACM Transactions on
Knowledge Discovery from Data (TKDD) 3(1): 158 (2009).
37. C. Aggarwal, A. Hinneburg, and D. A. Keim. “On the surprising behavior of distance metrics
in high dimensional spaces.” In Proc. Int’l. Conf. on Database heory (ICDT), London, January
2001, pp. 420–434.
38. C. C. Aggarwal. “Re-designing distance functions and distance-based applications for high
dimensional data.” ACM SIGMOD Record 30(1): 13–18 (2001).
39. J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press,
New York, 1981. Available at http://home.deib.polimi.it/matteucc/Clustering/tutorial_html
/cmeans.html.
40. F. Farnstrom, J. Lewis, and C. Elkan. “Scalability for clustering algorithms revisited.” ACM
SIGKDD Explorations Newsletter 2(1): 51–57 (2000).
41. T. F. Gonzalez. “Clustering to minimize the maximum intercluster distance.” heoretical
Computer Science 38: 293–306 (1985).
42. A. McCallum, K. Nigam, and L. H. Unger. “Eicient clustering of high-dimensional data sets
with applications to reference matching.” In Proc. 6th ACM Int’l. Conf. on Knowledge Discovery
and Data Mining (KDD), Boston, August 2000, pp. 169–178.
43. R. T. Ng, and J. Han. “CLARANS: A method for clustering objects for spatial data mining.”
IEEE Transactions on Knowledge and Data Engineering (TKDE) 14(5): 1003–1016 (2002).
44. T. Zhang, R. Ramakrishnan, and M. Livny. “BIRCH: An eicient data clustering method
for very large databases.” In Proc. 25th ACM SIGMOD Int’l. Conf. on Management of Data,
Montreal, Quebec, Canada, June 1996, pp. 103–114.
45. P. S. Bradley, U. M. Fayyad, and C. Reina. “Scaling clustering algorithms to large databases.” In
Proc. Int’l. Conf. on Knowledge Discovery and Data Mining, New York, August 1998, p. 915.
SVD, Clustering, and Indexing for Similarity Search ◾ 65
46. G. Karypis, E.-H. Han, and V. Kumar. “Chameleon: Hierarchical clustering using dynamic
modeling.” IEEE Computer 32(8): 68–75 (1999).
47. B. W. Kernighan, and S. Lin. “An eicient heuristic procedure for partitioning graph.” Bell
Systems Technical Journal (BSTJ) 49: 291–308 (1970).
48. S. Guha, R. Rastogi, and K. Shim. “CURE: An eicient clustering algorithm for large databases.”
In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Seattle, WA, June 1998, pp. 73–84.
49. J. L. Bentley. “Multidimensional binary search trees used for associative searching.”
Communications of the ACM 18(9): 509–517 (1975).
50. V. Ganti, R. Ramakrishnan, J. Gehrke, A. L. Powell, and J. C. French. “Clustering large datasets
in arbitrary metric spaces.” In Proc. Int’l. Conf. on Data Engineering (ICDE), Sidney, Australia,
March 1999, pp. 502–511.
51. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. “A density-based algorithm for discovering clusters
in large spatial databases with noise.” In Proc. 2nd Int’l. Conf. Knowledge Discovery and Data
Mining (KDD-96), Portland, OR, 1996, pp. 226–231.
52. M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. “OPTICS: Ordering points to iden-
tify the clustering structure.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data,
Philadelphia, PA, June 1999, pp. 49–60.
53. W. Wang, J. Yang, and R. Muntz. “STING: A statistical information grid approach to spatial
data mining.” In Proc. 23rd Int’l. Conf. Very Large Data Bases (VLDB), Athens, Greece, August
1997, pp. 186–195.
54. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. “Automatic subspace clustering of
high dimensional data for data mining applications.” In Proc. ACM SIGMOD Int’l. Conf. on
Management of Data, Seattle, WA, June 1998, pp. 94–105.
55. G. Sheikholeslami, S. Chatterjee, and A. Zhang. “WaveCluster: A multi-resolution cluster-
ing approach for very large spatial databases.” In Proc. 24th Int’l. Conf. Very Large Data Bases
(VLDB), New York, August 1998, pp. 428–439.
56. A. Hinneburg, and D. A. Keim. “Optimal grid-clustering: Towards breaking the curse of
dimensionality in high-dimensional clustering.” In Proc. 25th Int’l. Conf. on Very Large Data
Bases (PVLDB), Edinburgh, Scotland, UK, September 1999, pp. 506–517.
57. R. Agrawal, M. Mehta, J. Shafer, R. Srikant, A. Arning, and T. Bollinger. “he quest data mining
system.” In Proc. 2nd Int’l. Conference on Knowledge Discovery in Databases and Data Mining
(KDD), Portland, OR, 1996, pp. 244–249.
58. C.-H. Cheng, A. W. Fu, and Y. Zhang. “Entropy-based subspace clustering for mining numeri-
cal data.” In Proc. 5th ACM SIGKDD Int’l. Conf. on Knowledge Discovery and Data Mining
(KDD), San Diego, CA, August 1999, pp. 84–93.
59. H. S. Nagesh, S. Goil, and A. Choudhary. “A scalable parallel subspace clustering algorithm for
massive data sets.” In Proc. IEEE Int’l. Conf. on Parallel Processing (ICPP), Toronto, Ontario,
August 2000, pp. 477–484.
60. J.-W. Chang, and D.-S. Jin. “A new cell-based clustering method for large, high-dimensional
data in data mining applications.” In Proc. ACM Symposium on Applied Computing (SAC),
Madrid Spain, March 2002, pp. 503–507.
61. L. Parsons, E. Haque, and H. Liu. “Subspace clustering for high dimensional data: A review.”
ACM SIGKDD Explorations Newsletter 6(1): 90–105 (2004).
62. C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali. “A Monte Carlo algorithm for fast
projective clustering.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Madison,
WI, June 2002, pp. 418–427.
63. C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Precopiuc, and J. S. Park. “Fast algorithms for projected
clustering.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Philadelphia, PA, June
1999, pp. 61–72.
66 ◾ Alexander Thomasian
64. C. C. Aggarwal, and P. S. Yu. “Finding generalized projected clusters in high dimensional spaces.”
In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Dallas, May 2000, pp. 70–81.
65. K. G. Woo, J. H. Lee, M. H. Kim, and Y. J. Lee. “FINDIT: A fast and intelligent subspace clus-
tering algorithm using dimension voting.” Information Sotware Technology 46(4): 255–271
(2004).
66. J. Yang, W. Wang, H. Wang, and P. S. Yu. “δ-clusters: Capturing subspace correlation in a large
data set.” In Proc. 18th Int’l. Conf. on Data Engineering (ICDE), San Jose, CA, February–March
2002, pp. 517–528.
67. J. H. Friedman, and J. J. Meulman. “Clustering objects on subsets of attributes.” 2002. Available
at http://statweb.stanford.edu/~jhf/.
68. H. Jin, B. C. Ooi, H. T. Shen, C. Yu, and A. Y. Zhou. “An adaptive and eicient dimensionality
reduction algorithm for high-dimensional indexing.” In Proc. 19th IEEE Int’l. Conf. on Data
Engineering (ICDE), Bangalore, India, March 2003, pp. 87–98.
69. Y. Linde, A. Buzo, and R. Gray. “An algorithm for vector quantizer design.” IEEE Transactions
on Communications 28(1): 84–95 (1980).
70. H. Samet. Foundations of Multidimensional and Metric Data Structure. Elsevier, 2007.
71. V. S. Cherkassky, J. H. Friedman, and H. Wechsler. From Statistics to Neural Networks: heory
and Pattern Recognition Applications. Springer-Verlag, New York, 1994.
72. F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, and Z. Protopapas. “Fast and efective
retrieval of medical tumor shapes: Nearest neighbor search in medical image databases.” IEEE
Transactions on Knowledge and Data Engineering (TKDE) 10(6): 889–904 (1998).
73. A. homasian, Y. Li, and L. Zhang. “Optimal subspace dimensionality for k-nearest-neighbor
queries on clustered and dimensionality reduced datasets with SVD.” Multimedia Tools and
Applications (MTAP) 40(2): 241–259 (2008).
74. A. homasian, and L. Zhang. “Persistent semi-dynamic ordered partition index.” he Computer
Journal 49(6): 670–684 (2006).
75. A. homasian, and L. Zhang. “Persistent clustered main memory index for accelerating k-NN
queries on high dimensional datasets.” Multimedia Tools Applications (MTAP) 38(2): 253–270
(2008).
76. C. Yu. High-Dimensional Indexing: Transformational Approaches to High-Dimensional Range
and Similarity Searches. Lecture Notes in Computer Science (LNCS), Volume 2431. Springer-
Verlag, New York, 2002.
77. R. A. Finkel, and J. L. Bentley. “Quad trees: A data structure for retrieval of composite keys.”
Acta Informatica 4(1): 1–9 (1974).
78. J. T. Robinson. “he k-d-b tree: A search structure for large multidimensional dynamic
indexes.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Ann Arbor, MI, April
1981, pp. 10–18.
79. A. Guttman. “R-trees: A dynamic index structure for spatial searching.” In Proc. ACM SIGMOD
Int’l. Conf. on Management of Data, Boston, June 1984, pp. 47–57.
80. T. Sellis, N. Roussopoulos, and C. Faloutsos. “he R+-tree: A dynamic index for multi-
dimensional objects.” In Proc. 13th Int’l. Conf. on Very Large Data Bases (VLDB), Brighton,
England, September 1987, pp. 507–518.
81. N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. “he R*-tree: An eicient and robust
access method for points and rectangles.” In Proc. ACM SIGMOD Int’l. Conf. on Management
of Data, Atlantic City, NJ, May 1990, pp. 322–331.
82. N. Katayama, and S. Satoh. “he SR-tree: An index structure for high-dimensional nearest
neighbor queries.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Tucson, AZ,
May 1997, pp. 369–380.
83. G. R. Hjaltason, and H. Samet. “Distance browsing in spatial databases.” ACM Transactions on
Database Systems (TODS) 24(2): 265–318 (1999).
SVD, Clustering, and Indexing for Similarity Search ◾ 67
84. N. Roussopoulos, S. Kelley, and F. Vincent. “Nearest neighbor queries.” In Proc. ACM SIGMOD
Int’l. Conf. on Management of Data, San Jose, CA, June 1995, pp. 71–79.
85. L. Zhang. “High-dimensional indexing methods utilizing clustering and dimensionality reduc-
tion.” PhD Dissertation, Computer Science Department, New Jersey Institute of Technology
(NJIT), Newark, NJ, May 2005.
86. K. Chakrabarti, and S. Mehrotra. “he hybrid tree: An index structure for high dimensional
feature spaces.” In Proc. IEEE Int’l. Conf. on Data Engineering (ICDE), 1999, pp. 440–447.
87. S. Berchtold, D. A. Keim, and H. P. Kriegel. “he X-tree: An index structure for high-dimensional
data.” In Proc. 22nd Int’l. Conf. on Very Large Data Bases (PVLDB), San Jose, CA, August 1996,
pp. 28–39.
88. S. Berchtold, C. Böhm, and H.-P. Kriegel. “he pyramid-technique: Towards breaking the curse
of dimensionality.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Seattle, WA,
June 1998, pp. 142–153.
89. B. C. Ooi, K.-L. Tan, C. Yu, and S. Bressan. “Indexing the edges—A simple and yet eicient
approach to high-dimensional indexing.” In Proc. 19th ACM Int’l. Symp. on Principles of
Database Systems (PODS), Dallas, May 2000, pp. 166–174.
90. C. Yu, B. C. Ooi, K.-L. Tan, and H. Jagadish. “Indexing the distance: An eicient method to knn
processing.” In Proc. 27th Int’l. Conf. on Very Large Data Bases (VLDB), Rome, Italy, September
2001, pp. 421–430.
91. H. V. Jagadish, B. C. Ooi, K. L. Tan, C. Yu, and R. Zhang. “iDistance: An adaptive B+-tree
based indexing method for nearest neighbor search.” ACM Transactions on Database Systems
(TODS) 30(2): 364–397 (2000).
92. J. K. Uhlmann. “Satisfying general proximity/similarity queries with metric trees.” Information
Processing Letters (IPL) 40(4): 175–179 (1991).
93. P. N. Yianilos. “Data structures and algorithms for nearest neighbor search in general metric
spaces.” In Proc. 4th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Austin,
TX, January 1993, pp. 311–321.
94. T. Bozkaya, and M. Ozsoyoglu. “Distance-based indexing for high-dimensional metric
spaces.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Tucson, AZ, May 1997,
pp. 357–368.
95. P. Ciaccia, M. Patella, and P. Zezula. “M-tree: An eicient access method for similarity search
in metric spaces.” In Proc. 23rd Int’l. Conf. on Very Large Data Bases (PVLDB), Athens, Greece,
August 1997, pp. 426–435.
96. R. F. S. Filho, A. Traina, C. Traina Jr., and C. Faloutsos. “Similarity search without tears: he
OMNI family of all-purpose access methods.” In Proc. 17th IEEE Int’l. Conf. on Data Engineering
(ICDE), Heidelberg, Germany, April 2001, pp. 623–630.
97. B. Cui, B. C. Ooi, J. Su, and K.-L. Tan. “Indexing high-dimensional data for eicient in-memory
similarity search.” IEEE Transactions on Knowledge and Data Engineering (TKDE) 17(3): 339–
353 (2005).
98. J. L. Hennessey, and D. A. Patterson. Computer Architecture: A Quantitative Approach, 5th edi-
tion. Elsevier, Burlingame, MA, 2011.
99. K. I. Lin, H. V. Jagadish, and C. Faloutsos. “he TV-tree: An index structure for high-
dimensional data.” VLDB Journal 3(4): 517–542 (1994).
100. A. homasian, and L. Zhang. “he Stepwise Dimensionality-Increasing (SDI) index for high
dimensional data.” he Computer Journal 49(5): 609–618 (2006).
101. D. A. White, and R. Jain. “Similarity indexing: Algorithms and performance.” In Storage and
Retrieval for Image and Video Databases (SPIE), Volume 2670, San Jose, CA, 1996, pp. 62–73.
102. K. V. Ravikanth, D. Agrawal, and A. Singh. “Dimensionality-reduction for similarity searching
in dynamic databases.” In Proc. ACM SIGMOD Int’l. Conf. on Management of Data, Seattle,
WA, June 1998, pp. 166–176.
68 ◾ Alexander Thomasian
103. S. Rajamanickam. “Eicient algorithms for sparse singular value decomposition.” PhD hesis,
Computer Science Dept., University of Florida, Gainesville, FL, 2009. Available at http://www
.cise.ul.edu/~srajaman/Rajamanickam_S.pdf.
104. A. A. Amini. “High-dimensional principal component analysis.” PhD hesis, Report Number
UCB/EECS-2011-104, EECS Dept., University of California, Berkeley, CA, 2011. Available at
http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-104.html.
yi , j = ∑x
k =1
v , 1 ≤ j ≤ n, 1 ≤ i ≤ M .
i ,k k , j
he rest of the columns of the matrix are set to the column mean, which is zero in our
case. (Some data sets are studentized by dividing the zero-mean column by the standard
deviation [3].) he squared error in representing point Pi, 1 ≤ i ≤ M is then
2
N N N
ei = ∑ y = ∑ ∑x
j =n+1
2
i, j
j =n+1 k =1
v
i ,k k , j , 1 ≤ i ≤ M.
xi , j = ∑s u
k =1
v ,1 ≤ j ≤ N ,1 ≤ i ≤ M .
k i ,k k , j
2 2
N N N N
ei = ∑ ∑s u
j =1 k =n+1
v
k i ,k k , j = ∑ ∑y
j =1 k =n+1
v
i ,k k , j .
SVD, Clustering, and Indexing for Similarity Search ◾ 69
To show that ei = ei , 1 ≤ i ≤ M , we use the deinition δk,k′ = 1 for k = k′ and δk,k′ = 0 for
k ≠ k′.
N N N N N N
ei = ∑ ∑y
j =1 k =n+1
v
i ,k k , j ∑y
k =n+1
i ,k vk , j = ∑∑y
k =n+1 k =n+1
y
i ,k i ,k ∑v
j =1
v
k, j k , j
N N N
= ∑∑y
k =n+1 k =n+1
y δ k ,k =
i ,k i ,k ∑y
j =n+1
2
i, j = ei .
Our method is more eicient from the viewpoint of computational cost for nearest
neighbor, range, and window queries. In the case of nearest neighbor queries, once the
input or query vector is transformed and the appropriate n dimensions are selected, we
need only be concerned with these dimensions. his is less costly than applying the inverse
transformation.
CHAPTER 4
CONTENTS
4.1 Introduction 72
4.2 CDM 74
4.3 PEA 76
4.4 Divide and Conquer 79
4.5 GAs 79
4.6 DCGA 80
4.7 K-Means 81
4.8 Clustering Genetic Algorithm with the SSE Criterion 81
4.9 MapReduce Section 83
4.10 Simulation 83
4.11 Conclusion 87
References 87
ABSTRACT
he purpose of this project is to present a set of algorithms and their eiciency for
Multiple Sequence Alignment (MSA) and clustering problems, including also solu-
tions in distributive environments with Hadoop. he strength, the adaptability, and
the efectiveness of the genetic algorithms (GAs) for both problems are pointed out.
MSA is among the most important tasks in computational biology. In biological
sequence comparison, emphasis is given to the simultaneous alignment of several
sequences. GAs are stochastic approaches for eicient and robust search that can play
71
72 ◾ John Tsiligaridis
4.1 INTRODUCTION
Sequence alignment is an arrangement of two or more sequences, highlighting their simi-
larity. Sequence alignment is useful for discovering structural, functional, and evolutional
information in biological sequences. he idea of assigning a score to an alignment and
then minimizing over all alignments is at the heart of all biological sequence alignments.
We can deine a distance or a similarity function to an alignment. A distance function
will deine negative values for mismatches or gaps and then aim at minimizing this dis-
tance. A similarity function will give high values to matches and low values to gaps and
MSA and Clustering with Dot Matrices, Entropy, and GAs ◾ 73
then maximize the resulting score. A dot matrix analysis is a method for comparing two
sequences when looking for possible alignment. We use a dot plot to depict the pairwise
comparison and identify a region of close similarity between them. he biggest asset of
a dot matrix is its ability to ind direct repeats and visualize the entire comparison. he
consistent regions comprising aligned spans from all the sequences compared may permit
multiple selections of fragment sets. For Multiple Sequence Alignment (MSA), the divide-
and-conquer method is used with a genetic algorithm (GA) [1–3].
he plots may either contain enough points to trace a consecutive path for each pair of
sequences or display only isolated spans of similarity. his can be achieved by the CDM. In
addition, when there are no consecutive paths, an optimal subset with the longest subse-
quence may be selected. he criterion of consistency can guarantee the creation of consis-
tent regions. First, the lists are updated with the indices of the aligned items, and then the
MSA is created. he case of having more than one index in the next sequence is also exam-
ined. Vectors that follow each other (not crossing over) and an array of lists can provide
a solution. Finally, ater inding the consistency regions, using CDM, the divide-and-conquer
method can be applied by selecting cut points that do not discontinue the consistency.
Moreover, based on entropy of an alignment, the pilot entropy algorithm (PEA) can pro-
vide MSA by shiting lines let or columns right. he diference between CDM and PEA is
that PEA is not interested in inding consistency, but it makes alignment according to col-
umn entropy scores. CDM or PEA can cooperate with GA under DCGA using the divide-
and-conquer method. In addition, faster alignment can be achieved for some part of the
sequences (partial MSA) by the cut points and the preferred method (CDM, PEA, GA).
DCGA with the divide-and-conquer technique can efectively reduce the space com-
plexity for MSA so that the consistency achieved from the CDM is not broken. he advan-
tages and disadvantages of each of the methods are presented.
Clustering is the process of grouping data into clusters, where objects within each clus-
ter have high similarity but are dissimilar to the objects in other clusters. he goal of clus-
tering is to ind groups of similar objects based on a similarity metric [4]. Some of the
problems associated with current clustering algorithms are that they do not address all
the requirements adequately and that they need high time complexity when dealing with a
large number of dimensions and large data sets.
Since a clustering problem can be deined as an optimization problem, evolutionary
approaches may be appropriate here. he idea is to use evolutionary operators and a popu-
lation of clustering structures to converge into a globally optimal clustering. he most fre-
quently used evolutionary technique in clustering problems is GAs. Candidate clusterings
are encoded as chromosomes. A itness value is associated with each cluster’s structure. A
higher itness value indicates a better cluster structure. GAs perform a globalized search
for solutions, whereas most other clustering procedures perform a localized search. In a
localized search, the solution obtained at the “next iteration” of the procedure is in the
vicinity of the current solution. he K-means algorithm is used in the localized search
technique. In the hybrid approach that is proposed, the GA is used just to ind good initial
cluster centers, and the K-means algorithm is applied to ind the inal partition. his hybrid
approach performs better than the GAs.
74 ◾ John Tsiligaridis
he parallel nature of GAs makes them an optimal base for parallelization. Beyond
the Message Passing Interface (MPI), the traditional solution for the GA and hybrid GA
implementation, Hadoop can improve the parallel behavior of the GAs, becoming a much
better solution. In addition, for MPI programming, a sophisticated knowledge of parallel
programming (i.e., synchronization, etc.) is needed. MPI does not scale well on clusters
where a failure is not unlikely (i.e., long processing programs). MapReduce [5,6] provides
an opportunity to easily develop large-scale distributed applications. Several diferent
implementations of MapReduce have been developed [7,8]. Reference 7 is based on another
architecture such as Phoenix, while Reference 8 deals with the issue of reducing the num-
ber of iterations for inding the right set of initial centroids.
he contribution of this chapter is the development of a set of algorithms for both MSA
and GA and also the discovery of a distributed solution for CDM and for hybrid GA.
he chapter is organized as follows: CDM and PEA are developed in Sections 4.2 and
4.3, respectively. he divide-and-conquer principle is presented in Section 4.4. Sections 4.5
through 4.7 include the GAs, K-means, and GA following the divide-and-conquer method
(DCGA), which represent the part of work that is based on GA. Section 4.8 refers to DCGA.
Section 4.9 includes the DCGA distributed solution with Hadoop. Finally, Section 4.10
includes a simulation.
4.2 CDM
CDM has three phases: (1) the dot matrices’ preparation, (2) the scan phase, and (3) the
link phase. In the irst phase, the dot matrices are examined, in the second, the diagonal
similarities, and in the third phase, the vertical continuity. he dot matrix running time is
proportional to the product of the size of the sequences. Working with a set of dot matrices
poses the problem of how to deine consistency. Scanning of graphs for a series of 1s reveals
similarity or a string of the same characters. he steps of the algorithm are as follows:
(1) Prepare the dot plot matrix [9] and examine the horizontal continuity (preparation phase).
(2) Find gaps (horizontal) and examine the diagonal location (preparation phase). (3) Scan
phase. (4) Examine the vertical continuity (link phase). Steps 1 and 2 belong to the prepara-
tion phase. Create vectors that hold the locations of 1s (as developed in the example). In the
link phase, the locations are combined. he advantage of CDM is that with the link phase
only (n − 1), comparisons are needed (instead of n), where n is the number of sequences.
Vectors have the locations of the similar items, and the index of a vector is the location
of the item of the irst sequence of each pair.
Example
Lets consider the following sequences.
A: a b c d e, B: b c x, C: a b c
A. Preparation phase
he dot plot A-B:
abcde
b-1
c 1
x
here is a gap (horizontal) before the match b-b. Hold the horizontal gap into an
array hor[i,j]. i is the number of the string; j is the location of the gap.
he dot plot B-C:
bcx
a-
b1
c1
B. Scan phase
Read plot A-B: Find that positions 2(A) and 1(B) have a similar item: b. Find that
positions 3(A) and 2(B) have similar item: c.
here is continuity (horizontal) since there are numbers, one next to the other,
[2(A), 3(A)] and [1(B), 2(B)]. here is a gap horizontally.
Create the vectors with the vertical locations of the same item b: V2: 2, 1. Create
the vectors with the vertical locations of the same item c: V3: 3, 2.
76 ◾ John Tsiligaridis
Read plot B-C: Find that positions 1(B) and 2(C) have a similar item: b. here is
a vertical gap. Find that positions 2(B) and 3(C) have a similar item: c.
here is also continuity (horizontal), since there are numbers next to each other,
[1(B), 2(B)] and [2(C), 3(C)].
Create the vectors with the vertical locations (a) of the same item b: V1: 1, 2 and
(b) of the same item c: V2: 2, 3. he vectors V2: 2, 1 and V1: 1, 2 are in sequence
since the y coordinate of V2 (= 1) is equal to the x coordinate of V1, and there is
a vertical continuity. he B (= “bcx”) is considered as the intermediate string.
C. Link phase
he vectors can be appended since they have common item: for V2–V1, 1 (last), and
for V3 − V2, 2 (last).
Hence, the inal vectors are FV2: 2, 1, 2 for b and FV3: 3, 2, 3 for c. Ater that, using
the hor[] of the MSA, a vertical continuity for b and c is discovered as follows:
A: a b c d e
B: - b c x
C: a b c
4.3 PEA
PEA develops a methodology for line or column shiting and inserts gaps according to the
entropy values. he pilot is the line (which will be moved let) or the column (which will
move right). PEA is a dynamic solution for inding the alignment of k columns (k < n =
number of columns). It starts with the entropy of column 1, and if the value of the entropy
(entr) is less than a predeined limit (lm), then the line with the smaller contribution value
(diferent item) is searching (from let to right) until it inds the item with the higher con-
tribution in the entropy for the examined column.
PEA looks ahead for m steps. It searches the lists (one for each sequence) in order to ind
if there are similar items in k columns and try to igure out if the new alignment will have
a better (lower) column score.
MSA and Clustering with Dot Matrices, Entropy, and GAs ◾ 77
here are two moves, the let move of the line (LML) where pilot is the line with the lower
participation item (case 1) and the right move of a column (RMC) where pilot is the column
with the lower score (case 2). he LML and RMC are visible only when, ater the moves, the
total score of the column is greater than the previous total score. In the LML for a column
i, the let move happens when a line is moving one step let, and then “let gaps” are put on
all positions of column i − 1 except the line with the lower score item. For a column I, when
the LML is not visible, then the RMC is examined. If RMC happens, gaps are put before the
tested column.
if (LML_F='N')
{read next items of lines except line k,
recompute entropy }
if ((the new entropy of col i < previous
entropy of col i, or
(n_entr)i+1 <(p_entr)i) && RMC test is valid))
{//accept the new alignment, RLM visible
//right shift of the items in lines (\k line), put
//"right gaps" for the items ≠ k line
d=n; } }//end while
if (entri =0) go to the next col (i+1) }
Example
A set of three sequences is given.
1234
A B C (1)
C A K B (2)
A B K (3)
he column 1 in line 2 has to be tested. Line 2 is the line with the lower participa-
tion, since it has “C.” he LML is examined. From list 2, the item A appears in the
second location, and the entr2 = 0. Case 2 is visible (n_entr)i+1 < (p_entr)i. he new let
scored gaps are put for all the lines except line 2.
1234
- A B C (1)
C A K B (2)
- A B K (3)
In testing column 3, we see that there is, in line 2, the item K. If we try to use LML
with a new let move of line 2, then it is not valid, since it will destroy the previous
alignments. So the LML test is invalid, and the RMC is examined. he pilot is the
column (B, K, B), and line 2 provides the lower participation.
Case 2 is visible since (n_entr)i+1 < (p_entr)i, and in column 4, the entr4 = 0.
12345
- A - B C (1)
C A K B (2)
- A - B K (3)
4.5 GAs
By imitating the evolution process of organisms in the complicated natural environment,
an evolutionary system contains the operators of selection, crossover, and mutation. In
GAs, chromosomes form a population, and a set of genes forms a chromosome [14–16].
Selection: he system selects chromosomes to perform crossover and mutation opera-
tions according to the itness values of the chromosomes. he larger the itness value of a
chromosome, the higher the chance of the chromosome to be chosen. he roulette wheel
is used as a method of selection. he selection of the survivors for the next generation is
achieved through the itness function. he itness score for each alignment is calculated
by summing up the individual scores for each column of the sequences. he columns are
scored by looking at matches (+1), mismatches (0), and gaps (−1) in the sequences.
Crossover: he system randomly chooses a pair of chromosomes from a population, and
a crossover point is randomly generated. he single-point crossover method is used.
Mutation: a selected chromosome will mutate or not.
GA is used ater the pairwise alignment for all the sequences using dynamic program-
ming (DP). he time complexity of DP is O(n2), where n is the longest length of sequences
for pairwise alignment. To compare diferent alignments, for MSA, a itness function is
deined based on the number of matching symbols and number of sizes of gaps. For the
clustering problem, the sum of squared error (SSE) criterion is used.
4.6 DCGA
DCGA, with the divide-and-conquer technique, can efectively reduce the space complex-
ity for MSA. he cut points are deined, and one part is covered by PEA, while the rest, by
the GA. DCGA is a combination of PEA or CDM and GA. DCGA diminishes the search
space into a part of sequences using PEA or CDM for inding the “conserved areas” (let
part), and GA will be applied for the remaining sequences (right part). DCGA with the
“conserved areas” has superiority over the GA solution where the chromosome is used
from DP without any cut.
Example
A1 = A A T T C C T
A2 = A T C T
A3 = A A T T C G A
he sequences need to be cut according to the cut points (the locations with no
consistency) using PEA or CDM so that they do not disturb the consistency. Starting
with PEA from let to right, the subsequences’ sets are as follows.
SA11 = A A T SA12 = T C C T
SA21 = - A T SA22 = C T
SA31 = A A T SA32 = T C G A
applied into the let segment. It is extended until similar subsequences are discovered.
he GA, within DCGA, is applied only to a right set of subsequences that requires a
search space smaller than the entire sequences’ space. On the contrary, the GA solu-
tion requires all the sequence spaces. For GA, the pairwise DP is used in order to ind
the alignment of the subsequences. Finally, the total alignment can be obtained by
merging the two subsets.
4.7 K-MEANS
he centroid-based K-means algorithm generates initially a random set of K patterns from
the data set, known as the centroids. hey represent the problem’s initial solution. Each
point is then assigned to the closest centroid, and each collection of points assigned to
a centroid is a cluster. he quality of a cluster Ci can be measured by the within-cluster
variation, which is the SSE between all objects in Ci and the centroid ci, deined as
K
SSE = ∑∑dist( p,c ) , where SSE is the sum of the squared error for all objects in the
i =1 p∈Ci
i
2
data set; p is the point in space representing a given object; and ci is the centroid of cluster
Ci (both p and ci are multidimensional) [4]. he SSE can measure the quality of a cluster-
ing by calculating the error of each data point. For two diferent sets of clusters that are
produced by two diferent runs of K-means, the cluster with the smallest squared error is
preferred. his means that the centroids of the preferred cluster provide better representa-
tion of the points in that cluster.
A major problem with this algorithm is that it is sensitive to the selection of the initial
partition and may converge to a local minimum of variation if the original partition is not
properly chosen. he K-means fails to recognize some clusters. he time complexity is O(n *
k * d * i), where n = number of data points, k = number of clusters, d = dimension of data,
and i = number of iterations.
nearest cluster center, providing the best solution. he current population is evaluated, and
each individual has a new itness value, updated using the SSE criterion. he itness func-
tion is deined as the inverse of the SSE metric. A new population is selected and replaces
the previous one.
he pseudocode of CGA_SSE is as follows:
he process is repeated until there is no change in k cluster centers. his can be achieved
by minimizing the objective function of SSE. Hence, the minimum SSE deines an opti-
mum set of clusters. he time complexity of GCA_SSE is O(tm * p * n * k * d) where tm =
maximum number of iterations, p = population size, n = number of points, k = number
of clusters, and d = dimension of data. he GA with the selection and mutation opera-
tions may take time to converge because the initial assignments are arbitrary. For improve-
ment purposes, a K-means algorithm is used. he K-means calculates cluster centers and
MSA and Clustering with Dot Matrices, Entropy, and GAs ◾ 83
reassigns each data point to the cluster with the nearest cluster center according to the
SSE value.
4.10 SIMULATION
1
0.9
0.8
0.7
Running time
0.6
PEA
0.5
GA
0.4
0.3
0.2
0.1
0
Test 1 Test 2
point of 3. Similarly, for test 2, there is faster MSA for the area of 8–10 considering
the cut point of 7.
b. CDM and GA for various sequence sizes: Diferent sizes of sequences are compared
for their running time using CDM and GA. In Figure 4.2, there is the case for two
sets (5, 8) of sequences where the smaller-size sequences have better performance.
c. DCGA versus GA: A set of sequences with sizes 10, 20, and 30 has been examined.
he CDM inds the “conserved areas” starting from let to right. In the next step,
DCGA applies the GA only for the right segment. DCGA outperforms GA, since
GA has to use the operations (selection, crossover, and mutate) for all the sizes of
sequences. he greater the size of sequences, the better the performance of DCGA
is (Figure 4.3).
2. Two scenarios have been developed for clustering.
a. 2-D, 3-D data: 2-D and 3-D data for 300 points have been considered for CGA_SSE
with the measure of the running time. he running time is greater for the 3-D
data (Figure 4.4).
14
12
10
Running time
8 s1:5 seq
6 s2:8 seq
4
0
10 20 50 70
Size of sequences
1.4
1.2
Running time
1
0.8 DCGA
0.6 GA
0.4
0.2
0
Size of sequences
5000
Running time
4000
2-D
3000
3-D
2000
1000
0
50 100 150 200 250 300
Size of data
3-D wine
K-means 95.2% 96.3%
CGA_SSE 97.7% 98.2%
0.82
0.80
0.78
Running time
0.76
0.74 Wine
0.72
0.70
0.68
0.66
Sequential Distributed
Wine
5
4 Wine quality
3
2
1
0
Sequential Distributed 2 nodes
Wine quality
increase the performance of the processes due to the parallel instead of sequen-
tial execution of the code. For the distributed (one node) case, more processing
time is needed due to the Hadoop overheads. he use of a large number of nodes
decreases the required processing time.
c. he CDM (distributed) with various size sequences: he time for four sequences
with varying size and the number of nodes is examined. From Figure 4.8, the
1.4
1.2
1
Running time
0.8 1 node
0.6 2 nodes
0.4
0.2
0
800 KB 1200 KB
Size of sequences
0.6
0.5
0.4
Running time
1 node
0.3
2 nodes
0.2
0.1
0
2 MB 6 MB 8 MB
Number of sequences
larger the sequence size, the longer the alignment time. When more nodes are
involved, there is signiicant reduction of time.
d. CDM (distributed) with various numbers of sequences: For all the sizes of sequences,
the Hadoop with two nodes gives better time performance than with just one
node (Figure 4.9).
4.11 CONCLUSION
A set of algorithms has been developed for inding MSA using dot matrices and the divide-
and-conquer method with GA. CDM discovers the consistency, while PEA provides the
MSA. In DCGA, there are diferent results in favor of PEA or GA depending on the val-
ues of the cut points or the size of the subsequences that are under processing. his pro-
vides the opportunity for partial MSA. Moreover, DCGA with the divide-and-conquer
technique, can efectively reduce the space complexity for MSA so that the consistency
achieved from the CDM is not broken. For long sequences with “conserved areas” on their
let side, DCGA outperforms GA. his is due to the fact that GA uses the whole size of the
sequences. Finally, CGA_SSE, compared with K-means, can discover an optimal cluster-
ing using the SSE criterion.
Hadoop with the MapReduce framework can provide easy solutions by dividing the
applications into many small blocks of work that can be executed in parallel. CGA_SSE
and CDM with Hadoop are examples of MapReduce’s ability to solve large-scale process-
ing problems. Generally, for problems with massive populations, Hadoop provides a new
direction for solving hard problems using MapReduce with a number of nodes. Future
work would deal with MSA, neural networks, and Hadoop.
REFERENCES
1. J. Stoye, A. Dress, S. Perrey, “Improving the divide and conquer approach to sum-of-pairs mul-
tiple sequence alignment,” Applied Mathematical Literature 10(2):67–73, 1997. Mathematical
Reviews, 1980.
88 ◾ John Tsiligaridis
2. S.-M. Chen, C.-H. Lin, “Multiple DNA sequence alignment based on genetic algorithms and
divide and conquer techniques,” Information and Management Science Journal 18(2):76–111,
2007.
3. H. Garillo, D. Lipman, he Multiple Sequence Alignment Problem in Biology, Addison-Wesley,
Boston, 1989.
4. J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, 3rd ed., MK, 2012.
5. C. Lam, Hadoop in Action, Manning, Stamford, CT, 2011.
6. J. Dean, S. Ghemawai, “MapReduce: Simpliied data processing on large clusters,” Com-
munications of the ACM 51(1):107–113, 2008.
7. R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis, “Evaluating MapReduce for multi-
core and multiprocessor systems,” Proceedings 2007 IEEE, 13th Intern. Symposium on High
Performance Computer Architecture, January 2007.
8. J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S. Bae, J. Qiu, G. Fox, “Twister: Runtime for
iterative MapReduce,” Proc. 19th ACM Inter. Symposium on High Performance Distributed
Computing, 2010, pp. 810–818.
9. D. Mount, Bioinformatics: Sequence and Genome Analysis, 2nd ed., CSHL Press, 2004.
10. Apache Hadoop. Available at http://hadoop.apache.org.
11. T. White, Hadoop: he Deinite Guide, O’Reilly Media, Yahoo Press, Cambridge, MA, June 5,
2009.
12. R. Taylor, “An overview of the Hadoop/MapReduce/HBase framework and its current applica-
tions in bioinformatics,” BMC Bioinformatics 11(Suppl 12):S1, 2010. Available at http://www
.biomedicalcentral.com/1471-2105/11/512/51.
13. A. Hughes, Y. Ruan, S. Ekanayake, S. Bae, Q. Dong, M. Rho, J. Qiu, G. Fox, “Interpolative
multidimensional scaling techniques for the identiication of clusters in very large sequence
sets,” BMC Bioinformatics 12(Suppl 2):59, 2012. Available at http://www.biomedcentral
.com/1471-2105/13/52/59.
14. C. Zhang, A. Wong, “Towards eicient multiple molecular sequence alignment: A system of
genetic algorithm and dynamic programming,” IEEE Transaction System Man and Cybernetics
B 27(6):918–932, 1997.
15. D. Colberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-
Wesley, Boston, 1989.
16. K. Jong, “Learning with genetic algorithms: An overview,” Machine Learning 3:121–138, 1988.
17. E. Hruschka, R. Campello, L. Castro, “Improving the eiciency of a clustering genetic algo-
rithm,” Advances in Artiicial Intelligence, IBERAMIA 2004, vol. 3315 of LNCS, 2004,
pp. 861–870.
18. K. Maulik, S. Bandyopabhyay, “Genetic algorithm–based clustering technique,” Pattern
Recognition 33:1455–1465, 2000.
II
Big Data Processing
89
CHAPTER 5
CONTENTS
5.1 Introduction 92
5.2 Big Data Deinition and Concepts 93
5.3 Cloud Computing for Big Data Analysis 96
5.3.1 Data Analytics Tools as SaaS 96
5.3.2 Computing as IaaS 97
5.4 Challenges and Current Research Directions 98
5.5 Conclusions and Perspectives 102
References 102
ABSTRACT
Social media websites, such as Facebook, Twitter, and YouTube, and job posting
websites like LinkedIn and CareerBuilder involve a huge amount of data that are
very useful to economy assessment and society development. hese sites provide sen-
timents and interests of people connected to web communities and a lot of other
information. he Big Data collected from the web is considered as an unprecedented
source to fuel data processing and business intelligence. However, collecting, storing,
analyzing, and processing these Big Data as quickly as possible create new challenges
for both scientists and analytics. For example, analyzing Big Data from social media
is now widely accepted by many companies as a way of testing the acceptance of their
products and services based on customers’ opinions. Opinion mining or sentiment
91
92 ◾ Ouidad Achahbar et al.
analysis methods have been recently proposed for extracting positive/negative words
from Big Data. However, highly accurate and timely processing and analysis of the
huge amount of data to extract their meaning require new processing techniques.
More precisely, a technology is needed to deal with the massive amounts of unstruc-
tured and semistructured information, in order to understand hidden user behav-
ior. Existing solutions are time consuming given the increase in data volume and
complexity. It is possible to use high-performance computing technology to acceler-
ate data processing, through MapReduce ported to cloud computing. his will allow
companies to deliver more business value to their end customers in the dynamic and
changing business environment. his chapter discusses approaches proposed in lit-
erature and their use in the cloud for Big Data analysis and processing.
5.1 INTRODUCTION
Societal and technological progress have brought a widespread difusion of computer ser-
vices, information, and data characterized by an ever-increasing complexity, pervasiveness,
and social meaning. Ever more oten, people make use of and are surrounded by devices
enabling the rapid establishment of communication means on top of which people may
socialize and collaborate, supply or demand services, and query and provide knowledge as
it has never been possible before [1]. hese novel forms of social services and social organi-
zation are a promise of new wealth and quality for all. More precisely, social media, such as
Facebook, Twitter, and YouTube, and job posting websites like LinkedIn and CareerBuilder
are providing a communication medium to share and distribute information among users.
hese media involve a huge amount of data that are very useful to economy assessment
and society development. In other words, the Big Data collected from social media is con-
sidered as an unprecedented source to fuel data processing and business intelligence. A
study done in 2012 by the American Multinational Corporation (AMC) has estimated that
from 2005 to 2020, data will grow by a factor of 300 (from 130 exabytes to 40,000 exabytes),
and therefore, digital data will be doubled every 2 years [2]. IBM estimates that every day,
2.5 quintillion bytes of data are created, and 90% of the data in the world today has been
created in the last 2 years [3]. In addition, Oracle estimated that 2.5 zettabytes of data were
generated in 2012, and this will grow signiicantly every year [4].
he growth of data constitutes the “Big Data” phenomenon. Big Data can be then deined
as a large amounts of data, which requires new technologies and architectures so that it
becomes possible to extract value from them by a capturing and analysis process. Due to such
a large size of data, it becomes very diicult to perform efective analysis using existing tradi-
tional techniques. More precisely, as Big Data grows in terms of volume, velocity, and value,
the current technologies for storing, processing, and analyzing data become ineicient and
insuicient. Furthermore, unlike data that are structured in a known format (e.g., rows and
columns), data extracted, for example, from social media, have unstructured formats and are
very complex to process directly using standards tools. herefore, technologies that enable a
scalable and accurate analysis of Big Data are required in order to extract value from it.
A Gartner survey stated that data growth is considered as the largest challenge for
organizations [5]. With this issue, analyzing and high-performance processing of these
Approaches for High-Performance Big Data Processing ◾ 93
Big Data as quickly as possible create new challenges for both scientists and analytics.
However, high-performance computing (HPC) technologies still lack the tool sets that it
the current growth of data. In this case, new paradigms and storage tools were integrated
with HPC to deal with the current challenges related to data management. Some of these
technologies include providing computing as a utility (cloud computing) and introducing
new parallel and distributed paradigms. Recently, cloud computing has played an impor-
tant role as it provides organizations with the ability to analyze and store data economi-
cally and eiciently. For example, performing HPC in the cloud was introduced as data has
started to be migrated and managed in the cloud. For example, Digital Communications
Inc. (DCI) stated that by 2020, a signiicant portion of digital data will be managed in the
cloud, and even if a byte in the digital universe is not stored in the cloud, it will pass, at
some point, through the cloud [6].
Performing HPC in the cloud is known as high-performance computing as a ser-
vice (HPCaaS). herefore, a scalable HPC environment that can handle the complexity
and challenges related to Big Data is required [7]. Many solutions have been proposed
and developed to improve computation performance of Big Data. Some of them tend
to improve algorithm eiciency, provide new distributed paradigms, or develop power-
ful clustering environments, though few of those solutions have addressed a whole pic-
ture of integrating HPC with the current emerging technologies in terms of storage and
processing.
his chapter introduces the Big Data concepts along with their importance in the mod-
ern world and existing projects that are efective and important in changing the concept
of science into big science and society too. In this work, we will be focusing on eicient
approaches for high-performance mining of Big Data, particularly existing solutions that
are proposed in the literature and their eiciency in HPC for Big Data analysis and pro-
cessing. he remainder of this chapter is structured as follows. he deinition of Big Data
and related concepts are introduced in Section 5.2. In Section 5.3, existing solutions for
data processing and analysis are presented. Some remaining issues and further research
directions are presented in Section 5.4. Conclusions and perspectives are presented in
Section 5.5.
extract data for analysis. Data incorporates information of great beneit and insight for
users. For example, many social media websites provide sentiments and opinions of people
connected to web communities and a lot of other information. Sentiments or opinions
are clients’ personal judgments about something, such as whether the product is good or
bad [9]. Sentiment analysis or opinion mining is now widely accepted by companies as an
important core element for automatically detecting sentiments. For example, this research
ield has been active since 2002 as a part of online market research tool kits for processing
of large numbers of texts in order to identify clients’ opinions about products.
In past years, many machine learning algorithms for text analysis have been proposed.
hey can be classiied into two main families: linguistic and nonlinguistic. Linguistic-based
algorithms use knowledge about language (e.g., syntactic, semantic) in order to understand
and analyze texts. Nonlinguistic-based algorithms use a learning process from training
data in order to guess correct sentiment values. However, these algorithms are time con-
suming, and plenty of eforts have been dedicated to parallelize them to get better speed.
herefore, new technologies that can support the volume, velocity, variety, and value of
data were recently introduced. Some of the new technologies are Not Only Structured
Query Language (NoSQL), parallel and distributed paradigms, and new cloud comput-
ing trends that can support the four dimensions of Big Data. For example, NoSQL is the
transition from relational databases to nonrelational databases [10]. It is characterized by
the ability to scale horizontally, the ability to replicate and to partition data over many
servers, and the ability to provide high-performance operations. However, moving from
relational to NoSQL systems has eliminated some of the atomicity, consistency, isolation,
and durability (ACID) transactional properties [11]. In this context, NoSQL properties are
deined by consistency, availability, and partition (CAP) tolerance [12] theory, which states
that developers must make trade-of decisions between consistency, availability, and par-
titioning. Some examples of NoSQL tools are Cassandra [13], HBase [14], MongoDB [15],
and CouchDB [16].
In parallel to these technologies, cloud computing becomes the current innovative and
emerging trend in delivering information technology (IT) services that attracts the inter-
est of both academic and industrial ields. Using advanced technologies, cloud computing
provides end users with a variety of services, starting from hardware-level services to the
application level. Cloud computing is understood as utility computing over the Internet.
his means that computing services have moved from local data centers to hosted services
that are ofered over the Internet and paid for based on a pay-per-use model [17]. As stated
in References 18 and 19, cloud deployment models are classiied as follows: sotware as a
service (SaaS), platform as a service (PaaS), and infrastructure as a service (IaaS). SaaS repre-
sents application sotware, operating systems (OSs), and computing resources. End users
can view the SaaS model as a web-based application interface where services and complete
sotware applications are delivered over the Internet. Some examples of SaaS applications
are Google Docs, Microsot Oice Live, Salesforce Customer Relationship Management
(CRM), and so forth. PaaS allows end users to create and deploy applications on a pro-
vider’s cloud infrastructure. In this case, end users do not manage or control the underly-
ing cloud infrastructure like the network, servers, OSs, or storage. However, they do have
Approaches for High-Performance Big Data Processing ◾ 95
control over the deployed applications by being allowed to design, model, develop, and test
them. Examples of PaaS are Google App Engine, Microsot Azure, Salesforce, and so forth.
IaaS consists of a set of virtualized computing resources such as network bandwidth, stor-
age capacity, memory, and processing power. hese resources can be used to deploy and
run arbitrary sotware, which can include OSs and applications. Examples of IaaS provid-
ers are Dropbox, Amazon Web Services (AWS), and so forth.
As illustrated in Table 5.1, there are many other providers who ofer cloud services with
diferent features and pricing [20]. For example, Amazon (AWS) [21] ofers a number of
cloud services for all business sizes. Some AWS services are Elastic Compute Cloud; Simple
Storage Service; SimpleDB (relational data storage service that stores, processes, and que-
ries data sets in the cloud); and so forth. Google [22] ofers high accessibility and usability
in its cloud services. Some of Google’s services include Google App Engine, Gmail, Google
Docs, Google Analytics, Picasa (a tool used to exhibit products and upload their images in
the cloud), and so forth.
Recently, HPC has been adopted to provide high computation capabilities, high band-
width, and low-latency networks in order to handle the complexity of Big Data. HPC its
these requirements by implementing large physical clusters. However, traditional HPC
faces a set of challenges, which consist of peak demand, high capital, and high expertise,
to acquiring and operating the physical environment [23]. To deal with these issues, HPC
experts have leveraged the beneits of new technology trends including cloud technologies
and large storage infrastructures. Merging HPC with these new technologies has led to a
new HPC model, HPCaaS. HPCaaS is an emerging computing model where end users have
on-demand access to pre-existing needed technologies that provide high performance and
a scalable HPC computing environment [24]. HPCaaS provides unlimited beneits because
of the better quality of services provided by the cloud technologies and the better paral-
lel processing and storage provided by, for example, the Hadoop Distributed File System
(HDFS) and MapReduce paradigm. Some HPCaaS beneits are stated in Reference 23 as fol-
lows: (1) high scalability, in which resources are scaling up as to ensure essential resources
that it users’ demand in terms of processing large and complex data sets; (2) low cost, in
which end users can eliminate the initial capital outlay, time, and complexity to procure
HPC; and (3) low latency, by implementing the placement group concept that ensures the
execution and processing of data in the same server.
here are many HPCaaS providers in the market. An example of an HPCaaS provider
is Penguin Computing [25], which has been a leader in designing and implementing high-
performance environments for over a decade. Nowadays, it provides HPCaaS with dif-
ferent options: on demand, HPCaaS as private services, and hybrid HPCaaS services.
AWS [26] is also an active HPCaaS in the market; it provides simpliied tools to perform
HPC over the cloud. AWS allows end users to beneit from HPCaaS features with difer-
ent pricing models: On-Demand, Reserved [27], or Spot Instances [28]. HPCaaS on AWS
is currently used for computer-aided engineering, molecular modeling, genome analysis,
and numerical modeling across many industries, including oil and gas, inancial services,
and manufacturing [21]. Other leaders of HPCaaS in the market are Microsot (Windows
Azure HPC) [29] and Google (Google Compute Engine) [30].
For Big Data processing, two main research directions can be identiied: (1) deploying
popular tools and libraries in the cloud and providing the service as SaaS and (2) providing
computing and storage resources as IaaS or HPCaaS to allow users to create virtual clusters
and run jobs.
programming model and Hadoop system, to allow users with little knowledge in parallel
and distributed systems to easily implement parallel algorithms and run them on multiple
processor environments or on multithreading machines. It also provides preprogrammed
parallel algorithms, such as support vector machine classiiers, linear and transform
regression, nearest neighbors, K-means, and principal component analysis (PCA). NIMBLE
is another tool kit for implementing parallel data mining and machine learning algorithms
on MapReduce [32]. Its main aim is to allow users to develop parallel machine learning
algorithms and run them on distributed- and shared-memory machines.
Another category of machine learning systems has been developed and uses Hadoop
as a processing environment for data analysis. For example, the Kitenga Analytics
(http://software.dell.com/products/kitenga-analytics-suite/) platform provides ana-
lytical tools with an easy-to-use interface for Big Data sophisticated processing and
analysis. It combines Hadoop, Mahout machine learning, and advanced natural lan-
guage processing in a fully integrated platform for performing fast content mining
and Big Data analysis. Furthermore, it can be considered as the first Big Data search and
analytics platform to integrate and process diverse, unstructured, semistructured, and
structured information. Pentaho Business Analytics (http://www.pentaho.fr/explore
/ pentaho-business-analytics/) is another platform for data integration and analysis. It
offers comprehensive tools that support data preprocessing, data exploration, and data
extraction together with tools for visualization and for distributed execution on the
Hadoop platform. Other systems, such as BigML (https://bigml.com/), Google Prediction
API (https://cloud.google.com/products/prediction-api/), and Eigendog, recently have
been developed by offering services for data processing and analysis. For example,
Google Prediction API is Google’s cloud-based machine learning tool used for ana-
lyzing data. However, these solutions cannot be used for texts that are extracted, for
example, from social media and social networks (e.g., Twitter, Facebook).
Recently, an increased interest has been devoted to text mining and natural language
processing approaches. hese solutions are delivered to users as cloud-based services.
It is worth noting that the main aim of text mining approaches is to extract features,
for example, concept extraction and sentiment or opinion extraction. However, the size
and number of documents that need to be processed require the development of new
solutions. Several solutions are provided via web services. For example, AlchemyAPI
(http://www.alchemyapi.com/) provides natural language processing web services for
processing and analyzing vast amounts of unstructured data. It can be used for perform-
ing key word/entity extraction and sentiment analysis on large amounts of documents
and tweets. In other words, it uses linguistic parsing, natural language processing, and
machine learning to analyze and extract semantic meaning (i.e., valuable information)
from web content.
solutions provide users the possibility to experiment with complex algorithms on their
customized clusters on the cloud.
It is worth noting that performing HPC in the cloud was introduced as data has started
to be migrated and managed in the cloud. As stated in Section 5.2, performing HPC in the
cloud is known as HPCaaS.
In short, HPCaaS ofers high-performance, on-demand, and scalable HPC environ-
ments that can handle the complexity and challenges related to Big Data [7]. One of the
most known and adopted parallel and distributed systems is the MapReduce model that
was developed by Google to meet the growth of their web search indexing process [33].
MapReduce computations are performed with the support of a data storage system known
as Google File System (GFS). he success of both GFS and MapReduce inspired the devel-
opment of Hadoop, which is a distributed and parallel system that implements MapReduce
and HDFS [34]. Nowadays, Hadoop is widely adopted by big players in the market because
of its scalability, reliability, and low cost of implementation. Hadoop is also proposed to be
integrated with HPC as an underlying technology that distributes the work across an HPC
cluster [35,36]. With these solutions, users no longer need high skills in HPC-related ields
in order to access computing resources.
Recently, several cloud-based HPC clusters have been provided to users together with
sotware packages (e.g., Octave, R system) for data analysis [31]. he main objective is to
provide users with suitable environments that are equipped with scalable high-performance
resources and statistical sotware for their day-to-day data analysis and processing. For
example, Cloudnumbers.com (http://www.scientiic-computing.com/products/product_details
.php?product_id=1086) is a cloud-based HPC platform that can be used for time-intensive
processing of Big Data from diferent domains, such as inance and social science. Further-
more, a web interface can be used by users to create, monitor, and easily maintain their
work environments. Another similar environment to Cloudnumbers.com is Opani (http://
opani.com), which provides additional functionalities that allow users to adapt resources
according to data size. While these solutions are scalable, high-level expertise statistics are
required, and thus, there are a limited number of providers in this category of solutions. To
overcome this drawback, some solutions have been proposed to allow users to build their
cloud Hadoop clusters and run their applications. For example, RHIPE (http://www.datadr
.org/) provides a framework that allows users to access Hadoop clusters and launch map/
reduce analysis of complex Big Data. It is an environment composed of R, an interactive
language for data analysis; HDFS; and MapReduce. Other environments, such as Anaconda
(https://store.continuum.io/cshop/anaconda/) and Segue (http://code.google.com/p/segue/),
were developed for performing map/reduce jobs on top of clusters for Big Data analysis.
of techniques, such as association rule mining, decision trees, regression, support vector
machines, and other data mining techniques [37].
However, few of those solutions have addressed the whole picture of integrating HPC
with the current emerging technologies in terms of storage and processing of Big Data,
mainly those extracted from social media. Some of the most popular technologies cur-
rently used in hosting and processing Big Data are cloud computing, HDFS, and Hadoop
MapReduce [38]. At present, the use of HPC in cloud computing is still limited. he irst
step towards this research was done by the Department of Energy National Laboratories
(DOE), which started exploring the use of cloud services for scientiic computing [24].
Besides, in 2009, Yahoo Inc. launched partnerships with major top universities in the
United States to conduct more research about cloud computing, distributed systems, and
high computing applications.
Recently, there have been several studies that evaluated the performance of high com-
puting in the cloud. Most of these studies used Amazon Elastic Compute Cloud (Amazon
EC2) as a cloud environment [26,39–42]. Besides, only few studies have evaluated the
performance of high computing using the combination of both new emerging distrib-
uted paradigms and the cloud environment [43]. For example, in Reference 39, the authors
have evaluated HPC on three diferent cloud providers: Amazon EC2, GoGrid cloud, and
IBM cloud. For each cloud platform, they ran HPC on Linux virtual machines (VMs),
and they came to the conclusion that the tested public clouds do not seem to be optimized
for running HPC applications. his was explained by the fact that public cloud platforms
have slow network connections between VMs. Furthermore, the authors in Reference 26
evaluated the performance of HPC applications in today’s cloud environments (Amazon
EC2) to understand the trade-ofs in migrating to the cloud. Overall results indicated that
running HPC on the EC2 cloud platform limits performance and causes signiicant vari-
ability. Besides Amazon EC2, research [44] has evaluated the performance–cost trade-ofs
of running HPC applications on three diferent platforms. he irst and second platforms
consisted of two physical clusters (Taub and Open Cirrus cluster), and the third platform
consisted of Eucalyptus with kernel-based virtualization machine (KVM) virtualization.
Running HPC on these platforms led authors to conclude that the cloud is more cost efec-
tive for low-communication-intensive applications.
Evaluation of HPC without relating it to new cloud technologies was also performed
using diferent virtualization technologies [45–48]. For example, in Reference 45, the
authors performed an analysis of virtualization techniques including VMware, Xen, and
OpenVZ. heir indings showed that none of the techniques matches the performance of
the base system perfectly; however, OpenVZ demonstrates high performance in both ile
system performance and industry-standard benchmarks. In Reference 46, the authors
compared the performance of KVM and VMware. Overall indings showed that VMware
performs better than KVM. Still, in a few cases, KVM gave better results than VMware.
In Reference 47, the authors conducted quantitative analysis of two leading open-source
hypervisors, Xen and KVM. heir study evaluated the performance isolation, overall per-
formance, and scalability of VMs for each virtualization technology. In short, their ind-
ings showed that KVM has substantial problems with guests crashing (when increasing the
100 ◾ Ouidad Achahbar et al.
number of guests); however, KVM still has better performance isolation than Xen. Finally,
in Reference 48, the authors have extensively compared four hypervisors: Hyper-V, KVM,
VMware, and Xen. heir results demonstrated that there is no perfect hypervisor. However,
despite the evaluation of diferent technologies, HPCaaS still needs more investigation to
decide on appropriate environments for Big Data analysis.
In our recent work, diferent experiments have been conducted on three diferent clusters,
Hadoop physical cluster (HPhC), Hadoop virtualized cluster using KVM (HVC-KVM),
and Hadoop virtualized cluster using VMware Elastic sky X Integrated (HVC-VMware
ESXi), as illustrated in Figure 5.1 [49].
Two main benchmarks, TeraSort and HDFS I/O saturation (TestDFSIO) benchmarks
[50], were used to study the impact of machine virtualization on HPCaaS. TeraSort
does considerable computation, networking, and storage input/output (I/O) and is oten
Horrizon
Keystone
Glance Nova Swift
FIGURE 5.1 Hadoop physical and virtualized clusters. LTS, long term support. (From Achahbar,
O., he Impact of Virtualization on High Performance Computing Clustering in the Cloud,
Master’s thesis report, Al Akhawayn University in Ifrane, Morocco, 2014.)
Approaches for High-Performance Big Data Processing ◾ 101
12,000.00 4500.00
Average time in seconds
FIGURE 5.2 Performance on Hadoop physical cluster of (a) TeraSort and (b) TestDFSIO-Write.
(From Achahbar, O., he Impact of Virtualization on High Performance Computing Clustering in
the Cloud, Master’s thesis report, Al Akhawayn University in Ifrane, Morocco, 2014.)
techniques (e.g., association rule mining), and little work has been done for accelerating
other algorithms related to text mining and sentiment analysis. hese techniques are data
and computationally intensive, making them good candidates for implementation in cloud
environments. Generally, experimental results demonstrate that vitalized clusters can per-
form much better than physical clusters when processing and handling HPC.
REFERENCES
1. V. De Florio, M. Bakhouya, A. Coronato and G. Di Marzo, “Models and Concepts for Socio-
Technical Complex Systems: Towards Fractal Social Organizations,” Systems Research and
Behavioral Science, vol. 30, no. 6, pp. 750–772, 2013.
2. J. Gantz and D. Reinsel, “he Digital Universe in 2020: Big Data, Bigger Digital Shadows, and
Biggest Growth in the Far East,” IDC IVIEW, 2012, pp. 1–16.
3. M. K. Kakhani, S. Kakhani, and S. R. Biradar, “Research Issues in Big Data Analytics,”
International Journal of Application or Innovation in Engineering and Management, vol. 2, no. 8,
pp. 228–232, 2013.
4. C. Hagen, “Big Data and the Creative Destruction of Today’s,” ATKearney, 2012.
5. Gartner, Inc., “Hunting and Harvesting in a Digital World,” Gartner CIO Agenda Report, 2013,
pp. 1–8.
6. J. Gantz and D. Reinsel, “he Digital Universe Decade—Are You Ready?” IDC IVIEW, 2010,
pp. 1–15.
7. Ch. Vecchiola, S. Pandey and R. Buyya, “High-Performance Cloud Computing: A View of
Scientiic Applications,” in the 10th International Symposium on Pervasive Systems, Algorithms
and Networks I-SPAN 2009, IEEE Computer Society, 2009, pp. 4–16.
8. J.-P. Dijcks, “Oracle: Big Data for the Enterprise,” white paper, Oracle Corp., 2013, pp. 1–16.
9. B. Pang and L. Lee, “Opinion Mining and Sentiment Analysis,” Foundations and Trends in
Information Retrieval, vol. 2, no. 1–2, pp. 1–135, 2008.
10. Oracle Corporation, “Oracle NoSQL Database,” white paper, Oracle Corp., 2011, pp. 1–12.
11. S. Yu, “ACID Properties in Distributed Databases,” Advanced eBusiness Transactions for B2B-
Collaborations, 2009.
12. S. Gilbert and N. Lynch, “Brewer’s Conjecture and the Feasibility of Consistent, Available,
Partition-Tolerant Web Services,” ACM SIGACT News, vol. 33, no. 2, p. 51, 2002.
13. A. Lakshman and P. Malik, “Cassandra—A Decentralized Structured Storage System,” ACM
SIGOPS Operating Systems Review, vol. 44, no. 2, pp. 35–40, 2010.
14. G. Lars, HBase: he Deinitive Guide, 1st edition, O’Reilly Media, Sebastopol, CA, 556 pp.,
2011.
Approaches for High-Performance Big Data Processing ◾ 103
42. Y. Gu and R. L. Grossman, “Sector and Sphere: he Design and Implementation of a High
Performance Data Cloud,” National Science Foundation, 2008, pp. 1–11.
43. C. Evangelinos and C. N. Hill, “Cloud Computing for Parallel Scientiic HPC Applications:
Feasibility of Running Coupled Atmosphere-Ocean Climate Models on Amazon’s EC2,”
Department of Earth, Atmospheric and Planetary Sciences at Massachusetts Institute of
Technology, 2009, pp. 1–6.
44. A. Gupta and D. Milojicic, “Evaluation of HPC Applications on Cloud,” Hewlett-Packard
Development Company, 2011, pp. 1–6.
45. C. Fragni, M. Moreira, D. Mattos, L. Costa, and O. Duarte, “Evaluating Xen, VMware, and
OpenVZ Virtualization Platforms for Network Virtualization,” Federal University of Rio de
Janeiro, 2010, 1 p. Available at http://www.gta.ufrj.br/tp/gta/TechReports/FMM10b.pdf.
46. N. Yaqub, “Comparison of Virtualization Performance: VMWare and KVM,” Master hesis,
DUO Digitale utgivelser ved UiO, Universitetet i Oslo, Norway, 2012, pp. 30–44. Available at
http://urn.nb.no/URN:NBN:no-33642.
47. T. Deshane, M. Ben-Yehuda, A. Shah and B. Rao, “Quantitative Comparison of Xen and KVM,”
in Xen Summit, 2008, pp. 1–3.
48. J. Hwang, S. Wu and T. Wood, “A Component-Based Performance Comparison of Four
Hypervisors,” George Washington University and IBM T.J. Watson Research Center, 2012,
pp. 1–8.
49. O. Achahbar, “he Impact of Virtualization on High Performance Computing Clustering in the
Cloud,” Master hesis Report, Al Akhawayn University in Ifrane, Morocco, 2014.
50. M. G. Noll, “Benchmarking and Stress Testing and Hadoop Cluster with TeraSort, Test DFSIO
& Co.,” 2011.
51. E. Wu and Y. Liu, “Emerging Technology about GPGPU,” APCCAS. IEEE Asia Paciic
Conference on Circuits and Systems, 2008.
52. NVIDIA, “NVIDIA CUDA Compute Uniied Device Architecture: Programming Guide,”
Version 2.3, July 2009. Available at http://www.cs.ucla.edu/~palsberg/course/cs239/papers
/CudaReferenceManual_2.0.pdf.
53. W. Fang, “Parallel Data Mining on Graphics Processors,” Technical Report HKUST-CS08-07,
October 2008.
CHAPTER 6
CONTENTS
6.1 Introduction 106
6.2 Requirements for Scheduling in Big Data Platforms 107
6.3 Scheduling Models and Algorithms 109
6.4 Data Transfer Scheduling 113
6.5 Scheduling Policies 114
6.6 Optimization Techniques for Scheduling 115
6.7 Case Study on Hadoop and Big Data Applications 116
6.8 Conclusions 118
References 118
ABSTRACT
Many applications generate Big Data, like social networking and social inluence pro-
grams, cloud applications, public websites, scientiic experiments and simulations,
data warehouses, monitoring platforms, and e-government services. Data grow rap-
idly, since applications produce continuously increasing volumes of unstructured
and structured data. he impact on data processing, transfer, and storage is the need
to reevaluate the approaches and solutions to better answer user needs. In this con-
text, scheduling models and algorithms have an important role. A large variety of
solutions for speciic applications and platforms exist, so a thorough and systematic
analysis of existing solutions for scheduling models, methods, and algorithms used
in Big Data processing and storage environments has high importance. his chap-
ter presents the best of existing solutions and creates an overview of current and
near-future trends. It will highlight, from a research perspective, the performance
and limitations of existing solutions and will ofer the scientists from academia and
designers from industry an overview of the current situation in the area of scheduling
and resource management related to Big Data processing.
105
106 ◾ Florin Pop and Valentin Cristea
6.1 INTRODUCTION
he rapid growth of data volume requires processing of petabytes of data per day. Cisco
estimates that mobile data traic alone will reach 11.16 exabytes of data per month in
2017. he produced data is subject to diferent kinds of processing, from real-time pro-
cessing with impact for context-aware applications to data mining analysis for valuable
information extraction. he multi-V (volume, velocity, variety, veracity, and value) model
is frequently used to characterize Big Data processing needs. Volume deines the amount of
data, velocity means the rate of data production and processing, variety refers to data types,
veracity describes how data can be a trusted function of its source, and value refers to the
importance of data relative to a particular context [1].
Scheduling plays an important role in Big Data optimization, especially in reducing the
time for processing. he main goal of scheduling in Big Data platforms is to plan the process-
ing and completion of as many tasks as possible by handling and changing data in an eicient
way with a minimum number of migrations. Various mechanisms are used for resource allo-
cation in cloud, high performance computing (HPC), grid, and peer-to-peer systems, which
have diferent architectural characteristics. For example, in HPC, the cluster used for data
processing is homogeneous and can handle many tasks in parallel by applying predeined
rules. On the other side, cloud systems are heterogeneous and widely distributed; task man-
agement and execution are aware of communication rules and ofer the possibility to create
particular rules for the scheduling mechanism. he actual scheduling methods used in Big
Data processing frameworks are as follows: irst in irst out, fair scheduling, capacity sched-
uling, Longest Approximate Time to End (LATE) scheduling, deadline constraint scheduling,
and adaptive scheduling [2,3]. Finding the best method for a particular processing request
remains a signiicant challenge. We can see the Big Data processing as a big “batch” process
that runs on an HPC cluster by splitting a job into smaller tasks and distributing the work to
the cluster nodes. he new types of applications, like social networking, graph analytics, and
complex business work lows, require data transfer and data storage. he processing models
must be aware of data locality when deciding to move data to the computing nodes or to cre-
ate new computing nodes near data locations. he workload optimization strategies are the
key to guarantee proit to resource providers by using resources to their maximum capacity.
For applications that are both computationally and data intensive, the processing models
combine diferent techniques like in-memory, CPU, and/or graphics processing unit (GPU)
Big Data processing.
Moreover, Big Data platforms face the problem of environments’ heterogeneity due to the
variety of distributed systems types like cluster, grid, cloud, and peer-to-peer, which actually
ofer support for advanced processing. At the conluence of Big Data with widely distributed
platforms, scheduling solutions consider solutions designed for eicient problem solving and
parallel data transfers (that hide transfer latency) together with techniques for failure man-
agement in highly heterogeneous computing systems. In addition, handling heterogeneous
data sets becomes a challenge for interoperability among various sotware systems.
his chapter highlights the speciic requirements of scheduling in Big Data platforms,
scheduling models and algorithms, data transfer scheduling procedures, policies used in
The Art of Scheduling for Big Data Science ◾ 107
• Scalability and elasticity: A scheduling algorithm must take into consideration the
peta-scale data volumes and hundred thousands of processors that can be involved in
processing tasks. he scheduler must be aware of execution environment changes and
be able to adapt to workload changes by provisioning or deprovisioning resources.
• General purpose: A scheduling approach should make assumptions about and have
few restrictions to various types of applications that can be executed. Interactive jobs,
Data NewSQL
Batch Other
flow tasks
tasks models
(Pig) (Hive) Real-time Platform
streams services
Execution engine
(Tez, G-Hadoop, Torque)
FIGURE 6.1 Integration of application and service integration in Big Data platforms.
108 ◾ Florin Pop and Valentin Cristea
distributed and parallel applications, as well as noninteractive batch jobs should all
be supported with high performance. For example, a noninteractive batch job requir-
ing high throughput may prefer time-sharing scheduling; similarly, a real-time job
requiring short-time response prefers space-sharing scheduling.
• Dynamicity: he scheduling algorithm should exploit the full extent of available
resources and may change its behavior to cope, for example, with many computing
tasks. he scheduler needs to continuously adapt to resource availability changes,
paying special attention to cloud systems and HPC clusters (data centers) as reliable
solutions for Big Data [4].
• Transparency: he host(s) on which the execution is performed should not afect
tasks’ behavior and results. From the user perspective, there should be no diference
between local and remote execution, and the user should not be aware about system
changes or data movements for Big Data processing.
• Fairness: Sharing resources among users is a fair way to guarantee that each user
obtains resources on demand. In a pay-per-use model in the cloud, a cluster of
resources can be allocated dynamically or can be reserved in advance.
• Time eiciency: he scheduler should improve the performance of scheduled jobs as
much as possible using diferent heuristics and state estimation suitable for speciic
task models. Multitasking systems can process multiple data sets for multiple users
at the same time by mapping the tasks to resources in a way that optimizes their use.
• Cost (budget) eiciency: he scheduler should lower the total cost of execution by mini-
mizing the total number of resources used and respect the total money budget. his
aspect requires eicient resource usage. his can be done by optimizing the execu-
tion for mixed tasks using a high-performance queuing system and by reducing the
computation and communication overhead.
• Load balancing: his is used as a scheduling method to share the load among all avail-
able resources. his is a challenging requirement when some resources do not match
with tasks’ properties. here are classical approaches like round-robin scheduling,
but also, new approaches that cope with large scale and heterogeneous systems were
proposed: least connection, slow start time, or agent-based adaptive balancing.
• Support of data variety and diferent processing models: his is done by handling multiple
concurrent input streams, structured and unstructured content, multimedia content,
and advanced analytics. Classifying tasks into small or large, high or low priority,
and periodic or sporadic will address a speciic scheduling technique.
• Integration with shared distributed middleware: he scheduler must consider various sys-
tems and middleware frameworks, like the sensor integration in any place following
the Internet of hings paradigm or even mobile cloud solutions that use oloading
techniques to save energy. he integration considers the data access and consumption
and supports various sets of workloads produced by services and applications.
The Art of Scheduling for Big Data Science ◾ 109
• Capacity awareness: Estimate the percentage of resource allocation for a workload and
understand the volume and the velocity of data that are produced and processed.
• Real-time, latency-free delivery and error-free analytics: Support the service-level agree-
ments, with continuous business process low and with an integrated labor-intensive
and fault-tolerant Big Data environment.
• API integration: With diferent operating systems, support various-execution virtual
machines (VMs) and wide visibility (end-to-end access through standard protocols
like hypertext transfer protocol [HTTP], ile transfer protocol [FTP], and remote
login, and also using a single and simple management console) [5].
• First In First Out (FIFO) (oldest job irst)—jobs are ordered according to the arrival time.
he order can be afected by job priority.
• Fair scheduling—each job gets an equal share of the available resources.
• Capacity scheduling—provides a minimum capacity guarantee and shares excess capac-
ity; it also considers the priorities of jobs in a queue.
• Adaptive scheduling—balances between resource utilization and jobs’ parallelism in
the cluster and adapts to speciic dynamic changes of the context.
110 ◾ Florin Pop and Valentin Cristea
he use of these models takes into account diferent speciic situations: For example,
FIFO is recommended when the number of tasks is less than the cluster size, while fairness
is the best one when the cluster size is less than the number of tasks. Capacity schedul-
ing is used for multiple tasks and priorities speciied as response times. In Reference 8, a
solution of a scheduling bag of tasks is presented. Users receive guidance regarding the plau-
sible response time and are able to choose the way the application is executed: with more
money and faster or with less money but slower. An important ingredient in this method
is the phase of proiling the tasks in the actual bag. he basic scheduling is realized with a
bounded knapsack algorithm. Reference 9 presents the idea of scheduling based on scaling
up and down the number of the machines in a cloud system. his solution allows users to
choose among several types of VM instances while scheduling each instance’s start-up and
shutdown to respect the deadlines and ensure a reduced cost.
A scheduling solution based on genetic algorithms is described in Reference 10. he
scheduling is done on grid systems. Grids are diferent from cloud systems, but the prin-
ciple used by the authors in assigning tasks to resources is the same. he scheduling solu-
tion works with applications that can be modeled as directed acyclic graphs. he idea is
minimizing the duration of the application execution while the budget is respected. his
approach takes into account the system’s heterogeneity.
Reference 11 presents a scheduling model for instance-intensive work lows in the
cloud, which takes into consideration both budget and deadline constraints. he level of user
interaction is very high, the user being able to change dynamically the cost and dead-
line requirements and provide input to the scheduler during the work low execution. he
interventions can be made every scheduled round. his is an interesting model because the
user can choose to pay more or less depending on the scenario. he main characteristic
is that the user has more decision power on work low execution. In addition, the cloud
estimates the time and cost during the work low execution to provide hints to users and
dynamically reschedule the workload.
he Apache Hadoop framework is a sotware library that allows the processing of large
data sets distributed across clusters of computers using a simple programming model [3].
he framework facilitates the execution of MapReduce applications. Usually, a cluster
on which the Hadoop system is installed has two masters and several slave components
(Figure 6.2) [12]. One of the masters is the JobTracker, which deals with processing projects
coming from users and sends them to the scheduler used in that moment. he other mas-
ter is NameNode, which manages the ile system namespace and the user access control.
he other machines act as slaves. A TaskTracker represents a machine in Hadoop, while a
DataNode handles the operations with the Hadoop Distributed File System (HDFS), which
The Art of Scheduling for Big Data Science ◾ 111
Master Slave
Computation
layer
(MapReduce)
Job tracker
deals with data replication on all the slaves in the system. his is the way input data gets to
map and to reduce tasks. Every time an operation occurs on one of the slaves, the results of
the operation are immediately propagated into the system [13].
he Hadoop framework considers the capacity and fair scheduling algorithms. he
Capacity Scheduler has a number of queues. Each of these queues is assigned a part of the
system resources and has speciic numbers of map and reduce slots, which are set through
the coniguration iles. he queues receive users’ requests and order them by the associated
priorities. here is also a limitation for each queue per user. his prevents the user from
seizing the resources for a queue.
he Fair Scheduler has pools in which job requests are placed for selection. Each user is
assigned to a pool. Also, each pool is assigned a set of shares and uses them to manage
the resources allocated to jobs, so that each user receives an equal share, no matter the
number of jobs he/she submits. Anyway, if the system is not loaded, the remininig shares
are distributed to existing jobs. he Fair Scheduler has been proposed in Reference 14. he
authors demonstrated its special qualities regarding the reduced response time and the
high throughput.
here are several extensions to scheduling models for Hadoop. In Reference 15, a new
scheduling algorithm is presented, LATE, which is highly robust to heterogeneity without
using speciic information about nodes. he solution solves the problems posed by heteroge-
neity in virtualized data centers and ensures good performance and lexibility for specula-
tive tasks. In Reference 16, a scheduler is presented that meets deadlines. his scheduler has
a preliminary phase for estimating the possibility to achieve the deadline claimed by the
user, as a function of several parameters: the runtimes of map and reduce tasks, the input
data size, data distribution, and so forth. Jobs are scheduled only if the deadlines can be
met. In comparison with the schedulers mentioned in this section, the genetic scheduler
proposed in Reference 7 approaches the deadline constraints but also takes into account the
environment heterogeneity. In addition, it uses speculative techniques in order to increase
112 ◾ Florin Pop and Valentin Cristea
the scheduler’s power. he genetic scheduler has an estimation phase, where the processing
data speed for each application is measured. he scheduler ensures that, once an applica-
tion’s execution has started, that application will end successfully in normal conditions. he
Hadoop On Demand (HOD) [3] virtual cluster uses the Torque resource manager for node
allocation and automatically prepares coniguration iles. hen it initializes the system based
on the nodes within the virtual cluster. HOD can be used in a relatively independent way.
To support multiuser situations [14,17], the Hadoop framework incorporates several
components that are suitable for Big Data processing (Figure 6.3) since they ofer high
scalability through a large volume of data and support access to widely distributed data.
Here is a very short description of these components: Hadoop Common consists of com-
mon utilities that support any Hadoop modules and any new extension. HDFS provides
high-throughput access to application data. Hadoop YARN (Apache Hadoop NextGen
MapReduce) is a framework for job scheduling and cluster resource management that can
be extended across multiple platforms. Hadoop MapReduce is a YARN-based system for par-
allel processing of large data sets.
Facebook solution Corona [18] extends and improves the Hadoop model, ofering better
scalability and cluster utilization, lower latency for small jobs, the ability to upgrade with-
out disruption, and scheduling based on actual task resource requirements rather than
a count of map and reduce tasks. Corona was designed to answer the most important
Facebook challenges: unique scalability (the largest cluster has more than 100 PB of data)
and processing needs (crunch more than 60,000 Hive queries a day). he data warehouse
inside Facebook has grown by 2500 times between 2008 and 2012, and it is expected to
grow by the same factor until 2016.
Job
tracker
Cluster
manager
Scheduling and
Job execution engine layer
tracker (Hadoop, Corona)
Data blocks
Mesos [19] uses a model of resource isolation and sharing across distributed applications
using Hadoop, message passing interface (MPI), Spark, and Aurora in a dynamic way (Figure
6.4). he ZooKeeper quorum is used for master replication to ensure fault tolerance. By
integration of multiple slave executors, Mesos ofers support for multiresource scheduling
(memory and CPU aware) using isolation between tasks with Linux Containers. he expected
scalability goes over 10,000 nodes.
YARN [20] splits the two major functionalities of the Hadoop JobTracker in two separate
components: resource management and job scheduling/monitoring (application manage-
ment). he Resource Manager is integrated in the data-computation framework and coor-
dinates the resource for all jobs processing alive in a Big Data platform. he Resource
Manager has a pure scheduler, that is, it does not monitor or track the application status
and does not ofer guarantees about restarting failed tasks due to either application failure
or hardware failures. It ofers only matching of applications’ jobs on resources.
he new processing models based on bioinspired techniques are used for fault-tolerant
and self-adaptable handling of data-intensive and computation-intensive applications.
hese evolutionary techniques approach the learning based on history with the main aim
to ind a near-optimal solution for problems with multiconstraint and multicriteria opti-
mizations. For example, adaptive scheduling is needed for dynamic heterogeneous systems
where we can change the scheduling strategy according to available resources and their
capacity.
maximizing the bandwidth for which the ile transfer rates between two end points are
calculated and considering the heterogeneity of server resources.
he Big Data input/output (I/O) scheduler in Reference 22 is a solution for applications
that compete for I/O resources in a shared MapReduce-type Big Data system. he solution,
named Interposed Big Data I/O Scheduler (IBIS), has the main aims to solve the prob-
lem of diferentiating the I/Os among competitive applications on separate data nodes and
perform scheduling according to applications’ bandwidth demands. IBIS acts as a meta-
scheduler and eiciently coordinates the distributed I/O schedulers across data nodes in
order to allocate the global storage.
In the context of Big Data transfers, a few “big” questions need to be answered in order
to have an eicient cloud environment, more speciically, when and where to migrate. In
this context, an eicient data migration method, focusing on the minimum global time,
is presented in Reference 23. he method, however, does not try to minimize individual
migrations’ duration. In Reference 24, two migration models are described: oline and
online. he oline scheduling model has as a main target the minimization of the maximum
bandwidth usage on all links for all time slots of a planning period. In the online schedul-
ing model, the scheduler has to make fast decisions, and the migrations are revealed to the
migration scheduler in an a priori undefined sequence. Jung et al. [25] treat the data min-
ing parallelization by considering the data transfer delay between two computing nodes.
he delay is estimated by using the autoregressive moving average ilter. In Reference 26,
the impact of two resource reservation methods is tested: reservation in source machines
and reservation in target machines. Experimental results proved that resource reservation
in target machines is needed, in order to avoid migration failure. he performance over-
heads of live migration are afected by memory size, CPU, and workload types.
he model proposed in Reference 27 uses the greedy scheduling algorithm for data
transfers through diferent cloud data centers. his algorithm gets the transfer requests
on a irst-come-irst-serve order and sets a time interval in which they can be sent. his
interval is reserved on all the connections the packet has to go through (in this case, there
is a maximum of three hops to destination, because of the full mesh infrastructure). his
is done until there are no more transfers to schedule, taking into account the previously
reserved time frames for each individual connection. he connections are treated individu-
ally to avoid the bottleneck. For instance, the connections between individual clouds need
to transfer more messages than connections inside the cloud. his way, even if the connec-
tion from a physical machine to a router is unused, the connection between the routers can
be oversaturated. here is no point in scheduling the migration until the transfers that are
currently running between the routers end, even if the connection to the router is unused.
Application 1
Resource manager
Allocate
Application 2
A general-purpose resource management approach in a cluster used for Big Data pro-
cessing should make some assumptions about policies that are incorporated in service-
level agreements. For example, interactive tasks, distributed and parallel applications, as
well as noninteractive batch tasks should all be supported with high performance. his
property is a straightforward one but, to some extent, diicult to achieve. Because tasks
have diferent attributes, their requirements to the scheduler may contradict in a shared
environment. For example, a real-time task requiring short-time response prefers space-
sharing scheduling; a noninteractive batch task requiring high throughput may prefer
time-sharing scheduling [6,29]. To be general purpose, a trade-of may have to be made.
he scheduling method focuses on parallel tasks, while providing an acceptable perfor-
mance to other kinds of tasks.
YARN has a pluggable solution for dynamic policy loading, considering two steps for
the resource allocation process (Figure 6.5). Resource allocation is done by YARN, and task
scheduling is done by the application, which permits the YARN platform to be a generic
one while still allowing lexibility of scheduling strategies. he speciic policies in YARN
are oriented on resource splitting according with schedules provided by applications. In
this way, the YARN’s scheduler determines how much and in which cluster to allocate
resources based on their availability and on the conigured sharing policy.
• Linear programming allows the scheduler to ind the suitable resources or cluster of
resources, based on deined constraints.
116 ◾ Florin Pop and Valentin Cristea
generated for a MapReduce application. Last, but not least, execution of Big Data jobs has
to be scheduled and monitored. Instead of writing jobs or other code for scheduling, the Big
Data suite ofers the possibility to deine and manage execution plans in an easy way.
In a Big Data platform, Hadoop needs to integrate data of all diferent kinds of technolo-
gies and products. Besides iles and SQL databases, Hadoop needs to integrate the NoSQL
databases, used in applications like social media such as Twitter or Facebook; the messages
from middleware or data from business-to-business (B2B) products such as Salesforce or
SAP products; the multimedia streams; and so forth. A Big Data suite integrated with
Hadoop ofers connectors from all these diferent interfaces to Hadoop and back.
he main Hadoop-related projects, developed under the Apache license and supporting
Big Data application execution, include the following [35]:
6.8 CONCLUSIONS
In the past years, scheduling models and tools have been faced with fundamental re-
architecting to it as well as possible with large many-task computing environments.
he way that existing solutions were redesigned consisted of splitting the tools into mul-
tiple components and adapting each component according to its new role and place.
Workloads and work low sequences are scheduled at the application side, and then, a
resource manager allocates a pool of resources for the execution phase. So, the schedul-
ing became important at the same time for users and providers, being the most impor-
tant key for any optimal processing in Big Data science.
REFERENCES
1. M. D. Assuncao, R. N. Calheiros, S. Bianchi, M. A. Netto, and R. Buyya. Big Data comput-
ing and clouds: Challenges, solutions, and future directions. arXiv preprint arXiv:1312.4722,
2013. Available at http://arxiv.org/pdf/1312.4722v2.pdf.
2. A. Rasooli, and D. G. Down. COSHH: A classiication and optimization based scheduler for
heterogeneous Hadoop systems. Future Generation Computer Systems, vol. 36, pp. 1–15, 2014.
3. M. Tim Jones. Scheduling in Hadoop. An introduction to the pluggable scheduler framework.
IBM Developer Works, Technical Report, December 2011. Available at https://www.ibm.com
/developerworks/opensource/library/os-hadoop-scheduling.
4. L. Zhang, C. Wu, Z. Li, C. Guo, M. Chen, and F. C. M. Lau. Moving Big Data to the cloud:
An online cost-minimizing approach. Selected Areas in Communications, IEEE Journal on, vol. 31,
no. 12, pp. 2710–2721, 2013.
5. CISCO Report. Big Data Solutions on Cisco UCS Common Platform Architecture (CPA).
Available at http://www.cisco.com/c/en/us/solutions/data-center-virtualization/big-data/index
.html, May 2014.
6. V. Cristea, C. Dobre, C. Stratan, F. Pop, and A. Costan. Large-Scale Distributed Computing
and Applications: Models and Trends. IGI Global, Hershey, PA, pp. 1–276, 2010. doi:10.4018/
978-1-61520-703-9.
7. D. Pletea, F. Pop, and V. Cristea. Speculative genetic scheduling method for Hadoop envi-
ronments. In 2012 14th International Symposium on Symbolic and Numeric Algorithms for Scientiic
Computing. SYNASC, pp. 281–286, 2012.
8. A.-M. Oprescu, T. Kielmann, and H. Leahu. Budget estimation and control for bag-of-tasks
scheduling in clouds. Parallel Processing Letters, vol. 21, no. 2, pp. 219–243, 2011.
9. M. Mao, J. Li, and M. Humphrey. Cloud auto-scaling with deadline and budget constraints.
In he 11th ACM/IEEE International Conference on Grid Computing (Grid 2010). Brussels, Belgium,
2010.
10. J. Yu, and R. Buyya. Scheduling scientiic worklow applications with deadline and budget con-
straints using genetic algorithms. Scientiic Programming Journal, vol. 14, nos. 3–4, pp. 217–230,
2006.
11. K. Liu, H. Jin, J. Chen, X. Liu, D. Yuan, and Y. Yang. A compromised-time-cost scheduling
algorithm in swindew-c for instance-intensive cost-constrained worklows on a cloud com-
puting platform. International Journal of High Performance Computing Applications, vol. 24, no. 4,
pp. 445–456, 2010.
12. T. White. Hadoop: he Deinitive Guide. O’Reilly Media, Inc., Yahoo Press, Sebastopol, CA, 2012.
13. X. Hua, H. Wu, Z. Li, and S. Ren. Enhancing throughput of the Hadoop Distributed File
System for interaction-intensive tasks. Journal of Parallel and Distributed Computing, vol. 74, no. 8,
pp. 2770–2779, 2014. ISSN 0743-7315.
The Art of Scheduling for Big Data Science ◾ 119
14. M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Job scheduling for
multi-user MapReduce clusters. Technical Report, EECS Department, University of California,
Berkeley, CA, April 2009. Available at http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS
-2009-55.html.
15. M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving MapReduce per-
formance in heterogeneous environments. In Proceeding OSDI ’08 Proceedings of the 8th USENIX
Conference on Operating Systems Design and Implementation. USENIX Association, Berkeley, CA,
pp. 29–42, 2008.
16. K. Kc, and K. Anyanwu. Scheduling Hadoop jobs to meet deadlines. In Proceeding CLOUDCOM
’10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and
Science. IEEE Computer Society, Washington, DC, pp. 388–392, 2010.
17. Y. Tao, Q. Zhang, L. Shi, and P. Chen. Job scheduling optimization for multi-user MapReduce
clusters. In Parallel Architectures, Algorithms and Programming (PAAP), 2011 Fourth International
Symposium on, Tianjin, China, pp. 213, 217, December 9–11, 2011.
18. Corona. Under the Hood: Scheduling MapReduce jobs more eiciently with Corona, 2012.
Available at https://www.facebook.com/notes/facebook-engineering/under-the-hood-scheduling
-mapreduce-jobs-more-eiciently-with-corona/10151142560538920.
19. B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and
I. Stoica. Mesos: A platform for ine-grained resource sharing in the data center. In Proceedings
of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI ’11). USENIX
Association, Berkeley, CA, 22 p., 2011.
20. Hortonworks. Hadoop YARN: A next-generation framework for Hadoop data processing,
2013. Available at http://hortonworks.com/hadoop/yarn/.
21. J. Celaya, and U. Arronategui. A task routing approach to large-scale scheduling. Future
Generation Computer Systems, vol. 29, no. 5, pp. 1097–1111, 2013.
22. Y. Xu, A. Suarez, and M. Zhao. IBIS: Interposed big-data I/O scheduler. In Proceedings of the
22nd International Symposium on High-Performance Parallel and Distributed Computing (HPDC ’13).
ACM, New York, pp. 109–110, 2013.
23. J. Hall, J. Hartline, A. R. Karlin, J. Saia, and J. Wilkes. On algorithms for eicient data migra-
tion. In Proceedings of the Twelth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’01).
Society for Industrial and Applied Mathematics, Philadelphia, PA, pp. 620–629, 2001.
24. A. Stage, and T. Setzer. Network-aware migration control and scheduling of diferentiated
virtual machine workloads. In Proceedings of the 2009 ICSE Workshop on Sotware Engineering
Challenges of Cloud Computing (CLOUD ’09). IEEE Computer Society, Washington, DC, pp. 9–14,
2009.
25. G. Jung, N. Gnanasambandam, and T. Mukherjee. Synchronous parallel processing of big-
data analytics services to optimize performance in federated clouds. In Proceedings of the 2012
IEEE Fith International Conference on Cloud Computing (CLOUD ’12). IEEE Computer Society,
Washington, DC, pp. 811–818, 2012.
26. K. Ye, X. Jiang, D. Huang, J. Chen, and B. Wang. Live migration of multiple virtual machines
with resource reservation in cloud computing environments. In Proceedings of the 2011 IEEE 4th
International Conference on Cloud Computing (CLOUD ’11). IEEE Computer Society, Washington,
DC, pp. 267–274, 2011.
27. M.-C. Nita, C. Chilipirea, C. Dobre, and F. Pop. A SLA-based method for big-data transfers with
multi-criteria optimization constraints for IaaS. In Roedunet International Conference (RoEduNet),
2013 11th, Sinaia, Romania, pp. 1–6, January 17–19, 2013.
28. R. Van den Bossche, K. Vanmechelen, and J. Broeckhove. Online cost-eicient scheduling of
deadline-constrained workloads on hybrid clouds. Future Generation Computer Systems, vol. 29,
no. 4, pp. 973–985, 2013.
120 ◾ Florin Pop and Valentin Cristea
29. H. Karatza. Scheduling in distributed systems. In Performance Tools and Applications to Networked
Systems, M. C. Calzarossa and E. Gelenbe (Eds.), Springer Berlin, Heidelberg, pp. 336–356,
2004. ISBN: 978-3-540-21945-3.
30. F. Zhang, J. Cao, K. Li, S. U. Khan, and K. Hwang. Multi-objective scheduling of many tasks in
cloud platforms. Future Generation Computer Systems, vol. 37, pp. 309–320.
31. K. Elmeleegy. Piranha: Optimizing short jobs in Hadoop. Proceedings of the VLDB Endowment,
vol. 6, no. 11, pp. 985–996, 2013.
32. R. Ramakrishnan, and Team Members CISL. Scale-out beyond MapReduce. In Proceedings of
the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’13),
I. S. Dhillon, Y. Koren, R. Ghani, T. E. Senator, P. Bradley, R. Parekh, J. He, R. L. Grossman, and
R. Uthurusamy (Eds.). ACM, New York, 1 p., 2013.
33. L. Wang, J. Tao, R. Ranjan, H. Marten, A. Streit, J. Chen, and D. Chen. G-Hadoop: MapReduce
across distributed data centers for data-intensive computing. Future Generation Computer Systems,
vol. 29, no. 3, pp. 739–750, 2013.
34. B. C. Tak, B. Urgaonkar, and A. Sivasubramaniam. To move or not to move: he economics of
cloud computing. In Proceedings of the 3rd USENIX Conference on Hot Topics in Cloud Computing
(HotCloud ’11). USENIX Association, Berkeley, CA, 5 p., 2011.
35. D. Loshin. Chapter 7—Big Data tools and techniques. In Big Data Analytics, D. Loshin, and
M. Kaufmann (Eds.). Boston, pp. 61–72, 2013.
CHAPTER 7
Time–Space Scheduling in
the MapReduce Framework
Zhuo Tang, Ling Qi, Lingang Jiang, Kenli Li, and Keqin Li
CONTENTS
7.1 Introduction 122
7.2 Overview of Big Data Processing Architecture 123
7.3 Self-Adaptive Reduce Task Scheduling 125
7.3.1 Problem Analysis 125
7.3.2 Runtime Analysis of MapReduce Jobs 126
7.3.3 A Method of Reduce Task Start-Time Scheduling 127
7.4 Reduce Placement 129
7.4.1 Optimal Algorithms for Cross-Rack Communication Optimization 129
7.4.2 Locality-Aware Reduce Task Scheduling 130
7.4.3 MapReduce Network Traic Reduction 130
7.4.4 he Source of MapReduce Skews 131
7.4.5 Reduce Placement in Hadoop 131
7.5 NER in Biomedical Big Data Mining: A Case Study 132
7.5.1 Biomedical Big Data 132
7.5.2 Biomedical Text Mining and NER 133
7.5.3 MapReduce for CRFs 133
7.6 Concluding Remarks 136
References 136
ABSTRACT
As data are the basis of information systems, using Hadoop to rapidly extract useful
information from massive data of an enterprise has become an eicient method for
programmers in the process of application development. his chapter introduces the
MapReduce framework, an excellent distributed and parallel computing model. For
the increasing data and cluster scales, to avoid scheduling delays, scheduling skews,
poor system utilization, and low degrees of parallelism, some improved methods
that focus on the time and space scheduling of reduce tasks in MapReduce are pro-
posed in this chapter. hrough analyzing the MapReduce scheduling mechanism,
121
122 ◾ Zhuo Tang, Ling Qi, Lingang Jiang, Kenli Li, and Keqin Li
this chapter irst illustrates the reasons for system slot resource wasting, which results
in reduce tasks waiting around, and proposes the development of a method detail-
ing the start times of reduce tasks dynamically according to each job context. And
then, in order to mitigate network traic and improve the performance of Hadoop,
this chapter addresses several optimizing algorithms to solve the problems of reduce
placement. It makes a Hadoop reduce task scheduler aware of partitions’ network
locations and sizes. Finally, as the implementation, a parallel biomedical data pro-
cessing model using the MapReduce framework is presented as an application of the
proposed methods.
7.1 INTRODUCTION
Data are representations of information, the information content of data is generally
believed to be valuable, and data form the basis of information systems. Using comput-
ers to process data, extracting information is a basic function of information systems. In
today’s highly information-oriented society, the web can be said to be currently the largest
information system, of which the data are massive, diverse, heterogeneous, and dynami-
cally changing. Using Hadoop to rapidly extract useful information from the massive data
of an enterprise has become an eicient method for programmers in the process of applica-
tion development.
he signiicance of Big Data is to analyze people’s behavior, intentions, and preferences
in the growing and popular social networks. It is also to process data with nontraditional
structures and to explore their meanings. Big Data is oten used to describe a company’s
large amount of unstructured and semistructured data. Using analysis to create these data
in a relational database for downloading will require too much time and money. Big Data
analysis and cloud computing are oten linked together, because real-time analysis of large
data requires a framework similar to MapReduce to assign work to hundreds or even thou-
sands of computers. Ater several years of criticism, questioning, discussion, and specula-
tion, Big Data inally ushered in the era belonging to it.
Hadoop presents MapReduce as an analytics engine, and under the hood, it uses a
distributed storage layer referred to as the Hadoop distributed ile system (HDFS). As an
open-source implementation of MapReduce, Hadoop is, so far, one of the most successful
realizations of large-scale data-intensive cloud computing platforms. It has been realized
that when and where to start the reduce tasks are the key problems to enhance MapReduce
performance.
For time scheduling in MapReduce, the existing work may result in a block of reduce
tasks. Especially when the map tasks’ output is large, the performance of a MapReduce
task scheduling algorithm will be inluenced seriously. hrough analysis for the cur-
rent MapReduce scheduling mechanism, Section 7.3 illustrates the reasons for system
slot resource wasting, which results in reduce tasks waiting around. Then, the section
proposes a self-adaptive reduce task scheduling policy for reduce tasks’ start times in
the Hadoop platform. It can decide the start time point of each reduce task dynamically
according to each job context, including the task completion time and the size of the map
output.
Time–Space Scheduling in the MapReduce Framework ◾ 123
Amazon, Facebook, and so on. Because of the short development time, Hadoop can be
improved in many aspects, such as the problems of intermediate data management and
reduce task scheduling [4].
As shown in Figure 7.1, map and reduce are two sections in a MapReduce scheduling
algorithm. In Hadoop, each task contains three function phases, that is, copy, sort, and
reduce [5]. he goal of the copy phase is to read the map tasks’ output. he sort phase is
to sort the intermediate data, which are the output from map tasks and will be the input
to the reduce phase. Finally, the eventual results are produced through the reduce phase,
where the copy and sort phases are to preprocess the input data of the reduce phase. In real
applications, copying and sorting may cost a considerable amount of time, especially in
the copy phase. In the theoretical model, the reduce functions start only if all map tasks
are inished [6]. However, in the Hadoop implementation, all copy actions of reduce tasks
will start when the irst map action is inished [7]. But in slot duration, if there is any map
task still running, the copy actions will wait around. his will lead to the waste of reduce
slot resources.
In traditional MapReduce scheduling, reduce tasks should start when all the map
tasks are completed. In this way, the output of map tasks should be read and written
to the reduce tasks in the copy process [8]. However, through the analysis of the slot
resource usage in the reduce process, this chapter focuses on the slot idling and delay. In
particular, when the map tasks’ output becomes large, the performance of MapReduce
scheduling algorithms will be inluenced seriously [9]. When multiple tasks are running,
inappropriate scheduling of the reduce tasks will lead to the situation where other jobs
in the system cannot be scheduled in a timely manner. hese are the stumbling blocks of
Hadoop popularization.
A user needs to serve two functions in the Hadoop framework, that is, mapper and
reducer, to process data. Mappers produce a set of iles and send it to all the reducers.
Reducers will receive iles from all the mappers, which is an all-to-all communication
model. Hadoop runs in a data center environment in which machines are organized in
Master
⑥as⑦
He
④ ⑤
Re
③ ❶❷❸❹❺
ar
du
tbe
c
et
at
as
Mapper
k
I/O
⑧⑨ DFS
Blo
Reducer ❻❼❶ ❽❾❿ Output
Blo ⑧⑨ Mapper
Blo⑧⑨
⑩⑩⑩ Mapper Reducer Output
⑩⑩⑩
Blo⑧⑨
Mapper Reducer Output
J1 J1 J1 J1 J1 J1 J1 J1 J1 J1 J1 J1
Maps
Maps
Maps
J1 J1 J1 J1 J1 J1 J1 J1 J1 J1 J1 J1
J2 J2 J2 J2 J2 J2 J3
J2 J2 J2 J2 J2 J2 J3
t1 t2 t3 t4 t1 t2 t3 t4 t5
t1 t2 Time
Job2 maps complete Jobs submitted Time
J1
J1 J1
J1
Reduces
J1 J1
Reduces
Reduces
J2 J1
J2 J2 J2 J1
J2 J1
J2 J2 J2 J1
t1 t2 t3 t4
t1 t2 Time t1 t2 t3 t4 t5
(a) (b) Job2 maps complete (c)
FIGURE 7.2 Performance of the policies with respect to various graph sizes. (a) Job2 map tasks
inished. (b) Job2 reduce tasks inished. (c) Job3 submitted.
he root cause of this problem is that the reduce task of Job3 must wait for all the reduce
tasks of Job1 to be completed, as Job1 takes up all the reduce slots and the Hadoop system does
not support preemptive action acquiescently. In early algorithm design, a reduce task can be
scheduled once any map tasks are inished [12]. One of the beneits is that the reduce tasks
can copy the output of the map tasks as soon as possible. But reduce tasks will have to wait
before all map tasks are inished, and the pending tasks will always occupy the slot resources,
so that other jobs that inish the map tasks cannot start the reduce tasks. All in all, this will
result in long waiting of reduce tasks and will greatly increase the delay of Hadoop jobs.
In practical applications, there are oten diferent jobs running in a shared cluster envi-
ronment, which are from multiple users at the same time. If the situation described in
Figure 7.2 appears among the diferent users at the same time, and the reduce slot resources
are occupied for a long time, the submitted jobs from other users will not be pushed ahead
until the slots are released. Such ineiciency will extend the average response time of a
Hadoop system, lower the resource utilization rate, and afect the throughput of a Hadoop
cluster.
are ordered by merging, and the data are distributed in diferent storage locations. One
type is the data in memory. When the data are read from the various maps at the same
time, the data set should be merged as the same keys. he other type is data as the circle
bufer. Because the memory belonging to the reduce task is limited, the data in the bufer
should be written to disks regularly in advance.
In this way, subsequent data which are written into the disk earlier need to be merged, so-
called external sorting. he external sorting needs to be executed several times if the number
of map tasks is large in the practical works. he copy and sort processes are customarily
called the shule phase. Finally, ater inishing the copy and sort processes, the subsequent
functions start, and the reduce tasks can be scheduled to the compute nodes.
he denotations in Equation 7.1 are deined as follows. FTlm is the completion time of
the last map task; STfm is the start time of the irst map task; FTcp is the inish time of the
copy phase; FTlr is the inish time of reduce; and STsr is the start time of reduce sort.
In Figure 7.3, t1 is the start time of Map1, Map2, and the reduce task. During t1 to t3,
the main work of the reduce task is to copy the output from Map1 to Map14. he output of
t1 t2 t3 t4 t5 t6
Map15 and Map16 will be copied by the reduce task from t3 to t4. he duration from t4 to
t5 is called the sort stage, which ranks the intermediate results according to the key values.
he reduce function is called at time t5, which continues from t5 to t6. Because during t1
to t3, in the copy phase, the reduce task only copies the output data intermittently, once
any map task is completed, for the most part, it is always waiting around. We hope to make
the copy operations completed at a concentrated duration, which can decrease the waiting
time of the reduce tasks.
As Figure 7.4 shows, if we can start the reduce tasks at t2′, which can be calculated using
the following equations, and make sure these tasks can be inished before t6, then during t1
to t2′, the slots can be used by any other reduce tasks. But if we let the copy operation start
at t3, because the output of all map tasks should be copied from t3, delay will be produced
in this case. As shown in Figure 7.3, the copy phase starts at t2, which just collects the out-
put of the map tasks intermittently. By contrast, the reduce task’s waiting time is decreased
obviously in Figure 7.4, in which case the copy operations are started at t2′.
he SARS algorithm works by delaying the reduce processes. he reduce tasks are sched-
uled when part but not all of the map tasks are inished. For a special key value, if we assume
that there are s map slots and m map tasks in the current system, and the completion time
and the size of output data of each map task are denoted as t_mapi and m_outj, respectively,
where i, j ∈ [1,m]. hen, we can know that the data size of the map tasks can be calculated as
m
N_m = ∑m _ out ,
j =1
j j ∈[1, m]. (7.2)
In order to predict the time required to transmit the data, we deine the speed of the data
transmission from the map tasks to the reduce tasks as transSpeed in the cluster environ-
ment, and the number of concurrent copy threads with reduce tasks is denoted as copy-
hread. We denote the start time of the irst map task and the irst reduce task as startmap
and startreduce, respectively. herefore, the optimal start time of reduce tasks can be deter-
mined by the following equation:
m
As shown by time t2′ in Figure 7.4, the most appropriate start time of a reduce task is
when all the map tasks about the same key are inished, which is between the times when
the irst map is started and when the last map is inished. he second item in Equation 7.3
denotes the required time of the map tasks, and the third item is the time for data trans-
mission. Because the reduce tasks will be started before the copy processes, the time cost
should be cut from the map tasks’ completion time. he waiting around of the reduce tasks
may make the jobs in need of the slot resources not able to work normally. hrough adjusting
the reduce scheduling time, this method can decrease the time waste for the data replication
process and advance the utilization of the reduce slot resources efectively. hrough adjust-
ing the reduce scheduling time, this method can decrease the time waste for data replication
process and advance the utilization of the reduce slot resources efectively. he improvement
of these policies is especially important for the CPU-type jobs. For these jobs that need more
CPU computing, the data input/output (I/O) of the tasks is less, so more slot resources will be
wasted in the default schedule algorithm.
begins with a map phase, where each input split is processed in parallel, a random sample
of the required size will be produced. he split of samples is submitted to the auditor group
(AG); meanwhile, the master and map tasks will wait for the results of the auditor.
AG—he AG carries out a statistical and predicted test to calculate the distribution of
reduce tasks and then start the reduce VM [23] at the appropriate place in the PM. he
AG will receive several samples and then will assign its members that contain map and
reduce tasks to them. he distribution of intermediate key/value pairs that adopt a hashing
approach to distribute the load to the reducers will be computed in reduces.
Placement of reduce virtual machine (VM)—he results of AG will decide the placement
of reduce VMs. For example, in Figure 7.5, if 80% of key/value pairs of Reduce1 come from
Map2 and the remaining intermediate results are from Map1, the VM of Reduce1 will be
started in the physical machine (PM) that contains the VM of Map2. Similarly, the VM of
Reduce2 will be started in the PM that includes the VM of Map1.
data contain a wealth of biological information, including signiicant gene expression situ-
ations and protein–protein interactions. What is more, a disease network, which contains
hidden information associated with the disease and gives biomedical scientists the basis of
hypothesis generation, is constructed based on disease relationship mining in these bio-
medical data.
However, the most basic requirements for biomedical Big Data processing are diicult
to meet eiciently. For example, key word searching in biomedical Big Data or the Internet
can only ind lots of relevant ile lists, and the accuracy is not high, so that a lot of valuable
information contained in the text cannot be directly shown to the people.
model is widely used in natural language processing (NLP), including NER, part-of-speech
tagging, and so on.
Figure 7.6 shows the CRF model, which computes the conditional probability p y x
( )
of an output sequence y = ( y1, y 2 ,..., yn ) under the condition of a given input sequence
x = ( x1, x 2 ,…, xn ).
A linear CRF, which is used in Bio-NER, is as follows:
n K
1
P( y x ) = ⋅ exp
Z(x ) ∑ ∑ λ f (x ,i , y
i =1 k =1
k k i −1 , yi ) , (7.4)
where
n K
Z(x ) = ∑ y
exp ∑ ∑ λ f (x ,i , y
i =1 k =1
k k i −1 , yi ) , (7.5)
i is the position in the input sequence x = ( x1 , x 2 ,…, xn ), λk is a weight of a feature that does
K
{ }
not depend on location i, and f k ( x , i , yi −1 , yi ) k=1 are feature functions.
For the training process of the CRF model, the main purpose is to seek for theN parameter
λ = (λ1, λ 2 ,…, λ K ) that is most in accordance with the training data T = ( xi , yi ) i =1. Presume
{ }
every ( x , y ) is independently and identically distributed. he parameter is obtained gener-
ally in this way:
L(λ ) = ∑ log P( y x ).
T
(7.6)
When the log-likelihood function L(λ) reaches the maximum value, the parameter
is almost the best. However, to ind the parameter to maximize the training data likeli-
hood, there is no closed-form solution. Hence, we adopt parameter estimation, that is, the
limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm [28], to ind
the optimum solution.
y1 y2 y3 yn–1 yn
......
x = x1,x2 ,...,xn
To ind the parameter λ = (λ1, λ 2,…, λ K ) to make convex function L(λ) reach the maxi-
∂L ∂L ∂L
mum, algorithm L-BFGS makes its gradient vector L = , ,…, 0 by iterative
∂λ1 ∂λ 2 ∂λ K
computations with initial value λ0 = 0 at irst. Research shows that the irst step, that is, to
calculate ∇Li, which is on behalf of the gradient vector in iteration i, calls for much time.
herefore, we focus on the optimized improvement for it.
Every component in ∇Li is computed as follows:
n n
∂L (λ ) λk
∂λ k
= ∑∑
T i =1
f k ( x , i , yi −1, yi ) − ∑ y
P( y x ) ∑ f (x ,i , y
i =1
k i −1 , yi ) −
σ2
. (7.7)
It can be linked with every ordered pair ( x , y ) within Σ that is mutually independent. So we
n T
n
can calculate the diference between ∑
f k ( x , i , yi −1 , yi ) and
i =1
P( y x ) ∑
f k (x , i , yi −1 , yi )
y
∑ i =1
on each of the input sequences in the training set T and then put the results of all the
sequences together. As a result, they can be computed in parallel as shown in Figure 7.7.
We split the calculation process in-house Σ into several map tasks and summarize the
T
λ
results by a reduce task. And the diference between penalty term k2 is designed to be the
postprocessing. σ
In the actual situation, it is impossible to schedule one map task for one ordered pair
( x , y ) because the number of ordered pairs in the large scale of training samples is too
huge to estimate efectively. We must syncopate the training data T into several small
parts and then start the MapReduce plan as shown in the two above paragraphs at this
section.
For a MapReduce Bio-NER application, the data skew leads to uneven load in the whole
system. Any speciic corpus has its own uneven distribution of the entity (as shown in
Table 7.1), resulting in the serious problem of data skew. And protean, artiicial deined
feature sets exacerbate the problem in both training and inference processes.
Parallel
computing Postprocess
Map
n n
∂L =
∂λk (x,y) T ( i=1
fk(x,i,yi−1, yi) − p(y´ x)
y γ i=1
)
fk(x,i,yí −1, y´i)
λk
− ———
σ2
Reduce
The expectation value of ƒk The expectation value of ƒk
under the empirical under the model distribution
distribution with a given with a given vector of x
vector of x
TABLE 7.1 Proportion of Each Type of Entity in the Corpus Joint Workshop on
Natural Language Processing in Biomedicine and Its Application (JNLPBA2004)
Protein DNA RNA Cell Line Cell Type
Training Set 59.00% 18.58% 1.85% 7.47% 13.10%
Test Set 58.50% 12.19% 1.36% 5.77% 22.18%
Combined with schemes given in this chapter, the uneven load can be solved based on
the modiied Hadoop MapReduce. he implementation will further improve system per-
formance on MapReduce with time–space scheduling.
REFERENCES
1. J. Tan, S. Meng, X. Meng, L. Zhang. Improving reduce task data locality for sequential
MapReduce jobs. International Conference on Computer Communications (INFOCOM), 2013
Proceedings IEEE, April 14–19, 2013, pp. 1627–1635.
2. J. Dean, S. Ghemawat. MapReduce: Simpliied data processing on large clusters, Communications
of the ACM—50th Anniversary Issue: 1958–2008, 2008, Volume 51, Issue 1, pp. 137–150.
Time–Space Scheduling in the MapReduce Framework ◾ 137
3. X. Gao, Q. Chen, Y. Chen, Q. Sun, Y. Liu, M. Li. A dispatching-rule-based task scheduling pol-
icy for MapReduce with multi-type jobs in heterogeneous environments. 2012 7th ChinaGrid
Annual Conference (ChinaGrid), September 20–23, 2012, pp. 17–24.
4. J. Xie, F. Meng, H. Wang, H. Pan, J. Cheng, X. Qin. Research on scheduling scheme for Hadoop
clusters. Procedia Computer Science, 2013, Volume 18, pp. 2468–2471.
5. Z. Tang, M. Liu, K. Q. Li, Y. Xu. A MapReduce-enabled scientiic worklow framework with opti-
mization scheduling algorithm. 2012 13th International Conference on Parallel and Distributed
Computing, Applications and Technologies (PDCAT), December 14–16, 2012, pp. 599–604.
6. F. Ahmad, S. Lee, M. hottethodi, T. N. Vijaykumar. MapReduce with communication over-
lap (MaRCO). Journal of Parallel and Distributed Computing, 2013, Volume 73, Issue 5,
pp. 608–620.
7. M. Lin, L. Zhang, A. Wierman, J. Tan. Joint optimization of overlapping phases in MapReduce.
Performance Evaluation, 2013, Volume 70, Issue 10, pp. 720–735.
8. Y. Luo, B. Plale. Hierarchical MapReduce programming model and scheduling algorithms. 12th
IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), May
13–16, 2012, pp. 769–774.
9. H. Mohamed, S. Marchand-Maillet. MRO-MPI: MapReduce overlapping using MPI and an
optimized data exchange policy. Parallel Computing, 2013, Volume 39, Issue 12, pp. 851–866.
10. Z. Tang, L. G. Jiang, J. Q. Zhou, K. L. Li, K. Q. Li. A self-adaptive scheduling algorithm for reduce
start time. Future Generation Computer Systems, 2014. Available at http://dx.doi.org/10.1016/j
.future.2014.08.011.
11. D. Linderman, D. Collins, H. Wang, H. Meng. Merge: A programming model for heteroge-
neous multi-core systems. ASPLOSXIII Proceedings of the 13th International Conference on
Architectural Support for Programming Languages and Operating Systems, March 2008, Volume
36, Issue 1, pp. 287–296.
12. B. Palanisamy, A. Singh, L. Liu, B. Langston. Cura: A cost-optimized model for MapReduce in
a cloud. IEEE International Symposium on Parallel and Distributed Processing (IPDPS), IEEE
Computer Society, May 20–24, 2013, pp. 1275–1286.
13. L. Ho, J. Wu, P. Liu. Optimal algorithms for cross-rack communication optimization in
MapReduce framework. 2011 IEEE International Conference on Cloud Computing (CLOUD),
July 4–9, 2011, pp. 420–427.
14. M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, A. Goldberg. Quincy: Fair schedul-
ing for distributed computing clusters. Proceedings of the ACM SIGOPS 22nd Symposium on
Operating Systems Principles (SOSP), October 11–14, 2009, pp. 261–276.
15. Wikipedia, Greedy algorithm [EB/OL]. Available at http://en.wikipedia.org/wiki/Greedy
_algorithm, September 14, 2013.
16. D. D. Sleator, R. E. Tarjan. Self-adjusting binary search trees. Journal of the ACM (JACM), 1985,
Volume 32, Issue 3, pp. 652–686.
17. M. Hammoud, M. F. Sakr. Locality-aware reduce task scheduling for MapReduce. Cloud
Computing Technology and Science (CloudCom), 2011 IEEE hird International Conference on,
November 29–December 1, 2011, pp. 570–576.
18. S. Huang, J. Huang, J. Dai, T. Xie, B. Huang. he HiBench benchmark suite: Characterization
of the MapReduce-based data analysis. Date Engineering Workshops (ICDEW), IEEE 26th
International Conference on, March 1–6, 2010, pp. 41–51.
19. M. Hammoud, M. S. Rehman, M. F. Sakr. Center-of-gravity reduce task scheduling to lower
MapReduce network traic. Cloud Computing (CLOUD), IEEE 5th International Conference on,
June 24–29, 2012, pp. 49–58.
20. Y. C. Kwon, M. Balazinska, B. Howe, J. Rolia. Skew-resistant parallel processing of feature-
extracting scientiic user-deined functions. Proceedings of the 1st ACM Symposium on Cloud
Computing (SoCC), June 2010, pp. 75–86.
138 ◾ Zhuo Tang, Ling Qi, Lingang Jiang, Kenli Li, and Keqin Li
21. P. Dhawalia, S. Kailasam, D. Janakiram. Chisel: A resource savvy approach for handling skew
in MapReduce applications. Cloud Computing (CLOUD), IEEE Sixth International Conference
on, June 28–July 3, 2013, pp. 652–660.
22. R. Grover, M. J. Carey. Extending Map-Reduce for eicient predicate-based sampling. Data
Engineering (ICDE), 2012 IEEE 28th International Conference on, April 1–5, 2012, pp. 486–497.
23. S. Ibrahim, H. Jin, L. Lu, L. Qi, S. Wu, X. Shi. Evaluating MapReduce on virtual machines:
he Hadoop case cloud computing. Lecture Notes in Computer Science, 2009, Volume 5931,
pp. 519–528.
24. Wikipedia, MEDLINE [EB/OL]. Available at http://en.wikipedia.org/wiki/MEDLINE,
September 14, 2013.
25. J. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, N. Collier. Introduction to the bio-entity recognition
task at JNLPBA. Proceedings of the International Joint Workshop on Natural Language Processing
in Biomedicine and Its Applications (JNLPBA), August 2004, pp. 70–75.
26. L. Li, R. Zhou, D. Huang. Two-phase biomedical named entity recognition using CRFs.
Computational Biology and Chemistry, 2009, Volume 33, Issue 4, pp. 334–338.
27. J. Laferty, A. McCallum, F. Pereira. Conditional random ields: Probabilistic models for
segmenting and labeling sequence data. Proceedings of the 18th International Conference on
Machine Learning (ICML), June 28–July 1, 2001, pp. 282–289.
28. D. Liu, J. Nocedal. On the limited memory BFGS method for large scale optimization.
Mathematical Programming, 1989, Volume 45, Issue 1–3, pp. 503–528.
CHAPTER 8
CONTENTS
8.1 Introduction 140
8.2 Related Infrastructures 141
8.3 GEMS Overview 142
8.4 GMT Architecture 146
8.4.1 GMT: Aggregation 148
8.4.2 GMT: Fine-Grained Multithreading 149
8.5 Experimental Results 150
8.5.1 Synthetic Benchmarks 151
8.5.2 BSBM 152
8.5.3 RDESC 153
8.6 Conclusions 155
References 155
ABSTRACT
Many ields require organizing, managing, and analyzing massive amounts of data.
Among them, we can ind social network analysis, inancial risk management, threat
detection in complex network systems, and medical and biomedical databases. For
these areas, there is a problem not only in terms of size but also in terms of perfor-
mance, because the processing should happen suiciently fast to be useful. Graph
databases appear to be a good candidate to manage these data: hey provide an
139
140 ◾ Alessandro Morari et al.
eicient data structure for heterogeneous data or data that are not themselves rig-
idly structured. However, exploring large-scale graphs on modern high-performance
machines is challenging. hese systems include processors and networks optimized
for regular, loating-point intensive computations and large, batched data transfers.
At the opposite, exploring graphs generates ine-grained, unpredictable memory
and network accesses, is mostly memory bound, and is synchronization intensive.
Furthermore, graphs oten are diicult to partition, making their processing prone
to load unbalance.
In this book chapter, we describe Graph Engine for Multithreaded Systems (GEMS),
a full sotware stack that implements a graph database on a commodity cluster and
enables scaling in data set size while maintaining a constant query throughput when
adding more cluster nodes. GEMS employs the SPARQL (SPARQL Protocol and RDF
Query Language) language for querying the graph database. he GEMS sotware
stack comprises a SPARQL-to-data parallel C++ compiler; a library of distributed
data structures; and a custom, multithreaded, runtime system.
We provide an overview of the stack, describe its advantages compared with other
solutions, and focus on how we solved the challenges posed by irregular behaviors.
We inally propose an evaluation of GEMS on a typical SPARQL benchmark and
on a Resource Description Format (RDF) data set currently curated at Paciic Northwest
National Laboratory.
8.1 INTRODUCTION
Many very diverse application areas are experiencing an explosive increase in the avail-
ability of data to process. hey include inance; science ields such as astronomy, biology,
genomics, climate and weather, and material sciences; the web; geographical systems;
transportation; telephones; social networks; and security. he data sets of these applica-
tions have quickly surpassed petabytes in size and keep exponentially growing. Enabling
computing systems to process these large amounts of data, without compromising perfor-
mance, is becoming of paramount importance. Graph databases appear a most natural and
convenient way to organize, represent, and store the data of these applications, as they usu-
ally contain a large number of relationships among their elements, providing signiicant
beneits with respect to relational databases.
Graph databases organize the data in the form of subject–predicate–object triples fol-
lowing the Resource Description Framework (RDF) data model (Klyne et al. 2004). A set
of triples naturally represents a labeled, directed multigraph. An analyst can query seman-
tic graph databases through languages such as SPARQL (W3C SPARQL Working Group
2013), in which the fundamental query operation is graph matching. his is diferent from
conventional relational databases that employ schema-speciic tables to store data and
perform select and conventional join operations when executing queries. With relational
approaches, graph-oriented queries on large data sets can quickly become unmanageable
in both space and time due to the large sizes of immediate results created when performing
conventional joins.
GEMS 141
for which multinode cluster support is available. GEMS adopts in-memory processing: It
stores all data structures in RAM. In-memory processing potentially allows increasing the
data set size while maintaining constant query throughput by adding more cluster nodes.
Some approaches leverage MapReduce infrastructures for RDF-encoded databases.
SHARD (Rohlof and Schantz 2010) is a triplestore built on top of Hadoop, while YARS2
(Harth et al. 2007) is a bulk-synchronous, distributed query-answering system. Both
exploit hash partitioning to distribute triples across nodes. hese approaches work well
for simple index lookups, but they also present high communication overheads for moving
data through the network with more complex queries as well as introduce load-balancing
issues in the presence of data skew.
More general graph libraries, such as Pregel (Malewicz et al. 2010), Giraph (Apache
Giraph, n.d.), and GraphLab (GraphLab, n.d.), may also be exploited to explore semantic
databases, once the source data have been converted into a graph. However, they require
signiicant additions to work in a database environment, and they still rely on bulk-
synchronous, parallel models that do not perform well for large and complex queries. Our
system relies on a custom runtime that provides speciic features to support exploration of
a semantic database through graph-based methods.
Urika is a commercial shared-memory system from YarcData (YarcData, Inc., n.d.) tar-
geted at Big Data analytics. Urika exploits custom nodes with purpose-built multithreaded
processors (barrel processors with up to 128 threads and a very simple cache) derived from
the Cray XMT. Besides multithreading, which allows tolerating latencies for accessing
data on remote nodes, the system has hardware support for a scrambled global address
space and ine-grained synchronization. hese features allow more eicient execution of
irregular applications, such as graph exploration. On top of this hardware, YarcData inter-
faces with the Jena framework to provide a front-end application programming interface
(API). GEMS, instead, exploits clusters built with commodity hardware that are cheaper
to acquire and maintain and that are able to evolve more rapidly than custom hardware.
➀➁ARQL compiler
SGLib
query
Graph C++
Hashing
Set SGLib
Multiset
Graph API Dictionary Exec
FIGURE 8.1 High-level overview of Graph Engine for Multithreaded Systems (GEMS). HL-IR,
high level-intermediate representation; LL-IR, low level-intermediate representation.
generates the graph database and the related dictionary by ingesting the triples. Triples
can, for example, be RDF triples stored in N-Triples format.
he approach implemented by our system to extract information from the semantic graph
database is to solve a structural graph pattern-matching problem. GEMS employs a variation
of Ullmann’s (1976) subgraph isomorphism algorithm. he approach basically enumerates
all the possible mappings of the graph pattern vertices to those in the graph data set through
a depth-irst tree search. A path from root to leaves in the resulting tree denotes a complete
mapping. If all the vertices that are neighbors in a path are also neighbors both in the graph
pattern and in the graph data set (i.e., adjacency is preserved), then the path represents a
match. Even if the resulting solution space has exponential complexity, the algorithm can
prune early subtrees that do not lead to feasible mappings. Moreover, the compiler can per-
form further optimizations by reasoning on the query structure and the data set statistics.
he data structures implemented within the SGLib layer support the operation of a
query search using the features provided by the GMT layer. When loaded, the size of the
SGLib data structures is expected to be larger than what can it into a single node’s mem-
ory, and these are therefore implemented using the global memory of GMT. he two most
fundamental data structures are the graph and dictionary. Supplemental data structures
are the array and table.
he ingest phase of the GEMS worklow initializes a graph and dictionary. he dictionary
is used to encode string labels with unique integer identiiers (UIDs). herefore, each RDF
triple is encoded as a sequence of three UIDs. he graph indexes RDF triples in subject–
predicate–object and object–predicate–subject orders. Each index is range-partitioned so
that each node in the cluster gets an almost equal number of triples. Subject–predicate and
144 ◾ Alessandro Morari et al.
object–predicate pairs with highly skewed proportions of associated triples are specially
distributed among nodes so as to avoid load imbalance as the result of data skew.
he GMT layer provides the key features that enable management of the data structures
and load balancing across the nodes of the cluster. GMT is codesigned with the upper lay-
ers of the graph database engine so as to better support the irregularity of graph pattern-
matching operations. GMT provides a Partitioned Global Address Space (PGAS) data model,
hiding the complexity of the distributed memory system. GMT exposes to SGLib an API that
permits allocating, accessing, and freeing data in the global address space. Diferently from
other PGAS libraries, GMT employs a control model typical of shared-memory systems:
fork-join parallel constructs that generate thousands of lightweight tasks. hese lightweight
tasks allow hiding the latency for accessing data on remote cluster nodes; they are switched in
and out of processor cores while communication proceeds. Finally, GMT aggregates oper-
ations before communicating to other nodes, increasing network bandwidth utilization.
he sequence of Figures 8.2 through 8.6 presents a simple example of a graph data-
base, of a SPARQL query, and of the conversion performed by the GEMS’ SPARQL-to-C++
compiler.
Figure 8.2 shows the data set in N-Triples format (a common serialization format for RDF)
(Carothers and Seaborne 2014). Figure 8.3 shows the corresponding graph representation
has_address owns has_name has_name owns has_address owns has_address owns has_name
➂➃➄➅➆➇➈
➎➏➐➑ ➒➓➑➔➐➓→➣
SUV ➂➋➉➅➍
?person
➜➝➞➟ ➠➡➟➢➞➡➤➥
SUV ?car2
has_name = get_label(“has_name”)
of_type = get_label(“of_type”)
owns = get_label(“owns”)
suv = get_label(“SUV”)
forall e1 in edges(*, of_type, suv)
?car1 = source_node(e1)
forall e2 in edges(*, owns, ?car1)
?person = source_node(e2)
forall e3 in edges(?person, owns, *)
?car2 = target_node(e3)
if (?car1 != ?car2)
forall e4 in edges(?person,has_name,*)
?name = target_node(e4)
tuples.add(<?name>)
of the data set. Figure 8.4 shows the SPARQL description of the query. Figure 8.5 illustrates
the corresponding graph of the query. Finally, Figure 8.6 shows a pseudocode conceptually
similar to the C/C++ generated by the GEMS’ compiler and executed, through SGLib, by
the GMT runtime library.
As illustrated in Figure 8.6, the bulk of the query is executed as a parallel graph walk. he
forAll method is used to call a function for all matching edges, in parallel. hus, the graph
146 ◾ Alessandro Morari et al.
walk can be conceptualized as nested loops. At the end (or “bottom”) of a graph walk, results
are bufered in a loader object that is associated with a table object. When all the parallel edge
traversals are complete, the loader inalizes by actually inserting the results into its associated
table. At this point, operations like deduplication (DISTINCT) or grouping (GROUP BY) are
performed using the table. Results are efectively structs containing diferent variables of the
results. Many variables will be bound to UIDs, but some may contain primitive values.
GEMS has minimal system-level library requirements: Besides Pthreads, it only needs
message passing interface (MPI) for the GMT communication layer and Python for some
compiler phases and for glue scripts. he current implementation of GEMS also requires
x86-compatible processors because GMT employs optimized context switching routines.
However, because of the limited requirements in terms of libraries, the porting to other
architectures should be mostly conined to replacing these context switching routines.
1. Worker: executes application code, in the form of lightweight user tasks, and gener-
ates commands directed to other nodes
2. Helper: manages global address space and synchronization and handles incoming
commands from other nodes
GEMS 147
A➫➫➭➯➲➳➵➯➸➺
GMT API
Commands
➻➸rker
Comm
Commands server
Network
Helper (MPI)
Commands
GMT
Node
Network
Node 1 Node N
Global address
space (virtual)
Cluster
3. Communication server: end point for the network, which manages all incoming/out-
going communication at the node level in the form of network messages, which con-
tain the commands
he specialized threads are implemented as POSIX threads, each one pinned to a core.
he communication server employs MPI to send and receive messages to and from other
nodes. here are multiple workers and helpers per node. Usually, we use an equal number
of workers and helpers, but this is one of the parameters that can be adjusted, depend-
ing on empirical testing on the target machine. here is, instead, a single communication
server because while building the runtime, we veriied that a single communication server
is already able to provide the peak MPI bandwidth of the network with reasonably sized
packets, removing the need to manage multiple network end points (which, in turn, may
bring challenges and further overheads for synchronization).
SGLib data structures are implemented using shared arrays in GMT’s global address
space. Among them, there are the graph data structure and the terms dictionary. he
SPARQL-to-C++ compiler assumes operation on a shared-memory system and does not
need to reason about the physical partitioning of the database. However, as is common in
PGAS libraries, GMT also exposes locality information, allowing reduction in data move-
ments whenever possible. Because graph exploration algorithms mostly have loops that
run through edge or vertex lists (as the pseudocode in Figure 8.6 shows), GMT provides a
parallel loop construct that maps loop iterations to lightweight tasks. GMT supports task
generation from nested loops and allows specifying the number of iterations of a loop
mapped to a task. GMT also allows controlling code locality, enabling spawning (or mov-
ing) of tasks on preselected nodes, instead of moving data. SGLib routines exploit these
features to better manage its internal data structures. SGLib routines access data through
148 ◾ Alessandro Morari et al.
put and get communication primitives, moving them into local space for manipulation and
writing them back to the global space. he communication primitives are available with
both blocking and nonblocking semantics. GMT also provides atomic operations, such
as atomic addition and test-and-set, on data allocated in the global address space. SGLib
exploits them to protect parallel operations on the graph data sets and to implement global
synchronization constructs for database management and querying.
into aggregation queues when (a) it is full or (b) it has been waiting longer than a predeter-
mined time interval. Condition (a) is true when all the available entries are occupied with
commands or when the equivalent size in bytes of the commands (including any attached
data) reaches the size of the aggregation bufer. Condition (b) allows setting a conigurable
upper bound for the latency added by aggregation. Ater pushing a command block, when
a worker or a helper inds that the aggregation queue has suicient data to ill an aggrega-
tion bufer, it starts popping command blocks from the aggregation queue and copying
them with the related data into an aggregation bufer (4, 5, and 6). Aggregation bufers also
are preallocated and recycled to save memory space and eliminate allocation overhead.
Ater the copy, command blocks are returned to the command block pool (7). When the
aggregation bufer is full, the worker (or helper) pushes it into a channel queue (8). Channel
queues are high-throughput, single-producer, single-consumer queues that workers and
helpers use to exchange data with the communication server. If the communication server
inds a new aggregation bufer in one of the channel queues, it pops it (9) and performs a
nonblocking MPI send (10). he aggregation bufer is then returned into the pool of free
aggregation bufers.
he size of aggregation bufers and the time intervals for pushing out aggregated data
are conigurable parameters that depend on the interconnection of the cluster on which
GEMS runs. humb rules to set these parameters are as follows: Bufers should be suf-
iciently large to maximize network throughput, while time intervals should not increase
the latency over the values maskable through multithreading. he right values may be
derived through empirical testing on the target cluster with toy examples.
Worker
11 9
– 500 iterations
TASK_RUNNING
5 499 iterations
10
3 Push TASK_WAITING
1
2
Comm MPI recv
Helper
Parse server
command
queue (5), and it decreases the counter of t and pushes it back into the queue (6). he worker
creates t tasks (6) and pushes them into its private task queue (7).
At this point, the idle worker can pop a task from its task queue (8). If the task is execut-
able (i.e., all the remote operations completed), the worker restores the task’s context and
executes it (9). Otherwise, it pushes the task back into the task queue. If the task contains a
blocking remote request, the task enters a waiting state (10) and is reinserted into the task
queue for future execution (11).
his mechanism provides load balancing at the node level because each worker gets new
tasks from the itb queue as soon as its task queue is empty. At the cluster level, GMT evenly
splits tasks across nodes when it encounters a parallel for-loop construct.
utilization. We then show experimental results of the whole GEMS system on a well-
established benchmark, the Berlin SPARQL Benchmark (BSBM) (Bizer and Schultz 2009),
and on a data set currently curated at Paciic Northwest National Laboratory (PNNL) for
the Resource Discovery for Extreme Scale Collaboration (RDESC) project.
1000
100
MB/s
8B
10
16 B
32 B
64 B
128 B
1
1024 3072 5120 7168 9216 11,264 13,312 15,360
➼➽➾➚➾ ➪➶➹ ➘➴➷➶
FIGURE 8.10 Transfer rates of put operations between two nodes while increasing concurrency.
1000
100
MB/s
8B
10
16 B
32 B
64 B
128 B
1
1024 3072 5120 7168 9216 11,264 13,312 15,360
Task➬ ➮➱✃ ❐❒❮➱
FIGURE 8.11 Transfer rates of put operations among 128 nodes (one to all) while increasing
concurrency.
152 ◾ Alessandro Morari et al.
bandwidth among 128 nodes. he igures show how increasing the concurrency increases
the transfer rates because there is a higher number of messages that GMT can aggregate.
For example, across two nodes (Figure 8.10) with 1024 tasks each, puts of 8 bytes reach
a bandwidth of 8.55 MB/s. With 15,360 tasks, instead, GMT reaches 72.48 MB/s. When
increasing message sizes to 128 bytes, 15,360 tasks provide almost 1 GB/s. For reference,
32 MPI processes with 128 bytes messages only reach 72.26 MB/s. With more destina-
tion nodes, the probability of aggregating enough data to ill a bufer for a speciic remote
node decreases. Although there is a slight degradation, Figure 8.11 shows that GMT is still
very efective. For example, 15,360 tasks with 16 bytes messages reach 139.78 MB/s, while
32 MPI processes only provide up to 9.63 MB/s.
8.5.2 BSBM
BSBM deines a set of SPARQL queries and data sets to evaluate the performance of seman-
tic graph databases and systems that map RDF into other kinds of storage systems. Berlin
data sets are based on an e-commerce use case with millions to billions of commercial
transactions, involving many product types, producers, vendors, ofers, and reviews. We
run queries 1 through 6 (Q1–Q6) of the business intelligence use case on data sets with 100
million, 1 billion, and 10 billion triples.
Tables 8.1 through 8.3 respectively show the build time of the database and the execu-
tion time of the queries on 100 million, 1 billion, and 10 billion triples, while progressively
increasing the number of cluster nodes. he sizes of the input iles respectively are 21 GB
(100 million), 206 GB (1 billion), and 2 TB (10 billion). In all cases, the build time scales
with the number of nodes. Considering all the three tables together, we can appreciate how
GEMS scales in data set sizes by adding new nodes and how it can exploit the additional
parallelism available. With 100 million triples, Q1 and Q3 scale for all the experiments
up to 16 nodes. Increasing the number of nodes for the other queries, instead, provides
constant or slightly worse execution time. heir execution time is very short (under 0.5 s),
and the small data set does not provide suicient data parallelism. hese queries only have
two graph walks with two-level nesting, and even with larger data sets, GEMS is able to
exploit all the available parallelism already with a limited number of nodes. Furthermore,
the database has the same overall size but is partitioned on more nodes; thus, the com-
munication increases, slightly reducing the performance. With 1 billion triples, we see
similar behavior. In this case, however, Q1 stops scaling at 32 nodes. With 64 nodes, GEMS
can execute queries on 10 billion triples. Q3 still scales in performance up to 128 nodes,
while the other queries, except Q1, approximately maintain stable performance. Q1 expe-
riences the highest decrease in performance when using 128 nodes because its tasks pres-
ent higher communication intensity than the other queries, and GEMS already exploited
all the available parallelism with 64 nodes. hese data conirm that GEMS can maintain
constant throughput when running sets of mixed queries in parallel, that is, in typical
database usage.
8.5.3 RDESC
In this section, we describe some results in testing the use of GEMS to answer queries
for the RDESC project (Table 8.4). he RDESC project is a 3-year efort funded by the
TABLE 8.4 Speciications of the Query Executed on the RDESC Data Set
Q1 Find all instruments related to data resources containing measurements related to soil
moisture
Q2 Find all locations having measurements related to soil moisture that are taken with at least
ten instruments
Q3 Find locations (spatial locations) that have daily soil moisture proile data since 1990 with at
least 10 points
Q4 Find locations that have soil moistures proile data ranked by how many resources are
available for that location
154 ◾ Alessandro Morari et al.
Nodes 8 16 32
hroughput hroughput hroughput
1 run Exe [s] [q/s] Exe [s] [q/s] Exe [s] [q/s]
Load 2577.45909 1388.4894 1039.5300
Build 607.482135 301.1487 161.8761
Q1 0.0059 169.2873 0.0076 131.6247 0.0066 151.9955
Q2 0.0078 127.729 0.0088 113.3613 0.0093 107.9400
Q3 0.0017 592.6100 0.0020 477.9567 0.0017 592.2836
Q4 0.0154 64.9224 0.0119 83.7866 0.0119 84.2356
Nodes 8 16 32
Avg. 100 hroughput hroughput hroughput
runs Exe [s] [q/s] Exe [s] [q/s] Exe [s] [q/s]
Load 2593.6120 1367.2713 1062.4514
Build 583.0361 303.8100 153.7527
Q1 0.0057 174.1774 0.0059 168.5413 0.0070 142.5891
Q2 0.0074 135.0213 0.0080 124.4752 0.0091 109.6614
Q3 0.0017 582.8798 0.0044 229.1323 0.0017 640.1768
Q4 0.0124 80.3632 0.0147 67.8223 0.0101 98.6041
Using 8, 16, and 32 nodes of Olympus (equating to 120, 240, and 480 GMT workers,
respectively), the Q3 correctly produces no results since there are currently no data that
match the query, in the range of 0.0017 to 0.0044 s. Although there are no results, the query
still needs to explore several levels of the graph. However, our approach allows early prun-
ing of the search, signiicantly limiting the execution time.
Q1, Q2, and Q4, instead, under the same circumstances, run in the ranges of 0.0059–
0.0076, 0.0074–0.0093, and 0.0101–0.0154 s with 76, 1, and 68 results, respectively.
For the query in Q2, the one location with at least 10 “points” (instruments) for soil
moisture data is the ARM facility sgp.X1 (that is, the irst experimental facility in ARM’s
GEMS 155
Southern Great Plains site). he reason these queries take approximately the same amount
of time regardless of the number of nodes is that the amount of parallelism in executing
the query is limited so that eight nodes is suicient for fully exploiting the parallelism in
the query (in the way that we perform the query). We could not run the query on four
nodes or less due to the need for memory of the graph and dictionary. his emphasizes our
point that the need for more memory exceeds the need for more parallel computation. We
needed the eight nodes not necessarily to make the query faster but, rather, to keep the data
and query in memory to make the query feasible.
8.6 CONCLUSIONS
In this chapter, we presented GEMS, a full sotware stack for semantic graph databases
on commodity clusters. Diferent from other solutions, GEMS proposes an integrated
approach that primarily utilizes graph-based methods across all the layers of its stack.
GEMS includes a SPARQL-to-C++ compiler, a library of algorithms and data structures,
and a custom runtime. he custom runtime system (GMT) provides to all the other layers
several features that simplify the implementation of the exploration methods and makes
more eicient their execution on commodity clusters. GMT provides a global address
space, ine-grained multithreading (to tolerate latencies for accessing data on remote
nodes), remote message aggregation (to maximize network bandwidth utilization), and
load balancing. We have demonstrated how this integrated approach provides scaling in
size and performance as more nodes are added to the cluster on two example data sets (the
BSBM and the RDESC data set).
REFERENCES
Apache Giraph. (n.d.). Retrieved May 1, 2014, from http://incubator.apache.org/giraph/.
Apache Jena—Home. (n.d.). Retrieved March 19, 2014, from https://jena.apache.org/.
ARQ—A SPARQL Processor for Jena. (n.d.). Retrieved May 1, 2014, from http://jena.sourceforge.net
/ARQ/.
Bizer, C., and Schultz, A. (2009). he Berlin SPARQL benchmark. Int. J. Semantic Web Inf. Syst., 5
(2), 1–24.
Carothers, G., and Seaborne, A. (2014). RDF 1.1 N-Triples. Retrieved March 19, 2014, from http://
www.w3.org/TR/2014/REC-n-triples-20140225/.
GraphLab. (n.d.). Retrieved May 1, 2014, from http://graphlab.org.
Harth, A., Umbrich, J., Hogan, A., and Decker, S. (2007). YARS2: A federated repository for querying
graph structured data from the web. ISWC ’07/ASWC ’07: 6th International Semantic Web and
2nd Asian Semantic Web Conference (pp. 211–224).
International Soil Moisture Network. (n.d.). Retrieved February 17, 2014, from http://ismn.geo
.tuwien.ac.at.
Klyne, G., Carroll, J. J., and McBride, B. (2004). Resource Description Framework (RDF): Concepts
and Abstract Syntax. Retrieved December 2, 2013, from http://www.w3.org/TR/2004/REC
-rdf-concepts-20040210/.
Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N. et al. (2010). Pregel:
A system for large-scale graph processing. SIGMOD ’10: ACM International Conference on
Management of Data (pp. 135–146).
National Aeronautics and Space Administration. (2013). Global Change Master Directory, Online,
Version 9.9. Retrieved May 1, 2014, from http://gcmd.nasa.gov.
156 ◾ Alessandro Morari et al.
Ohio State University. (n.d.). OSU Micro-Benchmarks. Retrieved May 1, 2014, from http://mvapich
.cse.ohio-state.edu/benchmarks/.
openRDF.org. (n.d.). Retrieved May 1, 2014, from http://www.openrdf.org.
Redland RDF Libraries. (n.d.). Retrieved May 1, 2014, from http://librdf.org.
Rohlof, K., and Schantz, R. E. (2010). High-performance, massively scalable distributed systems
using the MapReduce sotware framework: he SHARD triple-store. PSI EtA ’10: Programming
Support Innovations for Emerging Distributed Applications, 4, Reno (pp. 1–5).
Top500.org. (n.d.). PNNL’s Olympus Entry. Retrieved May 1, 2014, from http://www.top500.org
/system/177790.
Ullmann, J. (1976). An algorithm for subgraph isomorphism. J. ACM, 23 (1), 31–42.
Virtuoso Universal Server. (n.d.). Retrieved May 1, 2014, from http://virtuoso.openlinksw.com.
W3C SPARQL Working Group. (2013). SPARQL 1.1 Overview. Retrieved February 18, 2014, from
http://www.w3.org/TR/2013/REC-sparql11-overview-20130321/.
Weaver, J. (2012). A scalability metric for parallel computations on large, growing datasets (like
the web). Proceedings of the Joint Workshop on Scalable and High-Performance Semantic Web
Systems, Boston.
YarcData, Inc. (n.d.) Urika Big Data Graph Appliance. Retrieved May 1, 2014, from http://www.cray
.com/Products/BigData/uRiKA.aspx.
CHAPTER 9
KSC-net
Community Detection for
Big Data Networks
CONTENTS
9.1 Introduction 158
9.2 KSC for Big Data Networks 160
9.2.1 Notations 160
9.2.2 FURS Selection 161
9.2.3 KSC Framework 162
9.2.3.1 Training Model 162
9.2.3.2 Model Selection 163
9.2.3.3 Out-of-Sample Extension 163
9.2.4 Practical Issues 165
9.3 KSC-net Sotware 165
9.3.1 KSC Demo on Synthetic Network 165
9.3.2 KSC Subfunctions 167
9.3.3 KSC Demo on Real-Life Network 169
9.4 Conclusion 171
Acknowledgments 171
References 172
ABSTRACT
In this chapter, we demonstrate the applicability of the kernel spectral clustering
(KSC) method for community detection in Big Data networks. We give a practical
exposition of the KSC method [1] on large-scale synthetic and real-world networks
with up to 106 nodes and 107 edges. he KSC method uses a primal–dual framework
to construct a model on a smaller subset of the Big Data network. he original large-
scale kernel matrix cannot it in memory. So we select smaller subgraphs using a fast
and unique representative subset (FURS) selection technique as proposed in Reference
2. hese subsets are used for training and validation, respectively, to build the model
157
158 ◾ Raghvendra Mall and Johan A.K. Suykens
9.1 INTRODUCTION
In the modern era, with the proliferation of data and ease of its availability with the
advancement of technology, the concept of Big Data has emerged. Big Data refers to a
massive amount of information, which can be collected by means of cheap sensors and
wide usage of the Internet. In this chapter, we focus on Big Data networks that are ubiq-
uitous in current life. heir omnipresence is conirmed by the modern massive online
social networks, communication networks, collaboration networks, inancial networks,
and so forth.
Figure 9.1 represents a synthetic network with 5000 nodes and a real-world YouTube
network with over a million nodes. he networks are visualized using the Gephi sotware
(https://gephi.org/).
Real-world complex networks are represented as graphs G(V, E) where the vertices V
represent the entities and the edges E represent the relationship between these entities.
For example, in a scientiic collaboration network, the entities are the researchers, and the
presence or absence of an edge between two researchers depicts whether they have collabo-
rated or not in the given network. Real-life networks exhibit community-like structure.
his means nodes that are part of one community are densely connected to each other and
sparsely connected to nodes belonging to other communities. his problem of community
detection can also be framed as graph partitioning and graph clustering [4] and, of late,
has received wide attention [5–16]. Among these techniques, one class of methods used for
community detection is referred to as spectral clustering [13–16].
In spectral clustering, an eigen-decomposition of the graph Laplacian matrix (L) is per-
formed. L is derived from the similarity or ainity matrix of the nodes in the network.
Once the eigenvectors are obtained, the communities can be detected using the k-means
clustering technique. he major disadvantage of the spectral clustering technique is that
we need to create a large N × N kernel matrix for the entire network to perform unsuper-
vised community detection. Here, N represents the number of nodes in the large-scale net-
work. However, as the size of network increases, calculating and storing the N × N matrix
becomes computationally infeasible.
A spectral clustering method based on weighted kernel principal component analysis
(PCA) with a primal–dual framework is proposed in Reference 17. he formulation resulted
in a model built on a representative subset of the data with a powerful out-of-sample exten-
sions property. his representative subset captures the inherent cluster structure present in
the data set. his property allows community ailiation for previously unseen data points
in the data set. he kernel spectral clustering (KSC) method was extended for community
detection on moderate-size graphs in Reference 11 and for large-scale graphs in Reference 1.
We use the fast and unique representative subset (FURS) selection technique introduced
in Reference 2 for selection of the subgraphs on which we build our training and validation
models (Figure 9.2). An important step in the KSC method is to estimate the model param-
eters, that is, kernel parameters, if any, and the number of communities k in the large-scale
network. In the case of large-scale networks, the normalized linear kernel is an efective kernel
C1 6 C2 C1 6 C2 C1 6 C2
12 12 12
11 1 11 1 11 1
5 5 5
13 7 13 7 13 7
8 8 8
2 2 2
10 9 4 10 9 4 10 9 4
3 3 3
20 17 20 17 20 17
16 18 14 16 18 14 16 18 14
15 15 15
as shown in Reference 18. hus, we use the normalized linear kernel, which is parameter-less
in our experiments. his saves us from tuning for an additional kernel parameter. In order
to estimate the number of communities k, we exploit the eigen-projections in eigenspace to
come up with a balanced angular itting (BAF) criterion [1] and another self-tuned approach
[3] where we use the concept of entropy and balance to automatically get k.
In this chapter, we give a practical exposition of the KSC method by irst briely explain-
ing the steps involved and then showcasing its usage with the KSC-net sotware. We dem-
onstrate the options available during the usage of KSC-net by means of two demos and
explain the internal structure and functionality in terms of the KSC methodology.
9.2.1 Notations
1. A graph is represented as G = (V, E) where V represents the set of nodes and E ⊆ V × V
represents the set of edges in the network. he nodes represent the entities in the net-
work, and the edges represent the relationship between them.
2. he cardinality of the set V is given as N.
3. he cardinality of the set E is given as Etotal.
4. he adjacency matrix A is an N × N sparse matrix.
5. he ainity/kernel matrix of nodes is given by Ω, and the ainity matrix of projec-
tions is depicted by S.
6. For unweighted graphs, Aij = 1 if (vi, vj) ∈ E, else Aij = 0.
7. he subgraph generated by the subset of nodes B is represented as G(B). Mathematically,
G(B) = (B, Q) where B ⊂ V and Q = (S × S) ∩ E represents the set of edges in the
subgraph.
8. he degree distribution function is given by f(V). For the graph G, it can written as
f(V), while for the subgraph B, it can be presented as f(B). Each vertex vi ∈ V has a
degree represented as f(vi).
9. he degree matrix is represented as D. It is a diagonal matrix with diagonal entries
di ,i = ∑ A.
j
ij
max
B
J ( B) = ∑ D(v )
j =1
j
where D(vj) represents the degree centrality of the node vj, s is the size of the subset, ci rep-
resents the ith community, and k represents the number of communities in the network
that cannot be determined explicitly beforehand.
he FURS selection technique is a greedy solution to the aforementioned problem. It is for-
mulated as an optimization problem where we maximize the sum of the degree centrality of
the nodes in the selected subset B, such that neighbors of the selected node are deactivated or
cannot be selected in that iteration. By deactivating its neighborhood, we move from one dense
region in the graph to another, thereby approximately covering all the communities in the net-
work. If all the nodes are deactivated in one iteration and we have not yet reached the required
subset size s, then these deactivated nodes are reactivated, and the procedure is repeated till we
reach the required size s. he optimization problem is formulated in Equation 9.2:
J ( B) = 0
While B < s
(9.2)
st
max
B
J ( B):= J ( B) + ∑ D (v )
j =1
j
where st is the size of the set of nodes selected by FURS during iteration t.
162 ◾ Raghvendra Mall and Johan A.K. Suykens
Several approaches have been proposed for sampling a graph, including References
2, 19 through 21. he FURS selection approach was proposed in Reference 2 and used
in Reference 1. A comprehensive comparison of various sampling techniques has been
explained in detail in Reference 2. We use the FURS selection technique in KSC-net sot-
ware for training and validation set selection.
maxk −1 maxk −1
1 1
min
w ( l ) ,e( l ) ,bl 2 ∑wl =1
( l ) (l )
w −
2 N tr ∑γe
l =1
l
( l )
D −1e (l )
(9.3)
such that e (l ) = Φw (l ) + bl 1Ntr , l = 1,…, maxk − 1
where e (l ) = e (l ) ;…; e N(l tr) are the projections onto the eigenspace; l = 1;…; maxk − 1 indi-
cates the number of score variables required to encode the maxk communities; D −1 ∈ Ntr × Ntr
is the inverse of the degree matrix associated to the kernel matrix Ω; Φ is the Ntr × dh fea-
ture matrix; and Φ = φ( x1 ) ;…; φ( x Ntr ) and γ l ∈ + are the regularization constants. We
note that Ntr ≪ N, that is, the number of nodes in the training set is much less than the total
number of nodes in the large-scale network. he kernel matrix Ω is obtained by calculating
the similarity between the adjacency list of each pair of nodes in the training set. Each ele-
ment of Ω, denoted as Ωij = K(xi, xj) = ϕ(xi)⊺ϕ(xj), is obtained by calculating the cosine simi-
xi x j
larity between the adjacency lists xi and xj. hus, ij = can be calculated eiciently
xi x j
using notions of set unions and intersections. his corresponds to using a normalized lin-
x z
ear kernel function K ( x , z ) = [22]. he clustering model is then represented by
x z
ei(l ) = w (l ) φ( xi ) + bl , i = 1,…, N tr (9.4)
KSC-net ◾ 163
D −1 M D α (l ) = λ l α (l ) (9.5)
where M D = I Ntr −
(1
1 D −1
N tr N tr ) . he α(l) are the dual variables, and the kernel function
1 D −11Ntr
N tr
In the case of KSC due to the centering matrix MD, the eigenvectors have zero mean, and
the optimal threshold for binarizing the eigenvectors is self-determined (equal to 0). So we
need k − 1 eigenvectors. However, in real-world networks, the communities exhibit overlap
and do not have piecewise constant eigenvectors.
( )
(l )
sign etest = sign( test α (l ) + bl 1Ntest ) (9.6)
where l = 1;…; k − 1; Ωtest is the Ntest × Ntr kernel matrix evaluated using the test nodes with
entries Ωr,i = K(xr, xi); r = 1;…; Ntest; and i = 1;…; Ntr. his natural extension to out-of-sample
nodes is the main advantage of KSC. In this way, the clustering model can be trained,
validated, and tested in an unsupervised learning procedure. For the test set, we use the
entire network. If the network cannot it in memory, we divide it into blocks to calculate
the cluster memberships for each test block in parallel.
KSC-net ◾ 165
0.9
0.85
BAF
0.8
0.75
0.7
3 4 5 6 7 8 9 10
ÛÐÑÒÓÔ ÕÖ ×ØÐÙÚÓÔÙ (k)
Block diagonal similarity matrix (S) Plot of threshold vs. F-measure value for self-tuned KSC
3 X: 0.2
Y: 2.709
2.5
F-measure
1.5
0.5
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Threshold t
network. he adjusted Rand index (ARI) [25] value is 0.999, which means that the commu-
nity membership provided by the KSC methodology is as good as the original ground truth.
If we put mod_sel = 'Self' in the demo script of Algorithm 3, then we perform the
self-tuned model selection. It uses the concept of entropy and balance clusters to identify
the optimal number of clusters k in the network. Figure 9.4 showcases the block diagonal
ainity matrix S generated from the eigen-projections of the validation set using the self-
tuned model selection procedure. From Figure 9.4, we can observe that there are seven
blocks in the ainity matrix S. In order to calculate the number of these blocks, the self-
tuned procedure uses the concept of entropy and balance together as an F-measure. Figure
9.4 shows that the F-measure is maximum for threshold t = 0.2, for which it takes the value
2.709. he number of clusters k identiied corresponding to this threshold value is 7, which
is equivalent to the number of ground truth communities.
Data:
netname = 'network';//Name of the file and extension should be '.txt'
baseinfo = 1; //Ground truth exists
cominfo = 'community'; //Name of ground truth file and extension should be '.txt'
weighted = 0; //Whether a network is weighted or unweighted
frac_network = 15; //Percentage of network to be used as training and
//validation set
maxk = 10; //maxk is maximum value of k to be used in eigen-decomposition, use
//maxk = 10 for k < = 10 else use maxk = 100
mod_sel = 'BAF'; //Method for model selection (options are 'BAF' or 'Self')
output = KSCnet(netname,baseinfo,cominfo,weighted,frac_network,maxk,mod_sel);
Result
output = 1; //When community detection operation completes
netname_numclu_mod_sel.csv; //Number of clusters detected
netname_outputlabel.csv; //Cluster labels assigned to the nodes
netname_ARI.csv; //Valid only when ground truth is known, provides information
//about ARI and number of communities covered
KSC-net ◾ 167
In the case of the BAF model selection criterion, we utilize the KSCcodebook and
KSCmembership functionality to estimate the optimal number of communities k in the
FIGURE 9.5 Output of FURS for selecting training and validation set.
168 ◾ Raghvendra Mall and Johan A.K. Suykens
large-scale network. However, the “Self” model selection technique works independent of
these functions and uses the concept of entropy and balance to calculate the number of
block diagonals in the ainity matrix S generated by the validation set eigen-projections
ater reordering by means of the ground truth information. Figure 9.6 represents a snippet
that we obtain when trying to identify the optimal number of communities k in the large-
scale network.
In Figure 9.6, we observe a “DeprecationWarning,” which we ignore as it is related to
some functionality in the “scipy” library, which we are not using. We also observe that
we obtained “Elapsed time…” at two places. Once, it occurs ater building the training
model, and the second time, it appears ater we have performed the model selection using
the validation set. Figure 9.7 represents the time required to obtain the test set community
ailiation. Since there are only 5000 nodes in the entire network and we divide the test
set into blocks of 5000 nodes due to memory constraints, we observe “I am in block 1” in
Figure 9.7. Figure 9.8 showcases the community-based structure for the synthetic network.
We get the same result in either case when we use the BAF selection criterion or the “Self”
selection technique.
FIGURE 9.7 Output snippet obtained while estimating the test cluster membership.
KSC-net ◾ 169
FIGURE 9.8 Communities detected by KSC methodology for the synthetic network.
0.8
BAF
0.6
0.4
0.2
0 20 40 60 80 100
NumbÜÝ Þß àáâãäÜÝã (k)
BAF-versus-k curve. However, we select the peak corresponding to which the BAF value
is maximum. his occurs at k = 4 for the BAF value of 0.8995. Since we provide the cluster
memberships for the entire network to the end user, the end user can use various internal
quality criteria to estimate the quality of the resulting communities.
If we put mod_sel = 'Self' in the demo script of Algorithm 4, then we would
be using the parameter-free model selection technique. Since the ground truth commu-
nities are not known beforehand, we try to estimate the block diagonal structure using
the greedy selection technique mentioned in Algorithm 2 to determine the SizeCt vector.
Figure 9.10 shows that the F-measure is maximum for t = 0.1, for which it takes the value
0.224. he number of communities k identiied corresponding to this threshold value is 36
(Figure 9.11).
üýþÿ
üýþ
X
ñòó Y ⑩⑩ ✁
ñòõô
sure
ðï
íî ñòõ
F
ñòñô
0 ñòõ ñòó ñòö ñò÷ ñòôåæçèéêëìñòøt ñòù ñòú ñòû 1
FIGURE 9.11 Community structure detected for the YouTube network by “BAF” and “Self” model
selection techniques respectively visualized using the sotware provided in Reference 26. (From
Lancichinetti, A. et al., Plos One, 6(e18961), 2011.)
KSC-net ◾ 171
Data:
netname = 'youtube_mod'; //Name of the file and extension should be '.txt'
baseinfo = 0; //Ground truth does not exists
cominfo = []; //No ground truth file so empty set as input
weighted = 0; //Whether a network is weighted or unweighted
frac_network = 10; //Percentage of network to be used as training and
//validation set
maxk = 100; //maxk is maximum value of k to be used in eigen-decomposition,
//use maxk = 10 for k < = 10 else use maxk = 100
mod_sel = 'BAF'; //Method for model selection (options are 'BAF' or 'Self')
output = KSCnet(netname,baseinfo,cominfo,weighted,frac_network,maxk,mod_sel);
Result:
output = 1; //When community detection operation completes
netname_numclu_mod_sel.csv; //Number of clusters detected
netname_outputlabel.csv; //Cluster labels assigned to the nodes
9.4 CONCLUSION
In this chapter, we gave a practical exposition of a methodology to perform community
detection for Big Data networks. he technique was built on the concept of spectral clus-
tering and referred to as KSC. he core concept was to build a model on a small repre-
sentative subgraph that captured the inherent community structure of the large-scale
network. he subgraph was obtained by the FURS [2] selection technique. hen, the KSC
model was built on this subgraph. he model parameters were obtained by one of the two
techniques: (1) BAF criterion or (2) self-tuned technique. he KSC model has a powerful
out-of-sample extensions property. his property was used for community ailiation of
previously unseen nodes in the Big Data network. We also explained and demonstrated
the usage of the KSC-net sotware, which uses the KSC methodology as the underlying
core concept.
ACKNOWLEDGMENTS
EU: he research leading to these results has received funding from the European Research
Council under the European Union’s Seventh Framework Programme (FP7/2007-2013)/
ERC AdG A-DATADRIVE-B (290923). his chapter relects only the authors’ views; the
Union is not liable for any use that may be made of the contained information. Research
Council KUL: GOA/10/09 MaNet, CoE PFV/10/002 (OPTEC), BIL12/11T; PhD/postdoc
grants. Flemish government—FWO: projects G.0377.12 (structured systems), G.088114N
(tensor-based data similarity); PhD/postdoc grants. IWT: project SBO POM (100031); PhD/
postdoc grants. iMinds Medical Information Technologies: SBO 2014. Belgian Federal
Science Policy Oice: IUAP P7/19 (DYSCO, dynamical systems, control and optimization,
2012–2017).
172 ◾ Raghvendra Mall and Johan A.K. Suykens
REFERENCES
1. R. Mall, R. Langone and J.A.K. Suykens; Kernel spectral clustering for Big Data networks.
Entropy (Special Issue: Big Data), 15(5):1567–1586, 2013.
2. R. Mall, R. Langone and J.A.K. Suykens; FURS: Fast and Unique Representative Subset selec-
tion retaining large scale community structure. Social Network Analysis and Mining, 3(4):1–21,
2013.
3. R. Mall, R. Langone and J.A.K. Suykens; Self-tuned kernel spectral clustering for large scale
networks. In Proceedings of the IEEE International Conference on Big Data (IEEE BigData
2013). Santa Clara, CA, October 6–9, 2013.
4. S. Schaefer; Algorithms for nonuniform networks. PhD thesis. Espoo, Finland: Helsinki
University of Technology, 2006.
5. L. Danaon, A. Diáz-Guilera, J. Duch and A. Arenas; Comparing community structure identii-
cation. Journal of Statistical Mechanics: heory and Experiment, 9:P09008, 2005.
6. S. Fortunato; Community detection in graphs. Physics Reports, 486:75–174, 2009.
7. A. Clauset, M. Newman and C. Moore; Finding community structure in very large scale net-
works. Physical Review E, 70:066111, 2004.
8. M. Girvan and M. Newman; Community structure in social and biological networks.
Proceedings of the National Academy of Sciences of the United States of America, 99(12):7821–
7826, 2002.
9. A. Lancichinetti and S. Fortunato; Community detection algorithms: A comparative analysis.
Physical Review E, 80:056117, 2009.
10. M. Rosvall and C. Bergstrom; Maps of random walks on complex networks reveal commu-
nity structure. Proceedings of the National Academy of Sciences of the United States of America,
105:1118–1123, 2008.
11. R. Langone, C. Alzate and J.A.K. Suykens; Kernel spectral clustering for community detection
in complex networks. In IEEE WCCI/IJCNN, pp. 2596–2603, 2002.
12. V. Blondel, J. Guillaume, R. Lambiotte and L. Lefebvre; Fast unfolding of communities in large
networks. Journal of Statistical Mechanics: heory and Experiment, 10:P10008, 2008.
13. A.Y. Ng, M.I. Jordan and Y. Weiss; On spectral clustering: Analysis and an algorithm. In
Proceedings of the Advances in Neural Information Processing Systems, Dietterich, T.G., Becker,
S., Ghahramani, Z., editors. Cambridge, MA: MIT Press, pp. 849–856, 2002.
14. U. von Luxburg; A tutorial on spectral clustering. Statistics and Computing, 17:395–416, 2007.
15. L. Zelnik-Manor and P. Perona; Self-tuning spectral clustering. In Advances in Neural
Information Processing Systems, Saul, L.K., Weiss, Y., Bottou, L., editors. Cambridge, MA: MIT
Press, pp. 1601–1608, 2005.
16. J. Shi and J. Malik; Normalized cuts and image segmentation. IEEE Transactions on Pattern
Analysis and Intelligence, 22(8):888–905, 2000.
17. C. Alzate and J.A.K. Suykens; Multiway spectral clustering with out-of-sample extensions
through weighted kernel PCA. IEEE Transactions on Pattern Analysis and Machine Intelligence,
32(2):335–347, 2010.
18. L. Mulikhah; Document clustering using concept space and concept similarity measurement.
In ICCTD, pp. 58–62, 2009.
19. A. Maiya and T. Berger-Wolf; Sampling community structure. In WWW, pp. 701–710, 2010.
20. U. Kang and C. Faloutsos; Beyond ‘caveman communities’: Hubs and spokes for graph com-
pression and mining. In Proceedings of ICDM, pp. 300–309, 2011.
21. N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller and E. Teller; Equation of state calcula-
tions by fast computing machines. Journal of Chemical Physics, 21(6):1087–1092, 1953.
22. J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor and J. Vandewalle; Least Squares
Support Vector Machines. Singapore: World Scientiic, 2002.
23. F.R.K. Chung; Spectral Graph heory. Providence, RI: American Mathematical Society, 1997.
KSC-net ◾ 173
24. J. Baylis; Error Correcting Codes: A Mathematical Introduction. Boca Raton, FL: CRC Press,
1988.
25. R. Rabbany, M. Takafoli, J. Fagnan, O.R. Zaiane and R.J.G.B. Campello; Relative validity cri-
teria for community mining algorithms. In International Conference on Advances in Social
Networks Analysis and Mining (ASONAM), pp. 258–265, 2012.
26. A. Lancichinetti, F. Radicchi, J.J. Ramasco and S. Fortunato; Finding statistically signiicant
communities in networks. PLoS One, 6:e18961, 2011.
CHAPTER 10
CONTENTS
10.1 Introduction 176
10.2 Sotware Developers’ Information Needs 178
10.2.1 Information Needs: Core Work Practice 178
10.2.2 Information Needs: Constructing and Maintaining Relationships 179
10.2.3 Information Needs: Professional/Career Development 179
10.3 Sotware Developers’ Ecosystem 180
10.3.1 Social Media Use 180
10.3.2 he Ecosystem 181
10.4 Information Overload and Awareness Issue 183
10.5 he Application of Big Data to Support the Sotware Developers’ Community 184
10.5.1 Data Generated from Core Practices 185
10.5.2 Sotware Analytics 186
10.6 Conclusion 187
References 188
ABSTRACT
Sotware developers in the digital age participate in a set of social media services, such
as GitHub, Twitter, and Stack Overlow, to gain access to diferent resources. Recent
research indicates that these social media services form an interconnected ecosys-
tem for sotware developers to make connections, stay aware of the latest news, and
coordinate work. Due to the complexity of the online ecosystem and the large volume
of information generated from it, sotware developers encounter information over-
load wherein abundant information makes it hard to be aware of the most relevant
resources to meet information needs. Sotware developers’ participation traces in the
ecosystem of social media services generate Big Data, which is publicly available for
175
176 ◾ Yu Wu, Jessica Kropczynski, and John M. Carroll
retrieval and analysis. Applying sotware analytics and related techniques to Big Data
can reduce information overload, satisfy information needs, and make collaboration
more efective. his chapter reviews the literature on a sotware developer’s infor-
mation needs and participation in social media services, describes the information
overload issue, and illustrates the available Big Data sources in the ecosystem and
their potential applications.
10.1 INTRODUCTION
he open-source sotware (OSS) development community has allowed technology to
progress at a rapid pace around the globe through shared knowledge, expertise, and collab-
orations. he broad-reaching open-source movement bases itself on a share-alike principle
that allows anybody to use or modify sotware, and upon completion of a project, its source
code is made publicly available. Programmers who are a part of this community contribute
by voluntarily writing and exchanging code through a collaborative development process
in order to produce high-quality sotware. his method has led to the creation of popular
sotware products including Mozilla Firefox, Linux, and Android. Most OSS development
activities are carried out online through formalized platforms (such as GitHub), inciden-
tally creating a vast amount of interaction data across an ecosystem of platforms that can be
used not only to characterize open-source development work activity more broadly but also
to create Big Data awareness resources for OSS developers. he intention of these aware-
ness resources is to enhance the ability to seek out much-needed information necessary to
produce high-quality sotware in this unique environment that is not conducive to own-
ership or proits. Currently, it is problematic that interconnected resources are archived
across stand-alone websites.
his chapter describes the process through which these resources can be more con-
spicuous through Big Data, in three interrelated sections about the context and issues of
the collaborating process in online space (Figure 10.1) and a fourth section on how Big
Data can be obtained and utilized. In Section 10.2, we will describe the information needs
that developers draw upon to perform basic tasks. Section 10.3 highlights the multiplat-
form suite of resources utilized by this community and the ways in which these platforms
disperse development activities. Section 10.4 discusses the information overload and
So❢✂✇✄re
developers’
ecosystem
■☎❢✆✝✞✄✂✟✆☎
Information needs
✆♦✠✝✡✆✄☛☞✄✇✄✝eness
awareness issue that is generated from the information needs and the ecosystem. Section
10.5 explores ways that the seemingly boundless data generated by this community might
be synthesized to increase awareness and lead to a deeper sense of community among
developers and strengthen cooperation.
Sotware developers have pervasive needs for data, such as activity traces, issue report-
ing, and coding resources necessary to transform existing code, due to the revision control
methods (or version control) necessary for collaboration through web-based interfaces.
Developers spend a signiicant amount of time processing data just to maintain awareness
of the activities of fellow developers (Biehl et al. 2007). With the rise of social media and
its application in sotware engineering, sotware developers need to be aware of an over-
whelming amount of information from various sources, such as coworkers, interesting
developers, trends, repositories, and conversations. A paradox exists wherein there is sim-
ply too much information and too little awareness of the knowledge and resources that this
information contains. Community awareness is not only important for day-to-day activi-
ties, but it is also necessary in order for developers to stay current with the development
landscape and industry change (Singer et al. 2014), to make decisions about projects (Buse
and Zimmerman 2012), and to make connections with others based on the data revealed in
public (Singer et al. 2013). Millions of sotware developers all over the world continuously
generate new social and productive activities across multiple platforms. It is diicult for
developers to maintain awareness of the breadth of the information generated by the larger
community while also maintaining awareness of individual projects of which they are a
part. he overwhelming amount of activity awareness can reduce the transparency of the
underlying community of sotware developers. In an efort to increase the transparency of
a thriving community and to improve awareness of available resources, this chapter will
describe how developers’ activity log data may be collected and analyzed and will pro-
vide potential applications of how results may be presented to the public in order to make
resources more available and searchable.
Recent research on OSS developers’ collaboration suggests that their interactions are
not limited to revision control programming platforms (Singer et al. 2013; Vasilescu et al.
2013, 2014; Wu et al. 2014). Instead, developers make use of social features, such as proile,
activity traces, and bookmarking, to coordinate work across several platforms. All these
sites and platforms form an “ecosystem” wherein sotware developers achieve productive
goals, such as code production, making social connections, and professional development.
Later in this chapter, we will illustrate the ecosystem of resources that sotware developers
draw upon and describe the purposes that they fulill, including social media functions
that facilitate peer-to-peer interaction and the sotware practices that are associated with
productivity. As one can imagine, the Big Data possibilities with which to explore sotware
developers’ interactions become exponentially greater when considering the multiplatform
dimensions with which to explore their interactions.
Section 10.5 will provide the greatest emphasis on the kinds of Big Data we can be
collecting about the sotware developers’ community, as well as the quantities of data
that exist. Online work production platforms for sotware development, such as GitHub
(Dabbish et al. 2012; Marlow et al. 2013), CodePlex, and Bitbucket, among others, are
178 ◾ Yu Wu, Jessica Kropczynski, and John M. Carroll
common platforms where sotware productive activities take place. hese activities are
publicly available through application programming interfaces (APIs) (Feeds 2014). he
synthesis of the information from this Big Data set is likely to increase the awareness of
sotware developers towards code artifacts, latest news, and other peer developers, which
in turn increases a sense of community among individual developers and facilitates fur-
ther collaboration. In Sections 10.2 through 10.5, we will provide further details about the
sotware developers’ information needs, characterize the ecosystem of resources available
to sotware developers, enumerate types of interaction data for OSS that are available to
researchers, and review recent eforts to utilize Big Data analytic techniques to analyze
OSS interaction data and provide summaries of information that developers will ind use-
ful to their work.
Bug reporting and ixing is another central sotware core practice, where difering
information needs must be satisied in order to solve bugs and issues eiciently. Bug track-
ing plays a central role in supporting collaboration between the developers and users of
the sotware. Breu et al. (2010) found that information needs change over a bug’s life cycle;
for example, in the beginning, a bug report may include missing information or details
for debugging. Later, questions develop, focusing on the correction of a bug and on status
inquiries. A bug-tracking system should account for such evolving information needs.
projects that will help them develop the skills and experience necessary to break into a hot
area of information technology. In such a role, programmers do not have to start out by
taking key development roles but can instead follow a project, try to take on a small role,
and ask questions of experts in the ield. From the perspective of those hiring program-
mers, a list of projects and contributions that a programmer has made in an open-source
community may ofer a better perspective of technical and team skills than a reference.
Potential employers seek out developers who are taking full advantage of resources avail-
able and are on the cutting edge of the sotware industry.
While experiences on such projects might help land a job, this is not the only reason that
programmers might utilize OSS in their career. Corporations have begun relying on enter-
prise editions of OSS platforms. Working on OSS is no longer a domain of hobbyists coding
in their spare time. Instead, many developers are now working on open-source programs
as part of their daily jobs. Many large organizations now rely on having their employees
engage with the open-source community, who may contribute to major bug reports and
ixes (Arora 2012). In addition to acquiring a workforce to perform these smaller tasks,
building in the open-source environment tends to allow for a broader perspective than in
teams typically working on single-use cases in the context of one individual company. his
environment naturally lends itself to new and innovative practices that encourage adapta-
tion and learning from additional data resources.
diferent from sotware engineer groups. Zhao and Rosson (2009) established that the
use of Twitter reduces the cost of information sharing. A study led by Black et al. (2010)
found that developers also share source code, speciication, and design information over
social media. Social media use brings other beneits to sotware developers as well, such as
increasing the quality of communication and the visibility of activities across all levels of
organizations (Black and Jacobs 2010).
Beyond the functional purposes of the tools embedded in social media, recent research
shows that the context of that use can vary widely. Even within one social media plat-
form, sotware developers discuss a variety of issues relevant to their work. By analyzing
300 sample Tweets sent by sotware developers, Tian et al. (2012) discovered that sotware
developers tweet on the following topics:
Moreover, social media also allows a new way for sotware developers to form teams
and work collaboratively (Begel et al. 2010). Begel and colleagues (2010) report several
typical small group communication functions that social media sites serve for online
sotware development groups to work together: (1) storming, when social media allows
rapid feedback that can be used to quickly identify and respond to changing user needs;
(2) norming, which emphasizes the knowledge-sharing function of social media among
developers; (3) performing, which considers that social media helps various develop-
ment processes; and (4) adjourning, which highlights the recording mechanism of social
media from interactions let behind. Activity traces of all of these interactions can be
aggregated from social media to form institutional memory that is accessible to all users.
Researchers have also made eforts to incorporate social media in sotware development
tools. Reinhardt (2009) embedded a Twitter client in Eclipse to support Global Sotware
Development.
In short, social media has iniltrated many aspects of sotware development, from sot-
ware developers’ daily life to team formation, and is not limited to source code sharing
and collaboration. On one hand, it greatly connects developers with one another across
multiple platforms that serve diferent purposes; on the other hand, it also introduces com-
plexities, where developers have to manage their proiles and activities in many platforms,
and potential issues, such as information overload from various social media sites.
from each other. he most recent work suggests that online communities create and orga-
nize complex information space among community members through several categories
of social tools—such as wikis, blogs, forums, social bookmarks, social ile repositories, and
task-management tools (Matthews et al. 2014). he “core” functions of each type of tool
are leveraged in combination to coordinate work efectively. he sotware community as
a whole acts similarly, where many online social media sites serve distinct functions for
developers, and their coordination and collaborations are spread over the whole ecosystem
space.
Among those types of social media adopted by sotware developers, GitHub is one of
the most prominent platforms since it has successfully attracted millions of developers to
host their own projects and collaborate with other peers (Marlow et al. 2013). Its growing
success has become a popular research focus in the past several years. However, previous
research on GitHub has not tapped into the expansiveness of the data resource, because it
mainly addresses how sotware developers utilize GitHub features, like user proiles and
activity traces, to coordinate work (Begel et al. 2010; Dabbish et al. 2012; Marlow et al.
2013). GitHub is more than a collaborative tool. It also contains an extensive network of
relations among millions of sotware developers. hung and colleagues (2013) present an
initial attempt to analyze the social network structures on GitHub. he sample presented
in their study, however, is not well articulated and seems to lack whole network or partial
network data necessary to generate a meaningful correlation. he rationale for the forma-
tion of GitHub networks is not discussed in previous literature. he gap missing here is
how developers’ social connections are correlated with their collaborations.
Researchers have already observed that, with the advancement of social media, indi-
viduals are all connected (through several degrees or with the potential to connect) in
online space (Hanna et al. 2011). Followship, which is a common feature for most social
media sites, allows users to constantly pay attention to others they are interested in to
receive the latest news and updates. For example, Twitter users can follow one another to
receive messages or tweets from interesting people (Kwak et al. 2010). Besides followship,
researchers also have found increasing correlations and reciprocity among microblogging
services through follow relationships (Java et al. 2007). GitHub provides a similar function:
By following others, one will receive activity traces of people he/she is following, which
is similar to following on other social media, with the exception that it is broadcasting
activities rather than short messages. As GitHub is similar to Twitter in terms of the follow
function, activities that demonstrate paying attention to others, such as following, are also
possibly used to coordinate collaborative work on sotware projects.
Social media also largely extends users’ ability to create, modify, share, and discuss
Internet content (Kietzmann et al. 2011). Seven fundamental building blocks of social
media were identiied by researchers, including identity, conversations, sharing, presence,
relationships, reputation, and groups, which form a rich and diverse ecology of social
media sites (Kietzmann et al. 2011). he authors argue that each social media site empha-
sizes these seven perspectives diferently. In the sotware industry, although GitHub is a
crucial platform that supports sotware core practices, many sites’ other platforms, like
Stack Overlow and Hacker News, are adopted by sotware developers in order to support
Making Big Data Transparent to the Software Developers’ Community ◾ 183
the fundamental purposes (Kietzmann et al. 2011). Sites/platforms emphasize the func-
tions diferently, as suggested by Kietzmann et al. (2011); however, little is known about
the ways these sites/platforms are organized and form an integral system for sotware
developers.
he sotware ecosystem has been discussed before but more from an industry’s per-
spective (Bosch 2009; Campbell and Ahmed 2010; Jansen et al. 2009; Messerschmitt and
Szyperski 2005), which describes the sotware industry in the context of users, developers,
buyers, sellers, government, and economics, leaving the sotware developers’ perspective
missing from the analysis. he recent work of Singer et al. (2013) reported the existence of
the ecosystem for sotware developers, but the missing parts are the exact components that
make up the system surrounding sotware code repositories and sotware practices and
what efort can be made to make this valuable ecosystem of social and productive resources
more transparent to all sotware developers. he large data set generated from sotware
developers’ activity traces in the online environment is a valuable resource to address the
issue.
issues in awareness. Fussell and colleagues (1998) identiied four types of awareness issues
for distributed sotware development: (1) lacking awareness of peer activities, (2) lacking
awareness of each other’s availability, (3) lacking awareness of the process, and (4) lacking
awareness of what peers are thinking and why.
Many research eforts and technologies have been applied to address the awareness issue
in distributed sotware development. Gutwin et al. (1996) argue that awareness contributes
to efective group collaboration in a physical workspace. In distributed sotware develop-
ment situations, sotware developers need to keep awareness of both the entire time and
the people they need to work with (Gutwin et al. 2004), and awareness is usually real-
ized through text-based communication, such as mailing lists and chat systems. However,
Bolici et al. (2009) ind that these types of communication are lacking in online sotware
development teams: Developers are less likely to communicate with peers other than to
look for traces of what others were doing and collaborate with them implicitly. Many of the
awareness issues are likely to be alleviated by increasing transparency during collabora-
tion. Dabbish et al. (2012) and Marlow et al. (2013) found that in online work production
systems, developers resort to diferent social cues from others’ proiles and activity traces
to maintain awareness and coordinate work. However, the issues of awareness in the entire
ecosystem are likely to be more complex, where platforms are separated from one another
and there is too much information to look for.
available through the timeline and API, which provide critical sources of information for
sotware analytics, such as understanding the evolution of sotware projects and archiving
the development history of the projects.
predictions to identify the critical parts of the system, limiting the gravity of their impact
and allowing a better-organized plan. However, the integration of data from these stand-
alone systems cannot be performed automatically but has to be maintained manually by
developers.
With the emergence of Big Data and the ecosystem, sotware analytics is likely to play an
even greater role. he automation integration process can be resolved because the modern
platform integrates several subsystems together. For instance, GitHub integrates VCS, an
issue tracker, and documentation functions for each repository. Data generated in one sot-
ware project on GitHub can be identiied and collected just by its unique identiier. Also,
the large data set generated from the ecosystem has the potential to address the analy-
sis types proposed by Buse and Zimmermann (2012), which include trends, alerts, fore-
casting, summarization, overlays, goals, modeling, benchmarking, and simulation. One
application of Big Data in sotware analytics is integrating the context with code artifacts.
For example, a popular sotware repository hosted on GitHub is likely to be discussed a
lot on Twitter; users of the sotware might raise many questions on Stack Overlow; and
on Hacker News, experts would address the pros and cons of the sotware, comparing it
with other equivalents and listing application scenarios. By extracting and synthesizing
information from the ecosystem, Big Data can reveal the trending of the sotware indus-
try, summarizing comprehensive information about a single sotware repository with user
feedback and questions, which provides alerts to the project stakeholders and forecasts
the features and error associated with that repository. Also, as Big Data contains detailed
information on the ecosystem, multiple overlays of the data set can better help develop-
ers to understand the history of a code artifact, which can be used to build models of the
history, or the trending of recent programming languages. And by exploring the Big Data
from the ecosystem, one can better understand users’ exact needs and concerns, which
facilitates the setting of goals to address users’ issues and concerns. Moreover, benchmark-
ing is important in comparing diferent practices, like similar code artifacts. Big Data pro-
vides a larger context for benchmarking to take place, in which more practices and more
factors can be taken into consideration.
10.6 CONCLUSION
In this chapter, we discussed the information needs of sotware developers to conduct
various sotware practices, the emergence of the sotware developers’ ecosystem in which
developers coordinate work and make social connections across social media sites, and
how the interaction of these two activities raises the information overload and awareness
issue. We also argued that the Big Data generated from the ecosystem has the potential to
address the information overload and awareness issue by providing accurate and compre-
hensive information with sotware analytics techniques.
Sotware development largely becomes implicit collaboration in online space (Bolici
et al. 2009), where explicit communication is lacking and mostly indirect communica-
tion happens through various cues, such as the proile information and activity traces on
social media that serve as references for sotware developers to form impressions of others
to inform collaboration (Marlow et al. 2013). While information within one platform is
188 ◾ Yu Wu, Jessica Kropczynski, and John M. Carroll
relatively searchable, similar and tangential information has been dispersed among several
online resources, which creates the information overload and awareness issue. he Big
Data techniques described in Section 10.5 have the most promise to allow for transparency
to increase the eiciency and efectiveness of collaboration and innovative sotware devel-
opment across the ecosystem.
REFERENCES
Arora, H. (2012). Why big companies are embracing open source? IBM. Accessed on March 20, 2014.
Retrieved from https://www.ibm.com/developerworks/community/blogs/6e6f6d1b-95c3-46df
-8a26-b7efd8ee4b57/entry/why_big_companies_are_embracing_open_source119?lang=en.
Bachmann, A., and Bernstein, A. (2009). Data Retrieval, Processing and Linking for Sotware Process
Data Analysis. Zürich, Switzerland: University of Zürich, Technical Report.
Basili, V. R. (1993). he experience factory and its relationship to other improvement paradigms.
In Sotware Engineering—ESEC ’93, I. Sommerville and M. Paul (Eds.) (pp. 68–83). Springer
Berlin, Heidelberg.
Begel, A., DeLine, R., and Zimmermann, T. (2010, November). Social media for sotware engi-
neering. In Proceedings of the FSE/SDP Workshop on Future of Sotware Engineering Research
(pp. 33–38). ACM.
Biehl, J. T., Czerwinski, M., Smith, G., and Robertson, G. G. (2007, April). FASTDash: A visual dash-
board for fostering awareness in sotware teams. In Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems (pp. 1313–1322). ACM.
Black, S., Harrison, R., and Baldwin, M. (2010). A survey of social media use in sotware systems
development. In Proc. of the 1st Workshop on Web 2.0 for Sotware Engineering (pp. 1–5). ACM.
Black, S., and Jacobs, J. (2010). Using Web 2.0 to improve sotware quality. In Proc. of the 1st Workshop
on Web 2.0 for Sotware Engineering (pp. 6–11). ACM.
Bolici, F., Howison, J., and Crowston, K. (2009, May). Coordination without discussion? Socio-
technical congruence and stigmergy in free and open source sotware projects. In Socio-
Technical Congruence Workshop in Conj. Intl Conf. on Sotware Engineering. Vancouver, Canada.
Bosch, J. (2009, August). From sotware product lines to sotware ecosystems. In Proceedings of the
13th International Sotware Product Line Conference (pp. 111–119). Carnegie Mellon University.
Bougie, G., Starke, J., Storey, M. A., and German, D. M. (2011, May). Towards understanding Twitter
use in sotware engineering: Preliminary indings, ongoing challenges and future questions. In
Proceedings of the 2nd International Workshop on Web 2.0 for Sotware Engineering (pp. 31–36).
ACM.
Breu, S., Premraj, R., Sillito, J., and Zimmermann, T. (2010, February). Information needs in bug
reports: Improving cooperation between developers and users. In Proceedings of the 2010 ACM
Conference on Computer Supported Cooperative Work (pp. 301–310). ACM.
Buse, R. P., and Zimmermann, T. (2010, November). Analytics for sotware development. In Proceedings
of the FSE/SDP Workshop on Future of Sotware Engineering Research (pp. 77–80). ACM.
Buse, R. P., and Zimmermann, T. (2012, June). Information needs for sotware development analyt-
ics. In Proceedings of the 2012 International Conference on Sotware Engineering (pp. 987–996).
IEEE Press.
Campbell, P. R., and Ahmed, F. (2010, August). A three-dimensional view of sotware ecosystems.
In Proceedings of the Fourth European Conference on Sotware Architecture: Companion Volume
(pp. 81–84). ACM.
Crockford, D. (2006). he application/JSON media type for javascript object notation (JSON). RFC
4627, July 2006.
Crowston, K. (1997). A coordination theory approach to organizational process design. Organization
Science, 8(2), 157–175.
Making Big Data Transparent to the Software Developers’ Community ◾ 189
Dabbish, L., Stuart, C., Tsay, J., and Herbsleb, J. (2012, February). Social coding in GitHub: Transpar-
ency and collaboration in an open sotware repository. In Proceedings of the ACM 2012
Conference on Computer Supported Cooperative Work (pp. 1277–1286). ACM.
Davenport, T. H., Harris, J. G., and Morison, R. (2010). Analytics at Work: Smarter Decisions, Better
Results. Harvard Business Press, Cambridge, MA.
Edmunds, A., and Morris, A. (2000). he problem of information overload in business organisations:
A review of the literature. International Journal of Information Management, 20(1), 17–28.
Feeds (2014). Accessed on March 11, 2014. Retrieved from http://developer.github.com/v3/activity
/feeds/.
Fussell, S. R., Kraut, R. E., Lerch, F. J., Scherlis, W. L., McNally, M. M., and Cadiz, J. J. (1998, November).
Coordination, overload and team performance: Efects of team communication strategies. In
Proceedings of the 1998 ACM Conference on Computer Supported Cooperative Work (pp. 275–284).
ACM.
Gutwin, C., Greenberg, S., and Roseman, M. (1996). Workspace awareness in real-time distributed
groupware: Framework, widgets, and evaluation. In People and Computers XI, R. J. Sasse, A.
Cunningham, and R. Winder (Eds.) (pp. 281–298). Springer, London.
Gutwin, C., Penner, R., and Schneider, K. (2004, November). Group awareness in distributed sotware
development. In Proceedings of the 2004 ACM Conference on Computer Supported Cooperative
Work (pp. 72–81). ACM.
Hanna, R., Rohm, A., and Crittenden, V. L. (2011). We’re all connected: he power of the social
media ecosystem. Business Horizons 54(3), 265–273.
Hassan, A. E. (2006, September). Mining sotware repositories to assist developers and support
managers. In ICSM ’06 22nd IEEE International Conference on Sotware Maintenance, 2006
(pp. 339–342). IEEE.
Jansen, S., Finkelstein, A., and Brinkkemper, S. (2009, May). A sense of community: A research
agenda for sotware ecosystems. In Sotware Engineering-Companion Volume, 2009. ICSE-
Companion 2009 (pp. 187–190). IEEE.
Java, A., Song, X., Finin, T., and Tseng, B. (2007, August). Why we twitter: Understanding micro-
blogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007
Workshop on Web Mining and Social Network Analysis (pp. 56–65). ACM.
Kietzmann, J. H., Hermkens, K., McCarthy, I. P., and Silvestre, B. S. (2011). Social media? Get seri-
ous! Understanding the functional building blocks of social media. Business Horizons 54(3),
241–251.
Klapp, O. E. (1986). Overload and Boredom: Essays on the Quality of Life in the Information Society
(pp. 98–99). Greenwood Press, Connecticut.
Ko, A. J., DeLine, R., and Venolia, G. (2007, May). Information needs in collocated sotware devel-
opment teams. In Proceedings of the 29th International Conference on Sotware Engineering
(pp. 344–353). IEEE Computer Society.
Kwak, H., Lee, C., Park, H., and Moon, S. (2010, April). What is Twitter, a social network or a news
media? In Proceedings of the 19th International Conference on World Wide Web (pp. 591–600).
ACM.
Lietsala, K., and Sirkkunen, E. (2008). Social Media—Introduction to the Tools and Processes of
Participatory Economy. University of Tampere Press, Tampere, Finland.
Marlow, J., Dabbish, L., and Herbsleb, J. (2013, February). Impression formation in online peer pro-
duction: Activity traces and personal proiles in GitHub. In Proceedings of the 2013 Conference
on Computer Supported Cooperative Work (pp. 117–128). ACM.
Matthews, T., Whittaker, S., Badenes, H., and Smith, B. A. (2014). Beyond end user content to col-
laborative knowledge mapping: Interrelations among community social tools. In Proceedings
of the 17th ACM Conference on Computer Supported Cooperative Work and Social Computing
(pp. 900–910). ACM.
190 ◾ Yu Wu, Jessica Kropczynski, and John M. Carroll
191
CHAPTER 11
CONTENTS
11.1 Introduction 194
11.1.1 Stream Computing 195
11.1.2 Application Background 195
11.1.3 Chapter Organization 196
11.2 Overview of a BDSC System 196
11.2.1 Directed Acyclic Graph and Stream Computing 196
11.2.2 System Architecture for Stream Computing 198
11.2.3 Key Technologies for BDSC Systems 199
11.2.3.1 System Structure 199
11.2.3.2 Data Stream Transmission 200
11.2.3.3 Application Interfaces 200
11.2.3.4 High Availability 200
11.3 Example BDSC Systems 202
11.3.1 Twitter Storm 202
11.3.1.1 Task Topology 202
11.3.1.2 Fault Tolerance 203
11.3.1.3 Reliability 203
11.3.1.4 Storm Cluster 204
11.3.2 Yahoo! S4 204
11.3.2.1 Processing Element 205
11.3.2.2 Processing Nodes 205
11.3.2.3 Fail-Over, Checkpointing, and Recovery Mechanism 205
11.3.2.4 System Architecture 206
11.3.3 Microsot TimeStream and Naiad 206
11.3.3.1 TimeStream 206
11.3.3.2 Naiad 209
11.4 Future Perspective 210
193
194 ◾ Dawei Sun, Guangyan Zhang, Weimin Zheng, and Keqin Li
ABSTRACT
As a new trend for data-intensive computing, real-time stream computing is gaining
signiicant attention in the Big Data era. In theory, stream computing is an efective
way to support Big Data by providing extremely low-latency processing tools and
massively parallel processing architectures in real-time data analysis. However, in
most existing stream computing environments, how to eiciently deal with Big Data
stream computing and how to build eicient Big Data stream computing systems
are posing great challenges to Big Data computing research. First, the data stream
graphs and the system architecture for Big Data stream computing, and some related
key technologies, such as system structure, data transmission, application interfaces,
and high availability, are systemically researched. hen, we give a classiication of the
latest research and depict the development status of some popular Big Data stream
computing systems, including Twitter Storm, Yahoo! S4, Microsot TimeStream, and
Microsot Naiad. Finally, the potential challenges and future directions of Big Data
stream computing are discussed.
11.1 INTRODUCTION
Big Data computing is a new trend for future computing, with the quantity of data grow-
ing and the speed of data increasing. In general, there are two main mechanisms for Big
Data computing, that is, Big Data stream computing (BDSC) and Big Data batch comput-
ing. BDSC is a model of straight-through computing, such as Storm [1] and S4 [2], which
does for stream computing what Hadoop does for batch computing, while Big Data batch
computing is a model of storing then computing, such as the MapReduce framework [3]
open-sourced by the Hadoop implementation [4].
Essentially, Big Data batch computing is not suicient for many real-time application
scenarios, where a data stream changes frequently over time and the latest data are the most
important and most valuable. For example, when analyzing data from real-time transac-
tions (e.g., inancial trades, e-mail messages, user search requests, sensor data tracking),
a data stream grows monotonically over time as more transactions take place. Ideally, a
real-time application environment can be supported by BDSC. Generally, Big Data stream-
ing computing has the following deining characteristics [5,6]. (1) he input data stream is
a real-time data stream and needs real-time computing, and the results must be updated
Key Technologies for Big Data Stream Computing ◾ 195
every time the data changes. (2) Incoming data arrive continuously at volumes that far
exceed the capabilities of individual machines. (3) Input streams incur multistaged com-
puting at low latency to produce output streams, where any incoming data entry is ideally
relected in the newly generated results in output streams within seconds.
Deinition 1
A data stream graph G is a directed acyclic graph, which is composed of set of a vertices
and a set of directed edges; has a logical structure and a special function; and is denoted
as G = (V(G), E(G)), where V(G) = {v1, v2,…,vn} is a inite set of n vertices, which represent
tasks, and E(G) = {e1,2, e1,3,…,en − 1,n} is a inite set of directed edges, which represent a data
stream between vertices. If ∃ei,j ∈ E(G), then vi, vj ∈ V(G), vi ≠ vj, and 〈vi, vj〉 is an ordered
pair, where a data stream comes from vi and goes to vj.
he in-degree of vertex vi is the number of incoming edges, and the out-degree of vertex
vi is the number of outgoing edges. A source vertex is a vertex whose in-degree is 0, and an
end vertex is a vertex whose out-degree is 0. A data stream graph G has at least one source
vertex and one end vertex.
For the example data stream graph with 11 vertices shown in Figure 11.1, the set of ver-
tices is V = {va, vb,…,vk}, the set of directed edges is E = {ea,c, eb,c,…,ej,k}, the source vertices
vh
vb ve vj
vc vk
vg
va vd vi
vf
are va and vb, and the end vertex is vk. he in-degree of vertex vd is 1, and the out-degree of
vertex vd is 2.
Deinition 2
A subgraph sub-G of the data stream graph G is a subgraph consisting of a subset of the
vertices with the edges in between. For vertices vi and vj in the subgraph sub-G and any
vertex v in the data stream graph G, v must also be in the sub-G if v is on a directed path
from vi to vj, that is, ∀vi, vj ∈ V(sub-G), ∀v ∈ V(G), and if v ∈ V (p(vi, vj)), then v ∈ V
(p(sub-G)).
A subgraph sub-G is logically equivalent and can be substituted by a vertex. But reduc-
ing that subgraph to a single logical vertex would create a graph with cycle, not a DAG.
Deinition 3
A path p(vu, vv) from vertex vu to vertex vv is a subset of E(p(vu, vv)), which should meet the
conditions ∃ei,k ∈ p(vu, vv) and ek,j ∈ p(vu, vv) for any directed edge ek,l in path p(vu, vv) that
displays the following properties: If k ≠ i, then ∃m, and em,k ∈ p(vu, vv); if i ≠ j, then ∃m, and
el,m ∈ p(vu, vv).
he latency lp(vu, vv) of a path from vertex vu to vertex vv is the sum of latencies of both
vertices and edges on the path, as given by Equation 11.1:
l p (v u , v v ) = ∑
vi ∈V ( p( vu ,vv ))
cvi + ∑
ei , j ∈E ( p( vu ,vv ))
cei , j , cvi , cei , j ≥ 0. (11.1)
A critical path, also called the longest path, is a path with the longest latency from a
source vertex vs to an end vertex ve in a data stream graph G, which is also the latency of
data stream graph G.
If there are m paths from source vertex vs to end vertex ve in data stream graph G, then
the latency l(G) of data stream graph G is given by Equation 11.2:
{
l(G ) = max l pi (v s , ve ), l p2 (v s , ve ),, l pm (v s , ve ) , } (11.2)
where l pi (v s , ve ) is the latency of the ith path from vertex vs to vertex ve.
Deinition 4
In data stream graph G, if ∃ei,j from vertex vi to vertex vj, then vertex vi is a parent of vertex
vj, and vertex vj is a child of vertex vi.
198 ◾ Dawei Sun, Guangyan Zhang, Weimin Zheng, and Keqin Li
Deinition 5
he throughput t(vi) of vertex vi is the average rate of successful data stream computing in
a Big Data environment and is usually measured in bits per second (bps).
We identify the source vertex vs as in the irst level, the children of source vertex vs as in
the second level, and so on, and the end vertex ve as in the last level.
he throughput t(leveli) of the ith level can be calculated by Equation 11.3:
ni
t (leveli ) = ∑t(v ),
k =1
k (11.3)
Deinition 6
( )
A topological sort TS(G ) = v x1 , v x 2 ,, v xn of the vertices V(G) in data stream graph G is a
( )
linear ordering of its vertices, such that for every directed edge e xi ,x j e xi ,x j ∈E(G ) from vertex
v xi to vertex v x j , v xi comes before v x j in the topological ordering.
A topological sort is possible if and only if the graph has no directed cycle, that is, it
needs to be a directed acyclic graph. Any directed acyclic graph has at least one topological
sort.
Deinition 7
A graph partitioning GP(G) = {GP1, GP2,…,GPm} of the data stream graph G is a topologi-
cal sort–based split of the vertex set V(G) and the corresponding directed edges. A graph
partitioning should meet the nonoverlapping and covering properties, that is, if ∀i ≠ j, i, j
m
❙✌✍✎❛✏ ✑✒✏✓✔✌✕✖✗
❍❛✍✘✙❛✍✎ ❙✌✒✍❛✗✎
continuous data streams are computed in real time, and the results must be updated also
in real time. The volume of data is so high that there is not enough space for storage, and
not all data need to be stored. Most data will be discarded, and only a small portion of the
data will be permanently stored in hard disks.
has a special function, and it always receives a data stream from the master node, processes
the data stream, and sends the results to the master node. Usually, the master node is the
bottleneck in the master–slave structure system. If it fails, the whole system will not work.
ACK
A B C
❈✚✛✜✢✣✤✥♥✦
B´
ACK
A ❇ ✩
❚✧✐★
B´
In the active standby strategy (see Figure 11.6), the secondary nodes compute all data
streams in parallel with their primaries. Usually, the recovery time of this strategy is the
shortest.
In the upstream backup strategy (see Figure 11.7), upstream nodes act as backups for
their downstream neighbors by preserving data streams in their output queues while their
downstream neighbors compute them. If a node fails, its upstream nodes replay the logged
data stream on a recovery node. Usually, the runtime overhead of this strategy is the lowest.
A comparison of the three main high-availability strategies, that is, passive standby
strategy, active standby strategy, and upstream backup strategy, in runtime overhead and
recovery time is shown in Figure 11.8. he recovery time of the upstream backup strategy
is the longest, while the runtime overhead of the passive standby strategy is the greatest.
Trim ACK A B C
A B C Replay
B´
(a) (b)
❯✲s✬✳❡✭✴
✰✭❜❦✵✲
❘❡❜❝✫❡✳✱ ✬✪✴❡
FIGURE 11.8 Comparison of high-availability strategies in runtime overhead and recovery time.
202 ◾ Dawei Sun, Guangyan Zhang, Weimin Zheng, and Keqin Li
Bolt
Spout Bolt
Bolt Bolt
Spout Bolt
Bolt
vertex in a task topology can be created in many instances. All those vertices will simul-
taneously process a data stream, and diferent parts of the topology can be allocated in
diferent machines. A good allocating strategy will greatly improve system performance.
A data stream grouping deines how that stream should be partitioned among the
bolt’s tasks; spouts and bolts execute in parallel as many tasks across the cluster. here
are seven built-in stream groupings in Storm, such as shule grouping, ields grouping,
all grouping, global grouping, none grouping, direct grouping, and local or shule group-
ing; a custom stream grouping to meet special needs can also be implemented by the
CustomStreamGrouping interface.
11.3.1.3 Reliability
In Storm, the reliability mechanisms guarantee that every spout tuple will be fully pro-
cessed by corresponding topology. hey do this by tracking the tree of tuples triggered
by every spout tuple and determining when that tree of tuples has been successfully com-
pleted. Every topology has a “message timeout” associated with it. If Storm fails to detect
that a spout tuple has been completed within that timeout, then it fails the tuple and replays
it later.
The reliability mechanisms of Storm are completely distributed, scalable, and fault
tolerant. Storm uses mod hashing to map a spout tuple ID to an acker task. Since every
tuple carries with it the spout tuple IDs of all the trees they exist within, they know which
acker tasks to communicate with. When a spout task emits a new tuple, it simply sends a
message to the appropriate acker telling it that its task ID is responsible for that spout tuple.
hen, when an acker sees that a tree has been completed, it knows to which task ID to send
the completion message.
An acker task stores a map from a spout tuple ID to a pair of values. he irst value is the
task ID that created the spout tuple that is used later on to send completion messages. he
second value is a 64-bit number called the “ack val.” he ack val is a representation of
the state of the entire tuple tree, no matter how big or how small. It is simply the exclusive
204 ◾ Dawei Sun, Guangyan Zhang, Weimin Zheng, and Keqin Li
Master
Nimbus
Cluster
Zookeeper ... Zookeeper ... Zookeeper
Slaves
Worker Worker Worker
.. .. ..
. . .
Worker Worker Worker
OR (XOR) of all tuple IDs that have been created and/or acked in the tree. When an acker
task sees that an “ack val” has become 0, then it knows that the tuple tree is completed.
11.3.2 Yahoo! S4
S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows
programmers to easily develop applications for computing continuous unbounded streams
of Big Data. he core part of S4 is written in Java. he implementation is modular and
pluggable, and S4 applications can be easily and dynamically combined for creating more
sophisticated stream processing systems. S4 was initially released by Yahoo! Inc. in October
2010 and has been an Apache Incubator project since September 2011. It is licensed under
the Apache 2.0 license [2,22–25].
Key Technologies for Big Data Stream Computing ◾ 205
PE 1 PE 2 ... PE n
Zookeeper
1 2 1 2 2
3 4 3 4 3 4 1
Ac▲◆❖● P❑❞●◗ ❱tandby nodes Ac▲◆❖● P❑❞●◗ ❱tandby nodes Ac▲◆❖● P❑❞●◗ ❱tandby nodes
(a) (b) (c)
FIGURE 11.13 Fail-over mechanism. (a) In working state, (b) under failed state, and (c) ater recov-
ery state.
In order to improve the availability of the S4 system, S4 system should provide a fail-
over mechanism to automatically detect failed nodes and redirect the data stream to a
standby node. If you have n partitions and start m nodes, with m > n, you get m − n standby
nodes. For instance, if there are seven live nodes and four partitions available, four of the
nodes pick the available partitions in Zookeeper. he remaining three nodes will be avail-
able standby nodes. Each active node consistently receives messages for the partition that it
picked, as shown in Figure 11.13a. When Zookeeper detects that one of active nodes fails, it
will notify a standby node to replace the failed node. As shown in Figure 11.13b, the node
assigned with partition 1 fails. Unassigned nodes compete for a partition assignment, and
only one of them picks it. Other nodes are notiied of the new assignment and can reroute
the data stream for partition 1, as shown in Figure 11.13c.
If a node is unreachable ater a session timeout, Zookeeper will identify this node as
dead. he session timeout is speciied by the client upon connection and is, at minimum,
twice the heartbeat speciied in the Zookeeper ensemble coniguration.
In order to minimize state loss when a node is dead, a checkpointing and recovery
mechanism is employed by S4. he states of PEs are periodically checkpointed and stored.
Whenever a node fails, the checkpoint information will be used by the recovery mecha-
nism to recover the state of the failed node to the corresponding standby node. Most of the
previous state of a failed node can be seen in the corresponding standby node.
11.3.3.1 TimeStream
TimeStream is a distributed system designed speciically for low-latency continuous pro-
cessing of big streaming data on a large cluster of commodity machines and is based on
Key Technologies for Big Data Stream Computing ◾ 207
User
U0 U1 U2 U3 Un–1
management
Resource
Monitoring Legacy support
S4 cluster
Pnode 1 Pnode 2 Pnode n
1. Streaming DAG
Streaming DAG is a type of task topology, which can be dynamically reconigured
according to the loading of a data stream. All data streams in the TimeStream system
will be processed in streaming DAG. Each vertex in streaming DAG will be allocated
to physical machines for execution. As shown in Figure 11.15, streaming function fv
of vertex v is designed by the user. When input data stream i is coming, streaming
function fv will process data stream i, update v’s state from τ to τ′, and produce a
sequence o of output entries as part of the output streams for downstream vertices.
A sub-DAG is logically equivalent and can be reduced to one vertex or another
sub-DAG. As shown in Figure 11.16, the sub-DAG comprised of vertices v2, v3, v4, and
v5 (as well as all their edges) is a valid sub-DAG and can be reduced to a “vertex” with
i as its input stream and o as its output stream.
2. Resilient Substitution
Resilient substitution is an important feature of TimeStream. It is used to dynami-
cally adjust and reconigure streaming DAG according to the loading change of a
ƒv v τ τ’
o o
v1
i
v2
v3 v4 v6
v5
o
v7
data stream. here are three types of resilient substitution in TimeStream. (a) A
vertex is substituted by another vertex. When a vertex fails, a new corresponding
standby vertex is initiated to replace the failed one and continues execution, possibly
on a diferent machine. (b) A sub-DAG is substituted by another sub-DAG. When the
number of instances of a vertex in a sub-DAG needs to be adjusted, a new sub-DAG
will replace the old one. For example, as shown in Figure 11.17, a sub-DAG comprised
of vertices v2, v3, v4, and v5 implements three stages: hash partitioning, computation,
and union. When the load increases, TimeStream can create a new sub-DAG (shown
on the let), which uses four partitions instead of two, to replace the original sub-
DAG. (c) A sub-DAG is substituted by a vertex. When the load decreases, there is no
need for so many steps to inish a special function, and the corresponding sub-DAG
can be substituted by a vertex, as shown in Figure 11.16.
v1
i
v8 Hash ♣❲❳❨❬❨❬❭❪ v2
Union v5
v13
o
v7
11.3.3.2 Naiad
Naiad is a distributed system for executing data-parallel, cyclic datalow programs. he
core part is written in C#. It ofers high throughput of batch processors and low latency of
stream processors and is able to perform iterative and incremental computations. Naiad is
a prototype implementation of a new computational model, timely datalow [30].
1. Timely Datalow
Timely dataflow is a computational model based on directed graphs. The data-
flow graph can be a directed acyclic graph, like in other BDSC environments. It
can also be a directed cyclic graph; the situation of cycles in a dataflow graph
is under consideration. In timely dataflow, the time stamps reflect cycle struc-
ture in order to distinguish data that arise in different input epochs and loop
iterations. The external producer labels each message with an integer epoch and
notifies the input vertex when it will not receive any more messages with a given
epoch label.
Timely datalow graphs are directed graphs with the constraint that the vertices
are organized into possibly nested loop contexts, with three associated system-
provided vertices. Edges entering a loop context must pass through an ingress vertex,
and edges leaving a loop context must pass through an egress vertex. Additionally,
every cycle in the graph must be contained entirely within some loop context and
include at least one feedback vertex that is not nested within any inner loop contexts.
Figure 11.18 shows a single-loop context with ingress (I), egress (E), and feedback (F)
vertices labeled.
2. System Architecture
he system architecture of a Naiad cluster is shown in Figure 11.19, with a group of
processes hosting workers that manage a partition of the timely datalow vertices.
Workers exchange messages locally using shared memory and remotely using TCP
connections between each pair of processes.
A program speciies its timely datalow graph as a logical graph of stages linked by
typed connectors. Each connector optionally has a partitioning function to control
the exchange of data between stages. At execution time, Naiad expands the logical
graph into a physical graph where each stage is replaced by a set of vertices and each
connector by a set of edges. Figure 11.19 shows a logical graph and a corresponding
❫❴❴❵ ❣❴❤❥♠q❥
In A I B C E D Ou❥
Worker 0 X0 Y0 Z0
Worker 1 X1 Y1 Z1
Process 0
TCP/IP network
Worker 2 X2 Y2 Z2
Worker 3 X3 Y3 Z3
Process 1
physical graph, where the connector from X to Y has partitioning function H(m) on
typed messages m.
Each Naiad worker is responsible for delivering messages and notiications to ver-
tices in its partition of the timely datalow graph. When faced with multiple runnable
actions, workers break ties by delivering messages before notiications, in order to
reduce the amount of queued data.
streams a partial data stream to a global computing center when local resources become
insuicient.
1. Research on new strategies to optimize a task topology graph, such as subgraph parti-
tioning strategy, subgraph replication strategy, and subgraph allocating strategy, and
to provide a high-throughput BDSC environment
2. Research on dynamic extensible data stream strategies, such that a data stream can be
adjusted according to available resources and the QoS of users, and provide a highly
load-balancing BDSC environment
3. Research on the impact of a task topology graph with a cycle, and a corresponding
task topology graph optimize strategy and resource allocating strategy, and provide
a highly adaptive BDSC environment
4. Research on the architectures for large-scale real-time stream computing environ-
ments, such as symmetric architecture and master–slave architecture, and provide a
highly consistent BDSC environment
5. Develop a BDSC system with the features of high throughput, high fault tolerance,
high consistency, and high scalability, and deploy such a system in a real BDSC
environment
ACKNOWLEDGMENTS
his work was supported in part by the National Natural Science Foundation of China under
Grant No. 61170008 and Grant No. 61272055, in part by the National Grand Fundamental
Research 973 Program of China under Grant No. 2014CB340402 in part by the National High
Technology Research and Development Program of China under Grant No. 2013AA01A210,
and in part by the China Postdoctoral Science Foundation under Grant No. 2014M560976.
Key Technologies for Big Data Stream Computing ◾ 213
REFERENCES
1. Storm. Available at http://storm-project.net/ (accessed July 16, 2013).
2. Neumeyer L, Robbins B, Nair A et al. S4: Distributed stream computing platform, Proc. 10th
IEEE International Conference on Data Mining Workshops, ICDMW 2010, Sydney, NSW,
Australia, IEEE Press, December 2010, pp. 170–177.
3. Zhao Y and Wu J. Dache: A data aware caching for big-data applications using the MapReduce
framework, Proc. 32nd IEEE Conference on Computer Communications, INFOCOM 2013, IEEE
Press, April 2013, pp. 35–39.
4. Shang W, Jiang Z M, Hemmati H et al. Assisting developers of Big Data analytics applica-
tions when deploying on Hadoop clouds, Proc. 35th International Conference on Sotware
Engineering, ICSE 2013, IEEE Press, May 2013, pp. 402–411.
5. Qian Z P, He Y, Su C Z et al. TimeStream: Reliable stream computation in the cloud, Proc. 8th
ACM European Conference on Computer Systems, EuroSys 2013, Prague, Czech Republic, ACM
Press, April 2013, pp. 1–14.
6. Umut A A and Yan C. Streaming Big Data with self-adjusting computation, Proc. 2013 ACM
SIGPLAN Workshop on Data Driven Functional Programming, Co-located with POPL 2013,
DDFP 2013, Rome, Italy, ACM Press, January 2013, pp. 15–18.
7. Demirkan H and Delen D. Leveraging the capabilities of service-oriented decision support
systems: Putting analytics and Big Data in cloud, Decision Support Systems, vol. 55(1), 2013,
pp. 412–421.
8. Albert B. Mining Big Data in real time, Informatica (Slovenia), vol. 37(1), 2013, pp. 15–20.
9. Lu J and Li D. Bias correction in a small sample from Big Data, IEEE Transactions on Knowledge
and Data Engineering, vol. 25(11), 2013, pp. 2658–2663.
10. Tien J M. Big Data: Unleashing information, Journal of Systems Science and Systems Engineering,
vol. 22(2), 2013, pp. 127–151.
11. Zhang R, Koudas N, Ooi B C et al. Streaming multiple aggregations using phantoms, VLDB
Journal, vol. 19(4), 2010, pp. 557–583.
12. Hirzel M, Andrade H, Gedik B et al. IBM streams processing language: Analyzing Big Data in
motion, IBM Journal of Research and Development, vol. 57(3/4), 2013, pp. 7:1–7:11.
13. Dayarathna M and Toyotaro S. Automatic optimization of stream programs via source pro-
gram operator graph transformations, Distributed and Parallel Databases, vol. 31(4), 2013,
pp. 543–599.
14. Farhad S M, Ko Y, Burgstaller B et al. Orchestration by approximation mapping stream pro-
grams onto multicore architectures, Proc. 16th International Conference on Architectural
Support for Programming Languages and Operating Systems, ASPLOS 2011, ACM Press, March
2011, pp. 357–367.
15. Schneider S, Hirzel M and Gedik B. Tutorial: Stream processing optimizations, Proc. 7th ACM
International Conference on Distributed Event-Based Systems, DEBS 2013, ACM Press, June
2013, pp. 249–258.
16. Khandekar R, Hildrum K, Parekh S et al. COLA: Optimizing stream processing applications
via graph partitioning, Proc. 10th ACM/IFIP/USENIX International Conference on Middleware,
Middleware 2009, ACM Press, November 2009, pp. 1–20.
17. Scalosub G, Marbach P and Liebeherr J. Bufer management for aggregated streaming data
with packet dependencies, IEEE Transactions on Parallel and Distributed Systems, vol. 24(3),
2013, pp. 439–449.
18. Malensek M, Pallickara S L and Pallickara S. Exploiting geospatial and chronological charac-
teristics in data streams to enable eicient storage and retrievals, Future Generation Computer
Systems, vol. 29(4), 2013, pp. 1049–1061.
19. Cugola G and Margara A. Processing lows of information: From data stream to complex event
processing, ACM Computing Surveys, vol. 44(3), 2012, pp. 15:1–15:62.
214 ◾ Dawei Sun, Guangyan Zhang, Weimin Zheng, and Keqin Li
CONTENTS
12.1 Introduction 216
12.2 An Unconventional Big Data Processor 218
12.2.1 Terminology 218
12.2.2 Overview of Hadoop 219
12.2.3 Hadoop Alternative: Big Data Replay 219
12.3 Putting the Pieces Together 220
12.3.1 More on the Scope of the Problem 220
12.3.2 Overview of Literature 222
12.4 he Data Streaming Problem 223
12.4.1 Data Streaming Terminology 223
12.4.2 Related Information heory and Formulations 224
12.4.3 Practical Applications and Designs 225
12.5 Practical Hashing and Bloom Filters 225
12.5.1 Bloom Filters: Store, Lookup, and Eiciency 225
12.5.2 Unconventional Bloom Filter Designs for Data Streams 227
12.5.3 Practical Data Streaming Targets 228
12.6 Big Data Streaming Optimization 228
12.6.1 A Simple Model of a Data Streaming Process 228
12.6.2 Streaming on Multicore 229
12.6.3 Performance Metrics 230
12.6.4 Example Analysis 231
12.7 Big Data Streaming on Multicore Technology 232
12.7.1 Parallel Processing Basics 232
12.7.2 DLL 233
215
216 ◾ Marat Zhanikeev
ABSTRACT
his chapter brings together three topics: hash functions, Bloom ilters, and the
recently emerged streaming algorithms. Hashing is the oldest of the three and is
backed by much literature. Bloom ilters are based on hash functions and beneit
from hashing eiciency directly. Streaming algorithms use both Bloom ilters and
hashing in various ways but impose strict requirements on performance. his chapter
views the three topics from the viewpoint of eiciency and speed. he two main per-
formance metrics are per-unit processing time and the size of the memory footprint.
All algorithms are presented as C/C++ pseudocode. Speciic attention is paid to the
feasibility of hardware implementation.
12.1 INTRODUCTION
HaDoop ile system (HDFS) [1] and MapReduce are de facto standards in Big Data processing
today. Although they are two separate technologies, they form a single package as far as Big
Data processing—not just storage—is concerned. his chapter will treat them as one package.
Today, Hadoop and/or MapReduce lack popular alternatives [2]. HDFS solves the practical
problem of not being able to store Big Data on a single machine by distributing the storage over
multiple nodes [3]. MapReduce is a framework on which one can run jobs that process the con-
tents of the storage—also in a distributed manner—and generate statistical summaries. his
chapter will show that performance improvements mostly target MapReduce [4].
here are several fundamental problems with MapReduce. First, the map and reduce opera-
tors are restricted to key–value hashes (data type, not hash function), which places a cap on
usability. For example, while data streaming is a good alternative for Big Data processing,
MapReduce fails to accommodate the necessary data types [5]. Secondly, MapReduce jobs cre-
ate heterogeneous environments where jobs compete for the same resource with no guarantee
of fairness [4]. Finally, MapReduce jobs, or HDFS for that matter, lack time awareness, while
some algorithms might need to process data in their time sequence or using a time window.
he core premise of this chapter is to replace HDFS/MapReduce with a time-aware stor-
age and processing logic. Big Data is replayed along the timeline, and all the jobs get the
time-ordered sequence of data items. he major diference here is that the new method col-
lects all the jobs in one place—the node that replays data—while HDFS/MapReduce sends
jobs to remote nodes so that data can be processed locally. his architecture is chosen for
the sole purpose of accommodating a wide range of data streaming algorithms and the
data types they create [5].
he new method presented in this chapter can be viewed as a generic technology that
exists independently of HDFS and MapReduce. his technology is one of many possible
practical applications of the new method.
Streaming Algorithms for Big Data Processing on Multicore Architecture ◾ 217
problems due to diferences between traditional parallelization and that running on mul-
ticore technology. Finally, the processing method itself in its traditional form is relatively
primitive, while this chapter will extend it into a more generic form known as data stream-
ing. he new process not only is more generic but also can support a more statistically rigor-
ous processing logic, while the traditional architecture based on MapReduce only accepts
key–value pairs while imposing a fairly strict restraint on how the while-in-processing data
can be used by user code.
12.2.1 Terminology
Hadoop and HDFS (Hadoop ile system) are used interchangeably. In fact, for simplicity,
Hadoop in this section will mean Hadoop and MapReduce. A unit of action in Hadoop is a job.
Streaming algorithm and data streaming also denote the same technology. Here, a unit
of data is a sketch—a statistical summary of a bulk of data. Sketches will function as jobs
in the new method.
Data store is the same as data storage. Distributed parts of stores are called shards in
Hadoop or substores in this section. Content itself is split into records, but in this section,
we refer to them as items. Items in the new/alternative design are assigned in time, which
is why the store is referred to as timeline store.
he new method runs on multicore technology and needs shared memory for the man-
ager to communicate with sketches running on cores. his section employs a ring bufer to
maintain a inite time window of items in shared memory continuously.
When estimating performance of the new method, heterogeneity is the main environ-
mental parameter, while sketchbytes (volume of data processed by all sketches) and over-
head (per data batch per core) are the two quality metrics.
Streaming Algorithms for Big Data Processing on Multicore Architecture ◾ 219
Timeline
substore
Access Client machine
TABID TABID client
node
Start Use
Your
sketcher
You
Before the process, let us consider diferences in sharding. Sharding is used in TABID as
well, but the store is ordered along the timeline. Note that such a design may not need a name
server, because shards can be designed as chains with one shard pointing to the next. his
discussion is let to further study and will be presented in future publications. Also note that
as the API will show further on, data items are not key–value pairs but are arbitrary strings,
thus accommodating any data type like encoded (Base64, for example) JSON [10].
he process starts with deining your sketcher. he format is JSON [10], and a fairly
small size is expected. Unique features of sketchers are discussed at the end of this section.
When you start your sketch, the TABID client will schedule it with the TABID manager.
he schedule involves optimization where sketches are packed in cores in real time, as will
be shown further in this section. When a replay session starts, your sketch will run on
one of the cores and will have access to the timeline of items for processing. When replay
is over, results are retuned back to the client in a JSON data type with arbitrary structure.
he following are the unique features/diferences of TABID. First, there is no code.
Instead, the sketcher speciies which standard streaming algorithm is used and provides
values for its coniguration parameters. In case of a new streaming algorithm, it should be
added to the library irst and then referred to via the sketcher. Substore nodes in TABID are
dumb storage devices and do not run the code, although it should not be diicult to create
a version of TABID that would migrate to shards to replay data locally. his version with
its performance evaluation will be studied in future publications.
the literature, having been created relatively recently—within the last decade or so—which
is why it appears under various code names in literature, some of which are streaming algo-
rithms, data streaming, and data streams. he title of this chapter clearly shows that this
author prefers the term data streaming. he irst two topics are well known and have been
around in both practice and theory for many years.
he two major objectives posed in this chapter are as follows.
hese objectives do not necessarily complement each other. In fact, they can be con-
licting under certain circumstances. For example, faster hash functions may be inferior
and cause more key collisions on average [11,12]. Such collisions have to be resolved by
the lookup algorithm, which should be designed to allow multiple records under the same
hash key [14]. he alternative of not using collision resolution is a bad design because it
directly results in loss of valuable information.
Data streaming as a topic has appeared in the research community relatively recently.
he main underlying reason is a fundamental change in how large volumes of data had
to be handled. he traditional way to handle large data (Big Data may be a better term),
which is still used in many places today, is to store the data in a database and analyze it
later—where the later is normally referred to as oline [15]. As the Big Data problem—the
problem of having to deal with an extremely large volume of data—starts to appear in
many areas, storing data in any kind of database has become diicult and, in some cases,
impossible. Speciically, it is pointless to store Big Data if its arrival rate exceeds processing
capacity, by the way of logic.
Hence the data streaming problem, which is deined as a process that extracts all the
necessary information from the input raw data stream without having to store it. he
irst obvious logical outcome from this statement is that such processing has to happen in
real time. In view of this major design feature, the need for both fast hashing and eicient
lookup should be obvious.
here is a long list of practical targets for data streaming. Some common targets are
It should be obvious that the irst two targets in this list would be trivial to achieve
without any special algorithm had they come without the conditionals. For example, it is
easy to calculate the average—simply sum up all the values and divide them by the total
222 ◾ Marat Zhanikeev
number of values. he same goes for the counting of distinct items. his seemingly small
detail makes for the majority of the complexity in data streaming.
In addition to the above list of practical targets, the following describes the catch of data
streaming:
• here is limited space for storing the current state—otherwise, we would revert back
to the traditional database-oriented design.
• Data have to be accessed in their natural arrival sequence, which is the obvious side
efect of a real-time process—again a major change from the database-backed pro-
cesses, which can access any record in the database.
• here is an upper limit on per-unit processing cost, which, if violated, would break
the continuity of a data streaming algorithm (arguably, bufering can help smoothen
out temporary spikes in arrival rate).
Data streaming is the main reason for the new design in this section. Time awareness
and processing during replay are the features of the new design that are implemented with
data streaming in mind. Coincidentally, the resulting architecture can beneit from run-
ning on multicore hardware.
Multicore technology is a special case of parallel processing [24]. It has already been
applied to MapReduce in Reference 4, which improves its performance but does not solve
the other problems discussed in this section. Most existing multicore methods implement
traditional scheduling-based parallelization [4,24,25], which requires intensive message
exchange across processes. his chapter uses a special case of parallelization, which fea-
tures minimum overhead due to a lock-free design. More details are presented further in
this section.
S = o(min{m,n}).
If we want to build a robust and suiciently generic method, it would pay to design it in
such a way that it would require roughly the same space for a wide range of n and m, that is,
S = O(log(min{m,n})).
When talking about space eiciency, the closest concept in traditional Information
heory is channel capacity (see the work of Shannon [27] for the original deinition). Let
us put function f({a}n) as the cost (time, CPU cycles, etc.) of operation for each item in the
input stream. he cost can be aggregated into F({a}n) to denote the entire output. It is pos-
sible to judge the quality of a given data streaming method by analyzing the latter metric.
he analysis can extend into other eiciency metrics like memory size and so forth, simply
by changing the deinition of a per-unit processing cost.
A simple example is in order. Let us discuss the unit cost, deined as
he unit cost in this case is the cost of deining—for each item in the arrival stream—if
it is equal to a given constant C. Although it sounds primitive, the same exact formulation
can be used for much more complicated unit functions.
Here is one example of a slightly higher complexity. his time, let us phrase the unit cost
as the following condition. Upon receiving item ai, update a given record fj ← fj + C. his
time, prior to updating a record, we need to ind the record in the current state. Since it
is common that i and j have no easily calculable relation between them, inding the j ei-
ciently can be a challenge.
Streaming Algorithms for Big Data Processing on Multicore Architecture ◾ 225
Note that these formulations may make it look like data streaming is similar to tradi-
tional hashing, where the latter also needs to update its state on every item in the input. his
is a gross misrepresentation of what data streaming is all about. Yes, it is true that some
portion of the state is potentially updated on each item in the arrival stream. However, in
hashing, the method always knows which part of the state is to be updated, given that the
state itself is oten just a single 32-bit word. In data streaming, the state is normally much
larger, which means that it takes at least a calculation or an algorithm to ind a spot in the
state that is to be updated.
he best way to describe the relation between data streaming and hashing is to state that
data streaming uses hashing as one of its primitive operations. Another primitive opera-
tion is blooming.
he term sketch is oten used in relation to data streaming to describe the entire state,
that is, the {f}m set, at a given point of time. Note that f here denotes the value obtained
from the unit function f(), using the same name for convenience.
Some of this section will discuss the traditional design presented by this igure but then
replace it with a more modern design. From earlier in this chapter, we know that blooming
is performed by calculating one or more hash keys and updating the value of the ilter by
OR-ing each hash key with its current state. his is referred to as the insert operation. he
lookup operation is done by taking the bitwise operation AND between a given hash key
and current state of the ilter. he decision making from this point on can go in one of the
following two directions:
• he result of AND is the same as the value of the hash key—this is either true positive
(TP) or false positive (FP), with no way to tell between the two.
• he result of AND is not the same as the value of the hash key—this is a 100% reliable
true negative.
One common way to describe this lookup behavior of Bloom ilters is to describe the
ilter as a person with memory who can only answer the question “Have you seen this item
before?” reliably [11]. his is not to underestimate the utility of the ilter, as the answer to
this exact question is precisely what is needed in many practical situations.
Let us look at the Bloom ilter design from the viewpoint of hashing, especially given
that the state of the ilter is gradually built by adding more hash keys onto its state.
Let us put n as the number of items and m as the bit length of hash keys and, therefore, the
ilter. We know from before that each bit in the hash key can be set to 1 with 50% probability.
herefore, omitting details, the optimal number of hash function can be calculated as
m m
k = ln2 ≈ 0.6 .
n n
If each hash function is perfectly independent of all others, then the probability of a bit
remaining 0 ater n elements is
kn − kn
1
p = 1− ≈e m .
m
= (1 − p) ≈ (1 − e ) ≈ ,
− kn
k m
1
pFP
2k
for the optimal k. Note with that increasing k, the probability of an FP actually is supposed
to decrease, which is an unintuitive outcome because one would expect the ilter to get
illed up with keys earlier.
Streaming Algorithms for Big Data Processing on Multicore Architecture ◾ 227
Let us analyze k. For the majority of cases, m ≪ n, which means that the optimal num-
ber of hash functions is 1. Two functions are feasible only with m > 2.5n. In most realistic
cases, this is almost never so, because n is normally huge, while m is something practical
like 24 or 32 (bits).
• Stop additions ilter. his ilter will stop accepting new keys beyond a given point.
Obviously, this is done in order to keep the FP beyond a given target value.
• Deletion ilter. his ilter is tricky to build, but if accomplished, it can revert to a given
previous state by forgetting the change introduced by a given key.
• Counting ilter. his ilter can count both individual bits of potential occurrences and
entire values—combinations of bits. his particular class of ilters obviously can ind
practical applications in data streaming. In fact, the existing example of the d-let
hashing method in Reference 32 uses a kind of counting Bloom ilter [33]. Another
example can be found in Reference 34, where it is used roughly for the same purpose.
FIGURE 12.3 A generic model representing Bloom ilters with dynamic functionality. he two
main changes are (1) extended design of the Bloom ilter structure itself, which is not a bit string
anymore, and (2) nontrivial manipulation logic dictated by the irst change—simply put, one can-
not simply use logical ORs between hashes and Bloom ilter states.
228 ◾ Marat Zhanikeev
here are other kinds of unconventional designs. Reference 35 declares that it can do
with fewer hash functions while providing the same blooming performance. Bloom ilters
speciic to perfect hashing are proposed in Reference 36.
• Example 1: Finding heavy hitters (beyond the min-count sketch). Find k most fre-
quently accessed items in a list. One algorithm is proposed in Reference 7. Generally,
more sound algorithms for sliding windows can be found in Reference 23.
• Example 2: Triangle detection. Detect triangles deined as A talks to B, B talks to C,
and C talks to A (other variants are possible as well) in the input stream. An algo-
rithm is proposed in Reference 22.
• Example 3: Superspreaders. Detect items that access or are accessed by exceedingly
many other items. Related research can be found in Reference 9.
• Example 4: Many-to-many patterns. his is a more generic case of heavy hitters and
superspreaders, but in this deinition, the patterns are not known in advance. Earlier
work by this author in Reference 8 is one method. However, the subject is popular with
several methods such as many-to-many broadcasting [37], and various many-to-many
memory structures [38,39] and data representations (like the graph in Reference 40)
are proposed—all outside of the concept of data streaming. he topic is of high intrin-
sic value because it has direct relevance to group communications where one-to-many
and many-to-many are the two popular types of group communications [41–44].
Index
deinition, cannot be higher than the arrival rate. On the other hand, arrival rate is impor-
tant because a data streaming method has to be able to support a given rate of arrival in
order to be feasible in practice.
Arrival rate is also the least popular topic in related research. In fact, per-unit process-
ing time is not discussed much in literature, which instead focuses on eicient hashing or
blooming methods [30,32,47,48]. Earlier work by this author in Reference 49 shows that
per-unit processing cost is important—the study speciically shows that too much process-
ing can have a major impact on throughput.
he topic of arrival rate is especially important in packet traic [28,29,50]. With con-
stantly increasing transmission rates as well as traic volume, a higher level of eiciency is
demanded of switching equipment. A Click router is one such technology [51], which is in
the active development phase with recent achievements of billion packets per second (pps)
processing rates [52]. he same objectives are pursued by Open vSwitch—a technology in
network virtualization. It is interesting that research in this area uses roughly the same ter-
minology as is found in data streaming. For example, References 52 and 53 talks about space
eiciency of data structures used to support per-packet decision making. Such research oten
uses Bloom ilters to improve search and lookup eiciency [13,32]. In general, this author
predicts with high probability that high-rate packet processing research in the near future
will discover the topic of data streaming and will greatly beneit from the discovery [50].
FIGURE 12.5 Big Data timeline and sketches running in groups on multiple cores.
Figure 12.5 shows the design of the TABID manager node. he manager is running on
one core and is in charge of starting and ending sketches as per its current schedule. Note
that this point alone represents a much higher lexibility compared to MapReduce, which
has no scheduling component. Read/write access to the time-ordered stream, although
remaining asynchronous, is collision-free by ensuring that writing and reading cursors
never point to the same position in the stream. Sketches read the data at the now cur-
sor while the manager is writing to a position further along the stream, thus creating a
collision-safe bufer.
which speciies a distribution of values between vmin and vmax conigured by a. he tuple
can be used to deine heterogeneity:
where a is the exponent. Note that the case of a = 0 is a special homogenous case. In analy-
sis, values for a are randomly chosen from the list 0, 0.7, 0.3, 0.1, 0.05, 0.01, where a = 0 is
a homogenous distribution (horizontal line), a between 0.7 and 0.3 creates distributions
with a majority of large values, a = 0.1 is almost a linear trend, and a between 0.05 and 0.01
outputs distributions where most values are small. he last two cases are commonly found
in natural systems and can therefore be referred to as realistic. Note that this selection only
appears arbitrary; in reality, it covers the entire range of all possible distributions.
hese distributions apply to sketch life span and per-unit overhead—the two practical
metrics that directly afect performance of a TABID system.
he packing heuristic is deined as follows. C denotes a set of item counts for all sketches
in all cores, one value per sketch. M is a set of per-core item counts, one value per core. he
optimization objective is then (var and max are operators)
Streaming Algorithms for Big Data Processing on Multicore Architecture ◾ 231
For simplicity, values for C and M are normalized within a time window to avoid dealing
with diferent units of measure in the two terms. In analysis, the problem is solved using
a genetic algorithm (GA), for which this objective serves as a itness function. A detailed
description of the GA is omitted due to limited space.
5.4
5.2 ④➃➄❷ ➃➄➄➄❷ ➄➅➄➆>
5
4.8
4.6 ④➃➄❷ ➃➄➄➄❷ ➄➅➃❼
4.4
➁❺❽③ ②❺⑦➂③❽ ②⑥⑧③ ⑨①❶⑦⑨ ④➃➄❷ ➃➄➄➄❷ ➄➅➇❼
4.2
4
④➃➄❷ ➃➄➄➄❷ ➄➅➈❼
3.8 ④➃➄❷ ➃➄➄➄❷ ➄❼
FIGURE 12.6 Performance of sketchbytes metric over a wide range of the number of sketches and
several heterogeneity setups.
232 ◾ Marat Zhanikeev
FIGURE 12.7 Performance of max per-core overhead metric over a wide range of the number of
sketches and several heterogeneity setups.
Figure 12.7 shows the performance of this packing heuristic. For a ≥ 0.1, it is diicult to
constrain a continuous increase in maximum overhead (per core, not per sketch). However,
for the realistic cases of a = 0.05 or, even better, for a = 0.01, there is a near-saturation
trend, which means that performance is about the same for few sketches as for very many
sketches. he worst case is about 5.4 ms (see parameters at the head of this Section 12.6.4)
per item (10 ms is the conigured max), which means that the slowest core can still process
roughly 200 items per second per sketch. Note that the total throughput of the system is
much greater because of multiple sketches and multiple cores.
Synced
Cap- Me
wn
Spa ture rge
Wait
Manager Manager
Syncless
Monitor
Manager
FIGURE 12.8 Two distinct parallel processes, one (above) that requires synchronization and the
other (below) that can work without it.
method or a design that avoids having locks of having to pass messages while processing
the same data stream in parallel.
A design shown later in this section will avoid locks, which is why it will be referred to
as lock-free. However, syncless remains the umbrella term since both locking and message
passing are means of synchronization.
It may be useful to provide more background on the main problem from having locks
in shared memory or having to pass messages between processes. he problem is that both
these methods are applied in an asynchronous environment, where multiple processes run
continuously in parallel but have to exchange data across each other occasionally. his,
in fact, is the core of the problem. Because the exchange is asynchronous, by deinition, it
means that one process never knows when the other process is going for contact. Naturally,
having a synchronous process would entirely solve the problem but would render the par-
allelization itself useless in practice.
he key point of a lock-free process is to reduce the time communicating processes hav-
ing to wait on each other. Existing research shows that multicore technology requires such
a novel approach simply because having locks greatly reduces the overall throughput of
the system. With four- and eight-core architectures becoming common today, such a hard
limitation on throughput is undesirable.
12.7.2 DLL
Figure 12.9 shows a four-way DLL. Strictly speaking, there is only big DLL and many small
sideways DLLs for each element in the big one. But it is convenient to call it four-way DLL.
Four-way DLL is an excellent solution to collisions of keys in hash functions. Since colli-
sion literally means that several elements contend for the same spot in an index table (not DLL
234 ◾ Marat Zhanikeev
Item
Next
Prev
Sideprev
Sidenext
Sidenext
Sideprev
Item ➝➞➟➠ Item
Next
Prev
Item
itself!), sideways DLLs help to connect the multiple elements so that they can be searched.
Note that collisions are relatively rare, which means that there are normally few of those.
Cap-
ture
Manager Shared
memory
Create
Fork
e
Captur
Assign Write
Double-linked list (DLL)
One thread
Li
➡➢
➤
➥
➦
➧
Stale check
Wrap wait
Process/wrap
Destroy, die
Monitor Sketch
Manager
cursor
Sketch
Cursor Global
started
time start
➲➳➫➯
sketches
Cursor Sketch
All
moved time Add
... data
Each Process
data All sketches
item
Data caught up
Ring buffer
Cursor =
sketch Stream
End ... Idle
time
➨➩ ➫➭➩➯ Wait for End of
Set sketch
time Shared all sketches data
memory
FIGURE 12.11 Shared memory design and logic followed by each sketch (let) and TABID manager
(right).
the ring bufer, which in turn depends on the size of the shared memory. When a new batch
has been read into the ring bufer, the global now cursor is updated, which allows for all
sketches to start processing new items. he manager also monitors all sketch times—
cursors for individual sketches—and can change packing coniguration based on collected
statistics.
he sketch (Figure 12.11, let side) follows the following logic. It waits for the now cur-
sor to advance beyond its own cursor, which is stored independently for each sketch in the
shared memory. When it detects change, the sketch processes all newly arrived items. Once
its cursor reaches the global now, the sketch returns to the idle state and starts polling for
new change at a given time interval.
Note that the ring bufer is accessed via a C/C++ library call [54] rather than directly
by each party. his is a useful convention because it removes the need for each accessing
party to maintain its current position in the ring bufer. he bufer only appears to be con-
tinuous, while in reality, it has inite size and has to wrap back to the head when its tail is
reached.
which returns the ID of the newly created and scheduled sketch. Its current status can be
veriied using
GET tabidSketchStatus(ID),
12.8 SUMMARY
his chapter presented a brand-new paradigm for Big Data processing. he new paradigm
abandons the “let us distribute everything” approach in search of a more lexible technol-
ogy. Flexibility is in several parts of the new paradigm—it is in how storage is distributed,
how Big Data is processed by user jobs, and inally, how user jobs are distributed across
cores in a multicore architecture.
Setting all that aside, it can be said that the new paradigm is an interesting formulation
simply because it brings together several currently active areas of research. hese are specii-
cally eicient hashing technology, parallel processing on multicore technology, and, inally,
the scientiically rigorous data streaming approach to creating statistical sketches from raw
data. his collection of topics creates an interesting platform for dynamic algorithms and
optimizations aimed at improving the performance of Big Data processing as a whole.
he unconventional Big Data processor was introduced early in this chapter in order to
avoid having to provide a uniied background for each separate part of the technology. his
early introduction also presented several major parts that are required for the technology
to work in the irst place. A separate section was then dedicated to putting all these parts
together into a complete practical/technical solution.
Having established the background, the chapter continued with speciics. A separate
section was dedicated to issues related to hashing, using Bloom ilters, and the related ei-
ciency. Another section fully focused on optimization related to assigning jobs in multi-
core environments. Finally, a section was dedicated to the issue of data streaming on top of
a multicore architecture, yet on top of a real-time Big Data replay process.
It should be noted that real-time replay of Big Data in itself is a brand-new way to view
Big Data. While the traditional approach distributes both the storage and processing of
data, this chapter discussed a more orderly, and therefore more controlled, process, where
Big Data is replayed on a single multicore machine and tasks are packed into separate
cores in order to process a replayed stream of data in real time. his chapter hinted that
such a design allows for much higher lexibility given that the design is based on a natural
trade-of. Although this chapter does not provide examples of various settings that would
showcase the dynamic nature of the design, it is clear from the contents which parameters
can be used to change the balance of the technology.
Streaming Algorithms for Big Data Processing on Multicore Architecture ◾ 237
he issue of scientiic rigor when referring to the data streaming method itself was cov-
ered in good detail, where several examples of practical statistic targets were shown. It
was also shown that such a level of statistical rigor cannot be supported by traditional
MapReduce technology, which only supports key–value pairs and also imposes strict
rules on how they can be used—the map and reduce operators themselves. When the data
streaming paradigm is used, the jobs are relieved of any such restraint and can use any data
structure necessary for processing. his single feature in this chapter is a huge factor in Big
Data, where authenticity of conclusions drawn from Big Data has to be defended.
Unfortunately, given the large set of topics in this chapter, several topics could not
be expanded with suicient detail. However, enough references are provided for recent
research on each of the speciic scientiic problems for the reader to be able to follow up.
REFERENCES
1. Apache Hadoop. Available at: http://hadoop.apache.org/.
2. A. Rowstron, S. Narayanan, A. Donnelly, G. O’Shea, A. Douglas, “Nobody Ever Got Fired for
Using Hadoop on a Cluster,” 1st International Workshop on Hot Topics in Cloud Data Processing,
April 2012.
3. K. Shvachko, “HDFS Scalability: he Limits to Growth,” he Magazine of USENIX, vol. 35,
no. 2, pp. 6–16, 2012.
4. R. Chen, H. Chen, B. Zang, “Tiled-MapReduce: Optimizing Resource Usages of Data-Parallel
Applications on Multicore with Tiling,” 19th International Conference on Parallel Architectures
and Compilation Techniques (PACT), pp. 523–534, 2010.
5. S. Muthukrishnan, “Data Streams: Algorithms and Applications,” Foundations and Trends in
heoretical Computer Science, vol. 1, no. 2, pp. 117–236, 2005.
6. M. Sung, A. Kumar, L. Li, J. Wang, J. Xu, “Scalable and Eicient Data Streaming Algorithms for
Detecting Common Content in Internet Traic,” ICDE Workshop, 2006.
7. M. Charikar, K. Chen, M. Farach-Colton, “Finding Frequent Items in Data Streams,” 29th
International Colloquium on Automata, Languages, and Programming, 2002.
8. M. Zhanikeev, “A Holistic Community-Based Architecture for Measuring End-to-End QoS at
Data Centres,” Inderscience International Journal of Computational Science and Engineering
(IJCSE), (in print) 2013.
9. S. Venkataraman, D. Song, P. Gibbons, A. Blum, “New Streaming Algorithms for Fast Detection
of Superspreaders,” Distributed System Security Symposium (NDSS), 2005.
10. JSON Format. Available at: http://www.json.org.
11. D. MacKey, Information heory, Inference, and Learning Algorithms. Cambridge University
Press, West Nyack, NY, 2003.
12. A. Konheim, Hashing in Computer Science: Fity Years of Slicing and Dicing. Wiley, Hoboken,
NJ, 2010.
13. G. Antichi, A. Pietro, D. Ficara, S. Giordano, G. Procissi, F. Vitucci, “A Heuristic and Hybrid
Hash-Based Approach to Fast Lookup,” International Conference on High Performance Switching
and Routing (HPSR), pp. 1–6, June 2009.
14. M. Zadnik, T. Pecenka, J. Korenek, “NetFlow Probe Intended for High-Speed Networks,”
International Conference on Field Programmable Logic and Applications, pp. 695–698, 2005.
15. R. Kimball, M. Ross, W. hornthwaite, J. Mundy, B. Becker, he Data Warehouse Lifecycle
Toolkit. John Wiley and Sons, Indianapolis, IN, 2008.
16. Small File Problem in Hadoop (blog). Available at: http://amilaparanawithana.blogspot
.jp/2012/06/small-ile-problem-in-hadoop.html.
238 ◾ Marat Zhanikeev
17. Z. Ren, X. Xu, J. Wan, W. Shi, M. Zhou, “Workload Characterization on a Production Hadoop
Cluster: A Case Study on Taobao,” IEEE International Symposium on Workload Characterization,
pp. 3–13, 2012.
18. Y. Chen, A. Ganapathi, R. Griith, R. Katz, “he Case for Evaluating MapReduce Performance
Using Workload Suites,” 19th International Symposium on Modeling, Analysis and Simulation of
Computer and Telecommunication Systems (MASCOTS), pp. 390–399, July 2011.
19. X. Gao, V. Nachankar, J. Qiu, “Experimenting with Lucene Index on HBase in an HPC
Environment,” 1st Annual Workshop on High Performance Computing Meets Databases
(HPCDB), pp. 25–28, 2012.
20. S. Das, Y. Sismanis, K. Beyer, R. Gemulla, P. Haas, J. McPherson, “Ricardo: Integrating R and
Hadoop,” SIGMOD, pp. 987–999, June 2010.
21. A. Rasooli, D. Down, “COSHH: A Classiication and Optimization Based Scheduler for
Heterogeneous Hadoop Systems,” Technical Report of McMaster University, Canada, 2013.
22. Z. Bar-Yossef, R. Kumar, D. Sivakumar, “Reductions in Streaming Algorithms, with an
Application to Counting Triangles in Graphs,” 13th ACM-SIAM Symposium on Discrete
Algorithms (SODA), January 2002.
23. M. Datar, A. Gionis, P. Indyk, R. Motwani, “Maintaining Stream Statistics over Sliding
Windows,” SIAM Journal on Computing, vol. 31, no. 6, pp. 1794–1813, 2002.
24. M. Aldinucci, M. Torquati, M. Meneghin, “FastFlow: Eicient Parallel Streaming Applications
on Multi-Core,” Technical Report no. TR-09-12, Universita di Pisa, Italy, September 2009.
25. R. Brightwell, “Workshop on Managed Many-Core Systems,” 1st Workshop on Managed Many-
Core Systems, 2008.
26. K. Michael, he Linux Programming Interface. No Starch Press, San Francisco, 2010.
27. C. Shannon, “A Mathematical heory of Communication,” he Bell System Technical Journal,
vol. 27, pp. 379–423, 1948.
28. M. Zhanikeev, Y. Tanaka, “Popularity-Based Modeling of Flash Events in Synthetic Packet
Traces,” IEICE Technical Report on Communication Quality, vol. 112, no. 288, pp. 1–6, 2012.
29. M. Zhanikeev, Y. Tanaka, “A Graphical Method for Detection of Flash Crowds in Traic,”
Springer Telecommunication Systems Journal, vol. 57, no. 1, pp. 91–105, 2014.
30. S. Heinz, J. Zobel, H. Williams, “Burst Tries: A Fast, Eicient Data Structure for String Keys,”
ACM Transactions on Information Systems (TOIS), vol. 20, no. 2, pp. 192–223, 2002.
31. F. Putze, P. Sanders, J. Singler, “Cache-, Hash- and Space-Eicient Bloom Filters,” Journal of
Experimental Algorithmics (JEA), vol. 14, no. 4, 2009.
32. F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, G. Varghese, “Bloom Filters via d-Let
Hashing and Dynamic Bit Reassignment,” 44th Allerton Conference on Communication,
Control, and Computing, 2006.
33. F. Bonomi, M. Mitzenmacher, R. Panigrahi, S. Singh, G. Vargrese, “An Improved Construction
for Counting Bloom Filters,” 14th Conference on Annual European Symposium (ESA), vol. 14,
pp. 684–695, 2006.
34. H. Song, S. Dharmapurikar, J. Turner, J. Lockwood, “Fast Hash Table Lookup Using Extended
Bloom Filter: An Aid to Network Processing,” SIGCOMM, 2005.
35. A. Kirsch, M. Mitzenmacher, “Less Hashing, Same Performance: Building a Better Bloom Filter,”
Wiley Inderscience Journal on Random Structures and Algorithms, vol. 33, no. 2, pp. 187–218, 2007.
36. G. Antichi, D. Ficara, S. Giordano, G. Procissi, F. Vitucci, “Blooming Trees for Minimal Perfect
Hashing,” IEEE Global Telecommunications Conference (GLOBECOM), pp. 1–5, December
2008.
37. C. Bhavanasi, S. Iyer, “M2MC: Middleware for Many to Many Communication over Broadcast
Networks,” 1st International Conference on Communication Systems Sotware and Middleware,
pp. 323–332, 2006.
Streaming Algorithms for Big Data Processing on Multicore Architecture ◾ 239
Organic Streams
A Unified Framework for Personal Big
Data Integration and Organization
Towards Social Sharing and
Individualized Sustainable Use
CONTENTS
13.1 Introduction 242
13.2 Overview of Related Work 243
13.3 Organic Stream: Deinitions and Organizations 245
13.3.1 Metaphors and Graph Model 245
13.3.2 Deinition of Organic Stream 246
13.3.3 Organization of Social Streams 248
13.4 Experimental Result and Analysis 250
13.4.1 Functional Modules 250
13.4.2 Experiment Analysis 250
13.5 Summary 252
References 253
ABSTRACT
his chapter describes a uniied framework for dynamically integrating and meaning-
fully organizing personal and social Big Data. With the rapid development of emerg-
ing computing paradigms, we have been continuously experiencing a change in
work, life, playing, and learning in the highly developed information society, which
is a kind of seamless integration of the real physical world and cyber digital space.
More and more people have been accustomed to sharing their personal contents
across the social networks due to the high accessibility of social media along with
the increasingly widespread adoption of wireless mobile computing devices. User-
generated information has spread more widely and quickly and provided people with
opportunities to obtain more knowledge and information than ever before, which
241
242 ◾ Xiaokang Zhou and Qun Jin
leads to an explosive increase of data scale, containing big potential value for indi-
vidual, business, domestic, and national economy development. Thus, it has become
an increasingly important issue to sustainably manage and utilize personal Big Data,
in order to mine useful insight and real value to better support information seeking
and knowledge discovery. To deal with this situation in the Big Data era, a uniied
approach to aggregation and integration of personal Big Data from life logs in accor-
dance with individual needs is considered essential and efective, which can beneit
the sustainable information sharing and utilization process in the social networking
environment. In this chapter, a new concept of organic stream, which is designed
as a lexibly extensible data carrier, is introduced and deined to provide a simple
but eicient means to formulate, organize, and represent personal Big Data. As an
abstract data type, organic streams can be regarded as a logic metaphor, which aims
to meaningfully process the raw stream data into an associatively and methodically
organized form, but no concrete implementation for physical data structure and stor-
age is deined. Under the conceptual model of organic streams, a heuristic method
is proposed and applied to extract diversiied individual needs from the tremendous
amount of social stream data through social media. And an integrated mechanism is
developed to aggregate and integrate the relevant data together based on individual
needs in a meaningful way, in which personal data can be physically stored and dis-
tributed in private personal clouds and logically represented and processed by a set of
newly introduced metaphors named heuristic stone, associative drop, and associative
ripple. he architecture of the system with the foundational modules is described,
and the prototype implementation with the experiment’s result is presented to dem-
onstrate the usability and efectiveness of the framework and system.
13.1 INTRODUCTION
With the continuous development of social media (such as Twitter and Facebook), more
and more populations have been involved in this social networking revolution, which leads
to a tremendous increase of data scale, ranging from daily text data to multimedia data
that describe diferent aspects of people’s lives. Websites no longer change in weeks or
days but in hours, minutes, or even seconds, as people are willing to share their personal
contents with each other at all times and places. For instance, every month, Facebook deals
with 570 billion page views, stores 3 billion new photos, and manages 25 billion pieces of
content [1]. Big Data, which “includes data sets with sizes beyond the ability of current
technology, method and theory to capture, manage, and process the data within a tolerable
elapsed time” [2], is an example of “high-volume, high-velocity, and/or high-variety infor-
mation assets that require new forms of processing to enable enhanced decision making,
insight discovery and process optimization” [3]. It has become a big challenge to process
such massive amounts of complex Big Data, which has attracted a lot of attention from
academia, industry, and government as well.
Life logs, a kind of personal Big Data, have also attracted increasing attention in recent
years. Life logs include a variety of data, such as text, location, sound, images, and videos,
which are dynamically produced from multiple sources and with diferent data structures
Organic Streams ◾ 243
or even no data structures. Consequently, it is diicult for an individual (end user) to uti-
lize the useful information hidden in these records, such as a user’s personal experience,
or the social knowledge shared in a community, by simply processing the raw data from
life logs. herefore, it is essential to efectively integrate and mine this kind of personal
Big Data in both cyberspace and the real world, in order to provide people with valuable
individualized services.
Following this discussion, we refer to either stream data in the cyber world or life log
data from the physical world, which represent diferent aspects of people’s information
behaviors and social activities, social streams. On one hand, the amount of stream data
that people can obtain is growing at a stupendous rate. On the other hand, it is diicult for
one person to remember such a tremendous amount of information and knowledge that
are mixed together with fewer relations simultaneously, especially when he/she faces the
situation where newer information emerges continuously. hus, we try to ind a new way
by which the massive amount of raw stream data can be meaningfully and methodically
organized into a pre-prepared form and the related information and knowledge hidden
in the stream data can be associatively provided to users when they match users’ personal
intentions, in order to beneit the individualized information-seeking and retrieval process.
In a previous study, a graph model with a set of metaphors was introduced and pre-
sented, in order to represent a variety of social stream data with a hierarchical structure on
an abstract level [4]. A mechanism was developed to assist users’ information seeking that
could best it users’ current needs and interests with two improved algorithms [5]. Based
on these, the organized stream data have further been utilized to facilitate the enrichment
of the user search experience [6]. Moreover, the so-called organic stream with formal
descriptions was introduced and deined, which could discover and represent the poten-
tial relations among data [7]. And a mechanism was proposed to organize the raw stream
data into the methodically pre-prepared form, in order to assist the data aggregation and
integration process [8].
In this chapter, an efective approach to aggregating and integrating personal Big Data in
accordance with individual needs is introduced, which can further beneit the information
fusion and knowledge discovery process and facilitate the sustainable data utilization pro-
cess as well. We describe the extended and reined deinition of organic stream in order to
make it an extensible data carrier to formulize and organize personal Big Data from life logs
in a meaningful and lexible way, which can assist the transformation from data to informa-
tion, from information to knowledge, and inally, from knowledge to asset iteratively.
selection (OFS) problem. Zhang et al. [12] presented a parallel rough set–based approach
using MapReduce for knowledge acquisition on the basis of the characteristics of data, in
order to mine knowledge from Big Data. Menon [13] focused on the data warehousing and
analytics platform of Facebook to provide support for batch-oriented analytics applica-
tions. Han et al. [14] introduced a Big Data model that used social network data for infor-
mation recommendation related to a variety of social behaviors.
here are many analyses, as well as applications, that focus on life logs [15–20].
Yamagiwa et al. [15] proposed a system to achieve an ecological lifestyle at home, in
which sensors closely measure a social life log by the temperature, humidity, and inten-
sity of illumination related to human action and living environment. Hori and Aizawa
[16] developed a context-based video retrieval system that utilized various sensor data to
provide video browsing and retrieval functions for life log applications. Hwang and Cho
[17] developed a machine learning method for life log management, in which a proba-
bilistic network model was employed to summarize and manage human experiences by
analyzing various kinds of log data. Kang et al. [18] deined metadata to save and search
life log media, in order to deal with problems such as high-capacity memory space and
long search time cost. Shimojo et al. [19] reengineered the life log common data model
(LLCDM) and life log mash-up API (LLAPI) with the relational database MySQL and
web services, which can help access the standardized data, in order to support the inte-
gration of heterogeneous life log services. Nakamura and Nishio [20] proposed a method
to infer users’ temporal preference according to their interests by analyzing web brows-
ing logs.
As for data aggregation [21–26], Ozdemir [21] presented a reliable data aggregation and
transmission protocol based on the concept of functional reputation, which can improve
the reliability of data aggregation and transmission. Jung et al. [22] proposed and devel-
oped two data aggregation mechanisms based on hybrid clustering, a combination of the
clustering-based data aggregation mechanism and adaptive clustering-based data aggregation,
in order to improve both data aggregation and energy eiciency. Itikhar and Pedersen [23]
developed a rule-based tool to maintain data at diferent levels of granularity, which fea-
tures automatic gradual data aggregation. Rahman et al. [24] proposed a privacy-preserving
data aggregation scheme named REBIVE (reliable private data aggregation scheme), which
considers both data accuracy maintenance and privacy protection, in order to provide a
privacy preservation technique to maintain data accuracy for realistic environments. Wei
et al. [25] proposed three prediction-based data aggregation approaches, Grey Model–
based Data Aggregation (GMDA), Kalman Filter–based Data Aggregation (KFDA), and
Combined Grey model and Kalman Filter Data Aggregation (CoGKDA), which can help
reduce redundant data communications. Ren et al. [26] proposed an attribute-aware
data aggregation (ADA) scheme, which consists of a packet-driven timing algorithm and
a special dynamic routing protocol, to improve data aggregation eiciency and further
beneit data redundancy elimination.
Research has also been done on data integration. Getta [27] exploited a system for
online data integration based on the binary operations of relational algebra. Atkinson et
al. [28] developed a framework of data mining, access, and integration, to deal with the
Organic Streams ◾ 245
scale-up issue of data integration and data mining across heterogeneous and distributed
data resources and data mining services. Sarma et al. [29] described approximation algo-
rithms to solve the cost-minimization and maximum-coverage problems for data inte-
gration over a large number of dependent sources. Tran et al. [30] developed a platform
for distributed data integration and mining, in which data are processed as streams by
processing elements connected together into worklows. Gong et al. [31] introduced the
architecture and prototype implementation to provide a dynamic solution for integration
of heterogeneous data sources.
Research has been done on processing stream data [32,33]. he Chronicle data model
[32] is the irst model proposed to capture and maintain stream data. Babcock et al. [33]
examined models and issues in stream query language and query processing. A host of
research has been done to make use of stream data and create Social Semantic Microblogs
or use Semantic Webs to link and reuse stream data across Web 2.0 platforms [34–39].
Ebner [34] showed how a microblog can be used during a presentation to improve the
situation through instant discussions by the individuals in a classroom. In addition
to traditional conference tools, Reinhardt et al. [35] utilized the microblog to enhance
knowledge among a group by connecting a diverse online audience. Ebner et al. [36]
indicated that a microblog should be seen as a completely new form of communication
that can support informal learning beyond classrooms. Studies have also been done to
create a prototype for distributed semantic microblogging [37]. Passant et al. [38] devel-
oped a platform for open, semantic, and distributed microblogs by combining Social
Web principles and state-of-the-art Semantic Web and Linked Data technologies. Bojars
et al. [39] used the Semantic Web to link and reuse distributed stream data across Web
2.0 platforms.
• Drop: A drop is a minimum unit of data streams, such as a message posted to a micro-
blog (e.g., Twitter) by a user or a status change in SNS (e.g., Facebook).
• Stream: A stream is a collection of drops in a timeline, which contains the messages,
activities, and actions of a user.
• River: A river is a conluence of streams from diferent users that is formed by fol-
lowing or subscribing to their followers/friends. It could be extended to followers’
followers.
• Ocean: An ocean is a combination of all the streams.
246 ◾ Xiaokang Zhou and Qun Jin
User 1
User 1
Organization
User 2
Confluence
User 2
Ripple
Ripple
User 9
User 9
FIGURE 13.1 Graph model for social streams. (With kind permission from Springer Science+
Business Media: Multimedia Tools and Applications, “Enriching User Search Experience by
Mining Social Streams with Heuristic Stones and Associative Ripples,” 63, 2013, 129–144, Zhou,
X.K. et al.)
hus, the personal data posted by each user can be seen as a drop, and the drops com-
ing from one user can converge together to form a stream. hen the streams of the user
and his/her friends can form a river. Finally, all the streams can come together to form an
ocean. All these processes are shown in Figure 13.1.
he following deinitions are used for seeking related information that satisies users’
current needs and interests [5].
• Heuristic stone: It represents one of a speciic user’s current interests, which may be
changed dynamically.
• Associative ripple: It is a meaningfully associated collection of the drops related to
some topics of a speciic user’s interests, which are formed by the heuristic stone in
the river.
• Associative drop: It is the drop distributed in an associative ripple, which is related
to one speciic heuristic stone and can be further collected into the organic stream.
where
Hs = {Hs[u1, t1], Hs[u2, t2], …, Hs[um, tn]}: a nonempty set of heuristic stones in accor-
dance with diferent users’ intentions, in which each Hs[ui, tj] indicates a speciic
heuristic stone of a speciic user, ui indicates the owner of this heuristic stone, and
tj indicates the time slice this heuristic stone belongs to;
Ad = {Ad1, Ad2, …, Adn}: a collection of associative drops, which can refer or link to each
other based on inherent or potential logicality; and
R: the relations among heuristic stones and associative drops in the organic stream.
Following these deinitions, the relation R in the organic stream can be categorized
into three major types: relation between each heuristic stone, relation between each asso-
ciative drop, and relation between heuristic stones and associative drops. he details are
addressed as follows.
Relation between heuristic stone and associative drop: This type of relation identiies
the relationships between one heuristic stone and a series of associative drops, which can
be represented as heuristic stone × associative drop. It is the basic relation in the organic
stream. In detail, due to diferent heuristic stones, those related drops can be connected
together in the organic stream.
Relation between heuristic stone and heuristic stone: This type of relation identiies the
relationships among the heuristic stones in the organic stream, which can be represented
as heuristic stone × heuristic stone. Note that the expression of heuristic stones has two
parameters, ui and tj; thus, this kind of relation can further be categorized into two sub-
types, which can be addressed as follows.
Hs[ui, tx] ↔ Hs[ui, ty]: This relation identiies the relationships of the heuristic stones within
a speciic user ui. hat is, this relation is used to describe those internal relationships or
changes for a speciic user’s intentions. In detail, given a series of heuristic stones belonging to
a speciic user ui, represented as {Hs[ui, t1], Hs[u2, t2], …, Hs[ui, tn]}, the diferences from Hs[ui,
t1] to Hs[ui, tn] changed in a sequence can demonstrate the transitions of this user’s interests or
needs in a speciic period, which can be employed to infer his/her further intention.
Hs[ui, tx] ↔ Hs[uj, ty]: This relation identiies the relationships of the heuristic stones between
two diferent users, ui and uj. hat is, this relation is used to describe those external relation-
ships among diferent users’ intentions. In detail, given two heuristic stones, represented as
Hs[ui, ta] and Hs[uj, tb] for two diferent users, the relationship can demonstrate the potential
connection between these two users in accordance with their dynamic interests or needs.
Relation between associative drop and associative drop: This type of relation identiies
the relationships among those drops that are clustered to the associative ripples and further
selected into the organic stream, which can be represented as associative drop × associative
drop. In detail, the drops connected together based on this relation can represent the whole
trend as well as its changes following the timeline.
According to its deinition, the organic stream is deined as a carrier of hidden knowl-
edge and potential information, in which three major relations are constructed to describe
248 ◾ Xiaokang Zhou and Qun Jin
the inherent relationship and logicality, in order to organize the raw stream data into
meaningful and methodic content. In other words, the organic stream is a pre-prepared
collection that contains diversiied and dynamically changed information, which shall also
be extended in time, so that it can associatively provide users with various services in
accordance with their diferent requirements.
wi = F (i , t )*
T
+
∑ j =1
F (i , t j )
n (13.2)
M
∑ j =1
In(i , t j )
where
FIGURE 13.2 Image of the generation process of organic streams. (From Zhou, X.K. et al.,
“Organic Stream: Meaningfully Organized Social Stream for Individualized Information Seeking
and Knowledge Mining,” Proc. the 5th IET International Conference on Ubi-Media Computing
[U-Media2012], Xining, China, Aug. 16–18, 2012.)
Organic Streams ◾ 249
In Equation 13.2, F(i, t) indicates the frequency of key word i in a speciic period t in
which the heuristic stone will be extracted, while T indicates the whole interval in which
the organic stream will be generated. For example, if T indicates 1 month, then t can be
set as 1 day. M indicates the total amount of the key words in period T. hat is, the former
T
part F (i , t )* n
is employed to calculate the transilient interest, while the latter
n
∑In(i , t j )
j =1
part ∑ j =1
F (i , t j ) is employed to calculate the durative interest. Finally, the key word with
M
the weight that is higher than a given threshold δ will be extracted as the heuristic stone.
Based on these, those related drops can be selected into the associative ripples with the
corresponding heuristic stone and go further to generate the organic stream according to
the relation R. he organization algorithm is shown in Figure 13.3.
As a summary, in order to capture an individual’s time-changing needs, the whole
time period should be divided into several time slices (e.g., 1 day, 1 week), in which the
developed TF-IDF method can be employed to extract data to represent individual needs,
concerns, and interests within this speciic time slice. Ater that, the data related to the
extracted individual need can be selected and aggregated according to the heuristic stone ×
associative drop relation in a heuristic way. Based on these, the associative ripples can be
generated, in which the selected data shall be integrated and organized following the
associative drop × associative drop relation. Finally, a series of associative ripples will com-
pose the organic stream according to the heuristic stone × heuristic stone relation in an
extensible way.
FIGURE 13.3 Algorithm for generating organic stream. (From Zhou, X.K. et al., “Organic Stream:
Meaningfully Organized Social Stream for Individualized Information Seeking and Knowledge
Mining,” Proc. the 5th IET International Conference on Ubi-Media Computing [U-Media2012],
Xining, China, Aug. 16–18, 2012.)
250 ◾ Xiaokang Zhou and Qun Jin
User interface
FIGURE 13.4 Algorithm for generating organic stream. (From Zhou, X.K. et al., “Organic Streams:
Data Aggregation and Integration Based on Individual Needs,” Proc. the 6th IEEE International
Conference on Ubi-Media Computing [U-Media2013], Aizu-Wakamatsu, Japan, Nov. 2–4, 2013.)
30
25
20
15
10
5
0
FIGURE 13.5 An example of experiment analysis results for data aggregation. (From Zhou, X.K. et al., “Organic Streams: Data Aggregation and
Integration Based on Individual Needs,” Proc. the 6th IEEE International Conference on Ubi-Media Computing [U-Media2013], Aizu-Wakamatsu,
Organic Streams ◾ 251
11
1
15 10
2
Raw data
9 8 13
4
6
7 5
3
12 14
Associative drop
14
1
6
4
Number Data User
1 @jack Hope @Square @twitter and u will help Boston. What a A_Karunaratne
8 2 @sethanookin Meanwhle over on @CNN P Morgan asks marathon kvox
15 13 3 Boston peeps! RT @jess: We’re hiring for tons of roles in rsarver
4 BOSTON: 2 explosions at Boston marathon finish line | Spots | nayorisaac
Heuristic 5 New goal. Run Boston marathon next year... Nothing more kevinthau
12 stone 6 Praying for everyone at the Boston Marathon today. The photos sa
2
3 7 RT @beep: (PSA: the @BostinGlobe Twitter account is doing kvox
7 10 8 RT @BostonGlobe: BREAKING NEWS: Two powerful explosions detonat amac
9 RT @CindyBoren: It may not be Patriots Day where you are, but elizabeth
10 RT @cw: For those at work, live local video from Boston: amac
9 11 RT @dicke: In. RT @keninthau New goal. Run Bostin Marathon noanvass
11 12 RT @dtrinh: Heartbroken seeing what happened in Boston. jess
5 13 RT @Energent007: RT @nytimes : A nap of where the explosions occ niv
Associative ripple 14 RT @qualVoiceNews: Comforting words from America’s neighbor, tatiana
15 RT @loearchese: Spoke with WST about today’s tragic even. rsarver
FIGURE 13.6 Conceptual image of data integration. (From Zhou, X.K. et al., “Organic Streams:
Data Aggregation and Integration Based on Individual Needs,” Proc. the 6th IEEE International
Conference on Ubi-Media Computing [U-Media2013], Aizu-Wakamatsu, Japan, Nov. 2–4, 2013.)
We further integrate the aggregated data according to the relation R discussed in Section
13.3.2. As shown in Figure 13.6, the selected heuristic stone has become the aggregating
center, and the related data converged to it as the associative drop. According to the rela-
tion heuristic stone × associative drop, the distance from associative drop to the center,
namely, the heuristic stone, describes the relevance between them, that is, the more related
associative drops would be closer to the center. Moreover, according to the relation asso-
ciative drop × associative drop, the associative drops that have the same relevance to the
heuristic stone will distribute in the same layer. For instance, in Figure 13.6, all associative
drops distribute in four layers, while in each layer, the more related data will stay closer to
each other. Finally, the heuristic stone with these related associative drops will form the
associative ripple.
13.5 SUMMARY
In this chapter, we have introduced and described the organic stream as a uniied framework
to integrate and organize personal Big Data from life logs in accordance with individual
needs, in order to beneit the sustainable information utilization and sharing process in the
social networking environment.
We have reviewed our previous works, in which a graph model was presented to
describe various social stream data in a hierarchical way and a set of stream metaphors
was employed to describe the organization process of the social streams. In addition, we
described and discussed the reined organic stream as a lexible and extensible data car-
rier, to aggregate and integrate personal Big Data according to individual needs, in which
Organic Streams ◾ 253
the raw stream data can be associatively and methodically organized into a meaningful
form. We introduced in detail the heuristic stone, associative drop, and associative ripple
to represent individual need, related data, and an organized data set, respectively. hree
major relations, heuristic stone × associative drop, heuristic stone × heuristic stone, and
associative drop × associative drop, were deined and proposed to discover and describe the
intrinsic and potential relationships among the aggregated data. Based on these, an inte-
grated mechanism was developed to extract individuals’ time-varying needs and further
aggregate and integrate personal Big Data into an associatively organized form, which can
further support the information fusion and knowledge discovery process. Finally, we pre-
sented the system architecture with the major functional modules, and the experimental
analysis results illuminated the feasibility and efectiveness of the proposed framework
and approach.
REFERENCES
1. C.Q. Ji, Y. Li, W.M. Qiu, U. Awada, and K.Q. Li, “Big Data processing in cloud computing
environments,” in Proc. 12th International Symposium on Pervasive Systems, Algorithms and
Networks (ISPAN), December 13–15, 2012, pp. 17–23.
2. “Big Data: Science in the petabyte era,” Nature vol. 455, no. 7209, p. 1, 2008.
3. L. Douglas, he Importance of “Big Data”: A Deinition, Gartner, Stamford, CT, 2008.
4. H. Chen, X.K. Zhou, H.F. Man, Y. Wu, A.U. Ahmed, and Q. Jin, “A framework of organic
streams: Integrating dynamically diversiied contents into ubiquitous personal study,” in Proc.
of 2nd International Symposium on Multidisciplinary Emerging Networks and Systems, Xi’an,
China, pp. 386–391, 2010.
5. X.K. Zhou, H. Chen, Q. Jin, and J.M. Yong, “Generating associative ripples of relevant informa-
tion from a variety of data streams by throwing a heuristic stone,” in Proc. of 5th International
Conference on Ubiquitous Information Management and Communication (ACM ICUIMC ’11),
Seoul, Korea, 2011.
6. X.K. Zhou, N.Y. Yen, Q. Jin, and T.K. Shih, “Enriching user search experience by mining social
streams with heuristic stones and associative ripples,” Multimed. Tools Appl. (Springer) vol. 63,
no. 1, pp. 129–144, 2013.
7. X.K. Zhou, J. Chen, Q. Jin, and T.K. Shih, “Organic stream: Meaningfully organized social
stream for individualized information seeking and knowledge mining,” in Proc. the 5th IET
International Conference on Ubi-Media Computing (U-Media2012), Xining, China, August
16–18, 2012.
8. X.K. Zhou, Q. Jin, B. Wu, W. Wang, J. Pan, and W. Zheng, “Organic streams: Data aggregation
and integration based on individual needs,” in Proc. the 6th IEEE International Conference on
Ubi-Media Computing (U-Media2013), Aizu-Wakamatsu, Japan, November 2–4, 2013.
9. S. Alsubaiee, Y. Altowim, H. Altwaijry, A. Behm, V. Borkar, Y. Bu, M. Carey, R. Grover,
Z. Heilbron, Y. Kim, C. Li, N. Onose, P. Pirzadeh, R. Vernica, and J. Wen, “ASTERIX: An open
source system for ‘Big Data’ management and analysis (demo),” Proc. VLDB Endow. vol. 5,
no. 12, pp. 1898–1901, 2012.
10. S. Berkovich, and D. Liao, “On clusterization of ‘Big Data’ streams,” in Proc. the 3rd International
Conference on Computing for Geospatial Research and Applications (COM.Geo ’12), ACM, New
York, Article 26, 6 pp, 2012.
11. S.C.H. Hoi, J. Wang, P. Zhao, and R. Jin, “Online feature selection for mining Big Data,” in
Proc. the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining:
Algorithms, Systems, Programming Models and Applications (BigMine ’12), ACM, New York,
pp. 93–100, 2012.
254 ◾ Xiaokang Zhou and Qun Jin
12. J. Zhang, T. Li, and Y. Pan, “Parallel rough set based knowledge acquisition using MapReduce
from Big Data,” in Proc. the 1st International Workshop on Big Data, Streams and Heterogeneous
Source Mining: Algorithms, Systems, Programming Models and Applications (BigMine ’12),
ACM, New York, pp. 20–27, 2012.
13. A. Menon, “Big Data @ Facebook,” in Proc. the 2012 Workshop on Management of Big Data
Systems (MBDS ’12), ACM, New York, pp. 31–32, 2012.
14. X. Han, L. Tian, M. Yoon, and M. Lee, “A Big Data model supporting information recommen-
dation in social networks,” in Proc. the Second International Conference on Cloud and Green
Computing (CGC), November 1–3, 2012, pp. 810–813.
15. M. Yamagiwa, M. Uehara, and M. Murakami, “Applied system of the social life log for eco-
logical lifestyle in the home,” in Proc. International Conference on Network-Based Information
Systems (NBIS ’09), August 19–21, 2009, pp. 457–462.
16. T. Hori, and K. Aizawa, “Capturing life-log and retrieval based on contexts,” in Proc. IEEE
International Conference on Multimedia and Expo (ICME ’04), June 27–30, 2004, pp. 301–304.
17. K.S. Hwang, and S.B. Cho, “Life log management based on machine learning technique,” in
Proc. IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems
(MFI), August 20–22, 2008, pp. 691–696.
18. H.H. Kang, C.H. Song, Y.C. Kim, S.J. Yoo, D. Han, and H.G. Kim, “Metadata for eicient storage
and retrieval of life log media,” in Proc. IEEE International Conference on Multisensor Fusion
and Integration for Intelligent Systems (MFI), August 20–22, 2008, pp. 687–690.
19. A. Shimojo, S. Matsumoto, and M. Nakamura, “Implementing and evaluating life-log
mashup platform using RDB and web services,” in Proc. the 13th International Conference
on Information Integration and Web-based Applications and Services (iiWAS ’11), ACM,
New York, pp. 503–506, 2011.
20. A. Nakamura, and N. Nishio, “User proile generation relecting user’s temporal preference
through web life-log,” in Proc. the 2012 ACM Conference on Ubiquitous Computing (UbiComp
’12), ACM, New York, pp. 615–616, 2012.
21. S. Ozdemir, “Functional reputation based reliable data aggregation and transmission for wire-
less sensor networks,” Comput. Commun. vol. 31, no. 17, pp. 3941–3953, 2008.
22. W.S. Jung, K.W. Lim, Y.B. Ko, and S.J. Park, “Eicient clustering-based data aggregation tech-
niques for wireless sensor networks,” Wirel. Netw. vol. 17, no. 5, pp. 1387–1400, 2011.
23. N. Itikhar, and T.B. Pedersen, “A rule-based tool for gradual granular data aggregation,” in
Proc. the ACM 14th International Workshop on Data Warehousing and OLAP (DOLAP ’11),
ACM, New York, pp. 1–8, 2011.
24. F. Rahman, E. Hoque, and S.I. Ahamed, “Preserving privacy in wireless sensor networks using
reliable data aggregation,” ACM SIGAPP Appl. Comput. Rev. vol. 11, no. 3, pp. 52–62, 2011.
25. G. Wei, Y. Ling, B. Guo, B. Xiao, and A.V. Vasilakos, “Prediction-based data aggregation in
wireless sensor networks: Combining grey model and Kalman ilter,” Comput. Commun.
vol. 34, no. 6, pp. 793–802, 2011.
26. F. Ren, J. Zhang, Y. Wu, T. He, C. Chen, and C. Lin, “Attribute-aware data aggregation using
potential-based dynamic routing in wireless sensor networks,” IEEE Trans. Parallel Distrib.
Syst. vol. 24, no. 5, pp. 881–892, 2013.
27. J.R. Getta, “Optimization of online data integration,” in Proc. 7th International Baltic Conference
on Databases and Information Systems, pp. 91–97, 2006.
28. M.P. Atkinson, J.I. Hemert, L. Han, A. Hume, and C.S. Liew, “A distributed architecture for data
mining and integration,” in Proc. the Second International Workshop on Data-Aware Distributed
Computing (DADC ’09), ACM, New York, pp. 11–20, 2009.
29. A.D. Sarma, X.L. Dong, and A. Halevy, “Data integration with dependent sources,” in Proc.
the 14th International Conference on Extending Database Technology (EDBT/ICDT ’11),
A. Ailamaki, S. Amer-Yahia, J. Pate, T. Risch, P. Senellart, and J. Stoyanovich (Eds.), ACM, New
York, pp. 401–412, 2011.
Organic Streams ◾ 255
30. V. Tran, O. Habala, B. Simo, and L. Hluchy, “Distributed data integration and mining,” in Proc.
the 13th International Conference on Information Integration and Web-Based Applications and
Services (iiWAS ’11), ACM, New York, pp. 435–438, 2011.
31. P. Gong, I. Gorton, and D.D. Feng, “Dynamic adapter generation for data integration middle-
ware,” in Proc. the 5th International Workshop on Sotware Engineering and Middleware (SEM
’05), ACM, New York, pp. 9–16, 2005.
32. H.V. Jagadish, I.S. Mumick, and A. Silberschatz, “View maintenance issues for the chronicle
data model,” in Proc. ACM/PODS 1995, pp. 113–124, 1995.
33. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and issues in data stream
systems,” in Proc. of 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database
Systems, Madison, WI, 2002.
34. M. Ebner, “Introducing live microblogging: How single presentations can be enhanced by the
mass,” J. Res. Innov. Teach. vol. 2, no. 1, pp. 108–119, 2009.
35. W. Reinhardt, M. Ebner, G. Beham, and C. Costa, “How people are using Twitter during con-
ferences,” in Proc. of 5th EduMedia Conference, Salzburg, Austria, pp. 145–156, 2009.
36. M. Ebner, C. Lienhardt, M. Rohs, and I. Meyer, “Microblogs in higher education—A chance to
facilitate informal and process-oriented learning?,” Comput. Educ. vol. 55, no. 1, pp. 92–100,
2010.
37. A. Passant, T. Hastrup, U. Bojars, and J. Breslin, “Micro-blogging: A semantic web and distrib-
uted approach,” in Proc. of ESWC/SFSW 2008, Tenerife, Spain, 2008.
38. A. Passant, U. Bojars, J.G. Breslin, T. Hastrup, M. Stankovic, and P. Laublet, “An overview of
SMOB 2: Open, semantic and distributed micro-blogging,” in Proc. of AAAI/ICWSM 2010,
2010.
39. U. Bojars, J. Breslin, A. Finn, and S. Decker, “Using the Semantic Web for linking and reusing
data across Web 2.0 communities,” J. Web Semant. vol. 6, no. 1, pp. 21–28, 2008.
CHAPTER 14
CONTENTS
14.1 Introduction 258
14.2 Trajectory Representation and Management 261
14.3 Online Trajectory Compression with Spatiotemporal Criteria 264
14.4 Amnesic Multiresolution Trajectory Synopses 267
14.5 Continuous Range Search over Uncertain Locations 270
14.6 Multiplexing of Evolving Trajectories 272
14.7 Toward Next-Generation Management of Big Trajectory Data 275
References 277
ABSTRACT
As smartphones and GPS-enabled devices proliferate, location-based services become
all the more important in social networking, mobile applications, advertising, traic
monitoring, and many other domains. Managing the locations and trajectories of
numerous people, vehicles, vessels, commodities, and so forth must be eicient and
robust, since this information must be processed online and should provide answers
to users’ requests in real time. In this geostreaming context, such long-running con-
tinuous queries must be repeatedly evaluated against the most recent positions relayed
by moving objects, for instance, reporting which people are now moving in a speciic
area or inding friends closest to the current location of a mobile user. In essence,
modern processing engines must cope with huge amounts of streaming, transient,
uncertain, and heterogeneous spatiotemporal data, which can be characterized as
big trajectory data. In this chapter, we examine Big Data processing techniques over
frequently updated locations and trajectories of moving objects. Rapidly evolving tra-
jectory data pose several research challenges with regard to their acquisition, storage,
indexing, analysis, discovery, and interpretation in order to be really useful for intel-
ligent, cost-efective decision making. Indeed, the Big Data issues regarding volume,
velocity, variety, and veracity also arise in this case. hus, we foster a close synergy
between the established stream processing paradigm and spatiotemporal properties
257
258 ◾ Kostas Patroumpas and Timos Sellis
inherent in motion features. Taking advantage of the spatial locality and tempo-
ral timeliness that characterize each trajectory, we present methods and heuristics
from our recent research results that address such problems. We highlight certain
aspects of big trajectory data management through several case studies. Regarding
volume, we suggest single-pass algorithms that can summarize each object’s course
into succinct, reliable representations. To cope with velocity, an amnesic trajectory
approximation structure may ofer fast, multiresolution synopses by dropping details
from obsolete segments. Detection of objects that travel together can lead to trajec-
tory multiplexing, hence reducing the variety inherent in raw positional data. As for
veracity, we discuss a probabilistic method for continuous range monitoring against
user locations with varying degrees of uncertainty, due to privacy concerns in geo-
social networking. Last, but not least, as we are heading toward a next-generation
framework in trajectory data management, we point out interesting open issues that
may provide rich opportunities for innovative research and applications.
14.1 INTRODUCTION
Nowadays, hundreds of millions of GPS-enabled devices are in use globally. he amount
of information exchanged on a daily basis is in the order of terabytes for several social net-
works (like Facebook or Twitter) or inancial platforms (e.g., New York Stock Exchange).
Billions of radio-frequency identiication (RFID) tags are generated per day, and millions
of sensors collect measurements in diverse application domains, ranging from meteo-
rology and biodiversity to energy consumption, battleield monitoring, and many more.
hese staggering igures are expected to double every 2 years over the next decade, as the
number of Internet users, smartphone holders, online customers, and networked sensors
is growing at a very fast rate. Quite oten, this information also includes a spatial aspect,
either explicitly in the form of coordinates (from GPS, RFID, or global system for mobile
communications [GSM]) or implicitly (via geo-tagged photos or addresses). As smart-
phones and GPS-enabled devices proliferate and related platforms penetrate into the mar-
ket, managing the bulk of rapidly accumulating traces of objects’ movement becomes all
the more crucial in modern, high-availability applications. Such location-based services
(LBSs) are capable of identifying the geographical location of a mobile device and then
providing services based on that location. Platforms for traic surveillance and leet man-
agement, mobile applications in tourism or advertising, notiication systems for natural
resources and hazards, tracing services for stolen objects or elder people, and so forth
must cope with huge amounts of such lowing, uncertain, and heterogeneous spatiotem-
poral data.
hese data sets can certainly be characterized as Big Data [1], not only because of their
increasingly large volumes but also due to their volatility and complexity, which make
their processing and analysis quite diicult, if not impossible, through traditional data
management tools and applications. Obviously, it is too hard for a centralized server to
sustain massive amounts of positional updates from a multitude of people, vehicles, ves-
sels, containers, and so forth, which may also afect network traic and load balancing.
Managing Big Trajectory Data ◾ 259
he main challenge is how such luctuating, transient, and possibly unbounded posi-
tional data streams can be processed in online fashion. In such a geostreaming con-
text, it is important to provide incremental answers to continuous queries (CQs), that
is, to various and numerous user requests that remain active for long and require their
results to be refreshed upon changes in the incoming data [2]. here are several types of
location-aware CQs [3,4] that examine spatial relationships against the current locations
of numerous moving objects. For instance, a continuous range search must report which
objects are now moving in a speciic area of interest (e.g., vehicles in the city center). A
mobile user who wishes to ind k of his/her friends that are closest to his/her current
location is an example of a continuous k-nearest neighbor (k-NN) search. As the moni-
tored objects are moving and their relative positions are changing, new results must
be emitted, thus cancelling or modifying previous ones (e.g., 5 min ater the previous
answer to this k-NN CQ, the situation has changed, and another friend is now closer to
him/her).
In the course of time, many locations per object are being accumulated and can actually
be used to keep track of its trace. Of course, it is not realistic to maintain a continuous trace
of each object, since practically, point samples can only be collected from the respective
data source at distinct time instants (e.g., every few seconds, a car sends its location mea-
sured by a GPS device). Still, these traces can be a treasure trove for innovative data analy-
sis and intelligent decision making. In fact, data exploration and trend discovery against
such collections of evolving trajectories are also crucial, beyond their efective storage or
the necessity for timely response to user requests. From detection of locks [5] or convoys
[6] in leet management to clustering [7] for wildlife protection, similarity joins [8] in vehi-
cle traces for carpooling services, or even identiication of frequently followed routes [9,10]
for efective traic control, the prospects are enormous.
he core challenges regarding Big Data [11] are usually described through the well-
known three Vs, namely, volume, velocity, and variety, although in Reference 12, it was
proposed to include a fourth one for veracity. In the particular case of big trajectory data
collected from moving objects, these challenges may be further speciied as follows:
• Volume. If numerous vehicles, parcels, smartphones, animals, and so forth are moni-
tored and their locations get relayed very frequently, then large amounts of positional
data are being captured. Hence, the processing mechanism must be scalable, as the
sheer volume of positional data may be overwhelming and could exceed several tera-
bytes per day. Further, processing every single position incurs some overhead but does
not necessarily convey signiicant movement changes (e.g., if a vessel moves along a
straight line for some time), so opportunities exist for data compression because of
similar redundancies in the input.
• Velocity. As data are generated continuously at various rates and are inherently
streaming in a time-varying fashion, they must be handled and processed in real
time. However, typical spatiotemporal queries like range, k-NN, and so forth are
costly, and their results should not be computed from scratch upon admission of
260 ◾ Kostas Patroumpas and Timos Sellis
fresh input. Ideally, evaluation of the current batch of positional data must be com-
pleted before the next one arrives, in order to provide timely results and also keep in
pace with the incoming streaming locations.
• Variety. Data can actually come from multiple, possibly heterogeneous sources and
may be represented in various types or formats, either structured (e.g., as relational
tuples, sensor readings) or unstructured (logs, text messages, etc.). Location could
have difering semantics, for example, positions may come from both GPS devices
and GSM cells; hence, accuracy may vary widely, up to hundreds of meters. Besides,
these dynamic positional data might also require interaction with static data sets,
for example, for matching vehicle locations against the underlying road network.
Handling the intricacies of such data, eliminating noise and errors (e.g., positional
outliers), and interpreting latent motion patterns are nontrivial tasks and may be
subject to assumptions at multiple stages of the analysis.
• Veracity. Due to privacy concerns, hiding a user location and not just his/her identity
is important. Hence, positional data may be purposely noisy, obfuscated, or errone-
ous in order to avoid malicious or inappropriate use (e.g., prevent locating of people
or linking them to certain groups in terms of social, cultural, or market habits). But
handling such uncertain, incomplete information adds too much complexity to pro-
cessing; hence, results cannot be accurate and are usually associated with a probabi-
listic conidence margin.
To address such challenging research issues related to trajectory data collection, main-
tenance, and analytics at a very large scale, there have been several recent initiatives, each
proposing methods that take advantage of the unique characteristics of these data. In this
chapter, we focus on results from our own work regarding real-time processing techniques
over frequently updated locations and trajectories of moving objects. We foster a close
synergy between the stream processing paradigm and spatiotemporal properties inherent
in motion features, in line with modern trends regarding mobility of data, fast data access,
and summarization. hus, we highlight certain case studies on big trajectory data, espe-
cially regarding online data reduction of streaming trajectories and approximate query
answering against scalable data volumes. In particular,
• In Section 14.2, we present fundamental notions about moving objects and trajectory
representation. We also stress certain characteristics that arise in modeling and que-
rying in a geostreaming context, such as the necessity of timestamps in the incoming
locations, the use of sliding windows, and the most common types of online analytics.
• In Section 14.3, we discuss single-pass approximation techniques based on sampling,
which take advantage of the spatial locality and temporal timeliness inherent in tra-
jectory streams. he objective is to tackle volume and maintain a concise yet quite
reliable summary of each object’s movement, avoiding any superluous details and
reducing processing complexity and communication cost.
Managing Big Trajectory Data ◾ 261
• In Section 14.4, we present a hierarchical tree structure that reserves more precision
for the most recent object positions, while tolerating increasing error for gradually
aging stream portions. Intended to cope with rapidly updated locations (i.e., velocity),
this time-decaying approach can efectively provide online trajectory approxima-
tions at multiple resolutions and may also ofer afordable estimates when counting
distinct moving objects.
• In Section 14.5, we describe a methodology for probabilistic range monitoring over
privacy-preserving user locations of limited veracity. Assuming a continuous uncer-
tainty model for streaming positional updates, novel pruning heuristics based on
spatial and probabilistic properties of the data are employed so as to ofer reliable
response with quality guarantees.
• In Section 14.6, we outline an online framework for detecting groups of moving
objects with approximately similar routes over the recent past. hanks to a lexible
encoding scheme, this technique can cope with the variety in motion features, by
synthesizing an indicative trajectory that collectively represents motion patterns per-
taining to objects in the same group.
• Finally, in Section 14.7, we point out several interesting open issues that may provide
rich opportunities for innovative research and applications involving big trajectory
and spatiotemporal data.
t
t5
t4
t3
y
t2
t1
FIGURE 14.1 A three-dimensional trajectory and its two-dimensional trace on the Euclidean
plane.
the processing engine from each moving source. However, no deletions or updates are
allowed to already-registered locations, so that coherence is preserved among append-
only positional items. On the basis of such current locations, the processor may evaluate
location-aware queries, for example, detect if a vehicle has just entered into a designated
region or if the closest friend has changed. he sequential nature in each object’s path is
also very signiicant in order to inspect movement patterns over a period of time, as well
as interactions between objects (e.g., objects traveling together) or with stationary entities
(e.g., crossing region boundaries). In essence, this sequence of point locations traces the
movement of this object across time. More speciically, trajectory T of a point object identi-
ied as id and moving over the Euclidean plane is a possibly unbounded sequence of items
consisting of its position p recorded at timestamp τ. hus, each trajectory is approximated
as an evolving sequence of point samples collected from the respective data source at dis-
tinct time instants (e.g., a GPS reading taken every few seconds).
Considering that numerous objects may be monitored, this information creates a trajec-
tory stream of locations concurrently evolving in space and time. It is worth mentioning
two characteristics inherent in such streaming trajectories. First, time monotonicity dictates
a strict ordering of spatial positions taken by a moving object all along its trajectory. hus,
items of the trajectory stream are in increasing temporal order as time advances. Hence,
positional tuples referring to the same trajectory can be ordered by simply comparing their
time indications. Second, locality in an object’s movement should be expected, assuming
that its next location will be recorded somewhere close to the current one. herefore, it is
plausible to anticipate consistent object paths in space, with the possible exception of dis-
continuities in movement due to communication failures or noise.
Several models and algorithms have been proposed for managing continuously moving
objects in spatiotemporal databases. In Reference 13, an abstract data model and a query
language were developed toward a foundation for implementing a spatiotemporal Data Base
Managing Big Trajectory Data ◾ 263
Management System (DBMS) extension where trajectories are considered as moving points.
Based on that infrastructure, the SECONDO prototype [14] ofered several built-in and exten-
sible operations. Due to obvious limitations in storage and communication bandwidth, it is
not computationally eicient to maintain a continuous trace of an object’s movement. As a
trade-of between performance and positional accuracy, a discrete model is usually assumed
for movement, by simply collecting point samples along its course. he discrete model pro-
posed in Reference 15 decomposes temporal development into fragments and then uses a
simple function to represent movement along every such “time slice.” For trajectories, each
time slice is a line segment connecting two consecutively recorded locations of a given object,
as a trade-of between performance and positional accuracy. In moving-object databases [16],
interpolation techniques can be used to estimate intermediate positions between recorded
samples and thus approximately reconstruct an acceptable trace of movement. In addition,
spline methods have also been suggested as a means to provide synchronized, curve-based
trajectory representations by resampling original positional measurements at a ixed rate.
In practice, a trajectory is oten reconstructed as a “polyline” vector, that is, a sequence
of line segments for each object that connect each observed location to the one immedi-
ately preceding it. herefore, from the original stream of successive positions taken by the
moving points, a stream of append-only line segments may be derived online, maintaining
progressively the trace of each object’s movement up to its most recent position. Again, no
updates are allowed to segments already registered, in order to avoid any modiication to
historical information and thus preserve coherence among streaming trajectory segments.
We adhere to such representation of trajectories as point sequences across time, implying
that locations are linearly connected. Also, note that trajectories are viewed from a his-
torical perspective, not dealing at all with predictions about future object positions as in
Reference 17. herefore, robust indexing structures are needed in order to support complex
evaluations in both space and time [18].
In LBS applications, several types of location-aware CQs can be expressed, so as to exam-
ine spatial interactions among moving objects or with speciic static geometries (including
spatial containment, range or distance search). According to the classiication in Reference
18, two types of spatiotemporal queries can be distinguished:
1. Coordinate-based queries. In this case, it is the current position of objects that matters
most in typical range [19] and nearest-neighbor search [3].
and both time and space complexity. In the special case of trajectory streams, an addi-
tional requirement is posed: Not only exploit the timely spatiotemporal information but
also take into account and preserve the sequential nature of the data. herefore, in order to
eiciently maintain trajectories online, there is no other way but to apply compression tech-
niques, thus not only reducing drastically the overall data size but also speeding up query
answering, for example, identifying pairs of trajectories that have recently been moving
close together. By intelligently dropping some points with negligible inluence on the gen-
eral movement of an object, a simpliied yet quite reliable trajectory representation may
be obtained. Such a procedure may be used as a ilter over the incoming spatiotemporal
updates, essentially controlling the stream arrival rate by discarding items before passing
them to further processing. Besides, since each object is considered in isolation, such item-
at-a-time iltering could be applied directly at the sources, with substantial savings both in
communication bandwidth and in computation overhead at the processing engine.
Clearly, there is a need for techniques that can successfully maintain an online summary
consisting of the most signiicant trajectory locations, with minimal process cost per point.
Such algorithms should take advantage of the spatiotemporal features that characterize
movement, successfully detecting changes in speed and orientation in order to produce a
representative synopsis as close as possible to the original route. Our key intuition in the
techniques proposed in Reference 23 is that a point should be taken into the sample as long
as it reveals a relative change in the course of a trajectory. If the location of an incoming
point can be safely predicted (e.g., via interpolation) from the current movement pattern,
then this point contributes little information and hence can be discarded without signii-
cant loss in accuracy. As a result, retained points may be used to approximately reconstruct
the trajectory; discarded locations could be derived via linear interpolation with small
error. It is important to note that locations are examined not on the basis of spatial posi-
tions only but, rather, on velocity considerations (e.g., speed changes, turns, etc.), such that
the sample may catch any signiicant alterations at the known pattern of movement.
he irst algorithm is called threshold-guided sampling because a new point is appended
to the retained trajectory sample once a speciied threshold is exceeded for an incoming
location. A decision to accept or reject a point is taken according to user-deined rules
specifying the allowed tolerance (“threshold”) for changes in speed and orientation. In
order to decide whether the current location can be safely predicted from the recent past
and is thus superluous, this approach takes under consideration both the mean and
instantaneous velocity of the object. As illustrated in Figure 14.2, the mean velocity VT
comes from the last two locations (A, B) stored in the current trajectory sample, while
the instantaneous velocity VC is derived from the last two observed locations (B, C). Also
note that small variations in the orientation of the predicted course are tolerable; hence,
deviation by a few degrees is allowed, as long as it does not change the overall direction of
movement by more than dϕ (threshold parameter in degrees). Actually, the correspond-
ing two loci derived by these criteria are the ring sections SAT and SAC , called “safe areas,”
where the object is normally expected to be found next, according to the mean and current
instantaneous velocities, respectively. As justiied in Reference 23, it is not suicient to use
only one of these loci as a criterion to drop locations, because critical points may be missed
266 ◾ Kostas Patroumpas and Timos Sellis
SAJ
SAC
VC
C VT SAT
VT
FIGURE 14.2 Finding the joint safe area (SAJ) in threshold-guided sampling.
and errors may be propagated and hence lead to distortions in the compressed trajectories.
Instead, taking the intersection SAJ of the two loci shown in Figure 14.2 as a joint safe area
is a more secure policy. As time goes by, this area is more likely to shrink as the number
of discarded points increases ater the last insertion into the sample, so the probability of
missing any critical points is diminishing. As soon as no intersection of these loci is found,
an insertion will be prompted into the compressed trajectory, regardless of the current
object location. he only problem with this scheme is that the total amount of items in
the trajectory sample keeps increasing without eliminating any point already stored, so it
is not possible to accommodate under a ixed memory space allocated to each trajectory.
To remedy this downside, the second algorithm introduced in Reference 23 and called
STTrace is more tailored for streaming environments. he intuition behind STTrace is to
use an insertion scheme based on the recent movement features but at the same time allow-
ing deletions from the sample to make room for the newly inserted points without exceed-
ing allocated memory per trajectory. However, a point candidate for deletion is chosen not
randomly over the current sample contents but according to its signiicance in trajectory
preservation. In order to quantify the importance of each point in the trajectory, we used
a metric based on the notion of synchronous Euclidean distance (SED) [24]. As shown in
Figure 14.3, for any point B in the retained sample (i.e., the compressed trajectory), SED
is the distance between its actual location and its time-aligned position B’ estimated via
SED
A
B’ C
FIGURE 14.3 he notion of synchronous Euclidean distance as used in the STTrace algorithm.
Managing Big Trajectory Data ◾ 267
interpolation between its predecessor A and successor point C in the sample. Since this
is essentially the distance between the actual point and its spatiotemporal trace along the
line segment that connects its immediate neighbors in the sequence, this sampling algo-
rithm was named STTrace. Admittedly, it is better to discard a point that will produce the
least distortion to the current trajectory synopsis. As soon as the allocated memory gets
exhausted and a new point is examined for possible insertion, the compressed representa-
tion is searched for the item with the lowest SED. hat point represents the least possible
loss of information in case it gets discarded. But this comes at a cost: For every new inser-
tion, the most appropriate candidate point must be searched for deletion over the entire
sample with O(N) worst-case cost, where N is the actual size of the compressed trajectory.
Nevertheless, as N is expectedly very small and the sampled points may be maintained in
an appropriate data structure (e.g., a binary balanced tree) with logarithmic cost for opera-
tions (search, insert, delete), normally, this is an afordable trade-of.
here have been several other attempts to compress moving-object traces, either
through load shedding policies on incoming locations like in References 25 through 27 or
by taking also into account trajectory information as in References 28 and 29. Still, based
on a series of experiments reported in Reference 23, threshold-guided sampling emerges
as a robust mechanism for semantic load shedding for trajectories, iltering out negligible
locations with minor computational overhead. he actual size of the sample it provides is
a rough indication of the complexity of each trajectory, and the parameters can be ine-
tuned according to trajectory characteristics and memory availability. Besides, it can be
applied close to the data sources instead of a central processor, sparing both transmission
cost and processing power. Regarding eiciency, STTrace always manages to maintain a
small representative sample for a trajectory of unknown size. It outperforms threshold-
guided sampling for small compression rates since it is not easy to deine suitable threshold
values in this case. Empirical results show that STTrace incurs some overhead in maintain-
ing the minimum synchronous distance and in-memory adjustment of the sampled points.
However, this cost can be reduced if STTrace is applied in a batch mode, that is, executed
at consecutive time intervals.
Pnow
7 L2
L1 L1 R1 g L1
6 6 6 8
L0 R0 g L0 g
R0 L0 R0 L0
8 8 4 4 10 10 6
f f f f
Stream
3, 2, 1, 2 1, 1, 2 3, 1, 2, 4 1, 3, 2 items
τ1 τ2 τ3 τ4 t
FIGURE 14.5 Four snapshots of an example operation of AmTree against numeric streaming items.
Managing Big Trajectory Data ◾ 269
R0 is shited to node L0. As time goes by and new data come in, the contents of each level are
merged using a function g (average in the example of Figure 14.5) and propagated higher
up in the tree, thus retaining less detail. Note that node R0 is the only entry point to the
synopsis maintained by the AmTree. As justiied in Reference 31, AmTree updates can be
carried out online in O(1) amortized time per location with only logarithmic requirements
in memory storage.
his framework is best suited for summarizing streams of sequential features, and it has
been applied to create amnesic synopses concerning singleton trajectories. As an alterna-
tive representation to a time series of points, the trajectory of a moving object can be rep-
resented with a polyline composed of consecutive displacements. Every such line segment
connects a pair of successive point locations recorded for this object, eventually providing
a continuous, though approximate, trace of its movement. With respect to compressing a
single trajectory, an AmTree instantiation manipulates all successive displacement tuples
relayed by this object. In direct correspondence to the generic AmTree functionality, map-
ping f converts every current object position into a displacement tuple with respect to its
previous location. his displacement is then inserted into the R0 node, possibly triggering
further updates higher up in the AmTree. When the contents of level i must be merged
to produce a coarser representation, a simple concatenation function g is used to com-
bine the successive displacements stored in nodes Li and Ri. Ater eliminating the common
articulation point of the two original segments, a single line segment is produced and then
stored in node Ri+1. As a result, end points of all displacements stored in AmTree nodes
correspond to original positional updates, while consecutive displacements remain con-
nected to each other at every level. Evidently, an amnesic behavior is achieved for trajectory
segments through levels of gradually less detail in this bottom-up tree maintenance. As
long as successive displacements are preserved, the movement of a particular object can be
properly reconstructed by choosing points in descending temporal order, starting from its
most recent position and going steadily backward in time. Any trajectory reconstruction
process can be gradually reined by combining information from multiple levels and nodes
of the tree, leading to a multiresolution approximation for a given trajectory, as depicted in
Figure 14.4.
his structure can also provide unbiased estimates for the number of objects that are
moving in an area of interest during a speciied time interval. When each object must be
counted only once, the problem is known as distinct counting [33]. We consider a regular
decomposition of the two-dimensional Euclidean plane into equal-area grid cells, which
serve as a simpliied spatial reference for the moving objects instead of their exact coordi-
nates. hus, each cell corresponds to a separate AmTree, which maintains gradually aging
information concerning the number of moving objects inside that cell. Query-oriented
compression is achieved using Flajolet-Martin (FM) sketches [34], which are based on
Probabilistic Counting with Stochastic Averaging (FM_PCSA). Each node of an AmTree
retains m bitmap vectors utilized by the FM_PCSA sketch, as illustrated in Figure 14.6.
Hence, we avoid enumeration of objects, as we are satisied with an acceptable estimate of
their distinct count (DC) given by the sketching algorithm. In order to estimate the number
of distinct objects moving within a given area a during a time interval Δτ, we irst identify
270 ◾ Kostas Patroumpas and Timos Sellis
Grid ➴➷➬➬
➮➱Tree
node
FM1 1 0 0 1 . . . 1
➘ FM2 1 1 0 0 . . . 0
..
.
FMm 1 1 0 0 . . . 1
FIGURE 14.6 he three-tier FM-AmTree structure used for approximate distinct counting of
moving objects.
the grid cells that completely cover region a. hose cells determine the group of qualifying
AmTree structures that maintain the aggregates. For each such tree, we need to locate the
set of nodes that overlap time period Δτ speciied by the query; these nodes are identical
for each qualifying tree. By taking the union of the sketches attached to these nodes (i.e., an
OR operation over the respective bitmaps), we can inally provide an approximate answer
to the DC query.
he experimental study conducted in Reference 31 conirms that recent trajectory segments
always remain more accurate, while overall error largely depends on the temporal extent of the
query in the past. Even for heavily compressed trajectories, accuracy is proven quite satisfactory
for answering spatiotemporal range queries. With respect to DC queries, it was observed that a
iner grid partitioning incurs more processing time at the expense of increased accuracy.
Overall, the suggested AmTree framework is a modular amnesic structure with expo-
nential decay characteristics. It is especially tailored to cope with streaming positional
updates, and it can retain a compressed outline of entire trajectories, always preserving
contiguity among successive segments for each individual object. Efectively, such a policy
constructs a multiresolution path, ofering iner motion paths for the recent past and keep-
ing gradually less and less details for aging parts of the trajectory.
In conjunction with FM sketches, AmTree can further be used in spatiotemporal aggre-
gation for providing good-quality estimates to DC queries over locations of moving objects.
F F F F F F F
U U U U U F F
T T T T U F F
T T T T U F F
q T T T T U F F
U U U U U F F
F F F F F F F
3σ
(a) (b)
FIGURE 14.7 (a) Standard bivariate Gaussian distribution N(0, 1) as a model for an uncertain loca-
tion. (b) Veriier with 7 × 7 elementary boxes for checking an object against range query q.
272 ◾ Kostas Patroumpas and Timos Sellis
radius 3σ. If this MBB were uniformly subdivided into λ × λ elementary boxes, then each
box would represent diverse cumulative probabilities, indicated by the difering shades
in Figure 14.7a. However, for a known σ, the cumulative probability in each elementary
box is independent of the parameters of the applied bivariate Gaussian distribution. Once
precomputed (e.g., by Monte Carlo), these probabilities can be retained in a lookup table.
he rationale behind this subdivision is that it may be used as a discretized veriier V
when probing uncertain Gaussians. Consider the case of query q against an object, shown
as a shaded rectangle in Figure 14.7b. Depending on its topological relation with the given
query, each elementary box of V can be easily characterized by one of three possible states:
(1) T is assigned to elementary boxes totally within query range; (2) F signiies disjoint
boxes, that is, those entirely outside the range; and (3) U marks boxes partially overlapping
with the speciied query range. hen, summing up the respective cumulative probabilities
for each subset of boxes returns three indicators pT , pF, and pU suitable for object valida-
tion against the cutof threshold θ. he conidence margin in the results equals the overall
cumulative probability of the U-boxes, which depends entirely on granularity λ. Indeed, a
small λ can provide answers quickly, which is critical when coping with numerous objects.
In contrast, the iner the subdivision into elementary boxes (i.e., a larger λ), the less the
uncertainty in the emitted results.
As a trade-of between timeliness and answer quality in a geostreaming context, in
Reference 40, we turned this range search into an (ε, δ) approximation problem, where
ε quantiies the error margin in the allowed overestimation when reporting a qualify-
ing object and δ speciies the tolerance that an invalid answer may be given (i.e., a false
positive). We introduced several optimizations based on inherent probabilistic and spatial
properties of the uncertain streaming data; details can be found in Reference 40.
In a nutshell, this technique can quickly determine whether an item possibly qualiies for
the query or safely prune examination of nonqualifying cases altogether. Inevitably, such
a probabilistic treatment returns approximate answers, along with conidence margins as
a measure of their quality. Simulations over large synthetic data sets indicated that, com-
pared with an exhaustive Monte Carlo evaluation, about 15% of candidates were eagerly
rejected, while another 25% were pruned. Most importantly, false negatives were less than
0.1% in all cases, which demonstrates the eiciency of this approach. Qualitative results are
similar for varying λ, but iner subdivisions naturally incur increasing execution costs, yet
always at least an order of magnitude more afordable than naïve Monte Carlo evaluation.
Hence, pairs of concurrently recorded locations from each object should not deviate
more than ε during interval ω. his notion of similarity is conined within the recent past
(window ω) and does not extend over the entire history of movement. Moreover, it can be
easily generalized for multiple objects with pairwise similar trajectory segments (Figure
14.8). In addition, a threshold is applied when incrementally creating “delegate” traces from
such trajectory groupings; a synthetic trace T is returned only if the respective group cur-
rently includes more than n objects. Note that no original location in such a group can
deviate more than the given tolerance ε from its delegate path T.
τnow T
ε
ε ω
ω
ε
ω
y β
ω
β
β
ε
Pnow
NNW
N NNE N NNW N
NNW NNE NNE
NW NE
NW NE
NW NW NE
WNW ENE WNW ENE
WNW ENE
N NNE W E W E
E W E
NNW
NW NE
WSW ESE NNW N NNE WSW ESE
WSW ESE
WNW ENE NW NE
SW SE SW SE
SW SE ENE
SSW SSE WNW SSE
W E S SSW SSE ESE NE
SSW S
S
WSW ESE N W E
NNW N NNE WSW ESE
SW SE
SSW NW NE
SSE SE SE
S WNW ENE
SW
SSW SSE
S
W E
WSW ESE
SW SE
SSW S SSE
recent location and rewinding the symbolic sequence backward, that is, reversely visiting
all samples retained within the sliding window frame.
Simulations against synthetic trajectories have shown that this framework has great
potential for data reduction and timely detection of motion trends, without hurting per-
formance or approximation quality. Some preliminary results regarding traic trends on
the road network of Athens are available in Reference 41.
Overall, we believe that this ongoing work fuses ideas from trajectory clustering [7] and
path simpliication [23] but proceeds further beyond. Operating in a geostreaming con-
text, not only can it identify important motion patterns in online fashion, but it may also
provide concise summaries without resorting to sophisticated spatiotemporal indexing.
Symbolic representation of routes was irst proposed in Reference 8 for iltering against
trajectory databases. Yet, our encoding difers substantially, as it attempts to capture evolv-
ing spatiotemporal vectors using a versatile alphabet of tunable object headings instead of
simply compiling time-stamped positions in a discretized space. In practice, processed fea-
tures may serve various needs, such as the following: (i) data compression, by collectively
representing traces of multiple objects with a single “delegate” that suitably approximates
their common recent movement; (ii) data discovery, by detecting trends or motion patterns
from real-time location feeds; (iii) data visualization, by estimating the signiicance of each
multiplexed group of trajectories and illustrating its mutability across time; and (iv) query
processing, if multiplexed traces are utilized at the iltering stage when evaluating diverse
queries (range, k-NN, aggregates, etc.) instead of the detailed, bulky trajectories.
oten been questioned regarding the quality and reliability of the information collected.
Preprocessing and noise removal of VGI seems a promising research direction toward
addressing uncertainty in these spatiotemporal data.
Last, but not least, next-generation platforms should ofer advanced functionality to
users in order to better deal with the peculiarities in trajectory data and better interpret
hidden properties. Interactive mapping tools should sustain the bulk of data, by leverag-
ing representation detail according to the actual map scale, the amount of distinct objects,
the complexity of their traces, and so forth. In application development, APIs for ine-
grain control over complex events that could correlate dynamic positional data against
stationary data sets (e.g., transportation networks or administrative boundaries) or other
trajectories would be valuable as well. Finally, dealing with inherent stream imperfections
like disorder or noise would ofer the ability for dynamic revision of query results, thus
increasing the quality of answers.
REFERENCES
1. Agrawal, D., P. Bernstein, E. Bertino, S. Davidson, U. Dayal, M. Franklin, J. Gehrke, L. Haas,
A. Halevy, J. Han, H.V. Jagadish, A. Labrinidis, S. Madden, Y. Papakonstantinou, J.M. Patel,
R. Ramakrishnan, K. Ross, C. Shahabi, D. Suciu, S. Vaithyanathan, and J. Widom. 2012.
Challenges and Opportunities with Big Data—A Community White Paper Developed by
Leading Researchers across the United States. Available at http://www.cra.org/ccc/iles/docs
/init/bigdatawhitepaper.pdf (accessed April 30, 2014).
2. Stonebraker, M., U. Cetintemel, and S. Zdonik. 2005. he 8 Requirements of Real-Time Stream
Processing. ACM SIGMOD Record, 34(4):42–47.
3. Mouratidis, K., M. Hadjieletheriou, and D. Papadias. 2005. Conceptual Partitioning: An
Eicient Method for Continuous Nearest Neighbor Monitoring. In Proceedings of the 24th
ACM SIGMOD International Conference on Management of Data, pp. 634–645.
4. Hu, H., J. Xu, and D.L. Lee. 2005. A Generic Framework for Monitoring Continuous Spatial
Queries over Moving Objects. In Proceedings of the 24th ACM SIGMOD International
Conference on Management of Data, pp. 479–490.
5. Vieira, M., P. Bakalov, and V.J. Tsotras. 2009. On-line Discovery of Flock Patterns in Spatio-
Temporal Data. In Proceedings of the 17th ACM SIGSPATIAL International Conference on
Advances in Geographic Information Systems (ACM GIS), pp. 286–295.
6. Jeungy, H., M.L. Yiu, X. Zhou, C.S. Jensen, and H.T. Shen. 2008. Discovery of Convoys in
Trajectory Databases. Proceedings of Very Large Data Bases (PVLDB), 1(1):1068–1080.
7. Lee, J., J. Han, and K. Whang. 2007. Trajectory Clustering: A Partition-and-Group Framework.
In Proceedings of the 26th ACM SIGMOD International Conference on Management of Data,
pp. 593–604.
8. Bakalov, P., M. Hadjieletheriou, E. Keogh, and V.J. Tsotras. 2005. Eicient Trajectory Joins
Using Symbolic Representations. In MDM, pp. 86–93.
9. Chen, Z., H.T. Shen, and X. Zhou. 2011. Discovering Popular Routes from Trajectories. In
Proceedings of the 27th International Conference on Data Engineering (ICDE), pp. 900–911.
10. Sacharidis, D., K. Patroumpas, M. Terrovitis, V. Kantere, M. Potamias, K. Mouratidis, and T.
Sellis. 2008. Online Discovery of Hot Motion Paths. In Proceedings of the 11th International
Conference on Extending Database Technology (EDBT), pp. 392–403.
11. Manyika, J., M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A.H. Byers. 2011.
Big Data: he Next Frontier for Innovation, Competition, and Productivity. McKinsey Global
Institute, New York. Available at http://www.mckinsey.com/insights/business_technology/big
_data_the_next_frontier_for_innovation (accessed April 30, 2014).
278 ◾ Kostas Patroumpas and Timos Sellis
12. Ferguson, M. 2012. Architecting a Big Data Platform for Analytics. Whitepaper Prepared
for IBM.
13. Güting, R.H., M.H. Böhlen, M. Erwig, C.S. Jensen, N.A. Lorentzos, M. Schneider, and M.
Vazirgiannis. 2000. A Foundation for Representing and Querying Moving Objects. ACM
Transactions on Database Systems, 25(1):1–42.
14. Güting, R.H., T. Behr, and C. Duentgen. 2010. SECONDO: A Platform for Moving Objects
Database Research and for Publishing and Integrating Research Implementations. IEEE Data
Engineering Bulletin, 33(2):56–63.
15. Forlizzi, L., R.H. Güting, E. Nardelli, and M. Schneider. 2000. A Data Model and Data Structures
for Moving Objects Databases. In Proceedings of the 19th ACM SIGMOD International
Conference on Management of Data, pp. 319–330.
16. Pfoser, D., and C.S. Jensen. 1999. Capturing the Uncertainty of Moving-Object Representations.
In Proceedings of the 6th International Symposium on Advances in Spatial Databases (SSD),
pp. 111–132.
17. Sistla, A.P., O. Wolfson, S. Chamberlain, and S. Dao. 1997. Modeling and Querying Moving
Objects. In Proceedings of the 13th International Conference on Data Engineering (ICDE),
pp. 422–432.
18. Pfoser, D., C.S. Jensen, and Y. heodoridis. 2000. Novel Approaches to the Indexing of Moving
Object Trajectories. In Proceedings of 26th International Conference on Very Large Data Bases
(VLDB), pp. 395–406.
19. Mokbel, M., X. Xiong, and W. Aref. 2004. SINA: Scalable Incremental Processing of Continuous
Queries in Spatiotemporal Databases. In Proceedings of the 23rd ACM SIGMOD International
Conference on Management of Data, pp. 623–634.
20. Frentzos, E., K. Gratsias, N. Pelekis, and Y. heodoridis. 2007. Algorithms for Nearest Neighbor
Search on Moving Object Trajectories. GeoInformatica, 11(2):159–193.
21. Zhenhui, L., D. Bolin, H. Jiawei, and R. Kays. 2010. Swarm: Mining Relaxed Temporal Moving
Object Clusters. Proceedings on Very Large Data Bases (PVLDB), 3(1–2):723–734.
22. Patroumpas, K., and T. Sellis. 2011. Maintaining Consistent Results of Continuous Queries
under Diverse Window Speciications. Information Systems, 36(1):42–61.
23. Potamias, M., K. Patroumpas, and T. Sellis. 2006. Sampling Trajectory Streams with
Spatiotemporal Criteria. In Proceedings of the 18th International Conference on Scientiic and
Statistical Data (SSDBM), pp. 275–284.
24. Meratnia, N., and R. de By. 2004. Spatiotemporal Compression Techniques for Moving Point
Objects. In Proceedings of the 9th International Conference on Extending Database Technology
(EDBT), pp. 765–782.
25. Nehme, R., and E. Rundensteiner. 2007. ClusterSheddy: Load Shedding Using Moving Clusters
over Spatio-Temporal Data Streams. In Proceedings of the 12th International Conference on
Database Systems for Advanced Applications (DASFAA), pp. 637–651.
26. Gedik, B., L. Liu, K. Wu, and P.S. Yu. 2007. Lira: Lightweight, Region-Aware Load Shedding in
Mobile CQ Systems. In Proceedings of the 23rd International Conference on Data Engineering
(ICDE), pp. 286–295.
27. Mokbel, M.F., and W.G. Aref. 2008. SOLE: Scalable On-line Execution of Continuous Queries
on Spatio-Temporal Data Streams. VLDB Journal, 17(5):971–995.
28. Cao, H., O. Wolfson, and G. Trajcevski. 2006. Spatio-Temporal Data Reduction with
Deterministic Error Bounds. VLDB Journal, 15(3):211–228.
29. Lange, R., F. Dürr, and K. Rothermel. 2011. Eicient Real-Time Trajectory Tracking. VLDB
Journal, 25:671–694.
30. Palpanas, T., M. Vlachos, E. Keogh, D. Gunopulos, and W. Truppel. 2004. Online Amnesic
Approximation of Streaming Time Series. In Proceedings of the 20th International Conference
on Data Engineering (ICDE), pp. 338–349.
Managing Big Trajectory Data ◾ 279
31. Potamias, M., K. Patroumpas, and T. Sellis. 2007. Online Amnesic Summarization of Streaming
Locations. In Proceedings of the 10th International Symposium on Spatial and Temporal
Databases (SSTD), pp. 148–165.
32. Bulut, A., and A.K. Singh. 2003. SWAT: Hierarchical Stream Summarization in Large
Networks. In Proceedings of the 19th International Conference on Data Engineering (ICDE),
pp. 303–314.
33. Tao, Y., G. Kollios, J. Considine, F. Li, and D. Papadias. 2004. Spatio-Temporal Aggregation
Using Sketches. In Proceedings of the 20th International Conference on Data Engineering (ICDE),
pp. 214–226.
34. Flajolet, P., and G.N. Martin. 1985. Probabilistic Counting Algorithms for Database
Applications. Journal of Computer and Systems Sciences, 31(2):182–209.
35. Mascetti, S., D. Freni, C. Bettini, X.S. Wang, and S. Jajodia. 2011. Privacy in Geo-Social
Networks: Proximity Notiication with Untrusted Service Providers and Curious Buddies.
VLDB Journal, 20(4):541–566.
36. Facebook. Available at http://facebook.com/about/location (accessed April 30, 2014).
37. Google Latitude. Available at http://google.com/latitude (accessed August 1, 2013).
38. Foursquare. Available at https://foursquare.com/ (accessed April 30, 2014).
39. Chow, C.-Y., M.F. Mokbel, and W.G. Aref. 2009. Casper*: Query Processing for Location
Services without Compromising Privacy. ACM Transactions on Database Systems, 34(4):24.
40. Patroumpas, K., M. Papamichalis, and T. Sellis. 2012. Probabilistic Range Monitoring of
Streaming Uncertain Positions in GeoSocial Networks. In Proceedings of the 24th International
Conference on Scientiic and Statistical Data (SSDBM), pp. 20–37.
41. Patroumpas, K., K. Toumbas, and T. Sellis. 2012. Multiplexing Trajectories of Moving Objects.
In Proceedings of the 24th International Conference on Scientiic and Statistical Data (SSDBM),
pp. 595–600.
42. Ester, M., H.-P. Kriegel, J. Sander, and X. Xu. 1996. A Density-Based Algorithm for Discovering
Clusters in Large Spatial Databases with Noise. In Proceedings of the 2nd International
Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231.
43. Aji, A., F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. 2013. Hadoop-GIS: A High
Performance Spatial Data Warehousing System over MapReduce. Proceedings of Very Large
Data Bases (PVLDB), 6(11):1009–1020.
44. Cudré-Mauroux, P., E. Wu, and S. Madden. 2010. TrajStore: An Adaptive Storage System for
Very Large Trajectory Data Sets. In Proceedings of the 26th International Conference on Data
Engineering (ICDE), pp. 109–120.
45. Tang, L., Y. Zheng, J. Yuan, J. Han, A. Leung, C. Hung, and W. Peng. 2012. On Discovery of
Traveling Companions from Streaming Trajectories. In Proceedings of the 29th International
Conference on Data Engineering (ICDE), pp. 186–197.
46. Zheng, K., Y. Zheng, N. Jing Yuan, and S. Shang. 2013. On Discovery of Gathering Patterns
from Trajectories. In Proceedings of the 29th IEEE International Conference on Data Engineering
(ICDE 2013), pp. 242–253.
47. Luo, W., H. Tan, L. Chen, and L.M. Ni. 2013. Finding Time Period-Based Most Frequent Path
in Big Trajectory Data. In Proceedings of the 33rd ACM SIGMOD International Conference on
Management of Data, pp. 713–724.
48. Efentakis, A., S. Brakatsoulas, N. Grivas, G. Lamprianidis, K. Patroumpas, and D. Pfoser. 2013.
Towards a Flexible and Scalable Fleet Management Service. In Proceedings of the 6th ACM
SIGSPATIAL International Workshop on Computational Transportation Science (IWCTS), p. 79.
49. hiagarajan, A., L. Ravindranath, K. LaCurts, S. Madden, H. Balakrishnan, S. Toledo, and
J. Eriksson. 2009. vTrack: Accurate, Energy-Aware Road Traic Delay Estimation Using Mobile
Phones. In Proceedings of the 7th ACM Conference on Embedded Networked Sensor Systems
(SenSys), pp. 85–98.
280 ◾ Kostas Patroumpas and Timos Sellis
50. Yang, B., C. Guo, C.S. Jensen, M. Kaul, and S. Shang. 2014. Multi-Cost Optimal Route Planning
under Time-Varying Uncertainty. In Proceedings of the 30th IEEE International Conference on
Data Engineering (ICDE).
51. Zheng, K., S. Shang, N. Jing Yuan, and Y. Yang. 2013. Towards Eicient Search for Activity
Trajectories. In Proceedings of the 29th International Conference on Data Engineering (ICDE),
pp. 230–241.
52. Shang, S., R. Ding, K. Zheng, C.S. Jensen, P. Kalnis, and X. Zhou. 2014. Personalized Trajectory
Matching in Spatial Networks. VLDB Journal, 23(3): 449–468.
53. Yuan, N.J., F. Zhang, D. Lian, K. Zheng, S. Yu, and X. Xie. 2013. We Know How You Live:
Exploring the Spectrum of Urban Lifestyles. In Proceedings of the 1st ACM Conference on
Online Social Networks (COSN), pp. 3–14.
54. Goodchild, M.F. 2007. Citizens as Voluntary Sensors: Spatial Data Infrastructure in the World
of Web 2.0. International Journal of Spatial Data Infrastructures Research (IJSDIR), 2(1):24–32.
55. OpenStreetMap Project. Available at http://www.openstreetmap.org (accessed April 30, 2014).
56. Twitter. Available at https://twitter.com (accessed April 30, 2014).
57. Flickr. Available at https://www.lickr.com (accessed April 30, 2014).
58. Damiani, M.L., C. Silvestri, and E. Bertino. 2011. Fine-Grained Cloaking of Sensitive Positions
in Location-Sharing Applications. Pervasive Computing, 10(4):64–72.
59. Zheng, Y.-T., Z.-J. Zha, and T.-S. Chua. 2012. Mining Travel Patterns from Geotagged Photos.
ACM Transactions on Intelligent Systems and Technology (TIST), 3(3):56.
IV
Big Data Privacy
281
CHAPTER 15
CONTENTS
15.1 Introduction 284
15.1.1 Topic and Aim 286
15.1.2 Note to the Reader, Structure, and Arguments 287
15.2 Data Protection Aspects 287
15.2.1 Big Data and Analytics in Four Steps 287
15.2.2 Personal Data 289
15.2.2.1 Proiling Activities on Personal Data 292
15.2.2.2 Pseudonymization 292
15.2.2.3 Anonymous Data 293
15.2.2.4 Reidentiication 295
15.2.3 Purpose Limitation 296
15.3 Conclusions and Recommendations 297
References 299
ABSTRACT
New technologies have signiicantly changed both the way that we live and the ways
in which we do business. he advent of the Internet and the vast amounts of data
disseminated across global networks and databases has brought with it both signii-
cant advantages in terms of scientiic understanding and business opportunities, and
challenges in terms of competition and privacy concerns. More data are currently
available than ever before. Neelie Kroes, vice president of the European Commission
responsible for the Digital Agenda, deined our era as “the data gold rush.” What
is Big Data? In its most general sense, the term Big Data refers to large digital data
sets oten held by corporations, governments, and other organizations that are sub-
sequently analyzed by way of computer algorithms. In a more practical sense, “Big
Data means big money!” he most signiicant challenge in harnessing the value of
Big Data is to rightly balance business aspects with social and ethical implications,
283
284 ◾ Paolo Balboni
15.1 INTRODUCTION
New technologies have signiicantly changed both the way that we live and the ways in
which we do business. he advent of the Internet and the vast amounts of data dissemi-
nated across global networks and databases has brought with it both signiicant advantages
in terms of scientiic understanding and business opportunities, and challenges in terms
of competition and privacy concerns. More data are currently available than ever before.
As Neelie Kroes, vice president of the European Commission responsible for the Digital
Agenda, pointed out in her March 19, 2014, speech entitled “he Data Gold Rush,” not only
do we have more data at our disposal than ever before, but also, we have access to a mul-
titude of ways to manipulate, collect, manage, and make use of it [1]. he key, however, as
Kroes wisely pointed out, is to ind the value amid the multiplying mass of data: “he right
infrastructure, the right networks, the right computing capacity and, last but not least, the
right analysis methods and algorithms help us break through the mountains of rock to ind
the gold within” [1].
What is Big Data? In its most general sense, the term Big Data refers to large digital data
sets oten held by corporations, governments, and other organizations that are subse-
quently analyzed by way of computer algorithms [2, p. 35]. While the way that Big Data
is deined depends largely on your point of reference, Big Data is comprised of aggregated
and anonymous data, the personal data that are generated by the 369 million European
Union citizens utilizing online platforms and services including but not limited to apps,
games, social media, e-commerce, and search engines [3, p. 9]. he personal data that are
collected, for example, in order to subscribe to an online service, usually include name,
sex, e-mail address, location, IP address, Internet search history, and personal preferences,
which are used to help companies provide more personalized and enhanced services and
even to reach out to potential customers [3, p. 9].
he worldwide digital economy is “marked by strong, dynamic growth, a high turnover
of new services” [3, p. 9], and is characterized by the presence of few dominant industry
members. he services that these few large players ofer oten appear to be “free” but, in
all reality, are paid for by the granting of personal data, estimated to be worth 300 billion
euro [3, p. 8], a value that is only expected to grow in the coming years. Big Data means big
money. Estimates suggest that 4 zettabytes of data were generated across the globe in 2013
[4]. hat is a staggering amount of information, which continues to grow by the minute,
Personal Data Protection Aspects of Big Data ◾ 285
presenting possible innovations and new ways to generate value from the same. he power
of Big Data and the inluence that it has on the global marketplace is therefore immense as
it becomes increasingly intertwined with our daily lives as we make use of such new tech-
nological devices and services.
he ways in which such data are used by both business and government are in a process
of constant evolution. In the past, data were collected in order to provide a speciic service.
In the age of Big Data, however, the genuine value of such data can be understood in terms
of their potential (re)uses, and in fact, the estimated digital value that EU customers place
on their data is forecast to rise to 1 trillion euro by 2020 [3, p. 9]. Big Data ofers endless
potential in terms of bettering our lives, from allowing for an increased provision and ei-
ciency of services, to monitoring climate change, health trends, and disease epidemics, to
allowing for increased monitoring of government fraud, abuse, and waste [5, p. 37]. In this
way, we can understand Big Data not only as a technological tool but also as a public good,
an expression of individual identity, or even property, a new form of value [5, pp. 37–49]. For
the purposes of this chapter, however, we will focus on the value generated by personal data
and the vital balance of adequate privacy and security safeguards with business opportunity.
In the European Data Protection Supervisor’s (EDPS’s) preliminary opinion of March
2014, the authority stressed the challenges that Big Data represents in the balancing act
between competition, data protection, and consumer protection [3, p. 2]. Indeed, the most
signiicant challenge in harnessing the value of Big Data must also take social and ethical
aspects into consideration, especially when consumers, or data subjects, possess a limited
understanding of how their personal data are being collected and what becomes of the
same. Privacy and security measures, however, need not focus on limitations and protec-
tions alone, in the words of Neelie Kroes:
Data generates value, and unlocks the door to new opportunities: you don’t need to
‘protect’ people from their own assets. What you need is to empower people, give them
control, give them a fair share of that value. Give them rights over their data—and
responsibilities too, and the digital tools to exercise them. And ensure that the net-
works and systems they use are afordable, lexible, resilient, trustworthy, secure. [1]*
Under President Obama, the United States government has committed itself to promoting
the free low of data and the growth of the digital economy, calling on stakeholders of the
private and public sectors in order to “harness the power of data in ways that boost produc-
tivity, improve lives, and serve communities” [5, p. 9]. he analysis of transactional and oper-
ational data has allowed manufacturers better management of warranties and equipment,
logistics optimization, behavioral advertising, customized pricing, and even better fraud
detection for banks [5, pp. 37–49]. In this sense, Big Data provides the possibility for improve-
ment, minimization of waste, new customers, and therefore, higher monetary returns.
* See also he World Economic Forum (2014) Rethinking Personal Data: Trust and Context in User-Centred Data Ecosystems,
p. 3. Available at: http://www3.weforum.org/docs/WEF_RethinkingPersonalData_TrustandContext_Report_2014.pdf.
“To thrive, the growing number of economies that depend on the potential of “Big Data” must earn the trust of individu-
als, and be centred on empowering those individuals by respecting their needs.”
286 ◾ Paolo Balboni
he European Union has also airmed the increase in productivity, innovation, and ser-
vices as a result of cloud computing and Big Data [1]. Further economic growth as a result
of new technologies, however, must be fostered by embracing new technologies, bettering
infrastructure, and providing adequate safeguards for citizens. In this respect, the World
Economic Forum rightly stressed the fundamental task of regulation, which “must keep
pace with new technology and protect consumers without stiling innovation or deterring
uptake” [6, p. 3].
Preliminary Opinion of the European Data Protection: Supervisor Privacy and Competi-
tiveness in the Age of Big Data: he Interplay between Data Protection, Competition
Law and Consumer Protection in the Digital Economy (henceforth EDPS Preliminary
Opinion on Big Data). [3]
Moreover, the Article 29 Data Protection Working Party issued a relevant opinion on
April 10, 2014, on anonymization techniques [7], which, together with Opinion 3/2013
on purpose limitation [2] and Opinion 4/2007 on the concept of personal data [8], repre-
sents the fundamental tools to interpret the ways that the current legal privacy framework
applies to Big Data.
In May 2014, the White House published two very detailed reports:
* “he notice and consent is defeated by exactly the positive beneits that Big Data enables: new, non-obvious, unexpect-
edly powerful uses of data.” he White House (2014) Report to the President Big Data and Privacy: A Technological
Perspective, p. 38. Available at: http://www.whitehouse.gov/sites/default/iles/microsites/ostp/PCAST/pcast_big_data
_and_privacy_-_may_2014.pdf. See also “What really matters about Big Data is what it does. Aside from how we deine
Big Data as a technological phenomenon, the wide variety of potential uses for Big Data analytics raises crucial questions
about whether our legal, ethical, and social norms are suicient to protect privacy and other values in a Big Data world.”
he White House (2014) Big Data: Seizing Opportunities, Preserving Values, p. 3. Available at: http://www.whitehouse.gov
/sites/default/iles/docs/big_data_privacy_report_may_1_2014.pdf.
Personal Data Protection Aspects of Big Data ◾ 287
In this chapter, I will take into consideration both the US and the EU perspectives on
Big Data; however, being a European lawyer, I will focus my analysis on applicable EU data
protection provisions and their impact on both businesses and consumers/data subjects.
More precisely, I will utilize a methodology to determine whether (1) data protection law
applies and (2) personal Big Data can be (further) processed (e.g., by way of analytic sot-
ware programs).
• Step 1: Collection/access
• Step 2: Storage and aggregation
• Step 3: Analysis and distribution
• Step 4: Usage [9]
288 ◾ Paolo Balboni
As the Article 29 Data Protection Working Party rightly stressed, the value of Big Data
rests in the increasing ability of technology to meaningfully and thoroughly analyze them*
in order to make better and more informed decisions:
‘Big Data’ refers to the exponential growth in availability and automated use of
information: it refers to gigantic digital datasets held by corporations, governments
and other large organisations, which are then extensively analysed using computer
algorithms. Big Data relies on the increasing ability of technology to support the
collection and storage of large amounts of data, but also to analyse, understand and
take advantage of the full value of data (in particular using analytics applications).
he expectation from Big Data is that it may ultimately lead to better and more
informed decisions. [2, p. 45]
herefore, going back to the OECD scheme, the crucial steps for Big Data are Step 3,
analysis, and Step 4, usage, as shown in Figure 15.1.
If “[c]omputational capabilities now make ‘inding a needle in a haystack’ not only pos-
sible, but practical” [5, p. 6], for businesses, two fundamental privacy-related questions that
data protection oicers of companies should ask themselves are (1) whether the “haystack
of data” has been lawfully built and (2) whether analytic sotware can be lawfully run on
that big set of data to ind the needle.
* his has been stressed also in the recent US reports on Big Data: “Big Data is big in two diferent senses. It is big in the
quantity and variety of data that are available to be processed. And, it is big in the scale of analysis (termed “analytics”)
that can be applied to those data, ultimately to make inferences and draw conclusions. By data mining and other kinds
of analytics, non-obvious and sometimes private information can be derived from data that, at the time of their collec-
tion, seemed to raise no, or only manageable, privacy issues.” he White House (2014) Report to the President Big Data
and Privacy: A Technological Perspective, p. ix. Available at: http://www.whitehouse.gov/sites/default/iles/microsites/ostp
/PCAST/pcast_big_data_and_privacy_-_may_2014.pdf; and “Computational capabilities now make “inding a needle
in a haystack” not only possible, but practical. In the past, searching large datasets required both rationally organized
data and a speciic research question, relying on choosing the right query to return the correct result. Big Data analytics
enable data scientists to amass lots of data, including unstructured data, and ind anomalies or patterns.” he White
House (2014) Big Data: Seizing Opportunities, Preserving Values, p. 6. Available at: http://www.whitehouse.gov/sites
/default/iles/docs/big_data_privacy_report_may_1_2014.pdf.
Personal Data Protection Aspects of Big Data ◾ 289
he irst question can be rephrased as follows: Have the personal data been lawfully
collected?
he second question can be rephrased as follows: Is (further) processing of personal
data by way of analytics compatible with the purposes for which the data were collected
(so-called compatible use)?
Given the limited space in this chapter, I will focus only on the second question—which,
in my opinion, represents the core question that the entire Big Data phenomenon poses—
assuming that the personal data have been lawfully collected.
However, before answering the “compatible use” question, it is irst necessary to under-
stand if the processing operations concern personal data at all. We therefore need to focus
on the concept of personal data.
* The same position was taken by The White House (2014) Big Data: Seizing Opportunities, Preserving Values, p. 6 and
p. xi. Available at: http://www.whitehouse.gov/sites/default/iles/docs/big_data_privacy_report_may_1_2014.pdf. “As
techniques like data fusion make Big Data analytics more powerful, the challenges to current expectations of privacy
grow more serious. When data is initially linked to an individual or device, some privacy-protective technology seeks
to remove this linkage, or ‘de-identify’ personally identiiable information—but equally efective techniques exist to
pull the pieces back together through ‘re-identiication.’ Moreover, it is stated in the study that “[a]nonymization is
increasingly easily defeated by the very techniques that are being developed for many legitimate applications of Big
Data. In general, as the size and diversity of available data grows, the likelihood of being able to re-identify individu-
als (that is, re-associate their records with their names) grows substantially. While anonymization may remain some-
what useful as an added safeguard in some situations, approaches that deem it, by itself, a suicient safeguard need
updating.”
† Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals
with regard to the processing of personal data and on the free movement of such data. Oicial Journal L 281, 23/11/1995,
pp. 0031–0050.
290 ◾ Paolo Balboni
In 2007, the Article 29 Working Party issued an opinion on the concept of personal data
[8], elaborating four speciic and fundamental elements of the deinition:
1. Any information
2. Relating to
3. Identiied or identiiable
4. Natural person
1. Any information
he Article 29 Working Party noted that this element underscores the rather
broad approach illustrated in Directive 95/46/EC. All information relevant to a
person is included, regardless of the “position or capacity of those persons (as
consumer, patient, employee, customer, etc.)” [8, p. 7]. In this case, the informa-
tion can be objective or subjective and does not necessarily have to be true or
proven.
he words “any information” also imply information of any form, audio,
text, video, images, and so forth. Importantly, the manner in which the infor-
mation is stored is irrelevant. he Article 29 Working Party expressly mentions
biometric data as a special case [8, p. 8], as such data can be considered as
information content as well as a link between the individual and the informa-
tion. Because biometric data are unique to an individual, they can also be used
as identiiers.
2. Relating to
Information related to an individual is information about that individual. he
relationship between data and an individual is oten self-evident, an example of
which is when the data are stored in an individual employee ile or in a medical
record. his is, however, not always the case, especially when the information
relates to objects. Such objects belong to individuals, but additional meanings or
information is required to create the link to the individual [8, p. 9].
At least one of the following three elements should be present in order to con-
sider information to be related to an individual: content, purpose, or result. An
element of content is present when the information is in reference to an individual,
regardless of the (intended) use of the information. he purpose element instead
refers to whether the information is used or is likely to be used “with the purpose
to evaluate, treat in a certain way or inluence the status or behaviour of an indi-
vidual” [8, p. 10]. A result element is present when the use of the data is likely to
have an impact on a certain person’s rights and interests [8, p. 11]. hese elements
are alternatives and are not cumulative, implying that one piece of data can relate
to diferent individuals based on diverse elements.
Personal Data Protection Aspects of Big Data ◾ 291
3. Identiied or identiiable
“A natural person can be ‘identiied’ when, within a group of persons, he or she is
‘distinguished’ from all other members of the group” [8, p. 12]. When identiication
has not occurred but is possible, the individual is considered to be “identiiable.”
In order to determine whether those with access to the data are able to identify the
individual, all reasonable means likely to be used either by the controller or by any
other person should be taken into consideration. he cost of identiication, the intended
purpose, the way the processing is structured, the advantage expected by the data con-
troller, the interest at stake for the data subjects, and the risk of organizational dysfunc-
tions and technical failures should be taken into account in the evaluation [8, p. 15].
4. Natural person
Directive 95/46/EC is applicable to the personal data of natural persons, a broad
concept that calls for protection wholly independent from the residence or nation-
ality of the data subject.
he concept of personality is understood as “the capacity to be the subject of legal
relations, starting with the birth of the individual and ending with his death” [8,
p. 22]. Personal data thus relate to identiied or identiiable living individuals. Data
concerning deceased persons or unborn children may, however, indirectly be subject
to protection in particular cases. When the data relate to other living persons, or when
a data controller makes no diferentiation in his/her documentation between living
and deceased persons, it may not be possible to ascertain whether the person the data
relate to is living or deceased; additionally, some national laws consider deceased or
unborn persons to be protected under the scope of Directive 95/46/EC [8, pp. 22–23].
Legal persons are excluded from the protection provided under Directive
95/46/EC. In some cases, however, data concerning a legal person may relate to
an individual, such as when a business holds the name of a natural person. Some
provisions of Directive 2002/58/EC* (amended by Directive 2009/136/EC)† extend
the scope of Directive 95/46/EC to legal persons.‡
* Directive 2002/58/EC of the European Parliament and of the Council of 12 July 2002 concerning the processing of per-
sonal data and the protection of privacy in the electronic communications sector (Directive on privacy and electronic
communications) [2002]. Oicial Journal L 31/07/2002, pp. 0037–0047.
† Directive 2009/136/EC of the European Parliament and of the Council of 25 November 2009 amending Directive 2002/22/
EC on universal service and users’ rights relating to electronic communications networks and services, Directive 2002/58/
EC concerning the processing of personal data and the protection of privacy in the electronic communications sector and
Regulation (EC) No 2006/2004 on cooperation between national authorities responsible for the enforcement of consumer
protection laws Text with EEA relevance [2006]. Oicial Journal L 337, 18/12/2009, pp. 0011–0036.
‡ In the EDPS Preliminary Opinion on Big Data, it is also expected that: “[c]ertain national jurisdictions (Austria,
Denmark, Italy and Luxembourg) extend some protection to legal persons.” European Data Protection Supervisor
(2014) Preliminary Opinion of the European Data Protection Supervisor Privacy and Competitiveness in the Age of Big Data:
he Interplay between Data Protection, Competition Law and Consumer Protection in the Digital Economy, p. 13, foot-
note 31. Available at: https://secure.edps.europa.eu/EDPSWEB/webdav/site/mySite/shared/Documents/Consultation
/Opinions/2014/14-03-26_competitition_law_big_data_EN.pdf.
292 ◾ Paolo Balboni
15.2.2.2 Pseudonymization
Personal data are data that directly or indirectly identify an individual. Identiication
should be understood broadly; reference by way of a unique number is such an example.
It is not necessary to know the name of a person; recognition is suicient. In other
words, it is possible to single out an individual in a group based on the data. he recent
Opinion 05/2014 on anonymization techniques of the Article 29 Data Protection
Working Party speciies that “pseudonymisation is not a method of anonymisation. It
merely reduces the linkability of a dataset with the original identity of a data subject,
* Recommendation CM/Rec (2010) 13 of the Committee of Ministers to member states on the protection of individuals
with regard to automatic processing of personal data in the context of proiling (Adopted by the Committee of Ministers
on 23 November 2010 at the 1099th meeting of the Ministers’ Deputies).
† Proposal for a Regulation of the European Parliament and of the Council on the protection of individuals with regard
to the processing of personal data and on the free movement of such data (General Data Protection Regulation)
[2012] OJ COM (2012) 11 inal. Available at: http://ec.europa.eu/justice/data-protection/document/review2012/com
_2012_11_en.pdf.
Personal Data Protection Aspects of Big Data ◾ 293
and is accordingly a useful security measure” [7, p. 20]. Pseudonyms indirectly identify
the individual and are therefore considered personal data.* he compromise version of
the GDPR approved by the EU Parliament on March 12, 2014, Recital 58(a), states that
“[p]roiling based solely on the processing of pseudonymous data should be presumed
not to signiicantly afect the interests, rights or freedoms of the data subject. Where
proiling, whether based on a single source of pseudonymous data or on the aggrega-
tion of pseudonymous data from diferent sources, permits the controller to attribute
pseudonymous data to a speciic data subject, the processed data should no longer be
considered to be pseudonymous.”
* In this respect, it is interesting to notice that in the 17.12.12 Drat report on the proposal for a regulation of the
European Parliament and of the Council on the protection of individual with regard to the processing of per-
sonal data and on the free movement of such data (General Data Protection Regulation) (COM(2012)0011–
C7-0025/2012–2012/0011(COD)) Committee on Civil Liberties, Justice and Home Affairs, Rapporteur: Jan
Philipp Albrecht (‘Albrecht’s 17.12.12 Drat report on GDPR’), “the rapporteur encourages the pseudonymous and
anonymous use of services. For the use of pseudonymous data, there could be alleviations with regard to obliga-
tions for the data controller (Articles 4(2)(a), 10), Recital 23),” p. 211. Available at: http://www.europarl.europa.eu
/RegData/commissions/libe/projet_rapport/2012/501927/LIBE_PR(2012)501927_EN.doc. Moreover, on pseudonymiza-
tion, see the recent European Privacy Association paper by Rosario Imperiali: Pseudonymity and legitimate interest:
two solutions to reduce data protection impact? Available at: http://www.academia.edu/2308794/Part_1_Pseudo
nymity_and_legitimate_interest_two_solutions_to_reduce_data_protection_impact.
† his deinition of anonymous data is used also in Section 4.1.n of the Italian Personal Data Protection Code Legislative
Decree No. 196 of June 30, 2003: “ ‘anonymous data’ shall mean any data that either in origin or on account of its having
been processed cannot be associated with any identiied or identiiable data subject.” See also Albrecht’s 17.12.12 Drat
report on GDPR, “his Regulation should not apply to anonymous data, meaning any data that can not be related,
directly or indirectly, alone or in combination with associated data, to a natural person or where establishing such a
relation would require a disproportionate amount of time, expense, and efort, taking into account the state of the art in
technology at the time of the processing and the possibilities for development during the period for which the data will
be processed,” p. 15.
‡ Council Recommendation (EC) R97/5 of February 13, 1997, on the protection of medical data, Article 29 Working Party,
Opinion 04/2007 on the concept of personal data, 01248/07/EN, WP 136. See also Albrecht’s 17.12.12 Drat report on
GDPR, p. 15.
294 ◾ Paolo Balboni
Data are considered unintelligible when personal data are securely encrypted with a
standardized secure encryption algorithm, or when hashed with a standardized crypto-
graphic keyed hash function and the key used to encrypt or hash the data has not been
compromised in any security breach or was generated in a way that it cannot be guessed by
exhaustive key searches using technological means [11].
Data protection experts and authorities agree that when a data controller anonymizes
personal data, even though it holds the “key” that would allow for reidentiication, the
publication of such data does not amount to a disclosure of personal data [12, p. 13].
In this case, then, personal data protection laws do not apply to the relative disclosed
information.
Anonymity of personal data provides important beneits for the protection of personal
data and can be seen as a way to advance one’s right to privacy [10, p. 527]. Anonymized
data transfers ensure a high level of privacy and are particularly beneicial in the process-
ing of sensitive personal data [13]. Additionally, anonymization partially enforces the prin-
ciple of data minimization [12, p. 12]. he anonymization techniques eliminate personal
data or identifying data from the processing if the purpose sought in the individual case
can be achieved by using anonymous data.
However, as the Article 29 Data Protection Working Party rightly stated in Opinion
05/2014 on anonymization techniques, “data controllers should consider that an ano-
nymised dataset can still present residual risks to data subjects. Indeed, on the one hand,
anonymisation and reidentiication are active ields of research and new discoveries are
regularly published, and on the other hand even anonymised data, like statistics, may be
used to enrich existing proiles of individuals, thus creating new data protection issues.
hus, anonymisation should not be regarded as a one-of exercise and the attending risks
should be reassessed regularly by data controllers” [7, p. 9].*
he European data protection legislation does not expressly provide a deinition of
“anonymous data.” he European data protection reform, the “Proposal GDPR” compro-
mise version approved by the European Parliament on March 12, 2014, does not provide a
deinition of anonymous data either, but reference to anonymous data is made in Recital 23:
“he principles of data protection should apply to any information concerning an identi-
ied or identiiable natural person. To determine whether a person is identiiable, account
should be taken of all the means reasonably likely to be used either by the controller or by
any other person to identify or single out the individual directly or indirectly. To ascertain
whether means are reasonably likely to be used to identify the individual, account should
be taken of all objective factors, such as the costs of and the amount of time required for
identiication, taking into consideration both available technology at the time of the pro-
cessing and technological development. he principles of data protection should therefore
* his view have also been conirmed by he White House (2014) Report to the President Big Data and Privacy: A Tech-
nological Perspective, p. xi. Available at: http://www.whitehouse.gov/sites/default/iles/microsites/ostp/PCAST/pcast
_big_data_and_privacy_-_may_2014.pdf: “Anonymization is increasingly easily defeated by the very techniques that
are being developed for many legitimate applications of Big Data. In general, as the size and diversity of available data
grows, the likelihood of being able to re-identify individuals (that is, re-associate their records with their names) grows
substantially. While anonymization may remain somewhat useful as an added safeguard in some situations, approaches
that deem it, by itself, a suicient safeguard need updating.”
Personal Data Protection Aspects of Big Data ◾ 295
not apply to anonymous data, which is information that does not relate to an identiied or
identiiable natural person. his Regulation does therefore not concern the processing of
such anonymous data, including for statistical and research purposes.”
Considering the legislative framework in force, the notion of anonymous data can be
derived from some recitals of Directive 95/46/EC and Directive 2002/58/EC (amended by
Directive 2009/136/EC). Recital 26 of the Preamble of Directive 95/46/EC refers to the con-
cept of anonymous data: “[t]he principles of protection shall not apply to data rendered
anonymous in such a way that the data subject is no longer identiiable.” Codes of conduct
within the meaning of Article 27 of Directive 95/46/EC may be useful for understanding
the ways in which data can be rendered anonymous and retained in a form in which iden-
tiication of the data subject is not possible.
Further reference to anonymous data is included in Recitals 9, 26, 28, and 33 of Directive
2002/58/EC. Speciically, Recital 9 states that “[t]he Member States (…) should cooperate in
introducing and developing the relevant technologies where this is necessary to apply the
guarantees provided for by this Directive and taking particular account of the objectives
of minimizing the processing of personal data and of using anonymous or pseudonymous
data where possible.”
15.2.2.4 Reidentiication
In order to determine that personal data are appropriately anonymized, it is important to
assess whether reidentiication of data subjects is possible and whether anyone can perform
such an operation.* While data protection legislation does not currently provide guid-
ance in these terms, data protection authorities apply several tests in practice to determine
whether the data are appropriately anonymized. For this purpose, the UK Information
Commissioner (ICO) has developed a “motivated intruder” test as part of a risk assessment
process [12, pp. 22–24]. his test, according to the ICO, should be applied primarily in the
context of anonymized information disclosure to the public or a third party and is consid-
ered a signiicant oicial benchmark.
he test itself is based on consideration of whether a person, without any prior knowl-
edge, but with a desire to achieve reidentiication of the individuals the information relates
to (a motivated intruder), would be successful. In the test, the motivated intruder repre-
sents the general public and is thus a reasonably competent individual who has access to
publicly available sources, including the Internet or libraries, but has no special knowledge,
such as computer hacking skills or equipment [12, p. 23].
Section 15.2.2 of this chapter examined the concept of personal data and its relevance
in terms of data processing. I have highlighted the importance of fully understanding
whether or not the principles of data protection are applicable, an aspect that is paramount
in the discussion and examination of personal data protection in the ield of Big Data. A
deep understanding of anonymization, pseudonimization, and reidentiication has been
* See generally on reidentiication Article 29 Data Protection Working Party Opinion 05/2014 on anonymisation tech-
niques. Adopted on April 10, 2014. Available at: http://ec.europa.eu/justice/data-protection/article-29/documentation
/opinion-recommendation/iles/2014/wp216_en.pdf.
296 ◾ Paolo Balboni
provided insofar as they are strictly related to the discussion. It is of utmost importance that
the implications of diferent types of data are adequately understood as they have highly
relevant consequences on the legality of plausible processing activities. Section 15.2.3 will
explore purpose limitation, compatibility assessments, and the concept of “functional
separation,” which must, however, consider that the risk of reidentiication is increasingly
present due to the development of new technologies.
• Personal data must be collected for speciied, explicit, and legitimate purposes (so-
called purpose speciication) [2, p. 11].
• Personal data must not be further processed in a way incompatible with those pur-
poses (so-called compatible use) [2, p. 12].
• he relationship between the purposes for which the personal data have been col-
lected and the purposes of further processing [2, p. 23]
• he context in which the personal data have been collected and the reasonable expec-
tations of the data subjects as to their further use [2, p. 24]
• he nature of the personal data and the impact of the processing on the data subjects
[2, p. 25]
• he safeguards adopted by the controller to ensure fair processing and to prevent any
undue impact on the data subjects [2, p. 26]
he purpose limitation principle can only be restricted subject to the conditions set
forth in Article 13 of Directive 95/46/EC (i.e., national security, defense, public security, or
protection of the data subject or of the rights and freedoms of others).
In this opinion, the Article 29 Data Protection Working Party deals with Big Data [2,
p. 45ss]. More precisely, the Article 29 Data Protection Working Party speciies that, in
order to lawfully process Big Data, in addition to the four key factors of the compatibility
assessment to be fulilled, additional safeguards must be assessed to ensure fair processing
and to prevent any undue impact. he Article 29 Data Protection Working Party considers
two scenarios to identify such additional safeguards:
Personal Data Protection Aspects of Big Data ◾ 297
1. In the irst one, the organizations processing the data want to detect trends and cor-
relations in the information.
2. In the second one, the organizations are interested in individuals (…) [as they spe-
ciically want] to analyse or predict personal preferences, behaviour and attitudes of
individual customers, which will subsequently inform ‘measures or decisions’ that
are taken with regard to those customers [2, p. 46].
In the irst scenario, so-called functional separation plays a major role in deciding
whether further use of data may be considered compatible. Examples of functional sepa-
ration are “full or partial anonymisation, pseudonymisation, or aggregation of the data,
privacy enhancing technologies, as well as other measures to ensure that the data cannot
be used to take decisions or other actions with respect to individuals” [2, p. 27; see also
Section 2.3].
In the second scenario, prior consent of customers/data subjects (i.e., free, speciic,
informed, and unambiguous opt-in) would be required for further use to be considered
compatible. In this respect, the Article 29 Data Protection Working Party speciies that
“such consent should be required, for example, for tracking and proiling for purposes of
direct marketing, behavioural advertisement, data-brokering, location-based advertising
or tracking-based digital market research” [2, p. 46]. Furthermore, as a prerequisite for
consent to be informed and for ensuring transparency, data subjects must have access
(1) to their proiles, (2) to the algorithm that develops the proiles, and (3) to the source of data
that led to the creation of the proiles [2, p. 47]. Furthermore, data subjects should be efec-
tively granted the right to correct or update their proiles. Last but not least, the Article 29
Data Protection Working Party recommends allowing “data portability”: “safeguards such
as allowing data subjects/customers to have access to their data in a portable, user-friendly
and machine readable format [as a way] to enable businesses and data-subjects/consumers
to maximise the beneit of Big Data in a more balanced and transparent way” [2].*
his section dealt with the implications of the Article 29 Data Protection Working Party
opinion regarding the principle of purpose limitation as outlined in Directive 95/46/EC
and explored the elements that characterize the principle of limitation, including purpose
speciication and compatibility use. Speciically, the compatibility assessment, to be car-
ried out on a case-by-case basis, and the necessary additional safeguards for fair processing
were highlighted. he concept of functional separation was outlined in terms of its impor-
tance in the lawfulness of further processing activities.
* “For example, access to information about energy consumption in a user-friendly format could make it easier for house-
holds to switch tarifs and get the best rates on gas and electricity, as well as enabling them to monitor their energy
consumption and modify their lifestyles to reduce their bills as well as their environmental impact.”
298 ◾ Paolo Balboni
by the increasing attention that the United States and the European Union have placed on
both the regulation and further exploration of the phenomenon.
Section 15.2.1 looked at the OECD’s four-step life cycle of personal data along the value
chain, giving special importance to Steps 3 and 4, analysis and usage, respectively. he
section also dealt with the question of compatible use or whether further personal data
processing by way of analytics is compatible with the purposes for which the data were
collected. he importance of identifying whether or not a company is processing personal
Big Data is highlighted in Section 15.2.2 and is paramount to this discussion. he same
section therefore focused on the deinition of personal data and the importance of under-
standing what type of data is being processed, a determining factor in the applicability of
data protection law. Section 15.2.2 further explored the deinitions of pseudonymization,
anonymization, and reidentiication and the relative consequences of further activities.
Finally, Section 15.2.3 provided insights with respect to the fundamental elements of the
principle of purpose limitation and the vitality of the compatibility assessment, bringing
other notions such as functional separation into the argument. Indeed, anonymization
techniques and the aggregation of data are to be considered as key elements in determin-
ing compatibility. he inclusion of the Article 29 Data Protection Working Party Opinion
3/2013 therefore underlined the criticality of purpose limitation for companies to under-
stand the conditions for further use for processing to be lawful.
Proliferation of data and increasing computational resources represent a great busi-
ness opportunity for companies. his chapter has provided a strategic look at the privacy
and data protection implications of Big Data processing. he importance of adequate and
rigorous data protection compliance management throughout the entire data life cycle is
key and was discussed at length. In fact, companies are usually advised on how to store,
protect, and analyze a large amount of data and turn them into valuable information to
improve their businesses. However, data—unless anonymized*—can only be processed
for the purposes they were collected for, or those that are compatible with them.† It is
therefore vital to have a strategic and accurate approach to data protection compliance in
order to collect personal data in a way that enables further lawful processing activities. he
* As Article 29 Data Protection Working Party rightly stated in Opinion 05/2014 on anonymisation techniques: “data
controllers should consider that an anonymised dataset can still present residual risks to data subjects. Indeed, on the
one hand, anonymisation and re-identiication are active ields of research and new discoveries are regularly published,
and on the other hand even anonymised data, like statistics, may be used to enrich existing proiles of individuals, thus
creating new data protection issues. hus, anonymisation should not be regarded as a one-of exercise and the attending
risks should be reassessed regularly by data controllers.” Article 29 Data Protection Working Party Opinion 05/2014
on anonymisation techniques. Adopted on April 10, 2014, p. 9. Available at: http://ec.europa.eu/justice/data-protection
/article-29/documentation/opinion-recommendation/iles/2014/wp216_en.pdf. his view has also been conirmed
by he White House (2014) Report to the President Big Data and Privacy: A Technological Perspective, p. xi. Available at:
http://www.whitehouse.gov/sites/default/iles/microsites/ostp/PCAST/pcast_big_data_and_privacy_-_may_2014.pdf:
“Anonymization is increasingly easily defeated by the very techniques that are being developed for many legitimate
applications of Big Data. In general, as the size and diversity of available data grows, the likelihood of being able to
re-identify individuals (that is, re-associate their records with their names) grows substantially. While anonymization
may remain somewhat useful as an added safeguard in some situations, approaches that deem it, by itself, a suicient
safeguard need updating.”
† See extensively Section 2.3.
Personal Data Protection Aspects of Big Data ◾ 299
diference for a company between dying buried under personal data and harnessing their
value is directly related to privacy compliance management.
REFERENCES
1. Kroes, N. (2014) “he data gold rush.” European Commission. Athens, March 19. Speech.
Available at: http://europa.eu/rapid/press-release_SPEECH-14-229_en.htm.
2. Article 29 Data Protection Working Party (2013) Opinion 03/2013 on purpose limitation.
Adopted on April 2, 2013. Available at: http://ec.europa.eu/justice/data-protection/article-29
/documentation/opinion-recommendation/iles/2013/wp203_en.pdf.
3. European Data Protection Supervisor (2014) Preliminary Opinion of the European Data Protection
Supervisor Privacy and Competitiveness in the Age of Big Data: he Interplay between Data Protection,
Competition Law and Consumer Protection in the Digital Economy. Available at: https://secure
.edps.europa.eu/EDPSWEB/webdav/site/mySite/shared/Documents/Consultation/Opinions
/2014/14-03-26_competitition_law_big_data_EN.pdf.
4. Meeker, M. and Yu, L. (2013) Internet Trends. Kleiner Perkins Caulield Byers. Available at:
http://www.slideshare.net/kleinerperkins/kpcb-internet-trends-2013.
5. he White House (2014) Big Data: Seizing Opportunities, Preserving Values. Available at: http://
www.whitehouse.gov/sites/default/iles/docs/big_data_privacy_report_may_1_2014.pdf.
6. he World Economic Forum (2012) Big Data, Big Impact: New Possibilities for International
Development. Available at: http://www3.weforum.org/docs/WEF_TC_MFS_BigDataBigImpact
_Brieing_2012.pdf.
7. Article 29 Data Protection Working Party (2014) Opinion 05/2014 on anonymisation tech-
niques. Adopted on April 10, 2014. Available at: http://ec.europa.eu/justice/data-protection
/article-29/documentation/opinion-recommendation/iles/2014/wp216_en.pdf.
8. Article 29 Working Party (2007) Opinion 4/2007 on the concept of personal data. Adopted
on June 20. Available at: http://ec.europa.eu/justice/policies/privacy/docs/wpdocs/2007/wp136
_en.pdf.
9. Organisation for Economic Co-operation and Development (2013) OECD Digital Economy
Papers No. 220 Exploring the Economics of Personal Data a Survey of Methodologies for Measuring
Monetary Value, pp. 4–39.
10. Kerr, I., Lucock, C. and Steeves, V. (Eds.) (2009) Lessons from the Identity Trail: Anonymity,
Privacy, and Identity in a Networked Society. Oxford: Oxford University Press.
11. ENISA (2012) Recommendations on Technical Implementation Guidelines of Article 4, April, p. 17.
Available at: http://www.enisa.europa.eu.
12. ICO (2012) Anonymisation: Managing Data Protection Risk Code of Practice, November 20.
Available at: http://www.ico.gov.uk/news/latest_news/2012/~/media/documents/library/Data
_Protection/Practical_application/anonymisation_code.ashx.
13. Nicoll, C., Prins, J. E. J. and van Dellen, M. J. M. (Eds.) (2003) Digital Anonymity and the Law:
Tensions and Dimensions, Information Technology and Law Series. he Hague: T.M.C. Asser Press,
p. 149.
CHAPTER 16
Privacy-Preserving Big
Data Management
The Case of OLAP
Alfredo Cuzzocrea
CONTENTS
16.1 Introduction 302
16.1.1 Problem Deinition 303
16.1.2 Chapter Organization 304
16.2 Literature Overview and Survey 305
16.2.1 Privacy-Preserving OLAP in Centralized Environments 305
16.2.2 Privacy-Preserving OLAP in Distributed Environments 306
16.3 Fundamental Deinitions and Formal Tools 307
16.4 Dealing with Overlapping Query Workloads 309
16.5 Metrics for Modeling and Measuring Accuracy 309
16.6 Metrics for Modeling and Measuring Privacy 310
16.7 Accuracy and Privacy hresholds 312
16.8 Accuracy Grids and Multiresolution Accuracy Grids: Conceptual Tools for
Handling Accuracy and Privacy 313
16.9 An Efective and Eicient Algorithm for Computing Synopsis Data Cubes 315
16.9.1 Allocation Phase 315
16.9.2 Sampling Phase 317
16.9.3 Reinement Phase 318
16.9.4 he computeSynDataCube Algorithm 319
16.10 Experimental Assessment and Analysis 320
16.11 Conclusions and Future Work 323
References 323
301
302 ◾ Alfredo Cuzzocrea
ABSTRACT
his chapter explores the emerging context of privacy-preserving OLAP over Big
Data, a novel topic that is playing a critical role in actual Big Data research, and
proposes an innovative framework for supporting intelligent techniques for comput-
ing privacy-preserving OLAP aggregations on data cubes. he proposed framework
originates from the evidence stating that state-of-the-art privacy-preserving OLAP
approaches lack strong theoretical bases that provide solid foundations to them. In
other words, there is not a theory underlying such approaches, but rather an algo-
rithmic vision of the problem. A class of methods that clearly conirm to us the trend
above is represented by the so-called perturbation-based techniques, which propose
to alter the target data cube cell-by-cell to gain privacy-preserving query processing.
his approach exposes us to clear limits, whose lack of extendibility and scalability is
only the tip of an enormous iceberg. With the aim of fulilling this critical drawback,
this chapter describes and experimentally assesses a theoretically-sound accuracy/
privacy-constrained framework for computing privacy-preserving data cubes in OLAP
environments. he beneits derived from our proposed framework are twofold. First,
we provide and meaningfully exploit solid theoretical foundations to the privacy-
preserving OLAP problem that pursue the idea of obtaining privacy-preserving data
cubes via balancing accuracy and privacy of cubes by means of lexible sampling meth-
ods. Second, we ensure the eiciency and the scalability of the proposed approach, as
conirmed to us by our experimental results, thanks to the idea of leaving the algo-
rithmic vision of the privacy-preserving OLAP problem.
16.1 INTRODUCTION
One among the most challenging topics in Big Data research is represented, without doubt,
by the issue of ensuring the security and privacy of Big Data repositories (e.g., References 1
and 2). To become convinced about this, consider the case of cloud systems [3,4], which are
very popular now. Here, cloud nodes are likely to exchange data very oten. herefore, the
privacy breach risk arises, as distributed data repositories can be accessed from a node to
another one, and hence, sensitive information can be inferred.
Another relevant data management context for Big Data research is represented by the
issue of efectively and eiciently supporting data warehousing and online analytical pro-
cessing (OLAP) over Big Data [5–7], as multidimensional data analysis paradigms are likely
to become an “enabling technology” for analytics over Big Data [8,9], a collection of mod-
els, algorithms, and techniques oriented to extract useful knowledge from cloud-based Big
Data repositories for decision-making and analysis purposes.
At the convergence of the three axioms introduced here (i.e., security and privacy of
Big Data, data warehousing and OLAP over Big Data, analytics over Big Data), a criti-
cal research challenge is represented by the issue of efectively and eiciently computing
privacy-preserving OLAP data cubes over Big Data [10,11]. It is easy to foresee that this
problem will become more and more important in future years, as it not only involves rel-
evant theoretical and methodological aspects, not all explored by actual literature, but also
Privacy-Preserving Big Data Management ◾ 303
regards signiicant modern scientiic applications, such as biomedical tools over Big Data
[12,13], e-science and e-life Big Data applications [14,15], intelligent tools for exploring Big
Data repositories [16,17], and so forth.
Inspired by these clear and evident trends, in this chapter, we focus the attention on privacy-
preserving OLAP data cubes over Big Data, and we provide two kinds of contributions:
with our work, as they basically combine a technique inspired from statistical databases with
an access control scheme, which are both outside the scope of this chapter. Reference 34
extends results of Reference 32 via proposing the algorithm FMC, which still works on the
cuboid lattice to hide sensitive data that cause inference. Finally, Reference 20 proposes a
random data distortion technique, called zero-sum method, for preserving the privacy of data
cells while providing accurate answers to range queries. To this end, Reference 20 iteratively
alters the values of data cells of the target data cube in such a way as to maintain the mar-
ginal sums of data cells along rows and columns of the data cube equal to zero. According to
motivations given in Section 16.1.1, when applied to massive data cubes, Reference 20 clearly
introduces excessive overheads, which are not comparable with low computational require-
ments due to sampling-based techniques like ours. In addition to this, in our framework, we
are interested in preserving the privacy of aggregate patterns rather than the one of data cells,
which, however, can be still captured by introducing aggregate patterns at the coarser degree
of aggregation of the input data cube, as stated in Section 16.1.1. In other words, Reference
20 does not introduce a proper notion of privacy OLAP but only restricts the analysis to the
privacy of data cube cells. Despite this, we observe that, from the client-side perspective,
Reference 20 (1) solves the same problem we investigate, that is, providing privacy-preserving
(approximate) answers to OLAP queries against data cubes, and (2) contrary to References
32 and 35, adopts a “data-oriented” approach, which is similar in nature to ours. For these
reasons, in our experimental analysis, we test the performance of our framework against the
one of Reference 20, which, apart from being the state-of-the-art perturbation-based privacy-
preserving OLAP technique, will be hereby considered as the comparison technique for test-
ing the efectiveness of the privacy-preserving OLAP technique we propose.
tuples with which they participate to the partition in order to gain row-level privacy and (2)
a server is capable of evaluating OLAP queries against perturbed tables via reconstructing
original distributions of attributes involved by such queries. In Reference 35, the authors
demonstrate that the proposed distributed privacy-preserving OLAP model is safe against
privacy breaches. Reference 42 is another distributed privacy-preserving OLAP approach
that is reminiscent of ours. More speciically, Reference 42 pursues the idea of obtaining
a privacy-preserving OLAP data cube model from distributed data sources across multiple
sites via applying perturbation-based techniques on aggregate data that are retrieved from
each singleton site as a baseline step of the main (distributed) OLAP computation task.
Finally, Reference 43 focuses the attention on the signiicant issue of providing eicient data
aggregation while preserving privacy over wireless sensor networks. he proposed solution
is represented by two privacy-preserving data aggregation schemes that make use of inno-
vative additive aggregation functions, these schemes being named cluster-based private data
aggregation (CPDA) and slice-mix-aggregate (SMART). Proposed aggregation functions
fully exploit topology and dynamics of the underlying wireless sensor network and bridge
the gap between collaborative data collection over such networks and data privacy needs.
etc). herefore, our framework can be straightforwardly extended to deal with other SQL
aggregations diferent from SUM. However, the latter research aspect is outside the scope
of this chapter and thus let as future work.
Given a query Q against a data cube A, the query region of Q, denoted by R(Q), is deined
as the subdomain of A bounded by the ranges Rk0 , Rk1 ,, Rkm−1 of Q.
Given an m-dimensional query Q, the accuracy grid (Q ) of Q is a tuple
(Q ) = ∆ℓ k0 , ∆ℓ k1 ,…, ∆ℓ km−1 , such that ∆ ki denotes the range partitioning Q along the
dimension dki of A, with ki belonging to [0, − 1], in a ∆ ki-based (one-dimensional) par-
tition. By combining the one-dimensional partitions along all the dimensions of Q, we
inally obtain (Q ) as a regular multidimensional partition of R(Q). From Section 16.1.1,
recall that the elementary cell of the accuracy grid (Q ) is implicitly deined by subqueries
of Q belonging to the query workload QWL against the target data cube. An example of
an accuracy grid is depicted in Figure 16.1: Each elementary data cell corresponds to a
subquery in QWL.
Based on the latter deinitions, in our framework, we consider the broader concept of
extended range query Q+, deined as a tuple Q + = Q , (Q ) , such that (1) Q is a “classical”
range query, Q = Rk0 , Rk1 ,, Rkm−1 , , and (2) (Q ) is the accuracy grid associated with Q,
(Q ) = ∆ℓ k0 , ∆ℓ k1 ,…, ∆ℓ km−1 , with the condition that each interval ∆ ki is deined on the
corresponding range Rki of the dimension dki of Q. For the sake of simplicity, here and in the
remaining part of the chapter, we assume Q ≡ Q+.
Given an n-dimensional data domain D, we introduce the volume of D, denoted by ||D||,
as follows: ||D||=|d0|×|d1|×…×|dn−1|, such that |di| is the cardinality of the dimension di of
D. his deinition can also be extended to a multidimensional data cube A, thus introduc-
ing the volume of A, ||A||, and to a multidimensional range query Q, thus introducing the
volume of Q, ||Q||.
Given a data cube A, a range query workload QWL against A is deined as a collection of
(range) queries against A, as follows: QWL = {Q0, Q1,…, Q|QWL|−1}, with R(Qk) ⊆ R(A) ∀ Qk ∈
QWL. An example query workload is depicted in Figure 16.1.
Given a query workload QWL = {Q0, Q1,…, Q|QWL|−1}, we say that QWL is nonoverlapping
if there do not exist two queries Qi and Q j belonging to QWL such that R(Qi) ∩ R(Q j) ≠ ∅.
Given a query workload QWL = {Q0, Q1,…, Q|QWL|−1}, we say that QWL is overlapping if
Q0
q0,0 q0,1 q0 1 Q0 Q1
q0,2 q1,0
q1,1 q1,2 q1,3
q2,0 q1,5
q2,1 q1,4 Q1
Q2 q2,2
q2 1 Q2 Q1
FIGURE 16.1 Building the nonoverlapping query workload (plain lines) from an overlapping
query workload (bold lines).
Privacy-Preserving Big Data Management ◾ 309
there exist two queries Qi and Q j belonging to QWL such that R(Qi) ∩ R(Q j) ≠ ∅. Given a
query workload QWL = {Q0, Q1,…, Q|QWL|−1}, the region set of QWL, denoted by R(QWL),
is deined as the collection of regions of queries belonging to QWL, as follows: R(QWL) =
{R(Q0), R(Q1),…, R(Q|QWL|−1)}.
Based on the previous definition of ||QWL||, the average relative query error
EQ (QWL ) for a given query workload QWL can be expressed as a weighted linear
combination of relative query errors EQ (Qk) of all the queries Qk in QWL, as follows:
QWL −1
Qk
QWL −1
Qk (Qk )
A(Qk ) − A
EQ (QWL ) = ∑ QWL
⋅ EQ (Qk ) , that is, EQ (QWL ) = ∑ QWL −1
⋅
max{ A(Qk ),1}
,
k =0
QWL −1
Qk
k =0
∑ Qj
under the constraint ∑
k =0
QWL
= 1. j = 0
cell, thus inferring sensitive knowledge that is even more useful. Also, by exploiting
OLAP hierarchies and the well-known ROLL-UP operator, it is possible to discover
aggregations of ranges of data at higher degrees of such hierarchies. It should be noted
that the singleton aggregation model, I(Qk), indeed represents an instance of our privacy
OLAP notion targeted to the problem of preserving the privacy of range-SUM queries
(the focus of our chapter). As a consequence, I(Qk) is essentially based on the conven-
tional SQL aggregation operator AVG. Despite this, the underlying theoretical model we
propose is general enough to be straightforwardly extended to deal with more sophisti-
cated privacy OLAP notion instances, depending on the particular class of OLAP que-
ries considered. Without loss of generality, given a query Qk belonging to an OLAP query
class , in order to handle the privacy preservation of Qk, we only need to deine the
formal expression of the related singleton aggregation I(Qk) (like the previous one for the
speciic case of range-SUM queries). hen, the theoretical framework we propose works
in the same way.
Secondly, we study how OLAP client applications can discover sensitive aggregations
from the knowledge about approximate answers and, similarly to the previous case, from
the knowledge about data cube and query metadata. Starting from the knowledge about the
synopsis data cube Aʹ and the knowledge about the answer to a given query Qk belonging to
the query workload QWL, it is possible to derive an estimation on I(Qk), denoted by I(Qk ), as
(Qk )
A
follows: I(Qk ) = , such that S(Qk) is the number of samples efectively extracted from
S(Qk )
R(Qk) to compute Aʹ (note that S(Qk) < ||Qk||). he relative diference between I(Qk) and
I(Qk ), named relative inference error and denoted by EI(Qk), gives us metrics for the privacy
of A (Qk ), which is deined as follows: EI (Qk ) = | I (Qk ) − I (Qk )|.
max{I (Qk ),1}
Indeed, while OLAP client applications are aware about the deinition and metadata of
both the target data cube and queries of the query workload QWL, the number of samples
S(Qk) (for each query Qk in QWL) is not disclosed to them. As a consequence, in order to
model this aspect of our framework, we introduce the user-perceived singleton aggrega-
tion, denoted by IU (Qk ), which is the efective singleton aggregation perceived by external
applications based on the knowledge made available to them. IU (Qk ) is deined as follows:
A(Qk )
IU (Qk ) = .
Qk
Based on IU (Qk ), we derive the deinition of the relative user-perceived inference error
| I (Qk ) − IU (Qk )|
EUI (Qk ), as follows: EUI (Qk ) = .
max{I (Qk ),1}
Since S(Qk) < ||Qk||, it is trivial to demonstrate that IU (Qk ) provides a better estimation
of the singleton aggregation of Qk than that provided by I(Qk ), as IU (Qk ) is evaluated with
respect to all the items contained within R(Qk) (i.e., ||Qk||), whereas I(Qk ) is evaluated with
respect to the efective number of samples extracted from R(Qk) [i.e., S(Qk)]. In other words,
IU (Qk ) is an upper bound for I(Qk ). herefore, in our framework, we consider I(Qk ) to
compute the synopsis data cube, whereas we consider IU (Qk ) to model inference issues on
the OLAP client application side.
312 ◾ Alfredo Cuzzocrea
EUI (Qk ) can be extended to the whole query workload QWL by considering the average
relative inference error EI (QWL ) that takes into account the contributions of relative inference
errors EI(Qk) of all the queries Qk in QWL. Similarly to what was done for the average rela-
QWL −1
Qk
tive query error EQ (QWL ), we model EI (QWL ) as follows: EI (QWL ) = ∑k =0
QWL
⋅ EI (Qk ),
QWL −1 QWL −1
Qk | I (Qk ) − IU (Qk )| Qk
that is, EI (QWL ) = ∑ QWL −1
⋅
max{I (Qk ),1}
, under the constraint ∑ QWL
= 1.
k =0
∑j =0
Qj
k =0
Note that, EI (QWL ) is deined in dependence on IU (Qk ) rather than I(Qk ). For the sake
U
of simplicity, here and in the remaining part of the chapter, we assume EI (Qk ) ≡ EI (Qk ).
he concepts and deinitions allow us to introduce the singleton aggregation privacy-
preserving model = I (⋅), I(⋅), IU (⋅) , which is a fundamental component of the privacy-
preserving OLAP framework we propose. properly realizes our privacy OLAP notion.
Given a query Qk ∈ QWL against the target data cube A, in order to preserve the pri-
vacy of Qk under our privacy OLAP notion, we must maximize the inference error EI(Qk)
while minimizing the query error EQ(Qk). While the deinition of EQ(Qk) can be reason-
ably considered as an invariant of our theoretical model, the deinition of EI(Qk) strictly
depends on . herefore, given a particular class of OLAP queries , in order to preserve
the privacy of queries of kind , we only need to appropriately deine . his nice amenity
states that the privacy-preserving OLAP framework we propose is orthogonal to the par-
ticular class of queries considered and can be straightforwardly adapted to a large family
of OLAP query classes.
with approximate query answering techniques. As a result, the parameter ΦI can be set
according to a two-step approach where irst the accuracy constraint is accomplished in the
dependence of ΦQ, and then ΦI is consequently set by trying to maximize it (i.e., augment-
ing the privacy of answers) as much as possible, thus following a best-efort approach.
G(Qk) G(Qk)
Δl1 Δl1
Qk Qk
Δl0 Δl0
A A
(a) (b)
FIGURE 16.2 Two sampling strategies: (a) accuracy grid–constrained sampling and (b) region-
constrained sampling.
which could be caused by the alternative strategy (i.e., region-constrained sampling), arbi-
trarily originates regions in Aʹ for which the accuracy error is low and the inference error
is high (which is a desiderata in our framework), and regions for which the accuracy error
is very high and the inference error is very low (which are both undesired efects in our
framework). he inal, global efect of such a scenario results in a limited capability of
answering queries by satisfying the accuracy/privacy constraint. Contrary to the latter sce-
nario, accuracy grid–constrained sampling aims at obtaining a fair distribution of samples
across Aʹ, so that a large number of queries against Aʹ can be accommodated by satisfying
the accuracy/privacy constraint.
When overlapping query workloads are considered, intersection (query) regions pose
the issue of dealing with the overlapping of diferent accuracy grids, which we name multi-
resolution accuracy grids, meaning that such grids partition the same intersection region
of multiple queries by means of cells at diferent granularities. As much as granularities of
such cells are diferent, we obtain “problematic” settings to be handled where subqueries
have very diferent volumes, so that, due to geometrical issues, handling both accuracy and
privacy of answers as well as dealing with the sampling phase become more questioning. It
should be noted that, contrary to what happens with overlapping queries, accuracy grids of
nonoverlapping queries originate subqueries having volumes that are equal one to another,
so that we obtain facility at both modeling and reasoning tasks.
To overcome issues deriving from handling multiresolution accuracy grids, we intro-
duce an innovative solution that consists in decomposing nonoverlapping and overlapping
query workloads in sets of appropriately selected subqueries, thus achieving the amenity
of treating both kinds of query workloads in a uniied manner. he baseline operation of
this process is represented by the decomposition of a query Qk of the target query work-
load QWL. Given an I0,k × I1,k query Qk and its accuracy grid (Qk ) = ∆ 0 ,k , ∆ 1,k , the
query decomposition process generates a set of subqueries ζ(Qk) on the basis of the nature
I I
of Qk. In more detail, if Qk is nonoverlapping, then Qk is decomposed in 0 ,k ⋅ 1,k sub-
∆ 0 ,k ∆ 1,k
queries given by cells in (Qk ), such that each subquery has volume equal to Δℓ0,k × Δℓ1,k.
Otherwise, if Qk is overlapping, that is, there exists another query Qh in QWL such that
R(Qk) ∩ R(Qh) ≠ ∅, the query decomposition process works as follows: (1) he intersec-
tion region of Qk and Qh, denoted by R I(Qk,Qh), is decomposed in the set of subqueries
ζ(Qk,Qh) given by the overlapping of (Qk ) and (Qh ). (2) Let π(Qk) be the set of subqueries
Privacy-Preserving Big Data Management ◾ 315
a set of regions R(QWL) = {R(Q0), R(Q1), …, R(Q|QWL|1)} is obtained. Let R(Qk) be a region
belonging to R(QWL); the amount of storage space allocated to R(Qk), B(Qk), is determined
according to a proportional approach that considers (1) the nature of the data distribution
of R(Qk) and geometrical issues of R(Qk) and (2) the latter parameters of R(Qk) in propor-
tional comparison with the same parameters of all the regions in R(QWL), as follows:
( R(Qk )) + Ψ( R(Qk )) ⋅ ξ( R(Qk ))
B(Qk ) = QWL −1 QWL −1
⋅ B , such that [45] (1) Ψ(R) is a Boolean
∑h=0
( R(Qk )) + ∑ Ψ(R(Q ))⋅ ξ(R(Q ))
h=0
k k
characteristic function that, given a region R, allows us to decide if data in R are uniform or
skewed; (2) φ(R) is a factor that captures the skewness and the variance of R in a combined
manner; and (3) ξ(R) is a factor that provides the ratio between the skewness of R and its stan-
dard deviation, which, according to Reference 46, allows us to estimate the skewness degree
of the data distribution of R. he previous formula can be extended to handle the overall
allocation of B across regions of QWL, thus achieving the formal deinition of our propor-
tional storage space allocation scheme, denoted by ( A, R(Q0 ), R(Q1 ),, R(Q QWL −1 ), B),
via the following system:
∑
k =0
( R(Qk )) + ∑ Ψ(R(Q ))⋅ ξ(R(Q ))
k =0
k k
( R(Q QWL −1 )) + Ψ( R(Q QWL −1 )) ⋅ ξ( R(Q QWL −1 ))
B(Q QWL −1 ) = QWL −1 QWL −1
⋅B (16.1)
∑
k =0
( R(Qk )) + ∑ Ψ(R(Q ))⋅ ξ(R(Q ))
k =0
k k
QWL −1
∑ B(Q ) ≤ B
k =0
k
In turn, for each query region R(Qk) of R(QWL), we further allocate the amount of stor-
age space B(Qk) across the subqueries of Qk, qk,0, qk,1,…, qk,m−1, obtained by decomposing Qk
according to our decomposition process (see Section 16.8), via using the same allocation
scheme (Equation 16.1). Overall, this approach allows us to obtain a storage space allocation
B(qk ,i )
for each subquery qk,i of QWL in terms of the maximum sample number N (qk ,i ) =
32
that can be extracted from qk,i,* B(qk,i) being the amount of storage space allocated to qk,i.
It should be noted that the described approach allows us to achieve an extremely accu-
rate level of detail in handling accuracy/privacy issues of the inal synopsis data cube Aʹ.
To become convinced of this, recall that the granularity of OLAP client applications is that
one of queries (see Section 16.1.1), which is much greater than that one of subqueries (spe-
ciically, the latter depends on the degree of accuracy grids) we use as the atomic unit of
our reasoning. hanks to this diference between granularity of input queries and accuracy
grid cells, which, in our framework, is made “conveniently” high, we inally obtain a crucial
information gain that allows us to eiciently accomplish the accuracy/privacy constraint.
is satisied as much as possible. his means that, given the input data cube A, we irst
sample A in order to obtain the current representation of Aʹ. If such a representation satis-
ies the accuracy/privacy constraint, then the inal representation of Aʹ is achieved and used
at query time to answer queries instead of A. Otherwise, if the current representation of
Aʹ does not satisfy the accuracy/privacy constraint, then we perform “corrections” on the
current representation of Aʹ, thus reining such representation in order to obtain a inal
representation that satisies the constraint, on the basis of a best-efort approach. What we
call the reinement process (described in Section 16.9.3) is based on a greedy approach that
“moves”* samples from regions of QWL whose queries satisfy the accuracy/privacy constraint
to regions of QWL whose queries do not satisfy the constraint, yet ensuring that the former
do not violate the constraint.
Given a query Qk of the target query workload QWL, we say that Qk satisies the accuracy/
EQ (Qk ) ≤ ΦQ
privacy constraint if the following inequalities simultaneously hold: .
EI (Qk ) ≥ Φ I
In turn, given a query workload QWL, we decide about its satisiability with respect to
the accuracy/privacy constraint by inspecting the satisiability of queries that compose
QWL. herefore, we say that QWL satisies the accuracy/privacy constraint if the follow-
EQ (QWL ) ≤ ΦQ
ing inequalities simultaneously hold: .
EI (QWL ) ≥ Φ I
Given the target query workload QWL, the criterion of our greedy approach used during
the reinement process is the minimization of the average relative query error, EQ (QWL ),
and the maximization of the average relative inference error, EI (QWL ), within the mini-
mum number of movements that allows us to accomplish both the goals simultaneously
[i.e., minimizing EQ (QWL ) and maximizing EI (QWL )]. Furthermore, the reinement pro-
cess is bounded by a maximum occupancy of samples moved across queries of QWL, which
we name total bufer size and denote as A ,QWL. A ,QWL depends on several parameters,
such as the size of the bufer, the number of sample pages moved at each iteration, the over-
all available swap memory, and so forth.
* In Section 16.9.3, we describe in detail the meaning of “moving” samples between query regions.
Privacy-Preserving Big Data Management ◾ 319
not satisfy the satisiability condition, and (ii.jj) Q is the query of QWL having the greater
negative distance from the satisiability condition, that is, Q is the query of QWL that is in
most need of new samples; (3) move enough samples from Q to Q in such a way as to sat-
isfy the accuracy/privacy constraint on Q while, at the same time, ensuring that Q does
not violate the constraint; and (4) repeat steps 1, 2, and 3 until the current representation
of Aʹ satisies, as much as possible, the accuracy/privacy constraint with respect to QWL,
within the maximum number of iterations bounded by A ,QWL. Regarding step 3, moving
ρ samples from Q to Q means (1) removing ρ samples from R(Q ), thus obtaining an
additional space, named B(ρ); (2) allocating B(ρ) to R(Q ); and (3) resampling R(Q ) by
considering the additional number of samples that have become available—in practice,
this means extracting from R(Q ) further ρ samples.
Let S*(Qk) be the number of samples of a query Qk ∈ QWL satisfying the accuracy/
privacy constraint. From the formal deinitions of EQ(Qk) (see Section 16.5), I(Qk), I(Qk ),
and EI(Qk) (see Section 16.6), and the satisiability condition, it could be easily demon-
(1 − ΦQ )
strated that S*(Qk) is given by the following formula: S*(Qk ) = ⋅ Qk .
(1 − Φ I )
Let Seff (Q ) and Seff (Q ) be the numbers of samples efectively extracted from R(Q )
and R(Q ), respectively, during the previous sampling phase. Note that Seff (Q ) < S*(Q )
and Seff (Q ) ≥ S*(Q ). It is easy to prove that the number of samples to be moved from
Q to Q such that Q satisies the accuracy/privacy constraint and Q does not vio-
late the constraint, denoted by Smov (Q T , Q F ), is inally given by the following formula:
Smov (Q T , Q F ) = S*(Q F ) − Seff (Q F ), under the constraint Smov (Q T , Q F ) < Seff (Q T ) − S*(Q T ).
Without going into detail, it is possible to demonstrate that, given (1) an arbitrary data
cube A, (2) an arbitrary query workload QWL, (3) an arbitrary pair of thresholds ΦQ and
ΦI, and (4) an arbitrary storage space B, it is not always possible to make QWL satisiable
via the reinement process. From this evidence, our idea of using a best-efort approach
makes sense perfectly.
diicult to compare the two techniques under completely diferent experimental settings.
On the other hand, this aspect puts in evidence the innovative characteristics of our pri-
vacy-preserving OLAP technique with respect to Reference 20, which is indeed a state-of-
the-art proposal in perturbation-based privacy-preserving OLAP techniques.
Figure 16.4 shows experimental results concerning relative query errors of synopsis
data cubes built from uniform, skewed, TPC-H, and FCT data, and for several values of s.
Figure 16.5 shows instead the results concerning relative inference errors on the same data
cubes. In both igures, our approach is labeled as G, whereas the approach in Reference 20,
is labeled as Z. Obtained experimental results conirm the efectiveness of our algorithm,
also in comparison with Reference 20, according to the following considerations. First,
relative query and inference errors decrease as selectivity of queries increases, that is, the
accuracy of answers increases and the privacy of answers decreases as selectivity of queries
increases. his is because the more the data cells involved by a given query Qk, the more the
samples extracted from R(Qk) are able to “describe” the original data distribution of R(Qk)
(this also depends on the proportional storage space allocation scheme [Equation 16.1]), so
that accuracy increases. At the same time, more samples cause a decrease in privacy, since
they provide accurate singleton aggregations and, as a consequence, the inference error
decreases. Secondly, when s increases, we observe a higher query error (i.e., accuracy of
answers decreases) and a higher inference error (i.e., privacy of answers increases). In other
50 50
40 40
(%)
(%)
Q(Q)
Q(Q)
30 30
E
20 20
10 10
20 25 30 35 40 20 25 30 35 40
||Q|| (%) ||Q|| (%)
s = 0.01, m = G s = 0.02, m = G s = 0.03, m = G s = 0.01, m = G s = 0.02, m = G s = 0.03, m = G
s = 0.01, m = Z s = 0.02, m = Z s = 0.03, m = Z s = 0.01, m = Z s = 0.02, m = Z s = 0.03, m = Z
(a) (b)
60 60
50 50
(%)
(%)
40 40
Q(Q)
Q(Q)
30 30
E
20 20
10 10
20 25 30 35 40 20 25 30 35 40
||Q|| (%) ||Q|| (%)
s = 0.01, m = G s = 0.02, m = G s = 0.03, m = G s = 0.01, m = G s = 0.02, m = G s = 0.03, m = G
s = 0.01, m = Z s = 0.02, m = Z s = 0.03, m = Z s = 0.01, m = Z s = 0.02, m = Z s = 0.03, m = Z
(c) (d)
FIGURE 16.4 Relative query errors of synopsis data cubes built from (a) uniform, (b) skewed,
(c) TPC-H, and (d) FCT data cubes for several values of s (r = 20%).
322 ◾ Alfredo Cuzzocrea
80 80
(Q) (%)
(Q) (%)
I
70 70
I
E
E
60 60
50 50
20 25 30 35 40 20 25 30 35 40
||Q|| (%) ||Q|| (%)
s = 0.01, m = Z s = 0.02, m = Z s = 0.03, m = Z s = 0.01, m = Z s = 0.02, m = Z s = 0.03, m = Z
s = 0.01, m = G s = 0.02, m = G s = 0.03, m = G s = 0.01, m = G s = 0.02, m = G s = 0.03, m = G
(a) (b)
90 90
80 80
(Q) (%)
(Q) (%)
70 70
I
I
E
60 E 60
50 50
20 25 30 35 40 20 25 30 35 40
||Q|| (%) ||Q|| (%)
s = 0.01, m = Z s = 0.02, m = Z s = 0.03, m = Z s = 0.01, m = Z s = 0.02, m = Z s = 0.03, m = Z
s = 0.01, m = G s = 0.02, m = G s = 0.03, m = G s = 0.01, m = G s = 0.02, m = G s = 0.03, m = G
(c) (d)
FIGURE 16.5 Relative inference errors of synopsis data cubes built from (a) uniform, (b) skewed,
(c) TPC-H, and (d) FCT data cubes for several values of s (r = 20%).
words, data sparseness inluences both accuracy and privacy of answers, with a negative
efect in the irst case (i.e., accuracy of answers) and a positive efect in the second case (i.e.,
privacy of answers). his is because, similarly to the results of Reference 20, we observe
that privacy-preserving techniques, being essentially based on mathematical/statistical
models and tools, strongly depend on the sparseness of data, since the latter, in turn, inlu-
ences the nature and, above all, the shape of data distributions kept in databases and data
cubes. Both these pieces of experimental evidence further corroborate our idea of trading
of accuracy and privacy of OLAP aggregations to compute the inal synopsis data cube.
Also, by comparing experimental results on uniform, skewed, TPC-H, and FCT (input)
data, we observe that our technique works better on uniform data, as expected, while it
decreases the performance on benchmark and real-life data gracefully. his is due to the
fact that uniform data distributions can be approximated better than skewed, benchmark,
and real-life ones. On the other hand, experimental results reported in Figures 16.4 and
16.5 conirm to us the efectiveness and, above all, the reliability of our technique even
on benchmark and real-life data one can ind in real-world application scenarios. Finally,
Figures 16.4 and 16.5 clearly state that our proposed privacy-preserving OLAP technique
outperforms the zero-sum method [20]. his achievement is another relevant contribution
of our research.
Privacy-Preserving Big Data Management ◾ 323
REFERENCES
1. C. Wu and Y. Guo, “Enhanced User Data Privacy with Pay-by-Data Model,” in: Proceedings of
BigData Conference, 53–57, 2013.
2. M. Jensen, “Challenges of Privacy Protection in Big Data Analytics,” in: Proceedings of BigData
Congress, 235–238, 2013.
3. M. Li et al., “MyCloud: Supporting User-Conigured Privacy Protection in Cloud Computing,”
in: Proceedings of ACSAC, 59–68, 2013.
4. S. Betgé-Brezetz et al., “End-to-End Privacy Policy Enforcement in Cloud Infrastructure,” in:
Proceedings of CLOUDNET, 25–32, 2013.
5. M. Weidner, J. Dees and P. Sanders, “Fast OLAP Query Execution in Main Memory on Large
Data in a Cluster,” in: Proceedings of BigData Conference, 518–524, 2013.
6. A. Cuzzocrea, R. Moussa and G. Xu, “OLAP*: Efectively and Eiciently Supporting Parallel
OLAP over Big Data,” in: Proceedings of MEDI, 38–49, 2013.
7. A. Cuzzocrea, L. Bellatreche and I.-Y. Song, “Data Warehousing and OLAP Over Big Data:
Current Challenges and Future Research Directions,” in: Proceedings of DOLAP, 67–70, 2013.
8. A. Cuzzocrea, “Analytics over Big Data: Exploring the Convergence of Data Warehousing, OLAP
and Data-Intensive Cloud Infrastructures,” in: Proceedings of COMPSAC, 481–483, 2013.
9. A. Cuzzocrea, I.-Y. Song and K.C. Davis, “Analytics over Large-Scale Multidimensional Data:
he Big Data Revolution!,” in: Proceedings of DOLAP, 101–104, 2011.
10. A. Cuzzocrea and V. Russo, “Privacy Preserving OLAP and OLAP Security,” J. Wang (ed.),
Encyclopedia of Data Warehousing and Mining, 2nd ed., IGI Global, Hershey, PA 1575–1581,
2009.
11. A. Cuzzocrea and D. Saccà, “Balancing Accuracy and Privacy of OLAP Aggregations on Data
Cubes,” in: Proceedings of the 13th ACM International Workshop on Data Warehousing and
OLAP, 93–98, 2010.
324 ◾ Alfredo Cuzzocrea
12. X. Chen et al., “OWL Reasoning over Big Biomedical Data,” in: Proceedings of BigData
Conference, 29–36, 2013.
13. M. Paoletti et al., “Explorative Data Analysis Techniques and Unsupervised Clustering
Methods to Support Clinical Assessment of Chronic Obstructive Pulmonary Disease (COPD)
Phenotypes,” Journal of Biomedical Informatics 42(6), 1013–1021, 2009.
14. Y.-W. Cheah et al., “Milieu: Lightweight and Conigurable Big Data Provenance for Science,” in:
Proceedings of BigData Congress, 46–53, 2013.
15. A.G. Erdman, D.F. Keefe and R. Schiestl, “Grand Challenge: Applying Regulatory Science and
Big Data to Improve Medical Device Innovation,” IEEE Transactions on Biomedical Engineering
60(3), 700–706, 2013.
16. D. Cheng et al., “Tile Based Visual Analytics for Twitter Big Data Exploratory Analysis,” in:
Proceedings of BigData Conference, 2–4, 2013.
17. N. Ferreira et al., “Visual Exploration of Big Spatio-Temporal Urban Data: A Study of New York
City Taxi Trips,” IEEE Transactions on Visualization and Computer Graphics 19(12), 2149–2158, 2013.
18. L. Sweeney, “k-Anonymity: A Model for Protecting Privacy,” International Journal on
Uncertainty Fuzziness and Knowledge-based Systems 10(5), 557–570, 2002.
19. A. Machanavajjhala et al., “L-diversity: Privacy beyond k-Anonymity,” ACM Transactions on
Knowledge Discovery from Data 1(1), art. no. 3, 2007.
20. S.Y. Sung et al., “Privacy Preservation for Data Cubes,” Knowledge and Information Systems
9(1), 38–61, 2006.
21. J. Han et al., “Eicient Computation of Iceberg Cubes with Complex Measures,” in: Proceedings
of ACM SIGMOD, 1–12, 2001.
22. A. Cuzzocrea and W. Wang, “Approximate Range-Sum Query Answering on Data Cubes with
Probabilistic Guarantees,” Journal of Intelligent Information Systems 28(2), 161–197, 2007.
23. A. Cuzzocrea and P. Seraino, “LCS-Hist: Taming Massive High-Dimensional Data Cube
Compression,” in: Proceedings of the 12th International Conference on Extending Database
Technology, 768–779, 2009.
24. A. Cuzzocrea, “Overcoming Limitations of Approximate Query Answering in OLAP,” in: IEEE
IDEAS, 200–209, 2005.
25. N.R. Adam and J.C. Wortmann, “Security-Control Methods for Statistical Databases: A
Comparative Study,” ACM Computing Surveys 21(4), 515–556, 1989.
26. F.Y. Chin and G. Ozsoyoglu, “Auditing and Inference Control in Statistical Databases,” IEEE
Transactions on Sotware Engineering 8(6), 574–582, 1982.
27. J. Schlorer, “Security of Statistical Databases: Multidimensional Transformation,” ACM
Transactions on Database Systems 6(1), 95–112, 1981.
28. D.E. Denning and J. Schlorer, “Inference Controls for Statistical Databases,” IEEE Computer
16(7), 69–82, 1983.
29. N. Zhang, W. Zhao and J. Chen, “Cardinality-Based Inference Control in OLAP Systems: An
Information heoretic Approach,” in: Proceedings of ACM DOLAP, 59–64, 2004.
30. F.M. Malvestuto, M. Mezzini and M. Moscarini, “Auditing Sum-Queries to Make a Statistical
Database Secure,” ACM Transactions on Information and System Security 9(1), 31–60, 2006.
31. L. Wang, D. Wijesekera and S. Jajodia, “Cardinality-based Inference Control in Data Cubes,”
Journal of Computer Security 12(5), 655–692, 2004.
32. L. Wang, S. Jajodia and D. Wijesekera, “Securing OLAP Data Cubes against Privacy Breaches,”
in: Proceedings of IEEE SSP, 161–175, 2004.
33. J. Gray et al., “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-
Tab, and Sub-Totals,” Data Mining and Knowledge Discovery 1(1), 29–53, 1997.
34. M. Hua et al., “FMC: An Approach for Privacy Preserving OLAP,” in: Proceedings of the 7th
International Conference on Data Warehousing and Knowledge Discovery, LNCS Vol. 3589,
408–417, 2005.
Privacy-Preserving Big Data Management ◾ 325
35. R. Agrawal, R. Srikant and D. homas, “Privacy-Preserving OLAP,” in: Proceedings of the 2005
ACM International Conference on Management of Data, 251–262, 2005.
36. J. Vaidya and C. Cliton, “Privacy Preserving Association Rule Mining in Vertically Partitioned
Data,” in: Proceedings of the 8th ACM International Conference on Knowledge Discovery and
Data Mining, 639–644, 2002.
37. M. Kantarcioglu and C. Cliton, “Privacy-Preserving Distributed Mining of Association Rules
on Horizontally Partitioned Data,” IEEE Transactions on Knowledge and Data Engineering
16(9), 1026–1037, 2004.
38. J. Vaidya and C. Cliton, “Privacy-Preserving K-Means Clustering over Vertically Partitioned
Data,” in: Proceedings of the 9th ACM International Conference on Knowledge Discovery and
Data Mining, 206–215, 2003.
39. G. Jagannathan, K. Pillaipakkamnatt and R. Wright, “A New Privacy-Preserving Distributed
K-Clustering Algorithm,” in: Proceedings of the 2006 SIAM International Conference on Data
Mining, 492–496, 2006.
40. G. Jagannathan and R. Wright, “Privacy-Preserving Distributed K-Means Clustering over
Arbitrarily Partitioned Data,” in: Proceedings of the 11th ACM International Conference on
Knowledge Discovery and Data Mining, 593–599, 2002.
41. C. Cliton et al., “Tools for Privacy Preserving Distributed Data Mining,” SIGKDD Explorations
4(2), 28–34, 2002.
42. Y. Tong et al., “Privacy-Preserving OLAP based on Output Perturbation Across Multiple Sites,”
in: Proceedings of the 2006 International Conference on Privacy, Security and Trust, AICPS
Vol. 380, 46, 2006.
43. W. He et al., “PDA: Privacy-Preserving Data Aggregation in Wireless Sensor Networks,” in:
Proceedings of the 26th IEEE Annual Conference on Computer Communications, 2045–2053,
2007.
44. A. Cuzzocrea, “Accuracy Control in Compressed Multidimensional Data Cubes for Quality
of Answer-based OLAP Tools,” in: Proceedings of the 18th IEEE International Conference on
Scientiic and Statistical Database Management, 301–310, 2006.
45. A. Cuzzocrea, “Improving Range-Sum Query Evaluation on Data Cubes via Polynomial
Approximation,” Data & Knowledge Engineering 56(2), 85–121, 2006.
46. A. Stuart and K.J. Ord, Kendall’s Advanced heory of Statistics, Vol. 1: Distribution heory, 6th
ed., Oxford University Press, New York City, 1998.
47. S. Agarwal et al., “On the Computation of Multidimensional Aggregates,” in: Proceedings of
VLDB, 506–521, 1996.
48. G.K. Zipf, Human Behaviour and the Principle of Least Efort: An Introduction to Human
Ecology, Addison-Wesley, Boston, MA, 1949.
49. Transaction Processing Council, TPC Benchmark H, available at http://www.tpc.org/tpch/.
50. UCI KDD Archive, he Forest CoverType Data Set, available at http://kdd.ics.uci.edu
/databases/covertype/covertype.html.
51. K. Beyer et al., “Extending XQuery for Analytics,” in: Proceedings of the 2005 ACM International
Conference on Management of Data, 503–514, 2005.
52. R.R. Bordawekar and C.A. Lang, “Analytical Processing of XML Documents: Opportunities
and Challenges,” SIGMOD Record 34(2), 27–32, 2005.
V
Big Data Applications
327
CHAPTER 17
CONTENTS
17.1 Introduction 330
17.2 Financial Domain Dynamics 331
17.2.1 Historical Landscape versus Emerging Trends 331
17.3 Financial Capital Market Domain: In-Depth View 334
17.3.1 Big Data Origins 334
17.3.2 Information Flow 335
17.3.3 Data Analytics 338
17.4 Emerging Big Data Landscape in Finance 339
17.4.1 Challenges 340
17.4.2 New Models of Computation and Novel Architectures 341
17.5 Impact on Financial Research and Emerging Research Landscape 342
17.5.1 Background 342
17.5.2 UHFD (Big Data)–Driven Research 344
17.5.3 UHFD (Big Data) Implications 348
17.5.4 UHFD (Big Data) Challenges 348
17.6 Summary 349
References 350
BACKGROUND
he inancial industry has always been driven by data. Today, Big Data is prevalent at
various levels of this ield, ranging from the inancial services sector to capital mar-
kets. he availability of Big Data in this domain has opened up new avenues for inno-
vation and has ofered immense opportunities for growth and sustainability. At the
same time, it has presented several new challenges that must be overcome to gain the
maximum value out of it. his chapter considers the impact and applications of Big
Data in the inancial domain. It examines some of the key advancements and trans-
formations driven by Big Data in this ield. he chapter also highlights important Big
Data challenges that remain to be addressed in the inancial domain.
329
330 ◾ Taruna Seth and Vipin Chaudhary
17.1 INTRODUCTION
In recent years, the inancial industry has seen an upsurge of interest in Big Data. his
comes as no surprise to inance experts, who understand the potential value of data in
this ield and are aware that no industry can beneit more from Big Data than the inan-
cial services industry. Ater all, the industry not only is driven by data but also thrives on
data. Today, the data, characterized by the four Vs, which refer to volume, variety, veloc-
ity, and veracity, are prevalent at various levels of this ield, ranging from capital markets
to the inancial services industry. In recent years, capital markets have gone through an
unprecedented change, resulting in the generation of massive amounts of high-velocity
and heterogeneous data. For instance, about 70% of the US equity trades today are gener-
ated by high-frequency trades (HFTs) and are machine driven [1]. he prevalence of elec-
tronic trading has spurred up growth in trading activity and HFT, which, among other
factors, have led to the availability of very large-scale ultrahigh-frequency data (UHFD).
hese high-speed data are already having a huge impact in the ield in several areas rang-
ing from risk assessment and management to business intelligence (BI). For example, the
availability of UHFD is forcing the market participants to rethink the traditional ways of
risk assessment and bringing up attention to more accurate, short-term risk assessment
measures. Similar trends can be observed in the inancial services sector, where Big Data
is increasingly becoming the most signiicant, promising, and diferentiating asset for the
inancial services companies. For instance, today, customers expect more personalized
banking services, and to remain competitive as well as comply with the increased regula-
tory surveillance, the banking services sector is under tremendous pressure to best utilize
the breadth and depth of the available data. In recent years, irms have already started
using the information obtained from the vast oceans of available data to gain customer
knowledge, anticipate market conditions, and better gauge customer preferences and
behavior ahead of time, so as to ofer highly personalized customer-centric products and
services to their customers, such as sentiment analysis–enabled brand strategy manage-
ment and real-time location-based product oferings as opposed to the historically ofered
product-centric services. Moreover, events like the credit crisis of 2008 have further shited
the focus of such inancial entities towards Big Data as a strategic imperative for dealing
with the acute stresses of renewed economic uncertainty, systemic monitoring, increasing
regulatory pressure, and banking sector reforms. Unarguably, similar developments can be
seen in other areas like asset management and insurance.
Clearly, such examples are indicative of the transformations ensuing in the inance sec-
tor, whereby more and more inancial institutions are resorting to Big Data to strategize
their business decisions based on reliable factual insights supported by real data rather
than just intuition. Additionally, Big Data is now playing a critical role in several areas
like investment analysis, econometrics, risk assessment, fraud detection, trading, customer
interactions analysis, and behavior modeling.
In this digital era, we create approximately 2.5 quintillion bytes of data every day, and
90% of the data in the world today have been created in the last 2 years alone. he Big Data
market is estimated to be at $5.1 billion this year and is expected to grow to $32.1 billion
Big Data in Finance ◾ 331
by 2015 and to $53.4 billion by the year 2017 [2]. Today, almost all sectors of the inancial
ield are inundated with data generated from a myriad number of heterogeneous sources,
such as hundreds of millions of transactions conducted daily, ultrahigh-frequency trading
activities, news, social media, and logs. A recent survey shows that around 62% of compa-
nies recognize the ability of Big Data to gain competitive edge [2], and there is no doubt
that the prevalent Big Data ofers immense potential and opportunity in the inance sector.
However, the enormously large inancial data volumes, high generation speeds, and het-
erogeneity associated with the relevant inancial domain data, along with its susceptibility
to errors, make the ingestion, processing, and timely analysis of such vast volumes of oten
heterogeneous data very challenging.
here is no clear consensus among and within inancial institutions today on the best
strategies to harvest and leverage the available Big Data to actionable knowledge. his can
be attributed to the fact that a single solution is unlikely to cater to the growing needs of
diferent businesses within the inancial domain spectrum. Today, many inancial orga-
nizations are exploring and adopting customized Big Data solutions that are speciically
tailored for their domain-speciic needs. Section 17.2 presents an in-depth view of the
inancial domain with details on the historical trends in this sector and innovations in the
ield as a result of Big Data. he section will also cover the three key elements involved in
inancial domain dynamics, namely, Big Data sources in inance, information low, and
data analytics.
Frank–Dodd Act have by themselves added hundreds of new rules afecting banking and
securities industries. hese directives for greater transparency are leading to enormous
increases in the data volumes across such industries and forcing them to redesign their ser-
vices infrastructure to cater to the new demands. Similar data growth trends can be seen
in other parts of the inancial domain. For example, there was a tenfold increase in market
data volumes between 2008 and 2011, and and the data volumes are growing stronger in all
areas of the inancial domain; for example, some of the top European insurers reported a six-
fold increase in the amount of data and analytic reporting required by just the irst pillar of
the Solvency II insurance reform regulation [4]. he New York Stock Exchange (NYSE) by
itself creates about several terabytes of market and reference data per day covering the use
and exchange of inancial instruments, whereas Twitter feeds, oten analyzed for sentiment
analysis in the inancial domain, generate about 8 terabytes of data per day of social inter-
actions [4]. here are around 10,000 payment card transactions executed per second across
the globe; there were about 210 billion electronic payments generated worldwide in 2010,
and the number is expected to double by the end of the decade [4]. Various other develop-
ments in the inancial system are also contributing enormously to the overall volume of
the data in the system. One such example of this shit can be explained by the emergence of
the originate-to-distribute model that has broken down the traditional origination process
into a sequence of highly specialized transactions and has led to an increase in the volume
of the data in this domain. In this model, inancial products like mortgages are system-
atically securitized and then structured, repackaged, and distributed again, so the loan
details that traditionally might have been recorded only by the original lender and the bor-
rower are now shared across multiple, diverse entities such as the originating bank, bor-
rower, loan servicer, securitization trust and bondholders, as well as buyers and sellers of
credit protection derivatives [5]. Besides contributing to the data volume, each new entity
in the system adds to the complexity of the involved data. he digital universe is expected
to grow nearly 20-fold, to approximately 35 zettabytes of data, by the year 2020 [6].
Traditional data management practices in inance can no longer efectively cope with
the ever-increasing, huge, and rapid inlux of heterogeneous (structured, semistructured,
unstructured) data originating from a wide range of internal processes and external
sources, including social media, blogs, audio, video, and so forth. Conventional data man-
agement technologies are destined to fail with such growing data volumes, which far exceed
the storage and analysis capabilities of many traditional systems, and they have in many
instances. For instance, with regard to the volume and complexity of created data, back
oices of trading irms have failed to keep up with their own front oice processes as well
as the emerging data management practices adopted in other industries to handle growing
data volumes and a multitude of diverse data types [5]. Moreover, traditional systems are
not equipped to handle the wide variety of data, especially unstructured data, from social
media, like news, Twitter, blogs, videos, and so forth, that is needed to gain insights about
businesses processes (e.g., risk analysis, trading predictions) and keep up with the evolving
needs of the customer in the inancial services industry. Such systems oten fail when it
comes to integration of heterogeneous data or even real-time processing of structured data.
In fact, this is becoming a bottleneck for many top-tier global banking systems since the
Big Data in Finance ◾ 333
introduction of new regulations that require the banks to provide a complete horizontal
view of risk within their trading arms. his task entails the integration of data from difer-
ent schemas unique to each of the trade capture systems into a central repository for posi-
tions, counterparty information, and trades. Extraction, transformation, cleansing, and
integration of such data via traditional extract, transform, load (ETL)–based approaches,
coupled with samplings, oten span several days and are not very accurate in the scenarios
where only a sample of the data is used for analysis. New regulations, however, demand
that this entire pipeline be executed several times a day, a feat clearly infeasible using the
conventional approach. hese regulations are similarly applicable to the capital markets
where the regulations necessitate an accurate view of diferent risk exposures across asset
classes and lines of businesses and irms to better estimate and manage systemic interplays.
hese tasks require simulations of a variety of risk situations, which can result in the gen-
eration of terabytes of additional data per day [7].
In recent years, it is becoming increasingly important for inancial irms to adopt a data-
centric perspective to handle the mounting regulatory pressures and succeed in today’s
digital, global marketplace. In the past, inancial organizations collected large amounts
of data. However, these institutions depended primarily on the conventional ETL frame-
work and lacked the ability to process the data and produce actionable knowledge out of it
within realistic time frames. his approach prevented them from gaining a full perspective
of their business insights and made it diicult for them to anticipate and respond to chang-
ing market conditions, business needs, and emerging opportunities, a few must-haves
essential to thrive in today’s dynamic business environment. As a result, the irms relying
on the traditional schemes have started to address the limitations inherent in their conven-
tional systems. Today, a growing number of inancial institutions are exploring new ways
of unlocking the potential of available data to gain insights that can help them improve
their performance and gain competitive advantage through factual and actionable knowl-
edge, timely decisions, risk management and mitigation, and eicient operations in highly
complex and oten volatile business environments. Figure 17.1 highlights the importance
and applicability of Big Data in the inancial domain. he igure illustrates the key sectors
of inance in which the power of Big Data is being harnessed to address critical business
needs ranging from product innovation and fact-driven strategic decision making to the
development of novel and intelligent business solutions.
he inancial industry has always been one of the most data-driven industries. In the
past few years, the prevalence of Big Data has opened up new horizons in the inancial
ields. Several industries have already started exploiting the value out of Big Data for
information discovery in areas like predictive analytics based on social behavior mining,
deep analytics, fraud detection, and risk management. For example, most of the credit
card companies mine several thousands to millions of records, aggregated from customer
transactional data (structured), call records (unstructured), e-mails (semistructured), and
claims data (unstructured), to proactively anticipate future risks, accurately predict cus-
tomer card switching behavior, and devise measures to improve customer relationships
based on such behavioral modeling. Likewise, several irms involved in inancial risk man-
agement perform risk assessment by integrating large volumes of transactional data with
334 ◾ Taruna Seth and Vipin Chaudhary
Billions of
éêëæçëïéãåæç
e v e r y d ëî ð íñëòy t e Hu n d r e d s o f
o f s t o r e d òëæóãæô âãääãåæç åè éêëìí s
ìëéë
e v e r î ìëy
data from other sources like the real-time market data as well as aggregate global positions
data, pricing calculations, and the value-at-risk (VaR) crossing capacity data of current
systems [2]. In the past couple of years, Big Data has begun to evolve as a front-runner for
inancial institutions interested in improving their performance and gaining a competitive
edge. Section 17.3 exempliies some of the Big Data innovations in the inancial domain
and highlights diferent key players involved in the inancial dynamics pipeline of the capi-
tal markets.
adding to the heterogeneity of the data. Moreover, the data generated and utilized by these
markets are highly voluminous and complex. he reason for the complexity in inancial
data is that agents acting in markets trade increasingly faster and in more numerous and
complex inancial instruments, and have better information-acquiring tools than ever
before. he reason for heterogeneity stems not just from the fact that agents and regula-
tions across the globe are themselves heterogeneous but also from the more mundane fact
that people report information in ways that are not standardized. For example, investment
managers reporting holdings to the Securities and Exchange Commission (SEC) oten
err regarding the proper unit of measurement (thousands or units); performance data are
oten “dressed” to appear more attractive (most oten, smoother); and so on.
Recent projections by the Options Pricing Reporting Authority for the years 2014–2015
estimate a total of 26.9 to 28.7 billion messages per day, 17.7 to 19 million messages per
second, and a maximum output rate of at least 1 million messages per second [10]. For
example, the data covering the quotes and transactions from the major US exchanges (trade
and quote database [TAQ]) grow exponentially, now at a rate of hundreds of terabytes per
year. Analysis of the operational structures of the underlying data generation entities par-
tially explains the enormity and complexity of the massive data sets. For instance, unlike
the traditional days of specialists and natural price discovery, today, the US stock market
structure comprises an aggregation of diferent exchanges, broker-sponsored execution
venues, and alternative trading systems, each of which contributes diferently to the market
data and volumes [11]. Speciically, around 14 exchanges, approximately 50 dark pools, and
more than 200 international platforms or venues contribute around 66%, 13%, and 21%
volume, respectively [8]. Orders are submitted through more than 2000 broker deals, and
the system is governed by various regulatory agencies including SECs, SROs, and so forth
[8]. he estimated average trading volumes for the market include about $50- to $100-billion-
value trades, at least 2 billion order submissions, and 5 billion share trades [8,12]. he
dynamic system environment comprising complex trade work lows (e.g., billions of trades
or order submissions, price matches, executions, rejections, modiications, acknowledg-
ments, etc.) and the changing market trading practices such as HFT further contribute to
the volume and complexity of the generated data. Figure 17.2 portrays an example showing
the high-level view of a typical automated electronic trading system. he key blocks shown
in the igure are representative of extremely intricate models, strategies, and data, among
other factors, which add to the complexity of the overall system. In the United States, the
HFTs were estimated to account for more than 70% of equity trades in the year 2010 [11].
Besides adding to the volume, such evolving trading techniques are resulting in very high
data generation speeds. For example, order matching and subsequent trade execution can
now be accomplished in less than 100 μs via a colocated server in the exchange, and algo-
rithmic trading can now be done within microseconds [11].
Trading models
(preprogrammed trading
instructions) Securities master
Orders
Market Modifiers Trading system
Cancels
Executions
Price ticks
Teûüýüþÿ ② ✁þ✁
order lows [13]. he data generated through the individual market platforms are dis-
seminated to the market participants through diferent channels. High-speed market
data are directly delivered to some entities through the principal electronic communica-
tion networks (ECNs) such as INET, whereas the data are delivered to a majority of other
entities through distribution channels like the National Association of Securities Dealers
Automated Quotations (NASDAQs) dissemination or the Consolidated Trade System
(CTS), which directly or indirectly collects all US trades, and the Consolidated Quote
System (CQS) [14]. he latter dissemination channel, however, collects data at a much
slower pace compared to the speed at which the data are generated [14]. Diferent market
participants oten require and utilize data with diferent price granularities, and hence
precision, depending upon their diverse trading objectives. For instance, high-frequency
traders and market makers like the NYSE dedicated market makers (DMMs) generally
utilize UHFD (tick data), are highly sensitive to small price changes, and deal with sev-
eral thousands of orders per day. In contrast, investors like the pension funds investors
normally base their investment decisions on low-frequency or aggregated data, are not
too sensitive to small price changes (e.g., at the intraday level), and usually deal with no
more than a few hundreds of orders per day. Unregulated investors like the hedge funds
and other speculators like the day traders, on the other hand, generally fall somewhere
in between the investors and market makers. Figure 17.3 further exempliies such market
dynamics that exist among diferent market participants. All these diferent players and
the inherent intricacies of the underlying processes complicate the order or other infor-
mation lows. For example, about one-third of price discovery nowadays occurs in dark
Big Data in Finance ◾ 337
pools, and even today, these pools remain largely unobservable. Such factors add to the
complexity of the market structures and make it diicult to understand the transforma-
tions of orders into trades and how they drive the price discovery process.
Market participants usually deploy diferent trading strategies in line with their trad-
ing objectives. Typically, the raw market data, collected by these market agents, are
reined, aggregated, and analyzed. Today, many market participants are resorting to novel
ways of information discovery and incorporating additional information in their trad-
ing strategies, which involve integration of traditional data (e.g., orders and trades) with
nontraditional data (e.g., sentiment data from social networks, trending over time, news,
exploratory and deep analysis of the available data through eicient interactive or ad hoc
queries). here is no doubt that the vast amounts of information that is generated by such
trading systems along with the information that exists in the complex networks of legal
and business relationships that deine the modern inancial system hold all the answers
required to understand and accurately predict unexpected market events as well as address
the most demanding questions that plague the inancial systems and, hence, the regulators.
In recent years, much of the focus has shited to inding ways to extract all the necessary
answers from these ever-growing inancial data within realistic time frames. However, to
date, the problem remains challenging for many in this ield not only because of the Big
Data constraints but also due to factors like the lack of transparency; absence of standard-
ized communication protocols; and ill-deined work lows among diferent data generation
systems, processes, and organizations.
338 ◾ Taruna Seth and Vipin Chaudhary
address some of the limitations inherent in the traditional systems and ofer unique capa-
bilities to eiciently handle today’s Big Data needs. he ultimate goal of a Big Data pipeline
is to facilitate analytics on the available data. Big Data analytics provides the ability to
infer actionable insights from massive amounts of data and can assist with information
discovery during the process. It has become a core component that is being deployed and
used by entities operating at various spheres of the inancial ield. For example, predictive
analytics tools are increasingly being deployed by the banks to predict and prevent fraud
in real time. Predictive analytics applies techniques from data mining, data modeling, and
statistics to identify relevant factors or interactions and predict future outcomes based on
such interactions. Predictive analytics tools are also increasingly being used by market
participants for tasks like decision making, improving trading strategies, and maximizing
return on equities.
Big Data analytics is also being used in the capital markets for governance-based activi-
ties such as detection of illegal trading patterns and risk management. For example, NYSE
Euronext has deployed a market surveillance platform that employs Big Data analytics to
eiciently analyze billions of trades to detect new patterns of illegal trading within realistic
time frames. he analytics platform allows Euronext to process approximately 2 terabytes
of data volume everyday, and this volume is expected to exceed 10 petabytes a day by 2015.
he deployed infrastructure has been reported to decrease the time required to run mar-
ket surveillance algorithms by more than 99% and improve the ability of regulators or
compliance personnel to detect suspicious or illegal patterns in trading activities, allow-
ing them to take proactive investigative action to mitigate risks [15]. Similarly, another
market information data analytics system (MIDAS) went online at the SEC in January
2013. MIDAS focuses on business data sources in the inancial domain, with particular
emphasis on the ilings periodically required by the companies to be made with the SEC
and Federal Deposit Insurance Corporation (FDIC). he system provides valuable insights
about inancial institutions at systemic and individual company levels by harnessing the
value out of market data as well as data archived by the SEC and FDIC. It processes about
1 terabyte of stocks, options, and futures data per day and millions of messages per second.
he analytics system is being utilized by subscribers to analyze mini lash crashes, assess
impacts of rule changes, and detect abnormal patterns in the captured data [8].
17.4.1 Challenges
Despite being aware of the signiicant promise that Big Data holds, many companies still
have not started to make any investments to reap its beneits. A lot of companies today waste
more than half of the data they already hold. If we assess the value of these data based on
the Pareto principle, that is, 80% of the value comes from 20% of the data, then clearly, a
lot of value is getting lost [6]. here is no doubt that the Big Data domain is rapidly evolv-
ing, but the domain is still premature in the inancial ield, with several factors slowing its
growth and adoption in this domain. Data management and sharing has been a diicult
problem for capital market irms for decades. he recent inancial collapse and the mortgage/
credit crisis of 2008 have uncovered some of the bottlenecks and inadequacies inherent in the
information structure of the US inancial systems. Factors like the lack of standardized data
communication protocols, lack of transparency, complex interactions, and data quality gaps
across diferent inancial units in the system make it extremely diicult to unravel and con-
nect diferent systems, processes, and organizations for any kind of analysis within realistic
time frames. Many of these limiting factors represent an evolutionary outcome of the years
of mergers, internal fragmentations, and realignments within the inancial institutions and
have been made worse by the business silos and inlexible Information Technology (IT) infra-
structures. To date, the work lows in the system largely remain ill deined, and data reside in
unconnected databases and/or spreadsheets, resulting in multiple formats and inconsistent
deinitions across diferent organizational units. Data integration remains point to point and
occurs tactically in response to emergencies [9]. Convergence of such factors makes it almost
infeasible to ingest, integrate, and analyze large-scale, heterogeneous data eiciently across
diferent inancial entities within a inancial organization. he lack of best practices, data
sharing procedures, quality metrics, mathematical modeling, and fact-based reasoning has
let even the federal regulatory agencies unable to ingest market information in a timely man-
ner, permit a proactive response, or even determine what information might be missing [9].
Financial institutions have historically spent vast sums on gathering, organizing, stor-
ing, analyzing, and reporting data through traditional data management approaches. As
demonstrated in Section 17.2, the conventional data management structures are no longer
suicient to handle the massively large, high-velocity, heterogeneous inancial data. Today,
it is critical to deploy new supporting infrastructure components, data ingestion and inte-
gration platforms, as well as Big Data analytics and reporting tools, to handle extremely
large-scale, oten real-time, heterogeneous data sets. However, due to the varied data and
business requirements of diferent organizational units, it is not realistic to expect a single
solution to it or even be equally applicable to all the inancial entities. Moreover, due to the
lack of well-deined solutions, an exploratory, incremental, and possibly iterative approach
would be needed to devise customized and eicient Big Data solutions. herefore, it is
important for the inancial organizations to devise their Big Data investment strategies
with focus on their business-speciic goals, like what is needed for risk management, prod-
uct innovation, risk and market intelligence, cost reduction, services, operations, and so
Big Data in Finance ◾ 341
forth. For instance, risk analytics and reporting requirements would necessitate a plat-
form that could help consolidate risk measure and provide powerful risk management
and reporting capabilities. A market data management system, on the other hand, would
require a highly scalable platform that could store, process, and analyze massive amounts
of heterogeneous data sets in real time.
Europe’s ongoing sovereign debt iasco, a large number of inancial organizations did not
successfully enforce the Basel Committee of Banking Supervision mandates with respect
to their VaR estimations [31]. hese adverse inancial episodes further underscore the sig-
niicance of extreme asset price movements, and hence, accurate volatility forecasts, for
eicient risk mitigation through appropriate risk measurement and management measures
that can adapt in accordance with the changing market environments. Due to these rea-
sons, it is not surprising that volatility estimation and inference have drawn a lot of atten-
tion in recent years. Also, another reason that makes volatility a more popular measure
comes from the fact that unlike daily returns, which ofer very little explanatory power
and hence are diicult to predict, volatility of daily returns, due to their relatively high per-
sistence and conditional dependence, is nonstationary and predictable [32,33]. Volatility
clustering efects were irst reported by Mandelbrot [34], who observed that periods of low
volatility were followed by periods of low volatility and vice versa. hese factors explain
the large number of contributions and research eforts dedicated to the measurement and
prediction of volatility.
Despite several advancements, accurate measurement of ex post volatility remains
nontrivial largely due to the fact that volatility is unobservable and cannot be directly
observed from the data [27]. In the past, several models have been developed to forecast
volatility. he very irst popular volatility model was the autoregressive conditional het-
eroskedasticity (ARCH) model introduced by Engle [35]. Subsequently, Bollerslev [36]
proposed a more generalized representation of the ARCH model, namely, a general-
ized autoregressive conditional heteroskedasticity (GARCH) model. hese models have
been followed by a large number of models based on diferent variations of the original
ARCH model. hese include parametric models like the Glosten–Jagannathan–Runkle
GARCH (GJR-GARCH), fractionally integrated GARCH (FIGARCH), component
GARCH (CGARCH), stochastic volatility (SV), and other models [37–43]. he ARCH
class of models incorporates time variation in the conditional distribution mainly
through the conditional variance and is geared towards capturing the heavy-tails and
long-memory efects in volatility. he stochastic models usually based on the assumption
that the repeated low-frequency observations of assets return patterns are generated by
an underlying but unknown stochastic process [44]. hese models have been successful
in explaining several empirical features of the inancial return series, such as heavy-tails
and long-memory efects in volatility. Since their introduction, an extensive literature
has been developed for modeling the conditional distribution of stock prices, interest
rates, and exchange rates [45]. his class of models has been extensively used in the lit-
erature to capture the dynamics of the volatility process. Until recent years, the ARCH
model and its variants had been used by many in the ield to model asset return volatil-
ity dynamics for daily, weekly, or higher-interval data across multiple asset classes and
institutional settings [37,46]; however, despite their provably good in-sample forecasting
performance, their out-of-sample forecasting performance remains questionable. Many
studies in the past have reported insigniicant forecasting ability of this class of mod-
els [47–50] based on low correlation coeicients in the assessments. hese models usu-
ally forecast volatility based on low-frequency asset return data at daily or longer time
344 ◾ Taruna Seth and Vipin Chaudhary
that is, as the length of the intraday intervals approaches zero [61]. Speciically, it has
been shown empirically that the sum of squared high-frequency intraday returns of an
asset can be used as an approximation to the daily volatility [52,53,61–67]. his quadratic
variation is known as the estimator of the daily integrated volatility. Moreover, the fore-
casting performance of this estimator has been shown to be superior compared to the
performance of standard ARCH-type models [52]. Based on the theoretical and empirical
properties of realized volatility, several other studies also conirm that precise volatility
forecasts can be obtained using UHFD [61,63,68,69]. However, the experimental results
do not exactly match with the empirical or theoretical justiications, which mainly rely
on the limit theory and suggest that by increasing the observation frequency of asset
returns and representing the true integrated volatility of the underlying returns process
via realized volatility, one can obtain more eicient, less noisy estimates in comparison
to the estimate obtained using low-frequency data, such as the daily data [52,64,70–72].
his discrepancy can be primarily attributed to the presence of market microstructure
noise. Market microstructure noise collectively refers to the vast array of frictions inher-
ent in the trading process. hese imperfections in the trading process arise due to several
factors, such as the bid–ask bounce, infrequent trading, discreteness of price changes,
varied informational content of price changes, and so forth. For example, changes in the
market prices occur in discrete units. Speciically, the prices usually luctuate between
the bid and the ask prices (the bid–ask spread) and multiple prices may be quoted simul-
taneously by competitive market participants due to the heterogeneous market hypoth-
esis [73], thereby resulting in market microstructure frictions. At ultrahigh frequencies,
the problem gets worse because the volatility of the true price process shrinks with the
time interval, while the volatility of the noise components remains largely unchanged
[74]. Consequently, at extremely high frequencies, the observed market prices relect val-
ues contributed largely by the noise component compared to the unobserved true price
component. As a result, the realized volatility measures become biased and thus are not
robust at ultrahigh frequencies when the price is contaminated with noise and the bias is
not accounted for during subsequent evaluations [52,75,76]. Market microstructure noise
efects and their impact in the high-frequency scenarios have been discussed and ana-
lyzed in several studies [74–81].
Since the inception of the high-frequency–based realized volatility measure, several
alternative volatility estimation measures have evolved in an attempt to improve upon the
basic high-frequency–based realized volatility measure and cater to the growing demands
imposed by the availability of trading data at ultrahigh frequencies. For instance, in recent
years, many new methods have been developed to estimate spot or instantaneous volatility,
for example, volatility per time unit or per transaction, using high-frequency data. Spot
volatility is particularly useful to high-frequency traders, who oten strategize using the
inest granularity of available tick data and to whose operations immediate revelation of
sudden price movements is critical. Various high-frequency data–based methods for spot
volatility estimation have been proposed in the literature, ranging from nonlinear state
space–based models to nonlinear market microstructure noise–based and particle ilter–
based models [82–88].
346 ◾ Taruna Seth and Vipin Chaudhary
more time scales respectively, to estimate integrated volatility. hese estimators have been
shown to be robust to time-series–dependent noise [14] and better than the classic real-
ized volatility measure. Other estimators like the kernel-based estimators have also been
shown to capture important characteristics of market microstructure noise and outper-
form the classic realized volatility measure [80]. UHFD also allow learning about jumps
in the price process. Jumps represent discontinuous variations or movements in the price
process that are totally incompatible with the observed volatility [104]. In the last couple
of years, several parametric and nonparametric methods have been developed that distin-
guish between jumps and the continuous price process to estimate time-varying volatility
robustly to jumps [105–110]. To date, bipower variation (BPV) or its variants remain to be
among the dominant methods used to model jumps. hese methods are based on the sums
of powers and products of powers of absolute returns and have been shown to be robust to
rare jumps in the log-price process [56,109,110].
All the high-frequency data–based methods discussed are mostly applicable to the uni-
variate or single-asset volatility estimation scenarios. Nowadays, volatility estimation of
the multiple asset scenarios is becoming increasingly important. Also, precise estimation
of the covariance matrix of multiple asset returns is central to many issues in inance such
as portfolio risk assessment and asset pricing. he availability of UHFD has spurred its
use in many recent covariance asset estimation methods. his is mainly because UHFD
better relect the underlying assets returns processes because of better statistical eiciency
and can greatly improve the accuracy of the covariance matrix estimates. However, as dis-
cussed earlier in this section, the use of UHFD can introduce noise as a result of market
microstructure frictions. Estimation methods in multiple asset scenarios face additional
diiculties due to nonsynchronous trading issues. hese issues arise because the transac-
tions for diferent assets occur at diferent points in time, are random, and are thus non-
synchronous. Due to this mismatch in the time points of the recorded transactions, returns
sampled at regular intervals in calendar time will correlate with the previous or successive
returns on other assets even in the absence of any underlying correlation structure [111,112].
his, called the Epps efect, causes the covariance estimator to be biased towards zero with
increasing sampling frequencies [113]. Two key approaches have generally been used in
the past to address these issues. One attempts to reduce the microstructure noise efects
through the use of lead and lag autocovariance terms in the realized covariance estima-
tor based on synchronized returns, whereas the other produces unbiased estimates of the
covariance matrix by using the cross-product of all fully and partially overlapping event-
time returns [114]. In recent years, several high-frequency data–based volatility estimation
methods have been proposed for multiple asset scenarios, starting with the realized cova-
riance estimator that is basically the sum of cross-products of intraday returns [115]. Like
the realized volatility measure, this measure also sufers from the impact of microstruc-
ture noise [114,116–120]. Since then, many methods have been developed to deal with the
inherent problems in the multiple asset settings, like the market microstructure noise and
bias due to nonsynchronicity [114,116,118,119,121–123]. Many market participants oten
need to estimate matrices comprising a large number of assets using high-frequency data.
However, many existing estimators can only be used for a small number of asset classes
348 ◾ Taruna Seth and Vipin Chaudhary
and become inconsistent as the size of the matrix becomes closer to or exceeds the sample
size [124]. A few recent methods suggest diferent ways to estimate large volatility matrices,
some of which are robust to the presence of microstructure noise and nonsynchronicity
efects. hey incorporate diferent schemes, ranging from the use of factor models and
low-frequency dynamic models to pairwise and all-refresh time schemes [112,124–126].
the data into a usable form [25,140]. Also, as previously discussed in Sections 17.5.1–17.5.4,
the usage of UHFD generally requires a trade-of between precision of the estimation tech-
nique and bias induced by the market microstructure noise efects. he bias increases with
the increase in the sampling frequencies, which renders the estimator less accurate at high
sampling frequencies. Organizational structure and institutional evolution of the equity
markets further exacerbate such errors in the UHFD [14]. hese issues represent some of
the challenges that are being faced by both the researchers and practitioners with regard
to UHFD. Despite several advancements in this ield, a lot more remains to be done to ei-
ciently utilize the available UHFD and extract the best value out of them.
17.6 SUMMARY
he inancial industry has always been a data-intensive industry. Recent technological
advancements coupled with several other factors like changing customer preferences and
changing business needs have led to the generation and consumption of proliic amounts
of data. Several changes in the last couple of years, driven by the conluence of factors like
the escalating regulatory pressures, ever-increasing compliance requirements, regulatory
oversight, global economic instability, increasing competition in the global markets, grow-
ing business demands, growing pressures to optimize capital and liquidity, and so forth, are
forcing the inancial organizations to rethink and restructure the way they do business. Also,
traditional data management practices prevalent in inance can no longer efectively cope
with the ever-increasing, huge, and rapid inlux of heterogeneous data originating from a
wide range of internal processes and external sources, including social media, blogs, audio,
and video. Consequently, a growing number of inancial institutions are resorting to Big Data
to strategize their business decisions based on reliable factual insights supported by real data
rather than just intuition. Increasingly, Big Data is being utilized in several areas such as
investment analysis, econometrics, risk assessment, fraud detection, trading, customer inter-
actions analysis, and behavior modeling. Eicient utilization of Big Data has become essen-
tial to the progress and success of many in this data-driven industry. However, Big Data by
itself does not hold much value, and not all of it may be useful at all times. To gain relevant
insights from the data, it is very important to deploy eicient solutions that can help analyze,
manage, and utilize data. Many solutions have already been deployed in the inancial domain
to manage relevant data and perform analytics on them. Despite such advancements, many
in the industry still lack the ability to address their Big Data needs. his is most likely due
to the fact that diferent organizational units usually have diferent domain-speciic require-
ments and, hence, solution speciications. So, it is highly unlikely that a solution deployed by
one unit would be equally useful to others. he chapter described the impact of Big Data on
the inancial industry and presented some of the key transformations being driven by the
data today. he availability of UHFD has resulted in signiicant advancements in the ield
of inance and has made it possible for empirical researchers to address problems that could
not be handled using data collected at lower frequencies. Besides showcasing the impact of
Big Data in the industrial sector, the chapter also highlighted the Big Data–driven prog-
ress in research in the ields of inance, inancial econometrics, and statistics. he chapter
exempliied some of the key developments in this area with special focus on inancial risk
350 ◾ Taruna Seth and Vipin Chaudhary
measurement and management through the use UHDF-based volatility metrics. In recent
years, the increased availability of ultrahigh-frequency trading data has spurred strong
growth in the development of techniques that exploit tick-level intraday price data to better
estimate volatility forecasts and overcome some of the limitations inherent in the traditional
low-frequency–based systems. Availability of high-frequency data coupled with recent tech-
nological advancements has made analysis of large-scale trading data more accessible to mar-
ket participants. Ultrahigh-frequency–based estimators have important implications in many
areas of inance, specially risk assessment and management or single assets and portfolios.
Among other things, such data have been shown to be extremely useful in the meaningful ex
post evaluation of daily volatility forecasts. However, various factors, such as the underlying
market structure, market dynamics, internal process lows, trading frequency, and so forth,
present several diiculties in the efective utilization of the vast amounts of high-frequency
data available today. Particularly, the market microstructure noise efects, such as those due
to bid–ask bounce and infrequent trading, introduce a signiicant bias in the estimation pro-
cedures based on high-frequency data. he chapter also touched upon the trade-of that is
oten required between the precision of the estimation technique and bias induced by the
market microstructure noise efects. he availability of Big Data in the inancial domain has
opened up new avenues for innovation and presented immense opportunities for growth and
sustainability. Despite signiicant progress in the development and adoption of Big Data–
based solutions in the ield, a lot more remains to be done to efectively utilize the relevant
data available in this domain and extract insightful information out of it. Compared to the
promise Big Data holds in this domain and its potential, the progress in this ield is still in
its nascent stages, and a lot more growth in this area remains to be seen in the coming years.
REFERENCES
1. Zervoudakis, F., Lawrence, D., Gontikas, G., and Al Merey, M., Perspectives on high-frequency
trading. Available at http://www0.cs.ucl.ac.uk/staf/f.zervoudakis/documents/Perspectives_on
_High-Frequency_Trading.pdf (accessed February 2014).
2. Connors, S., Courbe, J., and Waishampayan, V., Where have you been all my life? How the
inancial services industry can unlock the value in Big Data. PwC FS Viewpoint, October 2013.
3. Gutierrez, D., Why Big Data matters to inance, 2013. Available at http://www.inside-bigdata
.com (accessed February 2014).
4. Versace, M., and Massey, K., he case for Big Data in the inancial services industry. IDC
Financial Insights, White paper, September 2012.
5. Flood, M., Mendelowitz, A., and Nichols, W., Monitoring inancial stability in a complex world.
In: Lemieux, V. (Ed.), Financial Analysis and Risk Management: Data Governance, Analytics
and Life Cycle Management. Heidelberg: Springer Berlin, pp. 15–45, 2013.
6. Brett, L., Love, R., and Lewis, H., Big Data: Time for a lean approach in inancial services. A
Deloitte Analytics paper, 2012.
7. Oracle Corporation, Financial services data management: Big Data technology in inancial ser-
vices. Oracle Financial Services, An Oracle White paper, June 2012.
8. Rauchman, M., and Nazaruk, A., Big Data in Capital Markets. Proceedings of the 2013
International Conference on Management of Data. New York: ACM, 2013.
9. Jagadish, H.V., Kyle, A., and Raschid, L., Envisioning the next generation inancial cyberinfra-
structure: Transforming the monitoring and regulation of systemic risk. Available at tp://tp
.umiacs.umd.edu/incoming/louiqa/PUB2012/HBR-NextGen_v4.pdf (last accessed 2013).
Big Data in Finance ◾ 351
34. Mandelbrot, B., he variation of certain speculative prices. Journal of Business, 36(4), 394–419,
1963.
35. Engle, R., Autoregressive conditional heteroscedasticity with estimates of the variance of
United Kingdom inlation. Econometrica, 50(4), 987–1007, 1982.
36. Bollerslev, T., Generalized autoregressive conditional heteroscedasticity. Journal of Econo-
metrics, 31(3), 307–327, 1986.
37. Bollerslev, T., Chou, R.Y., and Kroner, K.F., ARCH modeling in inance: A review of the theory
and empirical evidence. Journal of Econometrics, 52, 5–59, 1992.
38. Gourieroux, C., ARCH Models and Financial Applications. New York: Springer, 1997.
39. Shephard, N., Statistical aspects of ARCH and stochastic volatility. In: Cox, D.R., Hinkley, D.V.,
and Barndorf-Nielsen, O.E. (Eds.), Time Series Models in Econometrics, Finance, and Other
Fields. London: Chapman and Hall, pp. 1–67, 1996.
40. Wang, Y., Asymptotic nonequivalence of ARCH models and difusions. he Annals of Statistics,
30, 754–783, 2002.
41. Glosten, L.R., Jagannathan, R., and Runkle, D.E., On the relation between the expected value
and the volatility of the nominal excess return on stocks. Journal of Finance, 48, 1779–1801,
1993.
42. Engle, R.F., and Lee, G.G., A permanent and transitory component model of stock return vola-
tility. In: Engle, R.F., and White, H. (Eds.), Cointegation, Casuality and Forecasting: A Festschrit
in Honour of Clive W. J. Granger. Oxford: Oxford University Press, pp. 475–497, 1999.
43. Baillie, R.T., Bollerslev, T., and Mikkelsen, H.O., Fractionally integrated generalized autoregres-
sive conditional heteroskedasticity. Journal of Econometrics, 73, 3–20, 1996.
44. Müller, H.-G., Sen, R., and Stadtmller, U., Functional data analysis for volatility. Journal of
Econometrics, 165(2), 233–245, 2011.
45. Ghysels, E., and Sinko, A., Volatility forecasting and microstructure noise. Journal of
Econometrics, 160(1), 257–271, 2011.
46. Bollerslev, T., Engle, R., and Nelson, D., ARCH models. In: Engle, R., and McFadden, D. (Eds.),
Handbook of Econometrics, Vol. IV. Amsterdam: Elsevier, 1994.
47. Akgiray, V., Conditional heteroskedasticity in time series of stock returns: Evidence and fore-
casts. Journal of Business, 62, 55–80, 1989.
48. Brailsford, T.F., and Faf, R.W., An evaluation of volatility forecasting techniques. Journal of
Banking and Finance, 20, 419–438, 1996.
49. Figlewski, S., Forecasting volatility. Financial Markets, Institutions and Instruments, 6, 1–88, 1997.
50. Frances, P.H., and Van Dijk, D., Forecasting stock market volatility using (non-linear) GARCH
models. Journal of Forecasting, 15, 229–235, 1995.
51. Cartea, Á., and Karyampas, D., Volatility and covariation of inancial assets: A high-frequency
analysis. Journal of Banking and Finance, 35(12), 3319–3334, 2011.
52. Andersen, T.G., Bollerslev, T., Diebold, F.X., and Labys, P., Modeling and forecasting realized
volatility. Econometrica, 71, 579–625, 2003.
53. Bardorf-Nielsen, O., and Shephard, N., Non-Gaussian Ornstein-Uhlenbeck based models and
some of their applications in inancial economics. Journal of the Royal Statistical Society B,
63, 167–241, 2001.
54. Chortareas, G., Jiang, Y., and Nankervis, J.C., Forecasting exchange rate volatility using high-
frequency data: Is the euro diferent? International Journal of Forecasting, 27(4), 1089–1107, 2011.
55. Martens, M., and van Dijk, D., Measuring volatility with the realized range. Journal of
Econometrics, 138(1), 181–207, 2007.
56. Barndorf-Nielsen, O.E., and Shephard, N., Power and bipower variation with stochastic vola-
tility and jumps. Journal of Financial Econometrics, 2(1), 1–48, 2004.
57. Grammig, J., and Wellner, M., Modeling the interdependence of volatility and inter-transaction
duration processes. Journal of Econometrics, 106(2), 369–400, 2002.
Big Data in Finance ◾ 353
58. Bollerslev, T., and Wright, J.H., High-frequency data, frequency domain inference, and volatil-
ity forecasting. Review of Economics and Statistics, 83(4), 596–602, 2001.
59. Comte, F., and Renault, E., Long-memory in continuous-time stochastic volatility models.
Mathematical Finance, 8, 291–323, 1998.
60. McAller, M., and Medeiros, M., Realized volatility: A review. Econometric Reviews, 27, 10–45,
2008.
61. Andersen, T.G., Bollerslev, T., Diebold, F.X., and Labys, P., he distribution of realized exchange
rate volatility. Journal of the American Statistical Association, 96, 42–55, 2001.
62. Andersen, T.G., and Bollerslev, T., Answering the skeptics: Yes, standard volatility models do
provide accurate forecasts. International Economic Review, 39, 885–905, 1998.
63. Andersen, T.G., and Bollerslev, T., Deutsche Mark–dollar volatility: Intraday activity patterns, mac-
roeconomic announcements, and longer run dependencies. Journal of Finance, 53, 219–265, 1998.
64. Andersen, T.G., Bollerslev, T., Diebold, F.X., and Ebens, H., he distribution of realized stock
return volatility. Journal of Financial Economics, 61, 43–76, 2001.
65. Barndorf-Nielsen, O.E., and Shephard, N., Econometric analysis of realized volatility and its
use in estimating stochastic volatility models. Journal of the Royal Statistical Society B, 64,
253–280, 2002.
66. Barndorf-Nielsen, O.L., and Shephard, N., Estimating quadratic variation using realized vari-
ance. Journal of Applied Econometrics, 17, 457–477, 2002.
67. Andreou, E., and Ghysels, E., Rolling-sample volatility estimators: Some new theoretical, simu-
lation and empirical results. Journal of Business and Economic Statistics, 20, 363–376, 2002.
68. Andersen, T., Bollerslev, T., and Meddahi, N., Correcting the errors: Volatility forecast evalua-
tion using high-frequency data and realized volatilities. Econometrica, 73, 279–296, 2005.
69. Koopman, S., Jungbacker, B., and Hol, E., Forecasting daily variability of the S&P100 stock
index using historical, realised and implied volatility measurements. Journal of Empirical
Finance, 12, 445–475, 2005.
70. Bollerslev, T., Tauchen, G., and Zhou, H., Expected stock returns and variance risk premia.
Review of Financial Studies, 22, 4463–4492, 2009.
71. Bollerslev, T., Gibson, M., and Zhou, H., Dynamic estimation of volatility risk premia and
investor risk aversion from option implied and realized volatilities. Journal of Econometrics,
160, 235–245, 2011.
72. Ghysels, E., and Sinko, A., Comment. Journal of Business and Economic Statistics, 24, 192–194,
2006.
73. Müller, U.A., Dacorogna, M.M., Davé, R.D., Pictet, O.V., Olsen, R.B., and Ward, J.R., Fractals
and intrinsic time—A challenge to econometricians. In: International AEA Conference on Real
Time Econometrics, Luxembourg, October 14–15, 1993.
74. Ait-Sahalia, Y., Mykland, P.A., and Zhang, L., How oten to sample a continuous-time process
in the presence of market microstructure noise. Review of Financial Studies, 18, 351–416, 2005.
75. Bandi, F.M., and Russell, J.R., Microstructure noise, realized variance, and optimal sampling.
Review of Economic Studies, 75, 339–369, 2008.
76. Zhang, L., Mykland, P., and Aït-Sahalia, Y., A tale of two time scales: Determining integrated
volatility with noisy high frequency data. Journal of the American Statistical Association,
100, 1394–1411, 2005.
77. Zhang, L., Eicient estimation of stochastic volatility using noisy observations: A multi-scale
approach. Bernoulli, 12, 1019–1043, 2006.
78. Barndorf-Nielsen, O.E., Hansen, P.R., Lunde, A., and Shephard, N., Designing realised ker-
nels to measure the ex-post variation of equity prices in the presence of noise. Econometrica,
76, 1481–1536, 2008.
79. Barndorf-Neilsen, O.E., Hansen, P.R., Lunde, A., and Shephard, N., Subsampling realised ker-
nels. Journal of Econometrics, 160, 204–219, 2011.
354 ◾ Taruna Seth and Vipin Chaudhary
80. Hansen, P.R., and Lunde, A., Realized variance and market microstructure noise (with com-
ments and rejoinder). Journal of Business and Economic Statistics, 24, 127–218, 2006.
81. Barucci, E., Magno, D., and Mancino, M., Fourier volatility forecasting with high-frequency
data and microstructure noise. Quantitative Finance, 12(2), 281–293, 2012.
82. Harris, L., Estimation of stock price variances and serial covariances from discrete observa-
tions. Journal of Financial and Quantitative Analysis, 25, 291–306, 1990.
83. Zeng, Y., A partially observed model for micro movement of asset prices with Bayes estimation
via iltering. Mathematical Finance, 13, 411–444, 2003.
84. Fan, J., and Wang, Y., Spot volatility estimation for high-frequency data. Statistics and Its
Interface, 1, 279–288, 2008.
85. Bos, C.S., Janus, P., and Koopman, S.J., Spot variance path estimation and its application to high
frequency jump testing. Discussion Paper TI 2009-110/4, Tinbergen Institute, 2009.
86. Kristensen, D., Nonparametric iltering of the realized spot volatility: A kernel-based approach.
Econometric heory, 26, 60–93, 2010.
87. Munk, A., and Schmidt-Hieber, J., Nonparametric estimation of the volatility function in a
high-frequency model corrupted by noise. Unpublished manuscript, 2009.
88. Zu, Y., and Boswijk, P., Estimating spot volatility with high frequency inancial data. Preprint,
University of Amsterdam, 2010.
89. Shephard, N., and Sheppard, K., Realising the future: Forecasting with high frequency-based
volatility (HEAVY) models. Journal of Applied Econometrics, 25, 197–231, 2010.
90. Brownlees, C.T., and Gallo, G.M., Comparison of volatility measures: A risk management per-
spective. Journal of Financial Econometrics, 8, 29–56, 2010.
91. Maheu, J.M., and McCurdy, T.J., Do high-frequency measures of volatility improve forecasts of
return distributions? Journal of Econometrics, 160, 69–76, 2011.
92. Hansen, P.R., Huang, Z., and Shek, H.H., Realized GARCH: A joint model for returns and real-
ized measures of volatility. Journal of Applied Econometrics, 27, 877–906, 2012.
93. Andersen, T.G., Bollerslev, T., and Meddahi, N., Analytical evaluation of volatility forecasts.
International Economic Review, 45, 1079–1110, 2004.
94. Ghysels, E., Santa-Clara, P., and Valkanov, R., Predicting volatility: Getting the most out of
return data sampled at diferent frequencies. Journal of Econometrics, 131, 59–95, 2006.
95. Andersen, T.G., Bollerslev, T., Frederiksen, P.H., and Nielsen, M.Ø., Continuous-time models,
realized volatilities, and testable distributional implications for daily stock returns. Working
paper, Northwestern University, 2006.
96. Corsi, F., A simple approximate long-memory model of realized volatility. Journal of Financial
Econometrics, 7(2), 174–196, 2009.
97. Boudt, K., Cornelissen, J., and Payseur, S., High-frequency: Toolkit for the analysis of high fre-
quency inancial data in R. Available at http://highfrequency.herokuapp.com (accessed August
2013).
98. Andersen, T.G., Bollerslev, T., Diebold, F.X., and Labys, P., Great realizations. Risk, 13, 105–108,
2000.
99. Hafner, C.M., Cross-correlating wavelet coeicients with applications to high-frequency inan-
cial time series. Journal of Applied Statistics, 39(6), 1363–1379, 2012.
100. Podolskij, M., and Vetter, M., Estimation of volatility functionals in the simultaneous presence
of microstructure noise and jumps. Bernoulli, 15(3), 634–658, 2009.
101. Jacod, J., Li, Y., Mykland, P.A., Podolskij, M., and Vetter, M., Microstructure noise in the con-
tinuous case: he pre-averaging approach. Stochastic Processes and heir Applications, 119(7),
2249–2276, 2009.
102. Jacod, J., and Protter, P., Asymptotic error distributions for the Euler method for stochastic dif-
ferential equations. he Annals of Probability, 26, 267–307, 1998.
Big Data in Finance ◾ 355
103. Meddahi, N., A theoretical comparison between integrated and realized volatility. Journal of
Applied Econometrics, 17, 479–508, 2002.
104. Roberto, R., Jump-difusion models: Including discontinuous variation. Lecture Notes,
November 2011. Available at http://www.econ-pol.unisi.it/fm20/jump_difusion_notes.pdf
(accessed May 2013).
105. Mancini, C., Estimation of the characteristics of jump of a general Poisson-difusion process.
Scandinavian Actuarial Journal, 1, 42–52, 2004.
106. Aït-Sahalia, Y., and Jacod, J., Volatility estimators for discretely sampled Lévy processes. Annals
of Statistics, 35, 335–392, 2007.
107. Jacod, J., Asymptotic properties of realized power variations and related functionals of semi-
martingales. Stochastic Processes and heir Applications, 118, 517–559, 2008.
108. Jacod, J., Statistics and high frequency data. In: Kessler, M., Lindner, A., and Sorensen, M.
(Eds.), Statistical Methods for Stochastic Diferential Equations. Boca Raton, FL: Chapman and
Hall, pp. 191–310, 2012.
109. Barndorf-Nielsen, O.E., and Shephard, N., Econometrics using bipower variation. Journal of
Financial Econometrics, 4, 1–30, 2006.
110. Barndorf-Nielsen, O.E., Graversen, S.E., Jacod, J., Podolskij, M., and Shephard, N., A cen-
tral limit theorem for realised power and bipower variations of continuous semimartingales.
In: Kabanov, Y., Lipster, R., and Stoyanov, J. (Eds.), From Stochastic Analysis to Mathematical
Finance, Festschrit for Albert Shiryaev. New York: Springer, pp. 33–68, 2006.
111. Fisher, L., Some new stock-market indexes. Journal of Business, 39(1–2), 191–225, 1966.
112. Tao, M., Large volatility matrix inference via combining low frequency and high-frequency
approaches. Journal of the American Statistical Association, 106(495), 1025–1040, 2011.
113. Epps, T.W., Co-movements in stock prices in the very short run. Journal of the American
Statistical Association, 74(366), 291–298, 1979.
114. Griin, J.E., and Oomen, R.A., Covariance measurement in the presence of non-synchronous
trading and market microstructure noise. Journal of Econometrics, 160(1), 58–68, 2011.
115. Barndorf-Nielsen, O.E., and Shephard, N., Econometric analysis of realized covariation: High
frequency based covariance, regression, and correlation in inancial economics. Econometrica,
72, 885–925, 2004.
116. Hayashi, T., and Yoshida, N., On covariance estimation of non-synchronously observed difu-
sion processes. Bernoulli, 11, 359–379, 2005.
117. Sheppard, K., Realized covariance and scrambling. Unpublished manuscript, 30, 2006.
118. Zhang, L., Estimating covariation: Epps efect and microstructure noise. Journal of Econometrics,
160(1), 33–47, 2011.
119. Voev, V., and Lunde, A., Integrated covariance estimation using high-frequency data in the
presence of noise. Journal of Financial Econometrics, 5(1), 68–104, 2007.
120. Bandi, F.M., Russell, J.R., and Zhu, Y., Using high-frequency data in dynamic portfolio choice.
Econometric Reviews, 27, 163–198, 2008.
121. Bandi, F.M., and Russell, J.R., Realized covariation, realized beta and microstructure noise.
Working paper, Graduate School of Business, University of Chicago, 2005.
122. Martens, M., Estimating unbiased and precise realized covariances. In: EFA 2004 Maastricht
Meetings Paper No. 4299, June 2004.
123. Hautsch, N., Kyj, L., and Oomen, R., A blocking and regularization approach to high dimen-
sional realized covariance estimation. Journal of Applied Econometrics, 27, 625–645, 2012.
124. Tao, M., Wang, Y., and Chen, X., Fast convergence rates in estimating large volatility matrices
using high-frequency inancial data. Econometric heory, 29(4), 838–856, 2013.
125. Noureldin, D., Shephard, N., and Sheppard, K., Multivariate high-frequency-based volatility
(HEAVY) models. Journal of Applied Econometrics, 27(6), 907–933, 2012.
356 ◾ Taruna Seth and Vipin Chaudhary
126. Fan, J., Li, Y., and Yu, K., Vast volatility matrix estimation using high-frequency data for port-
folio selection. Journal of the American Statistical Association, 107(497), 412–428, 2012.
127. Giot, P., and Laurent, S., Modeling daily value-at-risk using realized volatility and ARCH type
models. Journal of Empirical Finance, 11, 379–398, 2004.
128. Beltratti, A., and Morana, C., Statistical beneits of value-at-risk with long memory. Journal of
Risk, 7, 21–45, 2005.
129. Angelidis, T., and Degiannakis, S., Volatility forecasting: Intra-day versus inter-day models.
Journal of International Financial Markets, Institutions and Money, 18, 449–465, 2008.
130. Martens, M., van Dijk, D., and Pooter, M., Forecasting S&P 500 volatility: Long memory,
level shits, leverage efects, day of the week seasonality and macroeconomic announcements.
International Journal of Forecasting, 25, 282–303, 2009.
131. Louzis, D.P., Xanthopoulos-Sisinis, S., and Refenes, A., Are realized volatility models good
candidates for alternative value at risk prediction strategies? Germany: University Library of
Munich, 2011.
132. Shao, X.D., Lian, Y.J., and Yin, L.Q., Forecasting value-at-risk using high frequency data: he
realized range model. Global Finance Journal, 20, 128–136, 2009.
133. Christensen, K., and Podolskij, M., Realized range-based estimation of integrated variance.
Journal of Econometrics, 141, 323–349, 2007.
134. Huang, X., and Tauchen, G., he relative contribution of jumps to total price variation. Journal
of Financial Econometrics, 3, 456–499, 2005.
135. De Rossi, G., Zhang, H., Jessop, D., and Jones, C., High frequency data for low frequency man-
agers: Hedging market exposure. White paper, UBS Investment Research, 2012.
136. Aït-Sahalia, Y., and Saglam, M., High frequency traders: Taking advantage of speed
(No. w19531). National Bureau of Economic Research, 2013.
137. Brogaard, J., Hendershott, T., and Riordan, R., High frequency trading and price discovery,
April 2013. Available at http://ssrn.com/abstract=1928510.
138. Bollerslev, T., and Todorov, V., Tails, fears and risk premia. Journal of Finance, 66(6), 2165–
2211, 2011.
139. Todorov, V., and Bollerslev, T., Jumps and betas: A new framework for disentangling and esti-
mating systematic risks. Journal of Econometrics, 157(2), 220–235, 2010.
140. Brownlees, C.T., and Gallo, G.M., Financial econometric analysis at ultra-high frequency: Data
handling concerns. Computational Statistics and Data Analysis, 51(4), 2232–2245, 2006.
CHAPTER 18
Semantic-Based
Heterogeneous Multimedia
Big Data Retrieval
Kehua Guo and Jianhua Ma
CONTENTS
18.1 Introduction 358
18.2 Related Work 359
18.3 Proposed Framework 361
18.3.1 Overview 361
18.3.2 Semantic Annotation 362
18.3.3 Optimization and User Feedback 364
18.3.4 Semantic Representation 364
18.3.5 NoSQL-Based Semantic Storage 366
18.3.6 Heterogeneous Multimedia Retrieval 366
18.4 Performance Evaluation 367
18.4.1 Running Environment and Sotware Tools 367
18.4.2 Performance Evaluation Model 369
18.4.3 Precision Ratio Evaluation 370
18.4.4 Time and Storage Cost 371
18.5 Discussions and Conclusions 372
Acknowledgments 373
References 373
ABSTRACT
Nowadays, data heterogeneity is one of the most critical features for multimedia Big
Data; searching heterogeneous multimedia documents relecting users’ query intent
from a Big Data environment is a diicult task in information retrieval and pattern
recognition. This chapter proposes a heterogeneous multimedia Big Data retrieval
framework that can achieve good retrieval accuracy and performance. he chapter is
organized as follows. In Section 18.1, we address the particularity of heterogeneous
357
358 ◾ Kehua Guo and Jianhua Ma
multimedia retrieval in a Big Data environment and introduce the background of the
topic. hen literatures related to current multimedia retrieval approaches are briely
reviewed, and the general concept of the proposed framework is introduced briely
in Section 18.2. In Section 18.3, the description of this framework is given in detail
including semantic information extraction, representation, storage, and multimedia
Big Data retrieval. he performance evaluations are shown in Section 18.4, and
inally, we conclude the chapter in Section 18.5.
18.1 INTRODUCTION
Multimedia retrieval is an important technology in many applications such as web-
scale multimedia search engines, mobile multimedia search, remote video surveillance,
automation creation, and e-government [1]. With the widespread use of multimedia
documents, our world will be swamped with multimedia content such as massive images,
videos, audios, and other content. Therefore, traditional multimedia retrieval has been
switching into a Big Data environment, and the research into solving some problems
according to the features of multimedia Big Data retrieval attracts considerable attention.
At present, multimedia retrieval from Big Data environments is facing two problems:
he irst is document type heterogeneity. he multimedia content generated from various
applications may be huge and unstructured. For example, the diferent types of multi-
media services will generate images, videos, audios, graphics, or text documents. Such
heterogeneity makes it diicult to execute a heterogeneous search. he second is intent
expression. In multimedia retrieval, the query intent generally can be represented by text.
However, the text can only express very limited query intent. Users do not want to enter
too many key words, but short text may lead to ambiguity. In many cases, the query intent
may be described by content (for example, we can search some portraits that are similar
to an uploaded image), but content-based retrieval may ignore personal understanding
because semantic information cannot be described by physical features such as color, tex-
ture, shape, and so forth. herefore, the returned results may be far from satisfying users’
search intent.
hese features have become new challenges in the research of Big Data–based multi-
media information retrieval. herefore, how to simultaneously consider the type hetero-
geneity and user intent in order to guarantee good retrieval performance and economical
eiciency has been an important issue.
Traditional multimedia retrieval approaches can be divided into three classes: text-
based, content-based, and semantic-based retrieval [1,2]. he text-based approach has
been widely used in many commercial Internet-scale multimedia search engines. In this
approach, a user types some key words, and then the search engine searches for the text
in a host document and returns the multimedia iles whose surrounding text contains the
key words. his approach has the following disadvantages: (1) Users can only type short or
ambiguous key words because the users’ query intent usually cannot be correctly described
by text. For example, when a user inputs the key word “Apple,” the result may contain fruit,
a logo, and an Apple phone. (2) To a multimedia database that stores only multimedia
documents and in which the surrounding text does not exist, the text-based approach will
Semantic-Based Heterogeneous Multimedia Big Data Retrieval ◾ 359
be useless. For example, if a user stores many animation videos in the database, he/she can
only query the videos based on simple information (e.g., ilename), because the text related
to them is limited.
Another multimedia retrieval approach is content based. In this approach, a user
uploads a ile (e.g., an image); the search engine searches for documents that are similar
to it using content-based approaches. his approach will sufer from three disadvantages:
(1) his approach cannot support heterogeneous search (e.g., upload an image to search
for audio). (2) he search engine will ignore the users’ query intent and cannot get similar
results satisfying users’ search intent. (3) he computation time of feature attraction will
cost much computation resources.
he third approach is semantic based. In this approach, semantic features of multime-
dia documents are described by an ontology method and stored in the server’s knowledge
base; when the match requirement arrives, the server will execute retrieval in the knowl-
edge base. However, if the multimedia document leaves the knowledge base, the retrieval
process cannot be executed unless the semantic information is rebuilt.
his chapter is summarized from our recent work [1,2]. In this chapter, we use the
semantic-based approach to represent users’ intent and propose a storage and retrieval
framework supporting heterogeneous multimedia Big Data retrieval. he characteristics of
this framework are as follows: (1) It supports heterogeneous multimedia retrieval. We can
upload a multimedia document with any multimedia type (such as image, video, or audio) to
obtain suitable documents with various types. (2) here is convenience in interaction. his
framework provides retrieval interfaces similar to traditional commercial search engines
for convenient retrieval. (3) It saves data space. We store the text-represented semantic
information in the database and then provide links to the real multimedia documents
instead of directly processing multimedia data with large size. (4) here is eiciency with
economic processing. We use a NoSQL database on some inexpensive computers to store
the semantic information and use an open-sourced Hadoop tool to process the retrieval.
he experiment results show that this framework can efectively search the heterogeneous
multimedia documents relecting users’ query intent.
BigTable [6]. Tables in HBase can serve as the input and output for MapReduce [7] jobs
running in Hadoop [8] and may be accessed through some typical application program
interfaces (APIs), such as Java [9].
he research on heterogeneous multimedia retrieval approaches has mainly concen-
trated on how to combine the text-based approach with other methods [10,11]. Although
it is more convenient for users to type text key words, content-based retrieval (e.g., images)
has been widely used in some commercial search engines (e.g., Google Images search).
However, it is very diicult to execute heterogeneous retrieval based on multimedia con-
tent [12,13]. For example, given a video and audio document relevant to the same artist,
it is diicult to identify the artist or extract other similar features from the binary data of
the two documents because their data formats are diferent. herefore, full heterogeneous
multimedia retrieval has not been achieved.
he approach proposed in this chapter mainly uses semantic information to support
heterogeneous multimedia retrieval. Feature recognition is based on not only the low-level
visual features, such as color, texture, and shape, but also full consideration of the seman-
tic information, such as event, experience, and sensibility [14,15]. At present, the seman-
tic information extraction approaches generally use the model of text semantic analysis,
which constructs the relation between a text phrase and visual description in latent space
or a random ield. For an image, a bag-of-words model [16] is widely used to express visual
words. Objects can be described by visual words to realize semantic learning. In addi-
tion, a topic model is widely used for the semantic extraction. he typical topic models
are probabilistic latent semantic analysis (PLSA) [17] and latent Dirichlet allocation (LDA)
[18]. Based on these models, many semantic learning approaches have been proposed.
A study [19] proposed an algorithm to improve the training efect of image classiication in
high-dimension feature space. Previous works [20,21] have proposed some multiple-class
annotation approaches based on supervision.
In the ield of unsupervised learning, a study [22] has proposed a normalized cut clus-
tering algorithm. Another study [23] presented an improved PLSA model. In addition,
some researchers have combined relative feedback and machine learning. One study [24]
used feedback and obtained a model collaborative method to extract the semantic infor-
mation and get recognition results with higher precision. Another [25] used a support vec-
tor machine (SVM) active learning algorithm to handle feedback information; the users
could choose the image relecting the query intent.
In the ield of system development, one study [26] used the hidden Markov model
(HMM) to construct an automatic semantic image retrieval system. his approach could
express the relation of image features ater being given a semantic class. A Digital Shape
Workbench [27–29] provided an approach to realize sharing and reuse of design resources
and used ontology technology to describe the resources and high-level semantic infor-
mation of three-dimensional models. In this system, ontology-driven annotation of the
Shapes method and ShapeAnnotator were used for user interaction. Based on this work,
another study [30] investigated the ontology expression to virtual humans, covering the
features, functions, and skills of a human. Purdue University is responsible for the con-
ception of the Engineering Ontology and Engineering Lexicon and proposed a calculable
Semantic-Based Heterogeneous Multimedia Big Data Retrieval ◾ 361
search framework [31]. A study [32] used the Semantic Web Rule Language to deine the
semantic matching. Another [33] proposed a hierarchical ontology-based knowledge rep-
resentation model.
In current approaches, semantic features and multimedia documents are stored in
the server’s rational database, and we have to purchase expensive servers to process the
retrieval. On the one hand, if the multimedia document is not stored in the knowledge
base, the retrieval process cannot be conducted unless the semantic information is rebuilt.
On the other hand, multimedia data with large size will cost much storage space. So, our
framework presents an efective and economical framework that uses an inexpensive
investment to store and retrieve the semantic information from heterogeneous multime-
dia data. In this framework, we do not directly process multimedia data with large size; in
HBase, we only store the ontology-represented semantic information, which can be parallel-
processed in distributed nodes with a MapReduce-based retrieval algorithm.
In our framework, we apply some valuable assets to facilitate the intent-relective
retrieval of heterogeneous multimedia Big Data. Big Data processing tools (e.g., Hadoop)
are open source and convenient in that they can be freely obtained. For example, Hadoop
only provides a programming model to perform distributed computing, and we can use
the traditional retrieval algorithm ater designing the computing model that is suitable for
the MapReduce programming speciication. In addition, the semantic information of the
multimedia documents can also be easily obtained and saved because of the existence of
many computing models. For example, we can provide the semantic information to the
multimedia documents through annotating by social user and describe the semantic infor-
mation by ontology technology. hese technologies have been widely used in various ields.
(a)
Annotation (b) (d) (d2)
Ontology Ontology
User
Annotation
Users Ontology Ontology New Upload
Ontology Ontology multimedia Returned
Ontology Ontology result
(d1) Annotation
Users (d3) (d4)
(c)
HBase Semantic database
❖☎✆✂✝✂❣✞ file Map structure Index and block
input conversion generating
Data nodes
Semantic field refinement schema
Hadoop frame✇✂✄❦
FIGURE 18.2 he interface of SMV. (From Guo, K. et al., Wirel. Pers. Commun., 72, 2013; Guo, K.
and Zhang, S., Comput. Math. Methods Med., 2013, 2013.)
T
a1 , a2 ,..., an
Ami = (18.1)
w1 , w 2 ,..., wn
where ai is the ith annotation and wi is the corresponding weight. herefore, all the
annotation matrices for the multimedia documents can be deined as A = {Am1, Am2,…, AmN}.
1
Ater the semantic annotation, for arbitrary mi ∈ C, we assign the initial value of wi as .
n
It is evident that wi for every annotation could not be constant ater retrieval. Obviously,
more frequently used annotations during the retrieval process can better express semantic
information; they should be assigned a greater weight. We design an adjustment schema
as follows:
1
wi = wi + ki × (18.2)
n
364 ◾ Kehua Guo and Jianhua Ma
1 mi is retrieved based on ai
ki = (18.3)
0 others
he initial weight assignment and the adjustment process need to check all the semantic
information in the database, and this work will cost much computational resources. To
solve this problem, we can execute this process only once the search engine is built. In
addition, the adjustment process can be performed in a background thread.
n
1
wi <
n ∑w
i =1
i (18.4)
he initial weight assignment and the adjustment schema need to check all the semantic
documents in HBase, and this work will cost much computational resources. To solve this
problem, the adjustment process can be performed in a background thread.
Ater retrieval, this framework will return some multimedia documents. his frame-
work supports user feedback, so for a particular returned document, the user can add addi-
tional annotations to enrich the semantic information. For these annotations, the initial
weight will be 1/n too.
In summary, as the retrieval progresses, the annotations will be more and more abun-
dant. But rarely used annotations will also be removed. here will be some new annota-
tions added into the annotation matrix Ami because of the user feedback. herefore, this
framework is a dynamic framework; the longer it is used, the more accurate the results we
can obtain.
and other levels of semantic annotations will be provided based on the previous levels. All
the information is annotated by the users in the original annotation or feedback progress.
Figure 18.3 shows the annotation structure of an image.
his framework adopts a composite pattern as the data structure to represent the rela-
tion of annotations. In a composite pattern, objects can be composed as a tree structure to
represent the part and whole hierarchy. his pattern regards simple and complex elements
as common elements. A client can use the same method to deal with complex elements
as simple elements, so that the internal structure of the complex elements will be inde-
pendent with the client program. he data structure using composite pattern is shown in
Figure 18.4 [1].
In this structure, OntologyComponent is a declared object interface in the composition.
In many cases, this interface will implement all the default methods that are shared by all
the ontologies. OntologyLeaf represents the leaf node object in the composition, and these
nodes have no children nodes. In OntologyComposite, the methods of branch nodes will be
❘✟❞ Shore
✎✍t✏☛✏✑✒❈✏✓✔✏✍✌✍t
❈☛☞✌✍t
*
+O ✔eration()
+Add(in Component)
+Remove(in Component)
+GetChild(inIndex:int) Children
Ontology▲✌✕✖ OntologyComposite
+O ✔eration() +Operation() 1
+Add(in Component)
+Remove(in Component)
+GetChild(inIndex:int)
FIGURE 18.4 Ontology structure based on composite pattern. (From Guo, K. et al., Wirel. Pers.
Commun., 72, 779–793, 2013; Guo, K. and Zhang, S., Comput. Math. Methods Med., 2013, 1–7, 2013.)
366 ◾ Kehua Guo and Jianhua Ma
deined to store the children components; the operations relative to children components
will be implemented in the OntologyComponent interface. herefore, a composite pattern
allows the user to use ontology as a consistency method.
Record
OntologyComponent
Children
Ontology★✩✪✫ OntologyComposite
FIGURE 18.5 Map structure of block and record. (From Guo, K. et al., J. Syst. Sotware, 87, 2014.)
Semantic-Based Heterogeneous Multimedia Big Data Retrieval ◾ 367
Hadoop
environment HBase storage
m1 m2 mn Queries
Mapper
qn qn qn
rn rn rn
Reducer r1 r2 rn ReturnedList
FIGURE 18.6 he process of MapReduce-based retrieval. (From Guo, K. et al., J. Syst. Sotware,
87, 2014.)
he user can upload some multimedia documents to execute the retrieval. Similarly, these
documents will be assigned semantic information through social annotation. he semantic
information will be represented as an ontology and converted into a map structure. In the que-
ries, we assign every query a QueryId and QueryOntology; the returned result will be formed
as a ReturnedList. he process of MapReduce-based retrieval is shown in Figure 18.6 [34].
In Figure 18.6, the queries will be submitted to the Hadoop environment, and then the
mapper and reducer will run. All the returned records will be put into a list. In the queries
and returned list, the information will be represented as a byte array. he mapper function
takes pairs of record keys (multimedia location) and record values (ontology information).
For each pair, a retrieval engine runs all queries and outputs for each matching query using
a similarity function. he MapReduce tool runs the mapper functions in parallel on each
machine. When the map step inishes, the MapReduce tool groups the intermediate output
according to every QueryId. For each corresponding QueryId, the reducer function that
runs locally on each machine will simply take the result whose similarity is above the aver-
age value and output it into the ReturnedList.
❜✬✮✯✬ linux,
Ubuntu
U ✰✱✮✬✉✲ ✳✴✵✶✶♦
Hadoop ✷✸✹✸ ❡✮✺✺✳
OpenSS
2.0. Op H
02
02 03 04 0
055 0
066
01
✺✰✴✈❡
Slave ✺✰✴✈❡
Slave ✺✰av❡
Slave ✺✰✴✈❡
Slave ✺✰
Slav
avee
07 08 09 10
Master
lave
sslave
b serve
web
we serverr
((Tomcat
Tomcat 6. 0)
6.0)
✺✰✴✈❡
Slave ✺✰✴✈❡
Slave ✺✰ av❡
Slave ✺✰
Slav
avee
Upload
User
FIGURE 18.7 Running framework of performance evaluation. (From Guo, K. et al., J. Syst.
Sotware, 87, 2014.)
In this chapter, we developed some other sotware tools to verify the efectiveness of
our framework. hese tools include the following: (1) An annotation interface is used for
the users to provide the annotations to the multimedia documents. We developed SMV
for PC and mobile device users [1]. (2) Our framework provides a convenient operating
interface, which is very similar to the traditional commercial search engines (e.g., Google
Images search). Users can upload multimedia documents in the interface and submit the
information to the server. he interface was developed using HTML5, and it can run on a
typical terminal. (3) For the web server, the search engine was deployed in Tomcat 6.0. In
the cluster, the background process was executed every 24 h. Table 18.1 shows the intro-
duction of sotware tools.
How to construct the data set is an important problem in the experiment. Some general
databases have been proposed. However, these databases can only perform the experi-
ments aiming at one particular multimedia type (e.g., image iles). Heterogeneous multi-
media retrieval requires a wide variety of iles such as images, videos, audio, and so forth.
So these databases are not suitable for performing the experiments. In this chapter, we
have constructed a multimedia database containing various multimedia types including
images, videos, audio, and text documents. In the experiment, we used a multimedia data-
base containing 50,000 multimedia documents, including 20,000 images, 10,000 videos,
10,000 audio iles, and 10,000 text documents. All the semantic information of the mul-
timedia documents were provided through the users manually annotating and analyzing
the text from the host ile (e.g., web page) where the documents are downloaded.
1. Precision ratio. Precision ratio is a very common measurement for evaluating retrieval
performance. In the experiment, we slightly modify the traditional deinition
of precision ratio. For each retrieval process, we let the user choose multimedia
documents that relect his/her query intent. We deine the set of retrieved results as
Rt = {M1, M2,…, Mt} (where t is the number of retrieved multimedia documents) and
deine the set of all the multimedia documents relecting users’ intent as Rl = {M1,
M2,…,Ml} (where l is the number of relevant documents).
he precision ratio is computed by the proportion of retrieved relevant documents
to total retrieved documents. herefore, the modiied precision ratio p can be deined
as follows:
# | Rl |
p= (18.5)
# | Rt |
where tpre is the preprocess time (convert the semantic and multimedia location to
map structure) and satisies
t pre = ∑t
i =1
i
pre (18.7)
tref represents the annotation reinement time (eliminate the redundant or error
semantic information and add the new semantic information in the feedback).
370 ◾ Kehua Guo and Jianhua Ma
he second factor is retrieval time. We deine tr as the time cost for retrieval. In
fact, tr includes extraction time (extract the semantic information from the HBase)
and the matching time (match the semantic similarity of the sample document with
the stored documents).
3. Storage cost. Because the HBase will store the map information, the storage cost will
be taken into consideration. The rate of increase for storage ps is deined as follows:
ps = sont/sorg (18.8)
where sont is the size of semantic information of the multimedia documents and sorg is
the size of original multimedia documents,
N N
sont = ∑s
i =1
i
, sorg =
ont ∑s
i =1
i
org (18.9)
100
Precision ratio (%)
80
Image
60 Video
40 Audio
20 Text
0
01 02 03 04 05 06 07 08 09 10 Retrieval no.
84
82
80
Figure 18.9 indicates that even in the retrieval process between diferent multimedia
types, the precision ratios are not reduced. his is because this framework completely
abandons physical feature extraction and executes the retrieval process based only on
semantic information.
However, we cannot ignore that some traditional technologies supporting content-
based retrieval have good performances too. For example, Google Images search actually
can get perfect results relecting users’ intent in content-based image retrieval. However, it
has the following disadvantages in comparison with this framework: (1) his search pat-
tern cannot support heterogeneous retrieval, because of the physical feature extraction.
(2) Compared with physical features, annotations can better represent the users’ query
intent, so our framework can get more accurate results in case of the semantic multimedia
retrieval. If the documents contain more abundant annotations, the retrieval performance
will be better. In addition, our framework has the advantage of good speed because of skip-
ping the physical feature extraction.
6
5
me cost (s)
4 Image
Video
3
Audio
✻
2 Text
❚
1
0
01 02 03 04
Also, in our framework, we use some open-source tools such as Ubuntu Linux, Java SDK,
and Hadoop tools. he operating system and sotware tools can be freely downloaded from
the corresponding websites. his will save the investment of sotware. In addition, Apache
Hadoop provides simpliied programming models for reliable, scalable, distributed com-
puting. It allows the distributed processing of large data sets across clusters of computers
using simple programming models; the programmer can use Hadoop and MapReduce
with low learning cost. herefore, the investment cost is very economical for heteroge-
neous multimedia Big Data retrieval using our framework.
ACKNOWLEDGMENTS
his work is supported by the Natural Science Foundation of China (61202341, 61103203)
and the China Scholarship (201308430049).
REFERENCES
1. K. Guo, J. Ma, G. Duan. DHSR: A novel semantic retrieval approach for ubiquitous multi-
media. Wireless Personal Communications, 2013, 72(4): 779–793.
2. K. Guo, S. Zhang. A semantic medical multimedia retrieval approach using ontology
information hiding. Computational and Mathematical Methods in Medicine, 2013, 2013(407917):
1–7.
3. C. Liu, J. Chen, L. Yang et al. Authorized public auditing of dynamic Big Data storage on cloud
with eicient veriiable ine-grained updates. IEEE Transactions on Parallel and Distributed Systems,
2014, 25(9): 2234–2244.
4. J.R. Smith. Minding the gap. IEEE MultiMedia, 2012, 19(2): 2–3.
5. S. Ghemawat, H. Gobiof, S.T. Leung. he Google ile system. ACM SIGOPS Operating Systems
Review. ACM, 2003, 37(5): 29–43.
6. F. Chang, J. Dean, S. Ghemawat et al. Bigtable: A distributed storage system for structured data.
ACM Transactions on Computer Systems, 2008, 26(2): 1–26, Article no. 4.
7. J. Dean, S. Ghemawat. MapReduce: Simpliied data processing on large clusters. Communications
of the ACM, 2008, 51(1): 107–113.
8. Apache Hadoop. Available at http://hadoop.apache.org/.
9. Apache Hbase. Available at http://en.wikipedia.org/wiki/HBase.
10. R. Zhao, W.I. Grosky. Narrowing the semantic gap-improved text-based web document
retrieval using visual features. IEEE Transactions on Multimedia, 2002, 4(2): 189–200.
11. Y. Yang, F. Nie, D. Xu et al. A multimedia retrieval architecture based on semi-supervised
ranking and relevance feedback. IEEE Transactions on Pattern Analysis and Machine Intelligence,
2012, 34(4): 723–742.
12. A. Smeulders, M. Worring, S. Santini et al. Content-based image retrieval at the end of the
early years. IEEE Transactions Pattern Analysis and Machine Intelligence, 2000, 22(12): 1349–1380.
13. G. Zhou, K. Ting, F. Liu, Y. Yin. Relevance feature mapping for content-based multimedia
information retrieval. Pattern Recognition, 2012, 45(4): 1707–1720.
14. R.C.F. Wong, C.H.C. Leung. Automatic semantic annotation of real-world web images. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(11): 1933–1944.
15. A. Gijsenij, T. Gevers. Color constancy using natural image statistics and scene semantics. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2010, 33(4): 687–698.
16. W. Lei, S.C.H. Hoi, Y. Nenghai. Semantics-preserving bag-of-words models and applications.
IEEE Transactions on Image Processing, 2010, 19(7): 1908–1920.
17. T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning,
2001, 42(1–2): 177–196.
374 ◾ Kehua Guo and Jianhua Ma
18. D.M. Blei, A.Y. Ng, M.I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research,
2003, 3(1): 993–1022.
19. Y. Gao, J. Fan, H. Luo et al. Automatic image annotation by incorporating feature hierarchy
and boosting to scale up SVM classiiers. In Proceedings of the 14th ACM International Conference
on Multimedia, Santa Barbara, CA, 2006: 901–910.
20. G. Carneiro, A. Chan, P. Moreno, N. Vasconcelos. Supervised learning of semantic classes for
image annotation and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence,
2007, 29(3): 394–410.
21. N. Rasiwasia, P.J. Moreno, N. Vasconcelos. Bridging the gap: Query by semantic example. IEEE
Transactions on Multimedia, 2007, 9(5): 923–938.
22. J. Shi, J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 2000, 22(8): 888–905.
23. F. Monay, D. Gatica-Perez. Modeling semantic aspects for cross-media image indexing. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(10): 1802–1817.
24. D. Djordjevic, E. Izquierdo. An object and user driven system for semantic-based image anno-
tation and retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 2007, 17(3):
313–323.
25. S.C.H. Hoi, J. Rong, J. Zhu, M.R. Lyu. Semi-supervised SVM batch mode active learning for
image retrieval. In Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition,
Anchorage, AK, IEEE Computer Society, 2008: 1–7.
26. J. Li, J.Z. Wang. Automatic linguistic indexing of pictures by a statistical modeling approach.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003, 25(9): 1075–1088.
27. R. Albertoni, R. Papaleo, M. Pitikakis. Ontology-based searching framework for digital shapes.
Lecture Notes in Computer Science, 2005, 3762: 896–905.
28. M. Attene, F. Robbiano, M. Spagnuolo. Part-based annotation of virtual 3d shapes. In Proceedings
of International Conference on Cyberworlds, Hannover, Germany, IEEE Computer Society Press,
2007: 427–436.
29. M. Attene, F. Robbiano, M. Spagnuolo, B. Falcidieno. Semantic annotation of 3d surface meshes
based on feature characterization. Lecture Notes in Computer Science, 2009, 4816: 126–139.
30. M. Gutiérrez, A. García-Rojas, D. halmann. An ontology of virtual humans incorporating
semantics into human shapes. he Visual Computer, 2007, 23(3): 207–218.
31. Z.J. Li, V. Raskinm, K. Ramani. Developing ontologies for engineering information retrieval.
IASME Transactions Journal of Computing and Information Science in Engineering, 2008, 8(1): 1–13.
32. X.Y. Wang, T.Y. Lv, S.S. Wang. An ontology and swrl based 3d model retrieval system. Lecture
Notes in Computer Science, 2008, 4993: 335–344.
33. D. Yang, M. Dong, R. Miao. Development of a product coniguration system with an ontology-
based approach. Computer-Aided Design, 2008, 40(8): 863–878.
34. K. Guo, W. Pan, M. Lu, X. Zhou, J. Ma. An efective and economical architecture for seman-
tic-based heterogeneous multimedia big data retrieval. Journal of Systems and Sotware, 2014,
87 pp.
CHAPTER 19
CONTENTS
19.1 Introduction 376
19.2 Large-Scale Computing Frameworks 377
19.3 Probabilistic Topic Modeling 379
19.4 Couplings among Topic Models, Cloud Computing, and Multimedia Analysis 382
19.4.1 Large-Scale Topic Modeling 382
19.4.2 Topic Modeling for Multimedia 384
19.4.3 Large-Scale Computing in Multimedia 385
19.5 Large-Scale Topic Modeling for Multimedia Retrieval and Analysis 386
19.6 Conclusions and Future Directions 388
References 389
ABSTRACT
he explosion of multimedia data in social media raises a great demand for develop-
ing efective and eicient computational tools to facilitate producing, analyzing, and
retrieving large-scale multimedia content. Probabilistic topic models prove to be an
efective way to organize large volumes of text documents, while much fewer related
models are proposed for other types of unstructured data such as multimedia con-
tent, partly due to the high computational cost. With the emergence of cloud com-
puting, topic models are expected to become increasingly applicable to multimedia
data. Furthermore, the growing demand for a deep understanding of multimedia
data on the web drives the development of sophisticated machine learning meth-
ods. hus, it is greatly desirable to develop topic modeling approaches to multimedia
applications that are consistently efective, highly eicient, and easily scalable. In this
chapter, we present a review of topic models for large-scale multimedia analysis. Our
goal is to show the current challenges from various perspectives and to present a
375
376 ◾ Juan Hu, Yi Fang, Nam Ling, and Li Song
comprehensive overview of related work that addresses these challenges. We will also
discuss several research directions in the ield.
19.1 INTRODUCTION
With the arrival of the Big Data era, recent years have witnessed an exponential growth
of multimedia data, thanks to the rapid increase of processor speed, cheaper data storage,
prevalence of digital content capture devices, as well as the looding of social media like
Facebook and YouTube. New data generated each day have reached 2.5 quintillion bytes as
of 2012 (Dean and Ghemawat 2008). Particularly, more than 10 h of videos are uploaded
onto YouTube every minute, and millions of photos are available online every week. he
explosion of multimedia data in social media raises great demand in developing efective
and eicient computational tools to facilitate producing, analyzing, and retrieving large-
scale multimedia content. Big Data analysis for basic tasks such as classiication, retrieval,
and prediction has become ever popular for multimedia sources in the form of text, graph-
ics, images, audio, and video. he data set is so large and noisy that the scalability of the
traditional data mining algorithms needs to be improved. he MapReduce framework
designed by Google is very simple to implement and very lexible in that it can be extended
for various large-scale data processing functions. his framework is a powerful tool to
develop scalable parallel applications to process Big Data on large clusters of commodity
machines. he equivalent open-source Hadoop MapReduce developed by Yahoo is now
very popular in both the academic community and industry.
In the past decade, much efort has been made in the information retrieval (IR) ield to ind
lower-dimensional representation of the original high-dimensional data, which enables ei-
cient processing of a massive data set while preserving essential features. Probabilistic topic
models proved to be an efective way to organize large volumes of text documents. In natural
language processing, a topic model refers to a type of statistical model for representing a col-
lection of documents by discovering abstract topics. At the early stage, a generative probabi-
listic model for text corpora is developed to address the issues of term frequency and inverse
document frequency (TF-IDF), namely, that the dimension reduction efect using TF-IDF is
rather small (Papadimitriou et al. 1998). Later, another important topic model named proba-
bilistic latent semantic indexing (PLSI) was created by homas Hofmann in 1999. Essentially,
PLSI is a two-level hierarchical Bayesian model where each word is generated from a single
topic and each document is reduced to a probability distribution of a ixed set of topics. Latent
Dirichlet allocation (LDA) (Blei et al. 2003) is a generalization of PLSI developed by providing
a probabilistic model at the level of documents, which avoids the serious overitting problem
as the number of parameters in the model does not grow linearly with the size of the corpus.
LDA is now the most common topic model, and many topic models are generally an exten-
sion of LDA by relaxing some of the statistical assumptions. he probabilistic topic model
exempliied by LDA aims to discover the hidden themes running through the words that can
help us organize and understand the vast information conveyed by massive data sets.
To improve the scalability of a topic model for Big Data analysis, much efort has been
put into large-scale topic modeling. Parallel LDA (PLDA) was designed by distributing
Topic Modeling for Large-Scale Multimedia Analysis and Retrieval ◾ 377
Gibbs sampling for LDA on multiple machines (Wang et al. 2009). Another lexible large-
scale topic modeling package named Mr.LDA is implemented in MapReduce, where model
parameters are estimated by variational inference (Zhai et al. 2012). A novel architec-
ture for a parallel topic model is demonstrated to yield better performance (Smola and
Narayanamurthy 2010).
While topic models are proving to be efective methods for corpus analysis, much fewer
related models have been proposed for other types of unstructured data such as multi-
media content, partly due to high computational cost. With the emergence of cloud com-
puting, topic models are expected to become increasingly applicable to multimedia data.
Furthermore, the growing demand for a deep understanding of multimedia data on the
web drives the development of sophisticated machine learning methods. hus, it is greatly
desirable to develop topic modeling approaches for large-scale multimedia applications
that are consistently efective, highly eicient, and easily scalable. In this chapter, we pre-
sent a review of topic models for large-scale multimedia analysis.
he chapter is organized as follows. Section 19.2 gives an overview of several distributed
large-scale computing frameworks followed by a detailed introduction of the MapReduce
framework. In Section 19.3, topic models are introduced, and we present LDA as a typical
example of a topic model as well as inference techniques such as Gibbs sampling and varia-
tional inference. More advanced topic models are also discussed in this section. A review
of recent work on large-scale topic modeling, topic modeling for multimedia analysis, and
large-scale multimedia analysis is presented in Section 19.4. Section 19.5 demonstrates
recent eforts done in large-scale topic modeling for multimedia retrieval and analysis.
Finally, Section 19.6 concludes this chapter.
on racks, roughly 8 to 64 for each rack. he computer nodes on the same rack are con-
nected by gigabyte Ethernet, while the racks are interconnected by a switch.
here are two kinds of computer nodes in this framework, the master node and the
worker node. he master node basically assigns tasks for worker nodes and keeps track of
the status of the worker nodes, which can be idle, executing a particular task, or inished
completing it. hus, the master node takes central control role of the whole process, and
failure at the master node can be disastrous. he entire process could be down, and all the
tasks need to be restarted. On the other hand, failure in the worker node can be detected by
the master node as it periodically pings the worker processes. he master node can manage
the failures at the worker nodes, and all the computing tasks will be complete eventually.
he name of MapReduce comes naturally from the essential two functions under this
framework, that is, the Map step and the Reduce step, as described in Figure 19.1. he input
of the computing process is chunks of data, which can be any type, such as documents
or tuples. he Map function converts input data into key–value pairs. hen the master
controller chooses a hash function that is applied to keys in the Map step and produces a
bucket number, which is the total number of Reduce tasks deined by a user. Each key in
the Map task is hashed, and its key–value pair is put in the local buckets by grouping and
aggregation, each of which is destined for one of the Reduce tasks. he Reduce function
takes pairs consisting of a key and its list of associated values combined in a user-deined
way. he output of a Reduce task is a sequence of key–value pairs, which consists of the key
received from the Map task and the combined value constructed from the list of values that
the Reduce task received along with the key. Finally, the outputs from all the Reduce tasks
are merged into a single ile.
he MapReduce computation framework can be best illustrated with a classic example
of word count. Word count is very important as it is exactly the term frequency in the IR
model. he input ile for the framework is a corpus of many documents. he words are
the keys of the Map function, and the count of occurrences of each word is the value cor-
responding to the key. If a word w appears m times among all the documents assigned to
that process, in the Reduce task, we simply add up all values so that ater grouping and
aggregation, the output is a sequence of pairs (w, m).
In the real MapReduce execution, as shown in Figure 19.1, a worker node can han-
dle either a Map task or a Reduce task but will be assigned only one task at a time. It is
Map worker
Group keys
Reduce
Master
reasonable to have a smaller number of Reduce tasks compared to Map tasks as it is neces-
sary for each Map task to create an intermediate ile for each Reduce task and if there are
too may Reduce tasks, the number of intermediate iles explodes. he original implementa-
tion of the MapReduce framework was done by Google via Google File System. he open-
source implementation by Yahoo was called Hadoop, which can be downloaded along with
the Hadoop ile system, from the Apache foundation. Computational processing can occur
on data stored either in a distributed ile system or in a database. MapReduce can take
advantage of locality of data, processing data on or near the storage assets to decrease
transmission of data with a tolerance of hardware failure.
MapReduce allows for distributed processing of the map and reduction operations.
Provided each mapping operation is independent of the others, all maps can be performed
in parallel, though in practice, it is limited by the number of independent data sources and
the number of CPUs near each source. Similarly, a set of “reducers” can perform the reduc-
tion tasks. All outputs of the map operation that share the same key are presented to the
same reducer at the same time. While this process can oten appear ineicient compared
to algorithms that are more sequential, MapReduce can be applied to signiicantly larger
data sets than “commodity” servers can handle. A large server farm can use MapReduce
to sort petabytes of data in only a few hours. he parallelism also ofers some possibility of
recovering from partial failure of servers or storage during the operation: If one mapper
or reducer fails, the work can be rescheduled as long as the input data are still available.
While the MapReduce framework is very simple, lexible, and powerful, the data-low
model is ideally suited for batch processing of on-disk data, where the latency can be very
poor and its scalability to real-time computation is limited. To solve these issues, Facebook’s
Puma and Yahoo’s S4 are proposed for real-time aggregation of unbounded streaming data.
A novel columnar storage representation for nested records was proposed in the Dremel
framework (Melnik et al. 2010), which improves the eiciency of MapReduce. In-memory
computation was allowed in another data-centric programming model. Piccolo (Powell
and Li 2010) and Spark, which were created by the Berkeley AMPLab, have been demon-
strated to be able to deliver much faster performance. Pregel was proposed for large-scale
graphical processing (Malewicz et al. 2010).
Big Data sets by giving shorter and lower-dimensional representation of original docu-
ments. PLSI took a remarkable step forward based on LSI (Hofmann 1999). PLSI is essen-
tially a generative probabilistic topic model where each word in the document is considered
to be a sample from a mixture model, and the mixture topic is multinomial distribution.
As shown in Figure 19.2a, for each document in the corpus, each word w is generated from
the latent topic x with the probability p(w|d), where x is from the topic distribution p(x|d)
of the document d. PLSI models the probability of occurrence of a word in a document as
the following mixture of conditionally independent multinomial distributions:
where
Although PLSI proves to be useful, it has an issue in that there is no probabilistic gen-
erative model at a document level to generate the proportions of diferent topics. LDA was
proposed to ix this issue by constructing a three-level hierarchical Bayesian probabilis-
tic topic model (Blei et al. 2003), which is a generative probabilistic model of a corpus.
Probabilistic topic models are exempliied by LDA. In the next paragraph, we are going to
introduce LDA in the environment of natural language processing, where we are dealing
with text documents.
A topic is deined as a distribution over a ixed vocabulary. Intuitively, documents can
exhibit multiple topics with diferent proportions. For example, a book on topic models
can have several topics, such as probabilistic general models, LSI, and LDA, with LDA hav-
ing the largest proportion. We assume that the topics are speciied before the documents
are generated. hus, all the documents in the corpus share the same set of topics but with
diferent distribution. In the generative probabilistic model, data arise from the generative
process with hidden variables.
In LDA, documents are represented as random mixtures over latent topics, and the prior
distribution of the hidden topic structure is described by Dirichlet distribution. Each word
w w
β
t x t x
D N D N
FIGURE 19.2 Graph model representation of PLSI (a) and LDA (b).
Topic Modeling for Large-Scale Multimedia Analysis and Retrieval ◾ 381
is generated by looking up a topic in the document it refers to and inding out the probabil-
ity of the word within that topic, where we treat each topic as a bag of words and apply the
language model. As depicted in Figure 19.2, LDA is a three-level hierarchical probabilistic
model, while the PLSI has only two levels. LDA takes the advantage of deining document
distribution over multiple hidden topics, compared to PLSI, where the number of topics is
ixed.
he three-level LDA model is described as a probabilistic graphical model in Figure
19.2, where a corpus consists of N documents and each document has a sequence of D
words. he parameter θ describes the topic distribution over the vocabulary, assuming that
the dimensionality of the topic variable x is known and ixed. he parameter t describes
the topic mixture at the document level, which is multinomially distributed over all the
topics: Dir (β). Remember that the topics are generated before the documents, so both β
and θ are corpus-level parameters. he posterior distribution of the hidden variables given
a document, which is the key problem we need to solve LDA, is described in the following
equation:
p(t , x , w | θ, β)
p(t , x | w , θ, β) =
p(w | θ, β)
where
he topic structure is the hidden variable, and our goal is to discover the topic struc-
ture automatically with the observed documents. All we need is a posterior distribution,
the conditional distribution of the hidden topic structure given all the documents in the
corpus. A joint distribution over both the documents and the topic structure is deined to
compute this posterior distribution. Unfortunately, the posterior distribution is intractable
for exact inference. However, there are a wide variety of approximate inference algorithms
developed for LDA, including sampling-based algorithms such as Gibbs sampling and
variational algorithms. he sampling-based algorithms approximate the posterior with
samples from an empirical distribution. In Gibbs sampling, a Markov chain is deined
on the hidden topic variables for the corpus. he Monte Carlo method is used to run the
chain for a long time and collect samples and then approximate the posterior distribution
with these samples as the posterior distribution is the limiting distribution of the Markov
chain. On the other hand, variational inference is a deterministic method to approximate
the posterior distribution. he basic idea is to ind the nearest lower band by optimization.
he optimal lower band is the approximated posterior distribution.
here are many other probabilistic topic models, which are more sophisticated than
LDA and usually obtained by relaxing the assumptions of LDA. A topic model relaxes the
382 ◾ Juan Hu, Yi Fang, Nam Ling, and Li Song
with a large cluster of computers. However, it might not be able to process tens to hundreds
of millions of documents, while today’s Internet data, such as Yahoo and Facebook user
proiles, can easily beat this level. To handle hundreds of millions of user proiles, scalable
distributed inference of dynamic user interests was designed for a time-varying user model
(Ahmed et al. 2011). User interests can change over time, which is very valuable for predic-
tion of user purchase intent. Besides, the topics can also vary over time. hus, the topic
model for user behavior targeting should be dynamic and updated online. However, it is
very challenging for inference of a topic model for millions of users over several months.
Even for an approximate inference, for example, Gibbs sampling, it is also computation-
ally infeasible. While sequential Monte Carlo (SMC) could be a possible solution, the SMC
estimator can quickly become too heavy in long-range dependence, which makes it also
infeasible to transfer and update so many computers. Instead, only forward sampling is
utilized for inference so that we do not need to go back to look at old data anymore. he
new inference method has been demonstrated to be very eicient for analyzing Yahoo user
proiles, which provides powerful tools for web applications such as online advertising
targeting, content personalization, and social recommendations.
Second, the Gibbs sampler is highly tuned for LDA, which makes it very hard to extend
for other applications. To address this issue, a lexible large-scale topic modeling package
in MapReduce (Mr.LDA) was proposed by using the variational inference method as an
alternative (Zhai et al. 2012). Compared to random Gibbs sampling, variational inference
is deterministic given an initialization, which ensures the same output of each step no
matter where or when the implementation is running. his uniformity is important for a
MapReduce system to check the failure at a worker node and thus have greater fault toler-
ance, while for random Gibbs sampling, this is very hard. hus, variational inference is
more suitable for a distributed computing framework such as MapReduce.
Variational inference is meant to ind variational parameters to minimize the Kullback–
Leibler divergence between the variational distribution and the posterior distribution,
which is the target of the LDA problem, as mentioned earlier. he variational distribution
is carefully chosen in such a way that the probabilistic dependencies from the true distribu-
tion are well maintained. his independence results in natural palatalization of computing
across multiple processors. Furthermore, it only takes dozens of iterations to converge for
variational inference, while it might take thousands for Gibbs sampling. Most signiicantly,
Mr.LDA is very lexible and can be easily extended to eiciently handle online updates.
While distributed LDA has proven to be eicient to solve a large-scale topic model-
ing problem, regularized latent semantic indexing (RLSI) (Wang et al. 2011) is another
smart design for topic modeling parallelization. he text corpus is represented as a term–
document matrix, which is then approximated by two matrices: a term–topic matrix and a
topic–document matrix. he RLSI is meant to solve the optimal problem that minimizes
a quadratic loss function on term–document occurrence with regularization. Speciically,
the topics are regularized with l1 norm and l2 for the documents. he formulation of RLSI
is carefully chosen so that the inference process can be decomposed into many subprob-
lems. his is the key ingredient that RLSI can scale up via parallelization. With smooth
regularization, the overitting problem is efectively solved for RLSI.
384 ◾ Juan Hu, Yi Fang, Nam Ling, and Li Song
he most distinguishing advantage of RLSI is that it can scale to very large data sets
without reducing vocabulary. Most techniques reduce the number of terms to improve the
eiciency when the matrix becomes very large, which improves scalability at the cost of
decreasing learning accuracy.
from the topics of this multinomial. While this Bayesian network framework does make
sampling rather easy, the conditional dependencies between hidden variables become the
bottleneck of eicient learning for the LDA models. A multiwing harmonium (MWH) was
proposed for captioned images by combining a multivariate Poisson distribution for word
count in a caption and a multivariate Gaussian for a color histogram of the image (Xiang et
al. 2005). Unlike all the aforementioned models, MWH can be considered as an undirected
topic model because of the bidirectional relationship between the latent variables and the
inputs: he hidden topic structure can be viewed as predictors from a discriminative model
taking the inputs, while it also describes how the input is generated. his formalism enjoys
a signiicant advantage for fast inference as it maintains the conditional independence
between the hidden variables. MWH has multiple harmoniums, which group observations
from all sources using a shared array of hidden variables to model the theme of all sources.
his is consistent with the fact in real applications that the input data unnecessarily come
from a single source. In Xiang et al.’s (2005) work, a contrastive divergence and variational
learning algorithms are both designed and evaluated on a dual-wing harmonium (DWH)
model for the tasks of image annotation and retrieval on news video collections. DWH is
demonstrated to be robust and eicient with both learning algorithms. It is worth noting
that while the conditional independence enables fast learning for MWH, it also makes
learning more diicult. his trade-of can be acceptable for of-line learning, but in other
cases, it might need further investigation.
While image classiication is an independent problem from image annotations, the two
tasks can be connected as they provide evidence for each other and share the goal of auto-
matically organizing images. A single coherent topic model was proposed for simultane-
ous image classiication and annotation (Wang et al. 2009). he typical supervised topic
modeling supervised LDA (Blei and McAulife 2007) for image classiication was extended
to multiclasses and embedded with a probabilistic model of image annotation followed
by substantial development of inference algorithms based on a variational method. he
coherent model, examined on real-world image data sets, provides comparable annotation
performance and better than state-of-the-art classiication performance.
All these probabilistic topic models estimate the joint distribution of caption words and
regional image features to discover the statistical relationship between visual image features
and words in an unsupervised way. While the results from these models are encouraging,
simpler topic models such as latent semantic analysis (LSA) and direct image matching can
also achieve good performance for automatic image annotation through supervised learn-
ing where image classes associated with a set of words are predeined. A comparison study
between LSA and probabilistic LSA was presented with application on the Coral database
(Monay and Gatica-Perez 2003). Surprisingly, the result has shown that simple LSA based
on annotation by propagation outperformed probabilistic LSA on this application.
such as topic models to automatically organize the multimedia Big Data by capturing the
intrinsic topic structure, which makes good sense in simplifying the Big Data analysis.
For another, it is prohibitively challenging to handle such large-scale multimedia content
as images, audio, and videos. Cloud computing provides a new generation of comput-
ing infrastructure to manage and process Big Data by distribution and parallelization.
Multimedia cloud computing has become an emerging technology for providing multime-
dia services and applications.
In cloud-based multimedia computing, the multimedia application data can be stored and
accessed in a distributed manner, unlike traditional multimedia processing, which is done on
client or server ends. One of the big challenges with multimedia data is associated with the
heterogeneous data types and services such as video conferencing and photo sharing. Another
challenge is how to make the multimedia cloud provide distributed parallel processing ser-
vices (Zhu et al. 2011). Correspondingly, parallel algorithms such as parallel spectral clustering
(Chen et al. 2011), parallel support vector machines (SVMs) (Chang et al. 2007), and the afore-
mentioned PLDA provide important tools for mining large-scale rich-media data with cloud
service. More recently, several large-scale multimedia data storage and processing methods
were proposed based on the Hadoop platform (Lai et al. 2013; Kim et al. 2014). Python-based
content analysis using specialization (PyCASP), a Python-based framework, is also presented
for multimedia content analysis by automatically mapping computation onto parallel plat-
forms from Python application code to a variety of parallel platforms (Gonina et al. 2014).
makes it infeasible to track ever-growing data at the level of terabytes or petabytes. To solve
the scalability issues, data-intensive scalable computing (DISC) was specially designed as
a new computing diagram for large-scale data analysis (Bryant 2007). In this computing
diagram, data itself is more emphasized, and it takes special care to consider constantly
growing and changing data collections besides performing large-scale computations. he
aforementioned MapReduce framework is a popular application built on top of this dia-
gram. he intuition for MapReduce is very simple: If we have a very large set of tasks,
we can easily tackle the problem by hiring more workers, distributing the tasks to these
workers, and inally, grouping or combining all the results from the parallelized tasks.
MapReduce is simple to understand; however, the key problem is how to build semantic
concept modeling and develop eicient learning algorithms that can be scalable to the
distributed platforms.
In recent years, the implementation of topic modeling of multimedia data on the
MapReduce framework has provided a good solution for large-scale multimedia retrieval
and analysis. An overview of MapReduce for multimedia data mining was presented for
its application in web-scale computer vision (White et al. 2010). An eicient semantic con-
cept named robust subspace bagging (RSB) was proposed, combining random subspace
bagging and forward model selection (Yan et al. 2009). Consider the common problem of
overitting due to the high dimensionality of multimedia data; this semantic concept model
has remarkably reduced the risk of overitting by iteratively searching for the best models
in the forward model selection step based on validation collection. MapReduce implemen-
tation was also presented in Yan et al. (2009) and tested on standard image and video data
sets. Normally, tasks in the same category of either a Map task or a Reduce task would be
assumed to require roughly the same execution time so that longer tasks would not slow
down the distributed computing process. However, the features of multimedia data are
heterogeneous with various dimensions; thus, the tasks cannot be guaranteed to have simi-
lar execution times. herefore, extra efort should be made to organize these unbalanced
tasks. A task scheduling algorithm was specially designed in Yan et al. (2009) to estimate
the running time of each task and optimize task placement for heterogeneous tasks, which
achieved signiicantly improved performance compared to the baseline SVM scheduler.
Most of the multimedia data we talked are about text and images. Actually, large-scale
topic modeling also inds success in video retrieval and analysis. Scene understand-
ing is an established ield in computer vision for automatic video surveillance. he two
challenges are the robustness of the features and the computing complexity for scalabil-
ity to massive data sets, especially for real-world data. he real-world data streams are
unbounded, and the number of motion patterns is unknown in advance. his makes
traditional LDA inapplicable as the number of topics of LDA is usually ixed before the
documents are generated. Instead, BNP models are more appealing to discover unknown
patterns (Teh et al. 2006). he selection of features and the incremental inference cor-
responding to a continuous stream are crucial to enable the scalability of the models for
large-scale multimedia data.
A BNP model was designed for large-scale statistical modeling of motion patterns (Rana
and Venkatesh 2012). In the data preprocessing stage, coarse-scale low-level features are
388 ◾ Juan Hu, Yi Fang, Nam Ling, and Li Song
selected as a trade-of between model and complexity; that is, a sophisticated model can
deine a iner pattern but sacriice computational complexity. A robust and eicient princi-
ple component analysis (PCA) was utilized to extract sparse foregrounds in video. To avoid
the costly singular vector decomposition operation in each iteration, rank 1 constraint due
to the almost-unchanged background during a short period was used for PCA. hen the
Dirichlet process mixture was employed to model the sparse components extracted from
PCA. Similar to the generative model for text documents, in the bag-of-event model, each
feature vector was considered as a sample from motion pattern. he combination of mul-
tinomial distribution and mixture of Gaussians deines the location and quantity of the
motion. A decayed Markov chain Monte Carlo (MCMC) incremental inference was devel-
oped for ixed-cost update in the online setting. he posterior can be guaranteed to con-
verge to the true value on the condition that any point in past would be selected at nonzero
probability. A traditional decay function such as exponential decay only depends on time,
while for the scene-understanding problem, the probability distribution also depends on
data in the clustering space, which is the distance between the past observations and the
current ones. To improve the scalability to large-scale settings, the distance between clus-
ters is measured to represent the distance between samples in each cluster. he framework
was tested on a 140-h-long video, whose size is quite large, and it was demonstrated to
provide comparable pattern-discovering performance with existing scene-understanding
algorithms. his work enhances our ability to efectively and eiciently discover unknown
patterns from unbounded video streams, providing a very promising framework for large-
scale topic modeling of multimedia data.
example, comparative studies between an ordinary Gaussian mixture model and Corr-LDA
have demonstrated that Corr-LDA is signiicantly better for the tasks of image automatic
annotation and text-based image retrieval. Various distributed and parallel computing
frameworks such as PLDA and parallel support vector machine (PSVM) (Chang et al.
2007) have been proposed to solve scalability issues with multimedia computing. he DISC
diagram is another popular design for large-scale data analysis (Bryant 2007). he popular
Big Data processing framework MapReduce also ind its success in multimedia cloud com-
puting applications such as web-scale computer vision (White et al. 2010). Another major
issue with high dimensionality of large-scale data is overitting. A novel semantic concept
model was proposed (Yan et al. 2009) to address the overitting issue, and the results have
demonstrated the efectiveness of the method.
In future work, scalability will be a more prominent issue with the ever-growing data
size. Web-based multimedia data mining (White et al. 2010) demonstrates the power to
implement computer vision algorithms such as training, clustering, and background sub-
traction in the MapReduce framework. Cloud-based computing will remain a very active
research area for large-scale multimedia analysis. Moreover, for cloud-based large-scale
multimedia computing, it is very desirable to visualize high-dimensional data with easy
interpretation at the user interface. he current method mainly displays topics with term
frequency (Blei 2012), which is limited as hidden topic structure also connects diferent
documents. he relations among multimedia data and the underlying topics are even more
complex. To visualize these complex relations will signiicantly help people consume mul-
timedia content. In addition, topic models are usually trained in an of-line fashion, where
the whole batch of data is used once to construct the model. his process does not suit
the online setting, where data streams continuously arrive with time. With online learn-
ing (Barbara and Domeniconi 2008), we do not need to rebuild the whole topic model
when new data arrive but just incrementally update the parameters based on the new data.
hus, online learning could provide a much more eicient solution to multimedia anal-
ysis as some multimedia content is streaming in nature. Furthermore, feature selection
and feature engineering are crucial for multimedia analysis and retrieval. With the recent
advances in Deep Learning (Hinton et al. 2006; Hinton and Salakhutdinov 2006), learn-
ing a compact representation and features of multimedia data will become an important
research topic.
REFERENCES
Ahmed, A., Low, Y., and Smola, A. Scalable Distributed Inference of Dynamic User Interests for
Behavioral Targeting. ACM SIGKDD Conference on Knowledge Discovery from Data, 2011.
Barbara, D., and Domeniconi, C. On-line LDA: Adaptive Topic Models for Mining Text Streams with
Applications to Topic Detection and Tracking. IEEE International Conference on Data Mining,
2008.
Barnard, K., Duygulu, P., Forsyth, D., De Freitas, N., Blei, D., and Jordan, M. Matching Words and
Pictures. Journal of Machine Learning Research, 3, 1107–1135, 2003.
Blei, D. Probabilistic Topic Models. Communications of the ACM, 55(4), 77–84, 2012.
Blei, D., and Jordan, M. Modeling Annotated Data. ACM SIGIR Conference on Research and
Development in Information Retrieval, 127–134, 2003.
390 ◾ Juan Hu, Yi Fang, Nam Ling, and Li Song
Blei, D., and Laferty, J. Dynamic Topic Models. International Conference on Machine Learning, 113–
120, 2006.
Blei, D., and Laferty, J. A Correlated Topic Model of Science. Annals of Applied Statistics, 1(1), 17–35,
2007.
Blei, D., and McAulife, J. Supervised Topic Models. Neural Information Processing Systems, 2007.
Blei, D., Ng, A., and Jordan, M. Latent Dirichlet Allocation. Journal of Machine Learning Research,
3, 993–1022, 2003.
Bryant, R.E. Data-Intensive Supercomputing: he Case for Disc. Technical Report, School of
Computer Science, Carnegie Mellon University, Pittsburgh, PA, 2007.
Chang, E.Y., Zhu, K., Wang, H., and Bai, H. PSVM: Parallelizing Support Vector Machines on dis-
tributed computers. Neural Information Process System, 2007.
Chang, J., and Blei, D. Hierarchical Relational Models for Document Networks. Annals of Applied
Statistics, 4(1), 124–150, 2010.
Chen, Y.W., Song, Y., Bai, H., Lin, C. J., and Chang, E.Y. Parallel Spectral Clustering in Distributed
Systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3), 568–586, 2011.
Datta, R., Li, J., and Wang, J. Content-based Image Retrieval: Approaches and Trends of the New Age.
ACM SIGMM International Workshop on Multimedia Information Retrieval, 253–262, 2005.
Dean, J., and Ghemawat, S. MapReduce: Simpliied Data Processing on Large Clusters. Communications
of ACM, 2008.
Deerwester, S., Dumais, S., Landauer, T., Furnas, G., and Harshman, R. Indexing by latent semantic
analysis. Journal of the American Society of Information Science, 41(6), 391–407, 1990.
Duygulu, P., Barnard, K., de Freitas, N., and Forsyth, D.A. Object Recognition as Machine Translation:
Learning a Lexicon for a Fixed Image Vocabulary. European Conference on Computer Vision,
97–112, 2002.
Gonina, E., Friedland, G., Battenberg, E., Koanantakool, P., Driscoll, M., Georganas, E., and
Keutzer, K. Scalable Multimedia Content Analysis on Parallel Platforms using Python. ACM
Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP),
10(2), 18–38, 2014.
Hinton, G.E., Osindero, S., and Teh, Y. A Fast Learning Algorithm for Deep Belief Nets. Neural
Computation, 18, 1527–1554, 2006.
Hinton, G.E., and Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks.
Science, 313(5786), 504–507, 2006.
Hofmann, T. Probabilistic Latent Semantic Indexing. ACM SIGIR Conference on Research and
Development in Information Retrieval, 50–57, 1999.
Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. Drad: Distributed Data-Parallel Programs
from Sequential Building Blocks. European Conference on Computer Systems (EuroSys), 2007.
Kim, M., Han, S., Jung, J., Lee, H., and Choi, O. A Robust Cloud-based Service Architecture for
Multimedia Streaming Using Hadoop. Lecture Notes in Electrical Engineering, 274, 365–370, 2014.
Lai, W.K., Chen, Y.U., Wu, T.Y., and Obaidat, M.S. Towards a Framework for Large-scale Multimedia
Data Storage and Processing on Hadoop Platform. he Journal of Supercomputing, 68(1), 488–
507, 2013.
Li, F.F., and Perona, P. A Bayesian Hierarchical Model for Learning Natural Scene Categories. IEEE
Conference on Computer Vision and Pattern Recognition, 524–531, 2005.
Malewicz, G., Austern, M., and Czajkowski, G. Pregel: A System for Large-Scale Graph Processing.
ACM SIGMOD Conference, 2010.
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., and Vassilakis, T. Dremel:
Interactive Analysis of Web-scale Datasets. International Conference on Very Large Data Bases,
330–339, 2010.
Monay, F., and Gatica-Perez, D. On Image Auto-annotation with Latent Space Models. ACM
International Conference on Multimedia, 275–278, 2003.
Topic Modeling for Large-Scale Multimedia Analysis and Retrieval ◾ 391
Newman, D., Asuncion, A., Smyth, P., and Welling, M. Distributed Inference for Latent Dirichlet
Allocation. Neural Information Process System, 1081–1088, 2007.
Papadimitriou, C.H., Raghavan, P., Tamaki, H., and Vempala, S. Latent Semantic Indexing: A
Probabilistic Analysis. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database
Systems, 159–168, 1998.
Powell, R., and Li, J. Piccolo: Building Fast, Distributed Programs with Partitioned Tables. USENIX
Conference on Operating Systems Design and Implementation, 2010.
Rana, S., and Venkatesh, S. Large-Scale Statistical Modeling of Motion Patterns: A Bayesian
Nonparametric Approach. Indian Conference on Computer Vision, Graphics and Image
Processing, 2012.
Rosen-Zvi, M., Griiths, T., Steyvers, M., and Smith, P. he Author-Topic Model for Authors and
Documents. Uncertainty in Artiicial Intelligence, 2004.
Sivic, J., Russell, B., Zisserman, A., Freeman, W., and Efros, A. Unsupervised Discovery of Visual
Object Class Hierarchies. IEEE Conference on Computer Vision and Pattern Recognition, 2008.
Smola, A., and Narayanamurthy, S. An Architecture for Parallel Topic Model. Very Large Database
(VLDB), 3(1), 703–710, 2010.
Teh, Y., Jordan, M., Beal, M., and Blei, D. Hierarchical Dirichlet Processes. Journal of American
Statistical Association, 101(476), 1566–1581, 2006.
Wallach, H. Topic Modeling: Beyond Bag of Words. International Conference on Machine Learning,
2006.
Wang, Q., Xu, J., Li, H., and Craswell, N. Regularized Latent Semantic Indexing. ACM SIGIR
Conference on Research and Development in Information Retrieval, 2011.
Wang, Y., Bai, H., Stanton, M., Chen, M., and Chang, E.Y. PLDA: Parallel Latent Dirichlet
Allocation for Large-Scale Applications. Conference on Algorithmic Aspects in Information and
Management, 301–314, 2009.
White, B., Yeh, T., Lin, J., and Davis, L. Web-Scale Computer Vision using MapReduce for Multimedia
Data Mining. ACM KDD Conference Multimedia Data Mining Workshop, 2010.
Xiang, E., Yan, R., and Hauptmann, A. Mining Associated Text and Images with Dual-Wing
Harmoniums. Uncertainty in Artiicial Intelligence, 2005.
Yan, R., Fleury, M., and Smith, J., Large-Scale Multimedia Semantic Concept Modeling Using Robust
Subspace Bagging and MapReduce. ACM Workshop on Large-Scale Multimedia Retrieval and
Mining, 2009.
Zhai, K., Graber, J., Asadi, N., and Alkhouja, M. Mr.LDA: A Flexible Large Scale Topic Modeling
Package Using Variational Inference in MapReduce. ACM Conference World Wide Web, 2012.
Zhu, W., Luo, C., Wang, J., and Li, S. Multimedia Cloud Computing. IEEE Signal Processing Magazine,
28(3), 59–69, 2011.
CHAPTER 20
CONTENTS
20.1 Introduction 394
20.2 Background 395
20.2.1 Intel Xeon Phi 395
20.2.2 Iris Matching Algorithm 396
20.2.3 OpenMP 397
20.2.4 Intel VTune Ampliier 397
20.3 Experiments 398
20.3.1 Experiment Setup 398
20.3.2 Workload Characteristics 398
20.3.3 Impact of Diferent Ainity 399
20.3.4 Optimal Number of hreads 401
20.3.5 Vectorization 401
20.4 Conclusions 403
Acknowledgments 403
References 403
ABSTRACT
With the rapidly expanded biometric data collected by various sources for identii-
cation and veriication purposes, how to manage and process such Big Data draws
great concern. On one hand, biometric applications normally involve comparing a
huge amount of samples and templates, which has strict requirements on the com-
putational capability of the underlying hardware platform. On the other hand, the
number of cores and associated threads that hardware can support has increased
greatly; an example is the newly released Intel Xeon Phi coprocessor. Hence, Big Data
393
394 ◾ Xueyan Li and Chen Liu
20.1 INTRODUCTION
With the drive towards achieving higher computation capability, the most advanced
computing systems have been adopting alternatives from the traditional general purpose
processors (GPPs) as their main components to better prepare for Big Data processing.
NVIDIA’s graphic processing units (GPUs) have powered many of the top-ranked super-
computer systems since 2008. In the latest list published by Top500.org, two systems with
Intel Xeon Phi coprocessors have claimed positions 1 and 7 [1].
While it is clear that the need to improve eiciency for Big Data processing will continuously
drive changes in hardware, it is important to understand that these new systems have their own
advantages as well as limitations. he required efort from the researchers to port their codes
onto the new platforms is also of great signiicance. Unlike other coprocessors and accelera-
tors, the Intel Xeon Phi coprocessor does not require learning a new programming language or
new parallelization techniques. It presents an opportunity for the researchers to share parallel
programming with the GPP. his platform follows the standard parallel programming model,
which is familiar to developers who already work with x86-based parallel systems.
his does not mean that achieving good performance on this platform is simple. he
hardware, while presenting many similarities with other existing multicore systems, has
its own characteristics and unique features. In order to port the code in an eicient way,
those aspects must be understood.
he speciic application that we focus on in this parallelization study is biometrics. he
importance of biometrics and other identiication technology has increased signiicantly,
and we can observe biometric systems being deployed everywhere in our daily lives. his
has dramatically expanded the amount of biometric data collected and examined everyday
in health care, employment, law enforcement, and security systems. here can be up to pet-
abytes of biometric data from hundreds of millions of identities that need to be accessed in
real time. For example, the Department of Homeland Security (DHS) Automated Biometric
Identiication System (IDENT) database held 110 million identities and enrolled or veriied
over 125,000 individuals per day in 2010 [2]. India has their unique identiication (UID)
program with over 100 million people to date and expected to cover more than 1 billion
individuals. his Big Data processing poses a great computational challenge. In addition,
running biometric applications with sizable data sets consumes large amounts of power
and energy. herefore, eicient power management is another challenge we are facing.
As a result, we propose to execute biometric applications on many-core architecture with
great energy eiciency while achieving superb performance at the same time. In this study,
we want to analyze the performance of an OpenMP implementation of an iris recognition
algorithm on the new Intel Xeon Phi coprocessor platform. he focus is to understand the
Big Data Biometrics Processing ◾ 395
workload characteristics, analyze the results from speciic features of the hardware plat-
form, and discuss aspects that could help improve overall performance on Intel Xeon Phi.
20.2 BACKGROUND
In this section, we will introduce the Xeon Phi coprocessor, the iris matching algorithm,
the OpenMP programming model, and the Intel VTune performance analyzer, all serving
as background knowledge for this research.
need to be highly parallel and vectorial in order to optimize the performance on this plat-
form. he developers can maximize parallel performance on Xeon Phi in four ways:
hey should also vectorize applications to maximize single instruction multiple data
(SIMD) parallelism by including vectorial directives and applying loop transformations as
needed, in case the code cannot be automatically vectorized by the compiler [5].
Xeon Phi can be used in two diferent modes: native and oload. In our experiments,
we focused on the oload mode. here are two problems we need to handle in the oload
mode: One is managing the data transfer between the host and the coprocessor; the other
is launching the execution of functions (or loops) on the coprocessor. For the oload mode,
all the data transfer between the host and the coprocessor goes through the PCIe bus.
herefore, an attempt should be made to exploit data reuse. his is also a consideration for
applications written using OpenMP that span both the host and the coprocessor. Getting
the beneit from both the host and Xeon Phi is our goal.
Ater sampling the iris region (pupil and limbus), the iris sample is transformed into a
two-dimensional (2-D) image. hen Gabor iltering will extract the most discriminating
information from the iris pattern as a biometric template. Each iris sample is represented
with two 2-D matrices of bits (20 rows by 480 columns). he irst one is the template matrix
representing the iris pattern. he second one is the mask matrix, representing aspects such
as eyelid, eyelash, and noise (e.g., spectacular relections), which may occlude parts of the
iris. Two iris samples are compared based on the Hamming distance calculation, with
both matrices from each sample being used. he calculated matching score is normalized
between 0 and 1, where a lower score indicates that the two samples are more similar.
Big Data Biometrics Processing ◾ 397
he main body of the algorithm is a 17-round for-loop structure with the index shiting
from −8 to 8, performing bit-wise logical operations (AND, OR, XOR) on the matrices. We
suppose that the matrices of the irst iris sample are named as template1 and mask1 and the
matrices of the second iris sample are named as template2 and mask2. In each round, three
steps are needed to inish the procedure. Firstly, template1 and mask1 are rotated for 2 ×
|shit| columns to form two intermediate matrices, named template1s and mask1s. hey will
be let-rotated if shit is less than 0 or right-rotated if shit is larger than 0. Secondly, we will
XOR template1s with template2 into temp and OR mask1s with mask2 into mask. Finally,
Matrix temp will be ANDed with Matrix mask to form Matrix result. So the Hamming
distance from this round is calculated by dividing the number of 1s of Matrix result by
the total number of the 2-D matrix entries (which is 20 × 480) minus the number of 1s in
Matrix temp [9]. he minimum-valued Hamming distance of all 17 iterations shows the
two samples’ similarity, which is returned as the matching score of the two samples.
20.2.3 OpenMP
OpenMP [10] is an application programming interface (API) that supports multiplatform
shared-memory parallel programming in FORTRAN, C, and C++. It consists of a set of com-
piler directives and library routines, in which runtime behavior can be controlled by environ-
ment variables [11]. Annotating loop bodies and sections of the codes, compiler directives are
used for parallel execution and marking variables as local or shared (global). Certain con-
structs exist for critical sections, completely independent tasks, or reductions on variables [12].
When a for loop is declared to be parallel, the iterations of the loop are executed con-
currently. he iterations of the loop can be allocated to the working threads according
to three scheduling policies: static, dynamic, and guided. In static scheduling, the itera-
tions are either partitioned in as many intervals as the number of threads or partitioned
in chunks that are allocated to the threads in a round-robin fashion. Dynamic scheduling
partitions the iterations in chunks. hose chunks are dynamically allocated to the threads
using a irst-come irst-served policy. Finally, the guided scheduling policy tries to reduce
the scheduling overhead by allocating irst a large amount of work to each thread. It geo-
metrically decreases the amount of work allocated to the thread (up to a given minimum
chunk size) in order to optimize the load balance [12]. In this study, we use the default
static scheduling policy. In addition, the #pragma omp parallel directive can be used to
mark the parallel section, and the #pragma simd directive can be used to perform the vec-
torization. In this study, we employed both.
spent on a basic block, several lines of code, or even a single line of the source code inside a
function. In this way, we can identify the performance-intensive function (hot-spot func-
tion) or lines of code (hot block) in the sotware algorithm [14].
20.3 EXPERIMENTS
In this section, we irst present how we set up the experiments. hen we present the results
from our experimental data in detail.
1. Write the source code in C code on BEACON and then verify the correctness and
validate against the original results obtained from Microsot Visual Studio 2010.
2. Study the workload characterization of the iris matching algorithm by using Intel
VTune.
3. Based on the results from previous steps, rewrite this program by adding the paral-
lelism and vectorization features.
4. Compare all the results; get the optimal number of threads, the inluence of diferent
ainities, as well as vectorization on the performance.
0.52%
3.17%
0.13%
9.65%
▼❃❄♥
ReadFile
Others
❙❅❄❆❇❉❄❇s
86.53% GetHammingDistance
FIGURE 20.1 Percentage of execution time of the iris matching algorithm functions.
performs the function breakdown in order to identify the sotware performance bottle-
neck. Table 20.1 shows the environment in which we used VTune to get the characteristics
of the iris matching algorithm on an Intel Xeon E5-2630 CPU.
Figure 20.1 shows the functions of the iris matching algorithm along with their con-
sumed percentage of total execution time. Note that the time recorded for each function
does not include the time spent on subfunction calls [14]. For example, the time for func-
tion GetHammingDistance does not include the time spent on function shitbits. he
results show that the ReadFile function, which loads the samples into memory, occupies
the majority of execution time. It means that the data transfer is the bottleneck of the iris
matching algorithm.
number of threads, and the y-axis representing the execution time in seconds. he difer-
ence between the two ainities can be contrasted in three parts:
1. Compact ainity is more efective than scatter ainity when only a small number
of threads are needed, as shown in Figure 20.2. When the number of threads is less
than eight, compact ainity only uses a small number of cores (one to two cores), and
hence, the communication overhead between the threads is small; on the other hand,
scatter ainity uses the same number of cores as the number of threads.
2. Scatter ainity is faster than compact ainity between 16 and 240 threads, as shown
in Figure 20.3. In this case, even though compact ainity uses fewer cores, every core
3828.779438
3112.055771
3826.981276
3110.870804
1 2
1240.11073
2078.49633
1223.670036
2069.960943
4 8
FIGURE 20.2 Diference between scatter and compact ainity for one to two cores.
687.517345 503.091543
685.856856
498.4019197
16 30
356.839904 284.484532
355.4207443 254.441216
60 120
257.794156
275.145386
272.333214 256.4256463
180 240
FIGURE 20.3 Diference between scatter and compact ainity for more than two cores.
Big Data Biometrics Processing ◾ 401
550
500
450
400
Compact affinity
350
Scatter affinity
300
Time
250
200
150
100
50
0
30 60 120 180 240 480 720 960 1920
Number of threads
is fully loaded with four threads; they will compete for the hardware resource for
their execution and slow down the individual execution, what we call “interthread
interference,” which outweighs the beneit they get from saved communication over-
head. On the other hand, in scattered ainity, the cores are not that fully loaded,
which shows the beneit.
3. When the number of threads is large (more than 240 threads), the efect of two aini-
ties is almost the same, as shown in Figure 20.4. In this case, the number of threads
is over the hardware capability (60 cores with 4 threads each, resulting in a total of
240 threads), so all the cores are fully loaded in either ainity.
Overall, when we run a program on Xeon Phi, the suitable ainity must be considered
based on the number of threads and the correspondingly parallelism level.
20.3.5 Vectorization
As we can see from Figure 20.1, function GetHammingDistance takes the second longest
time, only next to the ReadFile function. Since Xeon Phi is equipped with a very powerful
vector engine [19], our target is to utilize the vector engine to improve the overall perfor-
mance. As a result, we rewrite the GetHammingDistance function by adding parallelism
402 ◾ Xueyan Li and Chen Liu
Overall optimal number of threads Optimal number of threads of 120, 180, and 240
Scatter affinity threads
4500
4000 270
3500
3000 265
Time
2500
Time
2000 260
1500
1000
255
500
0
0 30 60 90 120 150 180 210 240 270 250
120 180 240
Number of threads Number of threads
Overall optimal number of threads Optimal number of threads of 120, 180, and 240
Compact affinity threads
4500
4000
284
3500
3000 279
2500
Time
274
2000
1500 Time 269
1000 264
500 259
0
0 30 60 90 120 150 180 210 240 270 254
120 180 240
Number of threads Number of threads
and vectorization. Since we have discussed the optimal number of threads and the suitable
ainity, in this case, we run the program with scatter ainity. Figure 20.6 shows the execu-
tion time of running the iris matching algorithm on Xeon Phi in the oload mode with
and without vectorization. Please note that the y-axis is in log scale, so basically, 2 means
100 s, 3 means 1000 s, and so on and so forth. Clearly, in all cases, vectorization improves
the performance. he best time occurs when we use the optimal number of threads of 120.
3.6
2.6
2.4
2.2
1 2 4 8 16 30 60 120 180 240
Numb❊❋ ●❍ ❏❑❋❊◆Ps
FIGURE 20.6 Diference between multithreading with vectorization and without vectorization.
Big Data Biometrics Processing ◾ 403
20.4 CONCLUSIONS
Big Data biometrics processing poses a great computational need, while in the meantime,
it is embarrassingly parallel by nature. In this research, as a case study, we implemented
an OpenMP version of the iris matching algorithm and evaluated its performance on the
innovative Intel Xeon Phi coprocessor platform. We performed workload characterization
of the iris matching algorithm, analyzed the impact of diferent thread-to-core ainity set-
tings on the performance, identiied the optimal number of threads to run the workload,
and most important of all, vectorized the code to take advantage of the powerful vector
engines that come with Xeon Phi to improve the performance. Based on our experience,
even though the emerging many-core platform is able to provide adequate parallelism,
redesigning the application to take advantage of the speciic features of the hardware plat-
form is very important in order to achieve the optimal performance for Big Data processing.
ACKNOWLEDGMENTS
We thank the National Institute for Computational Sciences (NICS) for providing us access
to the BEACON supercomputer to conduct this research. We would also like to thank
Gildo Torres and Guthrie Cordone for their feedback that greatly improved the results
of this experiment and the quality of this study. his work is supported by the National
Science Foundation under grant number IIP-1332046. Any opinions, indings, and conclu-
sions or recommendations expressed in this material are those of the authors and do not
necessarily relect the views of the National Science Foundation.
REFERENCES
1. Top 500 Supercomputer Sites, November 2013. Available at http://www.top500.org/.
2. A. Sussman, “Biometrics and Cloud Computing,” presented at the Biometrics Consortium
Conference 2012, September 19, 2012.
3. L. Meadows, “Experiments with WRF on Intel Many Integrated Core (Intel MIC) Architecture,”
in Proceedings of the 8th International Conference on OpenMP in a Heterogeneous World, Ser.
IWOMP’12, Springer-Verlag, Berlin, Heidelberg, pp. 130–139, 2012.
4. J. Jefers and J. Reinders, Intel Xeon Phi Coprocessor High Performance Programming, 1st ed.,
Morgan Kaufmann, Boston, 2013.
5. J. Dokulil, E. Bajrovic, S. Benkner, S. Pllana and M. Sandrieser. “Eicient Hybrid Execution
of C++ Applications using Intel Xeon PhiTM Coprocessor,” Research Group Scientific
Computing, University of Vienna, Austria, November 2012.
6. K. Krommydas, T. R. W. Scogland and W.-C. Feng, “On the Programmability and Performance
of Heterogeneous Platforms,” in Parallel and Distributed Systems (ICPADS), 2013 International
Conference on, pp. 224–231, December 15–18, 2013.
7. P. Sutton, Benchmarks: Intel Xeon Phi vs. Sandy Bridge, 2013. Available at http://blog.xcelerit
.com/benchmarks-intel-xeon-phi-vs-intel-sandybridge/.
8. J. G. Daugman, “High Conidence Visual Recognition of Persons by a Test of Statistical
Independence,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 15, no. 11,
pp. 1148–1161, 1993.
9. G. Torres, J. K.-T. Chang, F. Hua, C. Liu and S. Schuckers, “A Power-Aware Study of Iris
Matching Algorithms on Intel’s SCC,” in 2013 International Workshop on Embedded Multicore
Systems (ICPP-EMS 2013), in conjunction with ICPP 2013, Lyon, France, October 1–4, 2013.
10. OpenMP. Available at http://openmp.org/wp/.
404 ◾ Xueyan Li and Chen Liu
11. E. Saule and U. V. Catalyurek, “An Early Evaluation of the Scalability of Graph Algorithms
on the Intel MIC Architecture,” in 26th International Symposium on Parallel and Distributed
Processing, Workshops and PhD Forum (IPDPSW), Workshop on Multithreaded Architectures
and Applications (MTAAP), 2012.
12. G. Gan, “Programming Model and Execution Model for OpenMP on the Cyclops-64 Manycore
Processor,” Ph.D. Dissertation, University of Delaware, Newark, DE, 2010.
13. Intel® VTune™ Ampliier XE 2013. Available at https://sotware.intel.com/en-us/intel-vtune
-ampliier-xe.
14. J. K.-T. Chang, F. Hua, G. Torres, C. Liu and S. Schuckers, “Workload Characteristics for Iris
Matching Algorithm: A Case Study,” in 13th Annual IEEE Conference on Technologies for
Homeland Security (HST’13), Boston, MA, November 12–14, 2013.
15. Beacon. Available at http://www.nics.tennessee.edu/beacon.
16. he Green500 List, June 2013. Available at http://www.green500.org/lists/green201306.
17. Beacon on Top500. Available at http://www.top500.org/system/177997#.U3MMkl4dufR.
18. P. A. Johnson, P. Lopez-Meyer, N. Sazonova, F. Hua and S. Schuckers, “Quality in Face and Iris
Research Ensemble (Q-FIRE),” in Fourth IEEE International Conference on Biometrics: heory
Applications and Systems (BTAS), pp. 1–6, 2010.
19. E. Saule, K. Kaya and U. V. Catalyurek, “Performance Evaluation of Sparse Matrix Multiplication
Kernels on Intel Xeon Phi,” arXiv preprint arXiv:1302.1078, 2013.
CHAPTER 21
Ziliang Zong
CONTENTS
21.1 Introduction 405
21.2 he Landsat Program 407
21.3 New Challenges and Solutions 408
21.3.1 he Conventional Satellite Imagery Distribution System 408
21.3.2 he New Satellite Data Distribution Policy 408
21.3.3 Impact on the Data Process Work Flow 409
21.3.4 Impact on the System Architecture, Hardware, and Sotware 409
21.3.5 Impact on the Characteristics of Users and heir Behaviors 409
21.3.6 he New System Architecture 411
21.4 Using Big Data Analytics to Improve Performance and Reduce Operation Cost 413
21.4.1 Vis-EROS: Big Data Visualization 414
21.4.2 FastStor: Data Mining-Based Multilayer Prefetching 416
21.5 Conclusions: Experiences and Lessons Learned 420
Acknowledgments 422
References 422
21.1 INTRODUCTION
Big Data has shown great capability in yielding extremely useful information and extraor-
dinary potential in revolutionizing scientiic discoveries and traditional commercial mod-
els. Numerous corporations have started to utilize Big Data to understand their customers’
behavior at a ine-grained level, rethink their business process work low, and increase
their productivity and competitiveness. Scientists are using Big Data to make new discov-
eries that were not possible before. As the volume, velocity, variety, and veracity of Big Data
405
406 ◾ Ziliang Zong
keep increasing, we are facing signiicant challenges with respect to innovative Big Data
management, eicient Big Data analytics, and low-cost Big Data storage solutions.
In this chapter, we will (1) provide a case study on how the big satellite data (at the peta-
byte level) of the world’s largest satellite imagery distribution system is captured, stored,
and managed by the National Aeronautics and Space Administration (NASA) and the US
Geological Survey (USGS); (2) provide a unique example of how a changed policy could
signiicantly afect the traditional ways of storing, managing, and distributing Big Data,
which will be quite diferent from typical commercial cases driven by sales; (3) discuss
how does the USGS Earth Resources Observation and Science (EROS) center switly over-
come the challenges from serving few government laboratories to hundreds of thousands
of global users; (4) discuss how are data visualization and data mining techniques used to
analyze the characteristics of millions of requests and how can they be used to improve
the performance, cost, and energy eiciency of the EROS system; and (5) summarize the
experiences and lessons that we learned from conducting this Big Data project in the past
4 years.
he Big Data project we discuss in this chapter has several unique features compared to
other typical Big Data projects:
he case study we conduct in this chapter will provide new perspectives for readers to
think wider and deeper about the challenges we are facing in today’s Big Data projects as
well as possible solutions that can help us transit smoothly toward the exciting Big Data era.
he remainder of the chapter is organized as follows. Section 21.2 provides a brief back-
ground about the NASA/USGS Landsat program. Section 21.3 will discuss how did a new
policy change almost everything (work low, system architecture, system hardware and
sotware, users, and user behaviors) of the conventional system and how does USGS EROS
switly overcome the new challenges with great agility. In Section 21.4, we will present our
previous research eforts on how to utilize data visualization and data mining techniques
Storing, Managing, and Analyzing Big Satellite Data ◾ 407
to improve the performance and reduce the operation cost of the newly designed satellite
imagery distribution system. Section 21.5 concludes our study by summarizing our suc-
cessful experiences and hard-learned lessons.
(a) (b)
FIGURE 21.1 (a) A Landsat satellite capturing images. (b) USGS Earth Resources Observation and
Science (EROS) Center, Sioux Falls, South Dakota.
408 ◾ Ziliang Zong
(a)
(b)
FIGURE 21.2 (a) An example of using Landsat satellite images to study global climate change.
(b) An example of using Landsat satellite images to study the urban growth of Las Vegas, NV.
21.2a shows an example of using Landsat satellite images to study the ice melting speed of
Antarctica, and Figure 21.2b shows an example of using Landsat satellite images to study
the urban growth of Las Vegas. All images shown in Figures 21.1 and 21.2 are provided by
the USGS and NASA.
in 1999, would become available for free downloading by the end of September 2008. At
that time, all Landsat 7 data purchasing options from the USGS, wherein users paid for on-
demand processing to various parameters, would be discontinued. his new policy gener-
ated a signiicant impact on almost every component of the conventional EROS satellite
imagery distribution system.
users would not request satellite images without a speciic research purpose, because the
requested images were usually paid out of a research grant and the payment needed to be
approved by a project manager. Additionally, users were unlikely to request a substantially
large number of satellite images due to the budget limitation. However, in the new sys-
tem, with the “Imagery for Everyone” policy, there is no cost for users requesting satellite
images. his greatly inluences the characteristics of users and their behaviors. First, the
number of users grows signiicantly. Before the new policy was announced, there were only
hundreds of users, and most of them were from the United States. he new EROS system
serves hundreds of thousands of global users now. Figure 21.3 shows the rapid growth of
monthly download requests since September 2008. It can be observed that the workload
of the EROS system (in terms of the number of requests) has been quadrupled in the irst
29 months ater applying the free download policy, and this growth is expected to be con-
tinuous in the future. Second, users behave rather aggressively and randomly in the new
system. Figure 21.4 shows the analysis results of more than 60,000 users. We ind that the
number of requests sent by the top nine users account for 17.5% of the total number of
requests. hese users can be classiied as aggressive users. On the other hand, the majority
➉
➊ ➎➌➍➌➌➌
➉
➄
➁
➈
➄ ➏➌➍➌➌➌
➅
➇
➆
➅
➄ ➐➌➍➌➌➌
➃
➂
➁
➑➌➍➌➌➌
➀
0
➋ ➒ ➓ ➔ → ➋➋ ➋➒ ➋➓ ➋➔ ➋→ ➑➋ ➑➒ ➑➓ ➑➔ ➑→
FIGURE 21.3 he number of monthly requests grew signiicantly in the irst 29 months.
4%
2% 2% 2%
2% Aggre➣➣↔↕➙ ➛➣➙➜ 1
2%
2% ➝➞➞➜➙➣➣↔↕➙ ➛➣➙➜ 2
1%
1% ➝➞➞➜➙➣➣↔↕➙ ➛➣➙➜ 3
➝➞➞➜➙➣➣↔↕➙ ➛➣➙➜ 4
➝➞➞➜➙➣➣↔↕➙ ➛➣➙➜ 5
➝➞➞➜➙➣➣↔↕➙ ➛➣➙➜ 6
➝➞➞➜➙➣➣↔↕➙ ➛➣➙➜ 7
➝➞➞➜➙➣➣↔↕➙ ➛➣➙➜ 8
➝➞➞➜➙➣➣↔↕➙ ➛➣➙➜ 9
➟➙➠➡↔➢↔➢➞ ➤➥➦➧➥➨ ➛➣➙➜➣
82%
FIGURE 21.4 he user download behaviors tend to be more aggressive and random.
Storing, Managing, and Analyzing Big Satellite Data ◾ 411
of users only request one or two satellite images, and they rarely come back to request
again. hese users can be classiied as random users. In addition, we observe that among
the 2,875,548 total requests, 27.7% of them are duplicated requests. Users send redundant
requests because there is no cost on their side and they do not need to track what satellite
images have been requested and what have not.
Database update
Database server
Fetch raw data
fr➭➯ ➸➺➲➵s
Move download ready
satellite images to FTP server
FTP server
Tapes
FIGURE 21.5 he new architecture of the EROS satellite imagery distribution system.
412 ◾ Ziliang Zong
satellite images. Once a work order is completed, the download-ready satellite images will
be moved to the FTP server, and a notiication e-mail (with download links) will be sent
to the user. Ater receiving the e-mail, users will be given at least 7 days to download their
requested images. In other words, the system will not delete a satellite image that is not
older than 7 days even when the FTP server is full. Since the size of each requested satellite
image is generally greater than 200 MB, this asynchronous e-mail notiication mechanism
successfully improved the usability of the EROS system because users do not have to worry
about not being able to download the requested satellite images immediately.
Hybrid Storage Subsystem: In the new architecture, the USGS EROS stores its massive
satellite images in long-term archived (LTA) format using a tiered storage subsystem. Table
21.1 outlines the tiered storage components in the USGS EROS satellite imagery distribu-
tion system. he raw data of all satellite images are irst stored in the tape system. Once a
satellite image is requested, its raw data will be found in the tape system and then loaded
to the SAN system composed of hard drive disks (HDDs) for image processing and opti-
mization. he processed images will then be fetched to the FTP server for downloading.
Since the EROS system is highly data intensive and the majority of requests will be “read”
operations, the use of solid state disks (SSDs) in the FTP server layer can signiicantly reduce
the data transfer time once the satellite images are ready. However, the high cost of SSDs
also sets the capacity limitation of FTP servers. Currently, the total amount of satellite data
of EROS is at the petabyte level, while the current storage capacity of the FTP server is
designed at the terabyte level. herefore, only a small portion of satellite data can reside in
the FTP server, and the large chunk of data have to be stored in the low-cost tape system.
If the requested satellite images reside on the FTP server, users can download them almost
immediately. However, users may need to wait for 20–30 minutes if the requested images
are missing on the FTP server (i.e., they are stored on tapes in the form of raw data). To
improve the quality of the satellite data distribution service, EROS has to make every efort
to identify and keep the most popular satellite images on the FTP server. Most importantly,
this process needs to be automatic and highly eicient. Manual data selection is impossible
for petabytes of data. Meanwhile, the slow response time of the tape system may become
the system bottleneck. herefore, data mining–based prefetching, which will be discussed
in Section 21.4.2, appears to be an appropriate technique for further improving the perfor-
mance of EROS’s new system.
Satellite Image Processing Subsystem: he satellite image processing subsystem consists
of nine computational nodes, and each node contains eight cores. One of the computa-
tional nodes serves as the master node, and the other eight computational nodes are slave
nodes. When the requested satellite images cannot be found at the FTP server, one or
more work orders will be issued by the master node. Here, a work order is a basic job
➼➽ ➾➚ 4 Up to 4
work orders
work orders Computational
node 1
Computational Up to 4
Computational node 2 work orders
node 8
Round robin
Up to 4
work orders
Master node Up to 4
Computational Computational
(scheduler)
node 7
Up to 4 work orders
node 3 work orders
Up to 4
work orders Round robin
Computational
Computational Up to 4
node 6
node 4 work orders
Computational
Up to 4 node 5
work orders Up to 4
work orders
executed by a core, which contains 15 diferent algorithms for raw satellite data prepro-
cessing and image optimization. he master node decides how to schedule work orders to
slave nodes based on the round-robin scheduling algorithm, which is shown in Figure 21.6.
Speciically, the scheduler will calculate the number of work orders being executed on each
slave node in a round-robin way. he next work order will be allocated to the irst identi-
ied slave node that runs less than four work orders. If all slave nodes are busy (four work
orders are running simultaneously in each slave node), the work order will be allocated to
the master node if it runs less than four work orders. Otherwise, the work orders will be
put in the waiting queue. Did you notice in Figure 21.6 that only half of the available cores
are utilized and wonder why not utilize all the eight cores of each node? his question will
be answered in Section 21.5 as one of the lessons we learned. It is also worth noting that
the work orders running on diferent cores have no knowledge about other corunning
work orders, even for those work orders scheduled on the same node. In other words, the
execution low and data low of diferent work orders are completely isolated and work in
an arbitrary way, which may cause potential cache and memory contention problems as
discussed in Section 21.5.
behaviors and download patterns, which in turn can facilitate EROS to further improve
system performance and reduce operation cost.
In this section, we provide details on how to ind the user download patterns by analyz-
ing historical requests and how can these patterns help improve the quality of the EROS
satellite imagery distribution service.
FIGURE 21.7 Visualization of EROS global download requests in NASA World Wind (let) and
Google Earth (right). he bright color indicates a large number of requests for this area.
Storing, Managing, and Analyzing Big Satellite Data ◾ 415
features such as points, lines, images, and polygons. he second choice is NASA World
Wind [9], which was developed by NASA using the Java sotware development kit (SDK). It
uses Java OpenGL (JOGL) [10] to render the globe and display visualization results.
At the beginning, we were not sure which visualization tool kit would be more appro-
priate for our needs. herefore, we started two separate projects focusing on each of them.
Figure 21.7 illustrates the irst generation of visualization results of EROS data using
Google Earth (right) and NASA World Wind (let). Although NASA World Wind has its
advantages (e.g., ine-grained control when rendering the globe), it turns out that Google
Earth is easier to use and program. herefore, in the second generation, we only optimized
the Google Earth version, which includes a better color rendering algorithm and the capa-
bility of displaying visualization results of a smaller region. Note that the satellite images
marked in brighter color indicate that they are more popular. Figure 21.8 shows the visual-
ization results of North America and South America, and Figure 21.9 shows the visualiza-
tion results of the United States. Figure 21.10 demonstrates the hot spots of Haiti ater the
2010 earthquake. Here, we do not include detailed information about how we visualized
the EROS data using Google Earth and NASA World Wind, simply because visualization
techniques are not the focus of our discussion here. Readers can visit the Vis-EROS website
[6] and refer to our papers [11,12] for implementation details. Our focus in this chapter is
to discuss how visualization can help with Big Data analytics.
he Vis-EROS project clearly answered the aforementioned questions. Popular images
do exist, and they can be identiied without sophisticated algorithms. he popularity of
images evolves over time. For example, before the Haiti earthquake happened, no satel-
lite images of Haiti were considered popular. However, the location of the earthquake did
become popular ater the event. he Vis-EROS project provides strong evidence that cach-
ing and prefetching techniques can improve the performance of the EROS system by pre-
processing, loading, and keeping popular images in the FTP servers.
• Few users request many images, while many users only request a few.
• Very few images are very popular, while most of them are unpopular.
• he popularity of images evolves over time. Newly captured images tend to be more
popular than old images in general.
• Some image requests are triggered right ater important global events (e.g., earth-
quake, tsunami, and forest ire).
• Images of extreme historical events stay popular (e.g., UFO crashing and Chernobyl
nuclear power plant explosion) for many years.
More details about our analysis report can be found in Reference 13. With these obser-
vations in mind, we started searching for techniques that can achieve high performance of
the EROS system. During this process, we encountered a series of challenges and pertinent
problems, several of which are listed as follows.
• Will cache optimization techniques that are typically applied to very small caches
still remain efective when the cache is orders of magnitude larger?
• New satellite data are added to the EROS system every week, and user download pat-
terns evolve over time. What techniques can efectively catch newly emerged patterns
with a continually increasing set of information?
• EROS currently has a large number of global users. Some users show aggressive
behavior by requesting a large number of images frequently, while others only down-
load very few images and never come back. Will it be possible to identify patterns of
each user, and can user-speciic algorithms improve system performance?
We decided to address these problems one by one. First, we narrowed down our focus
to the caching algorithms. We studied the impact of three widely used cache replace-
ment algorithms, namely, First In First Out (FIFO), Least Recently Used (LRU), and Least
Frequently Used (LFU), on system performance using historical user download requests.
hese algorithms determine which images will be evicted from the FTP servers when the
maximum capacity is reached. Speciically, FIFO evicts the earliest entry in cache when
cache replacement happens. No action is taken on cache hits (i.e., if an entry in a cache
gets requested again, its position does not change). he LRU algorithm removes the least
recently used entry in a cache when it is full. LFU exploits the overall popularity of entries
rather than their recency. LFU sorts entries by popularity. he least popular item is always
chosen for eviction.
Ater observing and analyzing the results generated from the real-world data provided by
EROS [14], we concluded that traditional approaches to caching can be used to successfully
improve performance in environments with large-scale data storage systems. hroughout
the course of evaluation, the FIFO cache replacement algorithm frequently resulted in a
much lower cache hit rate than either LRU or LFU. he LFU and LRU algorithms result in
418 ◾ Ziliang Zong
similar cache hit rates, but the LFU algorithm is more diicult to implement. Overall, we
found LRU to be the best caching algorithm with the consideration of both performance
and ease of implementation.
Prefetching is another technology that has the potential to signiicantly increase cache
performance. Prefetching ofsets high-latency input/output (I/O) accesses by predicting
user access patterns and preloading data that will be requested before the actual request
arrives. In fact, previous studies have shown that an efective prefetching algorithm can
improve cache hit ratio by as much as 50% [15].
Initially, we proposed a data mining–based multilayer prefetching framework (as shown
in Figure 21.11), which contains two engines, the oline pattern mining engine and the
online pattern matching engine. More speciically, the oline pattern mining engine con-
tains three modules: the pattern discovery module, the pattern activation module, and
the pattern amendment module. he pattern discovery module takes the user request his-
tory table, applies pattern discovery strategy, and generates the candidate pattern table.
he candidate patterns will not take efect until the pattern activation module activates
them. Based on the priority or urgency of requested iles, the candidate patterns will be
categorized into urgent patterns and nonurgent patterns. he activated urgent patterns
and nonurgent patterns will be sent to the Urgent Pattern Table and Nonurgent Pattern
Table, respectively. he upper-level prefetching will prefetch the satellite images generated
Pattern
Data Lower-level search
miss prefetching algorithm
(TAPE SAN)
by the pattern search algorithm based on the Urgent Pattern Table, and the lower-level
prefetching ile list should be generated from the Nonurgent Pattern Table. his framework
appeared to be overly complicated when we tried to implement it. More importantly, it
missed some critical factors (e.g., evolvement of popularity and consideration of aggressive
users) that need to be considered. In the real implementation, we did not diferentiate pat-
terns based on urgency, because it is very hard to determine which pattern is urgent and
which is not. Also, we dropped the online pattern matching engine because it usually does
not have a suicient number of requests to analyze within the given window and it is fairly
expensive to keep it running all the time.
Meanwhile, we implemented two speciic prefetching algorithms in our emulator, which
are called Popularity-Based Prefetching and User-Based Prefetching. In the Popularity-
Based Prefetching algorithm, we distinguished archived popular images from current
popular images. An archival popular image usually owes its popularity to historically sig-
niicant events that occurred in the past. For example, the satellite image of Chernobyl falls
into this category. A current popular image is popular because users are interested in the
newly captured scene. For example, research monitoring the efects of global warming may
request the newest images of Greenland as soon as they become available. In this case, it
would be beneicial to prefetch new satellite images of Greenland once they are available.
he User-Based Prefetching algorithm is designed based on the important observation
that EROS users behave signiicantly diferently. Some users downloaded thousands of sat-
ellite images while others only sent requests once and never came back again. herefore,
prefetching rules generated based on one user’s pattern may not be suitable for another
user at all. With this in mind, we designed the User-Based Prefetching algorithm that gen-
erates prefetching rules based solely on that user’s historical request pattern. Although this
will result in more accurate prefetching results, generating separate rules for each user is
a very ambitious goal because of the nature of EROS data. here are six attributes in each
request that together uniquely identify a satellite image. hese attributes are, in order, sat-
ellite number, satellite sensor, row, path, acquisition year, and acquisition day of year. Users
can switch their download pattern in any of these dimensions between two requests, which
makes it really hard to establish the pattern library for each user. Additionally, this process
must be completed in O(n) time, considering the size of the EROS user space.
Our solution is to represent each important attribute of a request as an integer. hen, we
concatenate all six attribute integers to create a single long integer that represents a unique
satellite image. Having a long integer representation of an image where each set of digits
corresponds to a diferent attribute allows for a simple subtraction of two images that cap-
tures the movement in the multidimensional space. hen, we can treat these movements
as prefetching rules. For example, if a user requests two images 7-1-100-100-2000-100 (an
image in row 100 and path 100, captured on the 100th day of year 2000 by the irst sen-
sor of Landsat 7) and 7-1-101-099-2000-100, we subtract the second image from the irst
one and get the following movement: 711010992000100 − 711001002000100 = 9990000000.
he diference, 9990000000, uniquely represents the movement of +1 path and −1 row
because every other attribute stays the same. his simple digitalization solution reduces
the time complexity of the prefetching algorithms to O(n), which allows us to efectively
420 ◾ Ziliang Zong
TABLE 21.2 Summary of Average Monthly Performance, Power Consumption, and Electricity Cost
Avg. Monthly Avg. Monthly Avg. Monthly
Cache Coniguration Hit Rate Power (watts) Cost ($)
100 TB LRU, no prefetching (current) 65.86% 13,908,620 1127
100 TB LRU with prefetching 70.26% 13,748,657 1114
200 TB LRU, no prefetching 69.58% 20,152,915 1632
200 TB LRU with prefetching 71.83% 17,052,688 1381
• Experience #1: Big Data is not scary. Even for a large-scale system like EROS, it can
adapt to the new challenges rather quickly via careful planning and well-thought
designing.
• Experience #2: Big Data analytics is worth exploring because it has great potential to
improve the quality of service and reduce cost.
• Experience #3: Data visualization techniques sometimes can be very helpful in ind-
ing hidden patterns, and many freely available visualization tool kits are available.
Storing, Managing, and Analyzing Big Satellite Data ◾ 421
• Experience #4: Big Data analytics is complicated. It will take substantial eforts to
igure out a working solution. Do not be overly aggressive at the beginning. Take one
step at a time. For example, in our case, the Vis-EROS project served as a pilot project
that provided support for further eforts and created excitement.
• Experience #5: Having a comprehensive understanding of user behavior or system
workload will help greatly in proposing suitable algorithms for a large-scale system
like EROS. Our proposed Popularity-Based Prefetching and User-Based Prefetching
algorithms are both inspired by our user behavior analysis and workload character-
ization results.
• Experience #6: If there does not exist a solution that can signiicantly improve system
performance, combining diferent methods that individually have small improve-
ments on the system can add up to make a noticeable overall increase in system per-
formance. For example, we achieved about 70% hit rate on the FTP server with LRU
alone. By applying the current Popularity-Based Prefetching algorithm, we were able
to achieve an extra 2% improvement. Finally, we improved the FTP server hit rate
to more than 76% by combining the LRU, Popularity-Based Prefetching, and User-
Based Prefetching algorithms altogether.
• Experience #7: he proposed algorithms for large-scale systems must be highly ei-
cient. An algorithm with a time complexity larger than O(n2) will not be feasible to
implement. Fast but simple solutions are always preferred (e.g., representing a user
request using a long integer signiicantly reduced the time needed to complete User-
Based Prefetching).
Meanwhile, we also learned several lessons from our study, listed as follows:
• Lesson #1: hink twice before physically expanding the existing system. he cost of
maintaining a Big Data system is fairly expensive. Do not scale the system up when
the number of users grows rapidly, although it is the easiest solution. It is wise to
span the entire hardware and sotware stack to see if performance can be improved
before investing on new hardware. For example, the EROS user space grew nearly
400% in the irst 3 years. To guarantee performance, the managers can choose to
double the size of the FTP servers. However, our experiments showed that the EROS
system could maintain the same performance with over 30% energy savings as well
as $8 million direct cost reduction (for purchasing an additional 100 TB of SSDs) by
utilizing our proposed prefetching algorithms, compared to the alternative solution
of doubling the size of the current FTP server farm.
• Lesson #2: Understanding the workload will help in reducing hardware cost and
improve hardware utilization. Recall that in Section 21.3, we mentioned that only
four cores of the eight-core node can be used simultaneously; this is because the typi-
cal data ile required by each work order is several hundred megabytes. Cores within
the same socket could race for cache resources. Additionally, cores within the same
422 ◾ Ziliang Zong
node may also ight for memory resources if too many work orders are assigned to the
node. Even worse, cache contention could further cause a thread thrashing problem,
which may lead to unpredicted server downtime. he current solution is to limit the
number of work orders simultaneously running on each computational node to four,
which will degrade the potential peak performance by almost 50%. EROS would be
able to purchase more nodes with fewer cores if this problem was identiied earlier.
hese problems will still exist or even get worse when the application is ported to
many-core systems with Intel Many Integrated Core (MIC) coprocessors or NVIDIA
GPU accelerators. Fortunately, EROS did not invest on the coprocessors/accelerators.
• Lesson #3: he volume and variety of Big Data prefers the simplicity of any imple-
mented framework or algorithms. A complicated framework (e.g., the original data
mining–based prefetching framework presented in Figure 21.11) probably needs to be
simpliied before it can be deployed at Big Data systems.
• Lesson #4: It is possible that no existing solutions can be applied directly to solve
your speciic problem. Before we started developing our own prefetching algorithms,
we spent substantial amount of time on evaluating several well-known data min-
ing and machine learning algorithms, which include market basket analysis (MBA)
[17], C4.5 [18], KNN [19], naive Bayes [20], Bayes networks [21], and support vector
machines [22]. However, no performance improvement is achieved when using all
possible combinations of input features. hese widely used techniques failed because
of the unique characteristics of EROS data and user behaviors.
We hope that readers can beneit from our previous experiences and lessons and have a
successful Big Data practice in their own organizations.
ACKNOWLEDGMENTS
he work reported in this chapter is supported by the US National Science Foundation
under grant nos. CNS-1212535 and CNS-0915762. We also gratefully acknowledge the sup-
port from the US Geological Survey (USGS) Earth Resources Observation and Science
(EROS) Center, who provided us the global user download requests. We also thank the
reviewers for their helpful comments. Any opinions, indings, and conclusions or recom-
mendations expressed in this study are those of the author and do not necessarily relect
the views of NSF or USGS.
REFERENCES
1. National Aeronautics and Space Administration (NASA) Landsat Program Introduction.
Available at http://landsat.gsfc.nasa.gov/.
2. U.S. Geological Survey (USGS) Landsat Program Introduction. Available at http://landsat
handbook.gsfc.nasa.gov/handbook/handbook_htmls/chapter1/chapter1.html.
3. Available at http://landsat.usgs.gov/documents/USGS_Landsat_Imagery_Release.pdf.
4. USGS Global Visualization Viewer. Available at http://glovis.usgs.gov/.
5. Earth Explorer. Available at http://earthexplorer.usgs.gov/.
6. Vis-EROS. Available at http://cs.txstate.edu/~zz11/viseros/.
Storing, Managing, and Analyzing Big Satellite Data ◾ 423
CONTENTS
22.1 Introduction 425
22.2 he Potential of Big Data: Beneits to the Social Sector—From Business to
Social Enterprise to NGO 426
22.3 How NGOs Can Leverage Big Data to Achieve heir Missions 430
22.4 Historical Limitations and Considerations 432
22.5 he Gap in Understanding within the Social Sector 434
22.6 Next Steps: How to Bridge the Gap 436
22.7 Conclusion 437
References 437
22.1 INTRODUCTION
Efectively working with and leveraging Big Data has the potential to change the world.
Indeed, in many ways, it already has [1,2]. Summarizing today’s conventional wisdom,
in PANEL Crowds, Clouds, and Algorithms, Doan et al. [1] credit the pervasiveness of Big
Data to the “connectivity of billions of device-enabled people to massive cloud computing
infrastructure, [which] has created a new dynamic that is moving data to the forefront of
many human endeavors.”
If there is a ceiling on realizing the beneits of Big Data algorithms, applications, and
techniques, we have not yet reached it. he research ield, initially coined “Big Data” in
2001 [1], is maturing rapidly. No longer are we seeking to understand what Big Data is and
whether it is useful. No longer is Big Data processing the province of niche computer sci-
ence research. Rather, the concept of Big Data has been widely accepted as important and
inexorable, and the buzzwords “Big Data” have found their way beyond computer science
into the essential tools of business, government, and media.
425
426 ◾ Elena Strange
Tools and algorithms to leverage Big Data have been increasingly democratized over
the last 10 years [1,3]. By 2010, over 100 organizations reported using the distributed ile
system and framework Hadoop [4]. Early adopters leveraged Hadoop on in-house Beowulf
clusters to process tremendous amounts of data. Today, well over 1000 organizations use
Hadoop. hat number is climbing [5] and now includes companies with a range of techni-
cal competencies and those with and without access to internal clusters and other tools.
Whereas Big Data processing once belonged to specialized parallel and distributed
programmers, it eventually reached programmers and computer scientists of all subields
and specialties. Today, even nonprogrammers who can navigate a simple web interface
have access to all that Big Data has to ofer. Foster et al. [6] highlight the accessibility of
cloud computing via “[g]ateways [that] provide access to a variety of capabilities including
worklows, visualization, resource discovery and job execution services through a browser-
based user interface (which can arguably hide much of the complexities).”
Yet, the beneits of Big Data have not been fully realized by businesses, governments,
and particularly the social sector. he remainder of this chapter will describe the impact
of this gap on the social sector and the broader implications engendered by the sector in a
broader context. Section 22.2 highlights the opportunity gap: the unrealized potential of
Big Data in the social sector. Section 22.3 lays out the channels through which the social
sector has access to Big Data. Section 22.5 describes the current perceptions of and reac-
tions to Big Data algorithms and applications in the social sector. Section 22.6 ofers some
recommendations to accelerate the adoption of Big Data. Finally, Section 22.7 ofers some
concluding remarks.
potentially beneicial technologies such as Big Data, the impact resonates among all these
constituent groups and beneiciaries.
As the for-proit sector is learning, the efective utility of Big Data is no longer restricted
to the technical realm. Businesses of all kinds, including those with little technical compe-
tency, are able to leverage applications and knowledge: As described by Null to his business
readers, “Chances are, you’re already processing Big Data, even if you aren’t aware of it” [9].
Businesses use Big Data tools, including mobile apps and websites, to understand traic
and user behavior that impacts their bottom line.
Consider a use case as simple as the Google Analytics tool [10], which has made inter-
pretation of website statistics a layman application. Plaza [11] describes a use case of a
small Spain-based NGO that leverages Google Analytics to understand how users are
attracted to research work generated by the Guggenheim Museum. In addition to gleaning
important information from the collected data, the organization uses the tables and charts
created by Google Analytics to better position its materials, with little scientiic or techni-
cal requirements on the part of the organization itself. As Big Data becomes increasingly
democratized, organizations like this NGO can not only use these tools but also com-
municate the results and impact more efectively; today, the readers of the NGO’s reports
will understand language around site traic patterns and user behavior, and immediately
comprehend the signiicance of the Big Data tools and information, with little or no con-
vincing needed.
Tools such as Google Analytics do not create data where there was none but, rather,
make existing data more accessible. Without an accessible tool such as Google Analytics,
only a highly technical, strongly motivated business owner would be inclined to follow
a traditional, work-intensive approach to learning about site traic: by manually down-
loading their server logs and meticulously identifying visitation patterns to the company
website. Ater the tool’s wide launch in 2006, however, every business and NGO, including
those with no technical background, has been able to set up a Google ID on their website
and intuitively navigate through the Google Analytics interface to understand and inter-
pret the implications of traic on their business.
he analysis provided by tools such as Google Analytics, true to the nature of Big Data,
provides insight proportional to the quantity of information you have. With the web-based
tool, anyone can glean more insight about which pages are relevant, where conversions
happen, and how they make money from the website—and they learn more the more site
visitors they have. his is the power of Big Data for businesses; it has become so accessible
that more data make it deceptively easier to understand the implications of user behavior
and spot relevant patterns.
In between the for-proit businesses who have eagerly embraced Big Data and NGOs
who are lagging behind sits the social enterprise. Social enterprises provide a more itting
example of a sector actively using and beneitting from Big Data.
Unlike NGOs, social enterprises must generate revenue and maintain proitability in
order to be successful. As a result, they see the potential beneits of Big Data far more
clearly than non-market-driven NGOs do. he remainder of this section explores the role
of the social enterprise within the social sector.
Barriers to the Adoption of Big Data Applications in the Social Sector ◾ 429
he social enterprise, a concept relatively new to both the business world and the social
sector, its squarely between both. Loosely deined, a social enterprise is an NGO that pri-
oritizes generating revenue or a for-proit business that maximizes revenue for a social goal.
he social enterprise represents “a new entrepreneurial spirit focused on social aims,” yet
it remains “a subdivision of the third sector…. In this sense they relect a trend, a ground-
swell involving the whole of the third sector” [12]. hese social enterprises, increasingly
a large and key component of the social sector, are also well positioned to compete with
traditional for-proit businesses [13].
It is social enterprises, more than traditional NGOs, who have led the way in adopting
Big Data trends and algorithms. hey will continue to do so, with traditional NGOs lag-
ging behind.
Consider the use case of social enterprise and travel site Couchsuring, a peer-to-peer
portal where travelers seeking places to stay connect with local hosts seeking to forge new
connections with travelers. No money changes hands between host and visitor. Visitors
beneit from the accommodations and local connection to a new city, whereas hosts ben-
eit from discovering new friendships and building their online and oline reputations.
Couchsuring itself has a social mission: It is a community-building site that is cultivating
a new kind of culture in the digital age.
Couchsuring was launched as a not-for-proit organization in 2004 and eventually
reestablished as a “B-corporation” company: a for-proit business with a social mission.
Although some controversy ensued when they gave up their not-for-proit status, this orga-
nization represents a key example of a social enterprise that has made tremendous use of
Big Data to both increase proits and achieve their social mission.
In addition to its need to turn a proit, the key distinction of Couchsuring, relative
to other social-mission organizations, lies in its origins as a technical organization. As a
community-building site, it assumed the use of Big Data to drive decision making from
its irst inception. Like many web start-ups, Couchsuring relies on Big Data to create and
continually improve the user experience on its site. As a traveler, you can use Couchsuring
to ind compatible hosts, locations, and events. With over 6 million members, it must use
data mining and patterns to connect guests with appropriate hosts and, more importantly,
to establish and maintain hosts’ and guests’ reputations [14].
Big Data cannot be decoupled from a site like Couchsuring. Patterned ater for-proit
web businesses such as Airbnb, Facebook, LinkedIn, and Quora, Couchsuring cannot
possibly keep tabs reliably on all of its hosts, nor would it be beneicial for the organization
to do so. here is no central system to vet the people ofering a place for travelers to stay,
yet these hosts must be reliable and safe in order for Couchsuring to stay in business. he
reputations of its hosts and travelers are paramount to the stability and reputability of the
site itself. As in other community sites, these reputations are best managed and engender
conidence when they are built and reinforced by community members.
Moreover, the Couchsuring user experience must be able to reliably connect travelers
with hosts based not only on individual preferences but also on patterns of use. It must be
able to ilter through the millions of hosts in order to provide each traveler with recom-
mendations that suit their needs, or the traveler will not participate.
430 ◾ Elena Strange
Straddling the line between for-proit and nonproit, Couchsuring has built its platform
like any for-proit technical business would, relying on Big Data algorithms and applica-
tions to make the most of user experience both on and of the site. All the while, they
have established themselves and seek to maintain their reputation as a community-minded
enterprise with the best interests of the community at heart. hey invest in the engineering
and research capabilities that enable them to leverage Big Data because it will beneit both
their proits and their social mission.
In fact, it is this latter need that drove Couchsuring to transition to a B-corporation. As
an NGO, they were unable to raise the capital needed to invest in the infrastructure and
core competencies needed to efectively leverage Big Data. When they were able to turn to
investors and venture capitalists for an infusion of resources, rather than foundations and
individual donors, they found a ready audience to accept and embrace their high-growth,
data-reliant approach to community building.
NGOs may not yet have caught up with Big Data, but social enterprises such as
Couchsuring are well on their way. It is these midway organizations, between for-proit
and NGO, that light the way for the adoption of Big Data tools and techniques throughout
the social sector.
he irst two leverage points directly impact the eicacy of an organization’s achieving
a social mission. he latter three are indirect: hey impact an organization’s structure,
eiciency, and even size. When these indirect factors are improved, a given NGO is better
positioned to achieve any of the goals associated with its mission.
First, let us consider how an NGO can improve its services. An NGO that leverages Big
Data can improve the services they make available to constituents and other end users and
beneiciaries. In the social sector, a “service” is a core competency of the NGO made available
to constituents for the purpose of achieving its mission. It might be a medical service, a food
Barriers to the Adoption of Big Data Applications in the Social Sector ◾ 431
service, or a logistical service. Service-based NGOs are constantly trying to assess the services
they ofer and how and whether they might be improved. To do so, they need information.
Consider a use case, such as Doctors Without Borders, an NGO that “provides urgent
medical care in countries to victims of war and disaster regardless of race, religion, or poli-
tics” [15]. his type of NGO must understand how efective their services are, whether those
who partake of the service beneit, and how services can be improved. hey must under-
stand the impact their services have at the individual level, such as the level of care ren-
dered to a patient. hey must also understand the organization’s impact at a broader level:
the efectiveness they impart at an aggregate level among their wide group of constituents.
hese needed evaluations even lend themselves to data mining metrics and language.
Doctors Without Borders needs to understand the efectiveness of their services in terms
of precision (percentage of constituents served who are in their target market) and recall
(percentage of afected victims in a given region the organization was able to reach).
Without Big Data systems or applications, NGOs such as Doctors Without Borders tend
to rely on surveys to gather and analyze information. hough useful, survey-based data
alone are insuicient for an NGO to fully understand the scope and impact of their ser-
vices, particularly as the number of constituents reached and number of services delivered
grow. Implicit data, captured and analyzed by Big Data applications and algorithms, is an
efective companion of explicit data captured through surveys.
Second, let us consider how an NGO can improve its oferings. In the social sector, an
NGO’s “ofering” is a product or service that is sold for a fee, such as a T-shirt with the NGO’s
logo on it. he revenue captured from these transactions is poured back into the NGO’s
operational budget. Even NGOs that do not describe themselves as social enterprises oten
need to generate revenue in order to maintain their sustainability over time. hey collect
money from end users for products and services rendered—for example, an NGO with a
medical mission might charge a patient for a check-up or other medical service.
How does Big Data impact upon an NGO’s oferings? In the ways it makes such oferings
available. Big Data is tremendously inluential in creating web platforms—particularly
retail sites—that are user friendly and navigable. In a retail context, Big Data algorithms
ensure that users ind what they are looking for and enjoy a smooth discovery process.
Users’ expectations have changed in the retail context. In traditional terms, end users
tend to be forgiving of NGOs whose missions they support. If the organization has a strong
and relevant social mission, users will traditionally wade through irrelevant products
and slog through a time-consuming purchasing process. Today, however, users’ expecta-
tions are higher. hanks to the prevalence of retail platforms that use Big Data to create
consumer-friendly experiences, users have to demand these experiences in all of their
online interactions. NGOs are less likely to provide these experiences without the help of
Big Data shaping their platforms and interactions models.
Data mining techniques have long been applied in traditional marketing contexts as
well. In their seminal data mining paper, Agarwal et al. [16] introduce the paradigm of data
mining as a technique to leverage businesses’ large data sets. For example, they describe
data mining as a tool to determine “what products may be impacted if the store stops sell-
ing bagels” and “what to put on sale and how to design coupons” [16].
432 ◾ Elena Strange
Retail leaders such as Target pioneered the use of Big Data and data mining techniques
to drive sales. Target employed its irst statisticians to do so as far back as 2002 [17]. As
Duhigg explains, “[f]or decades, Target has collected vast amounts of data on every per-
son who regularly walks into one of its stores… demographic information like your age,
whether you are married and have kids, which part of town you live in, how long it takes
you to drive to the store, your estimated salary, whether you’ve moved recently, what credit
cards you carry in your wallet and what Web sites you visit. All that information is mean-
ingless, however, without someone to analyze and make sense of it” [17].
Fourth, NGOs can leverage Big Data to streamline operational eiciencies. One of the
signiicant by-products of Big Data is the infrastructure of cloud computing technologies
used to support Big Data applications. Cloud computing tools are available—and increas-
ingly accessible—to almost everyone. Even NGOs that do not need Big Data applications
per se can use the infrastructure that has been built to support Big Data activities to host
their web applications, maintain their databases, and engage in other activities common to
even nontechnical organizations.
Like many for-proit businesses, especially smaller ones, many NGOs tend to main-
tain their information technology (IT) departments in-house: Mail servers, networks, and
other essential tools are constructed within their own four walls. As an indirect result of
the reach and scope of Big Data, NGOs can now take advantage of the proliferation of
cloud computing and online tools to maintain their infrastructure.
Fith and inally, NGOs can leverage Big Data to package their results to funders and
supporters. Stories are well known to be “a fundamental part of communication and a
powerful part of persuasion” [18, p. 1]. Storytelling is a popular and meaningful way to
make a case for inancial and emotional support.
Among other portals, the rise of Big Data is relected in the jobs and skills that for-
proit businesses are hiring for. For example, the job title “data scientist” came into wide-
spread use only in 2008, and the number of workers operating under that title skyrocketed
between 2011 and 2012 [19]. Whereas Big Data applications and algorithms emerged from
many IT departments, it has become professionalized over time. Businesses are willing to
invest time and resources into roles speciically aimed at Big Data.
In its earliest incarnations, Big Data applications and algorithms were strictly the prov-
ince of specialized computer scientists. Programmers who specialized in parallel and dis-
tributed computing learned how to manipulate memory, processors, and data in order to
create tapestries of programs that could work with more data than memory.
Prior to the introduction of the World Wide Web, any enterprise with enough data to
constitute Big Data was within a specialized cohort of businesses and researchers. he
urgency to develop applications to manage large amounts of data faded as the size of avail-
able physical memory increased in most computer systems from the late 1980s and on. For
a time, while the capacity of memory available grew, the amount of data that necessitated
processing did not. It seemed as if bigger memory and faster processors would be able
to quickly manage the data captured by an organization, and certainly enough for any
individual.
hen came the World Wide Web, shepherding a new, accessible entry point to the
Internet for organizations. New data were created and processed in two signiicant ways:
First, websites started popping up. he number of pages on the Internet grew from under
1 million to an estimated 1 trillion in a short 15-year time span [20].
he growth of web pages was just the beginning, however. More signiicantly, people
began interacting on the web. End users passed all manner of direct data and metadata
back and forth to companies, including e-commerce orders, personal information, medi-
cal histories, and online messages. he number of web pages paled in comparison to the
amount of data captured in Internet transactions.
We began to see that, no matter how large RAM became and no matter how fast proces-
sors became, personal computers and even supercomputers would never be able to keep up
with the growth of data fueled by the Internet.
his trend toward Big Data necessitated changes in computing tools and programming
techniques. Programming languages and techniques were developed to serve these spe-
cializations, including open multi-processor (OpenMP) [21] and Erlang [22], as well as
standards that evolved, such as portable operating system interface (POSIX) threads. A
programmer writing parallel code that worked with massive data had to learn a relevant
programming paradigm in order to write programs efectively.
Over time and in parallel to the growth of Internet-fueled data, innovations such as
Hadoop [4] and framework generator (FG) [23] became accessible to traditional computer
scientists, the vast majority of whom had no specialized training in parallel programming.
hese middleware frameworks were responsible for the “glue” code that is common to
many parallel programs, regardless of the task the speciic application sets out to solve.
With these developments, parallel programming and working with massive data sets
became increasingly more available to computer scientists of all stripes.
434 ◾ Elena Strange
Still, the ield of parallel computing belonged to computer scientists and programmers,
not to organizations or individuals. Over time, these paradigms evolved still further, to
the point where nonprogrammers could not only see the power of but also make use of
Big Data applications by using the tools and consumer applications becoming increasingly
accessible to them.
Systems like Amazon Web Services (AWS), introduced in 2006, provided access to
powerful clusters perfectly suited to operate on massive amounts of data. At launch, AWS
ofered a command-line interface, limited to highly technical entry points that were best
suited to highly technical users. AWS coupled with Hadoop made parallel programming
more accessible than ever.
AWS and similar oferings (e.g., Rattle [24]) slowly developed graphical interfaces that
made their powerful cloud computing systems more widely available. Slowly, technically
minded individuals who were not expert programmers have been able to adopt the intui-
tive tools that enable them to create their own applications on the bank of computers in
the cloud.
When cloud computing and data mining applications eventually become available and
accessible to any end user of any technical skill—like the consumer applications we have
come to rely upon in our everyday lives, such as the layman’s tool Google Analytics—for-
proit companies will surely make use of Big Data algorithms and techniques in increasing
breadth and depth.
here remains a gap, however, between the user accessibility of these highly technical
interfaces and a true for-everyone consumer application. Cloud computing and data min-
ing applications are following the same trajectory as enterprise applications: In 5 Ways
Consumer Apps are Driving the Enterprise Web, Svane asks, “Since the sotware I use every
day at home and on my phone are so friendly and easy to use, why is my expensive business
application so cumbersome and stodgy?” [25].
Like enterprise applications, cloud computing and data mining applications remain
seemingly impenetrable to end users who have come to expect “one-click easy” interfaces.
Still, both enterprise and Big Data applications are undeniably on the same forward trajec-
tory toward becoming more accessible over time. In the meantime, organizations rely on
their technical workers and data scientists to collect and leverage the Big Data intrinsic to
their businesses.
a position where they must outlay the appropriate resources to hire people who can work
with their data, and they are simply not incented to do so.
his reality has many underlying reasons: First, the risk of adopting Big Data is magni-
ied due to the limited payofs in a nonmarket reality. Second, NGOs do not have access to
the resources and knowledge required to leverage Big Data while its applications remain
out of reach for the everyday user. hird, NGOs tend to enjoy less turnover than for-proit
businesses, engendering a corporate culture strikingly resistant to change. In this section,
we will examine each of these underlying reasons in turn.
First, the beneits of Big Data are seen in the zeitgeist as primarily inancial (although
this perception is misleading): It provides a competitive advantage to businesses that know
how to leverage the data they collect. In their seminal data mining paper, Agarwal et al.
[16] argue for relevance of their work as a key way to improve “business decisions that the
management of [a] supermarket has to make, includ[ing] what to put on sale, how to design
coupons, how to place merchandise on shelves in order to maximize the proit, etc.” All of
these decisions, made better and more efective by data mining techniques, are in the ser-
vice of increasing proits. In the sphere of for-proit businesses, this limited focus is both a
necessary and suicient incentive to adopt and invest in a new technology.
NGOs, on the other hand, have no such driver. hey rely upon foundation and gov-
ernment grants, individual donations, memberships, and even some revenue-generating
products, but they do not carry a primary responsibility to sustain proitability. herefore,
the potential upside of increasing proits by leveraging Big Data is greatly diminished, and
the related risks are magniied.
To be sure, Big Data can be useful to NGOs. As we have seen in this chapter, Big Data
can help NGOs serve their constituents better and improve their oferings. To be sure,
every NGO has a mission that they strive for and constituents to serve. hey must man-
age their income—generated from foundation grants, individual donations, members, and
social revenue—well and responsibly. hey risk their organizational reputations if they fail
to carry out their social and iduciary missions. Still, the beneits of Big Data are seemingly
indirect and out of reach relative to the risk and resource outlay required to make use of
the data in the irst place.
Second, NGOs oten lack the resources needed to capture, manage, and analyze Big
Data. As we have seen, for-proit businesses place cloud computing and data mining appli-
cations in the hands of their data scientists and highly technical workers. As these applica-
tions become more democratized, end users of all kinds, working at all types of businesses,
will take on the roles currently held by data scientists. Until then, for-proit businesses see
enough inancial beneit in Big Data that they are willing and able to hire data scientists
and other dedicated roles to manage their data applications.
Nonproits, however, are less willing and able to invest in these roles due to the lack of
potential payof from leveraging their data. Projects such as Data Without Borders and
DataKind [26] are bringing data scientists into the social sector via collaborations and
internships, as a way to bridge the gap of these roles within the social sector.
Lacking the resources of these technical roles and their associated knowledge of
Big Data, NGOs face perceived risk when it comes to security and privacy associated with
436 ◾ Elena Strange
Big Data applications. For a typical organization working with Big Data, it is unlikely to be
proitable or beneicial for them to acquire and maintain a cluster of machines to manipu-
late their data.
Finally, in general terms, NGOs enjoy less turnover and greater longevity among their
employees than the for-proit business sector. One of the downsides of this reality is that
long-time employees are less inclined to take risks than young, newer employees. Indeed,
many NGOs were slow to adopt technologies such as Twitter and Facebook despite urging
from their young stafers [8]. he managers and executives who had been employed by an
NGO for a long time were reluctant to deviate from their known path and were slow to
trust younger, newer employees.
than donation- and membership-based nonproits to include many more social enterprises
similar to Couchsuring. As described in Section 22.2, social enterprises are responsible
for generating revenue in addition to carrying out their social missions.
hese new types of organizations, many of them highly technical and web based, will
lead the way in adopting and using Big Data to their advantage. More traditional NGOs
will be inclined to follow suit as they see their sister organizations realize the noninancial
beneits of leveraging Big Data.
Second, the scientists and researchers in the ield of Big Data can improve the way they
communicate about advances in the ield and related applications. he social sector will be
more amenable to Big Data when the applications and beneits are more understandable
[18], and it is incumbent upon scientists to translate novel academic and industry research
into terms that convey the relevance and real-world applicability of their results.
Scientists are responsible not just for the content of their original research but for the com-
munication layer as well. Big Data research, even the most esoteric results, oten has straight-
forward application for many kinds of end users, including for-proit businesses and NGOs.
hird, cloud computing and data mining applications will become increasingly acces-
sible. hese tools started out the sole province of specialized programmers, and they have
become democratized to a much wider range of users over time. Still, they currently remain
out of reach for those less comfortable with highly technical tools.
Like the social-web consumer applications before them, Big Data tools will grow in
accessibility and usability until they become relevant to even the least technically minded
users. When the translation and understanding of Big Data is relevant to anyone at any
NGO, then organizational Big Data will be truly utilized to its full potential.
22.7 CONCLUSION
his chapter has discussed the unmet potential of Big Data algorithms and applications
within the social sector. he implications of this gap are far-reaching and impactful, within
and beyond the sector itself.
Fundamentally, NGOs are not incented to seek out the resources and individuals they
need to make the most of Big Data. Without the inancial drive of the market, the beneits
of cloud computing and data mining applications stack up poorly against the risk required
to outlay resources required to use them.
his dynamic can and will change, however. Social enterprises are leading the way in
applying Big Data algorithms and techniques in a mission-driven context. hese organiza-
tions’ leadership, along with the ongoing accessibility of cloud computing and data mining
tools, will accelerate the adoption curve in the social sector.
REFERENCES
1. Doan, A., Kleinberg, J., and Koudas, N. PANEL Crowds, clouds, and algorithms: Exploring
the human side of “Big Data” applications. 2010 Special Interest Group on Management of Data
(SIGMOD ’10), June 6–10, 2010.
2. Lui, B., Hsu, W., Han, H.S., and Xia, Y. Mining changes for real-life applications. 2nd
International Conference on Data Warehousing and Knowlelge Discovery (DaWaK 2000),
pp. 337–346, 2000.
438 ◾ Elena Strange
3. Dean, J., and Gehmawat, S. MapReduce: Simpliied data processing on large clusters. Proceedings
of the 6th Conference on Symposium on Operating Systems Design and Implementation (ODSI
2004), pp. 137–150, 2004.
4. Shvachko, K., Hairong, K., Radia, S., and Chansler, R. he Hadoop distributed ile system. Mass
Storage Systems and Technologies (MSST 2010), pp. 3–7, 2010.
5. Rosenbush, S. More companies, drowning in data, are turning to Hadoop. Wall Street Journal,
April 14, 2014.
6. Foster, I., Zhao, Y., Raicu, I., and Lu, S. Cloud computing and grid computing 360-degree com-
pared. Grid Computing Environments Workshop 2008 (GCE ’08), pp. 12–16, November 2008.
7. Lohr, S. he age of Big Data. he New York Times, February 11, 2012.
8. Kanter, B., and Paine, K. Measuring the Networked Nonproit: Using Data to Change the World.
Jossey-Bass, New York, 2012.
9. Null, C. How small businesses can mine Big Data. PC World Magazine, August 27, 2013.
10. Google, Inc. Available at http://analytics.google.com.
11. Plaza, B. Monitoring web traic source efectiveness with Google Analytics: An experiment
with time series. Aslib Proceedings, Vol. 61, Issue 5, pp. 474–482, 2009.
12. DeFourney, J., and Borzaga, C., eds. From third sector to social enterprise. he Emergence of
Social Enterprise. Routledge, London and New York, pp. 1–18, 2001.
13. Raz, K. Toward an improved legal form for social enterprise. New York University Review of
Law & Social Change, Vol. 36, Issue 283, pp. 238–308, 2012.
14. Frankel, C., and Bromberger, A. he Art of Social Enterprise: Business as if People Mattered. New
Society Publishers, New York, 2013.
15. Doctors without Borders. Available at http://www.doctorswithoutborders.org.
16. Agarwal, R., Imieliński, T., and Swami, A. Mining association rules between sets of items in
large databases. Proceedings of the 1993 ACM SIGMOD International Conference on Manage-
ment of Data, pp. 207–216, 1993.
17. Duhigg, C. How companies learn your secrets. he New York Times, February 16, 2012.
18. Olson, R., Barton, D., and Palermo, B. Connection: Hollywood Storytelling Meets Critical
hinking. Prairie Starish Productions, Los Angeles, 2013.
19. Davenport, T., and Patil, D. Data scientist: he sexiest job of the 21st century. Harvard Business
Review, October 2012.
20. he Incredible Growth of Web Usage (1984–2013). Available at http://www.whoishostingthis
.com/blog/2013/08/21/incredible-growth-web-usage-infographic/.
21. Dagum, L. OpenMP: An industry standard API for shared-memory programming. Compu-
tational Science & Engineering, Vol. 5, Issue 1, pp. 46–55, 1998.
22. Armstrong, J., Virding, R., Wilkstrom, C., and Williams, M. Concurrent programming in
ERLANG, Prentice Hall, Upper Saddle River, NJ, 1993.
23. Cormen, T., and Davidson, E. FG: A framework generator for hiding latency in parallel pro-
grams running on clusters. 17th International Conference on Parallel and Distributed Computing
Systems (PDCS 2004), pp. 127–144, September 2004.
24. Williams, G. Rattle: A data mining GUI for R. he R Journal, Vol. 1, Issue 2, pp. 45–55, 2009.
25. Svane, M. 5 ways consumer apps are driving the enterprise web. Forbes Magazine, August 2011.
26. Data Kind. Available at http://www.datakind.org.
Computer Science & Engineering / Data Mining and Knowledge Discovery
The collection presented in the book covers fundamental and realistic issues
about Big Data, including efficient algorithmic methods to process data, bet-
ter analytical strategies to digest data, and representative applications in diverse
fields. ... This book is required understanding for anyone working in a major field
of science, engineering, business, and financing.
—Jack Dongarra, University of Tennessee
The editors have assembled an impressive book consisting of 22 chapters writ-
ten by 57 authors from 12 countries across America, Europe, and Asia. ... This
book has great potential to provide fundamental insight and privacy to individu-
als, long-lasting value to organizations, and security and sustainability to the cy-
ber–physical–social ecosystem ....
—D. Frank Hsu, Fordham University
These editors are active researchers and have done a lot of work in the area of
Big Data. They assembled a group of outstanding chapter authors. ... Each sec-
tion contains several case studies to demonstrate how the related issues are
addressed. ... I highly recommend this timely and valuable book. I believe that it
will benefit many readers and contribute to the further development of Big Data
research.
—Dr. Yi Pan, Georgia State University
Presenting the contributions of leading experts in their respective ields, Big
Data: Algorithms, Analytics, and Applications bridges the gap between the
vastness of big data and the appropriate computational methods for scientiic
and social discovery. It covers fundamental issues about Big Data, including ef-
icient algorithmic methods to process data, better analytical strategies to digest
data, and representative applications in diverse ields such as medicine, science,
and engineering.
Overall, the book reports on state-of-the-art studies and achievements in algo-
rithms, analytics, and applications of Big Data. It provides readers with the basis
for further efforts in this challenging scientiic ield that will play a leading role in
next-generation database, data warehousing, data mining, and cloud computing
research.
K23331
w w w. c rc p r e s s . c o m